docs: add LLM provider evolution & benchmarks (section 16)

Document the full history of local LLM selection:
- Mistral → GLM → Gemma 4 26B MoE with reasons for each change
- Side-by-side benchmarks: Gemini vs Gemma4 vs GLM vs Mistral
- Quality comparison (language, tool calling, links, hallucination)
- Resource usage (params, RAM, disk, offline capability)
- Configuration examples for dev and prod
This commit is contained in:
juanatsap
2026-04-09 20:48:52 +01:00
parent 1f17277a19
commit ff7cd4287b
+74 -3
View File
@@ -510,7 +510,78 @@ Gemini 2.5 Flash free tier provides **15 requests/minute** with no credit card r
If the free tier is exceeded, Gemini returns a rate limit error, which the handler catches and displays as a generic error message to the user. If the free tier is exceeded, Gemini returns a rate limit error, which the handler catches and displays as a generic error message to the user.
## 16. Dependencies ## 16. LLM Provider Evolution & Benchmarks
### Provider Architecture
The chat uses a **dual-provider** strategy with automatic fallback:
```
Primary: Gemini 2.5 Flash (Google API — production)
↓ (if fails)
Fallback: Gemma 4 26B MoE via Ollama (local — development)
```
In development, set `GOOGLE_API_KEY=""` to force the local model.
### Model Selection History
| Date | Local Model | Why Changed |
|------|-------------|-------------|
| 2026-03 | Mistral Small 3.2 (24B) | Initial choice — good tool calling support |
| 2026-04-09 | GLM-4.7-Flash (30B) | Better quality, Spanish support, 198K context |
| 2026-04-09 | **Gemma 4 26B MoE** | 3-4x faster (MoE: only 4B active), less RAM, excellent quality |
### Benchmark: Same 4 Questions, Same Hardware
| Test | Gemini 2.5 Flash | Gemma 4 26B | GLM-4.7-Flash | Mistral 3.2 |
|------|-----------------|-------------|---------------|-------------|
| Summary (no tool) | **2s** | 5s | 7s | 10s |
| Go question (tool call) | **6s** | 13s | 42s | 50s |
| Spanish tech search | **4s** | 10s | 15s | No español |
| All companies (heavy) | **3s** | 16s | 63s | Timeout |
### Quality Comparison
| Aspect | Gemini 2.5 Flash | Gemma 4 26B | GLM-4.7-Flash | Mistral 3.2 |
|--------|-----------------|-------------|---------------|-------------|
| Language detection | Excellent | Excellent | Good | Poor |
| Tool calling | Native (ADK) | Via OpenAI compat | Via OpenAI compat | Via OpenAI compat |
| CV navigation links | Correct | Partial | Rare | None |
| Response exhaustiveness | Very complete | Very complete | Complete | Acceptable |
| Hallucination rate | None (tool-grounded) | None | Low | Medium |
### Resource Usage
| Model | Parameters | Active | Disk | RAM (inference) | Offline |
|-------|-----------|--------|------|-----------------|---------|
| Gemini 2.5 Flash | Cloud | Cloud | 0 | 0 | No |
| Gemma 4 26B MoE | 26B | **4B** | 18GB | **~8GB** | Yes |
| GLM-4.7-Flash | 30B | 30B | 19GB | ~19GB | Yes |
| Mistral Small 3.2 | 24B | 24B | 15GB | ~15GB | Yes |
### Why Gemma 4 26B MoE Wins for Local Development
1. **Mixture of Experts**: 26B total but only 4B activated per token — inference speed close to a 4B model with quality of a much larger one
2. **3-4x faster** than dense 30B models (GLM) on the same hardware
3. **~8GB RAM** vs 19GB for GLM — leaves room for other dev tools
4. **Excellent Spanish**: Responds in the correct language consistently
5. **Tool calling works**: Compatible with the OpenAI-compatible Ollama adapter
6. **256K context**: Largest context window of all local options tested
### Configuration
```bash
# .env — Development (local Gemma 4)
OLLAMA_MODEL=gemma4:26b
# GOOGLE_API_KEY= (leave empty to force local)
# .env — Production (Gemini API)
GOOGLE_API_KEY=your-key
OLLAMA_MODEL=gemma4:26b # fallback if Gemini fails
```
## 17. Dependencies
| Package | Purpose | Size Impact | | Package | Purpose | Size Impact |
|---------|---------|-------------| |---------|---------|-------------|
@@ -519,7 +590,7 @@ If the free tier is exceeded, Gemini returns a rate limit error, which the handl
No frontend dependencies are added. The chat widget uses HTMX and Hyperscript which are already loaded by the site. No frontend dependencies are added. The chat widget uses HTMX and Hyperscript which are already loaded by the site.
## 17. ADK Go Concepts Used ## 18. ADK Go Concepts Used
| ADK Concept | Go Type / Function | Usage in This Project | | ADK Concept | Go Type / Function | Usage in This Project |
|-------------|-------------------|----------------------| |-------------|-------------------|----------------------|
@@ -534,7 +605,7 @@ No frontend dependencies are added. The chat widget uses HTMX and Hyperscript wh
| Tool Context | `tool.Context` | Passed to the tool function by ADK; provides access to session and agent state | | Tool Context | `tool.Context` | Passed to the tool function by ADK; provides access to session and agent state |
| JSON Schema | `jsonschema:"..."` struct tags | Describes tool parameters to the LLM for function calling | | JSON Schema | `jsonschema:"..."` struct tags | Describes tool parameters to the LLM for function calling |
## 18. Relation to Other Documentation ## 19. Relation to Other Documentation
- **[01-ARCHITECTURE.md](01-ARCHITECTURE.md)** — Overall system design - **[01-ARCHITECTURE.md](01-ARCHITECTURE.md)** — Overall system design
- **[03-API.md](03-API.md)** — HTTP API reference (includes `POST /api/chat`) - **[03-API.md](03-API.md)** — HTTP API reference (includes `POST /api/chat`)