docs: add LLM provider evolution & benchmarks (section 16)
Document the full history of local LLM selection: - Mistral → GLM → Gemma 4 26B MoE with reasons for each change - Side-by-side benchmarks: Gemini vs Gemma4 vs GLM vs Mistral - Quality comparison (language, tool calling, links, hallucination) - Resource usage (params, RAM, disk, offline capability) - Configuration examples for dev and prod
This commit is contained in:
+74
-3
@@ -510,7 +510,78 @@ Gemini 2.5 Flash free tier provides **15 requests/minute** with no credit card r
|
|||||||
|
|
||||||
If the free tier is exceeded, Gemini returns a rate limit error, which the handler catches and displays as a generic error message to the user.
|
If the free tier is exceeded, Gemini returns a rate limit error, which the handler catches and displays as a generic error message to the user.
|
||||||
|
|
||||||
## 16. Dependencies
|
## 16. LLM Provider Evolution & Benchmarks
|
||||||
|
|
||||||
|
### Provider Architecture
|
||||||
|
|
||||||
|
The chat uses a **dual-provider** strategy with automatic fallback:
|
||||||
|
|
||||||
|
```
|
||||||
|
Primary: Gemini 2.5 Flash (Google API — production)
|
||||||
|
↓ (if fails)
|
||||||
|
Fallback: Gemma 4 26B MoE via Ollama (local — development)
|
||||||
|
```
|
||||||
|
|
||||||
|
In development, set `GOOGLE_API_KEY=""` to force the local model.
|
||||||
|
|
||||||
|
### Model Selection History
|
||||||
|
|
||||||
|
| Date | Local Model | Why Changed |
|
||||||
|
|------|-------------|-------------|
|
||||||
|
| 2026-03 | Mistral Small 3.2 (24B) | Initial choice — good tool calling support |
|
||||||
|
| 2026-04-09 | GLM-4.7-Flash (30B) | Better quality, Spanish support, 198K context |
|
||||||
|
| 2026-04-09 | **Gemma 4 26B MoE** | 3-4x faster (MoE: only 4B active), less RAM, excellent quality |
|
||||||
|
|
||||||
|
### Benchmark: Same 4 Questions, Same Hardware
|
||||||
|
|
||||||
|
| Test | Gemini 2.5 Flash | Gemma 4 26B | GLM-4.7-Flash | Mistral 3.2 |
|
||||||
|
|------|-----------------|-------------|---------------|-------------|
|
||||||
|
| Summary (no tool) | **2s** | 5s | 7s | 10s |
|
||||||
|
| Go question (tool call) | **6s** | 13s | 42s | 50s |
|
||||||
|
| Spanish tech search | **4s** | 10s | 15s | No español |
|
||||||
|
| All companies (heavy) | **3s** | 16s | 63s | Timeout |
|
||||||
|
|
||||||
|
### Quality Comparison
|
||||||
|
|
||||||
|
| Aspect | Gemini 2.5 Flash | Gemma 4 26B | GLM-4.7-Flash | Mistral 3.2 |
|
||||||
|
|--------|-----------------|-------------|---------------|-------------|
|
||||||
|
| Language detection | Excellent | Excellent | Good | Poor |
|
||||||
|
| Tool calling | Native (ADK) | Via OpenAI compat | Via OpenAI compat | Via OpenAI compat |
|
||||||
|
| CV navigation links | Correct | Partial | Rare | None |
|
||||||
|
| Response exhaustiveness | Very complete | Very complete | Complete | Acceptable |
|
||||||
|
| Hallucination rate | None (tool-grounded) | None | Low | Medium |
|
||||||
|
|
||||||
|
### Resource Usage
|
||||||
|
|
||||||
|
| Model | Parameters | Active | Disk | RAM (inference) | Offline |
|
||||||
|
|-------|-----------|--------|------|-----------------|---------|
|
||||||
|
| Gemini 2.5 Flash | Cloud | Cloud | 0 | 0 | No |
|
||||||
|
| Gemma 4 26B MoE | 26B | **4B** | 18GB | **~8GB** | Yes |
|
||||||
|
| GLM-4.7-Flash | 30B | 30B | 19GB | ~19GB | Yes |
|
||||||
|
| Mistral Small 3.2 | 24B | 24B | 15GB | ~15GB | Yes |
|
||||||
|
|
||||||
|
### Why Gemma 4 26B MoE Wins for Local Development
|
||||||
|
|
||||||
|
1. **Mixture of Experts**: 26B total but only 4B activated per token — inference speed close to a 4B model with quality of a much larger one
|
||||||
|
2. **3-4x faster** than dense 30B models (GLM) on the same hardware
|
||||||
|
3. **~8GB RAM** vs 19GB for GLM — leaves room for other dev tools
|
||||||
|
4. **Excellent Spanish**: Responds in the correct language consistently
|
||||||
|
5. **Tool calling works**: Compatible with the OpenAI-compatible Ollama adapter
|
||||||
|
6. **256K context**: Largest context window of all local options tested
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# .env — Development (local Gemma 4)
|
||||||
|
OLLAMA_MODEL=gemma4:26b
|
||||||
|
# GOOGLE_API_KEY= (leave empty to force local)
|
||||||
|
|
||||||
|
# .env — Production (Gemini API)
|
||||||
|
GOOGLE_API_KEY=your-key
|
||||||
|
OLLAMA_MODEL=gemma4:26b # fallback if Gemini fails
|
||||||
|
```
|
||||||
|
|
||||||
|
## 17. Dependencies
|
||||||
|
|
||||||
| Package | Purpose | Size Impact |
|
| Package | Purpose | Size Impact |
|
||||||
|---------|---------|-------------|
|
|---------|---------|-------------|
|
||||||
@@ -519,7 +590,7 @@ If the free tier is exceeded, Gemini returns a rate limit error, which the handl
|
|||||||
|
|
||||||
No frontend dependencies are added. The chat widget uses HTMX and Hyperscript which are already loaded by the site.
|
No frontend dependencies are added. The chat widget uses HTMX and Hyperscript which are already loaded by the site.
|
||||||
|
|
||||||
## 17. ADK Go Concepts Used
|
## 18. ADK Go Concepts Used
|
||||||
|
|
||||||
| ADK Concept | Go Type / Function | Usage in This Project |
|
| ADK Concept | Go Type / Function | Usage in This Project |
|
||||||
|-------------|-------------------|----------------------|
|
|-------------|-------------------|----------------------|
|
||||||
@@ -534,7 +605,7 @@ No frontend dependencies are added. The chat widget uses HTMX and Hyperscript wh
|
|||||||
| Tool Context | `tool.Context` | Passed to the tool function by ADK; provides access to session and agent state |
|
| Tool Context | `tool.Context` | Passed to the tool function by ADK; provides access to session and agent state |
|
||||||
| JSON Schema | `jsonschema:"..."` struct tags | Describes tool parameters to the LLM for function calling |
|
| JSON Schema | `jsonschema:"..."` struct tags | Describes tool parameters to the LLM for function calling |
|
||||||
|
|
||||||
## 18. Relation to Other Documentation
|
## 19. Relation to Other Documentation
|
||||||
|
|
||||||
- **[01-ARCHITECTURE.md](01-ARCHITECTURE.md)** — Overall system design
|
- **[01-ARCHITECTURE.md](01-ARCHITECTURE.md)** — Overall system design
|
||||||
- **[03-API.md](03-API.md)** — HTTP API reference (includes `POST /api/chat`)
|
- **[03-API.md](03-API.md)** — HTTP API reference (includes `POST /api/chat`)
|
||||||
|
|||||||
Reference in New Issue
Block a user