From ff7cd4287b994103c1d32714edbc36ace55c75f0 Mon Sep 17 00:00:00 2001 From: juanatsap Date: Thu, 9 Apr 2026 20:48:52 +0100 Subject: [PATCH] docs: add LLM provider evolution & benchmarks (section 16) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Document the full history of local LLM selection: - Mistral → GLM → Gemma 4 26B MoE with reasons for each change - Side-by-side benchmarks: Gemini vs Gemma4 vs GLM vs Mistral - Quality comparison (language, tool calling, links, hallucination) - Resource usage (params, RAM, disk, offline capability) - Configuration examples for dev and prod --- doc/28-AI-CHAT-AGENT.md | 77 +++++++++++++++++++++++++++++++++++++++-- 1 file changed, 74 insertions(+), 3 deletions(-) diff --git a/doc/28-AI-CHAT-AGENT.md b/doc/28-AI-CHAT-AGENT.md index 0cef310..520e4ea 100644 --- a/doc/28-AI-CHAT-AGENT.md +++ b/doc/28-AI-CHAT-AGENT.md @@ -510,7 +510,78 @@ Gemini 2.5 Flash free tier provides **15 requests/minute** with no credit card r If the free tier is exceeded, Gemini returns a rate limit error, which the handler catches and displays as a generic error message to the user. -## 16. Dependencies +## 16. LLM Provider Evolution & Benchmarks + +### Provider Architecture + +The chat uses a **dual-provider** strategy with automatic fallback: + +``` +Primary: Gemini 2.5 Flash (Google API — production) + ↓ (if fails) +Fallback: Gemma 4 26B MoE via Ollama (local — development) +``` + +In development, set `GOOGLE_API_KEY=""` to force the local model. + +### Model Selection History + +| Date | Local Model | Why Changed | +|------|-------------|-------------| +| 2026-03 | Mistral Small 3.2 (24B) | Initial choice — good tool calling support | +| 2026-04-09 | GLM-4.7-Flash (30B) | Better quality, Spanish support, 198K context | +| 2026-04-09 | **Gemma 4 26B MoE** | 3-4x faster (MoE: only 4B active), less RAM, excellent quality | + +### Benchmark: Same 4 Questions, Same Hardware + +| Test | Gemini 2.5 Flash | Gemma 4 26B | GLM-4.7-Flash | Mistral 3.2 | +|------|-----------------|-------------|---------------|-------------| +| Summary (no tool) | **2s** | 5s | 7s | 10s | +| Go question (tool call) | **6s** | 13s | 42s | 50s | +| Spanish tech search | **4s** | 10s | 15s | No español | +| All companies (heavy) | **3s** | 16s | 63s | Timeout | + +### Quality Comparison + +| Aspect | Gemini 2.5 Flash | Gemma 4 26B | GLM-4.7-Flash | Mistral 3.2 | +|--------|-----------------|-------------|---------------|-------------| +| Language detection | Excellent | Excellent | Good | Poor | +| Tool calling | Native (ADK) | Via OpenAI compat | Via OpenAI compat | Via OpenAI compat | +| CV navigation links | Correct | Partial | Rare | None | +| Response exhaustiveness | Very complete | Very complete | Complete | Acceptable | +| Hallucination rate | None (tool-grounded) | None | Low | Medium | + +### Resource Usage + +| Model | Parameters | Active | Disk | RAM (inference) | Offline | +|-------|-----------|--------|------|-----------------|---------| +| Gemini 2.5 Flash | Cloud | Cloud | 0 | 0 | No | +| Gemma 4 26B MoE | 26B | **4B** | 18GB | **~8GB** | Yes | +| GLM-4.7-Flash | 30B | 30B | 19GB | ~19GB | Yes | +| Mistral Small 3.2 | 24B | 24B | 15GB | ~15GB | Yes | + +### Why Gemma 4 26B MoE Wins for Local Development + +1. **Mixture of Experts**: 26B total but only 4B activated per token — inference speed close to a 4B model with quality of a much larger one +2. **3-4x faster** than dense 30B models (GLM) on the same hardware +3. **~8GB RAM** vs 19GB for GLM — leaves room for other dev tools +4. **Excellent Spanish**: Responds in the correct language consistently +5. **Tool calling works**: Compatible with the OpenAI-compatible Ollama adapter +6. **256K context**: Largest context window of all local options tested + +### Configuration + +```bash +# .env — Development (local Gemma 4) +OLLAMA_MODEL=gemma4:26b +# GOOGLE_API_KEY= (leave empty to force local) + +# .env — Production (Gemini API) +GOOGLE_API_KEY=your-key +OLLAMA_MODEL=gemma4:26b # fallback if Gemini fails +``` + +## 17. Dependencies | Package | Purpose | Size Impact | |---------|---------|-------------| @@ -519,7 +590,7 @@ If the free tier is exceeded, Gemini returns a rate limit error, which the handl No frontend dependencies are added. The chat widget uses HTMX and Hyperscript which are already loaded by the site. -## 17. ADK Go Concepts Used +## 18. ADK Go Concepts Used | ADK Concept | Go Type / Function | Usage in This Project | |-------------|-------------------|----------------------| @@ -534,7 +605,7 @@ No frontend dependencies are added. The chat widget uses HTMX and Hyperscript wh | Tool Context | `tool.Context` | Passed to the tool function by ADK; provides access to session and agent state | | JSON Schema | `jsonschema:"..."` struct tags | Describes tool parameters to the LLM for function calling | -## 18. Relation to Other Documentation +## 19. Relation to Other Documentation - **[01-ARCHITECTURE.md](01-ARCHITECTURE.md)** — Overall system design - **[03-API.md](03-API.md)** — HTTP API reference (includes `POST /api/chat`)