docs: update SEO guide — duplicate content fix, Search Console, AI-era strategy

- Document /text noindex + canonical header solution - Add duplicate content prevention checklist - Document Google Search Console verification setup - Update files overview table with correct paths - Add AI chat agent as modern SEO signal
2026-04-09 12:56:22 +01:00
parent ded519758b
commit f8b48b92a3
1 changed files with 53 additions and 4 deletions
@@ -1,7 +1,7 @@
 # SEO Implementation Guide

 **Project:** CV Interactive Website
-**Last Updated:** 2025-11-30
+**Last Updated:** 2026-04-09
 **Status:** Production Ready

 ---
@@ -197,6 +197,45 @@ curl -H "Accept: text/plain" https://juan.andres.morenorub.io/
 - Clean, structured text
 - All CV content preserved

+#### Duplicate Content Prevention (April 2026)
+
+**Problem discovered:** Google was indexing `/text` instead of the main HTML page, causing the plain text version to appear as the primary search result.
+
+**Root cause:** The `/text` endpoint served the same CV content as the HTML page but with no SEO signals (no meta tags, no canonical, no noindex). Google favored it because plain text is easier to crawl and has dense keyword content.
+
+**Solution implemented:**
+
+1. **`X-Robots-Tag: noindex, nofollow`** HTTP header on `/text` responses
+   - Tells search engines not to index the plain text version
+   - Does NOT block crawling — LLMs and text browsers can still access it
+   - Implementation: `internal/handlers/cv_text.go`
+
+2. **`Link: canonical`** HTTP header on `/text` responses
+   - Points to the HTML version: `<https://juan.andres.morenorub.io/?lang=en>; rel="canonical"`
+   - Tells search engines which version is the "official" one
+
+3. **robots.txt comment** (not a Disallow — intentionally crawlable for LLMs)
+   - `/text` remains accessible for AI crawlers, curl, and text browsers
+   - Only search engine indexing is prevented via the HTTP header
+
+4. **Google Search Console verification**
+   - `<meta name="google-site-verification">` tag added to `<head>`
+   - Manual re-indexation requested for `/?lang=en` and `/?lang=es`
+   - Manual removal of `/text` from search index
+
+**Verification:**
+```bash
+# Check that /text has noindex header:
+curl -sI 'https://juan.andres.morenorub.io/text?lang=en' | grep X-Robots
+# → X-Robots-Tag: noindex, nofollow
+
+# Check canonical points to HTML version:
+curl -sI 'https://juan.andres.morenorub.io/text?lang=en' | grep Link
+# → Link: <https://juan.andres.morenorub.io/?lang=en>; rel="canonical"
+```
+
+**Key principle:** The `/text` endpoint is for **consumption** (LLMs, terminals), not for **discovery** (search engines). Search results should always point to the rich HTML version with structured data, icons, and the AI chat agent.
+
 ---

 #### robots.txt AI Bot Rules (`static/robots.txt`)
@@ -253,13 +292,14 @@ The implementation supports Google's E-E-A-T (Experience, Expertise, Authority,

 | File | Purpose |
 |------|---------|
-| `templates/index.html` | Meta tags, JSON-LD schemas |
+| `templates/partials/layout/head.html` | Meta tags, canonical, hreflang, Google verification |
+| `templates/partials/layout/head-structured-data.html` | JSON-LD schemas (Person, WebSite, etc.) |
 | `static/robots.txt` | Search engine and AI bot directives |
-| `static/llms.txt` | AI crawler information file |
+| `static/llms.txt` | AI crawler information file (llmstxt.org) |
 | `static/sitemap.xml` | XML sitemap for search engines |
 | `data/cv-en.json` | SEO fields (pageTitle, metaTitle, etc.) |
 | `data/cv-es.json` | Spanish SEO fields |
-| `/text` endpoint | Plain text CV for CLI/TUI browsers |
+| `internal/handlers/cv_text.go` | Plain text endpoint with noindex + canonical headers |
 | `templates/cv-text.txt` | Plain text template |

 ---
@@ -324,6 +364,15 @@ Test at: [Google Robots.txt Tester](https://www.google.com/webmasters/tools/robo
 - [x] Comprehensive JSON-LD schemas
 - [x] AI bot permissions in robots.txt
 - [x] Clear, parseable content structure
+- [x] AI chat agent (Gemini) for interactive CV queries
+- [x] Plain text endpoint for LLM consumption (noindex for search engines)
+- [x] Google Search Console verified and monitored
+
+### Duplicate Content Prevention
+- [x] `/text` endpoint: `X-Robots-Tag: noindex, nofollow`
+- [x] `/text` endpoint: `Link: canonical` pointing to HTML version
+- [x] Sitemap only contains HTML pages (not `/text`)
+- [x] Canonical URLs on all HTML pages

 ---