From f8b48b92a3d07420185154aab0ce0190ab11f390 Mon Sep 17 00:00:00 2001 From: juanatsap Date: Thu, 9 Apr 2026 12:56:22 +0100 Subject: [PATCH] =?UTF-8?q?docs:=20update=20SEO=20guide=20=E2=80=94=20dupl?= =?UTF-8?q?icate=20content=20fix,=20Search=20Console,=20AI-era=20strategy?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Document /text noindex + canonical header solution - Add duplicate content prevention checklist - Document Google Search Console verification setup - Update files overview table with correct paths - Add AI chat agent as modern SEO signal --- doc/15-SEO.md | 57 +++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 53 insertions(+), 4 deletions(-) diff --git a/doc/15-SEO.md b/doc/15-SEO.md index 18cca06..d07450b 100644 --- a/doc/15-SEO.md +++ b/doc/15-SEO.md @@ -1,7 +1,7 @@ # SEO Implementation Guide **Project:** CV Interactive Website -**Last Updated:** 2025-11-30 +**Last Updated:** 2026-04-09 **Status:** Production Ready --- @@ -197,6 +197,45 @@ curl -H "Accept: text/plain" https://juan.andres.morenorub.io/ - Clean, structured text - All CV content preserved +#### Duplicate Content Prevention (April 2026) + +**Problem discovered:** Google was indexing `/text` instead of the main HTML page, causing the plain text version to appear as the primary search result. + +**Root cause:** The `/text` endpoint served the same CV content as the HTML page but with no SEO signals (no meta tags, no canonical, no noindex). Google favored it because plain text is easier to crawl and has dense keyword content. + +**Solution implemented:** + +1. **`X-Robots-Tag: noindex, nofollow`** HTTP header on `/text` responses + - Tells search engines not to index the plain text version + - Does NOT block crawling — LLMs and text browsers can still access it + - Implementation: `internal/handlers/cv_text.go` + +2. **`Link: canonical`** HTTP header on `/text` responses + - Points to the HTML version: `; rel="canonical"` + - Tells search engines which version is the "official" one + +3. **robots.txt comment** (not a Disallow — intentionally crawlable for LLMs) + - `/text` remains accessible for AI crawlers, curl, and text browsers + - Only search engine indexing is prevented via the HTTP header + +4. **Google Search Console verification** + - `` tag added to `` + - Manual re-indexation requested for `/?lang=en` and `/?lang=es` + - Manual removal of `/text` from search index + +**Verification:** +```bash +# Check that /text has noindex header: +curl -sI 'https://juan.andres.morenorub.io/text?lang=en' | grep X-Robots +# → X-Robots-Tag: noindex, nofollow + +# Check canonical points to HTML version: +curl -sI 'https://juan.andres.morenorub.io/text?lang=en' | grep Link +# → Link: ; rel="canonical" +``` + +**Key principle:** The `/text` endpoint is for **consumption** (LLMs, terminals), not for **discovery** (search engines). Search results should always point to the rich HTML version with structured data, icons, and the AI chat agent. + --- #### robots.txt AI Bot Rules (`static/robots.txt`) @@ -253,13 +292,14 @@ The implementation supports Google's E-E-A-T (Experience, Expertise, Authority, | File | Purpose | |------|---------| -| `templates/index.html` | Meta tags, JSON-LD schemas | +| `templates/partials/layout/head.html` | Meta tags, canonical, hreflang, Google verification | +| `templates/partials/layout/head-structured-data.html` | JSON-LD schemas (Person, WebSite, etc.) | | `static/robots.txt` | Search engine and AI bot directives | -| `static/llms.txt` | AI crawler information file | +| `static/llms.txt` | AI crawler information file (llmstxt.org) | | `static/sitemap.xml` | XML sitemap for search engines | | `data/cv-en.json` | SEO fields (pageTitle, metaTitle, etc.) | | `data/cv-es.json` | Spanish SEO fields | -| `/text` endpoint | Plain text CV for CLI/TUI browsers | +| `internal/handlers/cv_text.go` | Plain text endpoint with noindex + canonical headers | | `templates/cv-text.txt` | Plain text template | --- @@ -324,6 +364,15 @@ Test at: [Google Robots.txt Tester](https://www.google.com/webmasters/tools/robo - [x] Comprehensive JSON-LD schemas - [x] AI bot permissions in robots.txt - [x] Clear, parseable content structure +- [x] AI chat agent (Gemini) for interactive CV queries +- [x] Plain text endpoint for LLM consumption (noindex for search engines) +- [x] Google Search Console verified and monitored + +### Duplicate Content Prevention +- [x] `/text` endpoint: `X-Robots-Tag: noindex, nofollow` +- [x] `/text` endpoint: `Link: canonical` pointing to HTML version +- [x] Sitemap only contains HTML pages (not `/text`) +- [x] Canonical URLs on all HTML pages ---