One site, all issues reproduced. Each section maps to one audit rule so it can be fixed and verified independently.
Audit rule: robots_has_errors | The broken robots.txt is live at /robots.txt.
Disallow:/staging means the rule is ignored and Google crawls your staging pages anyway.
Imagine you put a sign on a door that says "Staff only" but you misspelled it as "Staf fonly".
The security guard (Google) cannot read it, ignores it, and lets everyone through.
That is exactly what happens with a malformed robots.txt.
Common real-world damage: your /staging or /admin pages end up indexed in Google
because the Disallow line had a syntax error and was silently skipped.
View live broken /robots.txt →
# 9 real syntax errors — each one silently ignored by crawlers User-agent:* ← Error 1: missing space after colon Disallow:/staging ← Error 2: missing space → staging gets indexed Disallow: /admin ← Error 3: no trailing slash → /administrator not blocked Disallow: /api/* ← Error 4: wildcard only works in Google/Bing, others ignore it Disallow: /checkout$ ← Error 5: dollar anchor only Google supports Allow: /staging/public # comment ← Error 6: inline comment breaks the path value User-agent: Googlebot ← Error 7: no blank line before this group → bleeds into above Disallow: /private Crawl-delay:5 ← Error 8: missing space → crawl-delay ignored SITEMAP: https://... ← Error 9: wrong casing, should be "Sitemap:"
# Correct robots.txt User-agent: * Disallow: /staging/ Disallow: /admin/ Disallow: /api/ Disallow: /checkout/ Allow: /staging/public/ Crawl-delay: 5 ← blank line separates groups User-agent: Googlebot Disallow: /private/ ← blank line before Sitemap Sitemap: https://robots-txt-lab.pages.dev/sitemap.xml
| # | Broken | Problem | Real-world impact |
|---|---|---|---|
| 1 | User-agent:* | Missing space after : | Entire rule group may be ignored |
| 2 | Disallow:/staging | Missing space after : | Staging pages indexed in Google |
| 3 | Disallow: /admin | No trailing slash | Blocks /admin but not /administrator |
| 4 | Disallow: /api/* | Wildcard * in path | Only Google/Bing honour it; other crawlers ignore the line |
| 5 | Disallow: /checkout$ | Dollar anchor $ | Only Google supports it; other bots skip the line |
| 6 | Allow: /path # comment | Inline comment after value | Comment becomes part of the path — directive is broken |
| 7 | No blank line between groups | Missing group separator | Groups bleed together; Googlebot rules applied to all bots |
| 8 | Crawl-delay:5 | Missing space after : | Crawl-delay skipped; bot may hammer the server |
| 9 | SITEMAP: https://... | Wrong field-name casing | Sitemap not discovered by strict parsers |
Audit rule: sitemap_big | The large sitemap is live at /sitemap.xml (generate it first — see deploy steps).
sitemap.xml to 50,000 URLs and 50 MB.
If your sitemap exceeds either limit, Google silently truncates it — URLs past the cut-off are never discovered or indexed.
No error is surfaced to you; pages just quietly disappear from search.
Think of Google's crawler as a delivery driver with a truck that holds exactly 50,000 packages. If you hand it 50,001, the driver takes the first 50,000 and drives away. Package #50,001 sits on the pavement and is never delivered — and you never find out.
In practice this hits large e-commerce sites (product + variant pages), news sites (article archives), or any site that auto-generates a single flat sitemap instead of a paginated sitemap index.
This site's sitemap.xml contains 50,001 URLs — one over the limit.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>https://robots-txt-lab.pages.dev/</loc></url>
<url><loc>https://robots-txt-lab.pages.dev/page-1</loc></url>
<url><loc>https://robots-txt-lab.pages.dev/page-2</loc></url>
... (continues for 50,001 total entries) ← Google silently stops at 50,000
Split into multiple sitemaps and serve a sitemap index — one master file that lists the child sitemaps, each under 50,000 URLs.
<!-- /sitemap.xml — the INDEX (always under limit) --> <?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap><loc>https://example.com/sitemap-1.xml</loc></sitemap> <!-- URLs 1-50000 --> <sitemap><loc>https://example.com/sitemap-2.xml</loc></sitemap> <!-- URLs 50001-... --> </sitemapindex>
| Channel | Impact |
|---|---|
| SEO | Pages past URL #50,000 are never submitted to Google — they rely entirely on link discovery, which may never happen for deep pages. |
| AEO (AI Engines) | AI crawlers (GPTBot, PerplexityBot) that follow sitemaps also stop at the truncation point. Those pages are invisible to AI overviews. |
| GEO (Generative Engine) | Content past truncation cannot be cited in AI-generated responses that rely on indexed content. |
Audit rule: sitemap_4xx | The broken sitemap is live at /sitemap-4xx.xml.
sitemap.xml lists URLs that return a 4XX HTTP error (most commonly 404 Not Found or 410 Gone).
Every time Google crawls the sitemap and hits a dead URL, it loses a little trust in the sitemap as a whole.
Over time, Google crawls it less often and may stop prioritising newly added pages.
Imagine giving a tour guide a list of 10 rooms to show visitors. They walk to room 4 and the door is missing — the room was demolished. After the third demolished room, the tour guide starts doubting your whole list and skips checking new rooms you add later.
This happens constantly on real sites: a product is deleted, a blog post is removed, a URL is restructured — but nobody updates the sitemap. The dead URLs pile up and Google's trust in the sitemap degrades.
View live /sitemap-4xx.xml → — contains 1 valid URL and 9 dead URLs (none of those pages exist on this site → all return 404).
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url><loc>https://robots-txt-lab.pages.dev/</loc></url> ← ✅ 200 OK <url><loc>https://robots-txt-lab.pages.dev/deleted-product</loc></url> ← ❌ 404 <url><loc>https://robots-txt-lab.pages.dev/old-blog-post</loc></url> ← ❌ 404 <url><loc>https://robots-txt-lab.pages.dev/discontinued-service</loc></url> ← ❌ 404 <url><loc>https://robots-txt-lab.pages.dev/removed-about-us</loc></url> ← ❌ 404 <url><loc>https://robots-txt-lab.pages.dev/old-pricing</loc></url> ← ❌ 404 <url><loc>...5 more dead URLs...</loc></url> </urlset>
Only include URLs that return 200. Remove or redirect dead pages before adding them to the sitemap.
<!-- Fixed: only verified-live URLs --> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url><loc>https://example.com/</loc></url> <url><loc>https://example.com/about</loc></url> <url><loc>https://example.com/products</loc></url> <!-- deleted-product removed — 301 redirected, not listed --> </urlset>
| Channel | Impact |
|---|---|
| SEO | Dead URLs erode Google's trust in the sitemap. Google reduces crawl frequency of the sitemap, so new pages take longer to be discovered. |
| AEO (AI Engines) | AI crawlers that follow the sitemap (GPTBot, PerplexityBot) waste crawl budget on dead URLs and may miss new content. |
| GEO | Pages not crawled are not indexed, so they cannot appear in AI-generated answers that rely on indexed content. |
Audit rule: sitemap_noindex | The broken sitemap is at /sitemap-noindex.xml. The noindex pages: staging-page, admin-preview, draft-post.
<meta name="robots" content="noindex"> which tells Google "don't index this page".
These two instructions directly contradict each other — Google wastes crawl budget visiting pages it then has to discard.
Imagine sending a formal invitation to a party, but when the guest arrives the door says "No entry". You wasted everyone's time. The guest (Google) showed up, spent time getting there, then had to turn around.
The real-world pattern that causes this: a developer adds noindex to a staging or draft page
to keep it out of search results, but forgets to remove it from the auto-generated sitemap.
Or: a CMS generates the sitemap from all published pages, including ones that marketing tagged as noindex.
Three pages exist and return 200 OK, but each has
<meta name="robots" content="noindex"> in the HTML head.
All three are listed in sitemap-noindex.xml.
<!-- sitemap-noindex.xml — tells Google to visit these pages --> <url><loc>https://robots-txt-lab.pages.dev/</loc></url> ← ✅ indexable <url><loc>https://robots-txt-lab.pages.dev/staging-page</loc></url> ← ❌ noindex <url><loc>https://robots-txt-lab.pages.dev/admin-preview</loc></url> ← ❌ noindex <url><loc>https://robots-txt-lab.pages.dev/draft-post</loc></url> ← ❌ noindex <!-- Each noindex page has this in its <head> --> <meta name="robots" content="noindex, nofollow"> ← contradicts being in sitemap
Two valid options — pick one per page:
<!-- Option A: Remove the noindex tag (make the page indexable) --> <meta name="robots" content="index, follow"> ← or just remove the meta tag entirely <!-- Option B: Remove the URL from the sitemap (keep noindex, stop wasting crawl budget) --> <!-- simply delete the <url> entry from sitemap.xml --> <!-- Never do both at the same time in opposite directions -->
| Channel | Impact |
|---|---|
| SEO | Contradictory signal confuses Google. Crawl budget is spent on pages that will never rank. Google's trust in the sitemap degrades over time. |
| AEO (AI Engines) | Once noindex is honoured, those pages are invisible to AI Overviews and AI-generated answers — even if listed in the sitemap. |
| GEO | Same as AEO — noindex pages are excluded from generative engine content regardless of sitemap presence. |
Audit rule: sitemap_noindex (same rule ID, different detection method) | Broken sitemap: /sitemap-noindex-header.xml | Pages: pdf-report, user-data-export, internal-dashboard
<meta name="robots" content="noindex"> — visible in HTML source.X-Robots-Tag: noindex as an HTTP response header —
completely invisible when you View Source. You can only detect it via DevTools Network tab or a tool that checks headers.
Same rule, different mechanism, much harder to spot manually.
| Meta Tag (issue ④) | HTTP Header (issue ④-alt) | |
|---|---|---|
| Where set | Inside <head> of HTML | HTTP response header from server |
| Visible in View Source? | ✅ Yes | ❌ No — only in DevTools Network tab |
| Works for non-HTML? | ❌ HTML only | ✅ PDFs, images, any resource |
| How set on this site | In .html file head | Via Cloudflare Pages _headers file |
| Audit tool detection | HTML parser | HTTP response header check |
Three pages have no noindex meta tag in their HTML — they look clean in View Source.
But the _headers file tells Cloudflare Pages to send X-Robots-Tag: noindex
in the HTTP response. All three are listed in sitemap-noindex-header.xml.
# _headers file (Cloudflare Pages) /pdf-report.html X-Robots-Tag: noindex /user-data-export.html X-Robots-Tag: noindex, nofollow /internal-dashboard.html X-Robots-Tag: noindex, nofollow # HTML source of these pages has NO <meta name="robots"> — looks clean # The noindex only shows up in the HTTP response headers
# Method 1 — curl (checks HTTP headers, not HTML) curl -I https://robots-txt-lab.pages.dev/pdf-report.html # Look for: X-Robots-Tag: noindex in the output # Method 2 — View Source (should show NO noindex meta tag — that is the point) # Open /pdf-report.html → View Source → search "noindex" → not found in HTML # Method 3 — DevTools # Open /pdf-report.html → F12 → Network tab → click the page request # → Response Headers → look for X-Robots-Tag: noindex
| Channel | Impact |
|---|---|
| SEO | Same as meta tag noindex — Google respects it and won't index. The sitemap entry wastes crawl budget. Harder for developers to notice and fix because it's not visible in source. |
| AEO / GEO | Once X-Robots-Tag noindex is honoured, those pages are invisible to AI engines regardless of sitemap presence — same outcome as meta tag variant. |
Audit rule: sitemap_non_canonical | Broken sitemap: /sitemap-non-canonical.xml | Non-canonical pages: product-red, blog-page-2, print-version
<link rel="canonical" href="...">.
If a page's canonical points to a different URL, that page is a non-canonical — a duplicate.
Your sitemap should only list the canonical (official) version of each URL, never the duplicates.
Think of canonical as a page saying "I am a copy — the real one lives over there." If your sitemap lists the copy instead of the original, you are pointing Google to a page that itself says "ignore me, go elsewhere." Google follows the canonical and loses trust in your sitemap's accuracy.
Three common real-world patterns that cause this:
| Pattern | Non-canonical URL in sitemap | Should be |
|---|---|---|
| Colour/size variant | /product-red → canonical: /product |
List /product only |
| Paginated page | /blog-page-2 → canonical: /blog |
List /blog only |
| Print version | /print-version → canonical: /article |
List /article only |
Three pages exist and return 200 OK, but each declares a different URL as its canonical. All three are listed in sitemap-non-canonical.xml.
<!-- sitemap-non-canonical.xml --> <url><loc>https://robots-txt-lab.pages.dev/</loc></url> ← ✅ self-canonical <url><loc>https://robots-txt-lab.pages.dev/product-red</loc></url> ← ❌ canonical is /product <url><loc>https://robots-txt-lab.pages.dev/blog-page-2</loc></url> ← ❌ canonical is /blog <url><loc>https://robots-txt-lab.pages.dev/print-version</loc></url> ← ❌ canonical is /article <!-- What each non-canonical page says in its <head> --> <link rel="canonical" href="https://robots-txt-lab.pages.dev/product"> ← mismatch with sitemap <link rel="canonical" href="https://robots-txt-lab.pages.dev/blog"> ← mismatch with sitemap <link rel="canonical" href="https://robots-txt-lab.pages.dev/article"> ← mismatch with sitemap
<!-- Fixed sitemap: only list canonical URLs --> <url><loc>https://robots-txt-lab.pages.dev/</loc></url> <url><loc>https://robots-txt-lab.pages.dev/product</loc></url> ← canonical, not the variant <url><loc>https://robots-txt-lab.pages.dev/blog</loc></url> ← page 1 only, not /blog-page-2 <url><loc>https://robots-txt-lab.pages.dev/article</loc></url> ← main article, not print version
| Channel | Impact |
|---|---|
| SEO | Google ignores the non-canonical sitemap entries and consolidates signals to the canonical. Ranking credit is not diluted, but the sitemap signal is wasted and trust in the sitemap degrades. |
| AEO (AI Engines) | AI crawlers that rely on sitemaps for discovery may follow non-canonical URLs and attribute content to the wrong page, shifting citation accuracy. |
| GEO | Same as AEO — content attribution in AI-generated answers may reference the duplicate URL instead of the canonical one. |
Two related rules covered here:
title_duplicate — multiple pages share the same <title> text |
title_multiple — a single page has more than one <title> tag
<title> tag that describes exactly what that page is about.
When multiple pages share the same title, Google cannot tell them apart and splits ranking signals between them.
When one page has two title tags, Google's behaviour is undefined — it may use either one, or rewrite both.
title_duplicate: Imagine three job applicants sending identical CVs with the same name on them.
The employer cannot tell who is who and picks one at random. The other two are ignored.
That is what happens when three pages all have <title>Our Services | Demo Company</title>.
title_multiple: Imagine a letter with two subject lines that contradict each other.
The reader is confused about what the letter is about. Same problem for Google when a page has
<title>Contact Us</title> AND <title>About Us</title> in the same HTML.
title_duplicate — three pages, one identical title:
<!-- service-a.html --> <title>Our Services | Demo Company</title> ← ❌ duplicate <!-- service-b.html --> <title>Our Services | Demo Company</title> ← ❌ duplicate <!-- service-c.html --> <title>Our Services | Demo Company</title> ← ❌ duplicate <!-- What they should be --> <title>Web Design Services | Demo Company</title> <title>SEO Consulting | Demo Company</title> <title>Content Marketing | Demo Company</title>
title_multiple — two title tags on one page (multiple-titles.html):
<title>First Title Tag — Contact Us</title> ← ❌ tag 1 ... rest of <head> ... <title>Second Title Tag — About Us</title> ← ❌ tag 2 (browser uses this one, Google unclear) <!-- Fix: keep exactly one --> <title>Contact Us | Demo Company</title>
| Page | Issue | Title |
|---|---|---|
| service-a.html | title_duplicate | Our Services | Demo Company |
| service-b.html | title_duplicate | Our Services | Demo Company |
| service-c.html | title_duplicate | Our Services | Demo Company |
| multiple-titles.html | title_multiple | Has two separate <title> tags |
| Channel | Impact |
|---|---|
| SEO | title_duplicate: Keyword cannibalization — all three pages compete for the same query. Google picks one to rank and demotes or ignores the others. title_multiple: Google may pick either title or rewrite the SERP snippet entirely. |
| AEO | Duplicate titles split citation signals. AI engines cannot clearly attribute which page answers which question when titles are identical. |
| GEO | Same as AEO — generative engines use page titles as strong signals for content topic identification. |
Audit rule: redirect_chain |
Three chains reproduced via the _redirects file:
Chain A (5 hops),
Chain B (3 hops),
Chain C (4 hops)
Imagine asking for directions and being told "go to the post office" — then the post office says "go to the library" — the library says "go to the park" — the park says "go to the cafe" — the cafe finally has what you need. By hop 3, most people give up. Google's crawler behaves the same way.
How chains accumulate in real projects:
| Cause | Example |
|---|---|
| URL restructured multiple times | /old-url → /v2/old-url → /v2/new-url → /new-url |
| HTTP → HTTPS migration + www removal | http://www.site.com/page → https://www.site.com/page → https://site.com/page |
| CMS slug change on top of old redirects | Each slug change adds a new hop instead of updating the original redirect |
| Platform migration | Old platform redirect + new platform redirect + canonical redirect |
# Chain A — 5 hops (worst case) /chain-a-hop-1 301→ /chain-a-hop-2 301→ /chain-a-hop-3 301→ /chain-a-hop-4 301→ /chain-a-hop-5 301→ /chain-a-final # Chain B — 3 hops (URL slug restructured twice) /old-product-url 301→ /products/v2/item 301→ /products/item 301→ /chain-b-final # Chain C — 4 hops (migration done in stages) /legacy-page 301→ /legacy-page-v2 301→ /legacy-page-v3 301→ /legacy-page-v4 301→ /chain-c-final
# Broken: 5-hop chain /chain-a-hop-1 /chain-a-hop-2 301 /chain-a-hop-2 /chain-a-hop-3 301 ... (3 more hops) # Fixed: collapse all hops into one direct redirect /chain-a-hop-1 /chain-a-final 301 /chain-a-hop-2 /chain-a-final 301 /chain-a-hop-3 /chain-a-final 301 /chain-a-hop-4 /chain-a-final 301 /chain-a-hop-5 /chain-a-final 301 # Every old URL points directly to the final destination in a single hop
| Channel | Impact |
|---|---|
| SEO | Google follows up to ~10 hops but Mueller advises <3. Each hop dilutes PageRank passed through the chain. Chains >3 risk the final page not being indexed. Crawl budget wasted on intermediate hops. |
| AEO | Each redirect hop adds latency to content delivery. AI crawlers may time out before reaching the final destination on slow servers. |
| GEO | LLM retrieval systems that fetch URLs abandon long chains. Content at the end of a 5-hop chain may be unreachable to generative engine fetchers. |
Step 1 — Generate the large sitemap (run once, output is gitignored):
cd examples/robots-txt-lab node generate-sitemap.js # Output: # ✅ Generated: sitemap.xml # URLs : 50,001 (Google limit: 50,000) # Size : ~2.50 MB
Step 2 — Deploy to Cloudflare Pages (drag-and-drop or CLI):
# Option A — drag and drop (no account setup needed) # 1. Go to pages.cloudflare.com # 2. Create a project → Upload assets # 3. Drag the robots-txt-lab/ folder → Deploy # Option B — CLI npx wrangler pages deploy . --project-name robots-txt-lab
Step 3 — Verify the issues are present:
# Verify robots.txt errors — paste URL into:
# https://www.google.com/webmasters/tools/robots-testing-tool
# or: https://technicalseo.com/tools/robots-txt/
# Verify sitemap size — run after generating:
node -e "const f=require('fs').statSync('sitemap.xml'); console.log('URLs:', require('fs').readFileSync('sitemap.xml','utf8').match(/<url>/g).length, '| Size:', (f.size/1024/1024).toFixed(2)+'MB')"
# Or count manually:
grep -c "<url>" sitemap.xml
# → should print 50001
# Submit to Google Search Console:
# https://search.google.com/search-console → Sitemaps → add /sitemap.xml
# GSC will show: "Your Sitemap appears to be an HTML page"
# or flag the URL count in the coverage report.