One site, all issues reproduced. Each section maps to one audit rule so it can be fixed and verified independently.
Audit rule: robots_has_errors | The broken robots.txt is live at /robots.txt.
Disallow:/staging means the rule is ignored and Google crawls your staging pages anyway.
Imagine you put a sign on a door that says "Staff only" but you misspelled it as "Staf fonly".
The security guard (Google) cannot read it, ignores it, and lets everyone through.
That is exactly what happens with a malformed robots.txt.
Common real-world damage: your /staging or /admin pages end up indexed in Google
because the Disallow line had a syntax error and was silently skipped.
View live broken /robots.txt →
# 9 real syntax errors — each one silently ignored by crawlers User-agent:* ← Error 1: missing space after colon Disallow:/staging ← Error 2: missing space → staging gets indexed Disallow: /admin ← Error 3: no trailing slash → /administrator not blocked Disallow: /api/* ← Error 4: wildcard only works in Google/Bing, others ignore it Disallow: /checkout$ ← Error 5: dollar anchor only Google supports Allow: /staging/public # comment ← Error 6: inline comment breaks the path value User-agent: Googlebot ← Error 7: no blank line before this group → bleeds into above Disallow: /private Crawl-delay:5 ← Error 8: missing space → crawl-delay ignored SITEMAP: https://... ← Error 9: wrong casing, should be "Sitemap:"
# Correct robots.txt User-agent: * Disallow: /staging/ Disallow: /admin/ Disallow: /api/ Disallow: /checkout/ Allow: /staging/public/ Crawl-delay: 5 ← blank line separates groups User-agent: Googlebot Disallow: /private/ ← blank line before Sitemap Sitemap: https://robots-txt-lab.pages.dev/sitemap.xml
| # | Broken | Problem | Real-world impact |
|---|---|---|---|
| 1 | User-agent:* | Missing space after : | Entire rule group may be ignored |
| 2 | Disallow:/staging | Missing space after : | Staging pages indexed in Google |
| 3 | Disallow: /admin | No trailing slash | Blocks /admin but not /administrator |
| 4 | Disallow: /api/* | Wildcard * in path | Only Google/Bing honour it; other crawlers ignore the line |
| 5 | Disallow: /checkout$ | Dollar anchor $ | Only Google supports it; other bots skip the line |
| 6 | Allow: /path # comment | Inline comment after value | Comment becomes part of the path — directive is broken |
| 7 | No blank line between groups | Missing group separator | Groups bleed together; Googlebot rules applied to all bots |
| 8 | Crawl-delay:5 | Missing space after : | Crawl-delay skipped; bot may hammer the server |
| 9 | SITEMAP: https://... | Wrong field-name casing | Sitemap not discovered by strict parsers |
Audit rule: sitemap_big | The large sitemap is live at /sitemap.xml (generate it first — see deploy steps).
sitemap.xml to 50,000 URLs and 50 MB.
If your sitemap exceeds either limit, Google silently truncates it — URLs past the cut-off are never discovered or indexed.
No error is surfaced to you; pages just quietly disappear from search.
Think of Google's crawler as a delivery driver with a truck that holds exactly 50,000 packages. If you hand it 50,001, the driver takes the first 50,000 and drives away. Package #50,001 sits on the pavement and is never delivered — and you never find out.
In practice this hits large e-commerce sites (product + variant pages), news sites (article archives), or any site that auto-generates a single flat sitemap instead of a paginated sitemap index.
This site's sitemap.xml contains 50,001 URLs — one over the limit.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>https://robots-txt-lab.pages.dev/</loc></url>
<url><loc>https://robots-txt-lab.pages.dev/page-1</loc></url>
<url><loc>https://robots-txt-lab.pages.dev/page-2</loc></url>
... (continues for 50,001 total entries) ← Google silently stops at 50,000
Split into multiple sitemaps and serve a sitemap index — one master file that lists the child sitemaps, each under 50,000 URLs.
<!-- /sitemap.xml — the INDEX (always under limit) --> <?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap><loc>https://example.com/sitemap-1.xml</loc></sitemap> <!-- URLs 1-50000 --> <sitemap><loc>https://example.com/sitemap-2.xml</loc></sitemap> <!-- URLs 50001-... --> </sitemapindex>
| Channel | Impact |
|---|---|
| SEO | Pages past URL #50,000 are never submitted to Google — they rely entirely on link discovery, which may never happen for deep pages. |
| AEO (AI Engines) | AI crawlers (GPTBot, PerplexityBot) that follow sitemaps also stop at the truncation point. Those pages are invisible to AI overviews. |
| GEO (Generative Engine) | Content past truncation cannot be cited in AI-generated responses that rely on indexed content. |
Audit rule: sitemap_4xx | The broken sitemap is live at /sitemap-4xx.xml.
sitemap.xml lists URLs that return a 4XX HTTP error (most commonly 404 Not Found or 410 Gone).
Every time Google crawls the sitemap and hits a dead URL, it loses a little trust in the sitemap as a whole.
Over time, Google crawls it less often and may stop prioritising newly added pages.
Imagine giving a tour guide a list of 10 rooms to show visitors. They walk to room 4 and the door is missing — the room was demolished. After the third demolished room, the tour guide starts doubting your whole list and skips checking new rooms you add later.
This happens constantly on real sites: a product is deleted, a blog post is removed, a URL is restructured — but nobody updates the sitemap. The dead URLs pile up and Google's trust in the sitemap degrades.
View live /sitemap-4xx.xml → — contains 1 valid URL and 9 dead URLs (none of those pages exist on this site → all return 404).
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url><loc>https://robots-txt-lab.pages.dev/</loc></url> ← ✅ 200 OK <url><loc>https://robots-txt-lab.pages.dev/deleted-product</loc></url> ← ❌ 404 <url><loc>https://robots-txt-lab.pages.dev/old-blog-post</loc></url> ← ❌ 404 <url><loc>https://robots-txt-lab.pages.dev/discontinued-service</loc></url> ← ❌ 404 <url><loc>https://robots-txt-lab.pages.dev/removed-about-us</loc></url> ← ❌ 404 <url><loc>https://robots-txt-lab.pages.dev/old-pricing</loc></url> ← ❌ 404 <url><loc>...5 more dead URLs...</loc></url> </urlset>
Only include URLs that return 200. Remove or redirect dead pages before adding them to the sitemap.
<!-- Fixed: only verified-live URLs --> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url><loc>https://example.com/</loc></url> <url><loc>https://example.com/about</loc></url> <url><loc>https://example.com/products</loc></url> <!-- deleted-product removed — 301 redirected, not listed --> </urlset>
| Channel | Impact |
|---|---|
| SEO | Dead URLs erode Google's trust in the sitemap. Google reduces crawl frequency of the sitemap, so new pages take longer to be discovered. |
| AEO (AI Engines) | AI crawlers that follow the sitemap (GPTBot, PerplexityBot) waste crawl budget on dead URLs and may miss new content. |
| GEO | Pages not crawled are not indexed, so they cannot appear in AI-generated answers that rely on indexed content. |
Audit rule: sitemap_noindex | The broken sitemap is at /sitemap-noindex.xml. The noindex pages: staging-page, admin-preview, draft-post.
<meta name="robots" content="noindex"> which tells Google "don't index this page".
These two instructions directly contradict each other — Google wastes crawl budget visiting pages it then has to discard.
Imagine sending a formal invitation to a party, but when the guest arrives the door says "No entry". You wasted everyone's time. The guest (Google) showed up, spent time getting there, then had to turn around.
The real-world pattern that causes this: a developer adds noindex to a staging or draft page
to keep it out of search results, but forgets to remove it from the auto-generated sitemap.
Or: a CMS generates the sitemap from all published pages, including ones that marketing tagged as noindex.
Three pages exist and return 200 OK, but each has
<meta name="robots" content="noindex"> in the HTML head.
All three are listed in sitemap-noindex.xml.
<!-- sitemap-noindex.xml — tells Google to visit these pages --> <url><loc>https://robots-txt-lab.pages.dev/</loc></url> ← ✅ indexable <url><loc>https://robots-txt-lab.pages.dev/staging-page</loc></url> ← ❌ noindex <url><loc>https://robots-txt-lab.pages.dev/admin-preview</loc></url> ← ❌ noindex <url><loc>https://robots-txt-lab.pages.dev/draft-post</loc></url> ← ❌ noindex <!-- Each noindex page has this in its <head> --> <meta name="robots" content="noindex, nofollow"> ← contradicts being in sitemap
Two valid options — pick one per page:
<!-- Option A: Remove the noindex tag (make the page indexable) --> <meta name="robots" content="index, follow"> ← or just remove the meta tag entirely <!-- Option B: Remove the URL from the sitemap (keep noindex, stop wasting crawl budget) --> <!-- simply delete the <url> entry from sitemap.xml --> <!-- Never do both at the same time in opposite directions -->
| Channel | Impact |
|---|---|
| SEO | Contradictory signal confuses Google. Crawl budget is spent on pages that will never rank. Google's trust in the sitemap degrades over time. |
| AEO (AI Engines) | Once noindex is honoured, those pages are invisible to AI Overviews and AI-generated answers — even if listed in the sitemap. |
| GEO | Same as AEO — noindex pages are excluded from generative engine content regardless of sitemap presence. |
Step 1 — Generate the large sitemap (run once, output is gitignored):
cd examples/robots-txt-lab node generate-sitemap.js # Output: # ✅ Generated: sitemap.xml # URLs : 50,001 (Google limit: 50,000) # Size : ~2.50 MB
Step 2 — Deploy to Cloudflare Pages (drag-and-drop or CLI):
# Option A — drag and drop (no account setup needed) # 1. Go to pages.cloudflare.com # 2. Create a project → Upload assets # 3. Drag the robots-txt-lab/ folder → Deploy # Option B — CLI npx wrangler pages deploy . --project-name robots-txt-lab
Step 3 — Verify the issues are present:
# Verify robots.txt errors — paste URL into:
# https://www.google.com/webmasters/tools/robots-testing-tool
# or: https://technicalseo.com/tools/robots-txt/
# Verify sitemap size — run after generating:
node -e "const f=require('fs').statSync('sitemap.xml'); console.log('URLs:', require('fs').readFileSync('sitemap.xml','utf8').match(/<url>/g).length, '| Size:', (f.size/1024/1024).toFixed(2)+'MB')"
# Or count manually:
grep -c "<url>" sitemap.xml
# → should print 50001
# Submit to Google Search Console:
# https://search.google.com/search-console → Sitemaps → add /sitemap.xml
# GSC will show: "Your Sitemap appears to be an HTML page"
# or flag the URL count in the coverage report.