Site Audit Issues — Demo Lab

One site, all issues reproduced. Each section maps to one audit rule so it can be fixed and verified independently.

① robots.txt Syntax Errors robots_has_errors

Audit rule: robots_has_errors | The broken robots.txt is live at /robots.txt.

What is this issue?
Crawlers silently skip any line they cannot parse — no error is shown to you. A missing space like Disallow:/staging means the rule is ignored and Google crawls your staging pages anyway.

1. What is This Issue (Simple Terms)

Imagine you put a sign on a door that says "Staff only" but you misspelled it as "Staf fonly". The security guard (Google) cannot read it, ignores it, and lets everyone through. That is exactly what happens with a malformed robots.txt.

Common real-world damage: your /staging or /admin pages end up indexed in Google because the Disallow line had a syntax error and was silently skipped.

2. The Broken robots.txt on This Site

View live broken /robots.txt →

# 9 real syntax errors — each one silently ignored by crawlers

User-agent:*              ← Error 1: missing space after colon
Disallow:/staging         ← Error 2: missing space → staging gets indexed
Disallow: /admin           ← Error 3: no trailing slash → /administrator not blocked
Disallow: /api/*           ← Error 4: wildcard only works in Google/Bing, others ignore it
Disallow: /checkout$       ← Error 5: dollar anchor only Google supports
Allow: /staging/public # comment  ← Error 6: inline comment breaks the path value
User-agent: Googlebot     ← Error 7: no blank line before this group → bleeds into above
Disallow: /private
Crawl-delay:5             ← Error 8: missing space → crawl-delay ignored
SITEMAP: https://...      ← Error 9: wrong casing, should be "Sitemap:"

3. What It Should Look Like (Fixed)

# Correct robots.txt

User-agent: *
Disallow: /staging/
Disallow: /admin/
Disallow: /api/
Disallow: /checkout/
Allow: /staging/public/
Crawl-delay: 5
                          ← blank line separates groups
User-agent: Googlebot
Disallow: /private/
                          ← blank line before Sitemap
Sitemap: https://robots-txt-lab.pages.dev/sitemap.xml

4. Error Reference Table

#	Broken	Problem	Real-world impact
1	`User-agent:*`	Missing space after `:`	Entire rule group may be ignored
2	`Disallow:/staging`	Missing space after `:`	Staging pages indexed in Google
3	`Disallow: /admin`	No trailing slash	Blocks `/admin` but not `/administrator`
4	`Disallow: /api/*`	Wildcard `*` in path	Only Google/Bing honour it; other crawlers ignore the line
5	`Disallow: /checkout$`	Dollar anchor `$`	Only Google supports it; other bots skip the line
6	`Allow: /path # comment`	Inline comment after value	Comment becomes part of the path — directive is broken
7	No blank line between groups	Missing group separator	Groups bleed together; Googlebot rules applied to all bots
8	`Crawl-delay:5`	Missing space after `:`	Crawl-delay skipped; bot may hammer the server
9	`SITEMAP: https://...`	Wrong field-name casing	Sitemap not discovered by strict parsers

② Sitemap Exceeds Size Limits sitemap_big

Audit rule: sitemap_big | The large sitemap is live at /sitemap.xml (generate it first — see deploy steps).

What is this issue?
Google hard-limits each sitemap.xml to 50,000 URLs and 50 MB. If your sitemap exceeds either limit, Google silently truncates it — URLs past the cut-off are never discovered or indexed. No error is surfaced to you; pages just quietly disappear from search.

1. What Is This Issue (Simple Terms)

Think of Google's crawler as a delivery driver with a truck that holds exactly 50,000 packages. If you hand it 50,001, the driver takes the first 50,000 and drives away. Package #50,001 sits on the pavement and is never delivered — and you never find out.

In practice this hits large e-commerce sites (product + variant pages), news sites (article archives), or any site that auto-generates a single flat sitemap instead of a paginated sitemap index.

2. The Broken Sitemap on This Site

This site's sitemap.xml contains 50,001 URLs — one over the limit.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url><loc>https://robots-txt-lab.pages.dev/</loc></url>
  <url><loc>https://robots-txt-lab.pages.dev/page-1</loc></url>
  <url><loc>https://robots-txt-lab.pages.dev/page-2</loc></url>
  ... (continues for 50,001 total entries)  ← Google silently stops at 50,000

3. What the Fix Looks Like

Split into multiple sitemaps and serve a sitemap index — one master file that lists the child sitemaps, each under 50,000 URLs.

<!-- /sitemap.xml — the INDEX (always under limit) -->
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap><loc>https://example.com/sitemap-1.xml</loc></sitemap>  <!-- URLs 1-50000 -->
  <sitemap><loc>https://example.com/sitemap-2.xml</loc></sitemap>  <!-- URLs 50001-... -->
</sitemapindex>

4. Impact

Channel	Impact
SEO	Pages past URL #50,000 are never submitted to Google — they rely entirely on link discovery, which may never happen for deep pages.
AEO (AI Engines)	AI crawlers (GPTBot, PerplexityBot) that follow sitemaps also stop at the truncation point. Those pages are invisible to AI overviews.
GEO (Generative Engine)	Content past truncation cannot be cited in AI-generated responses that rely on indexed content.

③ 4XX Errors in Sitemap sitemap_4xx

Audit rule: sitemap_4xx | The broken sitemap is live at /sitemap-4xx.xml.

What is this issue?
Your sitemap.xml lists URLs that return a 4XX HTTP error (most commonly 404 Not Found or 410 Gone). Every time Google crawls the sitemap and hits a dead URL, it loses a little trust in the sitemap as a whole. Over time, Google crawls it less often and may stop prioritising newly added pages.

1. What Is This Issue (Simple Terms)

Imagine giving a tour guide a list of 10 rooms to show visitors. They walk to room 4 and the door is missing — the room was demolished. After the third demolished room, the tour guide starts doubting your whole list and skips checking new rooms you add later.

This happens constantly on real sites: a product is deleted, a blog post is removed, a URL is restructured — but nobody updates the sitemap. The dead URLs pile up and Google's trust in the sitemap degrades.

2. The Broken Sitemap on This Site

View live /sitemap-4xx.xml → — contains 1 valid URL and 9 dead URLs (none of those pages exist on this site → all return 404).

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

  <url><loc>https://robots-txt-lab.pages.dev/</loc></url>              ← ✅ 200 OK

  <url><loc>https://robots-txt-lab.pages.dev/deleted-product</loc></url>       ← ❌ 404
  <url><loc>https://robots-txt-lab.pages.dev/old-blog-post</loc></url>         ← ❌ 404
  <url><loc>https://robots-txt-lab.pages.dev/discontinued-service</loc></url>  ← ❌ 404
  <url><loc>https://robots-txt-lab.pages.dev/removed-about-us</loc></url>     ← ❌ 404
  <url><loc>https://robots-txt-lab.pages.dev/old-pricing</loc></url>         ← ❌ 404
  <url><loc>...5 more dead URLs...</loc></url>

</urlset>

3. What the Fix Looks Like

Only include URLs that return 200. Remove or redirect dead pages before adding them to the sitemap.

<!-- Fixed: only verified-live URLs -->
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url><loc>https://example.com/</loc></url>
  <url><loc>https://example.com/about</loc></url>
  <url><loc>https://example.com/products</loc></url>
  <!-- deleted-product removed — 301 redirected, not listed -->
</urlset>

4. Impact

Channel	Impact
SEO	Dead URLs erode Google's trust in the sitemap. Google reduces crawl frequency of the sitemap, so new pages take longer to be discovered.
AEO (AI Engines)	AI crawlers that follow the sitemap (GPTBot, PerplexityBot) waste crawl budget on dead URLs and may miss new content.
GEO	Pages not crawled are not indexed, so they cannot appear in AI-generated answers that rely on indexed content.

④ Noindex Pages in Sitemap sitemap_noindex

Audit rule: sitemap_noindex | The broken sitemap is at /sitemap-noindex.xml. The noindex pages: staging-page, admin-preview, draft-post.

What is this issue?
Your sitemap tells Google "please discover and visit these URLs". But those pages have <meta name="robots" content="noindex"> which tells Google "don't index this page". These two instructions directly contradict each other — Google wastes crawl budget visiting pages it then has to discard.

1. What Is This Issue (Simple Terms)

Imagine sending a formal invitation to a party, but when the guest arrives the door says "No entry". You wasted everyone's time. The guest (Google) showed up, spent time getting there, then had to turn around.

The real-world pattern that causes this: a developer adds noindex to a staging or draft page to keep it out of search results, but forgets to remove it from the auto-generated sitemap. Or: a CMS generates the sitemap from all published pages, including ones that marketing tagged as noindex.

2. The Broken Setup on This Site

Three pages exist and return 200 OK, but each has <meta name="robots" content="noindex"> in the HTML head. All three are listed in sitemap-noindex.xml.

<!-- sitemap-noindex.xml — tells Google to visit these pages -->
<url><loc>https://robots-txt-lab.pages.dev/</loc></url>              ← ✅ indexable
<url><loc>https://robots-txt-lab.pages.dev/staging-page</loc></url>   ← ❌ noindex
<url><loc>https://robots-txt-lab.pages.dev/admin-preview</loc></url>  ← ❌ noindex
<url><loc>https://robots-txt-lab.pages.dev/draft-post</loc></url>     ← ❌ noindex

<!-- Each noindex page has this in its <head> -->
<meta name="robots" content="noindex, nofollow">  ← contradicts being in sitemap

3. What the Fix Looks Like

Two valid options — pick one per page:

<!-- Option A: Remove the noindex tag (make the page indexable) -->
<meta name="robots" content="index, follow">  ← or just remove the meta tag entirely

<!-- Option B: Remove the URL from the sitemap (keep noindex, stop wasting crawl budget) -->
<!-- simply delete the <url> entry from sitemap.xml -->

<!-- Never do both at the same time in opposite directions -->

4. Impact

Channel	Impact
SEO	Contradictory signal confuses Google. Crawl budget is spent on pages that will never rank. Google's trust in the sitemap degrades over time.
AEO (AI Engines)	Once noindex is honoured, those pages are invisible to AI Overviews and AI-generated answers — even if listed in the sitemap.
GEO	Same as AEO — noindex pages are excluded from generative engine content regardless of sitemap presence.

🚀 How to Deploy This Site on Cloudflare Pages

Step 1 — Generate the large sitemap (run once, output is gitignored):

cd examples/robots-txt-lab
node generate-sitemap.js

# Output:
# ✅ Generated: sitemap.xml
#    URLs    : 50,001  (Google limit: 50,000)
#    Size    : ~2.50 MB

Step 2 — Deploy to Cloudflare Pages (drag-and-drop or CLI):

# Option A — drag and drop (no account setup needed)
# 1. Go to pages.cloudflare.com
# 2. Create a project → Upload assets
# 3. Drag the robots-txt-lab/ folder → Deploy

# Option B — CLI
npx wrangler pages deploy . --project-name robots-txt-lab

Step 3 — Verify the issues are present:

# Verify robots.txt errors — paste URL into:
# https://www.google.com/webmasters/tools/robots-testing-tool
# or: https://technicalseo.com/tools/robots-txt/

# Verify sitemap size — run after generating:
node -e "const f=require('fs').statSync('sitemap.xml'); console.log('URLs:', require('fs').readFileSync('sitemap.xml','utf8').match(/<url>/g).length, '| Size:', (f.size/1024/1024).toFixed(2)+'MB')"

# Or count manually:
grep -c "<url>" sitemap.xml
# → should print 50001

# Submit to Google Search Console:
# https://search.google.com/search-console → Sitemaps → add /sitemap.xml
# GSC will show: "Your Sitemap appears to be an HTML page"
# or flag the URL count in the coverage report.