Site Audit Issues — Demo Lab

One site, all issues reproduced. Each section maps to one audit rule so it can be fixed and verified independently.

① robots.txt Syntax Errors robots_has_errors

Audit rule: robots_has_errors | The broken robots.txt is live at /robots.txt.

What is this issue?
Crawlers silently skip any line they cannot parse — no error is shown to you. A missing space like Disallow:/staging means the rule is ignored and Google crawls your staging pages anyway.

1. What is This Issue (Simple Terms)

Imagine you put a sign on a door that says "Staff only" but you misspelled it as "Staf fonly". The security guard (Google) cannot read it, ignores it, and lets everyone through. That is exactly what happens with a malformed robots.txt.

Common real-world damage: your /staging or /admin pages end up indexed in Google because the Disallow line had a syntax error and was silently skipped.

2. The Broken robots.txt on This Site

View live broken /robots.txt →

# 9 real syntax errors — each one silently ignored by crawlers

User-agent:*              ← Error 1: missing space after colon
Disallow:/staging         ← Error 2: missing space → staging gets indexed
Disallow: /admin           ← Error 3: no trailing slash → /administrator not blocked
Disallow: /api/*           ← Error 4: wildcard only works in Google/Bing, others ignore it
Disallow: /checkout$       ← Error 5: dollar anchor only Google supports
Allow: /staging/public # comment  ← Error 6: inline comment breaks the path value
User-agent: Googlebot     ← Error 7: no blank line before this group → bleeds into above
Disallow: /private
Crawl-delay:5             ← Error 8: missing space → crawl-delay ignored
SITEMAP: https://...      ← Error 9: wrong casing, should be "Sitemap:"

3. What It Should Look Like (Fixed)

# Correct robots.txt

User-agent: *
Disallow: /staging/
Disallow: /admin/
Disallow: /api/
Disallow: /checkout/
Allow: /staging/public/
Crawl-delay: 5
                          ← blank line separates groups
User-agent: Googlebot
Disallow: /private/
                          ← blank line before Sitemap
Sitemap: https://robots-txt-lab.pages.dev/sitemap.xml

4. Error Reference Table

#	Broken	Problem	Real-world impact
1	`User-agent:*`	Missing space after `:`	Entire rule group may be ignored
2	`Disallow:/staging`	Missing space after `:`	Staging pages indexed in Google
3	`Disallow: /admin`	No trailing slash	Blocks `/admin` but not `/administrator`
4	`Disallow: /api/*`	Wildcard `*` in path	Only Google/Bing honour it; other crawlers ignore the line
5	`Disallow: /checkout$`	Dollar anchor `$`	Only Google supports it; other bots skip the line
6	`Allow: /path # comment`	Inline comment after value	Comment becomes part of the path — directive is broken
7	No blank line between groups	Missing group separator	Groups bleed together; Googlebot rules applied to all bots
8	`Crawl-delay:5`	Missing space after `:`	Crawl-delay skipped; bot may hammer the server
9	`SITEMAP: https://...`	Wrong field-name casing	Sitemap not discovered by strict parsers

② Sitemap Exceeds Size Limits sitemap_big

Audit rule: sitemap_big | The large sitemap is live at /sitemap.xml (generate it first — see deploy steps).

What is this issue?
Google hard-limits each sitemap.xml to 50,000 URLs and 50 MB. If your sitemap exceeds either limit, Google silently truncates it — URLs past the cut-off are never discovered or indexed. No error is surfaced to you; pages just quietly disappear from search.

1. What Is This Issue (Simple Terms)

Think of Google's crawler as a delivery driver with a truck that holds exactly 50,000 packages. If you hand it 50,001, the driver takes the first 50,000 and drives away. Package #50,001 sits on the pavement and is never delivered — and you never find out.

In practice this hits large e-commerce sites (product + variant pages), news sites (article archives), or any site that auto-generates a single flat sitemap instead of a paginated sitemap index.

2. The Broken Sitemap on This Site

This site's sitemap.xml contains 50,001 URLs — one over the limit.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url><loc>https://robots-txt-lab.pages.dev/</loc></url>
  <url><loc>https://robots-txt-lab.pages.dev/page-1</loc></url>
  <url><loc>https://robots-txt-lab.pages.dev/page-2</loc></url>
  ... (continues for 50,001 total entries)  ← Google silently stops at 50,000

3. What the Fix Looks Like

Split into multiple sitemaps and serve a sitemap index — one master file that lists the child sitemaps, each under 50,000 URLs.

<!-- /sitemap.xml — the INDEX (always under limit) -->
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap><loc>https://example.com/sitemap-1.xml</loc></sitemap>  <!-- URLs 1-50000 -->
  <sitemap><loc>https://example.com/sitemap-2.xml</loc></sitemap>  <!-- URLs 50001-... -->
</sitemapindex>

4. Impact

Channel	Impact
SEO	Pages past URL #50,000 are never submitted to Google — they rely entirely on link discovery, which may never happen for deep pages.
AEO (AI Engines)	AI crawlers (GPTBot, PerplexityBot) that follow sitemaps also stop at the truncation point. Those pages are invisible to AI overviews.
GEO (Generative Engine)	Content past truncation cannot be cited in AI-generated responses that rely on indexed content.

③ 4XX Errors in Sitemap sitemap_4xx

Audit rule: sitemap_4xx | The broken sitemap is live at /sitemap-4xx.xml.

What is this issue?
Your sitemap.xml lists URLs that return a 4XX HTTP error (most commonly 404 Not Found or 410 Gone). Every time Google crawls the sitemap and hits a dead URL, it loses a little trust in the sitemap as a whole. Over time, Google crawls it less often and may stop prioritising newly added pages.

1. What Is This Issue (Simple Terms)

Imagine giving a tour guide a list of 10 rooms to show visitors. They walk to room 4 and the door is missing — the room was demolished. After the third demolished room, the tour guide starts doubting your whole list and skips checking new rooms you add later.

This happens constantly on real sites: a product is deleted, a blog post is removed, a URL is restructured — but nobody updates the sitemap. The dead URLs pile up and Google's trust in the sitemap degrades.

2. The Broken Sitemap on This Site

View live /sitemap-4xx.xml → — contains 1 valid URL and 9 dead URLs (none of those pages exist on this site → all return 404).

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

  <url><loc>https://robots-txt-lab.pages.dev/</loc></url>              ← ✅ 200 OK

  <url><loc>https://robots-txt-lab.pages.dev/deleted-product</loc></url>       ← ❌ 404
  <url><loc>https://robots-txt-lab.pages.dev/old-blog-post</loc></url>         ← ❌ 404
  <url><loc>https://robots-txt-lab.pages.dev/discontinued-service</loc></url>  ← ❌ 404
  <url><loc>https://robots-txt-lab.pages.dev/removed-about-us</loc></url>     ← ❌ 404
  <url><loc>https://robots-txt-lab.pages.dev/old-pricing</loc></url>         ← ❌ 404
  <url><loc>...5 more dead URLs...</loc></url>

</urlset>

3. What the Fix Looks Like

Only include URLs that return 200. Remove or redirect dead pages before adding them to the sitemap.

<!-- Fixed: only verified-live URLs -->
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url><loc>https://example.com/</loc></url>
  <url><loc>https://example.com/about</loc></url>
  <url><loc>https://example.com/products</loc></url>
  <!-- deleted-product removed — 301 redirected, not listed -->
</urlset>

4. Impact

Channel	Impact
SEO	Dead URLs erode Google's trust in the sitemap. Google reduces crawl frequency of the sitemap, so new pages take longer to be discovered.
AEO (AI Engines)	AI crawlers that follow the sitemap (GPTBot, PerplexityBot) waste crawl budget on dead URLs and may miss new content.
GEO	Pages not crawled are not indexed, so they cannot appear in AI-generated answers that rely on indexed content.

④ Noindex Pages in Sitemap sitemap_noindex

Audit rule: sitemap_noindex | The broken sitemap is at /sitemap-noindex.xml. The noindex pages: staging-page, admin-preview, draft-post.

What is this issue?
Your sitemap tells Google "please discover and visit these URLs". But those pages have <meta name="robots" content="noindex"> which tells Google "don't index this page". These two instructions directly contradict each other — Google wastes crawl budget visiting pages it then has to discard.

1. What Is This Issue (Simple Terms)

Imagine sending a formal invitation to a party, but when the guest arrives the door says "No entry". You wasted everyone's time. The guest (Google) showed up, spent time getting there, then had to turn around.

The real-world pattern that causes this: a developer adds noindex to a staging or draft page to keep it out of search results, but forgets to remove it from the auto-generated sitemap. Or: a CMS generates the sitemap from all published pages, including ones that marketing tagged as noindex.

2. The Broken Setup on This Site

Three pages exist and return 200 OK, but each has <meta name="robots" content="noindex"> in the HTML head. All three are listed in sitemap-noindex.xml.

<!-- sitemap-noindex.xml — tells Google to visit these pages -->
<url><loc>https://robots-txt-lab.pages.dev/</loc></url>              ← ✅ indexable
<url><loc>https://robots-txt-lab.pages.dev/staging-page</loc></url>   ← ❌ noindex
<url><loc>https://robots-txt-lab.pages.dev/admin-preview</loc></url>  ← ❌ noindex
<url><loc>https://robots-txt-lab.pages.dev/draft-post</loc></url>     ← ❌ noindex

<!-- Each noindex page has this in its <head> -->
<meta name="robots" content="noindex, nofollow">  ← contradicts being in sitemap

3. What the Fix Looks Like

Two valid options — pick one per page:

<!-- Option A: Remove the noindex tag (make the page indexable) -->
<meta name="robots" content="index, follow">  ← or just remove the meta tag entirely

<!-- Option B: Remove the URL from the sitemap (keep noindex, stop wasting crawl budget) -->
<!-- simply delete the <url> entry from sitemap.xml -->

<!-- Never do both at the same time in opposite directions -->

4. Impact

Channel	Impact
SEO	Contradictory signal confuses Google. Crawl budget is spent on pages that will never rank. Google's trust in the sitemap degrades over time.
AEO (AI Engines)	Once noindex is honoured, those pages are invisible to AI Overviews and AI-generated answers — even if listed in the sitemap.
GEO	Same as AEO — noindex pages are excluded from generative engine content regardless of sitemap presence.

④-alt Noindex Pages in Sitemap — HTTP Header Variant sitemap_noindex

Audit rule: sitemap_noindex (same rule ID, different detection method) | Broken sitemap: /sitemap-noindex-header.xml | Pages: pdf-report, user-data-export, internal-dashboard

Why this is different from issue ④:
Issue ④ uses <meta name="robots" content="noindex"> — visible in HTML source.
This variant uses X-Robots-Tag: noindex as an HTTP response header — completely invisible when you View Source. You can only detect it via DevTools Network tab or a tool that checks headers. Same rule, different mechanism, much harder to spot manually.

1. What Is the Difference

	Meta Tag (issue ④)	HTTP Header (issue ④-alt)
Where set	Inside `<head>` of HTML	HTTP response header from server
Visible in View Source?	✅ Yes	❌ No — only in DevTools Network tab
Works for non-HTML?	❌ HTML only	✅ PDFs, images, any resource
How set on this site	In `.html` file head	Via Cloudflare Pages `_headers` file
Audit tool detection	HTML parser	HTTP response header check

2. The Broken Setup on This Site

Three pages have no noindex meta tag in their HTML — they look clean in View Source. But the _headers file tells Cloudflare Pages to send X-Robots-Tag: noindex in the HTTP response. All three are listed in sitemap-noindex-header.xml.

# _headers file (Cloudflare Pages)
/pdf-report.html
  X-Robots-Tag: noindex

/user-data-export.html
  X-Robots-Tag: noindex, nofollow

/internal-dashboard.html
  X-Robots-Tag: noindex, nofollow

# HTML source of these pages has NO <meta name="robots"> — looks clean
# The noindex only shows up in the HTTP response headers

3. How to Verify

# Method 1 — curl (checks HTTP headers, not HTML)
curl -I https://robots-txt-lab.pages.dev/pdf-report.html
# Look for: X-Robots-Tag: noindex in the output

# Method 2 — View Source (should show NO noindex meta tag — that is the point)
# Open /pdf-report.html → View Source → search "noindex" → not found in HTML

# Method 3 — DevTools
# Open /pdf-report.html → F12 → Network tab → click the page request
# → Response Headers → look for X-Robots-Tag: noindex

4. Impact

Channel	Impact
SEO	Same as meta tag noindex — Google respects it and won't index. The sitemap entry wastes crawl budget. Harder for developers to notice and fix because it's not visible in source.
AEO / GEO	Once X-Robots-Tag noindex is honoured, those pages are invisible to AI engines regardless of sitemap presence — same outcome as meta tag variant.

⑤ Non-Canonical Pages in Sitemap sitemap_non_canonical

Audit rule: sitemap_non_canonical | Broken sitemap: /sitemap-non-canonical.xml | Non-canonical pages: product-red, blog-page-2, print-version

What is this issue?
Every page can declare its "official" version using <link rel="canonical" href="...">. If a page's canonical points to a different URL, that page is a non-canonical — a duplicate. Your sitemap should only list the canonical (official) version of each URL, never the duplicates.

1. What Is This Issue (Simple Terms)

Think of canonical as a page saying "I am a copy — the real one lives over there." If your sitemap lists the copy instead of the original, you are pointing Google to a page that itself says "ignore me, go elsewhere." Google follows the canonical and loses trust in your sitemap's accuracy.

Three common real-world patterns that cause this:

Pattern	Non-canonical URL in sitemap	Should be
Colour/size variant	`/product-red` → canonical: `/product`	List `/product` only
Paginated page	`/blog-page-2` → canonical: `/blog`	List `/blog` only
Print version	`/print-version` → canonical: `/article`	List `/article` only

2. The Broken Setup on This Site

Three pages exist and return 200 OK, but each declares a different URL as its canonical. All three are listed in sitemap-non-canonical.xml.

<!-- sitemap-non-canonical.xml -->
<url><loc>https://robots-txt-lab.pages.dev/</loc></url>                ← ✅ self-canonical
<url><loc>https://robots-txt-lab.pages.dev/product-red</loc></url>    ← ❌ canonical is /product
<url><loc>https://robots-txt-lab.pages.dev/blog-page-2</loc></url>    ← ❌ canonical is /blog
<url><loc>https://robots-txt-lab.pages.dev/print-version</loc></url>  ← ❌ canonical is /article

<!-- What each non-canonical page says in its <head> -->
<link rel="canonical" href="https://robots-txt-lab.pages.dev/product">   ← mismatch with sitemap
<link rel="canonical" href="https://robots-txt-lab.pages.dev/blog">      ← mismatch with sitemap
<link rel="canonical" href="https://robots-txt-lab.pages.dev/article">   ← mismatch with sitemap

3. What the Fix Looks Like

<!-- Fixed sitemap: only list canonical URLs -->
<url><loc>https://robots-txt-lab.pages.dev/</loc></url>
<url><loc>https://robots-txt-lab.pages.dev/product</loc></url>      ← canonical, not the variant
<url><loc>https://robots-txt-lab.pages.dev/blog</loc></url>         ← page 1 only, not /blog-page-2
<url><loc>https://robots-txt-lab.pages.dev/article</loc></url>      ← main article, not print version

4. Impact

Channel	Impact
SEO	Google ignores the non-canonical sitemap entries and consolidates signals to the canonical. Ranking credit is not diluted, but the sitemap signal is wasted and trust in the sitemap degrades.
AEO (AI Engines)	AI crawlers that rely on sitemaps for discovery may follow non-canonical URLs and attribute content to the wrong page, shifting citation accuracy.
GEO	Same as AEO — content attribution in AI-generated answers may reference the duplicate URL instead of the canonical one.

⑥ Duplicate Title Tags title_duplicate title_multiple

Two related rules covered here:
title_duplicate — multiple pages share the same <title> text | title_multiple — a single page has more than one <title> tag

What is this issue?
Every page should have a unique <title> tag that describes exactly what that page is about. When multiple pages share the same title, Google cannot tell them apart and splits ranking signals between them. When one page has two title tags, Google's behaviour is undefined — it may use either one, or rewrite both.

1. What Is This Issue (Simple Terms)

title_duplicate: Imagine three job applicants sending identical CVs with the same name on them. The employer cannot tell who is who and picks one at random. The other two are ignored. That is what happens when three pages all have <title>Our Services | Demo Company</title>.

title_multiple: Imagine a letter with two subject lines that contradict each other. The reader is confused about what the letter is about. Same problem for Google when a page has <title>Contact Us</title> AND <title>About Us</title> in the same HTML.

2. The Broken Pages on This Site

title_duplicate — three pages, one identical title:

<!-- service-a.html -->  <title>Our Services | Demo Company</title>  ← ❌ duplicate
<!-- service-b.html -->  <title>Our Services | Demo Company</title>  ← ❌ duplicate
<!-- service-c.html -->  <title>Our Services | Demo Company</title>  ← ❌ duplicate

<!-- What they should be -->
<title>Web Design Services | Demo Company</title>
<title>SEO Consulting | Demo Company</title>
<title>Content Marketing | Demo Company</title>

title_multiple — two title tags on one page (multiple-titles.html):

<title>First Title Tag — Contact Us</title>   ← ❌ tag 1
... rest of <head> ...
<title>Second Title Tag — About Us</title>   ← ❌ tag 2 (browser uses this one, Google unclear)

<!-- Fix: keep exactly one -->
<title>Contact Us | Demo Company</title>

3. Pages to Check

Page	Issue	Title
service-a.html	title_duplicate	`Our Services \| Demo Company`
service-b.html	title_duplicate	`Our Services \| Demo Company`
service-c.html	title_duplicate	`Our Services \| Demo Company`
multiple-titles.html	title_multiple	Has two separate `<title>` tags

4. Impact

Channel	Impact
SEO	title_duplicate: Keyword cannibalization — all three pages compete for the same query. Google picks one to rank and demotes or ignores the others. title_multiple: Google may pick either title or rewrite the SERP snippet entirely.
AEO	Duplicate titles split citation signals. AI engines cannot clearly attribute which page answers which question when titles are identical.
GEO	Same as AEO — generative engines use page titles as strong signals for content topic identification.

🚀 How to Deploy This Site on Cloudflare Pages

Step 1 — Generate the large sitemap (run once, output is gitignored):

cd examples/robots-txt-lab
node generate-sitemap.js

# Output:
# ✅ Generated: sitemap.xml
#    URLs    : 50,001  (Google limit: 50,000)
#    Size    : ~2.50 MB

Step 2 — Deploy to Cloudflare Pages (drag-and-drop or CLI):

# Option A — drag and drop (no account setup needed)
# 1. Go to pages.cloudflare.com
# 2. Create a project → Upload assets
# 3. Drag the robots-txt-lab/ folder → Deploy

# Option B — CLI
npx wrangler pages deploy . --project-name robots-txt-lab

Step 3 — Verify the issues are present:

# Verify robots.txt errors — paste URL into:
# https://www.google.com/webmasters/tools/robots-testing-tool
# or: https://technicalseo.com/tools/robots-txt/

# Verify sitemap size — run after generating:
node -e "const f=require('fs').statSync('sitemap.xml'); console.log('URLs:', require('fs').readFileSync('sitemap.xml','utf8').match(/<url>/g).length, '| Size:', (f.size/1024/1024).toFixed(2)+'MB')"

# Or count manually:
grep -c "<url>" sitemap.xml
# → should print 50001

# Submit to Google Search Console:
# https://search.google.com/search-console → Sitemaps → add /sitemap.xml
# GSC will show: "Your Sitemap appears to be an HTML page"
# or flag the URL count in the coverage report.