Sitemap Completeness
Your sitemap says 500 pages exist. Our crawl finds 700. Those 200 missing URLs? AI crawlers will never know they exist.
Questions this article answers
- ?How do I find pages missing from my sitemap that AI crawlers will never see?
- ?Does an incomplete sitemap hurt my AI search visibility?
- ?How often should I update my sitemap lastmod dates for AI crawlers?
Summarize This Article With AI
Open this article in your preferred AI engine for an instant summary and analysis.
Quick Answer
Sitemap completeness compares the URLs in your sitemap.xml against actual pages found by crawling. Missing URLs, stale lastmod timestamps, and flat priority signals all get scored. A complete sitemap ensures AI crawlers discover every page you want indexed -not just the ones they stumble across.
Before & After
Before - Sitemap missing 40% of pages
<!-- sitemap.xml: 300 URLs listed --> <!-- Live crawl discovers 500 URLs --> <!-- 200 pages invisible to AI crawlers --> <!-- All lastmod set to 2024-01-01 -->
After - Complete sitemap with accurate lastmod
<!-- sitemap.xml: 500 URLs listed -->
<url>
<loc>https://example.com/blog/post-1</loc>
<lastmod>2026-02-10</lastmod>
<priority>0.7</priority>
</url>
<!-- Referenced in robots.txt:
Sitemap: https://example.com/sitemap.xml -->What This Actually Measures
We're quantifying the gap between what your XML sitemap declares and what actually exists. The audit crawls your site independently -following internal links from the homepage -then compares discovered URLs against the sitemap.xml. The primary metric: "sitemap coverage ratio" = URLs in sitemap that are also crawl-discoverable / total crawl-discoverable URLs.
Beyond simple URL presence, We evaluate three metadata dimensions per entry. First, lastmod accuracy -does the timestamp reflect the actual last modification date? (We verify against HTTP headers and on-page dates.) Second, changefreq appropriateness -does the declared frequency match the actual update pattern? Third, priority distribution -do values reflect a meaningful hierarchy or is every page set to 1.0 or the default 0.5?
We also check structural integrity. For sitemap index files, we validate that all referenced child sitemaps are accessible and contain valid URLs. We flag oversized sitemaps exceeding the 50,000 URL or 50MB limits -crawlers silently truncate those. And we validate XML syntax, because a malformed sitemap is worse than no sitemap: crawlers fail silently.
A secondary metric, "sitemap hygiene," captures the percentage of sitemap URLs returning 200 status codes. Sitemaps stuffed with 404s, 301s, or 410s waste crawler budget and signal poor maintenance.
Why Gaps in Your Sitemap Mean Invisible Pages
XML sitemaps are the primary way AI crawlers discover your content systematically. Internal links provide an organic discovery path, but sitemaps are the definitive manifest. When your sitemap is incomplete, unlisted pages depend entirely on being found through link-following -a process that takes weeks or months, if it happens at all.
AI crawlers have limited budgets. GPTBot, PerplexityBot, and ClaudeBot each allocate a finite number of requests per domain per cycle. When your sitemap lists 500 URLs, crawlers plan accordingly. But if 200 additional pages exist off-sitemap, those pages are only discovered if the crawler has budget remaining after processing the sitemap -and for most sites, it doesn't.
Stale lastmod dates create a different problem. When your sitemap shows a page was modified two years ago, crawlers deprioritize recrawling it. If you've actually updated the content but forgot the lastmod, the crawler may not see your changes for months. The flip side: sitemaps that set every page's lastmod to "today" on each generation lose credibility. Crawlers learn the values are unreliable and start ignoring them.
Priority values, when used thoughtfully, help crawlers allocate budget. A flat distribution (all 0.5 or all 1.0) provides zero signal. A meaningful hierarchy -homepage 1.0, category pages 0.8, product pages 0.6, utility pages 0.3 -tells crawlers exactly where to focus when budget is constrained.
How We Check This
The audit starts by fetching the sitemap -checking /sitemap.xml, /sitemap_index.xml, and the Sitemap directive in robots.txt. For sitemap index files, we recursively fetch all child sitemaps. The complete URL set is collected with each URL's lastmod, changefreq, and priority values.
In parallel, we run an independent crawl from the homepage, following internal links up to a configurable depth. This discovers the "actual" URL set -pages that exist and are internally linked. URLs matching common non-content patterns (pagination, sort/filter parameters, session IDs, tracking params) get excluded.
The two URL sets are compared. Three lists emerge: URLs in both (properly covered), URLs found by crawling but missing from sitemap (gaps), and URLs in sitemap but not found by crawling (potential dead links or orphaned entries). Each list is categorized by URL pattern to identify template-level issues -"all /blog/* pages missing from sitemap" means a CMS configuration gap.
For lastmod validation, we fetch each sitemap URL and compare the sitemap's lastmod against three signals: the HTTP Last-Modified header, dateModified in JSON-LD, and visible "Last updated" text. Discrepancies greater than 7 days get flagged. We also identify "flat lastmod" patterns -more than 80% of URLs sharing the same timestamp, which typically means automated generation without real tracking.
XML validation checks the Sitemaps.org protocol: correct namespace declaration, valid URL format in <loc> elements, ISO 8601 dates in <lastmod>, and total file size under 50MB uncompressed.
How We Score It
Sitemap completeness uses a four-component rubric:
1. URL coverage (4 points): - 95-100% of crawl-discovered URLs appear in sitemap: 4/4 points - 85-94%: 3/4 points - 70-84%: 2/4 points - 50-69%: 1/4 points - Below 50% or no sitemap found: 0/4 points
2. Sitemap hygiene -dead URLs (2 points): - Less than 2% of sitemap URLs return non-200: 2/2 points - 2-5% non-200: 1.5/2 points - 5-15% non-200: 1/2 points - More than 15% non-200: 0/2 points
3. lastmod accuracy (2 points): - 80%+ of URLs have lastmod within 7 days of actual modification: 2/2 points - 60-79% accurate OR lastmod present but unvalidated: 1.5/2 points - lastmod present on fewer than 50%: 1/2 points - No lastmod values or flat lastmod across all URLs: 0/2 points
4. Structural validity (2 points): - Valid XML, correct namespace, under size limits, referenced in robots.txt: 2/2 points - Valid XML with minor issues (missing robots.txt reference, no changefreq): 1.5/2 points - XML parsing warnings or oversized files: 1/2 points - XML parsing errors or completely invalid structure: 0/2 points
Deductions: - -1 point if sitemap contains URLs blocked by robots.txt (contradictory signals) - -0.5 points if sitemap index references child sitemaps that return errors
No sitemap at all = 0/10. Most CMS-generated sitemaps with defaults score 5-7 due to missing lastmod tracking and incomplete coverage.
Resources
Key Takeaways
- Compare your sitemap URLs against a live crawl to find pages that are missing from the map.
- Update lastmod only when actual content changes - flat or fake timestamps train crawlers to ignore the values.
- Use meaningful priority values to signal which pages matter most when crawl budget is limited.
- Remove 404s and redirects from your sitemap - dead entries waste crawler budget.
- Reference your sitemap in robots.txt so every crawler finds it automatically.
How does your site score on this criterion?
Get a free AEO audit and see where you stand across all 10 criteria.