Criterion 4

Clean, Crawlable HTML

Ensuring your page content is accessible in the raw HTML source, not hidden behind JavaScript rendering, accordions, or dynamic loading.

What It Is

Clean, crawlable HTML means that the meaningful content of your pages is present in the initial HTML document that a server sends to the browser — before any JavaScript runs. AI crawlers and many search engine bots read HTML directly without executing JavaScript.

This includes: - Text content visible in the page source (View Source, not Inspect Element) - Proper semantic HTML elements (<main>, <article>, <section>, <nav>) - Content not hidden behind accordion clicks, tab switches, or "Read more" buttons - Minimal DOM complexity from third-party scripts and widget frameworks - Fast server response times with content delivered on first request

Why It Matters for AEO

Most AI crawlers (GPTBot, CCBot, PerplexityBot) behave like simplified browsers — they fetch the HTML and extract text, but they don't run JavaScript. This means:

- Content hidden in accordions with aria-expanded="false" may be invisible to AI - Single-page applications (SPAs) that render content client-side can appear blank to crawlers - FAQ answers that require a click to expand may never be indexed - Heavy JavaScript payloads slow down crawling and may cause timeouts - Dynamic content loaded via API calls after page render is typically missed

If your content isn't in the raw HTML, it doesn't exist for AI systems.

How to Implement

**1. Server-side render your content** Use SSR (Server-Side Rendering) or SSG (Static Site Generation) to ensure content is in the HTML response. Frameworks like Next.js, Nuxt, and Astro handle this well.

**2. Make hidden content visible in HTML** ```html  <div class="accordion" aria-expanded="false"> <div class="content" style="display:none">Answer here</div> </div>

<details> <summary>Question here</summary> <p>Answer here — visible to crawlers in the HTML source</p> </details> ```

**3. Audit your HTML** ```bash # Check what crawlers see (no JS execution) curl -s https://yoursite.com/page | grep -c "your key content phrase"

# If 0 results, your content is JS-dependent ```

**4. Minimize third-party script impact** Load analytics, chat widgets, and social proof scripts with `defer` or `async`. Consider lazy-loading non-essential widgets.

**5. Use the <noscript> fallback** For essential content that requires JavaScript, provide a <noscript> version with the same content.

Common Mistakes

- Assuming "it works in my browser" means crawlers can see it — crawlers don't run JavaScript - Using custom web components (<my-accordion>) without ensuring content is in the initial HTML - Hiding blog excerpts with CSS class="hidden" — crawlers may skip hidden elements - Loading critical content via fetch/XHR after page load — crawlers won't wait for it - Over-relying on Shopify/WordPress theme features that render content client-side

External Resources

- Google's Mobile-Friendly Test — Shows how Googlebot sees your page - web.dev/rendering-on-the-web — Comprehensive guide to rendering strategies - Lighthouse — Audit performance and accessibility

← Back to Knowledge Base