"
CiteScan

Methodology

How we score your site.

The citation readiness score is calculated from six deterministic categories. Each category is based on facts extracted from your page — no AI guessing in the score itself.

Six scoring categories

Crawler access

20% of total

Checks your robots.txt for explicit rules covering GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot, Google-Extended, and Bingbot. Each blocked bot deducts 20 points. Each unclassified bot deducts 8 points. All allowed = 100.

Server-rendered content

20% of total

Detects whether the page returns readable text from the server without JavaScript execution. Pages that require JavaScript to render body content score lower — most AI crawlers do not execute JS. Word count thresholds: under 100 words (−30), under 300 words (−15), missing H1 (−10).

Query answer match

20% of total

Scores keyword overlap between the target query and the page's title, headings, and visible text. Exact keyword matches count. Unmatched query keywords are listed as issues. Pages with no question-and-answer format (FAQ, answer blocks) deduct 10 points.

Entity clarity and schema

15% of total

Checks for JSON-LD structured data. High-value types (Organization, LocalBusiness, Article, FAQPage, Product, BreadcrumbList, WebSite, Person) score 90. Low-value or absent schema scores 20–50. Missing contact signals (phone, email, address) deduct 10.

Evidence and citation readiness

15% of total

Starts at 40. Adds points for: statistics or data (two or more specific numbers or percentages +20), external citations (two or more external links +15), author attribution (+15), and a visible publication date (+10). Caps at 100.

Meta, trust, and freshness

10% of total

Checks title (missing −30, wrong length −10), meta description (missing −25, wrong length −8), canonical URL (missing −10), og:title (missing −10), og:image (missing −10), and twitter:card (missing −5).

Total score calculation

Total = sum of (category score x category weight). The result is a 0–100 integer. Scores of 80 or above indicate strong readiness. 60–79 is a good foundation. 40–59 is readable but weak. Below 40 means the site is hard to cite.

AI report generation

How the AI report works.

After the deterministic score is calculated, extracted page data (headings, visible text, meta tags, JSON-LD, crawler breakdown, and the deterministic score) is sent to Gemini 2.5 Flash to generate the AI-specific parts of the report.

The AI report generates: an AI page interpretation (how AI models may read the page), before/after AI interpretation summaries, citation likelihood and reason, semantic gaps, missing citation facts, copy-paste robots.txt and llms.txt drafts, meta tags, JSON-LD schema, answer blocks, and FAQ blocks.

All AI output is validated by a Zod schema before being shown to users. If the AI output fails schema validation, the report falls back to the deterministic score and crawler breakdown — which are always available and never AI-generated.

Limitations

  • 1

    The tool scans only the homepage URL provided — not the full site.

  • 2

    AI crawlers do not execute JavaScript. If your site requires JS to render, the scanner will detect this but cannot read the JS-rendered content.

  • 3

    The citation readiness score is deterministic and factual — it does not guarantee AI citations or search rankings.

  • 4

    AI report output is generated by Gemini 2.5 Flash and validated by Zod schema. Invalid or incomplete AI output falls back to the deterministic score.

  • 5

    robots.txt rules are parsed from the URL's root domain only. Subdomain or subdirectory robots.txt files are not checked.

  • 6

    The score reflects the state of the page at scan time. Scores may change if the page content, robots.txt, or structured data changes after the scan.

  • 7

    The tool does not simulate a specific AI search engine's response — it assesses structural readiness for AI indexing and citation.

Data handling

Scanned URLs and report data are stored in Supabase Postgres. Raw HTML is not stored permanently. Reports are cached for 24 hours to reduce redundant scans of the same URL and query combination. Public reports (indexed by a slug) are accessible to anyone with the URL. Rate limits apply: 10 scans per IP per hour, 5 scans per domain per hour.

Run a free scan →What is GEO?