How to measure AI visibility and citations
Classic SEO hands you a clean number: your rank for a keyword. AI search gives you almost nothing that tidy. There is no official dashboard for how often ChatGPT names your brand, no rank for a Perplexity answer, and no report Google ships for [AI Overviews](/glossary#ai-overview). Measuring [generative-engine visibility](/glossary#geo) means building your own instruments — and being honest that they are young, noisy, and directional. Here is a practical, skeptical way to measure whether AI engines see, cite, and repeat your brand, and where every method quietly breaks down.
Why does classic rank tracking miss AI visibility?
Rank tracking measures one thing: your position for a keyword in a list of blue links. AI answers have no fixed position, often no link, and no single keyword — the same question, asked twice, can return different sources. Tracking rank tells you nothing about whether a model quoted you.
The unit of AI visibility is the answer, not the page. A user asks a question and reads one synthesized paragraph, so what matters is whether your brand appears inside that paragraph — a state a keyword-position tool was never built to see.
Position is unstable by design. Generative engines sample, personalize, and re-retrieve, so there is no stable slot to rank for; two identical prompts minutes apart can cite different sources, which breaks the core assumption rank tracking rests on.
The classic SERP and the AI answer are different surfaces with different winners. A page can rank tenth in blue links yet be the source an engine quotes, or rank first and never get cited — so a rank report and a citation report can point in opposite directions.
This is why AI visibility needs its own instruments. You are no longer measuring where you sit in a list; you are measuring how often, and how favorably, a model repeats you — a fundamentally different question that needs a fundamentally different method.
How do you build a prompt set to test AI visibility?
Write a fixed list of real buyer questions — 30 to 100 prompts covering your category, problems, and competitors — then run the identical set across ChatGPT, Perplexity, Gemini, and Google AI Overviews on a schedule. Holding the prompts constant is what makes results comparable month over month.
The prompt set is your measurement instrument, so freeze it. Once you change the questions you lose the ability to compare this month to last, which means the discipline is boring on purpose: same prompts, same engines, same cadence.
Write prompts the way your buyers actually ask, not the way you would phrase a keyword. 'What is the best tool for X?', 'X vs Y', 'How do I do Z?' — natural-language, intent-rich questions are what generative engines are built to answer and what surfaces brands.
Cover the full funnel, not just your brand name. Category questions ('best X for Y'), problem questions, and comparison questions reveal whether you show up when the user has not already decided — which is where AI visibility is actually won or lost.
Run the same set on each engine, because they retrieve differently. The sources ChatGPT reaches for are not the sources Perplexity or an AI Overview reach for, so a brand can be cited everywhere on one and absent on another. This is the raw material for learning how to get cited by ChatGPT, Perplexity, and AI Overviews.
Doing this by hand is tedious but honest. Whether you log answers in a spreadsheet or automate the runs, the value is the same: a repeatable, dated record of what engines said when asked your questions.
How do you track being cited, mentioned, or absent?
For every prompt on every engine, record one of three states: cited (named with a link to your site), mentioned (named without a link), or absent (not there at all). This three-way tally, repeated on a schedule, is the closest thing GEO has to a rank-tracking table.
The three states are not equal, so do not collapse them. A citation with a link can send referral traffic and confers the most trust; a mention names your brand as an entity without a click; absence is the gap to close — and tracking them separately tells you which.
Log the surrounding context, not just the yes or no. Was your brand the recommended option or a footnote? Was the claim about you accurate? A model that mentions you incorrectly is a different problem than one that omits you, and only the context reveals which.
Watch mentions without links especially closely. Generative engines frequently name brands they do not hyperlink, so a link-only tracker undercounts your true visibility — the mention still shapes the user's shortlist even when it sends no traffic.
Turn the tally into rates you can trend. 'Cited in 40% of category prompts on Perplexity, up from 25%' is a measurable, honest sentence; a single screenshot is not. Rates over time are what let you tell whether your GEO work is moving anything.
- Cited — named with a link to your site; the strongest state.
- Mentioned — named as a brand, no link, still shapes the shortlist.
- Absent — not in the answer at all; the gap to close.
How do you find AI-referral traffic in analytics?
In your analytics, segment referral traffic by source domain — chatgpt.com, perplexity.ai, gemini.google.com, copilot.microsoft.com. Clicks from these are visits an AI answer actually sent you. It is real, first-party evidence of citations, but it captures only the fraction of AI mentions that carry a clickable link.
AI-referral traffic is your only fully first-party signal, so build a segment for it. Unlike prompt testing, these are real users your analytics already logged; grouping the known AI domains into one channel lets you watch it grow without guessing.
Expect the numbers to be small and lagging. AI answers resolve many questions without a click, so referral volume badly undercounts how often you were actually cited — treat it as a floor on your visibility, never the whole picture.
Google Search Console will not itemize AI Overviews for you. Impressions and clicks from AI Overviews are currently folded into ordinary Search totals with no separate report, so GSC confirms you are in Google's index but cannot isolate your AI-answer performance.
Cross-check referral spikes against your prompt-set results. When a new citation shows up in testing and referral traffic from that engine ticks up together, you have two independent signals agreeing — about as much certainty as this field currently offers. Making your site easy for those crawlers to read, down to a clean llms.txt, is what turns citations into clicks.
How do you measure share of voice against competitors?
Share of voice is how often your brand appears across your prompt set versus competitors. Count the citations and mentions each brand earns over the same questions and engines, then express yours as a percentage of the total. It reframes AI visibility from 'are we there?' to 'who is winning this answer?'
Absolute citation counts mislead without a denominator. Being cited in 30% of prompts sounds fine until a competitor is cited in 70% of the same set — share of voice puts your number next to theirs so you can read it honestly.
The same fixed prompt set powers this for free. Because you already run identical questions across engines, tallying which brands each answer names turns your visibility log into a competitive scoreboard at no extra cost.
Watch which competitors the engines volunteer, not just whether you appear. If a model repeatedly recommends three rivals and never you, that list is your real competitive set in the eyes of the engine — and closing the gap starts with becoming a citable entity it trusts.
Track share of voice per engine and over time. A brand can dominate Perplexity yet be invisible in AI Overviews, so a single blended number hides more than it shows; the trend per surface is where the useful signal lives.
What are the honest limits of measuring AI visibility?
Every method here is directional, not precise. There are no official visibility reports, engines are non-deterministic, sample sizes are small, and answers change without notice. You can measure trends and relative position credibly; you cannot get the clean, audited numbers classic rank tracking spoiled you with. Treat all of it as an estimate.
No engine ships an official visibility report, so every number you get is inferred. ChatGPT, Gemini, and Perplexity do not publish how often they name you; anyone selling a precise 'AI visibility score' is modeling an estimate, not reading a meter.
Non-determinism is the hard limit. The same prompt can yield different sources on repeat runs, so any single test is a sample, not a measurement — which is why cadence and larger prompt sets matter more than any one dramatic result.
Your sample is always tiny next to the real query space. A 50-prompt set cannot represent every way a buyer might phrase a question, so read your numbers as a directional read on a slice, not a census of your true visibility.
The ground keeps moving underneath the measurement. Engines change models, retrieval, and citation behavior without notice, so a drop can reflect a product update rather than anything you did — hold your conclusions loosely and weight trends over single points.
Measured honestly, it is still worth doing. Directional beats blind: knowing you went from absent to mentioned across your category, and that trust signals like E-E-A-T are moving the trend, is real progress even when the exact number never will be.
Let the agent run this playbook for you
Start free