How we measure AI visibility: one real measurement, opened up

Anyone can describe a methodology. This page shows one working, on real data, with the uncomfortable parts left in. Everything below comes from the measurement we ran on ourselves on 2 July 2026: 19 questions, three engines, 168 scored answers. Client measurements use the same system on a bigger set (a full baseline is roughly 75 questions and 700 answers).

1. The question set

We measure real buying questions, not keywords, grouped by what they are worth. This is a sample of the actual set we run on ourselves; a client set is built the same way from their customers' questions and their search data.

Intent	Question, verbatim
Money	best AEO agency UK
Money	answer engine optimisation agency UK
Money	who can help my business get recommended by ChatGPT
How-to	how to get my business recommended by ChatGPT
How-to	why does ChatGPT recommend my competitor and not me
Pricing	how much does AEO cost UK
Definitional	what is answer engine optimisation
Comparison	Scrunch vs Otterly
Brand	Aeova Partners reviews

2. What each answer is scored as

Every answer gets exactly one verdict. These are different things worth different amounts, and a tracker that blends them into one score is hiding information from you.

Verdict	Meaning
Recommended	The engine actively puts the business forward as an option. The only verdict that wins customers.
Mentioned	Named somewhere in the answer, but not put forward.
Cited only	The business's page is listed as a source the answer drew on, but the name never appears in the answer. Real visibility, invisible to the customer.
Negative	Named, and warned about or criticised.
Absent	Not named, not cited. Most common verdict in most categories.

One more distinction we keep separate: the inferred retrieval source. When an engine cites a directory that lists you, we can infer your page entered its reading list even though you were never shown. We report that as retrieval, never as visibility, because the customer never saw you.

3. One real answer, tracked end to end

Question: "answer engine optimisation agency UK". Asked 2 July 2026, three times per engine. ChatGPT's answer recommended a table of six UK providers, and listed the sources it drew them from:

What we recorded	Value, verbatim from the run
Verdict for Aeova, all 3 engines, all repeats	Absent (9 of 9)
Sources ChatGPT cited	found.co.uk · tilio.co.uk · aeoagency.co.uk · primointeractive.com · otterlabs.co.uk · scopesite.co.uk
What that means	Six agencies were recommended for the exact question our business exists to win, and we were not one of them. That row is our own before-photo, kept.

This is what citation tracking looks like in practice: every answer is stored with its sources, so when a verdict changes later, we can show which source changed it, not just that it changed.

4. The fan-out: the questions the machine actually asks

Engines do not search your customer's words. They rewrite the question into their own sub-queries, run those, and build the answer from what comes back. We capture those hidden sub-queries. Here is a real capture, unedited, for the question "Aeova Partners reviews":

#	Machine-generated sub-query
1	Aeova Partners reviews
2	Aenova Group reviews
3	BC Partners reviews
4	Aeova Partners company information
5	Is Aeova Partners related to Aenova Group or BC Partners?

Look at rows 2, 3 and 5: the machine is not sure we exist, so it spends three of its five searches on companies with similar names. That is a measurable, fixable problem (it is why our about page states so plainly who we are and are not), and it is invisible to anyone who only looks at the final answer. Content decisions should be made against these sub-queries, because these are the searches your pages actually have to win.

5. The score: naming-margin, not rank

There is no "position 3" in an AI answer; you are named or you are not, and it varies between asks. So the honest score is a proportion: naming-margin = times named ÷ times asked, always with a confidence range, always next to a same-day competitor panel so engine-wide mood swings (they are real and measured) cannot be passed off as anyone's work.

Worked on our own numbers: on 45 money-question answers we were named 0 times. Naming-margin 0%, and with 45 asks the statistics say the true rate is at most about 8%. That "at most" is the point: we quote what the sample can actually support, which is also how you should read any improvement number we ever show you.

6. Why we repeat, instead of tracking daily

Most AI-visibility products ask each tracked question once a day and draw a line through the results. Our own published data shows why that line misleads: 19% of identical same-day repeats change their verdict, and in our first client baseline, a third of the answers that named the brand at all dropped it on a same-day re-ask. A single daily ask charts that coin flip. So a daily tracker's graph moves most days, and neither the agency nor the client can say which movements are real.

Repeating each question and quoting a proportion with a confidence range costs more per data point and produces fewer, slower headlines. It is also the only version of this measurement where "the number went up" is allowed to mean something. That trade is the product.

7. Our own baseline, the before-photo

Measured 2 July 2026, published unretouched. Re-measured monthly; the next reading replaces this table with movement, or with the honest absence of it.

Question group	Answers	Aeova named
Money (agency selection)	45	0
How-to	36	0
Definitional	27	0
Tools & comparisons	27	0
Pricing	18	0
Brand-name questions	15	8, several confusing us with similarly named firms

For scale: our first client category baseline ran 75 real customer questions across the same three engines, roughly 700 answers: 20% recommended, 59% absent, with the absent share concentrated exactly where the buying intent is. The pattern of a business that looks fine in Google and does not exist in AI answers is the normal case, not the exception.

Questions we get about this

Why publish a measurement where you score zero?

Because AI answers change between sessions, a single good screenshot proves nothing, and every agency in this category shows you screenshots. A timestamped baseline that we cannot retroactively improve is the only honest starting point, and it is what makes any later movement checkable. We hold ourselves to the standard we sell.

What is the difference between a mention, a citation and a recommendation?

A mention is your name appearing anywhere in the answer text. A citation is your page listed as a source the answer drew on, which can happen without your name appearing at all. A recommendation is the engine actively putting you forward as an option. They move independently and they are worth different amounts, so we score them separately and never add them together.

How many times do you repeat each question?

At least three times per engine in a standard reading, because verdicts flip between repeats (our published volatility data measures exactly how often). Before we quote a naming-margin for a single question, that question gets 15 to 20 asks, so the number comes with a confidence range instead of a lucky draw.

Is this the same method behind the Answer Volatility Reading?

Yes. Same measurement engine, same engine configurations, same five-verdict classification. The reading is the aggregate view across every question we track; this page shows you one measurement up close.

Want this run on your business? The first ten questions are free: the ten-question audit.

Measured 2 July 2026 · Published 3 July 2026 · Method as in the Answer Volatility Reading

How we measure. Shown on a real measurement.