Anyone can describe a methodology. This page shows one working, on real data, with the uncomfortable parts left in. Everything below comes from the measurement we ran on ourselves on 2 July 2026: 19 questions, three engines, 168 scored answers. Client measurements use the same system on a bigger set (a full baseline is roughly 75 questions and 700 answers).
1. The question set
We measure real buying questions, not keywords, grouped by what they are worth. This is a sample of the actual set we run on ourselves; a client set is built the same way from their customers' questions and their search data.
| Intent | Question, verbatim |
|---|---|
| Money | best AEO agency UK |
| Money | answer engine optimisation agency UK |
| Money | who can help my business get recommended by ChatGPT |
| How-to | how to get my business recommended by ChatGPT |
| How-to | why does ChatGPT recommend my competitor and not me |
| Pricing | how much does AEO cost UK |
| Definitional | what is answer engine optimisation |
| Comparison | Scrunch vs Otterly |
| Brand | Aeova Partners reviews |
2. What each answer is scored as
Every answer gets exactly one verdict. These are different things worth different amounts, and a tracker that blends them into one score is hiding information from you.
| Verdict | Meaning |
|---|---|
| Recommended | The engine actively puts the business forward as an option. The only verdict that wins customers. |
| Mentioned | Named somewhere in the answer, but not put forward. |
| Cited only | The business's page is listed as a source the answer drew on, but the name never appears in the answer. Real visibility, invisible to the customer. |
| Negative | Named, and warned about or criticised. |
| Absent | Not named, not cited. Most common verdict in most categories. |
One more distinction we keep separate: the inferred retrieval source. When an engine cites a directory that lists you, we can infer your page entered its reading list even though you were never shown. We report that as retrieval, never as visibility, because the customer never saw you.
3. One real answer, tracked end to end
Question: "answer engine optimisation agency UK". Asked 2 July 2026, three times per engine. ChatGPT's answer recommended a table of six UK providers, and listed the sources it drew them from:
| What we recorded | Value, verbatim from the run |
|---|---|
| Verdict for Aeova, all 3 engines, all repeats | Absent (9 of 9) |
| Sources ChatGPT cited | found.co.uk · tilio.co.uk · aeoagency.co.uk · primointeractive.com · otterlabs.co.uk · scopesite.co.uk |
| What that means | Six agencies were recommended for the exact question our business exists to win, and we were not one of them. That row is our own before-photo, kept. |
This is what citation tracking looks like in practice: every answer is stored with its sources, so when a verdict changes later, we can show which source changed it, not just that it changed.
4. The fan-out: the questions the machine actually asks
Engines do not search your customer's words. They rewrite the question into their own sub-queries, run those, and build the answer from what comes back. We capture those hidden sub-queries. Here is a real capture, unedited, for the question "Aeova Partners reviews":
| # | Machine-generated sub-query |
|---|---|
| 1 | Aeova Partners reviews |
| 2 | Aenova Group reviews |
| 3 | BC Partners reviews |
| 4 | Aeova Partners company information |
| 5 | Is Aeova Partners related to Aenova Group or BC Partners? |
Look at rows 2, 3 and 5: the machine is not sure we exist, so it spends three of its five searches on companies with similar names. That is a measurable, fixable problem (it is why our about page states so plainly who we are and are not), and it is invisible to anyone who only looks at the final answer. Content decisions should be made against these sub-queries, because these are the searches your pages actually have to win.
5. The score: naming-margin, not rank
There is no "position 3" in an AI answer; you are named or you are not, and it varies between asks. So the honest score is a proportion: naming-margin = times named ÷ times asked, always with a confidence range, always next to a same-day competitor panel so engine-wide mood swings (they are real and measured) cannot be passed off as anyone's work.
Worked on our own numbers: on 45 money-question answers we were named 0 times. Naming-margin 0%, and with 45 asks the statistics say the true rate is at most about 8%. That "at most" is the point: we quote what the sample can actually support, which is also how you should read any improvement number we ever show you.
6. Why we repeat, instead of tracking daily
Most AI-visibility products ask each tracked question once a day and draw a line through the results. Our own published data shows why that line misleads: 19% of identical same-day repeats change their verdict, and in our first client baseline, a third of the answers that named the brand at all dropped it on a same-day re-ask. A single daily ask charts that coin flip. So a daily tracker's graph moves most days, and neither the agency nor the client can say which movements are real.
Repeating each question and quoting a proportion with a confidence range costs more per data point and produces fewer, slower headlines. It is also the only version of this measurement where "the number went up" is allowed to mean something. That trade is the product.
7. Our own baseline, the before-photo
Measured 2 July 2026, published unretouched. Re-measured monthly; the next reading replaces this table with movement, or with the honest absence of it.
| Question group | Answers | Aeova named |
|---|---|---|
| Money (agency selection) | 45 | 0 |
| How-to | 36 | 0 |
| Definitional | 27 | 0 |
| Tools & comparisons | 27 | 0 |
| Pricing | 18 | 0 |
| Brand-name questions | 15 | 8, several confusing us with similarly named firms |
For scale: our first client category baseline ran 75 real customer questions across the same three engines, roughly 700 answers: 20% recommended, 59% absent, with the absent share concentrated exactly where the buying intent is. The pattern of a business that looks fine in Google and does not exist in AI answers is the normal case, not the exception.
Questions we get about this
Why publish a measurement where you score zero?
Because AI answers change between sessions, a single good screenshot proves nothing, and every agency in this category shows you screenshots. A timestamped baseline that we cannot retroactively improve is the only honest starting point, and it is what makes any later movement checkable. We hold ourselves to the standard we sell.
What is the difference between a mention, a citation and a recommendation?
A mention is your name appearing anywhere in the answer text. A citation is your page listed as a source the answer drew on, which can happen without your name appearing at all. A recommendation is the engine actively putting you forward as an option. They move independently and they are worth different amounts, so we score them separately and never add them together.
How many times do you repeat each question?
At least three times per engine in a standard reading, because verdicts flip between repeats (our published volatility data measures exactly how often). Before we quote a naming-margin for a single question, that question gets 15 to 20 asks, so the number comes with a confidence range instead of a lucky draw.
Is this the same method behind the Answer Volatility Reading?
Yes. Same measurement engine, same engine configurations, same five-verdict classification. The reading is the aggregate view across every question we track; this page shows you one measurement up close.
Want this run on your business? The first ten questions are free: the ten-question audit.
Measured 2 July 2026 · Published 3 July 2026 · Method as in the Answer Volatility Reading