The Answer Volatility Reading: how much AI answers change between runs

Name: Aeova Partners Answer Volatility Reading, July 2026
Creator: Aeova Partners
License: https://creativecommons.org/licenses/by/4.0/

AI assistants change their answers: across 1,211 measured answers, repeating the identical question on the same engine changed whether a given business got recommended 19% of the time, and different engines gave different verdicts on 29% of questions. These numbers come from our own measurement engine and update as our measurement base grows.

1,211

Measured answers

Individual AI answers captured and classified to date, across ChatGPT, Perplexity and Google AI Overviews.

19%

Same question, same engine, different verdict

Of cells with three or more same-day repeats, 19% flipped between recommending and not recommending a given business across repeats.

29%

Engines disagree with each other

On 29% of questions, the majority verdict differed between engines measured on the same day.

Answers that keep the same verdict on repeat

Answers that flip on repeat (19%)

Questions where engines disagree with each other (29%)

What this means if you're buying AEO

A screenshot is not evidence. Anyone can screenshot a good answer, and one run in four contradicts itself on repeat. Demand proportions over repeated runs, before and after.
One engine is not the market. A third of questions get a different verdict on a different assistant. Visibility on ChatGPT says little about Google AI, and vice versa.
Movement claims need controls. If answers wobble naturally, improvement claims must beat the wobble: measured lift on changed queries against untouched control queries.

Methodology in detail

Prompt universe. 96 unique questions across two UK service categories (77 in the pilot category, 19 in the second), drawn from real customer questions and, where available, Google Search Console demand, spanning commercial, local, price-led, comparison and informational intents. Questions are versioned and held fixed between readings.

Engine setup. "ChatGPT" means the OpenAI API with the web-search tool enabled, which behaves like the consumer product's search mode but is not identical to it (no personalisation, no chat memory). "Perplexity" means the Perplexity API (sonar-pro). "Google AI Overviews" means the AI Overview captured from UK desktop search results via a search API. All engines are UK-configured. API measurement is chosen deliberately: it is stable, repeatable and unpersonalised, which is what makes readings comparable over time.

Classification rules. Every answer receives exactly one verdict:

Verdict	Meaning
recommended	The business is actively suggested as a suitable option
mentioned	The business is named, but not recommended
negative	The business is named in a critical or warning context
cited	The business's page is used as a source, but the business is not named in the answer
absent	No mention and no citation

Definitions. A question-engine cell is volatile if three or more same-day repeats disagree on whether a given business is recommended. Engine disagreement means the majority verdict differs between engines on the same question, measured the same day. Classification is deterministic brand-and-domain matching, with a language-model pass only for ambiguous cases.

Limits of this reading

Two UK service categories measured to date. Category mix affects the aggregate numbers (a category where a brand is uniformly absent contributes stable cells and lowers measured volatility), which is why the readings table states category count per reading.
Engines change their behaviour continuously; a reading describes the weeks it was measured, not a permanent property.
API measurement excludes personalisation and location effects that individual users' sessions may show.
Classification involves judgement at the margins even with fixed rules; the rules above are applied consistently across every reading.

Readings

Reading	Measured answers	Categories	Same-engine volatility	Cross-engine disagreement
July 2026	1,211	2	19%	29%

Citing this reading

Licensed CC BY 4.0. Cite as "Aeova Partners Answer Volatility Reading" with a link to this page. A small anonymised sample of the underlying rows is downloadable: volatility-sample.csv (52 rows: question id, engine, repeat, verdict, date). This page carries Dataset structured data; journalists and researchers are welcome to ask for methodology detail: felix@aeovapartners.com.

Common questions

Why do AI answers change between identical runs?

Assistants generate answers rather than retrieve fixed ones: sampling variation, retrieval differences and answer phrasing all shift run to run. That is not a flaw in measurement, it is the reason measurement must be repeated: a single screenshot, good or bad, proves nothing.

Does volatility mean AI recommendations can't be trusted?

It means single observations can't be trusted. Measured in proportions over repeated runs, AI recommendation behaviour is stable enough to track, compare and improve, which is exactly how we report it.

Can I cite these numbers?

Yes, with attribution and a link: "Aeova Partners Answer Volatility Reading, aeovapartners.com/answer-volatility". The reading is licensed CC BY 4.0 and updates as our measurement base grows.

Updated July 2026

The Answer Volatility Reading.