Research · Direction 3 — Freshness & benchmark · repeatable run · 7 May 2026

Can a Kenyan sector be benchmarked over time?

A method note on tracking one Kenyan sector through repeated AI answer states without turning visibility into a false ranking table.

Outcome

A Kenyan sector can be benchmarked over time when the lab records reconstructable prompt runs, dates, engines, language choices and answer states. The useful result is a descriptive visibility history, not a single score that hides omission, inaccuracy, regional skew and business-form mismatch.

A benchmark is useful only if another reader can rebuild the path. For Kenyan AI visibility, the path matters as much as the answer because the answer keeps moving.

A tourism prompt in January can feel settled. Ask for operators in Kenya and the answer names a handful of familiar Nairobi or national examples, adds a coastal mention, and sounds confident enough to pass a quick read. Run the same kind of prompt later with county wording, and the shape shifts. A coastal operator that looked absent may appear. A seasonal note may turn stale. A licence reference may vanish from the summary.

Kivuli Index Lab does not treat that movement as a nuisance to clean away. The movement is the material. A benchmark over time begins with the fact that answer engines change their phrasing, source mix and interface behaviour. The lab’s question is therefore practical: can a Kenyan sector be tracked in a way that survives those changes, without pretending the answer is a fixed ranking report?

Starting with one sector, not the whole country

A useful benchmark starts smaller than ambition wants. The lab avoids beginning with “Kenyan business visibility” as one giant object, because that phrase can swallow too much. Tourism, agriculture, fintech or professional services each has its own evidence shape. Tourism has seasonal operations, licence language, reviews and destination geography. Agriculture may involve cooperatives, input suppliers, county programmes and weather-shaped demand. Fintech carries more national and investor-facing evidence. Professional services often depends on firm pages and directories.

The lab’s working definition is direct: a sector benchmark is a dated sequence of comparable AI answer states for one Kenyan sector, because repeated prompts reveal whether visibility patterns hold, drift or break. This keeps the task disciplined. The benchmark is not a national scoreboard. It is closer to a field notebook with enough structure that someone else can follow the same path.

In practice, the team chooses a sector, then defines the prompt family. A tourism benchmark might compare national, county and service-specific prompts: tour operators in Kenya, tour operators in Mombasa, safari operators serving domestic visitors, coastal tour operators with licence references. An agriculture benchmark might use the composite Nakuru farm-supply cooperative from the research plan and place it beside other farm-input or cooperative prompts. The point is not to trap the engine with one perfect query. It is to watch how a sector behaves when the question is turned a few degrees.

The early benchmark is often plain-looking. Prompt type. Engine. Language. Date. Sector. Region. Answer state. Notes on naming, omission, inaccurate description, regional skew, language divergence, freshness and business-form mismatch. That plainness is the virtue. It leaves fewer hiding places for the lab’s own assumptions.

What gets recorded in each run

The lab records answer states before it argues about causes. That rule sounds modest until the work begins. An answer may name a business and still be wrong about its geography. It may skip a relevant county example but describe the sector accurately. It may blur a cooperative into a private supplier. It may displace a local operator with a better-known Nairobi firm. If the benchmark only asks “appeared or did not appear,” it throws away the useful part.

The anchor classification remains the four visibility states of a Kenyan business in AI answers: named, skipped, blurred or displaced. In a sector benchmark, those states are applied repeatedly across prompts and dates. A named state means the business or category is directly identified in a recognisable way. Skipped means it is absent despite being relevant to the prompt. Blurred means the answer compresses a specific entity, county or business form into a generic label. Displaced means another reference occupies the position the tested entity could reasonably have held.

A composite coastal tour operator helps show the difference. In one run, the operator may be skipped in a broad Kenya tourism prompt. In a county prompt, it may be named but with a stale seasonal description. In a service prompt, it may be displaced by a Nairobi operator offering similar packages. Those are three observations, not one failure. The benchmark has to keep them apart.

The same discipline applies to sector-level descriptions. If a prompt asks about Kenyan agriculture suppliers and the answer repeatedly favours private firms over cooperatives, the lab records business-form mismatch. If a prompt asks in Swahili and the answer becomes more generic than the English version, the lab records language divergence. The benchmark grows through these small labels. It does not need invented percentages to become useful.

A strong run also records what was not tested. If the prompt set covers Nairobi, Mombasa and Nakuru but not Turkana or Kisii, the published note should say so by implication or direct statement. A benchmark loses trust when it dresses a partial sample in national clothing.

Watching drift without chasing noise

The hardest part of tracking over time is deciding when movement matters. AI answers are unstable. Some changes are surface-level: a sentence moves, a phrase softens, a business appears lower in a paragraph. Other changes alter the answer state. A named business becomes skipped. A cooperative becomes blurred. A coastal operator is displaced by a Nairobi example after the prompt is widened. Those changes matter more.

The lab treats drift as a change in answer state or classification logic across comparable runs. If a business remains named but the wording changes slightly, that may not carry much weight. If it remains named while the description becomes inaccurate, the observation deepens. If a county disappears from the answer across multiple engines or dates, the benchmark starts to show a regional pattern.

This is where the lab is cautious with the word “trend.” A trend needs repeated comparable movement. One odd run is an event. Two similar runs are a signal to watch. Several related runs across prompt types, engines or dates may support a conclusion. Even then, the conclusion should stay close to the evidence: “in this sector benchmark, county-level operators were more often skipped in broad national prompts” is stronger than a sweeping claim about Kenya as a whole.

The benchmark also has to survive interface changes. ChatGPT, Gemini, Perplexity, Google AI Overviews and Copilot do not behave as identical instruments. Some answers cite sources visibly; some summarise without the same citation style; some are embedded into search flows. The lab does not force them into one artificial format. It records the observable answer behaviour and keeps the engine field attached to the note.

A small awkwardness helps here. In real benchmark work, the prompt that looked clean on paper often turns out to be too broad, too buyer-like or too close to a brand query. The team may revise the prompt family after an early pilot. When that happens, the benchmark should show the change rather than hide it. A corrected path is more credible than a perfect-looking path nobody could reproduce.

What a sector history can show

Over time, a benchmark can show patterns that a single answer cannot. It can show whether Nairobi examples hold their place when county prompts are introduced. It can show whether Swahili wording changes which sectors appear concrete and which become generic. It can show whether seasonal or licence-dependent operations become stale in answer summaries. It can show whether cooperatives and SACCOs keep getting flattened into conventional company categories.

For a trade body, that history may be more useful than a dashboard score. A tourism association does not only need to know whether members appear. It needs to know whether coastal operators disappear in national prompts, whether licence language is current, whether seasonal closures are misread, and whether domestic-tourism services are named differently from international-tourism services. Each of those problems asks for a different kind of evidence work.

For a county economic-development office, the benchmark can make regional skew visible without turning it into a complaint. If businesses in one county are regularly skipped while equivalent Nairobi businesses are named, the office can ask what public evidence exists, how current it is, whether business categories are clear, and whether local operators are present in machine-readable pages. The benchmark does not fix the gap by itself. It gives the gap a shape.

For individual businesses, the value is narrower. A benchmark is not a guarantee that a page rewrite will cause an answer engine to cite or name them. The lab avoids that promise. What it can show is the surrounding pattern: whether the sector rewards clear service descriptions, whether review scarcity appears alongside omission, whether county references seem to help, or whether mobile-first sellers remain thinly represented. Those observations are useful precisely because they do not pretend to be magic instructions.

A sector history can also show where the public map differs from the business reality. One prompt tells very little. Several comparable runs start to reveal which counties are being served in the answer, which enterprise forms are missed, and where the formal public record differs from the route people actually use.

Limits of a repeatable benchmark

A repeatable benchmark does not make AI answers stable. It only makes the lab’s reading of them reconstructable. Another reader may rerun the same prompt and get a different sentence, a different ordering, or a different mix of examples. That does not automatically invalidate the benchmark. The question is whether the test path, classification logic and observed state are clear enough to compare.

The method also cannot measure every business in a sector. Kivuli Index Lab’s samples are descriptive and purpose-built. They include sectors, counties, business forms, language choices and evidence conditions that make the research question inspectable. They do not claim full national coverage or exact market shares. When the corpus says a pattern appears in a benchmark, it means the pattern appeared within comparable runs, not across every Kenyan enterprise.

There is a second limit around causality. If a county-linked public page appears near stronger visibility, the lab can mark that as an uncertainty note. It cannot declare that the page caused the answer state to improve. If review scarcity appears near omission, the lab can watch the signal across runs. It cannot turn that observation into a universal law. The benchmark is strongest when it keeps fact, interpretation and forecast separate.

Finally, a sector benchmark can become too tidy. The real Kenyan business environment includes formal firms, jua kali enterprises, cooperatives, SACCOs, mobile-first sellers, seasonal operations and hybrid organisations that do not fit neat categories. A benchmark that smooths those forms into one “business visibility” score may look cleaner, but it will explain less. The lab’s task is to keep the rough edges visible long enough for the pattern to be useful.

← How do AI answers represent Kenyan cooperatives and SACCOs? How can trade bodies close AI visibility gaps? →