The difference between claiming evidence and having it

One brand, one study, one boundary condition. The standard that separates evidence-based from evidence-inspired.

BUILDING BEHAVIOURKIT

Lauren Kelly

7/18/2025

I've spent the last few months building the evidence base for BehaviourKit's plays, and I want to talk about the standard I've set, because I think it matters for anyone building tools that claim to be evidence-based.

Here's the rule: every evidence item must be a specific, named, single case. One brand. One campaign. One programme. One study. And not aggregations, generalisations or broad "research suggests."

"Headspace onboarding in 2019 used a single breathing exercise as the first action."

"NHS electronic repeat dispensing reduced medication renewal friction by automating re-applications."

"Opower's ongoing energy reports sustained household behaviour change over multiple years."

Each item needs a real, verifiable link to the actual write-up, paper, or documentation. And each item needs a boundary condition: when does this work, and when does it fail?

This standard is considerably harder to meet than the one most tools in this space set for themselves. "Evidence-based" has become a marketing term that can mean anything from "we read some papers while building this" to "every recommendation is traceable to specific studies with documented boundary conditions." The gap between those two meanings is enormous, and most products sit somewhere in the middle without being transparent about where.

I want BehaviourKit to sit at the specific end of that spectrum, and I want the transparency to be built into the product rather than left to a footnote on the about page.

Confidence calls and quality of research link were always captured in BehaviourKit. More so now that the system has many moving parts. So every matrix cell now carries a confidence level: strong, good, or emerging. Strong means there's direct empirical evidence for the specific connection between the contradiction type and the recommended intervention. Good means there's solid evidence for the mechanism and supporting case studies. Emerging means the logic is sound and the mechanism is plausible, but the direct evidence is thin.

Those distinctions matter more than they might seem. When the system recommends a play with strong confidence, the user knows this is well-trodden ground. When it recommends with emerging confidence, the user knows they're in newer territory and should monitor results more carefully. That honesty builds trust, because it signals that the system knows what it knows and also knows what it doesn't.

I've been running the evidence gathering using the same approach I described in the last post. I set the rules (what counts as evidence, what standard each item must meet, what fields are required) and then work through the plays systematically, checking each connection against the literature and the case base. The AI helps me manage the scale, checking hundreds of play-evidence links against the quality criteria, flagging gaps, and identifying where the confidence basis is weaker than the system currently claims.

The domains I'm drawing evidence from are deliberately wide. Healthtech, fintech, sustainability, public health programmes, L&D, retail, government initiatives, workplace change. The reason for casting the net broadly is the same reason I catalogued across domains back in 2020: patterns that recur across multiple fields are more trustworthy than patterns that only show up in one. If a mechanism works for medication adherence in the NHS and also for course completion in a corporate L&D programme, the mechanism itself is doing the work, not the specific context.

Boundary conditions turn out to be the most valuable part of the evidence gathering. Knowing that a play works is useful. Knowing when it stops working is more useful. For example: social proof interventions work well when the real number is high, but they can backfire spectacularly when the real number is low or when the reference group feels unfamiliar to the audience. That boundary condition changes how the system should recommend the play. It's no longer "use social proof" but "use social proof when, and avoid it when."

The process has also surfaced some uncomfortable findings. A few plays I was confident about turned out to have thinner evidence than I'd assumed. A few connections I thought were well-supported were actually based on mechanism logic rather than direct empirical testing. Acknowledging those gaps honestly is more useful than papering over them with vague citations.

The evidence layer, I'm increasingly convinced, is the thing that will make BehaviourKit genuinely different in a crowded market. Content is replicable. A good evidence base with real traceability, honest confidence levels, and documented boundary conditions is expensive to build and hard to copy. It requires doing the actual work, across hundreds of items, one case at a time. There aren't many shortcuts.

That's the kind of competitive advantage I'm comfortable building on.

Go deeper into the Building BehaviourKit series: