Keeping the science honest as the system gets smarter

Six rules for maintaining scientific validity when a taxonomy becomes an engine.

BUILDING BEHAVIOURKIT

Lauren Kelly

6/13/2025

There's a tension in BehaviourKit that I want to talk about openly, because I think it matters for anyone building applied tools on top of scientific foundations.

When BehaviourKit was a taxonomy, each piece needed to earn its place. A pattern made the list because I found it recurring across multiple studies and cases. A driver earned its spot because it appeared across multiple behavioural science frameworks. Each connection between a driver and a pattern needed a mechanism, a plausible causal chain grounded in what the science says about how behaviour works. If I couldn't explain why a particular driver connects to a particular intervention, the connection didn't go in.

That standard was manageable because the system was relatively flat. Drivers here. Patterns there. Connections between them. Each connection could be checked against the literature individually.

Now the system is growing into something more complex. The contradiction matrix doesn't just say "this driver connects to this pattern." It says: given this goal and this constraint, this principle resolves the contradiction, and this play delivers the principle, with this evidence supporting it, and this protective control running alongside it, and this confidence level reflecting how much we should trust the route.

That's a chain of reasoning with multiple links. And here's the thing about chains of reasoning: the overall validity is only as strong as the weakest link. If the principle is well-supported but the play is loosely matched, the recommendation inherits the looseness. If the evidence is strong for one context but the system applies it to a different context without flagging the transfer, the confidence is overstated.

In a taxonomy, you validate each piece. In an ontology, you validate each piece AND the inference rules that connect them AND the conditions under which those rules hold AND the variation that the system needs to handle honestly.

That's a significantly harder scientific validation problem.

So I've been thinking carefully about what rules the system needs to follow to keep the science honest as it gets more complex. Here's where I've landed, at least for now.

Rule one: every recommendation must be traceable to a specific mechanism.

The system doesn't recommend a play because it's popular or because it sounds good. It recommends it because there's a named behavioural mechanism connecting the user's situation to the intervention. "Task Ease is low, so Make It Easier helps by reducing the number of steps required to act" contains a mechanism: step reduction lowers cognitive and physical effort, which increases the probability of action. That mechanism has decades of evidence behind it.

When I can't identify the mechanism, the connection doesn't go in. Mechanism traceability is the minimum standard for every route in the system.

Rule two: evidence must be specific and boundary-conditioned.

I talked about this in the last post with regard to the evidence base itself. But the rule has implications for the system's inference logic too. When the system routes someone from a contradiction to a principle to a play, the evidence supporting that chain should be specific enough that you could look it up, and honest enough about the conditions under which it holds.

"This works in healthcare settings with motivated participants" is different from "this works." The system needs to carry that context, because a recommendation that transfers well from a hospital to a workplace is different from one that only works in clinical settings with high-engagement participants.

Rule three: confidence must be graded, visible, and honest.

Not all routes have the same quality of evidence behind them. Some have direct empirical support: a study tested this specific mechanism in this specific context and found this specific result. Others have strong theoretical backing but limited direct testing. Others are plausible inferences from adjacent evidence.

The system tracks these differences using confidence levels: strong, good, and emerging. Strong means direct empirical evidence for the specific connection. Good means solid mechanism evidence plus supporting case studies. Emerging means the logic is sound, the mechanism is plausible, but the direct evidence is thin.

These confidence levels are visible to the user. They're part of the recommendation, not hidden in a database field. The system says: "I'm recommending this with good confidence" or "I'm recommending this with emerging confidence, which means you should monitor results more carefully." That visibility matters because it lets the user calibrate their trust appropriately.

Rule four: the system must not overclaim causation.

Behavioural science is full of associations that aren't clean causal relationships. Social proof is associated with increased adoption. That's well-supported. But saying "social proof causes adoption" overclaims the evidence. There are moderators, boundary conditions, and contextual factors that shape whether the association holds in any given case.

The system's language reflects this. Mechanism descriptions use phrases like "tends to," "is associated with," and "helps by" rather than "causes" or "guarantees." This isn't hedging for the sake of it. It's an accurate representation of what the evidence actually supports. Behaviour is complex, multicausal, and context-dependent. The system's language should reflect that reality.

Rule five: construct definitions must be behaviourally defensible, not just intuitively appealing.

This is the rule that caused the most disruption during the ontology audit. Some constructs in the system had names that sounded right and definitions that felt plausible but weren't well-anchored in the behavioural science literature. "Willpower" is a good example. Everybody knows what willpower means in everyday conversation. In behavioural science, it's a contested construct with significant debate about whether it represents a real mechanism or a folk-psychology label covering several different things (depleted executive function, competing motivation, habit interference, poor environmental support).

The system can't use constructs that two behavioural scientists would define differently. If a construct is too vague or too contested to support consistent diagnosis, it either needs to be tightened to a defensible definition or flagged as provisional.

Rule six: variation must be acknowledged, not hidden.

A taxonomy can treat each construct as uniform. An ontology can't, because it encodes rules about how constructs interact, and those interactions differ by context. The relationship between confidence and task difficulty depends on prior experience. The effectiveness of social proof depends on whether the reference group feels relevant. Making something easier only helps if difficulty was the actual barrier rather than a symptom of something else.

The system handles this through route conditions: plain-language statements about when a route holds and when it doesn't. This is the frontier of what I can currently manage. The variation in human behaviour is enormous. The system captures some of it. It needs to be transparent about the limits of its coverage.

How I think about maintenance:

The evidence base isn't static. New studies get published. Existing findings get replicated or challenged. Mechanisms get clarified. The system needs a maintenance rhythm: regular passes through the evidence to check whether confidence levels still hold, flagging new research that strengthens or weakens existing routes, and marking constructs under active debate.

Scientific validity in an applied system is a commitment, not a milestone. You don't achieve it once and move on. You maintain it continuously, or it degrades. That's the standard I'm holding BehaviourKit to. Whether I'm succeeding is something I'll need others to help evaluate. But the rules are explicit, and that's the starting point.

Go deeper into the Building BehaviourKit series: