Experimentation · Prioritisation · Strategy
Stop Guessing. Start Scoring.
How to build an A/B test prioritisation framework that removes subjectivity — and a free Notion template to do it in.
In this article
- The problem with "this feels important"
- What a non-subjective framework actually looks like
- The RICE foundation
- Breaking down Reach into 6 dimensions
- Scoring Impact, Confidence, and Effort
- How to use the framework in practice
- Common mistakes (and how to avoid them)
- Get the free Notion template
The problem with "this feels important"
Let me paint you a very familiar picture.
It's your weekly roadmap review. On the table: 14 A/B test ideas. The PM wants to test the homepage hero. Engineering says the checkout flow is too complex. The CEO saw something a competitor did and "it would be quick to replicate." The data analyst has a hypothesis backed by actual user research — but they're the quietest person in the room.
The data analyst watching the roadmap get decided by who talks loudest.
Two weeks later you're A/B testing button colours. The high-confidence, high-reach test that could move the needle? Pushed to Q3. Again.
This is what happens when prioritisation is subjective. The loudest opinion wins. The best ideas lose. And your roadmap becomes a political artefact rather than a strategic one.
The fix isn't a longer meeting. It's a framework that does the arguing for you.
What a non-subjective framework actually looks like
Let's be honest — you can't remove subjectivity entirely. Humans make the calls. But you can structure subjectivity so that:
- Every test is scored on the same criteria
- The criteria are agreed on in advance (not made up in the meeting)
- Scores come from structured dropdowns, not blank text fields
- The formula does the maths — no one gets to override it mid-presentation
That shift is everything. When the debate moves from "I feel like this is more impactful" to "should this be a 7 or an 8 on confidence?", you're suddenly having a useful conversation grounded in evidence.
The RICE foundation
The framework is built on RICE scoring — originally from Intercom — with one significant upgrade to the Reach dimension.
Each component gets a numerical score. Multiply them together (with a 1.5 weighting to emphasise reach), divide by effort, and you get a Priority Score. Sort descending. Done.
The 1.5 multiplier on Reach is intentional. Tests that affect more users should rank higher than niche optimisations, all else being equal. If your A/B test only affects one market on desktop with one traffic source — it should score lower than a test that touches your entire audience.
Everyone when they realise scope matters.
Breaking down Reach into 6 dimensions
Standard RICE treats Reach as a single number ("how many users per quarter?"). That's fine for product features. For A/B tests on a website, it's too blunt.
A test might reach 100% of users in theory — but only on mobile, in one market, on one page type, hidden below the fold. That's not the same reach as a test that's visible to everyone, everywhere, above the fold.
So instead, Reach is scored across 6 dimensions, then averaged:
| Dimension | What it measures | Max pts |
|---|---|---|
| Market scope | All markets vs. one local market | 10 |
| Device scope | Both mobile + desktop vs. one device type | 10 |
| Journey / audience scope | Most users vs. a very specific segment | 10 |
| Page traffic level | Very high-traffic area vs. low-traffic area | 10 |
| Template / page coverage | All pages of that type vs. one isolated page | 10 |
| Visibility on page | Above the fold and prominent vs. hidden | 10 |
Each dimension has a structured dropdown with clear options. Not a blank number field. Not a free text box. A dropdown where the label tells you exactly what score you get and why. Like this:
| Option in the dropdown | Score |
|---|---|
| All markets | 10 |
| Top markets only | 7 |
| Few secondary / non-top markets | 4 |
| One local market only | 1 |
You pick your option. The score calculates automatically. No debate needed.
Scoring Impact, Confidence, and Effort
Impact
Based on expected business impact within 30 days. Deliberately revenue-anchored to keep things concrete:
| Option | Score |
|---|---|
| Massive impact — 7%+ RPU uplift or $50k+ incremental | 10 |
| High impact — 5–7% RPU or $25k–$50k | 8 |
| Moderate impact — 3–5% RPU or $10k–$25k | 6 |
| Small impact — 1.5–3% RPU or $5k–$10k | 4 |
| Minimal impact — 0.1–1.5% RPU or $1k–$5k | 2 |
If you don't know your RPU targets, use the revenue ranges as anchors. The key is that "I think it'll be impactful" becomes "I think this is a 6 because it could move CVR by 3–5%." That's a sentence someone can challenge with data.
Confidence
How much evidence backs this hypothesis? This is where the research actually gets rewarded:
| Option | Score |
|---|---|
| Quant + qual sources + prior successful A/B test | 12 |
| 3+ valid sources including quant and qual | 10 |
| 2 sources including quant and qual | 8 |
| 1 valid source | 6 |
| Weak directional signal below threshold | 4 |
| Opinions only — no research | 2 |
Notice the top score goes above 10. That's deliberate. Tests backed by prior successful A/B test results in the same area deserve to be elevated — they're not hypotheses, they're near-certainties.
"We have strong evidence for this." "Cool, is that based on actual data or the CEO's last trip to a competitor's website?"
Effort
Split into two parts — because building the test and shipping it permanently if it wins are very different things:
Test implementation effort (1–5 pts):
- Very light build, CSS tweak only → 1
- Light front-end test → 2
- Moderate FE + QA → 3
- Complex FE / feed / app logic → 4
- Backend or major dependency → 5
Permanent rollout effort (1–5 pts):
- Copy change only, minimal effort → 1
- Limited coordination needed → 2
- Multiple templates or markets → 3
- Multi-team dependency → 4
- Significant backend / legal / localisation → 5
Total effort = the two scores added together. Maximum possible effort score: 10. This gets used as the denominator in the priority formula — so high-effort tests get pulled down proportionally.
How to use the framework in practice
Write the hypothesis first
Don't score until you have a proper "if/then/because" hypothesis. No hypothesis = no test.
Fill the dropdowns
Use the structured dropdowns. Do not type raw numbers. The labels tell you exactly what to pick.
Let the score calculate
The framework does the maths. Resist the urge to override it. If you disagree, challenge the inputs — not the output.
Sort by Priority Score
Top scores go first. Review as a team quarterly. Add new tests at any time.
The Priority Tier (High / Medium / Lower) is a manual override — for cases where the score is right but context means the test should wait. Use it sparingly, and document why in the Notes field.
The rule of one challenge
When someone disagrees with a score, they must challenge a specific input — not the total. "I think the market scope should be 4, not 7, because this only affects users in EN markets" is a valid challenge.
"I just feel like this test is more important" is not a valid challenge.
This rule alone transforms the quality of your roadmap conversations.
Common mistakes (and how to avoid them)
Mistake 1: Treating the framework as optional
If some tests go through the scoring process and others don't, the framework loses its authority. Every test — no matter who suggested it — goes through the same process. Yes, including the one the CEO mentioned.
Me explaining why the CEO's idea scored a 34 and the analyst's scored a 192.
Mistake 2: Being too generous with Confidence
It's tempting to give every idea a confident score. Resist. If your evidence is "we heard some users mention it in a session six months ago," that's a 4, not an 8. Be honest. Low confidence scores are informative — they tell you where to do more research.
Mistake 3: Using effort as a tiebreaker rather than a genuine score
Low-effort tests are easy to execute but often low-impact. Don't let "it's quick to build" become a backdoor to the top of the backlog. If the reach, impact, and confidence scores are low, a small effort score won't save it.
Mistake 4: Never updating scores
A test that was scored 3 months ago might have different evidence behind it now. Review scores quarterly. If your data team found new evidence, update the confidence score. If the test page got a redesign that changed its traffic, update the reach score.
Mistake 5: Treating Priority Tier as a second vote
The Priority Tier field exists for genuine edge cases — a legal constraint, a seasonal dependency, a technical blocker. It is not there so stakeholders can manually elevate their favourite ideas after the scoring disagreed with them. If you find yourself changing tiers in every meeting, you've rebuilt the subjective process you were trying to replace.
The free Notion template
I've built the entire framework in Notion — every scoring dimension with a dropdown, every score with an auto-calculated formula, and the final Priority Score computing automatically from your inputs.
No formulas to set up. No spreadsheet to maintain. Just open it, duplicate it into your workspace, and start scoring.
Here's what's inside:
- Structured dropdowns for all 6 Reach dimensions, Impact, Confidence, and Effort
- Auto-calculated scores next to every dropdown — pick your option, the score appears
- Reach Score, Total Effort, and Priority Score — all computed automatically
- Priority Tier — colour-coded High / Medium / Lower for quick scanning
- Status tracking — Backlog → Planned → Running → Completed
- Start Date + End Date — so it doubles as a test calendar
- Hypothesis, Problem Statement, KPI fields — because good tests start with good thinking
Me sharing the template. You get a framework! You get a framework!
Get the free Notion template
Comment "FRAMEWORK" on the LinkedIn post and I'll DM you the link directly. No email, no form, no catch.
Already commented? Check your DMs — I send them personally.
Closing thought
The hardest part of building a non-subjective framework isn't the scoring logic. It's getting everyone to agree to use it — and to actually trust it when it contradicts their instincts.
Start with one sprint. Score every test in the backlog using this framework. Don't change the order based on gut feel. See what happens. My guess is that the tests you'd have deprioritised under the old system will start outperforming.
Because it turns out, evidence is a pretty good predictor of outcomes. Who knew.
The team when the data analyst's hypothesis scores highest two sprints in a row.
If you try this and it helps — or you find something that doesn't work — I'd genuinely love to hear about it. Drop a comment below or reach out directly.
Good luck. May your backlog be forever sorted by Priority Score descending.
Stop guessing why users drop off
Discoveo connects your GA4 funnel data with user feedback to explain the why — and prioritise what to fix first.
Discover Discoveo →