What is AB Test Plan?

AB Test Plan is a free AI-powered tool that predicts A/B test outcomes using synthetic persona simulation. Instead of spending weeks of real traffic, you get a prediction in 60 seconds — complete with a Run/Iterate/Kill verdict, persona-by-persona reasoning, and specific iteration suggestions.

How does the A/B test prediction work?

AB Test Plan generates 6 diverse synthetic personas, each with real economic constraints (fixed budgets, time pressure), specific behavioral patterns (skepticism levels, decision styles), and existing workflow investments (switching costs). Each persona independently evaluates your control and variant, then the tool synthesizes their responses into an actionable prediction with a Run, Iterate, or Kill verdict.

Why should I trust synthetic persona predictions?

Unlike generic AI chatbots, AB Test Plan's personas have rigid constraints that force honest trade-offs — like a real person deciding whether to spend their limited budget on your tool vs. keeping their current workflow. The behavioral anchoring methodology is based on Stanford Generative Agent research and forces personas to prioritize rather than agree.

How does ICE scoring work?

ICE scoring rates each experiment idea on three dimensions: Impact (how much will this move the needle, 1-10), Confidence (how sure are you it will work, 1-10), and Ease (how easy is it to implement, 1-10). The total ICE score helps you prioritize which experiments to run first. Higher scores indicate better candidates for testing.

What frameworks does AB Test Plan use?

AB Test Plan uses ICE Scoring for prioritization, Reforge Growth Loops, Cialdini's 6 Principles of Persuasion, Fogg Behavior Model, Jobs-to-be-Done framework, loss aversion, cognitive load theory, behavioral anchoring, and trade-off forcing methodology for realistic persona simulation.

How do I calculate the right sample size for an A/B test?

The built-in calculator determines sample size based on your baseline conversion rate, minimum detectable effect (MDE), statistical significance level (typically 95%), and statistical power (typically 80%). It tells you exactly how many visitors per variation you need and how many days the test will take based on your daily traffic.

Is AB Test Plan free?

Yes, AB Test Plan is completely free. Generate experiment ideas, build hypotheses, calculate sample sizes, preview variants, and run persona predictions at no cost. No account or credit card required.

How long should I run an A/B test?

Run your test until it reaches statistical significance (typically 95% confidence) and has run for at least 1-2 full business cycles (7-14 days minimum). But first, run it through AB Test Plan's prediction simulation to make sure the test is worth running at all — 70-80% of A/B tests lose or are inconclusive.

PIE vs ICE vs PXL: Which Test Prioritization Framework Wins?

The short answer: ICE is the fastest to use but the most subjective. PIE gives a better CRO focus with slightly more structure. PXL (created by CXL/ConversionXL) is the most objective of the three, replacing gut-feel scores with structured binary questions. Which one you should use depends on your team's maturity and how much bias you can tolerate in your backlog.

ICE (Impact, Confidence, Ease)

ICE was popularized by Sean Ellis and is the default framework at most growth teams today. You score each test idea on three dimensions, each from 1 to 10, then average (or multiply) the scores to rank your backlog.

Impact: How large is the potential lift on your target metric?
Confidence: How much evidence supports the hypothesis?
Ease: How fast and cheap is this to ship?

The appeal is speed. A team can score 20 backlog items in under 30 minutes. The weakness is that "1 to 10" leaves enormous room for individual bias — one person's 7 is another's 4, and senior voices tend to anchor the room.

For a deeper breakdown of each dimension, scoring tables, and the multiply-vs-average debate, read the full ICE scoring guide.

PIE (Potential, Importance, Ease)

PIE was introduced by Chris Goward at WiderFunnel and is specifically designed for conversion rate optimization contexts. The three dimensions are:

Potential: How much room for improvement does this page or flow have? A page already converting at 40% has less headroom than one converting at 1.2%.
Importance: How much traffic and revenue runs through this touchpoint? A test on your highest-traffic landing page is more important than one on a low-traffic thank-you screen, even if the conversion potential is similar.
Ease: Same as ICE — how hard is this to implement?

Each dimension is scored 1 to 10, and the three scores are averaged.

The key difference from ICE is the substitution of "Potential" and "Importance" for ICE's "Impact" and "Confidence." Potential grounds you in headroom rather than expected lift, which is a subtler and often more honest framing. Importance forces an explicit weighting of traffic volume — something ICE only captures implicitly in the Impact score.

PIE is a better default for CRO teams working on landing pages and funnels because it keeps you focused on where improvement is possible and where it matters most. The downside is the same as ICE: the 1-10 scale is still subjective, and teams can still score in ways that reflect enthusiasm rather than evidence.

PXL (CXL's Structured Framework)

PXL was developed by CXL (ConversionXL) as a direct response to the subjectivity problem in ICE and PIE. Instead of asking scorers to pick a number from 1 to 10, PXL replaces most dimensions with binary yes/no questions that each carry a fixed point value. The total score is the sum of those points.

The binary questions cover factors like:

Is this test above the fold? (+2 points)
Does it directly address a known user frustration from research? (+2 points)
Is it supported by A/B test data from your own site? (+2 points)
Does it address a primary goal (e.g., checkout conversion) rather than a secondary metric? (+1 point)
Is it easy to implement? (+1 point)
Is there analytics data supporting this change? (+1 point)

(CXL's full rubric has around 10 questions; the specific weights are published in their CRO certification materials.)

The effect is significant. Because scorers are answering objective questions rather than picking arbitrary numbers, inter-rater agreement goes up and hippo-driven prioritization (where the highest-paid person's opinion dominates) goes down. Ideas backed by research and data accumulate points naturally. Ideas born from gut feel or executive preference score low unless they happen to pass the structured checks.

The tradeoff is setup time. PXL requires you to document evidence for each idea before you can score it accurately — you can't fake the "supported by user research?" question if you haven't done user research. For early-stage teams without a strong research practice, that dependency can make PXL feel like overkill.

Side-by-Side Comparison

Framework	Dimensions	Objectivity	Speed	Best for
ICE	Impact, Confidence, Ease	Low — free-form 1-10	Very fast	Early-stage teams, large backlogs, growth experiments
PIE	Potential, Importance, Ease	Low — free-form 1-10	Fast	CRO teams focused on landing pages and funnels
PXL	~10 binary/structured questions	High — fixed point values per question	Slower	Mature CRO teams with research data, reducing stakeholder bias

Worked Example

Let's score the same experiment under all three frameworks. The hypothesis: adding a short testimonial directly beneath the checkout CTA will increase purchase conversion on a product detail page.

Before scoring, here's what we know: the page gets significant traffic, the checkout CTA is above the fold, we have heatmap data showing users pause near the CTA, and a prior test adding social proof to a different page lifted conversion by 11%.

ICE scoring:

Impact: 7 (meaningful potential based on prior analogous result)
Confidence: 8 (heatmap evidence + analogous past test)
Ease: 8 (copy and image change, no backend work)
ICE score: 7.7

PIE scoring:

Potential: 6 (page already converts reasonably well, moderate headroom)
Importance: 8 (high-traffic, revenue-critical page)
Ease: 8 (same as above)
PIE score: 7.3

PXL scoring (illustrative subset):

Above the fold? Yes (+2)
Addresses known user friction from research? Yes — heatmap data (+2)
Supported by your own A/B test data? Partial — analogous test, not direct (+1)
Targets primary conversion goal? Yes (+2)
Easy to implement? Yes (+1)
Supported by analytics data? Yes (+1)
PXL score: 9/~14 possible (high relative to most backlog items)

All three frameworks rank this experiment as high priority, but notice the differences. ICE produces the highest raw score because it rewards enthusiasm and the scorer's general sense that this will work. PIE moderates slightly because it forces an honest look at headroom. PXL surfaces the experiment as high-priority through evidence — it scores well because there's actual data behind it, not just confidence.

Now imagine an experiment idea with no research behind it. Under ICE and PIE, an enthusiastic scorer could still give it a 7 across the board. Under PXL, it would score 2-3 points because it can't pass the structured questions. That's the bias-reduction PXL is designed for.

Which Framework Should You Use?

Use ICE if you're a small team, moving fast, and the primary goal is getting experiments out of your head and into a ranked list quickly. It works well when the team shares context and calibrates scores together. The ICE scoring guide covers how to run a calibration session effectively.

Use PIE if your team is specifically focused on conversion optimization — landing pages, signup flows, checkout — and you want the prioritization framework to push you toward high-traffic, high-headroom pages rather than just high-excitement ideas.

Use PXL if you have a mature CRO practice, run regular user research, and find that your experiment backlog keeps getting distorted by stakeholder pressure or unchecked optimism. PXL requires investment upfront but pays back in a more defensible, evidence-driven roadmap.

A common evolution: teams start with ICE, graduate to PIE as they get more CRO-specific, and move to PXL once they've built enough research infrastructure that the structured questions can be answered honestly.

Regardless of which framework you use, the starting point is the same: a clear test hypothesis and a prioritized backlog of ideas worth running. If you need both, AB Test Plan generates scored experiment ideas from a description of your product — giving you a ranked starting point you can refine with PIE, ICE, or PXL from there. You can also browse the A/B test ideas library for proven experiment types organized by conversion goal.

The Bottom Line

ICE wins on speed. PIE wins on CRO focus. PXL wins on objectivity. No framework eliminates judgment entirely — but PXL gets closest by replacing open-ended scores with structured questions that reward evidence. If you're just getting started, ICE is the right call. If you're managing a team where politics influence your backlog, move to PXL.

PIE vs ICE vs PXL: Which Test Prioritization Framework Wins?

ICE (Impact, Confidence, Ease)

PIE (Potential, Importance, Ease)

PXL (CXL's Structured Framework)

Side-by-Side Comparison

Worked Example

Which Framework Should You Use?

The Bottom Line

Ready to plan your next A/B test?

More Articles

47 A/B Test Ideas That Actually Lift Conversions (2026)

9 A/B Testing Statistics Mistakes That Wreck Your Results

A/B Testing vs Multivariate Testing: Which Should You Use?