Strategy

Your Email Split Tests Are Probably Lying to You

Real data from hundreds of thousands of sends - and the testing mistakes killing your results

By Alex Berman - Apr 29, 2026 - 19 min read

Email Split Tests Are Built on Broken Data

I see it constantly - split tests coming back with garbage data. The tester isn't bad at their job. The test itself was designed on a broken foundation.

Open rates are the most common metric for judging a split test winner. They're also the most unreliable number in your dashboard right now.

Apple's Mail Privacy Protection (MPP) preloads your tracking pixel through a proxy server the moment a subscriber's mailbox syncs - whether or not they ever read your email. That means every Apple Mail user on your list registers as an "open" automatically. One newsletter operator documented their open rate jumping from 28% to 55% overnight when MPP rolled out - with zero change in actual engagement.

Omeda tracked the before-and-after across roughly two billion emails and found that open rates nearly doubled after MPP rolled out. Apple was firing pixels on subscribers' behalf.

Mailchimp openly states that A/B test results based on open rates may not be accurate for Apple Mail users. Litmus confirmed that A/B testing subject lines using opens to determine a winner no longer works reliably.

Split testing guides - including the ones ranking above this article right now - still tell you to test subject lines and judge the winner by open rate.

That advice is now broken. This article tells you what to do instead.

What Email Split Testing Is

Email split testing - also called A/B testing - means sending two or more versions of an email to different segments of your list, then comparing performance to find the winner.

Version A goes to one group. Version B goes to another. You measure a meaningful metric. The winner gets sent to the rest of the list, or becomes your new control for future sends.

Simple in theory. The problems show up in execution.

I see it constantly - people testing the wrong variables. They use sample sizes that are too small. They measure the wrong metric. They call a winner before it's statistically valid. And now, with Apple MPP distorting open rate data, they're making decisions based on a metric that is - at least partially - a fabrication.

Start with the variables that actually move revenue.

Before you set up any test, decide which metric you're optimizing for. This decision alone determines whether your tests are useful or noise.

For cold email outreach, the only metric that matters is positive reply rate. Reply rate is the number that tells you if the prospect wants to talk. A positive reply means the prospect wants to talk. That's product-market-message fit.

One operator with significant cold email experience put it plainly: focus on one thing - did anyone reply positive? Because you can have the best campaign in the world, but if you're measuring the wrong signal, it'll look like failure when it's working.

For newsletter and marketing email, the hierarchy looks like this:

Conversion rate - the metric that directly reflects revenue or sign-ups
Click rate - a reliable engagement signal unaffected by Apple MPP
Reply rate - for personal-feeling newsletters where replies indicate true engagement
Open rate - now unreliable for Apple Mail users; use only for non-Apple Mail segments

Email operations consultant Jeanne Jennings, who has run A/B tests professionally for over two decades, is direct on this point: the most reliable KPI for split testing is conversion rate. If revenue is the goal, it doesn't make sense to use a different metric to judge your test.

Bad split test results come from measuring the wrong thing.

The Apple MPP Problem In Full

Here is what is happening to your open rate data right now.

Apple Mail Privacy Protection was introduced with iOS 15 in September 2021. When a subscriber has MPP enabled and uses Apple Mail - regardless of whether their email address is Gmail, Yahoo, or anything else - Apple routes your email through its own proxy servers. Those servers preload your email content, including your tracking pixel. That pixel fires. Your ESP logs an open. Your subscriber may have never seen the message.

The result: open rates are artificially inflated, sometimes dramatically. Omeda's six-month post-MPP study found open rates nearly doubled across their client base. Machine-generated phantom opens.

MPP applies to any email address accessed through Apple Mail - not just Apple email accounts. A contact with a Gmail address reading it in Apple Mail on their iPhone is invisible to your tracking. Their "open" is a proxy ghost.

This matters for split testing in a specific way. When you run an A/B subject line test and pick the winner based on open rate, you're picking between two populations of phantom opens. The winning subject line may have gotten more Apple proxy fires - not more human eyeballs. You've optimized for nothing.

Mailchimp's official guidance now tells marketers: if you're running A/B tests based on open rates, your results might not be accurate. Their recommendation is to use click rates instead.

For cold email, where reply rate is your north star, this problem is less acute - cold email tools typically don't have the same MPP exposure as newsletter platforms. But any cold email operator using open rate to judge which subject line to scale is still operating on broken data.

The fix: test on click rate, reply rate, or conversion rate. Never open rate.

Minimum Sample Size - The Number That Kills Most Tests

The second biggest reason email split tests produce useless results is sample size. Specifically: too small.

Here is what small sample sizes do to your results. Say your baseline reply rate is 3%. You run a test. Version A gets 10 replies from 200 sends. Version B gets 14 replies from 200 sends. Version B looks like a 40% improvement. You declare a winner and scale it.

Except at 200 sends per variant, that difference falls entirely within statistical noise. You didn't find a winner. You found randomness. And now you've scaled a losing variant while abandoning a winner - or vice versa.

The numbers that matter:

For cold email: minimum 500 sends per variant, 1,000 total. That's the floor for getting directionally useful data at typical cold email reply rates.
For newsletter/marketing email: 95% statistical confidence is the standard threshold. A sample size calculator using your baseline conversion rate and a realistic minimum detectable effect will tell you exactly how many you need per cell.
For high-stakes tests: Some practitioners run at 20,000 per cell to get significance on small conversion differences.

A cold email SaaS founder who advises on outreach programs recommends testing at least 1,000 emails with a 500/500 split. The reasoning is straightforward: below that, you're flipping a coin and calling it optimization.

The math is not complicated. If you're testing at a 3% reply rate and want to detect a 1 percentage point improvement (from 3% to 4%), you need thousands of sends per variant to hit 95% confidence. I see it constantly - operators running 100-200 per variant and wondering why nothing is conclusive.

One SDR sending 2,500 to 3,000 emails per week described it simply: always be testing, but only if you're sending to similar groups at enough volume to give you good data - hundreds or thousands of emails.

Volume is the non-negotiable. Without it, you're not split testing. You're guessing with extra steps.

For Cold Email Specifically - Send More Than You Think

Cold email split testing has a volume problem that newsletter operators don't face.

Newsletter senders build lists over time. They send to the same people repeatedly. Getting to 1,000 subscribers per variant is achievable with a decent-sized list.

Cold email operators are constantly sourcing new contacts. Testing 500 per variant means having 1,000 fresh, targeted, verified contacts available before you even send one email. At traditional per-lead pricing of $0.50 per contact, that's $500 per test. Want to test five different offer angles? That's $2,500 before you write a single word.

I've watched cold emailers hit this wall and never push through it. They send 50 emails to one message, 50 to another, see no difference, and conclude that copy doesn't matter. 50 sends per variant is not a test.

Flat-rate lead access changes the math entirely. When downloading leads isn't a per-unit cost, you can run a proper 500/500 test on five different angles without worrying about burning budget on the list itself. One operator described discovering how to test ten different industries with 2,000 contacts each - a $20,000 experiment at per-lead rates - for a fraction of that on flat-rate access. Flat-rate access is what makes proper split testing at cold email scale possible.

If you're serious about building a testing system for cold outreach, Try ScraperCity free - the flat-rate B2B lead platform built for operators who test at volume.

What to Test - Ranked by Impact

Not all email variables are equal. Some create massive swings in results. Some barely move the needle. Here's what the data shows, ranked from highest to lowest impact.

1. The Offer and Angle (Highest Impact)

I see this every week - split testers obsessing over copy. Practitioners getting results test offers and angles.

One framework that has held up for years: cold email results are roughly 80% list quality, 15% offer framing, and 5% copy. I see this every week - agencies charging thousands per month to optimize the 5%.

That means before you test subject line capitalization or emoji usage, you should be testing whether your core offer resonates with different market segments. A different angle - leading with a case study versus leading with a question versus leading with a provocative claim - can swing reply rates by multiples, not percentages.

The 3C framework (Compliment, Case Study, Call to Action) is a structural offer test worth running as your A/B control. Test the compliment-first open against a direct problem-statement open. Test a case study lead-in against a specific observation about the prospect's business. These are angle-level tests, and they move numbers more than word-swapping does.

2. Personalization Specificity (5.6x Difference)

One of the clearest data points in cold email testing comes from a documented 2,400-email personalization test. Three variants were compared:

Specific detail: "Congrats on the $8M Series A" - 23.4% reply rate
Generic detail: "Congrats on the recent funding" - 11.8% reply rate
No mention at all - 4.2% reply rate

The specific version outperformed the no-mention version by 5.6x. The generic version - which most personalization advice recommends - only delivered 2.8x. Vague personalization is better than none, but specific personalization is in a different league.

The testing implication: don't just test "personalized vs. not personalized." Test specific vs. vague personalization. That's where lift lives.

3. Send Timing (15 Percentage Point Swing)

Timing split tests are underused. I've watched operators pick a send time based on folklore ("Tuesday morning is best") and never verify it against their actual list.

A documented send-time test across a 200-person sample showed real variance:

Monday 9am: 18% reply rate
Tuesday 2pm: 24% reply rate
Wednesday 11am: 29% reply rate
Thursday 4pm: 22% reply rate
Friday (any time): 14% reply rate

The spread between best and worst was 15 percentage points. Friday was half as effective as Wednesday mid-morning. For an operator sending 10,000 emails per month, that difference compounds into thousands of missed replies annually.

One important note for newsletter operators: since Apple MPP pre-fires pixels when the mailbox syncs - not when the person reads the email - send time optimization based on open rate is now unreliable for Apple Mail users. Test timing using click rate or reply rate instead.

4. Subject Line Approach (3x Difference)

Subject lines matter, but not in the way most guides suggest. The biggest subject line split test variable is not length or keyword inclusion. It's the safety spectrum: safe vs. specific vs. slightly weird.

Real practitioner data shows safe subject lines average around 12% open rates, while lines that feel personal, specific, or slightly off-script hit 35% and above. That's a 3x difference from the same list.

The most viral insight on this came from a marketer who discovered that putting intentional typos in subject lines increased open rates by roughly 40%, because recipients assumed a human wrote it. Deliberate imperfection as a signal of authenticity is testable.

Worth testing specifically:

Clean, professional subject line vs. casual, slightly imperfect version
Generic trigger ("Quick question") vs. specific trigger ("Your pricing page - one thing")
First name in subject vs. no first name
Question format vs. statement format

But remember: test subject lines using click rate or reply rate as your metric. Not open rate. Not anymore.

5. Sequence Length and Follow-Up Timing

One documented finding that runs counter to most sequence advice: 58% of positive replies in a cold email sequence come from the first email. The remaining 42% come from email two.

That was from an operator running two-email sequences. For operators running six or seven-step sequences: you are spending 80% of your effort on the 40% of replies. The first email is doing the heavy lifting. Optimize it accordingly.

Instantly's benchmark data corroborates this: 58% of replies happen on the first touch. I see teams spending the majority of their energy on follow-up optimization.

The split test here: does a longer sequence actually outperform a shorter one for your list and offer? Test a two-email sequence against a five-email sequence with the same list segment. Measure total replies per thousand sends, not just per-email reply rate. You might find that the extra emails add noise and unsubscribes without adding meaningful replies.

6. CTA Format

Call to action testing is consistently underused. I see operators treat the CTA as an afterthought and focus all testing energy on the subject line.

The variables worth testing in your CTA:

Soft ask ("Would it make sense to chat?") vs. direct ask ("Are you free Thursday at 2pm?")
Single CTA vs. double CTA (offer two times, let them pick)
Question-format CTA vs. link-format CTA
Long CTA sentence vs. one-liner

An analysis of over 4,000 cold email script variants found that single CTAs consistently outperformed multiple asks. One clear request. One decision for the prospect. The more choices you give, the fewer choices get made.

7. Domain TLD - The Test That Settled It

For operators who have wondered whether .com domains outperform .io or .org domains in cold email deliverability and reply rates, one operator ran a definitive test: 600,000 emails split across .com, .org, and .info domains.

Results:

.com: 0.24% human reply rate
.org: 0.22% human reply rate
.info: 0.21% human reply rate

The variance is statistically insignificant at that scale. Domain TLD does not meaningfully affect cold email reply rates. Stop worrying about it. Use your testing capacity on variables that move numbers - like offer angle and personalization specificity.

The Tests That Burn Money

One of the most useful things to understand about split testing is when not to do it.

A copywriter who ran 200+ tests across multiple businesses over six months found that only 5-10% of variants beat the control. The opportunity cost of running the 90-95% of losing variants - in lost revenue from the control performance you didn't deploy - adds up fast.

That doesn't mean stop testing. It means test with a hypothesis worth testing. Changing button color from blue to green is not a hypothesis. Changing from a soft-ask CTA to a calendar-link CTA is.

The variables most likely to waste your testing budget:

Minor copy tweaks - changing "I" to "we," swapping one adjective, rewording the same idea. These rarely produce statistically significant differences.
Design changes in plain-text cold email - cold email is plain text. There's nothing to design. Yet operators spend time testing HTML vs. plain text when the answer is almost always plain text for cold outreach.
Emoji in subject lines for B2B cold email - the audience is reading on corporate email clients. Emoji signal mass sends, not personal outreach.
Send-from name variations when personalization score is low - if your message isn't resonating, changing from "John" to "John Smith" will not save it.

Test things with high expected impact. List quality and offer angle sit at the top of that list. Word-level copy tweaks are an afterthought.

How to Structure an Email Split Test That's Valid

Here is the exact process for setting up a test that produces usable data.

Step 1 - Write a falsifiable hypothesis

"I think Version B will work better" is not a hypothesis. "A subject line referencing the prospect's specific funding round will produce a higher reply rate than a subject line referencing funding generically, because specificity signals genuine research" is a hypothesis. It names the variable, the expected direction, and the reason.

Step 2 - Choose your single variable

Test one thing per send. If you change the subject line AND the opening line AND the CTA, you have no idea what caused the result. One variable. One test.

Step 3 - Set your metric before you send

For cold email: reply rate.
For newsletters: click rate or conversion rate.
Not open rate.

Define what counts as a positive reply before the test goes out. Otherwise you'll be tempted to redefine success based on whichever variant performed better on whichever metric happened to look good.

Step 4 - Calculate your minimum sample size

For cold email operators: 500 per variant minimum, 1,000 total. That's for detecting moderate-sized differences at typical cold email reply rates.

For newsletter operators: use a sample size calculator. Plug in your baseline conversion rate, the minimum improvement you want to detect, and a 95% confidence threshold. The calculator tells you how many recipients you need per cell.

If you don't have enough volume to hit that number, you have two options: wait until your list grows, or accept that your result is directional rather than conclusive.

Step 5 - Send to equivalent segments

Both variants need to go to contacts at the same stage of the funnel, in the same industry, at the same time. If Version A goes to your hottest leads and Version B goes to your coldest, the test is worthless. Any difference in results is audience difference, not message difference.

Step 6 - Wait for completion before calling a winner

For email, most replies or clicks happen within 48-72 hours. Let the test run its full cycle. Peeking at results at hour 4 and calling a winner is how false positives happen. Let the data settle.

Step 7 - Check significance before declaring a winner

Run your results through a free A/B test significance calculator. If your result doesn't hit 95% confidence, you don't have a winner. You have a direction. That's still useful information - but don't scale based on it.

Step 8 - Document and iterate

A test that doesn't beat the control is not a failure. It's information. Log what you tested, what you expected, what happened, and what you learned. Over time, your testing log becomes a library of what your specific list responds to. That's worth more than any generic benchmark.

The "Never Stop Testing" Framework

The best cold email operators don't run tests as projects. They run them as a permanent operating mode.

The practical setup: every month, pick one clear A test (your current best-performing script) and one B test (a genuinely different approach - different offer angle, different opening structure, different CTA, different market segment). Split them evenly. At minimum 3,000 sends total, 1,500 per variant.

The goal in early testing is not to optimize word-level copy. The goal is to find positive replies. Positive replies signal product-market-message fit. Once you have that signal, you scale. Then you refine.

Start with 6,000 emails per month. Send 3,000 to your A script and 3,000 to a genuinely different B approach. A completely different angle, offer frame, or target market. Both ideas get real volume. Then you read the data.

When you already have a winning offer - already seeing positive replies consistently - you're looking for incremental improvements to a proven message, not proof-of-concept validation. At that stage, you can run tighter tests on specific variables: subject line format, opening line structure, CTA phrasing.

But never stop testing. The market changes. Inboxes get more crowded. What worked at 8% reply rates in one period may fall to 3% as the message gets overused. Testing is how you catch that drift before it kills your pipeline.

The Metric That Replaces Open Rate

Track positive reply rate instead of open rate.

For cold email: positive reply rate. Track replies that express interest, ask for more information, or agree to a conversation. Ignore out-of-office and "not interested" replies in your primary metric (though track them separately - high "not interested" rates signal list quality problems).

For newsletter email:

Click-through rate - unaffected by Apple MPP, directly measures engagement with your content
Click-to-open rate - but only for non-Apple Mail segments, since total opens are distorted
Conversion rate - the gold standard; measures the action that matters to your business
Revenue per email sent - the purest metric for commerce-focused email programs

Clicks and purchases are both unaffected by Apple MPP. They represent genuine human action. Build your testing methodology around those signals and your data becomes reliable again.

If you have a mixed list with significant Apple Mail usage, segment your Apple Mail users out for subject line testing purposes. Test subject lines against the non-Apple Mail segment only, then apply the winner to the full list. It's not perfect, but it removes the single largest source of noise in your data.

What Good Testing Infrastructure Looks Like

The operators I've worked with who test well have a few things in common.

They have enough volume. You cannot run valid split tests on a 500-person cold email list or a 1,000-subscriber newsletter. The math doesn't work. Volume is the prerequisite.

They have clean lists. Bad data - unverified emails, contacts who changed companies, invalid domains - creates noise that distorts every test. A 10% bounce rate doesn't just hurt deliverability. It corrupts your reply rate calculations. Before you optimize copy, verify your list.

They test one thing at a time. Every operator who has run dozens of tests says the same thing: the biggest testing mistake is changing multiple variables at once. The results look good but you can't learn from them.

They wait for statistical validity. They don't call winners at 50% completion. They don't use results from 75-person test cells. Volume gets built in before they call anything.

They measure what matters to their business. Not open rates. Not the metric that happens to look good. The metric that reflects whether their emails are working at the level that matters - replies, clicks, conversions, revenue.

And they document everything. A testing log from 12 months of consistent testing is one of the most valuable assets a marketing operator can build. It tells you exactly what your specific audience responds to, stripped of generic best-practice assumptions that may not apply to your list at all.

The Testing Mistake That Costs the Most

There is one testing mistake that is more expensive than any other: testing the wrong layer of the funnel.

I see it constantly - email split testers working on copy. They test subject lines, opening sentences, CTA phrasing. These are variables that affect performance.

But they sit on top of two more important layers: list quality and offer.

List quality is the substrate everything else runs on. If your contacts are wrong-fit, unverified, or stale, no amount of subject line optimization rescues the campaign. The data from one practitioner is clear: roughly 80% of cold email results come from list quality. The 15% from offer framing. The 5% from copy.

Offer framing is what your email is asking for and how it's positioned. A service that helps SaaS companies reduce churn by 30% is the same service whether you frame it as "churn reduction" or "customer retention" - but those two framings can produce dramatically different reply rates because one resonates with how prospects think about their problem and the other doesn't.

Copy - the actual words in the email - is the final 5%. It matters. But it matters far less than most split testers treat it.

The implication for testing priority: test markets and offers first. When you find a market that replies and an offer that resonates, then tighten the copy. Not before.

One operator who built and sold a SaaS company within months documented this exact sequence. Test markets. Find positive replies. Then optimize the message that's already working. Test markets and offers first, then optimize copy.

The split testing playbook is different depending on which type of email you're running. I see this constantly - testing guides blurring these two together. They are not the same.

Cold email split testing is about finding message-market fit. You don't have a relationship with these contacts yet. You're testing whether your offer, angle, and targeting resonate enough to earn a reply. The metric is reply rate. The variable with the highest influence is offer angle and personalization specificity. Subject lines matter for getting through filters, but reply rate is the truth metric.

Newsletter split testing is about deepening an existing relationship and driving action from subscribers who already opted in. You're testing what content format, CTA placement, subject line tone, and send timing maximize engagement from an audience that already knows you. The metric is click rate or conversion rate. Subject lines matter more here because they determine whether a subscriber reads this email or skips it - but still measure with click rate, not open rate.

The overlap: both benefit from clean lists, adequate sample sizes, and single-variable testing discipline. But the specific variables worth testing and the metrics worth measuring are different. Treat them separately.

Putting the System Together

Email split testing done right requires discipline.

Test on reply rate or click rate. Not open rate.

Use 500+ sends per variant for cold email. More for newsletter segments where conversion rate differences are smaller.

Test one variable at a time. Offer angle first. Personalization approach second. Subject line format third. Then copy tweaks.

Document every test. What you tested. What you expected. What happened. And what you changed as a result.

Never call a winner before 95% statistical confidence.

Never stop testing. The winning script from six months ago is not necessarily the winning script today.

The operators getting the best results from cold email right now are not the ones with the cleverest subject lines. They are the ones sending enough volume to get real data, testing the variables that matter, and building a compounding library of what their specific audience responds to.

That's the system.

FAQ

Frequently Asked Questions

What is the minimum number of emails I need to send for a valid split test?

For cold email, send at least 500 per variant - 1,000 total. Below that, normal variance in reply rates will look like a meaningful difference when it isn't. For newsletter testing, calculate your sample size based on your baseline conversion rate and the minimum improvement you want to detect at 95% statistical confidence. A sample size calculator will give you a precise number. Most operators run tests that are 3-5x too small and wonder why their results don't hold when scaled.

Should I still test subject lines if Apple MPP makes open rates unreliable?

Yes - but change how you measure the winner. Instead of judging subject lines by open rate, judge them by click rate or reply rate. For newsletter lists with significant Apple Mail users, segment out Apple Mail users and test against the non-Apple segment only. Apply the winner to the full list. Subject lines still affect whether people engage - you just need a reliable signal to measure that engagement, and open rate is no longer it.

What's the single highest-impact variable to test in a cold email?

Offer angle and personalization specificity, in that order. Copy tweaks represent roughly 5% of cold email results. Offer framing represents about 15%. Getting the angle right - how you frame the problem your service solves and why this specific prospect should care - moves the biggest numbers. Once you have a resonating offer, test whether specific personalization details (like mentioning a funding round by amount) outperform generic personalization. That single variable produced a 5.6x difference in one documented 2,400-email test.

How is email split testing different from A/B testing?

They're the same thing. 'Split test' and 'A/B test' are interchangeable terms. Both mean sending two versions of an email to different segments and comparing performance. The only meaningful distinction is when people run multivariate tests - testing multiple variables simultaneously across three or more variants. For most operators, sticking to two-variant (A/B) tests with one variable changed is the right approach. Multivariate tests require much larger sample sizes to produce valid results.

How long should I wait before calling a split test winner?

For cold email, most replies come within 48-72 hours of send. Let the test run at least 72 hours before reviewing results. For newsletters, 24-48 hours captures the majority of clicks and conversions. The bigger issue is not time - it's statistical significance. Calling a winner at 50% of your planned volume because one variant looks better is how false positives happen. Wait until you hit your target sample size AND check statistical significance. If the result isn't at 95% confidence, you have directional data, not a winner.

What should I do if my test results are not statistically significant?

Three options: run more volume until you hit significance, accept the result as directional guidance rather than a conclusive winner, or conclude the variable you tested doesn't move the needle enough to matter and move on to a higher-impact variable. Not reaching significance is not a failure - it's information. It often means the variable you tested has less impact than you expected. Document it, move to a bigger lever, and test again.

Is it worth testing email send times?

Yes, and most operators skip it. One documented 200-person send-time test found a 15 percentage point spread between best time (Wednesday 11am, 29% reply rate) and worst time (Friday, 14% reply rate). That's a meaningful lift from a single variable change that costs nothing to implement. Test send times using reply rate or click rate - not open rate, since Apple MPP pre-fires open pixels when the mailbox syncs rather than when the human actually reads the email.