The P-Value: Our Savior in A/B Testing

Mohammad Eskandari
2 min readSep 24, 2021

When it comes to discovering the perfect product for the customer, A/B tests and in general, testing is the product manager’s bread and butter. The underlying idea is pretty simple: divide the users who are eligible for the test into two groups, A and B. Show the unmodified, baseline version of the product to group A, and the modified to group B. Then analyze the results and see the difference between these two groups (if there is any) in terms of the appropriate metric defined for the test.

As PMs, we usually use A/B testing in order to gauge the effects of the changes which other evidence tells us is good for the product. A/B tests are, more often than not, the final, settling argument in rolling out a feature product-wide or not. Or are they?

Imagine a hypothetical situation: 1,000 users have seen the baseline, A version and out of those 1000 users, 500 have made a purchase (50 % conversion rate). As for the B version, 1,100 users have seen it and 580 people have made a purchase (53% conversion rate). Would we be right in asserting version B has outperformed A?
NO. The test is not statistically significant, it has a P-Value of 0.1.

The P-Value is the probability that there is no difference between the performance of the two versions and the observed improvement is the result of chance. In other words, let’s imagine that there is no difference between the performance of A and B. if I run the test again, how likely is it that we get a conversion rate of 53% percent?

Obviously, we want this pesky p-value to be as small as possible. In principle, we normally shoot for a p-value of 0.05<. Best experiments yield p-values of less than 0.01.

Let’s visualize using ABtestGuide.com:

For an A/B test, our probability density distribution is a binomial distribution: equivalent to flip a coin n times:

For the A version, we have flipped a coin 1000 times and have observed heads 500 times. For B, we have flipped a coin 1100 times and observed 580 heads. What we plan to achieve with the test, is to see whether the mean of the population does shift from 0.50 to 0.53 (has our coin changed or not?) but we do have to account for one special event: the event that the coin is the same, and that pure chance has yielded a 0.53 for the mean.

--

--

Mohammad Eskandari

A curious, fast-learning product manager. This is my work diary, I write about the things I learn and experience.