Statistics and sample size

Alistair Hutton · October 2017

I am a computer programmer

I am writing a a new service to replace an old service.

I want to be confident that my new service works like the old service.

The old service gets called 30,000,000 times a day.

If I was to randomly sample 1000 of those requests from the last day and hit my new service with them how confident would I be that my new service matches the old service for that days load of queries if it came back 100% matching.

How confident would I be that it was 50% fucked if 50% of the results came back not-matching.

I feel this is close to survey margin-of-error calculation but I know that my intuitive grasp of statistics is often completely wrong so would appreciate guidance.

(Addendum: We are planning on doing a full traffic replication as well (so replaying all thirty million requests) but want a quick comparison we can be confident in for doing iterative changes during development)

Enc · October 2017

You probably want a sample size closer to ~2500

Dis' · October 2017

Is the check a binary match/not-match?

Without sampling everything you'll never be 100% confident. You need to decide a confidence level you're happy with (most shorthand 95% but you can choose 99% or 99.9% etc).

You'll then need to determine your margin of error i.e how close you need to be to the true value (say you want to be 99% confident you're in 1% margin of error)

From the sounds of it you want a very close margin of error and high confidence so I plugged 99% confident of a 0.1% or less margin of error into a quick online calculator https://www.surveymonkey.co.uk/mp/sample-size-calculator/ and got a sample size of 1.5 million.

Cauld · October 2017

Make sure your sample is representative of the population. ie. all types of calls are covered

Inquisitor77 · October 2017

If you are performing a full replication test at the end, then your iterative tests are really a matter of testing efficiency vs. risk. Rather than focusing on a number for a random sample, I'd focus more on actually writing out the scenarios you intend to cover and explicitly testing them. In particular, what you need to pay attention to are not just happy path calls but the trickier or more esoteric ones that may have caused problems in the past but absolutely need to be covered. For example, if the vast majority of calculations involve integers but you have to support decimals, how it handles non-terminating values or DIV0 errors.

Otherwise all you are doing is fishing, so you might as well run as many sample comparisons as you can afford as often as possible. Which actually isn't a bad idea...

schuss · October 2017

Do you have a full matrix of possible transactions? You should be testing the conditions, with a random sample as an end just in case test. If you haven't already researched and mapped the conditions to tests, do that now. Random sample will most likely just get 90% of the same sort of transaction and a few edge cases.

Alistair Hutton · October 2017

schuss wrote: »

Do you have a full matrix of possible transactions? You should be testing the conditions, with a random sample as an end just in case test. If you haven't already researched and mapped the conditions to tests, do that now. Random sample will most likely just get 90% of the same sort of transaction and a few edge cases.

The full matrix of possible queries is large 16,000,000*250*125*400

In practice we are most concerned about the ones that happen most often and can afford a few failures around the edge edge cases to ensure the common use cases are covered.

schuss · October 2017

Ok. One thing to look up is the allpairs method. It's an algorithm that pairs conditions as the most likely failures are from a simple pair of conditions. It's a great tool to help narrow in conditions like that. Seriously 16million variables with different handling for each value, or is it just a numeric?

Alistair Hutton · October 2017

schuss wrote: »

Ok. One thing to look up is the allpairs method. It's an algorithm that pairs conditions as the most likely failures are from a simple pair of conditions. It's a great tool to help narrow in conditions like that. Seriously 16million variables with different handling for each value, or is it just a numeric?

2 variables with 4000 categorical inputs for each.

Inquisitor77 · October 2017

Are these meaningful categories (e.g., reflective of different logic flows in the code) or are they just labels?

Penny Arcade

Quick Links

Statistics and sample size

Posts