On Fri, 28 Oct 2005, Johan Geldenhuys wrote:
> After I tested the previous code, I noticed that the odds is 1:49 that a > duplicate number can be found in the 6 digit range (and it happended) > and that 0 can also be found. Hi Johan, I'm not exactly sure how 1 in 49 are the odds of getting a duplication. Let's talk about this for a moment. [Warning: long post ahead.] Just to make sure: the situation is that we're rolling a fifty-sided dice six times, and we'd like to see what's the probability of having duplicates in the mix. We have 50**6 possible rolls, but there are only: 50 * 49 * 48 * 47 * 46 * 45 ways of making sure we don't hit duplicates. The probability that we don't hit any duplicates, then, is: ###### >>> (50 * 49 * 48 * 47 * 46 * 45) / (50 **6) 0L ##### Oops, forgot about integer division. *grin* Let me try that again: ###### >>> (50.0 * 49 * 48 * 47 * 46 * 45) / (50 **6) 0.73224345599999996 ###### Ok, so we have about a 73% chance of not having any duplicate numbers if we roll a 50-digit dice six times. Since we can either have no duplication or duplication, then he probability that we will have duplication is: (1 - probability of not having duplication) ###### >>> 1 - (50.0 * 49 * 48 * 47 * 46 * 45) / (50 **6) 0.26775654400000004 ###### So we have about a 27% percent chance of getting duplication when we roll a fifty sided dice six times. But that's just probability theory. Is this real? Is this truly probable? Let's test this experimentally. First, we need some way of generating a roll of dice. Here's a quick-and-dirty function to do that: ###### >>> def make_roll(n, r): ... return [random.randrange(n) for i in range(r)] ... >>> make_roll(50, 6) [11, 48, 39, 43, 30, 24] >>> make_roll(50, 6) [6, 47, 0, 34, 9, 38] ###### We can test things experimentally by writing a function to see a roll has duplication: ###### >>> def has_duplication(roll): ... d = {} ... for number in roll: ... d[number] = d.get(number, 0) + 1 ... for count in d.values(): ... if count > 1: ... return True ... return False ... >>> has_duplication([1, 2, 3]) False >>> has_duplication([1, 2, 2]) True ###### Once we have this tool, then it becomes easy to do a trial run and see, if we roll 10 times, what percentage of those are duplicates. ###### >>> count_duplicates = 0 >>> for i in range(10): ... if has_duplication(make_roll(50, 6)): ... count_duplicates = count_duplicates + 1 ... >>> count_duplicates 1 >>> 1.0 / 10 0.10000000000000001 ###### Experimentally, we're seeing 1%, but that might just be a fluke. Or maybe it's just because the trial size is much too small. *grin* Let's formalize this test as a function, just to make it easier to retry: ###### >>> def do_trial(n): ... """Returns the number of duplicates if we do a n-trial.""" ... count_duplicates = 0 ... for i in range(n): ... if has_duplication(make_roll(50, 6)): ... count_duplicates = count_duplicates + 1 ... return float(count_duplicates) / n ... >>> >>> do_trial(5) 0.40000000000000002 >>> do_trial(10) 0.29999999999999999 >>> do_trial(100) 0.27000000000000002 >>> do_trial(1000) 0.26900000000000002 >>> do_trial(10000) 0.2621 >>> do_trial(100000) 0.26676 ###### So we're experimentally getting evidence that the probability of rolling duplicates is about 26%, if we have a 50-sided dice six times. Hey, that's not bad: that's about the value we got from doing math. As we scale our expeiment higher, we start seeing the Law of Large Numbers taking place: when we use lots of trials, then our experimental results start getting very close to the one we calculated with probability theory. http://en.wikipedia.org/wiki/Law_of_large_numbers Here's a punchline: the problem we've been looking at is another disguise for the Birthday Paradox: http://en.wikipedia.org/wiki/Birthday_problem That article makes the assertion that if we have 23 people in a room, the probability that we have at least two people with the same birthday is pretty good (about 50%). We can modify do_trial() to experimentally see that! ###### >>> def do_trial(n, x, y): ... """Returns the number of duplicates if we do a n-trial.""" ... count_duplicates = 0 ... for i in range(n): ... if has_duplication(make_roll(x, y)): ... count_duplicates = count_duplicates + 1 ... return float(count_duplicates) / n ... >>> do_trial(10000, 365, 23) 0.51390000000000002 >>> do_trial(10000, 365, 1) 0.0 >>> do_trial(10000, 365, 2) 0.0023 >>> for i in range(23): ... print i, "people, probability of shared birthday:", do_trial(10000, 365, i) ... 0 people, probability of shared birthday: 0.0 1 people, probability of shared birthday: 0.0 2 people, probability of shared birthday: 0.0035 3 people, probability of shared birthday: 0.0087 4 people, probability of shared birthday: 0.0181 5 people, probability of shared birthday: 0.0273 6 people, probability of shared birthday: 0.0436 7 people, probability of shared birthday: 0.0551 8 people, probability of shared birthday: 0.0736 9 people, probability of shared birthday: 0.0932 10 people, probability of shared birthday: 0.1201 11 people, probability of shared birthday: 0.1325 12 people, probability of shared birthday: 0.1654 13 people, probability of shared birthday: 0.1967 14 people, probability of shared birthday: 0.2241 15 people, probability of shared birthday: 0.2581 16 people, probability of shared birthday: 0.281 17 people, probability of shared birthday: 0.3169 18 people, probability of shared birthday: 0.3497 19 people, probability of shared birthday: 0.3843 20 people, probability of shared birthday: 0.4171 21 people, probability of shared birthday: 0.438 22 people, probability of shared birthday: 0.4817 ###### The variable names 'x' and 'y' suck (I'll try to think of better ones next time), but I hope that it's clear what we're doing: we're making do_trial() more general so it can handle different dice rolls. And because they are parameters in our do_trial() function, we can then see what the situation loosk like as we get more and more kids in the classroom. Does this make sense? If you have any questions about this, please feel free to ask. Hope this helps! _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor