On Dec 29 2009, 6:38 pm, Cédric Krier <[email protected]> wrote: > Hi, > > I would like to add a new type of sequence in Tryton that will generate random > but unique number. > A simple solution will be to generate random number and compare it with > already generated number (stored in a table). But there is the birthday > paradox [1] that reduces the speed of this method. > > An other way will be to store in a table every possible value (if the length > of the number is fixed) and pick/remove one randomly. But here the generation > of all values can be slow. > > Is there anybody that knows a better way? > > [1]http://en.wikipedia.org/wiki/Birthday_paradox > > -- > Cédric Krier
About the scalability of this approach : Let's say we pick an integer. On my 32bit laptop, sys.maxint gives me: 2147483647, which is roughly 2*10**9. So this means that the odds of generating a number already in the sequence is 1 in 10 when 2*10**8 (200 million) numbers have been generated. One in ten is not bad this means that the the creation of the next item in the sequence is on average 10% slower than if there was no "collision". To give an idea of how much "time" is needed to generate a sequence with 2*10**8 numbers: It takes 5479 years if 100 numbers are generated each day (2*10**8 / (100*365) gives 5479). So scalability may become a problem if one want to generate more than 10,000 number per day over 50 years (10,000 number a day reach the 10% slowdown after 54 years) or more than 100,000 per day over 5 years. The only use case I would see for such numbers is a big cable/water/ electricity/phone company that want to generate statements for their customers. But if we want to scale more at the expense of being a bit less random would be to use a variant of the "store in a table every possible value" solution: - Generate a randomized table with the number from 1 to 10k. - Consume table - Generate a table with number from 10k+1 to 20k. - ... Another variant is to use two such "possible values tables": one that gives which intervals are still available (1to 10k, 10k+1 to 20k, etc) and the other giving the available values in the currently active interval. Bertrand -- [email protected] mailing list
