Re: Random number generators for NDFS block numbers
Paul Baclace wrote: Doug Cutting expressed a concern to me about using util.Random to generate random 64 bit block numbers for NDFS. The following is my analysis. Nice stuff, Paul. Thanks. It just occurred to me that perhaps we could simply use sequential block numbering. All block ids are generated centrally on the namenode. If it can simply maintain a persistent counter then sequential allocation could be used. Since each block addition is logged, this is easy to persist. When re-playing the log on namenode startup we simply keep track of the highest known block id. Blocks are not logged until the file is closed, so there could be a problem on restart if datanodes report blocks for files that were never closed. These would collide with yet-unallocated block numbers, potentially corrupting the filesystem. I'm not sure what the best way to handle that would be... Ideas? Doug
Re: Random number generators for NDFS block numbers
Doug Cutting wrote: It just occurred to me that perhaps we could simply use sequential block numbering. All block ids are generated centrally on the namenode. I'm not sure what the advantage of sequential block numbers would be since long period PRNG block numbering does not even need to store it's state, just pick a new starting place. Sequential block numbering does have the downside that picking a datanode based on (BlockNum % DataNodeCount) would devolve into round robin. Any attempt to pass the sequence through a hash ends up becoming a random number generator. Sequential numbering provides contiguous numbers, but after G.C. that would be lost, so no advantage there. When human beings eyeball block numbers, many with small differences are more likely to be misread than many that are totally different. If block numbering is sequential, then there is a temptation to use 32 bits instead of 64, but 32 bits leads to wrap-around and uh oh. Blocks are not logged until the file is closed, so there could be a problem on restart if datanodes report blocks for files that were never closed. These would collide with yet-unallocated block numbers, potentially corrupting the filesystem. I suppose a large margin of block ids could be skipped if there was doubt about the previous shutdown, but the random block numbers have many advantages. Paul
Re: Random number generators for NDFS block numbers
Thanks to archive.org, I found this additional reference: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/eindex.html which is the English home page of Matsumoto san, co-originator of Mersenne Twister. There is a faq and the original C source. Paul Paul Baclace wrote: [...] Alternative MersenneTwister replacement for Random is best choice, has a mind-bogglingly long period. Matsumoto and Nishimura (1998) [...]