Re: Random number generators for NDFS block numbers

2005-09-26 Thread Doug Cutting

Paul Baclace wrote:

Doug Cutting expressed a concern to me about using util.Random to generate
random 64 bit block numbers for NDFS.  The following is my analysis.


Nice stuff, Paul.  Thanks.

It just occurred to me that perhaps we could simply use sequential block 
numbering.  All block ids are generated centrally on the namenode.  If 
it can simply maintain a persistent counter then sequential allocation 
could be used.  Since each block addition is logged, this is easy to 
persist.  When re-playing the log on namenode startup we simply keep 
track of the highest known block id.


Blocks are not logged until the file is closed, so there could be a 
problem on restart if datanodes report blocks for files that were never 
closed.  These would collide with yet-unallocated block numbers, 
potentially corrupting the filesystem.  I'm not sure what the best way 
to handle that would be... Ideas?


Doug


Re: Random number generators for NDFS block numbers

2005-09-26 Thread Paul Baclace

Doug Cutting wrote:
It just occurred to me that perhaps we could simply use sequential block 
numbering.  All block ids are generated centrally on the namenode.  


I'm not sure what the advantage of sequential block numbers would be
since long period PRNG block numbering does not even need to store
it's state, just pick a new starting place.

Sequential block numbering does have the downside that picking a
datanode based on (BlockNum % DataNodeCount) would devolve into
round robin.  Any attempt to pass the sequence through a hash
ends up becoming a random number generator.

Sequential numbering provides contiguous numbers, but after G.C.
that would be lost, so no advantage there.

When human beings eyeball block numbers, many with small differences
are more likely to be misread than many that are totally different.

If block numbering is sequential, then there is a temptation to use
32 bits instead of 64, but 32 bits leads to wrap-around and uh oh.

Blocks are not logged until the file is closed, so there could be a 
problem on restart if datanodes report blocks for files that were never 
closed.  These would collide with yet-unallocated block numbers, 
potentially corrupting the filesystem. 


I suppose a large margin of block ids could be skipped if there
was doubt about the previous shutdown, but the random block numbers
have many advantages.

Paul


Re: Random number generators for NDFS block numbers

2005-09-23 Thread Paul Baclace

Thanks to archive.org, I found this additional reference:

  http://www.math.sci.hiroshima-u.ac.jp/~m-mat/eindex.html

which is the English home page of Matsumoto san, co-originator of Mersenne 
Twister.
There is a faq and the original C source.

Paul


Paul Baclace wrote:

[...]
  Alternative
MersenneTwister replacement for Random is best choice, has a 
mind-bogglingly long period.

  Matsumoto and Nishimura (1998)

 [...]