On Wed, Feb 15, 2012 at 1:53 AM, Oliver Meyn (GBIF) <om...@gbif.org> wrote: > So hacking around reveals that key collision is indeed the problem. I > thought the modulo part of the getRandomRow method was suspect but while > removing it improved the behaviour (I got ~8M rows instead of ~6.6M) it > didn't fix it completely. Since that's really what UUIDs are for I gave that > a shot (i.e UUID.randomUUID()) and sure enough now I get the full 10M rows. > Those are 16-byte keys now though, instead of the 10-byte that the integers > produced. But because we're testing scan performance I think using a > sequentially written table would probably be cheating and so will stick with > randomWrite with slightly bigger keys. That means it's a little harder to > compare to the results that other people get, but at least I know my internal > tests are apples to apples. > > Oh and I removed the outer 10x loop and that produced the desired number of > mappers (ie what I passed in on the commandline) but made no difference in > the key generation/collision story. > > Should I file bugs for these 2 issues? >
Thanks Oliver for digging. Using UUIDs will make it tougher on the other end when reading? How do you divide up the UUID space? UUIDs are not well distributed across the possible key space IIUC. Should writing UUIDs be an option on PE? Thanks again for figuring it. St.Ack