Hi all, I've been trying to run a battery of tests to really understand our cluster's performance, and I'm employing PerformanceEvaluation to do that (picking up where Tim Robertson left off, elsewhere on the list). I'm seeing two strange things that I hope someone can help with:
1) With a command line like 'hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 10' I see 100 mappers spawned, rather than the expected 10. I expect 10 because that's what the usage text implies, and what the javadoc explicitly states - quoting from doMapReduce "Run as many maps as asked-for clients." The culprit appears to be the outer loop in writeInputFile which sets up 10 splits for every "asked-for client" - at least, if I'm reading it right. Is this somehow expected, or is that code leftover from some previous iteration/experiment? 2) With that same randomWrite command line above, I would expect a resulting table with 10 * (1024 * 1024) rows (so 10485700 = roughly 10M rows). Instead what I'm seeing is that the randomWrite job reports writing that many rows (exactly) but running rowcounter against the table reveals only 6549899 rows. A second attempt to build the table produces slightly different results (e.g. 6627689). I see a similar discrepancy when using 50 instead of 10 clients (~35% smaller than expected). Key collision could explain it, but it seems pretty unlikely (given I only need e.g. 10M keys from a potential 2B). Any and all input appreciated. Thanks, Oliver -- Oliver Meyn Software Developer Global Biodiversity Information Facility (GBIF) +45 35 32 15 12 http://www.gbif.org
