Answers interspersed below

On May 22, 2010, at 9:25 AM, Jean-Daniel Cryans wrote:

> My first question would be, what do you expect exactly? Would 5 min be
> enough? Or are you expecting something more like 1-2 secs (which is
> impossible since this is mapreduce)?

I don't have a set requirement.  Just trying to learn more about the system and 
25 minutes seemed excessive.  I really have nothing to compare against and have 
no expectations; but, it takes about 900 seconds to run the count function in 
the shell.  My main goal is to figure out what reasonable times are given 
similar setups or just to have a general idea of what's acceptable so that I 
can make sure that everything is configured properly.

> Then there's also Jon's questions.

I'm not sure how many regions there are per table.  My guess is whatever the 
default is since this isn't an option I've tried to change.  However, I will 
look into it more and update the thread. 

> Finally, did you set a higher scanner caching on that job?
> hbase.client.scanner.caching is the name of the config, which defaults
> to 1. When mapping a HBase table, if you don't set it higher you're
> basically benchmarking the RPC layer since it does 1 call per next()
> invocation. Setting the right value depends on the size of your rows
> eg are you storing 60 bytes or something high like 100KB? On our 13B
> rows table (each row is a few bytes), we set it to 10k.

Again, my guess is that hbase.client.scanner.caching is 1 as you have 
mentioned.  When calculating the size of a row, is this just the size of the 
data stored in the various columns or do I need to factor in overhead also?  Do 
you have a reference or any guidance on the optimal setting for the 
hbase.client.scanner.caching given the size of a typical row?  In my case, I 
have about 8 rows, each storing a decimal value.  I haven't checked, but I'm 
assuming these are being stored as doubles.

Thanks!

Reply via email to