I am trying to assess the performance of Scans on a 100TB db on 180 nodes 
running Hbase 0.20.5..

I run a sharded scan (each Map task runs a scan on a specific range: 
speculative execution is turned false so that there is no duplication in tasks) 
on a fully compacted table...

1 MB block size, Block cache enabled.. Max of 2 tasks per node..  Each row is 
30 KB in size: 1 big column family with just one field..
Region lease timeout is set to an hour.. And I don't get any socket timeout 
exceptions so I have not reassigned the write socket timeout...

I ran experiments on the following cases:

 1.  The client level cache is set to 1 (default: got he number using 
getCaching): The MR tasks take around 13 hours to finish in the average.. Which 
gives around 13.17 MBps per node. The worst case is 34 hours (to finish the 
entire job)...
 2.  Client cache set to 20 rows: this is much worse than the previous case: we 
get around a super low 1MBps per node...

         Question: Should I set it to a value such that the block size is a 
multiple of the above said cache size? Or the cache size to a much lower value?

I find that these numbers are much less than the ones I get when it's running 
with just a few nodes..

Can you guys help me with this problem?

Thank you
Vidhya

Reply via email to