On Mon, Jul 26, 2010 at 2:43 PM, Vidhyashankar Venkataraman <[email protected]> wrote: > I am trying to assess the performance of Scans on a 100TB db on 180 nodes > running Hbase 0.20.5.. > > I run a sharded scan (each Map task runs a scan on a specific range: > speculative execution is turned false so that there is no duplication in > tasks) on a fully compacted table... >
How big is the range V? How many rows you scan in your map task? They are contiguous, right? > 1 MB block size, Block cache enabled.. Max of 2 tasks per node.. Each row is > 30 KB in size: 1 big column family with just one field.. > Region lease timeout is set to an hour.. And I don't get any socket timeout > exceptions so I have not reassigned the write socket timeout... > Did you try with defaults first? > I ran experiments on the following cases: > > 1. The client level cache is set to 1 (default: got he number using > getCaching): The MR tasks take around 13 hours to finish in the average.. > Which gives around 13.17 MBps per node. The worst case is 34 hours (to finish > the entire job)... > 2. Client cache set to 20 rows: this is much worse than the previous case: > we get around a super low 1MBps per node... > > Question: Should I set it to a value such that the block size is a > multiple of the above said cache size? Or the cache size to a much lower > value? > > I find that these numbers are much less than the ones I get when it's running > with just a few nodes.. > What numbers you see on a smaller cluster? > Oh and forgot to add, 4 gig regions and 8 gig heap size.. So 4G to HBase and 8G on these machines in total? You are running TaskTrackers on same machines? 2 Mappers, 1 DN, and 1RS on all 180 machines? You are using Hadoop streaming? Hows that work? Streaming does text only? I didn't think you could write HBase out of Streaming. > does the Hfile block size influence only the size of the index and the > efficiency of random reads? Generally, yes. I'd think though that a bigger block size, especially if you are using caching so you cut down on number of RPCs, then you should be approaching the scan speeds you'd see going against HDFS. > Just to make sure: the client uses zookeeper only for obtaining ROOT right > whenever it performs scans, isnt it? So scans shouldn't face any master/zk > bottlenecks when we scale up wrt number of nodes, am I right? Thats right. St.Ack
