Re: MR sharded Scans giving poor performance..

Ryan Rawson Mon, 26 Jul 2010 15:38:18 -0700

Hey,

That sounds interesting - maybe you could tell us about why your
system is better performing? The default TableInputFormat is just
creating N map tasks, one for each region, which are all roughly the
same data-size.


What do you do?
-ryan

On Mon, Jul 26, 2010 at 3:29 PM, Xavier Stevens <[email protected]> wrote:
>  We have something that might interest you.
>
> http://socorro.googlecode.com/svn/trunk/analysis/src/java/org/apache/hadoop/hbase/mapreduce/
>
> We haven't fully tested everything yet, so don't blame us if something
> goes wrong.  It's basically the exact same as TableInputFormat except it
> takes an array of Scans to be used for row-key ranges.  It requires the
> caller to setup the Scan array since they should have the best knowledge
> about their row-key structure.
>
> Preliminary results for us reduced a 15 minute job to under 2 minutes.
>
> Cheers,
>
>
> -Xavier
>
> On 7/26/10 3:16 PM, Vidhyashankar Venkataraman wrote:
>> I did not use a TableInputFormat: I ran my own scans on specific ranges 
>> (just for more control from my side to tune the ranges and the ease with 
>> which I can run a hadoop streaming job)..
>>
>> 1 MB for Hfile Block size.. Not the HDFS block size..
>> I increased it since I didn't care too much for random read performance.. 
>> HDFS block size is the default value... (I have a related question then: 
>> does the Hfile block size influence only the size of the index and the 
>> efficiency of random reads?  I don't see an effect on scans though)...
>>
>>   I had previously run 5 tasks per machine and at 20 rows, but that resulted 
>> in scanner expiries (UnknownScannerexception) and DFS socket timeouts.. So 
>> that's why I reduced the number of tasks.. Let me decrease the number of 
>> rows and see..
>>
>>   Just to make sure: the client uses zookeeper only for obtaining ROOT right 
>> whenever it performs scans, isnt it? So scans shouldn't face any master/zk 
>> bottlenecks when we scale up wrt number of nodes, am I right?
>>
>> Thank you
>> Vidhya
>>
>> On 7/26/10 3:01 PM, "Ryan Rawson" <[email protected]> wrote:
>>
>> Hey,
>>
>> A few questions:
>>
>> - sharded scan, are you not using TableInputFormat?
>> - 1 MB block size - what block size?  You probably shouldnt set the
>> HDFS block size to 1MB, it just causes more nn traffic.
>> - Tests a year ago indicated that HFile block size really didnt
>> improve speed when you went beyond 64k or so.
>> - Run more maps/machine... one map task per disk probably?
>> - Try setting the client cache to an in-between level, 2-6 perhaps.
>>
>> Let us know about those other questions and we can go from there.
>> -ryan
>>
>> On Mon, Jul 26, 2010 at 2:43 PM, Vidhyashankar Venkataraman
>> <[email protected]> wrote:
>>> I am trying to assess the performance of Scans on a 100TB db on 180 nodes 
>>> running Hbase 0.20.5..
>>>
>>> I run a sharded scan (each Map task runs a scan on a specific range: 
>>> speculative execution is turned false so that there is no duplication in 
>>> tasks) on a fully compacted table...
>>>
>>> 1 MB block size, Block cache enabled.. Max of 2 tasks per node..  Each row 
>>> is 30 KB in size: 1 big column family with just one field..
>>> Region lease timeout is set to an hour.. And I don't get any socket timeout 
>>> exceptions so I have not reassigned the write socket timeout...
>>>
>>> I ran experiments on the following cases:
>>>
>>>  1.  The client level cache is set to 1 (default: got he number using 
>>> getCaching): The MR tasks take around 13 hours to finish in the average.. 
>>> Which gives around 13.17 MBps per node. The worst case is 34 hours (to 
>>> finish the entire job)...
>>>  2.  Client cache set to 20 rows: this is much worse than the previous 
>>> case: we get around a super low 1MBps per node...
>>>
>>>         Question: Should I set it to a value such that the block size is a 
>>> multiple of the above said cache size? Or the cache size to a much lower 
>>> value?
>>>
>>> I find that these numbers are much less than the ones I get when it's 
>>> running with just a few nodes..
>>>
>>> Can you guys help me with this problem?
>>>
>>> Thank you
>>> Vidhya
>>>
>>
>

Re: MR sharded Scans giving poor performance..

Reply via email to