Re: MR to load HBase running slowly in reduce

Jean-Daniel Cryans Wed, 24 Nov 2010 12:16:31 -0800

On the same page where you see those 54 regions, there's a split
button... hit it a few times :) (wait a few minutes between each hit)


On why it took 3h to get there, my guess is that with 40 reducers
there was a lot of contention on the first few regions and I wouldn't
be surprised if it slowed you down a lot.

Since you're inserting largish values, I think you should also
consider playing with the compaction configurations. Take a look at
this presentation to understand what I'm referring to
http://people.apache.org/~jdcryans/HUG8/HUG8-rawson.pdf

J-D

On Wed, Nov 24, 2010 at 12:08 PM, Tim Robertson
<[email protected]> wrote:
>>> I'm using 
>>> http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html
>>
>> Did you set it up with TableMapReduceUtil?
>>
>>> Not explicitly set be me
>>
>> If you use TableMapReduceUtil, then it's set to 2MB by default, but
>> looking at the RS logs the write buffer is probably not the problem.
>>
>>> 1 family
>>
>> Good
>>
>>> LZO
>>
>> Excellent
>>
>>> Indeed:
>>>
>>> memstore size 138.7m is >= than blocking 128.0m size 2010-11-24
>>> 17:12:49,136 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>>> Blocking updates for 'IPC Server handler 4 on 60020' on region
>>> raw_occurrence_record,,1290613896288.841ac149ecacf4b721ac232960e98761.:
>>> memstore size 138.7m is >= than blocking 128.0m size 2010-11-24
>>> 17:12:49,155 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>>> Blocking updates for 'IPC Server handler 10 on 60020' on region
>>> raw_occurrence_record,,1290613896288.841ac149ecacf4b721ac232960e98761.:
>>> memstore size 146.3m is >= than blocking 128.0m size 2010-11-24
>>> 17:12:49,169 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>>> Blocking updates for 'IPC Server handler 5 on 60020' on region
>>> raw_occurrence_record,,1290613896288.841ac149ecacf4b721ac232960e98761.:
>>> memstore size 148.8m is >= than blocking 128.0m size 2010-11-24
>>> 17:12:49,193 INFO org.apache.hadoop.hbase.regionserver.HRegion:
>>> Blocking updates for 'IPC Server handler 8 on 60020' on region
>>>
>>> I guess this is bad, but could benefit from some guidance...
>>
>> How many regions do you have in your table? If you started with only 1
>> region (eg a new table), then all the load will go to that single
>> region. It's a good thing to create your tables pre-split if you're
>> planning to do a massive upload into them. See this method and the
>> others in the likes
>> http://hbase.apache.org/docs/r0.89.20100924/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor,
>> byte[][])
>>
>> To find how many regions you have in "raw_occurrence_record", go on
>> the master web UI and click on the table's name in the tables list.
>
> Yeah I created with 1 and now there are 54 with the reduce only at
> 12million records from 267million, so loads more splits to go.  Would
> this account for 3 hours and under 5% of data though?
>
>> Finally, you might want to do a bulk load instead, see
>> http://hbase.apache.org/docs/r0.89.20100924/bulk-loads.html
> Thanks.  I am simulating a load that would come from data crawlers
> just to start getting some understanding of HBase.
>
>>
>>>
>>> What's the best way to do this please and I will?
>>
>> Open conf/hbase-env.sh and go to:
>>
>> # Uncomment below to enable java garbage collection logging.
>> # export HBASE_OPTS="$HBASE_OPTS -verbose:gc -XX:+PrintGCDetails
>> -XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log"
>
> Will have to wait 'till tomorrow but then I will.
>
> Cheers,
> Tim
>
>>
>> J-D
>>
>

Re: MR to load HBase running slowly in reduce

Reply via email to