Re: HBase Table Row Count Optimization - A Solicitation For Help

James Birchfield Fri, 20 Sep 2013 17:54:21 -0700

Thanks for the info.

Right now the MapReduce Scan uses the FirstKeyOnlyFilter.  From what I have 
read in the javadoc, FirstKeyFilter *should* be faster since it only grabs the 
first KV pair.


I will play around with setting the caching size to a much higher number and 
see how it performs.  I do not think I have too much wiggle room to modify our 
cluster configurations, but will see what I can do.

Thanks!

Birch
On Sep 20, 2013, at 5:39 PM, Bryan Beaudreault <[email protected]> wrote:

> If your cells are extremely small try setting the caching even higher than
> 10k.  You want to strike a balance between MBs of response size and number
> of calls needed, leaning towards larger response sizes as far as your
> system can handle (account for RS max memory, and memory available to your
> mappers).
> 
> You could use the KeyOnlyFilter to further limit the sizes of responses
> transferred.
> 
> Another thing that may help would be increasing your block size.  This
> would speed up sequential read but slow down random access.  It would be a
> matter of making the config change and then running a major compaction to
> re-write the data.
> 
> We constantly run multiple MR jobs (often on the order of 10's) against the
> same hbase cluster and don't often see issues.  They are not full table
> scans, but they do often overlap.  I think it would be worth at least
> attempting to run multiple jobs at once.
> 
> 
> 
> 
> On Fri, Sep 20, 2013 at 8:09 PM, James Birchfield <
> [email protected]> wrote:
> 
>> I did not implement accurate timing, but the current table being counted
>> has been running for about 10 hours, and the log is estimating the map
>> portion at 10%
>> 
>> 2013-09-20 23:40:24,099 INFO  [main] Job                            :  map
>> 10% reduce 0%
>> 
>> So a loooong time.  Like I mentioned, we have billions, if not trillions
>> of rows potentially.
>> 
>> Thanks for the feedback on the approaches I mentioned.  I was not sure if
>> they would have any effect overall.
>> 
>> I will look further into coprocessors.
>> 
>> Thanks!
>> Birch
>> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[email protected]>
>> wrote:
>> 
>>> How long does it take for RowCounter Job for largest table to finish on
>> your cluster?
>>> 
>>> Just curious.
>>> 
>>> On your options:
>>> 
>>> 1. Not worth it probably - you may overload your cluster
>>> 2. Not sure this one differs from 1. Looks the same to me but more
>> complex.
>>> 3. The same as 1 and 2
>>> 
>>> Counting rows in efficient way can be done if you sacrifice some
>> accuracy :
>>> 
>>> 
>> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
>>> 
>>> Yeah, you will need coprocessors for that.
>>> 
>>> Best regards,
>>> Vladimir Rodionov
>>> Principal Platform Engineer
>>> Carrier IQ, www.carrieriq.com
>>> e-mail: [email protected]
>>> 
>>> ________________________________________
>>> From: James Birchfield [[email protected]]
>>> Sent: Friday, September 20, 2013 3:50 PM
>>> To: [email protected]
>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help
>>> 
>>> Hadoop 2.0.0-cdh4.3.1
>>> 
>>> HBase 0.94.6-cdh4.3.1
>>> 
>>> 110 servers, 0 dead, 238.2364 average load
>>> 
>>> Some other info, not sure if it helps or not.
>>> 
>>> Configured Capacity: 1295277834158080 (1.15 PB)
>>> Present Capacity: 1224692609430678 (1.09 PB)
>>> DFS Remaining: 624376503857152 (567.87 TB)
>>> DFS Used: 600316105573526 (545.98 TB)
>>> DFS Used%: 49.02%
>>> Under replicated blocks: 0
>>> Blocks with corrupt replicas: 1
>>> Missing blocks: 0
>>> 
>>> It is hitting a production cluster, but I am not really sure how to
>> calculate the load placed on the cluster.
>>> On Sep 20, 2013, at 3:19 PM, Ted Yu <[email protected]> wrote:
>>> 
>>>> How many nodes do you have in your cluster ?
>>>> 
>>>> When counting rows, what other load would be placed on the cluster ?
>>>> 
>>>> What is the HBase version you're currently using / planning to use ?
>>>> 
>>>> Thanks
>>>> 
>>>> 
>>>> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <
>>>> [email protected]> wrote:
>>>> 
>>>>>      After reading the documentation and scouring the mailing list
>>>>> archives, I understand there is no real support for fast row counting
>> in
>>>>> HBase unless you build some sort of tracking logic into your code.  In
>> our
>>>>> case, we do not have such logic, and have massive amounts of data
>> already
>>>>> persisted.  I am running into the issue of very long execution of the
>>>>> RowCounter MapReduce job against very large tables (multi-billion for
>> many
>>>>> is our estimate).  I understand why this issue exists and am slowly
>>>>> accepting it, but I am hoping I can solicit some possible ideas to help
>>>>> speed things up a little.
>>>>> 
>>>>>      My current task is to provide total row counts on about 600
>>>>> tables, some extremely large, some not so much.  Currently, I have a
>>>>> process that executes the MapRduce job in process like so:
>>>>> 
>>>>>                      Job job = RowCounter.createSubmittableJob(
>>>>>                                      ConfigManager.getConfiguration(),
>>>>> new String[]{tableName});
>>>>>                      boolean waitForCompletion =
>>>>> job.waitForCompletion(true);
>>>>>                      Counters counters = job.getCounters();
>>>>>                      Counter rowCounter =
>>>>> counters.findCounter(hbaseadminconnection.Counters.ROWS);
>>>>>                      return rowCounter.getValue();
>>>>> 
>>>>>      At the moment, each MapReduce job is executed in serial order, so
>>>>> counting one table at a time.  For the current implementation of this
>> whole
>>>>> process, as it stands right now, my rough timing calculations indicate
>> that
>>>>> fully counting all the rows of these 600 tables will take anywhere
>> between
>>>>> 11 to 22 days.  This is not what I consider a desirable timeframe.
>>>>> 
>>>>>      I have considered three alternative approaches to speed things
>> up.
>>>>> 
>>>>>      First, since the application is not heavily CPU bound, I could
>> use
>>>>> a ThreadPool and execute multiple MapReduce jobs at the same time
>> looking
>>>>> at different tables.  I have never done this, so I am unsure if this
>> would
>>>>> cause any unanticipated side effects.
>>>>> 
>>>>>      Second, I could distribute the processes.  I could find as many
>>>>> machines that can successfully talk to the desired cluster properly,
>> give
>>>>> them a subset of tables to work on, and then combine the results post
>>>>> process.
>>>>> 
>>>>>      Third, I could combine both the above approaches and run a
>>>>> distributed set of multithreaded process to execute the MapReduce jobs
>> in
>>>>> parallel.
>>>>> 
>>>>>      Although it seems to have been asked and answered many times, I
>>>>> will ask once again.  Without the need to change our current
>> configurations
>>>>> or restart the clusters, is there a faster approach to obtain row
>> counts?
>>>>> FYI, my cache size for the Scan is set to 1000.  I have experimented
>> with
>>>>> different numbers, but nothing made a noticeable difference.  Any
>> advice or
>>>>> feedback would be greatly appreciated!
>>>>> 
>>>>> Thanks,
>>>>> Birch
>>> 
>>> 
>>> Confidentiality Notice:  The information contained in this message,
>> including any attachments hereto, may be confidential and is intended to be
>> read only by the individual or entity to whom this message is addressed. If
>> the reader of this message is not the intended recipient or an agent or
>> designee of the intended recipient, please note that any review, use,
>> disclosure or distribution of this message or its attachments, in any form,
>> is strictly prohibited.  If you have received this message in error, please
>> immediately notify the sender and/or [email protected] and
>> delete or destroy any copy of this message and its attachments.
>> 
>>

smime.p7s
Description: S/MIME cryptographic signature

Re: HBase Table Row Count Optimization - A Solicitation For Help

Reply via email to