Re: HBase Table Row Count Optimization - A Solicitation For Help

James Birchfield Fri, 20 Sep 2013 19:20:23 -0700

So, just to clarify where I am at this point, I have learned that I was 
absolutely not taking advantage of the cluster doing it the way I was.  Some 
quick tests running the 'correct' way, from the command line, using the built 
in RowCounter MapReduce job works orders of magnitude faster than what I am 
seeing.


So, my apologies for seeking help for a problem when I fully didn't understand 
the technology and the proper use of it.  However, I am very glad that this 
community was able to point this out and clue me in.  For that I am very, very 
appreciative.

I will rework my logic to use this technique, probably creating a customized 
RowCounter MapReduce impl that can count multiple table at one instead of 
having to issue 600 individual requests.

Thanks again!!!
Birch
On Sep 20, 2013, at 6:57 PM, James Birchfield <[email protected]> 
wrote:

> Yes, we have a fully setup cluster complete with all you pointed out.  But I 
> believe, now that it has been pointed out to me in this thread and your 
> reply, is exactly as you and Lars say.  I am running the MapReduce in process 
> from a standalone java process, and I believe it is not taking advantage of 
> that infrastructure.
> 
> So I will pull this all out of the process, and run it on the cluster using 
> the example I have read about.
> 
> It is most likely just my ignorance leading to the root cause of this 
> problem.  All the help is very appreciative.
> 
> Thanks!
> Birch
> On Sep 20, 2013, at 6:46 PM, Bryan Beaudreault <[email protected]> 
> wrote:
> 
>> I could be wrong, but based on the info in your most recent emails and the
>> logs therein as well, I believe you may be running this job as a single
>> process.
>> 
>> Do you actually have a full hadoop setup running, with a jobtracker and
>> tasktrackers?  In the absence of proper configuration, the hadoop code will
>> simply launch a local, single-process job.  The LocalJobRunner referenced
>> in your logs points to that.
>> 
>> If this is the case you are likely only running a single mapper and
>> reducer, or at most running a few mappers at once in threads in your local
>> process. Either way this would obviously greatly limit the throughput.
>> 
>> If you have a full hadoop set-up, make sure the client (dev machine) you
>> are running this job from has access to a mapred-site.xml and hdfs-site.xml
>> configuration file, or at the very least set the mapred.job.tracker value
>> manually in your job configuration before submitting.
>> 
>> Let me know if I'm totally off base here.
>> 
>> 
>> On Fri, Sep 20, 2013 at 9:34 PM, James Birchfield <
>> [email protected]> wrote:
>> 
>>> Excellent!  Will do!
>>> 
>>> Birchj
>>> On Sep 20, 2013, at 6:32 PM, Ted Yu <[email protected]> wrote:
>>> 
>>>> Please take a look at the javadoc
>>>> for
>>> src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java
>>>> 
>>>> As long as the machine can reach your HBase cluster, you should be able
>>> to
>>>> run AggregationClient and utilize the AggregateImplementation endpoint in
>>>> the region servers.
>>>> 
>>>> Cheers
>>>> 
>>>> 
>>>> On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield <
>>>> [email protected]> wrote:
>>>> 
>>>>> Thanks Ted.
>>>>> 
>>>>> That was the direction I have been working towards as I am learning
>>> today.
>>>>> Much appreciation to all the replies to this thread.
>>>>> 
>>>>> Whether I keep the MapReduce job or utilize the Aggregation coprocessor
>>>>> (which is turning out that it should be possible for me here), I need to
>>>>> make sure I am running the client in an efficient manner.  Lars may have
>>>>> hit upon the core problem.  I am not running the map reduce job on the
>>>>> cluster, but rather from a stand alone remote java client executing the
>>> job
>>>>> in process.  This may very well turn out to be the number one issue.  I
>>>>> would love it if this turns out to be true.  Would make this a great
>>>>> learning lesson for me as a relative newcomer to working with HBase, and
>>>>> potentially allow me to finish this initial task much quicker than I was
>>>>> thinking.
>>>>> 
>>>>> So assuming the MapReduce jobs need to be run on the cluster instead of
>>>>> locally, does a coprocessor endpoint client need to be run the same, or
>>> is
>>>>> it safe to run it on a remote machine since the work gets distributed
>>> out
>>>>> to the region servers?  Just wondering if I would run into the same
>>> issues
>>>>> if what I said above holds true.
>>>>> 
>>>>> Thanks!
>>>>> Birch
>>>>> On Sep 20, 2013, at 6:17 PM, Ted Yu <[email protected]> wrote:
>>>>> 
>>>>>> In 0.94, we have AggregateImplementation, an endpoint coprocessor,
>>> which
>>>>>> implements getRowNum().
>>>>>> 
>>>>>> Example is in AggregationClient.java
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> 
>>>>>> On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <[email protected]>
>>> wrote:
>>>>>> 
>>>>>>> From your numbers below you have about 26k regions, thus each region
>>> is
>>>>>>> about 545tb/26k = 20gb. Good.
>>>>>>> 
>>>>>>> How many mappers are you running?
>>>>>>> And just to rule out the obvious, the M/R is running on the cluster
>>> and
>>>>>>> not locally, right? (it will default to a local runner when it cannot
>>>>> use
>>>>>>> the M/R cluster).
>>>>>>> 
>>>>>>> Some back of the envelope calculations tell me that assuming 1ge
>>> network
>>>>>>> cards, the best you can expect for 110 machines to map through this
>>>>> data is
>>>>>>> about 10h. (so way faster than what you see).
>>>>>>> (545tb/(110*1/8gb/s) ~ 40ks ~11h)
>>>>>>> 
>>>>>>> 
>>>>>>> We should really add a rowcounting coprocessor to HBase and allow
>>> using
>>>>> it
>>>>>>> via M/R.
>>>>>>> 
>>>>>>> -- Lars
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ________________________________
>>>>>>> From: James Birchfield <[email protected]>
>>>>>>> To: [email protected]
>>>>>>> Sent: Friday, September 20, 2013 5:09 PM
>>>>>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For
>>>>> Help
>>>>>>> 
>>>>>>> 
>>>>>>> I did not implement accurate timing, but the current table being
>>> counted
>>>>>>> has been running for about 10 hours, and the log is estimating the map
>>>>>>> portion at 10%
>>>>>>> 
>>>>>>> 2013-09-20 23:40:24,099 INFO  [main] Job                            :
>>>>> map
>>>>>>> 10% reduce 0%
>>>>>>> 
>>>>>>> So a loooong time.  Like I mentioned, we have billions, if not
>>> trillions
>>>>>>> of rows potentially.
>>>>>>> 
>>>>>>> Thanks for the feedback on the approaches I mentioned.  I was not sure
>>>>> if
>>>>>>> they would have any effect overall.
>>>>>>> 
>>>>>>> I will look further into coprocessors.
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> Birch
>>>>>>> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <
>>> [email protected]
>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> How long does it take for RowCounter Job for largest table to finish
>>> on
>>>>>>> your cluster?
>>>>>>>> 
>>>>>>>> Just curious.
>>>>>>>> 
>>>>>>>> On your options:
>>>>>>>> 
>>>>>>>> 1. Not worth it probably - you may overload your cluster
>>>>>>>> 2. Not sure this one differs from 1. Looks the same to me but more
>>>>>>> complex.
>>>>>>>> 3. The same as 1 and 2
>>>>>>>> 
>>>>>>>> Counting rows in efficient way can be done if you sacrifice some
>>>>>>> accuracy :
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
>>>>>>>> 
>>>>>>>> Yeah, you will need coprocessors for that.
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Vladimir Rodionov
>>>>>>>> Principal Platform Engineer
>>>>>>>> Carrier IQ, www.carrieriq.com
>>>>>>>> e-mail: [email protected]
>>>>>>>> 
>>>>>>>> ________________________________________
>>>>>>>> From: James Birchfield [[email protected]]
>>>>>>>> Sent: Friday, September 20, 2013 3:50 PM
>>>>>>>> To: [email protected]
>>>>>>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For
>>>>> Help
>>>>>>>> 
>>>>>>>> Hadoop 2.0.0-cdh4.3.1
>>>>>>>> 
>>>>>>>> HBase 0.94.6-cdh4.3.1
>>>>>>>> 
>>>>>>>> 110 servers, 0 dead, 238.2364 average load
>>>>>>>> 
>>>>>>>> Some other info, not sure if it helps or not.
>>>>>>>> 
>>>>>>>> Configured Capacity: 1295277834158080 (1.15 PB)
>>>>>>>> Present Capacity: 1224692609430678 (1.09 PB)
>>>>>>>> DFS Remaining: 624376503857152 (567.87 TB)
>>>>>>>> DFS Used: 600316105573526 (545.98 TB)
>>>>>>>> DFS Used%: 49.02%
>>>>>>>> Under replicated blocks: 0
>>>>>>>> Blocks with corrupt replicas: 1
>>>>>>>> Missing blocks: 0
>>>>>>>> 
>>>>>>>> It is hitting a production cluster, but I am not really sure how to
>>>>>>> calculate the load placed on the cluster.
>>>>>>>> On Sep 20, 2013, at 3:19 PM, Ted Yu <[email protected]> wrote:
>>>>>>>> 
>>>>>>>>> How many nodes do you have in your cluster ?
>>>>>>>>> 
>>>>>>>>> When counting rows, what other load would be placed on the cluster ?
>>>>>>>>> 
>>>>>>>>> What is the HBase version you're currently using / planning to use ?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> 
>>>>>>>>>>   After reading the documentation and scouring the mailing list
>>>>>>>>>> archives, I understand there is no real support for fast row
>>> counting
>>>>>>> in
>>>>>>>>>> HBase unless you build some sort of tracking logic into your code.
>>>>> In
>>>>>>> our
>>>>>>>>>> case, we do not have such logic, and have massive amounts of data
>>>>>>> already
>>>>>>>>>> persisted.  I am running into the issue of very long execution of
>>> the
>>>>>>>>>> RowCounter MapReduce job against very large tables (multi-billion
>>> for
>>>>>>> many
>>>>>>>>>> is our estimate).  I understand why this issue exists and am slowly
>>>>>>>>>> accepting it, but I am hoping I can solicit some possible ideas to
>>>>> help
>>>>>>>>>> speed things up a little.
>>>>>>>>>> 
>>>>>>>>>>   My current task is to provide total row counts on about 600
>>>>>>>>>> tables, some extremely large, some not so much.  Currently, I have
>>> a
>>>>>>>>>> process that executes the MapRduce job in process like so:
>>>>>>>>>> 
>>>>>>>>>>                   Job job = RowCounter.createSubmittableJob(
>>>>>>>>>> 
>>>>> ConfigManager.getConfiguration(),
>>>>>>>>>> new String[]{tableName});
>>>>>>>>>>                   boolean waitForCompletion =
>>>>>>>>>> job.waitForCompletion(true);
>>>>>>>>>>                   Counters counters = job.getCounters();
>>>>>>>>>>                   Counter rowCounter =
>>>>>>>>>> counters.findCounter(hbaseadminconnection.Counters.ROWS);
>>>>>>>>>>                   return rowCounter.getValue();
>>>>>>>>>> 
>>>>>>>>>>   At the moment, each MapReduce job is executed in serial order,
>>>>> so
>>>>>>>>>> counting one table at a time.  For the current implementation of
>>> this
>>>>>>> whole
>>>>>>>>>> process, as it stands right now, my rough timing calculations
>>>>> indicate
>>>>>>> that
>>>>>>>>>> fully counting all the rows of these 600 tables will take anywhere
>>>>>>> between
>>>>>>>>>> 11 to 22 days.  This is not what I consider a desirable timeframe.
>>>>>>>>>> 
>>>>>>>>>>   I have considered three alternative approaches to speed things
>>>>>>> up.
>>>>>>>>>> 
>>>>>>>>>>   First, since the application is not heavily CPU bound, I could
>>>>>>> use
>>>>>>>>>> a ThreadPool and execute multiple MapReduce jobs at the same time
>>>>>>> looking
>>>>>>>>>> at different tables.  I have never done this, so I am unsure if
>>> this
>>>>>>> would
>>>>>>>>>> cause any unanticipated side effects.
>>>>>>>>>> 
>>>>>>>>>>   Second, I could distribute the processes.  I could find as many
>>>>>>>>>> machines that can successfully talk to the desired cluster
>>> properly,
>>>>>>> give
>>>>>>>>>> them a subset of tables to work on, and then combine the results
>>> post
>>>>>>>>>> process.
>>>>>>>>>> 
>>>>>>>>>>   Third, I could combine both the above approaches and run a
>>>>>>>>>> distributed set of multithreaded process to execute the MapReduce
>>>>> jobs
>>>>>>> in
>>>>>>>>>> parallel.
>>>>>>>>>> 
>>>>>>>>>>   Although it seems to have been asked and answered many times, I
>>>>>>>>>> will ask once again.  Without the need to change our current
>>>>>>> configurations
>>>>>>>>>> or restart the clusters, is there a faster approach to obtain row
>>>>>>> counts?
>>>>>>>>>> FYI, my cache size for the Scan is set to 1000.  I have
>>> experimented
>>>>>>> with
>>>>>>>>>> different numbers, but nothing made a noticeable difference.  Any
>>>>>>> advice or
>>>>>>>>>> feedback would be greatly appreciated!
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Birch
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Confidentiality Notice:  The information contained in this message,
>>>>>>> including any attachments hereto, may be confidential and is intended
>>>>> to be
>>>>>>> read only by the individual or entity to whom this message is
>>>>> addressed. If
>>>>>>> the reader of this message is not the intended recipient or an agent
>>> or
>>>>>>> designee of the intended recipient, please note that any review, use,
>>>>>>> disclosure or distribution of this message or its attachments, in any
>>>>> form,
>>>>>>> is strictly prohibited.  If you have received this message in error,
>>>>> please
>>>>>>> immediately notify the sender and/or [email protected] and
>>>>>>> delete or destroy any copy of this message and its attachments.
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: HBase Table Row Count Optimization - A Solicitation For Help

Reply via email to