Re: HBase Table Row Count Optimization - A Solicitation For Help

James Birchfield Fri, 20 Sep 2013 21:17:22 -0700

Sweet!  Thanks a lot Ted.  Like I said, I haven't looked at the code to try to 
determine if I could understand any potential side affects of not requiring it. 
 But if it isn't detrimental to the speed, would be nice to have it optional if 
you really just don't care or even know the column family makeup of the table.  
Perhaps this is a use case specific to my particular usage, but an observation 
nonetheless.


Birch
On Sep 20, 2013, at 9:11 PM, Ted Yu <[email protected]> wrote:

> Thanks for the feedback.
> 
> I logged HBASE-9605 for relaxation of this requirement for row
> count aggregate.
> 
> 
> On Fri, Sep 20, 2013 at 8:46 PM, James Birchfield <
> [email protected]> wrote:
> 
>> Thanks.  I have ben taking a look this evening.  We enabled the
>> Aggregation coprocessor and the Aggregation client works great.  I still
>> have to execute it with the 'hadoop jar' command though, but can live with
>> that.  When I try to run it in process, it just hangs.  I am not going to
>> fight i though.
>> 
>> The only thing I dislike about the AggrgationClient is that it requires a
>> column family.  I was hoping to do this in a completely generic way,
>> without having any information about a tables column families to get a row
>> count.  The provided implementation requires exactly one.  I was hoping
>> maybe there was always some sort of default column family always print on a
>> table but it does not appear so.  I will look at the provided coprocessor
>> implementation and see why it is required and see if it can be optional,
>> and if so, what the performance penalty would be.  In the mean time, I am
>> just using the first column family returned from a query to the admin
>> client for a table.  Seems to work fine.
>> 
>> Thanks!
>> Birch
>> On Sep 20, 2013, at 8:41 PM, Ted Yu <[email protected]> wrote:
>> 
>>> HBase is open source. You can check out the source code and look at the
>>> source code.
>>> 
>>> $ svn info
>>> Path: .
>>> URL: http://svn.apache.org/repos/asf/hbase/branches/0.94
>>> Repository Root: http://svn.apache.org/repos/asf
>>> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
>>> Revision: 1525061
>>> 
>>> 
>>> On Fri, Sep 20, 2013 at 6:46 PM, James Birchfield <
>>> [email protected]> wrote:
>>> 
>>>> Ted,
>>>> 
>>>>       My apologies if I am being thick, but I am looking at the API
>> docs
>>>> here: http://hbase.apache.org/apidocs/index.html and I do not see that
>>>> package.  And the coprocessor package only contains an exception.
>>>> 
>>>>       Ok, weird.  Those classes do not show up through normal
>> navigation
>>>> from that link, however, the documentation does exist if I google for it
>>>> directly.  Maybe the javadocs need to be regenerated???  Dunno, but I
>> will
>>>> check it out.
>>>> 
>>>> Birch
>>>> 
>>>> On Sep 20, 2013, at 6:32 PM, Ted Yu <[email protected]> wrote:
>>>> 
>>>>> Please take a look at the javadoc
>>>>> for
>>>> 
>> src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java
>>>>> 
>>>>> As long as the machine can reach your HBase cluster, you should be able
>>>> to
>>>>> run AggregationClient and utilize the AggregateImplementation endpoint
>> in
>>>>> the region servers.
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> 
>>>>> On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> Thanks Ted.
>>>>>> 
>>>>>> That was the direction I have been working towards as I am learning
>>>> today.
>>>>>> Much appreciation to all the replies to this thread.
>>>>>> 
>>>>>> Whether I keep the MapReduce job or utilize the Aggregation
>> coprocessor
>>>>>> (which is turning out that it should be possible for me here), I need
>> to
>>>>>> make sure I am running the client in an efficient manner.  Lars may
>> have
>>>>>> hit upon the core problem.  I am not running the map reduce job on the
>>>>>> cluster, but rather from a stand alone remote java client executing
>> the
>>>> job
>>>>>> in process.  This may very well turn out to be the number one issue.
>> I
>>>>>> would love it if this turns out to be true.  Would make this a great
>>>>>> learning lesson for me as a relative newcomer to working with HBase,
>> and
>>>>>> potentially allow me to finish this initial task much quicker than I
>> was
>>>>>> thinking.
>>>>>> 
>>>>>> So assuming the MapReduce jobs need to be run on the cluster instead
>> of
>>>>>> locally, does a coprocessor endpoint client need to be run the same,
>> or
>>>> is
>>>>>> it safe to run it on a remote machine since the work gets distributed
>>>> out
>>>>>> to the region servers?  Just wondering if I would run into the same
>>>> issues
>>>>>> if what I said above holds true.
>>>>>> 
>>>>>> Thanks!
>>>>>> Birch
>>>>>> On Sep 20, 2013, at 6:17 PM, Ted Yu <[email protected]> wrote:
>>>>>> 
>>>>>>> In 0.94, we have AggregateImplementation, an endpoint coprocessor,
>>>> which
>>>>>>> implements getRowNum().
>>>>>>> 
>>>>>>> Example is in AggregationClient.java
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <[email protected]>
>>>> wrote:
>>>>>>> 
>>>>>>>> From your numbers below you have about 26k regions, thus each region
>>>> is
>>>>>>>> about 545tb/26k = 20gb. Good.
>>>>>>>> 
>>>>>>>> How many mappers are you running?
>>>>>>>> And just to rule out the obvious, the M/R is running on the cluster
>>>> and
>>>>>>>> not locally, right? (it will default to a local runner when it
>> cannot
>>>>>> use
>>>>>>>> the M/R cluster).
>>>>>>>> 
>>>>>>>> Some back of the envelope calculations tell me that assuming 1ge
>>>> network
>>>>>>>> cards, the best you can expect for 110 machines to map through this
>>>>>> data is
>>>>>>>> about 10h. (so way faster than what you see).
>>>>>>>> (545tb/(110*1/8gb/s) ~ 40ks ~11h)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> We should really add a rowcounting coprocessor to HBase and allow
>>>> using
>>>>>> it
>>>>>>>> via M/R.
>>>>>>>> 
>>>>>>>> -- Lars
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ________________________________
>>>>>>>> From: James Birchfield <[email protected]>
>>>>>>>> To: [email protected]
>>>>>>>> Sent: Friday, September 20, 2013 5:09 PM
>>>>>>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For
>>>>>> Help
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I did not implement accurate timing, but the current table being
>>>> counted
>>>>>>>> has been running for about 10 hours, and the log is estimating the
>> map
>>>>>>>> portion at 10%
>>>>>>>> 
>>>>>>>> 2013-09-20 23:40:24,099 INFO  [main] Job
>> :
>>>>>> map
>>>>>>>> 10% reduce 0%
>>>>>>>> 
>>>>>>>> So a loooong time.  Like I mentioned, we have billions, if not
>>>> trillions
>>>>>>>> of rows potentially.
>>>>>>>> 
>>>>>>>> Thanks for the feedback on the approaches I mentioned.  I was not
>> sure
>>>>>> if
>>>>>>>> they would have any effect overall.
>>>>>>>> 
>>>>>>>> I will look further into coprocessors.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> Birch
>>>>>>>> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <
>>>> [email protected]
>>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> How long does it take for RowCounter Job for largest table to
>> finish
>>>> on
>>>>>>>> your cluster?
>>>>>>>>> 
>>>>>>>>> Just curious.
>>>>>>>>> 
>>>>>>>>> On your options:
>>>>>>>>> 
>>>>>>>>> 1. Not worth it probably - you may overload your cluster
>>>>>>>>> 2. Not sure this one differs from 1. Looks the same to me but more
>>>>>>>> complex.
>>>>>>>>> 3. The same as 1 and 2
>>>>>>>>> 
>>>>>>>>> Counting rows in efficient way can be done if you sacrifice some
>>>>>>>> accuracy :
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
>>>>>>>>> 
>>>>>>>>> Yeah, you will need coprocessors for that.
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> Vladimir Rodionov
>>>>>>>>> Principal Platform Engineer
>>>>>>>>> Carrier IQ, www.carrieriq.com
>>>>>>>>> e-mail: [email protected]
>>>>>>>>> 
>>>>>>>>> ________________________________________
>>>>>>>>> From: James Birchfield [[email protected]]
>>>>>>>>> Sent: Friday, September 20, 2013 3:50 PM
>>>>>>>>> To: [email protected]
>>>>>>>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation
>> For
>>>>>> Help
>>>>>>>>> 
>>>>>>>>> Hadoop 2.0.0-cdh4.3.1
>>>>>>>>> 
>>>>>>>>> HBase 0.94.6-cdh4.3.1
>>>>>>>>> 
>>>>>>>>> 110 servers, 0 dead, 238.2364 average load
>>>>>>>>> 
>>>>>>>>> Some other info, not sure if it helps or not.
>>>>>>>>> 
>>>>>>>>> Configured Capacity: 1295277834158080 (1.15 PB)
>>>>>>>>> Present Capacity: 1224692609430678 (1.09 PB)
>>>>>>>>> DFS Remaining: 624376503857152 (567.87 TB)
>>>>>>>>> DFS Used: 600316105573526 (545.98 TB)
>>>>>>>>> DFS Used%: 49.02%
>>>>>>>>> Under replicated blocks: 0
>>>>>>>>> Blocks with corrupt replicas: 1
>>>>>>>>> Missing blocks: 0
>>>>>>>>> 
>>>>>>>>> It is hitting a production cluster, but I am not really sure how to
>>>>>>>> calculate the load placed on the cluster.
>>>>>>>>> On Sep 20, 2013, at 3:19 PM, Ted Yu <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>>> How many nodes do you have in your cluster ?
>>>>>>>>>> 
>>>>>>>>>> When counting rows, what other load would be placed on the
>> cluster ?
>>>>>>>>>> 
>>>>>>>>>> What is the HBase version you're currently using / planning to
>> use ?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> 
>>>>>>>>>>>   After reading the documentation and scouring the mailing list
>>>>>>>>>>> archives, I understand there is no real support for fast row
>>>> counting
>>>>>>>> in
>>>>>>>>>>> HBase unless you build some sort of tracking logic into your
>> code.
>>>>>> In
>>>>>>>> our
>>>>>>>>>>> case, we do not have such logic, and have massive amounts of data
>>>>>>>> already
>>>>>>>>>>> persisted.  I am running into the issue of very long execution of
>>>> the
>>>>>>>>>>> RowCounter MapReduce job against very large tables (multi-billion
>>>> for
>>>>>>>> many
>>>>>>>>>>> is our estimate).  I understand why this issue exists and am
>> slowly
>>>>>>>>>>> accepting it, but I am hoping I can solicit some possible ideas
>> to
>>>>>> help
>>>>>>>>>>> speed things up a little.
>>>>>>>>>>> 
>>>>>>>>>>>   My current task is to provide total row counts on about 600
>>>>>>>>>>> tables, some extremely large, some not so much.  Currently, I
>> have
>>>> a
>>>>>>>>>>> process that executes the MapRduce job in process like so:
>>>>>>>>>>> 
>>>>>>>>>>>                   Job job = RowCounter.createSubmittableJob(
>>>>>>>>>>> 
>>>>>> ConfigManager.getConfiguration(),
>>>>>>>>>>> new String[]{tableName});
>>>>>>>>>>>                   boolean waitForCompletion =
>>>>>>>>>>> job.waitForCompletion(true);
>>>>>>>>>>>                   Counters counters = job.getCounters();
>>>>>>>>>>>                   Counter rowCounter =
>>>>>>>>>>> counters.findCounter(hbaseadminconnection.Counters.ROWS);
>>>>>>>>>>>                   return rowCounter.getValue();
>>>>>>>>>>> 
>>>>>>>>>>>   At the moment, each MapReduce job is executed in serial order,
>>>>>> so
>>>>>>>>>>> counting one table at a time.  For the current implementation of
>>>> this
>>>>>>>> whole
>>>>>>>>>>> process, as it stands right now, my rough timing calculations
>>>>>> indicate
>>>>>>>> that
>>>>>>>>>>> fully counting all the rows of these 600 tables will take
>> anywhere
>>>>>>>> between
>>>>>>>>>>> 11 to 22 days.  This is not what I consider a desirable
>> timeframe.
>>>>>>>>>>> 
>>>>>>>>>>>   I have considered three alternative approaches to speed things
>>>>>>>> up.
>>>>>>>>>>> 
>>>>>>>>>>>   First, since the application is not heavily CPU bound, I could
>>>>>>>> use
>>>>>>>>>>> a ThreadPool and execute multiple MapReduce jobs at the same time
>>>>>>>> looking
>>>>>>>>>>> at different tables.  I have never done this, so I am unsure if
>>>> this
>>>>>>>> would
>>>>>>>>>>> cause any unanticipated side effects.
>>>>>>>>>>> 
>>>>>>>>>>>   Second, I could distribute the processes.  I could find as
>> many
>>>>>>>>>>> machines that can successfully talk to the desired cluster
>>>> properly,
>>>>>>>> give
>>>>>>>>>>> them a subset of tables to work on, and then combine the results
>>>> post
>>>>>>>>>>> process.
>>>>>>>>>>> 
>>>>>>>>>>>   Third, I could combine both the above approaches and run a
>>>>>>>>>>> distributed set of multithreaded process to execute the MapReduce
>>>>>> jobs
>>>>>>>> in
>>>>>>>>>>> parallel.
>>>>>>>>>>> 
>>>>>>>>>>>   Although it seems to have been asked and answered many times,
>> I
>>>>>>>>>>> will ask once again.  Without the need to change our current
>>>>>>>> configurations
>>>>>>>>>>> or restart the clusters, is there a faster approach to obtain row
>>>>>>>> counts?
>>>>>>>>>>> FYI, my cache size for the Scan is set to 1000.  I have
>>>> experimented
>>>>>>>> with
>>>>>>>>>>> different numbers, but nothing made a noticeable difference.  Any
>>>>>>>> advice or
>>>>>>>>>>> feedback would be greatly appreciated!
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Birch
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Confidentiality Notice:  The information contained in this message,
>>>>>>>> including any attachments hereto, may be confidential and is
>> intended
>>>>>> to be
>>>>>>>> read only by the individual or entity to whom this message is
>>>>>> addressed. If
>>>>>>>> the reader of this message is not the intended recipient or an agent
>>>> or
>>>>>>>> designee of the intended recipient, please note that any review,
>> use,
>>>>>>>> disclosure or distribution of this message or its attachments, in
>> any
>>>>>> form,
>>>>>>>> is strictly prohibited.  If you have received this message in error,
>>>>>> please
>>>>>>>> immediately notify the sender and/or [email protected]
>>>>>>>> delete or destroy any copy of this message and its attachments.
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

smime.p7s
Description: S/MIME cryptographic signature

Re: HBase Table Row Count Optimization - A Solicitation For Help

Reply via email to