Excellent! Will do! Birchj On Sep 20, 2013, at 6:32 PM, Ted Yu <[email protected]> wrote:
> Please take a look at the javadoc > for > src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java > > As long as the machine can reach your HBase cluster, you should be able to > run AggregationClient and utilize the AggregateImplementation endpoint in > the region servers. > > Cheers > > > On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield < > [email protected]> wrote: > >> Thanks Ted. >> >> That was the direction I have been working towards as I am learning today. >> Much appreciation to all the replies to this thread. >> >> Whether I keep the MapReduce job or utilize the Aggregation coprocessor >> (which is turning out that it should be possible for me here), I need to >> make sure I am running the client in an efficient manner. Lars may have >> hit upon the core problem. I am not running the map reduce job on the >> cluster, but rather from a stand alone remote java client executing the job >> in process. This may very well turn out to be the number one issue. I >> would love it if this turns out to be true. Would make this a great >> learning lesson for me as a relative newcomer to working with HBase, and >> potentially allow me to finish this initial task much quicker than I was >> thinking. >> >> So assuming the MapReduce jobs need to be run on the cluster instead of >> locally, does a coprocessor endpoint client need to be run the same, or is >> it safe to run it on a remote machine since the work gets distributed out >> to the region servers? Just wondering if I would run into the same issues >> if what I said above holds true. >> >> Thanks! >> Birch >> On Sep 20, 2013, at 6:17 PM, Ted Yu <[email protected]> wrote: >> >>> In 0.94, we have AggregateImplementation, an endpoint coprocessor, which >>> implements getRowNum(). >>> >>> Example is in AggregationClient.java >>> >>> Cheers >>> >>> >>> On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <[email protected]> wrote: >>> >>>> From your numbers below you have about 26k regions, thus each region is >>>> about 545tb/26k = 20gb. Good. >>>> >>>> How many mappers are you running? >>>> And just to rule out the obvious, the M/R is running on the cluster and >>>> not locally, right? (it will default to a local runner when it cannot >> use >>>> the M/R cluster). >>>> >>>> Some back of the envelope calculations tell me that assuming 1ge network >>>> cards, the best you can expect for 110 machines to map through this >> data is >>>> about 10h. (so way faster than what you see). >>>> (545tb/(110*1/8gb/s) ~ 40ks ~11h) >>>> >>>> >>>> We should really add a rowcounting coprocessor to HBase and allow using >> it >>>> via M/R. >>>> >>>> -- Lars >>>> >>>> >>>> >>>> ________________________________ >>>> From: James Birchfield <[email protected]> >>>> To: [email protected] >>>> Sent: Friday, September 20, 2013 5:09 PM >>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For >> Help >>>> >>>> >>>> I did not implement accurate timing, but the current table being counted >>>> has been running for about 10 hours, and the log is estimating the map >>>> portion at 10% >>>> >>>> 2013-09-20 23:40:24,099 INFO [main] Job : >> map >>>> 10% reduce 0% >>>> >>>> So a loooong time. Like I mentioned, we have billions, if not trillions >>>> of rows potentially. >>>> >>>> Thanks for the feedback on the approaches I mentioned. I was not sure >> if >>>> they would have any effect overall. >>>> >>>> I will look further into coprocessors. >>>> >>>> Thanks! >>>> Birch >>>> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[email protected] >>> >>>> wrote: >>>> >>>>> How long does it take for RowCounter Job for largest table to finish on >>>> your cluster? >>>>> >>>>> Just curious. >>>>> >>>>> On your options: >>>>> >>>>> 1. Not worth it probably - you may overload your cluster >>>>> 2. Not sure this one differs from 1. Looks the same to me but more >>>> complex. >>>>> 3. The same as 1 and 2 >>>>> >>>>> Counting rows in efficient way can be done if you sacrifice some >>>> accuracy : >>>>> >>>>> >>>> >> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html >>>>> >>>>> Yeah, you will need coprocessors for that. >>>>> >>>>> Best regards, >>>>> Vladimir Rodionov >>>>> Principal Platform Engineer >>>>> Carrier IQ, www.carrieriq.com >>>>> e-mail: [email protected] >>>>> >>>>> ________________________________________ >>>>> From: James Birchfield [[email protected]] >>>>> Sent: Friday, September 20, 2013 3:50 PM >>>>> To: [email protected] >>>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For >> Help >>>>> >>>>> Hadoop 2.0.0-cdh4.3.1 >>>>> >>>>> HBase 0.94.6-cdh4.3.1 >>>>> >>>>> 110 servers, 0 dead, 238.2364 average load >>>>> >>>>> Some other info, not sure if it helps or not. >>>>> >>>>> Configured Capacity: 1295277834158080 (1.15 PB) >>>>> Present Capacity: 1224692609430678 (1.09 PB) >>>>> DFS Remaining: 624376503857152 (567.87 TB) >>>>> DFS Used: 600316105573526 (545.98 TB) >>>>> DFS Used%: 49.02% >>>>> Under replicated blocks: 0 >>>>> Blocks with corrupt replicas: 1 >>>>> Missing blocks: 0 >>>>> >>>>> It is hitting a production cluster, but I am not really sure how to >>>> calculate the load placed on the cluster. >>>>> On Sep 20, 2013, at 3:19 PM, Ted Yu <[email protected]> wrote: >>>>> >>>>>> How many nodes do you have in your cluster ? >>>>>> >>>>>> When counting rows, what other load would be placed on the cluster ? >>>>>> >>>>>> What is the HBase version you're currently using / planning to use ? >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> After reading the documentation and scouring the mailing list >>>>>>> archives, I understand there is no real support for fast row counting >>>> in >>>>>>> HBase unless you build some sort of tracking logic into your code. >> In >>>> our >>>>>>> case, we do not have such logic, and have massive amounts of data >>>> already >>>>>>> persisted. I am running into the issue of very long execution of the >>>>>>> RowCounter MapReduce job against very large tables (multi-billion for >>>> many >>>>>>> is our estimate). I understand why this issue exists and am slowly >>>>>>> accepting it, but I am hoping I can solicit some possible ideas to >> help >>>>>>> speed things up a little. >>>>>>> >>>>>>> My current task is to provide total row counts on about 600 >>>>>>> tables, some extremely large, some not so much. Currently, I have a >>>>>>> process that executes the MapRduce job in process like so: >>>>>>> >>>>>>> Job job = RowCounter.createSubmittableJob( >>>>>>> >> ConfigManager.getConfiguration(), >>>>>>> new String[]{tableName}); >>>>>>> boolean waitForCompletion = >>>>>>> job.waitForCompletion(true); >>>>>>> Counters counters = job.getCounters(); >>>>>>> Counter rowCounter = >>>>>>> counters.findCounter(hbaseadminconnection.Counters.ROWS); >>>>>>> return rowCounter.getValue(); >>>>>>> >>>>>>> At the moment, each MapReduce job is executed in serial order, >> so >>>>>>> counting one table at a time. For the current implementation of this >>>> whole >>>>>>> process, as it stands right now, my rough timing calculations >> indicate >>>> that >>>>>>> fully counting all the rows of these 600 tables will take anywhere >>>> between >>>>>>> 11 to 22 days. This is not what I consider a desirable timeframe. >>>>>>> >>>>>>> I have considered three alternative approaches to speed things >>>> up. >>>>>>> >>>>>>> First, since the application is not heavily CPU bound, I could >>>> use >>>>>>> a ThreadPool and execute multiple MapReduce jobs at the same time >>>> looking >>>>>>> at different tables. I have never done this, so I am unsure if this >>>> would >>>>>>> cause any unanticipated side effects. >>>>>>> >>>>>>> Second, I could distribute the processes. I could find as many >>>>>>> machines that can successfully talk to the desired cluster properly, >>>> give >>>>>>> them a subset of tables to work on, and then combine the results post >>>>>>> process. >>>>>>> >>>>>>> Third, I could combine both the above approaches and run a >>>>>>> distributed set of multithreaded process to execute the MapReduce >> jobs >>>> in >>>>>>> parallel. >>>>>>> >>>>>>> Although it seems to have been asked and answered many times, I >>>>>>> will ask once again. Without the need to change our current >>>> configurations >>>>>>> or restart the clusters, is there a faster approach to obtain row >>>> counts? >>>>>>> FYI, my cache size for the Scan is set to 1000. I have experimented >>>> with >>>>>>> different numbers, but nothing made a noticeable difference. Any >>>> advice or >>>>>>> feedback would be greatly appreciated! >>>>>>> >>>>>>> Thanks, >>>>>>> Birch >>>>> >>>>> >>>>> Confidentiality Notice: The information contained in this message, >>>> including any attachments hereto, may be confidential and is intended >> to be >>>> read only by the individual or entity to whom this message is >> addressed. If >>>> the reader of this message is not the intended recipient or an agent or >>>> designee of the intended recipient, please note that any review, use, >>>> disclosure or distribution of this message or its attachments, in any >> form, >>>> is strictly prohibited. If you have received this message in error, >> please >>>> immediately notify the sender and/or [email protected] and >>>> delete or destroy any copy of this message and its attachments. >>>> >> >>
smime.p7s
Description: S/MIME cryptographic signature
