Re: Coprocessors

Michael Segel Thu, 25 Apr 2013 19:43:32 -0700

Hi,

Lets reiterate what you've said....


You have a set of objects <O1, O2..... On> and you have some field type <F1> 
where F1 which is part of your composite key. You want to fetch back a set of 
rows and then do some aggregation on the attributes. 


There was a similar discussion on this where someone had a random set of values 
and was having performance issues. 

If your set of objects is in sort order and you have only one field type <F1> 
you should be able to do the multi-gets. 

Are you currently using the multigets ? 



On Apr 25, 2013, at 5:36 PM, Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN) 
<[email protected]> wrote:

> Michael: Fair enough. Let me see what relevant information I can add to what 
> I've already said:
> 
> 1. To Lars' point, my 250K keys are unlikely to fall into fewer than 250K 
> sub-ranges.
> 2. Here's a bit more about my schema:
> 2.1 My rowkeys are composed of 2 entities - let's call it object-id and 
> field-type. An object (O1) has 100s of field types (F1,F2,F3...). Each 
> object-id - field-type pair has 100s of attributes (A1,A2,A3). 
> 2.2 My rowkeys are O1-F1, O1-F2, O1-F3, etc.
> 2.3 My primary application (not the one my original post was about) accesses 
> by these rowkeys.
> 2.4 My application that does aggregation is given a bunch of objects <O1, O2, 
> O3>, a field-type <F1>, a bunch of attributes <A1,A2> and some computation to 
> perform.
> 2.5 As you can see, scans are unlikely to be useful when fetching O1-F1, 
> O2-F1, O3-F1 etc.
> 
> Viral: How do I tackle aggregation using observers? Let's say I override the 
> postGet method. I do a multi-get from my client and my method gets called on 
> each region server for each row. What is the next step with this approach?
> 
> 
> ----- Original Message -----
> From: [email protected]
> To: [email protected], [email protected]
> Cc: Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN)
> At: Apr 25 2013 18:12:46
> 
> I don't think Phoenix will solve his problem. 
> 
> He also needs to explain more about his problem before we can start to think 
> about the problem. 
> 
> 
> On Apr 25, 2013, at 4:54 PM, lars hofhansl <[email protected]> wrote:
> 
>> You might want to have a look at Phoenix 
>> (https://github.com/forcedotcom/phoenix), which does that and more, and 
>> gives a SQL/JDBC interface.
>> 
>> -- Lars
>> 
>> 
>> 
>> ________________________________
>> From: Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN) <[email protected]>
>> To: [email protected] 
>> Sent: Thursday, April 25, 2013 2:44 PM
>> Subject: Coprocessors
>> 
>> 
>> Folks:
>> 
>> This is my first post on the HBase user mailing list. 
>> 
>> I have the following scenario:
>> I've a HBase table of upto a billion keys. I'm looking to support an 
>> application where on some user action, I'd need to fetch multiple columns 
>> for upto 250K keys and do some sort of aggregation on it. Fetching all that 
>> data and doing the aggregation in my application takes about a minute.
>> 
>> I'm looking to co-locate the aggregation logic with the region servers to
>> a. Distribute the aggregation
>> b. Avoid having to fetch large amounts of data over the network (this could 
>> potentially be cross-datacenter)
>> 
>> Neither observers nor aggregation endpoints work for this use case. 
>> Observers don't return data back to the client while aggregation endpoints 
>> work in the context of scans not a multi-get (Are these correct 
>> assumptions?).
>> 
>> I'm looking to write a service that runs alongside the region servers and 
>> acts a proxy b/w my application and the region servers. 
>> 
>> I plan to use the logic in HBase client's HConnectionManager, to segment my 
>> request of 1M rowkeys into sub-requests per region-server. These are sent 
>> over to the proxy which fetches the data from the region server, aggregates 
>> locally and sends data back. Does this sound reasonable or even a useful 
>> thing to pursue?
>> 
>> Regards,
>> -sudarshan

Re: Coprocessors

Reply via email to