Thanks for the additional info, Sudarshan. This would fit well with the implementation of Phoenix's skip scan.

CREATE TABLE t (
    object_id INTEGER NOT NULL,
    field_type INTEGER NOT NULL,
    attrib_id INTEGER NOT NULL,
    value BIGINT
    CONSTRAINT pk PRIMARY KEY (object_id, field_type, attribute_id));

SELECT count(value), sum(value),avg(value) FROM t
WHERE object_id IN (?,?,?) AND field_type IN (?,?,?) AND attribute_type IN (?,?,?)

and then your client would do whatever additional computation it needed on the results it got back.

Would that fit with what you're trying to do?

    James

On 04/25/2013 03:36 PM, Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN) wrote:
Michael: Fair enough. Let me see what relevant information I can add to what 
I've already said:

1. To Lars' point, my 250K keys are unlikely to fall into fewer than 250K 
sub-ranges.
2. Here's a bit more about my schema:
  2.1 My rowkeys are composed of 2 entities - let's call it object-id and 
field-type. An object (O1) has 100s of field types (F1,F2,F3...). Each 
object-id - field-type pair has 100s of attributes (A1,A2,A3).
  2.2 My rowkeys are O1-F1, O1-F2, O1-F3, etc.
  2.3 My primary application (not the one my original post was about) accesses 
by these rowkeys.
  2.4 My application that does aggregation is given a bunch of objects <O1, O2, O3>, a 
field-type <F1>, a bunch of attributes <A1,A2> and some computation to perform.
  2.5 As you can see, scans are unlikely to be useful when fetching O1-F1, 
O2-F1, O3-F1 etc.

Viral: How do I tackle aggregation using observers? Let's say I override the 
postGet method. I do a multi-get from my client and my method gets called on 
each region server for each row. What is the next step with this approach?


----- Original Message -----
From: [email protected]
To: [email protected], [email protected]
Cc: Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN)
At: Apr 25 2013 18:12:46

I don't think Phoenix will solve his problem.

He also needs to explain more about his problem before we can start to think 
about the problem.


On Apr 25, 2013, at 4:54 PM, lars hofhansl <[email protected]> wrote:

You might want to have a look at Phoenix 
(https://github.com/forcedotcom/phoenix), which does that and more, and gives a 
SQL/JDBC interface.

-- Lars



________________________________
From: Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN) <[email protected]>
To: [email protected]
Sent: Thursday, April 25, 2013 2:44 PM
Subject: Coprocessors


Folks:

This is my first post on the HBase user mailing list.

I have the following scenario:
I've a HBase table of upto a billion keys. I'm looking to support an 
application where on some user action, I'd need to fetch multiple columns for 
upto 250K keys and do some sort of aggregation on it. Fetching all that data 
and doing the aggregation in my application takes about a minute.

I'm looking to co-locate the aggregation logic with the region servers to
a. Distribute the aggregation
b. Avoid having to fetch large amounts of data over the network (this could 
potentially be cross-datacenter)

Neither observers nor aggregation endpoints work for this use case. Observers 
don't return data back to the client while aggregation endpoints work in the 
context of scans not a multi-get (Are these correct assumptions?).

I'm looking to write a service that runs alongside the region servers and acts 
a proxy b/w my application and the region servers.

I plan to use the logic in HBase client's HConnectionManager, to segment my 
request of 1M rowkeys into sub-requests per region-server. These are sent over 
to the proxy which fetches the data from the region server, aggregates locally 
and sends data back. Does this sound reasonable or even a useful thing to 
pursue?

Regards,
-sudarshan

Reply via email to