Re: Fanning out hbase queries in parallel

Stack Mon, 25 Jul 2011 14:25:39 -0700

Yes.
St.Ack


On Mon, Jul 25, 2011 at 1:23 PM, Paul Nickerson
<[email protected]> wrote:
> We currently run on the cloudera stack. Would this be something that we can 
> pull, compile, and plug right into that stack?
>
> ----- Original Message -----
>
> From: "Gary Helmling" <[email protected]>
> To: [email protected]
> Sent: Monday, July 25, 2011 2:02:50 PM
> Subject: Re: Fanning out hbase queries in parallel
>
> Coprocessors are currently only in trunk. They will be in the 0.92 release
> once we get that out. There's no set date for that, but personally I'll be
> trying to help get it out sooner than later.
>
>
> On Mon, Jul 25, 2011 at 7:37 AM, Michel Segel 
> <[email protected]>wrote:
>
>> Which release(s) have coprocessors enabled?
>>
>> Sent from a remote device. Please excuse any typos...
>>
>> Mike Segel
>>
>> On Jul 24, 2011, at 11:03 PM, Sonal Goyal <[email protected]> wrote:
>>
>> > Hi Paul,
>> >
>> > Have you taken a look at HBase coprocessors? I think you will find them
>> > useful.
>> >
>> > Best Regards,
>> > Sonal
>> > <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
>> > Integration<https://github.com/sonalgoyal/hiho>
>> > Nube Technologies <http://www.nubetech.co>
>> >
>> > <http://in.linkedin.com/in/sonalgoyal>
>> >
>> >
>> >
>> >
>> >
>> > On Mon, Jul 25, 2011 at 8:13 AM, Paul Nickerson <
>> [email protected]
>> >> wrote:
>> >
>> >>
>> >> I would like to implement a multidimensional query system that
>> aggregates
>> >> large amounts of data on-the-fly by fanning out queries in parallel. It
>> >> should be fast enough for interactive exploration of the data and
>> extensible
>> >> enough to take sets of hundreds or thousands of dimensions with high
>> >> cardinality, and aggregate them from high granularity to low
>> granularity.
>> >> Dimensions and their values are stored in the row key. For instance, row
>> >> keys look like this
>> >> Foo=bar,blah=123
>> >> and each row contains numerical values within their column families,
>> such
>> >> as plays=100, versioned by the date of calculation.
>> >> User wants the top "Foo" values with blah=123 sorted downward by total
>> >> plays in july. My current thinking is that a query would get executed by
>> >> grouping all Foo-prefixed row keys by region server, and send the query
>> to
>> >> each of those. Each region server iterates through all of it's row keys
>> that
>> >> start with Foo=something,blah=, and passes the query on to all regions
>> >> containing blahs that equal 123, which then contain play counts.
>> Matching
>> >> row keys, as well as the sum of all their play values within july, are
>> >> passed back up the chain and sorted/truncated when possible.
>> >>
>> >>
>> >> It seems quite complicated and would involve either modifying hbase
>> source
>> >> code or at the very least using the deep internals of the api. Does this
>> >> seem like a practical solution or could someone offer some ideas?
>> >>
>> >>
>> >> Thank you!
>>
>
>

Re: Fanning out hbase queries in parallel

Reply via email to