RE: Results from a Map/Reduce

Peter Haidinyak Fri, 17 Dec 2010 13:43:55 -0800

So the idea is to aggregate the final result to an HBase Table and then from 
the client query that table. I'm going to have to find a quicker method. 
Currently on my small three node cluster with 100million rows it takes a couple 
of minutes to do a scan that brings back several million rows. My boss wants 
the query to be in the 'less than five second' range.


Thanks

-Pete

-----Original Message-----
From: Jonathan Gray [mailto:[email protected]] 
Sent: Friday, December 17, 2010 1:19 PM
To: [email protected]
Subject: RE: Results from a Map/Reduce

If there's a customer waiting for the query, then you wouldn't want to have 
them what for an MR job.

So what you're saying is you want to change this from on-demand scans to using 
MapReduce to aggregate roll-ups ahead of time and serve those?

In that case, your MR job doesn't need one final output, right?  You could do 
the Map over the entire table (or start/stop rows depending on schema) and with 
the appropriate filters.  You would output (customerid + hour bucket) as the 
key and 1 for the value.  You'd get a reduce for each customerid/hour bucket 
and would write that to HBase.

One of the ideas behind coprocessors is you could do the per-customer 
scan/filter/aggregate as a parallel operation inside the RSs (without the 
overhead of MR or cross-JVM) and might be able to increase the number of rows 
you can process within a reasonable amount of time.

Another approach to these kinds of aggregates, if you care about realtime at 
some level, is to use HBase's increment capabilities and a similar 
hour-bucketed schema but updated on demand instead of in batch.

Yeah, this is a "basic" operation but that only means there are 100 ways to 
implement it :)

JG

> -----Original Message-----
> From: Peter Haidinyak [mailto:[email protected]]
> Sent: Friday, December 17, 2010 12:13 PM
> To: [email protected]
> Subject: RE: Results from a Map/Reduce
> 
> What I have is basically a query on a log table to return the number of hits 
> per
> hour for customer X for Y days and having the ability to filtering on columns,
> these are to be displayed in a web page on demand.
> Currently, using a Scan, with a popular customer I can get back millions of
> rows to aggregate into 'Hits per hour' buckets. I wanted to push the
> aggregation back to a Map/Reduce and then have those results available to
> send back as a web page.
> This seems like such a basic operation that I am hoping there are 'Best
> Practices' or examples on how to accomplish this. I would also like a pony 
> too.
> :-)
> 
> Thanks
> 
> -Pete
> 
> -----Original Message-----
> From: Jonathan Gray [mailto:[email protected]]
> Sent: Friday, December 17, 2010 12:01 PM
> To: [email protected]
> Subject: RE: Results from a Map/Reduce
> 
> There's not much in the way of examples for coprocessors besides the
> implementation of Security.  Check out HBASE-2000 and go from there.  If
> you're fairly new to HBase, then wait a couple months and there should be
> much better support around Coprocessors.
> 
> I'm unsure of a way to have a final result returned back to the main()
> method.  What exactly are you trying to do with this result?  Available to you
> to do what with it?  Could the MR job put the result back into HBase or could
> your reducer contain the logic you need to use with the final result?
> 
> > -----Original Message-----
> > From: Peter Haidinyak [mailto:[email protected]]
> > Sent: Friday, December 17, 2010 11:56 AM
> > To: [email protected]
> > Subject: RE: Results from a Map/Reduce
> >
> > Does that mean that when the job.waitForCompletion(true) returns that
> > I have the results from the Reducer(s) available to me? I haven't seen
> > much on coprocessors, can you point me to some examples of their use?
> >
> > Thanks
> > -Pete
> >
> > -----Original Message-----
> > From: Jonathan Gray [mailto:[email protected]]
> > Sent: Friday, December 17, 2010 11:13 AM
> > To: [email protected]
> > Subject: RE: Results from a Map/Reduce
> >
> > Hey Peter,
> >
> > That System.exit line is nothing important, just the main thread
> > waiting for the tasks to finish before closing.
> >
> > You're interested in having the MR job return a single result?  To do
> > that, you would need to roll-up the processing done in each of your
> > Map tasks into a single Reduce task.  With one reducer, you can have a
> > single point to do the final aggregation of the result.
> >
> > I'm not sure exactly what kind of aggregation you are doing but
> > funneling into a single reducer can range from no problem to don't
> > even try it.  Sounds like you just want a final number or something so
> shouldn't be an issue.
> >
> > You might also consider doing your aggregations with coprocessors if
> > you're into experimenting on HBase Trunk :)
> >
> > As for FirstKeyOnlyFilter:
> >
> > /**
> >  * A filter that will only return the first KV from each row.
> >  * <p>
> >  * This filter can be used to more efficiently perform row count operations.
> >  */
> >
> > That's what it does.  If you scan a table, regardless of what you ask
> > for in the query, the filter will just return whatever the first
> > KeyValue is on each row and will skip every other column/version/value of
> that row except the first.
> >
> > Like it says, it's generally useful for doing row counting but that's about 
> > it.
> >
> > JG
> >
> > > -----Original Message-----
> > > From: Peter Haidinyak [mailto:[email protected]]
> > > Sent: Friday, December 17, 2010 10:56 AM
> > > To: [email protected]
> > > Subject: Results from a Map/Reduce
> > >
> > > Hi, dumb question again.
> > >   I have been using a Scan to return a result back to my client
> > > which works fine except when I am returning a million rows just to
> > > aggregate the
> > results.
> > > The next logical step would be to do the aggregation in a Map/Reduce.
> > > I've been looking at what samples I could find and they see to all do 
> > > this...
> > >
> > >     System.exit(job.waitForCompletion(true) ? 0 : 1);
> > >
> > > My question, is there a way to return a result from the job in a
> > > similar way of getting a ResultScanner back in iterating through the
> results?
> > >
> > > Also, is there a good definition of what a 'FirstKeyOnlyFilter' does?
> > >
> > > Thanks
> > >
> > > -Pete

RE: Results from a Map/Reduce

Reply via email to