RE: Results from a Map/Reduce

Jonathan Gray Fri, 17 Dec 2010 14:25:33 -0800

If you do aggregation, your queries will most likely be well under a second.  
The aggregates should reduce the amount of data that needs to be read by 
several orders of magnitude, no?


> -----Original Message-----
> From: Peter Haidinyak [mailto:[email protected]]
> Sent: Friday, December 17, 2010 1:43 PM
> To: [email protected]
> Subject: RE: Results from a Map/Reduce
> 
> So the idea is to aggregate the final result to an HBase Table and then from
> the client query that table. I'm going to have to find a quicker method.
> Currently on my small three node cluster with 100million rows it takes a
> couple of minutes to do a scan that brings back several million rows. My boss
> wants the query to be in the 'less than five second' range.
> 
> Thanks
> 
> -Pete
> 
> -----Original Message-----
> From: Jonathan Gray [mailto:[email protected]]
> Sent: Friday, December 17, 2010 1:19 PM
> To: [email protected]
> Subject: RE: Results from a Map/Reduce
> 
> If there's a customer waiting for the query, then you wouldn't want to have
> them what for an MR job.
> 
> So what you're saying is you want to change this from on-demand scans to
> using MapReduce to aggregate roll-ups ahead of time and serve those?
> 
> In that case, your MR job doesn't need one final output, right?  You could do
> the Map over the entire table (or start/stop rows depending on schema) and
> with the appropriate filters.  You would output (customerid + hour bucket) as
> the key and 1 for the value.  You'd get a reduce for each customerid/hour
> bucket and would write that to HBase.
> 
> One of the ideas behind coprocessors is you could do the per-customer
> scan/filter/aggregate as a parallel operation inside the RSs (without the
> overhead of MR or cross-JVM) and might be able to increase the number of
> rows you can process within a reasonable amount of time.
> 
> Another approach to these kinds of aggregates, if you care about realtime at
> some level, is to use HBase's increment capabilities and a similar hour-
> bucketed schema but updated on demand instead of in batch.
> 
> Yeah, this is a "basic" operation but that only means there are 100 ways to
> implement it :)
> 
> JG
> 
> > -----Original Message-----
> > From: Peter Haidinyak [mailto:[email protected]]
> > Sent: Friday, December 17, 2010 12:13 PM
> > To: [email protected]
> > Subject: RE: Results from a Map/Reduce
> >
> > What I have is basically a query on a log table to return the number
> > of hits per hour for customer X for Y days and having the ability to
> > filtering on columns, these are to be displayed in a web page on demand.
> > Currently, using a Scan, with a popular customer I can get back
> > millions of rows to aggregate into 'Hits per hour' buckets. I wanted
> > to push the aggregation back to a Map/Reduce and then have those
> > results available to send back as a web page.
> > This seems like such a basic operation that I am hoping there are
> > 'Best Practices' or examples on how to accomplish this. I would also like a
> pony too.
> > :-)
> >
> > Thanks
> >
> > -Pete
> >
> > -----Original Message-----
> > From: Jonathan Gray [mailto:[email protected]]
> > Sent: Friday, December 17, 2010 12:01 PM
> > To: [email protected]
> > Subject: RE: Results from a Map/Reduce
> >
> > There's not much in the way of examples for coprocessors besides the
> > implementation of Security.  Check out HBASE-2000 and go from there.
> > If you're fairly new to HBase, then wait a couple months and there
> > should be much better support around Coprocessors.
> >
> > I'm unsure of a way to have a final result returned back to the main()
> > method.  What exactly are you trying to do with this result?
> > Available to you to do what with it?  Could the MR job put the result
> > back into HBase or could your reducer contain the logic you need to use
> with the final result?
> >
> > > -----Original Message-----
> > > From: Peter Haidinyak [mailto:[email protected]]
> > > Sent: Friday, December 17, 2010 11:56 AM
> > > To: [email protected]
> > > Subject: RE: Results from a Map/Reduce
> > >
> > > Does that mean that when the job.waitForCompletion(true) returns
> > > that I have the results from the Reducer(s) available to me? I
> > > haven't seen much on coprocessors, can you point me to some examples
> of their use?
> > >
> > > Thanks
> > > -Pete
> > >
> > > -----Original Message-----
> > > From: Jonathan Gray [mailto:[email protected]]
> > > Sent: Friday, December 17, 2010 11:13 AM
> > > To: [email protected]
> > > Subject: RE: Results from a Map/Reduce
> > >
> > > Hey Peter,
> > >
> > > That System.exit line is nothing important, just the main thread
> > > waiting for the tasks to finish before closing.
> > >
> > > You're interested in having the MR job return a single result?  To
> > > do that, you would need to roll-up the processing done in each of
> > > your Map tasks into a single Reduce task.  With one reducer, you can
> > > have a single point to do the final aggregation of the result.
> > >
> > > I'm not sure exactly what kind of aggregation you are doing but
> > > funneling into a single reducer can range from no problem to don't
> > > even try it.  Sounds like you just want a final number or something
> > > so
> > shouldn't be an issue.
> > >
> > > You might also consider doing your aggregations with coprocessors if
> > > you're into experimenting on HBase Trunk :)
> > >
> > > As for FirstKeyOnlyFilter:
> > >
> > > /**
> > >  * A filter that will only return the first KV from each row.
> > >  * <p>
> > >  * This filter can be used to more efficiently perform row count
> operations.
> > >  */
> > >
> > > That's what it does.  If you scan a table, regardless of what you
> > > ask for in the query, the filter will just return whatever the first
> > > KeyValue is on each row and will skip every other
> > > column/version/value of
> > that row except the first.
> > >
> > > Like it says, it's generally useful for doing row counting but that's 
> > > about it.
> > >
> > > JG
> > >
> > > > -----Original Message-----
> > > > From: Peter Haidinyak [mailto:[email protected]]
> > > > Sent: Friday, December 17, 2010 10:56 AM
> > > > To: [email protected]
> > > > Subject: Results from a Map/Reduce
> > > >
> > > > Hi, dumb question again.
> > > >   I have been using a Scan to return a result back to my client
> > > > which works fine except when I am returning a million rows just to
> > > > aggregate the
> > > results.
> > > > The next logical step would be to do the aggregation in a Map/Reduce.
> > > > I've been looking at what samples I could find and they see to all do
> this...
> > > >
> > > >     System.exit(job.waitForCompletion(true) ? 0 : 1);
> > > >
> > > > My question, is there a way to return a result from the job in a
> > > > similar way of getting a ResultScanner back in iterating through
> > > > the
> > results?
> > > >
> > > > Also, is there a good definition of what a 'FirstKeyOnlyFilter' does?
> > > >
> > > > Thanks
> > > >
> > > > -Pete

RE: Results from a Map/Reduce

Reply via email to