If there's a customer waiting for the query, then you wouldn't want to have them what for an MR job.
So what you're saying is you want to change this from on-demand scans to using MapReduce to aggregate roll-ups ahead of time and serve those? In that case, your MR job doesn't need one final output, right? You could do the Map over the entire table (or start/stop rows depending on schema) and with the appropriate filters. You would output (customerid + hour bucket) as the key and 1 for the value. You'd get a reduce for each customerid/hour bucket and would write that to HBase. One of the ideas behind coprocessors is you could do the per-customer scan/filter/aggregate as a parallel operation inside the RSs (without the overhead of MR or cross-JVM) and might be able to increase the number of rows you can process within a reasonable amount of time. Another approach to these kinds of aggregates, if you care about realtime at some level, is to use HBase's increment capabilities and a similar hour-bucketed schema but updated on demand instead of in batch. Yeah, this is a "basic" operation but that only means there are 100 ways to implement it :) JG > -----Original Message----- > From: Peter Haidinyak [mailto:[email protected]] > Sent: Friday, December 17, 2010 12:13 PM > To: [email protected] > Subject: RE: Results from a Map/Reduce > > What I have is basically a query on a log table to return the number of hits > per > hour for customer X for Y days and having the ability to filtering on columns, > these are to be displayed in a web page on demand. > Currently, using a Scan, with a popular customer I can get back millions of > rows to aggregate into 'Hits per hour' buckets. I wanted to push the > aggregation back to a Map/Reduce and then have those results available to > send back as a web page. > This seems like such a basic operation that I am hoping there are 'Best > Practices' or examples on how to accomplish this. I would also like a pony > too. > :-) > > Thanks > > -Pete > > -----Original Message----- > From: Jonathan Gray [mailto:[email protected]] > Sent: Friday, December 17, 2010 12:01 PM > To: [email protected] > Subject: RE: Results from a Map/Reduce > > There's not much in the way of examples for coprocessors besides the > implementation of Security. Check out HBASE-2000 and go from there. If > you're fairly new to HBase, then wait a couple months and there should be > much better support around Coprocessors. > > I'm unsure of a way to have a final result returned back to the main() > method. What exactly are you trying to do with this result? Available to you > to do what with it? Could the MR job put the result back into HBase or could > your reducer contain the logic you need to use with the final result? > > > -----Original Message----- > > From: Peter Haidinyak [mailto:[email protected]] > > Sent: Friday, December 17, 2010 11:56 AM > > To: [email protected] > > Subject: RE: Results from a Map/Reduce > > > > Does that mean that when the job.waitForCompletion(true) returns that > > I have the results from the Reducer(s) available to me? I haven't seen > > much on coprocessors, can you point me to some examples of their use? > > > > Thanks > > -Pete > > > > -----Original Message----- > > From: Jonathan Gray [mailto:[email protected]] > > Sent: Friday, December 17, 2010 11:13 AM > > To: [email protected] > > Subject: RE: Results from a Map/Reduce > > > > Hey Peter, > > > > That System.exit line is nothing important, just the main thread > > waiting for the tasks to finish before closing. > > > > You're interested in having the MR job return a single result? To do > > that, you would need to roll-up the processing done in each of your > > Map tasks into a single Reduce task. With one reducer, you can have a > > single point to do the final aggregation of the result. > > > > I'm not sure exactly what kind of aggregation you are doing but > > funneling into a single reducer can range from no problem to don't > > even try it. Sounds like you just want a final number or something so > shouldn't be an issue. > > > > You might also consider doing your aggregations with coprocessors if > > you're into experimenting on HBase Trunk :) > > > > As for FirstKeyOnlyFilter: > > > > /** > > * A filter that will only return the first KV from each row. > > * <p> > > * This filter can be used to more efficiently perform row count operations. > > */ > > > > That's what it does. If you scan a table, regardless of what you ask > > for in the query, the filter will just return whatever the first > > KeyValue is on each row and will skip every other column/version/value of > that row except the first. > > > > Like it says, it's generally useful for doing row counting but that's about > > it. > > > > JG > > > > > -----Original Message----- > > > From: Peter Haidinyak [mailto:[email protected]] > > > Sent: Friday, December 17, 2010 10:56 AM > > > To: [email protected] > > > Subject: Results from a Map/Reduce > > > > > > Hi, dumb question again. > > > I have been using a Scan to return a result back to my client > > > which works fine except when I am returning a million rows just to > > > aggregate the > > results. > > > The next logical step would be to do the aggregation in a Map/Reduce. > > > I've been looking at what samples I could find and they see to all do > > > this... > > > > > > System.exit(job.waitForCompletion(true) ? 0 : 1); > > > > > > My question, is there a way to return a result from the job in a > > > similar way of getting a ResultScanner back in iterating through the > results? > > > > > > Also, is there a good definition of what a 'FirstKeyOnlyFilter' does? > > > > > > Thanks > > > > > > -Pete
