Is it possible to do incremental processing without putting the timestamp in the leading part of the row key in a more efficient manner i.e. process data that came within the last hour/ 2 hour etc ? I can't seem to find a good answer to this question myself.
On Mon, Oct 10, 2011 at 12:09 AM, Steinmaurer Thomas < [email protected]> wrote: > Leif, > > we are pretty much in the same boat with a custom timestamp at the end of a > three-part rowkey, so basically we end up with reading all data when > processing daily batches. Beside performance aspects, have you seen that > using internals timestamps for scans etc... work reliable? > > Or did you come up with another solution to your problem? > > Thanks, > Thomas > > -----Original Message----- > From: Leif Wickland [mailto:[email protected]] > Sent: Freitag, 09. September 2011 20:33 > To: [email protected] > Subject: Performance characteristics of scans using timestamp as the filter > > (Apologies if this has been answered before. I couldn't find anything in > the archives quite along these lines.) > > I have a process which writes to HBase as new data arrives. I'd like to > run a map-reduce periodically, say daily, that takes the new items as input. > A naive approach would use a scan which grabs all of the rows that have a > timestamp in a specified interval as the input to a MapReduce. I tested a > scenario like that with 10s of GB of data and it seemed to perform OK. > Should I expected that approach to continue to perform reasonably well > when I have TBs of data? > > From what I understand of the HBase architecture, I don't see a reason that > the the scan approach would continue to perform well as the data grows. It > seems like I may have to keep a log of modified keys and use that as the > map-reduce input, instead. > > Thanks, > > Leif Wickland >
