Leif, we are pretty much in the same boat with a custom timestamp at the end of a three-part rowkey, so basically we end up with reading all data when processing daily batches. Beside performance aspects, have you seen that using internals timestamps for scans etc... work reliable?
Or did you come up with another solution to your problem? Thanks, Thomas -----Original Message----- From: Leif Wickland [mailto:[email protected]] Sent: Freitag, 09. September 2011 20:33 To: [email protected] Subject: Performance characteristics of scans using timestamp as the filter (Apologies if this has been answered before. I couldn't find anything in the archives quite along these lines.) I have a process which writes to HBase as new data arrives. I'd like to run a map-reduce periodically, say daily, that takes the new items as input. A naive approach would use a scan which grabs all of the rows that have a timestamp in a specified interval as the input to a MapReduce. I tested a scenario like that with 10s of GB of data and it seemed to perform OK. Should I expected that approach to continue to perform reasonably well when I have TBs of data? From what I understand of the HBase architecture, I don't see a reason that the the scan approach would continue to perform well as the data grows. It seems like I may have to keep a log of modified keys and use that as the map-reduce input, instead. Thanks, Leif Wickland
