(Apologies if this has been answered before. I couldn't find anything in the archives quite along these lines.)
I have a process which writes to HBase as new data arrives. I'd like to run a map-reduce periodically, say daily, that takes the new items as input. A naive approach would use a scan which grabs all of the rows that have a timestamp in a specified interval as the input to a MapReduce. I tested a scenario like that with 10s of GB of data and it seemed to perform OK. Should I expected that approach to continue to perform reasonably well when I have TBs of data? >From what I understand of the HBase architecture, I don't see a reason that the the scan approach would continue to perform well as the data grows. It seems like I may have to keep a log of modified keys and use that as the map-reduce input, instead. Thanks, Leif Wickland
