(Apologies if this has been answered before.  I couldn't find anything in
the archives quite along these lines.)

I have a process which writes to HBase as new data arrives.  I'd like to run
a map-reduce periodically, say daily, that takes the new items as input.  A
naive approach would use a scan which grabs all of the rows that have a
timestamp in a specified interval as the input to a MapReduce.  I tested a
scenario like that with 10s of GB of data and it seemed to perform OK.
 Should I expected that approach to continue to perform reasonably well when
I have TBs of data?

>From what I understand of the HBase architecture, I don't see a reason that
the the scan approach would continue to perform well as the data grows.  It
seems like I may have to keep a log of modified keys and use that as the
map-reduce input, instead.

Thanks,

Leif Wickland

Reply via email to