Leif,

we are pretty much in the same boat with a custom timestamp at the end of a 
three-part rowkey, so basically we end up with reading all data when processing 
daily batches. Beside performance aspects, have you seen that using internals 
timestamps for scans etc... work reliable?

Or did you come up with another solution to your problem?

Thanks,
Thomas

-----Original Message-----
From: Leif Wickland [mailto:[email protected]] 
Sent: Freitag, 09. September 2011 20:33
To: [email protected]
Subject: Performance characteristics of scans using timestamp as the filter

(Apologies if this has been answered before.  I couldn't find anything in the 
archives quite along these lines.)

I have a process which writes to HBase as new data arrives.  I'd like to run a 
map-reduce periodically, say daily, that takes the new items as input.  A naive 
approach would use a scan which grabs all of the rows that have a timestamp in 
a specified interval as the input to a MapReduce.  I tested a scenario like 
that with 10s of GB of data and it seemed to perform OK.
 Should I expected that approach to continue to perform reasonably well when I 
have TBs of data?

From what I understand of the HBase architecture, I don't see a reason that the 
the scan approach would continue to perform well as the data grows.  It seems 
like I may have to keep a log of modified keys and use that as the map-reduce 
input, instead.

Thanks,

Leif Wickland

Reply via email to