Is it possible to do incremental processing without putting the timestamp in
the leading part of the row key in a more efficient manner  i.e. process
data that came within the last hour/ 2 hour etc ? I can't seem to find a
good answer to this question myself.

On Mon, Oct 10, 2011 at 12:09 AM, Steinmaurer Thomas <
[email protected]> wrote:

> Leif,
>
> we are pretty much in the same boat with a custom timestamp at the end of a
> three-part rowkey, so basically we end up with reading all data when
> processing daily batches. Beside performance aspects, have you seen that
> using internals timestamps for scans etc... work reliable?
>
> Or did you come up with another solution to your problem?
>
> Thanks,
> Thomas
>
> -----Original Message-----
> From: Leif Wickland [mailto:[email protected]]
> Sent: Freitag, 09. September 2011 20:33
> To: [email protected]
> Subject: Performance characteristics of scans using timestamp as the filter
>
> (Apologies if this has been answered before.  I couldn't find anything in
> the archives quite along these lines.)
>
> I have a process which writes to HBase as new data arrives.  I'd like to
> run a map-reduce periodically, say daily, that takes the new items as input.
>  A naive approach would use a scan which grabs all of the rows that have a
> timestamp in a specified interval as the input to a MapReduce.  I tested a
> scenario like that with 10s of GB of data and it seemed to perform OK.
>  Should I expected that approach to continue to perform reasonably well
> when I have TBs of data?
>
> From what I understand of the HBase architecture, I don't see a reason that
> the the scan approach would continue to perform well as the data grows.  It
> seems like I may have to keep a log of modified keys and use that as the
> map-reduce input, instead.
>
> Thanks,
>
> Leif Wickland
>

Reply via email to