In our tests, filtering rows with timestamps was much faster than using
a filter which results in a full table scan. But, I question the
reliability to use the internal timestamp to detect new data and if this
still scales with a growing amount of data over the years.

Regards,
Thomas

-----Original Message-----
From: Carson Hoffacker [mailto:[email protected]] 
Sent: Donnerstag, 15. Dezember 2011 05:36
To: [email protected]; Stuart Smith
Subject: Re: Questions on timestamps, insights on how
timerange/timestamp filter are processed?

I believe it's the same amount of work.

On Wed, Dec 14, 2011 at 3:37 PM, Stuart Smith <[email protected]>
wrote:

> Ah. Thanks for clarifying my wrong answer.. !
>
> The only time I had to deal with timestamps I had to go through the 
> thrift API ...
> Never noticed the setTimeRange in the Scan() java API :)
>
> So now I'm curious.. If I use this and it can't skip HFiles.. is there

> any performance gain from doing this vs doing it client side?
> Or is it basically the same amount of work - a full scan checking & 
> skipping timestamps.. ?
>
>
> Take care,
>   -stu
>
>
>
> ________________________________
>  From: Carson Hoffacker <[email protected]>
> To: [email protected]; Stuart Smith <[email protected]>
> Sent: Wednesday, December 14, 2011 10:29 AM
> Subject: Re: Questions on timestamps, insights on how 
> timerange/timestamp filter are processed?
>
> The timerange scan is able to leverage metadata in each of the HFiles.

> Each HFile should store information about the timerange associated 
> with the data within the HFile. If the the timerange associated with 
> the HFile is different than the timerange you are interested in, that 
> hfile will be skipped completely. This can significantly increase scan
performance.
>
> However, when these files get compacted and the data is merged into a 
> smaller number of files, the time range associated with each file 
> increases. I don't think it works this way out of the box, but I 
> believe you can be smart about how you manage compactions over time to

> get the behavior that you want. You could have compactions compact all

> the data from January 2011 into a single file, and then compact all 
> the data from February 2011 into a different file.
>
> -Carson
>
> On Wed, Dec 14, 2011 at 9:39 AM, Stuart Smith <[email protected]>
wrote:
>
> > Hello Thomas,
> >
> >    Someone here could probably provide more help, but to start you 
> > off, the only way I've filtered timestamps is to do a scan, and just

> > filter
> out
> > rows one by one. This definitely sounds like something coprocessors 
> > could help with, but I don't really understand those yet, so someone

> > else will have to step up.. or you can really dig into the 
> > documentation about them (AFAIK, it's a little bit of custom code 
> > that runs on the regionservers that can pre-process your gets.. but
don't quote me on that!).
> >
> > But I can say that a major compaction should not affect them - I've 
> > never seen it happen, and if it does, I believe that's a bug.
> >
> > Take care,
> >   -stu
> >
> >
> >
> > ________________________________
> >  From: Steinmaurer Thomas <[email protected]>
> > To: [email protected]
> > Sent: Wednesday, December 14, 2011 12:38 AM
> > Subject: Questions on timestamps, insights on how 
> > timerange/timestamp filter are processed?
> >
> > Hello,
> >
> > can anybody share some insights on how timerange/timestamp filters 
> > are processed?
> >
> > Basically we intend to use timerange/timestamp filters to process 
> > rather new data from an insertion timestamp POV
> >
> > - How does the process of skipping records and/or regions work, if 
> > one use timerange filters?
> > - I also wonder, do timestamp change when e.g. running a major 
> > compaction?
> > - If data grows over the years, is there any chance that regions 
> > with "older" rows keep "stable" in a way, that they can be skipped 
> > very quickly when querying data with a timerange filter of e.g. the 
> > last three yours?
> >
> > Thanks,
> > Thomas
> >
>

Reply via email to