In our tests, filtering rows with timestamps was much faster than using a filter which results in a full table scan. But, I question the reliability to use the internal timestamp to detect new data and if this still scales with a growing amount of data over the years.
Regards, Thomas -----Original Message----- From: Carson Hoffacker [mailto:[email protected]] Sent: Donnerstag, 15. Dezember 2011 05:36 To: [email protected]; Stuart Smith Subject: Re: Questions on timestamps, insights on how timerange/timestamp filter are processed? I believe it's the same amount of work. On Wed, Dec 14, 2011 at 3:37 PM, Stuart Smith <[email protected]> wrote: > Ah. Thanks for clarifying my wrong answer.. ! > > The only time I had to deal with timestamps I had to go through the > thrift API ... > Never noticed the setTimeRange in the Scan() java API :) > > So now I'm curious.. If I use this and it can't skip HFiles.. is there > any performance gain from doing this vs doing it client side? > Or is it basically the same amount of work - a full scan checking & > skipping timestamps.. ? > > > Take care, > -stu > > > > ________________________________ > From: Carson Hoffacker <[email protected]> > To: [email protected]; Stuart Smith <[email protected]> > Sent: Wednesday, December 14, 2011 10:29 AM > Subject: Re: Questions on timestamps, insights on how > timerange/timestamp filter are processed? > > The timerange scan is able to leverage metadata in each of the HFiles. > Each HFile should store information about the timerange associated > with the data within the HFile. If the the timerange associated with > the HFile is different than the timerange you are interested in, that > hfile will be skipped completely. This can significantly increase scan performance. > > However, when these files get compacted and the data is merged into a > smaller number of files, the time range associated with each file > increases. I don't think it works this way out of the box, but I > believe you can be smart about how you manage compactions over time to > get the behavior that you want. You could have compactions compact all > the data from January 2011 into a single file, and then compact all > the data from February 2011 into a different file. > > -Carson > > On Wed, Dec 14, 2011 at 9:39 AM, Stuart Smith <[email protected]> wrote: > > > Hello Thomas, > > > > Someone here could probably provide more help, but to start you > > off, the only way I've filtered timestamps is to do a scan, and just > > filter > out > > rows one by one. This definitely sounds like something coprocessors > > could help with, but I don't really understand those yet, so someone > > else will have to step up.. or you can really dig into the > > documentation about them (AFAIK, it's a little bit of custom code > > that runs on the regionservers that can pre-process your gets.. but don't quote me on that!). > > > > But I can say that a major compaction should not affect them - I've > > never seen it happen, and if it does, I believe that's a bug. > > > > Take care, > > -stu > > > > > > > > ________________________________ > > From: Steinmaurer Thomas <[email protected]> > > To: [email protected] > > Sent: Wednesday, December 14, 2011 12:38 AM > > Subject: Questions on timestamps, insights on how > > timerange/timestamp filter are processed? > > > > Hello, > > > > can anybody share some insights on how timerange/timestamp filters > > are processed? > > > > Basically we intend to use timerange/timestamp filters to process > > rather new data from an insertion timestamp POV > > > > - How does the process of skipping records and/or regions work, if > > one use timerange filters? > > - I also wonder, do timestamp change when e.g. running a major > > compaction? > > - If data grows over the years, is there any chance that regions > > with "older" rows keep "stable" in a way, that they can be skipped > > very quickly when querying data with a timerange filter of e.g. the > > last three yours? > > > > Thanks, > > Thomas > > >
