Re: Actual min and max-value of NumericField during codec flush

2014-02-19 Thread Ravikumar Govindarajan
Thanks Mike for your time and help On Monday, February 17, 2014, Michael McCandless wrote: > On Mon, Feb 17, 2014 at 8:33 AM, Ravikumar Govindarajan > > wrote: > >> > >> Well, this will change your scores? MultiReader will sum up all term > >> statistics across all SegmentReaders "up front", a

Re: Actual min and max-value of NumericField during codec flush

2014-02-17 Thread Michael McCandless
On Mon, Feb 17, 2014 at 8:33 AM, Ravikumar Govindarajan wrote: >> >> Well, this will change your scores? MultiReader will sum up all term >> statistics across all SegmentReaders "up front", and then scoring per >> segment will use those top-level weights. > > > Our app needs to do only matching a

Re: Actual min and max-value of NumericField during codec flush

2014-02-17 Thread Ravikumar Govindarajan
> > Well, this will change your scores? MultiReader will sum up all term > statistics across all SegmentReaders "up front", and then scoring per > segment will use those top-level weights. Our app needs to do only matching and sorting. In-fact, it would be fully OK to by-pass scoring. But I feel

Re: Actual min and max-value of NumericField during codec flush

2014-02-14 Thread Michael McCandless
On Fri, Feb 14, 2014 at 12:14 AM, Ravikumar Govindarajan wrote: > Early-Query termination quits by throwing an Exception right?. Is it ok to > individually search using SegmentReader and then break-off, instead of > using a MultiReader, especially when the order is known before search > begins?

Re: Actual min and max-value of NumericField during codec flush

2014-02-13 Thread Ravikumar Govindarajan
Yeah, now I understood a little bit. Since LogMP always merges adjacent segments, that should pretty much serve my use-case, when used with a SortingMP Early-Query termination quits by throwing an Exception right?. Is it ok to individually search using SegmentReader and then break-off, instead of

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Shai Erera
Hi LogMP *always* picks adjacent segments together. Therefore, if you have segments S1, S2, S3, S4 where the date-wise sort order is S4>S3>S2>S1, then LogMP will pick either S1-S4, S2-S4, S2-S3 and so on. But always adjacent segments and in a raw (i.e. it doesn't skip segments). I guess what both

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Ravikumar Govindarajan
@Mike, I had suggested the same approach in one of my previous mails, where-by each segment records min/max timestamps in seg-info diagnostics and use it for merging adjacent segments. "Then, I define a TimeMergePolicy extends LogMergePolicy and define the segment-size=Long.MAX_VALUE - SEG_LEAST_

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Michael McCandless
Right, I think you'll need to use either of the LogXMergePolicy (or subclass LogMergePolicy and make your own): they always pick adjacent segments to merge. SortingMP let's you pass in the MP to wrap, so just pass in a LogXMP, and then sort by timestamp? Mike McCandless http://blog.mikemccandles

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Shai Erera
Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks adjacent segments and SortingMP ensures the merged segment is also sorted. Shai On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan < ravikumar.govindara...@gmail.com> wrote: > Yes exactly as you have described. > > Ex: Cons

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Ravikumar Govindarajan
Yes exactly as you have described. Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order and goes for a merge While SortingMergePolicy will correctly solve the merge-part, it does not however play any role in picking segments to merge right? SMP internally delegates to TieredMer

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Michael McCandless
OK, I see (early termination). That's a challenge, because you really want the docs sorted backwards from how they were added right? And, e.g., merged and then searched in "reverse segment order"? I think you should be able to do this w/ SortingMergePolicy? And then use a custom collector that

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Ravikumar Govindarajan
Mike, All our queries need to be sorted by timestamp field, in descending order of time. [latest-first] Each segment is sorted in itself. But TieredMergePolicy picks arbitrary segments and merges them [even with SortingMergePolicy etc...]. I am trying to avoid this and see if an approximate globa

Re: Actual min and max-value of NumericField during codec flush

2014-02-07 Thread Michael McCandless
LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total order. Only TieredMergePolicy merges out-of-order segments. I don't understand why you need to encouraging merging of the more recent (by your "time" field) segments... Mike McCandless http://blog.mikemccandless.com On Fri

Re: Actual min and max-value of NumericField during codec flush

2014-02-07 Thread Ravikumar Govindarajan
Mike, Each of my flushed segment is fully ordered by time. But TieredMergePolicy or LogByteSizeMergePolicy is going to pick arbitrary time-segments and disturb this arrangement and I wanted some kind of control on this. But like you pointed-out, going by only be time-adjacent merges can be disast

Re: Actual min and max-value of NumericField during codec flush

2014-02-07 Thread Michael McCandless
You want to focus merging on the segments containing newer documents? Why? This seems somewhat dangerous... Not taking into account the "true" segment size can lead to very very poor merge decisions ... you should turn on IndexWriter's infoStream and do a long running test to convince yourself th

Re: Actual min and max-value of NumericField during codec flush

2014-02-06 Thread Ravikumar Govindarajan
Thanks Mike, Will try your suggestion. I will try to describe the actual use-case itself There is a requirement for merging time-adjacent segments [append-only, rolling time-series data] All Documents have a timestamp affixed and during flush I need to note down the least timestamp for all docum

Re: Actual min and max-value of NumericField during codec flush

2014-02-06 Thread Michael McCandless
Somewhere in those numeric trie terms are the exact integers from your documents, encoded. You can use oal.util.NumericUtils.prefixCodecToInt to get the int value back from the BytesRef term. But you need to filter out the "higher level" terms, e.g. using NumericUtils.getPrefixCodedLongShift(term