Thanks Mike for your time and help
On Monday, February 17, 2014, Michael McCandless
wrote:
> On Mon, Feb 17, 2014 at 8:33 AM, Ravikumar Govindarajan
> > wrote:
> >>
> >> Well, this will change your scores? MultiReader will sum up all term
> >> statistics across all SegmentReaders "up front", a
On Mon, Feb 17, 2014 at 8:33 AM, Ravikumar Govindarajan
wrote:
>>
>> Well, this will change your scores? MultiReader will sum up all term
>> statistics across all SegmentReaders "up front", and then scoring per
>> segment will use those top-level weights.
>
>
> Our app needs to do only matching a
>
> Well, this will change your scores? MultiReader will sum up all term
> statistics across all SegmentReaders "up front", and then scoring per
> segment will use those top-level weights.
Our app needs to do only matching and sorting. In-fact, it would be fully
OK to by-pass scoring. But I feel
On Fri, Feb 14, 2014 at 12:14 AM, Ravikumar Govindarajan
wrote:
> Early-Query termination quits by throwing an Exception right?. Is it ok to
> individually search using SegmentReader and then break-off, instead of
> using a MultiReader, especially when the order is known before search
> begins?
Yeah, now I understood a little bit.
Since LogMP always merges adjacent segments, that should pretty much serve
my use-case, when used with a SortingMP
Early-Query termination quits by throwing an Exception right?. Is it ok to
individually search using SegmentReader and then break-off, instead of
Hi
LogMP *always* picks adjacent segments together. Therefore, if you have
segments S1, S2, S3, S4 where the date-wise sort order is S4>S3>S2>S1, then
LogMP will pick either S1-S4, S2-S4, S2-S3 and so on. But always adjacent
segments and in a raw (i.e. it doesn't skip segments).
I guess what both
@Mike,
I had suggested the same approach in one of my previous mails, where-by
each segment records min/max timestamps in seg-info diagnostics and use it
for merging adjacent segments.
"Then, I define a TimeMergePolicy extends LogMergePolicy and define the
segment-size=Long.MAX_VALUE - SEG_LEAST_
Right, I think you'll need to use either of the LogXMergePolicy (or
subclass LogMergePolicy and make your own): they always pick adjacent
segments to merge.
SortingMP let's you pass in the MP to wrap, so just pass in a LogXMP,
and then sort by timestamp?
Mike McCandless
http://blog.mikemccandles
Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks adjacent
segments and SortingMP ensures the merged segment is also sorted.
Shai
On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan <
ravikumar.govindara...@gmail.com> wrote:
> Yes exactly as you have described.
>
> Ex: Cons
Yes exactly as you have described.
Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order and
goes for a merge
While SortingMergePolicy will correctly solve the merge-part, it does not
however play any role in picking segments to merge right?
SMP internally delegates to TieredMer
OK, I see (early termination).
That's a challenge, because you really want the docs sorted backwards
from how they were added right? And, e.g., merged and then searched
in "reverse segment order"?
I think you should be able to do this w/ SortingMergePolicy? And then
use a custom collector that
Mike,
All our queries need to be sorted by timestamp field, in descending order
of time. [latest-first]
Each segment is sorted in itself. But TieredMergePolicy picks arbitrary
segments and merges them [even with SortingMergePolicy etc...]. I am trying
to avoid this and see if an approximate globa
LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total order.
Only TieredMergePolicy merges out-of-order segments.
I don't understand why you need to encouraging merging of the more
recent (by your "time" field) segments...
Mike McCandless
http://blog.mikemccandless.com
On Fri
Mike,
Each of my flushed segment is fully ordered by time. But TieredMergePolicy
or LogByteSizeMergePolicy is going to pick arbitrary time-segments and
disturb this arrangement and I wanted some kind of control on this.
But like you pointed-out, going by only be time-adjacent merges can be
disast
You want to focus merging on the segments containing newer documents?
Why? This seems somewhat dangerous...
Not taking into account the "true" segment size can lead to very very
poor merge decisions ... you should turn on IndexWriter's infoStream
and do a long running test to convince yourself th
Thanks Mike,
Will try your suggestion. I will try to describe the actual use-case itself
There is a requirement for merging time-adjacent segments [append-only,
rolling time-series data]
All Documents have a timestamp affixed and during flush I need to note down
the least timestamp for all docum
Somewhere in those numeric trie terms are the exact integers from your
documents, encoded.
You can use oal.util.NumericUtils.prefixCodecToInt to get the int
value back from the BytesRef term.
But you need to filter out the "higher level" terms, e.g. using
NumericUtils.getPrefixCodedLongShift(term
17 matches
Mail list logo