Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Ravikumar Govindarajan
> > These lookups are expensive and will be done millions of times (each term, > each DV field, each .. everything). Yes, I think you have described the issue correctly. There is no way we can achieve speed-ups without a DocMap, especially for repeated lookups/merge IndexWriter relies on this i

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Shai Erera
That said... if we generate the global DocMap up front, there's no reason to not execute the merge of the segments more efficiently, i.e. without wrapping them in a SlowCompositeReaderWrapper. But that's not work for SortingMergePolicy, it's either a special SortingAtomicReader which wraps a group

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Shai Erera
OK I think I now understand what you're asking :). It's unrelated though to SortingMergePolicy. You propose to do the "merge" part of a merge-sort, since we know the indexes are already sorted, right? This is something we've considered in the past, but it is very tricky (see below) and we went wit

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Ravikumar Govindarajan
> > Therefore the DocMap is initialized only when the > merge actually executes ... what is there more to postpone? Agreed. However, what I am asking is, if there is an alternative to DocMap, will that be better? Plz read-on And besides, if the segments are already sorted, you should return a n

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Shai Erera
> > I am afraid the DocMap still maintains doc-id mappings till merge and I am > trying to avoid it... > What do you mean 'till merge'? The method OneMerge.getMergeReaders() is called only when the merge is executed, not when the MergePolicy decided to merge those segments. Therefore the DocMap is

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Ravikumar Govindarajan
I am afraid the DocMap still maintains doc-id mappings till merge and I am trying to avoid it... I think lucene itself has a MergeIterator in o.a.l.util package. A MergePolicy can wrap a simple MergeIterator for iterating docs across different AtomicReaders in correct sort-order for a given field

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Shai Erera
loadSortTerm is your method right? In the current Sorter.sort implementation, I see this code: boolean sorted = true; for (int i = 1; i < maxDoc; ++i) { if (comparator.compare(i-1, i) > 0) { sorted = false; break; } } if (sorted) { return null;

Re: SortingMergePolicy for already sorted segments

2014-06-16 Thread Ravikumar Govindarajan
Shai, This is the code snippet I use inside my class... public class MySorter extends Sorter { @Override public DocMap sort(AtomicReader reader) throws IOException { final Map docVsId = loadSortTerm(reader); final Sorter.DocComparator comparator = new Sorter.DocComparator() { @Override

Re: SortingMergePolicy for already sorted segments

2014-06-16 Thread Shai Erera
I'm not sure that I follow ... where do you see DocMap being loaded up front? Specifically, Sorter.sort may return null of the readers are already sorted ... I think we already optimized for the case where the readers are sorted. Shai On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan < rav