[jira] [Commented] (SOLR-2976) stats.facet no longer works on single valued trie fields that don't use precision step

2013-02-23 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585163#comment-13585163
 ] 

Adrien Grand commented on SOLR-2976:


bq. if precisionStep != 0, faceting on a single-valued numeric field builds an 
UninvertedField 

I think the last commits on SOLR-3855 fix it (they even make faceting use the 
numeric field caches instead of the terms index).

 stats.facet no longer works on single valued trie fields that don't use 
 precision step
 --

 Key: SOLR-2976
 URL: https://issues.apache.org/jira/browse/SOLR-2976
 Project: Solr
  Issue Type: Bug
Affects Versions: 3.5
Reporter: Hoss Man
 Attachments: SOLR-2976_3.4_test.patch, SOLR-2976.patch


 As reported on the mailing list, 3.5 introduced a regression that prevents 
 single valued Trie fields that don't use precision steps (to add course 
 grained terms) from being used in stats.facet.
 two immediately obvious problems...
 1) in 3.5 the stats component is checking if isTokenzed() is true for the 
 field type (which is probably wise) but regardless of the precisionStep used, 
 TrieField.isTokenized is hardcoded to return true
 2) the 3.5 stats faceting will fail if the FieldType is multivalued - it 
 doesn't check if the SchemaField is configured to be single valued 
 (overriding the FieldType)
 so even if a user has something like this in their schema...
 {code}
 fieldType name=long class=solr.TrieLongField precisionStep=0 
 omitNorms=true /
 field name=ts type=long indexed=true stored=true required=true 
 multiValued=false /
 {code}
 ...stats.facet will not work.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4792) Smaller doc maps

2013-02-24 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585490#comment-13585490
 ] 

Adrien Grand commented on LUCENE-4792:
--

In case someone would like to use this class, I'd add that:
 - the encoded sequence does not strictly need to be monotonic: it can encode 
any sequence of values but it compresses best when the stream contains 
monotonic sub-sequences of 1024 longs at least (for example it would have a 
good compression ratio if there are first 1 increasing values and then 5000 
decreasing values),
 - it can address up to 2^42 values,
 - there are writer/reader equivalents called MonotonicBlockPackedWriter and 
MonotonicBlockPackedReader (which can either load values in memory or read from 
disk).

 Smaller doc maps
 

 Key: LUCENE-4792
 URL: https://issues.apache.org/jira/browse/LUCENE-4792
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.2

 Attachments: LUCENE-4792.patch


 MergeState.DocMap could leverage MonotonicAppendingLongBuffer to save memory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4795) Add FacetsCollector based on SortedSetDocValues

2013-02-25 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13585878#comment-13585878
 ] 

Adrien Grand commented on LUCENE-4795:
--

Not having to manage a taxonomy index is very appealing to me!

What about collecting based on segment ords and bulk translating these ords to 
the global ords in setNextReader and when the collection ends? This way 
ordinalMap.get would be called less often (once per value per segment instead 
of once per value per doc per segment) and in a sequential way so I assume it 
would be faster while remaining easy to implement?

 Add FacetsCollector based on SortedSetDocValues
 ---

 Key: LUCENE-4795
 URL: https://issues.apache.org/jira/browse/LUCENE-4795
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Michael McCandless
 Attachments: LUCENE-4795.patch, pleaseBenchmarkMe.patch


 Recently (LUCENE-4765) we added multi-valued DocValues field
 (SortedSetDocValuesField), and this can be used for faceting in Solr
 (SOLR-4490).  I think we should also add support in the facet module?
 It'd be an option with different tradeoffs.  Eg, it wouldn't require
 the taxonomy index, since the main index handles label/ord resolving.
 There are at least two possible approaches:
   * On every reopen, build the seg - global ord map, and then on
 every collect, get the seg ord, map it to the global ord space,
 and increment counts.  This adds cost during reopen in proportion
 to number of unique terms ...
   * On every collect, increment counts based on the seg ords, and then
 do a merge in the end just like distributed faceting does.
 The first approach is much easier so I built a quick prototype using
 that.  The prototype does the counting, but it does NOT do the top K
 facets gathering in the end, and it doesn't know parent/child ord
 relationships, so there's tons more to do before this is real.  I also
 was unsure how to properly integrate it since the existing classes
 seem to expect that you use a taxonomy index to resolve ords.
 I ran a quick performance test.  base = trunk except I disabled the
 compute top-K in FacetsAccumulator to make the comparison fair; comp
 = using the prototype collector in the patch:
 {noformat}
 TaskQPS base  StdDevQPS comp  StdDev  
   Pct diff
OrHighLow   18.79  (2.5%)   14.36  (3.3%)  
 -23.6% ( -28% -  -18%)
 HighTerm   21.58  (2.4%)   16.53  (3.7%)  
 -23.4% ( -28% -  -17%)
OrHighMed   18.20  (2.5%)   13.99  (3.3%)  
 -23.2% ( -28% -  -17%)
  Prefix3   14.37  (1.5%)   11.62  (3.5%)  
 -19.1% ( -23% -  -14%)
  LowTerm  130.80  (1.6%)  106.95  (2.4%)  
 -18.2% ( -21% -  -14%)
   OrHighHigh9.60  (2.6%)7.88  (3.5%)  
 -17.9% ( -23% -  -12%)
  AndHighHigh   24.61  (0.7%)   20.74  (1.9%)  
 -15.7% ( -18% -  -13%)
   Fuzzy1   49.40  (2.5%)   43.48  (1.9%)  
 -12.0% ( -15% -   -7%)
  MedSloppyPhrase   27.06  (1.6%)   23.95  (2.3%)  
 -11.5% ( -15% -   -7%)
  MedTerm   51.43  (2.0%)   46.21  (2.7%)  
 -10.2% ( -14% -   -5%)
   IntNRQ4.02  (1.6%)3.63  (4.0%)   
 -9.7% ( -15% -   -4%)
 Wildcard   29.14  (1.5%)   26.46  (2.5%)   
 -9.2% ( -13% -   -5%)
 HighSloppyPhrase0.92  (4.5%)0.87  (5.8%)   
 -5.4% ( -15% -5%)
  MedSpanNear   29.51  (2.5%)   27.94  (2.2%)   
 -5.3% (  -9% -0%)
 HighSpanNear3.55  (2.4%)3.38  (2.0%)   
 -4.9% (  -9% -0%)
   AndHighMed  108.34  (0.9%)  104.55  (1.1%)   
 -3.5% (  -5% -   -1%)
  LowSloppyPhrase   20.50  (2.0%)   20.09  (4.2%)   
 -2.0% (  -8% -4%)
LowPhrase   21.60  (6.0%)   21.26  (5.1%)   
 -1.6% ( -11% -   10%)
   Fuzzy2   53.16  (3.9%)   52.40  (2.7%)   
 -1.4% (  -7% -5%)
  LowSpanNear8.42  (3.2%)8.45  (3.0%)
 0.3% (  -5% -6%)
  Respell   45.17  (4.3%)   45.38  (4.4%)
 0.5% (  -7% -9%)
MedPhrase  113.93  (5.8%)  115.02  (4.9%)
 1.0% (  -9% -   12%)
   AndHighLow  596.42  (2.5%)  617.12  (2.8%)
 3.5% (  -1% -8%)
   HighPhrase   17.30 (10.5%)   18.36  (9.1%)
 6.2% ( -12% -   28%)
 

[jira] [Commented] (SOLR-4490) add support for multivalued docvalues

2013-02-25 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13586349#comment-13586349
 ] 

Adrien Grand commented on SOLR-4490:


+1 

 add support  for multivalued docvalues
 --

 Key: SOLR-4490
 URL: https://issues.apache.org/jira/browse/SOLR-4490
 Project: Solr
  Issue Type: New Feature
Reporter: Robert Muir
 Attachments: SOLR-4490.patch, SOLR-4490.patch


 exposing LUCENE-4765 essentially. 
 I think we don't need any new options, it just means doing the right thing 
 when someone has docValues=true and multivalued=true.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4752) Merge segments to sort them

2013-03-04 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592107#comment-13592107
 ] 

Adrien Grand commented on LUCENE-4752:
--

I think a very simple first step could be have an experimental 
IndexWriterConfig option to tell IndexWriter to provide an atomic sorted view 
(easy once LUCENE-3918 is committed) of the segments to merge to SegmentMerger 
instead of the segments themselves (see calls to 
SegmentMerger.add(SegmentReader) in IndexWriter.mergeMiddle). Initial segments 
would remain unsorted, but the big ones, those that are interesting for both 
index compression and early query termination, would be sorted.

It can seem inefficient to sort segments over and over but I don't think we 
should worry too much:
 - if we are merging initial segments (those created from IndexWriter.flush), 
they would be small so sorting/merging them would be fast?
 - if we are merging big segments, I think that the following reasons could 
make merging slower than a regular merge:
   1. computing the new doc ID mapping,
   2. random I/O access,
   3. not being able to use the specialized codec merging methods.

But if the segments to merge are sorted, computing the new doc ID mapping could 
be actually fast (some sorting algorithms such as 
[TimSort|http://en.wikipedia.org/wiki/Timsort] perform in O(n) when the input 
is a succession of sorted sequences), and the access patterns to the individual 
segments would be I/O cache-friendly (because each segment would be read 
sequentially). So I think this approach could be fast enough?

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand

 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4752) Merge segments to sort them

2013-03-04 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592366#comment-13592366
 ] 

Adrien Grand commented on LUCENE-4752:
--

bq. How can you early terminate a query for a single segment? [...] To early 
terminate efficiently, you must have the segments in a consistent order, e.g. 
S1  S2  S3.

I think this is just an API limitation? Segments being processed independently, 
we should be able to terminate collection on a per-segment basis? 

bq. instead of stuffing into IWC what seems like a random setting 
(pick-segments-for-sorting), we should have something more generic, like 
AtomicReaderFactory

I didn't mean this should be a boolean. Of course it should be something more 
flexible/configurable! I'm very bad at picking names, but following your naming 
suggestion, we could have something like
{code}
abstract class AtomicReaderFactory {
  abstract ListAtomicReader reorder(ListSegmentReader segmentReaders);
}
{code}?

The default impl would be the identity whereas the sorting impl would return a 
singleton containing a sorted view over the segment readers?

bq. Also, a custom SegmentMerger to implement the zig-zag merge would help too.

This is another option. I actually started exploring this option when David 
opened this issue, but it can become complicated, especially for postings lists 
merging, whereas reusing the sorted view from LUCENE-3918 would make merging 
trivial.

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand

 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3918) Port index sorter to trunk APIs

2013-03-06 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594650#comment-13594650
 ] 

Adrien Grand commented on LUCENE-3918:
--

Thanks for your work Shai. Indeed it looks really good now! Here a a few 
suggestions/questions:
 - Are there actual use-cases for sorting by stored fields or payloads? If not 
I think we should remove StoredFieldsSorter and PayloadSorter?
 - Remove IndexSorter.java and make SortDoc package-private?

{code}
+  // we cannot reuse the given DocsAndPositionsEnum because we return our
+  // own wrapper, and not all Codecs like it.
{code}

Maybe we could check if the docs enum to reuse is an instance of 
SortingDocsEnum and reuse its wrapped DocEnum?



 Port index sorter to trunk APIs
 ---

 Key: LUCENE-3918
 URL: https://issues.apache.org/jira/browse/LUCENE-3918
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/other
Affects Versions: 4.0-ALPHA
Reporter: Robert Muir
 Fix For: 4.2, 5.0

 Attachments: LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, 
 LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, 
 LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch


 LUCENE-2482 added an IndexSorter to 3.x, but we need to port this
 functionality to 4.0 apis.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3918) Port index sorter to trunk APIs

2013-03-06 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594787#comment-13594787
 ] 

Adrien Grand commented on LUCENE-3918:
--

Regarding PayloadSorter and StoredFieldsSorter I'm just afraid that the fact 
that they exist might make users think these are viable options...

bq. IndexSorter is a convenient utility for sorting a Directory end-to-end. Why 
remove it?

I think taking an AtomicReader as an argument (instead of a Directory) and 
feeding an IndexWriter (instead of another Directory) would be much more 
flexible but then it would just be a call to IndexWriter.addIndexes... If we 
want an utility to sort indexes, maybe it should rather be something callable 
from command-line? (java oal.index.sorter.IndexSorter fromDir toDir sortField)

bq. Get rid of SortDoc. Sorter is now abstract class with a helper int[] 
compute(int[] docs, T[] values)

I think it's better! Maybe a List instead of an array would be even better so 
that NumericDocValuesSorter could use a view over the doc values instead of 
loading all of them into memory?

Reusage of DocsEnum looks great!






 Port index sorter to trunk APIs
 ---

 Key: LUCENE-3918
 URL: https://issues.apache.org/jira/browse/LUCENE-3918
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/other
Affects Versions: 4.0-ALPHA
Reporter: Robert Muir
 Fix For: 4.2, 5.0

 Attachments: LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, 
 LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, 
 LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch


 LUCENE-2482 added an IndexSorter to 3.x, but we need to port this
 functionality to 4.0 apis.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3918) Port index sorter to trunk APIs

2013-03-06 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-3918:
-

Attachment: LUCENE-3918.patch

bq. I use two parallel arrays to sort the documents (docs and values)

I updated the patch to use doc IDs as ords so that values are never swapped 
(only doc IDs) and the numeric doc values don't need to be all loaded in memory.

bq. So one option is to remove the class, but still keep a test around which 
does the addIndexes to make sure it works.

+1

bq. I don't want however to add a main that is limited to NumericDV ... and I 
do think that stored fields / payload value are viable options.

I still don't get why someone would use stored fields rather than doc values 
(either binary, sorted or numeric) to sort his index. I think it's important to 
make users understand that stored fields are only useful to display results?

 Port index sorter to trunk APIs
 ---

 Key: LUCENE-3918
 URL: https://issues.apache.org/jira/browse/LUCENE-3918
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/other
Affects Versions: 4.0-ALPHA
Reporter: Robert Muir
 Fix For: 4.2, 5.0

 Attachments: LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, 
 LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, 
 LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, LUCENE-3918.patch, 
 LUCENE-3918.patch, LUCENE-3918.patch


 LUCENE-2482 added an IndexSorter to 3.x, but we need to port this
 functionality to 4.0 apis.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4752) Merge segments to sort them

2013-03-10 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598331#comment-13598331
 ] 

Adrien Grand commented on LUCENE-4752:
--

bq. the SortingSegmentMerger will accumulate the readers in add(SegmentReader) 
and open a SortingAtomicReader over a MultiReader of all SegReaders... what do 
you think?

I think this is a good idea!

However, I don't understand this global sorting issue. What would it bring?

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand

 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4830) Sorter API: use an abstract doc map instead of an array

2013-03-13 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4830:


 Summary: Sorter API: use an abstract doc map instead of an array
 Key: LUCENE-4830
 URL: https://issues.apache.org/jira/browse/LUCENE-4830
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Fix For: 4.3


The sorter API uses arrays to store the old-new and new-old doc IDs mappings. 
It should rather be an abstract class given that in some cases an array is not 
required at all (reverse mapping for example).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4830) Sorter API: use an abstract doc map instead of an array

2013-03-13 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4830:
-

Attachment: LUCENE-4830.patch

Patch. I also changed SortingAtomicReader.liveDocs() to be a view over the 
original liveDocs.

 Sorter API: use an abstract doc map instead of an array
 ---

 Key: LUCENE-4830
 URL: https://issues.apache.org/jira/browse/LUCENE-4830
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Fix For: 4.3

 Attachments: LUCENE-4830.patch


 The sorter API uses arrays to store the old-new and new-old doc IDs 
 mappings. It should rather be an abstract class given that in some cases an 
 array is not required at all (reverse mapping for example).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig

2013-03-14 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4833:


 Summary: Fix default MergePolicy in IndexWriterConfig
 Key: LUCENE-4833
 URL: https://issues.apache.org/jira/browse/LUCENE-4833
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor


Although the default merge policy is TieredMergePolicy (as documented in 
IndexWriterConfig constructor), setMergePolicy assumes that the default is 
LogByteSizeMergePolicy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig

2013-03-14 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4833:
-

Attachment: LUCENE-4833.patch

Patch.

 Fix default MergePolicy in IndexWriterConfig
 

 Key: LUCENE-4833
 URL: https://issues.apache.org/jira/browse/LUCENE-4833
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4833.patch


 Although the default merge policy is TieredMergePolicy (as documented in 
 IndexWriterConfig constructor), setMergePolicy assumes that the default is 
 LogByteSizeMergePolicy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig

2013-03-14 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602265#comment-13602265
 ] 

Adrien Grand commented on LUCENE-4833:
--

Good point. I copied the behavior of setCodec which throws a NPE although you 
are right that most methods seem to set the default value...

 Fix default MergePolicy in IndexWriterConfig
 

 Key: LUCENE-4833
 URL: https://issues.apache.org/jira/browse/LUCENE-4833
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4833.patch


 Although the default merge policy is TieredMergePolicy (as documented in 
 IndexWriterConfig constructor), setMergePolicy assumes that the default is 
 LogByteSizeMergePolicy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig

2013-03-14 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602269#comment-13602269
 ] 

Adrien Grand commented on LUCENE-4833:
--

I'm not sure I like the fact that passing null to setXXX actually sets the 
default value, what do other committers think?

 Fix default MergePolicy in IndexWriterConfig
 

 Key: LUCENE-4833
 URL: https://issues.apache.org/jira/browse/LUCENE-4833
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4833.patch


 Although the default merge policy is TieredMergePolicy (as documented in 
 IndexWriterConfig constructor), setMergePolicy assumes that the default is 
 LogByteSizeMergePolicy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig

2013-03-14 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602274#comment-13602274
 ] 

Adrien Grand commented on LUCENE-4833:
--

My point is that if someone wants to use the default value, all he has to do is 
to never call the setter? Moreover users can't pass null to methods that expect 
primitive types (such as setMaxBufferedDocs) so throwing an exception when 
encountering null would be more consistent?

 Fix default MergePolicy in IndexWriterConfig
 

 Key: LUCENE-4833
 URL: https://issues.apache.org/jira/browse/LUCENE-4833
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4833.patch


 Although the default merge policy is TieredMergePolicy (as documented in 
 IndexWriterConfig constructor), setMergePolicy assumes that the default is 
 LogByteSizeMergePolicy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig

2013-03-14 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602285#comment-13602285
 ] 

Adrien Grand commented on LUCENE-4833:
--

bq. We throw IllegalArg in the other setters (which take primitives), so maybe 
throw that and not NPE?

+1 I'll update the patch.

 Fix default MergePolicy in IndexWriterConfig
 

 Key: LUCENE-4833
 URL: https://issues.apache.org/jira/browse/LUCENE-4833
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4833.patch


 Although the default merge policy is TieredMergePolicy (as documented in 
 IndexWriterConfig constructor), setMergePolicy assumes that the default is 
 LogByteSizeMergePolicy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig

2013-03-14 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4833:
-

Attachment: LUCENE-4833.patch

Updated patch. IndexWriterConfig.setXXX methods now throw an 
IllegalArgumentException when passed null instead of setting the default value. 
Tests pass.

 Fix default MergePolicy in IndexWriterConfig
 

 Key: LUCENE-4833
 URL: https://issues.apache.org/jira/browse/LUCENE-4833
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4833.patch, LUCENE-4833.patch


 Although the default merge policy is TieredMergePolicy (as documented in 
 IndexWriterConfig constructor), setMergePolicy assumes that the default is 
 LogByteSizeMergePolicy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4833) Fix default MergePolicy in IndexWriterConfig

2013-03-14 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4833.
--

Resolution: Fixed

 Fix default MergePolicy in IndexWriterConfig
 

 Key: LUCENE-4833
 URL: https://issues.apache.org/jira/browse/LUCENE-4833
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4833.patch, LUCENE-4833.patch


 Although the default merge policy is TieredMergePolicy (as documented in 
 IndexWriterConfig constructor), setMergePolicy assumes that the default is 
 LogByteSizeMergePolicy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4830) Sorter API: use an abstract doc map instead of an array

2013-03-14 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602775#comment-13602775
 ] 

Adrien Grand commented on LUCENE-4830:
--

bq. I think that we should make the DocMap impl final? Maybe it will encourage 
JIT ...

Looks like it doesn't help much? 
http://stackoverflow.com/questions/8354412/do-java-finals-help-the-compiler-create-more-efficient-bytecode

 Sorter API: use an abstract doc map instead of an array
 ---

 Key: LUCENE-4830
 URL: https://issues.apache.org/jira/browse/LUCENE-4830
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Fix For: 4.3

 Attachments: LUCENE-4830.patch


 The sorter API uses arrays to store the old-new and new-old doc IDs 
 mappings. It should rather be an abstract class given that in some cases an 
 array is not required at all (reverse mapping for example).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4752) Merge segments to sort them

2013-03-14 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4752:
-

Attachment: LUCENE-4752.patch

I've tried playing with SegmentMerger to make it configurable. This could be 
used to reorder document IDs (if you look at the diff in LuceneTestCase, all 
that is needed to reorder doc IDs is to wrap the SlowCompositeReaderWrapper 
with a SortingAtomicReader). Do you think it is a step in the right direction? 

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752.patch


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4830) Sorter API: use an abstract doc map instead of an array

2013-03-15 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4830.
--

Resolution: Fixed

Thank you for the review, Shai!

 Sorter API: use an abstract doc map instead of an array
 ---

 Key: LUCENE-4830
 URL: https://issues.apache.org/jira/browse/LUCENE-4830
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Fix For: 4.3

 Attachments: LUCENE-4830.patch


 The sorter API uses arrays to store the old-new and new-old doc IDs 
 mappings. It should rather be an abstract class given that in some cases an 
 array is not required at all (reverse mapping for example).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4834) Sorter API: Make TermsEnum.docs accept any source of liveDocs

2013-03-15 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4834:


 Summary: Sorter API: Make TermsEnum.docs accept any source of 
liveDocs
 Key: LUCENE-4834
 URL: https://issues.apache.org/jira/browse/LUCENE-4834
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3


TermsEnum.docs currently only works when liveDocs is null or the reader's 
liveDocs. This is enough for addIndexes but it would be cleaner to follow 
TermsEnum.docs contract and accept any source of liveDocs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4834) Sorter API: Make TermsEnum.docs accept any source of liveDocs

2013-03-15 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4834:
-

Attachment: LUCENE-4834.patch

Patch. I'll commit soon.

 Sorter API: Make TermsEnum.docs accept any source of liveDocs
 -

 Key: LUCENE-4834
 URL: https://issues.apache.org/jira/browse/LUCENE-4834
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3

 Attachments: LUCENE-4834.patch


 TermsEnum.docs currently only works when liveDocs is null or the reader's 
 liveDocs. This is enough for addIndexes but it would be cleaner to follow 
 TermsEnum.docs contract and accept any source of liveDocs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4834) Sorter API: Make TermsEnum.docs accept any source of liveDocs

2013-03-15 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4834.
--

Resolution: Fixed

Thanks Shai.

 Sorter API: Make TermsEnum.docs accept any source of liveDocs
 -

 Key: LUCENE-4834
 URL: https://issues.apache.org/jira/browse/LUCENE-4834
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3

 Attachments: LUCENE-4834.patch


 TermsEnum.docs currently only works when liveDocs is null or the reader's 
 liveDocs. This is enough for addIndexes but it would be cleaner to follow 
 TermsEnum.docs contract and accept any source of liveDocs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4839) Sorter API: Use TimSort to sort doc IDs and postings lists

2013-03-16 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4839:


 Summary: Sorter API: Use TimSort to sort doc IDs and postings lists
 Key: LUCENE-4839
 URL: https://issues.apache.org/jira/browse/LUCENE-4839
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor


TimSort (http://svn.python.org/projects/python/trunk/Objects/listsort.txt, used 
by python and Java's Arrays.sort(Object[]) in particular) is a sorting 
algorithm that performs very well on partially-sorted data. Indeed, with 
TimSort, sorting an array which is in reverse order or a finite concatenation 
of sorted arrays is a linear operation (instead of O(n ln(n))).

The sorter API could benefit from this algorithm when using Sorter.REVERSE_DOCS 
or merging several sorted readers for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4752) Merge segments to sort them

2013-03-16 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604272#comment-13604272
 ] 

Adrien Grand commented on LUCENE-4752:
--

bq. Is it possible to make fieldInfos final?

Sure. I removed the final keyword because it was easier to hack up a quick 
patch but this can definitely be fixed.

bq. Adrien, perhaps add a SortingSegmentMerger to the sorter package? Or at 
least add a test that verifies merges keep things sorted?

I'll do that in the next patches!

bq. And finally i think it would be way better to provide whatever 'hook' is 
needed for this kinda stuff rather than allow subclassing of segmentmerger.

I'm fine with that option too, I need to think more about how to name it and 
where to plug it.

In addition to the API, I think something important to validate is whether 
sorting the segments to merge is viable and doesn't blow up memory or indexing 
time... I started working on this (LUCENE-4830 for memory and LUCENE-4839 for 
complexity) and will run some indexing benchmarks with the Wikipedia corpus to 
see how it behaves compared to natural merging.


 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752.patch


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4839) Sorter API: Use TimSort to sort doc IDs and postings lists

2013-03-16 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604279#comment-13604279
 ] 

Adrien Grand commented on LUCENE-4839:
--

One major difference with the original impl is that I reused the merge routine 
used by mergeSort instead of porting the original one which has a few 
optimizations to merge runs which have different lengths and/or some patterns 
(look for galloping in listsort.txt) but requires extra memory. This doesn't 
change the fact that this impl performs extremely well when data is partially 
sorted.

 Sorter API: Use TimSort to sort doc IDs and postings lists
 --

 Key: LUCENE-4839
 URL: https://issues.apache.org/jira/browse/LUCENE-4839
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4839.patch


 TimSort (http://svn.python.org/projects/python/trunk/Objects/listsort.txt, 
 used by python and Java's Arrays.sort(Object[]) in particular) is a sorting 
 algorithm that performs very well on partially-sorted data. Indeed, with 
 TimSort, sorting an array which is in reverse order or a finite concatenation 
 of sorted arrays is a linear operation (instead of O(n ln(n))).
 The sorter API could benefit from this algorithm when using 
 Sorter.REVERSE_DOCS or merging several sorted readers for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4839) Sorter API: Use TimSort to sort doc IDs and postings lists

2013-03-16 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4839:
-

Attachment: LUCENE-4839.patch

bq. Nice! Why do we need the private inner class TimSort?

It's no needed but my first patch (not uploaded) did not use a helper class and 
was hard to read, so I think this is better this way?

bq. I would be happy to also add the timSort algorithm to ArrayUtils and 
CollectionUtils.

Done in the patch.

bq. The bonus would be: The extensive random tests in TestArrayUtils and 
TestCollectionUtils could be used for timSort, too (their existence is the 
reason why there is no TestSorterTemplate class in current code).

Done.

 Sorter API: Use TimSort to sort doc IDs and postings lists
 --

 Key: LUCENE-4839
 URL: https://issues.apache.org/jira/browse/LUCENE-4839
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4839.patch, LUCENE-4839.patch


 TimSort (http://svn.python.org/projects/python/trunk/Objects/listsort.txt, 
 used by python and Java's Arrays.sort(Object[]) in particular) is a sorting 
 algorithm that performs very well on partially-sorted data. Indeed, with 
 TimSort, sorting an array which is in reverse order or a finite concatenation 
 of sorted arrays is a linear operation (instead of O(n ln(n))).
 The sorter API could benefit from this algorithm when using 
 Sorter.REVERSE_DOCS or merging several sorted readers for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4839) Sorter API: Use TimSort to sort doc IDs and postings lists

2013-03-16 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604318#comment-13604318
 ] 

Adrien Grand commented on LUCENE-4839:
--

Thanks UWe, I'll fix it before committing!

 Sorter API: Use TimSort to sort doc IDs and postings lists
 --

 Key: LUCENE-4839
 URL: https://issues.apache.org/jira/browse/LUCENE-4839
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4839.patch, LUCENE-4839.patch


 TimSort (http://svn.python.org/projects/python/trunk/Objects/listsort.txt, 
 used by python and Java's Arrays.sort(Object[]) in particular) is a sorting 
 algorithm that performs very well on partially-sorted data. Indeed, with 
 TimSort, sorting an array which is in reverse order or a finite concatenation 
 of sorted arrays is a linear operation (instead of O(n ln(n))).
 The sorter API could benefit from this algorithm when using 
 Sorter.REVERSE_DOCS or merging several sorted readers for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4839) Sorter API: Use TimSort to sort doc IDs and postings lists

2013-03-16 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4839.
--

Resolution: Fixed

 Sorter API: Use TimSort to sort doc IDs and postings lists
 --

 Key: LUCENE-4839
 URL: https://issues.apache.org/jira/browse/LUCENE-4839
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4839.patch, LUCENE-4839.patch


 TimSort (http://svn.python.org/projects/python/trunk/Objects/listsort.txt, 
 used by python and Java's Arrays.sort(Object[]) in particular) is a sorting 
 algorithm that performs very well on partially-sorted data. Indeed, with 
 TimSort, sorting an array which is in reverse order or a finite concatenation 
 of sorted arrays is a linear operation (instead of O(n ln(n))).
 The sorter API could benefit from this algorithm when using 
 Sorter.REVERSE_DOCS or merging several sorted readers for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4752) Merge segments to sort them

2013-03-16 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4752:
-

Attachment: LUCENE-4752.patch

bq. i think it would be way better to provide whatever 'hook' is needed for 
this kinda stuff rather than allow subclassing of segmentmerger. like a proper 
pluggable api (e.g. codec is an example of this) versus letting people just 
subclass concrete things.

Here is a patch that allows for reordering via a simple hook instead of having 
to subclass a class that does concrete things like SegmentMerger. The hook is 
on MergePolicy because I felt like it makes sense to think about doc ID 
reordering at merging time as part of a merge policy but it could also be put 
somewhere else or have its own class. (The patch is just here to gather some 
API feedback, I haven't tried to run anything with it yet). Does it look more 
reasonable?

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752.patch, LUCENE-4752.patch


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4752) Merge segments to sort them

2013-03-17 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604536#comment-13604536
 ] 

Adrien Grand commented on LUCENE-4752:
--

bq. This looks less invasive indeed, but I feel that MP.reorder() is kind of 
out of the blue. Maybe we should find a way to stuff it into OneMerge?

Indeed, I thought about OneMerge too and liked this option better but I think 
this is a problem for addIndexes(IndexReader...): this method doesn't need to 
find merges and as a consequence doesn't manipulate OnMerge instances. How 
would we make addIndexes(IndexReader...) sort doc IDs?

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752.patch, LUCENE-4752.patch


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4752) Merge segments to sort them

2013-03-17 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604540#comment-13604540
 ] 

Adrien Grand commented on LUCENE-4752:
--

Good point! I'll update the patch!

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752.patch, LUCENE-4752.patch


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4752) Merge segments to sort them

2013-03-17 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4752:
-

Attachment: LUCENE-4752.patch

Patch with tests that makes OneMerge responsible for reordering doc IDs. 
Thoughts?

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4752) Merge segments to sort them

2013-03-17 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4752:
-

Attachment: LUCENE-4752.patch

bq. But, since LTC is quite big, perhaps we can move these methods to a util, 
e.g. CompareIndexes?

Why is the size of the class a concern? I think it's more convenient to have 
all assert*Equals methods in the same class? (LuceneTestCase already has many 
assert*Equals methods inherited from Assert.) And it makes these methods easier 
to find when writing a test?

bq. Can we make OneMerge.readers private and add OneMerge.add(AtomicReader) for 
IW to use? It looks odd that IW manipulates OneMerge.readers directly, but then 
calls OneMerge.getMergeReaders()

I think it would be odd if getMergeReaders was just a getter but it is more 
than that since it filters out empty readers and can even return an arbitrary 
view over the readers to merge. But here it is just a method that computes data 
based on the class members, like segString?

bq. Can we remove SegmentMerger.add()

Good point, I updated the patch.

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 LUCENE-4752.patch


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4847) Sorter API: Fully reuse docs enums

2013-03-17 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4847:


 Summary: Sorter API: Fully reuse docs enums
 Key: LUCENE-4847
 URL: https://issues.apache.org/jira/browse/LUCENE-4847
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3


SortingAtomicReader reuses the filtered docs enums but not the wrapper. In the 
case of SortingAtomicReader this can be a problem because the wrappers are 
heavyweight (they load the whole postings list into memory), so an index with 
many terms with high freqs will make the JVM allocate a lot of memory when 
browsing the postings lists.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4847) Sorter API: Fully reuse docs enums

2013-03-18 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4847:
-

Attachment: LUCENE-4847.patch

Patch.

 Sorter API: Fully reuse docs enums
 --

 Key: LUCENE-4847
 URL: https://issues.apache.org/jira/browse/LUCENE-4847
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3

 Attachments: LUCENE-4847.patch


 SortingAtomicReader reuses the filtered docs enums but not the wrapper. In 
 the case of SortingAtomicReader this can be a problem because the wrappers 
 are heavyweight (they load the whole postings list into memory), so an index 
 with many terms with high freqs will make the JVM allocate a lot of memory 
 when browsing the postings lists.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4747) java7 as a minimum requirement for lucene 5

2013-03-18 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605126#comment-13605126
 ] 

Adrien Grand commented on LUCENE-4747:
--

Maybe we should fix all places that should use Integer.compare/Long.compare/... 
too?

 java7 as a minimum requirement for lucene 5
 ---

 Key: LUCENE-4747
 URL: https://issues.apache.org/jira/browse/LUCENE-4747
 Project: Lucene - Core
  Issue Type: Task
Affects Versions: 5.0
Reporter: Robert Muir
Assignee: Uwe Schindler
 Fix For: 5.0

 Attachments: LUCENE-4747.patch, LUCENE-4747.patch


 Spinoff from LUCENE-4746. 
 I propose we make this change on trunk only. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4851) Use Java 7's {Integer,Long,Float,Double}.compare instead of branches

2013-03-18 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4851:


 Summary: Use Java 7's {Integer,Long,Float,Double}.compare instead 
of branches
 Key: LUCENE-4851
 URL: https://issues.apache.org/jira/browse/LUCENE-4851
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 5.0


We can use those methods now that trunk is on Java 7.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4851) Use Java 7's {Integer,Long,Float,Double}.compare instead of branches

2013-03-18 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4851:
-

Attachment: LUCENE-4851.patch

Patch. Most changes are in FieldComparator.

 Use Java 7's {Integer,Long,Float,Double}.compare instead of branches
 

 Key: LUCENE-4851
 URL: https://issues.apache.org/jira/browse/LUCENE-4851
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-4851.patch


 We can use those methods now that trunk is on Java 7.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4851) Use Java 7's {Integer,Long,Float,Double}.compare instead of branches

2013-03-18 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605216#comment-13605216
 ] 

Adrien Grand commented on LUCENE-4851:
--

Good idea, I'll do it!

 Use Java 7's {Integer,Long,Float,Double}.compare instead of branches
 

 Key: LUCENE-4851
 URL: https://issues.apache.org/jira/browse/LUCENE-4851
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-4851.patch


 We can use those methods now that trunk is on Java 7.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4851) Use Java 7's {Integer,Long,Float,Double}.compare instead of branches

2013-03-18 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4851:
-

Attachment: LUCENE-4851.patch

It found two calls to signum in ConjunctionScorer and PostingsHighlighter.

 Use Java 7's {Integer,Long,Float,Double}.compare instead of branches
 

 Key: LUCENE-4851
 URL: https://issues.apache.org/jira/browse/LUCENE-4851
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-4851.patch, LUCENE-4851.patch


 We can use those methods now that trunk is on Java 7.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4851) Use Java 7's {Integer,Long,Float,Double}.compare instead of branches

2013-03-18 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4851.
--

Resolution: Fixed

 Use Java 7's {Integer,Long,Float,Double}.compare instead of branches
 

 Key: LUCENE-4851
 URL: https://issues.apache.org/jira/browse/LUCENE-4851
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 5.0

 Attachments: LUCENE-4851.patch, LUCENE-4851.patch


 We can use those methods now that trunk is on Java 7.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4852) BaseStoredFieldsFormatTestCase

2013-03-18 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605466#comment-13605466
 ] 

Adrien Grand commented on LUCENE-4852:
--

Patch looks good!

 BaseStoredFieldsFormatTestCase
 --

 Key: LUCENE-4852
 URL: https://issues.apache.org/jira/browse/LUCENE-4852
 Project: Lucene - Core
  Issue Type: Task
  Components: general/test
Reporter: Robert Muir
 Attachments: LUCENE-4852.patch, LUCENE-4852_prototype.patch


 The idea is similar to Base[Postings/DocValues/TermVectors]TestCase.
 We ensure each codec has certain checks and its easier to maintain and also 
 easier to ensure new impls are correct.
 For example hunting around today, a lot of the best tests are actually tucked 
 away in TestCompressingStoredFieldsFormat.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4752) Merge segments to sort them

2013-03-18 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4752:
-

Attachment: sorting_10M_ingestion.log
natural_10M_ingestion.log
LUCENE-4752.patch

bq. Maybe just put a comment in IW where it calls merge.getReaders() why we 
don't access the readers list directly

Done.

bq. I started working on this (LUCENE-4830 for memory and LUCENE-4839 for 
complexity) and will run some indexing benchmarks with the Wikipedia corpus to 
see how it behaves compared to natural merging.

Now that SortingAtomicReader uses TimSort to compute the doc ID mapping and 
sort postigs lists, using SortingMergePolicy only increases the merge 
complexity by constant factors compared to a natural merge if the readers to 
merge are sorted (I'm assuming the number of segments to merge is bounded). I 
think this makes online sorting a viable option.

I ran some indexing benchmarks to see how slower indexing is with 
SortingMergePolicy. To do this I quickly patched luceneutil to add a random 
NumericDocValuesField to all documents and wrap the merge policy with 
SortingMergePolicy. Indexing 10M docs from the wikimedium collection was 2x 
slower with SortingMergePolicy (see ingestion rate logs attached). To measure 
pure merge performance, I ran a forceMerge(1) on those indexes and 
SortingMergePolicy made this forceMerge 3.5x slower (856415 ms vs 250054 ms). 
If you're curious, here is where the merging time is spent with 
SortingMergePolicy according to my profiler:
 - 32%: CompressingStoredField.visitDocument (vs.  1% when using a regular 
merge policy)
 - 17%: TimSort: to sort the doc mapping and postings lists
 - 6%: Sorter.DocMap.oldToNew: used by SortingDocsEnum to map the old IDs to 
the new ones

Most of the time is not spent into actual sorting but in visitDocument because 
the codec-specific merge routine can't be used, so the stored fields format 
decompresses every chunk multiple times (a few hundred  times given that my 
docs are really small, this would be less noticeable with larger docs).

I think it's close, what do you think?

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, 
 sorting_10M_ingestion.log


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4752) Merge segments to sort them

2013-03-19 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4752:
-

Attachment: LUCENE-4752.patch

bq. I think these are not bad numbers.

Me neither! I'm rather happy with them actually.

bq. As for search, perhaps we can quickly hack up IndexSearcher to allow 
terminating per-segment and then compare two Collectors TopFields and 
TopSortedFields [...] but in order to do that, we must make sure that each 
segment is sorted (i.e. those that are not hit by MP are still in random 
order), or we somehow mark on each segment whether it's sorted or not

The attached patch contains a different approach, the idea is to use together 
SortingMergePolicy and IndexWriterConfig.getMaxBufferedDocs: this guarantees 
that all segments whose size is above maxBufferedDocs are sorted. Then there is 
a new EarlyTerminationIndexSearcher that extends search to collect normally 
segments in random order and to early terminate collection on segments which 
are sorted.

bq. Accessing close documents together ... we can make an artificial test 
which accesses documents with sort-by-value in a specific range. But that's a 
too artificial test, not sure what it will tell us.

Yes, I think the important thing to validate here is that merging does not get 
exponentially slower as segments grow. Other checks are just bonus.

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 natural_10M_ingestion.log, sorting_10M_ingestion.log


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4858) Ability to terminate queries on a per-segment basis

2013-03-20 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4858:


 Summary: Ability to terminate queries on a per-segment basis
 Key: LUCENE-4858
 URL: https://issues.apache.org/jira/browse/LUCENE-4858
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3


Spin-off of LUCENE-4752.

When an index is sorted per-segment, queries that sort according to the index 
sort order could be early terminated.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4752) Merge segments to sort them

2013-03-20 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607647#comment-13607647
 ] 

Adrien Grand commented on LUCENE-4752:
--

I opened LUCENE-4858 to deal with early query termination (as you suggested 
earlier) so that we can concentrate on sorting in this issue.

bq. Adrien, perhaps in order to keep the patch small, commit separately the 
changes to LTC and TestDuelingCodec (as well as the SortingAtomicReader.wrap 
change)

I'll do that soon if nobody objects.

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 natural_10M_ingestion.log, sorting_10M_ingestion.log


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4858) Ability to terminate queries on a per-segment basis

2013-03-20 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4858:
-

Description: 
Spin-off of LUCENE-4752, see 
https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565
 and 
https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282

When an index is sorted per-segment, queries that sort according to the index 
sort order could be early terminated.


  was:
Spin-off of LUCENE-4752.

When an index is sorted per-segment, queries that sort according to the index 
sort order could be early terminated.



 Ability to terminate queries on a per-segment basis
 ---

 Key: LUCENE-4858
 URL: https://issues.apache.org/jira/browse/LUCENE-4858
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3


 Spin-off of LUCENE-4752, see 
 https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565
  and 
 https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282
 When an index is sorted per-segment, queries that sort according to the index 
 sort order could be early terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4858) Ability to terminate queries on a per-segment basis

2013-03-20 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607711#comment-13607711
 ] 

Adrien Grand commented on LUCENE-4858:
--

{quote} What in the patch guarantees that any segment with more than 
maxBufferedDocs is sorted? Perhaps I've missed it, but I looked for code which 
ensures every such segment gets picked up by SortingMP, however didn't find it.

I don't think that in general we should make assumptions based on a 
maxBufferedDocs setting because the default setting in IWC is per RAM 
consumption and also it seems slightly unrelated. I.e. if a segment is sorted, 
but has deletions such that numDocs  maxBufferedDocs, we do full collection, 
while we can early terminate as usual?{quote}

Indeed I think that finding out which segments are sorted is the main issue. My 
idea was to say that if you want to use early query termination, you need to 
set maxBufferedDocs to a given limit (low values improve early query 
termination while high values improve indexing speed), so that large segments 
(the ones that are interesting for early query termination since they require 
time to collect) that have more than maxBufferedDocs documents (deleted or not) 
are known to be sorted, because they result from a merge. Of course, this could 
miss some small segments which are sorted but since they are small, they're not 
as interesting for early query termination?

What options do we have here? I think you mentionned tagging sorted segments, 
do you have an idea where/how we could do that?

bq. And hopefully we can stuff the early termination logic down to 
IndexSearcher eventually. There are other scenarios for early termination, such 
as time limit, and therefore I think it's ok if we have an 
EarlyTerminationException which IndexSearcher responds to.

Inded, I think this makes sense.

 Ability to terminate queries on a per-segment basis
 ---

 Key: LUCENE-4858
 URL: https://issues.apache.org/jira/browse/LUCENE-4858
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3


 Spin-off of LUCENE-4752, see 
 https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565
  and 
 https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282
 When an index is sorted per-segment, queries that sort according to the index 
 sort order could be early terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4862) Ability to terminate queries on a per-segment basis

2013-03-20 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4862:


 Summary: Ability to terminate queries on a per-segment basis
 Key: LUCENE-4862
 URL: https://issues.apache.org/jira/browse/LUCENE-4862
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3


Spin-off of LUCENE-4752. The idea is to add a marker exception that tells 
IndexSearcher to terminate the collection of the current segment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4858) Early termination with SortingMergePolicy

2013-03-20 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4858:
-

Summary: Early termination with SortingMergePolicy  (was: Ability to 
terminate queries on a per-segment basis)

 Early termination with SortingMergePolicy
 -

 Key: LUCENE-4858
 URL: https://issues.apache.org/jira/browse/LUCENE-4858
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3


 Spin-off of LUCENE-4752, see 
 https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565
  and 
 https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282
 When an index is sorted per-segment, queries that sort according to the index 
 sort order could be early terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4858) Early termination with SortingMergePolicy

2013-03-20 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607816#comment-13607816
 ] 

Adrien Grand commented on LUCENE-4858:
--

bq. Can't we split this issue up? I think the current discussion is focused 
much on this sorted segments thing, but thats not the only possible 
implementation for this kind of thing.

I created LUCENE-4862.

 Early termination with SortingMergePolicy
 -

 Key: LUCENE-4858
 URL: https://issues.apache.org/jira/browse/LUCENE-4858
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3


 Spin-off of LUCENE-4752, see 
 https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565
  and 
 https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282
 When an index is sorted per-segment, queries that sort according to the index 
 sort order could be early terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4862) Ability to terminate queries on a per-segment basis

2013-03-20 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4862:
-

Attachment: LUCENE-4862.patch

Patch that adds a new CollectionTerminatedException. When thrown from 
Collector.collect, IndexSearcher swallows it and terminates collection of the 
current IndexReader leaf.

 Ability to terminate queries on a per-segment basis
 ---

 Key: LUCENE-4862
 URL: https://issues.apache.org/jira/browse/LUCENE-4862
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3

 Attachments: LUCENE-4862.patch


 Spin-off of LUCENE-4752. The idea is to add a marker exception that tells 
 IndexSearcher to terminate the collection of the current segment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4847) Sorter API: Fully reuse docs enums

2013-03-20 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4847.
--

Resolution: Fixed

 Sorter API: Fully reuse docs enums
 --

 Key: LUCENE-4847
 URL: https://issues.apache.org/jira/browse/LUCENE-4847
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3

 Attachments: LUCENE-4847.patch


 SortingAtomicReader reuses the filtered docs enums but not the wrapper. In 
 the case of SortingAtomicReader this can be a problem because the wrappers 
 are heavyweight (they load the whole postings list into memory), so an index 
 with many terms with high freqs will make the JVM allocate a lot of memory 
 when browsing the postings lists.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4752) Merge segments to sort them

2013-03-20 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4752:
-

Attachment: LUCENE-4752.patch

New patch, focused on SortingMergePolicy, ready to be reviewed!

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 natural_10M_ingestion.log, sorting_10M_ingestion.log


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4867) SorterTemplate.merge is slow

2013-03-21 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4867:


 Summary: SorterTemplate.merge is slow
 Key: LUCENE-4867
 URL: https://issues.apache.org/jira/browse/LUCENE-4867
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor


SorterTemplate.mergeSort/timSort can be very slow. For example, I ran a quick 
benchmark that sorts an Integer[] array of 50M elements, and mergeSort was 
almost 4x slower than quickSort (~12.8s for quickSort, ~46.5s for mergeSort). 
This is even worse when the cost of a swap is higher (e.g. parallel arrays).

This is due to SorterTemplate.merge. I first feared that this method might not 
be linear, but it is, so the slowness is due to the fact that this method needs 
to swap lots of values in order not to require extra memory. Could we make it 
faster?

For reference, I hacked a SorterTemplate instance to use the usual merge 
routine (that requires n/2 elements in memory), and it was much faster: ~17s on 
average, so there is room for improvement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4867) SorterTemplate.merge is slow

2013-03-21 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4867:
-

Attachment: SortBench.java

Here is the program I used for testing.

 SorterTemplate.merge is slow
 

 Key: LUCENE-4867
 URL: https://issues.apache.org/jira/browse/LUCENE-4867
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: SortBench.java


 SorterTemplate.mergeSort/timSort can be very slow. For example, I ran a quick 
 benchmark that sorts an Integer[] array of 50M elements, and mergeSort was 
 almost 4x slower than quickSort (~12.8s for quickSort, ~46.5s for mergeSort). 
 This is even worse when the cost of a swap is higher (e.g. parallel arrays).
 This is due to SorterTemplate.merge. I first feared that this method might 
 not be linear, but it is, so the slowness is due to the fact that this method 
 needs to swap lots of values in order not to require extra memory. Could we 
 make it faster?
 For reference, I hacked a SorterTemplate instance to use the usual merge 
 routine (that requires n/2 elements in memory), and it was much faster: ~17s 
 on average, so there is room for improvement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4867) SorterTemplate.merge is slow

2013-03-21 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4867:
-

Attachment: LUCENE-4867.patch

bq. If you want a faster algorithm, you have to move away from in-place.

In that case, could we make SorterTemplate.merge overridable (protected) so 
that custom templates can use extra memory to merge? The attached patch 
modifies ArrayUtil to show how it could be used to implement a faster merge, 
which makes mergeSort more than 2x faster (~21s on average on my 50M array) 
although it only requires 1% of additional memory. What do you think?

 SorterTemplate.merge is slow
 

 Key: LUCENE-4867
 URL: https://issues.apache.org/jira/browse/LUCENE-4867
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4867.patch, SortBench.java


 SorterTemplate.mergeSort/timSort can be very slow. For example, I ran a quick 
 benchmark that sorts an Integer[] array of 50M elements, and mergeSort was 
 almost 4x slower than quickSort (~12.8s for quickSort, ~46.5s for mergeSort). 
 This is even worse when the cost of a swap is higher (e.g. parallel arrays).
 This is due to SorterTemplate.merge. I first feared that this method might 
 not be linear, but it is, so the slowness is due to the fact that this method 
 needs to swap lots of values in order not to require extra memory. Could we 
 make it faster?
 For reference, I hacked a SorterTemplate instance to use the usual merge 
 routine (that requires n/2 elements in memory), and it was much faster: ~17s 
 on average, so there is room for improvement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4867) SorterTemplate.merge is slow

2013-03-21 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609014#comment-13609014
 ] 

Adrien Grand commented on LUCENE-4867:
--

bq. Or did you implement it separate to not allocate the extra array, if only 
quicksort is called?

Exactly.

 SorterTemplate.merge is slow
 

 Key: LUCENE-4867
 URL: https://issues.apache.org/jira/browse/LUCENE-4867
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4867.patch, SortBench.java


 SorterTemplate.mergeSort/timSort can be very slow. For example, I ran a quick 
 benchmark that sorts an Integer[] array of 50M elements, and mergeSort was 
 almost 4x slower than quickSort (~12.8s for quickSort, ~46.5s for mergeSort). 
 This is even worse when the cost of a swap is higher (e.g. parallel arrays).
 This is due to SorterTemplate.merge. I first feared that this method might 
 not be linear, but it is, so the slowness is due to the fact that this method 
 needs to swap lots of values in order not to require extra memory. Could we 
 make it faster?
 For reference, I hacked a SorterTemplate instance to use the usual merge 
 routine (that requires n/2 elements in memory), and it was much faster: ~17s 
 on average, so there is room for improvement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4867) SorterTemplate.merge is slow

2013-03-21 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609024#comment-13609024
 ] 

Adrien Grand commented on LUCENE-4867:
--

bq. Otherwise I am fine with doing it that way, if we do not enforce users to 
implement the merge code.

OK. I'll update the patch to port the same behavior to CollectionUtil.

 SorterTemplate.merge is slow
 

 Key: LUCENE-4867
 URL: https://issues.apache.org/jira/browse/LUCENE-4867
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4867.patch, SortBench.java


 SorterTemplate.mergeSort/timSort can be very slow. For example, I ran a quick 
 benchmark that sorts an Integer[] array of 50M elements, and mergeSort was 
 almost 4x slower than quickSort (~12.8s for quickSort, ~46.5s for mergeSort). 
 This is even worse when the cost of a swap is higher (e.g. parallel arrays).
 This is due to SorterTemplate.merge. I first feared that this method might 
 not be linear, but it is, so the slowness is due to the fact that this method 
 needs to swap lots of values in order not to require extra memory. Could we 
 make it faster?
 For reference, I hacked a SorterTemplate instance to use the usual merge 
 routine (that requires n/2 elements in memory), and it was much faster: ~17s 
 on average, so there is room for improvement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4867) SorterTemplate.merge is slow

2013-03-21 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4867:
-

Attachment: LUCENE-4867.patch

Patch that makes SorterTemplate.merge protected and makes ArrayUtil and 
CollectionUtil use specialized SorterTemplate instances that use up to 1% extra 
memory for faster merge-based sorts.

I'll open a separate issue to use the same optimizations for the sorter API's 
timsorts.

 SorterTemplate.merge is slow
 

 Key: LUCENE-4867
 URL: https://issues.apache.org/jira/browse/LUCENE-4867
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4867.patch, LUCENE-4867.patch, SortBench.java


 SorterTemplate.mergeSort/timSort can be very slow. For example, I ran a quick 
 benchmark that sorts an Integer[] array of 50M elements, and mergeSort was 
 almost 4x slower than quickSort (~12.8s for quickSort, ~46.5s for mergeSort). 
 This is even worse when the cost of a swap is higher (e.g. parallel arrays).
 This is due to SorterTemplate.merge. I first feared that this method might 
 not be linear, but it is, so the slowness is due to the fact that this method 
 needs to swap lots of values in order not to require extra memory. Could we 
 make it faster?
 For reference, I hacked a SorterTemplate instance to use the usual merge 
 routine (that requires n/2 elements in memory), and it was much faster: ~17s 
 on average, so there is room for improvement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4862) Ability to terminate queries on a per-segment basis

2013-03-21 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4862.
--

Resolution: Fixed

Thank you for the review Shai!

 Ability to terminate queries on a per-segment basis
 ---

 Key: LUCENE-4862
 URL: https://issues.apache.org/jira/browse/LUCENE-4862
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3

 Attachments: LUCENE-4862.patch


 Spin-off of LUCENE-4752. The idea is to add a marker exception that tells 
 IndexSearcher to terminate the collection of the current segment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4752) Merge segments to sort them

2013-03-21 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609231#comment-13609231
 ] 

Adrien Grand commented on LUCENE-4752:
--

I plan to commit it tomorrow unless someone objects.

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 natural_10M_ingestion.log, sorting_10M_ingestion.log


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4571) speedup disjunction with minShouldMatch

2013-03-22 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13610099#comment-13610099
 ] 

Adrien Grand commented on LUCENE-4571:
--

Agreed, these speedups are awesome!

 speedup disjunction with minShouldMatch 
 

 Key: LUCENE-4571
 URL: https://issues.apache.org/jira/browse/LUCENE-4571
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Affects Versions: 4.1
Reporter: Mikhail Khludnev
 Attachments: LUCENE-4571.patch, LUCENE-4571.patch, LUCENE-4571.patch, 
 LUCENE-4571.patch, LUCENE-4571.patch, LUCENE-4571.patch


 even minShouldMatch is supplied to DisjunctionSumScorer it enumerates whole 
 disjunction, and verifies minShouldMatch condition [on every 
 doc|https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java#L70]:
 {code}
   public int nextDoc() throws IOException {
 assert doc != NO_MORE_DOCS;
 while(true) {
   while (subScorers[0].docID() == doc) {
 if (subScorers[0].nextDoc() != NO_MORE_DOCS) {
   heapAdjust(0);
 } else {
   heapRemoveRoot();
   if (numScorers  minimumNrMatchers) {
 return doc = NO_MORE_DOCS;
   }
 }
   }
   afterNext();
   if (nrMatchers = minimumNrMatchers) {
 break;
   }
 }
 
 return doc;
   }
 {code}
 [~spo] proposes (as well as I get it) to pop nrMatchers-1 scorers from the 
 heap first, and then push them back advancing behind that top doc. For me the 
 question no.1 is there a performance test for minShouldMatch constrained 
 disjunction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4871) Sorter API: better compress positions, offsets and payloads in SortingDocsAndPositionsEnum

2013-03-22 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4871:


 Summary: Sorter API: better compress positions, offsets and 
payloads in SortingDocsAndPositionsEnum
 Key: LUCENE-4871
 URL: https://issues.apache.org/jira/browse/LUCENE-4871
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Fix For: 4.3


SortingDocsAndPositionsEnum could easily save memory by using a 
Lucene40TCF-like compression method for positions, offsets and payloads:
 - delta-encode positions and startOffsets (with the previous end offset),
 - store the length of the tokens instead of their end offset (endOffset == 
startOffset + length),
 - use a single bit to say whether the token has a payload.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4871) Sorter API: better compress positions, offsets and payloads in SortingDocsAndPositionsEnum

2013-03-22 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4871:
-

Attachment: LUCENE-4871.patch

Patch.

 Sorter API: better compress positions, offsets and payloads in 
 SortingDocsAndPositionsEnum
 --

 Key: LUCENE-4871
 URL: https://issues.apache.org/jira/browse/LUCENE-4871
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Fix For: 4.3

 Attachments: LUCENE-4871.patch


 SortingDocsAndPositionsEnum could easily save memory by using a 
 Lucene40TCF-like compression method for positions, offsets and payloads:
  - delta-encode positions and startOffsets (with the previous end offset),
  - store the length of the tokens instead of their end offset (endOffset == 
 startOffset + length),
  - use a single bit to say whether the token has a payload.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4752) Merge segments to sort them

2013-03-22 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4752.
--

Resolution: Fixed

bq. Adrien, you didn't put your name in the CHANGES entry . +1 to commit.

Fixed and committed. Thank you Shai!

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 natural_10M_ingestion.log, sorting_10M_ingestion.log


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4871) Sorter API: better compress positions, offsets and payloads in SortingDocsAndPositionsEnum

2013-03-22 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4871.
--

Resolution: Fixed

 Sorter API: better compress positions, offsets and payloads in 
 SortingDocsAndPositionsEnum
 --

 Key: LUCENE-4871
 URL: https://issues.apache.org/jira/browse/LUCENE-4871
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Fix For: 4.3

 Attachments: LUCENE-4871.patch


 SortingDocsAndPositionsEnum could easily save memory by using a 
 Lucene40TCF-like compression method for positions, offsets and payloads:
  - delta-encode positions and startOffsets (with the previous end offset),
  - store the length of the tokens instead of their end offset (endOffset == 
 startOffset + length),
  - use a single bit to say whether the token has a payload.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4867) SorterTemplate.merge is slow

2013-03-22 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4867.
--

Resolution: Fixed

 SorterTemplate.merge is slow
 

 Key: LUCENE-4867
 URL: https://issues.apache.org/jira/browse/LUCENE-4867
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4867.patch, LUCENE-4867.patch, SortBench.java


 SorterTemplate.mergeSort/timSort can be very slow. For example, I ran a quick 
 benchmark that sorts an Integer[] array of 50M elements, and mergeSort was 
 almost 4x slower than quickSort (~12.8s for quickSort, ~46.5s for mergeSort). 
 This is even worse when the cost of a swap is higher (e.g. parallel arrays).
 This is due to SorterTemplate.merge. I first feared that this method might 
 not be linear, but it is, so the slowness is due to the fact that this method 
 needs to swap lots of values in order not to require extra memory. Could we 
 make it faster?
 For reference, I hacked a SorterTemplate instance to use the usual merge 
 routine (that requires n/2 elements in memory), and it was much faster: ~17s 
 on average, so there is room for improvement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4874) Remove FilterTerms.intersect

2013-03-23 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4874:


 Summary: Remove FilterTerms.intersect
 Key: LUCENE-4874
 URL: https://issues.apache.org/jira/browse/LUCENE-4874
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Priority: Minor


Terms.intersect is an optional method. The fact that it is overridden in 
FilterTerms forces any non-trivial class that extends Terms to override 
intersect in order this method to have a correct behavior. If FilterTerms did 
not override this method and used the default impl, we would not have this 
problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4874) Remove FilterTerms.intersect

2013-03-23 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4874:
-

Description: Terms.intersect is an optional method. The fact that it is 
overridden in FilterTerms forces any non-trivial class that extends FilterTerms 
to override intersect in order this method to have a correct behavior. If 
FilterTerms did not override this method and used the default impl, we would 
not have this problem.  (was: Terms.intersect is an optional method. The fact 
that it is overridden in FilterTerms forces any non-trivial class that extends 
Terms to override intersect in order this method to have a correct behavior. If 
FilterTerms did not override this method and used the default impl, we would 
not have this problem.)

 Remove FilterTerms.intersect
 

 Key: LUCENE-4874
 URL: https://issues.apache.org/jira/browse/LUCENE-4874
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Priority: Minor

 Terms.intersect is an optional method. The fact that it is overridden in 
 FilterTerms forces any non-trivial class that extends FilterTerms to override 
 intersect in order this method to have a correct behavior. If FilterTerms did 
 not override this method and used the default impl, we would not have this 
 problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4874) Remove FilterTerms.intersect

2013-03-23 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611829#comment-13611829
 ] 

Adrien Grand commented on LUCENE-4874:
--

This makes sense. I found another bug in SortingAtomicReader which doesn't 
override getCoreCacheKey, this could lead to very bad things if an atomic 
reader and its sorted view were both used with the same FieldCache instance.

I've started looking at methods that override default impls and would like to 
have your opinion on some of them:
 - shouldn't IndexReader.hasDeletions return numDeletedDocs()  0 by default 
instead of being abstract?
 - isn't the default impl of TermsEnum.termState dangerous? Shouldn't it throw 
an UnsupportedOperationException or being abstract instead?

 Remove FilterTerms.intersect
 

 Key: LUCENE-4874
 URL: https://issues.apache.org/jira/browse/LUCENE-4874
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Priority: Minor

 Terms.intersect is an optional method. The fact that it is overridden in 
 FilterTerms forces any non-trivial class that extends FilterTerms to override 
 intersect in order this method to have a correct behavior. If FilterTerms did 
 not override this method and used the default impl, we would not have this 
 problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4875) Make SorterTemplate.mergeSort run in linear time on sorted arrays

2013-03-23 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4875:


 Summary: Make SorterTemplate.mergeSort run in linear time on 
sorted arrays
 Key: LUCENE-4875
 URL: https://issues.apache.org/jira/browse/LUCENE-4875
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Fix For: 4.3


Through minor modifications, SorterTemplate.mergeSort could run in linear time 
on sorted arrays, so I think we should do it? The idea is to modify merge so 
that it returns instantly when compare(pivot-1, pivot) = 0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4875) Make SorterTemplate.mergeSort run in linear time on sorted arrays

2013-03-23 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4875:
-

Attachment: LUCENE-4875.patch

Patch. I modified the test case to make sure merge is never called when the 
concatenation of the two runs to merge is already sorted.

 Make SorterTemplate.mergeSort run in linear time on sorted arrays
 -

 Key: LUCENE-4875
 URL: https://issues.apache.org/jira/browse/LUCENE-4875
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Fix For: 4.3

 Attachments: LUCENE-4875.patch


 Through minor modifications, SorterTemplate.mergeSort could run in linear 
 time on sorted arrays, so I think we should do it? The idea is to modify 
 merge so that it returns instantly when compare(pivot-1, pivot) = 0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler

2013-03-23 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4876:


 Summary: IndexWriterConfig.clone should clone the MergeScheduler
 Key: LUCENE-4876
 URL: https://issues.apache.org/jira/browse/LUCENE-4876
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
 Fix For: 4.3


ConcurrentMergeScheduler has a ListMergeThread member to track the running 
merging threads, so IndexWriterConfig.clone should clone the merge scheduler so 
that both IndexWriterConfig instances are independant.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (LUCENE-4752) Merge segments to sort them

2013-03-24 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reopened LUCENE-4752:
--


I just found what caused the last Jenkins failures: sometimes deletions happen 
concurrently with a merge. In this case, deletes are still applied to the old 
ReaderAndLiveDocs and once the merge is finished, IndexWriter runs 
commitMergedDeletes to apply deletes to the new segment too, but since it 
assumes doc IDs are assigned sequentially, it doesn't work with 
SortingMergePolicy. (This explains why the bug was hard to reproduce too.)

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 natural_10M_ingestion.log, sorting_10M_ingestion.log


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-4874) Remove FilterTerms.intersect

2013-03-24 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reassigned LUCENE-4874:


Assignee: Adrien Grand

 Remove FilterTerms.intersect
 

 Key: LUCENE-4874
 URL: https://issues.apache.org/jira/browse/LUCENE-4874
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor

 Terms.intersect is an optional method. The fact that it is overridden in 
 FilterTerms forces any non-trivial class that extends FilterTerms to override 
 intersect in order this method to have a correct behavior. If FilterTerms did 
 not override this method and used the default impl, we would not have this 
 problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4874) Remove FilterTerms.intersect

2013-03-25 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13612908#comment-13612908
 ] 

Adrien Grand commented on LUCENE-4874:
--

Although DocIdSetIterator.advance is abstract, it describes a default 
implementation that many classes that extend DocsEnum/DocsAndPositionsEnum 
duplicate. Maybe we should just provide a default implementation for advance, 
this would save copy-pastes.

 Remove FilterTerms.intersect
 

 Key: LUCENE-4874
 URL: https://issues.apache.org/jira/browse/LUCENE-4874
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor

 Terms.intersect is an optional method. The fact that it is overridden in 
 FilterTerms forces any non-trivial class that extends FilterTerms to override 
 intersect in order this method to have a correct behavior. If FilterTerms did 
 not override this method and used the default impl, we would not have this 
 problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4888) SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1

2013-03-26 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4888:


 Summary: SloppyPhraseScorer calls DocsAndPositionsEnum.advance 
with target = -1
 Key: LUCENE-4888
 URL: https://issues.apache.org/jira/browse/LUCENE-4888
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.2
Reporter: Adrien Grand


SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 although 
the behavior of this method is undefined in such cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4888) SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1

2013-03-26 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4888:
-

Attachment: LUCENE-4888.patch

A patch that adds assertions to AssertingDocsAndPositionsEnum. You can 
reproduce the issue by applying this patch and running {{ant test 
-Dtestcase=TestSloppyPhraseQuery -Dtests.codec=Asserting}}.

 SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1
 --

 Key: LUCENE-4888
 URL: https://issues.apache.org/jira/browse/LUCENE-4888
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.2
Reporter: Adrien Grand
 Attachments: LUCENE-4888.patch


 SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 
 although the behavior of this method is undefined in such cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4752) Merge segments to sort them

2013-03-26 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4752:
-

Attachment: LUCENE-4752-2.patch

Patch:
 - fixes the issue by allowing OneMerges to return a doc map that translates 
doc IDs to their new value so that IndexWriter can commit merged deletes,
 - TestSortingMergePolicy has been modified to make deletions more likely to 
happen concurrently with a merge.

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752-2.patch, LUCENE-4752.patch, 
 LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, 
 sorting_10M_ingestion.log


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-4647) Grouping is broken on docvalues-only fields

2013-03-27 Thread Adrien Grand (JIRA)
Adrien Grand created SOLR-4647:
--

 Summary: Grouping is broken on docvalues-only fields
 Key: SOLR-4647
 URL: https://issues.apache.org/jira/browse/SOLR-4647
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.2
Reporter: Adrien Grand


There are a few places where grouping uses 
FieldType.toObject(SchemaField.createField(String, float)) to translate a 
String field value to an Object. The problem is that createField returns null 
when the field is neither stored nor indexed, even if it has doc values.

An option to fix it could be to use the ValueSource instead to resolve the 
Object value (similarily to NumericFacets).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler

2013-03-27 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reassigned LUCENE-4876:


Assignee: Adrien Grand

 IndexWriterConfig.clone should clone the MergeScheduler
 ---

 Key: LUCENE-4876
 URL: https://issues.apache.org/jira/browse/LUCENE-4876
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.3


 ConcurrentMergeScheduler has a ListMergeThread member to track the running 
 merging threads, so IndexWriterConfig.clone should clone the merge scheduler 
 so that both IndexWriterConfig instances are independant.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler

2013-03-27 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4876:
-

Attachment: LUCENE-4876.patch

Patch:

 * MergeScheduler implements Cloneable

 * IndexDeletionPolicy is now an abstract class (so that it can provide a 
default clone impl) and implements Cloneable

 * InfoStream implements Cloneable (there is no need for this today but I 
assumed that some people might be interested to display line numbers or other 
things that would require adding a state to the InfoStream, I've no strong 
feeling about it and can remove it if you think it shouldn't implement 
Cloneable)

 * MergeSchedulers and IndexDeletionPolicies have been fixed so that clones 
don't share state with the instance they've been cloned from

 * IndexWriterConfig clones mergeScheduler and delPolicy (in addition to 
mergePolicy, flushPolicy and indexerThreadPool which were already cloned)

 * Most of the patch changes are due to the fact that many tests assumed that 
the IndexDeletionPolicy instance passed to IndexWriterConfig was the same one 
as the one used by IndexWriter (which is not true now since IndexWriter clones 
the provided config in its constructor and we now clone del policies in 
IndexWriterConfig.clone).

 IndexWriterConfig.clone should clone the MergeScheduler
 ---

 Key: LUCENE-4876
 URL: https://issues.apache.org/jira/browse/LUCENE-4876
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.3

 Attachments: LUCENE-4876.patch


 ConcurrentMergeScheduler has a ListMergeThread member to track the running 
 merging threads, so IndexWriterConfig.clone should clone the merge scheduler 
 so that both IndexWriterConfig instances are independant.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4888) SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1

2013-03-27 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615496#comment-13615496
 ] 

Adrien Grand commented on LUCENE-4888:
--

May someone confirm that the assertions I added to 
AssertingDocsAndPositionsEnum are correct (meaning there is actually a bug in 
SloppyPhraseScorer)?

 SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1
 --

 Key: LUCENE-4888
 URL: https://issues.apache.org/jira/browse/LUCENE-4888
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.2
Reporter: Adrien Grand
 Attachments: LUCENE-4888.patch


 SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 
 although the behavior of this method is undefined in such cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4752) Merge segments to sort them

2013-03-27 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615534#comment-13615534
 ] 

Adrien Grand commented on LUCENE-4752:
--

Thank you for the review Mike, I hope it will pass tests now!

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752-2.patch, LUCENE-4752.patch, 
 LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, 
 sorting_10M_ingestion.log


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4752) Merge segments to sort them

2013-03-27 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4752.
--

Resolution: Fixed

 Merge segments to sort them
 ---

 Key: LUCENE-4752
 URL: https://issues.apache.org/jira/browse/LUCENE-4752
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index
Reporter: David Smiley
Assignee: Adrien Grand
 Attachments: LUCENE-4752-2.patch, LUCENE-4752.patch, 
 LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, LUCENE-4752.patch, 
 LUCENE-4752.patch, LUCENE-4752.patch, natural_10M_ingestion.log, 
 sorting_10M_ingestion.log


 It would be awesome if Lucene could write the documents out in a segment 
 based on a configurable order.  This of course applies to merging segments 
 to. The benefit is increased locality on disk of documents that are likely to 
 be accessed together.  This often applies to documents near each other in 
 time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4875) Make SorterTemplate.mergeSort run in linear time on sorted arrays

2013-03-27 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4875.
--

Resolution: Fixed

 Make SorterTemplate.mergeSort run in linear time on sorted arrays
 -

 Key: LUCENE-4875
 URL: https://issues.apache.org/jira/browse/LUCENE-4875
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Fix For: 4.3

 Attachments: LUCENE-4875.patch


 Through minor modifications, SorterTemplate.mergeSort could run in linear 
 time on sorted arrays, so I think we should do it? The idea is to modify 
 merge so that it returns instantly when compare(pivot-1, pivot) = 0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4858) Early termination with SortingMergePolicy

2013-03-27 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615741#comment-13615741
 ] 

Adrien Grand commented on LUCENE-4858:
--

bq. I am thinking for some time on segment-level metadata. Something like 
SegmentInfo.attributes().

I agree that something like SegmentInfo.attributes would be helpful but why not 
SegmentInfo.attributes themselves? (I'm not trying to push for it, just curious 
what their use-cases are, they seem to be unused today?)

 Early termination with SortingMergePolicy
 -

 Key: LUCENE-4858
 URL: https://issues.apache.org/jira/browse/LUCENE-4858
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3


 Spin-off of LUCENE-4752, see 
 https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565
  and 
 https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282
 When an index is sorted per-segment, queries that sort according to the index 
 sort order could be early terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4858) Early termination with SortingMergePolicy

2013-03-27 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615752#comment-13615752
 ] 

Adrien Grand commented on LUCENE-4858:
--

bq. Why is additional metadata necessary? Isnt 
SegmentInfo.getDiagnostics().get(source) enough to tell you if the segment 
was created via a flush or a merge... maybe a little evil but the data is 
already there. 

It looks good, I hadn't noticed that we store this information in the 
diagnostics, thanks!

 Early termination with SortingMergePolicy
 -

 Key: LUCENE-4858
 URL: https://issues.apache.org/jira/browse/LUCENE-4858
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3


 Spin-off of LUCENE-4752, see 
 https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565
  and 
 https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282
 When an index is sorted per-segment, queries that sort according to the index 
 sort order could be early terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler

2013-03-27 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615839#comment-13615839
 ] 

Adrien Grand commented on LUCENE-4876:
--

bq. Does PersistentSnapshotDeletionPolicy need clone() too?

At first, I though about making its clone() method throw an exception but we 
can't because IndexWriter constructor always clones the provided 
IndexWriterConfig. I'll add warnings about sharing in the javadocs.

 IndexWriterConfig.clone should clone the MergeScheduler
 ---

 Key: LUCENE-4876
 URL: https://issues.apache.org/jira/browse/LUCENE-4876
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.3

 Attachments: LUCENE-4876.patch


 ConcurrentMergeScheduler has a ListMergeThread member to track the running 
 merging threads, so IndexWriterConfig.clone should clone the merge scheduler 
 so that both IndexWriterConfig instances are independant.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4858) Early termination with SortingMergePolicy

2013-03-27 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4858:
-

Attachment: LUCENE-4858.patch

Here is a first patch:

 * New convenient abstract collector class: EarlyTerminationCollector which 
makes no assumption about the readers it collects (it relies on sub-classes in 
order to know whether the collected context is sorted and how many docs should 
be collected at most).

 * New collector: SortingMergePolicyCollector that assumes that segments that 
result from a merge are sorted (to do so it inspect the diagnostics of the 
SegmentInfo). I named it this way to make it clear it needs to be used with 
SortingMergePolicy.

 * I made SegmentReader.getSegmentInfo public (instead of pkg-private) to be 
able to read the diagnostics. Is it OK to do so/Is there a cleaner way to 
expose diagnostics to high-level APIs?

 Early termination with SortingMergePolicy
 -

 Key: LUCENE-4858
 URL: https://issues.apache.org/jira/browse/LUCENE-4858
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3

 Attachments: LUCENE-4858.patch


 Spin-off of LUCENE-4752, see 
 https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565
  and 
 https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282
 When an index is sorted per-segment, queries that sort according to the index 
 sort order could be early terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4876) IndexWriterConfig.clone should clone the MergeScheduler

2013-03-28 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4876:
-

Attachment: LUCENE-4876.patch

New patch:

 * Added CHANGES entries

 * Added documentation to PersistentSnapshotDeletionPolicy to make clear that 
instances of this classes must not be shared across IndexWriters

 * Some Solr tests were failing because Solr expects SolrCore.solrDelPolicy to 
be the same instance as IndexWriter.getConfig().getIndexDeletionPolicy(). There 
is sensible code relying on it (SnapShooter/ReplicationHandler in particular) 
so I preferred emulating the old behavior by making 
IndexDeletetionPolicyWrapper.clone() return 'this' for the moment. This is not 
a problem because each core has its own private deletion policy and never opens 
more than one IndexWriter with it.

 IndexWriterConfig.clone should clone the MergeScheduler
 ---

 Key: LUCENE-4876
 URL: https://issues.apache.org/jira/browse/LUCENE-4876
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.3

 Attachments: LUCENE-4876.patch, LUCENE-4876.patch


 ConcurrentMergeScheduler has a ListMergeThread member to track the running 
 merging threads, so IndexWriterConfig.clone should clone the merge scheduler 
 so that both IndexWriterConfig instances are independant.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-4888) SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1

2013-03-29 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reassigned LUCENE-4888:


Assignee: Adrien Grand

 SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1
 --

 Key: LUCENE-4888
 URL: https://issues.apache.org/jira/browse/LUCENE-4888
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.2
Reporter: Adrien Grand
Assignee: Adrien Grand
 Attachments: LUCENE-4888.patch


 SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 
 although the behavior of this method is undefined in such cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4888) SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1

2013-03-29 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4888:
-

Attachment: LUCENE-4888.patch

Patch that adds assertions from the previous patch to new bug fixes:
 - SloppyPhraseScorer.advance
 - MultiDocs(AndPositions)Enum.advance
 - MultiSpansWrapper.skipTo

These three methods relied on the assumption that advance(target) is equivalent 
to nextDoc() when target is = the current position (which is wrong, although 
all our impls behave this way).

 SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1
 --

 Key: LUCENE-4888
 URL: https://issues.apache.org/jira/browse/LUCENE-4888
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.2
Reporter: Adrien Grand
Assignee: Adrien Grand
 Attachments: LUCENE-4888.patch, LUCENE-4888.patch


 SloppyPhraseScorer calls DocsAndPositionsEnum.advance with target = -1 
 although the behavior of this method is undefined in such cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4877) Fix analyzer factories to throw exception when arguments are invalid

2013-03-29 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617472#comment-13617472
 ] 

Adrien Grand commented on LUCENE-4877:
--

+1

 Fix analyzer factories to throw exception when arguments are invalid
 

 Key: LUCENE-4877
 URL: https://issues.apache.org/jira/browse/LUCENE-4877
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Attachments: LUCENE-4877_one_solution_prototype.patch


 Currently if someone typos an argument someParamater=xyz instead of 
 someParameter=xyz, they get no exception and sometimes incorrect behavior.
 It would be way better if these factories threw exception on unknown params, 
 e.g. they removed the args they used and checked they were empty at the end.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-4654) Integrate Lucene's sorting and early query termination capabilities into Solr

2013-03-29 Thread Adrien Grand (JIRA)
Adrien Grand created SOLR-4654:
--

 Summary: Integrate Lucene's sorting and early query termination 
capabilities into Solr
 Key: SOLR-4654
 URL: https://issues.apache.org/jira/browse/SOLR-4654
 Project: Solr
  Issue Type: Improvement
Reporter: Adrien Grand
Priority: Trivial


I think there would be some interesting work to do to integrate Lucene's 
sorting and early query termination capabilities into Solr, in particular (just 
ideas, maybe they're not all interesting/useful):
 - configuring a SortingMergePolicy,
 - figuring out when the sort order of queries matches the sort order of the 
index segments,
 - giving the ability to get approximated results when the query is not sorted 
but only boosted by the sort order of the index,
 - integration with TimeLimitingCollector: maybe it's better to collect only 
half of all segments than to fully collect half of the segments,
 - approximation of the number of matches based on the ratio of collected 
documents,
 - ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



<    6   7   8   9   10   11   12   13   14   15   >