[jira] [Resolved] (LUCENE-8975) Code Cleanup: Use entryset for map iteration wherever possible.

2019-09-13 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8975.
--
Fix Version/s: 8.3
   Resolution: Fixed

> Code Cleanup: Use entryset for map iteration wherever possible.
> ---
>
> Key: LUCENE-8975
> URL: https://issues.apache.org/jira/browse/LUCENE-8975
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 8.2
>Reporter: Koen De Groote
>Priority: Trivial
> Fix For: 8.3
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Simple, non-important code cleanup.
> Again, to clarify, please don't bother yourself with this ticket on company 
> time, on personal time you could be working on something that makes you money 
> or improves the product for your feature personally.
>  
> This entire ticket is an afterthough. A look back at the code base that most 
> people don't have the time for.
>  
> 
>  
> While true that using `entrySet()` is really only an improvement for 
> traversing a TreeMap(at least that's how it was in JDK8), it's a good 
> practice in general to use it over keySet(), if you then use that keyset to 
> do an extra lookup to get the value as well as the key.
>  
> So that's what this ticket is.
>  
> All changes were done automatically via Intellij's built-in code analysis.
>  
> Putting this on LUCENE because code both in lucene and solr was changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8939) Shared Hit Count Early Termination

2019-09-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929307#comment-16929307
 ] 

Adrien Grand commented on LUCENE-8939:
--

Thanks [~mikemccand] I had lost track of this, I agree it would be nice to not 
have to wait for 9.0 for this improvement.

> Shared Hit Count Early Termination
> --
>
> Key: LUCENE-8939
> URL: https://issues.apache.org/jira/browse/LUCENE-8939
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Atri Sharma
>Priority: Major
> Fix For: 8.3
>
>  Time Spent: 12h 20m
>  Remaining Estimate: 0h
>
> When collecting hits across sorted segments, it should be possible to 
> terminate early across all slices when enough hits have been collected 
> globally i.e. hit count > numHits AND hit count < totalHitsThreshold



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8939) Shared Hit Count Early Termination

2019-09-13 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8939.
--
Fix Version/s: 8.3
   Resolution: Fixed

> Shared Hit Count Early Termination
> --
>
> Key: LUCENE-8939
> URL: https://issues.apache.org/jira/browse/LUCENE-8939
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Atri Sharma
>Priority: Major
> Fix For: 8.3
>
>  Time Spent: 12h 20m
>  Remaining Estimate: 0h
>
> When collecting hits across sorted segments, it should be possible to 
> terminate early across all slices when enough hits have been collected 
> globally i.e. hit count > numHits AND hit count < totalHitsThreshold



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8970) TopFieldCollector(s) Should Prepopulate Sentinel Objects

2019-09-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929154#comment-16929154
 ] 

Adrien Grand commented on LUCENE-8970:
--

Maybe we can try to do a quick hack to see how much it could bring, but my 
intuition is that it wouldn't help with performance given that that we are 
looking at a condition that is easily predictable?

> TopFieldCollector(s) Should Prepopulate Sentinel Objects
> 
>
> Key: LUCENE-8970
> URL: https://issues.apache.org/jira/browse/LUCENE-8970
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> We do not repopulate the hit queue with sentinel values today, thus leading 
> to extra checks and extra code.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7521) Simplify PackedInts

2019-09-11 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-7521.
--
Fix Version/s: master (9.0)
   Resolution: Fixed

> Simplify PackedInts
> ---
>
> Key: LUCENE-7521
> URL: https://issues.apache.org/jira/browse/LUCENE-7521
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (9.0)
>
> Attachments: LUCENE-7521.patch
>
>
> We have a lot of specialization in PackedInts about how to keep packed arrays 
> of longs in memory. However, most use-cases have slowly moved to DirectWriter 
> and DirectMonotonicWriter and most specializations we have are barely used 
> for performance-sensitive operations, so I'd like to clean this up a bit.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8961) CheckIndex: pre-exorcise document id salvage

2019-09-09 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926078#comment-16926078
 ] 

Adrien Grand commented on LUCENE-8961:
--

Agreed it is awkward. When I said "on top of CheckIndex", I was rather thinking 
of running CheckIndex programmatically and then looking at the return value to 
understand what segments might need salvaging. A separate stand-alone tool 
sounds good to me too.

> CheckIndex: pre-exorcise document id salvage
> 
>
> Key: LUCENE-8961
> URL: https://issues.apache.org/jira/browse/LUCENE-8961
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Christine Poerschke
>Priority: Minor
> Attachments: LUCENE-8961.patch, LUCENE-8961.patch
>
>
> The 
> [CheckIndex|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.2.0/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java]
>  tool supports the exorcising of corrupt segments from an index.
> This ticket proposes to add an extra option which could first be used to 
> potentially salvage the document ids of the segment(s) about to be exorcised. 
> Re-ingestion for those documents could then be arranged so as to repair the 
> data damage caused by the exorcising.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8917) Remove the "Direct" doc-value format

2019-09-05 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923711#comment-16923711
 ] 

Adrien Grand commented on LUCENE-8917:
--

Thanks Hoss, I shouldn't have neglected to test Solr...

> Remove the "Direct" doc-value format
> 
>
> Key: LUCENE-8917
> URL: https://issues.apache.org/jira/browse/LUCENE-8917
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (9.0)
>
>
> This is the last user of the Legacy*DocValues APIs. Another option would be 
> to move this format to doc-value iterators, but I don't think it's worth the 
> effort: let's just remove it in Lucene 9?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8956) QueryRescorer sort optimization

2019-09-05 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8956.
--
Resolution: Fixed

Merged, thanks.

> QueryRescorer sort optimization
> ---
>
> Key: LUCENE-8956
> URL: https://issues.apache.org/jira/browse/LUCENE-8956
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Paul Sanwald
>Priority: Minor
> Fix For: 8.3
>
> Attachments: LUCENE-8956.patch
>
>
> This patch addresses a TODO in QueryRescorer: We should not sort the full 
> array of the results returned from rescoring, but rather only topN, when topN 
> is less than total hits.
>  
> Made this optimization with some suggestions from [~jpountz] and [~jimczi], 
> this is my first lucene patch submission.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (LUCENE-8956) QueryRescorer sort optimization

2019-09-05 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reopened LUCENE-8956:
--

I had to revert this change due to test failures such as

{noformat}
13:40:13[junit4] Suite: org.apache.lucene.search.TestQueryRescorer
13:40:13[junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestQueryRescorer -Dtests.method=testRescoreIsIdempotent 
-Dtests.seed=2E252C4ACD6D510E -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=nb -Dtests.timezone=America/Indiana/Winamac -Dtests.asserts=true 
-Dtests.file.encoding=UTF8
13:40:13[junit4] FAILURE 0.04s J3 | 
TestQueryRescorer.testRescoreIsIdempotent <<<
13:40:13[junit4]> Throwable #1: junit.framework.AssertionFailedError: 
Hit 3 docnumbers don't match
13:40:13[junit4]> Hits length1=30   length2=30
13:40:13[junit4]> hit=0: doc11=0.027509231 shardIndex=-1,
doc11=0.027509231 shardIndex=-1
13:40:13[junit4]> hit=1: doc59=0.026626626 shardIndex=-1,
doc59=0.026626626 shardIndex=-1
13:40:13[junit4]> hit=2: doc94=0.025820786 shardIndex=-1,
doc94=0.025820786 shardIndex=-1
13:40:13[junit4]> hit=3: doc14=0.025403785 shardIndex=-1,
doc13=0.025586303 shardIndex=-1
13:40:13[junit4]> hit=4: doc13=0.025586303 shardIndex=-1,
doc14=0.025403785 shardIndex=-1
13:40:13[junit4]> hit=5: doc69=0.02535518 shardIndex=-1, 
doc69=0.02535518 shardIndex=-1
13:40:13[junit4]> hit=6: doc37=0.024901403 shardIndex=-1,
doc37=0.024901403 shardIndex=-1
13:40:13[junit4]> hit=7: doc73=0.023868036 shardIndex=-1,
doc73=0.023868036 shardIndex=-1
13:40:13[junit4]> hit=8: doc63=0.021619176 shardIndex=-1,
doc63=0.021619176 shardIndex=-1
13:40:13[junit4]> hit=9: doc72=0.019876242 shardIndex=-1,
doc72=0.019876242 shardIndex=-1
13:40:13[junit4]> hit=10: doc50=0.01923588 shardIndex=-1,
doc50=0.01923588 shardIndex=-1
13:40:13[junit4]> hit=11: doc10=0.018147592 shardIndex=-1,   
doc10=0.018147592 shardIndex=-1
13:40:13[junit4]> hit=12: doc0=0.018087288 shardIndex=-1,
doc0=0.018087288 shardIndex=-1
13:40:13[junit4]> hit=13: doc98=0.017422743 shardIndex=-1,   
doc98=0.017422743 shardIndex=-1
13:40:13[junit4]> hit=14: doc47=0.016934035 shardIndex=-1,   
doc47=0.016934035 shardIndex=-1
13:40:13[junit4]> hit=15: doc6=0.016177062 shardIndex=-1,
doc79=0.016415473 shardIndex=-1
13:40:13[junit4]> hit=16: doc96=0.016177062 shardIndex=-1,   
doc6=0.016177062 shardIndex=-1
13:40:13[junit4]> hit=17: doc79=0.016415473 shardIndex=-1,   
doc96=0.016177062 shardIndex=-1
13:40:13[junit4]> hit=18: doc55=0.015521079 shardIndex=-1,   
doc55=0.015521079 shardIndex=-1
13:40:13[junit4]> hit=19: doc34=0.011353219 shardIndex=-1,   
doc34=0.011353219 shardIndex=-1
13:40:13[junit4]> hit=20: doc9=0.010401427 shardIndex=-1,
doc9=0.010401427 shardIndex=-1
13:40:13[junit4]> hit=21: doc12=0.0033432373 shardIndex=-1,  
doc12=0.0033432373 shardIndex=-1
13:40:13[junit4]> hit=22: doc15=0.0033432373 shardIndex=-1,  
doc15=0.0033432373 shardIndex=-1
13:40:13[junit4]> hit=23: doc23=0.0033432373 shardIndex=-1,  
doc23=0.0033432373 shardIndex=-1
13:40:13[junit4]> hit=24: doc26=0.0033432373 shardIndex=-1,  
doc26=0.0033432373 shardIndex=-1
13:40:13[junit4]> hit=25: doc31=0.0033432373 shardIndex=-1,  
doc31=0.0033432373 shardIndex=-1
13:40:13[junit4]> hit=26: doc33=0.0033432373 shardIndex=-1,  
doc33=0.0033432373 shardIndex=-1
13:40:13[junit4]> hit=27: doc41=0.0033432373 shardIndex=-1,  
doc41=0.0033432373 shardIndex=-1
13:40:13[junit4]> hit=28: doc43=0.0033432373 shardIndex=-1,  
doc43=0.0033432373 shardIndex=-1
13:40:13[junit4]> hit=29: doc45=0.0033432373 shardIndex=-1,  
doc45=0.0033432373 shardIndex=-1
13:40:13[junit4]> for query:field:"river river"~1
13:40:13[junit4]>   at 
__randomizedtesting.SeedInfo.seed([2E252C4ACD6D510E:E4C37C59CD79ADE6]:0)
13:40:13[junit4]>   at junit.framework.Assert.fail(Assert.java:57)
13:40:13[junit4]>   at 
org.apache.lucene.search.CheckHits.checkEqual(CheckHits.java:205)
13:40:13[junit4]>   at 
org.apache.lucene.search.TestQueryRescorer.testRescoreIsIdempotent(TestQueryRescorer.java:142)
13:40:13[junit4]>   at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
13:40:13[junit4]>   at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
13:40:13[junit4]>   at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
13:40:13[junit4]>   at 

[jira] [Resolved] (LUCENE-8905) TopDocsCollector Should Have Better Error Handling For Illegal Arguments

2019-09-05 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8905.
--
Fix Version/s: master (9.0)
   Resolution: Fixed

> TopDocsCollector Should Have Better Error Handling For Illegal Arguments
> 
>
> Key: LUCENE-8905
> URL: https://issues.apache.org/jira/browse/LUCENE-8905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> While writing some tests, I realised that TopDocsCollector does not behave 
> well when illegal arguments are passed in (for eg, requesting more hits than 
> the number of hits collected). Instead, we return a TopDocs instance with 0 
> hits.
>  
> This can be problematic when queries are being formed by applications. This 
> can hide bugs where malformed queries return no hits and that is surfaced 
> upstream to client applications.
>  
> I found a TODO at the relevant code space, so I believe it is time to fix the 
> problem and throw an IllegalArgumentsException.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-05 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923446#comment-16923446
 ] 

Adrien Grand commented on LUCENE-8920:
--

This is a cool idea.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8917) Remove the "Direct" doc-value format

2019-09-05 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8917.
--
Fix Version/s: master (9.0)
   Resolution: Fixed

> Remove the "Direct" doc-value format
> 
>
> Key: LUCENE-8917
> URL: https://issues.apache.org/jira/browse/LUCENE-8917
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (9.0)
>
>
> This is the last user of the Legacy*DocValues APIs. Another option would be 
> to move this format to doc-value iterators, but I don't think it's worth the 
> effort: let's just remove it in Lucene 9?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7521) Simplify PackedInts

2019-09-05 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923212#comment-16923212
 ] 

Adrien Grand commented on LUCENE-7521:
--

I'd like to move forward on this. Hopefully since we last discussed this 
cleanup, more users took the time to move from FieldCache to doc values, which 
has been our recommendation for a very long time now. I will only push this 
change to master.

> Simplify PackedInts
> ---
>
> Key: LUCENE-7521
> URL: https://issues.apache.org/jira/browse/LUCENE-7521
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7521.patch
>
>
> We have a lot of specialization in PackedInts about how to keep packed arrays 
> of longs in memory. However, most use-cases have slowly moved to DirectWriter 
> and DirectMonotonicWriter and most specializations we have are barely used 
> for performance-sensitive operations, so I'd like to clean this up a bit.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8942) Tighten Up LRUQueryCache's Methods

2019-09-05 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8942.
--
Fix Version/s: 8.3
   Resolution: Fixed

> Tighten Up LRUQueryCache's Methods
> --
>
> Key: LUCENE-8942
> URL: https://issues.apache.org/jira/browse/LUCENE-8942
> Project: Lucene - Core
>  Issue Type: Improvement
> Environment: LRUQueryCache has less strict visibility of methods than 
> it can, and has some redundant parameters.
>Reporter: Atri Sharma
>Priority: Minor
> Fix For: 8.3
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8932) Allow BKDReader packedIndex to be off heap

2019-09-05 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923137#comment-16923137
 ] 

Adrien Grand commented on LUCENE-8932:
--

I was hoping LUCENE-8833 would land and we could build on it but it's taking 
some more time than I thought yet I don't think we need to wait. I'd suggest 
the following to move forward:
 - Add a parameter to BKDReader's constructor to decide whether the index 
should be on or off heap. This is useful so that tests can use both flavors of 
index.
 - Keep a constructor that only takes an IndexInput, and load the index off 
heap iff the IndexInput extends ByteBufferIndexInput. This would be consistent 
with the terms index.
 - Update TestBKD to randomly load the index on or off-heap.

Then when LUCENE-8833 lands we can look into doing this a bit differently. 
[~jdconradson] Does it sound good to you?

The patch looks good to me in general. One minor thing that looks inconsistent 
to me is that BKDInput#setPosition doesn't throw an IOException. It forces the 
offheap impl to catch and rethrow, while readBytes throws an IOException so 
it's up to callers to deal with the exception. In my opinion, we should make 
setPosition and readBytes consistent: either both throw an IOException, or none 
of them do?

> Allow BKDReader packedIndex to be off heap
> --
>
> Key: LUCENE-8932
> URL: https://issues.apache.org/jira/browse/LUCENE-8932
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jack Conradson
>Priority: Minor
> Attachments: LUCENE-8932.patch
>
>
> This change modifies BKDReader to read the packedIndex bytes off heap rather 
> than load them all on heap at a single time.
> Questions for discussion:
>  # Should BKDReader only support packedIndex off heap?
>  # If not, how should the choice be made?
> Using luceneutils IndexAndSearchOpenStreetMaps present the following test 
> results:
> with -box -points (patch)
> READER MB: 1.1345596313476562
> BEST M hits/sec: 73.34277344984474
> BEST QPS: 74.63011169783009
> with -box -points (original)
> READER MB: 1.7249317169189453
> BEST M hits/sec: 73.77125157623486
> BEST QPS: 75.06611062353801
> with -nearest 10 -points (patch)
> READER MB: 1.1345596313476562
> BEST M hits/sec: 0.013586298373879497
> BEST QPS: 1358.6298373879497
> with -nearest 10 -points (original)
> READER MB: 1.7249317169189453
> BEST M hits/sec: 0.01445208197367343
> BEST QPS: 1445.208197367343
> with -box -geo3d (patch)
> READER MB: 1.1345596313476562
> BEST M hits/sec: 39.84968715299074
> BEST QPS: 40.54914292796736
> with -box -geo3d (original)
> READER MB: 1.7456226348876953
> BEST M hits/sec: 40.45051734329004
> BEST QPS: 41.160519101846695



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8956) QueryRescorer sort optimization

2019-09-05 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-8956:
-
Fix Version/s: 8.3
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

This looks great, thanks [~pcsanwald].

> QueryRescorer sort optimization
> ---
>
> Key: LUCENE-8956
> URL: https://issues.apache.org/jira/browse/LUCENE-8956
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Paul Sanwald
>Priority: Minor
> Fix For: 8.3
>
> Attachments: LUCENE-8956.patch
>
>
> This patch addresses a TODO in QueryRescorer: We should not sort the full 
> array of the results returned from rescoring, but rather only topN, when topN 
> is less than total hits.
>  
> Made this optimization with some suggestions from [~jpountz] and [~jimczi], 
> this is my first lucene patch submission.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8910) upgrade to icu 62.1must be completed

2019-09-05 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923106#comment-16923106
 ] 

Adrien Grand commented on LUCENE-8910:
--

Thank you [~matmarie]!

> upgrade to icu 62.1must be completed
> 
>
> Key: LUCENE-8910
> URL: https://issues.apache.org/jira/browse/LUCENE-8910
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: trunk, 7.5, 7.6, 7.7, 7.7.1, 7.7.2, 8.0, 8.1, 8.1.1
>Reporter: Mathieu Marie
>Priority: Minor
> Fix For: master (9.0), 8.3
>
> Attachments: LUCENE-8910.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> LUCENE-8366 migrated the icu components to version 62-1.
> There is however a place where the version number is still 60-2:
> [https://github.com/apache/lucene-solr/blob/branch_7_5/lucene/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/GenerateUTR30DataFiles.java#L66]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8910) upgrade to icu 62.1must be completed

2019-09-05 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8910.
--
Fix Version/s: 8.3
   Resolution: Fixed

> upgrade to icu 62.1must be completed
> 
>
> Key: LUCENE-8910
> URL: https://issues.apache.org/jira/browse/LUCENE-8910
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: trunk, 7.5, 7.6, 7.7, 7.7.1, 7.7.2, 8.0, 8.1, 8.1.1
>Reporter: Mathieu Marie
>Priority: Minor
> Fix For: master (9.0), 8.3
>
> Attachments: LUCENE-8910.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> LUCENE-8366 migrated the icu components to version 62-1.
> There is however a place where the version number is still 60-2:
> [https://github.com/apache/lucene-solr/blob/branch_7_5/lucene/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/GenerateUTR30DataFiles.java#L66]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8963) Allow Collectors To "Publish" If They Can Be Used In Concurrent Search

2019-09-04 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922304#comment-16922304
 ] 

Adrien Grand commented on LUCENE-8963:
--

I don't think this would solve any problem? Collectors can only run from a 
single thread anyway, and all collectors could have a CollectorManager provided 
that there is a way that the results that they produce can be merged?

> Allow Collectors To "Publish" If They Can Be Used In Concurrent Search
> --
>
> Key: LUCENE-8963
> URL: https://issues.apache.org/jira/browse/LUCENE-8963
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> There is an implied assumption today that all we need to run a query 
> concurrently is a CollectorManager implementation. While that is true, there 
> might be some corner cases where a Collector's semantics do not allow it to 
> be concurrently executed (think of ES's aggregates). If a user manages to 
> write a CollectorManager with a Collector that is not really concurrent 
> friendly, we could end up in an undefined state.
>  
> This Jira is more of a rhetorical discussion, and to explore if we should 
> allow Collectors to implement an API which simply returns a boolean 
> signifying if a Collector is parallel ready or not. The default would be 
> true, until a Collector explicitly overrides it?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8150) Remove references to segments.gen.

2019-09-04 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-8150:
-
Fix Version/s: master (9.0)
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Remove references to segments.gen.
> --
>
> Key: LUCENE-8150
> URL: https://issues.apache.org/jira/browse/LUCENE-8150
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (9.0)
>
> Attachments: LUCENE-8150.patch, LUCENE-8150.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This was the way we wrote pending segment files before we switch to 
> {{pending_segments_N}} in LUCENE-5925.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8860) LatLonShapeBoundingBoxQuery could make more decisions on inner nodes

2019-09-02 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920960#comment-16920960
 ] 

Adrien Grand commented on LUCENE-8860:
--

Why is it a problem? If the query polygon fully contains all left edges of the 
MBRs on a given node then we would know it intersects all indexed triangles, 
just like when the query is a box?

> LatLonShapeBoundingBoxQuery could make more decisions on inner nodes
> 
>
> Key: LUCENE-8860
> URL: https://issues.apache.org/jira/browse/LUCENE-8860
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: fig1.png, fig2.png, fig3.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently LatLonShapeBoundingBoxQuery with the INTERSECTS relation only 
> returns CELL_INSIDE_QUERY if the query contains ALL minimum bounding 
> rectangles of the indexed triangles.
> I think we could return CELL_INSIDE_QUERY if the box contains either of the 
> edges of all MBRs of indexed triangles since triangles are guaranteed to 
> touch all edges of their MBR by definition. In some cases this would help 
> save decoding triangles and running costly point-in-triangle computations.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8961) CheckIndex: pre-exorcise document id salvage

2019-09-02 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920958#comment-16920958
 ] 

Adrien Grand commented on LUCENE-8961:
--

This feels too unsafe to me for CheckIndex. For instance, what if idField is 
the corrupt field, you could end up with missing ids or the wrong ids? I'm fine 
with adding more information to the CheckIndex status in order to make it 
easier to do this kind of hacks on top of CheckIndex, but I'd like to keep 
CheckIndex something that is rock solid.

> CheckIndex: pre-exorcise document id salvage
> 
>
> Key: LUCENE-8961
> URL: https://issues.apache.org/jira/browse/LUCENE-8961
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Christine Poerschke
>Priority: Minor
> Attachments: LUCENE-8961.patch
>
>
> The 
> [CheckIndex|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.2.0/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java]
>  tool supports the exorcising of corrupt segments from an index.
> This ticket proposes to add an extra option which could first be used to 
> potentially salvage the document ids of the segment(s) about to be exorcised. 
> Re-ingestion for those documents could then be arranged so as to repair the 
> data damage caused by the exorcising.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8956) QueryRescorer sort optimization

2019-09-02 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920853#comment-16920853
 ] 

Adrien Grand commented on LUCENE-8956:
--

I was thinking that we could verify that we have the right hits by rescoring 
twice, once with topN=random().nextInt(numDocs) like in your patch, and another 
time with topN=numDocs, then make sure that the first topN hits are the same in 
both cases (CheckHits#checkEquals might help).


> QueryRescorer sort optimization
> ---
>
> Key: LUCENE-8956
> URL: https://issues.apache.org/jira/browse/LUCENE-8956
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Paul Sanwald
>Priority: Minor
> Attachments: LUCENE-8956.patch
>
>
> This patch addresses a TODO in QueryRescorer: We should not sort the full 
> array of the results returned from rescoring, but rather only topN, when topN 
> is less than total hits.
>  
> Made this optimization with some suggestions from [~jpountz] and [~jimczi], 
> this is my first lucene patch submission.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8403) Support 'filtered' term vectors - don't require all terms to be present

2019-09-02 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920659#comment-16920659
 ] 

Adrien Grand commented on LUCENE-8403:
--

bq. mucking with the higher level features to use a separate field for the term 
vector (e.g. in a highlighter)

We could do this quite transparently for highlighters by using a FilterReader 
that redirects term vector calls to the filtered field. This would still be a 
hack as the reader would not pass CheckIndex either, but a much more contained 
one that might be ok, and would avoid making highlighters too complex?

> Support 'filtered' term vectors - don't require all terms to be present
> ---
>
> Key: LUCENE-8403
> URL: https://issues.apache.org/jira/browse/LUCENE-8403
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Braun
>Priority: Minor
> Attachments: LUCENE-8403.patch
>
>
> The genesis of this was a conversation and idea from [~dsmiley] several years 
> ago.
> In order to optimize term vector storage, we may not actually need all tokens 
> to be present in the term vectors - and if so, ideally our codec could just 
> opt not to store them.
> I attempted to fork the standard codec and override the TermVectorsFormat and 
> TermVectorsWriter to ignore storing certain Terms within a field. This 
> worked, however, CheckIndex checks that the terms present in the standard 
> postings are also present in the TVs, if TVs enabled. So this then doesn't 
> work as 'valid' according to CheckIndex.
> Can the TermVectorsFormat be made in such a way to support configuration of 
> tokens that should not be stored (benefits: less storage, more optimal 
> retrieval per doc)? Is this valuable to the wider community? Is there a way 
> we can design this to not break CheckIndex's contract while at the same time 
> lessening storage for unneeded tokens?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8956) QueryRescorer sort optimization

2019-08-22 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913694#comment-16913694
 ] 

Adrien Grand commented on LUCENE-8956:
--

The changes in src/java look good to me. Maybe we could make the test a bit 
better:
 - The name is a bit vague, as there could be lots of different optimizations 
in the rescoring logic. Maybe call it "testRescoreSubsetOfHits" or something 
along those lines?
 - Currently the test only checks that the correct number of hits is returned, 
we should also check that we got the right hits?

Some minor comments:
 - No need for a BooleanQuery in the test since you are adding a single clause. 
Or did you plan to add wordTwo in another clause?
 - I can't check right now but I think randomizedtesting already has a utility 
class called RandomPicks to help pick a random element from a list.


> QueryRescorer sort optimization
> ---
>
> Key: LUCENE-8956
> URL: https://issues.apache.org/jira/browse/LUCENE-8956
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Paul Sanwald
>Priority: Minor
> Attachments: LUCENE-8956.patch
>
>
> This patch addresses a TODO in QueryRescorer: We should not sort the full 
> array of the results returned from rescoring, but rather only topN, when topN 
> is less than total hits.
>  
> Made this optimization with some suggestions from [~jpountz] and [~jimczi], 
> this is my first lucene patch submission.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8860) LatLonShapeBoundingBoxQuery could make more decisions on inner nodes

2019-08-20 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911753#comment-16911753
 ] 

Adrien Grand commented on LUCENE-8860:
--

I suspect that the confusion is that we don't use minPackedValue and 
maxPackedValue directly but a combination of them. For instance the current 
logic short circuits the query on Inner nodes when the query shape fully 
contains all indexed triangles, using the minimum (x,y) coordinates from 
minPackedValue and the maximum (x,y) coordinates from maxPackedValue to compute 
a rectangle that contains all MBRs. In your example this rectangle is [(0,0), 
(30,20)].

The idea of this issue is that there are weaker conditions we can check that 
allow us to make the same conclusion. For instance if we take the minimum (x,y) 
coordinates from minPackedValue, the maximum x from minPackedValue and the 
maximum y from maxPackedValue we get a smaller rectangle that contains the left 
edge of all MBRs on the leaf block. If the query fully contains this rectangle 
then it is guaranteed to intersect all triangles of the leaf given that 
triangles have at least one point on the left edge of their MBR. In your 
example this rectangle is [(0,0), (20,20)]. We can do the same test for the 
right, top and bottom edges.

With your red box query, we wouldn't be able to make a decision on the inner 
node since it doesn't fully contain any of the 4 rectangles that contains all 
left, right, top and bottom edges.

> LatLonShapeBoundingBoxQuery could make more decisions on inner nodes
> 
>
> Key: LUCENE-8860
> URL: https://issues.apache.org/jira/browse/LUCENE-8860
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: fig1.png, fig2.png
>
>
> Currently LatLonShapeBoundingBoxQuery with the INTERSECTS relation only 
> returns CELL_INSIDE_QUERY if the query contains ALL minimum bounding 
> rectangles of the indexed triangles.
> I think we could return CELL_INSIDE_QUERY if the box contains either of the 
> edges of all MBRs of indexed triangles since triangles are guaranteed to 
> touch all edges of their MBR by definition. In some cases this would help 
> save decoding triangles and running costly point-in-triangle computations.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8950) FieldComparators Should Not Maintain Implicit PQs

2019-08-14 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907090#comment-16907090
 ] 

Adrien Grand commented on LUCENE-8950:
--

OK I got confused with the notion of deprecation, which suggests to me that we 
need to find a replacement, but reading your last message, my understanding is 
that you would like to introduce a sub class of FieldComparator that hides the 
fact that it maintains an implicit PQ, and make simple comparators extend this 
sub class instead of FieldComparator directly? I would be +1 to giving it a try 
and making the change if the perf hit is negligible.

> FieldComparators Should Not Maintain Implicit PQs
> -
>
> Key: LUCENE-8950
> URL: https://issues.apache.org/jira/browse/LUCENE-8950
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While doing some perf tests, I realised that FieldComparators inherently 
> maintain implicit priority queues for maintaining the sorted order of 
> documents for the given sort order. This is wasteful especially in the case 
> of a multi feature sort order and a large number of hits requested.
>  
> We should change this to have FieldComparators maintain only the top and 
> bottom values, and use them as barriers to compare



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies

2019-08-14 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907086#comment-16907086
 ] 

Adrien Grand commented on LUCENE-8947:
--

Changing it to a long might be challenging for norms, since the current 
encoding relies on the fact that the length is an integer. Are you using norms, 
I guess not? Maybe we could skip computing the field length when norms are 
disabled?

> Indexing fails with "too many tokens for field" when using custom term 
> frequencies
> --
>
> Key: LUCENE-8947
> URL: https://issues.apache.org/jira/browse/LUCENE-8947
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 7.5
>Reporter: Michael McCandless
>Priority: Major
>
> We are using custom term frequencies (LUCENE-7854) to index per-token scoring 
> signals, however for one document that had many tokens and those tokens had 
> fairly large (~998,000) scoring signals, we hit this exception:
> {noformat}
> 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) 
> com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: 
> java.lang.IllegalArgumentException: too many tokens for field "foobar"
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825)
> at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
> at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
> at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
> at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
> at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
> at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
> {noformat}
> This is happening in this code in {{DefaultIndexingChain.java}}:
> {noformat}
>   try {
> invertState.length = Math.addExact(invertState.length, 
> invertState.termFreqAttribute.getTermFrequency());
>   } catch (ArithmeticException ae) {
> throw new IllegalArgumentException("too many tokens for field \"" + 
> field.name() + "\"");
>   }{noformat}
> Where Lucene is accumulating the total length (number of tokens) for the 
> field.  But total length doesn't really make sense if you are using custom 
> term frequencies to hold arbitrary scoring signals?  Or, maybe it does make 
> sense, if user is using this as simple boosting, but maybe we should allow 
> this length to be a {{long}}?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8950) FieldComparators Should Not Maintain Implicit PQs

2019-08-14 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907069#comment-16907069
 ] 

Adrien Grand commented on LUCENE-8950:
--

This looks like a duplicate of LUCENE-8878?

I think all of us agree on the fact that it would be nice to have a simpler 
FieldComparator API. The challenge is that we don't want to trade too much 
efficiency. For instance the API you are proposing wouldn't work well with 
geo-distance sorting since it would require computing the actual distance for 
every new document, while the current implementation tries to be smart to first 
check a bounding box, and then compute a sort key that compares like the actual 
distance but is much cheaper to compute (see discussion on LUCENE-8878 for more 
details).

> FieldComparators Should Not Maintain Implicit PQs
> -
>
> Key: LUCENE-8950
> URL: https://issues.apache.org/jira/browse/LUCENE-8950
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While doing some perf tests, I realised that FieldComparators inherently 
> maintain implicit priority queues for maintaining the sorted order of 
> documents for the given sort order. This is wasteful especially in the case 
> of a multi feature sort order and a large number of hits requested.
>  
> We should change this to have FieldComparators maintain only the top and 
> bottom values, and use them as barriers to compare



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8935) BooleanQuery with no scoring clauses cannot skip documents when running TOP_SCORES mode

2019-07-26 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893786#comment-16893786
 ] 

Adrien Grand commented on LUCENE-8935:
--

Woops indeed you are right. +1 to the attached patch!

> BooleanQuery with no scoring clauses cannot skip documents when running 
> TOP_SCORES mode
> ---
>
> Key: LUCENE-8935
> URL: https://issues.apache.org/jira/browse/LUCENE-8935
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8935.patch
>
>
> Today a boolean query that is composed of filtering clauses only (more than 
> one) cannot skip documents when the search is executed with the TOP_SCORES 
> mode. However since all documents have a score of 0 it should be possible to 
> early terminate the query as soon as we collected enough top hits. Wrapping 
> the resulting boolean scorer in a constant score scorer should allow early 
> termination in this case and would speed up the retrieval of top hits case 
> considerably if the total hit count is not requested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8935) BooleanQuery with no scoring clauses cannot skip documents when running TOP_SCORES mode

2019-07-26 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893690#comment-16893690
 ] 

Adrien Grand commented on LUCENE-8935:
--

The approach works for me. I'm wondering that if we put this logic at the very 
bottom of Boolean2ScorerSupplier#get instead then we'd also cover the case when 
there is a SHOULD clause in addition to the FILTER clauses, but it produces a 
null scorer.

> BooleanQuery with no scoring clauses cannot skip documents when running 
> TOP_SCORES mode
> ---
>
> Key: LUCENE-8935
> URL: https://issues.apache.org/jira/browse/LUCENE-8935
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8935.patch
>
>
> Today a boolean query that is composed of filtering clauses only (more than 
> one) cannot skip documents when the search is executed with the TOP_SCORES 
> mode. However since all documents have a score of 0 it should be possible to 
> early terminate the query as soon as we collected enough top hits. Wrapping 
> the resulting boolean scorer in a constant score scorer should allow early 
> termination in this case and would speed up the retrieval of top hits case 
> considerably if the total hit count is not requested.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8915) Document that RateLimiter's limits may be updated over time

2019-07-26 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8915.
--
   Resolution: Fixed
Fix Version/s: 8.3

Thanks [~atris]

> Document that RateLimiter's limits may be updated over time
> ---
>
> Key: LUCENE-8915
> URL: https://issues.apache.org/jira/browse/LUCENE-8915
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> RateLimiter does not allow dynamic configuration of the rate limit today. 
> This limits the kind of applications that the functionality can be applied 
> to. This Jira tracks 1) allowing the rate limiter to change limits 
> dynamically. 2) Add a RateLimiter subclass which exposes the same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8915) Document that RateLimiter's limits may be updated over time

2019-07-26 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-8915:
-
Summary: Document that RateLimiter's limits may be updated over time  (was: 
Allow RateLimiter To Have Dynamic Limits)

> Document that RateLimiter's limits may be updated over time
> ---
>
> Key: LUCENE-8915
> URL: https://issues.apache.org/jira/browse/LUCENE-8915
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> RateLimiter does not allow dynamic configuration of the rate limit today. 
> This limits the kind of applications that the functionality can be applied 
> to. This Jira tracks 1) allowing the rate limiter to change limits 
> dynamically. 2) Add a RateLimiter subclass which exposes the same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-26 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893662#comment-16893662
 ] 

Adrien Grand commented on LUCENE-8933:
--

Should we go further and check that the concatenation of the segments is equal 
to the surface form?

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8927) Cut Over To Set.copyOf and Set.Of From Collections.unmodifiableSet

2019-07-26 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893477#comment-16893477
 ] 

Adrien Grand commented on LUCENE-8927:
--

Thanks for fixing [~ichattopadhyaya]!

> Cut Over To Set.copyOf and Set.Of From Collections.unmodifiableSet
> --
>
> Key: LUCENE-8927
> URL: https://issues.apache.org/jira/browse/LUCENE-8927
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8927) Cut Over To Set.copyOf and Set.Of From Collections.unmodifiableSet

2019-07-25 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8927.
--
   Resolution: Fixed
Fix Version/s: master (9.0)

> Cut Over To Set.copyOf and Set.Of From Collections.unmodifiableSet
> --
>
> Key: LUCENE-8927
> URL: https://issues.apache.org/jira/browse/LUCENE-8927
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8931) TestTopFieldCollectorEarlyTermination Should Use CheckHits

2019-07-25 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8931.
--
   Resolution: Fixed
Fix Version/s: 8.3

> TestTopFieldCollectorEarlyTermination Should Use CheckHits
> --
>
> Key: LUCENE-8931
> URL: https://issues.apache.org/jira/browse/LUCENE-8931
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Minor
> Fix For: 8.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> TestTopFieldCollectorEarlyTermination invents a new way of checking equality 
> of hits. That is redundant since CheckHits provides the same functionality 
> and is the de facto standard now.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8922) Speed up retrieval of top hits of DisjunctionMaxQuery

2019-07-25 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8922.
--
   Resolution: Fixed
Fix Version/s: 8.3

> Speed up retrieval of top hits of DisjunctionMaxQuery
> -
>
> Key: LUCENE-8922
> URL: https://issues.apache.org/jira/browse/LUCENE-8922
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There a simple optimization that we are not doing in the case that 
> tieBreakMultiplier is 0: we could propagate the min competitive score to sub 
> clauses as-is.
> Even in the general case, we currently compute the block boundary of the 
> DisjunctionMaxQuery as the minimum of the block boundaries of its sub 
> clauses. This generates blocks that have very low score upper bounds but 
> unfortunately they are also very small, which means that we might sometimes 
> not make progress quickly enough.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-25 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892644#comment-16892644
 ] 

Adrien Grand commented on LUCENE-8933:
--

Ah, thanks for digging [~tomoko] and [~danmuzi]. I simplified Tomoko's 
recreation a bit more:

{code:java}
UserDictionary dict = UserDictionary.open(new 
StringReader("アメリカン航空,アメカン航空,アメリカンコウクウ,カスタム用語"));
JapaneseTokenizer tok = new JapaneseTokenizer(dict, true, Mode.NORMAL);
tok.setReader(new StringReader("アメリカン航空"));
tok.reset();
tok.incrementToken();
{code}

Tomoko, I wonder that the fact that the issue doesn't occur when the emoji is 
at other positions might be due to the fact that the Position class initializes 
its buffers' sizes to 8?

> JapaneseTokenizer creates Token objects with corrupt offsets
> 
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing 
> synonyms. It looks like the only reason why this might occur is if the offset 
> of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
>  ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
> nknize - 2018-12-07 14:44:20]
> at 
> org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
>  ~[?:?]
> at 
> org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
>  ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
> 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at 
> org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
>  ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8932) Allow BKDReader packedIndex to be off heap

2019-07-24 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892193#comment-16892193
 ] 

Adrien Grand commented on LUCENE-8932:
--

This is interesting! It doesn't seem to affect performance at all so I believe 
we could just load the index off-heap all the time? And maybe something like 
LUCENE-8833 could help make sure it is hot in the page cache.

> Allow BKDReader packedIndex to be off heap
> --
>
> Key: LUCENE-8932
> URL: https://issues.apache.org/jira/browse/LUCENE-8932
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jack Conradson
>Priority: Minor
> Attachments: LUCENE-8932.patch
>
>
> This change modifies BKDReader to read the packedIndex bytes off heap rather 
> than load them all on heap at a single time.
> Questions for discussion:
>  # Should BKDReader only support packedIndex off heap?
>  # If not, how should the choice be made?
> Using luceneutils IndexAndSearchOpenStreetMaps present the following test 
> results:
> with -box -points (patch)
> READER MB: 1.1345596313476562
> BEST M hits/sec: 73.34277344984474
> BEST QPS: 74.63011169783009
> with -box -points (original)
> READER MB: 1.7249317169189453
> BEST M hits/sec: 73.77125157623486
> BEST QPS: 75.06611062353801
> with -nearest 10 -points (patch)
> READER MB: 1.1345596313476562
> BEST M hits/sec: 0.013586298373879497
> BEST QPS: 1358.6298373879497
> with -nearest 10 -points (original)
> READER MB: 1.7249317169189453
> BEST M hits/sec: 0.01445208197367343
> BEST QPS: 1445.208197367343
> with -box -geo3d (patch)
> READER MB: 1.1345596313476562
> BEST M hits/sec: 39.84968715299074
> BEST QPS: 40.54914292796736
> with -box -geo3d (original)
> READER MB: 1.7456226348876953
> BEST M hits/sec: 40.45051734329004
> BEST QPS: 41.160519101846695



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8933) JapaneseTokenizer creates Token objects with corrupt offsets

2019-07-24 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-8933:


 Summary: JapaneseTokenizer creates Token objects with corrupt 
offsets
 Key: LUCENE-8933
 URL: https://issues.apache.org/jira/browse/LUCENE-8933
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand


An Elasticsearch user reported the following stack trace when parsing synonyms. 
It looks like the only reason why this might occur is if the offset of a 
{{org.apache.lucene.analysis.ja.Token}} is not within the expected range.

 
{noformat}
Caused by: java.lang.ArrayIndexOutOfBoundsException
at 
org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44)
 ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - 
nknize - 2018-12-07 14:44:20]
at 
org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486)
 ~[?:?]
at 
org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318)
 ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
at 
org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57)
 ~[elasticsearch-6.6.1.jar:6.6.1]
at 
org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114)
 ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
at 
org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70)
 ~[lucene-analyzers-common-7.6.0.jar:7.6.0 
719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
at 
org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154)
 ~[elasticsearch-6.6.1.jar:6.6.1]
... 24 more
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8929) Early Terminating CollectorManager

2019-07-23 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890753#comment-16890753
 ] 

Adrien Grand commented on LUCENE-8929:
--

Hmm I am confused now, I don't think you can spread the top numHits hits across 
collectors given that index sorting works on a per-segment basis. So you need 
to collect each segment at least until \{numHits} hits have been collected, or 
until the last collected hit was not competitive globally (whichever comes 
first)?

> Early Terminating CollectorManager
> --
>
> Key: LUCENE-8929
> URL: https://issues.apache.org/jira/browse/LUCENE-8929
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> We should have an early terminating collector manager which accurately tracks 
> hits across all of its collectors and determines when there are enough hits, 
> allowing all the collectors to abort.
> The options for the same are:
> 1) Shared total count : Global "scoreboard" where all collectors update their 
> current hit count. At the end of each document's collection, collector checks 
> if N > threshold, and aborts if true
> 2) State Reporting Collectors: Collectors report their total number of counts 
> collected periodically using a callback mechanism, and get a proceed or abort 
> decision.
> 1) has the overhead of synchronization in the hot path, 2) can collect 
> unnecessary hits before aborting.
> I am planning to work on 2), unless objections



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8929) Early Terminating CollectorManager

2019-07-23 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890746#comment-16890746
 ] 

Adrien Grand commented on LUCENE-8929:
--

OK, so if I understand correctly you are still collecting the first numHits 
hits as today, but you are trying to avoid collecting 
${totalHitsThreshold-numHits} additional hits on every slice with this global 
counter?

> Early Terminating CollectorManager
> --
>
> Key: LUCENE-8929
> URL: https://issues.apache.org/jira/browse/LUCENE-8929
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> We should have an early terminating collector manager which accurately tracks 
> hits across all of its collectors and determines when there are enough hits, 
> allowing all the collectors to abort.
> The options for the same are:
> 1) Shared total count : Global "scoreboard" where all collectors update their 
> current hit count. At the end of each document's collection, collector checks 
> if N > threshold, and aborts if true
> 2) State Reporting Collectors: Collectors report their total number of counts 
> collected periodically using a callback mechanism, and get a proceed or abort 
> decision.
> 1) has the overhead of synchronization in the hot path, 2) can collect 
> unnecessary hits before aborting.
> I am planning to work on 2), unless objections



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8929) Early Terminating CollectorManager

2019-07-23 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890732#comment-16890732
 ] 

Adrien Grand commented on LUCENE-8929:
--

What collector do you have in mind? Is it TopFieldCollector?

> Early Terminating CollectorManager
> --
>
> Key: LUCENE-8929
> URL: https://issues.apache.org/jira/browse/LUCENE-8929
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> We should have an early terminating collector manager which accurately tracks 
> hits across all of its collectors and determines when there are enough hits, 
> allowing all the collectors to abort.
> The options for the same are:
> 1) Shared total count : Global "scoreboard" where all collectors update their 
> current hit count. At the end of each document's collection, collector checks 
> if N > threshold, and aborts if true
> 2) State Reporting Collectors: Collectors report their total number of counts 
> collected periodically using a callback mechanism, and get a proceed or abort 
> decision.
> 1) has the overhead of synchronization in the hot path, 2) can collect 
> unnecessary hits before aborting.
> I am planning to work on 2), unless objections



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8921) IndexSearcher.termStatistics should not require TermStates but docFreq and totalTermFreq

2019-07-19 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1614#comment-1614
 ] 

Adrien Grand commented on LUCENE-8921:
--

+1

> IndexSearcher.termStatistics should not require TermStates but docFreq and 
> totalTermFreq
> 
>
> Key: LUCENE-8921
> URL: https://issues.apache.org/jira/browse/LUCENE-8921
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 8.1
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: master (9.0)
>
>
> IndexSearcher.termStatistics(Term term, TermStates context) is the way to 
> create a TermStatistics. It requires a TermStates param although it only 
> cares about the docFreq and totalTermFreq.
>  
> For customizations that what to create TermStatistics based on docFreq and 
> totalTermFreq, but that do not have available TermStates, this method forces 
> to create a TermStates instance (which is not very lightweight) only to pass 
> two ints.
> termStatistics could be modified to the following signature:
> termStatistics(Term term, int docFreq, int totalTermFreq)
> Since it would change the API, it could be done in master for next major 
> release.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-19 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1612#comment-1612
 ] 

Adrien Grand commented on LUCENE-8920:
--

This sounds good Mike. I'm making it a blocker for 8.3 since we haven't 
reverted from branch_8x.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-19 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-8920:
-
Priority: Blocker  (was: Major)

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-19 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-8920:
-
Fix Version/s: 8.3

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8928) BKDWriter could make splitting decisions based on the actual range of values

2019-07-19 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888722#comment-16888722
 ] 

Adrien Grand commented on LUCENE-8928:
--

I played with this idea a bit at 
https://github.com/jpountz/lucene-solr/commit/16e6594af44b753c9ac498a063eb9b9d6102e020
 and 
https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/IndexAndSearchOpenStreetMaps.java
 with shapes. It's a bit artificial since we are using shapes to index points, 
but nevertheless I got 62% slower indexing (130 seconds instead of 80) but 45% 
faster searching for box queries (63.0 QPS instead of 43.5).

> BKDWriter could make splitting decisions based on the actual range of values
> 
>
> Key: LUCENE-8928
> URL: https://issues.apache.org/jira/browse/LUCENE-8928
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Currently BKDWriter assumes that splitting on one dimension has no effect on 
> values in other dimensions. While this may be ok for geo points, this is 
> usually not true for ranges (or geo shapes, which are ranges too). Maybe we 
> could get better indexing by re-computing the range of values on each 
> dimension before making the choice of the split dimension?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8928) BKDWriter could make splitting decisions based on the actual range of values

2019-07-19 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-8928:


 Summary: BKDWriter could make splitting decisions based on the 
actual range of values
 Key: LUCENE-8928
 URL: https://issues.apache.org/jira/browse/LUCENE-8928
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


Currently BKDWriter assumes that splitting on one dimension has no effect on 
values in other dimensions. While this may be ok for geo points, this is 
usually not true for ranges (or geo shapes, which are ranges too). Maybe we 
could get better indexing by re-computing the range of values on each dimension 
before making the choice of the split dimension?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8924) Remove Fields Order Checks from CheckIndex?

2019-07-17 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887300#comment-16887300
 ] 

Adrien Grand commented on LUCENE-8924:
--

We rely on the order for merging, see "MultiFields".

> Remove Fields Order Checks from CheckIndex?
> ---
>
> Key: LUCENE-8924
> URL: https://issues.apache.org/jira/browse/LUCENE-8924
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> CheckIndex checks the order of fields read from the FieldsEnum for the 
> posting reader. Since we do not explicitly sort or use a sorted data 
> structure to represent keys (atleast explicitly), and no FieldsEnum depends 
> on the order apart from MultiFieldsEnum, which no longer exists.
>  
> Should we remove the check?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8909) Deprecate getFieldNames from IndexWriter

2019-07-17 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887291#comment-16887291
 ] 

Adrien Grand commented on LUCENE-8909:
--

+1

> Deprecate getFieldNames from IndexWriter
> 
>
> Key: LUCENE-8909
> URL: https://issues.apache.org/jira/browse/LUCENE-8909
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Munendra S N
>Priority: Major
> Attachments: LUCENE-8909.patch
>
>
> From SOLR-12368
> {quote}Would be nice to be able to remove IndexWriter.getFieldNames as well, 
> which was added in LUCENE-7659 only for this workaround.{quote}
> Once Solr task resolved, deprecate {{IndexWriter#getFieldNames}} from 8x and 
> remove it from master



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8908) Specified default value not returned for query() when doc doesn't match

2019-07-17 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887290#comment-16887290
 ] 

Adrien Grand commented on LUCENE-8908:
--

+1

> Specified default value not returned for query() when doc doesn't match
> ---
>
> Key: LUCENE-8908
> URL: https://issues.apache.org/jira/browse/LUCENE-8908
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Bill Bell
>Priority: Major
> Attachments: LUCENE-8908.patch, SOLR-7845.patch, SOLR-7845.patch
>
>
> The 2 arg version of the "query()" was designed so that the second argument 
> would specify the value used for any document that does not match the query 
> pecified by the first argument -- but the "exists" property of the resulting 
> ValueSource only takes into consideration wether or not the document matches 
> the query -- and ignores the use of the second argument.
> 
> The work around is to ignore the 2 arg form of the query() function, and 
> instead wrap he query function in def().
> for example:  {{def(query($something), $defaultval)}} instead of 
> {{query($something, $defaultval)}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8922) Speed up retrieval of top hits of DisjunctionMaxQuery

2019-07-17 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886944#comment-16886944
 ] 

Adrien Grand commented on LUCENE-8922:
--

Here is a patch. It uses the first clause that has a score greater than or 
equal to the minimum competitive score to lead iteration of impacts and 
propagates min competitive scores when the tie break multiplier is 0.

I ran wikibigall with the wikinightly tasks where I added 4 new tasks:
 - DisMaxHighMed: same as OrHighMed but with a DisjunctionMaxQuery and a tie 
break multiplier of 0.1
 - DisMaxHighHigh: same as OrHighHigh but with a DisjunctionMaxQuery and a tie 
break multiplier of 0.1
 - DisMax0HighMed: same as OrHighMed but with a DisjunctionMaxQuery and a tie 
break multiplier of 0
 - DisMax0HighHigh: same as OrHighHigh but with a DisjunctionMaxQuery and a tie 
break multiplier of 0

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  Fuzzy1  177.71 (11.7%)  174.01 (11.2%)   
-2.1% ( -22% -   23%)
SloppyPhrase6.26  (6.1%)6.23  (6.2%)   
-0.4% ( -12% -   12%)
SpanNear2.32  (3.0%)2.32  (3.4%)   
-0.0% (  -6% -6%)
IntervalsOrdered0.85  (1.7%)0.85  (1.8%)
0.0% (  -3% -3%)
 Prefix3   47.79 (12.6%)   47.85 (12.7%)
0.1% ( -22% -   29%)
  OrHighHigh9.87  (2.8%)9.89  (2.8%)
0.2% (  -5% -5%)
  Phrase   70.88  (3.2%)   71.04  (3.1%)
0.2% (  -5% -6%)
Wildcard  128.13  (8.6%)  128.43  (9.0%)
0.2% ( -16% -   19%)
  AndHighMed   65.61  (3.5%)   65.85  (2.9%)
0.4% (  -5% -6%)
 AndHighHigh   36.41  (3.4%)   36.60  (3.1%)
0.5% (  -5% -7%)
 AndHighOrMedMed   25.99  (2.0%)   26.13  (1.8%)
0.5% (  -3% -4%)
   OrHighMed   36.42  (2.7%)   36.61  (2.6%)
0.5% (  -4% -5%)
  Fuzzy2   92.96 (16.1%)   93.59 (13.7%)
0.7% ( -25% -   36%)
  IntNRQ  132.08 (37.3%)  133.02 (38.0%)
0.7% ( -54% -  121%)
AndMedOrHighHigh   26.80  (2.0%)   27.07  (2.1%)
1.0% (  -3% -5%)
Term 1308.93  (3.6%) 1331.58  (3.7%)
1.7% (  -5% -9%)
   DisMaxHighMed   83.40  (3.1%)  111.26  (3.0%)   
33.4% (  26% -   40%)
  DisMaxHighHigh   54.28  (4.8%)   81.35  (4.1%)   
49.9% (  39% -   61%)
 DisMax0HighHigh   45.39  (5.7%)  217.70 (20.1%)  
379.6% ( 334% -  430%)
  DisMax0HighMed  129.09  (3.9%)  905.16 (16.5%)  
601.2% ( 558% -  646%)
{noformat}

> Speed up retrieval of top hits of DisjunctionMaxQuery
> -
>
> Key: LUCENE-8922
> URL: https://issues.apache.org/jira/browse/LUCENE-8922
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There a simple optimization that we are not doing in the case that 
> tieBreakMultiplier is 0: we could propagate the min competitive score to sub 
> clauses as-is.
> Even in the general case, we currently compute the block boundary of the 
> DisjunctionMaxQuery as the minimum of the block boundaries of its sub 
> clauses. This generates blocks that have very low score upper bounds but 
> unfortunately they are also very small, which means that we might sometimes 
> not make progress quickly enough.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8923) Release procedure does not add new version in CHANGES.txt in master

2019-07-17 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886941#comment-16886941
 ] 

Adrien Grand commented on LUCENE-8923:
--

+1 Even if some changes are missing I think we'd benefit from pushing this 
rather soon so that developers don't automatically add their changes to 8.2 as 
the last minor.

> Release procedure does not add new version in CHANGES.txt in master
> ---
>
> Key: LUCENE-8923
> URL: https://issues.apache.org/jira/browse/LUCENE-8923
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Minor
> Attachments: LUCENE-8923.patch
>
>
> This issue is just to track something that maybe missing in the release 
> procedure. It currently adds a new version on CHANGES.txt in the minor 
> version branch but it does not do it in master.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8922) Speed up retrieval of top hits of DisjunctionMaxQuery

2019-07-17 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-8922:


 Summary: Speed up retrieval of top hits of DisjunctionMaxQuery
 Key: LUCENE-8922
 URL: https://issues.apache.org/jira/browse/LUCENE-8922
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


There a simple optimization that we are not doing in the case that 
tieBreakMultiplier is 0: we could propagate the min competitive score to sub 
clauses as-is.

Even in the general case, we currently compute the block boundary of the 
DisjunctionMaxQuery as the minimum of the block boundaries of its sub clauses. 
This generates blocks that have very low score upper bounds but unfortunately 
they are also very small, which means that we might sometimes not make progress 
quickly enough.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8883) CHANGES.txt: Auto add issue categories on new releases

2019-07-16 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886114#comment-16886114
 ] 

Adrien Grand commented on LUCENE-8883:
--

I have a slight preference for having "Optimizations" as one category.

> CHANGES.txt: Auto add issue categories on new releases
> --
>
> Key: LUCENE-8883
> URL: https://issues.apache.org/jira/browse/LUCENE-8883
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/build
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Attachments: LUCENE-8883.patch, LUCENE-8883.patch
>
>
> As I write this, looking at Solr's CHANGES.txt for 8.2 I see we have some 
> sections: "Upgrade Notes", "New Features", "Bug Fixes", and "Other Changes".  
> There is no "Improvements" so no surprise here, the New Features category 
> has issues that ought to be listed as such.  I think the order vary as well.  
> I propose that on new releases, the initial state of the next release in 
> CHANGES.txt have these sections.  They can easily be removed at the upcoming 
> release if there are no such sections, or they could stay as empty.  It seems 
> addVersion.py is the code that sets this up and it could be enhanced.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8884) Add Directory wrapper to track per-query IO counters

2019-07-16 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886043#comment-16886043
 ] 

Adrien Grand commented on LUCENE-8884:
--

I'm not seeing any attachement on this JIRA, did you forget to attach a patch?

> Add Directory wrapper to track per-query IO counters
> 
>
> Key: LUCENE-8884
> URL: https://issues.apache.org/jira/browse/LUCENE-8884
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/store
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
>
> Lucene's IO abstractions ({{Directory, IndexInput/Output}}) make it really 
> easy to track counters of how many IOPs and net bytes are read for each 
> query, which is a useful metric to track/aggregate/alarm on in production or 
> dev benchmarks.
> At my day job we use these wrappers in our nightly benchmarks to catch any 
> accidental performance regressions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8810) Flattening of nested disjunctions does not take into account number of clause limitation of builder

2019-07-15 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8810.
--
   Resolution: Fixed
Fix Version/s: (was: 8.1.1)
   8.2

> Flattening of nested disjunctions does not take into account number of clause 
> limitation of builder
> ---
>
> Key: LUCENE-8810
> URL: https://issues.apache.org/jira/browse/LUCENE-8810
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.0
>Reporter: Mickaël Sauvée
>Priority: Minor
> Fix For: 8.2
>
> Attachments: LUCENE-8810.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In org.apache.lucene.search.BooleanQuery, at the end of the function 
> rewrite(IndexReader reader), the query is rewritten to flatten nested 
> disjunctions.
> This does not take into account the limitation on the number of clauses in a 
> builder (1024).
>  In some circumstances, this limite can be reached, hence an exception is 
> thrown.
> Here is a unit test that highlight this.
> {code:java}
>   public void testFlattenInnerDisjunctionsWithMoreThan1024Terms() throws 
> IOException {
> IndexSearcher searcher = newSearcher(new MultiReader());
> BooleanQuery.Builder builder1024 = new BooleanQuery.Builder();
> for(int i = 0; i < 1024; i++) {
>   builder1024.add(new TermQuery(new Term("foo", "bar-" + i)), 
> Occur.SHOULD);
> }
> Query inner = builder1024.build();
> Query query = new BooleanQuery.Builder()
> .add(inner, Occur.SHOULD)
> .add(new TermQuery(new Term("foo", "baz")), Occur.SHOULD)
> .build();
> searcher.rewrite(query);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8917) Remove the "Direct" doc-value format

2019-07-15 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-8917:


 Summary: Remove the "Direct" doc-value format
 Key: LUCENE-8917
 URL: https://issues.apache.org/jira/browse/LUCENE-8917
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


This is the last user of the Legacy*DocValues APIs. Another option would be to 
move this format to doc-value iterators, but I don't think it's worth the 
effort: let's just remove it in Lucene 9?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8811) Add maximum clause count check to IndexSearcher rather than BooleanQuery

2019-07-15 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884944#comment-16884944
 ] 

Adrien Grand commented on LUCENE-8811:
--

Thanks! I'll revert this change from 8.x and 8.2 in the meantime.

> Add maximum clause count check to IndexSearcher rather than BooleanQuery
> 
>
> Key: LUCENE-8811
> URL: https://issues.apache.org/jira/browse/LUCENE-8811
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Alan Woodward
>Priority: Minor
> Fix For: 8.2
>
> Attachments: LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch, 
> LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch
>
>
> Currently we only check whether boolean queries have too many clauses. 
> However there are other ways that queries may have too many clauses, for 
> instance if you have boolean queries that have themselves inner boolean 
> queries.
> Could we use the new Query visitor API to move this check from BooleanQuery 
> to IndexSearcher in order to make this check more consistent across queries? 
> See for instance LUCENE-8810 where a rewrite rule caused the maximum clause 
> count to be hit even though the total number of leaf queries remained the 
> same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8811) Add maximum clause count check to IndexSearcher rather than BooleanQuery

2019-07-15 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884932#comment-16884932
 ] 

Adrien Grand commented on LUCENE-8811:
--

[~atris] If the patch you are thinking of is the one on LUCENE-8810, I was 
thinking of something even simpler that would catch the TooManyClauses 
exception when trying to flatten the query.

> Add maximum clause count check to IndexSearcher rather than BooleanQuery
> 
>
> Key: LUCENE-8811
> URL: https://issues.apache.org/jira/browse/LUCENE-8811
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Alan Woodward
>Priority: Minor
> Fix For: 8.2
>
> Attachments: LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch, 
> LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch
>
>
> Currently we only check whether boolean queries have too many clauses. 
> However there are other ways that queries may have too many clauses, for 
> instance if you have boolean queries that have themselves inner boolean 
> queries.
> Could we use the new Query visitor API to move this check from BooleanQuery 
> to IndexSearcher in order to make this check more consistent across queries? 
> See for instance LUCENE-8810 where a rewrite rule caused the maximum clause 
> count to be hit even though the total number of leaf queries remained the 
> same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8811) Add maximum clause count check to IndexSearcher rather than BooleanQuery

2019-07-15 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884904#comment-16884904
 ] 

Adrien Grand commented on LUCENE-8811:
--

I was reviewing the changelog for 8.2, this change looks a bit too breaking for 
a minor and should probably wait for 9.0? We can separately address LUCENE-8810 
by disabling the flattening of disjunctions if the new BooleanQuery would have 
more than 1024 clauses?

> Add maximum clause count check to IndexSearcher rather than BooleanQuery
> 
>
> Key: LUCENE-8811
> URL: https://issues.apache.org/jira/browse/LUCENE-8811
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Alan Woodward
>Priority: Minor
> Fix For: 8.2
>
> Attachments: LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch, 
> LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch
>
>
> Currently we only check whether boolean queries have too many clauses. 
> However there are other ways that queries may have too many clauses, for 
> instance if you have boolean queries that have themselves inner boolean 
> queries.
> Could we use the new Query visitor API to move this check from BooleanQuery 
> to IndexSearcher in order to make this check more consistent across queries? 
> See for instance LUCENE-8810 where a rewrite rule caused the maximum clause 
> count to be hit even though the total number of leaf queries remained the 
> same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8311) Leverage impacts for phrase queries

2019-07-11 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8311.
--
   Resolution: Fixed
Fix Version/s: 8.2

> Leverage impacts for phrase queries
> ---
>
> Key: LUCENE-8311
> URL: https://issues.apache.org/jira/browse/LUCENE-8311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.2
>
> Attachments: LUCENE-8311.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now that we expose raw impacts, we could leverage them for phrase queries.
> For instance for exact phrases, we could take the minimum term frequency for 
> each unique norm value in order to get upper bounds of the score for the 
> phrase.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8909) Deprecate getFieldNames from IndexWriter

2019-07-10 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882325#comment-16882325
 ] 

Adrien Grand commented on LUCENE-8909:
--

+1

> Deprecate getFieldNames from IndexWriter
> 
>
> Key: LUCENE-8909
> URL: https://issues.apache.org/jira/browse/LUCENE-8909
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Munendra S N
>Priority: Major
>
> From SOLR-12368
> {quote}Would be nice to be able to remove IndexWriter.getFieldNames as well, 
> which was added in LUCENE-7659 only for this workaround.{quote}
> Once Solr task resolved, deprecate {{IndexWriter#getFieldNames}} from 8x and 
> remove it from master



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?

2019-07-10 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8875.
--
   Resolution: Fixed
Fix Version/s: 8.2

> Should TopScoreDocCollector Always Populate Sentinel Values?
> 
>
> Key: LUCENE-8875
> URL: https://issues.apache.org/jira/browse/LUCENE-8875
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: 8.2
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> TopScoreDocCollector always initializes HitQueue as the PQ implementation, 
> and instruct HitQueue to populate with sentinels. While this is a great 
> safety mechanism, for very large datasets where the query's selectivity is 
> high, the sentinel population can be redundant and can become a large enough 
> bottleneck in itself. Does it make sense to introduce a new parameter in 
> TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and 
> does not populate sentinels?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8907) Provide backward compatibility for loading analysis factories

2019-07-10 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882074#comment-16882074
 ] 

Adrien Grand commented on LUCENE-8907:
--

I have a slight preference for reverting on 8.x and have this change only in 
9.0. My worry is that the backward compatibility layer would either need to 
introduce leniency (factories without a NAME) or stronger checks (Same NAME and 
class name), and it could end up causing as much trouble as the original change.

[~thetaphi‍] What do you think?

> Provide backward compatibility for loading analysis factories
> -
>
> Key: LUCENE-8907
> URL: https://issues.apache.org/jira/browse/LUCENE-8907
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Blocker
>
> The changes in LUCENE-8778 have breaking changes in the analysis factory 
> interface and  custom factories implemented by users / 3rd parties will be 
> affected. We need to keep some backwards compatibility during 8.x.
> Please see the discussion in SOLR-13593 for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8907) Provide backward compatibility for loading analysis factories

2019-07-10 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881982#comment-16881982
 ] 

Adrien Grand commented on LUCENE-8907:
--

I don't like 2, it feels too breaking to me for a minor release. I like 1 
better but then I think we should also fail if an analysis factory has a NAME 
constant that is not the same as the name that would be derived from the class 
name?

Another option could be to revert from 8.x. In any case we should add migration 
instructions to lucene/MIGRATE.txt on master.

bq. some warning messages would be helpful

We never log from Lucene since this is a library.

> Provide backward compatibility for loading analysis factories
> -
>
> Key: LUCENE-8907
> URL: https://issues.apache.org/jira/browse/LUCENE-8907
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Blocker
>
> The changes in LUCENE-8778 have breaking changes in the analysis factory 
> interface and  custom factories implemented by users / 3rd parties will be 
> affected. We need to keep some backwards compatibility during 8.x.
> Please see the discussion in SOLR-13593 for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8311) Leverage impacts for phrase queries

2019-07-10 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881967#comment-16881967
 ] 

Adrien Grand commented on LUCENE-8311:
--

This made exact phrase queries 3x faster in the nightly benchmarks 
http://people.apache.org/~mikemccand/lucenebench/Phrase.html and term queries 
about 10% slower http://people.apache.org/~mikemccand/lucenebench/Term.html.

> Leverage impacts for phrase queries
> ---
>
> Key: LUCENE-8311
> URL: https://issues.apache.org/jira/browse/LUCENE-8311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8311.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now that we expose raw impacts, we could leverage them for phrase queries.
> For instance for exact phrases, we could take the minimum term frequency for 
> each unique norm value in order to get upper bounds of the score for the 
> phrase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8883) CHANGES.txt: Auto add issue categories on new releases

2019-07-09 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881541#comment-16881541
 ] 

Adrien Grand commented on LUCENE-8883:
--

I did grep "^[A-Z]" CHANGES.txt | sort | uniq -c | sort -nr.

> CHANGES.txt: Auto add issue categories on new releases
> --
>
> Key: LUCENE-8883
> URL: https://issues.apache.org/jira/browse/LUCENE-8883
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/build
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Attachments: LUCENE-8883.patch
>
>
> As I write this, looking at Solr's CHANGES.txt for 8.2 I see we have some 
> sections: "Upgrade Notes", "New Features", "Bug Fixes", and "Other Changes".  
> There is no "Improvements" so no surprise here, the New Features category 
> has issues that ought to be listed as such.  I think the order vary as well.  
> I propose that on new releases, the initial state of the next release in 
> CHANGES.txt have these sections.  They can easily be removed at the upcoming 
> release if there are no such sections, or they could stay as empty.  It seems 
> addVersion.py is the code that sets this up and it could be enhanced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8069) Allow index sorting by field length

2019-07-09 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881491#comment-16881491
 ] 

Adrien Grand commented on LUCENE-8069:
--

I didn't measure the indexing rate, I can do that next. Yes, I hacked a way to 
sort by norm field indeed. The solution that you proposed would likely yield 
similar benefits.

bq. is luceneutil assuming the search query doesn't want the number of total 
hits

Yes, like in nightly benchmarks.

bq. Yet this is not now most people use Lucene...

There are many use-cases for Lucene, but getting top full-text hits by score is 
a pretty common one and it typically doesn't require computing hit counts?

> Allow index sorting by field length
> ---
>
> Key: LUCENE-8069
> URL: https://issues.apache.org/jira/browse/LUCENE-8069
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
>
> Short documents are more likely to get higher scores, so sorting an index by 
> field length would mean we would be likely to collect best matches first. 
> Depending on the similarity implementation, this might even allow to early 
> terminate collection of top documents on term queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8900) Simplify MultiSorter

2019-07-09 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8900.
--
   Resolution: Fixed
Fix Version/s: 8.2

Thanks [~danmuzi].

> Simplify MultiSorter
> 
>
> Key: LUCENE-8900
> URL: https://issues.apache.org/jira/browse/LUCENE-8900
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.2
>
> Attachments: LUCENE-8900.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8883) CHANGES.txt: Auto add issue categories on new releases

2019-07-09 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880981#comment-16880981
 ] 

Adrien Grand commented on LUCENE-8883:
--

I just looked at the section names that we used at least 10 times in the 
changelog:

{noformat}
 55 API Changes
 53 Bug Fixes
 52 Optimizations
 46 Build
 41 Bug fixes
 37 New Features
 25 Documentation
 24 Other
 21 Changes in Runtime Behavior
 19 Changes in backwards compatibility policy
 15 New features
 15 Improvements
 14 Changes in runtime behavior
 10 Tests
{noformat}

Maybe your patch should rename "Other Changes" to "Other" which seems to be 
what we have used historically, and maybe also add "API Changes" and 
"Optimizations", which seem pretty popular?

Maybe we could specialize bugfix versions and only introduce a "Bug Fixes" 
section in that case?

bq. Also I didn't add "(No changes)"; seems needless / self-evident.

I think it helps clarify since it is very uncommon to release software without 
any changes. We could do it only for new bugfix releases if you think that 
helps since I think those are the only ones that we ever released without new 
changes.

> CHANGES.txt: Auto add issue categories on new releases
> --
>
> Key: LUCENE-8883
> URL: https://issues.apache.org/jira/browse/LUCENE-8883
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/build
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Attachments: LUCENE-8883.patch
>
>
> As I write this, looking at Solr's CHANGES.txt for 8.2 I see we have some 
> sections: "Upgrade Notes", "New Features", "Bug Fixes", and "Other Changes".  
> There is no "Improvements" so no surprise here, the New Features category 
> has issues that ought to be listed as such.  I think the order vary as well.  
> I propose that on new releases, the initial state of the next release in 
> CHANGES.txt have these sections.  They can easily be removed at the upcoming 
> release if there are no such sections, or they could stay as empty.  It seems 
> addVersion.py is the code that sets this up and it could be enhanced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4312) Index format to store position length per position

2019-07-08 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880606#comment-16880606
 ] 

Adrien Grand commented on LUCENE-4312:
--

bq. the complexity of query execution would be driven by what's actually in the 
index

I don't think this is true.

For instance an exact phrase query trying to match "A B C" that is currently 
positioned on A (position=3, length=1), B (position=4, length=1), C 
(position=6, length=1) would need to advance B to the next position in case 
there is another match on position 4 that has a length of 2. And then we should 
advance C first because maybe because it also has another match on position 4 
of a different length.

Also we can't advance positions on terms in the order we want anymore. Today we 
use the rarer term to lead the iteration of positions. If we had position 
lengths in the index we would need to advance positions in the order in which 
terms occur in the phrase query since the start position that B must have 
depends on the length of A on the current position: position starts are 
guaranteed to come in order in the index but position ends are not (at least we 
don't enforce it in token streams today).

> Index format to store position length per position
> --
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 6.0
>Reporter: Gang Luo
>Priority: Minor
>  Labels: Suggestion
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4312) Index format to store position length per position

2019-07-08 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880422#comment-16880422
 ] 

Adrien Grand commented on LUCENE-4312:
--

Recording position lengths in the index is the easy part of the problem in my 
opinion. I'm concerned that this will introduce significant complexity to 
phrase queries (they will require backtracking in order to deal with the case 
that a term exists twice at the same position with different position lengths), 
and even make sloppy phrase queries and their spans/intervals counterparts 
meaningless (as terms could be very distant according to the index only because 
there is one term in-between that has a multi-term synonym indexed). 

> Index format to store position length per position
> --
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 6.0
>Reporter: Gang Luo
>Priority: Minor
>  Labels: Suggestion
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8860) LatLonShapeBoundingBoxQuery could make more decisions on inner nodes

2019-07-05 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879116#comment-16879116
 ] 

Adrien Grand commented on LUCENE-8860:
--

I made the issue about box queries, but that would actually work for polygons 
too.

> LatLonShapeBoundingBoxQuery could make more decisions on inner nodes
> 
>
> Key: LUCENE-8860
> URL: https://issues.apache.org/jira/browse/LUCENE-8860
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Currently LatLonShapeBoundingBoxQuery with the INTERSECTS relation only 
> returns CELL_INSIDE_QUERY if the query contains ALL minimum bounding 
> rectangles of the indexed triangles.
> I think we could return CELL_INSIDE_QUERY if the box contains either of the 
> edges of all MBRs of indexed triangles since triangles are guaranteed to 
> touch all edges of their MBR by definition. In some cases this would help 
> save decoding triangles and running costly point-in-triangle computations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12368) in-place DV updates should no longer have to jump through hoops if field does not yet exist

2019-07-04 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16878866#comment-16878866
 ] 

Adrien Grand commented on SOLR-12368:
-

I'm not familiar enough with how Solr routes updates to doc values or stored 
fields, but indeed we don't need to avoid updates on fields that don't exist 
anymore. Thanks for cleaning this up! Can you mark IndexWriter#getFieldNames as 
deprecated on 8.x instead of removing (which is the right thing to do on 
master).

> in-place DV updates should no longer have to jump through hoops if field does 
> not yet exist
> ---
>
> Key: SOLR-12368
> URL: https://issues.apache.org/jira/browse/SOLR-12368
> Project: Solr
>  Issue Type: Improvement
>Reporter: Hoss Man
>Priority: Major
> Attachments: SOLR-12368.patch, SOLR-12368.patch, SOLR-12368.patch
>
>
> When SOLR-5944 first added "in-place" DocValue updates to Solr, one of the 
> edge cases thta had to be dealt with was the limitation imposed by 
> IndexWriter that docValues could only be updated if they already existed - if 
> a shard did not yet have a document w/a value in the field where the update 
> was attempted, we would get an error.
> LUCENE-8316 seems to have removed this error, which i believe means we can 
> simplify & speed up some of the checks in Solr, and support this situation as 
> well, rather then falling back on full "read stored fields & reindex" atomic 
> update



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8311) Leverage impacts for phrase queries

2019-07-03 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877919#comment-16877919
 ] 

Adrien Grand commented on LUCENE-8311:
--

I opened https://github.com/apache/lucene-solr/pull/760. Performance is a bit 
better than what we had before:

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
HighTerm 1395.12  (5.1%) 1230.78  (4.3%)  
-11.8% ( -20% -   -2%)
 MedTerm 2352.56  (4.7%) 2170.42  (3.9%)   
-7.7% ( -15% -0%)
 LowSpanNear   13.70  (7.0%)   12.67  (4.9%)   
-7.5% ( -18% -4%)
HighSpanNear5.69  (5.3%)5.31  (3.2%)   
-6.5% ( -14% -2%)
 MedSpanNear   23.33  (4.2%)   21.97  (2.4%)   
-5.8% ( -11% -0%)
  AndHighMed  114.70  (2.9%)  109.40  (4.1%)   
-4.6% ( -11% -2%)
 AndHighHigh   35.08  (3.2%)   33.51  (4.1%)   
-4.5% ( -11% -2%)
 LowTerm 3014.11  (4.7%) 2893.44  (4.7%)   
-4.0% ( -12% -5%)
   OrHighMed   60.26  (2.5%)   57.96  (2.1%)   
-3.8% (  -8% -0%)
  OrHighHigh   15.45  (2.5%)   14.87  (2.3%)   
-3.8% (  -8% -1%)
   LowPhrase   25.81  (3.4%)   24.89  (2.8%)   
-3.6% (  -9% -2%)
HighSloppyPhrase7.44  (6.3%)7.20  (5.7%)   
-3.3% ( -14% -9%)
 MedSloppyPhrase   12.76  (5.1%)   12.51  (4.6%)   
-1.9% ( -10% -8%)
 LowSloppyPhrase   34.24  (4.1%)   33.59  (3.8%)   
-1.9% (  -9% -6%)
   HighTermMonthSort   70.86 (10.9%)   69.98 (10.7%)   
-1.2% ( -20% -   22%)
  Fuzzy1  211.28  (3.5%)  208.86  (2.2%)   
-1.1% (  -6% -4%)
  Fuzzy2  180.97  (4.4%)  179.47  (2.6%)   
-0.8% (  -7% -6%)
   OrHighLow  467.25  (2.9%)  467.94  (2.0%)
0.1% (  -4% -5%)
 Prefix3   91.35  (8.1%)   91.52  (7.2%)
0.2% ( -14% -   16%)
   HighTermDayOfYearSort   62.77  (6.9%)   62.96  (7.5%)
0.3% ( -13% -   15%)
Wildcard  129.49  (4.3%)  129.99  (2.8%)
0.4% (  -6% -7%)
 Respell  210.68  (1.9%)  211.58  (2.4%)
0.4% (  -3% -4%)
  AndHighLow  541.64  (3.1%)  544.44  (3.2%)
0.5% (  -5% -7%)
  IntNRQ  148.56  (8.3%)  149.44 (10.4%)
0.6% ( -16% -   21%)
  HighPhrase   10.86  (9.0%)   13.92 (15.2%)   
28.2% (   3% -   57%)
   MedPhrase   62.22  (2.1%)   97.61  (4.6%)   
56.9% (  49% -   64%)
{noformat}

But there is a lot of variance across runs because it depends a lot on which 
query gets picked up. For instance on another run I got

{noformat}
   LowPhrase   39.39  (1.9%)   51.21  (2.2%)   
30.0% (  25% -   34%)
  HighPhrase   13.09  (3.2%)  192.76 (26.8%) 
1372.5% (1301% - 1448%)
{noformat}

In spite of some queries that get slightly slower, I think we should merge this 
since we need phrases to expose good impacts if we want to give boolean queries 
a chance to speed up queries that include phrases. Term queries appear to be a 
bit slower, I'm assuming that this is due to the fact that the JVM cannot do as 
much inlining as before since we are starting to use classes for phrases that 
were only used for term queries before.

> Leverage impacts for phrase queries
> ---
>
> Key: LUCENE-8311
> URL: https://issues.apache.org/jira/browse/LUCENE-8311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8311.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now that we expose raw impacts, we could leverage them for phrase queries.
> For instance for exact phrases, we could take the minimum term frequency for 
> each unique norm value in order to get upper bounds of the score for the 
> phrase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8762) Lucene50PostingsReader should specialize reading docs+freqs with impacts

2019-07-03 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877916#comment-16877916
 ] 

Adrien Grand commented on LUCENE-8762:
--

I proposed to change the specialization for docs+freqs+positions as part of 
LUCENE-8311. But it doesn't add any specialization for docs+freqs, which would 
still probably be worth adding?

> Lucene50PostingsReader should specialize reading docs+freqs with impacts
> 
>
> Key: LUCENE-8762
> URL: https://issues.apache.org/jira/browse/LUCENE-8762
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Currently if you ask for impacts, we only have one implementation that is 
> able to expose everything: docs, freqs, positions and offsets. In contrast, 
> if you don't need impacts, we have specialization for docs+freqs, 
> docs+freqs+positions and docs+freqs+positions+offsets.
> Maybe we should add specialization for the docs+freqs case with impacts, 
> which should be the most common case, and remove specialization for 
> docs+freqs+positions when impacts are not requested?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8311) Leverage impacts for phrase queries

2019-07-03 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877849#comment-16877849
 ] 

Adrien Grand commented on LUCENE-8311:
--

It turns out that part of the reason why the patch is making things slower is 
that it is moving phrase queries from BlockPostingsEnum, which is specialized 
to read freqs and positions only, to BlockImpactsEverythingEnum, which can read 
any of docs+freqs, docs+freqs+positios or docs+freqs+positions+offsets. Maybe 
we should remove BlockPostingsEnum and have a specialized impacts enum for 
positions instead.

The merged impacts look like they have some room for improvement as well. I'm 
looking into those issues so that we can then do better testing of LUCENE-8806.

> Leverage impacts for phrase queries
> ---
>
> Key: LUCENE-8311
> URL: https://issues.apache.org/jira/browse/LUCENE-8311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8311.patch
>
>
> Now that we expose raw impacts, we could leverage them for phrase queries.
> For instance for exact phrases, we could take the minimum term frequency for 
> each unique norm value in order to get upper bounds of the score for the 
> phrase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8899) Implementation of MultiTermQuery for ORed Queries

2019-07-02 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877185#comment-16877185
 ] 

Adrien Grand commented on LUCENE-8899:
--

This sounds very similar to what TermInSetQuery is doing, am I missing 
something?

> Implementation of MultiTermQuery for ORed Queries
> -
>
> Key: LUCENE-8899
> URL: https://issues.apache.org/jira/browse/LUCENE-8899
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While working on multi range queries, I realised that it would be good to 
> specialize for cases where all clauses in a query are ORed together. 
> MultiTermQuery springs to mind, when all terms are basically disjuncted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8900) Simplify MultiSorter

2019-07-02 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877183#comment-16877183
 ] 

Adrien Grand commented on LUCENE-8900:
--

Thanks [~danmuzi], I will apply your first suggestion. However I can't apply 2 
because I merged the logic for integers and longs, which means that is some 
cases the missing value will be an Integer and in other cases it will be a Long.

> Simplify MultiSorter
> 
>
> Key: LUCENE-8900
> URL: https://issues.apache.org/jira/browse/LUCENE-8900
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8900.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8901) Load frequencies lazily for postings and impacts

2019-07-02 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877164#comment-16877164
 ] 

Adrien Grand commented on LUCENE-8901:
--

Thanks [~mayyas]!

> Load frequencies lazily for postings and impacts
> 
>
> Key: LUCENE-8901
> URL: https://issues.apache.org/jira/browse/LUCENE-8901
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Priority: Minor
> Fix For: 8.2
>
>
> Allow frequencies blocks to be loaded lazily when they are not needed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8901) Load frequencies lazily for postings and impacts

2019-07-02 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-8901:
-
Fix Version/s: 8.2

> Load frequencies lazily for postings and impacts
> 
>
> Key: LUCENE-8901
> URL: https://issues.apache.org/jira/browse/LUCENE-8901
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Priority: Minor
> Fix For: 8.2
>
>
> Allow frequencies blocks to be loaded lazily when they are not needed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-02 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877157#comment-16877157
 ] 

Adrien Grand commented on LUCENE-8857:
--

Double checking, have you run all Solr tests or only TestDistributedGrouping?

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-02 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877155#comment-16877155
 ] 

Adrien Grand commented on LUCENE-8857:
--

Thanks [~atris] I'll look into merging now. MIGRATE is considered quite loud 
already, plus the fact that it is there makes it pretty likely to be included 
in the release notes.

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (LUCENE-8069) Allow index sorting by field length

2019-07-02 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reopened LUCENE-8069:
--

I've had this idea come back to my mind several times since I opened it. 
Sorting by norm brings the following benefits:
 - Better compression, smaller doc IDs likely have tiny term frequencies since 
most times the term frequency is less than or equal to the norm.
 - Smaller impacts: since each block of postings has only one unique norm value 
on average, then it also only has one impact on average. This helps at search 
time since computing the score of this impact gives us immediately the best 
score of the block, as opposed to having to iterate several impacts and take 
the highest score.
 - For term queries, it makes sure that among all documents that have X 
occurrences of the queried term, we visit the documents that have the lowest 
norm first, and thus the ones that trigger the better scores.
 - Boolean queries are interesting: they get the same above benefit as term 
queries but on the other hand the norm tends to correlate with the number of 
unique terms so it might be that you need to collect more matches before you 
find one that matches several query terms.

I hacked a quick prototype and ran luceneutil on wikibig, results are 
encouraging:
{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
   HighTermDayOfYearSort   37.64  (6.4%)   33.96  (4.7%)   
-9.8% ( -19% -1%)
  HighPhrase   26.45  (2.7%)   25.24  (2.8%)   
-4.6% (  -9% -0%)
   OrHighLow  341.59  (2.8%)  327.84  (2.6%)   
-4.0% (  -9% -1%)
  Fuzzy2  153.15  (5.3%)  147.70  (5.1%)   
-3.6% ( -13% -7%)
  IntNRQ  151.43  (1.4%)  147.04  (3.4%)   
-2.9% (  -7% -1%)
   HighTermMonthSort   79.28  (6.4%)   79.44  (7.6%)
0.2% ( -12% -   15%)
 Respell  229.10  (2.2%)  230.62  (1.8%)
0.7% (  -3% -4%)
  Fuzzy1  285.25  (6.9%)  288.99  (6.8%)
1.3% ( -11% -   16%)
 Prefix3   34.60 (10.3%)   35.14 (10.6%)
1.6% ( -17% -   25%)
Wildcard   72.36  (5.8%)   73.86  (6.3%)
2.1% (  -9% -   15%)
 MedTerm 1895.68  (4.2%) 1939.92  (4.2%)
2.3% (  -5% -   11%)
HighSpanNear5.25  (6.0%)5.46  (6.0%)
3.9% (  -7% -   17%)
 LowSloppyPhrase6.85  (6.5%)7.13  (6.3%)
4.2% (  -8% -   18%)
   LowPhrase   46.08  (1.7%)   48.56  (1.8%)
5.4% (   1% -9%)
 LowSpanNear   24.03  (3.7%)   25.68  (4.3%)
6.9% (  -1% -   15%)
 MedSpanNear5.20 (13.2%)5.63 (15.2%)
8.3% ( -17% -   42%)
 MedSloppyPhrase   11.01  (4.5%)   11.95  (4.7%)
8.6% (   0% -   18%)
   MedPhrase   23.39  (2.6%)   25.64  (2.2%)
9.6% (   4% -   14%)
HighSloppyPhrase3.84  (5.9%)4.26  (5.8%)   
11.0% (   0% -   24%)
  AndHighLow  401.13  (3.4%)  458.11  (3.0%)   
14.2% (   7% -   21%)
 LowTerm 2294.98  (4.0%) 2863.59  (7.0%)   
24.8% (  13% -   37%)
  AndHighMed   53.62  (3.8%)   71.40  (1.8%)   
33.2% (  26% -   40%)
HighTerm 1286.59  (3.9%) 1917.61  (5.7%)   
49.0% (  38% -   60%)
 AndHighHigh   41.24  (3.5%)   69.17  (4.2%)   
67.7% (  58% -   78%)
   OrHighMed   49.92  (2.4%)   84.95  (4.0%)   
70.2% (  62% -   78%)
  OrHighHigh   43.55  (2.3%)   90.06  (4.8%)  
106.8% (  97% -  116%)
{noformat}

The {{doc}} file is 12% smaller.

> Allow index sorting by field length
> ---
>
> Key: LUCENE-8069
> URL: https://issues.apache.org/jira/browse/LUCENE-8069
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
>
> Short documents are more likely to get higher scores, so sorting an index by 
> field length would mean we would be likely to collect best matches first. 
> Depending on the similarity implementation, this might even allow to early 
> terminate collection of top documents on term queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8900) Simplify MultiSorter

2019-07-02 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-8900:
-
Attachment: LUCENE-8900.patch
Status: Open  (was: Open)

Here is a patch, it does two things:
 - Uses advanceExact instead of advance on doc-value iterators.
 - Replaces usage of Comparable with longs, since in all cases values can be 
converted to comparable longs, which avoids issues with generics.

> Simplify MultiSorter
> 
>
> Key: LUCENE-8900
> URL: https://issues.apache.org/jira/browse/LUCENE-8900
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8900.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8900) Simplify MultiSorter

2019-07-02 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-8900:


 Summary: Simplify MultiSorter
 Key: LUCENE-8900
 URL: https://issues.apache.org/jira/browse/LUCENE-8900
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8892) Missing closing parens in string representation of MultiBoolFunction

2019-07-02 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876712#comment-16876712
 ] 

Adrien Grand commented on LUCENE-8892:
--

[~munendrasn] I resolved and gave you permission. It should work next time.

> Missing closing parens in string representation of MultiBoolFunction
> 
>
> Key: LUCENE-8892
> URL: https://issues.apache.org/jira/browse/LUCENE-8892
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Florian Diebold
>Priority: Trivial
> Fix For: 8.2
>
> Attachments: 0001-Fix-missing-parenthesis-in-MultiBoolFunction.patch, 
> LUCENE-8892.patch, SOLR-13514.patch
>
>
> The {{description}} function of {{MultiBoolFunction}} includes an open 
> parenthesis, but doesn't close it. This makes score explanations more 
> confusing than necessary sometimes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8892) Missing closing parens in string representation of MultiBoolFunction

2019-07-02 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-8892:
-
   Resolution: Fixed
Fix Version/s: 8.2
   Status: Resolved  (was: Patch Available)

> Missing closing parens in string representation of MultiBoolFunction
> 
>
> Key: LUCENE-8892
> URL: https://issues.apache.org/jira/browse/LUCENE-8892
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Florian Diebold
>Priority: Trivial
> Fix For: 8.2
>
> Attachments: 0001-Fix-missing-parenthesis-in-MultiBoolFunction.patch, 
> LUCENE-8892.patch, SOLR-13514.patch
>
>
> The {{description}} function of {{MultiBoolFunction}} includes an open 
> parenthesis, but doesn't close it. This makes score explanations more 
> confusing than necessary sometimes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-02 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876705#comment-16876705
 ] 

Adrien Grand commented on LUCENE-8857:
--

We need to have all changes in the same pull request, otherwise there will be a 
window of time during which we will get test failures when testing Solr, which 
could break a lot of people. As you noticed, it didn't take long to Munendra to 
notice something had broken.

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-02 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876698#comment-16876698
 ] 

Adrien Grand commented on LUCENE-8857:
--

[~atris] Thanks for looking into the grouping failure. I'm not seeing changes 
to Solr, so I'm assuming we would still get the failure that [~munendrasn] 
shared if we pushed?

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm

2019-07-02 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876689#comment-16876689
 ] 

Adrien Grand commented on LUCENE-8757:
--

This change as been reverted from 8.x due to the fact that it required changes 
to TopDocs#merge that would necessarily be breaking to our users.

> Better Segment To Thread Mapping Algorithm
> --
>
> Key: LUCENE-8757
> URL: https://issues.apache.org/jira/browse/LUCENE-8757
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Assignee: Simon Willnauer
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch, 
> LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch, 
> LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch
>
>
> The current segments to threads allocation algorithm always allocates one 
> thread per segment. This is detrimental to performance in case of skew in 
> segment sizes since small segments also get their dedicated thread. This can 
> lead to performance degradation due to context switching overheads.
>  
> A better algorithm which is cognizant of size skew would have better 
> performance for realistic scenarios



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8757) Better Segment To Thread Mapping Algorithm

2019-07-02 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-8757:
-
Fix Version/s: (was: 8.2)

> Better Segment To Thread Mapping Algorithm
> --
>
> Key: LUCENE-8757
> URL: https://issues.apache.org/jira/browse/LUCENE-8757
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Assignee: Simon Willnauer
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch, 
> LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch, 
> LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch
>
>
> The current segments to threads allocation algorithm always allocates one 
> thread per segment. This is detrimental to performance in case of skew in 
> segment sizes since small segments also get their dedicated thread. This can 
> lead to performance degradation due to context switching overheads.
>  
> A better algorithm which is cognizant of size skew would have better 
> performance for realistic scenarios



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8898) TestRamUsageEstimator.testMap failures

2019-07-01 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-8898:


 Summary: TestRamUsageEstimator.testMap failures
 Key: LUCENE-8898
 URL: https://issues.apache.org/jira/browse/LUCENE-8898
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
 Fix For: 8.2


Here is an example failure:

{noformat}
4 tests failed.
FAILED:  org.apache.lucene.util.TestRamUsageEstimator.testMap

Error Message:
expected:<25152.0> but was:<30184.0>

Stack Trace:
java.lang.AssertionError: expected:<25152.0> but was:<30184.0>
at 
__randomizedtesting.SeedInfo.seed([ED7055A14021EA69:CD56E1725ADAF91B]:0)
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:834)
at org.junit.Assert.assertEquals(Assert.java:553)
at org.junit.Assert.assertEquals(Assert.java:683)
at 
org.apache.lucene.util.TestRamUsageEstimator.testMap(TestRamUsageEstimator.java:136)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}

This happens on master and branch_8x but always when the JVM version is greater 
than or equal to 11 apparently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Updated] (LUCENE-8898) TestRamUsageEstimator.testMap failures

2019-07-01 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-8898:
-
Issue Type: Bug  (was: Improvement)

> TestRamUsageEstimator.testMap failures
> --
>
> Key: LUCENE-8898
> URL: https://issues.apache.org/jira/browse/LUCENE-8898
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Blocker
> Fix For: 8.2
>
>
> Here is an example failure:
> {noformat}
> 4 tests failed.
> FAILED:  org.apache.lucene.util.TestRamUsageEstimator.testMap
> Error Message:
> expected:<25152.0> but was:<30184.0>
> Stack Trace:
> java.lang.AssertionError: expected:<25152.0> but was:<30184.0>
> at 
> __randomizedtesting.SeedInfo.seed([ED7055A14021EA69:CD56E1725ADAF91B]:0)
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:834)
> at org.junit.Assert.assertEquals(Assert.java:553)
> at org.junit.Assert.assertEquals(Assert.java:683)
> at 
> org.apache.lucene.util.TestRamUsageEstimator.testMap(TestRamUsageEstimator.java:136)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
> at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
> at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
> at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
> at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
> at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
> at 
> org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> 

[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876248#comment-16876248
 ] 

Adrien Grand commented on LUCENE-8857:
--

[~atris] Can you look into those failures? I had understood from your comment 
on the PR that you had run tests? FYI I'm seeing issues on the Lucene end as 
well, e.g. ant test  -Dtestcase=TestGrouping -Dtests.method=testRandom 
-Dtests.seed=1039BE5B957F7FDD -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=be -Dtests.timezone=Europe/Rome -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8.

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876242#comment-16876242
 ] 

Adrien Grand commented on LUCENE-8857:
--

Thanks [~munendrasn] I'm reverting now.

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8857.
--
   Resolution: Fixed
Fix Version/s: master (9.0)

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   4   5   6   7   8   9   10   >