[jira] [Updated] (LUCENE-8708) Can we simplify conjunctions of range queries automatically?

2019-04-07 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma updated LUCENE-8708:

Attachment: interval_range_clauses_merging0704.patch

> Can we simplify conjunctions of range queries automatically?
> 
>
> Key: LUCENE-8708
> URL: https://issues.apache.org/jira/browse/LUCENE-8708
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: interval_range_clauses_merging0704.patch
>
>
> BooleanQuery#rewrite already has some logic to make queries more efficient, 
> such as deduplicating filters or rewriting boolean queries that wrap a single 
> positive clause to that clause.
> It would be nice to also simplify conjunctions of range queries, so that eg. 
> {{foo: [5 TO *] AND foo:[* TO 20]}} would be rewritten to {{foo:[5 TO 20]}}. 
> When constructing queries manually or via the classic query parser, it feels 
> unnecessary as this is something that the user can fix easily. However if you 
> want to implement a query parser that only allows specifying one bound at 
> once, such as Gmail ({{after:2018-12-31}} 
> https://support.google.com/mail/answer/7190?hl=en) or GitHub 
> ({{updated:>=2018-12-31}} 
> https://help.github.com/en/articles/searching-issues-and-pull-requests#search-by-when-an-issue-or-pull-request-was-created-or-last-updated)
>  then you might end up with inefficient queries if the end user specifies 
> both an upper and a lower bound. It would be nice if we optimized those 
> automatically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8708) Can we simplify conjunctions of range queries automatically?

2019-04-07 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811799#comment-16811799
 ] 

Atri Sharma commented on LUCENE-8708:
-

Attached is a WIP patch for the same. There is one existing test which needs to 
be refactored to comply with the new API, which I will do before the final 
commit. The intent of this patch is to get early feedback and potential 
blockers.

This commit introduces the concept of ToString interface. While a bit of a 
controversial change, ToString interface is necessary to allow creation of new 
Range Queries of a given type post the merge. I am happy to replace it with any 
other alternatives that seem more sane.

[^interval_range_clauses_merging0704.patch]

> Can we simplify conjunctions of range queries automatically?
> 
>
> Key: LUCENE-8708
> URL: https://issues.apache.org/jira/browse/LUCENE-8708
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: interval_range_clauses_merging0704.patch
>
>
> BooleanQuery#rewrite already has some logic to make queries more efficient, 
> such as deduplicating filters or rewriting boolean queries that wrap a single 
> positive clause to that clause.
> It would be nice to also simplify conjunctions of range queries, so that eg. 
> {{foo: [5 TO *] AND foo:[* TO 20]}} would be rewritten to {{foo:[5 TO 20]}}. 
> When constructing queries manually or via the classic query parser, it feels 
> unnecessary as this is something that the user can fix easily. However if you 
> want to implement a query parser that only allows specifying one bound at 
> once, such as Gmail ({{after:2018-12-31}} 
> https://support.google.com/mail/answer/7190?hl=en) or GitHub 
> ({{updated:>=2018-12-31}} 
> https://help.github.com/en/articles/searching-issues-and-pull-requests#search-by-when-an-issue-or-pull-request-was-created-or-last-updated)
>  then you might end up with inefficient queries if the end user specifies 
> both an upper and a lower bound. It would be nice if we optimized those 
> automatically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8708) Can we simplify conjunctions of range queries automatically?

2019-04-07 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma updated LUCENE-8708:

Attachment: (was: interval_range_clauses_merging0704.patch)

> Can we simplify conjunctions of range queries automatically?
> 
>
> Key: LUCENE-8708
> URL: https://issues.apache.org/jira/browse/LUCENE-8708
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: interval_range_clauses_merging0704.patch
>
>
> BooleanQuery#rewrite already has some logic to make queries more efficient, 
> such as deduplicating filters or rewriting boolean queries that wrap a single 
> positive clause to that clause.
> It would be nice to also simplify conjunctions of range queries, so that eg. 
> {{foo: [5 TO *] AND foo:[* TO 20]}} would be rewritten to {{foo:[5 TO 20]}}. 
> When constructing queries manually or via the classic query parser, it feels 
> unnecessary as this is something that the user can fix easily. However if you 
> want to implement a query parser that only allows specifying one bound at 
> once, such as Gmail ({{after:2018-12-31}} 
> https://support.google.com/mail/answer/7190?hl=en) or GitHub 
> ({{updated:>=2018-12-31}} 
> https://help.github.com/en/articles/searching-issues-and-pull-requests#search-by-when-an-issue-or-pull-request-was-created-or-last-updated)
>  then you might end up with inefficient queries if the end user specifies 
> both an upper and a lower bound. It would be nice if we optimized those 
> automatically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8708) Can we simplify conjunctions of range queries automatically?

2019-04-07 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma updated LUCENE-8708:

Attachment: interval_range_clauses_merging0704.patch

> Can we simplify conjunctions of range queries automatically?
> 
>
> Key: LUCENE-8708
> URL: https://issues.apache.org/jira/browse/LUCENE-8708
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: interval_range_clauses_merging0704.patch
>
>
> BooleanQuery#rewrite already has some logic to make queries more efficient, 
> such as deduplicating filters or rewriting boolean queries that wrap a single 
> positive clause to that clause.
> It would be nice to also simplify conjunctions of range queries, so that eg. 
> {{foo: [5 TO *] AND foo:[* TO 20]}} would be rewritten to {{foo:[5 TO 20]}}. 
> When constructing queries manually or via the classic query parser, it feels 
> unnecessary as this is something that the user can fix easily. However if you 
> want to implement a query parser that only allows specifying one bound at 
> once, such as Gmail ({{after:2018-12-31}} 
> https://support.google.com/mail/answer/7190?hl=en) or GitHub 
> ({{updated:>=2018-12-31}} 
> https://help.github.com/en/articles/searching-issues-and-pull-requests#search-by-when-an-issue-or-pull-request-was-created-or-last-updated)
>  then you might end up with inefficient queries if the end user specifies 
> both an upper and a lower bound. It would be nice if we optimized those 
> automatically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8749) Proposal: Pluggable Interface for Slice Allocation Algorithm

2019-04-07 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811802#comment-16811802
 ] 

Atri Sharma commented on LUCENE-8749:
-

Agreed, a better IndexSearcher#slices implementation would definitely help.

 

However, I believe that the ability to customize the method through an external 
object allows a user more granular control over the slice allocation algorithm. 
Two users might have wildly different parameters on which they want to allocate 
slices, so drawing a best fit algorithm for both of them might be hard.

 

I believe that having both the functionalities is a good idea. I am happy to 
open another Jira for tracking efforts on a better default slice allocation 
algorithm.

 

Thoughts?

> Proposal: Pluggable Interface for Slice Allocation Algorithm
> 
>
> Key: LUCENE-8749
> URL: https://issues.apache.org/jira/browse/LUCENE-8749
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Atri Sharma
>Priority: Major
>
> The slice allocation method allocates one thread per segment today. If a user 
> wishes to use a different slice allocation algorithm, there is no way except 
> to make a change in IndexSearcher. This Jira proposes an interface to 
> decouple the slice allocation mechanism from IndexSearcher and allow plugging 
> in the method from an external factory (like Collectors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8757) Better Segment To Thread Mapping Algorithm

2019-04-09 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8757:
---

 Summary: Better Segment To Thread Mapping Algorithm
 Key: LUCENE-8757
 URL: https://issues.apache.org/jira/browse/LUCENE-8757
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


The current segments to threads allocation algorithm always allocates one 
thread per segment. This is detrimental to performance in case of skew in 
segment sizes since small segments also get their dedicated thread. This can 
lead to performance degradation due to context switching overheads.

 

A better algorithm which is cognizant of size skew would have better 
performance for realistic scenarios



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8769) Range Query Type With Logically Connected Ranges

2019-04-18 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8769:
---

 Summary: Range Query Type With Logically Connected Ranges
 Key: LUCENE-8769
 URL: https://issues.apache.org/jira/browse/LUCENE-8769
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


Today, we visit BKD tree for each range specified for PointRangeQuery. It would 
be good to have a range query type which can take multiple ranges logically 
ANDed or ORed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8708) Can we simplify conjunctions of range queries automatically?

2019-04-18 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821001#comment-16821001
 ] 

Atri Sharma commented on LUCENE-8708:
-

[~ivera] Thanks, that makes sense. I have created an issue for the same: 
https://issues.apache.org/jira/browse/LUCENE-8769

 

However, I think that we should still optimize overlapping ranges as this issue 
proposes so that existing users also get the performance advantage.

 

[~jpountz] Any thoughts on how we could simplify the patch?

> Can we simplify conjunctions of range queries automatically?
> 
>
> Key: LUCENE-8708
> URL: https://issues.apache.org/jira/browse/LUCENE-8708
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: interval_range_clauses_merging0704.patch
>
>
> BooleanQuery#rewrite already has some logic to make queries more efficient, 
> such as deduplicating filters or rewriting boolean queries that wrap a single 
> positive clause to that clause.
> It would be nice to also simplify conjunctions of range queries, so that eg. 
> {{foo: [5 TO *] AND foo:[* TO 20]}} would be rewritten to {{foo:[5 TO 20]}}. 
> When constructing queries manually or via the classic query parser, it feels 
> unnecessary as this is something that the user can fix easily. However if you 
> want to implement a query parser that only allows specifying one bound at 
> once, such as Gmail ({{after:2018-12-31}} 
> https://support.google.com/mail/answer/7190?hl=en) or GitHub 
> ({{updated:>=2018-12-31}} 
> https://help.github.com/en/articles/searching-issues-and-pull-requests#search-by-when-an-issue-or-pull-request-was-created-or-last-updated)
>  then you might end up with inefficient queries if the end user specifies 
> both an upper and a lower bound. It would be nice if we optimized those 
> automatically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8675) Divide Segment Search Amongst Multiple Threads

2019-04-18 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16779287#comment-16779287
 ] 

Atri Sharma edited comment on LUCENE-8675 at 4/18/19 12:09 PM:
---

Here are the results of luceneutil (patched to generate P50 and P90 and to run 
concurrent searching within IndexSearcher. Patch is posted to luceneutil repo).

Adrien has a valid point about costly scorers not benefitting from this 
approach. Specifically, range queries can take a hit since BKD Tree's scorer is 
two phase and is expensive to construct, so doing them per portion of a segment 
would lead to increase in latency, as is evident from the increase in P90 
latency in the above results. I am spending time to evaluate how to tackle this 
problem and will post any thoughts that I see as viable. These benchmarks are 
targeted to measure the changes in the "happy" path i.e. the targeted big index 
sizes and low QPS cases. Luceneutil was configured accordingly (low number of 
search threads, impacts turned off)

In summary, the queries scanning a higher amount of data and having higher read 
latencies tend to have the maximum improvement. Term queries and queries 
involving term queries on higher frequency terms get a reasonable latency 
reduction.

The following are P50 and P90 latencies calculated by Luceneutil. P50 Base is 
the P50 latency of the base, P50 Cmp is the P50 latency of the competitor 
(patched version), and the same for P90.

Note: The QPS jumps are not real. Since Luceneutil was congigured to run a 
single searcher thread, QPS jump is proportional to the latency drop for task.

Luceneutil results: 
[https://gist.github.com/atris/9a06d511fdfa9de1b48b47e09d5ab8d2]

I have attached the P50 and P90 latency graphs for high frequency phrase and 
term queries. It is apparent that queries with high frequency terms have 
sizeable improvements.

To address Adrien's point, I have some ideas to improve performance of BKD tree 
scorer for this case, will open a separate JIRA issue and link here.

[~jpountz] Are there any other concerns that you see here? Happy to address 
your feedback.

 

 


was (Author: atris):
Here are the results of luceneutil (patched to generate P50 and P90 and to run 
concurrent searching within IndexSearcher. Patch is posted to luceneutil repo).

Adrien has a valid point about costly scorers not benefitting from this 
approach. Specifically, range queries can take a hit since BKD Tree's scorer is 
two phase and is expensive to construct, so doing them per portion of a segment 
would lead to increase in latency, as is evident from the increase in P90 
latency in the above results. I am spending time to evaluate how to tackle this 
problem and will post any thoughts that I see as viable. These benchmarks are 
targeted to measure the changes in the "happy" path i.e. the targeted big index 
sizes and low QPS cases. Luceneutil was configured accordingly (low number of 
search threads, impacts turned off)

In summary, the queries scanning a higher amount of data and having higher read 
latencies tend to have the maximum improvement. Term queries and queries 
involving term queries on higher frequency terms get a reasonable latency 
reduction.

The following are P50 and P90 latencies calculated by Luceneutil. P50 Base is 
the P50 latency of the base, P50 Cmp is the P50 latency of the competitor 
(patched version), and the same for P90.

Note: The QPS jumps are not real. Since Luceneutil was congigured to run a 
single searcher thread, QPS jump is proportional to the latency drop for task.

Luceneutil results:

{{||Task ('Wildcard', None)||P50 Base 9.993697||P50 Cmp 11.906981||Pct 
19.1449070349||P90 Base 14.431318||P90 Cmp 13.953923|| Pct -3.3080485095}}
{{||Task ('HighTermDayOfYearSort', 'DayOfYear')||P50 Base 39.556908||P50 Cmp 
44.389095||Pct 12.2157854198||P90 Base 62.421873||P90 Cmp 49.214184|| Pct 
-21.1587515165}}
{{||Task ('AndHighHigh', None)||P50 Base 3.814074||P50 Cmp 2.459326||Pct 
-35.5197093711||P90 Base 5.045984||P90 Cmp 7.932029|| Pct 57.1948900353}}
{{||Task ('OrHighHigh', None)||P50 Base 9.586193||P50 Cmp 5.846643||Pct 
-39.0097507947||P90 Base 14.978843||P90 Cmp 7.078967|| Pct -52.7402283341}}
{{||Task ('MedPhrase', None)||P50 Base 3.210464||P50 Cmp 2.276356||Pct 
-29.0957319565||P90 Base 4.217049||P90 Cmp 3.852337|| Pct -8.64851226533}}
{{||Task ('LowSpanNear', None)||P50 Base 11.247447||P50 Cmp 4.986828||Pct 
-55.6625783611||P90 Base 16.095342||P90 Cmp 6.121194|| Pct -61.9691585305}}
{{||Task ('Fuzzy2', None)||P50 Base 23.636902||P50 Cmp 20.959304||Pct 
-11.3280412128||P90 Base 112.5086||P90 Cmp 105.188025|| Pct -6.50668037821}}
{{||Task ('OrNotHighHigh', None)||P50 Base 4.225917||P50 Cmp 2.62127||Pct 
-37.9715692476||P90 Base 6.11225||P90 Cmp 8.525249|| Pct 39.4780809031}}
{{||Task ('OrHighNotLow', None)||P50 Base 4.015982||P50 Cmp 2.250697||Pct 
-43.9564968

[jira] [Updated] (LUCENE-8675) Divide Segment Search Amongst Multiple Threads

2019-04-18 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma updated LUCENE-8675:

Attachment: TermHighFreqP90.png
TermHighFreqP50.png
PhraseHighFreqP90.png
PhraseHighFreqP50.png

> Divide Segment Search Amongst Multiple Threads
> --
>
> Key: LUCENE-8675
> URL: https://issues.apache.org/jira/browse/LUCENE-8675
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Atri Sharma
>Priority: Major
> Attachments: PhraseHighFreqP50.png, PhraseHighFreqP90.png, 
> TermHighFreqP50.png, TermHighFreqP90.png
>
>
> Segment search is a single threaded operation today, which can be a 
> bottleneck for large analytical queries which index a lot of data and have 
> complex queries which touch multiple segments (imagine a composite query with 
> range query and filters on top). This ticket is for discussing the idea of 
> splitting a single segment into multiple threads based on mutually exclusive 
> document ID ranges.
> This will be a two phase effort, the first phase targeting queries returning 
> all matching documents (collectors not terminating early). The second phase 
> patch will introduce staged execution and will build on top of this patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8770) BlockMaxConjunctionScorer should support two-phase scorers

2019-04-18 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821253#comment-16821253
 ] 

Atri Sharma commented on LUCENE-8770:
-

+1.

A big win would be BKD Scorer. How difficult do you think that effort would be?

> BlockMaxConjunctionScorer should support two-phase scorers
> --
>
> Key: LUCENE-8770
> URL: https://issues.apache.org/jira/browse/LUCENE-8770
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
>
> The support for two-phase scorers in BlockMaxConjunctionScorer is missing. 
> This can slow down some queries that need to execute costly second phase on 
> more documents.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8963) Allow Collectors To "Publish" If They Can Be Used In Concurrent Search

2019-09-04 Thread Atri Sharma (Jira)
Atri Sharma created LUCENE-8963:
---

 Summary: Allow Collectors To "Publish" If They Can Be Used In 
Concurrent Search
 Key: LUCENE-8963
 URL: https://issues.apache.org/jira/browse/LUCENE-8963
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


There is an implied assumption today that all we need to run a query 
concurrently is a CollectorManager implementation. While that is true, there 
might be some corner cases where a Collector's semantics do not allow it to be 
concurrently executed (think of ES's aggregates). If a user manages to write a 
CollectorManager with a Collector that is not really concurrent friendly, we 
could end up in an undefined state.

 

This Jira is more of a rhetorical discussion, and to explore if we should allow 
Collectors to implement an API which simply returns a boolean signifying if a 
Collector is parallel ready or not. The default would be true, until a 
Collector explicitly overrides it?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8963) Allow Collectors To "Publish" If They Can Be Used In Concurrent Search

2019-09-04 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922322#comment-16922322
 ] 

Atri Sharma commented on LUCENE-8963:
-

Yeah, I agree.

 

My only gripe is that in case a collector is not really reducible or has some 
semantic constraints against concurrency, we do not provide any defense against 
getting into an unknown state.

 

Maybe it is not an engine problem but more of a user issue – but I wanted to 
raise this point and see if we have any thoughts about this.

> Allow Collectors To "Publish" If They Can Be Used In Concurrent Search
> --
>
> Key: LUCENE-8963
> URL: https://issues.apache.org/jira/browse/LUCENE-8963
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> There is an implied assumption today that all we need to run a query 
> concurrently is a CollectorManager implementation. While that is true, there 
> might be some corner cases where a Collector's semantics do not allow it to 
> be concurrently executed (think of ES's aggregates). If a user manages to 
> write a CollectorManager with a Collector that is not really concurrent 
> friendly, we could end up in an undefined state.
>  
> This Jira is more of a rhetorical discussion, and to explore if we should 
> allow Collectors to implement an API which simply returns a boolean 
> signifying if a Collector is parallel ready or not. The default would be 
> true, until a Collector explicitly overrides it?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8970) TopFieldCollector(s) Should Prepopulate Sentinel Objects

2019-09-06 Thread Atri Sharma (Jira)
Atri Sharma created LUCENE-8970:
---

 Summary: TopFieldCollector(s) Should Prepopulate Sentinel Objects
 Key: LUCENE-8970
 URL: https://issues.apache.org/jira/browse/LUCENE-8970
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


We do not repopulate the hit queue with sentinel values today, thus leading to 
extra checks and extra code.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8970) TopFieldCollector(s) Should Prepopulate Sentinel Objects

2019-09-10 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926480#comment-16926480
 ] 

Atri Sharma commented on LUCENE-8970:
-

I did a prototype of this –- it is a bit hairy since, unlike TopDocsCollector, 
TopFieldComparator
does not directly perform comparisons against the bottom but instead uses 
FieldComparator
to do the job. The problem is that FieldComparatorcould maintain its internal 
queue, which needs to be accordingly set with sentinel values if the queue is 
prepopulated. This works well with straight implementations, but for 
comparators like RelevanceComparator, which do not use the passed in slot but 
instead depend on the presence of the scorer instance to generate the doc to be 
placed, this can be an issue.

I wonder if it is worth exposing a prePopulate API in FieldComparator which 
does what it advertises – allows prepopulating the internal structure used for 
maintaining docID mappings.

> TopFieldCollector(s) Should Prepopulate Sentinel Objects
> 
>
> Key: LUCENE-8970
> URL: https://issues.apache.org/jira/browse/LUCENE-8970
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> We do not repopulate the hit queue with sentinel values today, thus leading 
> to extra checks and extra code.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8974) Shared Bottom Score Based Early Termination For Concurrent Search

2019-09-10 Thread Atri Sharma (Jira)
Atri Sharma created LUCENE-8974:
---

 Summary: Shared Bottom Score Based Early Termination For 
Concurrent Search
 Key: LUCENE-8974
 URL: https://issues.apache.org/jira/browse/LUCENE-8974
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


Following up to LUCENE-8939, post collection of numHits, we should share a 
bottom score which can be used to globally filter hits and choose competitive 
hits



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7282) search APIs should take advantage of index sort by default

2019-09-10 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926845#comment-16926845
 ] 

Atri Sharma commented on LUCENE-7282:
-

I think LUCENE-7714 does a similar thing for range queries. However, I don’t 
think we do this optimisation for exact queries yet (I might be mistaken 
though, [~jtibshirani] any thoughts here?

> search APIs should take advantage of index sort by default
> --
>
> Key: LUCENE-7282
> URL: https://issues.apache.org/jira/browse/LUCENE-7282
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> Spinoff from LUCENE-6766, where we made it very easy to have Lucene sort 
> documents in the index (at merge time).
> An index-time sort is powerful because if you then search that index by the 
> same sort (or by a "prefix" of it), you can early-terminate per segment once 
> you've collected enough hits.  But doing this by default would mean accepting 
> an approximate hit count, and could not be used in cases that need to see 
> every hit, e.g. if you are also faceting.
> Separately, `TermQuery` on the leading sort field can be very fast since we 
> can advance to the first docID, and only match to the last docID for the 
> requested value.  This would not be approximate, and should be lower risk / 
> easier.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8978) "Max Bottom" Based Early Termination For Concurrent Search

2019-09-12 Thread Atri Sharma (Jira)
Atri Sharma created LUCENE-8978:
---

 Summary: "Max Bottom" Based Early Termination For Concurrent Search
 Key: LUCENE-8978
 URL: https://issues.apache.org/jira/browse/LUCENE-8978
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


When running a search concurrently, collectors which have collected the number 
of hits requested locally i.e. their local priority queue is full can then 
globally publish their bottom hit's score, and other collectors can then use 
that score as the filter. If multiple collectors have full priority queues, the 
maximum of all bottom scores will be considered as the global bottom score.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8978) "Max Bottom" Based Early Termination For Concurrent Search

2019-09-13 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929030#comment-16929030
 ] 

Atri Sharma commented on LUCENE-8978:
-

||Task ('HighSpanNear', None)||P50 Base 11.060489||P50 Cmp 11.859525||Pct Diff 
7.22423755405||P90 Base 15.826127||P90 Cmp 15.409751||Pct Diff 
-2.63094059589||P99 Base 17.0499||P99 Cmp 15.787728||Pct Diff 
-7.4028117467||P999 Base 17.0499||P999 Cmp 15.787728||Pct Diff 
-7.4028117467||P100 Base 369.613225||P100 Cmp 411.489965||Pct Diff 11.3298813916
||Task ('BrowseDayOfYearSSDVFacets', None)||P50 Base 26.011344||P50 Cmp 
25.870156||Pct Diff -0.542793944058||P90 Base 27.199846||P90 Cmp 26.948776||Pct 
Diff -0.923056696718||P99 Base 50.355332||P99 Cmp 62.389047||Pct Diff 
23.8975983715||P999 Base 50.355332||P999 Cmp 62.389047||Pct Diff 
23.8975983715||P100 Base 265.301527||P100 Cmp 242.147844||Pct Diff 
-8.72730860686
||Task ('HighTermDayOfYearSort', 'DayOfYear')||P50 Base 4.855392||P50 Cmp 
5.073211||Pct Diff 4.48612593999||P90 Base 91.615585||P90 Cmp 90.944365||Pct 
Diff -0.73264827158||P99 Base 139.177491||P99 Cmp 134.249562||Pct Diff 
-3.54075142797||P999 Base 139.177491||P999 Cmp 134.249562||Pct Diff 
-3.54075142797||P100 Base 413.078905||P100 Cmp 399.62664||Pct Diff 
-3.25658484061
||Task ('IntNRQ', None)||P50 Base 4.003539||P50 Cmp 4.117275||Pct Diff 
2.84088652565||P90 Base 68.282386||P90 Cmp 67.613176||Pct Diff 
-0.980062413168||P99 Base 168.038952||P99 Cmp 162.14838||Pct Diff 
-3.50548008655||P999 Base 168.038952||P999 Cmp 162.14838||Pct Diff 
-3.50548008655||P100 Base 183.270534||P100 Cmp 180.209181||Pct Diff 
-1.67040109132
||Task ('LowTerm', None)||P50 Base 0.736588||P50 Cmp 0.802246||Pct Diff 
8.91380255991||P90 Base 1.433158||P90 Cmp 9.655967||Pct Diff 573.754533694||P99 
Base 9.67953||P99 Cmp 41.953847||Pct Diff 333.428554899||P999 Base 
9.67953||P999 Cmp 41.953847||Pct Diff 333.428554899||P100 Base 57.585597||P100 
Cmp 212.693297||Pct Diff 269.351553306
||Task ('AndHighLow', None)||P50 Base 1.54347||P50 Cmp 1.634274||Pct Diff 
5.88310754339||P90 Base 2.434604||P90 Cmp 3.283687||Pct Diff 34.8756101608||P99 
Base 3.374315||P99 Cmp 10.557446||Pct Diff 212.8767172||P999 Base 
3.374315||P999 Cmp 10.557446||Pct Diff 212.8767172||P100 Base 41.81324||P100 
Cmp 50.963314||Pct Diff 21.8831977622
||Task ('MedTerm', None)||P50 Base 0.89585||P50 Cmp 0.944529||Pct Diff 
5.43383378914||P90 Base 1.404803||P90 Cmp 1.912129||Pct Diff 36.1136757254||P99 
Base 1.721718||P99 Cmp 2.879041||Pct Diff 67.2190800119||P999 Base 
1.721718||P999 Cmp 2.879041||Pct Diff 67.2190800119||P100 Base 57.913331||P100 
Cmp 6.156178||Pct Diff -89.3700156878
||Task ('AndHighHigh', None)||P50 Base 9.298414||P50 Cmp 9.193083||Pct Diff 
-1.13278458025||P90 Base 17.43996||P90 Cmp 28.767063||Pct Diff 
64.9491340576||P99 Base 29.387967||P99 Cmp 36.807631||Pct Diff 
25.2472857343||P999 Base 29.387967||P999 Cmp 36.807631||Pct Diff 
25.2472857343||P100 Base 109.854089||P100 Cmp 107.673127||Pct Diff 
-1.98532619027
||Task ('LowSloppyPhrase', None)||P50 Base 5.680762||P50 Cmp 5.562709||Pct Diff 
-2.0781190974||P90 Base 10.573096||P90 Cmp 8.783411||Pct Diff 
-16.9267828458||P99 Base 11.119536||P99 Cmp 10.675304||Pct Diff 
-3.99505878663||P999 Base 11.119536||P999 Cmp 10.675304||Pct Diff 
-3.99505878663||P100 Base 279.186923||P100 Cmp 253.176147||Pct Diff 
-9.3166168818
||Task ('Wildcard', None)||P50 Base 5.493537||P50 Cmp 5.347662||Pct Diff 
-2.65539305551||P90 Base 251.824224||P90 Cmp 242.036414||Pct Diff 
-3.88676269682||P99 Base 410.472925||P99 Cmp 411.681977||Pct Diff 
0.294550974343||P999 Base 410.472925||P999 Cmp 411.681977||Pct Diff 
0.294550974343||P100 Base 473.53058||P100 Cmp 467.82275||Pct Diff -1.20537727468
||Task ('HighSloppyPhrase', None)||P50 Base 11.728682||P50 Cmp 11.905609||Pct 
Diff 1.50849856787||P90 Base 78.56345||P90 Cmp 23.156508||Pct Diff 
-70.5250876839||P99 Base 165.526231||P99 Cmp 24.095868||Pct Diff 
-85.4428703811||P999 Base 165.526231||P999 Cmp 24.095868||Pct Diff 
-85.4428703811||P100 Base 239.459867||P100 Cmp 154.765063||Pct Diff 
-35.369101746
||Task ('HighIntervalsOrdered', None)||P50 Base 18.723819||P50 Cmp 
19.239293||Pct Diff 2.75303878979||P90 Base 20.32576||P90 Cmp 20.59||Pct 
Diff 2.22377416638||P99 Base 21.323183||P99 Cmp 21.997505||Pct Diff 
3.16238902982||P999 Base 21.323183||P999 Cmp 21.997505||Pct Diff 
3.16238902982||P100 Base 365.748746||P100 Cmp 306.958046||Pct Diff 
-16.0740674146
||Task ('HighTerm', None)||P50 Base 0.982074||P50 Cmp 1.08638||Pct Diff 
10.6209919008||P90 Base 1.859062||P90 Cmp 4.64411||Pct Diff 149.809312438||P99 
Base 2.090176||P99 Cmp 25.399617||Pct Diff 1115.19034761||P999 Base 
2.090176||P999 Cmp 25.399617||Pct Diff 1115.19034761||P100 Base 4.26937||P100 
Cmp 54.324505||Pct Diff 1172.4243858
||Task ('BrowseDayOfYearTaxoFacets', None)||P50 Base 0.111432||P50 Cmp 
0.116611||Pct Diff 4.64767750736||P90 Base 0.177541||P90 Cm

[jira] [Commented] (LUCENE-8978) "Max Bottom" Based Early Termination For Concurrent Search

2019-09-13 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929277#comment-16929277
 ] 

Atri Sharma commented on LUCENE-8978:
-

Run with propagating global minimum scores

||Task ('HighSpanNear', None)||P50 Base 63.640386||P50 Cmp 65.506369||Pct Diff 
2.93207366781||P90 Base 68.082931||P90 Cmp 68.719427||Pct Diff 
0.9348833704||P99 Base 98.544661||P99 Cmp 90.023821||Pct Diff 
-8.64667848418||P999 Base 98.544661||P999 Cmp 90.023821||Pct Diff 
-8.64667848418||P100 Base 120.85372||P100 Cmp 115.88214||Pct Diff -4.1137169795
||Task ('BrowseDayOfYearSSDVFacets', None)||P50 Base 25.833619||P50 Cmp 
25.713409||Pct Diff -0.465323886677||P90 Base 28.549801||P90 Cmp 35.187339||Pct 
Diff 23.2489816654||P99 Base 34.097888||P99 Cmp 61.883127||Pct Diff 
81.4866862135||P999 Base 34.097888||P999 Cmp 61.883127||Pct Diff 
81.4866862135||P100 Base 214.305793||P100 Cmp 275.876451||Pct Diff 28.7302816868
||Task ('HighTermDayOfYearSort', 'DayOfYear')||P50 Base 4.600415||P50 Cmp 
5.241538||Pct Diff 13.9361992342||P90 Base 54.632331||P90 Cmp 41.589045||Pct 
Diff -23.8746649855||P99 Base 140.777103||P99 Cmp 113.980705||Pct Diff 
-19.0346280957||P999 Base 140.777103||P999 Cmp 113.980705||Pct Diff 
-19.0346280957||P100 Base 212.259622||P100 Cmp 232.746881||Pct Diff 
9.65198128922
||Task ('HighTerm', None)||P50 Base 0.707935||P50 Cmp 0.767744||Pct Diff 
8.44837449766||P90 Base 2.481444||P90 Cmp 2.45366||Pct Diff -1.11967064338||P99 
Base 2.819463||P99 Cmp 3.250364||Pct Diff 15.283087595||P999 Base 
2.819463||P999 Cmp 3.250364||Pct Diff 15.283087595||P100 Base 5.743958||P100 
Cmp 67.726682||Pct Diff 1079.09431093
||Task ('LowTerm', None)||P50 Base 0.662316||P50 Cmp 0.730491||Pct Diff 
10.2934248908||P90 Base 1.215188||P90 Cmp 3.100794||Pct Diff 155.169899637||P99 
Base 10.361147||P99 Cmp 8.509808||Pct Diff -17.8680893148||P999 Base 
10.361147||P999 Cmp 8.509808||Pct Diff -17.8680893148||P100 Base 
40.860202||P100 Cmp 43.746191||Pct Diff 7.06308059857
||Task ('AndHighLow', None)||P50 Base 1.000578||P50 Cmp 1.001309||Pct Diff 
0.0730577726074||P90 Base 1.841719||P90 Cmp 1.74311||Pct Diff 
-5.35418269562||P99 Base 2.803872||P99 Cmp 7.829637||Pct Diff 
179.243738659||P999 Base 2.803872||P999 Cmp 7.829637||Pct Diff 
179.243738659||P100 Base 8.888941||P100 Cmp 26.286796||Pct Diff 195.724721314
||Task ('MedTerm', None)||P50 Base 0.702324||P50 Cmp 0.760572||Pct Diff 
8.29360807832||P90 Base 1.789433||P90 Cmp 5.539351||Pct Diff 209.559005562||P99 
Base 4.193817||P99 Cmp 14.309771||Pct Diff 241.211144883||P999 Base 
4.193817||P999 Cmp 14.309771||Pct Diff 241.211144883||P100 Base 12.924386||P100 
Cmp 69.040778||Pct Diff 434.190003301
||Task ('AndHighHigh', None)||P50 Base 8.716311||P50 Cmp 8.766923||Pct Diff 
0.580658491878||P90 Base 22.896812||P90 Cmp 14.794421||Pct Diff 
-35.3865463891||P99 Base 76.380162||P99 Cmp 27.420985||Pct Diff 
-64.0993364219||P999 Base 76.380162||P999 Cmp 27.420985||Pct Diff 
-64.0993364219||P100 Base 192.565741||P100 Cmp 209.282678||Pct Diff 
8.68115839982
||Task ('LowSloppyPhrase', None)||P50 Base 2.504543||P50 Cmp 2.496497||Pct Diff 
-0.321256213209||P90 Base 5.864326||P90 Cmp 17.025432||Pct Diff 
190.322059176||P99 Base 17.061955||P99 Cmp 26.972014||Pct Diff 
58.0827871132||P999 Base 17.061955||P999 Cmp 26.972014||Pct Diff 
58.0827871132||P100 Base 28.311233||P100 Cmp 38.382978||Pct Diff 35.5750842784
||Task ('Wildcard', None)||P50 Base 4.622608||P50 Cmp 4.604615||Pct Diff 
-0.389239148117||P90 Base 13.902747||P90 Cmp 9.311908||Pct Diff 
-33.0210928819||P99 Base 212.077852||P99 Cmp 217.640103||Pct Diff 
2.62274016242||P999 Base 212.077852||P999 Cmp 217.640103||Pct Diff 
2.62274016242||P100 Base 256.120499||P100 Cmp 348.976972||Pct Diff 36.254994568
||Task ('HighSloppyPhrase', None)||P50 Base 40.021589||P50 Cmp 40.71495||Pct 
Diff 1.73246744401||P90 Base 41.349646||P90 Cmp 42.092274||Pct Diff 
1.7959718446||P99 Base 43.137416||P99 Cmp 63.876883||Pct Diff 
48.0776757699||P999 Base 43.137416||P999 Cmp 63.876883||Pct Diff 
48.0776757699||P100 Base 889.481117||P100 Cmp 748.568262||Pct Diff 
-15.8421412559
||Task ('HighIntervalsOrdered', None)||P50 Base 17.065112||P50 Cmp 
17.259941||Pct Diff 1.1416801718||P90 Base 18.188702||P90 Cmp 18.965857||Pct 
Diff 4.27273479988||P99 Base 18.315874||P99 Cmp 50.189647||Pct Diff 
174.022670171||P999 Base 18.315874||P999 Cmp 50.189647||Pct Diff 
174.022670171||P100 Base 302.418464||P100 Cmp 329.078973||Pct Diff 8.81576761133
||Task ('IntNRQ', None)||P50 Base 4.603492||P50 Cmp 5.553211||Pct Diff 
20.6304040498||P90 Base 61.351885||P90 Cmp 61.48353||Pct Diff 
0.214573684248||P99 Base 164.30294||P99 Cmp 163.250118||Pct Diff 
-0.640780986634||P999 Base 164.30294||P999 Cmp 163.250118||Pct Diff 
-0.640780986634||P100 Base 224.633428||P100 Cmp 224.348545||Pct Diff 
-0.126821285032
||Task ('BrowseDayOfYearTaxoFacets', None)||P50 Base 0.121258||P50 Cmp 
0.121229||Pct Dif

[jira] [Commented] (LUCENE-8978) "Max Bottom" Based Early Termination For Concurrent Search

2019-09-13 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929280#comment-16929280
 ] 

Atri Sharma commented on LUCENE-8978:
-

Both the runs are for wikimedium2m with concurrent searches enabled

> "Max Bottom" Based Early Termination For Concurrent Search
> --
>
> Key: LUCENE-8978
> URL: https://issues.apache.org/jira/browse/LUCENE-8978
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> When running a search concurrently, collectors which have collected the 
> number of hits requested locally i.e. their local priority queue is full can 
> then globally publish their bottom hit's score, and other collectors can then 
> use that score as the filter. If multiple collectors have full priority 
> queues, the maximum of all bottom scores will be considered as the global 
> bottom score.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8881) Query.rewrite Should Move To QueryVisitor

2019-06-27 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873984#comment-16873984
 ] 

Atri Sharma commented on LUCENE-8881:
-

[~romseygeek] Agreed, however, we could use QueryVisitor's recursion mechanism 
to get query specific rewrites done (please see my PR to add metadata state to 
QueryVisitor). We could add a boolean property saying DO_REWRITE=true and fire 
a visitor, and each query checks for that property.

 

My main point is that it seems incorrect for two query tree traversal 
mechanisms to exist independently. This Jira is primarily opened to trade 
thoughts on that front, and maybe see if we can draw a common baseline between 
the two existing mechanisms. WDYT?

> Query.rewrite Should Move To QueryVisitor
> -
>
> Key: LUCENE-8881
> URL: https://issues.apache.org/jira/browse/LUCENE-8881
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> Now that we have QueryVisitor, the rewrite functionality should belong there, 
> since rewrite is essentially a recursive visitation of underlying queries, 
> which sounds exactly as what QueryVisitor is designed to be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8889) Remove Dead Code From PointRangeQuery

2019-06-27 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8889:
---

 Summary: Remove Dead Code From PointRangeQuery
 Key: LUCENE-8889
 URL: https://issues.apache.org/jira/browse/LUCENE-8889
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


PointRangeQuery has accessors for the underlying points in the query but those 
are never accessed. We should remove them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8889) Remove Dead Code From PointRangeQuery

2019-06-27 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874002#comment-16874002
 ] 

Atri Sharma commented on LUCENE-8889:
-

[~jim.ferenczi] Call me old school, but I believe that APIs should have atleast 
one user within library code base (for purely external facing APIs, tests are 
the way as you suggested).

 

I have raised a PR to beef up equality tests using the said API, let me know if 
it looks fine

> Remove Dead Code From PointRangeQuery
> -
>
> Key: LUCENE-8889
> URL: https://issues.apache.org/jira/browse/LUCENE-8889
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> PointRangeQuery has accessors for the underlying points in the query but 
> those are never accessed. We should remove them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8862) Collector Level Dynamic Memory Accounting

2019-06-27 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874013#comment-16874013
 ] 

Atri Sharma commented on LUCENE-8862:
-

Updated the PR with latest comments and moved to misc module. Happy to iterate 
further.

> Collector Level Dynamic Memory Accounting
> -
>
> Key: LUCENE-8862
> URL: https://issues.apache.org/jira/browse/LUCENE-8862
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Inspired from LUCENE-8855, I am thinking of adding a new interface which 
> tracks dynamic memory used by Collectors. This shall allow users to get an 
> accountability as to the memory usage of their Collectors and better plan 
> their resource capacity. This shall also allow us to add Collector level 
> limits for memory usage, thus allowing users a finer control over their 
> resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-06-27 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874125#comment-16874125
 ] 

Atri Sharma commented on LUCENE-8857:
-

Updated the PR with latest comments, removing merge functionality as well. 
Happy to iterate further

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876026#comment-16876026
 ] 

Atri Sharma commented on LUCENE-8857:
-

Should we push the latest iteration on the PR, if it looks fine?

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8896) Override default implementation of IntersectVisitor#visit(DocIDSetBuilder, byte[]) for several queries

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876037#comment-16876037
 ] 

Atri Sharma commented on LUCENE-8896:
-

Does PointRangeQuery not already have its custom intersects implementation?

> Override default implementation of IntersectVisitor#visit(DocIDSetBuilder, 
> byte[]) for several queries
> --
>
> Key: LUCENE-8896
> URL: https://issues.apache.org/jira/browse/LUCENE-8896
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>
> In LUCENE-8885, it was introduced a new method on the {{IntersectsVisitor}} 
> interface. It contains a default implementation but queries can override it 
> and therefore benefit when there are several documents on a leaf associated 
> to the same point.
> In this issue the following queries are proposed to override the default 
> implementation
> * LatLonShapeQuery
> * RangeFieldQuery
> * LatLonPointInPolygonQuery
> * LatLonPointDistanceQuery
> * PointRangeQuery
> * PointInSetQuery



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8897) Allow Callbacks For Events In Collectors/ CollectorManagers

2019-07-01 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8897:
---

 Summary: Allow Callbacks For Events In Collectors/ 
CollectorManagers
 Key: LUCENE-8897
 URL: https://issues.apache.org/jira/browse/LUCENE-8897
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


It would be good to allow and Collectors and CollectorManagers to allow 
callbacks to happen for specific incidents (such as collection of N doc IDs 
across all Collectors of a CollectorManager). This will allow things like more 
accurate early termination to happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8862) Collector Level Dynamic Memory Accounting

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876130#comment-16876130
 ] 

Atri Sharma commented on LUCENE-8862:
-

[~jpountz] Thanks for pushing and reviewing!

> Collector Level Dynamic Memory Accounting
> -
>
> Key: LUCENE-8862
> URL: https://issues.apache.org/jira/browse/LUCENE-8862
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Inspired from LUCENE-8855, I am thinking of adding a new interface which 
> tracks dynamic memory used by Collectors. This shall allow users to get an 
> accountability as to the memory usage of their Collectors and better plan 
> their resource capacity. This shall also allow us to add Collector level 
> limits for memory usage, thus allowing users a finer control over their 
> resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876218#comment-16876218
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~jpountz] Thanks for committing and reviewing, [~simonw] Thanks for your 
constructive inputs!

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876253#comment-16876253
 ] 

Atri Sharma commented on LUCENE-8857:
-

I did – I was not able to see any failures (probably due to seeds?). I will try 
with the seed in your command now.

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876335#comment-16876335
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~jpountz] I investigated this and it turned out to be a test limitation 
(testGrouping assumed that TopDocs.merge was setting the shard indices). It 
took a while to reproduce since it was the random test which was failing 
(thanks for providing the seed!) I have fixed the test and ran ant test a 
couple of times – it came in clean:

 

Can we push this in now?

 
{code:java}
[junit4:tophints]  49.54s | 
org.apache.lucene.search.suggest.document.TestSuggestField
[junit4:tophints]  21.55s | 
org.apache.lucene.search.suggest.analyzing.FuzzySuggesterTest
[junit4:tophints]  21.51s | 
org.apache.lucene.search.suggest.DocumentDictionaryTest
[junit4:tophints]  15.45s | org.apache.lucene.search.spell.TestSpellChecker

-check-totals:

common.test:

-check-totals:

test:

BUILD SUCCESSFUL
Total time: 49 minutes 49 seconds
f01898a404cf:lucene atris$ {code}
 

[~munendrasn] I am not too aware of Solr's internals, but looking at the error 
you pointed to, looks like that the test is not setting shard indices or hit 
indices. This points to an assumption in the test – that TopDocs.merge is 
setting the shard indices. Can you check
{code:java}
search/grouping/distributed/responseprocessor/TopGroupsShardResponseProcessor.java{code}
where the TopDocs.merge call is done? We can set shard indices for all TopHits 
based on the QueryCommandResult they come from.

 

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876378#comment-16876378
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~jpountz] Ran ant test 5 times again: all came in clean:

 

I have raised a new PR with testGrouping fixes: 
[https://github.com/apache/lucene-solr/pull/757]

 

Can we merge it, if it looks fine?
{code:java}
junit4:tophints]  54.39s | 
org.apache.lucene.search.suggest.document.TestSuggestField
[junit4:tophints]  16.93s | 
org.apache.lucene.search.suggest.DocumentDictionaryTest
[junit4:tophints]  16.63s | 
org.apache.lucene.search.suggest.analyzing.FuzzySuggesterTest
[junit4:tophints]  16.42s | 
org.apache.lucene.search.suggest.fst.FSTCompletionTest

-check-totals:

common.test:

-check-totals:

test:

BUILD SUCCESSFUL
Total time: 45 minutes 8 seconds
f01898a404cf:lucene atris$ {code}

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13597) TopGroups Should Respect the API in Lucene's TopDocs.merge

2019-07-01 Thread Atri Sharma (JIRA)
Atri Sharma created SOLR-13597:
--

 Summary: TopGroups Should Respect the API in Lucene's TopDocs.merge
 Key: SOLR-13597
 URL: https://issues.apache.org/jira/browse/SOLR-13597
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Atri Sharma


In LUCENE-8857, TopDocs.merge loses the ability to set shard indices, so 
callers have to set shard indices themselves before calling merge, or use docID 
based tie breaker.

 

TopGroups uses this non existent capability of Lucene, hence the corresponding 
tests break. This Jira tracks the efforts to fix TopGroups to respect the new 
API, and should be merged post merge of LUCENE-8857



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876692#comment-16876692
 ] 

Atri Sharma commented on LUCENE-8857:
-

I have opened https://issues.apache.org/jira/browse/SOLR-13597 to track fixes 
to Solr to use the new API (that is what is causing the Solr test to fail). I 
will raise a PR for that Jira post the merging of this PR.

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876699#comment-16876699
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~jpountz] Yes, we will. I did not want to add the fix for Solr in this PR 
since that kind of muddles up (going across two modules). I can raise a 
separate PR just for the Solr fixes, though, if that works.

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876708#comment-16876708
 ] 

Atri Sharma commented on LUCENE-8857:
-

Ok, updating the PR now.

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876711#comment-16876711
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~munendrasn] Thanks for the compilation fix.

Yes, the test will fail. I fixed that test failure – will update the PR once my 
local test suite run completes

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8882) Add State To QueryVisitor

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876754#comment-16876754
 ] 

Atri Sharma commented on LUCENE-8882:
-

My idea was not to replace IndexOrDocValues, but to allow it to be more 
generally applicable.

 

For eg, taking the specific example of the optimized query which is applicable 
for limited cases in which the index is sorted, we would ideally be better off 
if we used that query over point values (even though that query is a docvalues 
based implementation). However, the query is too specialized for 
IndexOrDocValues to factor in.

 

What I was envisioning was a state where, at the start of the query, 
IndexSearcher creates a QueryVisitor, sees that the index is sorted by key X, 
and populates a property in the QueryVisitor's metadata (INDEX_SORTED_KEY=X).

 

IndexOrDocValuesQuery, then, instead of making an immediate decision as to 
whether to use Points or DocValues, passes on the visitor to both of the 
branches. Further down the line, the sorted index query type will see the 
metadata in the visitor and volunteer itself (by adding another property in the 
metadata of the visitor (SORTED_PLAN_AVAILABLE=true or something).

 

In the end, IndexOrDocValues will perform an evaluation, which includes the 
costing which it does today + the metadata state gathered from both the 
branches, and then choose the branch to execute. This will allow new query 
types for specific use cases to be added easily (just add a new property type 
and a listener query for it), and let the engine take better decisions as to 
when to execute what queries, which can potentially lead to better query 
performance.

 

Thoughts?

> Add State To QueryVisitor
> -
>
> Key: LUCENE-8882
> URL: https://issues.apache.org/jira/browse/LUCENE-8882
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> QueryVisitor has no state passed in either up or down recursion. This limits 
> the width of decisions that can be taken by visitation of QueryVisitor. For 
> eg, for LUCENE-8881, we need a way to specify is the visitor is a rewriter 
> visitor.
>  
> This Jira proposes adding a property bag model to QueryVisitor, which can 
> then be referred to by the Query instance being visited by QueryVisitor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876759#comment-16876759
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~jpountz] I have pushed the latest iteration to the new PR. It passes ant test:

 
{code:java}
 
[junit4:tophints]  59.58s | 
org.apache.lucene.search.suggest.document.TestSuggestField
[junit4:tophints]  17.10s | 
org.apache.lucene.search.suggest.DocumentDictionaryTest
[junit4:tophints]  14.56s | 
org.apache.lucene.search.suggest.fst.FSTCompletionTest
[junit4:tophints]  14.21s | 
org.apache.lucene.search.suggest.analyzing.FuzzySuggesterTest

-check-totals:

common.test:

-check-totals:

test:

BUILD SUCCESSFUL
Total time: 74 minutes 29 seconds
f01898a404cf:lucene atris$
{code}
 

 

It also passes the offending Solr test:

 
ant test  -Dtestcase=TestDistributedGrouping -Dtests.method=test 
-Dtests.seed=B5D95BEAE23E9468 -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=nl-AW -Dtests.timezone=Asia/Jayapura -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
 
{code:java}
27429 INFO  (closeThreadPool-74-thread-4) [ ] o.e.j.s.AbstractConnector 
Stopped ServerConnector@3e6caf50{HTTP/1.1,[http/1.1, h2c]}{127.0.0.1:0}
27430 INFO  (closeThreadPool-74-thread-4) [ ] o.e.j.s.h.ContextHandler 
Stopped o.e.j.s.ServletContextHandler@169e0265{/,null,UNAVAILABLE}
27430 INFO  (closeThreadPool-74-thread-4) [ ] o.e.j.s.session node0 Stopped 
scavenging
27431 INFO  (closeThreadPool-74-thread-1) [ ] o.e.j.s.AbstractConnector 
Stopped ServerConnector@1be02e89{HTTP/1.1,[http/1.1, h2c]}{127.0.0.1:0}
27431 INFO  (closeThreadPool-74-thread-1) [ ] o.e.j.s.h.ContextHandler 
Stopped o.e.j.s.ServletContextHandler@6b6f3dda{/,null,UNAVAILABLE}
27432 INFO  (closeThreadPool-74-thread-1) [ ] o.e.j.s.session node0 Stopped 
scavenging
27432 INFO  (closeThreadPool-74-thread-5) [ ] o.e.j.s.AbstractConnector 
Stopped ServerConnector@4052b482{HTTP/1.1,[http/1.1, h2c]}{127.0.0.1:0}
27432 INFO  (closeThreadPool-74-thread-5) [ ] o.e.j.s.h.ContextHandler 
Stopped o.e.j.s.ServletContextHandler@7063254f{/,null,UNAVAILABLE}
27432 INFO  (closeThreadPool-74-thread-5) [ ] o.e.j.s.session node0 Stopped 
scavenging

27436 INFO  (SUITE-TestDistributedGrouping-seed#[C817F4DEFFC8F2A7]-worker) [
 ] o.a.s.SolrTestCaseJ4 --- 
Done waiting for tracked resources to be released{code}
 

 

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876773#comment-16876773
 ] 

Atri Sharma commented on LUCENE-8857:
-

JFYI The latest iteration on PR also fixes the compilation failure in Solr, 
introduced in SOLR-13404

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8899) Implementation of MultiTermQuery for ORed Queries

2019-07-02 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8899:
---

 Summary: Implementation of MultiTermQuery for ORed Queries
 Key: LUCENE-8899
 URL: https://issues.apache.org/jira/browse/LUCENE-8899
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


While working on multi range queries, I realised that it would be good to 
specialize for cases where all clauses in a query are ORed together. 
MultiTermQuery springs to mind, when all terms are basically disjuncted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876965#comment-16876965
 ] 

Atri Sharma commented on LUCENE-8857:
-

Since this is a breaking API change, is there a way we can highlight this to 
existing users in a "louder" manner, or is MIGRATE.txt entry enough?

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13597) TopGroups Should Respect the API in Lucene's TopDocs.merge

2019-07-02 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma resolved SOLR-13597.

Resolution: Not A Problem

This can be done at Lucene level itself, given the usage pattern of Solr for 
TopDocs.merge

> TopGroups Should Respect the API in Lucene's TopDocs.merge
> --
>
> Key: SOLR-13597
> URL: https://issues.apache.org/jira/browse/SOLR-13597
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Atri Sharma
>Priority: Major
>
> In LUCENE-8857, TopDocs.merge loses the ability to set shard indices, so 
> callers have to set shard indices themselves before calling merge, or use 
> docID based tie breaker.
>  
> TopGroups uses this non existent capability of Lucene, hence the 
> corresponding tests break. This Jira tracks the efforts to fix TopGroups to 
> respect the new API, and should be merged post merge of LUCENE-8857



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8899) Implementation of MultiTermQuery for ORed Queries

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877154#comment-16877154
 ] 

Atri Sharma commented on LUCENE-8899:
-

The way I am thinking of this is by using the fact that 
MultiTermQueryConstantScoreWrapper will always convert to a BooleanQuery with 
each clause as SHOULD. So it should be a simple matter to use that logic. The 
main change will be introduction of a new TermsEnum implementation which can 
filter the input terms based on a filter built from the terms list given in the 
query.

 

Does this seem like a reasonable approach?

> Implementation of MultiTermQuery for ORed Queries
> -
>
> Key: LUCENE-8899
> URL: https://issues.apache.org/jira/browse/LUCENE-8899
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While working on multi range queries, I realised that it would be good to 
> specialize for cases where all clauses in a query are ORed together. 
> MultiTermQuery springs to mind, when all terms are basically disjuncted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877156#comment-16877156
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~jpountz] Thanks for confirming. I wanted to ensure that no unsuspecting user 
gets bitten :)

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8899) Implementation of MultiTermQuery for ORed Queries

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877219#comment-16877219
 ] 

Atri Sharma commented on LUCENE-8899:
-

Hmm, true. I was thinking of a query type just for the disjunctives, but looks 
like TermInSetQuery already covers it.

 

Thanks for pointing it out!

> Implementation of MultiTermQuery for ORed Queries
> -
>
> Key: LUCENE-8899
> URL: https://issues.apache.org/jira/browse/LUCENE-8899
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While working on multi range queries, I realised that it would be good to 
> specialize for cases where all clauses in a query are ORed together. 
> MultiTermQuery springs to mind, when all terms are basically disjuncted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8899) Implementation of MultiTermQuery for ORed Queries

2019-07-02 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma resolved LUCENE-8899.
-
Resolution: Not A Problem

> Implementation of MultiTermQuery for ORed Queries
> -
>
> Key: LUCENE-8899
> URL: https://issues.apache.org/jira/browse/LUCENE-8899
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While working on multi range queries, I realised that it would be good to 
> specialize for cases where all clauses in a query are ORed together. 
> MultiTermQuery springs to mind, when all terms are basically disjuncted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877232#comment-16877232
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~jpountz] Yes, I ran the Solr suite twice. The first time, failures with 
tracer not able to close were seen. The second time, the entire suite came in 
clean.

 

I also ran ant precommit – came in clean.

 

 

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8762) Lucene50PostingsReader should specialize reading docs+freqs with impacts

2019-07-03 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877864#comment-16877864
 ] 

Atri Sharma commented on LUCENE-8762:
-

I will take a crack at this and post a patch soon.

> Lucene50PostingsReader should specialize reading docs+freqs with impacts
> 
>
> Key: LUCENE-8762
> URL: https://issues.apache.org/jira/browse/LUCENE-8762
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Currently if you ask for impacts, we only have one implementation that is 
> able to expose everything: docs, freqs, positions and offsets. In contrast, 
> if you don't need impacts, we have specialization for docs+freqs, 
> docs+freqs+positions and docs+freqs+positions+offsets.
> Maybe we should add specialization for the docs+freqs case with impacts, 
> which should be the most common case, and remove specialization for 
> docs+freqs+positions when impacts are not requested?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8794) Cost Based Slice Allocation Algorithm

2019-07-03 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma resolved LUCENE-8794.
-
Resolution: Fixed

Merged to master

> Cost Based Slice Allocation Algorithm
> -
>
> Key: LUCENE-8794
> URL: https://issues.apache.org/jira/browse/LUCENE-8794
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> In https://issues.apache.org/jira/browse/LUCENE-8757, the idea of a cost 
> based and dynamically adjusting slice allocation algorithm was conceived. We 
> should ideally have a hard cap on the number of threads that can be consumed 
> by a single query, and have static cost factors associated with segments and 
> assign them to threads in a fair manner. We will also need to ensure that we 
> end up not assigning individual threads to small segments, or making more 
> thread s that needed (thread context switching could outweight benefits).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8829) TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved

2019-07-03 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma resolved LUCENE-8829.
-
Resolution: Fixed

Merged to master

> TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved
> -
>
> Key: LUCENE-8829
> URL: https://issues.apache.org/jira/browse/LUCENE-8829
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8829.patch, LUCENE-8829.patch, LUCENE-8829.patch, 
> LUCENE-8829.patch
>
>
> While investigating LUCENE-8819, I understood that TopDocs#merge's order of 
> results are indirectly dependent on the number of collectors involved in the 
> merge. This is troubling because 1) The number of collectors involved in a 
> merge are cost based and directly dependent on the number of slices created 
> for the parallel searcher case. 2) TopN hits code path will invoke merge with 
> a single Collector, so essentially, doing the same TopN query with single 
> threaded and parallel threaded searcher will invoke different order of 
> results, which is a bad invariant that breaks.
>  
> The reason why this happens is because of the subtle way TopDocs#merge sets 
> shardIndex in the ScoreDoc population during populating the priority queue 
> used for merging. ShardIndex is essentially set to the ordinal of the 
> collector which generates the hit. This means that the shardIndex is 
> dependent on the number of collectors, even for the same set of hits.
>  
> In case of no sort order specified, shardIndex is used for tie breaking when 
> scores are equal. This translates to different orders for same hits with 
> different shardIndices.
>  
> I propose that we remove shardIndex from the default tie breaking mechanism 
> and replace it with docID. DocID order is the de facto that is expected 
> during collection, so it might make sense to use the same factor during tie 
> breaking when scores are the same.
>  
> CC: [~ivera]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8905) TopDocsCollector Should Have Better Error Handling For Illegal Arguments

2019-07-08 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8905:
---

 Summary: TopDocsCollector Should Have Better Error Handling For 
Illegal Arguments
 Key: LUCENE-8905
 URL: https://issues.apache.org/jira/browse/LUCENE-8905
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


While writing some tests, I realised that TopDocsCollector does not behave well 
when illegal arguments are passed in (for eg, requesting more hits than the 
number of hits collected). Instead, we return a TopDocs instance with 0 hits.

 

This can be problematic when queries are being formed by applications. This can 
hide bugs where malformed queries return no hits and that is surfaced upstream 
to client applications.

 

I found a TODO at the relevant code space, so I believe it is time to fix the 
problem and throw an IllegalArgumentsException.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8950) FieldComparators Should Not Maintain Implicit PQs

2019-08-13 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8950:
---

 Summary: FieldComparators Should Not Maintain Implicit PQs
 Key: LUCENE-8950
 URL: https://issues.apache.org/jira/browse/LUCENE-8950
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


While doing some perf tests, I realised that FieldComparators inherently 
maintain implicit priority queues for maintaining the sorted order of documents 
for the given sort order. This is wasteful especially in the case of a multi 
feature sort order and a large number of hits requested.

 

We should change this to have FieldComparators maintain only the top and bottom 
values, and use them as barriers to compare



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8950) FieldComparators Should Not Maintain Implicit PQs

2019-08-14 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906978#comment-16906978
 ] 

Atri Sharma commented on LUCENE-8950:
-

I confess I do not have a very clean idea as to how this can be implemented: 
the typical usages of FieldComparator mandate that the user maintain a list of 
slots into the FieldComparator, which can implicitly be as bad in terms of size 
as the queue itself. FieldComparator provides a convenient API to allow 
comparisons between two values of the type maintained in the queue, which can 
form the basis of this observation.

 

Here is the first cut of proposal that I have in mind:

1) Deprecate compare(slot, slot) so that new implementations do not depend on 
this method, but rather use compare(T val, T val).

2) Start with some comparators (Numeric comparators?), get rid of the implicit 
priority queue and make the user maintain those values.

3) Make Numeric comparators track only the top and bottom values, as needed.

 

Note that I am treating NumericComparators as the starting point/example, but 
the approach should extend for other comparators as well.

 

With [https://github.com/apache/lucene-solr/pull/831,] getting values out of 
leaf comparators should be easy, so the logical step after this PR is to depend 
on compare (val, val) more than we rely on compare (slot, slot).

 

Happy to receive feedback and alternate proposals

> FieldComparators Should Not Maintain Implicit PQs
> -
>
> Key: LUCENE-8950
> URL: https://issues.apache.org/jira/browse/LUCENE-8950
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While doing some perf tests, I realised that FieldComparators inherently 
> maintain implicit priority queues for maintaining the sorted order of 
> documents for the given sort order. This is wasteful especially in the case 
> of a multi feature sort order and a large number of hits requested.
>  
> We should change this to have FieldComparators maintain only the top and 
> bottom values, and use them as barriers to compare



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8950) FieldComparators Should Not Maintain Implicit PQs

2019-08-14 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907075#comment-16907075
 ] 

Atri Sharma commented on LUCENE-8950:
-

{quote}This looks like a duplicate of LUCENE-8878?
{quote}
Not necessarily – 8878 targets refactoring the API to be simpler, whereas this 
Jira only targets removing the necessary condition that FieldComparators 
maintain their own priority queues. I believe this Jira compliments 8878.
{quote}I think all of us agree on the fact that it would be nice to have a 
simpler FieldComparator API. The challenge is that we don't want to trade too 
much efficiency. For instance the API you are proposing wouldn't work well with 
geo-distance sorting since it would require computing the actual distance for 
every new document, while the current implementation tries to be smart to first 
check a bounding box, and then compute a sort key that compares like the actual 
distance but is much cheaper to compute
{quote}
Agreed, that is precisely why I suggested deprecating compare (slot, slot) 
instead of removing it completely. The idea is that comparators that require 
access to an internal PQ for whatever reasons are free to do so, but it should 
not be mandatory, and future comparators should not take on this dependency 
without understanding the tradeoffs

> FieldComparators Should Not Maintain Implicit PQs
> -
>
> Key: LUCENE-8950
> URL: https://issues.apache.org/jira/browse/LUCENE-8950
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While doing some perf tests, I realised that FieldComparators inherently 
> maintain implicit priority queues for maintaining the sorted order of 
> documents for the given sort order. This is wasteful especially in the case 
> of a multi feature sort order and a large number of hits requested.
>  
> We should change this to have FieldComparators maintain only the top and 
> bottom values, and use them as barriers to compare



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8950) FieldComparators Should Not Maintain Implicit PQs

2019-08-14 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907093#comment-16907093
 ] 

Atri Sharma commented on LUCENE-8950:
-

{quote}you would like to introduce a sub class of FieldComparator that hides 
the fact that it maintains an implicit PQ, and make simple comparators extend 
this sub class instead of FieldComparator directly?
{quote}
Yes, exactly.

 

Thanks for validating – I will work on a PR now.

> FieldComparators Should Not Maintain Implicit PQs
> -
>
> Key: LUCENE-8950
> URL: https://issues.apache.org/jira/browse/LUCENE-8950
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While doing some perf tests, I realised that FieldComparators inherently 
> maintain implicit priority queues for maintaining the sorted order of 
> documents for the given sort order. This is wasteful especially in the case 
> of a multi feature sort order and a large number of hits requested.
>  
> We should change this to have FieldComparators maintain only the top and 
> bottom values, and use them as barriers to compare



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8403) Support 'filtered' term vectors - don't require all terms to be present

2019-08-25 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915472#comment-16915472
 ] 

Atri Sharma commented on LUCENE-8403:
-

Any thoughts on this one?

> Support 'filtered' term vectors - don't require all terms to be present
> ---
>
> Key: LUCENE-8403
> URL: https://issues.apache.org/jira/browse/LUCENE-8403
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Braun
>Priority: Minor
> Attachments: LUCENE-8403.patch
>
>
> The genesis of this was a conversation and idea from [~dsmiley] several years 
> ago.
> In order to optimize term vector storage, we may not actually need all tokens 
> to be present in the term vectors - and if so, ideally our codec could just 
> opt not to store them.
> I attempted to fork the standard codec and override the TermVectorsFormat and 
> TermVectorsWriter to ignore storing certain Terms within a field. This 
> worked, however, CheckIndex checks that the terms present in the standard 
> postings are also present in the TVs, if TVs enabled. So this then doesn't 
> work as 'valid' according to CheckIndex.
> Can the TermVectorsFormat be made in such a way to support configuration of 
> tokens that should not be stored (benefits: less storage, more optimal 
> retrieval per doc)? Is this valuable to the wider community? Is there a way 
> we can design this to not break CheckIndex's contract while at the same time 
> lessening storage for unneeded tokens?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8403) Support 'filtered' term vectors - don't require all terms to be present

2019-08-26 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916404#comment-16916404
 ] 

Atri Sharma commented on LUCENE-8403:
-

Thanks for reviewing, David.

 

I did notice a CheckHits breakage on this patch – I was hoping to get some 
early feedback on the patch and then seek advice to solve the open problems.

 

Does it make sense for me to adapt the patch to support pattern based filtering?

 

RE: CheckHits fix, how about Hoss's idea to allow the TermVector codec to 
publish which terms are available?

> Support 'filtered' term vectors - don't require all terms to be present
> ---
>
> Key: LUCENE-8403
> URL: https://issues.apache.org/jira/browse/LUCENE-8403
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Braun
>Priority: Minor
> Attachments: LUCENE-8403.patch
>
>
> The genesis of this was a conversation and idea from [~dsmiley] several years 
> ago.
> In order to optimize term vector storage, we may not actually need all tokens 
> to be present in the term vectors - and if so, ideally our codec could just 
> opt not to store them.
> I attempted to fork the standard codec and override the TermVectorsFormat and 
> TermVectorsWriter to ignore storing certain Terms within a field. This 
> worked, however, CheckIndex checks that the terms present in the standard 
> postings are also present in the TVs, if TVs enabled. So this then doesn't 
> work as 'valid' according to CheckIndex.
> Can the TermVectorsFormat be made in such a way to support configuration of 
> tokens that should not be stored (benefits: less storage, more optimal 
> retrieval per doc)? Is this valuable to the wider community? Is there a way 
> we can design this to not break CheckIndex's contract while at the same time 
> lessening storage for unneeded tokens?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8958) Add Shared Count Based Concurrent Early Termination For TopScoreDocCollector

2019-08-27 Thread Atri Sharma (Jira)
Atri Sharma created LUCENE-8958:
---

 Summary: Add Shared Count Based Concurrent Early Termination For 
TopScoreDocCollector
 Key: LUCENE-8958
 URL: https://issues.apache.org/jira/browse/LUCENE-8958
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


LUCENE-8939 implements a shared count early termination collector manager for 
indices sorted by non relevance fields. This Jira tracks efforts for 
implementing the same for TopScoreDocCollector when the index is sorted by 
relevance



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8403) Support 'filtered' term vectors - don't require all terms to be present

2019-08-29 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918358#comment-16918358
 ] 

Atri Sharma commented on LUCENE-8403:
-

David, sorry for the delay in response – this somehow was misplaced by my inbox.

 

 I get a NullPointerException when CheckIndex tries to validate term vectors.

 

I understand the approaches – your approach seems to be a longer term solution 
(I am not sure of the complexity implications though).

 

How do you suggest we approach this?

> Support 'filtered' term vectors - don't require all terms to be present
> ---
>
> Key: LUCENE-8403
> URL: https://issues.apache.org/jira/browse/LUCENE-8403
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Braun
>Priority: Minor
> Attachments: LUCENE-8403.patch
>
>
> The genesis of this was a conversation and idea from [~dsmiley] several years 
> ago.
> In order to optimize term vector storage, we may not actually need all tokens 
> to be present in the term vectors - and if so, ideally our codec could just 
> opt not to store them.
> I attempted to fork the standard codec and override the TermVectorsFormat and 
> TermVectorsWriter to ignore storing certain Terms within a field. This 
> worked, however, CheckIndex checks that the terms present in the standard 
> postings are also present in the TVs, if TVs enabled. So this then doesn't 
> work as 'valid' according to CheckIndex.
> Can the TermVectorsFormat be made in such a way to support configuration of 
> tokens that should not be stored (benefits: less storage, more optimal 
> retrieval per doc)? Is this valuable to the wider community? Is there a way 
> we can design this to not break CheckIndex's contract while at the same time 
> lessening storage for unneeded tokens?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8915) Allow RateLimiter To Have Dynamic Limits

2019-07-15 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8915:
---

 Summary: Allow RateLimiter To Have Dynamic Limits
 Key: LUCENE-8915
 URL: https://issues.apache.org/jira/browse/LUCENE-8915
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


While working on multi range queries, I realised that it would be good to 
specialize for cases where all clauses in a query are ORed together. 
MultiTermQuery springs to mind, when all terms are basically disjuncted.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8915) Allow RateLimiter To Have Dynamic Limits

2019-07-15 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma updated LUCENE-8915:

Description: RateLimiter does not allow dynamic configuration of the rate 
limit today. This limits the kind of applications that the functionality can be 
applied to. This Jira tracks 1) allowing the rate limiter to change limits 
dynamically. 2) Add a RateLimiter subclass which exposes the same.  (was: While 
working on multi range queries, I realised that it would be good to specialize 
for cases where all clauses in a query are ORed together. MultiTermQuery 
springs to mind, when all terms are basically disjuncted.)

> Allow RateLimiter To Have Dynamic Limits
> 
>
> Key: LUCENE-8915
> URL: https://issues.apache.org/jira/browse/LUCENE-8915
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> RateLimiter does not allow dynamic configuration of the rate limit today. 
> This limits the kind of applications that the functionality can be applied 
> to. This Jira tracks 1) allowing the rate limiter to change limits 
> dynamically. 2) Add a RateLimiter subclass which exposes the same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8811) Add maximum clause count check to IndexSearcher rather than BooleanQuery

2019-07-15 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884930#comment-16884930
 ] 

Atri Sharma commented on LUCENE-8811:
-

[~jpountz] I had originally raised a patch which implemented your suggested 
approach, should we commit that for 8.2, and let all other branches have the 
actual change introduced by this JIRA?

> Add maximum clause count check to IndexSearcher rather than BooleanQuery
> 
>
> Key: LUCENE-8811
> URL: https://issues.apache.org/jira/browse/LUCENE-8811
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Alan Woodward
>Priority: Minor
> Fix For: 8.2
>
> Attachments: LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch, 
> LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch
>
>
> Currently we only check whether boolean queries have too many clauses. 
> However there are other ways that queries may have too many clauses, for 
> instance if you have boolean queries that have themselves inner boolean 
> queries.
> Could we use the new Query visitor API to move this check from BooleanQuery 
> to IndexSearcher in order to make this check more consistent across queries? 
> See for instance LUCENE-8810 where a rewrite rule caused the maximum clause 
> count to be hit even though the total number of leaf queries remained the 
> same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8811) Add maximum clause count check to IndexSearcher rather than BooleanQuery

2019-07-15 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884939#comment-16884939
 ] 

Atri Sharma commented on LUCENE-8811:
-

[~jpountz] Yeah, that is what I was thinking of, but I see your view point.

 

I will raise a PR shortly

> Add maximum clause count check to IndexSearcher rather than BooleanQuery
> 
>
> Key: LUCENE-8811
> URL: https://issues.apache.org/jira/browse/LUCENE-8811
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Alan Woodward
>Priority: Minor
> Fix For: 8.2
>
> Attachments: LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch, 
> LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch
>
>
> Currently we only check whether boolean queries have too many clauses. 
> However there are other ways that queries may have too many clauses, for 
> instance if you have boolean queries that have themselves inner boolean 
> queries.
> Could we use the new Query visitor API to move this check from BooleanQuery 
> to IndexSearcher in order to make this check more consistent across queries? 
> See for instance LUCENE-8810 where a rewrite rule caused the maximum clause 
> count to be hit even though the total number of leaf queries remained the 
> same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8919) Query Metadata Aggregator

2019-07-15 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8919:
---

 Summary: Query Metadata Aggregator
 Key: LUCENE-8919
 URL: https://issues.apache.org/jira/browse/LUCENE-8919
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


It would be good if there was a mechanism to allow aggregation of metadata for 
queries (eg, number of clauses, types of clauses, terms involved etc). This is 
particularly useful for complex queries with multiple levels of nesting and a 
high degree of branching. This should help debug query performance issues and 
draw patterns in case a query is misbehaving. With the QueryVisitor being 
present, this should be doable.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8915) Allow RateLimiter To Have Dynamic Limits

2019-07-16 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885887#comment-16885887
 ] 

Atri Sharma commented on LUCENE-8915:
-

Hmm, I do not see a reason why SimpleRateLimiter cannot dynamically set values 
today (the setter is public).

 

Should we make the rate limit value as protected, or update the 
javadocs/comments to reflect that dynamic updatability is available?

> Allow RateLimiter To Have Dynamic Limits
> 
>
> Key: LUCENE-8915
> URL: https://issues.apache.org/jira/browse/LUCENE-8915
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> RateLimiter does not allow dynamic configuration of the rate limit today. 
> This limits the kind of applications that the functionality can be applied 
> to. This Jira tracks 1) allowing the rate limiter to change limits 
> dynamically. 2) Add a RateLimiter subclass which exposes the same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8915) Allow RateLimiter To Have Dynamic Limits

2019-07-16 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885941#comment-16885941
 ] 

Atri Sharma commented on LUCENE-8915:
-

[~ab] Thanks, raised a PR doing the same.

 

[https://github.com/apache/lucene-solr/pull/789]

> Allow RateLimiter To Have Dynamic Limits
> 
>
> Key: LUCENE-8915
> URL: https://issues.apache.org/jira/browse/LUCENE-8915
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> RateLimiter does not allow dynamic configuration of the rate limit today. 
> This limits the kind of applications that the functionality can be applied 
> to. This Jira tracks 1) allowing the rate limiter to change limits 
> dynamically. 2) Add a RateLimiter subclass which exposes the same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8924) Remove Fields Order Checks from CheckIndex?

2019-07-17 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8924:
---

 Summary: Remove Fields Order Checks from CheckIndex?
 Key: LUCENE-8924
 URL: https://issues.apache.org/jira/browse/LUCENE-8924
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


CheckIndex checks the order of fields read from the FieldsEnum for the posting 
reader. Since we do not explicitly sort or use a sorted data structure to 
represent keys (atleast explicitly), and no FieldsEnum depends on the order 
apart from MultiFieldsEnum, which no longer exists.

 

Should we remove the check?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8924) Remove Fields Order Checks from CheckIndex?

2019-07-17 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887340#comment-16887340
 ] 

Atri Sharma commented on LUCENE-8924:
-

I see. Should we make this more explicit and robust then? For E.g., since we do 
not explicitly maintain a sort order but rely on the key set to do the right 
thing, a change from Collections.unModifiableSet to Set.copyOf breaks this 
assertion in checkIndex (since Ser.copyOf explicitly calls out that there is no 
guarantee in the order of traversal)

> Remove Fields Order Checks from CheckIndex?
> ---
>
> Key: LUCENE-8924
> URL: https://issues.apache.org/jira/browse/LUCENE-8924
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> CheckIndex checks the order of fields read from the FieldsEnum for the 
> posting reader. Since we do not explicitly sort or use a sorted data 
> structure to represent keys (atleast explicitly), and no FieldsEnum depends 
> on the order apart from MultiFieldsEnum, which no longer exists.
>  
> Should we remove the check?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8927) Cut Over To Set.copyOf and Set.Of From Collections.unmodifiableSet

2019-07-18 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8927:
---

 Summary: Cut Over To Set.copyOf and Set.Of From 
Collections.unmodifiableSet
 Key: LUCENE-8927
 URL: https://issues.apache.org/jira/browse/LUCENE-8927
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8727) IndexSearcher#search(Query,int) should operate on a shared priority queue when configured with an executor

2019-07-19 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888766#comment-16888766
 ] 

Atri Sharma commented on LUCENE-8727:
-

[~jpountz] Here are two thoughts for the implementation of same:

 

1) Shared Priority Queue: A shared priority queue which is held in parent 
CollectorManager is used by all Collectors. This flows down naturally since 
post collection of top N hits globally, the minimum competitive score can be 
increased without Collectors getting involved and further hits will be ranked 
accordingly. However, the downside is that the priority queue implementation 
will have to be synchronized, so there can be performance hit as the critical 
path of segment collection will be affected.

 

2) Alternate way can be that for N hits, each slice gets an equal number of 
prorated hits to start with (M collectors, so N/M hits). Each Collector gets a 
callback supplier which the Collector will call with the number of hits 
collected till the point and the score of the highest scoring local hit. The 
callback will return the minimum competitive hit globally seen till now, and 
the Collector will use that score to filter out remaining hits. The point in 
time when a Collector calls the callback mechanism can be relative, simplest 
being after each N/M hits. The callback will be provided by the 
CollectorManager. The downside of this approach is that there is communication 
involved between Collectors and CollectorManager, and some redundant hits can 
be collected due to the periodic callback invocation. In contrast, the shared 
priority queue mechanism allows for accurate filtering.

 

WDYT?

> IndexSearcher#search(Query,int) should operate on a shared priority queue 
> when configured with an executor
> --
>
> Key: LUCENE-8727
> URL: https://issues.apache.org/jira/browse/LUCENE-8727
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> If IndexSearcher is configured with an executor, then the top docs for each 
> slice are computed separately before being merged once the top docs for all 
> slices are computed. With block-max WAND this is a bit of a waste of 
> resources: it would be better if an increase of the min competitive score 
> could help skip non-competitive hits on every slice and not just the current 
> one.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8727) IndexSearcher#search(Query,int) should operate on a shared priority queue when configured with an executor

2019-07-22 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889970#comment-16889970
 ] 

Atri Sharma commented on LUCENE-8727:
-

bq. we will have to skip all these docs with smaller doc Ids even if they have 
the same scores as docs with higher doc Ids and should be selected instead.

That should be avoidable, since we will need a custom PQ implementation anyways 
if we decided to share the queue, so the PQ can tie break the other way round 
on doc IDs. One advantage of sharing PQ is that we can skip the merge process 
during reduce call of the CollectorManager.

I am hesitant to introduce a synchronized block to the collector level 
collection mechanism -- it has a potential of blowing up in our face and 
becoming a performance bottleneck.

I am curious about if we should simply have both versions -- sharing the PQ/min 
score and the CollectorManager which allows callbacks which are invoked at 
regular intervals by the dependent Collectors. The former can work well with 
lesser number of slices, while the latter can work well with a large number of 
slices.

> IndexSearcher#search(Query,int) should operate on a shared priority queue 
> when configured with an executor
> --
>
> Key: LUCENE-8727
> URL: https://issues.apache.org/jira/browse/LUCENE-8727
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> If IndexSearcher is configured with an executor, then the top docs for each 
> slice are computed separately before being merged once the top docs for all 
> slices are computed. With block-max WAND this is a bit of a waste of 
> resources: it would be better if an increase of the min competitive score 
> could help skip non-competitive hits on every slice and not just the current 
> one.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8929) Early Terminating CollectorManager

2019-07-22 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8929:
---

 Summary: Early Terminating CollectorManager
 Key: LUCENE-8929
 URL: https://issues.apache.org/jira/browse/LUCENE-8929
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


We should have an early terminating collector manager which accurately tracks 
hits across all of its collectors and determines when there are enough hits, 
allowing all the collectors to abort.

The options for the same are:

1) Shared total count : Global "scoreboard" where all collectors update their 
current hit count. At the end of each document's collection, collector checks 
if N > threshold, and aborts if true

2) State Reporting Collectors: Collectors report their total number of counts 
collected periodically using a callback mechanism, and get a proceed or abort 
decision.

1) has the overhead of synchronization in the hot path, 2) can collect 
unnecessary hits before aborting.

I am planning to work on 2), unless objections



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8929) Early Terminating CollectorManager

2019-07-23 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890734#comment-16890734
 ] 

Atri Sharma commented on LUCENE-8929:
-

{quote}What collector do you have in mind? Is it TopFieldCollector?
{quote}
Yes, that is the one.

 

I did some tests, and am now inclined to go with 1), since that is a less 
invasive change and allows accurate termination with minimal overhead (< 3% 
degradation). This is due to the fact that AtomicInteger is mostly not 
implemented with a synchronization lock on modern hardwares.

> Early Terminating CollectorManager
> --
>
> Key: LUCENE-8929
> URL: https://issues.apache.org/jira/browse/LUCENE-8929
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> We should have an early terminating collector manager which accurately tracks 
> hits across all of its collectors and determines when there are enough hits, 
> allowing all the collectors to abort.
> The options for the same are:
> 1) Shared total count : Global "scoreboard" where all collectors update their 
> current hit count. At the end of each document's collection, collector checks 
> if N > threshold, and aborts if true
> 2) State Reporting Collectors: Collectors report their total number of counts 
> collected periodically using a callback mechanism, and get a proceed or abort 
> decision.
> 1) has the overhead of synchronization in the hot path, 2) can collect 
> unnecessary hits before aborting.
> I am planning to work on 2), unless objections



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8929) Early Terminating CollectorManager

2019-07-23 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890749#comment-16890749
 ] 

Atri Sharma commented on LUCENE-8929:
-

bq. OK, so if I understand correctly you are still collecting the first numHits 
hits as today, but you are trying to avoid collecting 
${totalHitsThreshold-numHits} additional hits on every slice with this global 
counter?

Yeah, exactly.

The first numHits hits can be spread across all the involved collectors, but 
with the global counter, all collectors will abort once they realize that 
numHits number of hits have been collected globally, even if the total hit 
count per collector is, obviously, < numHits.

> Early Terminating CollectorManager
> --
>
> Key: LUCENE-8929
> URL: https://issues.apache.org/jira/browse/LUCENE-8929
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> We should have an early terminating collector manager which accurately tracks 
> hits across all of its collectors and determines when there are enough hits, 
> allowing all the collectors to abort.
> The options for the same are:
> 1) Shared total count : Global "scoreboard" where all collectors update their 
> current hit count. At the end of each document's collection, collector checks 
> if N > threshold, and aborts if true
> 2) State Reporting Collectors: Collectors report their total number of counts 
> collected periodically using a callback mechanism, and get a proceed or abort 
> decision.
> 1) has the overhead of synchronization in the hot path, 2) can collect 
> unnecessary hits before aborting.
> I am planning to work on 2), unless objections



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8929) Early Terminating CollectorManager

2019-07-23 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890760#comment-16890760
 ] 

Atri Sharma commented on LUCENE-8929:
-

bq. So you need to collect each segment at least until {numHits} hits have been 
collected, or until the last collected hit was not competitive globally 
(whichever comes first)

Yeah, sorry I was not clear. Per collector, we will collect until numHits hits 
are collected.

I have opened a PR implementing the same: 
https://github.com/apache/lucene-solr/pull/803

Hoping the code gives more clarity

> Early Terminating CollectorManager
> --
>
> Key: LUCENE-8929
> URL: https://issues.apache.org/jira/browse/LUCENE-8929
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We should have an early terminating collector manager which accurately tracks 
> hits across all of its collectors and determines when there are enough hits, 
> allowing all the collectors to abort.
> The options for the same are:
> 1) Shared total count : Global "scoreboard" where all collectors update their 
> current hit count. At the end of each document's collection, collector checks 
> if N > threshold, and aborts if true
> 2) State Reporting Collectors: Collectors report their total number of counts 
> collected periodically using a callback mechanism, and get a proceed or abort 
> decision.
> 1) has the overhead of synchronization in the hot path, 2) can collect 
> unnecessary hits before aborting.
> I am planning to work on 2), unless objections



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8931) TestTopFieldCollectorEarlyTermination Should Use CheckHits

2019-07-23 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8931:
---

 Summary: TestTopFieldCollectorEarlyTermination Should Use CheckHits
 Key: LUCENE-8931
 URL: https://issues.apache.org/jira/browse/LUCENE-8931
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


TestTopFieldCollectorEarlyTermination invents a new way of checking equality of 
hits. That is redundant since CheckHits provides the same functionality and is 
the de facto standard now.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8675) Divide Segment Search Amongst Multiple Threads

2019-01-31 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8675:
---

 Summary: Divide Segment Search Amongst Multiple Threads
 Key: LUCENE-8675
 URL: https://issues.apache.org/jira/browse/LUCENE-8675
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Reporter: Atri Sharma


Segment search is a single threaded operation today, which can be a bottleneck 
for large analytical queries which index a lot of data and have complex queries 
which touch multiple segments (imagine a composite query with range query and 
filters on top). This ticket is for discussing the idea of splitting a single 
segment into multiple threads based on mutually exclusive document ID ranges.

This will be a two phase effort, the first phase targeting queries returning 
all matching documents (collectors not terminating early). The second phase 
patch will introduce staged execution and will build on top of this patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8675) Divide Segment Search Amongst Multiple Threads

2019-01-31 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757614#comment-16757614
 ] 

Atri Sharma commented on LUCENE-8675:
-

Thanks for the comments.

Having a multi shard approach makes sense, but a search is still bottlenecked 
by the largest segment it needs to scan. If there are many segments of that 
type, that might become a problem.

While I agree that range queries might not be directly benefited from parallel 
scans, but other queries (such as TermQueries) might be benefitted from a 
segment parallel scan. In a typical ElasticSearch interactive query, we see 
spikes when a large segment is hit for an interactive use case. Such cases can 
be optimized with parallel scans.

We should have a method of deciding whether a scan should be parallelized or 
not, and then let the execution operator get a set of nodes to execute. That is 
probably outside the scope of this JIRA, but I wanted to open this thread to 
get the conversation going.

> Divide Segment Search Amongst Multiple Threads
> --
>
> Key: LUCENE-8675
> URL: https://issues.apache.org/jira/browse/LUCENE-8675
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Atri Sharma
>Priority: Major
>
> Segment search is a single threaded operation today, which can be a 
> bottleneck for large analytical queries which index a lot of data and have 
> complex queries which touch multiple segments (imagine a composite query with 
> range query and filters on top). This ticket is for discussing the idea of 
> splitting a single segment into multiple threads based on mutually exclusive 
> document ID ranges.
> This will be a two phase effort, the first phase targeting queries returning 
> all matching documents (collectors not terminating early). The second phase 
> patch will introduce staged execution and will build on top of this patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8675) Divide Segment Search Amongst Multiple Threads

2019-02-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758862#comment-16758862
 ] 

Atri Sharma commented on LUCENE-8675:
-

{quote}If some segments are getting large enough that intra-segment parallelism 
becomes appealing, then maybe an easier and more efficient way to increase 
parallelism is to instead reduce the maximum segment size so that inter-segment 
parallelism has more potential for parallelizing query execution.
{quote}
Would that not lead to a much higher number of segments than required? That 
could lead to issues such as a lot of open file handles and too many threads 
required for scanning (although we would assign multiple small segments to a 
single thread).

Thanks for the point about range queries, that is an important thought. I will 
follow up with a separate patch on top of this which will do the first phase of 
BKD iteration and share the generated bitset across N parallel threads, where N 
is equal to the remaining clauses and each thread intersects a clause with the 
bitset.

> Divide Segment Search Amongst Multiple Threads
> --
>
> Key: LUCENE-8675
> URL: https://issues.apache.org/jira/browse/LUCENE-8675
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Atri Sharma
>Priority: Major
>
> Segment search is a single threaded operation today, which can be a 
> bottleneck for large analytical queries which index a lot of data and have 
> complex queries which touch multiple segments (imagine a composite query with 
> range query and filters on top). This ticket is for discussing the idea of 
> splitting a single segment into multiple threads based on mutually exclusive 
> document ID ranges.
> This will be a two phase effort, the first phase targeting queries returning 
> all matching documents (collectors not terminating early). The second phase 
> patch will introduce staged execution and will build on top of this patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8708) Can we simplify conjunctions of range queries automatically?

2019-02-25 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16777138#comment-16777138
 ] 

Atri Sharma commented on LUCENE-8708:
-

We could extend this approach to identify overlapping ranges ([5, 20], [15, 35] 
can be converted to 5 to 35).

 

I can take a crack at this one, if you are not planning to actively work on it

> Can we simplify conjunctions of range queries automatically?
> 
>
> Key: LUCENE-8708
> URL: https://issues.apache.org/jira/browse/LUCENE-8708
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> BooleanQuery#rewrite already has some logic to make queries more efficient, 
> such as deduplicating filters or rewriting boolean queries that wrap a single 
> positive clause to that clause.
> It would be nice to also simplify conjunctions of range queries, so that eg. 
> {{foo: [5 TO *] AND foo:[* TO 20]}} would be rewritten to {{foo:[5 TO 20]}}. 
> When constructing queries manually or via the classic query parser, it feels 
> unnecessary as this is something that the user can fix easily. However if you 
> want to implement a query parser that only allows specifying one bound at 
> once, such as Gmail ({{after:2018-12-31}} 
> https://support.google.com/mail/answer/7190?hl=en) or GitHub 
> ({{updated:>=2018-12-31}} 
> https://help.github.com/en/articles/searching-issues-and-pull-requests#search-by-when-an-issue-or-pull-request-was-created-or-last-updated)
>  then you might end up with inefficient queries if the end user specifies 
> both an upper and a lower bound. It would be nice if we optimized those 
> automatically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8675) Divide Segment Search Amongst Multiple Threads

2019-02-27 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16779287#comment-16779287
 ] 

Atri Sharma commented on LUCENE-8675:
-

Here are the results of luceneutil (patched to generate P50 and P90 and to run 
concurrent searching within IndexSearcher. Patch is posted to luceneutil repo).

Adrien has a valid point about costly scorers not benefitting from this 
approach. Specifically, range queries can take a hit since BKD Tree's scorer is 
two phase and is expensive to construct, so doing them per portion of a segment 
would lead to increase in latency, as is evident from the increase in P90 
latency in the above results. I am spending time to evaluate how to tackle this 
problem and will post any thoughts that I see as viable. These benchmarks are 
targeted to measure the changes in the "happy" path i.e. the targeted big index 
sizes and low QPS cases. Luceneutil was configured accordingly (low number of 
search threads, impacts turned off)

In summary, the queries scanning a higher amount of data and having higher read 
latencies tend to have the maximum improvement. Term queries and queries 
involving term queries on higher frequency terms get a reasonable latency 
reduction.

The following are P50 and P90 latencies calculated by Luceneutil. P50 Base is 
the P50 latency of the base, P50 Cmp is the P50 latency of the competitor 
(patched version), and the same for P90.

Note: The QPS jumps are not real. Since Luceneutil was congigured to run a 
single searcher thread, QPS jump is proportional to the latency drop for task.

Luceneutil results:

{{||Task ('Wildcard', None)||P50 Base 9.993697||P50 Cmp 11.906981||Pct 
19.1449070349||P90 Base 14.431318||P90 Cmp 13.953923|| Pct -3.3080485095}}
{{||Task ('HighTermDayOfYearSort', 'DayOfYear')||P50 Base 39.556908||P50 Cmp 
44.389095||Pct 12.2157854198||P90 Base 62.421873||P90 Cmp 49.214184|| Pct 
-21.1587515165}}
{{||Task ('AndHighHigh', None)||P50 Base 3.814074||P50 Cmp 2.459326||Pct 
-35.5197093711||P90 Base 5.045984||P90 Cmp 7.932029|| Pct 57.1948900353}}
{{||Task ('OrHighHigh', None)||P50 Base 9.586193||P50 Cmp 5.846643||Pct 
-39.0097507947||P90 Base 14.978843||P90 Cmp 7.078967|| Pct -52.7402283341}}
{{||Task ('MedPhrase', None)||P50 Base 3.210464||P50 Cmp 2.276356||Pct 
-29.0957319565||P90 Base 4.217049||P90 Cmp 3.852337|| Pct -8.64851226533}}
{{||Task ('LowSpanNear', None)||P50 Base 11.247447||P50 Cmp 4.986828||Pct 
-55.6625783611||P90 Base 16.095342||P90 Cmp 6.121194|| Pct -61.9691585305}}
{{||Task ('Fuzzy2', None)||P50 Base 23.636902||P50 Cmp 20.959304||Pct 
-11.3280412128||P90 Base 112.5086||P90 Cmp 105.188025|| Pct -6.50668037821}}
{{||Task ('OrNotHighHigh', None)||P50 Base 4.225917||P50 Cmp 2.62127||Pct 
-37.9715692476||P90 Base 6.11225||P90 Cmp 8.525249|| Pct 39.4780809031}}
{{||Task ('OrHighNotLow', None)||P50 Base 4.015982||P50 Cmp 2.250697||Pct 
-43.956496817||P90 Base 10.636566||P90 Cmp 3.134868|| Pct -70.5274427856}}
{{||Task ('BrowseMonthSSDVFacets', None)||P50 Base 66.920633||P50 Cmp 
66.986841||Pct 0.0989351072038||P90 Base 67.230757||P90 Cmp 76.011531|| Pct 
13.0606502021}}
{{||Task ('Fuzzy1', None)||P50 Base 14.779783||P50 Cmp 12.559705||Pct 
-15.0210459788||P90 Base 46.329521||P90 Cmp 218.272906|| Pct 371.131367838}}
{{||Task ('HighSloppyPhrase', None)||P50 Base 21.362967||P50 Cmp 10.563982||Pct 
-50.5500242546||P90 Base 33.009649||P90 Cmp 15.74507|| Pct -52.3016133858}}
{{||Task ('OrNotHighMed', None)||P50 Base 2.032775||P50 Cmp 1.584332||Pct 
-22.0606314029||P90 Base 2.529475||P90 Cmp 2.044107|| Pct -19.1884877297}}
{{||Task ('LowPhrase', None)||P50 Base 4.937747||P50 Cmp 2.8876||Pct 
-41.5198875115||P90 Base 6.910574||P90 Cmp 5.159077|| Pct -25.345173932}}
{{||Task ('AndHighLow', None)||P50 Base 1.097696||P50 Cmp 1.416176||Pct 
29.0134973617||P90 Base 3.426081||P90 Cmp 13.987273|| Pct 308.258678064}}
{{||Task ('LowTerm', None)||P50 Base 0.787595||P50 Cmp 1.038949||Pct 
31.9141182968||P90 Base 1.12006||P90 Cmp 39.639455|| Pct 3439.04746174}}
{{||Task ('BrowseDayOfYearSSDVFacets', None)||P50 Base 80.006624||P50 Cmp 
80.215023||Pct 0.260477182489||P90 Base 80.610476||P90 Cmp 81.187614|| Pct 
0.71595905227}}
{{||Task ('Prefix3', None)||P50 Base 3.347358||P50 Cmp 3.219213||Pct 
-3.82824305019||P90 Base 6.716371||P90 Cmp 5.21174|| Pct -22.4024402464}}
{{||Task ('HighTermMonthSort', 'Month')||P50 Base 20.684075||P50 Cmp 
19.601521||Pct -5.23375592092||P90 Base 21.341383||P90 Cmp 20.092673|| Pct 
-5.85112033274}}
{{||Task ('HighTerm', None)||P50 Base 2.991271||P50 Cmp 1.891199||Pct 
-36.7760727798||P90 Base 4.058212||P90 Cmp 2.320309|| Pct -42.8243522024}}
{{||Task Respell||P50 Base 17.33154||P50 Cmp 17.397468||Pct 0.38039320222||P90 
Base 99.071728||P90 Cmp 66.75552|| Pct -32.6190010535}}
{{||Task ('MedTerm', None)||P50 Base 3.011125||P50 Cmp 1.793175||Pct 
-40.4483374154||P90 Base 4.206761||P90 Cmp 2.392798|| Pct -43.120

[jira] [Commented] (LUCENE-8362) Add DocValue support for RangeFields

2019-05-30 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852082#comment-16852082
 ] 

Atri Sharma commented on LUCENE-8362:
-

[~jpountz] Thanks for the comments. Attached is an updated patch:

 

[^LUCENE-8362.patch]

> Add DocValue support for RangeFields 
> -
>
> Key: LUCENE-8362
> URL: https://issues.apache.org/jira/browse/LUCENE-8362
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nicholas Knize
>Priority: Minor
> Attachments: LUCENE-8362-approach2.patch, LUCENE-8362.patch, 
> LUCENE-8362.patch, LUCENE-8362.patch, LUCENE-8362.patch
>
>
> I'm opening this issue to discuss adding DocValue support to 
> {{\{Int|Long|Float|Double\}Range}} field types. Since existing numeric range 
> fields already provide the methods for encoding ranges as a byte array I 
> think this could be as simple as adding syntactic sugar to existing range 
> fields that simply build an instance of {{BinaryDocValues}} using that same 
> encoding. I'm envisioning something like 
> {{doc.add(IntRange.newDocValuesField("intDV", 100)}} But I'd like to solicit 
> other ideas or potential drawbacks to this approach.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8362) Add DocValue support for RangeFields

2019-05-30 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852087#comment-16852087
 ] 

Atri Sharma commented on LUCENE-8362:
-

[~mgrigorov] No specific reason. I am accustomed to patches and have vim hacks 
that allow easy generation of patches

> Add DocValue support for RangeFields 
> -
>
> Key: LUCENE-8362
> URL: https://issues.apache.org/jira/browse/LUCENE-8362
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nicholas Knize
>Priority: Minor
> Attachments: LUCENE-8362-approach2.patch, LUCENE-8362.patch, 
> LUCENE-8362.patch, LUCENE-8362.patch, LUCENE-8362.patch
>
>
> I'm opening this issue to discuss adding DocValue support to 
> {{\{Int|Long|Float|Double\}Range}} field types. Since existing numeric range 
> fields already provide the methods for encoding ranges as a byte array I 
> think this could be as simple as adding syntactic sugar to existing range 
> fields that simply build an instance of {{BinaryDocValues}} using that same 
> encoding. I'm envisioning something like 
> {{doc.add(IntRange.newDocValuesField("intDV", 100)}} But I'd like to solicit 
> other ideas or potential drawbacks to this approach.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8819) org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure

2019-06-03 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854285#comment-16854285
 ] 

Atri Sharma commented on LUCENE-8819:
-

I took a look at this and looks like the test failure is occuring from 
AssertingCollector's newly added check in LUCENE-8757 that asserts that 
LeafReaderContexts are ordered by docIDs i.e. a LeafReaderContext's docBase is 
greater than the predecessor's maxDoc.

 

I will dig deeper into this and update

> org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure
> 
>
> Key: LUCENE-8819
> URL: https://issues.apache.org/jira/browse/LUCENE-8819
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
>
> It can be reproduced with:
>  
> {code:java}
> ant test  -Dtestcase=TestTopDocsMerge -Dtests.method=testSort_1 
> -Dtests.seed=E916688CE5BC9122 -Dtests.multiplier=3 -Dtests.slow=true 
> -Dtests.locale=es-US -Dtests.timezone=Pacific/Johnston -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
>  
> Test fails in master and branch 8.x but it does not fail in branch 8.1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8819) org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure

2019-06-03 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854307#comment-16854307
 ] 

Atri Sharma commented on LUCENE-8819:
-

Acked, same assert.

> org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure
> 
>
> Key: LUCENE-8819
> URL: https://issues.apache.org/jira/browse/LUCENE-8819
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
>
> It can be reproduced with:
>  
> {code:java}
> ant test  -Dtestcase=TestTopDocsMerge -Dtests.method=testSort_1 
> -Dtests.seed=E916688CE5BC9122 -Dtests.multiplier=3 -Dtests.slow=true 
> -Dtests.locale=es-US -Dtests.timezone=Pacific/Johnston -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
>  
> Test fails in master and branch 8.x but it does not fail in branch 8.1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8819) org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure

2019-06-03 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma updated LUCENE-8819:

Attachment: LUCENE-8819.patch

> org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure
> 
>
> Key: LUCENE-8819
> URL: https://issues.apache.org/jira/browse/LUCENE-8819
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: LUCENE-8819.patch
>
>
> It can be reproduced with:
>  
> {code:java}
> ant test  -Dtestcase=TestTopDocsMerge -Dtests.method=testSort_1 
> -Dtests.seed=E916688CE5BC9122 -Dtests.multiplier=3 -Dtests.slow=true 
> -Dtests.locale=es-US -Dtests.timezone=Pacific/Johnston -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
>  
> Test fails in master and branch 8.x but it does not fail in branch 8.1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8819) org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure

2019-06-03 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854352#comment-16854352
 ] 

Atri Sharma commented on LUCENE-8819:
-

The problem is that the assert does not account for docIDs within a single 
segment i.e. when there are multiple documents within a segment collected by 
the same leaf collector. The assert should only be checking at segment 
boudaries to ensure that subsequent segments have the right sequence of DocIDs.

 

Attached is a patch to fix the same. I ran ant check and the guilty tests 
passed.

 

[~ivera] Could you check if this fix passes at your end as well?

 

[^LUCENE-8819.patch]

> org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure
> 
>
> Key: LUCENE-8819
> URL: https://issues.apache.org/jira/browse/LUCENE-8819
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: LUCENE-8819.patch
>
>
> It can be reproduced with:
>  
> {code:java}
> ant test  -Dtestcase=TestTopDocsMerge -Dtests.method=testSort_1 
> -Dtests.seed=E916688CE5BC9122 -Dtests.multiplier=3 -Dtests.slow=true 
> -Dtests.locale=es-US -Dtests.timezone=Pacific/Johnston -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
>  
> Test fails in master and branch 8.x but it does not fail in branch 8.1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8819) org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure

2019-06-03 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854539#comment-16854539
 ] 

Atri Sharma commented on LUCENE-8819:
-

[~ivera] Hmm, I was working on the original forked branch for 8757, now rebased 
to master. Interestingly, the problem I described above does not occur in 
master, thanks to Adrien's replacement of the said code block.

 

Running the tests in IntelliJ did not reproduce the problem for me.Running from 
the commandline triggers the following assert. I am not sure if I understand 
how 8757 can affect this since 8757 primarily introduces two changes, one which 
is triggered only during building LeafSlice contexts (which in turn is invoked 
only for parallel search, and the failing tests do not do parallel segment 
reads AFAIK), and tightening up asserts in AssertingCollector (which are not 
getting tripped). Is this the same stack that you see in test failure for 
TestRegexpRandom2?:

 

 
{code:java}
at junit.framework.Assert.fail(Assert.java:57)
   [junit4]    > at 
org.apache.lucene.search.CheckHits.checkEqual(CheckHits.java:205)
   [junit4]    > at 
org.apache.lucene.search.TestRegexpRandom2.assertSame(TestRegexpRandom2.java:178)
   [junit4]    > at 
org.apache.lucene.search.TestRegexpRandom2.testRegexps(TestRegexpRandom2.java:164)
{code}
 

> org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure
> 
>
> Key: LUCENE-8819
> URL: https://issues.apache.org/jira/browse/LUCENE-8819
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: LUCENE-8819.patch
>
>
> It can be reproduced with:
>  
> {code:java}
> ant test  -Dtestcase=TestTopDocsMerge -Dtests.method=testSort_1 
> -Dtests.seed=E916688CE5BC9122 -Dtests.multiplier=3 -Dtests.slow=true 
> -Dtests.locale=es-US -Dtests.timezone=Pacific/Johnston -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
>  
> Test fails in master and branch 8.x but it does not fail in branch 8.1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8819) org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure

2019-06-03 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854583#comment-16854583
 ] 

Atri Sharma commented on LUCENE-8819:
-

[~ivera] I did set the seed in the test run (I am assuming you are also setting 
it in the VM configuration in test config setup?)

 

I do agree with you that 8757 is what changes the order of segments for 
parallel runs. I will dive deeper to see if I can find the root cause. Thanks 
for your investigation on this :)

> org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure
> 
>
> Key: LUCENE-8819
> URL: https://issues.apache.org/jira/browse/LUCENE-8819
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: LUCENE-8819.patch
>
>
> It can be reproduced with:
>  
> {code:java}
> ant test  -Dtestcase=TestTopDocsMerge -Dtests.method=testSort_1 
> -Dtests.seed=E916688CE5BC9122 -Dtests.multiplier=3 -Dtests.slow=true 
> -Dtests.locale=es-US -Dtests.timezone=Pacific/Johnston -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
>  
> Test fails in master and branch 8.x but it does not fail in branch 8.1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8362) Add DocValue support for RangeFields

2019-06-03 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855400#comment-16855400
 ] 

Atri Sharma commented on LUCENE-8362:
-

[~jpountz] Another thought, BTW. Given how the patch is structured now, it 
should be simple to add support for other types of range queries (CROSSES et 
al). I have not added it in this iteration, but will post a follow up patch if 
you recommend. WDYT?

> Add DocValue support for RangeFields 
> -
>
> Key: LUCENE-8362
> URL: https://issues.apache.org/jira/browse/LUCENE-8362
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nicholas Knize
>Priority: Minor
> Attachments: LUCENE-8362-approach2.patch, LUCENE-8362.patch, 
> LUCENE-8362.patch, LUCENE-8362.patch, LUCENE-8362.patch
>
>
> I'm opening this issue to discuss adding DocValue support to 
> {{\{Int|Long|Float|Double\}Range}} field types. Since existing numeric range 
> fields already provide the methods for encoding ranges as a byte array I 
> think this could be as simple as adding syntactic sugar to existing range 
> fields that simply build an instance of {{BinaryDocValues}} using that same 
> encoding. I'm envisioning something like 
> {{doc.add(IntRange.newDocValuesField("intDV", 100)}} But I'd like to solicit 
> other ideas or potential drawbacks to this approach.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8362) Add DocValue support for RangeFields

2019-06-04 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855465#comment-16855465
 ] 

Atri Sharma commented on LUCENE-8362:
-

[~jpountz] Thanks, attached is an updated patch.

 

Subclasses of BinaryRangeFieldRangeQuery do call super.rewrite. Did I miss a 
point here?

 

[^LUCENE-8362.patch]

> Add DocValue support for RangeFields 
> -
>
> Key: LUCENE-8362
> URL: https://issues.apache.org/jira/browse/LUCENE-8362
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nicholas Knize
>Priority: Minor
> Attachments: LUCENE-8362-approach2.patch, LUCENE-8362.patch, 
> LUCENE-8362.patch, LUCENE-8362.patch, LUCENE-8362.patch, LUCENE-8362.patch
>
>
> I'm opening this issue to discuss adding DocValue support to 
> {{\{Int|Long|Float|Double\}Range}} field types. Since existing numeric range 
> fields already provide the methods for encoding ranges as a byte array I 
> think this could be as simple as adding syntactic sugar to existing range 
> fields that simply build an instance of {{BinaryDocValues}} using that same 
> encoding. I'm envisioning something like 
> {{doc.add(IntRange.newDocValuesField("intDV", 100)}} But I'd like to solicit 
> other ideas or potential drawbacks to this approach.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8362) Add DocValue support for RangeFields

2019-06-04 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma updated LUCENE-8362:

Attachment: LUCENE-8362.patch

> Add DocValue support for RangeFields 
> -
>
> Key: LUCENE-8362
> URL: https://issues.apache.org/jira/browse/LUCENE-8362
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nicholas Knize
>Priority: Minor
> Attachments: LUCENE-8362-approach2.patch, LUCENE-8362.patch, 
> LUCENE-8362.patch, LUCENE-8362.patch, LUCENE-8362.patch, LUCENE-8362.patch
>
>
> I'm opening this issue to discuss adding DocValue support to 
> {{\{Int|Long|Float|Double\}Range}} field types. Since existing numeric range 
> fields already provide the methods for encoding ranges as a byte array I 
> think this could be as simple as adding syntactic sugar to existing range 
> fields that simply build an instance of {{BinaryDocValues}} using that same 
> encoding. I'm envisioning something like 
> {{doc.add(IntRange.newDocValuesField("intDV", 100)}} But I'd like to solicit 
> other ideas or potential drawbacks to this approach.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8824) TestTopDocsMerge Is Broken

2019-06-04 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8824:
---

 Summary: TestTopDocsMerge Is Broken
 Key: LUCENE-8824
 URL: https://issues.apache.org/jira/browse/LUCENE-8824
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


Investigating a test failure post-LUCENE-8757, I realized that TestTopDocsMerge 
takes a non-obvious invariant on the fact that the number of Collectors 
involved in the merge will be equal to the number of LeafReaderContexts 
originally present. This is propagated in the corresponding ScoreDocs's 
shardIndex fields, which can lead to subtle issues since shardIndex is used for 
tie-breaking in the priority queue used during the merge. I believe that this 
is a dangerous and unnecessary dependency to take since the 
IndexSearcher#slices method does not advertise any such guarantees.

 

The underlying assumption worked well in the past since the default slice 
allocation algorithm always allocated a slice per segment, thus guaranteeing 
that the number of Collectors (== number of Slices) will be equal to the number 
of Leaf Contexts. With 8757, this is no longer true.

 

I propose a rewrite of the test, where ShardSearcher is allowed to take a 
LeafSlice instance and can internally do a sequential search on the leaf 
contexts of the passed in the slice. This will allow TestTopDocsMerge to create 
N subsearchers where N is equal to the number of slices used by the 
IndexSearcher being compared to.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8819) org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure

2019-06-04 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855493#comment-16855493
 ] 

Atri Sharma commented on LUCENE-8819:
-

[~ivera] Thanks, that worked for TestTopDocsMerge (not for TestRandomRegExp2 
though, are you using different seed and multiplier there?)

 

I investigated TestTopDocsMerge failure and listed the issue in 
https://issues.apache.org/jira/browse/LUCENE-8824

 

Let me know if it makes sense.

> org.apache.lucene.search.TestTopDocsMerge.testSort_1 failure
> 
>
> Key: LUCENE-8819
> URL: https://issues.apache.org/jira/browse/LUCENE-8819
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: LUCENE-8819.patch
>
>
> It can be reproduced with:
>  
> {code:java}
> ant test  -Dtestcase=TestTopDocsMerge -Dtests.method=testSort_1 
> -Dtests.seed=E916688CE5BC9122 -Dtests.multiplier=3 -Dtests.slow=true 
> -Dtests.locale=es-US -Dtests.timezone=Pacific/Johnston -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1{code}
>  
> Test fails in master and branch 8.x but it does not fail in branch 8.1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8825) Improve Print Info Of CheckHits

2019-06-04 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8825:
---

 Summary: Improve Print Info Of CheckHits
 Key: LUCENE-8825
 URL: https://issues.apache.org/jira/browse/LUCENE-8825
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


CheckHits should publish the shardIndex of the two involved ScoreDoc instances 
when there is a mismatch. Since shardIndex can be involved in the ordering of 
result ScoreDocs (due to it being considered during tie break when no sort 
order is specified), this can be useful for understanding test failures 
involving CheckHits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8825) Improve Print Info Of CheckHits

2019-06-04 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma updated LUCENE-8825:

Attachment: LUCENE-8825.patch

> Improve Print Info Of CheckHits
> ---
>
> Key: LUCENE-8825
> URL: https://issues.apache.org/jira/browse/LUCENE-8825
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Minor
> Attachments: LUCENE-8825.patch
>
>
> CheckHits should publish the shardIndex of the two involved ScoreDoc 
> instances when there is a mismatch. Since shardIndex can be involved in the 
> ordering of result ScoreDocs (due to it being considered during tie break 
> when no sort order is specified), this can be useful for understanding test 
> failures involving CheckHits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8825) Improve Print Info Of CheckHits

2019-06-04 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855586#comment-16855586
 ] 

Atri Sharma commented on LUCENE-8825:
-

Attached is a patch for the same.

 

[^LUCENE-8825.patch]

> Improve Print Info Of CheckHits
> ---
>
> Key: LUCENE-8825
> URL: https://issues.apache.org/jira/browse/LUCENE-8825
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Minor
> Attachments: LUCENE-8825.patch
>
>
> CheckHits should publish the shardIndex of the two involved ScoreDoc 
> instances when there is a mismatch. Since shardIndex can be involved in the 
> ordering of result ScoreDocs (due to it being considered during tie break 
> when no sort order is specified), this can be useful for understanding test 
> failures involving CheckHits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   4   >