[jira] [Commented] (LUCENE-8978) "Max Bottom" Based Early Termination For Concurrent Search

2019-09-13 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929280#comment-16929280
 ] 

Atri Sharma commented on LUCENE-8978:
-

Both the runs are for wikimedium2m with concurrent searches enabled

> "Max Bottom" Based Early Termination For Concurrent Search
> --
>
> Key: LUCENE-8978
> URL: https://issues.apache.org/jira/browse/LUCENE-8978
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> When running a search concurrently, collectors which have collected the 
> number of hits requested locally i.e. their local priority queue is full can 
> then globally publish their bottom hit's score, and other collectors can then 
> use that score as the filter. If multiple collectors have full priority 
> queues, the maximum of all bottom scores will be considered as the global 
> bottom score.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8978) "Max Bottom" Based Early Termination For Concurrent Search

2019-09-13 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929277#comment-16929277
 ] 

Atri Sharma commented on LUCENE-8978:
-

Run with propagating global minimum scores

||Task ('HighSpanNear', None)||P50 Base 63.640386||P50 Cmp 65.506369||Pct Diff 
2.93207366781||P90 Base 68.082931||P90 Cmp 68.719427||Pct Diff 
0.9348833704||P99 Base 98.544661||P99 Cmp 90.023821||Pct Diff 
-8.64667848418||P999 Base 98.544661||P999 Cmp 90.023821||Pct Diff 
-8.64667848418||P100 Base 120.85372||P100 Cmp 115.88214||Pct Diff -4.1137169795
||Task ('BrowseDayOfYearSSDVFacets', None)||P50 Base 25.833619||P50 Cmp 
25.713409||Pct Diff -0.465323886677||P90 Base 28.549801||P90 Cmp 35.187339||Pct 
Diff 23.2489816654||P99 Base 34.097888||P99 Cmp 61.883127||Pct Diff 
81.4866862135||P999 Base 34.097888||P999 Cmp 61.883127||Pct Diff 
81.4866862135||P100 Base 214.305793||P100 Cmp 275.876451||Pct Diff 28.7302816868
||Task ('HighTermDayOfYearSort', 'DayOfYear')||P50 Base 4.600415||P50 Cmp 
5.241538||Pct Diff 13.9361992342||P90 Base 54.632331||P90 Cmp 41.589045||Pct 
Diff -23.8746649855||P99 Base 140.777103||P99 Cmp 113.980705||Pct Diff 
-19.0346280957||P999 Base 140.777103||P999 Cmp 113.980705||Pct Diff 
-19.0346280957||P100 Base 212.259622||P100 Cmp 232.746881||Pct Diff 
9.65198128922
||Task ('HighTerm', None)||P50 Base 0.707935||P50 Cmp 0.767744||Pct Diff 
8.44837449766||P90 Base 2.481444||P90 Cmp 2.45366||Pct Diff -1.11967064338||P99 
Base 2.819463||P99 Cmp 3.250364||Pct Diff 15.283087595||P999 Base 
2.819463||P999 Cmp 3.250364||Pct Diff 15.283087595||P100 Base 5.743958||P100 
Cmp 67.726682||Pct Diff 1079.09431093
||Task ('LowTerm', None)||P50 Base 0.662316||P50 Cmp 0.730491||Pct Diff 
10.2934248908||P90 Base 1.215188||P90 Cmp 3.100794||Pct Diff 155.169899637||P99 
Base 10.361147||P99 Cmp 8.509808||Pct Diff -17.8680893148||P999 Base 
10.361147||P999 Cmp 8.509808||Pct Diff -17.8680893148||P100 Base 
40.860202||P100 Cmp 43.746191||Pct Diff 7.06308059857
||Task ('AndHighLow', None)||P50 Base 1.000578||P50 Cmp 1.001309||Pct Diff 
0.0730577726074||P90 Base 1.841719||P90 Cmp 1.74311||Pct Diff 
-5.35418269562||P99 Base 2.803872||P99 Cmp 7.829637||Pct Diff 
179.243738659||P999 Base 2.803872||P999 Cmp 7.829637||Pct Diff 
179.243738659||P100 Base 8.888941||P100 Cmp 26.286796||Pct Diff 195.724721314
||Task ('MedTerm', None)||P50 Base 0.702324||P50 Cmp 0.760572||Pct Diff 
8.29360807832||P90 Base 1.789433||P90 Cmp 5.539351||Pct Diff 209.559005562||P99 
Base 4.193817||P99 Cmp 14.309771||Pct Diff 241.211144883||P999 Base 
4.193817||P999 Cmp 14.309771||Pct Diff 241.211144883||P100 Base 12.924386||P100 
Cmp 69.040778||Pct Diff 434.190003301
||Task ('AndHighHigh', None)||P50 Base 8.716311||P50 Cmp 8.766923||Pct Diff 
0.580658491878||P90 Base 22.896812||P90 Cmp 14.794421||Pct Diff 
-35.3865463891||P99 Base 76.380162||P99 Cmp 27.420985||Pct Diff 
-64.0993364219||P999 Base 76.380162||P999 Cmp 27.420985||Pct Diff 
-64.0993364219||P100 Base 192.565741||P100 Cmp 209.282678||Pct Diff 
8.68115839982
||Task ('LowSloppyPhrase', None)||P50 Base 2.504543||P50 Cmp 2.496497||Pct Diff 
-0.321256213209||P90 Base 5.864326||P90 Cmp 17.025432||Pct Diff 
190.322059176||P99 Base 17.061955||P99 Cmp 26.972014||Pct Diff 
58.0827871132||P999 Base 17.061955||P999 Cmp 26.972014||Pct Diff 
58.0827871132||P100 Base 28.311233||P100 Cmp 38.382978||Pct Diff 35.5750842784
||Task ('Wildcard', None)||P50 Base 4.622608||P50 Cmp 4.604615||Pct Diff 
-0.389239148117||P90 Base 13.902747||P90 Cmp 9.311908||Pct Diff 
-33.0210928819||P99 Base 212.077852||P99 Cmp 217.640103||Pct Diff 
2.62274016242||P999 Base 212.077852||P999 Cmp 217.640103||Pct Diff 
2.62274016242||P100 Base 256.120499||P100 Cmp 348.976972||Pct Diff 36.254994568
||Task ('HighSloppyPhrase', None)||P50 Base 40.021589||P50 Cmp 40.71495||Pct 
Diff 1.73246744401||P90 Base 41.349646||P90 Cmp 42.092274||Pct Diff 
1.7959718446||P99 Base 43.137416||P99 Cmp 63.876883||Pct Diff 
48.0776757699||P999 Base 43.137416||P999 Cmp 63.876883||Pct Diff 
48.0776757699||P100 Base 889.481117||P100 Cmp 748.568262||Pct Diff 
-15.8421412559
||Task ('HighIntervalsOrdered', None)||P50 Base 17.065112||P50 Cmp 
17.259941||Pct Diff 1.1416801718||P90 Base 18.188702||P90 Cmp 18.965857||Pct 
Diff 4.27273479988||P99 Base 18.315874||P99 Cmp 50.189647||Pct Diff 
174.022670171||P999 Base 18.315874||P999 Cmp 50.189647||Pct Diff 
174.022670171||P100 Base 302.418464||P100 Cmp 329.078973||Pct Diff 8.81576761133
||Task ('IntNRQ', None)||P50 Base 4.603492||P50 Cmp 5.553211||Pct Diff 
20.6304040498||P90 Base 61.351885||P90 Cmp 61.48353||Pct Diff 
0.214573684248||P99 Base 164.30294||P99 Cmp 163.250118||Pct Diff 
-0.640780986634||P999 Base 164.30294||P999 Cmp 163.250118||Pct Diff 
-0.640780986634||P100 Base 224.633428||P100 Cmp 224.348545||Pct Diff 
-0.126821285032
||Task ('BrowseDayOfYearTaxoFacets', None)||P50 Base 0.121258||P50 Cmp 
0.121229||Pct Dif

[jira] [Commented] (LUCENE-8978) "Max Bottom" Based Early Termination For Concurrent Search

2019-09-13 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929030#comment-16929030
 ] 

Atri Sharma commented on LUCENE-8978:
-

||Task ('HighSpanNear', None)||P50 Base 11.060489||P50 Cmp 11.859525||Pct Diff 
7.22423755405||P90 Base 15.826127||P90 Cmp 15.409751||Pct Diff 
-2.63094059589||P99 Base 17.0499||P99 Cmp 15.787728||Pct Diff 
-7.4028117467||P999 Base 17.0499||P999 Cmp 15.787728||Pct Diff 
-7.4028117467||P100 Base 369.613225||P100 Cmp 411.489965||Pct Diff 11.3298813916
||Task ('BrowseDayOfYearSSDVFacets', None)||P50 Base 26.011344||P50 Cmp 
25.870156||Pct Diff -0.542793944058||P90 Base 27.199846||P90 Cmp 26.948776||Pct 
Diff -0.923056696718||P99 Base 50.355332||P99 Cmp 62.389047||Pct Diff 
23.8975983715||P999 Base 50.355332||P999 Cmp 62.389047||Pct Diff 
23.8975983715||P100 Base 265.301527||P100 Cmp 242.147844||Pct Diff 
-8.72730860686
||Task ('HighTermDayOfYearSort', 'DayOfYear')||P50 Base 4.855392||P50 Cmp 
5.073211||Pct Diff 4.48612593999||P90 Base 91.615585||P90 Cmp 90.944365||Pct 
Diff -0.73264827158||P99 Base 139.177491||P99 Cmp 134.249562||Pct Diff 
-3.54075142797||P999 Base 139.177491||P999 Cmp 134.249562||Pct Diff 
-3.54075142797||P100 Base 413.078905||P100 Cmp 399.62664||Pct Diff 
-3.25658484061
||Task ('IntNRQ', None)||P50 Base 4.003539||P50 Cmp 4.117275||Pct Diff 
2.84088652565||P90 Base 68.282386||P90 Cmp 67.613176||Pct Diff 
-0.980062413168||P99 Base 168.038952||P99 Cmp 162.14838||Pct Diff 
-3.50548008655||P999 Base 168.038952||P999 Cmp 162.14838||Pct Diff 
-3.50548008655||P100 Base 183.270534||P100 Cmp 180.209181||Pct Diff 
-1.67040109132
||Task ('LowTerm', None)||P50 Base 0.736588||P50 Cmp 0.802246||Pct Diff 
8.91380255991||P90 Base 1.433158||P90 Cmp 9.655967||Pct Diff 573.754533694||P99 
Base 9.67953||P99 Cmp 41.953847||Pct Diff 333.428554899||P999 Base 
9.67953||P999 Cmp 41.953847||Pct Diff 333.428554899||P100 Base 57.585597||P100 
Cmp 212.693297||Pct Diff 269.351553306
||Task ('AndHighLow', None)||P50 Base 1.54347||P50 Cmp 1.634274||Pct Diff 
5.88310754339||P90 Base 2.434604||P90 Cmp 3.283687||Pct Diff 34.8756101608||P99 
Base 3.374315||P99 Cmp 10.557446||Pct Diff 212.8767172||P999 Base 
3.374315||P999 Cmp 10.557446||Pct Diff 212.8767172||P100 Base 41.81324||P100 
Cmp 50.963314||Pct Diff 21.8831977622
||Task ('MedTerm', None)||P50 Base 0.89585||P50 Cmp 0.944529||Pct Diff 
5.43383378914||P90 Base 1.404803||P90 Cmp 1.912129||Pct Diff 36.1136757254||P99 
Base 1.721718||P99 Cmp 2.879041||Pct Diff 67.2190800119||P999 Base 
1.721718||P999 Cmp 2.879041||Pct Diff 67.2190800119||P100 Base 57.913331||P100 
Cmp 6.156178||Pct Diff -89.3700156878
||Task ('AndHighHigh', None)||P50 Base 9.298414||P50 Cmp 9.193083||Pct Diff 
-1.13278458025||P90 Base 17.43996||P90 Cmp 28.767063||Pct Diff 
64.9491340576||P99 Base 29.387967||P99 Cmp 36.807631||Pct Diff 
25.2472857343||P999 Base 29.387967||P999 Cmp 36.807631||Pct Diff 
25.2472857343||P100 Base 109.854089||P100 Cmp 107.673127||Pct Diff 
-1.98532619027
||Task ('LowSloppyPhrase', None)||P50 Base 5.680762||P50 Cmp 5.562709||Pct Diff 
-2.0781190974||P90 Base 10.573096||P90 Cmp 8.783411||Pct Diff 
-16.9267828458||P99 Base 11.119536||P99 Cmp 10.675304||Pct Diff 
-3.99505878663||P999 Base 11.119536||P999 Cmp 10.675304||Pct Diff 
-3.99505878663||P100 Base 279.186923||P100 Cmp 253.176147||Pct Diff 
-9.3166168818
||Task ('Wildcard', None)||P50 Base 5.493537||P50 Cmp 5.347662||Pct Diff 
-2.65539305551||P90 Base 251.824224||P90 Cmp 242.036414||Pct Diff 
-3.88676269682||P99 Base 410.472925||P99 Cmp 411.681977||Pct Diff 
0.294550974343||P999 Base 410.472925||P999 Cmp 411.681977||Pct Diff 
0.294550974343||P100 Base 473.53058||P100 Cmp 467.82275||Pct Diff -1.20537727468
||Task ('HighSloppyPhrase', None)||P50 Base 11.728682||P50 Cmp 11.905609||Pct 
Diff 1.50849856787||P90 Base 78.56345||P90 Cmp 23.156508||Pct Diff 
-70.5250876839||P99 Base 165.526231||P99 Cmp 24.095868||Pct Diff 
-85.4428703811||P999 Base 165.526231||P999 Cmp 24.095868||Pct Diff 
-85.4428703811||P100 Base 239.459867||P100 Cmp 154.765063||Pct Diff 
-35.369101746
||Task ('HighIntervalsOrdered', None)||P50 Base 18.723819||P50 Cmp 
19.239293||Pct Diff 2.75303878979||P90 Base 20.32576||P90 Cmp 20.59||Pct 
Diff 2.22377416638||P99 Base 21.323183||P99 Cmp 21.997505||Pct Diff 
3.16238902982||P999 Base 21.323183||P999 Cmp 21.997505||Pct Diff 
3.16238902982||P100 Base 365.748746||P100 Cmp 306.958046||Pct Diff 
-16.0740674146
||Task ('HighTerm', None)||P50 Base 0.982074||P50 Cmp 1.08638||Pct Diff 
10.6209919008||P90 Base 1.859062||P90 Cmp 4.64411||Pct Diff 149.809312438||P99 
Base 2.090176||P99 Cmp 25.399617||Pct Diff 1115.19034761||P999 Base 
2.090176||P999 Cmp 25.399617||Pct Diff 1115.19034761||P100 Base 4.26937||P100 
Cmp 54.324505||Pct Diff 1172.4243858
||Task ('BrowseDayOfYearTaxoFacets', None)||P50 Base 0.111432||P50 Cmp 
0.116611||Pct Diff 4.64767750736||P90 Base 0.177541||P90 Cm

[jira] [Created] (LUCENE-8978) "Max Bottom" Based Early Termination For Concurrent Search

2019-09-12 Thread Atri Sharma (Jira)
Atri Sharma created LUCENE-8978:
---

 Summary: "Max Bottom" Based Early Termination For Concurrent Search
 Key: LUCENE-8978
 URL: https://issues.apache.org/jira/browse/LUCENE-8978
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


When running a search concurrently, collectors which have collected the number 
of hits requested locally i.e. their local priority queue is full can then 
globally publish their bottom hit's score, and other collectors can then use 
that score as the filter. If multiple collectors have full priority queues, the 
maximum of all bottom scores will be considered as the global bottom score.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7282) search APIs should take advantage of index sort by default

2019-09-10 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926845#comment-16926845
 ] 

Atri Sharma commented on LUCENE-7282:
-

I think LUCENE-7714 does a similar thing for range queries. However, I don’t 
think we do this optimisation for exact queries yet (I might be mistaken 
though, [~jtibshirani] any thoughts here?

> search APIs should take advantage of index sort by default
> --
>
> Key: LUCENE-7282
> URL: https://issues.apache.org/jira/browse/LUCENE-7282
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> Spinoff from LUCENE-6766, where we made it very easy to have Lucene sort 
> documents in the index (at merge time).
> An index-time sort is powerful because if you then search that index by the 
> same sort (or by a "prefix" of it), you can early-terminate per segment once 
> you've collected enough hits.  But doing this by default would mean accepting 
> an approximate hit count, and could not be used in cases that need to see 
> every hit, e.g. if you are also faceting.
> Separately, `TermQuery` on the leading sort field can be very fast since we 
> can advance to the first docID, and only match to the last docID for the 
> requested value.  This would not be approximate, and should be lower risk / 
> easier.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8974) Shared Bottom Score Based Early Termination For Concurrent Search

2019-09-10 Thread Atri Sharma (Jira)
Atri Sharma created LUCENE-8974:
---

 Summary: Shared Bottom Score Based Early Termination For 
Concurrent Search
 Key: LUCENE-8974
 URL: https://issues.apache.org/jira/browse/LUCENE-8974
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


Following up to LUCENE-8939, post collection of numHits, we should share a 
bottom score which can be used to globally filter hits and choose competitive 
hits



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8970) TopFieldCollector(s) Should Prepopulate Sentinel Objects

2019-09-10 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926480#comment-16926480
 ] 

Atri Sharma commented on LUCENE-8970:
-

I did a prototype of this –- it is a bit hairy since, unlike TopDocsCollector, 
TopFieldComparator
does not directly perform comparisons against the bottom but instead uses 
FieldComparator
to do the job. The problem is that FieldComparatorcould maintain its internal 
queue, which needs to be accordingly set with sentinel values if the queue is 
prepopulated. This works well with straight implementations, but for 
comparators like RelevanceComparator, which do not use the passed in slot but 
instead depend on the presence of the scorer instance to generate the doc to be 
placed, this can be an issue.

I wonder if it is worth exposing a prePopulate API in FieldComparator which 
does what it advertises – allows prepopulating the internal structure used for 
maintaining docID mappings.

> TopFieldCollector(s) Should Prepopulate Sentinel Objects
> 
>
> Key: LUCENE-8970
> URL: https://issues.apache.org/jira/browse/LUCENE-8970
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> We do not repopulate the hit queue with sentinel values today, thus leading 
> to extra checks and extra code.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8970) TopFieldCollector(s) Should Prepopulate Sentinel Objects

2019-09-06 Thread Atri Sharma (Jira)
Atri Sharma created LUCENE-8970:
---

 Summary: TopFieldCollector(s) Should Prepopulate Sentinel Objects
 Key: LUCENE-8970
 URL: https://issues.apache.org/jira/browse/LUCENE-8970
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


We do not repopulate the hit queue with sentinel values today, thus leading to 
extra checks and extra code.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8963) Allow Collectors To "Publish" If They Can Be Used In Concurrent Search

2019-09-04 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922322#comment-16922322
 ] 

Atri Sharma commented on LUCENE-8963:
-

Yeah, I agree.

 

My only gripe is that in case a collector is not really reducible or has some 
semantic constraints against concurrency, we do not provide any defense against 
getting into an unknown state.

 

Maybe it is not an engine problem but more of a user issue – but I wanted to 
raise this point and see if we have any thoughts about this.

> Allow Collectors To "Publish" If They Can Be Used In Concurrent Search
> --
>
> Key: LUCENE-8963
> URL: https://issues.apache.org/jira/browse/LUCENE-8963
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> There is an implied assumption today that all we need to run a query 
> concurrently is a CollectorManager implementation. While that is true, there 
> might be some corner cases where a Collector's semantics do not allow it to 
> be concurrently executed (think of ES's aggregates). If a user manages to 
> write a CollectorManager with a Collector that is not really concurrent 
> friendly, we could end up in an undefined state.
>  
> This Jira is more of a rhetorical discussion, and to explore if we should 
> allow Collectors to implement an API which simply returns a boolean 
> signifying if a Collector is parallel ready or not. The default would be 
> true, until a Collector explicitly overrides it?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8963) Allow Collectors To "Publish" If They Can Be Used In Concurrent Search

2019-09-04 Thread Atri Sharma (Jira)
Atri Sharma created LUCENE-8963:
---

 Summary: Allow Collectors To "Publish" If They Can Be Used In 
Concurrent Search
 Key: LUCENE-8963
 URL: https://issues.apache.org/jira/browse/LUCENE-8963
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


There is an implied assumption today that all we need to run a query 
concurrently is a CollectorManager implementation. While that is true, there 
might be some corner cases where a Collector's semantics do not allow it to be 
concurrently executed (think of ES's aggregates). If a user manages to write a 
CollectorManager with a Collector that is not really concurrent friendly, we 
could end up in an undefined state.

 

This Jira is more of a rhetorical discussion, and to explore if we should allow 
Collectors to implement an API which simply returns a boolean signifying if a 
Collector is parallel ready or not. The default would be true, until a 
Collector explicitly overrides it?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8403) Support 'filtered' term vectors - don't require all terms to be present

2019-08-29 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918358#comment-16918358
 ] 

Atri Sharma commented on LUCENE-8403:
-

David, sorry for the delay in response – this somehow was misplaced by my inbox.

 

 I get a NullPointerException when CheckIndex tries to validate term vectors.

 

I understand the approaches – your approach seems to be a longer term solution 
(I am not sure of the complexity implications though).

 

How do you suggest we approach this?

> Support 'filtered' term vectors - don't require all terms to be present
> ---
>
> Key: LUCENE-8403
> URL: https://issues.apache.org/jira/browse/LUCENE-8403
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Braun
>Priority: Minor
> Attachments: LUCENE-8403.patch
>
>
> The genesis of this was a conversation and idea from [~dsmiley] several years 
> ago.
> In order to optimize term vector storage, we may not actually need all tokens 
> to be present in the term vectors - and if so, ideally our codec could just 
> opt not to store them.
> I attempted to fork the standard codec and override the TermVectorsFormat and 
> TermVectorsWriter to ignore storing certain Terms within a field. This 
> worked, however, CheckIndex checks that the terms present in the standard 
> postings are also present in the TVs, if TVs enabled. So this then doesn't 
> work as 'valid' according to CheckIndex.
> Can the TermVectorsFormat be made in such a way to support configuration of 
> tokens that should not be stored (benefits: less storage, more optimal 
> retrieval per doc)? Is this valuable to the wider community? Is there a way 
> we can design this to not break CheckIndex's contract while at the same time 
> lessening storage for unneeded tokens?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8958) Add Shared Count Based Concurrent Early Termination For TopScoreDocCollector

2019-08-27 Thread Atri Sharma (Jira)
Atri Sharma created LUCENE-8958:
---

 Summary: Add Shared Count Based Concurrent Early Termination For 
TopScoreDocCollector
 Key: LUCENE-8958
 URL: https://issues.apache.org/jira/browse/LUCENE-8958
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


LUCENE-8939 implements a shared count early termination collector manager for 
indices sorted by non relevance fields. This Jira tracks efforts for 
implementing the same for TopScoreDocCollector when the index is sorted by 
relevance



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8403) Support 'filtered' term vectors - don't require all terms to be present

2019-08-26 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916404#comment-16916404
 ] 

Atri Sharma commented on LUCENE-8403:
-

Thanks for reviewing, David.

 

I did notice a CheckHits breakage on this patch – I was hoping to get some 
early feedback on the patch and then seek advice to solve the open problems.

 

Does it make sense for me to adapt the patch to support pattern based filtering?

 

RE: CheckHits fix, how about Hoss's idea to allow the TermVector codec to 
publish which terms are available?

> Support 'filtered' term vectors - don't require all terms to be present
> ---
>
> Key: LUCENE-8403
> URL: https://issues.apache.org/jira/browse/LUCENE-8403
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Braun
>Priority: Minor
> Attachments: LUCENE-8403.patch
>
>
> The genesis of this was a conversation and idea from [~dsmiley] several years 
> ago.
> In order to optimize term vector storage, we may not actually need all tokens 
> to be present in the term vectors - and if so, ideally our codec could just 
> opt not to store them.
> I attempted to fork the standard codec and override the TermVectorsFormat and 
> TermVectorsWriter to ignore storing certain Terms within a field. This 
> worked, however, CheckIndex checks that the terms present in the standard 
> postings are also present in the TVs, if TVs enabled. So this then doesn't 
> work as 'valid' according to CheckIndex.
> Can the TermVectorsFormat be made in such a way to support configuration of 
> tokens that should not be stored (benefits: less storage, more optimal 
> retrieval per doc)? Is this valuable to the wider community? Is there a way 
> we can design this to not break CheckIndex's contract while at the same time 
> lessening storage for unneeded tokens?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8403) Support 'filtered' term vectors - don't require all terms to be present

2019-08-25 Thread Atri Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915472#comment-16915472
 ] 

Atri Sharma commented on LUCENE-8403:
-

Any thoughts on this one?

> Support 'filtered' term vectors - don't require all terms to be present
> ---
>
> Key: LUCENE-8403
> URL: https://issues.apache.org/jira/browse/LUCENE-8403
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Braun
>Priority: Minor
> Attachments: LUCENE-8403.patch
>
>
> The genesis of this was a conversation and idea from [~dsmiley] several years 
> ago.
> In order to optimize term vector storage, we may not actually need all tokens 
> to be present in the term vectors - and if so, ideally our codec could just 
> opt not to store them.
> I attempted to fork the standard codec and override the TermVectorsFormat and 
> TermVectorsWriter to ignore storing certain Terms within a field. This 
> worked, however, CheckIndex checks that the terms present in the standard 
> postings are also present in the TVs, if TVs enabled. So this then doesn't 
> work as 'valid' according to CheckIndex.
> Can the TermVectorsFormat be made in such a way to support configuration of 
> tokens that should not be stored (benefits: less storage, more optimal 
> retrieval per doc)? Is this valuable to the wider community? Is there a way 
> we can design this to not break CheckIndex's contract while at the same time 
> lessening storage for unneeded tokens?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8950) FieldComparators Should Not Maintain Implicit PQs

2019-08-14 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907093#comment-16907093
 ] 

Atri Sharma commented on LUCENE-8950:
-

{quote}you would like to introduce a sub class of FieldComparator that hides 
the fact that it maintains an implicit PQ, and make simple comparators extend 
this sub class instead of FieldComparator directly?
{quote}
Yes, exactly.

 

Thanks for validating – I will work on a PR now.

> FieldComparators Should Not Maintain Implicit PQs
> -
>
> Key: LUCENE-8950
> URL: https://issues.apache.org/jira/browse/LUCENE-8950
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While doing some perf tests, I realised that FieldComparators inherently 
> maintain implicit priority queues for maintaining the sorted order of 
> documents for the given sort order. This is wasteful especially in the case 
> of a multi feature sort order and a large number of hits requested.
>  
> We should change this to have FieldComparators maintain only the top and 
> bottom values, and use them as barriers to compare



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8950) FieldComparators Should Not Maintain Implicit PQs

2019-08-14 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907075#comment-16907075
 ] 

Atri Sharma commented on LUCENE-8950:
-

{quote}This looks like a duplicate of LUCENE-8878?
{quote}
Not necessarily – 8878 targets refactoring the API to be simpler, whereas this 
Jira only targets removing the necessary condition that FieldComparators 
maintain their own priority queues. I believe this Jira compliments 8878.
{quote}I think all of us agree on the fact that it would be nice to have a 
simpler FieldComparator API. The challenge is that we don't want to trade too 
much efficiency. For instance the API you are proposing wouldn't work well with 
geo-distance sorting since it would require computing the actual distance for 
every new document, while the current implementation tries to be smart to first 
check a bounding box, and then compute a sort key that compares like the actual 
distance but is much cheaper to compute
{quote}
Agreed, that is precisely why I suggested deprecating compare (slot, slot) 
instead of removing it completely. The idea is that comparators that require 
access to an internal PQ for whatever reasons are free to do so, but it should 
not be mandatory, and future comparators should not take on this dependency 
without understanding the tradeoffs

> FieldComparators Should Not Maintain Implicit PQs
> -
>
> Key: LUCENE-8950
> URL: https://issues.apache.org/jira/browse/LUCENE-8950
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While doing some perf tests, I realised that FieldComparators inherently 
> maintain implicit priority queues for maintaining the sorted order of 
> documents for the given sort order. This is wasteful especially in the case 
> of a multi feature sort order and a large number of hits requested.
>  
> We should change this to have FieldComparators maintain only the top and 
> bottom values, and use them as barriers to compare



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8950) FieldComparators Should Not Maintain Implicit PQs

2019-08-14 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906978#comment-16906978
 ] 

Atri Sharma commented on LUCENE-8950:
-

I confess I do not have a very clean idea as to how this can be implemented: 
the typical usages of FieldComparator mandate that the user maintain a list of 
slots into the FieldComparator, which can implicitly be as bad in terms of size 
as the queue itself. FieldComparator provides a convenient API to allow 
comparisons between two values of the type maintained in the queue, which can 
form the basis of this observation.

 

Here is the first cut of proposal that I have in mind:

1) Deprecate compare(slot, slot) so that new implementations do not depend on 
this method, but rather use compare(T val, T val).

2) Start with some comparators (Numeric comparators?), get rid of the implicit 
priority queue and make the user maintain those values.

3) Make Numeric comparators track only the top and bottom values, as needed.

 

Note that I am treating NumericComparators as the starting point/example, but 
the approach should extend for other comparators as well.

 

With [https://github.com/apache/lucene-solr/pull/831,] getting values out of 
leaf comparators should be easy, so the logical step after this PR is to depend 
on compare (val, val) more than we rely on compare (slot, slot).

 

Happy to receive feedback and alternate proposals

> FieldComparators Should Not Maintain Implicit PQs
> -
>
> Key: LUCENE-8950
> URL: https://issues.apache.org/jira/browse/LUCENE-8950
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While doing some perf tests, I realised that FieldComparators inherently 
> maintain implicit priority queues for maintaining the sorted order of 
> documents for the given sort order. This is wasteful especially in the case 
> of a multi feature sort order and a large number of hits requested.
>  
> We should change this to have FieldComparators maintain only the top and 
> bottom values, and use them as barriers to compare



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8950) FieldComparators Should Not Maintain Implicit PQs

2019-08-13 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8950:
---

 Summary: FieldComparators Should Not Maintain Implicit PQs
 Key: LUCENE-8950
 URL: https://issues.apache.org/jira/browse/LUCENE-8950
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


While doing some perf tests, I realised that FieldComparators inherently 
maintain implicit priority queues for maintaining the sorted order of documents 
for the given sort order. This is wasteful especially in the case of a multi 
feature sort order and a large number of hits requested.

 

We should change this to have FieldComparators maintain only the top and bottom 
values, and use them as barriers to compare



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8949) Allow LeafFieldComparators to publish feature values

2019-08-12 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8949:
---

 Summary: Allow LeafFieldComparators to publish feature values
 Key: LUCENE-8949
 URL: https://issues.apache.org/jira/browse/LUCENE-8949
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


We allow LeafFieldComparators to only accept a docID, get the equivalent 
feature value(s) and compare against the bottom/top of the values set for the 
comparator. This mandates that the values being compared against the bottom/top 
should originate from the same comparator. This does not allow use cases such 
as cross comparator value comparisons i.e. if a user wanted to compute the 
"global" minimum across multiple comparators.

 

FieldComparators expose an API to get the feature value corresponding to a 
docID. We should let LeafFieldComparators do the same. A new comparison method 
is not required since the parent FieldComparator's compare method can be used 
once the values are retrieved.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8213) Cache costly subqueries asynchronously

2019-08-05 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900160#comment-16900160
 ] 

Atri Sharma commented on LUCENE-8213:
-

I raised a PR for the same. The performance number from newly enhanced 
luceneutil for wikimedium10m are:

 

Latencies:
|Task ('Wildcard', None)||P50 Base 2.045201||P50 Cmp 2.089539||Pct Diff 
2.16790427933||P90 Base 18.845334||P90 Cmp 35.346911||Pct Diff 
87.5631973411||P99 Base 83.02344||P99 Cmp 48.300884||Pct Diff 
-41.8225937157||P999 Base 83.02344||P999 Cmp 48.300884||Pct Diff 
-41.8225937157||P100 Base 249.902876||P100 Cmp 87.512667||Pct Diff 
-64.9813285862||
||Task ('HighTermDayOfYearSort', 'DayOfYear')||P50 Base 4.295828||P50 Cmp 
4.727759||Pct Diff 10.0546623375||P90 Base 9.037488||P90 Cmp 55.639159||Pct 
Diff 515.648496573||P99 Base 82.149576||P99 Cmp 81.261365||Pct Diff 
-1.08121191033||P999 Base 82.149576||P999 Cmp 81.261365||Pct Diff 
-1.08121191033||P100 Base 86.642014||P100 Cmp 168.84768||Pct Diff 
94.8796804285||
||Task ('MedSloppyPhrase', None)||P50 Base 9.18549||P50 Cmp 8.683321||Pct Diff 
-5.46698107559||P90 Base 29.233836||P90 Cmp 30.984274||Pct Diff 
5.98771232075||P99 Base 34.303039||P99 Cmp 35.978633||Pct Diff 
4.88468091705||P999 Base 34.303039||P999 Cmp 35.978633||Pct Diff 
4.88468091705||P100 Base 181.426025||P100 Cmp 261.742214||Pct Diff 
44.2693869306||
||Task ('OrHighHigh', None)||P50 Base 20.997779||P50 Cmp 16.938239||Pct Diff 
-19.3331875719||P90 Base 26.989668||P90 Cmp 29.711731||Pct Diff 
10.0855742279||P99 Base 71.1345||P99 Cmp 72.914457||Pct Diff 
2.50224152837||P999 Base 71.1345||P999 Cmp 72.914457||Pct Diff 
2.50224152837||P100 Base 288.85441||P100 Cmp 203.02949||Pct Diff 
-29.7121723016||
||Task ('MedPhrase', None)||P50 Base 6.935508||P50 Cmp 6.676061||Pct Diff 
-3.74085070625||P90 Base 8.834132||P90 Cmp 7.366097||Pct Diff 
-16.6177616545||P99 Base 61.645788||P99 Cmp 59.423887||Pct Diff 
-3.60430302229||P999 Base 61.645788||P999 Cmp 59.423887||Pct Diff 
-3.60430302229||P100 Base 65.592528||P100 Cmp 63.493249||Pct Diff 
-3.20048496987||
||Task ('LowSpanNear', None)||P50 Base 23.256239||P50 Cmp 23.17936||Pct Diff 
-0.330573658105||P90 Base 33.890598||P90 Cmp 34.205568||Pct Diff 
0.929372801271||P99 Base 34.958863||P99 Cmp 34.857876||Pct Diff 
-0.288873811485||P999 Base 34.958863||P999 Cmp 34.857876||Pct Diff 
-0.288873811485||P100 Base 96.937787||P100 Cmp 121.889403||Pct Diff 
25.7398242442||
||Task ('Fuzzy2', None)||P50 Base 25.45292||P50 Cmp 25.25128||Pct Diff 
-0.792207730979||P90 Base 79.376572||P90 Cmp 106.649481||Pct Diff 
34.3588899254||P99 Base 108.933154||P99 Cmp 122.051216||Pct Diff 
12.0423044026||P999 Base 108.933154||P999 Cmp 122.051216||Pct Diff 
12.0423044026||P100 Base 212.373308||P100 Cmp 209.138442||Pct Diff 
-1.52319800942||
||Task ('OrNotHighHigh', None)||P50 Base 1.903331||P50 Cmp 2.16024||Pct Diff 
13.4978624317||P90 Base 4.890325||P90 Cmp 4.723459||Pct Diff -3.4121658581||P99 
Base 102.556452||P99 Cmp 102.641448||Pct Diff 0.0828772820651||P999 Base 
102.556452||P999 Cmp 102.641448||Pct Diff 0.0828772820651||P100 Base 
226.783706||P100 Cmp 308.709148||Pct Diff 36.1249242483||
||Task ('OrHighNotLow', None)||P50 Base 1.434646||P50 Cmp 1.52378||Pct Diff 
6.21296124619||P90 Base 3.905319||P90 Cmp 4.569729||Pct Diff 17.0129507986||P99 
Base 6.321682||P99 Cmp 7.281513||Pct Diff 15.1831585328||P999 Base 
6.321682||P999 Cmp 7.281513||Pct Diff 15.1831585328||P100 Base 7.720665||P100 
Cmp 15.035781||Pct Diff 94.7472270847||
||Task ('BrowseMonthSSDVFacets', None)||P50 Base 93.940495||P50 Cmp 
93.939183||Pct Diff -0.00139662879145||P90 Base 102.50354||P90 Cmp 
98.604983||Pct Diff -3.80333888956||P99 Base 103.572854||P99 Cmp 
106.785928||Pct Diff 3.10223564951||P999 Base 103.572854||P999 Cmp 
106.785928||Pct Diff 3.10223564951||P100 Base 283.457123||P100 Cmp 
244.054099||Pct Diff -13.9008762888||
||Task ('Fuzzy1', None)||P50 Base 26.559456||P50 Cmp 29.050383||Pct Diff 
9.37868230434||P90 Base 159.424881||P90 Cmp 171.063113||Pct Diff 
7.30013529068||P99 Base 339.7673||P99 Cmp 179.733118||Pct Diff 
-47.1011136151||P999 Base 339.7673||P999 Cmp 179.733118||Pct Diff 
-47.1011136151||P100 Base 417.349072||P100 Cmp 395.168736||Pct Diff 
-5.31457657105||
||Task ('HighSloppyPhrase', None)||P50 Base 9.489382||P50 Cmp 9.980939||Pct 
Diff 5.18007389733||P90 Base 14.424659||P90 Cmp 15.315198||Pct Diff 
6.17372653315||P99 Base 37.046395||P99 Cmp 31.348423||Pct Diff 
-15.380638251||P999 Base 37.046395||P999 Cmp 31.348423||Pct Diff 
-15.380638251||P100 Base 51.797966||P100 Cmp 33.660774||Pct Diff 
-35.0152590934||
||Task ('OrNotHighMed', None)||P50 Base 1.605631||P50 Cmp 1.549948||Pct Diff 
-3.46798236955||P90 Base 16.030506||P90 Cmp 11.175798||Pct Diff 
-30.2841844169||P99 Base 63.933462||P99 Cmp 63.33348||Pct Diff 
-0.938447537848||P999 Base 63.933462||P999 Cmp 63.33348||Pct Diff 
-0.938447537848||P100 Base 176.946354||P100 

[jira] [Created] (LUCENE-8946) LRUQueryCache#doCache Should Be More Verbose

2019-08-04 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8946:
---

 Summary: LRUQueryCache#doCache Should Be More Verbose
 Key: LUCENE-8946
 URL: https://issues.apache.org/jira/browse/LUCENE-8946
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


doCache does not really cache the query on its invocation. The actual caching 
(or checks) will happen only during scoring. doCache is basically creating the 
caching weight wrapper around the original weight of the query.

 

We should 1) rename the method or/and 2) update the documentation around the 
method explicitly calling out this facet.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8942) Tighten Up LRUQueryCache's Methods

2019-08-01 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8942:
---

 Summary: Tighten Up LRUQueryCache's Methods
 Key: LUCENE-8942
 URL: https://issues.apache.org/jira/browse/LUCENE-8942
 Project: Lucene - Core
  Issue Type: Improvement
 Environment: LRUQueryCache has less strict visibility of methods than 
it can, and has some redundant parameters.
Reporter: Atri Sharma






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8929) Early Terminating CollectorManager

2019-07-28 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma updated LUCENE-8929:

Issue Type: Sub-task  (was: Improvement)
Parent: LUCENE-8940

> Early Terminating CollectorManager
> --
>
> Key: LUCENE-8929
> URL: https://issues.apache.org/jira/browse/LUCENE-8929
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> We should have an early terminating collector manager which accurately tracks 
> hits across all of its collectors and determines when there are enough hits, 
> allowing all the collectors to abort.
> The options for the same are:
> 1) Shared total count : Global "scoreboard" where all collectors update their 
> current hit count. At the end of each document's collection, collector checks 
> if N > threshold, and aborts if true
> 2) State Reporting Collectors: Collectors report their total number of counts 
> collected periodically using a callback mechanism, and get a proceed or abort 
> decision.
> 1) has the overhead of synchronization in the hot path, 2) can collect 
> unnecessary hits before aborting.
> I am planning to work on 2), unless objections



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8939) Shared Hit Count Early Termination

2019-07-28 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma updated LUCENE-8939:

Summary: Shared Hit Count Early Termination  (was: Global Early Termination 
For Sorted Collections)

> Shared Hit Count Early Termination
> --
>
> Key: LUCENE-8939
> URL: https://issues.apache.org/jira/browse/LUCENE-8939
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Atri Sharma
>Priority: Major
>
> When collecting hits across sorted segments, it should be possible to 
> terminate early across all slices when enough hits have been collected 
> globally i.e. hit count > numHits AND hit count < totalHitsThreshold



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8940) Early Termination Across Slices

2019-07-28 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8940:
---

 Summary: Early Termination Across Slices
 Key: LUCENE-8940
 URL: https://issues.apache.org/jira/browse/LUCENE-8940
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


This JIRA tracks efforts for global early termination when segments are sorted. 
The cases being chased are:

1) Sorted segments -- hit count > numHits but less than threshold

2) Sorted segments and sort key is non score -- use shared PQ

3) Sorted segments and sort key is score -- propagate global minimum score



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8939) Global Early Termination For Sorted Collections

2019-07-28 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma updated LUCENE-8939:

Issue Type: Sub-task  (was: Improvement)
Parent: LUCENE-8940

> Global Early Termination For Sorted Collections
> ---
>
> Key: LUCENE-8939
> URL: https://issues.apache.org/jira/browse/LUCENE-8939
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Atri Sharma
>Priority: Major
>
> When collecting hits across sorted segments, it should be possible to 
> terminate early across all slices when enough hits have been collected 
> globally i.e. hit count > numHits AND hit count < totalHitsThreshold



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8939) Global Early Termination For Sorted Collections

2019-07-28 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8939:
---

 Summary: Global Early Termination For Sorted Collections
 Key: LUCENE-8939
 URL: https://issues.apache.org/jira/browse/LUCENE-8939
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


When collecting hits across sorted segments, it should be possible to terminate 
early across all slices when enough hits have been collected globally i.e. hit 
count > numHits AND hit count < totalHitsThreshold



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8936) Add SpanishMinimalStemFilter

2019-07-27 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894336#comment-16894336
 ] 

Atri Sharma commented on LUCENE-8936:
-

Hello Vinod!

Welcome to the community. Thank you for your contribution.

I would suggest following either of two approaches : 1) Attach a patch to this 
JIRA or 2) Open a pull request on the Lucene-Solr Github repository. Somebody 
will review your contribution soon and provide feedback.

> Add SpanishMinimalStemFilter
> 
>
> Key: LUCENE-8936
> URL: https://issues.apache.org/jira/browse/LUCENE-8936
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: vinod kumar
>Priority: Major
> Attachments: LUCENE-8936.patch
>
>
> SpanishMinimalStemmerFilter is less aggressive stemmer than 
> SpanishLightStemmerFilter
> Ex:
> input tokens -> output tokens
>  1. camiseta niños -> *camiseta* and *nino*
>  2. camisas -> camisa
> *camisetas* and *camisas* are t-shirts and shirts respectively.
>  Stemming both of the tokens to *camis* will match both tokens and returns 
> both t-shirts and shirts for query camisas(shirts). 
> SpanishMinimalStemmerFilter will help handling these cases.
> And importantly It will preserve gender context with tokens.
> Ex:  *niños* ,*niñas* *chicos* and *chicas* are stemmed to *nino*, *nina*, 
> *chico* and *chica*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13655) Cut Over Collections.unmodifiedSet usages to Set.*

2019-07-26 Thread Atri Sharma (JIRA)
Atri Sharma created SOLR-13655:
--

 Summary: Cut Over Collections.unmodifiedSet usages to Set.*
 Key: SOLR-13655
 URL: https://issues.apache.org/jira/browse/SOLR-13655
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Atri Sharma






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8929) Early Terminating CollectorManager

2019-07-25 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892643#comment-16892643
 ] 

Atri Sharma commented on LUCENE-8929:
-

Ok, so I have been working on this and am wondering what the definition 
(parameter) of a globally competitive hit be. Should it be the largest of the 
worst accepted hit across all collectors, and all collectors use that as the 
minimum threshold when filtering further hits?

> Early Terminating CollectorManager
> --
>
> Key: LUCENE-8929
> URL: https://issues.apache.org/jira/browse/LUCENE-8929
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We should have an early terminating collector manager which accurately tracks 
> hits across all of its collectors and determines when there are enough hits, 
> allowing all the collectors to abort.
> The options for the same are:
> 1) Shared total count : Global "scoreboard" where all collectors update their 
> current hit count. At the end of each document's collection, collector checks 
> if N > threshold, and aborts if true
> 2) State Reporting Collectors: Collectors report their total number of counts 
> collected periodically using a callback mechanism, and get a proceed or abort 
> decision.
> 1) has the overhead of synchronization in the hot path, 2) can collect 
> unnecessary hits before aborting.
> I am planning to work on 2), unless objections



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8931) TestTopFieldCollectorEarlyTermination Should Use CheckHits

2019-07-23 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8931:
---

 Summary: TestTopFieldCollectorEarlyTermination Should Use CheckHits
 Key: LUCENE-8931
 URL: https://issues.apache.org/jira/browse/LUCENE-8931
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


TestTopFieldCollectorEarlyTermination invents a new way of checking equality of 
hits. That is redundant since CheckHits provides the same functionality and is 
the de facto standard now.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8929) Early Terminating CollectorManager

2019-07-23 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890760#comment-16890760
 ] 

Atri Sharma commented on LUCENE-8929:
-

bq. So you need to collect each segment at least until {numHits} hits have been 
collected, or until the last collected hit was not competitive globally 
(whichever comes first)

Yeah, sorry I was not clear. Per collector, we will collect until numHits hits 
are collected.

I have opened a PR implementing the same: 
https://github.com/apache/lucene-solr/pull/803

Hoping the code gives more clarity

> Early Terminating CollectorManager
> --
>
> Key: LUCENE-8929
> URL: https://issues.apache.org/jira/browse/LUCENE-8929
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We should have an early terminating collector manager which accurately tracks 
> hits across all of its collectors and determines when there are enough hits, 
> allowing all the collectors to abort.
> The options for the same are:
> 1) Shared total count : Global "scoreboard" where all collectors update their 
> current hit count. At the end of each document's collection, collector checks 
> if N > threshold, and aborts if true
> 2) State Reporting Collectors: Collectors report their total number of counts 
> collected periodically using a callback mechanism, and get a proceed or abort 
> decision.
> 1) has the overhead of synchronization in the hot path, 2) can collect 
> unnecessary hits before aborting.
> I am planning to work on 2), unless objections



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8929) Early Terminating CollectorManager

2019-07-23 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890749#comment-16890749
 ] 

Atri Sharma commented on LUCENE-8929:
-

bq. OK, so if I understand correctly you are still collecting the first numHits 
hits as today, but you are trying to avoid collecting 
${totalHitsThreshold-numHits} additional hits on every slice with this global 
counter?

Yeah, exactly.

The first numHits hits can be spread across all the involved collectors, but 
with the global counter, all collectors will abort once they realize that 
numHits number of hits have been collected globally, even if the total hit 
count per collector is, obviously, < numHits.

> Early Terminating CollectorManager
> --
>
> Key: LUCENE-8929
> URL: https://issues.apache.org/jira/browse/LUCENE-8929
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> We should have an early terminating collector manager which accurately tracks 
> hits across all of its collectors and determines when there are enough hits, 
> allowing all the collectors to abort.
> The options for the same are:
> 1) Shared total count : Global "scoreboard" where all collectors update their 
> current hit count. At the end of each document's collection, collector checks 
> if N > threshold, and aborts if true
> 2) State Reporting Collectors: Collectors report their total number of counts 
> collected periodically using a callback mechanism, and get a proceed or abort 
> decision.
> 1) has the overhead of synchronization in the hot path, 2) can collect 
> unnecessary hits before aborting.
> I am planning to work on 2), unless objections



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8929) Early Terminating CollectorManager

2019-07-23 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890734#comment-16890734
 ] 

Atri Sharma commented on LUCENE-8929:
-

{quote}What collector do you have in mind? Is it TopFieldCollector?
{quote}
Yes, that is the one.

 

I did some tests, and am now inclined to go with 1), since that is a less 
invasive change and allows accurate termination with minimal overhead (< 3% 
degradation). This is due to the fact that AtomicInteger is mostly not 
implemented with a synchronization lock on modern hardwares.

> Early Terminating CollectorManager
> --
>
> Key: LUCENE-8929
> URL: https://issues.apache.org/jira/browse/LUCENE-8929
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> We should have an early terminating collector manager which accurately tracks 
> hits across all of its collectors and determines when there are enough hits, 
> allowing all the collectors to abort.
> The options for the same are:
> 1) Shared total count : Global "scoreboard" where all collectors update their 
> current hit count. At the end of each document's collection, collector checks 
> if N > threshold, and aborts if true
> 2) State Reporting Collectors: Collectors report their total number of counts 
> collected periodically using a callback mechanism, and get a proceed or abort 
> decision.
> 1) has the overhead of synchronization in the hot path, 2) can collect 
> unnecessary hits before aborting.
> I am planning to work on 2), unless objections



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8929) Early Terminating CollectorManager

2019-07-22 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8929:
---

 Summary: Early Terminating CollectorManager
 Key: LUCENE-8929
 URL: https://issues.apache.org/jira/browse/LUCENE-8929
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


We should have an early terminating collector manager which accurately tracks 
hits across all of its collectors and determines when there are enough hits, 
allowing all the collectors to abort.

The options for the same are:

1) Shared total count : Global "scoreboard" where all collectors update their 
current hit count. At the end of each document's collection, collector checks 
if N > threshold, and aborts if true

2) State Reporting Collectors: Collectors report their total number of counts 
collected periodically using a callback mechanism, and get a proceed or abort 
decision.

1) has the overhead of synchronization in the hot path, 2) can collect 
unnecessary hits before aborting.

I am planning to work on 2), unless objections



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8727) IndexSearcher#search(Query,int) should operate on a shared priority queue when configured with an executor

2019-07-22 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889970#comment-16889970
 ] 

Atri Sharma commented on LUCENE-8727:
-

bq. we will have to skip all these docs with smaller doc Ids even if they have 
the same scores as docs with higher doc Ids and should be selected instead.

That should be avoidable, since we will need a custom PQ implementation anyways 
if we decided to share the queue, so the PQ can tie break the other way round 
on doc IDs. One advantage of sharing PQ is that we can skip the merge process 
during reduce call of the CollectorManager.

I am hesitant to introduce a synchronized block to the collector level 
collection mechanism -- it has a potential of blowing up in our face and 
becoming a performance bottleneck.

I am curious about if we should simply have both versions -- sharing the PQ/min 
score and the CollectorManager which allows callbacks which are invoked at 
regular intervals by the dependent Collectors. The former can work well with 
lesser number of slices, while the latter can work well with a large number of 
slices.

> IndexSearcher#search(Query,int) should operate on a shared priority queue 
> when configured with an executor
> --
>
> Key: LUCENE-8727
> URL: https://issues.apache.org/jira/browse/LUCENE-8727
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> If IndexSearcher is configured with an executor, then the top docs for each 
> slice are computed separately before being merged once the top docs for all 
> slices are computed. With block-max WAND this is a bit of a waste of 
> resources: it would be better if an increase of the min competitive score 
> could help skip non-competitive hits on every slice and not just the current 
> one.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8727) IndexSearcher#search(Query,int) should operate on a shared priority queue when configured with an executor

2019-07-19 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888766#comment-16888766
 ] 

Atri Sharma commented on LUCENE-8727:
-

[~jpountz] Here are two thoughts for the implementation of same:

 

1) Shared Priority Queue: A shared priority queue which is held in parent 
CollectorManager is used by all Collectors. This flows down naturally since 
post collection of top N hits globally, the minimum competitive score can be 
increased without Collectors getting involved and further hits will be ranked 
accordingly. However, the downside is that the priority queue implementation 
will have to be synchronized, so there can be performance hit as the critical 
path of segment collection will be affected.

 

2) Alternate way can be that for N hits, each slice gets an equal number of 
prorated hits to start with (M collectors, so N/M hits). Each Collector gets a 
callback supplier which the Collector will call with the number of hits 
collected till the point and the score of the highest scoring local hit. The 
callback will return the minimum competitive hit globally seen till now, and 
the Collector will use that score to filter out remaining hits. The point in 
time when a Collector calls the callback mechanism can be relative, simplest 
being after each N/M hits. The callback will be provided by the 
CollectorManager. The downside of this approach is that there is communication 
involved between Collectors and CollectorManager, and some redundant hits can 
be collected due to the periodic callback invocation. In contrast, the shared 
priority queue mechanism allows for accurate filtering.

 

WDYT?

> IndexSearcher#search(Query,int) should operate on a shared priority queue 
> when configured with an executor
> --
>
> Key: LUCENE-8727
> URL: https://issues.apache.org/jira/browse/LUCENE-8727
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> If IndexSearcher is configured with an executor, then the top docs for each 
> slice are computed separately before being merged once the top docs for all 
> slices are computed. With block-max WAND this is a bit of a waste of 
> resources: it would be better if an increase of the min competitive score 
> could help skip non-competitive hits on every slice and not just the current 
> one.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8927) Cut Over To Set.copyOf and Set.Of From Collections.unmodifiableSet

2019-07-18 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8927:
---

 Summary: Cut Over To Set.copyOf and Set.Of From 
Collections.unmodifiableSet
 Key: LUCENE-8927
 URL: https://issues.apache.org/jira/browse/LUCENE-8927
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8924) Remove Fields Order Checks from CheckIndex?

2019-07-17 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887340#comment-16887340
 ] 

Atri Sharma commented on LUCENE-8924:
-

I see. Should we make this more explicit and robust then? For E.g., since we do 
not explicitly maintain a sort order but rely on the key set to do the right 
thing, a change from Collections.unModifiableSet to Set.copyOf breaks this 
assertion in checkIndex (since Ser.copyOf explicitly calls out that there is no 
guarantee in the order of traversal)

> Remove Fields Order Checks from CheckIndex?
> ---
>
> Key: LUCENE-8924
> URL: https://issues.apache.org/jira/browse/LUCENE-8924
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> CheckIndex checks the order of fields read from the FieldsEnum for the 
> posting reader. Since we do not explicitly sort or use a sorted data 
> structure to represent keys (atleast explicitly), and no FieldsEnum depends 
> on the order apart from MultiFieldsEnum, which no longer exists.
>  
> Should we remove the check?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8924) Remove Fields Order Checks from CheckIndex?

2019-07-17 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8924:
---

 Summary: Remove Fields Order Checks from CheckIndex?
 Key: LUCENE-8924
 URL: https://issues.apache.org/jira/browse/LUCENE-8924
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


CheckIndex checks the order of fields read from the FieldsEnum for the posting 
reader. Since we do not explicitly sort or use a sorted data structure to 
represent keys (atleast explicitly), and no FieldsEnum depends on the order 
apart from MultiFieldsEnum, which no longer exists.

 

Should we remove the check?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8915) Allow RateLimiter To Have Dynamic Limits

2019-07-16 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885941#comment-16885941
 ] 

Atri Sharma commented on LUCENE-8915:
-

[~ab] Thanks, raised a PR doing the same.

 

[https://github.com/apache/lucene-solr/pull/789]

> Allow RateLimiter To Have Dynamic Limits
> 
>
> Key: LUCENE-8915
> URL: https://issues.apache.org/jira/browse/LUCENE-8915
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> RateLimiter does not allow dynamic configuration of the rate limit today. 
> This limits the kind of applications that the functionality can be applied 
> to. This Jira tracks 1) allowing the rate limiter to change limits 
> dynamically. 2) Add a RateLimiter subclass which exposes the same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8915) Allow RateLimiter To Have Dynamic Limits

2019-07-16 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885887#comment-16885887
 ] 

Atri Sharma commented on LUCENE-8915:
-

Hmm, I do not see a reason why SimpleRateLimiter cannot dynamically set values 
today (the setter is public).

 

Should we make the rate limit value as protected, or update the 
javadocs/comments to reflect that dynamic updatability is available?

> Allow RateLimiter To Have Dynamic Limits
> 
>
> Key: LUCENE-8915
> URL: https://issues.apache.org/jira/browse/LUCENE-8915
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> RateLimiter does not allow dynamic configuration of the rate limit today. 
> This limits the kind of applications that the functionality can be applied 
> to. This Jira tracks 1) allowing the rate limiter to change limits 
> dynamically. 2) Add a RateLimiter subclass which exposes the same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8919) Query Metadata Aggregator

2019-07-15 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8919:
---

 Summary: Query Metadata Aggregator
 Key: LUCENE-8919
 URL: https://issues.apache.org/jira/browse/LUCENE-8919
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


It would be good if there was a mechanism to allow aggregation of metadata for 
queries (eg, number of clauses, types of clauses, terms involved etc). This is 
particularly useful for complex queries with multiple levels of nesting and a 
high degree of branching. This should help debug query performance issues and 
draw patterns in case a query is misbehaving. With the QueryVisitor being 
present, this should be doable.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8811) Add maximum clause count check to IndexSearcher rather than BooleanQuery

2019-07-15 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884939#comment-16884939
 ] 

Atri Sharma commented on LUCENE-8811:
-

[~jpountz] Yeah, that is what I was thinking of, but I see your view point.

 

I will raise a PR shortly

> Add maximum clause count check to IndexSearcher rather than BooleanQuery
> 
>
> Key: LUCENE-8811
> URL: https://issues.apache.org/jira/browse/LUCENE-8811
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Alan Woodward
>Priority: Minor
> Fix For: 8.2
>
> Attachments: LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch, 
> LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch
>
>
> Currently we only check whether boolean queries have too many clauses. 
> However there are other ways that queries may have too many clauses, for 
> instance if you have boolean queries that have themselves inner boolean 
> queries.
> Could we use the new Query visitor API to move this check from BooleanQuery 
> to IndexSearcher in order to make this check more consistent across queries? 
> See for instance LUCENE-8810 where a rewrite rule caused the maximum clause 
> count to be hit even though the total number of leaf queries remained the 
> same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8811) Add maximum clause count check to IndexSearcher rather than BooleanQuery

2019-07-15 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884930#comment-16884930
 ] 

Atri Sharma commented on LUCENE-8811:
-

[~jpountz] I had originally raised a patch which implemented your suggested 
approach, should we commit that for 8.2, and let all other branches have the 
actual change introduced by this JIRA?

> Add maximum clause count check to IndexSearcher rather than BooleanQuery
> 
>
> Key: LUCENE-8811
> URL: https://issues.apache.org/jira/browse/LUCENE-8811
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Alan Woodward
>Priority: Minor
> Fix For: 8.2
>
> Attachments: LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch, 
> LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch
>
>
> Currently we only check whether boolean queries have too many clauses. 
> However there are other ways that queries may have too many clauses, for 
> instance if you have boolean queries that have themselves inner boolean 
> queries.
> Could we use the new Query visitor API to move this check from BooleanQuery 
> to IndexSearcher in order to make this check more consistent across queries? 
> See for instance LUCENE-8810 where a rewrite rule caused the maximum clause 
> count to be hit even though the total number of leaf queries remained the 
> same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8915) Allow RateLimiter To Have Dynamic Limits

2019-07-15 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma updated LUCENE-8915:

Description: RateLimiter does not allow dynamic configuration of the rate 
limit today. This limits the kind of applications that the functionality can be 
applied to. This Jira tracks 1) allowing the rate limiter to change limits 
dynamically. 2) Add a RateLimiter subclass which exposes the same.  (was: While 
working on multi range queries, I realised that it would be good to specialize 
for cases where all clauses in a query are ORed together. MultiTermQuery 
springs to mind, when all terms are basically disjuncted.)

> Allow RateLimiter To Have Dynamic Limits
> 
>
> Key: LUCENE-8915
> URL: https://issues.apache.org/jira/browse/LUCENE-8915
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> RateLimiter does not allow dynamic configuration of the rate limit today. 
> This limits the kind of applications that the functionality can be applied 
> to. This Jira tracks 1) allowing the rate limiter to change limits 
> dynamically. 2) Add a RateLimiter subclass which exposes the same.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8915) Allow RateLimiter To Have Dynamic Limits

2019-07-15 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8915:
---

 Summary: Allow RateLimiter To Have Dynamic Limits
 Key: LUCENE-8915
 URL: https://issues.apache.org/jira/browse/LUCENE-8915
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


While working on multi range queries, I realised that it would be good to 
specialize for cases where all clauses in a query are ORed together. 
MultiTermQuery springs to mind, when all terms are basically disjuncted.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8905) TopDocsCollector Should Have Better Error Handling For Illegal Arguments

2019-07-08 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8905:
---

 Summary: TopDocsCollector Should Have Better Error Handling For 
Illegal Arguments
 Key: LUCENE-8905
 URL: https://issues.apache.org/jira/browse/LUCENE-8905
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


While writing some tests, I realised that TopDocsCollector does not behave well 
when illegal arguments are passed in (for eg, requesting more hits than the 
number of hits collected). Instead, we return a TopDocs instance with 0 hits.

 

This can be problematic when queries are being formed by applications. This can 
hide bugs where malformed queries return no hits and that is surfaced upstream 
to client applications.

 

I found a TODO at the relevant code space, so I believe it is time to fix the 
problem and throw an IllegalArgumentsException.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8829) TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved

2019-07-03 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma resolved LUCENE-8829.
-
Resolution: Fixed

Merged to master

> TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved
> -
>
> Key: LUCENE-8829
> URL: https://issues.apache.org/jira/browse/LUCENE-8829
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8829.patch, LUCENE-8829.patch, LUCENE-8829.patch, 
> LUCENE-8829.patch
>
>
> While investigating LUCENE-8819, I understood that TopDocs#merge's order of 
> results are indirectly dependent on the number of collectors involved in the 
> merge. This is troubling because 1) The number of collectors involved in a 
> merge are cost based and directly dependent on the number of slices created 
> for the parallel searcher case. 2) TopN hits code path will invoke merge with 
> a single Collector, so essentially, doing the same TopN query with single 
> threaded and parallel threaded searcher will invoke different order of 
> results, which is a bad invariant that breaks.
>  
> The reason why this happens is because of the subtle way TopDocs#merge sets 
> shardIndex in the ScoreDoc population during populating the priority queue 
> used for merging. ShardIndex is essentially set to the ordinal of the 
> collector which generates the hit. This means that the shardIndex is 
> dependent on the number of collectors, even for the same set of hits.
>  
> In case of no sort order specified, shardIndex is used for tie breaking when 
> scores are equal. This translates to different orders for same hits with 
> different shardIndices.
>  
> I propose that we remove shardIndex from the default tie breaking mechanism 
> and replace it with docID. DocID order is the de facto that is expected 
> during collection, so it might make sense to use the same factor during tie 
> breaking when scores are the same.
>  
> CC: [~ivera]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8794) Cost Based Slice Allocation Algorithm

2019-07-03 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma resolved LUCENE-8794.
-
Resolution: Fixed

Merged to master

> Cost Based Slice Allocation Algorithm
> -
>
> Key: LUCENE-8794
> URL: https://issues.apache.org/jira/browse/LUCENE-8794
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> In https://issues.apache.org/jira/browse/LUCENE-8757, the idea of a cost 
> based and dynamically adjusting slice allocation algorithm was conceived. We 
> should ideally have a hard cap on the number of threads that can be consumed 
> by a single query, and have static cost factors associated with segments and 
> assign them to threads in a fair manner. We will also need to ensure that we 
> end up not assigning individual threads to small segments, or making more 
> thread s that needed (thread context switching could outweight benefits).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8762) Lucene50PostingsReader should specialize reading docs+freqs with impacts

2019-07-03 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877864#comment-16877864
 ] 

Atri Sharma commented on LUCENE-8762:
-

I will take a crack at this and post a patch soon.

> Lucene50PostingsReader should specialize reading docs+freqs with impacts
> 
>
> Key: LUCENE-8762
> URL: https://issues.apache.org/jira/browse/LUCENE-8762
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Currently if you ask for impacts, we only have one implementation that is 
> able to expose everything: docs, freqs, positions and offsets. In contrast, 
> if you don't need impacts, we have specialization for docs+freqs, 
> docs+freqs+positions and docs+freqs+positions+offsets.
> Maybe we should add specialization for the docs+freqs case with impacts, 
> which should be the most common case, and remove specialization for 
> docs+freqs+positions when impacts are not requested?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877232#comment-16877232
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~jpountz] Yes, I ran the Solr suite twice. The first time, failures with 
tracer not able to close were seen. The second time, the entire suite came in 
clean.

 

I also ran ant precommit – came in clean.

 

 

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8899) Implementation of MultiTermQuery for ORed Queries

2019-07-02 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma resolved LUCENE-8899.
-
Resolution: Not A Problem

> Implementation of MultiTermQuery for ORed Queries
> -
>
> Key: LUCENE-8899
> URL: https://issues.apache.org/jira/browse/LUCENE-8899
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While working on multi range queries, I realised that it would be good to 
> specialize for cases where all clauses in a query are ORed together. 
> MultiTermQuery springs to mind, when all terms are basically disjuncted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8899) Implementation of MultiTermQuery for ORed Queries

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877219#comment-16877219
 ] 

Atri Sharma commented on LUCENE-8899:
-

Hmm, true. I was thinking of a query type just for the disjunctives, but looks 
like TermInSetQuery already covers it.

 

Thanks for pointing it out!

> Implementation of MultiTermQuery for ORed Queries
> -
>
> Key: LUCENE-8899
> URL: https://issues.apache.org/jira/browse/LUCENE-8899
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While working on multi range queries, I realised that it would be good to 
> specialize for cases where all clauses in a query are ORed together. 
> MultiTermQuery springs to mind, when all terms are basically disjuncted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877156#comment-16877156
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~jpountz] Thanks for confirming. I wanted to ensure that no unsuspecting user 
gets bitten :)

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8899) Implementation of MultiTermQuery for ORed Queries

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877154#comment-16877154
 ] 

Atri Sharma commented on LUCENE-8899:
-

The way I am thinking of this is by using the fact that 
MultiTermQueryConstantScoreWrapper will always convert to a BooleanQuery with 
each clause as SHOULD. So it should be a simple matter to use that logic. The 
main change will be introduction of a new TermsEnum implementation which can 
filter the input terms based on a filter built from the terms list given in the 
query.

 

Does this seem like a reasonable approach?

> Implementation of MultiTermQuery for ORed Queries
> -
>
> Key: LUCENE-8899
> URL: https://issues.apache.org/jira/browse/LUCENE-8899
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> While working on multi range queries, I realised that it would be good to 
> specialize for cases where all clauses in a query are ORed together. 
> MultiTermQuery springs to mind, when all terms are basically disjuncted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13597) TopGroups Should Respect the API in Lucene's TopDocs.merge

2019-07-02 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-13597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma resolved SOLR-13597.

Resolution: Not A Problem

This can be done at Lucene level itself, given the usage pattern of Solr for 
TopDocs.merge

> TopGroups Should Respect the API in Lucene's TopDocs.merge
> --
>
> Key: SOLR-13597
> URL: https://issues.apache.org/jira/browse/SOLR-13597
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Atri Sharma
>Priority: Major
>
> In LUCENE-8857, TopDocs.merge loses the ability to set shard indices, so 
> callers have to set shard indices themselves before calling merge, or use 
> docID based tie breaker.
>  
> TopGroups uses this non existent capability of Lucene, hence the 
> corresponding tests break. This Jira tracks the efforts to fix TopGroups to 
> respect the new API, and should be merged post merge of LUCENE-8857



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876965#comment-16876965
 ] 

Atri Sharma commented on LUCENE-8857:
-

Since this is a breaking API change, is there a way we can highlight this to 
existing users in a "louder" manner, or is MIGRATE.txt entry enough?

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8899) Implementation of MultiTermQuery for ORed Queries

2019-07-02 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8899:
---

 Summary: Implementation of MultiTermQuery for ORed Queries
 Key: LUCENE-8899
 URL: https://issues.apache.org/jira/browse/LUCENE-8899
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


While working on multi range queries, I realised that it would be good to 
specialize for cases where all clauses in a query are ORed together. 
MultiTermQuery springs to mind, when all terms are basically disjuncted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876773#comment-16876773
 ] 

Atri Sharma commented on LUCENE-8857:
-

JFYI The latest iteration on PR also fixes the compilation failure in Solr, 
introduced in SOLR-13404

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876759#comment-16876759
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~jpountz] I have pushed the latest iteration to the new PR. It passes ant test:

 
{code:java}
 
[junit4:tophints]  59.58s | 
org.apache.lucene.search.suggest.document.TestSuggestField
[junit4:tophints]  17.10s | 
org.apache.lucene.search.suggest.DocumentDictionaryTest
[junit4:tophints]  14.56s | 
org.apache.lucene.search.suggest.fst.FSTCompletionTest
[junit4:tophints]  14.21s | 
org.apache.lucene.search.suggest.analyzing.FuzzySuggesterTest

-check-totals:

common.test:

-check-totals:

test:

BUILD SUCCESSFUL
Total time: 74 minutes 29 seconds
f01898a404cf:lucene atris$
{code}
 

 

It also passes the offending Solr test:

 
ant test  -Dtestcase=TestDistributedGrouping -Dtests.method=test 
-Dtests.seed=B5D95BEAE23E9468 -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=nl-AW -Dtests.timezone=Asia/Jayapura -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
 
{code:java}
27429 INFO  (closeThreadPool-74-thread-4) [ ] o.e.j.s.AbstractConnector 
Stopped ServerConnector@3e6caf50{HTTP/1.1,[http/1.1, h2c]}{127.0.0.1:0}
27430 INFO  (closeThreadPool-74-thread-4) [ ] o.e.j.s.h.ContextHandler 
Stopped o.e.j.s.ServletContextHandler@169e0265{/,null,UNAVAILABLE}
27430 INFO  (closeThreadPool-74-thread-4) [ ] o.e.j.s.session node0 Stopped 
scavenging
27431 INFO  (closeThreadPool-74-thread-1) [ ] o.e.j.s.AbstractConnector 
Stopped ServerConnector@1be02e89{HTTP/1.1,[http/1.1, h2c]}{127.0.0.1:0}
27431 INFO  (closeThreadPool-74-thread-1) [ ] o.e.j.s.h.ContextHandler 
Stopped o.e.j.s.ServletContextHandler@6b6f3dda{/,null,UNAVAILABLE}
27432 INFO  (closeThreadPool-74-thread-1) [ ] o.e.j.s.session node0 Stopped 
scavenging
27432 INFO  (closeThreadPool-74-thread-5) [ ] o.e.j.s.AbstractConnector 
Stopped ServerConnector@4052b482{HTTP/1.1,[http/1.1, h2c]}{127.0.0.1:0}
27432 INFO  (closeThreadPool-74-thread-5) [ ] o.e.j.s.h.ContextHandler 
Stopped o.e.j.s.ServletContextHandler@7063254f{/,null,UNAVAILABLE}
27432 INFO  (closeThreadPool-74-thread-5) [ ] o.e.j.s.session node0 Stopped 
scavenging

27436 INFO  (SUITE-TestDistributedGrouping-seed#[C817F4DEFFC8F2A7]-worker) [
 ] o.a.s.SolrTestCaseJ4 --- 
Done waiting for tracked resources to be released{code}
 

 

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8882) Add State To QueryVisitor

2019-07-02 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876754#comment-16876754
 ] 

Atri Sharma commented on LUCENE-8882:
-

My idea was not to replace IndexOrDocValues, but to allow it to be more 
generally applicable.

 

For eg, taking the specific example of the optimized query which is applicable 
for limited cases in which the index is sorted, we would ideally be better off 
if we used that query over point values (even though that query is a docvalues 
based implementation). However, the query is too specialized for 
IndexOrDocValues to factor in.

 

What I was envisioning was a state where, at the start of the query, 
IndexSearcher creates a QueryVisitor, sees that the index is sorted by key X, 
and populates a property in the QueryVisitor's metadata (INDEX_SORTED_KEY=X).

 

IndexOrDocValuesQuery, then, instead of making an immediate decision as to 
whether to use Points or DocValues, passes on the visitor to both of the 
branches. Further down the line, the sorted index query type will see the 
metadata in the visitor and volunteer itself (by adding another property in the 
metadata of the visitor (SORTED_PLAN_AVAILABLE=true or something).

 

In the end, IndexOrDocValues will perform an evaluation, which includes the 
costing which it does today + the metadata state gathered from both the 
branches, and then choose the branch to execute. This will allow new query 
types for specific use cases to be added easily (just add a new property type 
and a listener query for it), and let the engine take better decisions as to 
when to execute what queries, which can potentially lead to better query 
performance.

 

Thoughts?

> Add State To QueryVisitor
> -
>
> Key: LUCENE-8882
> URL: https://issues.apache.org/jira/browse/LUCENE-8882
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> QueryVisitor has no state passed in either up or down recursion. This limits 
> the width of decisions that can be taken by visitation of QueryVisitor. For 
> eg, for LUCENE-8881, we need a way to specify is the visitor is a rewriter 
> visitor.
>  
> This Jira proposes adding a property bag model to QueryVisitor, which can 
> then be referred to by the Query instance being visited by QueryVisitor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876711#comment-16876711
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~munendrasn] Thanks for the compilation fix.

Yes, the test will fail. I fixed that test failure – will update the PR once my 
local test suite run completes

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876708#comment-16876708
 ] 

Atri Sharma commented on LUCENE-8857:
-

Ok, updating the PR now.

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857-compile-fix.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876699#comment-16876699
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~jpountz] Yes, we will. I did not want to add the fix for Solr in this PR 
since that kind of muddles up (going across two modules). I can raise a 
separate PR just for the Solr fixes, though, if that works.

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876692#comment-16876692
 ] 

Atri Sharma commented on LUCENE-8857:
-

I have opened https://issues.apache.org/jira/browse/SOLR-13597 to track fixes 
to Solr to use the new API (that is what is causing the Solr test to fail). I 
will raise a PR for that Jira post the merging of this PR.

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-13597) TopGroups Should Respect the API in Lucene's TopDocs.merge

2019-07-01 Thread Atri Sharma (JIRA)
Atri Sharma created SOLR-13597:
--

 Summary: TopGroups Should Respect the API in Lucene's TopDocs.merge
 Key: SOLR-13597
 URL: https://issues.apache.org/jira/browse/SOLR-13597
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Atri Sharma


In LUCENE-8857, TopDocs.merge loses the ability to set shard indices, so 
callers have to set shard indices themselves before calling merge, or use docID 
based tie breaker.

 

TopGroups uses this non existent capability of Lucene, hence the corresponding 
tests break. This Jira tracks the efforts to fix TopGroups to respect the new 
API, and should be merged post merge of LUCENE-8857



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876378#comment-16876378
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~jpountz] Ran ant test 5 times again: all came in clean:

 

I have raised a new PR with testGrouping fixes: 
[https://github.com/apache/lucene-solr/pull/757]

 

Can we merge it, if it looks fine?
{code:java}
junit4:tophints]  54.39s | 
org.apache.lucene.search.suggest.document.TestSuggestField
[junit4:tophints]  16.93s | 
org.apache.lucene.search.suggest.DocumentDictionaryTest
[junit4:tophints]  16.63s | 
org.apache.lucene.search.suggest.analyzing.FuzzySuggesterTest
[junit4:tophints]  16.42s | 
org.apache.lucene.search.suggest.fst.FSTCompletionTest

-check-totals:

common.test:

-check-totals:

test:

BUILD SUCCESSFUL
Total time: 45 minutes 8 seconds
f01898a404cf:lucene atris$ {code}

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876335#comment-16876335
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~jpountz] I investigated this and it turned out to be a test limitation 
(testGrouping assumed that TopDocs.merge was setting the shard indices). It 
took a while to reproduce since it was the random test which was failing 
(thanks for providing the seed!) I have fixed the test and ran ant test a 
couple of times – it came in clean:

 

Can we push this in now?

 
{code:java}
[junit4:tophints]  49.54s | 
org.apache.lucene.search.suggest.document.TestSuggestField
[junit4:tophints]  21.55s | 
org.apache.lucene.search.suggest.analyzing.FuzzySuggesterTest
[junit4:tophints]  21.51s | 
org.apache.lucene.search.suggest.DocumentDictionaryTest
[junit4:tophints]  15.45s | org.apache.lucene.search.spell.TestSpellChecker

-check-totals:

common.test:

-check-totals:

test:

BUILD SUCCESSFUL
Total time: 49 minutes 49 seconds
f01898a404cf:lucene atris$ {code}
 

[~munendrasn] I am not too aware of Solr's internals, but looking at the error 
you pointed to, looks like that the test is not setting shard indices or hit 
indices. This points to an assumption in the test – that TopDocs.merge is 
setting the shard indices. Can you check
{code:java}
search/grouping/distributed/responseprocessor/TopGroupsShardResponseProcessor.java{code}
where the TopDocs.merge call is done? We can set shard indices for all TopHits 
based on the QueryCommandResult they come from.

 

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876253#comment-16876253
 ] 

Atri Sharma commented on LUCENE-8857:
-

I did – I was not able to see any failures (probably due to seeds?). I will try 
with the seed in your command now.

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876218#comment-16876218
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~jpountz] Thanks for committing and reviewing, [~simonw] Thanks for your 
constructive inputs!

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8862) Collector Level Dynamic Memory Accounting

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876130#comment-16876130
 ] 

Atri Sharma commented on LUCENE-8862:
-

[~jpountz] Thanks for pushing and reviewing!

> Collector Level Dynamic Memory Accounting
> -
>
> Key: LUCENE-8862
> URL: https://issues.apache.org/jira/browse/LUCENE-8862
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Inspired from LUCENE-8855, I am thinking of adding a new interface which 
> tracks dynamic memory used by Collectors. This shall allow users to get an 
> accountability as to the memory usage of their Collectors and better plan 
> their resource capacity. This shall also allow us to add Collector level 
> limits for memory usage, thus allowing users a finer control over their 
> resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8897) Allow Callbacks For Events In Collectors/ CollectorManagers

2019-07-01 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8897:
---

 Summary: Allow Callbacks For Events In Collectors/ 
CollectorManagers
 Key: LUCENE-8897
 URL: https://issues.apache.org/jira/browse/LUCENE-8897
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


It would be good to allow and Collectors and CollectorManagers to allow 
callbacks to happen for specific incidents (such as collection of N doc IDs 
across all Collectors of a CollectorManager). This will allow things like more 
accurate early termination to happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8896) Override default implementation of IntersectVisitor#visit(DocIDSetBuilder, byte[]) for several queries

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876037#comment-16876037
 ] 

Atri Sharma commented on LUCENE-8896:
-

Does PointRangeQuery not already have its custom intersects implementation?

> Override default implementation of IntersectVisitor#visit(DocIDSetBuilder, 
> byte[]) for several queries
> --
>
> Key: LUCENE-8896
> URL: https://issues.apache.org/jira/browse/LUCENE-8896
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>
> In LUCENE-8885, it was introduced a new method on the {{IntersectsVisitor}} 
> interface. It contains a default implementation but queries can override it 
> and therefore benefit when there are several documents on a leaf associated 
> to the same point.
> In this issue the following queries are proposed to override the default 
> implementation
> * LatLonShapeQuery
> * RangeFieldQuery
> * LatLonPointInPolygonQuery
> * LatLonPointDistanceQuery
> * PointRangeQuery
> * PointInSetQuery



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-07-01 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876026#comment-16876026
 ] 

Atri Sharma commented on LUCENE-8857:
-

Should we push the latest iteration on the PR, if it looks fine?

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-06-27 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874125#comment-16874125
 ] 

Atri Sharma commented on LUCENE-8857:
-

Updated the PR with latest comments, removing merge functionality as well. 
Happy to iterate further

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8862) Collector Level Dynamic Memory Accounting

2019-06-27 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874013#comment-16874013
 ] 

Atri Sharma commented on LUCENE-8862:
-

Updated the PR with latest comments and moved to misc module. Happy to iterate 
further.

> Collector Level Dynamic Memory Accounting
> -
>
> Key: LUCENE-8862
> URL: https://issues.apache.org/jira/browse/LUCENE-8862
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Inspired from LUCENE-8855, I am thinking of adding a new interface which 
> tracks dynamic memory used by Collectors. This shall allow users to get an 
> accountability as to the memory usage of their Collectors and better plan 
> their resource capacity. This shall also allow us to add Collector level 
> limits for memory usage, thus allowing users a finer control over their 
> resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8889) Remove Dead Code From PointRangeQuery

2019-06-27 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874002#comment-16874002
 ] 

Atri Sharma commented on LUCENE-8889:
-

[~jim.ferenczi] Call me old school, but I believe that APIs should have atleast 
one user within library code base (for purely external facing APIs, tests are 
the way as you suggested).

 

I have raised a PR to beef up equality tests using the said API, let me know if 
it looks fine

> Remove Dead Code From PointRangeQuery
> -
>
> Key: LUCENE-8889
> URL: https://issues.apache.org/jira/browse/LUCENE-8889
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> PointRangeQuery has accessors for the underlying points in the query but 
> those are never accessed. We should remove them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8889) Remove Dead Code From PointRangeQuery

2019-06-27 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8889:
---

 Summary: Remove Dead Code From PointRangeQuery
 Key: LUCENE-8889
 URL: https://issues.apache.org/jira/browse/LUCENE-8889
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


PointRangeQuery has accessors for the underlying points in the query but those 
are never accessed. We should remove them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8881) Query.rewrite Should Move To QueryVisitor

2019-06-27 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873984#comment-16873984
 ] 

Atri Sharma commented on LUCENE-8881:
-

[~romseygeek] Agreed, however, we could use QueryVisitor's recursion mechanism 
to get query specific rewrites done (please see my PR to add metadata state to 
QueryVisitor). We could add a boolean property saying DO_REWRITE=true and fire 
a visitor, and each query checks for that property.

 

My main point is that it seems incorrect for two query tree traversal 
mechanisms to exist independently. This Jira is primarily opened to trade 
thoughts on that front, and maybe see if we can draw a common baseline between 
the two existing mechanisms. WDYT?

> Query.rewrite Should Move To QueryVisitor
> -
>
> Key: LUCENE-8881
> URL: https://issues.apache.org/jira/browse/LUCENE-8881
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> Now that we have QueryVisitor, the rewrite functionality should belong there, 
> since rewrite is essentially a recursive visitation of underlying queries, 
> which sounds exactly as what QueryVisitor is designed to be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8811) Add maximum clause count check to IndexSearcher rather than BooleanQuery

2019-06-26 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873207#comment-16873207
 ] 

Atri Sharma commented on LUCENE-8811:
-

Thanks [~romseygeek] for pushing!

 

A small nit: I think git somehow botched up the patch during commit ? (I see 
your name as both author and committer).

> Add maximum clause count check to IndexSearcher rather than BooleanQuery
> 
>
> Key: LUCENE-8811
> URL: https://issues.apache.org/jira/browse/LUCENE-8811
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Alan Woodward
>Priority: Minor
> Fix For: 8.2
>
> Attachments: LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch, 
> LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch
>
>
> Currently we only check whether boolean queries have too many clauses. 
> However there are other ways that queries may have too many clauses, for 
> instance if you have boolean queries that have themselves inner boolean 
> queries.
> Could we use the new Query visitor API to move this check from BooleanQuery 
> to IndexSearcher in order to make this check more consistent across queries? 
> See for instance LUCENE-8810 where a rewrite rule caused the maximum clause 
> count to be hit even though the total number of leaf queries remained the 
> same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8882) Add State To QueryVisitor

2019-06-25 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873002#comment-16873002
 ] 

Atri Sharma commented on LUCENE-8882:
-

I think this is useful even outside LUCENE-8881 – This allows upper queries to 
collect metadata about the lower leaf level queries and make decisions 
(motivated by the excellent work done recently to use the property of a sorted 
index to perform binary searches on docIDs). So we could use a property such as 
INDEX_SORTED, which is populated at some query and visible to the entire query 
tree, and then a query looks at the property and decides to use a specific type 
of query. This can even be ingested in the cost of the query, but in a 
localised form so that not all heuristics are crammed in one specialized query 
(IndexOrDocValues?)

 

Objections/Thoughts/Comments?

> Add State To QueryVisitor
> -
>
> Key: LUCENE-8882
> URL: https://issues.apache.org/jira/browse/LUCENE-8882
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> QueryVisitor has no state passed in either up or down recursion. This limits 
> the width of decisions that can be taken by visitation of QueryVisitor. For 
> eg, for LUCENE-8881, we need a way to specify is the visitor is a rewriter 
> visitor.
>  
> This Jira proposes adding a property bag model to QueryVisitor, which can 
> then be referred to by the Query instance being visited by QueryVisitor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8882) Add State To QueryVisitor

2019-06-25 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8882:
---

 Summary: Add State To QueryVisitor
 Key: LUCENE-8882
 URL: https://issues.apache.org/jira/browse/LUCENE-8882
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


QueryVisitor has no state passed in either up or down recursion. This limits 
the width of decisions that can be taken by visitation of QueryVisitor. For eg, 
for LUCENE-8881, we need a way to specify is the visitor is a rewriter visitor.

 

This Jira proposes adding a property bag model to QueryVisitor, which can then 
be referred to by the Query instance being visited by QueryVisitor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8881) Query.rewrite Should Move To QueryVisitor

2019-06-25 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8881:
---

 Summary: Query.rewrite Should Move To QueryVisitor
 Key: LUCENE-8881
 URL: https://issues.apache.org/jira/browse/LUCENE-8881
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


Now that we have QueryVisitor, the rewrite functionality should belong there, 
since rewrite is essentially a recursive visitation of underlying queries, 
which sounds exactly as what QueryVisitor is designed to be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8811) Add maximum clause count check to IndexSearcher rather than BooleanQuery

2019-06-25 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872550#comment-16872550
 ] 

Atri Sharma commented on LUCENE-8811:
-

Any chance we could push this one? Happy to make any changes

> Add maximum clause count check to IndexSearcher rather than BooleanQuery
> 
>
> Key: LUCENE-8811
> URL: https://issues.apache.org/jira/browse/LUCENE-8811
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch, 
> LUCENE-8811.patch, LUCENE-8811.patch, LUCENE-8811.patch
>
>
> Currently we only check whether boolean queries have too many clauses. 
> However there are other ways that queries may have too many clauses, for 
> instance if you have boolean queries that have themselves inner boolean 
> queries.
> Could we use the new Query visitor API to move this check from BooleanQuery 
> to IndexSearcher in order to make this check more consistent across queries? 
> See for instance LUCENE-8810 where a rewrite rule caused the maximum clause 
> count to be hit even though the total number of leaf queries remained the 
> same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8880) Add a TopDocsCollector which does not sort by score

2019-06-25 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8880:
---

 Summary: Add a TopDocsCollector which does not sort by score
 Key: LUCENE-8880
 URL: https://issues.apache.org/jira/browse/LUCENE-8880
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


We assume that the user cares about the underlying hits being ordered by score. 
This Jira explores adding a collector which does not make this guarantee, thus 
not using priority queue as the collection data structure. This should help 
with large hits case, where the heap’s rebalancing can become a bottleneck



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8877) TopDocsCollector Should Not Depend on Priority Queue

2019-06-25 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872544#comment-16872544
 ] 

Atri Sharma commented on LUCENE-8877:
-

Any thoughts on this? I am envisioning eventually getting to a state where the 
underlying data structure used is opaque to IndexSearcher API. This should 
allow an abstraction with high degree of flexibility 

> TopDocsCollector Should Not Depend on Priority Queue
> 
>
> Key: LUCENE-8877
> URL: https://issues.apache.org/jira/browse/LUCENE-8877
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> TopDocsCollector is tightly coupled to the notion of priority queue, which is 
> not necessarily a good abstraction to have since the collector really just 
> needs an interface to iterate on and hold docID and score, with possibly 
> shard indexes.
>  
> We should rewrite this to a more simplistic interface with priority queue 
> being the default implementation 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?

2019-06-24 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871620#comment-16871620
 ] 

Atri Sharma commented on LUCENE-8875:
-

I meant Elasticsearch aggregates (although I am not sure if  this new proposed 
collector has a direct improvement in that front,  on second thought).

The meat of the point here is that I believe it is the path of minimal invasion 
if we introduced a new collector which clearly calls out that it is meant for 
cases when N is very large (>10k?), and lists out the benefits and trade offs 
clearly.

Are there any catches that are applicable here?

> Should TopScoreDocCollector Always Populate Sentinel Values?
> 
>
> Key: LUCENE-8875
> URL: https://issues.apache.org/jira/browse/LUCENE-8875
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> TopScoreDocCollector always initializes HitQueue as the PQ implementation, 
> and instruct HitQueue to populate with sentinels. While this is a great 
> safety mechanism, for very large datasets where the query's selectivity is 
> high, the sentinel population can be redundant and can become a large enough 
> bottleneck in itself. Does it make sense to introduce a new parameter in 
> TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and 
> does not populate sentinels?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?

2019-06-24 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871543#comment-16871543
 ] 

Atri Sharma commented on LUCENE-8875:
-

While I do agree that too many hits are not what top N hits are intended for, 
but some increasing popular use cases are inclined in that direction (bucket 
aggregates?) I think it would be fair to allow such users to use a different 
Collector which optimises their case while not muddling with the commonly used 
code path. WDYT?

> Should TopScoreDocCollector Always Populate Sentinel Values?
> 
>
> Key: LUCENE-8875
> URL: https://issues.apache.org/jira/browse/LUCENE-8875
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> TopScoreDocCollector always initializes HitQueue as the PQ implementation, 
> and instruct HitQueue to populate with sentinels. While this is a great 
> safety mechanism, for very large datasets where the query's selectivity is 
> high, the sentinel population can be redundant and can become a large enough 
> bottleneck in itself. Does it make sense to introduce a new parameter in 
> TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and 
> does not populate sentinels?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8877) TopDocsCollector Should Not Depend on Priority Queue

2019-06-24 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8877:
---

 Summary: TopDocsCollector Should Not Depend on Priority Queue
 Key: LUCENE-8877
 URL: https://issues.apache.org/jira/browse/LUCENE-8877
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


TopDocsCollector is tightly coupled to the notion of priority queue, which is 
not necessarily a good abstraction to have since the collector really just 
needs an interface to iterate on and hold docID and score, with possibly shard 
indexes.

 

We should rewrite this to a more simplistic interface with priority queue being 
the default implementation 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?

2019-06-23 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870679#comment-16870679
 ] 

Atri Sharma commented on LUCENE-8875:
-

Another thing to explore is to have a sleek set of arrays instead of ScoreDocs: 
[https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/]

 

Maybe have a new implementation of a PQ using this idea, and a new Collector 
which uses the threshold sentinel filling + the new PQ? Only used for very 
large N?

> Should TopScoreDocCollector Always Populate Sentinel Values?
> 
>
> Key: LUCENE-8875
> URL: https://issues.apache.org/jira/browse/LUCENE-8875
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> TopScoreDocCollector always initializes HitQueue as the PQ implementation, 
> and instruct HitQueue to populate with sentinels. While this is a great 
> safety mechanism, for very large datasets where the query's selectivity is 
> high, the sentinel population can be redundant and can become a large enough 
> bottleneck in itself. Does it make sense to introduce a new parameter in 
> TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and 
> does not populate sentinels?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8875) Should TopScoreDocCollector Always Populate Sentinel Values?

2019-06-23 Thread Atri Sharma (JIRA)
Atri Sharma created LUCENE-8875:
---

 Summary: Should TopScoreDocCollector Always Populate Sentinel 
Values?
 Key: LUCENE-8875
 URL: https://issues.apache.org/jira/browse/LUCENE-8875
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Atri Sharma


TopScoreDocCollector always initializes HitQueue as the PQ implementation, and 
instruct HitQueue to populate with sentinels. While this is a great safety 
mechanism, for very large datasets where the query's selectivity is high, the 
sentinel population can be redundant and can become a large enough bottleneck 
in itself. Does it make sense to introduce a new parameter in 
TopScoreDocCollector which uses a heuristic (say number of hits > 10k) and does 
not populate sentinels?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-06-20 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868909#comment-16868909
 ] 

Atri Sharma commented on LUCENE-8857:
-

[~simonw] I have added the default tie breaker which tie breaks by shard 
indices first and then docIDs, as suggested. The new PR has the latest 
iteration, please let me know if it seems fine.

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-06-20 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868717#comment-16868717
 ] 

Atri Sharma commented on LUCENE-8857:
-

{quote}Any chance we can select the tie-breaker based on if one of the TopDocs 
has a shardIndex != -1 and assert that all of them have it or not? Another 
option would be to have only one comparator and first tie-break on shardIndex 
and then on doc since we don't set the shard index it should be fine since they 
are all -1?
{quote}
Would that not defeat the purpose of passing in the custom tie breaker? I 
thought the reason we added passing in the Comparator was to allow users to 
specify custom tie breaking algorithms, and define a custom one. Am I missing 
something?

 

 

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-06-20 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868479#comment-16868479
 ] 

Atri Sharma commented on LUCENE-8857:
-

Does this iteration look fine? Happy to iterate further if needed.

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-06-19 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867411#comment-16867411
 ] 

Atri Sharma commented on LUCENE-8857:
-

Updated patch with improved javadocs and removal of now redundant methods

 

[^LUCENE-8857.patch]

 

 

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers

2019-06-19 Thread Atri Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atri Sharma updated LUCENE-8857:

Attachment: LUCENE-8857.patch

> Refactor TopDocs#Merge To Take In Custom Tie Breakers
> -
>
> Key: LUCENE-8857
> URL: https://issues.apache.org/jira/browse/LUCENE-8857
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, 
> LUCENE-8857.patch, LUCENE-8857.patch
>
>
> In LUCENE-8829, the idea of having lambdas passed in to the API to allow 
> finer control over the process was discussed.
> This JIRA tracks adding a parameter to the API which allows passing in 
> lambdas to define custom tie breakers, thus allowing users to do custom 
> algorithms when required.
> CC: [~jpountz]  [~simonw] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8862) Collector Level Dynamic Memory Accounting

2019-06-19 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867325#comment-16867325
 ] 

Atri Sharma commented on LUCENE-8862:
-

I have opened a PR for the same. Please let me know if this looks fine.

Once we merge this, I am planning to open a Jira to enable Solr's facet 
collector to account for memory. For default cases, the limit can be 
long.MAX_VALUE.

Thoughts?

> Collector Level Dynamic Memory Accounting
> -
>
> Key: LUCENE-8862
> URL: https://issues.apache.org/jira/browse/LUCENE-8862
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Inspired from LUCENE-8855, I am thinking of adding a new interface which 
> tracks dynamic memory used by Collectors. This shall allow users to get an 
> accountability as to the memory usage of their Collectors and better plan 
> their resource capacity. This shall also allow us to add Collector level 
> limits for memory usage, thus allowing users a finer control over their 
> resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8864) Add Query Memory Estimation Ability in QueryVisitor

2019-06-18 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866933#comment-16866933
 ] 

Atri Sharma commented on LUCENE-8864:
-

Right, the purpose of this Jira was twofold:

 

1) To throw out thoughts about making memory accounting a first class citizen 
within QueryVisitor. I think it would be good if we added a method which 
returned the overall size of the underlying query. This fits in nicely with 
QueryVisitor's model since queries can be nested, so it is good to get the 
"deep" memory usage of the parent query. As you said, the new method could 
return the Accountable's estimate or shallow size if Accountable is not 
supported.

 

2) Borrow ideas from QueryVisitor design to see if we can improve Accountable 
itself. While this is orthogonal and I have not really thought through every 
corner case, my instinct says that there might be opportunities to improve 
Accountable's APIs to be more recursive in nature. For eg, there are a ton of 
instanceof checks present today, for each Query type. Should we think about 
delegating some of that calculation to a visitor type model which localizes the 
per query calculation to the query's scope?

> Add Query Memory Estimation Ability in QueryVisitor
> ---
>
> Key: LUCENE-8864
> URL: https://issues.apache.org/jira/browse/LUCENE-8864
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
>
> In LUCENE-8855, there is a discussion around adding memory accounting 
> capabilities to QueryVisitor to allow estimation of memory consumption by 
> queries.'
> This Jira tracks the effort



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8769) Range Query Type With Logically Connected Ranges

2019-06-18 Thread Atri Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866884#comment-16866884
 ] 

Atri Sharma commented on LUCENE-8769:
-

Thinking more about this, I think what can be done is:

 

1) Introduce NOT semantics by translating NOT (a, b) to (-infinity, a) AND (b, 
infinity)

2) Introduce a RangeClause which contains a bunch of ranges and associated AND 
and NOT clauses (not OR). Each RangeClause will be independently executed, and 
then the final result then ANDed or ORed. For eg:

 

(a AND B) OR (c NOT d) converts to two RangeClauses: \{a, b, AND}, \{c, d, 
NOT}, where the RangeClauses are connected by OR, so the independent results of 
both clauses are then ORed to give final result.

 

Does this seem useful and a doable approach?

> Range Query Type With Logically Connected Ranges
> 
>
> Key: LUCENE-8769
> URL: https://issues.apache.org/jira/browse/LUCENE-8769
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Major
> Attachments: LUCENE-8769.patch, LUCENE-8769.patch, LUCENE-8769.patch
>
>
> Today, we visit BKD tree for each range specified for PointRangeQuery. It 
> would be good to have a range query type which can take multiple ranges 
> logically ANDed or ORed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   4   >