[jira] [Resolved] (LUCENE-10644) Facets#getAllChildren testing should ignore child order

2022-08-18 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10644.
--
Fix Version/s: 9.4
   Resolution: Fixed

> Facets#getAllChildren testing should ignore child order
> ---
>
> Key: LUCENE-10644
> URL: https://issues.apache.org/jira/browse/LUCENE-10644
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.4
>
> Attachments: failing tests.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Our javadoc for {{Facets#getAllChildren}} explicitly calls out that callers 
> should make no assumptions about child ordering, but a number of our own unit 
> tests turn around and make that assumption. I ran into this when recently 
> trying an optimization that would result in a different child ordering for 
> {{{}getAllChildren{}}}, and found a number of unit tests that started 
> failing. I'll upload a list of what I found failing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10644) Facets#getAllChildren testing should ignore child order

2022-08-18 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581476#comment-17581476
 ] 

Greg Miller commented on LUCENE-10644:
--

Merged and backported. Thanks!

> Facets#getAllChildren testing should ignore child order
> ---
>
> Key: LUCENE-10644
> URL: https://issues.apache.org/jira/browse/LUCENE-10644
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Attachments: failing tests.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Our javadoc for {{Facets#getAllChildren}} explicitly calls out that callers 
> should make no assumptions about child ordering, but a number of our own unit 
> tests turn around and make that assumption. I ran into this when recently 
> trying an optimization that would result in a different child ordering for 
> {{{}getAllChildren{}}}, and found a number of unit tests that started 
> failing. I'll upload a list of what I found failing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery

2022-08-04 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575402#comment-17575402
 ] 

Greg Miller commented on LUCENE-10207:
--

I'm coming back to this work now as I'm working on another project that would 
benefit from the ability to use a {{TermInSetQuery}} within an 
{{IndexOrDocValuesQuery}}. Where this work stalled last year was in answering 
whether-or-not making {{TermInSetQuery}} extend {{MultiTermQuery}} would have a 
negative performance impact, since the term intersection implementation would 
differ. The motivation for extending {{MultiTermQuery}} was to make a doc 
values-based term-in-set implementation easy (using the existing 
{{DocValuesRewriteMethod}}.

I suggest we separate some of these concerns to make progress. The sandbox 
module already has {{DocValuesTermsQuery}} that could be paired with 
{{TermInSetQuery}} inside of {{IndexOrDocValuesQuery}}. But, we still can't use 
{{TermInSetQuery}} in a {{IndexOrDocValuesQuery}} since {{TermInSetQuery}} 
doesn't provide a {{ScoreSupplier}} with cost estimation. I propose we address 
this first, and not worry about refactoring {{TermInSetQuery}} to extend 
{{MultiTermQuery}} at this point. This would be incremental progress that 
enable using {{TermInSetQuery}} + {{DocValuesTermsQuery}} in an 
{{IndexOrDocValuesQuery}}, while not requiring us to answer the performance 
impact of changing {{TermInSetQuery}} to extend {{MultiTermQuery}}.

I've opened a separate PR to make this iterative step: 
https://github.com/apache/lucene/pull/1058

> Make TermInSetQuery usable with IndexOrDocValuesQuery
> -
>
> Key: LUCENE-10207
> URL: https://issues.apache.org/jira/browse/LUCENE-10207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Greg Miller
>Priority: Minor
> Attachments: LUCENE-10207_multitermquery.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> IndexOrDocValuesQuery is very useful to pick the right execution mode for a 
> query depending on other bits of the query tree.
> We would like to be able to use it to optimize execution of TermInSetQuery. 
> However IndexOrDocValuesQuery only works well if the "index" query can give 
> an estimation of the cost of the query without doing anything expensive (like 
> looking up all terms of the TermInSetQuery in the terms dict). Maybe we could 
> implement it for primary keys (terms.size() == sumDocFreq) by returning the 
> number of terms of the query? Another idea is to multiply the number of terms 
> by the average postings length, though this could be dangerous if the field 
> has a zipfian distribution and some terms have a much higher doc frequency 
> than the average.
> [~romseygeek] and I were discussing this a few weeks ago, and more recently 
> [~mikemccand] and [~gsmiller] again independently. So it looks like there is 
> interest in this. Here is an email thread where this was recently discussed: 
> https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10668) Should we deprecate/remove DocValuesTermsQuery in sandbox?

2022-07-29 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10668.
--
Resolution: Won't Fix

> Should we deprecate/remove DocValuesTermsQuery in sandbox?
> --
>
> Key: LUCENE-10668
> URL: https://issues.apache.org/jira/browse/LUCENE-10668
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/sandbox
>Reporter: Greg Miller
>Priority: Minor
>
> I came across the sandbox {{DocValuesTermsQuery}} and it sure looks a lot 
> like {{TermInSetQuery}}. I wonder if we ought to deprecate and remove it? Any 
> reason to keep this around?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery

2022-07-29 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573027#comment-17573027
 ] 

Greg Miller commented on LUCENE-10207:
--

I also just came across {{DocValuesTermsQuery}} in the sandbox module. Once we 
see this work through (adding doc value rewrite support to TermInSet), we can 
deprecate/remove this.

> Make TermInSetQuery usable with IndexOrDocValuesQuery
> -
>
> Key: LUCENE-10207
> URL: https://issues.apache.org/jira/browse/LUCENE-10207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Greg Miller
>Priority: Minor
> Attachments: LUCENE-10207_multitermquery.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> IndexOrDocValuesQuery is very useful to pick the right execution mode for a 
> query depending on other bits of the query tree.
> We would like to be able to use it to optimize execution of TermInSetQuery. 
> However IndexOrDocValuesQuery only works well if the "index" query can give 
> an estimation of the cost of the query without doing anything expensive (like 
> looking up all terms of the TermInSetQuery in the terms dict). Maybe we could 
> implement it for primary keys (terms.size() == sumDocFreq) by returning the 
> number of terms of the query? Another idea is to multiply the number of terms 
> by the average postings length, though this could be dangerous if the field 
> has a zipfian distribution and some terms have a much higher doc frequency 
> than the average.
> [~romseygeek] and I were discussing this a few weeks ago, and more recently 
> [~mikemccand] and [~gsmiller] again independently. So it looks like there is 
> interest in this. Here is an email thread where this was recently discussed: 
> https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10668) Should we deprecate/remove DocValuesTermsQuery in sandbox?

2022-07-29 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573026#comment-17573026
 ] 

Greg Miller commented on LUCENE-10668:
--

Ha, oops... right [~jpountz]. I've been working on 
[LUCENE-10207|https://issues.apache.org/jira/browse/LUCENE-10207] again and had 
the new doc value based implementation in the brain. I'll just mention over in 
LUCENE-10207 that we can deprecate {{DocValuesTermsQuery}} when we see that 
work through. My mistake.

> Should we deprecate/remove DocValuesTermsQuery in sandbox?
> --
>
> Key: LUCENE-10668
> URL: https://issues.apache.org/jira/browse/LUCENE-10668
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/sandbox
>Reporter: Greg Miller
>Priority: Minor
>
> I came across the sandbox {{DocValuesTermsQuery}} and it sure looks a lot 
> like {{TermInSetQuery}}. I wonder if we ought to deprecate and remove it? Any 
> reason to keep this around?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10668) Should we deprecate/remove DocValuesTermsQuery in sandbox?

2022-07-28 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10668:


 Summary: Should we deprecate/remove DocValuesTermsQuery in sandbox?
 Key: LUCENE-10668
 URL: https://issues.apache.org/jira/browse/LUCENE-10668
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/sandbox
Reporter: Greg Miller


I came across the sandbox {{DocValuesTermsQuery}} and it sure looks a lot like 
{{TermInSetQuery}}. I wonder if we ought to deprecate and remove it? Any reason 
to keep this around?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

2022-07-23 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570315#comment-17570315
 ] 

Greg Miller commented on LUCENE-10659:
--

Patched this additional fix in as well. Hopefully this test is good to go now. 
I'll keep an eye on it.

> Fix random TestDisiPriorityQueue bug
> 
>
> Key: LUCENE-10659
> URL: https://issues.apache.org/jira/browse/LUCENE-10659
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.3
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Blocker
> Fix For: 9.3
>
>
> A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly 
> trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we 
> should roll it into the 9.3 release. I'll prepare a PR, but raising it here 
> for visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

2022-07-23 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10659.
--
Resolution: Fixed

> Fix random TestDisiPriorityQueue bug
> 
>
> Key: LUCENE-10659
> URL: https://issues.apache.org/jira/browse/LUCENE-10659
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.3
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Blocker
> Fix For: 9.3
>
>
> A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly 
> trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we 
> should roll it into the 9.3 release. I'll prepare a PR, but raising it here 
> for visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

2022-07-22 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570209#comment-17570209
 ] 

Greg Miller commented on LUCENE-10659:
--

Another fix here: https://github.com/apache/lucene/pull/1044

> Fix random TestDisiPriorityQueue bug
> 
>
> Key: LUCENE-10659
> URL: https://issues.apache.org/jira/browse/LUCENE-10659
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.3
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Blocker
> Fix For: 9.3
>
>
> A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly 
> trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we 
> should roll it into the 9.3 release. I'll prepare a PR, but raising it here 
> for visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Reopened] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

2022-07-22 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller reopened LUCENE-10659:
--
  Assignee: Greg Miller

There's still an issue with the test. Tripped it again last night. Working on a 
fix now. Let's block 9.3 until this fix is in. PR will be up shortly.

> Fix random TestDisiPriorityQueue bug
> 
>
> Key: LUCENE-10659
> URL: https://issues.apache.org/jira/browse/LUCENE-10659
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.3
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Blocker
> Fix For: 9.3
>
>
> A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly 
> trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we 
> should roll it into the 9.3 release. I'll prepare a PR, but raising it here 
> for visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

2022-07-21 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10659.
--
Fix Version/s: 9.3
   Resolution: Fixed

> Fix random TestDisiPriorityQueue bug
> 
>
> Key: LUCENE-10659
> URL: https://issues.apache.org/jira/browse/LUCENE-10659
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.3
>Reporter: Greg Miller
>Priority: Blocker
> Fix For: 9.3
>
>
> A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly 
> trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we 
> should roll it into the 9.3 release. I'll prepare a PR, but raising it here 
> for visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

2022-07-20 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-10659:
-
Priority: Blocker  (was: Minor)

> Fix random TestDisiPriorityQueue bug
> 
>
> Key: LUCENE-10659
> URL: https://issues.apache.org/jira/browse/LUCENE-10659
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.3
>Reporter: Greg Miller
>Priority: Blocker
>
> A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly 
> trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we 
> should roll it into the 9.3 release. I'll prepare a PR, but raising it here 
> for visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

2022-07-20 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569184#comment-17569184
 ] 

Greg Miller commented on LUCENE-10659:
--

PR for pulling this fix into 9.3: https://github.com/apache/lucene/pull/1038

> Fix random TestDisiPriorityQueue bug
> 
>
> Key: LUCENE-10659
> URL: https://issues.apache.org/jira/browse/LUCENE-10659
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.3
>Reporter: Greg Miller
>Priority: Minor
>
> A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly 
> trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we 
> should roll it into the 9.3 release. I'll prepare a PR, but raising it here 
> for visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10659) Fix random TestDisiPriorityQueue bug

2022-07-20 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10659:


 Summary: Fix random TestDisiPriorityQueue bug
 Key: LUCENE-10659
 URL: https://issues.apache.org/jira/browse/LUCENE-10659
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 9.3
Reporter: Greg Miller


A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly 
trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we 
should roll it into the 9.3 release. I'll prepare a PR, but raising it here for 
visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10653) Should BlockMaxMaxscoreScorer rebuild its heap in bulk?

2022-07-19 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10653.
--
Fix Version/s: 9.3
 Assignee: Greg Miller
   Resolution: Fixed

> Should BlockMaxMaxscoreScorer rebuild its heap in bulk?
> ---
>
> Key: LUCENE-10653
> URL: https://issues.apache.org/jira/browse/LUCENE-10653
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> BMMScorer has to frequently rebuild its heap, and does do by clearing and 
> then iteratively calling {{{}add{}}}. It would be more efficient to heapify 
> in bulk. This is more academic than anything right now though since BMMScorer 
> is only used with two-clause disjunctions, so it's sort of a silly 
> optimization if it's not supporting a greater number of clauses.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-19 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568587#comment-17568587
 ] 

Greg Miller commented on LUCENE-10633:
--

{quote}It also relates to [~gsmiller] 's work about running term-in-set queries 
using doc values, which would only help if doc values are enabled on the field.
{quote}
Which is actually perfect timing as I've just come back to working on this 
(LUCENE-10207) after setting it aside for a while. Thanks for making this 
change to {{luceneutil!}}

> Dynamic pruning for queries sorted by SORTED(_SET) field
> 
>
> Key: LUCENE-10633
> URL: https://issues.apache.org/jira/browse/LUCENE-10633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
> when sorting by a numeric field, by leveraging the points index to skip 
> documents that do not compare better than the top of the priority queue 
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which 
> is disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-13 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566615#comment-17566615
 ] 

Greg Miller commented on LUCENE-10603:
--

Shouldn't be any more to do on this now. Resolving. FWIW, I ran benchmarks 
{{wikimediumall}} and didn't see any significant changes. Thought we might see 
a small improvement for SSDV heavy faceting, but nothing showed up.

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-13 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10603.
--
Resolution: Fixed

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10632) Change getAllChildren to return all children regardless of the count

2022-07-13 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-10632:
-
Component/s: modules/facet

> Change getAllChildren to return all children regardless of the count
> 
>
> Key: LUCENE-10632
> URL: https://issues.apache.org/jira/browse/LUCENE-10632
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Yuting Gan
>Priority: Minor
>
> Currently, the getAllChildren functionality is implemented in a way that is 
> similar to getTopChildren, where they only return children with count that is 
> greater than zero.
> However, he original getTopChildren in RangeFacetCounts returned all children 
> whether-or-not the count was zero. This actually has good use cases and we 
> should continue supporting the feature in getAllChildren, so that we will not 
> lose it after properly supporting getTopChildren in RangeFacetCounts.
> As discussed with [~gsmiller] in the [LUCENE-10614 
> pr|https://github.com/apache/lucene/pull/974], allowing getAllChildren to 
> behave differently from getTopChildren can actually be more helpful for 
> users. If users want to get children with only positive count, we have 
> getTopChildren supporting this behavior already. Therefore, the 
> getAllChildren API should provide all children in all of the implementations, 
> whether-or-not the count is zero.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10632) Change getAllChildren to return all children regardless of the count

2022-07-13 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566560#comment-17566560
 ] 

Greg Miller commented on LUCENE-10632:
--

Bringing a conversation about this issue we had offline here for transparency 
and future discovery. While I think it would actually be ideal if 
{{getAllChildren}} could actually return _all_ children, regardless of the 
count, it's not really practical in most of our {{Facets}} implementations 
since they only "see" children that exist in the docs they're counting. So if 
they're counting from a {{{}FacetsCollector{}}}, and those hits don't contain 
some of the possible child values for a given dimension, it's quite hard for 
{{getAllChildren}} to actually know about them.

So for now, I think it's reasonable that range facet counting behaves a little 
differently from the rest and actually returns all the ranges it was asked 
about, regardless of count. This is consistent with the behavior of 
{{{}getSpecificValue{}}}, which are both similar use-cases in that the user is 
providing the value(s) they care about. But this does create a small 
inconsistency in the behavior of {{getAllChildren}} generally.

> Change getAllChildren to return all children regardless of the count
> 
>
> Key: LUCENE-10632
> URL: https://issues.apache.org/jira/browse/LUCENE-10632
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Yuting Gan
>Priority: Minor
>
> Currently, the getAllChildren functionality is implemented in a way that is 
> similar to getTopChildren, where they only return children with count that is 
> greater than zero.
> However, he original getTopChildren in RangeFacetCounts returned all children 
> whether-or-not the count was zero. This actually has good use cases and we 
> should continue supporting the feature in getAllChildren, so that we will not 
> lose it after properly supporting getTopChildren in RangeFacetCounts.
> As discussed with [~gsmiller] in the [LUCENE-10614 
> pr|https://github.com/apache/lucene/pull/974], allowing getAllChildren to 
> behave differently from getTopChildren can actually be more helpful for 
> users. If users want to get children with only positive count, we have 
> getTopChildren supporting this behavior already. Therefore, the 
> getAllChildren API should provide all children in all of the implementations, 
> whether-or-not the count is zero.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10653) Should BlockMaxMaxscoreScorer rebuild its heap in bulk?

2022-07-11 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565195#comment-17565195
 ] 

Greg Miller commented on LUCENE-10653:
--

Here's essentially what I'm thinking: 
https://github.com/gsmiller/lucene/commit/597a760d6c0b0524ba1d72c290689e4dc4b4b9e9

> Should BlockMaxMaxscoreScorer rebuild its heap in bulk?
> ---
>
> Key: LUCENE-10653
> URL: https://issues.apache.org/jira/browse/LUCENE-10653
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
>
> BMMScorer has to frequently rebuild its heap, and does do by clearing and 
> then iteratively calling {{{}add{}}}. It would be more efficient to heapify 
> in bulk. This is more academic than anything right now though since BMMScorer 
> is only used with two-clause disjunctions, so it's sort of a silly 
> optimization if it's not supporting a greater number of clauses.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10653) Should BlockMaxMaxscoreScorer rebuild its heap in bulk?

2022-07-11 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10653:


 Summary: Should BlockMaxMaxscoreScorer rebuild its heap in bulk?
 Key: LUCENE-10653
 URL: https://issues.apache.org/jira/browse/LUCENE-10653
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Reporter: Greg Miller


BMMScorer has to frequently rebuild its heap, and does do by clearing and then 
iteratively calling {{{}add{}}}. It would be more efficient to heapify in bulk. 
This is more academic than anything right now though since BMMScorer is only 
used with two-clause disjunctions, so it's sort of a silly optimization if it's 
not supporting a greater number of clauses.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10614) Properly support getTopChildren in RangeFacetCounts

2022-07-11 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10614.
--
Fix Version/s: 10.0 (main)
   Resolution: Fixed

> Properly support getTopChildren in RangeFacetCounts
> ---
>
> Key: LUCENE-10614
> URL: https://issues.apache.org/jira/browse/LUCENE-10614
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 10.0 (main)
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 10.0 (main)
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> As mentioned in LUCENE-10538, {{RangeFacetCounts}} is not implementing 
> {{getTopChildren}}. Instead of returning "top" ranges, it returns all 
> user-provided ranges in the order the user specified them when instantiating. 
> This is probably more useful functionality, but it would be nice to support 
> {{getTopChildren}} as well.
> LUCENE-10550 is introducing the concept of {{getAllChildren}}, so once that 
> lands, we can replace the current implementation of {{getTopChildren}} with 
> an actual "top children" implementation and direct users to 
> {{getAllChildren}} if they want to maintain the current behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10614) Properly support getTopChildren in RangeFacetCounts

2022-07-11 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565084#comment-17565084
 ] 

Greg Miller commented on LUCENE-10614:
--

Thanks again [~yutinggan] !

> Properly support getTopChildren in RangeFacetCounts
> ---
>
> Key: LUCENE-10614
> URL: https://issues.apache.org/jira/browse/LUCENE-10614
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 10.0 (main)
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> As mentioned in LUCENE-10538, {{RangeFacetCounts}} is not implementing 
> {{getTopChildren}}. Instead of returning "top" ranges, it returns all 
> user-provided ranges in the order the user specified them when instantiating. 
> This is probably more useful functionality, but it would be nice to support 
> {{getTopChildren}} as well.
> LUCENE-10550 is introducing the concept of {{getAllChildren}}, so once that 
> lands, we can replace the current implementation of {{getTopChildren}} with 
> an actual "top children" implementation and direct users to 
> {{getAllChildren}} if they want to maintain the current behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10614) Properly support getTopChildren in RangeFacetCounts

2022-07-11 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565083#comment-17565083
 ] 

Greg Miller commented on LUCENE-10614:
--

Just merged this to {{{}main{}}}. I don't think we should backport this to 9.x 
since it is a functional change to an existing API. Because of this, I moved 
the CHANGES entry under 10.0 and added an entry to MIGRATE describing the 
difference and how to retain the 9.x functionality if desired.

> Properly support getTopChildren in RangeFacetCounts
> ---
>
> Key: LUCENE-10614
> URL: https://issues.apache.org/jira/browse/LUCENE-10614
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 10.0 (main)
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> As mentioned in LUCENE-10538, {{RangeFacetCounts}} is not implementing 
> {{getTopChildren}}. Instead of returning "top" ranges, it returns all 
> user-provided ranges in the order the user specified them when instantiating. 
> This is probably more useful functionality, but it would be nice to support 
> {{getTopChildren}} as well.
> LUCENE-10550 is introducing the concept of {{getAllChildren}}, so once that 
> lands, we can replace the current implementation of {{getTopChildren}} with 
> an actual "top children" implementation and direct users to 
> {{getAllChildren}} if they want to maintain the current behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-07 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563912#comment-17563912
 ] 

Greg Miller commented on LUCENE-10603:
--

It looks like the only remaining work is to:
 # Remove the NO_MORE_ORDS definition
 # Update all the SortedSetDocValue implementations to stop returning 
NO_MORE_ORDS in nextOrd()
 # Remove all the test assertions that validate that SSDV#nextOrd() returns 
NO_MORE_ORDS

This should all be main branch work, and not something we backport to 9.x. I 
think 9.x is now good.

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10644) Facets#getAllChildren testing should ignore child order

2022-07-06 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10644:


 Summary: Facets#getAllChildren testing should ignore child order
 Key: LUCENE-10644
 URL: https://issues.apache.org/jira/browse/LUCENE-10644
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Greg Miller
 Attachments: failing tests.png

Our javadoc for {{Facets#getAllChildren}} explicitly calls out that callers 
should make no assumptions about child ordering, but a number of our own unit 
tests turn around and make that assumption. I ran into this when recently 
trying an optimization that would result in a different child ordering for 
{{{}getAllChildren{}}}, and found a number of unit tests that started failing. 
I'll upload a list of what I found failing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10644) Facets#getAllChildren testing should ignore child order

2022-07-06 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-10644:
-
Attachment: failing tests.png

> Facets#getAllChildren testing should ignore child order
> ---
>
> Key: LUCENE-10644
> URL: https://issues.apache.org/jira/browse/LUCENE-10644
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Attachments: failing tests.png
>
>
> Our javadoc for {{Facets#getAllChildren}} explicitly calls out that callers 
> should make no assumptions about child ordering, but a number of our own unit 
> tests turn around and make that assumption. I ran into this when recently 
> trying an optimization that would result in a different child ordering for 
> {{{}getAllChildren{}}}, and found a number of unit tests that started 
> failing. I'll upload a list of what I found failing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-05 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562832#comment-17562832
 ] 

Greg Miller commented on LUCENE-10603:
--

Thanks [~stefanvodita] for jumping in as well to help! I left a little feedback 
on the PR. Thanks again!

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10639) WANDScorer performs better without two-phase

2022-07-05 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562786#comment-17562786
 ] 

Greg Miller commented on LUCENE-10639:
--

As a quick update, I ran benchmarks with just [livedoc checking broken 
out|https://github.com/gsmiller/lucene/commit/f4e9614a299523b57c854a3bd3371253f0a7fb17]
 in {{DefaultBulkScorer}}. I surprisingly didn't see any difference. So maybe 
something else going on here?

Note that I ran this with {{wikimedium10m}} instead of {{all}} to get a 
datapoint a bit quicker:

{code:java}
TaskQPS baseline  StdDevQPS candidate  
StdDevPct diff p-value
 Prefix3  118.98 (10.2%)  114.60  
(9.9%)   -3.7% ( -21% -   18%) 0.247
Wildcard   40.69  (6.9%)   39.62  
(7.2%)   -2.6% ( -15% -   12%) 0.236
  TermDTSort   17.76 (20.4%)   17.33 
(14.2%)   -2.4% ( -30% -   40%) 0.663
   OrNotHighHigh  881.01  (4.4%)  861.34  
(3.9%)   -2.2% ( -10% -6%) 0.089
 AndHighHigh8.87  (5.0%)8.70  
(6.2%)   -1.8% ( -12% -9%) 0.296
 MedTerm 1771.40  (4.2%) 1740.50  
(4.4%)   -1.7% (  -9% -7%) 0.198
  AndHighMed   30.59  (4.0%)   30.06  
(5.6%)   -1.7% ( -10% -8%) 0.267
OrHighNotLow  782.90  (4.8%)  769.92  
(5.1%)   -1.7% ( -11% -8%) 0.291
  HighPhrase  392.18  (2.7%)  386.50  
(2.7%)   -1.4% (  -6% -4%) 0.087
   OrHighNotHigh  830.76  (4.3%)  818.83  
(4.3%)   -1.4% (  -9% -7%) 0.295
OrNotHighMed  585.86  (2.6%)  578.07  
(3.1%)   -1.3% (  -6% -4%) 0.146
OrHighNotMed  966.75  (3.6%)  956.07  
(3.9%)   -1.1% (  -8% -6%) 0.352
   LowPhrase  546.02  (2.1%)  540.42  
(2.4%)   -1.0% (  -5% -3%) 0.148
   MedPhrase   24.65  (2.3%)   24.40  
(3.0%)   -1.0% (  -6% -4%) 0.225
  AndHighLow  508.37  (3.7%)  503.84  
(4.7%)   -0.9% (  -8% -7%) 0.506
OrNotHighLow  672.15  (2.7%)  666.29  
(2.8%)   -0.9% (  -6% -4%) 0.313
   BrowseMonthTaxoFacets8.92 (14.5%)8.84 
(13.9%)   -0.9% ( -25% -   32%) 0.846
 AndHighMedDayTaxoFacets   39.14  (2.2%)   38.82  
(2.2%)   -0.8% (  -5% -3%) 0.241
AndHighHighDayTaxoFacets8.01  (2.8%)7.96  
(2.8%)   -0.7% (  -6% -4%) 0.416
 LowSloppyPhrase5.83  (3.8%)5.79  
(3.8%)   -0.7% (  -8% -7%) 0.556
   OrHighLow  128.01  (3.7%)  127.11  
(3.8%)   -0.7% (  -7% -7%) 0.554
HighTerm 1190.03  (4.4%) 1183.10  
(4.1%)   -0.6% (  -8% -8%) 0.663
 MedSloppyPhrase   11.67  (2.1%)   11.61  
(2.6%)   -0.5% (  -5% -4%) 0.480
MedTermDayTaxoFacets   14.09  (3.1%)   14.03  
(4.1%)   -0.5% (  -7% -6%) 0.686
  IntNRQ  110.15  (2.3%)  109.69  
(2.1%)   -0.4% (  -4% -4%) 0.546
HighSloppyPhrase9.56  (4.5%)9.53  
(4.5%)   -0.4% (  -8% -9%) 0.794
BrowseDateSSDVFacets0.85 (10.4%)0.85 
(10.8%)   -0.3% ( -19% -   23%) 0.939
 Respell   33.65  (1.7%)   33.58  
(1.7%)   -0.2% (  -3% -3%) 0.684
  Fuzzy2   74.16  (1.9%)   74.02  
(1.7%)   -0.2% (  -3% -3%) 0.740
 LowTerm 1522.48  (2.9%) 1520.76  
(3.3%)   -0.1% (  -6% -6%) 0.909
 LowIntervalsOrdered   12.75  (3.3%)   12.74  
(3.3%)   -0.1% (  -6% -6%) 0.915
HighIntervalsOrdered6.30  (4.2%)6.31  
(4.0%)0.1% (  -7% -8%) 0.923
 BrowseRandomLabelSSDVFacets2.57  (4.9%)2.57  
(4.9%)0.1% (  -9% -   10%) 0.927
  Fuzzy1   57.11  (1.9%)   57.26  
(1.7%)0.2% (  -3% -3%) 0.666
 BrowseRandomLabelTaxoFacets6.32  (9.3%)6.34 
(10.3%)0.3% ( -17% -   21%) 0.911
 LowSpanNear   15.95  (2.9%)   16.01  
(2.7%)0.4% (  -5% -6%) 0.680
 MedIntervalsOrdered1.61  (5.8%)1.62  
(5.8%)0.4% ( -10% -   12%) 0.834
HighSpanNear2.27  (4.2%)2.28  
(4.0%)0.6% (  -7% -9%) 0.636
   

[jira] [Commented] (LUCENE-10639) WANDScorer performs better without two-phase

2022-07-02 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561772#comment-17561772
 ] 

Greg Miller commented on LUCENE-10639:
--

{quote}I suspected there was some overhead to two-phase iteration but not as 
much as this.
{quote}
Right. Yeah, I guess I was so surprised by the performance shift that I assumed 
there must be an interesting second-phase happening. But from what you're 
saying, it sounds like these {{OrHighLow/Med/High}} tasks aren't doing that. 
And that the performance change is purely some side-effect of running the two 
phases instead of doing all the checks in the first phase. I should have dug 
into what these tasks are doing.
{quote}Hotspot was not always able to optimize "if (liveDocs == null)" checks
{quote}
Interesting. Seems worth a shot.

 

Thanks for the quick thoughts!

> WANDScorer performs better without two-phase
> 
>
> Key: LUCENE-10639
> URL: https://issues.apache.org/jira/browse/LUCENE-10639
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Major
>
> After looking at the recent improvement [~jpountz] made to WAND scoring in 
> LUCENE-10634, which does additional work during match confirmation to not 
> confirm a match who's score wouldn't be competitive, I wanted to see how 
> performance would shift if we squashed the two-phase iteration completely and 
> only returned true matches (that were also known to be competitive by score) 
> in the "approximation" phase. I was a bit surprised to find that luceneutil 
> benchmarks (run with {{{}wikimediumall{}}}), improves significantly on some 
> disjunction tasks and doesn't show significant regressions anywhere else.
> Note that I used LUCENE-10634 as a baseline, and built my candidate change on 
> top of that. The diff can be seen here: 
> [DIFF|https://github.com/gsmiller/lucene/compare/b2d46440998fe4a972e8cc8c948580111359ed0f..c5bab794c92dbc66e70f9389948c1bdfe9b45231]
> A simple conclusion here might be that we shouldn't do two-phase iteration in 
> WANDScorer, but I'm pretty sure that's not right. I wonder if what's really 
> going on is that we're under-estimating the cost of confirming a match? Right 
> now we just return the tail size as the cost. While the cost of confirming a 
> match is proportional to the tail size, the actual work involved can be quite 
> significant (having to advance tail iterators to new blocks and decompress 
> them). I wonder if the WAND second phase is being run too early on 
> approximate candidates, and if less-expensive, (and even possibly more 
> restrictive?), second phases could/should be running first?
> I'm raising this here as more of a curiosity to see if it sparks ideas on how 
> to move forward. Again, I'm not proposing we do away with two-phase 
> iteration, but it seems we might be able to improve things. Maybe I'll 
> explore changing the cost heuristic next. Also, maybe there's some different 
> benchmarking that would be useful here that I may not be familiar with?
> Benchmark results on wikimediumall:
> {code:java}
> TaskQPS baseline  StdDevQPS candidate  
> StdDevPct diff p-value
> HighTermTitleBDVSort   22.52 (18.9%)   21.66 
> (15.6%)   -3.8% ( -32% -   37%) 0.485
>  Prefix39.38  (9.2%)9.09 
> (10.6%)   -3.1% ( -20% -   18%) 0.326
>HighTermMonthSort   25.37 (16.0%)   24.87 
> (17.1%)   -2.0% ( -30% -   37%) 0.710
> MedTermDayTaxoFacets9.62  (4.2%)9.51  
> (4.1%)   -1.2% (  -9% -7%) 0.368
>   TermDTSort   74.69 (18.0%)   74.13 
> (18.2%)   -0.7% ( -31% -   43%) 0.897
>HighTermDayOfYearSort   52.64 (16.1%)   52.32 
> (15.4%)   -0.6% ( -27% -   36%) 0.903
>BrowseMonthTaxoFacets8.64 (19.1%)8.59 
> (19.8%)   -0.6% ( -33% -   47%) 0.926
> BrowseDateSSDVFacets0.86  (9.5%)0.86 
> (13.1%)   -0.4% ( -20% -   24%) 0.914
> PKLookup  147.18  (3.9%)  146.66  
> (3.3%)   -0.3% (  -7% -7%) 0.759
>BrowseDayOfYearSSDVFacets3.47  (4.5%)3.45  
> (4.8%)   -0.3% (  -9% -9%) 0.822
> Wildcard   36.36  (4.4%)   36.26  
> (5.2%)   -0.3% (  -9% -9%) 0.866
>BrowseMonthSSDVFacets4.15 (12.7%)4.13 
> (12.8%)   -0.3% ( -22% -   28%) 0.950
>  AndHighMedDayTaxoFacets   15.21  (2.7%)   15.18  
> (2.9%)   -0.2% (  -5% -5%) 0.819
>   Fuzzy1   68.33  (1.8%)   68.22 

[jira] [Created] (LUCENE-10639) WANDScorer performs better without two-phase

2022-07-02 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10639:


 Summary: WANDScorer performs better without two-phase
 Key: LUCENE-10639
 URL: https://issues.apache.org/jira/browse/LUCENE-10639
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Reporter: Greg Miller


After looking at the recent improvement [~jpountz] made to WAND scoring in 
LUCENE-10634, which does additional work during match confirmation to not 
confirm a match who's score wouldn't be competitive, I wanted to see how 
performance would shift if we squashed the two-phase iteration completely and 
only returned true matches (that were also known to be competitive by score) in 
the "approximation" phase. I was a bit surprised to find that luceneutil 
benchmarks (run with {{{}wikimediumall{}}}), improves significantly on some 
disjunction tasks and doesn't show significant regressions anywhere else.

Note that I used LUCENE-10634 as a baseline, and built my candidate change on 
top of that. The diff can be seen here: 
[DIFF|https://github.com/gsmiller/lucene/compare/b2d46440998fe4a972e8cc8c948580111359ed0f..c5bab794c92dbc66e70f9389948c1bdfe9b45231]

A simple conclusion here might be that we shouldn't do two-phase iteration in 
WANDScorer, but I'm pretty sure that's not right. I wonder if what's really 
going on is that we're under-estimating the cost of confirming a match? Right 
now we just return the tail size as the cost. While the cost of confirming a 
match is proportional to the tail size, the actual work involved can be quite 
significant (having to advance tail iterators to new blocks and decompress 
them). I wonder if the WAND second phase is being run too early on approximate 
candidates, and if less-expensive, (and even possibly more restrictive?), 
second phases could/should be running first?

I'm raising this here as more of a curiosity to see if it sparks ideas on how 
to move forward. Again, I'm not proposing we do away with two-phase iteration, 
but it seems we might be able to improve things. Maybe I'll explore changing 
the cost heuristic next. Also, maybe there's some different benchmarking that 
would be useful here that I may not be familiar with?

Benchmark results on wikimediumall:
{code:java}
TaskQPS baseline  StdDevQPS candidate  
StdDevPct diff p-value
HighTermTitleBDVSort   22.52 (18.9%)   21.66 
(15.6%)   -3.8% ( -32% -   37%) 0.485
 Prefix39.38  (9.2%)9.09 
(10.6%)   -3.1% ( -20% -   18%) 0.326
   HighTermMonthSort   25.37 (16.0%)   24.87 
(17.1%)   -2.0% ( -30% -   37%) 0.710
MedTermDayTaxoFacets9.62  (4.2%)9.51  
(4.1%)   -1.2% (  -9% -7%) 0.368
  TermDTSort   74.69 (18.0%)   74.13 
(18.2%)   -0.7% ( -31% -   43%) 0.897
   HighTermDayOfYearSort   52.64 (16.1%)   52.32 
(15.4%)   -0.6% ( -27% -   36%) 0.903
   BrowseMonthTaxoFacets8.64 (19.1%)8.59 
(19.8%)   -0.6% ( -33% -   47%) 0.926
BrowseDateSSDVFacets0.86  (9.5%)0.86 
(13.1%)   -0.4% ( -20% -   24%) 0.914
PKLookup  147.18  (3.9%)  146.66  
(3.3%)   -0.3% (  -7% -7%) 0.759
   BrowseDayOfYearSSDVFacets3.47  (4.5%)3.45  
(4.8%)   -0.3% (  -9% -9%) 0.822
Wildcard   36.36  (4.4%)   36.26  
(5.2%)   -0.3% (  -9% -9%) 0.866
   BrowseMonthSSDVFacets4.15 (12.7%)4.13 
(12.8%)   -0.3% ( -22% -   28%) 0.950
 AndHighMedDayTaxoFacets   15.21  (2.7%)   15.18  
(2.9%)   -0.2% (  -5% -5%) 0.819
  Fuzzy1   68.33  (1.8%)   68.22  
(2.0%)   -0.2% (  -3% -3%) 0.783
  OrHighMedDayTaxoFacets2.90  (4.1%)2.89  
(4.0%)   -0.1% (  -7% -8%) 0.930
   MedPhrase   52.81  (2.3%)   52.76  
(1.8%)   -0.1% (  -4% -4%) 0.878
 Respell   36.80  (1.9%)   36.78  
(1.9%)   -0.1% (  -3% -3%) 0.933
  Fuzzy2   63.06  (1.9%)   63.05  
(2.1%)   -0.0% (  -3% -4%) 0.971
   LowPhrase   74.60  (1.9%)   74.61  
(1.8%)0.0% (  -3% -3%) 0.987
AndHighHighDayTaxoFacets4.54  (2.3%)4.55  
(2.0%)0.0% (  -4% -4%) 0.960
  HighPhrase  353.13  (2.6%)  353.28  
(2.5%)0.0% (  -4% -5%) 0.958
   OrNotHighHigh  761.72  (4.0%)  762.48  
(3.6%)0.1% (  -7% -8%) 0.935
OrHighNotLow 1129.94  (4.1%) 1131.56  

[jira] [Comment Edited] (LUCENE-10246) Support getting counts from "association" facets

2022-06-30 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561222#comment-17561222
 ] 

Greg Miller edited comment on LUCENE-10246 at 7/1/22 12:27 AM:
---

[~shahrs87] I'd start by becoming familiar with the existing "association 
facet" implementations ({{TaxonomyFacetIntAssociations}} and 
{{TaxonomyFacetFloatAssociations}} as well as looking at some demo code like 
{{AssociationsFacetsExample}}). The API contract they implement represent 
results with {{FacetResult}}, which contains a list of {{LabelAndValue}} 
instances. {{LabelAndValue}} only models a single label along with a single 
numeric value. The value "usually" represents a total faceting count for a 
label in "non-association" facets, but with association faceting, value takes 
on an aggregated weight "associated" with the label.

The idea with this Jira is to be able to convey _both_ an aggregated weight and 
the count associated with a label. The best way to do that without creating a 
weird API for non-association cases is something that will probably take a 
little thought. Should we just put another "count" field in {{LabelAndValue}} 
and have both value and count be populated with a count for non-association 
cases? That sounds weird.

So beyond understanding what's currently there, I think the next step is to 
think about the right way to evolve the API that doesn't create a weird 
interaction for non-association faceting, especially since those are more 
commonly used.

Please reach out here as you have questions and I'll do my best to answer in a 
timely fashion. Thanks for having a look at this!


was (Author: gsmiller):
[~shahrs87] I'd start by becoming familiar with the existing "association 
facet" implementations ({{TaxonomyFacetIntAssociations}} and 
{{TaxonomyFacetFloatAssociations}} as well as looking at some demo code like 
{{AssociationsFacetsExample}}). The API contract they implement represent 
results with {{FacetResult}}, which contains a list of {{LabelAndValue}} 
instances. {{LabelAndValue}} only models a single label along with a single 
numeric value. The value "usually" represents a total faceting count for a 
label in "non-association" facets, but with association faceting, value takes 
on an aggregated weight "associated" with the label.

The idea with this Jira is to be able to convey _both_ an aggregated weight and 
the count associated with a label. The best way to do that without creating a 
weird API for non-association cases is something that will probably take a 
little thought. Should we just put another "count" field in {{LabelAndValue}} 
and have both value and count be populated with a count for non-assocation 
cases? That sounds weird.

So beyond understanding what's currently there, I think the next step is to 
think about the right way to evolve the API that doesn't create a weird 
interaction for non-association faceting, especially since those are more 
commonly used.

> Support getting counts from "association" facets
> 
>
> Key: LUCENE-10246
> URL: https://issues.apache.org/jira/browse/LUCENE-10246
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>
> We have these nice "association" facet implementations today that aggregate 
> "weights" from the docs that facet over, but they don't keep track of counts. 
> So the user can get "top-n" values for a dim by aggregated weight (great!), 
> but can't know how many docs matched each value. It would be nice to support 
> this so users could show the top-n values but _also_ show counts associated 
> with each.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10246) Support getting counts from "association" facets

2022-06-30 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561222#comment-17561222
 ] 

Greg Miller commented on LUCENE-10246:
--

[~shahrs87] I'd start by becoming familiar with the existing "association 
facet" implementations ({{TaxonomyFacetIntAssociations}} and 
{{TaxonomyFacetFloatAssociations}} as well as looking at some demo code like 
{{AssociationsFacetsExample}}). The API contract they implement represent 
results with {{FacetResult}}, which contains a list of {{LabelAndValue}} 
instances. {{LabelAndValue}} only models a single label along with a single 
numeric value. The value "usually" represents a total faceting count for a 
label in "non-association" facets, but with association faceting, value takes 
on an aggregated weight "associated" with the label.

The idea with this Jira is to be able to convey _both_ an aggregated weight and 
the count associated with a label. The best way to do that without creating a 
weird API for non-association cases is something that will probably take a 
little thought. Should we just put another "count" field in {{LabelAndValue}} 
and have both value and count be populated with a count for non-assocation 
cases? That sounds weird.

So beyond understanding what's currently there, I think the next step is to 
think about the right way to evolve the API that doesn't create a weird 
interaction for non-association faceting, especially since those are more 
commonly used.

> Support getting counts from "association" facets
> 
>
> Key: LUCENE-10246
> URL: https://issues.apache.org/jira/browse/LUCENE-10246
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>
> We have these nice "association" facet implementations today that aggregate 
> "weights" from the docs that facet over, but they don't keep track of counts. 
> So the user can get "top-n" values for a dim by aggregated weight (great!), 
> but can't know how many docs matched each value. It would be nice to support 
> this so users could show the top-n values but _also_ show counts associated 
> with each.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-30 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561203#comment-17561203
 ] 

Greg Miller commented on LUCENE-10603:
--

I pushed another commit that takes care of the remaining "production" code 
iteration. I think the next step is to knock out all remaining iteration 
patterns, which should only exist in "test" related code. When I get some more 
free time I'll take a pass at it, but might be a week or so. Happy to have 
someone beat me to it :)

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10546) Update Faceting user guide

2022-06-30 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561199#comment-17561199
 ] 

Greg Miller commented on LUCENE-10546:
--

Great, thanks [~epotiom]! I'm not aware of anyone else working on this.

> Update Faceting user guide
> --
>
> Key: LUCENE-10546
> URL: https://issues.apache.org/jira/browse/LUCENE-10546
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>
> The  [facet user 
> guide|https://lucene.apache.org/core/4_1_0/facet/org/apache/lucene/facet/doc-files/userguide.html]
>  was written based on 4.1. Since there's been a fair amount of active 
> facet-related development over the last year+, it would be nice to review the 
> guide and see what updates make sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10274) Implement "hyperrectangle" faceting

2022-06-28 Thread Greg Miller (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Greg Miller resolved as Fixed  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Very excited to see this shipped! Thanks Shai Erera and Marc D'Mello for all the PR iterations and conversation. Great example of shipping something much stronger than the original idea after rounds of discussion and iteration. Thanks again!  
 

  
 
 
 
 

 
 Lucene - Core /  LUCENE-10274  
 
 
  Implement "hyperrectangle" faceting   
 

  
 
 
 
 

 
Change By: 
 Greg Miller  
 
 
Fix Version/s: 
 9.3  
 
 
Resolution: 
 Fixed  
 
 
Status: 
 Open Resolved  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
   

[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-28 Thread Greg Miller (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Greg Miller commented on  LUCENE-10603  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Improve iteration of ords for SortedSetDocValues   
 

  
 
 
 
 

 
 Thanks Lu Xugang for letting me know! As I have some free time, I'll try to migrate a few more modules over (and will update here as I put out PRs for the modules).  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock

2022-06-24 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558709#comment-17558709
 ] 

Greg Miller commented on LUCENE-10624:
--

Oh, and just to clarify my above comment, I'm not weighing in (yet?) on 
whether-or-not this change makes sense, just adding a data point that we didn't 
see an impact in our particular application one way or the other. So it doesn't 
seem to help our usage patterns, but it also doesn't seem to hurt. +1 to 
[~jpountz]'s sentiment though to understand why those benchmark tasks you saw 
impact on changed. Thanks for pursuing this!

> Binary Search for Sparse IndexedDISI advanceWithinBlock & 
> advanceExactWithinBlock
> -
>
> Key: LUCENE-10624
> URL: https://issues.apache.org/jira/browse/LUCENE-10624
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 9.0, 9.1, 9.2
>Reporter: Weiming Wu
>Priority: Major
> Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, 
> candiate-exponential-searchsparse-sorted.0.log, 
> candidate_sparseTaxis_searchsparse-sorted.0.log
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> h3. Problem Statement
> We noticed DocValue read performance regression with the iterative API when 
> upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The 
> degradation is similar to what's described in 
> https://issues.apache.org/jira/browse/SOLR-9599 
> By analyzing profiling data, we found method "advanceWithinBlock" and 
> "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to 
> their O(N) doc lookup algorithm.
> h3. Changes
> Used binary search algorithm to replace current O(N) lookup algorithm in 
> Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because 
> docs are in ascending order.
> h3. Test
> {code:java}
> ./gradlew tidy
> ./gradlew check {code}
> h3. Benchmark
> Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the 
> reports of baseline and candidates in attachments section.{color}
> {color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}
> {color:#1d1c1d}2. Some highlights (>20%):{color}
>  * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] 
> yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*
>  ** {color:#1d1c1d}*Baseline:*  10973978+ hits hits in *726.81967 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 
> msec*{color}
>  * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*
>  ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color}
>  * {color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}
>  ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 
> msec*{color}{*}{*}
>  * {color:#1d1c1d}*...*{color}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock

2022-06-24 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558700#comment-17558700
 ] 

Greg Miller commented on LUCENE-10624:
--

For what it's worth, I ran a benchmark on the Amazon Product Search engine with 
this change, where we do lots of doc value access for various purposes, and saw 
effectively no change to latency or throughput (qps). Just adding that as a 
datapoint from a real-world, large-scale application.

> Binary Search for Sparse IndexedDISI advanceWithinBlock & 
> advanceExactWithinBlock
> -
>
> Key: LUCENE-10624
> URL: https://issues.apache.org/jira/browse/LUCENE-10624
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 9.0, 9.1, 9.2
>Reporter: Weiming Wu
>Priority: Major
> Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, 
> candiate-exponential-searchsparse-sorted.0.log, 
> candidate_sparseTaxis_searchsparse-sorted.0.log
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> h3. Problem Statement
> We noticed DocValue read performance regression with the iterative API when 
> upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The 
> degradation is similar to what's described in 
> https://issues.apache.org/jira/browse/SOLR-9599 
> By analyzing profiling data, we found method "advanceWithinBlock" and 
> "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to 
> their O(N) doc lookup algorithm.
> h3. Changes
> Used binary search algorithm to replace current O(N) lookup algorithm in 
> Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because 
> docs are in ascending order.
> h3. Test
> {code:java}
> ./gradlew tidy
> ./gradlew check {code}
> h3. Benchmark
> Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the 
> reports of baseline and candidates in attachments section.{color}
> {color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}
> {color:#1d1c1d}2. Some highlights (>20%):{color}
>  * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] 
> yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*
>  ** {color:#1d1c1d}*Baseline:*  10973978+ hits hits in *726.81967 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 
> msec*{color}
>  * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*
>  ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color}
>  * {color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}
>  ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 
> msec*{color}{*}{*}
>  * {color:#1d1c1d}*...*{color}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10550) Add getAllChildren functionality to facets

2022-06-22 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10550.
--
Fix Version/s: 9.3
   Resolution: Fixed

Thanks again [~yutinggan] !

> Add getAllChildren functionality to facets
> --
>
> Key: LUCENE-10550
> URL: https://issues.apache.org/jira/browse/LUCENE-10550
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>Reporter: Yuting Gan
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Currently Lucene does not support returning range counts sorted by label 
> values, but there are use cases demanding this feature. For example, a user 
> specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts 
> without changing the range order. Today we can only call getTopChildren to 
> populate range counts, but it would return ranges sorted by counts (e.g., 
> [10, 20] 100, [0, 10] 50) instead of range values. 
> Lucene has a API, getAllChildrenSortByValue, that returns numeric values with 
> counts sorted by label values, please see 
> [LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. 
> Therefore, it would be nice that we can also have a similar API to support 
> range counts. The proposed getAllChildren API is to return value/range counts 
> sorted by label values instead of counts. 
> This proposal was inspired from the discussions with [~gsmiller] when I was 
> working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], 
> and we believe users would benefit from adding this API to Facets. 
> Hope I can get some feedback from the community since this proposal would 
> require changes to the getTopChildren API in RangeFacetCounts. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10614) Properly support getTopChildren in RangeFacetCounts

2022-06-22 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557453#comment-17557453
 ] 

Greg Miller commented on LUCENE-10614:
--

Great, thanks [~yutinggan] !

> Properly support getTopChildren in RangeFacetCounts
> ---
>
> Key: LUCENE-10614
> URL: https://issues.apache.org/jira/browse/LUCENE-10614
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 10.0 (main)
>Reporter: Greg Miller
>Priority: Minor
>
> As mentioned in LUCENE-10538, {{RangeFacetCounts}} is not implementing 
> {{getTopChildren}}. Instead of returning "top" ranges, it returns all 
> user-provided ranges in the order the user specified them when instantiating. 
> This is probably more useful functionality, but it would be nice to support 
> {{getTopChildren}} as well.
> LUCENE-10550 is introducing the concept of {{getAllChildren}}, so once that 
> lands, we can replace the current implementation of {{getTopChildren}} with 
> an actual "top children" implementation and direct users to 
> {{getAllChildren}} if they want to maintain the current behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-21 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556891#comment-17556891
 ] 

Greg Miller commented on LUCENE-10603:
--

[~ChrisLu] thanks again for proposing this. I've merged the work in the 
{{facets}} module to use the new style of iteration, but there's still plenty 
more locations in our code base that need updating. Let me know if you want any 
help with this. I'm happy to divide up some of the modules if you'd like (or 
maybe we can recruit others if interested as well).

In the meantime, I propose we get this {{NO_MORE_ORDS}} constant marked as 
{{deprecated}} so we have a shot of removing it in a 10.0 release. By removing 
it, as [~jpountz] points out in 
[#954|https://github.com/apache/lucene/pull/954], we may have a performance 
benefit since we won't need the book-keeping to keep it updated. I opened 
another PR for this: [#969|https://github.com/apache/lucene/pull/969].

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10584) SSDV facets should support hierarchical paths in #getSpecificValue

2022-06-15 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10584.
--
Fix Version/s: 9.3
   Resolution: Fixed

> SSDV facets should support hierarchical paths in #getSpecificValue
> --
>
> Key: LUCENE-10584
> URL: https://issues.apache.org/jira/browse/LUCENE-10584
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We added hierarchical pathing capabilities to SSDV faceting recently but it 
> looks like we didn't update #getSpecificValue to work with hierarchical paths.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10584) SSDV facets should support hierarchical paths in #getSpecificValue

2022-06-15 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554758#comment-17554758
 ] 

Greg Miller commented on LUCENE-10584:
--

Fixed and backported. Resolving.

> SSDV facets should support hierarchical paths in #getSpecificValue
> --
>
> Key: LUCENE-10584
> URL: https://issues.apache.org/jira/browse/LUCENE-10584
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We added hierarchical pathing capabilities to SSDV faceting recently but it 
> looks like we didn't update #getSpecificValue to work with hierarchical paths.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10614) Properly support getTopChildren in RangeFacetCounts

2022-06-13 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10614:


 Summary: Properly support getTopChildren in RangeFacetCounts
 Key: LUCENE-10614
 URL: https://issues.apache.org/jira/browse/LUCENE-10614
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Affects Versions: 10.0 (main)
Reporter: Greg Miller


As mentioned in LUCENE-10538, {{RangeFacetCounts}} is not implementing 
{{getTopChildren}}. Instead of returning "top" ranges, it returns all 
user-provided ranges in the order the user specified them when instantiating. 
This is probably more useful functionality, but it would be nice to support 
{{getTopChildren}} as well.

LUCENE-10550 is introducing the concept of {{getAllChildren}}, so once that 
lands, we can replace the current implementation of {{getTopChildren}} with an 
actual "top children" implementation and direct users to {{getAllChildren}} if 
they want to maintain the current behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-10 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552851#comment-17552851
 ] 

Greg Miller commented on LUCENE-10603:
--

OK, thanks [~ChrisLu]! +1 to doing this for consistency.

I took a pass at making this change within the faceting module since there are 
a number of places we rely on SSDV ordinal iteration. I figured we could 
probably tackle this change through multiple PRs, so I figured I'd lend a hand 
with faceting: https://github.com/apache/lucene/pull/954

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-09 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552390#comment-17552390
 ] 

Greg Miller commented on LUCENE-10603:
--

Seems reasonable. Is there an expected benefit of moving to this iteration 
style? Do we think the loops can be better optimized by the JVM/hotspot since 
the number of iterations is known ahead of time? 

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10595) TestGroupFacetCollector#testRandom has caught IndexOutOfBoundsException a couple times

2022-05-29 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10595:


 Summary: TestGroupFacetCollector#testRandom has caught 
IndexOutOfBoundsException a couple times
 Key: LUCENE-10595
 URL: https://issues.apache.org/jira/browse/LUCENE-10595
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/grouping
Reporter: Greg Miller


Random testing has caught an {{IndexOutOfBoundsException}} a couple times now 
in {{org.apache.lucene.search.grouping.TestGroupFacetCollector.testRandom}}. I 
was able to reproduce locally with {{./gradlew :lucene:grouping:test --tests 
"org.apache.lucene.search.grouping.TestGroupFacetCollector.testRandom" 
-Ptests.jvms=4 -Ptests.haltonfailure=false 
-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -Ptests.seed=91EC8BE9DE2A5BAB 
-Ptests.multiplier=2 -Ptests.badapples=false -Ptests.file.encoding=US-ASCII}}.

>From what I can tell, the exception is coming from way down in 
>{{ByteBuffersDataInput#readBytes}} on [this 
>line|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java#L155].
> Popping the stack a bit, it seems like the issue is maybe in 
>{{TermGroupFacetCollector$MV$SegmentResult#nextTerm}}.

I'm not totally sure if this is an actual bug or a bug in testing methodology. 
Haven't had time to dig in further and likely won't in the near future, so 
opening this Jira to track.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10585) Cleanup copy/paste code in facets, particularly in SSDV

2022-05-29 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10585.
--
Fix Version/s: 10.0 (main)
   9.3
   Resolution: Fixed

> Cleanup copy/paste code in facets, particularly in SSDV
> ---
>
> Key: LUCENE-10585
> URL: https://issues.apache.org/jira/browse/LUCENE-10585
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
> Fix For: 10.0 (main), 9.3
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We've accumulated some copy/paste code in the facets modules, especially in 
> SSDV-related classes. I'm going to take a pass at cleaning this up to help 
> make the code more readable and easier to maintain.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-05-20 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540309#comment-17540309
 ] 

Greg Miller edited comment on LUCENE-10544 at 5/20/22 10:03 PM:


As I understood Adrien's suggestion, I think the idea is to create a new 
{{BulkScorer}} sub-class that would wrap another {{BulkScorer}} (provided as 
{{in}} in Adrien's code snippet). This class would override the {{score}} 
method as Adrien shows above to periodically check timeouts, but otherwise just 
delegate to {{in}} if the query has not yet timed out. I imagine somewhere in 
{{IndexSearcher}} you would instantiate this new "timeout enforcing bulk 
scorer", wrapping the {{BulkScorer}} provided by the query's weight. Does that 
help?

Also, can I request that we move this conversation over to 
[LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151]? This issue 
is really about modifying {{ExitableTermsEnum}}, which we may want to 
eventually due independent of adding timeout support to {{IndexSearcher}}. 
Since this discussion is really about adding timeout support to 
{{IndexSearcher}}, it would be best to capture the conversation in LUCENE-10151 
to make it easier to dig up in the future. Thank you!


was (Author: gsmiller):
As I understood Adrien's suggestion, I think the idea is to create a new 
{{BulkScorer}} sub-class that would wrap another {{BulkScorer}} (provided as 
{{in}} in Adrien's code snippet). This class would override the {{score}} 
method as Adrien shows above to periodically check timeouts, but otherwise just 
delegate to {{in}} if the query has not yet timed out. I image somewhere in 
{{IndexSearcher}} you would instantiate this new "timeout enforcing bulk 
scorer", wrapping the {{BulkScorer}} provided by the query's weight. Does that 
help?

Also, can I request that we move this conversation over to 
[LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151]? This issue 
is really about modifying {{ExitableTermsEnum}}, which we may want to 
eventually due independent of adding timeout support to {{IndexSearcher}}. 
Since this discussion is really about adding timeout support to 
{{IndexSearcher}}, it would be best to capture the conversation in LUCENE-10151 
to make it easier to dig up in the future. Thank you!

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-05-20 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540309#comment-17540309
 ] 

Greg Miller commented on LUCENE-10544:
--

As I understood Adrien's suggestion, I think the idea is to create a new 
{{BulkScorer}} sub-class that would wrap another {{BulkScorer}} (provided as 
{{in}} in Adrien's code snippet). This class would override the {{score}} 
method as Adrien shows above to periodically check timeouts, but otherwise just 
delegate to {{in}} if the query has not yet timed out. I image somewhere in 
{{IndexSearcher}} you would instantiate this new "timeout enforcing bulk 
scorer", wrapping the {{BulkScorer}} provided by the query's weight. Does that 
help?

Also, can I request that we move this conversation over to 
[LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151]? This issue 
is really about modifying {{ExitableTermsEnum}}, which we may want to 
eventually due independent of adding timeout support to {{IndexSearcher}}. 
Since this discussion is really about adding timeout support to 
{{IndexSearcher}}, it would be best to capture the conversation in LUCENE-10151 
to make it easier to dig up in the future. Thank you!

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10585) Cleanup copy/paste code in facets, particularly in SSDV

2022-05-20 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller reassigned LUCENE-10585:


Assignee: Greg Miller

> Cleanup copy/paste code in facets, particularly in SSDV
> ---
>
> Key: LUCENE-10585
> URL: https://issues.apache.org/jira/browse/LUCENE-10585
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
>
> We've accumulated some copy/paste code in the facets modules, especially in 
> SSDV-related classes. I'm going to take a pass at cleaning this up to help 
> make the code more readable and easier to maintain.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10585) Cleanup copy/paste code in facets, particularly in SSDV

2022-05-20 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10585:


 Summary: Cleanup copy/paste code in facets, particularly in SSDV
 Key: LUCENE-10585
 URL: https://issues.apache.org/jira/browse/LUCENE-10585
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Greg Miller


We've accumulated some copy/paste code in the facets modules, especially in 
SSDV-related classes. I'm going to take a pass at cleaning this up to help make 
the code more readable and easier to maintain.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10584) SSDV facets should support hierarchical paths in #getSpecificValue

2022-05-20 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller reassigned LUCENE-10584:


Assignee: Greg Miller

> SSDV facets should support hierarchical paths in #getSpecificValue
> --
>
> Key: LUCENE-10584
> URL: https://issues.apache.org/jira/browse/LUCENE-10584
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Major
>
> We added hierarchical pathing capabilities to SSDV faceting recently but it 
> looks like we didn't update #getSpecificValue to work with hierarchical paths.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10584) SSDV facets should support hierarchical paths in #getSpecificValue

2022-05-20 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10584:


 Summary: SSDV facets should support hierarchical paths in 
#getSpecificValue
 Key: LUCENE-10584
 URL: https://issues.apache.org/jira/browse/LUCENE-10584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Greg Miller


We added hierarchical pathing capabilities to SSDV faceting recently but it 
looks like we didn't update #getSpecificValue to work with hierarchical paths.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10580) Should we add a "slow range query" to xxxPoint classes?

2022-05-19 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10580:


 Summary: Should we add a "slow range query" to xxxPoint classes?
 Key: LUCENE-10580
 URL: https://issues.apache.org/jira/browse/LUCENE-10580
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Greg Miller


Users that index 2D point data have the option of running a range query with, 
1) the points index (via {{LongPoint#newRangeQuery}}), or 2) a doc values field 
(via {{SortedNumericDocValuesField#newSlowRangeQuery}}). But if users are 
indexing points data in higher dimensions, there's no equivalent "slow" query 
that I'm aware of (relying on doc values). It's useful to have both and be able 
to wrap them in {{IndexOrDocValuesQuery}}.

I wonder if we should model a "point" doc value type (could just extend from 
{{BinaryDocValuesField}}) that supports creating "slow" range queries?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-05-17 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538424#comment-17538424
 ] 

Greg Miller commented on LUCENE-10544:
--

+1 to pursuing this delegating bulk scorer suggestion. I really like that idea 
[~jpountz]. Seems like a simple, easy to understand approach that still allows 
queries to provide their own custom bulk scoring logic as necessary. 

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations

2022-05-13 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10488.
--
Fix Version/s: 9.2
   Resolution: Fixed

Merged to {{main}} and {{branch_9x}}. Resolving. Thanks again [~yutinggan]!

> Optimize Facets#getTopDims across Facets implementations
> 
>
> Key: LUCENE-10488
> URL: https://issues.apache.org/jira/browse/LUCENE-10488
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the 
> number of "top" dimensions they want. The default implementation just 
> delegates to {{getAllDims}} and returns the number of top dims requested, but 
> some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated 
> this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's 
> at least some opportunity to do better in:
> * {{ConcurrentSortedSetDocValuesFacetCounts}}
> * {{FastTaxonomyFacetCounts}}
> * {{TaxonomyFacetSumFloatAssociations}}
> * {{TaxonomyFacetSumIntAssociations}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10565) Can we "warm" SSDV ordinal maps on index reopen?

2022-05-12 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536357#comment-17536357
 ] 

Greg Miller commented on LUCENE-10565:
--

A tricky aspect of this is identifying which fields to pre-build the ordinal 
maps for, but I wonder if we could leverage {{FacetsConfig}} for this. 
Unfortunately, users don't have to register a facet field with {{FacetsConfig}} 
if they want all the default behavior, but maybe there's something we could do 
with this to make it more straight-forward to identify all the SSDV fields 
being used for faceting on reopen so the ordinal maps could be built. 

> Can we "warm" SSDV ordinal maps on index reopen?
> 
>
> Key: LUCENE-10565
> URL: https://issues.apache.org/jira/browse/LUCENE-10565
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Major
>
> As [~rcmuir] and [~jpountz] pointed out in a [discussion about facet 
> benchmarks|https://github.com/mikemccand/luceneutil/issues/169], we lazily 
> build ordinal maps needed for SSDV faceting the first time we need them for a 
> given index field instead of eagerly building them when the index is 
> reopened. This puts an expensive penalty on the search path whenever an index 
> is reloaded. Let's see if we can eagerly build these maps as part of 
> reopening the index so the user doesn't get hit with this at search time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10565) Can we "warm" SSDV ordinal maps on index reopen?

2022-05-10 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10565:


 Summary: Can we "warm" SSDV ordinal maps on index reopen?
 Key: LUCENE-10565
 URL: https://issues.apache.org/jira/browse/LUCENE-10565
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Greg Miller


As [~rcmuir] and [~jpountz] pointed out in a [discussion about facet 
benchmarks|https://github.com/mikemccand/luceneutil/issues/169], we lazily 
build ordinal maps needed for SSDV faceting the first time we need them for a 
given index field instead of eagerly building them when the index is reopened. 
This puts an expensive penalty on the search path whenever an index is 
reloaded. Let's see if we can eagerly build these maps as part of reopening the 
index so the user doesn't get hit with this at search time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10538) TopN is not being used in getTopChildren()

2022-05-09 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533912#comment-17533912
 ] 

Greg Miller commented on LUCENE-10538:
--

So I think the order of operations here is:
1. Deliver [LUCENE-10550|https://issues.apache.org/jira/browse/LUCENE-10550], 
which would effectively _copy_ the currently "top children" functionality of 
range faceting to a new API method for getting all children (which is what it's 
really doing).
2. Fix the existing "top children" functionality of range faceting to actually 
return top children (and honor the top-n parameter).

I think this issue now effectively captures #2, and is blocked until 
LUCENE-10550 is delivered. Does that sound right [~yutinggan]?

> TopN is not being used in getTopChildren()
> --
>
> Key: LUCENE-10538
> URL: https://issues.apache.org/jira/browse/LUCENE-10538
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Yuting Gan
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When looking at the overridden implementation getTopChildren(int topN, String 
> dim, String... path) in RangeFacetCounts, I found that the topN parameter is 
> not being used in the code, and the unit tests did not test this function 
> properly. I will create a PR to fix this, and will look into other overridden 
> implementations and see if they have the same issue. Please let me know if 
> there is any question. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10538) TopN is not being used in getTopChildren()

2022-05-05 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-10538:
-
Component/s: modules/facet

> TopN is not being used in getTopChildren()
> --
>
> Key: LUCENE-10538
> URL: https://issues.apache.org/jira/browse/LUCENE-10538
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Reporter: Yuting Gan
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When looking at the overridden implementation getTopChildren(int topN, String 
> dim, String... path) in RangeFacetCounts, I found that the topN parameter is 
> not being used in the code, and the unit tests did not test this function 
> properly. I will create a PR to fix this, and will look into other overridden 
> implementations and see if they have the same issue. Please let me know if 
> there is any question. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10550) Add getAllChildren functionality to facets

2022-05-05 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532277#comment-17532277
 ] 

Greg Miller edited comment on LUCENE-10550 at 5/5/22 2:37 PM:
--

I'm also +1 on this but with a minor suggestion.

{quote}The proposed getAllChildren API is to return value/range counts sorted 
by label values instead of counts. {quote}

I wonder if we should "sort" at all for this functionality? If we're returning 
all children for a specified path, the caller can just as easily sort by 
whatever criteria they want (or maybe none at all), so sorting within the 
implementation might be wasteful. Also, for range faceting, the user is 
providing a list of ranges they care about up-front in a specific order. I 
would actually propose we retain that order instead of sorting by the range 
"values" in some way. This is what range faceting currently implements 
(somewhat confusingly) behind the {{getTopChildren}} API. The order of those 
ranges might have some meaning to the caller, so it might be best to retain it. 
What do you think?


was (Author: gsmiller):
I'm also +1 on this but with a minor suggestion.

> The proposed getAllChildren API is to return value/range counts sorted by 
> label values instead of counts. 

I wonder if we should "sort" at all for this functionality? If we're returning 
all children for a specified path, the caller can just as easily sort by 
whatever criteria they want (or maybe none at all), so sorting within the 
implementation might be wasteful. Also, for range faceting, the user is 
providing a list of ranges they care about up-front in a specific order. I 
would actually propose we retain that order instead of sorting by the range 
"values" in some way. This is what range faceting currently implements 
(somewhat confusingly) behind the {{getTopChildren}} API. The order of those 
ranges might have some meaning to the caller, so it might be best to retain it. 
What do you think?

> Add getAllChildren functionality to facets
> --
>
> Key: LUCENE-10550
> URL: https://issues.apache.org/jira/browse/LUCENE-10550
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>Reporter: Yuting Gan
>Priority: Minor
>
> Currently Lucene does not support returning range counts sorted by label 
> values, but there are use cases demanding this feature. For example, a user 
> specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts 
> without changing the range order. Today we can only call getTopChildren to 
> populate range counts, but it would return ranges sorted by counts (e.g., 
> [10, 20] 100, [0, 10] 50) instead of range values. 
> Lucene has a API, getAllChildrenSortByValue, that returns numeric values with 
> counts sorted by label values, please see 
> [LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. 
> Therefore, it would be nice that we can also have a similar API to support 
> range counts. The proposed getAllChildren API is to return value/range counts 
> sorted by label values instead of counts. 
> This proposal was inspired from the discussions with [~gsmiller] when I was 
> working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], 
> and we believe users would benefit from adding this API to Facets. 
> Hope I can get some feedback from the community since this proposal would 
> require changes to the getTopChildren API in RangeFacetCounts. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10550) Add getAllChildren functionality to facets

2022-05-05 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532277#comment-17532277
 ] 

Greg Miller commented on LUCENE-10550:
--

I'm also +1 on this but with a minor suggestion.

> The proposed getAllChildren API is to return value/range counts sorted by 
> label values instead of counts. 

I wonder if we should "sort" at all for this functionality? If we're returning 
all children for a specified path, the caller can just as easily sort by 
whatever criteria they want (or maybe none at all), so sorting within the 
implementation might be wasteful. Also, for range faceting, the user is 
providing a list of ranges they care about up-front in a specific order. I 
would actually propose we retain that order instead of sorting by the range 
"values" in some way. This is what range faceting currently implements 
(somewhat confusingly) behind the {{getTopChildren}} API. The order of those 
ranges might have some meaning to the caller, so it might be best to retain it. 
What do you think?

> Add getAllChildren functionality to facets
> --
>
> Key: LUCENE-10550
> URL: https://issues.apache.org/jira/browse/LUCENE-10550
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>Reporter: Yuting Gan
>Priority: Minor
>
> Currently Lucene does not support returning range counts sorted by label 
> values, but there are use cases demanding this feature. For example, a user 
> specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts 
> without changing the range order. Today we can only call getTopChildren to 
> populate range counts, but it would return ranges sorted by counts (e.g., 
> [10, 20] 100, [0, 10] 50) instead of range values. 
> Lucene has a API, getAllChildrenSortByValue, that returns numeric values with 
> counts sorted by label values, please see 
> [LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. 
> Therefore, it would be nice that we can also have a similar API to support 
> range counts. The proposed getAllChildren API is to return value/range counts 
> sorted by label values instead of counts. 
> This proposal was inspired from the discussions with [~gsmiller] when I was 
> working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], 
> and we believe users would benefit from adding this API to Facets. 
> Hope I can get some feedback from the community since this proposal would 
> require changes to the getTopChildren API in RangeFacetCounts. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10550) Add getAllChildren functionality to facets

2022-05-05 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-10550:
-
Component/s: modules/facet

> Add getAllChildren functionality to facets
> --
>
> Key: LUCENE-10550
> URL: https://issues.apache.org/jira/browse/LUCENE-10550
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>Reporter: Yuting Gan
>Priority: Minor
>
> Currently Lucene does not support returning range counts sorted by label 
> values, but there are use cases demanding this feature. For example, a user 
> specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts 
> without changing the range order. Today we can only call getTopChildren to 
> populate range counts, but it would return ranges sorted by counts (e.g., 
> [10, 20] 100, [0, 10] 50) instead of range values. 
> Lucene has a API, getAllChildrenSortByValue, that returns numeric values with 
> counts sorted by label values, please see 
> [LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. 
> Therefore, it would be nice that we can also have a similar API to support 
> range counts. The proposed getAllChildren API is to return value/range counts 
> sorted by label values instead of counts. 
> This proposal was inspired from the discussions with [~gsmiller] when I was 
> working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], 
> and we believe users would benefit from adding this API to Facets. 
> Hope I can get some feedback from the community since this proposal would 
> require changes to the getTopChildren API in RangeFacetCounts. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10538) TopN is not being used in getTopChildren()

2022-05-02 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530905#comment-17530905
 ] 

Greg Miller commented on LUCENE-10538:
--

We discussed this in the PR but I wanted to bring the conversation here for 
visibility as well and to make sure this issue isn't just left hanging.

While it appears buggy that range facets don't use topN, I believe this is 
intentional. Range facets are (somewhat confusingly) overloading the 
{{getTopChildren}} faceting API with slightly different functionality that 
returns counts for all requested ranges in the order the ranges were provided. 
I think this existing functionality is important to retain, and I don't want to 
lose it by truncating to topN.

I also think properly implementing {{getTopChildren}} for range faceting would 
be useful for users. Meaning, a method that actually returns the top-n ranges 
in decreasing count order, just like other faceting implementations.

What I'd actually suggest we do here is add a {{getAllChildren}} method to the 
faceting API. Then we can migrate the existing {{getTopChildren}} functionality 
implemented in range faceting to {{getAllChildren}}. Finally, we can replace 
the existing {{getTopChildren}} range faceting implementation with a proper one.

{{LongValueFacetCounts}} is another faceting implementation where we've already 
implemented "get all children" functionality, so I think there's value beyond 
just range faceting here (i.e., we could migrate that implementation behind a 
new method defined for all {{Facets}}).

> TopN is not being used in getTopChildren()
> --
>
> Key: LUCENE-10538
> URL: https://issues.apache.org/jira/browse/LUCENE-10538
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Yuting Gan
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When looking at the overridden implementation getTopChildren(int topN, String 
> dim, String... path) in RangeFacetCounts, I found that the topN parameter is 
> not being used in the code, and the unit tests did not test this function 
> properly. I will create a PR to fix this, and will look into other overridden 
> implementations and see if they have the same issue. Please let me know if 
> there is any question. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10530) TestTaxonomyFacetAssociations test failure

2022-04-29 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10530.
--
Fix Version/s: 10.0 (main)
   9.2
   Resolution: Fixed

> TestTaxonomyFacetAssociations test failure
> --
>
> Key: LUCENE-10530
> URL: https://issues.apache.org/jira/browse/LUCENE-10530
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some 
> flakiness, it fails on the following random seed.
> {code:java}
> ./gradlew test --tests 
> TestTaxonomyFacetAssociations.testFloatAssociationRandom \ 
> -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \
> -Dtests.timezone=Europe/Athens -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8 {code}
> This is because of a mismatch in (SUM) aggregated multi-valued, 
> {{float_random}} facet field. We accept an error delta of 1 in this 
> aggregation, but for the failing random seed, the delta is 1.3. Maybe we 
> should change this delta to 1.5?
> My hunch is that it is some floating point approximation error. I'm unable to 
> repro it without the randomization seed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10529) TestTaxonomyFacetAssociations may have floating point issues

2022-04-29 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10529.
--
Fix Version/s: 10.0 (main)
   9.2
   Resolution: Fixed

> TestTaxonomyFacetAssociations may have floating point issues
> 
>
> Key: LUCENE-10529
> URL: https://issues.apache.org/jira/browse/LUCENE-10529
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
>
> Hit this in a jenkins CI build while testing something else:
> {noformat}
> gradlew test --tests TestTaxonomyFacetAssociations.testFloatAssociationRandom 
> -Dtests.seed=B39C450F4870F7F1 -Dtests.locale=ar-IQ 
> -Dtests.timezone=America/Rankin_Inlet -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
> ...
> org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > 
> testFloatAssociationRandom FAILED
> java.lang.AssertionError: expected:<2605996.5> but was:<2605995.2>
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10547) Implement "flattened" Facets#getTopChildren

2022-04-29 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10547:


 Summary: Implement "flattened" Facets#getTopChildren
 Key: LUCENE-10547
 URL: https://issues.apache.org/jira/browse/LUCENE-10547
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/facet
Reporter: Greg Miller


The currently implementation of {{Facets#getTopChildren}} only considers the 
immediate children of the user-provided path. In many cases, this is probably 
what the user is looking for, but it would be useful to also have an 
implementation that considers any descendant of the path, regardless of 
"level." This would allow the user to build a deeper set of facet path options 
in "one shot," instead of having to iteratively call {{getTopChildren}}.

Of course the shallower paths, and specifically the immediate children of the 
provided path, will always outweigh "deeper" paths due to counts/weights 
accumulating along the ancestry paths, but by providing a topN value larger 
than the number of immediate children, the user could build up a more complete 
view of path options in a taxonomy with a lot of depth.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10546) Update Faceting user guide

2022-04-29 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10546:


 Summary: Update Faceting user guide
 Key: LUCENE-10546
 URL: https://issues.apache.org/jira/browse/LUCENE-10546
 Project: Lucene - Core
  Issue Type: Wish
  Components: modules/facet
Reporter: Greg Miller


The  [facet user 
guide|https://lucene.apache.org/core/4_1_0/facet/org/apache/lucene/facet/doc-files/userguide.html]
 was written based on 4.1. Since there's been a fair amount of active 
facet-related development over the last year+, it would be nice to review the 
guide and see what updates make sense.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-04-29 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1753#comment-1753
 ] 

Greg Miller commented on LUCENE-10544:
--

{quote}In my opinion, a better solution that has less overhead and would still 
support cancelling such slow queries consists of leveraging 
{{BulkScorer#score}} to score small-ish ranges of doc IDs at a time.
{quote}
+1. We've had success by implementing a "timeout enforcing" Query that does 
timeout enforcement within the Scorer it provides as a short-term solution, but 
there are a number of flaws with this approach. Hooking into the BulkScorer 
makes sense but does need some thought as [~dpsharma] mentions since Queries 
may (and do!) provide their own BulkScorers in some cases (e.g., 
{{{}BooleanScorer{}}}).
{quote}Long-term I'd like ExitableDirectoryReader and other tooling to handle 
cancellation/timeout to become mostly implementation details, and have proper 
support directly on IndexSearcher (LUCENE-10151).
{quote}
+1. For full disclosure, [~dpsharma] and I work together at Amazon and she is 
working on LUCENE-10151. One idea is to use {{ExitableDirectoryReader}} as an 
internal implementation detail of {{IndexSearcher}} to add first-class timeout 
support. While we were debugging some prototype code, we ran into this issue 
with {{ExitableDirectoryReader}} and I thought it warranted a spin-off issue 
since it seems like something we might want to generally fix.

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-04-28 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529481#comment-17529481
 ] 

Greg Miller commented on LUCENE-10544:
--

Thanks [~jpountz]. One issue with the collector approach is that it doesn't 
catch two-phase iterator situations where there are many approximate hits but 
very few confirmed matches, since the collector will only be invoked after a 
match is confirmed. If the second phase check is costly, this can be 
particularly problematic. So it would be nice to enforce the check at a 
lower-level and solve for this issue if possible.

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-04-28 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-10544:
-
Description: 
While looking into options for LUCENE-10151, I noticed that 
{{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do anything 
to wrap postings or impacts. So timeouts will be enforced when moving to the 
"next" term, but not when iterating the postings/impacts associated with a term.

I think we ought to wrap the postings/impacts as well with some form of timeout 
checking so timeouts can be enforced on long-running queries. I'm not sure why 
this wasn't done originally (back in 2014), but it was questioned back in 2020 
on the original Jira SOLR-5986. Does anyone know of a good reason why we 
shouldn't enforce timeouts in this way?

Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
given that only {{next}} is being wrapped currently.

  was:
While looking into options for 
[LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151], I noticed 
that {{ExitableDirectoryReader}} doesn't actually do any timeout checking once 
you start iterating postings/impact. The does create a {{ExitableTermsEnum}} 
wrapper when loading a {{TermsEnum}}, but that wrapper doesn't do anything to 
wrap postings or impact. So timeouts will be enforced when moving to the "next" 
term, but not when iterating the postings/impact associated with a term.

I think we ought to wrap the postings/impacts as well with some form of timeout 
checking so timeouts can be enforced on long-running queries. I'm not sure why 
this wasn't done originally (back in 2014), but it was questioned back in 2020 
on the original Jira 
[SOLR-5986|https://issues.apache.org/jira/browse/SOLR-5986?focusedCommentId=17177009=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17177009].
 Does anyone know of a good reason why we shouldn't enforce timeouts in this 
way?

Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
given that only {{next}} is being wrapped currently.


> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-04-28 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10544:


 Summary: Should ExitableTermsEnum wrap postings and impacts?
 Key: LUCENE-10544
 URL: https://issues.apache.org/jira/browse/LUCENE-10544
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/index
Reporter: Greg Miller


While looking into options for 
[LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151], I noticed 
that {{ExitableDirectoryReader}} doesn't actually do any timeout checking once 
you start iterating postings/impact. The does create a {{ExitableTermsEnum}} 
wrapper when loading a {{TermsEnum}}, but that wrapper doesn't do anything to 
wrap postings or impact. So timeouts will be enforced when moving to the "next" 
term, but not when iterating the postings/impact associated with a term.

I think we ought to wrap the postings/impacts as well with some form of timeout 
checking so timeouts can be enforced on long-running queries. I'm not sure why 
this wasn't done originally (back in 2014), but it was questioned back in 2020 
on the original Jira 
[SOLR-5986|https://issues.apache.org/jira/browse/SOLR-5986?focusedCommentId=17177009=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17177009].
 Does anyone know of a good reason why we shouldn't enforce timeouts in this 
way?

Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10274) Implement "hyperrectangle" faceting

2022-04-27 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529122#comment-17529122
 ] 

Greg Miller commented on LUCENE-10274:
--

Exciting! I'll have a look at the PR in the next couple of days and get some 
feedback your way if nobody beats me to it. Thanks so much for having a go at 
this!

> Implement "hyperrectangle" faceting
> ---
>
> Key: LUCENE-10274
> URL: https://issues.apache.org/jira/browse/LUCENE-10274
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'd be interested in expanding Lucene's faceting capabilities to aggregate a 
> point field against a set of user-provided n-dimensional 
> [hyperrectangles|https://en.wikipedia.org/wiki/Hyperrectangle]. This would be 
> a generalization of {{LongRangeFacets}} / {{DoubleRangeFacets}} from a single 
> dimension to n-dimensions, and would compliment {{PointRangeQuery}} well, 
> providing the ability to facet ahead of "drilling down" on such a query.
> As a motivating use-case, imagine searching against movie documents that 
> contain a 2-dimensional point storing "awards" the movie has received. One 
> dimension encodes the year the award was won, while the other encodes the 
> type of award as an ordinal. For example, the film "Nomadland" won the 
> "Academy Awards Best Picture" award in 2021. Imagine providing a 
> two-dimensional refinement to users allowing them to filter by the 
> combination of award + year in a single action (e.g., using 
> {{{}PointRangeQuery{}}}) and needing to get facet counts for these 
> combinations ahead of time.
> Curious if the community thinks this functionality would be useful. Any 
> thoughts? 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10529) TestTaxonomyFacetAssociations may have floating point issues

2022-04-27 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529049#comment-17529049
 ] 

Greg Miller commented on LUCENE-10529:
--

I've got a PR up for this, but associated it with the dup Jira (10530). Linking 
the PR here as well for visibility. Since the fix for the NPE also reported 
here was trivial, I pushed it last night separately.

https://github.com/apache/lucene/pull/848

> TestTaxonomyFacetAssociations may have floating point issues
> 
>
> Key: LUCENE-10529
> URL: https://issues.apache.org/jira/browse/LUCENE-10529
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Hit this in a jenkins CI build while testing something else:
> {noformat}
> gradlew test --tests TestTaxonomyFacetAssociations.testFloatAssociationRandom 
> -Dtests.seed=B39C450F4870F7F1 -Dtests.locale=ar-IQ 
> -Dtests.timezone=America/Rankin_Inlet -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
> ...
> org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > 
> testFloatAssociationRandom FAILED
> java.lang.AssertionError: expected:<2605996.5> but was:<2605995.2>
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10530) TestTaxonomyFacetAssociations test failure

2022-04-27 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529048#comment-17529048
 ] 

Greg Miller commented on LUCENE-10530:
--

PR is up for this.

> TestTaxonomyFacetAssociations test failure
> --
>
> Key: LUCENE-10530
> URL: https://issues.apache.org/jira/browse/LUCENE-10530
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some 
> flakiness, it fails on the following random seed.
> {code:java}
> ./gradlew test --tests 
> TestTaxonomyFacetAssociations.testFloatAssociationRandom \ 
> -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \
> -Dtests.timezone=Europe/Athens -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8 {code}
> This is because of a mismatch in (SUM) aggregated multi-valued, 
> {{float_random}} facet field. We accept an error delta of 1 in this 
> aggregation, but for the failing random seed, the delta is 1.3. Maybe we 
> should change this delta to 1.5?
> My hunch is that it is some floating point approximation error. I'm unable to 
> repro it without the randomization seed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10530) TestTaxonomyFacetAssociations test failure

2022-04-27 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529027#comment-17529027
 ] 

Greg Miller commented on LUCENE-10530:
--

Instead of increasing the acceptable delta, I'd prefer to ensure we sum the 
floats in the same order and then expect them to be exactly equal. This should 
be a more robust solution than fiddling with the delta every time we trip a 
random case that breaks things. The issue is that we keep track of all float 
values we index for the purpose of determining "expected" sums, but the order 
ends up differing from the order we visit the values when iterating the index. 
I think I have a solution that lets us reconcile this ordering difference.

> TestTaxonomyFacetAssociations test failure
> --
>
> Key: LUCENE-10530
> URL: https://issues.apache.org/jira/browse/LUCENE-10530
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
>
> TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some 
> flakiness, it fails on the following random seed.
> {code:java}
> ./gradlew test --tests 
> TestTaxonomyFacetAssociations.testFloatAssociationRandom \ 
> -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \
> -Dtests.timezone=Europe/Athens -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8 {code}
> This is because of a mismatch in (SUM) aggregated multi-valued, 
> {{float_random}} facet field. We accept an error delta of 1 in this 
> aggregation, but for the failing random seed, the delta is 1.3. Maybe we 
> should change this delta to 1.5?
> My hunch is that it is some floating point approximation error. I'm unable to 
> repro it without the randomization seed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10204) Support iteration of sub-matches in join queries (ToParentBlockJoinQuery / ToChildBlockJoinQuery)

2022-04-26 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528507#comment-17528507
 ] 

Greg Miller commented on LUCENE-10204:
--

Yeah +1 to not pursuing further right now. Because various query evaluation 
optimizations (current and possibly future) mean that not all children will 
necessarily be visited in a complex disjunction clause when determining 
matching parents, I think it's fundamentally flawed to try to track all child 
hits while evaluating the query. For example, in BMW, a sub-clause may never 
get advanced to a given parent match if it's determined to be a match based on 
a minimum number of other clauses confirming the match.

>From what I can tell, the only accurate way to find all child matches is to 
>issue a separate query that identifies them, and doesn't "join" to the parents.

> Support iteration of sub-matches in join queries (ToParentBlockJoinQuery / 
> ToChildBlockJoinQuery)
> -
>
> Key: LUCENE-10204
> URL: https://issues.apache.org/jira/browse/LUCENE-10204
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/join
>Reporter: Greg Miller
>Priority: Minor
>
> It would be nice to be able to iterate over the "sub-matches" in these join 
> queries for the purpose of faceting (or possibly other use-cases?).
> For example, we have a use-case where our query matches on "child" docs, 
> using a {{ToParentBlockJoinQuery}} to "emit" the associated parents, which 
> are ultimately added to our match set. But, we want to iterate over the 
> matching "children" for the purpose of faceting.
> To make it concrete, consider searching over a product catalog where "offers" 
> and "items" are indexed side-by-side, with the offers being represented as 
> "children" of the parent items. An offer contains information like 
> "condition" (new vs. used), selling price, etc. for the parent item. If we 
> want to facet on "condition", we want to observe all children that matched 
> the query to know if the parent item had a "new" or "used" offer (or both). 
> This requires iterating over the child matches when faceting, which we cannot 
> do today since the child hit information isn't retained anywhere.
> We can support this by "caching" the child hits in a bitset but there is some 
> complexity when multiple join queries appear in a query structure (would need 
> to logically combine various "cached" bitsets using the same boolean 
> operations as in the original query structure).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10204) Support iteration of sub-matches in join queries (ToParentBlockJoinQuery / ToChildBlockJoinQuery)

2022-04-26 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10204.
--
Resolution: Won't Fix

> Support iteration of sub-matches in join queries (ToParentBlockJoinQuery / 
> ToChildBlockJoinQuery)
> -
>
> Key: LUCENE-10204
> URL: https://issues.apache.org/jira/browse/LUCENE-10204
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/join
>Reporter: Greg Miller
>Priority: Minor
>
> It would be nice to be able to iterate over the "sub-matches" in these join 
> queries for the purpose of faceting (or possibly other use-cases?).
> For example, we have a use-case where our query matches on "child" docs, 
> using a {{ToParentBlockJoinQuery}} to "emit" the associated parents, which 
> are ultimately added to our match set. But, we want to iterate over the 
> matching "children" for the purpose of faceting.
> To make it concrete, consider searching over a product catalog where "offers" 
> and "items" are indexed side-by-side, with the offers being represented as 
> "children" of the parent items. An offer contains information like 
> "condition" (new vs. used), selling price, etc. for the parent item. If we 
> want to facet on "condition", we want to observe all children that matched 
> the query to know if the parent item had a "new" or "used" offer (or both). 
> This requires iterating over the child matches when faceting, which we cannot 
> do today since the child hit information isn't retained anywhere.
> We can support this by "caching" the child hits in a bitset but there is some 
> complexity when multiple join queries appear in a query structure (would need 
> to logically combine various "cached" bitsets using the same boolean 
> operations as in the original query structure).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10529) TestTaxonomyFacetAssociations may have floating point issues

2022-04-26 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528504#comment-17528504
 ] 

Greg Miller commented on LUCENE-10529:
--

Just pushed a fix for the NPE (rare random case where no docs get indexed for a 
dim in the test case was handled incorrectly). For some reason I'm not able to 
repro the the original reported floating point precision issue but I am able to 
reproduce with the seed in LUCENE-10530. I'll work on a fix for that tomorrow. 
Thanks for reporting and apologies for the random test failures.

> TestTaxonomyFacetAssociations may have floating point issues
> 
>
> Key: LUCENE-10529
> URL: https://issues.apache.org/jira/browse/LUCENE-10529
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Hit this in a jenkins CI build while testing something else:
> {noformat}
> gradlew test --tests TestTaxonomyFacetAssociations.testFloatAssociationRandom 
> -Dtests.seed=B39C450F4870F7F1 -Dtests.locale=ar-IQ 
> -Dtests.timezone=America/Rankin_Inlet -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
> ...
> org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > 
> testFloatAssociationRandom FAILED
> java.lang.AssertionError: expected:<2605996.5> but was:<2605995.2>
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10529) TestTaxonomyFacetAssociations may have floating point issues

2022-04-26 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528401#comment-17528401
 ] 

Greg Miller commented on LUCENE-10529:
--

Looks like maybe the same thing reported in LUCENE-10530. I'll have a look at 
this.

> TestTaxonomyFacetAssociations may have floating point issues
> 
>
> Key: LUCENE-10529
> URL: https://issues.apache.org/jira/browse/LUCENE-10529
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Hit this in a jenkins CI build while testing something else:
> {noformat}
> gradlew test --tests TestTaxonomyFacetAssociations.testFloatAssociationRandom 
> -Dtests.seed=B39C450F4870F7F1 -Dtests.locale=ar-IQ 
> -Dtests.timezone=America/Rankin_Inlet -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
> ...
> org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > 
> testFloatAssociationRandom FAILED
> java.lang.AssertionError: expected:<2605996.5> but was:<2605995.2>
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10530) TestTaxonomyFacetAssociations test failure

2022-04-26 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528402#comment-17528402
 ] 

Greg Miller commented on LUCENE-10530:
--

Possibly the same issue also reported in LUCENE-10529. I'll have a look.

> TestTaxonomyFacetAssociations test failure
> --
>
> Key: LUCENE-10530
> URL: https://issues.apache.org/jira/browse/LUCENE-10530
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
>
> TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some 
> flakiness, it fails on the following random seed.
> {code:java}
> ./gradlew test --tests 
> TestTaxonomyFacetAssociations.testFloatAssociationRandom \ 
> -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \
> -Dtests.timezone=Europe/Athens -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8 {code}
> This is because of a mismatch in (SUM) aggregated multi-valued, 
> {{float_random}} facet field. We accept an error delta of 1 in this 
> aggregation, but for the failing random seed, the delta is 1.3. Maybe we 
> should change this delta to 1.5?
> My hunch is that it is some floating point approximation error. I'm unable to 
> repro it without the randomization seed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10495) Fix return statement of siblingsLoaded() in TaxonomyFacets

2022-04-26 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10495.
--
Fix Version/s: 9.2
   Resolution: Fixed

> Fix return statement of siblingsLoaded() in TaxonomyFacets
> --
>
> Key: LUCENE-10495
> URL: https://issues.apache.org/jira/browse/LUCENE-10495
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Yuting Gan
>Priority: Minor
> Fix For: 9.2
>
> Attachments: Screen Shot 2022-03-30 at 8.02.15 PM.png
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Found a bug in TaxonomyFacets when trying to use the siblingsLoaded function. 
> siblingsLoaded() should return siblings != null and it returns children != 
> null currently. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10444) Support alternate aggregation functions in association facets

2022-04-07 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10444.
--
Fix Version/s: 9.2
   Resolution: Fixed

> Support alternate aggregation functions in association facets
> -
>
> Key: LUCENE-10444
> URL: https://issues.apache.org/jira/browse/LUCENE-10444
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We currently only support {{sum}} aggregations in the various association 
> facet implementations. I'd be really interested in extending the association 
> facet implementations to support other aggregations, starting with {{max}} 
> and {{min}} (in addition to {{{}sum{}}}). 
> I've been sketching up a prototype of this and I think I have a reasonable 
> way to introduce this idea. Will get a PR out for feedback soon.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations

2022-04-07 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17518939#comment-17518939
 ] 

Greg Miller commented on LUCENE-10488:
--

Very exciting. Thanks [~yutinggan]! Also, please note that the refactoring 
change I mentioned above for association facets is now merged (LUCENE-10444), 
so it should be easy now to move forward with optimizations there as well if 
you're interested (or if anyone else is interested). Thanks again!

> Optimize Facets#getTopDims across Facets implementations
> 
>
> Key: LUCENE-10488
> URL: https://issues.apache.org/jira/browse/LUCENE-10488
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the 
> number of "top" dimensions they want. The default implementation just 
> delegates to {{getAllDims}} and returns the number of top dims requested, but 
> some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated 
> this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's 
> at least some opportunity to do better in:
> * {{ConcurrentSortedSetDocValuesFacetCounts}}
> * {{FastTaxonomyFacetCounts}}
> * {{TaxonomyFacetSumFloatAssociations}}
> * {{TaxonomyFacetSumIntAssociations}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?

2022-04-07 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17518935#comment-17518935
 ] 

Greg Miller commented on LUCENE-10507:
--

+1. I think this is a great idea!

> Should it be more likely to search concurrently in tests?
> -
>
> Key: LUCENE-10507
> URL: https://issues.apache.org/jira/browse/LUCENE-10507
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
>
> As part of LUCENE-10002 we are migrating test usages of 
> IndexSearcher#search(Query, Collector) to use the corresponding search method 
> that takes a CollectorManager in place of a Collector. As part of such 
> changes, I've been paying attention to whether searchers are created through 
> LuceneTestCase#newSearcher and migrating to it when possible.
> This caused some recent test failures following test changes, which were in 
> most cases test issues, although they were quite rare due to the fact that we 
> only rarely exercise the concurrent code-path in tests.
> One recent failure uncovered LUCENE-10500, which was an actual bug that 
> affected concurrent searches only, and was uncovered by a test run that 
> indexed a considerable amount of docs and was lucky enough to get an executor 
> set to its index searcher as well as get multiple slices.
> LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and 
> even when useThreads is true, the searcher may not get an executor set. Also, 
> it can often happen that despite an executor is set, the searcher will hold 
> only one slice, as not enough documents are indexed. Some nightly tests index 
> enough documents, and LuceneTestCase also lowers the slice limits but only 
> 50% of the times and only when wrapWithAssertions is false. Also I wonder if 
> the lower limits are low enough:
> {code:java}
> int maxDocPerSlice = 1 + random.nextInt(10);
> int maxSegmentsPerSlice = 1 + random.nextInt(20);
> {code}
> All in all, I wonder if we should make it more likely for real concurrent 
> searches to happen while testing across multiple slices. It seems like it 
> could be useful especially as we'd like users to use collector managers 
> instead of collectors (although that does not necessarily translate to 
> concurrent search).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10467) Throws IllegalArgumentException for getAllDims and getTopChildren if topN <= 0

2022-04-05 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10467.
--
Fix Version/s: 9.2
   Resolution: Fixed

Merged and backported. Thanks [~yutinggan]!

> Throws IllegalArgumentException for getAllDims and getTopChildren if topN <= 0
> --
>
> Key: LUCENE-10467
> URL: https://issues.apache.org/jira/browse/LUCENE-10467
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Yuting Gan
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Currently, there are different behaviors from subclass that implements  and 
> overrides getAllDims and getTopChildren when passing in an invalid TopN 
> parameter (topN <= 0). Some overridden implementations throw a 
> NullPointerException, some throw an IllegalArgumentException, and others 
> throw no exception.
> It would provide a better user experience by consistently throwing an 
> IllegalArgumentException when requesting topN <= 0 for these two 
> functionalities across all implementations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10491) TaxonomyFacetSumValueSource incorrectly provides scores to doc values

2022-03-30 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10491.
--
Fix Version/s: 9.2
   Resolution: Fixed

Fixed and backported.

> TaxonomyFacetSumValueSource incorrectly provides scores to doc values
> -
>
> Key: LUCENE-10491
> URL: https://issues.apache.org/jira/browse/LUCENE-10491
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Affects Versions: 10.0 (main), 9.2
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{TaxonomyFacetSumValueSource}} has a bug in the way it provides scores to 
> the user-provided doc values. [On this 
> line|https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacetSumValueSource.java#L78]
>  it should be {{index = doc}}, not {{index++}}. Thanks to [~mikemccand] for 
> finding this over in #718!
> I've reproduced with a test and will post the test and a fix shortly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10491) TaxonomyFacetSumValueSource incorrectly provides scores to doc values

2022-03-30 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller reassigned LUCENE-10491:


Assignee: Greg Miller

> TaxonomyFacetSumValueSource incorrectly provides scores to doc values
> -
>
> Key: LUCENE-10491
> URL: https://issues.apache.org/jira/browse/LUCENE-10491
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
>Affects Versions: 10.0 (main), 9.2
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{TaxonomyFacetSumValueSource}} has a bug in the way it provides scores to 
> the user-provided doc values. [On this 
> line|https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacetSumValueSource.java#L78]
>  it should be {{index = doc}}, not {{index++}}. Thanks to [~mikemccand] for 
> finding this over in #718!
> I've reproduced with a test and will post the test and a fix shortly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10491) TaxonomyFacetSumValueSource incorrectly provides scores to doc values

2022-03-30 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10491:


 Summary: TaxonomyFacetSumValueSource incorrectly provides scores 
to doc values
 Key: LUCENE-10491
 URL: https://issues.apache.org/jira/browse/LUCENE-10491
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 10.0 (main), 9.2
Reporter: Greg Miller


{{TaxonomyFacetSumValueSource}} has a bug in the way it provides scores to the 
user-provided doc values. [On this 
line|https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacetSumValueSource.java#L78]
 it should be {{index = doc}}, not {{index++}}. Thanks to [~mikemccand] for 
finding this over in #718!

I've reproduced with a test and will post the test and a fix shortly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10325) Add getTopDims functionality to Facets

2022-03-28 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513732#comment-17513732
 ] 

Greg Miller commented on LUCENE-10325:
--

Also opened LUCENE-10488 to track other optimizations.

> Add getTopDims functionality to Facets
> --
>
> Key: LUCENE-10325
> URL: https://issues.apache.org/jira/browse/LUCENE-10325
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> The current {{getAllDims}} functionality is really the only way for users to 
> determine the "top" dimensions in a faceting field (i.e., get the top dims by 
> count along with their top-n children), but it has the unfortunate 
> side-effect of resolving all child paths for every dim, even if the user 
> doesn't intend to use those dims. For example, if a match set contains docs 
> relating to 100 different dims (and various values under each), but the user 
> only wants the top 10 dims with their top 5 children, they can call 
> getAllDims(5) then just grab the first 10 results, but a lot of wasted work 
> has been done for the other 90 dims.
> It would be nice to implement something like {{getTopDims(int numDims, int 
> numChildren)}} that would only do the work necessary to resolve {{numDims}} 
> dims instead of all dims.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10325) Add getTopDims functionality to Facets

2022-03-28 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10325.
--
Fix Version/s: 9.2
   Resolution: Fixed

Merged and backported. Thanks again [~yutinggan]!

> Add getTopDims functionality to Facets
> --
>
> Key: LUCENE-10325
> URL: https://issues.apache.org/jira/browse/LUCENE-10325
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> The current {{getAllDims}} functionality is really the only way for users to 
> determine the "top" dimensions in a faceting field (i.e., get the top dims by 
> count along with their top-n children), but it has the unfortunate 
> side-effect of resolving all child paths for every dim, even if the user 
> doesn't intend to use those dims. For example, if a match set contains docs 
> relating to 100 different dims (and various values under each), but the user 
> only wants the top 10 dims with their top 5 children, they can call 
> getAllDims(5) then just grab the first 10 results, but a lot of wasted work 
> has been done for the other 90 dims.
> It would be nice to implement something like {{getTopDims(int numDims, int 
> numChildren)}} that would only do the work necessary to resolve {{numDims}} 
> dims instead of all dims.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations

2022-03-28 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513722#comment-17513722
 ] 

Greg Miller edited comment on LUCENE-10488 at 3/28/22, 11:41 PM:
-

Note that I have an [open PR|https://github.com/apache/lucene/pull/719] that 
proposes some significant changes to association facets, so might be worth 
trying to avoid large merge collisions with that if someone jumps on this.


was (Author: gsmiller):
Note that I have an [open PR](https://github.com/apache/lucene/pull/719) that 
proposes some significant changes to association facets, so might be worth 
trying to avoid large merge collisions with that if someone jumps on this.

> Optimize Facets#getTopDims across Facets implementations
> 
>
> Key: LUCENE-10488
> URL: https://issues.apache.org/jira/browse/LUCENE-10488
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>
> LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the 
> number of "top" dimensions they want. The default implementation just 
> delegates to {{getAllDims}} and returns the number of top dims requested, but 
> some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated 
> this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's 
> at least some opportunity to do better in:
> * {{ConcurrentSortedSetDocValuesFacetCounts}}
> * {{FastTaxonomyFacetCounts}}
> * {{TaxonomyFacetSumFloatAssociations}}
> * {{TaxonomyFacetSumIntAssociations}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations

2022-03-28 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513722#comment-17513722
 ] 

Greg Miller commented on LUCENE-10488:
--

Note that I have an [open PR](https://github.com/apache/lucene/pull/719) that 
proposes some significant changes to association facets, so might be worth 
trying to avoid large merge collisions with that if someone jumps on this.

> Optimize Facets#getTopDims across Facets implementations
> 
>
> Key: LUCENE-10488
> URL: https://issues.apache.org/jira/browse/LUCENE-10488
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>
> LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the 
> number of "top" dimensions they want. The default implementation just 
> delegates to {{getAllDims}} and returns the number of top dims requested, but 
> some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated 
> this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's 
> at least some opportunity to do better in:
> * {{ConcurrentSortedSetDocValuesFacetCounts}}
> * {{FastTaxonomyFacetCounts}}
> * {{TaxonomyFacetSumFloatAssociations}}
> * {{TaxonomyFacetSumIntAssociations}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations

2022-03-28 Thread Greg Miller (Jira)
Greg Miller created LUCENE-10488:


 Summary: Optimize Facets#getTopDims across Facets implementations
 Key: LUCENE-10488
 URL: https://issues.apache.org/jira/browse/LUCENE-10488
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Greg Miller


LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the 
number of "top" dimensions they want. The default implementation just delegates 
to {{getAllDims}} and returns the number of top dims requested, but some Facets 
sub-classes can do this more optimally. LUCENE-10325 demonstrated this in 
{{SortedSetDocValueFacetCounts}}, but we can take it further. There's at least 
some opportunity to do better in:
* {{ConcurrentSortedSetDocValuesFacetCounts}}
* {{FastTaxonomyFacetCounts}}
* {{TaxonomyFacetSumFloatAssociations}}
* {{TaxonomyFacetSumIntAssociations}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10484) Add support for concurrent facets random sampling

2022-03-28 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-10484:
-
Component/s: modules/facet

> Add support for concurrent facets random sampling
> -
>
> Key: LUCENE-10484
> URL: https://issues.apache.org/jira/browse/LUCENE-10484
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Luca Cavanna
>Priority: Minor
>
> While FacetsCollectorManager exists to allows users to concurrently do facets 
> collection through FacetsCollector, RandomSamplingFacetsCollector does not 
> have a corresponding collector manager that easily allows users to 
> concurrently do random sampling. The needed collector manager would be very 
> similar to FacetsCollectorManager, yet it would need to expose a specialized 
> reduced RandomSamplingFacetsCollector, and the reduction should call 
> getOriginalMatchingDocs instead of getMatchingDocs, which modifies the 
> internal totalHits when called.
> This relates to LUCENE-10002 and would allow to use a collector manager 
> instead of a collector when doing random sampling, in the effort of reducing 
> usages of IndexSearcher#search(Query, Collector).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10468) Do not always do checkField() in DocValues.getXXX(LeafReader, String)

2022-03-15 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507137#comment-17507137
 ] 

Greg Miller commented on LUCENE-10468:
--

+1, I appreciate the field checking done by the DocValues factory methods. It 
only throws if the field exists but was indexed with a different type, which 
likely indicates a user-initiated error.

Note that you can always use lower-level access by loading doc values directly 
from a LeafReader if you have some special use-case, or you can load FieldInfos 
and check those yourself as well. I've seen a few use-cases where this is 
useful, primarily optimizing the {{null}} case.

> Do not always do checkField() in DocValues.getXXX(LeafReader, String)
> -
>
> Key: LUCENE-10468
> URL: https://issues.apache.org/jira/browse/LUCENE-10468
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
> Attachments: 1.png
>
>
> IndexQuery can always get an empty result when field in Query does not exist 
> or even it was indexed with different FieldType.
> But when doing DocValuesQuery and field in such query does not exist, if this 
> field was not indexed by any other FieldType, DocValues query's behavior is 
> as the same as IndexQuery, otherwise it will throw a exception, because 
> getting a DocValuesIterator always do DocValues#checkField(...).
> I mean checkFIeld(...) is not needed if only do getting a DocValuesIterator, 
> and the exception's content is not friendly, so we can keep 'query result 
> consistency'?
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



  1   2   3   4   5   6   >