[jira] [Resolved] (LUCENE-10644) Facets#getAllChildren testing should ignore child order
[ https://issues.apache.org/jira/browse/LUCENE-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10644. -- Fix Version/s: 9.4 Resolution: Fixed > Facets#getAllChildren testing should ignore child order > --- > > Key: LUCENE-10644 > URL: https://issues.apache.org/jira/browse/LUCENE-10644 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > Fix For: 9.4 > > Attachments: failing tests.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > Our javadoc for {{Facets#getAllChildren}} explicitly calls out that callers > should make no assumptions about child ordering, but a number of our own unit > tests turn around and make that assumption. I ran into this when recently > trying an optimization that would result in a different child ordering for > {{{}getAllChildren{}}}, and found a number of unit tests that started > failing. I'll upload a list of what I found failing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10644) Facets#getAllChildren testing should ignore child order
[ https://issues.apache.org/jira/browse/LUCENE-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581476#comment-17581476 ] Greg Miller commented on LUCENE-10644: -- Merged and backported. Thanks! > Facets#getAllChildren testing should ignore child order > --- > > Key: LUCENE-10644 > URL: https://issues.apache.org/jira/browse/LUCENE-10644 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > Attachments: failing tests.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > Our javadoc for {{Facets#getAllChildren}} explicitly calls out that callers > should make no assumptions about child ordering, but a number of our own unit > tests turn around and make that assumption. I ran into this when recently > trying an optimization that would result in a different child ordering for > {{{}getAllChildren{}}}, and found a number of unit tests that started > failing. I'll upload a list of what I found failing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery
[ https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575402#comment-17575402 ] Greg Miller commented on LUCENE-10207: -- I'm coming back to this work now as I'm working on another project that would benefit from the ability to use a {{TermInSetQuery}} within an {{IndexOrDocValuesQuery}}. Where this work stalled last year was in answering whether-or-not making {{TermInSetQuery}} extend {{MultiTermQuery}} would have a negative performance impact, since the term intersection implementation would differ. The motivation for extending {{MultiTermQuery}} was to make a doc values-based term-in-set implementation easy (using the existing {{DocValuesRewriteMethod}}. I suggest we separate some of these concerns to make progress. The sandbox module already has {{DocValuesTermsQuery}} that could be paired with {{TermInSetQuery}} inside of {{IndexOrDocValuesQuery}}. But, we still can't use {{TermInSetQuery}} in a {{IndexOrDocValuesQuery}} since {{TermInSetQuery}} doesn't provide a {{ScoreSupplier}} with cost estimation. I propose we address this first, and not worry about refactoring {{TermInSetQuery}} to extend {{MultiTermQuery}} at this point. This would be incremental progress that enable using {{TermInSetQuery}} + {{DocValuesTermsQuery}} in an {{IndexOrDocValuesQuery}}, while not requiring us to answer the performance impact of changing {{TermInSetQuery}} to extend {{MultiTermQuery}}. I've opened a separate PR to make this iterative step: https://github.com/apache/lucene/pull/1058 > Make TermInSetQuery usable with IndexOrDocValuesQuery > - > > Key: LUCENE-10207 > URL: https://issues.apache.org/jira/browse/LUCENE-10207 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Greg Miller >Priority: Minor > Attachments: LUCENE-10207_multitermquery.patch > > Time Spent: 20m > Remaining Estimate: 0h > > IndexOrDocValuesQuery is very useful to pick the right execution mode for a > query depending on other bits of the query tree. > We would like to be able to use it to optimize execution of TermInSetQuery. > However IndexOrDocValuesQuery only works well if the "index" query can give > an estimation of the cost of the query without doing anything expensive (like > looking up all terms of the TermInSetQuery in the terms dict). Maybe we could > implement it for primary keys (terms.size() == sumDocFreq) by returning the > number of terms of the query? Another idea is to multiply the number of terms > by the average postings length, though this could be dangerous if the field > has a zipfian distribution and some terms have a much higher doc frequency > than the average. > [~romseygeek] and I were discussing this a few weeks ago, and more recently > [~mikemccand] and [~gsmiller] again independently. So it looks like there is > interest in this. Here is an email thread where this was recently discussed: > https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10668) Should we deprecate/remove DocValuesTermsQuery in sandbox?
[ https://issues.apache.org/jira/browse/LUCENE-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10668. -- Resolution: Won't Fix > Should we deprecate/remove DocValuesTermsQuery in sandbox? > -- > > Key: LUCENE-10668 > URL: https://issues.apache.org/jira/browse/LUCENE-10668 > Project: Lucene - Core > Issue Type: Task > Components: modules/sandbox >Reporter: Greg Miller >Priority: Minor > > I came across the sandbox {{DocValuesTermsQuery}} and it sure looks a lot > like {{TermInSetQuery}}. I wonder if we ought to deprecate and remove it? Any > reason to keep this around? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery
[ https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573027#comment-17573027 ] Greg Miller commented on LUCENE-10207: -- I also just came across {{DocValuesTermsQuery}} in the sandbox module. Once we see this work through (adding doc value rewrite support to TermInSet), we can deprecate/remove this. > Make TermInSetQuery usable with IndexOrDocValuesQuery > - > > Key: LUCENE-10207 > URL: https://issues.apache.org/jira/browse/LUCENE-10207 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Greg Miller >Priority: Minor > Attachments: LUCENE-10207_multitermquery.patch > > Time Spent: 10m > Remaining Estimate: 0h > > IndexOrDocValuesQuery is very useful to pick the right execution mode for a > query depending on other bits of the query tree. > We would like to be able to use it to optimize execution of TermInSetQuery. > However IndexOrDocValuesQuery only works well if the "index" query can give > an estimation of the cost of the query without doing anything expensive (like > looking up all terms of the TermInSetQuery in the terms dict). Maybe we could > implement it for primary keys (terms.size() == sumDocFreq) by returning the > number of terms of the query? Another idea is to multiply the number of terms > by the average postings length, though this could be dangerous if the field > has a zipfian distribution and some terms have a much higher doc frequency > than the average. > [~romseygeek] and I were discussing this a few weeks ago, and more recently > [~mikemccand] and [~gsmiller] again independently. So it looks like there is > interest in this. Here is an email thread where this was recently discussed: > https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10668) Should we deprecate/remove DocValuesTermsQuery in sandbox?
[ https://issues.apache.org/jira/browse/LUCENE-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573026#comment-17573026 ] Greg Miller commented on LUCENE-10668: -- Ha, oops... right [~jpountz]. I've been working on [LUCENE-10207|https://issues.apache.org/jira/browse/LUCENE-10207] again and had the new doc value based implementation in the brain. I'll just mention over in LUCENE-10207 that we can deprecate {{DocValuesTermsQuery}} when we see that work through. My mistake. > Should we deprecate/remove DocValuesTermsQuery in sandbox? > -- > > Key: LUCENE-10668 > URL: https://issues.apache.org/jira/browse/LUCENE-10668 > Project: Lucene - Core > Issue Type: Task > Components: modules/sandbox >Reporter: Greg Miller >Priority: Minor > > I came across the sandbox {{DocValuesTermsQuery}} and it sure looks a lot > like {{TermInSetQuery}}. I wonder if we ought to deprecate and remove it? Any > reason to keep this around? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10668) Should we deprecate/remove DocValuesTermsQuery in sandbox?
Greg Miller created LUCENE-10668: Summary: Should we deprecate/remove DocValuesTermsQuery in sandbox? Key: LUCENE-10668 URL: https://issues.apache.org/jira/browse/LUCENE-10668 Project: Lucene - Core Issue Type: Task Components: modules/sandbox Reporter: Greg Miller I came across the sandbox {{DocValuesTermsQuery}} and it sure looks a lot like {{TermInSetQuery}}. I wonder if we ought to deprecate and remove it? Any reason to keep this around? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10659) Fix random TestDisiPriorityQueue bug
[ https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570315#comment-17570315 ] Greg Miller commented on LUCENE-10659: -- Patched this additional fix in as well. Hopefully this test is good to go now. I'll keep an eye on it. > Fix random TestDisiPriorityQueue bug > > > Key: LUCENE-10659 > URL: https://issues.apache.org/jira/browse/LUCENE-10659 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 9.3 >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Blocker > Fix For: 9.3 > > > A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly > trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we > should roll it into the 9.3 release. I'll prepare a PR, but raising it here > for visibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10659) Fix random TestDisiPriorityQueue bug
[ https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10659. -- Resolution: Fixed > Fix random TestDisiPriorityQueue bug > > > Key: LUCENE-10659 > URL: https://issues.apache.org/jira/browse/LUCENE-10659 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 9.3 >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Blocker > Fix For: 9.3 > > > A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly > trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we > should roll it into the 9.3 release. I'll prepare a PR, but raising it here > for visibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10659) Fix random TestDisiPriorityQueue bug
[ https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570209#comment-17570209 ] Greg Miller commented on LUCENE-10659: -- Another fix here: https://github.com/apache/lucene/pull/1044 > Fix random TestDisiPriorityQueue bug > > > Key: LUCENE-10659 > URL: https://issues.apache.org/jira/browse/LUCENE-10659 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 9.3 >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Blocker > Fix For: 9.3 > > > A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly > trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we > should roll it into the 9.3 release. I'll prepare a PR, but raising it here > for visibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-10659) Fix random TestDisiPriorityQueue bug
[ https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller reopened LUCENE-10659: -- Assignee: Greg Miller There's still an issue with the test. Tripped it again last night. Working on a fix now. Let's block 9.3 until this fix is in. PR will be up shortly. > Fix random TestDisiPriorityQueue bug > > > Key: LUCENE-10659 > URL: https://issues.apache.org/jira/browse/LUCENE-10659 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 9.3 >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Blocker > Fix For: 9.3 > > > A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly > trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we > should roll it into the 9.3 release. I'll prepare a PR, but raising it here > for visibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10659) Fix random TestDisiPriorityQueue bug
[ https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10659. -- Fix Version/s: 9.3 Resolution: Fixed > Fix random TestDisiPriorityQueue bug > > > Key: LUCENE-10659 > URL: https://issues.apache.org/jira/browse/LUCENE-10659 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 9.3 >Reporter: Greg Miller >Priority: Blocker > Fix For: 9.3 > > > A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly > trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we > should roll it into the 9.3 release. I'll prepare a PR, but raising it here > for visibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10659) Fix random TestDisiPriorityQueue bug
[ https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-10659: - Priority: Blocker (was: Minor) > Fix random TestDisiPriorityQueue bug > > > Key: LUCENE-10659 > URL: https://issues.apache.org/jira/browse/LUCENE-10659 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 9.3 >Reporter: Greg Miller >Priority: Blocker > > A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly > trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we > should roll it into the 9.3 release. I'll prepare a PR, but raising it here > for visibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10659) Fix random TestDisiPriorityQueue bug
[ https://issues.apache.org/jira/browse/LUCENE-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569184#comment-17569184 ] Greg Miller commented on LUCENE-10659: -- PR for pulling this fix into 9.3: https://github.com/apache/lucene/pull/1038 > Fix random TestDisiPriorityQueue bug > > > Key: LUCENE-10659 > URL: https://issues.apache.org/jira/browse/LUCENE-10659 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 9.3 >Reporter: Greg Miller >Priority: Minor > > A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly > trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we > should roll it into the 9.3 release. I'll prepare a PR, but raising it here > for visibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10659) Fix random TestDisiPriorityQueue bug
Greg Miller created LUCENE-10659: Summary: Fix random TestDisiPriorityQueue bug Key: LUCENE-10659 URL: https://issues.apache.org/jira/browse/LUCENE-10659 Project: Lucene - Core Issue Type: Bug Affects Versions: 9.3 Reporter: Greg Miller A recently added test ({{TestDisiPriorityQueue}}) has a bug that can randomly trip (my fault). I fixed this on {{main}} and {{branch_9x}}, but I think we should roll it into the 9.3 release. I'll prepare a PR, but raising it here for visibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10653) Should BlockMaxMaxscoreScorer rebuild its heap in bulk?
[ https://issues.apache.org/jira/browse/LUCENE-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10653. -- Fix Version/s: 9.3 Assignee: Greg Miller Resolution: Fixed > Should BlockMaxMaxscoreScorer rebuild its heap in bulk? > --- > > Key: LUCENE-10653 > URL: https://issues.apache.org/jira/browse/LUCENE-10653 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > Fix For: 9.3 > > Time Spent: 40m > Remaining Estimate: 0h > > BMMScorer has to frequently rebuild its heap, and does do by clearing and > then iteratively calling {{{}add{}}}. It would be more efficient to heapify > in bulk. This is more academic than anything right now though since BMMScorer > is only used with two-clause disjunctions, so it's sort of a silly > optimization if it's not supporting a greater number of clauses. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field
[ https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568587#comment-17568587 ] Greg Miller commented on LUCENE-10633: -- {quote}It also relates to [~gsmiller] 's work about running term-in-set queries using doc values, which would only help if doc values are enabled on the field. {quote} Which is actually perfect timing as I've just come back to working on this (LUCENE-10207) after setting it aside for a while. Thanks for making this change to {{luceneutil!}} > Dynamic pruning for queries sorted by SORTED(_SET) field > > > Key: LUCENE-10633 > URL: https://issues.apache.org/jira/browse/LUCENE-10633 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > LUCENE-9280 introduced the ability to dynamically prune non-competitive hits > when sorting by a numeric field, by leveraging the points index to skip > documents that do not compare better than the top of the priority queue > maintained by the field comparator. > However queries sorted by a SORTED(_SET) field still look at all hits, which > is disappointing. Could we leverage the terms index to skip hits? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566615#comment-17566615 ] Greg Miller commented on LUCENE-10603: -- Shouldn't be any more to do on this now. Resolving. FWIW, I ran benchmarks {{wikimediumall}} and didn't see any significant changes. Thought we might see a small improvement for SSDV heavy faceting, but nothing showed up. > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 6h > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10603. -- Resolution: Fixed > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 6h > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10632) Change getAllChildren to return all children regardless of the count
[ https://issues.apache.org/jira/browse/LUCENE-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-10632: - Component/s: modules/facet > Change getAllChildren to return all children regardless of the count > > > Key: LUCENE-10632 > URL: https://issues.apache.org/jira/browse/LUCENE-10632 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Yuting Gan >Priority: Minor > > Currently, the getAllChildren functionality is implemented in a way that is > similar to getTopChildren, where they only return children with count that is > greater than zero. > However, he original getTopChildren in RangeFacetCounts returned all children > whether-or-not the count was zero. This actually has good use cases and we > should continue supporting the feature in getAllChildren, so that we will not > lose it after properly supporting getTopChildren in RangeFacetCounts. > As discussed with [~gsmiller] in the [LUCENE-10614 > pr|https://github.com/apache/lucene/pull/974], allowing getAllChildren to > behave differently from getTopChildren can actually be more helpful for > users. If users want to get children with only positive count, we have > getTopChildren supporting this behavior already. Therefore, the > getAllChildren API should provide all children in all of the implementations, > whether-or-not the count is zero. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10632) Change getAllChildren to return all children regardless of the count
[ https://issues.apache.org/jira/browse/LUCENE-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566560#comment-17566560 ] Greg Miller commented on LUCENE-10632: -- Bringing a conversation about this issue we had offline here for transparency and future discovery. While I think it would actually be ideal if {{getAllChildren}} could actually return _all_ children, regardless of the count, it's not really practical in most of our {{Facets}} implementations since they only "see" children that exist in the docs they're counting. So if they're counting from a {{{}FacetsCollector{}}}, and those hits don't contain some of the possible child values for a given dimension, it's quite hard for {{getAllChildren}} to actually know about them. So for now, I think it's reasonable that range facet counting behaves a little differently from the rest and actually returns all the ranges it was asked about, regardless of count. This is consistent with the behavior of {{{}getSpecificValue{}}}, which are both similar use-cases in that the user is providing the value(s) they care about. But this does create a small inconsistency in the behavior of {{getAllChildren}} generally. > Change getAllChildren to return all children regardless of the count > > > Key: LUCENE-10632 > URL: https://issues.apache.org/jira/browse/LUCENE-10632 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Yuting Gan >Priority: Minor > > Currently, the getAllChildren functionality is implemented in a way that is > similar to getTopChildren, where they only return children with count that is > greater than zero. > However, he original getTopChildren in RangeFacetCounts returned all children > whether-or-not the count was zero. This actually has good use cases and we > should continue supporting the feature in getAllChildren, so that we will not > lose it after properly supporting getTopChildren in RangeFacetCounts. > As discussed with [~gsmiller] in the [LUCENE-10614 > pr|https://github.com/apache/lucene/pull/974], allowing getAllChildren to > behave differently from getTopChildren can actually be more helpful for > users. If users want to get children with only positive count, we have > getTopChildren supporting this behavior already. Therefore, the > getAllChildren API should provide all children in all of the implementations, > whether-or-not the count is zero. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10653) Should BlockMaxMaxscoreScorer rebuild its heap in bulk?
[ https://issues.apache.org/jira/browse/LUCENE-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565195#comment-17565195 ] Greg Miller commented on LUCENE-10653: -- Here's essentially what I'm thinking: https://github.com/gsmiller/lucene/commit/597a760d6c0b0524ba1d72c290689e4dc4b4b9e9 > Should BlockMaxMaxscoreScorer rebuild its heap in bulk? > --- > > Key: LUCENE-10653 > URL: https://issues.apache.org/jira/browse/LUCENE-10653 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Greg Miller >Priority: Minor > > BMMScorer has to frequently rebuild its heap, and does do by clearing and > then iteratively calling {{{}add{}}}. It would be more efficient to heapify > in bulk. This is more academic than anything right now though since BMMScorer > is only used with two-clause disjunctions, so it's sort of a silly > optimization if it's not supporting a greater number of clauses. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10653) Should BlockMaxMaxscoreScorer rebuild its heap in bulk?
Greg Miller created LUCENE-10653: Summary: Should BlockMaxMaxscoreScorer rebuild its heap in bulk? Key: LUCENE-10653 URL: https://issues.apache.org/jira/browse/LUCENE-10653 Project: Lucene - Core Issue Type: Improvement Components: core/search Reporter: Greg Miller BMMScorer has to frequently rebuild its heap, and does do by clearing and then iteratively calling {{{}add{}}}. It would be more efficient to heapify in bulk. This is more academic than anything right now though since BMMScorer is only used with two-clause disjunctions, so it's sort of a silly optimization if it's not supporting a greater number of clauses. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10614) Properly support getTopChildren in RangeFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10614. -- Fix Version/s: 10.0 (main) Resolution: Fixed > Properly support getTopChildren in RangeFacetCounts > --- > > Key: LUCENE-10614 > URL: https://issues.apache.org/jira/browse/LUCENE-10614 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 10.0 (main) >Reporter: Greg Miller >Priority: Minor > Fix For: 10.0 (main) > > Time Spent: 4h 10m > Remaining Estimate: 0h > > As mentioned in LUCENE-10538, {{RangeFacetCounts}} is not implementing > {{getTopChildren}}. Instead of returning "top" ranges, it returns all > user-provided ranges in the order the user specified them when instantiating. > This is probably more useful functionality, but it would be nice to support > {{getTopChildren}} as well. > LUCENE-10550 is introducing the concept of {{getAllChildren}}, so once that > lands, we can replace the current implementation of {{getTopChildren}} with > an actual "top children" implementation and direct users to > {{getAllChildren}} if they want to maintain the current behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10614) Properly support getTopChildren in RangeFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565084#comment-17565084 ] Greg Miller commented on LUCENE-10614: -- Thanks again [~yutinggan] ! > Properly support getTopChildren in RangeFacetCounts > --- > > Key: LUCENE-10614 > URL: https://issues.apache.org/jira/browse/LUCENE-10614 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 10.0 (main) >Reporter: Greg Miller >Priority: Minor > Time Spent: 4h 10m > Remaining Estimate: 0h > > As mentioned in LUCENE-10538, {{RangeFacetCounts}} is not implementing > {{getTopChildren}}. Instead of returning "top" ranges, it returns all > user-provided ranges in the order the user specified them when instantiating. > This is probably more useful functionality, but it would be nice to support > {{getTopChildren}} as well. > LUCENE-10550 is introducing the concept of {{getAllChildren}}, so once that > lands, we can replace the current implementation of {{getTopChildren}} with > an actual "top children" implementation and direct users to > {{getAllChildren}} if they want to maintain the current behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10614) Properly support getTopChildren in RangeFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565083#comment-17565083 ] Greg Miller commented on LUCENE-10614: -- Just merged this to {{{}main{}}}. I don't think we should backport this to 9.x since it is a functional change to an existing API. Because of this, I moved the CHANGES entry under 10.0 and added an entry to MIGRATE describing the difference and how to retain the 9.x functionality if desired. > Properly support getTopChildren in RangeFacetCounts > --- > > Key: LUCENE-10614 > URL: https://issues.apache.org/jira/browse/LUCENE-10614 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 10.0 (main) >Reporter: Greg Miller >Priority: Minor > Time Spent: 4h 10m > Remaining Estimate: 0h > > As mentioned in LUCENE-10538, {{RangeFacetCounts}} is not implementing > {{getTopChildren}}. Instead of returning "top" ranges, it returns all > user-provided ranges in the order the user specified them when instantiating. > This is probably more useful functionality, but it would be nice to support > {{getTopChildren}} as well. > LUCENE-10550 is introducing the concept of {{getAllChildren}}, so once that > lands, we can replace the current implementation of {{getTopChildren}} with > an actual "top children" implementation and direct users to > {{getAllChildren}} if they want to maintain the current behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563912#comment-17563912 ] Greg Miller commented on LUCENE-10603: -- It looks like the only remaining work is to: # Remove the NO_MORE_ORDS definition # Update all the SortedSetDocValue implementations to stop returning NO_MORE_ORDS in nextOrd() # Remove all the test assertions that validate that SSDV#nextOrd() returns NO_MORE_ORDS This should all be main branch work, and not something we backport to 9.x. I think 9.x is now good. > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 5h 20m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10644) Facets#getAllChildren testing should ignore child order
Greg Miller created LUCENE-10644: Summary: Facets#getAllChildren testing should ignore child order Key: LUCENE-10644 URL: https://issues.apache.org/jira/browse/LUCENE-10644 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Greg Miller Attachments: failing tests.png Our javadoc for {{Facets#getAllChildren}} explicitly calls out that callers should make no assumptions about child ordering, but a number of our own unit tests turn around and make that assumption. I ran into this when recently trying an optimization that would result in a different child ordering for {{{}getAllChildren{}}}, and found a number of unit tests that started failing. I'll upload a list of what I found failing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10644) Facets#getAllChildren testing should ignore child order
[ https://issues.apache.org/jira/browse/LUCENE-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-10644: - Attachment: failing tests.png > Facets#getAllChildren testing should ignore child order > --- > > Key: LUCENE-10644 > URL: https://issues.apache.org/jira/browse/LUCENE-10644 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > Attachments: failing tests.png > > > Our javadoc for {{Facets#getAllChildren}} explicitly calls out that callers > should make no assumptions about child ordering, but a number of our own unit > tests turn around and make that assumption. I ran into this when recently > trying an optimization that would result in a different child ordering for > {{{}getAllChildren{}}}, and found a number of unit tests that started > failing. I'll upload a list of what I found failing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562832#comment-17562832 ] Greg Miller commented on LUCENE-10603: -- Thanks [~stefanvodita] for jumping in as well to help! I left a little feedback on the PR. Thanks again! > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 4h > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10639) WANDScorer performs better without two-phase
[ https://issues.apache.org/jira/browse/LUCENE-10639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562786#comment-17562786 ] Greg Miller commented on LUCENE-10639: -- As a quick update, I ran benchmarks with just [livedoc checking broken out|https://github.com/gsmiller/lucene/commit/f4e9614a299523b57c854a3bd3371253f0a7fb17] in {{DefaultBulkScorer}}. I surprisingly didn't see any difference. So maybe something else going on here? Note that I ran this with {{wikimedium10m}} instead of {{all}} to get a datapoint a bit quicker: {code:java} TaskQPS baseline StdDevQPS candidate StdDevPct diff p-value Prefix3 118.98 (10.2%) 114.60 (9.9%) -3.7% ( -21% - 18%) 0.247 Wildcard 40.69 (6.9%) 39.62 (7.2%) -2.6% ( -15% - 12%) 0.236 TermDTSort 17.76 (20.4%) 17.33 (14.2%) -2.4% ( -30% - 40%) 0.663 OrNotHighHigh 881.01 (4.4%) 861.34 (3.9%) -2.2% ( -10% -6%) 0.089 AndHighHigh8.87 (5.0%)8.70 (6.2%) -1.8% ( -12% -9%) 0.296 MedTerm 1771.40 (4.2%) 1740.50 (4.4%) -1.7% ( -9% -7%) 0.198 AndHighMed 30.59 (4.0%) 30.06 (5.6%) -1.7% ( -10% -8%) 0.267 OrHighNotLow 782.90 (4.8%) 769.92 (5.1%) -1.7% ( -11% -8%) 0.291 HighPhrase 392.18 (2.7%) 386.50 (2.7%) -1.4% ( -6% -4%) 0.087 OrHighNotHigh 830.76 (4.3%) 818.83 (4.3%) -1.4% ( -9% -7%) 0.295 OrNotHighMed 585.86 (2.6%) 578.07 (3.1%) -1.3% ( -6% -4%) 0.146 OrHighNotMed 966.75 (3.6%) 956.07 (3.9%) -1.1% ( -8% -6%) 0.352 LowPhrase 546.02 (2.1%) 540.42 (2.4%) -1.0% ( -5% -3%) 0.148 MedPhrase 24.65 (2.3%) 24.40 (3.0%) -1.0% ( -6% -4%) 0.225 AndHighLow 508.37 (3.7%) 503.84 (4.7%) -0.9% ( -8% -7%) 0.506 OrNotHighLow 672.15 (2.7%) 666.29 (2.8%) -0.9% ( -6% -4%) 0.313 BrowseMonthTaxoFacets8.92 (14.5%)8.84 (13.9%) -0.9% ( -25% - 32%) 0.846 AndHighMedDayTaxoFacets 39.14 (2.2%) 38.82 (2.2%) -0.8% ( -5% -3%) 0.241 AndHighHighDayTaxoFacets8.01 (2.8%)7.96 (2.8%) -0.7% ( -6% -4%) 0.416 LowSloppyPhrase5.83 (3.8%)5.79 (3.8%) -0.7% ( -8% -7%) 0.556 OrHighLow 128.01 (3.7%) 127.11 (3.8%) -0.7% ( -7% -7%) 0.554 HighTerm 1190.03 (4.4%) 1183.10 (4.1%) -0.6% ( -8% -8%) 0.663 MedSloppyPhrase 11.67 (2.1%) 11.61 (2.6%) -0.5% ( -5% -4%) 0.480 MedTermDayTaxoFacets 14.09 (3.1%) 14.03 (4.1%) -0.5% ( -7% -6%) 0.686 IntNRQ 110.15 (2.3%) 109.69 (2.1%) -0.4% ( -4% -4%) 0.546 HighSloppyPhrase9.56 (4.5%)9.53 (4.5%) -0.4% ( -8% -9%) 0.794 BrowseDateSSDVFacets0.85 (10.4%)0.85 (10.8%) -0.3% ( -19% - 23%) 0.939 Respell 33.65 (1.7%) 33.58 (1.7%) -0.2% ( -3% -3%) 0.684 Fuzzy2 74.16 (1.9%) 74.02 (1.7%) -0.2% ( -3% -3%) 0.740 LowTerm 1522.48 (2.9%) 1520.76 (3.3%) -0.1% ( -6% -6%) 0.909 LowIntervalsOrdered 12.75 (3.3%) 12.74 (3.3%) -0.1% ( -6% -6%) 0.915 HighIntervalsOrdered6.30 (4.2%)6.31 (4.0%)0.1% ( -7% -8%) 0.923 BrowseRandomLabelSSDVFacets2.57 (4.9%)2.57 (4.9%)0.1% ( -9% - 10%) 0.927 Fuzzy1 57.11 (1.9%) 57.26 (1.7%)0.2% ( -3% -3%) 0.666 BrowseRandomLabelTaxoFacets6.32 (9.3%)6.34 (10.3%)0.3% ( -17% - 21%) 0.911 LowSpanNear 15.95 (2.9%) 16.01 (2.7%)0.4% ( -5% -6%) 0.680 MedIntervalsOrdered1.61 (5.8%)1.62 (5.8%)0.4% ( -10% - 12%) 0.834 HighSpanNear2.27 (4.2%)2.28 (4.0%)0.6% ( -7% -9%) 0.636
[jira] [Commented] (LUCENE-10639) WANDScorer performs better without two-phase
[ https://issues.apache.org/jira/browse/LUCENE-10639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561772#comment-17561772 ] Greg Miller commented on LUCENE-10639: -- {quote}I suspected there was some overhead to two-phase iteration but not as much as this. {quote} Right. Yeah, I guess I was so surprised by the performance shift that I assumed there must be an interesting second-phase happening. But from what you're saying, it sounds like these {{OrHighLow/Med/High}} tasks aren't doing that. And that the performance change is purely some side-effect of running the two phases instead of doing all the checks in the first phase. I should have dug into what these tasks are doing. {quote}Hotspot was not always able to optimize "if (liveDocs == null)" checks {quote} Interesting. Seems worth a shot. Thanks for the quick thoughts! > WANDScorer performs better without two-phase > > > Key: LUCENE-10639 > URL: https://issues.apache.org/jira/browse/LUCENE-10639 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Greg Miller >Priority: Major > > After looking at the recent improvement [~jpountz] made to WAND scoring in > LUCENE-10634, which does additional work during match confirmation to not > confirm a match who's score wouldn't be competitive, I wanted to see how > performance would shift if we squashed the two-phase iteration completely and > only returned true matches (that were also known to be competitive by score) > in the "approximation" phase. I was a bit surprised to find that luceneutil > benchmarks (run with {{{}wikimediumall{}}}), improves significantly on some > disjunction tasks and doesn't show significant regressions anywhere else. > Note that I used LUCENE-10634 as a baseline, and built my candidate change on > top of that. The diff can be seen here: > [DIFF|https://github.com/gsmiller/lucene/compare/b2d46440998fe4a972e8cc8c948580111359ed0f..c5bab794c92dbc66e70f9389948c1bdfe9b45231] > A simple conclusion here might be that we shouldn't do two-phase iteration in > WANDScorer, but I'm pretty sure that's not right. I wonder if what's really > going on is that we're under-estimating the cost of confirming a match? Right > now we just return the tail size as the cost. While the cost of confirming a > match is proportional to the tail size, the actual work involved can be quite > significant (having to advance tail iterators to new blocks and decompress > them). I wonder if the WAND second phase is being run too early on > approximate candidates, and if less-expensive, (and even possibly more > restrictive?), second phases could/should be running first? > I'm raising this here as more of a curiosity to see if it sparks ideas on how > to move forward. Again, I'm not proposing we do away with two-phase > iteration, but it seems we might be able to improve things. Maybe I'll > explore changing the cost heuristic next. Also, maybe there's some different > benchmarking that would be useful here that I may not be familiar with? > Benchmark results on wikimediumall: > {code:java} > TaskQPS baseline StdDevQPS candidate > StdDevPct diff p-value > HighTermTitleBDVSort 22.52 (18.9%) 21.66 > (15.6%) -3.8% ( -32% - 37%) 0.485 > Prefix39.38 (9.2%)9.09 > (10.6%) -3.1% ( -20% - 18%) 0.326 >HighTermMonthSort 25.37 (16.0%) 24.87 > (17.1%) -2.0% ( -30% - 37%) 0.710 > MedTermDayTaxoFacets9.62 (4.2%)9.51 > (4.1%) -1.2% ( -9% -7%) 0.368 > TermDTSort 74.69 (18.0%) 74.13 > (18.2%) -0.7% ( -31% - 43%) 0.897 >HighTermDayOfYearSort 52.64 (16.1%) 52.32 > (15.4%) -0.6% ( -27% - 36%) 0.903 >BrowseMonthTaxoFacets8.64 (19.1%)8.59 > (19.8%) -0.6% ( -33% - 47%) 0.926 > BrowseDateSSDVFacets0.86 (9.5%)0.86 > (13.1%) -0.4% ( -20% - 24%) 0.914 > PKLookup 147.18 (3.9%) 146.66 > (3.3%) -0.3% ( -7% -7%) 0.759 >BrowseDayOfYearSSDVFacets3.47 (4.5%)3.45 > (4.8%) -0.3% ( -9% -9%) 0.822 > Wildcard 36.36 (4.4%) 36.26 > (5.2%) -0.3% ( -9% -9%) 0.866 >BrowseMonthSSDVFacets4.15 (12.7%)4.13 > (12.8%) -0.3% ( -22% - 28%) 0.950 > AndHighMedDayTaxoFacets 15.21 (2.7%) 15.18 > (2.9%) -0.2% ( -5% -5%) 0.819 > Fuzzy1 68.33 (1.8%) 68.22
[jira] [Created] (LUCENE-10639) WANDScorer performs better without two-phase
Greg Miller created LUCENE-10639: Summary: WANDScorer performs better without two-phase Key: LUCENE-10639 URL: https://issues.apache.org/jira/browse/LUCENE-10639 Project: Lucene - Core Issue Type: Improvement Components: core/search Reporter: Greg Miller After looking at the recent improvement [~jpountz] made to WAND scoring in LUCENE-10634, which does additional work during match confirmation to not confirm a match who's score wouldn't be competitive, I wanted to see how performance would shift if we squashed the two-phase iteration completely and only returned true matches (that were also known to be competitive by score) in the "approximation" phase. I was a bit surprised to find that luceneutil benchmarks (run with {{{}wikimediumall{}}}), improves significantly on some disjunction tasks and doesn't show significant regressions anywhere else. Note that I used LUCENE-10634 as a baseline, and built my candidate change on top of that. The diff can be seen here: [DIFF|https://github.com/gsmiller/lucene/compare/b2d46440998fe4a972e8cc8c948580111359ed0f..c5bab794c92dbc66e70f9389948c1bdfe9b45231] A simple conclusion here might be that we shouldn't do two-phase iteration in WANDScorer, but I'm pretty sure that's not right. I wonder if what's really going on is that we're under-estimating the cost of confirming a match? Right now we just return the tail size as the cost. While the cost of confirming a match is proportional to the tail size, the actual work involved can be quite significant (having to advance tail iterators to new blocks and decompress them). I wonder if the WAND second phase is being run too early on approximate candidates, and if less-expensive, (and even possibly more restrictive?), second phases could/should be running first? I'm raising this here as more of a curiosity to see if it sparks ideas on how to move forward. Again, I'm not proposing we do away with two-phase iteration, but it seems we might be able to improve things. Maybe I'll explore changing the cost heuristic next. Also, maybe there's some different benchmarking that would be useful here that I may not be familiar with? Benchmark results on wikimediumall: {code:java} TaskQPS baseline StdDevQPS candidate StdDevPct diff p-value HighTermTitleBDVSort 22.52 (18.9%) 21.66 (15.6%) -3.8% ( -32% - 37%) 0.485 Prefix39.38 (9.2%)9.09 (10.6%) -3.1% ( -20% - 18%) 0.326 HighTermMonthSort 25.37 (16.0%) 24.87 (17.1%) -2.0% ( -30% - 37%) 0.710 MedTermDayTaxoFacets9.62 (4.2%)9.51 (4.1%) -1.2% ( -9% -7%) 0.368 TermDTSort 74.69 (18.0%) 74.13 (18.2%) -0.7% ( -31% - 43%) 0.897 HighTermDayOfYearSort 52.64 (16.1%) 52.32 (15.4%) -0.6% ( -27% - 36%) 0.903 BrowseMonthTaxoFacets8.64 (19.1%)8.59 (19.8%) -0.6% ( -33% - 47%) 0.926 BrowseDateSSDVFacets0.86 (9.5%)0.86 (13.1%) -0.4% ( -20% - 24%) 0.914 PKLookup 147.18 (3.9%) 146.66 (3.3%) -0.3% ( -7% -7%) 0.759 BrowseDayOfYearSSDVFacets3.47 (4.5%)3.45 (4.8%) -0.3% ( -9% -9%) 0.822 Wildcard 36.36 (4.4%) 36.26 (5.2%) -0.3% ( -9% -9%) 0.866 BrowseMonthSSDVFacets4.15 (12.7%)4.13 (12.8%) -0.3% ( -22% - 28%) 0.950 AndHighMedDayTaxoFacets 15.21 (2.7%) 15.18 (2.9%) -0.2% ( -5% -5%) 0.819 Fuzzy1 68.33 (1.8%) 68.22 (2.0%) -0.2% ( -3% -3%) 0.783 OrHighMedDayTaxoFacets2.90 (4.1%)2.89 (4.0%) -0.1% ( -7% -8%) 0.930 MedPhrase 52.81 (2.3%) 52.76 (1.8%) -0.1% ( -4% -4%) 0.878 Respell 36.80 (1.9%) 36.78 (1.9%) -0.1% ( -3% -3%) 0.933 Fuzzy2 63.06 (1.9%) 63.05 (2.1%) -0.0% ( -3% -4%) 0.971 LowPhrase 74.60 (1.9%) 74.61 (1.8%)0.0% ( -3% -3%) 0.987 AndHighHighDayTaxoFacets4.54 (2.3%)4.55 (2.0%)0.0% ( -4% -4%) 0.960 HighPhrase 353.13 (2.6%) 353.28 (2.5%)0.0% ( -4% -5%) 0.958 OrNotHighHigh 761.72 (4.0%) 762.48 (3.6%)0.1% ( -7% -8%) 0.935 OrHighNotLow 1129.94 (4.1%) 1131.56
[jira] [Comment Edited] (LUCENE-10246) Support getting counts from "association" facets
[ https://issues.apache.org/jira/browse/LUCENE-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561222#comment-17561222 ] Greg Miller edited comment on LUCENE-10246 at 7/1/22 12:27 AM: --- [~shahrs87] I'd start by becoming familiar with the existing "association facet" implementations ({{TaxonomyFacetIntAssociations}} and {{TaxonomyFacetFloatAssociations}} as well as looking at some demo code like {{AssociationsFacetsExample}}). The API contract they implement represent results with {{FacetResult}}, which contains a list of {{LabelAndValue}} instances. {{LabelAndValue}} only models a single label along with a single numeric value. The value "usually" represents a total faceting count for a label in "non-association" facets, but with association faceting, value takes on an aggregated weight "associated" with the label. The idea with this Jira is to be able to convey _both_ an aggregated weight and the count associated with a label. The best way to do that without creating a weird API for non-association cases is something that will probably take a little thought. Should we just put another "count" field in {{LabelAndValue}} and have both value and count be populated with a count for non-association cases? That sounds weird. So beyond understanding what's currently there, I think the next step is to think about the right way to evolve the API that doesn't create a weird interaction for non-association faceting, especially since those are more commonly used. Please reach out here as you have questions and I'll do my best to answer in a timely fashion. Thanks for having a look at this! was (Author: gsmiller): [~shahrs87] I'd start by becoming familiar with the existing "association facet" implementations ({{TaxonomyFacetIntAssociations}} and {{TaxonomyFacetFloatAssociations}} as well as looking at some demo code like {{AssociationsFacetsExample}}). The API contract they implement represent results with {{FacetResult}}, which contains a list of {{LabelAndValue}} instances. {{LabelAndValue}} only models a single label along with a single numeric value. The value "usually" represents a total faceting count for a label in "non-association" facets, but with association faceting, value takes on an aggregated weight "associated" with the label. The idea with this Jira is to be able to convey _both_ an aggregated weight and the count associated with a label. The best way to do that without creating a weird API for non-association cases is something that will probably take a little thought. Should we just put another "count" field in {{LabelAndValue}} and have both value and count be populated with a count for non-assocation cases? That sounds weird. So beyond understanding what's currently there, I think the next step is to think about the right way to evolve the API that doesn't create a weird interaction for non-association faceting, especially since those are more commonly used. > Support getting counts from "association" facets > > > Key: LUCENE-10246 > URL: https://issues.apache.org/jira/browse/LUCENE-10246 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > > We have these nice "association" facet implementations today that aggregate > "weights" from the docs that facet over, but they don't keep track of counts. > So the user can get "top-n" values for a dim by aggregated weight (great!), > but can't know how many docs matched each value. It would be nice to support > this so users could show the top-n values but _also_ show counts associated > with each. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10246) Support getting counts from "association" facets
[ https://issues.apache.org/jira/browse/LUCENE-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561222#comment-17561222 ] Greg Miller commented on LUCENE-10246: -- [~shahrs87] I'd start by becoming familiar with the existing "association facet" implementations ({{TaxonomyFacetIntAssociations}} and {{TaxonomyFacetFloatAssociations}} as well as looking at some demo code like {{AssociationsFacetsExample}}). The API contract they implement represent results with {{FacetResult}}, which contains a list of {{LabelAndValue}} instances. {{LabelAndValue}} only models a single label along with a single numeric value. The value "usually" represents a total faceting count for a label in "non-association" facets, but with association faceting, value takes on an aggregated weight "associated" with the label. The idea with this Jira is to be able to convey _both_ an aggregated weight and the count associated with a label. The best way to do that without creating a weird API for non-association cases is something that will probably take a little thought. Should we just put another "count" field in {{LabelAndValue}} and have both value and count be populated with a count for non-assocation cases? That sounds weird. So beyond understanding what's currently there, I think the next step is to think about the right way to evolve the API that doesn't create a weird interaction for non-association faceting, especially since those are more commonly used. > Support getting counts from "association" facets > > > Key: LUCENE-10246 > URL: https://issues.apache.org/jira/browse/LUCENE-10246 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > > We have these nice "association" facet implementations today that aggregate > "weights" from the docs that facet over, but they don't keep track of counts. > So the user can get "top-n" values for a dim by aggregated weight (great!), > but can't know how many docs matched each value. It would be nice to support > this so users could show the top-n values but _also_ show counts associated > with each. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561203#comment-17561203 ] Greg Miller commented on LUCENE-10603: -- I pushed another commit that takes care of the remaining "production" code iteration. I think the next step is to knock out all remaining iteration patterns, which should only exist in "test" related code. When I get some more free time I'll take a pass at it, but might be a week or so. Happy to have someone beat me to it :) > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 3h 40m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10546) Update Faceting user guide
[ https://issues.apache.org/jira/browse/LUCENE-10546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561199#comment-17561199 ] Greg Miller commented on LUCENE-10546: -- Great, thanks [~epotiom]! I'm not aware of anyone else working on this. > Update Faceting user guide > -- > > Key: LUCENE-10546 > URL: https://issues.apache.org/jira/browse/LUCENE-10546 > Project: Lucene - Core > Issue Type: Wish > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > > The [facet user > guide|https://lucene.apache.org/core/4_1_0/facet/org/apache/lucene/facet/doc-files/userguide.html] > was written based on 4.1. Since there's been a fair amount of active > facet-related development over the last year+, it would be nice to review the > guide and see what updates make sense. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10274) Implement "hyperrectangle" faceting
Title: Message Title Greg Miller resolved as Fixed Very excited to see this shipped! Thanks Shai Erera and Marc D'Mello for all the PR iterations and conversation. Great example of shipping something much stronger than the original idea after rounds of discussion and iteration. Thanks again! Lucene - Core / LUCENE-10274 Implement "hyperrectangle" faceting Change By: Greg Miller Fix Version/s: 9.3 Resolution: Fixed Status: Open Resolved Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
Title: Message Title Greg Miller commented on LUCENE-10603 Re: Improve iteration of ords for SortedSetDocValues Thanks Lu Xugang for letting me know! As I have some free time, I'll try to migrate a few more modules over (and will update here as I put out PRs for the modules). Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558709#comment-17558709 ] Greg Miller commented on LUCENE-10624: -- Oh, and just to clarify my above comment, I'm not weighing in (yet?) on whether-or-not this change makes sense, just adding a data point that we didn't see an impact in our particular application one way or the other. So it doesn't seem to help our usage patterns, but it also doesn't seem to hurt. +1 to [~jpountz]'s sentiment though to understand why those benchmark tasks you saw impact on changed. Thanks for pursuing this! > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candiate-exponential-searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 40m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > h3. Benchmark > Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the > reports of baseline and candidates in attachments section.{color} > {color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color} > {color:#1d1c1d}2. Some highlights (>20%):{color} > * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] > yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}* > ** {color:#1d1c1d}*Baseline:* 10973978+ hits hits in *726.81967 msec*{color} > ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 > msec*{color} > * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}* > ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color} > ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color} > * {color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color} > ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color} > ** {color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 > msec*{color}{*}{*} > * {color:#1d1c1d}*...*{color} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock
[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558700#comment-17558700 ] Greg Miller commented on LUCENE-10624: -- For what it's worth, I ran a benchmark on the Amazon Product Search engine with this change, where we do lots of doc value access for various purposes, and saw effectively no change to latency or throughput (qps). Just adding that as a datapoint from a real-world, large-scale application. > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > - > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 9.0, 9.1, 9.2 >Reporter: Weiming Wu >Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candiate-exponential-searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 40m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > h3. Benchmark > Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the > reports of baseline and candidates in attachments section.{color} > {color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color} > {color:#1d1c1d}2. Some highlights (>20%):{color} > * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] > yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}* > ** {color:#1d1c1d}*Baseline:* 10973978+ hits hits in *726.81967 msec*{color} > ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 > msec*{color} > * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}* > ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color} > ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color} > * {color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color} > ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color} > ** {color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 > msec*{color}{*}{*} > * {color:#1d1c1d}*...*{color} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10550) Add getAllChildren functionality to facets
[ https://issues.apache.org/jira/browse/LUCENE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10550. -- Fix Version/s: 9.3 Resolution: Fixed Thanks again [~yutinggan] ! > Add getAllChildren functionality to facets > -- > > Key: LUCENE-10550 > URL: https://issues.apache.org/jira/browse/LUCENE-10550 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Yuting Gan >Priority: Minor > Fix For: 9.3 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Currently Lucene does not support returning range counts sorted by label > values, but there are use cases demanding this feature. For example, a user > specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts > without changing the range order. Today we can only call getTopChildren to > populate range counts, but it would return ranges sorted by counts (e.g., > [10, 20] 100, [0, 10] 50) instead of range values. > Lucene has a API, getAllChildrenSortByValue, that returns numeric values with > counts sorted by label values, please see > [LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. > Therefore, it would be nice that we can also have a similar API to support > range counts. The proposed getAllChildren API is to return value/range counts > sorted by label values instead of counts. > This proposal was inspired from the discussions with [~gsmiller] when I was > working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], > and we believe users would benefit from adding this API to Facets. > Hope I can get some feedback from the community since this proposal would > require changes to the getTopChildren API in RangeFacetCounts. Thanks! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10614) Properly support getTopChildren in RangeFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557453#comment-17557453 ] Greg Miller commented on LUCENE-10614: -- Great, thanks [~yutinggan] ! > Properly support getTopChildren in RangeFacetCounts > --- > > Key: LUCENE-10614 > URL: https://issues.apache.org/jira/browse/LUCENE-10614 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 10.0 (main) >Reporter: Greg Miller >Priority: Minor > > As mentioned in LUCENE-10538, {{RangeFacetCounts}} is not implementing > {{getTopChildren}}. Instead of returning "top" ranges, it returns all > user-provided ranges in the order the user specified them when instantiating. > This is probably more useful functionality, but it would be nice to support > {{getTopChildren}} as well. > LUCENE-10550 is introducing the concept of {{getAllChildren}}, so once that > lands, we can replace the current implementation of {{getTopChildren}} with > an actual "top children" implementation and direct users to > {{getAllChildren}} if they want to maintain the current behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556891#comment-17556891 ] Greg Miller commented on LUCENE-10603: -- [~ChrisLu] thanks again for proposing this. I've merged the work in the {{facets}} module to use the new style of iteration, but there's still plenty more locations in our code base that need updating. Let me know if you want any help with this. I'm happy to divide up some of the modules if you'd like (or maybe we can recruit others if interested as well). In the meantime, I propose we get this {{NO_MORE_ORDS}} constant marked as {{deprecated}} so we have a shot of removing it in a 10.0 release. By removing it, as [~jpountz] points out in [#954|https://github.com/apache/lucene/pull/954], we may have a performance benefit since we won't need the book-keeping to keep it updated. I opened another PR for this: [#969|https://github.com/apache/lucene/pull/969]. > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 1h 40m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10584) SSDV facets should support hierarchical paths in #getSpecificValue
[ https://issues.apache.org/jira/browse/LUCENE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10584. -- Fix Version/s: 9.3 Resolution: Fixed > SSDV facets should support hierarchical paths in #getSpecificValue > -- > > Key: LUCENE-10584 > URL: https://issues.apache.org/jira/browse/LUCENE-10584 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Major > Fix For: 9.3 > > Time Spent: 1h > Remaining Estimate: 0h > > We added hierarchical pathing capabilities to SSDV faceting recently but it > looks like we didn't update #getSpecificValue to work with hierarchical paths. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10584) SSDV facets should support hierarchical paths in #getSpecificValue
[ https://issues.apache.org/jira/browse/LUCENE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554758#comment-17554758 ] Greg Miller commented on LUCENE-10584: -- Fixed and backported. Resolving. > SSDV facets should support hierarchical paths in #getSpecificValue > -- > > Key: LUCENE-10584 > URL: https://issues.apache.org/jira/browse/LUCENE-10584 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > We added hierarchical pathing capabilities to SSDV faceting recently but it > looks like we didn't update #getSpecificValue to work with hierarchical paths. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10614) Properly support getTopChildren in RangeFacetCounts
Greg Miller created LUCENE-10614: Summary: Properly support getTopChildren in RangeFacetCounts Key: LUCENE-10614 URL: https://issues.apache.org/jira/browse/LUCENE-10614 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Affects Versions: 10.0 (main) Reporter: Greg Miller As mentioned in LUCENE-10538, {{RangeFacetCounts}} is not implementing {{getTopChildren}}. Instead of returning "top" ranges, it returns all user-provided ranges in the order the user specified them when instantiating. This is probably more useful functionality, but it would be nice to support {{getTopChildren}} as well. LUCENE-10550 is introducing the concept of {{getAllChildren}}, so once that lands, we can replace the current implementation of {{getTopChildren}} with an actual "top children" implementation and direct users to {{getAllChildren}} if they want to maintain the current behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552851#comment-17552851 ] Greg Miller commented on LUCENE-10603: -- OK, thanks [~ChrisLu]! +1 to doing this for consistency. I took a pass at making this change within the faceting module since there are a number of places we rely on SSDV ordinal iteration. I figured we could probably tackle this change through multiple PRs, so I figured I'd lend a hand with faceting: https://github.com/apache/lucene/pull/954 > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552390#comment-17552390 ] Greg Miller commented on LUCENE-10603: -- Seems reasonable. Is there an expected benefit of moving to this iteration style? Do we think the loops can be better optimized by the JVM/hotspot since the number of iterations is known ahead of time? > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Priority: Trivial > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10595) TestGroupFacetCollector#testRandom has caught IndexOutOfBoundsException a couple times
Greg Miller created LUCENE-10595: Summary: TestGroupFacetCollector#testRandom has caught IndexOutOfBoundsException a couple times Key: LUCENE-10595 URL: https://issues.apache.org/jira/browse/LUCENE-10595 Project: Lucene - Core Issue Type: Bug Components: modules/grouping Reporter: Greg Miller Random testing has caught an {{IndexOutOfBoundsException}} a couple times now in {{org.apache.lucene.search.grouping.TestGroupFacetCollector.testRandom}}. I was able to reproduce locally with {{./gradlew :lucene:grouping:test --tests "org.apache.lucene.search.grouping.TestGroupFacetCollector.testRandom" -Ptests.jvms=4 -Ptests.haltonfailure=false -Ptests.jvmargs=-XX:TieredStopAtLevel=1 -Ptests.seed=91EC8BE9DE2A5BAB -Ptests.multiplier=2 -Ptests.badapples=false -Ptests.file.encoding=US-ASCII}}. >From what I can tell, the exception is coming from way down in >{{ByteBuffersDataInput#readBytes}} on [this >line|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataInput.java#L155]. > Popping the stack a bit, it seems like the issue is maybe in >{{TermGroupFacetCollector$MV$SegmentResult#nextTerm}}. I'm not totally sure if this is an actual bug or a bug in testing methodology. Haven't had time to dig in further and likely won't in the near future, so opening this Jira to track. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10585) Cleanup copy/paste code in facets, particularly in SSDV
[ https://issues.apache.org/jira/browse/LUCENE-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10585. -- Fix Version/s: 10.0 (main) 9.3 Resolution: Fixed > Cleanup copy/paste code in facets, particularly in SSDV > --- > > Key: LUCENE-10585 > URL: https://issues.apache.org/jira/browse/LUCENE-10585 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > Fix For: 10.0 (main), 9.3 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > We've accumulated some copy/paste code in the facets modules, especially in > SSDV-related classes. I'm going to take a pass at cleaning this up to help > make the code more readable and easier to maintain. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540309#comment-17540309 ] Greg Miller edited comment on LUCENE-10544 at 5/20/22 10:03 PM: As I understood Adrien's suggestion, I think the idea is to create a new {{BulkScorer}} sub-class that would wrap another {{BulkScorer}} (provided as {{in}} in Adrien's code snippet). This class would override the {{score}} method as Adrien shows above to periodically check timeouts, but otherwise just delegate to {{in}} if the query has not yet timed out. I imagine somewhere in {{IndexSearcher}} you would instantiate this new "timeout enforcing bulk scorer", wrapping the {{BulkScorer}} provided by the query's weight. Does that help? Also, can I request that we move this conversation over to [LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151]? This issue is really about modifying {{ExitableTermsEnum}}, which we may want to eventually due independent of adding timeout support to {{IndexSearcher}}. Since this discussion is really about adding timeout support to {{IndexSearcher}}, it would be best to capture the conversation in LUCENE-10151 to make it easier to dig up in the future. Thank you! was (Author: gsmiller): As I understood Adrien's suggestion, I think the idea is to create a new {{BulkScorer}} sub-class that would wrap another {{BulkScorer}} (provided as {{in}} in Adrien's code snippet). This class would override the {{score}} method as Adrien shows above to periodically check timeouts, but otherwise just delegate to {{in}} if the query has not yet timed out. I image somewhere in {{IndexSearcher}} you would instantiate this new "timeout enforcing bulk scorer", wrapping the {{BulkScorer}} provided by the query's weight. Does that help? Also, can I request that we move this conversation over to [LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151]? This issue is really about modifying {{ExitableTermsEnum}}, which we may want to eventually due independent of adding timeout support to {{IndexSearcher}}. Since this discussion is really about adding timeout support to {{IndexSearcher}}, it would be best to capture the conversation in LUCENE-10151 to make it easier to dig up in the future. Thank you! > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540309#comment-17540309 ] Greg Miller commented on LUCENE-10544: -- As I understood Adrien's suggestion, I think the idea is to create a new {{BulkScorer}} sub-class that would wrap another {{BulkScorer}} (provided as {{in}} in Adrien's code snippet). This class would override the {{score}} method as Adrien shows above to periodically check timeouts, but otherwise just delegate to {{in}} if the query has not yet timed out. I image somewhere in {{IndexSearcher}} you would instantiate this new "timeout enforcing bulk scorer", wrapping the {{BulkScorer}} provided by the query's weight. Does that help? Also, can I request that we move this conversation over to [LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151]? This issue is really about modifying {{ExitableTermsEnum}}, which we may want to eventually due independent of adding timeout support to {{IndexSearcher}}. Since this discussion is really about adding timeout support to {{IndexSearcher}}, it would be best to capture the conversation in LUCENE-10151 to make it easier to dig up in the future. Thank you! > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-10585) Cleanup copy/paste code in facets, particularly in SSDV
[ https://issues.apache.org/jira/browse/LUCENE-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller reassigned LUCENE-10585: Assignee: Greg Miller > Cleanup copy/paste code in facets, particularly in SSDV > --- > > Key: LUCENE-10585 > URL: https://issues.apache.org/jira/browse/LUCENE-10585 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > > We've accumulated some copy/paste code in the facets modules, especially in > SSDV-related classes. I'm going to take a pass at cleaning this up to help > make the code more readable and easier to maintain. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10585) Cleanup copy/paste code in facets, particularly in SSDV
Greg Miller created LUCENE-10585: Summary: Cleanup copy/paste code in facets, particularly in SSDV Key: LUCENE-10585 URL: https://issues.apache.org/jira/browse/LUCENE-10585 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Greg Miller We've accumulated some copy/paste code in the facets modules, especially in SSDV-related classes. I'm going to take a pass at cleaning this up to help make the code more readable and easier to maintain. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-10584) SSDV facets should support hierarchical paths in #getSpecificValue
[ https://issues.apache.org/jira/browse/LUCENE-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller reassigned LUCENE-10584: Assignee: Greg Miller > SSDV facets should support hierarchical paths in #getSpecificValue > -- > > Key: LUCENE-10584 > URL: https://issues.apache.org/jira/browse/LUCENE-10584 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Major > > We added hierarchical pathing capabilities to SSDV faceting recently but it > looks like we didn't update #getSpecificValue to work with hierarchical paths. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10584) SSDV facets should support hierarchical paths in #getSpecificValue
Greg Miller created LUCENE-10584: Summary: SSDV facets should support hierarchical paths in #getSpecificValue Key: LUCENE-10584 URL: https://issues.apache.org/jira/browse/LUCENE-10584 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Greg Miller We added hierarchical pathing capabilities to SSDV faceting recently but it looks like we didn't update #getSpecificValue to work with hierarchical paths. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10580) Should we add a "slow range query" to xxxPoint classes?
Greg Miller created LUCENE-10580: Summary: Should we add a "slow range query" to xxxPoint classes? Key: LUCENE-10580 URL: https://issues.apache.org/jira/browse/LUCENE-10580 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: Greg Miller Users that index 2D point data have the option of running a range query with, 1) the points index (via {{LongPoint#newRangeQuery}}), or 2) a doc values field (via {{SortedNumericDocValuesField#newSlowRangeQuery}}). But if users are indexing points data in higher dimensions, there's no equivalent "slow" query that I'm aware of (relying on doc values). It's useful to have both and be able to wrap them in {{IndexOrDocValuesQuery}}. I wonder if we should model a "point" doc value type (could just extend from {{BinaryDocValuesField}}) that supports creating "slow" range queries? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538424#comment-17538424 ] Greg Miller commented on LUCENE-10544: -- +1 to pursuing this delegating bulk scorer suggestion. I really like that idea [~jpountz]. Seems like a simple, easy to understand approach that still allows queries to provide their own custom bulk scoring logic as necessary. > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations
[ https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10488. -- Fix Version/s: 9.2 Resolution: Fixed Merged to {{main}} and {{branch_9x}}. Resolving. Thanks again [~yutinggan]! > Optimize Facets#getTopDims across Facets implementations > > > Key: LUCENE-10488 > URL: https://issues.apache.org/jira/browse/LUCENE-10488 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > Fix For: 9.2 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the > number of "top" dimensions they want. The default implementation just > delegates to {{getAllDims}} and returns the number of top dims requested, but > some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated > this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's > at least some opportunity to do better in: > * {{ConcurrentSortedSetDocValuesFacetCounts}} > * {{FastTaxonomyFacetCounts}} > * {{TaxonomyFacetSumFloatAssociations}} > * {{TaxonomyFacetSumIntAssociations}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10565) Can we "warm" SSDV ordinal maps on index reopen?
[ https://issues.apache.org/jira/browse/LUCENE-10565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536357#comment-17536357 ] Greg Miller commented on LUCENE-10565: -- A tricky aspect of this is identifying which fields to pre-build the ordinal maps for, but I wonder if we could leverage {{FacetsConfig}} for this. Unfortunately, users don't have to register a facet field with {{FacetsConfig}} if they want all the default behavior, but maybe there's something we could do with this to make it more straight-forward to identify all the SSDV fields being used for faceting on reopen so the ordinal maps could be built. > Can we "warm" SSDV ordinal maps on index reopen? > > > Key: LUCENE-10565 > URL: https://issues.apache.org/jira/browse/LUCENE-10565 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Major > > As [~rcmuir] and [~jpountz] pointed out in a [discussion about facet > benchmarks|https://github.com/mikemccand/luceneutil/issues/169], we lazily > build ordinal maps needed for SSDV faceting the first time we need them for a > given index field instead of eagerly building them when the index is > reopened. This puts an expensive penalty on the search path whenever an index > is reloaded. Let's see if we can eagerly build these maps as part of > reopening the index so the user doesn't get hit with this at search time. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10565) Can we "warm" SSDV ordinal maps on index reopen?
Greg Miller created LUCENE-10565: Summary: Can we "warm" SSDV ordinal maps on index reopen? Key: LUCENE-10565 URL: https://issues.apache.org/jira/browse/LUCENE-10565 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Greg Miller As [~rcmuir] and [~jpountz] pointed out in a [discussion about facet benchmarks|https://github.com/mikemccand/luceneutil/issues/169], we lazily build ordinal maps needed for SSDV faceting the first time we need them for a given index field instead of eagerly building them when the index is reopened. This puts an expensive penalty on the search path whenever an index is reloaded. Let's see if we can eagerly build these maps as part of reopening the index so the user doesn't get hit with this at search time. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10538) TopN is not being used in getTopChildren()
[ https://issues.apache.org/jira/browse/LUCENE-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533912#comment-17533912 ] Greg Miller commented on LUCENE-10538: -- So I think the order of operations here is: 1. Deliver [LUCENE-10550|https://issues.apache.org/jira/browse/LUCENE-10550], which would effectively _copy_ the currently "top children" functionality of range faceting to a new API method for getting all children (which is what it's really doing). 2. Fix the existing "top children" functionality of range faceting to actually return top children (and honor the top-n parameter). I think this issue now effectively captures #2, and is blocked until LUCENE-10550 is delivered. Does that sound right [~yutinggan]? > TopN is not being used in getTopChildren() > -- > > Key: LUCENE-10538 > URL: https://issues.apache.org/jira/browse/LUCENE-10538 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Reporter: Yuting Gan >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > When looking at the overridden implementation getTopChildren(int topN, String > dim, String... path) in RangeFacetCounts, I found that the topN parameter is > not being used in the code, and the unit tests did not test this function > properly. I will create a PR to fix this, and will look into other overridden > implementations and see if they have the same issue. Please let me know if > there is any question. Thanks! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10538) TopN is not being used in getTopChildren()
[ https://issues.apache.org/jira/browse/LUCENE-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-10538: - Component/s: modules/facet > TopN is not being used in getTopChildren() > -- > > Key: LUCENE-10538 > URL: https://issues.apache.org/jira/browse/LUCENE-10538 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Reporter: Yuting Gan >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > When looking at the overridden implementation getTopChildren(int topN, String > dim, String... path) in RangeFacetCounts, I found that the topN parameter is > not being used in the code, and the unit tests did not test this function > properly. I will create a PR to fix this, and will look into other overridden > implementations and see if they have the same issue. Please let me know if > there is any question. Thanks! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10550) Add getAllChildren functionality to facets
[ https://issues.apache.org/jira/browse/LUCENE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532277#comment-17532277 ] Greg Miller edited comment on LUCENE-10550 at 5/5/22 2:37 PM: -- I'm also +1 on this but with a minor suggestion. {quote}The proposed getAllChildren API is to return value/range counts sorted by label values instead of counts. {quote} I wonder if we should "sort" at all for this functionality? If we're returning all children for a specified path, the caller can just as easily sort by whatever criteria they want (or maybe none at all), so sorting within the implementation might be wasteful. Also, for range faceting, the user is providing a list of ranges they care about up-front in a specific order. I would actually propose we retain that order instead of sorting by the range "values" in some way. This is what range faceting currently implements (somewhat confusingly) behind the {{getTopChildren}} API. The order of those ranges might have some meaning to the caller, so it might be best to retain it. What do you think? was (Author: gsmiller): I'm also +1 on this but with a minor suggestion. > The proposed getAllChildren API is to return value/range counts sorted by > label values instead of counts. I wonder if we should "sort" at all for this functionality? If we're returning all children for a specified path, the caller can just as easily sort by whatever criteria they want (or maybe none at all), so sorting within the implementation might be wasteful. Also, for range faceting, the user is providing a list of ranges they care about up-front in a specific order. I would actually propose we retain that order instead of sorting by the range "values" in some way. This is what range faceting currently implements (somewhat confusingly) behind the {{getTopChildren}} API. The order of those ranges might have some meaning to the caller, so it might be best to retain it. What do you think? > Add getAllChildren functionality to facets > -- > > Key: LUCENE-10550 > URL: https://issues.apache.org/jira/browse/LUCENE-10550 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Yuting Gan >Priority: Minor > > Currently Lucene does not support returning range counts sorted by label > values, but there are use cases demanding this feature. For example, a user > specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts > without changing the range order. Today we can only call getTopChildren to > populate range counts, but it would return ranges sorted by counts (e.g., > [10, 20] 100, [0, 10] 50) instead of range values. > Lucene has a API, getAllChildrenSortByValue, that returns numeric values with > counts sorted by label values, please see > [LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. > Therefore, it would be nice that we can also have a similar API to support > range counts. The proposed getAllChildren API is to return value/range counts > sorted by label values instead of counts. > This proposal was inspired from the discussions with [~gsmiller] when I was > working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], > and we believe users would benefit from adding this API to Facets. > Hope I can get some feedback from the community since this proposal would > require changes to the getTopChildren API in RangeFacetCounts. Thanks! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10550) Add getAllChildren functionality to facets
[ https://issues.apache.org/jira/browse/LUCENE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532277#comment-17532277 ] Greg Miller commented on LUCENE-10550: -- I'm also +1 on this but with a minor suggestion. > The proposed getAllChildren API is to return value/range counts sorted by > label values instead of counts. I wonder if we should "sort" at all for this functionality? If we're returning all children for a specified path, the caller can just as easily sort by whatever criteria they want (or maybe none at all), so sorting within the implementation might be wasteful. Also, for range faceting, the user is providing a list of ranges they care about up-front in a specific order. I would actually propose we retain that order instead of sorting by the range "values" in some way. This is what range faceting currently implements (somewhat confusingly) behind the {{getTopChildren}} API. The order of those ranges might have some meaning to the caller, so it might be best to retain it. What do you think? > Add getAllChildren functionality to facets > -- > > Key: LUCENE-10550 > URL: https://issues.apache.org/jira/browse/LUCENE-10550 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Yuting Gan >Priority: Minor > > Currently Lucene does not support returning range counts sorted by label > values, but there are use cases demanding this feature. For example, a user > specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts > without changing the range order. Today we can only call getTopChildren to > populate range counts, but it would return ranges sorted by counts (e.g., > [10, 20] 100, [0, 10] 50) instead of range values. > Lucene has a API, getAllChildrenSortByValue, that returns numeric values with > counts sorted by label values, please see > [LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. > Therefore, it would be nice that we can also have a similar API to support > range counts. The proposed getAllChildren API is to return value/range counts > sorted by label values instead of counts. > This proposal was inspired from the discussions with [~gsmiller] when I was > working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], > and we believe users would benefit from adding this API to Facets. > Hope I can get some feedback from the community since this proposal would > require changes to the getTopChildren API in RangeFacetCounts. Thanks! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10550) Add getAllChildren functionality to facets
[ https://issues.apache.org/jira/browse/LUCENE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-10550: - Component/s: modules/facet > Add getAllChildren functionality to facets > -- > > Key: LUCENE-10550 > URL: https://issues.apache.org/jira/browse/LUCENE-10550 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Yuting Gan >Priority: Minor > > Currently Lucene does not support returning range counts sorted by label > values, but there are use cases demanding this feature. For example, a user > specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts > without changing the range order. Today we can only call getTopChildren to > populate range counts, but it would return ranges sorted by counts (e.g., > [10, 20] 100, [0, 10] 50) instead of range values. > Lucene has a API, getAllChildrenSortByValue, that returns numeric values with > counts sorted by label values, please see > [LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. > Therefore, it would be nice that we can also have a similar API to support > range counts. The proposed getAllChildren API is to return value/range counts > sorted by label values instead of counts. > This proposal was inspired from the discussions with [~gsmiller] when I was > working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], > and we believe users would benefit from adding this API to Facets. > Hope I can get some feedback from the community since this proposal would > require changes to the getTopChildren API in RangeFacetCounts. Thanks! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10538) TopN is not being used in getTopChildren()
[ https://issues.apache.org/jira/browse/LUCENE-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530905#comment-17530905 ] Greg Miller commented on LUCENE-10538: -- We discussed this in the PR but I wanted to bring the conversation here for visibility as well and to make sure this issue isn't just left hanging. While it appears buggy that range facets don't use topN, I believe this is intentional. Range facets are (somewhat confusingly) overloading the {{getTopChildren}} faceting API with slightly different functionality that returns counts for all requested ranges in the order the ranges were provided. I think this existing functionality is important to retain, and I don't want to lose it by truncating to topN. I also think properly implementing {{getTopChildren}} for range faceting would be useful for users. Meaning, a method that actually returns the top-n ranges in decreasing count order, just like other faceting implementations. What I'd actually suggest we do here is add a {{getAllChildren}} method to the faceting API. Then we can migrate the existing {{getTopChildren}} functionality implemented in range faceting to {{getAllChildren}}. Finally, we can replace the existing {{getTopChildren}} range faceting implementation with a proper one. {{LongValueFacetCounts}} is another faceting implementation where we've already implemented "get all children" functionality, so I think there's value beyond just range faceting here (i.e., we could migrate that implementation behind a new method defined for all {{Facets}}). > TopN is not being used in getTopChildren() > -- > > Key: LUCENE-10538 > URL: https://issues.apache.org/jira/browse/LUCENE-10538 > Project: Lucene - Core > Issue Type: Bug >Reporter: Yuting Gan >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > When looking at the overridden implementation getTopChildren(int topN, String > dim, String... path) in RangeFacetCounts, I found that the topN parameter is > not being used in the code, and the unit tests did not test this function > properly. I will create a PR to fix this, and will look into other overridden > implementations and see if they have the same issue. Please let me know if > there is any question. Thanks! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10530) TestTaxonomyFacetAssociations test failure
[ https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10530. -- Fix Version/s: 10.0 (main) 9.2 Resolution: Fixed > TestTaxonomyFacetAssociations test failure > -- > > Key: LUCENE-10530 > URL: https://issues.apache.org/jira/browse/LUCENE-10530 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > Fix For: 10.0 (main), 9.2 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some > flakiness, it fails on the following random seed. > {code:java} > ./gradlew test --tests > TestTaxonomyFacetAssociations.testFloatAssociationRandom \ > -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \ > -Dtests.timezone=Europe/Athens -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 {code} > This is because of a mismatch in (SUM) aggregated multi-valued, > {{float_random}} facet field. We accept an error delta of 1 in this > aggregation, but for the failing random seed, the delta is 1.3. Maybe we > should change this delta to 1.5? > My hunch is that it is some floating point approximation error. I'm unable to > repro it without the randomization seed. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10529) TestTaxonomyFacetAssociations may have floating point issues
[ https://issues.apache.org/jira/browse/LUCENE-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10529. -- Fix Version/s: 10.0 (main) 9.2 Resolution: Fixed > TestTaxonomyFacetAssociations may have floating point issues > > > Key: LUCENE-10529 > URL: https://issues.apache.org/jira/browse/LUCENE-10529 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Fix For: 10.0 (main), 9.2 > > > Hit this in a jenkins CI build while testing something else: > {noformat} > gradlew test --tests TestTaxonomyFacetAssociations.testFloatAssociationRandom > -Dtests.seed=B39C450F4870F7F1 -Dtests.locale=ar-IQ > -Dtests.timezone=America/Rankin_Inlet -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 > ... > org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > > testFloatAssociationRandom FAILED > java.lang.AssertionError: expected:<2605996.5> but was:<2605995.2> > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10547) Implement "flattened" Facets#getTopChildren
Greg Miller created LUCENE-10547: Summary: Implement "flattened" Facets#getTopChildren Key: LUCENE-10547 URL: https://issues.apache.org/jira/browse/LUCENE-10547 Project: Lucene - Core Issue Type: New Feature Components: modules/facet Reporter: Greg Miller The currently implementation of {{Facets#getTopChildren}} only considers the immediate children of the user-provided path. In many cases, this is probably what the user is looking for, but it would be useful to also have an implementation that considers any descendant of the path, regardless of "level." This would allow the user to build a deeper set of facet path options in "one shot," instead of having to iteratively call {{getTopChildren}}. Of course the shallower paths, and specifically the immediate children of the provided path, will always outweigh "deeper" paths due to counts/weights accumulating along the ancestry paths, but by providing a topN value larger than the number of immediate children, the user could build up a more complete view of path options in a taxonomy with a lot of depth. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10546) Update Faceting user guide
Greg Miller created LUCENE-10546: Summary: Update Faceting user guide Key: LUCENE-10546 URL: https://issues.apache.org/jira/browse/LUCENE-10546 Project: Lucene - Core Issue Type: Wish Components: modules/facet Reporter: Greg Miller The [facet user guide|https://lucene.apache.org/core/4_1_0/facet/org/apache/lucene/facet/doc-files/userguide.html] was written based on 4.1. Since there's been a fair amount of active facet-related development over the last year+, it would be nice to review the guide and see what updates make sense. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1753#comment-1753 ] Greg Miller commented on LUCENE-10544: -- {quote}In my opinion, a better solution that has less overhead and would still support cancelling such slow queries consists of leveraging {{BulkScorer#score}} to score small-ish ranges of doc IDs at a time. {quote} +1. We've had success by implementing a "timeout enforcing" Query that does timeout enforcement within the Scorer it provides as a short-term solution, but there are a number of flaws with this approach. Hooking into the BulkScorer makes sense but does need some thought as [~dpsharma] mentions since Queries may (and do!) provide their own BulkScorers in some cases (e.g., {{{}BooleanScorer{}}}). {quote}Long-term I'd like ExitableDirectoryReader and other tooling to handle cancellation/timeout to become mostly implementation details, and have proper support directly on IndexSearcher (LUCENE-10151). {quote} +1. For full disclosure, [~dpsharma] and I work together at Amazon and she is working on LUCENE-10151. One idea is to use {{ExitableDirectoryReader}} as an internal implementation detail of {{IndexSearcher}} to add first-class timeout support. While we were debugging some prototype code, we ran into this issue with {{ExitableDirectoryReader}} and I thought it warranted a spin-off issue since it seems like something we might want to generally fix. > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529481#comment-17529481 ] Greg Miller commented on LUCENE-10544: -- Thanks [~jpountz]. One issue with the collector approach is that it doesn't catch two-phase iterator situations where there are many approximate hits but very few confirmed matches, since the collector will only be invoked after a match is confirmed. If the second phase check is costly, this can be particularly problematic. So it would be nice to enforce the check at a lower-level and solve for this issue if possible. > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-10544: - Description: While looking into options for LUCENE-10151, I noticed that {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do anything to wrap postings or impacts. So timeouts will be enforced when moving to the "next" term, but not when iterating the postings/impacts associated with a term. I think we ought to wrap the postings/impacts as well with some form of timeout checking so timeouts can be enforced on long-running queries. I'm not sure why this wasn't done originally (back in 2014), but it was questioned back in 2020 on the original Jira SOLR-5986. Does anyone know of a good reason why we shouldn't enforce timeouts in this way? Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} given that only {{next}} is being wrapped currently. was: While looking into options for [LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151], I noticed that {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you start iterating postings/impact. The does create a {{ExitableTermsEnum}} wrapper when loading a {{TermsEnum}}, but that wrapper doesn't do anything to wrap postings or impact. So timeouts will be enforced when moving to the "next" term, but not when iterating the postings/impact associated with a term. I think we ought to wrap the postings/impacts as well with some form of timeout checking so timeouts can be enforced on long-running queries. I'm not sure why this wasn't done originally (back in 2014), but it was questioned back in 2020 on the original Jira [SOLR-5986|https://issues.apache.org/jira/browse/SOLR-5986?focusedCommentId=17177009=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17177009]. Does anyone know of a good reason why we shouldn't enforce timeouts in this way? Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} given that only {{next}} is being wrapped currently. > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
Greg Miller created LUCENE-10544: Summary: Should ExitableTermsEnum wrap postings and impacts? Key: LUCENE-10544 URL: https://issues.apache.org/jira/browse/LUCENE-10544 Project: Lucene - Core Issue Type: Bug Components: core/index Reporter: Greg Miller While looking into options for [LUCENE-10151|https://issues.apache.org/jira/browse/LUCENE-10151], I noticed that {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you start iterating postings/impact. The does create a {{ExitableTermsEnum}} wrapper when loading a {{TermsEnum}}, but that wrapper doesn't do anything to wrap postings or impact. So timeouts will be enforced when moving to the "next" term, but not when iterating the postings/impact associated with a term. I think we ought to wrap the postings/impacts as well with some form of timeout checking so timeouts can be enforced on long-running queries. I'm not sure why this wasn't done originally (back in 2014), but it was questioned back in 2020 on the original Jira [SOLR-5986|https://issues.apache.org/jira/browse/SOLR-5986?focusedCommentId=17177009=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17177009]. Does anyone know of a good reason why we shouldn't enforce timeouts in this way? Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10274) Implement "hyperrectangle" faceting
[ https://issues.apache.org/jira/browse/LUCENE-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529122#comment-17529122 ] Greg Miller commented on LUCENE-10274: -- Exciting! I'll have a look at the PR in the next couple of days and get some feedback your way if nobody beats me to it. Thanks so much for having a go at this! > Implement "hyperrectangle" faceting > --- > > Key: LUCENE-10274 > URL: https://issues.apache.org/jira/browse/LUCENE-10274 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > I'd be interested in expanding Lucene's faceting capabilities to aggregate a > point field against a set of user-provided n-dimensional > [hyperrectangles|https://en.wikipedia.org/wiki/Hyperrectangle]. This would be > a generalization of {{LongRangeFacets}} / {{DoubleRangeFacets}} from a single > dimension to n-dimensions, and would compliment {{PointRangeQuery}} well, > providing the ability to facet ahead of "drilling down" on such a query. > As a motivating use-case, imagine searching against movie documents that > contain a 2-dimensional point storing "awards" the movie has received. One > dimension encodes the year the award was won, while the other encodes the > type of award as an ordinal. For example, the film "Nomadland" won the > "Academy Awards Best Picture" award in 2021. Imagine providing a > two-dimensional refinement to users allowing them to filter by the > combination of award + year in a single action (e.g., using > {{{}PointRangeQuery{}}}) and needing to get facet counts for these > combinations ahead of time. > Curious if the community thinks this functionality would be useful. Any > thoughts? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10529) TestTaxonomyFacetAssociations may have floating point issues
[ https://issues.apache.org/jira/browse/LUCENE-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529049#comment-17529049 ] Greg Miller commented on LUCENE-10529: -- I've got a PR up for this, but associated it with the dup Jira (10530). Linking the PR here as well for visibility. Since the fix for the NPE also reported here was trivial, I pushed it last night separately. https://github.com/apache/lucene/pull/848 > TestTaxonomyFacetAssociations may have floating point issues > > > Key: LUCENE-10529 > URL: https://issues.apache.org/jira/browse/LUCENE-10529 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Hit this in a jenkins CI build while testing something else: > {noformat} > gradlew test --tests TestTaxonomyFacetAssociations.testFloatAssociationRandom > -Dtests.seed=B39C450F4870F7F1 -Dtests.locale=ar-IQ > -Dtests.timezone=America/Rankin_Inlet -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 > ... > org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > > testFloatAssociationRandom FAILED > java.lang.AssertionError: expected:<2605996.5> but was:<2605995.2> > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10530) TestTaxonomyFacetAssociations test failure
[ https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529048#comment-17529048 ] Greg Miller commented on LUCENE-10530: -- PR is up for this. > TestTaxonomyFacetAssociations test failure > -- > > Key: LUCENE-10530 > URL: https://issues.apache.org/jira/browse/LUCENE-10530 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some > flakiness, it fails on the following random seed. > {code:java} > ./gradlew test --tests > TestTaxonomyFacetAssociations.testFloatAssociationRandom \ > -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \ > -Dtests.timezone=Europe/Athens -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 {code} > This is because of a mismatch in (SUM) aggregated multi-valued, > {{float_random}} facet field. We accept an error delta of 1 in this > aggregation, but for the failing random seed, the delta is 1.3. Maybe we > should change this delta to 1.5? > My hunch is that it is some floating point approximation error. I'm unable to > repro it without the randomization seed. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10530) TestTaxonomyFacetAssociations test failure
[ https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529027#comment-17529027 ] Greg Miller commented on LUCENE-10530: -- Instead of increasing the acceptable delta, I'd prefer to ensure we sum the floats in the same order and then expect them to be exactly equal. This should be a more robust solution than fiddling with the delta every time we trip a random case that breaks things. The issue is that we keep track of all float values we index for the purpose of determining "expected" sums, but the order ends up differing from the order we visit the values when iterating the index. I think I have a solution that lets us reconcile this ordering difference. > TestTaxonomyFacetAssociations test failure > -- > > Key: LUCENE-10530 > URL: https://issues.apache.org/jira/browse/LUCENE-10530 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > > TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some > flakiness, it fails on the following random seed. > {code:java} > ./gradlew test --tests > TestTaxonomyFacetAssociations.testFloatAssociationRandom \ > -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \ > -Dtests.timezone=Europe/Athens -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 {code} > This is because of a mismatch in (SUM) aggregated multi-valued, > {{float_random}} facet field. We accept an error delta of 1 in this > aggregation, but for the failing random seed, the delta is 1.3. Maybe we > should change this delta to 1.5? > My hunch is that it is some floating point approximation error. I'm unable to > repro it without the randomization seed. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10204) Support iteration of sub-matches in join queries (ToParentBlockJoinQuery / ToChildBlockJoinQuery)
[ https://issues.apache.org/jira/browse/LUCENE-10204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528507#comment-17528507 ] Greg Miller commented on LUCENE-10204: -- Yeah +1 to not pursuing further right now. Because various query evaluation optimizations (current and possibly future) mean that not all children will necessarily be visited in a complex disjunction clause when determining matching parents, I think it's fundamentally flawed to try to track all child hits while evaluating the query. For example, in BMW, a sub-clause may never get advanced to a given parent match if it's determined to be a match based on a minimum number of other clauses confirming the match. >From what I can tell, the only accurate way to find all child matches is to >issue a separate query that identifies them, and doesn't "join" to the parents. > Support iteration of sub-matches in join queries (ToParentBlockJoinQuery / > ToChildBlockJoinQuery) > - > > Key: LUCENE-10204 > URL: https://issues.apache.org/jira/browse/LUCENE-10204 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/join >Reporter: Greg Miller >Priority: Minor > > It would be nice to be able to iterate over the "sub-matches" in these join > queries for the purpose of faceting (or possibly other use-cases?). > For example, we have a use-case where our query matches on "child" docs, > using a {{ToParentBlockJoinQuery}} to "emit" the associated parents, which > are ultimately added to our match set. But, we want to iterate over the > matching "children" for the purpose of faceting. > To make it concrete, consider searching over a product catalog where "offers" > and "items" are indexed side-by-side, with the offers being represented as > "children" of the parent items. An offer contains information like > "condition" (new vs. used), selling price, etc. for the parent item. If we > want to facet on "condition", we want to observe all children that matched > the query to know if the parent item had a "new" or "used" offer (or both). > This requires iterating over the child matches when faceting, which we cannot > do today since the child hit information isn't retained anywhere. > We can support this by "caching" the child hits in a bitset but there is some > complexity when multiple join queries appear in a query structure (would need > to logically combine various "cached" bitsets using the same boolean > operations as in the original query structure). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10204) Support iteration of sub-matches in join queries (ToParentBlockJoinQuery / ToChildBlockJoinQuery)
[ https://issues.apache.org/jira/browse/LUCENE-10204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10204. -- Resolution: Won't Fix > Support iteration of sub-matches in join queries (ToParentBlockJoinQuery / > ToChildBlockJoinQuery) > - > > Key: LUCENE-10204 > URL: https://issues.apache.org/jira/browse/LUCENE-10204 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/join >Reporter: Greg Miller >Priority: Minor > > It would be nice to be able to iterate over the "sub-matches" in these join > queries for the purpose of faceting (or possibly other use-cases?). > For example, we have a use-case where our query matches on "child" docs, > using a {{ToParentBlockJoinQuery}} to "emit" the associated parents, which > are ultimately added to our match set. But, we want to iterate over the > matching "children" for the purpose of faceting. > To make it concrete, consider searching over a product catalog where "offers" > and "items" are indexed side-by-side, with the offers being represented as > "children" of the parent items. An offer contains information like > "condition" (new vs. used), selling price, etc. for the parent item. If we > want to facet on "condition", we want to observe all children that matched > the query to know if the parent item had a "new" or "used" offer (or both). > This requires iterating over the child matches when faceting, which we cannot > do today since the child hit information isn't retained anywhere. > We can support this by "caching" the child hits in a bitset but there is some > complexity when multiple join queries appear in a query structure (would need > to logically combine various "cached" bitsets using the same boolean > operations as in the original query structure). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10529) TestTaxonomyFacetAssociations may have floating point issues
[ https://issues.apache.org/jira/browse/LUCENE-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528504#comment-17528504 ] Greg Miller commented on LUCENE-10529: -- Just pushed a fix for the NPE (rare random case where no docs get indexed for a dim in the test case was handled incorrectly). For some reason I'm not able to repro the the original reported floating point precision issue but I am able to reproduce with the seed in LUCENE-10530. I'll work on a fix for that tomorrow. Thanks for reporting and apologies for the random test failures. > TestTaxonomyFacetAssociations may have floating point issues > > > Key: LUCENE-10529 > URL: https://issues.apache.org/jira/browse/LUCENE-10529 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Hit this in a jenkins CI build while testing something else: > {noformat} > gradlew test --tests TestTaxonomyFacetAssociations.testFloatAssociationRandom > -Dtests.seed=B39C450F4870F7F1 -Dtests.locale=ar-IQ > -Dtests.timezone=America/Rankin_Inlet -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 > ... > org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > > testFloatAssociationRandom FAILED > java.lang.AssertionError: expected:<2605996.5> but was:<2605995.2> > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10529) TestTaxonomyFacetAssociations may have floating point issues
[ https://issues.apache.org/jira/browse/LUCENE-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528401#comment-17528401 ] Greg Miller commented on LUCENE-10529: -- Looks like maybe the same thing reported in LUCENE-10530. I'll have a look at this. > TestTaxonomyFacetAssociations may have floating point issues > > > Key: LUCENE-10529 > URL: https://issues.apache.org/jira/browse/LUCENE-10529 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Hit this in a jenkins CI build while testing something else: > {noformat} > gradlew test --tests TestTaxonomyFacetAssociations.testFloatAssociationRandom > -Dtests.seed=B39C450F4870F7F1 -Dtests.locale=ar-IQ > -Dtests.timezone=America/Rankin_Inlet -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 > ... > org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > > testFloatAssociationRandom FAILED > java.lang.AssertionError: expected:<2605996.5> but was:<2605995.2> > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10530) TestTaxonomyFacetAssociations test failure
[ https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528402#comment-17528402 ] Greg Miller commented on LUCENE-10530: -- Possibly the same issue also reported in LUCENE-10529. I'll have a look. > TestTaxonomyFacetAssociations test failure > -- > > Key: LUCENE-10530 > URL: https://issues.apache.org/jira/browse/LUCENE-10530 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > > TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some > flakiness, it fails on the following random seed. > {code:java} > ./gradlew test --tests > TestTaxonomyFacetAssociations.testFloatAssociationRandom \ > -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \ > -Dtests.timezone=Europe/Athens -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 {code} > This is because of a mismatch in (SUM) aggregated multi-valued, > {{float_random}} facet field. We accept an error delta of 1 in this > aggregation, but for the failing random seed, the delta is 1.3. Maybe we > should change this delta to 1.5? > My hunch is that it is some floating point approximation error. I'm unable to > repro it without the randomization seed. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10495) Fix return statement of siblingsLoaded() in TaxonomyFacets
[ https://issues.apache.org/jira/browse/LUCENE-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10495. -- Fix Version/s: 9.2 Resolution: Fixed > Fix return statement of siblingsLoaded() in TaxonomyFacets > -- > > Key: LUCENE-10495 > URL: https://issues.apache.org/jira/browse/LUCENE-10495 > Project: Lucene - Core > Issue Type: Bug >Reporter: Yuting Gan >Priority: Minor > Fix For: 9.2 > > Attachments: Screen Shot 2022-03-30 at 8.02.15 PM.png > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Found a bug in TaxonomyFacets when trying to use the siblingsLoaded function. > siblingsLoaded() should return siblings != null and it returns children != > null currently. > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10444) Support alternate aggregation functions in association facets
[ https://issues.apache.org/jira/browse/LUCENE-10444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10444. -- Fix Version/s: 9.2 Resolution: Fixed > Support alternate aggregation functions in association facets > - > > Key: LUCENE-10444 > URL: https://issues.apache.org/jira/browse/LUCENE-10444 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > Fix For: 9.2 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > We currently only support {{sum}} aggregations in the various association > facet implementations. I'd be really interested in extending the association > facet implementations to support other aggregations, starting with {{max}} > and {{min}} (in addition to {{{}sum{}}}). > I've been sketching up a prototype of this and I think I have a reasonable > way to introduce this idea. Will get a PR out for feedback soon. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations
[ https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17518939#comment-17518939 ] Greg Miller commented on LUCENE-10488: -- Very exciting. Thanks [~yutinggan]! Also, please note that the refactoring change I mentioned above for association facets is now merged (LUCENE-10444), so it should be easy now to move forward with optimizations there as well if you're interested (or if anyone else is interested). Thanks again! > Optimize Facets#getTopDims across Facets implementations > > > Key: LUCENE-10488 > URL: https://issues.apache.org/jira/browse/LUCENE-10488 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the > number of "top" dimensions they want. The default implementation just > delegates to {{getAllDims}} and returns the number of top dims requested, but > some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated > this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's > at least some opportunity to do better in: > * {{ConcurrentSortedSetDocValuesFacetCounts}} > * {{FastTaxonomyFacetCounts}} > * {{TaxonomyFacetSumFloatAssociations}} > * {{TaxonomyFacetSumIntAssociations}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?
[ https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17518935#comment-17518935 ] Greg Miller commented on LUCENE-10507: -- +1. I think this is a great idea! > Should it be more likely to search concurrently in tests? > - > > Key: LUCENE-10507 > URL: https://issues.apache.org/jira/browse/LUCENE-10507 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Priority: Minor > > As part of LUCENE-10002 we are migrating test usages of > IndexSearcher#search(Query, Collector) to use the corresponding search method > that takes a CollectorManager in place of a Collector. As part of such > changes, I've been paying attention to whether searchers are created through > LuceneTestCase#newSearcher and migrating to it when possible. > This caused some recent test failures following test changes, which were in > most cases test issues, although they were quite rare due to the fact that we > only rarely exercise the concurrent code-path in tests. > One recent failure uncovered LUCENE-10500, which was an actual bug that > affected concurrent searches only, and was uncovered by a test run that > indexed a considerable amount of docs and was lucky enough to get an executor > set to its index searcher as well as get multiple slices. > LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and > even when useThreads is true, the searcher may not get an executor set. Also, > it can often happen that despite an executor is set, the searcher will hold > only one slice, as not enough documents are indexed. Some nightly tests index > enough documents, and LuceneTestCase also lowers the slice limits but only > 50% of the times and only when wrapWithAssertions is false. Also I wonder if > the lower limits are low enough: > {code:java} > int maxDocPerSlice = 1 + random.nextInt(10); > int maxSegmentsPerSlice = 1 + random.nextInt(20); > {code} > All in all, I wonder if we should make it more likely for real concurrent > searches to happen while testing across multiple slices. It seems like it > could be useful especially as we'd like users to use collector managers > instead of collectors (although that does not necessarily translate to > concurrent search). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10467) Throws IllegalArgumentException for getAllDims and getTopChildren if topN <= 0
[ https://issues.apache.org/jira/browse/LUCENE-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10467. -- Fix Version/s: 9.2 Resolution: Fixed Merged and backported. Thanks [~yutinggan]! > Throws IllegalArgumentException for getAllDims and getTopChildren if topN <= 0 > -- > > Key: LUCENE-10467 > URL: https://issues.apache.org/jira/browse/LUCENE-10467 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Yuting Gan >Priority: Minor > Fix For: 9.2 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Currently, there are different behaviors from subclass that implements and > overrides getAllDims and getTopChildren when passing in an invalid TopN > parameter (topN <= 0). Some overridden implementations throw a > NullPointerException, some throw an IllegalArgumentException, and others > throw no exception. > It would provide a better user experience by consistently throwing an > IllegalArgumentException when requesting topN <= 0 for these two > functionalities across all implementations. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10491) TaxonomyFacetSumValueSource incorrectly provides scores to doc values
[ https://issues.apache.org/jira/browse/LUCENE-10491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10491. -- Fix Version/s: 9.2 Resolution: Fixed Fixed and backported. > TaxonomyFacetSumValueSource incorrectly provides scores to doc values > - > > Key: LUCENE-10491 > URL: https://issues.apache.org/jira/browse/LUCENE-10491 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 10.0 (main), 9.2 >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Major > Fix For: 9.2 > > Time Spent: 20m > Remaining Estimate: 0h > > {{TaxonomyFacetSumValueSource}} has a bug in the way it provides scores to > the user-provided doc values. [On this > line|https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacetSumValueSource.java#L78] > it should be {{index = doc}}, not {{index++}}. Thanks to [~mikemccand] for > finding this over in #718! > I've reproduced with a test and will post the test and a fix shortly. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-10491) TaxonomyFacetSumValueSource incorrectly provides scores to doc values
[ https://issues.apache.org/jira/browse/LUCENE-10491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller reassigned LUCENE-10491: Assignee: Greg Miller > TaxonomyFacetSumValueSource incorrectly provides scores to doc values > - > > Key: LUCENE-10491 > URL: https://issues.apache.org/jira/browse/LUCENE-10491 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet >Affects Versions: 10.0 (main), 9.2 >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > {{TaxonomyFacetSumValueSource}} has a bug in the way it provides scores to > the user-provided doc values. [On this > line|https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacetSumValueSource.java#L78] > it should be {{index = doc}}, not {{index++}}. Thanks to [~mikemccand] for > finding this over in #718! > I've reproduced with a test and will post the test and a fix shortly. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10491) TaxonomyFacetSumValueSource incorrectly provides scores to doc values
Greg Miller created LUCENE-10491: Summary: TaxonomyFacetSumValueSource incorrectly provides scores to doc values Key: LUCENE-10491 URL: https://issues.apache.org/jira/browse/LUCENE-10491 Project: Lucene - Core Issue Type: Bug Components: modules/facet Affects Versions: 10.0 (main), 9.2 Reporter: Greg Miller {{TaxonomyFacetSumValueSource}} has a bug in the way it provides scores to the user-provided doc values. [On this line|https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacetSumValueSource.java#L78] it should be {{index = doc}}, not {{index++}}. Thanks to [~mikemccand] for finding this over in #718! I've reproduced with a test and will post the test and a fix shortly. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10325) Add getTopDims functionality to Facets
[ https://issues.apache.org/jira/browse/LUCENE-10325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513732#comment-17513732 ] Greg Miller commented on LUCENE-10325: -- Also opened LUCENE-10488 to track other optimizations. > Add getTopDims functionality to Facets > -- > > Key: LUCENE-10325 > URL: https://issues.apache.org/jira/browse/LUCENE-10325 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Major > Fix For: 9.2 > > Time Spent: 9h > Remaining Estimate: 0h > > The current {{getAllDims}} functionality is really the only way for users to > determine the "top" dimensions in a faceting field (i.e., get the top dims by > count along with their top-n children), but it has the unfortunate > side-effect of resolving all child paths for every dim, even if the user > doesn't intend to use those dims. For example, if a match set contains docs > relating to 100 different dims (and various values under each), but the user > only wants the top 10 dims with their top 5 children, they can call > getAllDims(5) then just grab the first 10 results, but a lot of wasted work > has been done for the other 90 dims. > It would be nice to implement something like {{getTopDims(int numDims, int > numChildren)}} that would only do the work necessary to resolve {{numDims}} > dims instead of all dims. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10325) Add getTopDims functionality to Facets
[ https://issues.apache.org/jira/browse/LUCENE-10325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10325. -- Fix Version/s: 9.2 Resolution: Fixed Merged and backported. Thanks again [~yutinggan]! > Add getTopDims functionality to Facets > -- > > Key: LUCENE-10325 > URL: https://issues.apache.org/jira/browse/LUCENE-10325 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Major > Fix For: 9.2 > > Time Spent: 9h > Remaining Estimate: 0h > > The current {{getAllDims}} functionality is really the only way for users to > determine the "top" dimensions in a faceting field (i.e., get the top dims by > count along with their top-n children), but it has the unfortunate > side-effect of resolving all child paths for every dim, even if the user > doesn't intend to use those dims. For example, if a match set contains docs > relating to 100 different dims (and various values under each), but the user > only wants the top 10 dims with their top 5 children, they can call > getAllDims(5) then just grab the first 10 results, but a lot of wasted work > has been done for the other 90 dims. > It would be nice to implement something like {{getTopDims(int numDims, int > numChildren)}} that would only do the work necessary to resolve {{numDims}} > dims instead of all dims. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations
[ https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513722#comment-17513722 ] Greg Miller edited comment on LUCENE-10488 at 3/28/22, 11:41 PM: - Note that I have an [open PR|https://github.com/apache/lucene/pull/719] that proposes some significant changes to association facets, so might be worth trying to avoid large merge collisions with that if someone jumps on this. was (Author: gsmiller): Note that I have an [open PR](https://github.com/apache/lucene/pull/719) that proposes some significant changes to association facets, so might be worth trying to avoid large merge collisions with that if someone jumps on this. > Optimize Facets#getTopDims across Facets implementations > > > Key: LUCENE-10488 > URL: https://issues.apache.org/jira/browse/LUCENE-10488 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > > LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the > number of "top" dimensions they want. The default implementation just > delegates to {{getAllDims}} and returns the number of top dims requested, but > some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated > this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's > at least some opportunity to do better in: > * {{ConcurrentSortedSetDocValuesFacetCounts}} > * {{FastTaxonomyFacetCounts}} > * {{TaxonomyFacetSumFloatAssociations}} > * {{TaxonomyFacetSumIntAssociations}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations
[ https://issues.apache.org/jira/browse/LUCENE-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513722#comment-17513722 ] Greg Miller commented on LUCENE-10488: -- Note that I have an [open PR](https://github.com/apache/lucene/pull/719) that proposes some significant changes to association facets, so might be worth trying to avoid large merge collisions with that if someone jumps on this. > Optimize Facets#getTopDims across Facets implementations > > > Key: LUCENE-10488 > URL: https://issues.apache.org/jira/browse/LUCENE-10488 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > > LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the > number of "top" dimensions they want. The default implementation just > delegates to {{getAllDims}} and returns the number of top dims requested, but > some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated > this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's > at least some opportunity to do better in: > * {{ConcurrentSortedSetDocValuesFacetCounts}} > * {{FastTaxonomyFacetCounts}} > * {{TaxonomyFacetSumFloatAssociations}} > * {{TaxonomyFacetSumIntAssociations}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10488) Optimize Facets#getTopDims across Facets implementations
Greg Miller created LUCENE-10488: Summary: Optimize Facets#getTopDims across Facets implementations Key: LUCENE-10488 URL: https://issues.apache.org/jira/browse/LUCENE-10488 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Greg Miller LUCENE-10325 added a new {{getTopDims}} API, allowing users to specify the number of "top" dimensions they want. The default implementation just delegates to {{getAllDims}} and returns the number of top dims requested, but some Facets sub-classes can do this more optimally. LUCENE-10325 demonstrated this in {{SortedSetDocValueFacetCounts}}, but we can take it further. There's at least some opportunity to do better in: * {{ConcurrentSortedSetDocValuesFacetCounts}} * {{FastTaxonomyFacetCounts}} * {{TaxonomyFacetSumFloatAssociations}} * {{TaxonomyFacetSumIntAssociations}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10484) Add support for concurrent facets random sampling
[ https://issues.apache.org/jira/browse/LUCENE-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-10484: - Component/s: modules/facet > Add support for concurrent facets random sampling > - > > Key: LUCENE-10484 > URL: https://issues.apache.org/jira/browse/LUCENE-10484 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Luca Cavanna >Priority: Minor > > While FacetsCollectorManager exists to allows users to concurrently do facets > collection through FacetsCollector, RandomSamplingFacetsCollector does not > have a corresponding collector manager that easily allows users to > concurrently do random sampling. The needed collector manager would be very > similar to FacetsCollectorManager, yet it would need to expose a specialized > reduced RandomSamplingFacetsCollector, and the reduction should call > getOriginalMatchingDocs instead of getMatchingDocs, which modifies the > internal totalHits when called. > This relates to LUCENE-10002 and would allow to use a collector manager > instead of a collector when doing random sampling, in the effort of reducing > usages of IndexSearcher#search(Query, Collector). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10468) Do not always do checkField() in DocValues.getXXX(LeafReader, String)
[ https://issues.apache.org/jira/browse/LUCENE-10468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507137#comment-17507137 ] Greg Miller commented on LUCENE-10468: -- +1, I appreciate the field checking done by the DocValues factory methods. It only throws if the field exists but was indexed with a different type, which likely indicates a user-initiated error. Note that you can always use lower-level access by loading doc values directly from a LeafReader if you have some special use-case, or you can load FieldInfos and check those yourself as well. I've seen a few use-cases where this is useful, primarily optimizing the {{null}} case. > Do not always do checkField() in DocValues.getXXX(LeafReader, String) > - > > Key: LUCENE-10468 > URL: https://issues.apache.org/jira/browse/LUCENE-10468 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Priority: Trivial > Attachments: 1.png > > > IndexQuery can always get an empty result when field in Query does not exist > or even it was indexed with different FieldType. > But when doing DocValuesQuery and field in such query does not exist, if this > field was not indexed by any other FieldType, DocValues query's behavior is > as the same as IndexQuery, otherwise it will throw a exception, because > getting a DocValuesIterator always do DocValues#checkField(...). > I mean checkFIeld(...) is not needed if only do getting a DocValuesIterator, > and the exception's content is not friendly, so we can keep 'query result > consistency'? > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org