[GitHub] [lucene] jtibshirani commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical
jtibshirani commented on pull request #416: URL: https://github.com/apache/lucene/pull/416#issuecomment-957098407 Got it, it sounds like you already adjusted my set-up to include warm-ups. Overall it looks like a positive performance improvement. I'm in favor of merging this even though the improvement is relatively small -- I think it's good to implement the actual algorithm that we claim to! I also think this sets us up well for future performance improvements, by closely comparing to other HNSW implementations. One last thing to check regarding performance: does it have an impact on indexing speed? Reviewing the code with fresh eyes, I found some more parts where I had questions. I know 9.0 feature freeze is coming up really soon, maybe we want to discuss the timing of this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical
jtibshirani commented on a change in pull request #416: URL: https://github.com/apache/lucene/pull/416#discussion_r740706179 ## File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java ## @@ -99,32 +121,72 @@ public static NeighborQueue search( Bits acceptOrds, SplittableRandom random) Review comment: I think this `SplittableRandom` is unused. ## File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java ## @@ -99,32 +121,72 @@ public static NeighborQueue search( Bits acceptOrds, SplittableRandom random) throws IOException { + int size = graphValues.size(); +int boundedNumSeed = Math.max(topK, Math.min(numSeed, 2 * size)); +NeighborQueue results; + +int[] eps = new int[] {graphValues.entryNode()}; +for (int level = graphValues.numLevels() - 1; level >= 1; level--) { + results = + HnswGraph.searchLevel( Review comment: Tiny comment, we can remove the `HnswGraph` qualifier. ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java ## @@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws IOException { return new FieldEntry(input, similarityFunction); } + private void fillGraphNodesAndOffsetsByLevel() throws IOException { +for (FieldEntry entry : fields.values()) { + IndexInput input = Review comment: If I understand correctly, the graph index file is only used to populate these data structures on `FieldEntry` (the graph level information). It feels a bit surprising that `FieldEntry` is constructed across two different files (both the metadata and graph index files). It also means `FieldEntry` isn't immutable. hmm, I'm not sure what typically belongs in a metadata file. I assume it doesn't make sense to move the graph index information there, alongside the ord -> doc mapping. Maybe we could keep the file as-is but create a new class to hold the graph index, something like `GraphLevels` ? ## File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java ## @@ -56,31 +59,50 @@ public final class HnswGraph extends KnnGraphValues { private final int maxConn; + private int numLevels; // the current number of levels in the graph + private int entryNode; // the current graph entry node on the top level - // Each entry lists the top maxConn neighbors of a node. The nodes correspond to vectors added to - // HnswBuilder, and the - // node values are the ordinals of those vectors. - private final List graph; + // Nodes by level expressed as the level 0's nodes' ordinals. + // As level 0 contains all nodes, nodesByLevel.get(0) is null. + private final List nodesByLevel; + + // graph is a list of graph levels. + // Each level is represented as List – nodes' connections on this level. + // Each entry in the list has the top maxConn neighbors of a node. The nodes correspond to vectors + // added to HnswBuilder, and the node values are the ordinals of those vectors. + // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals. + private final List> graph; // KnnGraphValues iterator members private int upto; private NeighborArray cur; - HnswGraph(int maxConn) { -graph = new ArrayList<>(); -// Typically with diversity criteria we see nodes not fully occupied; average fanout seems to be -// about 1/2 maxConn. There is some indexing time penalty for under-allocating, but saves RAM -graph.add(new NeighborArray(Math.max(32, maxConn / 4))); + HnswGraph(int maxConn, int levelOfFirstNode) { this.maxConn = maxConn; +this.numLevels = levelOfFirstNode + 1; +this.graph = new ArrayList<>(numLevels); +this.entryNode = 0; +for (int i = 0; i < numLevels; i++) { + graph.add(new ArrayList<>()); + // Typically with diversity criteria we see nodes not fully occupied; + // average fanout seems to be about 1/2 maxConn. + // There is some indexing time penalty for under-allocating, but saves RAM + graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4))); Review comment: I am a little confused why we are careful not to overallocate here, but in `addNode` we do `new NeighborArray(maxConn + 1)));`? I see you maintained the existing behavior, so not critical to address in this PR, I was just curious about it. ## File path: lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java ## @@ -32,25 +33,41 @@ protected KnnGraphValues() {} /** - * Move the pointer to exactly {@code target}, the id of a node in the graph. After this method + * Move the pointer to exactly the given {@code level}'s {@code target}. After this method * returns, call {@link #nextNeighbor()} to return successive (ordered) connected node ordinals. * - * @param target must be a valid node in the
[jira] [Created] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API
Vigya Sharma created LUCENE-10216: - Summary: Add concurrency to addIndexes(CodecReader…) API Key: LUCENE-10216 URL: https://issues.apache.org/jira/browse/LUCENE-10216 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: Vigya Sharma I work at Amazon Product Search, and we use Lucene to power search for the e-commerce platform. I’m working on a project that involves applying metadata+ETL transforms and indexing documents on n different _indexing_ boxes, combining them into a single index on a separate _reducer_ box, and making it available for queries on m different _search_ boxes (replicas). Segments are asynchronously copied from indexers to reducers to searchers as they become available for the next layer to consume. I am using the addIndexes API to combine multiple indexes into one on the reducer boxes. Since we also have taxonomy data, we need to remap facet field ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version of this API. The API leverages {{SegmentMerger.merge()}} to create segments with new ordinal values while also merging all provided segments in the process. _This is however a blocking call that runs in a single thread._ Until we have written segments with new ordinal values, we cannot copy them to searcher boxes, which increases the time to make documents available for search. I was playing around with the API by creating multiple concurrent merges, each with only a single reader, creating a concurrently running 1:1 conversion from old segments to new ones (with new ordinal values). We follow this up with non-blocking background merges. This lets us copy the segments to searchers and replicas as soon as they are available, and later replace them with merged segments as background jobs complete. On the Amazon dataset I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. Each call was given about 5 readers to add on average. This might be useful add to Lucene. We could create another {{addIndexes()}} API with a {{boolean}} flag for concurrency, that internally submits multiple merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, and waits for them to complete before returning. While this is doable from outside Lucene by using your thread pool, starting multiple addIndexes() calls and waiting for them to complete, I felt it needs some understanding of what addIndexes does, why you need to wait on the merge and why it makes sense to pass a single reader in the addIndexes API. Out of box support in Lucene could simplify this for folks a similar use case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude merged pull request #2598: SOLR-12666: Add authn & authz plugins that supports multiple authentication schemes, such as Bearer and Basic
thelabdude merged pull request #2598: URL: https://github.com/apache/lucene-solr/pull/2598 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude opened a new pull request #2598: SOLR-12666: Add authn & authz plugins that supports multiple authentication schemes, such as Bearer and Basic
thelabdude opened a new pull request #2598: URL: https://github.com/apache/lucene-solr/pull/2598 backport of https://github.com/apache/solr/pull/355 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10141) Update releaseWizard for 8x to correctly create back-compat indices and update Version in main after repo split
[ https://issues.apache.org/jira/browse/LUCENE-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Potter resolved LUCENE-10141. - Lucene Fields: (was: New) Resolution: Fixed > Update releaseWizard for 8x to correctly create back-compat indices and > update Version in main after repo split > --- > > Key: LUCENE-10141 > URL: https://issues.apache.org/jira/browse/LUCENE-10141 > Project: Lucene - Core > Issue Type: Task > Components: release wizard >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Major > Fix For: 8.11 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Need to update the release wizard in 8x to create the back-compat indices and > update the Version info so that issues like: > https://issues.apache.org/jira/browse/LUCENE-10131 don't impact future 8x > release managers. Hopefully an 8.11 is NOT needed but release managers have > enough on their plate to get right that we should fix this if possible. If > not, we at least need to document the process of doing it manually. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10141) Update releaseWizard for 8x to correctly create back-compat indices and update Version in main after repo split
[ https://issues.apache.org/jira/browse/LUCENE-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437031#comment-17437031 ] ASF subversion and git services commented on LUCENE-10141: -- Commit 8b3a5899cd1450690aeafb4fcea666cf06646b97 in lucene-solr's branch refs/heads/branch_8x from Timothy Potter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8b3a589 ] LUCENE-10141: Add the next minor version on Lucene's main branch in the split repo so the backcompat_master task works (#2595) > Update releaseWizard for 8x to correctly create back-compat indices and > update Version in main after repo split > --- > > Key: LUCENE-10141 > URL: https://issues.apache.org/jira/browse/LUCENE-10141 > Project: Lucene - Core > Issue Type: Task > Components: release wizard >Reporter: Timothy Potter >Assignee: Timothy Potter >Priority: Major > Fix For: 8.11 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Need to update the release wizard in 8x to create the back-compat indices and > update the Version info so that issues like: > https://issues.apache.org/jira/browse/LUCENE-10131 don't impact future 8x > release managers. Hopefully an 8.11 is NOT needed but release managers have > enough on their plate to get right that we should fix this if possible. If > not, we at least need to document the process of doing it manually. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude merged pull request #2595: LUCENE-10141: Add the next minor version on Lucene's main branch in the split repo so the backcompat_master task works
thelabdude merged pull request #2595: URL: https://github.com/apache/lucene-solr/pull/2595 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler removed a comment on pull request #421: LUCENE-10195: Add gradle cache option and make some tasks cacheable
uschindler removed a comment on pull request #421: URL: https://github.com/apache/lucene/pull/421#issuecomment-956588955 Wasn't the idea to make the cache optional? Looks like it is enabled by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #421: LUCENE-10195: Add gradle cache option and make some tasks cacheable
uschindler commented on pull request #421: URL: https://github.com/apache/lucene/pull/421#issuecomment-956588955 Wasn't the idea to make the cache optional? Looks like it is enabled by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10195) Add gradle cache option and make some tasks cacheable
[ https://issues.apache.org/jira/browse/LUCENE-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437026#comment-17437026 ] Dawid Weiss commented on LUCENE-10195: -- I've just tried this and it seems to work just fine. I'd appreciate if you could have a second look at the modified PR at [https://github.com/apache/lucene/pull/421] and I think it's ready to go. > Add gradle cache option and make some tasks cacheable > - > > Key: LUCENE-10195 > URL: https://issues.apache.org/jira/browse/LUCENE-10195 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jerome Prinet >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Increase Gradle build speed with help of Gradle built-in features, mostly > cache and up-to-date checks > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10195) Add gradle cache option and make some tasks cacheable
[ https://issues.apache.org/jira/browse/LUCENE-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437013#comment-17437013 ] Dawid Weiss commented on LUCENE-10195: -- Hi Jerome. I reviewed the PR again - added changes entry, a non-enabled option to the default gradle.properties. I also corrected unnecessary exclusions in validateSourcePatterns (.gradle and .idea are at the root level only, so they only need to be excluded for the root project). Finally, after some deliberation, I decided to not include changes to test-related tasks (ecjLint and test). You know our position on the subject of caching test results and the changes you added (while informative about how gradle handles such things) add a layer of additional complexity that I don't think is needed. Maybe it'll change in the future, who knows - if so, your PR stays and can be a source of code/ inspiration. I'd also like to correct the renderJavadoc task - I'd like to keep the map for offline links as it makes it much simpler to configure and maybe add new linked javadocs later. Can this map be made cacheable by exposing a converting Input getter method with a list of classes marked Nested and two fields (string and a file)? So that we can keep using the map but provide more semantics to gradle? > Add gradle cache option and make some tasks cacheable > - > > Key: LUCENE-10195 > URL: https://issues.apache.org/jira/browse/LUCENE-10195 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jerome Prinet >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Increase Gradle build speed with help of Gradle built-in features, mostly > cache and up-to-date checks > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10120) Lazy initialize FixedBitSet in LRUQueryCache
[ https://issues.apache.org/jira/browse/LUCENE-10120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437012#comment-17437012 ] Greg Miller commented on LUCENE-10120: -- [~ChrisLu] if I'm understanding this correctly, it sounds like your proposal here is to add a new special-case to the LRU caching that optimizes for an extremely dense iterator, where all documents within a given min/max range are present (and the extreme case of this being where all docs match). Is that a correct understanding? If so, it could be interesting to try this out. I think we'd only need to modify the {{cacheIntoBitSet}} method to behave similar to {{DocsWithFieldSet}} (as you point out). I don't know if we'll see much impact, but I like the idea! > Lazy initialize FixedBitSet in LRUQueryCache > > > Key: LUCENE-10120 > URL: https://issues.apache.org/jira/browse/LUCENE-10120 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: main (9.0) >Reporter: Lu Xugang >Priority: Major > Attachments: 1.png > > > Basing on the implement of collecting docIds in DocsWithFieldSet, may be we > could do similar way to cache docIdSet in > *LRUQueryCache#cacheIntoBitSet(BulkScorer scorer, int maxDoc)* when docIdSet > is density. > In this way , we do not always init a huge FixedBitSet which sometime is not > necessary when maxDoc is large > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10195) Add gradle cache option and make some tasks cacheable
[ https://issues.apache.org/jira/browse/LUCENE-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-10195: - Summary: Add gradle cache option and make some tasks cacheable (was: Gradle build speed improvement) > Add gradle cache option and make some tasks cacheable > - > > Key: LUCENE-10195 > URL: https://issues.apache.org/jira/browse/LUCENE-10195 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jerome Prinet >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Increase Gradle build speed with help of Gradle built-in features, mostly > cache and up-to-date checks > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih opened a new pull request #420: [DRAFT] LUCENE-10122 Explore using NumericDocValue to store taxonomy parent array
zhaih opened a new pull request #420: URL: https://github.com/apache/lucene/pull/420 # Description As mentioned in the issue, use NumericDocValues to store parent array instead of term positioning. # TODO * benchmark * address the backward compatibility issue # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [ ] I have run `./gradlew check`. (Not yet, need to deal with backward compatibility issue) - [ ] I have added tests for my changes. (Old tests should be sufficient) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery
[ https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436974#comment-17436974 ] Robert Muir commented on LUCENE-10207: -- My line of thinking was just that it might be consistent with a disjunction-OR of the terms (which is really equivalent in the amount of work done) > Make TermInSetQuery usable with IndexOrDocValuesQuery > - > > Key: LUCENE-10207 > URL: https://issues.apache.org/jira/browse/LUCENE-10207 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-10207_multitermquery.patch > > > IndexOrDocValuesQuery is very useful to pick the right execution mode for a > query depending on other bits of the query tree. > We would like to be able to use it to optimize execution of TermInSetQuery. > However IndexOrDocValuesQuery only works well if the "index" query can give > an estimation of the cost of the query without doing anything expensive (like > looking up all terms of the TermInSetQuery in the terms dict). Maybe we could > implement it for primary keys (terms.size() == sumDocFreq) by returning the > number of terms of the query? Another idea is to multiply the number of terms > by the average postings length, though this could be dangerous if the field > has a zipfian distribution and some terms have a much higher doc frequency > than the average. > [~romseygeek] and I were discussing this a few weeks ago, and more recently > [~mikemccand] and [~gsmiller] again independently. So it looks like there is > interest in this. Here is an email thread where this was recently discussed: > https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery
[ https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436972#comment-17436972 ] Greg Miller commented on LUCENE-10207: -- [~rcmuir] as a cost heuristic for running the term-based scorer, I agree that sumDocFreq() is a better fit than getDocCount(). But, I thought that {{ScorerSupplier#cost()}} was meant to estimate the number of docs the scorer would produce if leading iteration. Am I misunderstanding that? Thanks! > Make TermInSetQuery usable with IndexOrDocValuesQuery > - > > Key: LUCENE-10207 > URL: https://issues.apache.org/jira/browse/LUCENE-10207 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-10207_multitermquery.patch > > > IndexOrDocValuesQuery is very useful to pick the right execution mode for a > query depending on other bits of the query tree. > We would like to be able to use it to optimize execution of TermInSetQuery. > However IndexOrDocValuesQuery only works well if the "index" query can give > an estimation of the cost of the query without doing anything expensive (like > looking up all terms of the TermInSetQuery in the terms dict). Maybe we could > implement it for primary keys (terms.size() == sumDocFreq) by returning the > number of terms of the query? Another idea is to multiply the number of terms > by the average postings length, though this could be dangerous if the field > has a zipfian distribution and some terms have a much higher doc frequency > than the average. > [~romseygeek] and I were discussing this a few weeks ago, and more recently > [~mikemccand] and [~gsmiller] again independently. So it looks like there is > interest in this. Here is an email thread where this was recently discussed: > https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved LUCENE-10196. - Fix Version/s: 8.11 Resolution: Fixed Thanks reviewers! > Improve IntroSorter with 3-ways partitioning > > > Key: LUCENE-10196 > URL: https://issues.apache.org/jira/browse/LUCENE-10196 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Fix For: 8.11 > > Time Spent: 2h > Remaining Estimate: 0h > > I added a SorterBenchmark to evaluate the performance of the various Sorter > implementations depending on the strategies defined in BaseSortTestCase > (random, random-low-cardinality, ascending, descending, etc). > By changing the implementation of the IntroSorter to use a 3-ways > partitioning, we can gain a significant performance improvement when sorting > low-cardinality lists, and with additional changes we can also improve the > performance for all the strategies. > Proposed changes: > - Sort small ranges with insertion sort (instead of binary sort). > - Select the quick sort pivot with medians. > - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm. > - Replace the tail recursion by a loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] bruno-roustant closed pull request #404: LUCENE-10196: Improve IntroSorter with 3-ways partitioning.
bruno-roustant closed pull request #404: URL: https://github.com/apache/lucene/pull/404 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436911#comment-17436911 ] ASF subversion and git services commented on LUCENE-10196: -- Commit e0ce294868c17cbdbf54a344bbb7a94d7a62d7ff in lucene-solr's branch refs/heads/branch_8x from Bruno Roustant [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e0ce294 ] LUCENE-10196: Improve IntroSorter with 3-ways partitioning. > Improve IntroSorter with 3-ways partitioning > > > Key: LUCENE-10196 > URL: https://issues.apache.org/jira/browse/LUCENE-10196 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > I added a SorterBenchmark to evaluate the performance of the various Sorter > implementations depending on the strategies defined in BaseSortTestCase > (random, random-low-cardinality, ascending, descending, etc). > By changing the implementation of the IntroSorter to use a 3-ways > partitioning, we can gain a significant performance improvement when sorting > low-cardinality lists, and with additional changes we can also improve the > performance for all the strategies. > Proposed changes: > - Sort small ranges with insertion sort (instead of binary sort). > - Select the quick sort pivot with medians. > - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm. > - Replace the tail recursion by a loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10120) Lazy initialize FixedBitSet in LRUQueryCache
[ https://issues.apache.org/jira/browse/LUCENE-10120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436737#comment-17436737 ] Lu Xugang edited comment on LUCENE-10120 at 11/1/21, 10:31 AM: --- Hi, [~jpountz] , In the case of large numbers of document matched, During the collection process, if the doc id is sequential and ordered from small to large. We only need to record the smallest and largest doc id, and use the method *DocIdSetIterator#range(int minDoc, int maxDoc)* currently provided in the source code to get a *DocIdSetIterator*. During the collection process, if the doc id is not sequential or ordered found , then using old way to create a new big size *FixedBitSet*, and initialize it with a range(*FixedBitSet#**set(int startIndex, int endIndex)**)* by smallest and largest doc id currently collected . was (Author: chrislu): Hi, [~jpountz] , In the case of large numbers of document matched, During the collection process, if the doc id is sequential and ordered from small to large. We only need to record the smallest and largest doc id, and use the method *DocIdSetIterator#range(int minDoc, int maxDoc)* currently provided in the source code to get a DocIdSetIterator. During the collection process, if the doc id is not sequential or ordered found , then using old way to create a new big size FixedBitSet, and initialize it with a range(*FixedBitSet#**set(int startIndex, int endIndex)**)* by smallest and largest doc id currently collected . > Lazy initialize FixedBitSet in LRUQueryCache > > > Key: LUCENE-10120 > URL: https://issues.apache.org/jira/browse/LUCENE-10120 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: main (9.0) >Reporter: Lu Xugang >Priority: Major > Attachments: 1.png > > > Basing on the implement of collecting docIds in DocsWithFieldSet, may be we > could do similar way to cache docIdSet in > *LRUQueryCache#cacheIntoBitSet(BulkScorer scorer, int maxDoc)* when docIdSet > is density. > In this way , we do not always init a huge FixedBitSet which sometime is not > necessary when maxDoc is large > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10120) Lazy initialize FixedBitSet in LRUQueryCache
[ https://issues.apache.org/jira/browse/LUCENE-10120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436737#comment-17436737 ] Lu Xugang commented on LUCENE-10120: Hi, [~jpountz] , In the case of large numbers of document matched, During the collection process, if the doc id is sequential and ordered from small to large. We only need to record the smallest and largest doc id, and use the method *DocIdSetIterator#range(int minDoc, int maxDoc)* currently provided in the source code to get a DocIdSetIterator. During the collection process, if the doc id is not sequential or ordered found , then using old way to create a new big size FixedBitSet, and initialize it with a range(*FixedBitSet#**set(int startIndex, int endIndex)**)* by smallest and largest doc id currently collected . > Lazy initialize FixedBitSet in LRUQueryCache > > > Key: LUCENE-10120 > URL: https://issues.apache.org/jira/browse/LUCENE-10120 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: main (9.0) >Reporter: Lu Xugang >Priority: Major > Attachments: 1.png > > > Basing on the implement of collecting docIds in DocsWithFieldSet, may be we > could do similar way to cache docIdSet in > *LRUQueryCache#cacheIntoBitSet(BulkScorer scorer, int maxDoc)* when docIdSet > is density. > In this way , we do not always init a huge FixedBitSet which sometime is not > necessary when maxDoc is large > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436728#comment-17436728 ] ASF subversion and git services commented on LUCENE-10196: -- Commit 63b9e603e6f53dae40ece03814a9aa613f6cc189 in lucene's branch refs/heads/main from Bruno Roustant [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=63b9e60 ] LUCENE-10196: Improve IntroSorter with 3-ways partitioning. > Improve IntroSorter with 3-ways partitioning > > > Key: LUCENE-10196 > URL: https://issues.apache.org/jira/browse/LUCENE-10196 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > I added a SorterBenchmark to evaluate the performance of the various Sorter > implementations depending on the strategies defined in BaseSortTestCase > (random, random-low-cardinality, ascending, descending, etc). > By changing the implementation of the IntroSorter to use a 3-ways > partitioning, we can gain a significant performance improvement when sorting > low-cardinality lists, and with additional changes we can also improve the > performance for all the strategies. > Proposed changes: > - Sort small ranges with insertion sort (instead of binary sort). > - Select the quick sort pivot with medians. > - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm. > - Replace the tail recursion by a loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10200) Restructure and modernize the release artifacts
[ https://issues.apache.org/jira/browse/LUCENE-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-10200. -- Fix Version/s: main (9.0) Resolution: Fixed > Restructure and modernize the release artifacts > --- > > Key: LUCENE-10200 > URL: https://issues.apache.org/jira/browse/LUCENE-10200 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > Fix For: main (9.0) > > Time Spent: 0.5h > Remaining Estimate: 0h > > This is an umbrella issue for various sub-tasks as per my e-mail [1]. > [1] [https://markmail.org/thread/f7yrggnynq2ijgmy] > In this order, perhaps: > * (/) Apply small text file changes (LUCENE-10163) > * (/) Simplify artifacts (LUCENE-10199 drop ZIP binary), > * (/) LUCENE-10192 drop third party JARs. > * -Create an additional binary artifact for Luke (LUCENE-9978).- > * (-) -only include relevant binary license/ notice files- > * (/) make sure source package can be compiled (no .git folder). > * (/) Test everything with the smoke tester. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10200) Restructure and modernize the release artifacts
[ https://issues.apache.org/jira/browse/LUCENE-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436682#comment-17436682 ] ASF subversion and git services commented on LUCENE-10200: -- Commit 0544819b789235faecd718a339564e5669847731 in lucene's branch refs/heads/main from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0544819 ] LUCENE-10200: store git revision in the release folder and read it back from buildAndPushRelease (#419) > Restructure and modernize the release artifacts > --- > > Key: LUCENE-10200 > URL: https://issues.apache.org/jira/browse/LUCENE-10200 > Project: Lucene - Core > Issue Type: Task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > This is an umbrella issue for various sub-tasks as per my e-mail [1]. > [1] [https://markmail.org/thread/f7yrggnynq2ijgmy] > In this order, perhaps: > * (/) Apply small text file changes (LUCENE-10163) > * (/) Simplify artifacts (LUCENE-10199 drop ZIP binary), > * (/) LUCENE-10192 drop third party JARs. > * -Create an additional binary artifact for Luke (LUCENE-9978).- > * (-) -only include relevant binary license/ notice files- > * (/) make sure source package can be compiled (no .git folder). > * (/) Test everything with the smoke tester. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss merged pull request #419: LUCENE-10200: store git revision in the release folder and read it back from buildAndPushRelease
dweiss merged pull request #419: URL: https://github.com/apache/lucene/pull/419 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #419: LUCENE-10200: store git revision in the release folder and read it back from buildAndPushRelease
dweiss commented on pull request #419: URL: https://github.com/apache/lucene/pull/419#issuecomment-956034449 I tested it on Linux and it worked fine for me. Applying - we can always correct if something is not right. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org