[GitHub] [lucene] jtibshirani commented on pull request #416: LUCENE-10054 Make HnswGraph hierarchical

2021-11-01 Thread GitBox


jtibshirani commented on pull request #416:
URL: https://github.com/apache/lucene/pull/416#issuecomment-957098407


   Got it, it sounds like you already adjusted my set-up to include warm-ups. 
Overall it looks like a positive performance improvement. I'm in favor of 
merging this even though the improvement is relatively small -- I think it's 
good to implement the actual algorithm that we claim to! I also think this sets 
us up well for future performance improvements, by closely comparing to other 
HNSW implementations.
   
   One last thing to check regarding performance: does it have an impact on 
indexing speed?
   
   Reviewing the code with fresh eyes, I found some more parts where I had 
questions. I know 9.0 feature freeze is coming up really soon, maybe we want to 
discuss the timing of this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

2021-11-01 Thread GitBox


jtibshirani commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r740706179



##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -99,32 +121,72 @@ public static NeighborQueue search(
   Bits acceptOrds,
   SplittableRandom random)

Review comment:
   I think this `SplittableRandom` is unused.

##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -99,32 +121,72 @@ public static NeighborQueue search(
   Bits acceptOrds,
   SplittableRandom random)
   throws IOException {
+
 int size = graphValues.size();
+int boundedNumSeed = Math.max(topK, Math.min(numSeed, 2 * size));
+NeighborQueue results;
+
+int[] eps = new int[] {graphValues.entryNode()};
+for (int level = graphValues.numLevels() - 1; level >= 1; level--) {
+  results =
+  HnswGraph.searchLevel(

Review comment:
   Tiny comment, we can remove the `HnswGraph` qualifier.

##
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##
@@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws 
IOException {
 return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+for (FieldEntry entry : fields.values()) {
+  IndexInput input =

Review comment:
   If I understand correctly, the graph index file is only used to populate 
these data structures on `FieldEntry` (the graph level information). It feels a 
bit surprising that `FieldEntry` is constructed across two different files 
(both the metadata and graph index files). It also means `FieldEntry` isn't 
immutable.
   
   hmm, I'm not sure what typically belongs in a metadata file. I assume it 
doesn't make sense to move the graph index information there, alongside the ord 
-> doc mapping. Maybe we could keep the file as-is but create a new class to 
hold the graph index, something like `GraphLevels` ?

##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes 
correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List – nodes' connections on 
this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes 
correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those 
vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-graph = new ArrayList<>();
-// Typically with diversity criteria we see nodes not fully occupied; 
average fanout seems to be
-// about 1/2 maxConn. There is some indexing time penalty for 
under-allocating, but saves RAM
-graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
 this.maxConn = maxConn;
+this.numLevels = levelOfFirstNode + 1;
+this.graph = new ArrayList<>(numLevels);
+this.entryNode = 0;
+for (int i = 0; i < numLevels; i++) {
+  graph.add(new ArrayList<>());
+  // Typically with diversity criteria we see nodes not fully occupied;
+  // average fanout seems to be about 1/2 maxConn.
+  // There is some indexing time penalty for under-allocating, but saves 
RAM
+  graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));

Review comment:
   I am a little confused why we are careful not to overallocate here, but 
in `addNode` we do `new NeighborArray(maxConn + 1)));`? I see you maintained 
the existing behavior, so not critical to address in this PR, I was just 
curious about it.

##
File path: lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java
##
@@ -32,25 +33,41 @@
   protected KnnGraphValues() {}
 
   /**
-   * Move the pointer to exactly {@code target}, the id of a node in the 
graph. After this method
+   * Move the pointer to exactly the given {@code level}'s {@code target}. 
After this method
* returns, call {@link #nextNeighbor()} to return successive (ordered) 
connected node ordinals.
*
-   * @param target must be a valid node in the

[jira] [Created] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

2021-11-01 Thread Vigya Sharma (Jira)
Vigya Sharma created LUCENE-10216:
-

 Summary: Add concurrency to addIndexes(CodecReader…) API
 Key: LUCENE-10216
 URL: https://issues.apache.org/jira/browse/LUCENE-10216
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Vigya Sharma


I work at Amazon Product Search, and we use Lucene to power search for the 
e-commerce platform. I’m working on a project that involves applying 
metadata+ETL transforms and indexing documents on n different _indexing_ boxes, 
combining them into a single index on a separate _reducer_ box, and making it 
available for queries on m different _search_ boxes (replicas). Segments are 
asynchronously copied from indexers to reducers to searchers as they become 
available for the next layer to consume.

I am using the addIndexes API to combine multiple indexes into one on the 
reducer boxes. Since we also have taxonomy data, we need to remap facet field 
ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version of 
this API. The API leverages {{SegmentMerger.merge()}} to create segments with 
new ordinal values while also merging all provided segments in the process.

_This is however a blocking call that runs in a single thread._ Until we have 
written segments with new ordinal values, we cannot copy them to searcher 
boxes, which increases the time to make documents available for search.

I was playing around with the API by creating multiple concurrent merges, each 
with only a single reader, creating a concurrently running 1:1 conversion from 
old segments to new ones (with new ordinal values). We follow this up with 
non-blocking background merges. This lets us copy the segments to searchers and 
replicas as soon as they are available, and later replace them with merged 
segments as background jobs complete. On the Amazon dataset I profiled, this 
gave us around 2.5 to 3x improvement in addIndexes() time. Each call was given 
about 5 readers to add on average.

This might be useful add to Lucene. We could create another {{addIndexes()}} 
API with a {{boolean}} flag for concurrency, that internally submits multiple 
merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, and 
waits for them to complete before returning.

While this is doable from outside Lucene by using your thread pool, starting 
multiple addIndexes() calls and waiting for them to complete, I felt it needs 
some understanding of what addIndexes does, why you need to wait on the merge 
and why it makes sense to pass a single reader in the addIndexes API.

Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thelabdude merged pull request #2598: SOLR-12666: Add authn & authz plugins that supports multiple authentication schemes, such as Bearer and Basic

2021-11-01 Thread GitBox


thelabdude merged pull request #2598:
URL: https://github.com/apache/lucene-solr/pull/2598


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thelabdude opened a new pull request #2598: SOLR-12666: Add authn & authz plugins that supports multiple authentication schemes, such as Bearer and Basic

2021-11-01 Thread GitBox


thelabdude opened a new pull request #2598:
URL: https://github.com/apache/lucene-solr/pull/2598


   backport of https://github.com/apache/solr/pull/355


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10141) Update releaseWizard for 8x to correctly create back-compat indices and update Version in main after repo split

2021-11-01 Thread Timothy Potter (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Potter resolved LUCENE-10141.
-
Lucene Fields:   (was: New)
   Resolution: Fixed

> Update releaseWizard for 8x to correctly create back-compat indices and 
> update Version in main after repo split
> ---
>
> Key: LUCENE-10141
> URL: https://issues.apache.org/jira/browse/LUCENE-10141
> Project: Lucene - Core
>  Issue Type: Task
>  Components: release wizard
>Reporter: Timothy Potter
>Assignee: Timothy Potter
>Priority: Major
> Fix For: 8.11
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Need to update the release wizard in 8x to create the back-compat indices and 
> update the Version info so that issues like: 
> https://issues.apache.org/jira/browse/LUCENE-10131 don't impact future 8x 
> release managers. Hopefully an 8.11 is NOT needed but release managers have 
> enough on their plate to get right that we should fix this if possible. If 
> not, we at least need to document the process of doing it manually. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10141) Update releaseWizard for 8x to correctly create back-compat indices and update Version in main after repo split

2021-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437031#comment-17437031
 ] 

ASF subversion and git services commented on LUCENE-10141:
--

Commit 8b3a5899cd1450690aeafb4fcea666cf06646b97 in lucene-solr's branch 
refs/heads/branch_8x from Timothy Potter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8b3a589 ]

LUCENE-10141: Add the next minor version on Lucene's main branch in the split 
repo so the backcompat_master task works (#2595)



> Update releaseWizard for 8x to correctly create back-compat indices and 
> update Version in main after repo split
> ---
>
> Key: LUCENE-10141
> URL: https://issues.apache.org/jira/browse/LUCENE-10141
> Project: Lucene - Core
>  Issue Type: Task
>  Components: release wizard
>Reporter: Timothy Potter
>Assignee: Timothy Potter
>Priority: Major
> Fix For: 8.11
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Need to update the release wizard in 8x to create the back-compat indices and 
> update the Version info so that issues like: 
> https://issues.apache.org/jira/browse/LUCENE-10131 don't impact future 8x 
> release managers. Hopefully an 8.11 is NOT needed but release managers have 
> enough on their plate to get right that we should fix this if possible. If 
> not, we at least need to document the process of doing it manually. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] thelabdude merged pull request #2595: LUCENE-10141: Add the next minor version on Lucene's main branch in the split repo so the backcompat_master task works

2021-11-01 Thread GitBox


thelabdude merged pull request #2595:
URL: https://github.com/apache/lucene-solr/pull/2595


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler removed a comment on pull request #421: LUCENE-10195: Add gradle cache option and make some tasks cacheable

2021-11-01 Thread GitBox


uschindler removed a comment on pull request #421:
URL: https://github.com/apache/lucene/pull/421#issuecomment-956588955


   Wasn't the idea to make the cache optional? Looks like it is enabled by 
default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #421: LUCENE-10195: Add gradle cache option and make some tasks cacheable

2021-11-01 Thread GitBox


uschindler commented on pull request #421:
URL: https://github.com/apache/lucene/pull/421#issuecomment-956588955


   Wasn't the idea to make the cache optional? Looks like it is enabled by 
default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10195) Add gradle cache option and make some tasks cacheable

2021-11-01 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437026#comment-17437026
 ] 

Dawid Weiss commented on LUCENE-10195:
--

I've just tried this and it seems to work just fine. I'd appreciate if you 
could have a second look at the modified PR at 
[https://github.com/apache/lucene/pull/421] and I think it's ready to go.

> Add gradle cache option and make some tasks cacheable
> -
>
> Key: LUCENE-10195
> URL: https://issues.apache.org/jira/browse/LUCENE-10195
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jerome Prinet
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Increase Gradle build speed with help of Gradle built-in features, mostly 
> cache and up-to-date checks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10195) Add gradle cache option and make some tasks cacheable

2021-11-01 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437013#comment-17437013
 ] 

Dawid Weiss commented on LUCENE-10195:
--

Hi Jerome. I reviewed the PR again - added changes entry, a non-enabled option 
to the default gradle.properties. I also corrected unnecessary exclusions in 
validateSourcePatterns (.gradle and .idea are at the root level only, so they 
only need to be excluded for the root project).

Finally, after some deliberation, I decided to not include changes to 
test-related tasks (ecjLint and test). You know our position on the subject of 
caching test results and the changes you added (while informative about how 
gradle handles such things) add a layer of additional complexity that I don't 
think is needed. Maybe it'll change in the future, who knows - if so, your PR 
stays and can be a source of code/ inspiration.

I'd also like to correct the renderJavadoc task - I'd like to keep the map for 
offline links as it makes it much simpler to configure and maybe add new linked 
javadocs later. Can this map be made cacheable by exposing a converting Input 
getter method with a list of classes marked Nested and two fields (string and a 
file)? So that we can keep using the map but provide more semantics to gradle?

> Add gradle cache option and make some tasks cacheable
> -
>
> Key: LUCENE-10195
> URL: https://issues.apache.org/jira/browse/LUCENE-10195
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jerome Prinet
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Increase Gradle build speed with help of Gradle built-in features, mostly 
> cache and up-to-date checks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10120) Lazy initialize FixedBitSet in LRUQueryCache

2021-11-01 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437012#comment-17437012
 ] 

Greg Miller commented on LUCENE-10120:
--

[~ChrisLu] if I'm understanding this correctly, it sounds like your proposal 
here is to add a new special-case to the LRU caching that optimizes for an 
extremely dense iterator, where all documents within a given min/max range are 
present (and the extreme case of this being where all docs match). Is that a 
correct understanding? If so, it could be interesting to try this out. I think 
we'd only need to modify the {{cacheIntoBitSet}} method to behave similar to 
{{DocsWithFieldSet}} (as you point out). I don't know if we'll see much impact, 
but I like the idea! 

> Lazy initialize FixedBitSet in LRUQueryCache
> 
>
> Key: LUCENE-10120
> URL: https://issues.apache.org/jira/browse/LUCENE-10120
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: main (9.0)
>Reporter: Lu Xugang
>Priority: Major
> Attachments: 1.png
>
>
> Basing on the implement of collecting docIds in DocsWithFieldSet, may be we 
> could do similar way to cache docIdSet in 
> *LRUQueryCache#cacheIntoBitSet(BulkScorer scorer, int maxDoc)* when docIdSet 
> is density.
> In this way , we do not always init a huge FixedBitSet which sometime is not 
> necessary when maxDoc is large
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10195) Add gradle cache option and make some tasks cacheable

2021-11-01 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-10195:
-
Summary: Add gradle cache option and make some tasks cacheable  (was: 
Gradle build speed improvement)

> Add gradle cache option and make some tasks cacheable
> -
>
> Key: LUCENE-10195
> URL: https://issues.apache.org/jira/browse/LUCENE-10195
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jerome Prinet
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Increase Gradle build speed with help of Gradle built-in features, mostly 
> cache and up-to-date checks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih opened a new pull request #420: [DRAFT] LUCENE-10122 Explore using NumericDocValue to store taxonomy parent array

2021-11-01 Thread GitBox


zhaih opened a new pull request #420:
URL: https://github.com/apache/lucene/pull/420


   
   
   
   
   # Description
   
   As mentioned in the issue, use NumericDocValues to store parent array 
instead of term positioning.
   
   # TODO
* benchmark
* address the backward compatibility issue
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [ ] I have run `./gradlew check`. (Not yet, need to deal with backward 
compatibility issue)
   - [ ] I have added tests for my changes. (Old tests should be sufficient)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery

2021-11-01 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436974#comment-17436974
 ] 

Robert Muir commented on LUCENE-10207:
--

My line of thinking was just that it might be consistent with a disjunction-OR 
of the terms (which is really equivalent in the amount of work done)

> Make TermInSetQuery usable with IndexOrDocValuesQuery
> -
>
> Key: LUCENE-10207
> URL: https://issues.apache.org/jira/browse/LUCENE-10207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-10207_multitermquery.patch
>
>
> IndexOrDocValuesQuery is very useful to pick the right execution mode for a 
> query depending on other bits of the query tree.
> We would like to be able to use it to optimize execution of TermInSetQuery. 
> However IndexOrDocValuesQuery only works well if the "index" query can give 
> an estimation of the cost of the query without doing anything expensive (like 
> looking up all terms of the TermInSetQuery in the terms dict). Maybe we could 
> implement it for primary keys (terms.size() == sumDocFreq) by returning the 
> number of terms of the query? Another idea is to multiply the number of terms 
> by the average postings length, though this could be dangerous if the field 
> has a zipfian distribution and some terms have a much higher doc frequency 
> than the average.
> [~romseygeek] and I were discussing this a few weeks ago, and more recently 
> [~mikemccand] and [~gsmiller] again independently. So it looks like there is 
> interest in this. Here is an email thread where this was recently discussed: 
> https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery

2021-11-01 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436972#comment-17436972
 ] 

Greg Miller commented on LUCENE-10207:
--

[~rcmuir] as a cost heuristic for running the term-based scorer, I agree that 
sumDocFreq() is a better fit than getDocCount(). But, I thought that 
{{ScorerSupplier#cost()}} was meant to estimate the number of docs the scorer 
would produce if leading iteration. Am I misunderstanding that? Thanks!

> Make TermInSetQuery usable with IndexOrDocValuesQuery
> -
>
> Key: LUCENE-10207
> URL: https://issues.apache.org/jira/browse/LUCENE-10207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-10207_multitermquery.patch
>
>
> IndexOrDocValuesQuery is very useful to pick the right execution mode for a 
> query depending on other bits of the query tree.
> We would like to be able to use it to optimize execution of TermInSetQuery. 
> However IndexOrDocValuesQuery only works well if the "index" query can give 
> an estimation of the cost of the query without doing anything expensive (like 
> looking up all terms of the TermInSetQuery in the terms dict). Maybe we could 
> implement it for primary keys (terms.size() == sumDocFreq) by returning the 
> number of terms of the query? Another idea is to multiply the number of terms 
> by the average postings length, though this could be dangerous if the field 
> has a zipfian distribution and some terms have a much higher doc frequency 
> than the average.
> [~romseygeek] and I were discussing this a few weeks ago, and more recently 
> [~mikemccand] and [~gsmiller] again independently. So it looks like there is 
> interest in this. Here is an email thread where this was recently discussed: 
> https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning

2021-11-01 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved LUCENE-10196.
-
Fix Version/s: 8.11
   Resolution: Fixed

Thanks reviewers!

> Improve IntroSorter with 3-ways partitioning
> 
>
> Key: LUCENE-10196
> URL: https://issues.apache.org/jira/browse/LUCENE-10196
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: 8.11
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> I added a SorterBenchmark to evaluate the performance of the various Sorter 
> implementations depending on the strategies defined in BaseSortTestCase 
> (random, random-low-cardinality, ascending, descending, etc).
> By changing the implementation of the IntroSorter to use a 3-ways 
> partitioning, we can gain a significant performance improvement when sorting 
> low-cardinality lists, and with additional changes we can also improve the 
> performance for all the strategies.
> Proposed changes:
>  - Sort small ranges with insertion sort (instead of binary sort).
>  - Select the quick sort pivot with medians.
>  - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm.
>  - Replace the tail recursion by a loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] bruno-roustant closed pull request #404: LUCENE-10196: Improve IntroSorter with 3-ways partitioning.

2021-11-01 Thread GitBox


bruno-roustant closed pull request #404:
URL: https://github.com/apache/lucene/pull/404


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning

2021-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436911#comment-17436911
 ] 

ASF subversion and git services commented on LUCENE-10196:
--

Commit e0ce294868c17cbdbf54a344bbb7a94d7a62d7ff in lucene-solr's branch 
refs/heads/branch_8x from Bruno Roustant
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e0ce294 ]

LUCENE-10196: Improve IntroSorter with 3-ways partitioning.


> Improve IntroSorter with 3-ways partitioning
> 
>
> Key: LUCENE-10196
> URL: https://issues.apache.org/jira/browse/LUCENE-10196
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> I added a SorterBenchmark to evaluate the performance of the various Sorter 
> implementations depending on the strategies defined in BaseSortTestCase 
> (random, random-low-cardinality, ascending, descending, etc).
> By changing the implementation of the IntroSorter to use a 3-ways 
> partitioning, we can gain a significant performance improvement when sorting 
> low-cardinality lists, and with additional changes we can also improve the 
> performance for all the strategies.
> Proposed changes:
>  - Sort small ranges with insertion sort (instead of binary sort).
>  - Select the quick sort pivot with medians.
>  - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm.
>  - Replace the tail recursion by a loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10120) Lazy initialize FixedBitSet in LRUQueryCache

2021-11-01 Thread Lu Xugang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436737#comment-17436737
 ] 

Lu Xugang edited comment on LUCENE-10120 at 11/1/21, 10:31 AM:
---

Hi, [~jpountz] , In the case of large numbers of document matched, During the 
collection process, if the doc id  is sequential and ordered from small to 
large. We only need to record the smallest and largest doc id, and use the 
method *DocIdSetIterator#range(int minDoc, int maxDoc)* currently provided in 
the source code to get a *DocIdSetIterator*.

During the collection process, if the doc id  is not sequential or ordered 
found , then using old way to create a new big size *FixedBitSet*, and 
initialize it with a range(*FixedBitSet#**set(int startIndex, int endIndex)**)* 
by  smallest and largest doc id currently collected .


was (Author: chrislu):
Hi, [~jpountz] , In the case of large numbers of document matched, During the 
collection process, if the doc id  is sequential and ordered from small to 
large. We only need to record the smallest and largest doc id, and use the 
method *DocIdSetIterator#range(int minDoc, int maxDoc)* currently provided in 
the source code to get a DocIdSetIterator.

During the collection process, if the doc id  is not sequential or ordered 
found , then using old way to create a new big size FixedBitSet, and initialize 
it with a range(*FixedBitSet#**set(int startIndex, int endIndex)**)* by  
smallest and largest doc id currently collected .

> Lazy initialize FixedBitSet in LRUQueryCache
> 
>
> Key: LUCENE-10120
> URL: https://issues.apache.org/jira/browse/LUCENE-10120
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: main (9.0)
>Reporter: Lu Xugang
>Priority: Major
> Attachments: 1.png
>
>
> Basing on the implement of collecting docIds in DocsWithFieldSet, may be we 
> could do similar way to cache docIdSet in 
> *LRUQueryCache#cacheIntoBitSet(BulkScorer scorer, int maxDoc)* when docIdSet 
> is density.
> In this way , we do not always init a huge FixedBitSet which sometime is not 
> necessary when maxDoc is large
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10120) Lazy initialize FixedBitSet in LRUQueryCache

2021-11-01 Thread Lu Xugang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436737#comment-17436737
 ] 

Lu Xugang commented on LUCENE-10120:


Hi, [~jpountz] , In the case of large numbers of document matched, During the 
collection process, if the doc id  is sequential and ordered from small to 
large. We only need to record the smallest and largest doc id, and use the 
method *DocIdSetIterator#range(int minDoc, int maxDoc)* currently provided in 
the source code to get a DocIdSetIterator.

During the collection process, if the doc id  is not sequential or ordered 
found , then using old way to create a new big size FixedBitSet, and initialize 
it with a range(*FixedBitSet#**set(int startIndex, int endIndex)**)* by  
smallest and largest doc id currently collected .

> Lazy initialize FixedBitSet in LRUQueryCache
> 
>
> Key: LUCENE-10120
> URL: https://issues.apache.org/jira/browse/LUCENE-10120
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: main (9.0)
>Reporter: Lu Xugang
>Priority: Major
> Attachments: 1.png
>
>
> Basing on the implement of collecting docIds in DocsWithFieldSet, may be we 
> could do similar way to cache docIdSet in 
> *LRUQueryCache#cacheIntoBitSet(BulkScorer scorer, int maxDoc)* when docIdSet 
> is density.
> In this way , we do not always init a huge FixedBitSet which sometime is not 
> necessary when maxDoc is large
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning

2021-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436728#comment-17436728
 ] 

ASF subversion and git services commented on LUCENE-10196:
--

Commit 63b9e603e6f53dae40ece03814a9aa613f6cc189 in lucene's branch 
refs/heads/main from Bruno Roustant
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=63b9e60 ]

LUCENE-10196: Improve IntroSorter with 3-ways partitioning.


> Improve IntroSorter with 3-ways partitioning
> 
>
> Key: LUCENE-10196
> URL: https://issues.apache.org/jira/browse/LUCENE-10196
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> I added a SorterBenchmark to evaluate the performance of the various Sorter 
> implementations depending on the strategies defined in BaseSortTestCase 
> (random, random-low-cardinality, ascending, descending, etc).
> By changing the implementation of the IntroSorter to use a 3-ways 
> partitioning, we can gain a significant performance improvement when sorting 
> low-cardinality lists, and with additional changes we can also improve the 
> performance for all the strategies.
> Proposed changes:
>  - Sort small ranges with insertion sort (instead of binary sort).
>  - Select the quick sort pivot with medians.
>  - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm.
>  - Replace the tail recursion by a loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10200) Restructure and modernize the release artifacts

2021-11-01 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-10200.
--
Fix Version/s: main (9.0)
   Resolution: Fixed

> Restructure and modernize the release artifacts
> ---
>
> Key: LUCENE-10200
> URL: https://issues.apache.org/jira/browse/LUCENE-10200
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: main (9.0)
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is an umbrella issue for various sub-tasks as per my e-mail [1].
>  [1] [https://markmail.org/thread/f7yrggnynq2ijgmy]
> In this order, perhaps:
>  * (/) Apply small text file changes (LUCENE-10163)
>  * (/) Simplify artifacts (LUCENE-10199 drop ZIP binary),
>  * (/) LUCENE-10192 drop third party JARs.
>  * -Create an additional binary artifact for Luke (LUCENE-9978).-
>  * (-) -only include relevant binary license/ notice files-
>  * (/) make sure source package can be compiled (no .git folder).
>  * (/) Test everything with the smoke tester.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10200) Restructure and modernize the release artifacts

2021-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436682#comment-17436682
 ] 

ASF subversion and git services commented on LUCENE-10200:
--

Commit 0544819b789235faecd718a339564e5669847731 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0544819 ]

LUCENE-10200: store git revision in the release folder and read it back from 
buildAndPushRelease (#419)



> Restructure and modernize the release artifacts
> ---
>
> Key: LUCENE-10200
> URL: https://issues.apache.org/jira/browse/LUCENE-10200
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is an umbrella issue for various sub-tasks as per my e-mail [1].
>  [1] [https://markmail.org/thread/f7yrggnynq2ijgmy]
> In this order, perhaps:
>  * (/) Apply small text file changes (LUCENE-10163)
>  * (/) Simplify artifacts (LUCENE-10199 drop ZIP binary),
>  * (/) LUCENE-10192 drop third party JARs.
>  * -Create an additional binary artifact for Luke (LUCENE-9978).-
>  * (-) -only include relevant binary license/ notice files-
>  * (/) make sure source package can be compiled (no .git folder).
>  * (/) Test everything with the smoke tester.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss merged pull request #419: LUCENE-10200: store git revision in the release folder and read it back from buildAndPushRelease

2021-11-01 Thread GitBox


dweiss merged pull request #419:
URL: https://github.com/apache/lucene/pull/419


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #419: LUCENE-10200: store git revision in the release folder and read it back from buildAndPushRelease

2021-11-01 Thread GitBox


dweiss commented on pull request #419:
URL: https://github.com/apache/lucene/pull/419#issuecomment-956034449


   I tested it on Linux and it worked fine for me. Applying - we can always 
correct if something is not right.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org