[jira] [Commented] (LUCENE-8675) Divide Segment Search Amongst Multiple Threads

2022-08-04 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575251#comment-17575251
 ] 

Adrien Grand commented on LUCENE-8675:
--

I wonder if we could avoid paying the cost of Scorer/BulkScorer initialization 
multiple times by implementing Cloneable on these classes, similarly to how we 
use cloning on IndexInputs to consume them from multiple threads. It would 
require implementing Cloneable on a few other classes, e.g. PostingsEnum, and 
maybe we'd need to set some restrictions to keep this feature reasonable, e.g. 
it's only legal to clone when the current doc ID is -1. But this could help 
parallelize collecting a single segment by assigning each clone its own range 
of doc IDs.

A downside of this approach is that it wouldn't help parallelize the 
initialization of Scorers, but I don't know if there is a way around it.

> Divide Segment Search Amongst Multiple Threads
> --
>
> Key: LUCENE-8675
> URL: https://issues.apache.org/jira/browse/LUCENE-8675
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Atri Sharma
>Priority: Major
> Attachments: PhraseHighFreqP50.png, PhraseHighFreqP90.png, 
> TermHighFreqP50.png, TermHighFreqP90.png
>
>
> Segment search is a single threaded operation today, which can be a 
> bottleneck for large analytical queries which index a lot of data and have 
> complex queries which touch multiple segments (imagine a composite query with 
> range query and filters on top). This ticket is for discussing the idea of 
> splitting a single segment into multiple threads based on mutually exclusive 
> document ID ranges.
> This will be a two phase effort, the first phase targeting queries returning 
> all matching documents (collectors not terminating early). The second phase 
> patch will introduce staged execution and will build on top of this patch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10672) Re-evaluate different ways to encode postings

2022-08-03 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10672:
-

 Summary: Re-evaluate different ways to encode postings
 Key: LUCENE-10672
 URL: https://issues.apache.org/jira/browse/LUCENE-10672
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand


In Lucene 4, we moved to FOR to encode postings because it woud give better 
throughput compared to VInts that we had been using until then. This was a time 
when Lucene would often need to evaluate entire postings lists, and 
optimizations like BS1 were very important for good performance.

Nowadays, Lucene performs more dynamic pruning and it's less frequent that 
Lucene needs to evaluate all hits that match a query. So the performance of 
{{nextDoc()}} has become a bit less relevant while the performance of 
{{advance(target)}} has become more relevant.

I wonder if we should re-evaluate other ways to encode postings that are 
theoretically better at skipping, such as Elias-Fano coding, since they support 
skipping directly on the encoded representation instead of requiring decoding a 
full block of integers where only a couple of them would be relevant.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10627) Using ByteBuffersDataInput reduce memory copy on compressing data

2022-08-01 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10627.
---
Fix Version/s: 9.4
   Resolution: Fixed

> Using ByteBuffersDataInput reduce memory copy on compressing data
> -
>
> Key: LUCENE-10627
> URL: https://issues.apache.org/jira/browse/LUCENE-10627
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/store
>Reporter: LuYunCheng
>Priority: Major
> Fix For: 9.4
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Code: [https://github.com/apache/lucene/pull/987]
> I see When Lucene Do flush and merge store fields, need many memory copies:
> {code:java}
> Lucene Merge Thread #25940]" #906546 daemon prio=5 os_prio=0 cpu=20503.95ms 
> elapsed=68.76s tid=0x7ee990002c50 nid=0x3aac54 runnable  
> [0x7f17718db000]
>    java.lang.Thread.State: RUNNABLE
>     at 
> org.apache.lucene.store.ByteBuffersDataOutput.toArrayCopy(ByteBuffersDataOutput.java:271)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(CompressingStoredFieldsWriter.java:239)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.finishDocument(CompressingStoredFieldsWriter.java:169)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:654)
>     at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228)
>     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
>     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760)
>     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364)
>     at 
> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
>     at 
> org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:100)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
>  {code}
> When Lucene *CompressingStoredFieldsWriter* do flush documents, it needs many 
> memory copies:
> With Lucene90 using {*}LZ4WithPresetDictCompressionMode{*}:
>  # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
> compress
>  # compressor copy dict and data into one block buffer
>  # do compress
>  # copy compressed data out
> With Lucene90 using {*}DeflateWithPresetDictCompressionMode{*}:
>  # bufferedDocs.toArrayCopy copy blocks into one continue content for chunk 
> compress
>  # do compress
>  # copy compressed data out
>  
> I think we can use -CompositeByteBuf- to reduce temp memory copies:
>  # we do not have to *bufferedDocs.toArrayCopy* when just need continues 
> content for chunk compress
>  
> I write a simple mini benchamrk in test code ([link 
> |https://github.com/apache/lucene/blob/5a406a5c483c7fadaf0e8a5f06732c79ad174d11/lucene/core/src/test/org/apache/lucene/codecs/lucene90/compressing/TestCompressingStoredFieldsFormat.java#L353]):
> *LZ4WithPresetDict run* Capacity:41943040(bytes) , iter 10times: Origin 
> elapse:5391ms , New elapse:5297ms
> *DeflateWithPresetDict run* Capacity:41943040(bytes), iter 10times: Origin 
> elapse:{*}115ms{*}, New elapse:{*}12ms{*}
>  
> And I run runStoredFieldsBenchmark with doc_limit=-1:
> shows:
> ||Msec to index||BEST_SPEED ||BEST_COMPRESSION||
> |Baseline|318877.00|606288.00|
> |Candidate|314442.00|604719.00|
>  
> --{-}UPDATE{-}--
>  
>  I try to *reuse ByteBuffersDataInput* to reduce memory copy because it can 
> get from ByteBuffersDataOutput.toDataInput.  and it could reduce this 
> complexity ([PR|https://github.com/apache/lucene/pull/987])
> BUT i am not sure whether can change Compressor interface compress input 
> param from byte[] to ByteBuffersDataInput. If change this interface 
> [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/Compressor.java#L35],
>  it increased the backport code 
> [like|https://github.com/apache/lucene/blob/382962f22df3ee3af3fb538b877c98d61a622ddb/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java#L274],
>  however if we change the interface with ByteBuffersDataInput, we can 
> optimize memory copy into different compress algorithm code.
> Also, i found we can do more memory copy reduce in 
> *{{{}CompressingStoredFieldsWriter.{}}}{{{}copyOneDoc 
> 

[jira] [Commented] (LUCENE-10629) Add fastMatchQuery param to MatchingFacetSetCounts

2022-08-01 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573757#comment-17573757
 ] 

Adrien Grand commented on LUCENE-10629:
---

Sure thing, it was an easy fix!

> Add fastMatchQuery param to MatchingFacetSetCounts
> --
>
> Key: LUCENE-10629
> URL: https://issues.apache.org/jira/browse/LUCENE-10629
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Marc D'Mello
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 9.4
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Some facet counters, like {{RangeFacetCounts}}, allow the user to pass in a 
> {{fastMatchQuery}} parameter in order to quickly and efficiently filter out 
> documents in the passed in match set. We should create this same parameter in 
> {{MatchingFacetSetCounts}} as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-29 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10633.
---
Fix Version/s: 9.4
   Resolution: Fixed

> Dynamic pruning for queries sorted by SORTED(_SET) field
> 
>
> Key: LUCENE-10633
> URL: https://issues.apache.org/jira/browse/LUCENE-10633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.4
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
> when sorting by a numeric field, by leveraging the points index to skip 
> documents that do not compare better than the top of the priority queue 
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which 
> is disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-07-29 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572859#comment-17572859
 ] 

Adrien Grand commented on LUCENE-10544:
---

I guess my only concern about this is that there are likely users who rely on 
the fact that ExitableDirectoryReader doesn't wrap postings in order to get 
good performance for their queries (by not adding a wrapper on every postings 
list) while still enabling timeouts via a collector or BulkScorer approach 
(e.g. Elasticsearch is in this case).

I would suggest that we wait for IndexSearcher's timeout support to be more 
complete (LUCENE-10641) before doing this, so that users never have to wrap 
with ExitableDirectoryReader themselves and can instead rely fully on 
IndexSearcher doing the right thing.

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10668) Should we deprecate/remove DocValuesTermsQuery in sandbox?

2022-07-29 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572854#comment-17572854
 ] 

Adrien Grand commented on LUCENE-10668:
---

It is like TermInSetQuery except that it operates on doc values instead of the 
inverted index?

> Should we deprecate/remove DocValuesTermsQuery in sandbox?
> --
>
> Key: LUCENE-10668
> URL: https://issues.apache.org/jira/browse/LUCENE-10668
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/sandbox
>Reporter: Greg Miller
>Priority: Minor
>
> I came across the sandbox {{DocValuesTermsQuery}} and it sure looks a lot 
> like {{TermInSetQuery}}. I wonder if we ought to deprecate and remove it? Any 
> reason to keep this around?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10658) Merges should periodically check for abort

2022-07-28 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572282#comment-17572282
 ] 

Adrien Grand commented on LUCENE-10658:
---

{{Thread#interrupt}} has the side effect on invalidating every 
{{NIOFSIndexInput}} that is blocked on I/O, the javadocs on {{NIOFSDirectory}} 
give some more details, so we generally discourage from using 
{{Thread#interrupt}}. I guess it would not be ok to interrupt because there 
might be open IndexReaders that reference some of the segments that are being 
merged and that users plan on keeping using after the call to rollback?

> Merges should periodically check for abort
> --
>
> Key: LUCENE-10658
> URL: https://issues.apache.org/jira/browse/LUCENE-10658
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.3
>Reporter: Nhat Nguyen
>Priority: Major
>
> Rolling back an IndexWriter without committing shouldn't take long (i.e., 
> less than several seconds), and Elasticsearch cluster coordination [relies 
> on|https://github.com/elastic/elasticsearch/issues/88055] this assumption. If 
> some merges are taking place, the rollback can take several minutes as merges 
> only check for abort when writing to files via 
> [MergeRateLimiter|https://github.com/apache/lucene/blob/3d7d85f245381f84c46c766119695a8645cde2b8/lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java#L117-L119].
>  Merging a completion field, for example, can take a long time without 
> touching output files. Another reason merges should periodically check for 
> abort is its outputs will be discarded.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8810) Flattening of nested disjunctions does not take into account number of clause limitation of builder

2022-07-27 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572021#comment-17572021
 ] 

Adrien Grand commented on LUCENE-8810:
--

Indeed this change no longer makes sense on 9.x since Lucene doesn't check the 
number of clauses per boolean query but globally, which makes more sense.

If you want to avoid getting this exception, you need to either avoid creating 
queries that have many clauses (e.g. using TermInSetQuery when applicable) or 
increase the maximum clause count (which is discouraged).

> Flattening of nested disjunctions does not take into account number of clause 
> limitation of builder
> ---
>
> Key: LUCENE-8810
> URL: https://issues.apache.org/jira/browse/LUCENE-8810
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.0
>Reporter: Mickaël Sauvée
>Priority: Minor
> Fix For: 8.2
>
> Attachments: LUCENE-8810.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In org.apache.lucene.search.BooleanQuery, at the end of the function 
> rewrite(IndexReader reader), the query is rewritten to flatten nested 
> disjunctions.
> This does not take into account the limitation on the number of clauses in a 
> builder (1024).
>  In some circumstances, this limite can be reached, hence an exception is 
> thrown.
> Here is a unit test that highlight this.
> {code:java}
>   public void testFlattenInnerDisjunctionsWithMoreThan1024Terms() throws 
> IOException {
> IndexSearcher searcher = newSearcher(new MultiReader());
> BooleanQuery.Builder builder1024 = new BooleanQuery.Builder();
> for(int i = 0; i < 1024; i++) {
>   builder1024.add(new TermQuery(new Term("foo", "bar-" + i)), 
> Occur.SHOULD);
> }
> Query inner = builder1024.build();
> Query query = new BooleanQuery.Builder()
> .add(inner, Occur.SHOULD)
> .add(new TermQuery(new Term("foo", "baz")), Occur.SHOULD)
> .build();
> searcher.rewrite(query);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10661) Reduce memory copy in BytesStore

2022-07-27 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571971#comment-17571971
 ] 

Adrien Grand commented on LUCENE-10661:
---

Thank you [~luyuncheng]!

> Reduce memory copy in BytesStore
> 
>
> Key: LUCENE-10661
> URL: https://issues.apache.org/jira/browse/LUCENE-10661
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: LuYunCheng
>Priority: Major
> Fix For: 9.4
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is derived from 
> [LUCENE-10627](https://github.com/apache/lucene/pull/987) AND 
> [LUCENE-10657](https://github.com/apache/lucene/pull/1034)
> The abstract method copyBytes in DataOutput have to copy from input to a 
> copyBuffer and then write into BytesStore.blocks, which is called in FST 
> initialization read from metaIn. 
> Although, this copy bytes only a few bytes (in the testscase only 3-10 
> bytes), i think we can save this memory copy, just save the 
> DataOutput.copyBytes to create new copyBuffer with 16384 bytes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10661) Reduce memory copy in BytesStore

2022-07-27 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10661.
---
Fix Version/s: 9.4
   Resolution: Fixed

> Reduce memory copy in BytesStore
> 
>
> Key: LUCENE-10661
> URL: https://issues.apache.org/jira/browse/LUCENE-10661
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: LuYunCheng
>Priority: Major
> Fix For: 9.4
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is derived from 
> [LUCENE-10627](https://github.com/apache/lucene/pull/987) AND 
> [LUCENE-10657](https://github.com/apache/lucene/pull/1034)
> The abstract method copyBytes in DataOutput have to copy from input to a 
> copyBuffer and then write into BytesStore.blocks, which is called in FST 
> initialization read from metaIn. 
> Although, this copy bytes only a few bytes (in the testscase only 3-10 
> bytes), i think we can save this memory copy, just save the 
> DataOutput.copyBytes to create new copyBuffer with 16384 bytes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10664) SearcherManager should return new IndexSearchers every time

2022-07-27 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571935#comment-17571935
 ] 

Adrien Grand commented on LUCENE-10664:
---

Or maybe we should just deprecate/remove SearcherManager and suggest users that 
they use IndexReaderManager directly.

> SearcherManager should return new IndexSearchers every time
> ---
>
> Key: LUCENE-10664
> URL: https://issues.apache.org/jira/browse/LUCENE-10664
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Major
>
> SearcherManager caches IndexSearcher instances. This is no longer a good 
> approach now that IndexSearcher has timeout support (LUCENE-10151) and keeps 
> track of the time until which queries are allowed to run.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10664) SearcherManager should return new IndexSearchers every time

2022-07-27 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10664:
-

 Summary: SearcherManager should return new IndexSearchers every 
time
 Key: LUCENE-10664
 URL: https://issues.apache.org/jira/browse/LUCENE-10664
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand


SearcherManager caches IndexSearcher instances. This is no longer a good 
approach now that IndexSearcher has timeout support (LUCENE-10151) and keeps 
track of the time until which queries are allowed to run.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-27 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571807#comment-17571807
 ] 

Adrien Grand commented on LUCENE-10633:
---

The PR is ready for review now if someone is interested in having a look. I 
made an improvement for the very sparse case, so that after collecting 
{{numHits}} matches, the collector would tell the query to only look at 
documents that have a value for the sort field.

One assumption that this change makes is that terms are encoded exactly the 
same way in the terms index and in the doc-values terms dictionary. I think 
it's a fine assumption, but wanted to make it explicit because this 
optimization will lead to runtime errors if this assumption isn't met. This is 
the same assumption that we are already making today when sorting numeric 
fields and using the points index to dynamically prune irrelevant hits.

I ran luceneutil again to verify performance is still good:

{noformat}
TaskQPS baseline  StdDevQPS my_modified_version 
 StdDevPct diff p-value
HighSloppyPhrase   11.46  (4.3%)   11.19  
(5.3%)   -2.4% ( -11% -7%) 0.120
 Prefix3   53.30 (16.7%)   52.06 
(16.8%)   -2.3% ( -30% -   37%) 0.659
BrowseDateSSDVFacets5.23 (11.1%)5.13 
(13.5%)   -1.9% ( -23% -   25%) 0.632
   BrowseDayOfYearSSDVFacets   20.33  (7.6%)   19.96  
(8.6%)   -1.9% ( -16% -   15%) 0.470
   BrowseMonthTaxoFacets   28.62 (12.0%)   28.11  
(7.8%)   -1.8% ( -19% -   20%) 0.582
OrHighNotLow 1357.76  (6.3%) 1334.12  
(4.8%)   -1.7% ( -12% -9%) 0.325
OrHighNotMed 1568.25  (4.3%) 1541.21  
(4.8%)   -1.7% ( -10% -7%) 0.232
 MedTerm 2422.95  (5.2%) 2381.38  
(4.6%)   -1.7% ( -10% -8%) 0.269
HighTerm 1736.81  (6.5%) 1710.26  
(5.6%)   -1.5% ( -12% -   11%) 0.426
 MedSloppyPhrase   62.45  (3.4%)   61.59  
(4.1%)   -1.4% (  -8% -6%) 0.249
   OrNotHighHigh  931.81  (5.4%)  919.74  
(4.4%)   -1.3% ( -10% -8%) 0.403
  OrHighHigh   58.41  (5.3%)   57.65  
(4.1%)   -1.3% ( -10% -8%) 0.388
OrNotHighMed 1179.51  (3.0%) 1168.53  
(3.2%)   -0.9% (  -6% -5%) 0.338
 BrowseRandomLabelSSDVFacets   14.52  (1.9%)   14.40  
(1.9%)   -0.8% (  -4% -3%) 0.186
 LowTerm 1589.67  (3.6%) 1579.95  
(4.6%)   -0.6% (  -8% -7%) 0.642
MedTermDayTaxoFacets   52.00  (4.3%)   51.70  
(4.3%)   -0.6% (  -8% -8%) 0.672
   OrHighNotHigh 1008.27  (5.9%) 1002.78  
(5.1%)   -0.5% ( -10% -   11%) 0.756
 LowIntervalsOrdered   11.03  (4.8%)   10.98  
(4.4%)   -0.5% (  -9% -9%) 0.724
  OrHighMedDayTaxoFacets   22.72  (3.5%)   22.64  
(3.1%)   -0.4% (  -6% -6%) 0.718
   OrHighLow  899.20  (3.3%)  896.35  
(3.0%)   -0.3% (  -6% -6%) 0.750
 MedIntervalsOrdered   43.37  (3.6%)   43.25  
(3.7%)   -0.3% (  -7% -7%) 0.799
HighIntervalsOrdered   24.44  (5.3%)   24.37  
(5.5%)   -0.3% ( -10% -   11%) 0.864
OrNotHighLow 1448.52  (4.0%) 1446.40  
(3.5%)   -0.1% (  -7% -7%) 0.901
 LowSpanNear   85.70  (2.4%)   85.59  
(2.2%)   -0.1% (  -4% -4%) 0.851
  AndHighLow 1043.29  (5.2%) 1042.26  
(3.9%)   -0.1% (  -8% -9%) 0.946
PKLookup  236.83  (1.4%)  236.69  
(2.2%)   -0.1% (  -3% -3%) 0.919
HighTermTitleBDVSort   25.03  (3.5%)   25.02  
(2.6%)   -0.0% (  -5% -6%) 0.977
Wildcard  156.78  (1.9%)  156.93  
(1.8%)0.1% (  -3% -3%) 0.877
 MedSpanNear  214.11  (4.2%)  214.32  
(2.9%)0.1% (  -6% -7%) 0.929
  Fuzzy1  118.50  (1.2%)  118.67  
(0.9%)0.1% (  -1% -2%) 0.664
 Respell   59.34  (1.0%)   59.43  
(0.8%)0.1% (  -1% -2%) 0.630
  Fuzzy2  115.77  (1.1%)  116.01  
(1.1%)0.2% (  -1% -2%) 0.549
 LowSloppyPhrase   89.17  (2.6%)   89.38  
(2.6%)0.2% (  -4% -5%) 0.771
HighSpanNear   31.18  (4.1%)   31.28  
(3.2%)0.3% (  -6% -8%) 0.769
   

[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-26 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571370#comment-17571370
 ] 

Adrien Grand commented on LUCENE-10151:
---

I just noticed that the push of my backport had failed, so it will be in 9.4, 
not 9.3. I don't think it's worth respinning for it.

> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10660) precompute the max level in LogMergePolicy

2022-07-26 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-10660:
--
Fix Version/s: 9.4
   (was: 9.3)

> precompute the max level in LogMergePolicy
> --
>
> Key: LUCENE-10660
> URL: https://issues.apache.org/jira/browse/LUCENE-10660
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tang donghai
>Priority: Minor
> Fix For: 9.4
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I notice LogMergePolicy#findMerges will always calculate  max level on the 
> right side when find the next segments to merge.
>  
> I think we could calculate the max levels only once, and when we need the max 
> level, we could simply
> {code:java}
> float maxLevel = maxLevels[start];
> {code}
> and the precomputed code looks like below, compare each level in levels from 
> right to left 
> {code:java}
> float[] maxLevels = new float[numMergeableSegments + 1];
> maxLevels[numMergeableSegments] = -1.0f;
> for (int i = numMergeableSegments - 1; i >= 0; i--) {
>   maxLevels[i] = Math.max(levels.get(i).level, maxLevels[i + 1]);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10660) precompute the max level in LogMergePolicy

2022-07-26 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571369#comment-17571369
 ] 

Adrien Grand commented on LUCENE-10660:
---

The change made sense to me and I merged it, thank you [~tangdh]!

> precompute the max level in LogMergePolicy
> --
>
> Key: LUCENE-10660
> URL: https://issues.apache.org/jira/browse/LUCENE-10660
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tang donghai
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I notice LogMergePolicy#findMerges will always calculate  max level on the 
> right side when find the next segments to merge.
>  
> I think we could calculate the max levels only once, and when we need the max 
> level, we could simply
> {code:java}
> float maxLevel = maxLevels[start];
> {code}
> and the precomputed code looks like below, compare each level in levels from 
> right to left 
> {code:java}
> float[] maxLevels = new float[numMergeableSegments + 1];
> maxLevels[numMergeableSegments] = -1.0f;
> for (int i = numMergeableSegments - 1; i >= 0; i--) {
>   maxLevels[i] = Math.max(levels.get(i).level, maxLevels[i + 1]);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10660) precompute the max level in LogMergePolicy

2022-07-26 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10660.
---
Fix Version/s: 9.3
   Resolution: Fixed

> precompute the max level in LogMergePolicy
> --
>
> Key: LUCENE-10660
> URL: https://issues.apache.org/jira/browse/LUCENE-10660
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tang donghai
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I notice LogMergePolicy#findMerges will always calculate  max level on the 
> right side when find the next segments to merge.
>  
> I think we could calculate the max levels only once, and when we need the max 
> level, we could simply
> {code:java}
> float maxLevel = maxLevels[start];
> {code}
> and the precomputed code looks like below, compare each level in levels from 
> right to left 
> {code:java}
> float[] maxLevels = new float[numMergeableSegments + 1];
> maxLevels[numMergeableSegments] = -1.0f;
> for (int i = numMergeableSegments - 1; i >= 0; i--) {
>   maxLevels[i] = Math.max(levels.get(i).level, maxLevels[i + 1]);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

2022-07-26 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571355#comment-17571355
 ] 

Adrien Grand commented on LUCENE-10592:
---

I just pushed an annotion that should show up in the next couple days.

> Should we build HNSW graph on the fly during indexing
> -
>
> Key: LUCENE-10592
> URL: https://issues.apache.org/jira/browse/LUCENE-10592
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.4
>
> Attachments: Screen Shot 2022-07-25 at 9.04.11 AM.png
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Currently, when we index vectors for KnnVectorField, we buffer those vectors 
> in memory and on flush during a segment construction we build an HNSW graph.  
> As building an HNSW graph is very expensive, this makes flush operation take 
> a lot of time. This also makes overall indexing performance quite 
> unpredictable (as the number of flushes are defined by memory used, and the 
> presence of concurrent searches), e.g. some indexing operations return almost 
> instantly while others that trigger flush take a lot of time. 
> Building an HNSW graph on the fly as we index vectors allows to avoid this 
> problem, and spread a load of HNSW graph construction evenly during indexing.
> This will also supersede LUCENE-10194



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-19 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-10603:
--
Fix Version/s: 9.3

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
> Fix For: 9.3
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7713) Optimize TopFieldDocCollector for the sorted case

2022-07-19 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-7713.
--
Resolution: Fixed

> Optimize TopFieldDocCollector for the sorted case
> -
>
> Key: LUCENE-7713
> URL: https://issues.apache.org/jira/browse/LUCENE-7713
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> When the sort order is a prefix of the index sort order, 
> {{TopFieldDocCollector}} could skip reading doc values and comparing them 
> against the bottom value after {{numHits}} documents have been collected, and 
> just count matches.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7713) Optimize TopFieldDocCollector for the sorted case

2022-07-19 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-7713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568758#comment-17568758
 ] 

Adrien Grand commented on LUCENE-7713:
--

Looking at the current code it looks like this optimization got added since I 
opened this issue via {{TopFieldLeafCollector#collectedAllCompetitiveHits}}.

> Optimize TopFieldDocCollector for the sorted case
> -
>
> Key: LUCENE-7713
> URL: https://issues.apache.org/jira/browse/LUCENE-7713
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> When the sort order is a prefix of the index sort order, 
> {{TopFieldDocCollector}} could skip reading doc values and comparing them 
> against the bottom value after {{numHits}} documents have been collected, and 
> just count matches.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-19 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568757#comment-17568757
 ] 

Adrien Grand commented on LUCENE-10151:
---

I backported #996 too.

> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10524) Augment CONTRIBUTING.md guide with instructions on how/when to benchmark

2022-07-19 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10524.
---
Fix Version/s: 10.0 (main)
   Resolution: Fixed

> Augment CONTRIBUTING.md guide with instructions on how/when to benchmark
> 
>
> Key: LUCENE-10524
> URL: https://issues.apache.org/jira/browse/LUCENE-10524
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Gautam Worah
>Priority: Minor
> Fix For: 10.0 (main)
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This came up when I was trying to think about improving the experience for 
> new contributors.
> Today, new contributors are usually unaware of where luceneutil benchmarks 
> are and when/how to run them. Committers usually end up pointing contributors 
> to the benchmarks package when they make perf impacting changes and then they 
> run the benchmarks.
>  
> Adding benchmark details to the Lucene repo will also make them more 
> accessible to other researchers who want to experiment/benchmark their own 
> custom task implementation with Java Lucene.
>  
> What does the community think?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10605) fix error in 32bit jvm object alignment gap calculation

2022-07-19 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10605.
---
Fix Version/s: 9.3
   Resolution: Fixed

> fix error in 32bit jvm object alignment gap calculation
> ---
>
> Key: LUCENE-10605
> URL: https://issues.apache.org/jira/browse/LUCENE-10605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 8.11.1
> Environment: jdk 7 32-bit
> jdk 8 32-bit
>Reporter: sun wuqiang
>Priority: Trivial
> Fix For: 9.3
>
> Attachments: image-2022-06-08-20-50-27-712.png, 
> image-2022-06-08-21-24-57-674.png, image-2022-06-09-08-25-55-289.png, 
> image-2022-06-09-08-26-36-528.png
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> ArrayUtil.{*}oversize{*}(int minTargetSize, int bytesPerElement)
> This method is used to calculate the optimal length of an array during 
> expansion.
>  
> According to current logic,in order to avoid space waste caused by *object 
> alignment gap.* In *32-bit* JVM,the array length will select the numbers(the 
> +current optional+ columns) in the table below. But the results weren't 
> perfect.
> For example, if I want to expand byte[2], I will call the method 
> oversize(2,1) to get the size of the next array, which returns 8.
> But byte [8] is not the best result.
> Since byte[8] and byte[12] use the same memory space (both are 24 bytes due 
> to alignment gap),
> So it's best to return 12 here.
> See the table below.
> !image-2022-06-09-08-26-36-528.png!
>  
> I used *jol-core* to calculate object alignment gap
> {code:java}
> 
> org.openjdk.jol
> jol-core
> 0.16
> compile
>  {code}
>  
> Execute the following code:
> {code:java}
> System.out.println(ClassLayout.parseInstance(new byte[6]).toPrintable()); 
> {code}
>  
> !image-2022-06-08-21-24-57-674.png!
>  
> To further verify that the tool's results are correct, I wrote the following 
> code to infer how much space the array of different lengths actually occupies 
> based on when the OOM occursThe conclusion is consistent with jol-core.
> {code:java}
> // -Xms16m -Xmx16m
> // Used to infer the memory space occupied
> // by the length of various arrays
> public static void main(String[] args) {
> byte[][] arr = new byte[1024 * 1024][];
> for (int i = 0; i < arr.length; i++) {
> if (i % 100 == 0) {
> System.out.println(i);
> }
> // According to OOM occurrence time
> // in 32-bit JVM,
> // Arrays range in length from 5 to 12,
> // occupying the same amount of memory
> arr[i]=new byte[5];
> }
> } {code}
> *new byte[5]* and *new byte[12]* use the same amount of memory
> 
>  
> In addition +*- XX: ObjectAlignmentInBytes*+ should also affect the return 
> value of this method. But I don't know whether it is necessary to do this 
> function. If necessary, I will modify it together. Thank you very much!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10598) SortedSetDocValues#docValueCount() should be always greater than zero

2022-07-19 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10598.
---
Fix Version/s: 9.3
   Resolution: Fixed

> SortedSetDocValues#docValueCount() should be always greater than zero
> -
>
> Key: LUCENE-10598
> URL: https://issues.apache.org/jira/browse/LUCENE-10598
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Lu Xugang
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This test runs failed.
> {code:java}
>   public void testDocValueCount() throws IOException {
>   try (Directory d = newDirectory()) {
> try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
>   for (int j = 0; j < 1; j++) {
> Document doc = new Document();
> doc.add(new SortedSetDocValuesField("field", new BytesRef("a")));
> doc.add(new SortedSetDocValuesField("field", new BytesRef("a")));
> doc.add(new SortedSetDocValuesField("field", new BytesRef("b")));
> w.addDocument(doc);
>   }
> }
> try (IndexReader reader = DirectoryReader.open(d)) {
>   assertEquals(1, reader.leaves().size());
>   for (LeafReaderContext leaf : reader.leaves()) {
> SortedSetDocValues docValues= 
> leaf.reader().getSortedSetDocValues("field") ;
> for (int doc1 = docValues.nextDoc(); doc1 != 
> DocIdSetIterator.NO_MORE_DOCS; doc1 = docValues.nextDoc()) {
>   assert docValues.docValueCount() > 0;
> }
>   }
> }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10648) Fix TestAssertingPointsFormat.testWithExceptions failure

2022-07-19 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10648.
---
Fix Version/s: 10.0 (main)
   Resolution: Fixed

> Fix TestAssertingPointsFormat.testWithExceptions failure
> 
>
> Key: LUCENE-10648
> URL: https://issues.apache.org/jira/browse/LUCENE-10648
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
> Fix For: 10.0 (main)
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We are seeing build failures due to 
> TestAssertingPointsFormat.testWithExceptions. I am able to repro this on my 
> box with the random seed. Tracking the issue here.
> Sample Failing Build: 
> https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main/6057/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

2022-07-19 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568745#comment-17568745
 ] 

Adrien Grand commented on LUCENE-10216:
---

Can this issue be resolved?

> Add concurrency to addIndexes(CodecReader…) API
> ---
>
> Key: LUCENE-10216
> URL: https://issues.apache.org/jira/browse/LUCENE-10216
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> I work at Amazon Product Search, and we use Lucene to power search for the 
> e-commerce platform. I’m working on a project that involves applying 
> metadata+ETL transforms and indexing documents on n different _indexing_ 
> boxes, combining them into a single index on a separate _reducer_ box, and 
> making it available for queries on m different _search_ boxes (replicas). 
> Segments are asynchronously copied from indexers to reducers to searchers as 
> they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the 
> reducer boxes. Since we also have taxonomy data, we need to remap facet field 
> ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version 
> of this API. The API leverages {{SegmentMerger.merge()}} to create segments 
> with new ordinal values while also merging all provided segments in the 
> process.
> _This is however a blocking call that runs in a single thread._ Until we have 
> written segments with new ordinal values, we cannot copy them to searcher 
> boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges, 
> each with only a single reader, creating a concurrently running 1:1 
> conversion from old segments to new ones (with new ordinal values). We follow 
> this up with non-blocking background merges. This lets us copy the segments 
> to searchers and replicas as soon as they are available, and later replace 
> them with merged segments as background jobs complete. On the Amazon dataset 
> I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. 
> Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}} 
> API with a {{boolean}} flag for concurrency, that internally submits multiple 
> merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, 
> and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting 
> multiple addIndexes() calls and waiting for them to complete, I felt it needs 
> some understanding of what addIndexes does, why you need to wait on the merge 
> and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10507) Should it be more likely to search concurrently in tests?

2022-07-19 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10507.
---
Fix Version/s: 9.3
   Resolution: Fixed

> Should it be more likely to search concurrently in tests?
> -
>
> Key: LUCENE-10507
> URL: https://issues.apache.org/jira/browse/LUCENE-10507
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> As part of LUCENE-10002 we are migrating test usages of 
> IndexSearcher#search(Query, Collector) to use the corresponding search method 
> that takes a CollectorManager in place of a Collector. As part of such 
> changes, I've been paying attention to whether searchers are created through 
> LuceneTestCase#newSearcher and migrating to it when possible.
> This caused some recent test failures following test changes, which were in 
> most cases test issues, although they were quite rare due to the fact that we 
> only rarely exercise the concurrent code-path in tests.
> One recent failure uncovered LUCENE-10500, which was an actual bug that 
> affected concurrent searches only, and was uncovered by a test run that 
> indexed a considerable amount of docs and was lucky enough to get an executor 
> set to its index searcher as well as get multiple slices.
> LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and 
> even when useThreads is true, the searcher may not get an executor set. Also, 
> it can often happen that despite an executor is set, the searcher will hold 
> only one slice, as not enough documents are indexed. Some nightly tests index 
> enough documents, and LuceneTestCase also lowers the slice limits but only 
> 50% of the times and only when wrapWithAssertions is false. Also I wonder if 
> the lower limits are low enough:
> {code:java}
> int maxDocPerSlice = 1 + random.nextInt(10);
> int maxSegmentsPerSlice = 1 + random.nextInt(20);
> {code}
> All in all, I wonder if we should make it more likely for real concurrent 
> searches to happen while testing across multiple slices. It seems like it 
> could be useful especially as we'd like users to use collector managers 
> instead of collectors (although that does not necessarily translate to 
> concurrent search).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10657) CopyBytes now saves one memory copy on ByteBuffersDataOutput

2022-07-19 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10657.
---
Fix Version/s: 9.3
   Resolution: Fixed

> CopyBytes now saves one memory copy on ByteBuffersDataOutput
> 
>
> Key: LUCENE-10657
> URL: https://issues.apache.org/jira/browse/LUCENE-10657
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/store
>Reporter: LuYunCheng
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> This is derived from 
> [LUCENE-10627|[https://github.com/apache/lucene/pull/987]]
> Code: [https://github.com/apache/lucene/pull/1034] 
> The abstract method `copyBytes` in DataOutput have to copy from input to a 
> copyBuffer and then write into ByteBuffersDataOutput.blocks, i think there is 
> unnecessary, we can override it, copy directly from input into output.
> with override this method,
>  # Reduce memory copy in `Lucene90CompressingStoredFieldsWriter#copyOneDoc` 
> -> `bufferdDocs.copyBytes(DataInput input)`
>  # Reduce memory copy in `Lucene90CompoundFormat.writeCompoundFile` -> 
> `data.copyBytes` when input is `BufferedChecksumIndexinput` and output is 
> `ByteBuffersDataOutput`
>  # Reduce memory `IndexWriter#copySegmentAsIs` ->CopyFrom -> copyBytes
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10649) Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField

2022-07-19 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10649.
---
Fix Version/s: 9.3
   Resolution: Fixed

> Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField
> ---
>
> Key: LUCENE-10649
> URL: https://issues.apache.org/jira/browse/LUCENE-10649
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Failing Build Link: 
> [https://jenkins.thetaphi.de/job/Lucene-main-Linux/35617/testReport/junit/org.apache.lucene.index/TestDemoParallelLeafReader/testRandomMultipleSchemaGensSameField/]
> Repro:
> {code:java}
> gradlew test --tests 
> TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField 
> -Dtests.seed=A7496D7D3957981A -Dtests.multiplier=3 -Dtests.locale=sr-Latn-BA 
> -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 
> {code}
> Error:
> {code:java}
> java.lang.AssertionError: expected:<103> but was:<2147483647>
>     at 
> __randomizedtesting.SeedInfo.seed([A7496D7D3957981A:F71866BCCEA1C903]:0)
>     at org.junit.Assert.fail(Assert.java:89)
>     at org.junit.Assert.failNotEquals(Assert.java:835)
>     at org.junit.Assert.assertEquals(Assert.java:647)
>     at org.junit.Assert.assertEquals(Assert.java:633)
>     at 
> org.apache.lucene.index.TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField(TestDemoParallelLeafReader.java:1347)
>     at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10651) SimpleQueryParser stack overflow for large nested queries.

2022-07-18 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568116#comment-17568116
 ] 

Adrien Grand commented on LUCENE-10651:
---

I'm not familiar with the simple query parser, but we seem to create sub state 
objects in {{consumeSubQuery}}, do we need to add the number of nested clauses 
of these sub states to the top-level state to properly count the overall number 
of clauses?

> SimpleQueryParser stack overflow for large nested queries.
> --
>
> Key: LUCENE-10651
> URL: https://issues.apache.org/jira/browse/LUCENE-10651
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.1, 8.10, 9.2, 9.3
>Reporter: Marc Handalian
>Priority: Major
>
> The OpenSearch project received an issue [1] where stack overflow can occur 
> for large nested boolean queries during rewrite.  In trying to reproduce this 
> error I've also encountered SO during parsing where queries expand beyond the 
> default 1024 clause limit.  This unit test will fail with SO:
> {code:java}
> public void testSimpleQueryParserWithTooManyClauses() {
>   StringBuilder queryString = new StringBuilder("foo");
>   for (int i = 0; i < 1024; i++) {
> queryString.append(" | bar").append(i).append(" + baz");
>   }
>   expectThrows(IndexSearcher.TooManyClauses.class, () -> 
> parse(queryString.toString()));
> }
>  {code}
> I would expect this case to also fail with TooManyClauses, is my 
> understanding correct?  If so, I've attempted a fix [2] that during parsing 
> increments a counter whenever a clause is added.
>  [1] [https://github.com/opensearch-project/OpenSearch/issues/3760]
>  [2] 
> [https://github.com/mch2/lucene/commit/6a558f17f448b92ae4cf8c43e0b759ff7425acdf]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10651) SimpleQueryParser stack overflow for large nested queries.

2022-07-18 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568114#comment-17568114
 ] 

Adrien Grand commented on LUCENE-10651:
---

bq. But I like your fix – it prevents a StackOverflowException when the 
returned Query would have failed with TooManyClauses anyways.

I like this approach too.

> SimpleQueryParser stack overflow for large nested queries.
> --
>
> Key: LUCENE-10651
> URL: https://issues.apache.org/jira/browse/LUCENE-10651
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.1, 8.10, 9.2, 9.3
>Reporter: Marc Handalian
>Priority: Major
>
> The OpenSearch project received an issue [1] where stack overflow can occur 
> for large nested boolean queries during rewrite.  In trying to reproduce this 
> error I've also encountered SO during parsing where queries expand beyond the 
> default 1024 clause limit.  This unit test will fail with SO:
> {code:java}
> public void testSimpleQueryParserWithTooManyClauses() {
>   StringBuilder queryString = new StringBuilder("foo");
>   for (int i = 0; i < 1024; i++) {
> queryString.append(" | bar").append(i).append(" + baz");
>   }
>   expectThrows(IndexSearcher.TooManyClauses.class, () -> 
> parse(queryString.toString()));
> }
>  {code}
> I would expect this case to also fail with TooManyClauses, is my 
> understanding correct?  If so, I've attempted a fix [2] that during parsing 
> increments a counter whenever a clause is added.
>  [1] [https://github.com/opensearch-project/OpenSearch/issues/3760]
>  [2] 
> [https://github.com/mch2/lucene/commit/6a558f17f448b92ae4cf8c43e0b759ff7425acdf]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-18 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568109#comment-17568109
 ] 

Adrien Grand commented on LUCENE-10633:
---

I opened https://github.com/mikemccand/luceneutil/pull/185.

> Dynamic pruning for queries sorted by SORTED(_SET) field
> 
>
> Key: LUCENE-10633
> URL: https://issues.apache.org/jira/browse/LUCENE-10633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
> when sorting by a numeric field, by leveraging the points index to skip 
> documents that do not compare better than the top of the priority queue 
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which 
> is disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-17 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567649#comment-17567649
 ] 

Adrien Grand commented on LUCENE-10633:
---

Double yes [~mikemccand] ! I plan on opening a PR against luceneutil and I 
already opened LUCENE-10162 a while back about making this sort of things a 
more obvious choice. It also relates to [~gsmiller] 's work about running 
term-in-set queries using doc values, which would only help if doc values are 
enabled on the field.

> Dynamic pruning for queries sorted by SORTED(_SET) field
> 
>
> Key: LUCENE-10633
> URL: https://issues.apache.org/jira/browse/LUCENE-10633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
> when sorting by a numeric field, by leveraging the points index to skip 
> documents that do not compare better than the top of the priority queue 
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which 
> is disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-16 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567485#comment-17567485
 ] 

Adrien Grand commented on LUCENE-10633:
---

Indeed the speedup is impressive. :) I should have noted that I had to tweak 
luceneutil to also index fields that were used for sorting so that the inverted 
index could be used to skip hits.

This change is very similar to LUCENE-9280, which led to annotation DD on 
[https://home.apache.org/~mikemccand/lucenebench/TermDayOfYearSort.html] and 
https://home.apache.org/~mikemccand/lucenebench/TermDTSort.html.

> Dynamic pruning for queries sorted by SORTED(_SET) field
> 
>
> Key: LUCENE-10633
> URL: https://issues.apache.org/jira/browse/LUCENE-10633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
> when sorting by a numeric field, by leveraging the points index to skip 
> documents that do not compare better than the top of the priority queue 
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which 
> is disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10655) can we optimize visited bitset usage in HNSW graph search/indexing?

2022-07-16 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567482#comment-17567482
 ] 

Adrien Grand commented on LUCENE-10655:
---

I've been wondering if using a simple int hash set would help. FixedBitSet is 
super efficient CPU-wise, but it also requires lots of memory on large segments 
while we typically only set a limited number of bits, so it can quickly become 
memory-bound for random access, like we do when building the graph. An int hash 
set should also be cheaper to clear.

> can we optimize visited bitset usage in HNSW graph search/indexing?
> ---
>
> Key: LUCENE-10655
> URL: https://issues.apache.org/jira/browse/LUCENE-10655
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/hnsw
>Reporter: Michael Sokolov
>Priority: Major
>
> When running {{luceneutil}}  I noticed that {{FixedBitSet.clear()}} dominates 
> the CPU profiler output. I had a few ideas:
>  # In upper graph layers, the occupied nodes are very sparse - maybe 
> {{SparseFixedBitSet}} would be a better fit for those
>  # We are caching these bitsets, but they are only used for a single search 
> (single document insert, during indexing). Should we cache across searches? 
> We would need to pool them though, and they would vary by field since fields 
> can have different numbers of vector nodes. This starts to get complex
>  # Are we sure that clearing a bitset is more efficient than allocating a new 
> one? Maybe the JDK maintains a pool of already-zeroed memory for us
> I think we could try specializing the bitset type by graph level, and then I 
> think we ought to measure the performance of allocation vs the limited reuse 
> that we currently have.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-15 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567268#comment-17567268
 ] 

Adrien Grand commented on LUCENE-10633:
---

I played with a prototype that starts dynamically pruning matches as soon as 
there are 128 competitive ordinals left or less by pulling postings to iterate 
over the remaining documents that have competitive values. I still need to 
think of simplifying the logic and improving tests but the initial benchmarks 
on wikimedium10m are very encouraging (assuming I didn't get anything wrong):

{noformat}
TaskQPS baseline  StdDevQPS my_modified_version 
 StdDevPct diff p-value
 Prefix3  248.74  (6.1%)  242.61  
(5.8%)   -2.5% ( -13% -   10%) 0.191
   BrowseMonthTaxoFacets   27.71 (10.1%)   27.34 
(10.6%)   -1.3% ( -20% -   21%) 0.682
BrowseDateSSDVFacets4.99 (10.3%)4.94  
(8.4%)   -1.1% ( -17% -   19%) 0.707
BrowseDateTaxoFacets   44.26 (12.2%)   43.97 
(13.1%)   -0.7% ( -23% -   28%) 0.870
Wildcard  137.61  (3.0%)  136.97  
(2.6%)   -0.5% (  -5% -5%) 0.592
   BrowseDayOfYearTaxoFacets   45.53 (12.4%)   45.44 
(13.4%)   -0.2% ( -23% -   29%) 0.963
  IntNRQ  198.27  (8.1%)  197.94  
(7.4%)   -0.2% ( -14% -   16%) 0.946
 BrowseRandomLabelSSDVFacets   14.51  (2.2%)   14.49  
(2.4%)   -0.2% (  -4% -4%) 0.835
AndHighHighDayTaxoFacets8.32  (5.1%)8.31  
(5.7%)   -0.1% ( -10% -   11%) 0.956
 LowSpanNear   46.83  (1.6%)   46.82  
(2.0%)   -0.0% (  -3% -3%) 0.990
 BrowseRandomLabelTaxoFacets   36.18 (10.5%)   36.18 
(12.6%)0.0% ( -20% -   25%) 0.998
MedTermDayTaxoFacets   73.59  (4.8%)   73.66  
(5.7%)0.1% (  -9% -   11%) 0.954
   OrNotHighHigh 1476.08  (5.3%) 1477.58  
(3.9%)0.1% (  -8% -9%) 0.945
  TermDTSort  746.55  (2.4%)  747.70  
(1.7%)0.2% (  -3% -4%) 0.817
  Fuzzy2   96.18  (1.3%)   96.39  
(1.4%)0.2% (  -2% -2%) 0.617
 AndHighMedDayTaxoFacets  154.89  (1.8%)  155.29  
(1.6%)0.3% (  -3% -3%) 0.629
  AndHighMed  378.38  (3.7%)  379.50  
(4.4%)0.3% (  -7% -8%) 0.817
PKLookup  243.14  (1.9%)  243.99  
(1.9%)0.4% (  -3% -4%) 0.552
  HighPhrase  279.13  (2.1%)  280.21  
(1.5%)0.4% (  -3% -4%) 0.510
 Respell   71.59  (1.5%)   71.87  
(1.5%)0.4% (  -2% -3%) 0.406
  OrHighHigh   66.95  (6.5%)   67.21  
(5.7%)0.4% ( -11% -   13%) 0.837
  Fuzzy1  101.53  (1.5%)  101.95  
(1.5%)0.4% (  -2% -3%) 0.382
   LowPhrase  101.76  (2.3%)  102.22  
(2.6%)0.5% (  -4% -5%) 0.558
 LowSloppyPhrase   21.14  (3.1%)   21.25  
(4.1%)0.5% (  -6% -7%) 0.661
   MedPhrase  173.45  (2.7%)  174.55  
(2.6%)0.6% (  -4% -6%) 0.443
 MedSpanNear   17.77  (4.5%)   17.88  
(4.8%)0.6% (  -8% -   10%) 0.661
OrHighNotLow 1396.26  (5.6%) 1406.85  
(6.4%)0.8% ( -10% -   13%) 0.692
   OrHighMed  162.41  (5.3%)  163.69  
(4.8%)0.8% (  -8% -   11%) 0.625
   HighTermDayOfYearSort 1476.11  (2.7%) 1488.26  
(2.4%)0.8% (  -4% -6%) 0.312
 MedIntervalsOrdered  113.65  (4.2%)  114.59  
(7.0%)0.8% (  -9% -   12%) 0.652
   OrHighLow  828.13  (5.2%)  835.45  
(4.7%)0.9% (  -8% -   11%) 0.574
 MedTerm 2356.21  (4.7%) 2377.47  
(5.0%)0.9% (  -8% -   11%) 0.554
 MedSloppyPhrase   62.13  (3.4%)   62.72  
(3.9%)0.9% (  -6% -8%) 0.420
HighIntervalsOrdered   18.19  (5.7%)   18.37  
(8.6%)1.0% ( -12% -   16%) 0.673
 AndHighHigh   54.46  (6.2%)   55.01  
(6.3%)1.0% ( -10% -   14%) 0.615
 LowTerm 2247.13  (4.7%) 2270.19  
(3.7%)1.0% (  -7% -9%) 0.446
OrNotHighLow 1728.71  (4.3%) 1748.19  
(4.7%)1.1% (  -7% -   10%) 0.427
HighTermTitleBDVSort   14.31  (3.3%)   14.47  
(5.7%)

[jira] [Commented] (LUCENE-10650) "after_effect": "no" was removed what replaces it?

2022-07-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566422#comment-17566422
 ] 

Adrien Grand commented on LUCENE-10650:
---

Indeed Elasticsearch would change the after effect to `L` instead of `no` to 
work around the fact that Lucene removed support for `no`. You may not need to 
reindex, I believe it would be possible to close your index, update settings to 
use this new scripted similarity, and then open the index again to make the 
change effective (I did not test this).

> "after_effect": "no" was removed what replaces it?
> --
>
> Key: LUCENE-10650
> URL: https://issues.apache.org/jira/browse/LUCENE-10650
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nathan Meisels
>Priority: Major
>
> Hi!
> We have been using an old version of elasticsearch with the following 
> settings:
>  
> {code:java}
>         "default": {
>           "queryNorm": "1",
>           "type": "DFR",
>           "basic_model": "in",
>           "after_effect": "no",
>           "normalization": "no"
>         }{code}
>  
> I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that 
> "after_effect": "no" was removed.
> In 
> [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33]
>  version score was:
> {code:java}
> return tfn * (float)(log2((N + 1) / (n + 0.5)));{code}
> In 
> [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43]
>  version it's:
> {code:java}
> long N = stats.getNumberOfDocuments();
> long n = stats.getDocFreq();
> double A = log2((N + 1) / (n + 0.5));
> // basic model I should return A * tfn
> // which we rewrite to A * (1 + tfn) - A
> // so that it can be combined with the after effect while still guaranteeing
> // that the result is non-decreasing with tfn
> return A * aeTimes1pTfn * (1 - 1 / (1 + tfn));
> {code}
> I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is 
> different than what we are used to. (We depend heavily on the exact scoring).
> Do you have any advice how we can keep the same scoring as before?
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10619) Optimize the writeBytes in TermsHashPerField

2022-07-12 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10619.
---
Fix Version/s: 9.3
   Resolution: Fixed

> Optimize the writeBytes in TermsHashPerField
> 
>
> Key: LUCENE-10619
> URL: https://issues.apache.org/jira/browse/LUCENE-10619
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tang donghai
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Because we don't know the length of slice, writeBytes will always write byte 
> one after another instead of writing a block of bytes.
> May be we could return both offset and length in ByteBlockPool#allocSlice?
> 1. BYTE_BLOCK_SIZE is 32768, offset is at most 15 bits.
> 2. slice size is at most 200, so it could fit in 8 bits.
> So we could put them together into an int  offset | length
> There are only two places where this function is used,the cost of change it 
> is relatively small.
> When allocSlice could return the offset and length of new Slice, we could 
> change writeBytes like below
> {code:java}
> // write block of bytes each time
> while(remaining > 0 ) {
>int offsetAndLength = allocSlice(bytes, offset);
>length = min(remaining, (offsetAndLength & 0xff) - 1);
>offset = offsetAndLength >> 8;
>System.arraycopy(src, srcPos, bytePool.buffer, offset, length);
>remaining -= length;
>offset+= (length + 1);
> }
> {code}
> If it could work, I'd like to raise a pr.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10650) "after_effect": "no" was removed what replaces it?

2022-07-12 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565384#comment-17565384
 ] 

Adrien Grand commented on LUCENE-10650:
---

{{query.boost}} is the {{query.getBoost()}} from DFRSimilarity's {{double 
score(BasicStats stats, double freq, double docLen)}}, which does 
{{stats.getBoost() * basicModel.score(stats, tfn, aeTimes1pTfn)}}.

The division by log(2) is not the tfn but a way to turn Math.log, which is a 
log in base 10 into a log in base 2.

I wouldn't expect latency to be higher, this should get compiled to more or 
less the same code that you used to rely on in DFRSimilarity.

> "after_effect": "no" was removed what replaces it?
> --
>
> Key: LUCENE-10650
> URL: https://issues.apache.org/jira/browse/LUCENE-10650
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nathan Meisels
>Priority: Major
>
> Hi!
> We have been using an old version of elasticsearch with the following 
> settings:
>  
> {code:java}
>         "default": {
>           "queryNorm": "1",
>           "type": "DFR",
>           "basic_model": "in",
>           "after_effect": "no",
>           "normalization": "no"
>         }{code}
>  
> I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that 
> "after_effect": "no" was removed.
> In 
> [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33]
>  version score was:
> {code:java}
> return tfn * (float)(log2((N + 1) / (n + 0.5)));{code}
> In 
> [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43]
>  version it's:
> {code:java}
> long N = stats.getNumberOfDocuments();
> long n = stats.getDocFreq();
> double A = log2((N + 1) / (n + 0.5));
> // basic model I should return A * tfn
> // which we rewrite to A * (1 + tfn) - A
> // so that it can be combined with the after effect while still guaranteeing
> // that the result is non-decreasing with tfn
> return A * aeTimes1pTfn * (1 - 1 / (1 + tfn));
> {code}
> I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is 
> different than what we are used to. (We depend heavily on the exact scoring).
> Do you have any advice how we can keep the same scoring as before?
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10653) Should BlockMaxMaxscoreScorer rebuild its heap in bulk?

2022-07-12 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565380#comment-17565380
 ] 

Adrien Grand commented on LUCENE-10653:
---

+1 to doing a bulk heapify

The fact that this scorer only handles 2 clauses for now is only a way to give 
us more time to evaluate when we should use it vs. WANDScorer in my opinion. 
Most likely it will be used for more than 2 clauses at some point in the future.

> Should BlockMaxMaxscoreScorer rebuild its heap in bulk?
> ---
>
> Key: LUCENE-10653
> URL: https://issues.apache.org/jira/browse/LUCENE-10653
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
>
> BMMScorer has to frequently rebuild its heap, and does do by clearing and 
> then iteratively calling {{{}add{}}}. It would be more efficient to heapify 
> in bulk. This is more academic than anything right now though since BMMScorer 
> is only used with two-clause disjunctions, so it's sort of a silly 
> optimization if it's not supporting a greater number of clauses.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10649) Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField

2022-07-12 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565885#comment-17565885
 ] 

Adrien Grand commented on LUCENE-10649:
---

Good catch [~vigyas], it looks related indeed. The bug seems to be that 
{{ReindexingMergePolicy}} doesn't override {{findFullFlushMerges}} to wrap 
input readers, so the merged segment doesn't get fields from the parallel 
reader. Would you like to open a PR?

> Failure in TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField
> ---
>
> Key: LUCENE-10649
> URL: https://issues.apache.org/jira/browse/LUCENE-10649
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
>
> Failing Build Link: 
> [https://jenkins.thetaphi.de/job/Lucene-main-Linux/35617/testReport/junit/org.apache.lucene.index/TestDemoParallelLeafReader/testRandomMultipleSchemaGensSameField/]
> Repro:
> {code:java}
> gradlew test --tests 
> TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField 
> -Dtests.seed=A7496D7D3957981A -Dtests.multiplier=3 -Dtests.locale=sr-Latn-BA 
> -Dtests.timezone=Etc/GMT-7 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 
> {code}
> Error:
> {code:java}
> java.lang.AssertionError: expected:<103> but was:<2147483647>
>     at 
> __randomizedtesting.SeedInfo.seed([A7496D7D3957981A:F71866BCCEA1C903]:0)
>     at org.junit.Assert.fail(Assert.java:89)
>     at org.junit.Assert.failNotEquals(Assert.java:835)
>     at org.junit.Assert.assertEquals(Assert.java:647)
>     at org.junit.Assert.assertEquals(Assert.java:633)
>     at 
> org.apache.lucene.index.TestDemoParallelLeafReader.testRandomMultipleSchemaGensSameField(TestDemoParallelLeafReader.java:1347)
>     at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-12 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565402#comment-17565402
 ] 

Adrien Grand commented on LUCENE-10603:
---

+1

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10600) SortedSetDocValues#docValueCount should be an int, not long

2022-07-12 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-10600:
--
Fix Version/s: 9.3

> SortedSetDocValues#docValueCount should be an int, not long
> ---
>
> Key: LUCENE-10600
> URL: https://issues.apache.org/jira/browse/LUCENE-10600
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Lu Xugang
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-12 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565375#comment-17565375
 ] 

Adrien Grand commented on LUCENE-10480:
---

+1 to explore this in a separate issue.

bq. Do you think this slowdown to AndHighOrMedMed may be considered as blocker 
to 9.3 release? 

I wouldn't say blocker, but maybe we could give us time indeed by only using 
this new scorer on top-level disjunctions for now so that we have more time to 
figure out whether we should stick to BMW or switch to BMM for inner 
disjunctions.

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10650) "after_effect": "no" was removed what replaces it?

2022-07-11 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564988#comment-17564988
 ] 

Adrien Grand commented on LUCENE-10650:
---

Hi Nathan. When we introduced dynamic pruning to Lucene, we also introduced the 
requirement that similarities produce scores that are non-decreasing when tf 
increases or when the length norm decreases (all other things equal). 
Unfortunately, this property could not be retained while keeping DFR 
similarities pluggable as they were so we removed support for the no after 
effect and only retained L and B.

It looks like this specific similarity that you are looking for could still be 
implemented in a way that scores are non-decreasing with increasing tf or 
decreasing norm, so you should be able to re-implement it using a scripted 
similarity for instance 
(https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html#scripted_similarity)
 with something like below (untested):

{code}
"similarity": {
  "my_dfr_sim": {
"type": "scripted",
"weight_script": {
  "source": "return query.boost * 
Math.log((field.docCount+1.0)/(term.docFreq+0.5)) / Math.log(2);"
},
"script": {
  "source": "return weight * doc.freq;"
}
  }
}
{code}

> "after_effect": "no" was removed what replaces it?
> --
>
> Key: LUCENE-10650
> URL: https://issues.apache.org/jira/browse/LUCENE-10650
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nathan Meisels
>Priority: Major
>
> Hi!
> We have been using an old version of elasticsearch with the following 
> settings:
>  
> {code:java}
>         "default": {
>           "queryNorm": "1",
>           "type": "DFR",
>           "basic_model": "in",
>           "after_effect": "no",
>           "normalization": "no"
>         }{code}
>  
> I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that 
> "after_effect": "no" was removed.
> In 
> [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33]
>  version score was:
> {code:java}
> return tfn * (float)(log2((N + 1) / (n + 0.5)));{code}
> In 
> [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43]
>  version it's:
> {code:java}
> long N = stats.getNumberOfDocuments();
> long n = stats.getDocFreq();
> double A = log2((N + 1) / (n + 0.5));
> // basic model I should return A * tfn
> // which we rewrite to A * (1 + tfn) - A
> // so that it can be combined with the after effect while still guaranteeing
> // that the result is non-decreasing with tfn
> return A * aeTimes1pTfn * (1 - 1 / (1 + tfn));
> {code}
> I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is 
> different than what we are used to. (We depend heavily on the exact scoring).
> Do you have any advice how we can keep the same scoring as before?
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10650) "after_effect": "no" was removed what replaces it?

2022-07-11 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10650.
---
Resolution: Won't Fix

> "after_effect": "no" was removed what replaces it?
> --
>
> Key: LUCENE-10650
> URL: https://issues.apache.org/jira/browse/LUCENE-10650
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nathan Meisels
>Priority: Major
>
> Hi!
> We have been using an old version of elasticsearch with the following 
> settings:
>  
> {code:java}
>         "default": {
>           "queryNorm": "1",
>           "type": "DFR",
>           "basic_model": "in",
>           "after_effect": "no",
>           "normalization": "no"
>         }{code}
>  
> I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that 
> "after_effect": "no" was removed.
> In 
> [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33]
>  version score was:
> {code:java}
> return tfn * (float)(log2((N + 1) / (n + 0.5)));{code}
> In 
> [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43]
>  version it's:
> {code:java}
> long N = stats.getNumberOfDocuments();
> long n = stats.getDocFreq();
> double A = log2((N + 1) / (n + 0.5));
> // basic model I should return A * tfn
> // which we rewrite to A * (1 + tfn) - A
> // so that it can be combined with the after effect while still guaranteeing
> // that the result is non-decreasing with tfn
> return A * aeTimes1pTfn * (1 - 1 / (1 + tfn));
> {code}
> I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is 
> different than what we are used to. (We depend heavily on the exact scoring).
> Do you have any advice how we can keep the same scoring as before?
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10647) Failure in TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler

2022-07-11 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10647.
---
Fix Version/s: 9.3
   Resolution: Fixed

> Failure in TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler
> --
>
> Key: LUCENE-10647
> URL: https://issues.apache.org/jira/browse/LUCENE-10647
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Vigya Sharma
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Recent builds are intermittently failing on 
> TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler. Example:
> https://jenkins.thetaphi.de/job/Lucene-main-Linux/35576/testReport/junit/org.apache.lucene/TestMergeSchedulerExternal/testSubclassConcurrentMergeScheduler/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-11 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564885#comment-17564885
 ] 

Adrien Grand commented on LUCENE-10480:
---

I haven't tried to reproduce it but the steps you took by running on wikibigall 
with the nightly tasks file sound good to me. Another thing that changes 
performance sometimes is the doc ID order, were you using multiple indexing 
threads maybe?

Ignoring the fact that we cannot reproduce the slowdown, if I try to think of 
the main differences between WANDScorer and BlockMaxMaxscoreScorer for 
AndHighOrMedMed, I think the main one is the way that {{advanceShallow}} is 
computed. Conjunctions use block boundaries of the clause that has the lowest 
cost, so this could explain why we are seeing a slowdown with AndHighOrMedMed 
(since the conjunction uses block boundaries of OrMedMed) and not 
AndMedOrHighHigh (since the conjunction uses block boundaries of Med). Maybe we 
could explore other approaches for {{advanceShallow}} such as taking the 
minimum block boundary across essential clauses only instead of all clauses.

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-09 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564565#comment-17564565
 ] 

Adrien Grand commented on LUCENE-10480:
---

[AndMedOrHighHigh|https://home.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]
 recovered fully but 
[AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 only a bit. I'm unsure what explains there is still a slowdown compared to BMW.

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-06 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563022#comment-17563022
 ] 

Adrien Grand commented on LUCENE-10480:
---

I still suspect that one issue when only running queries that are very good at 
dynamic pruning is that the JVM doesn't have time to warm up. These queries can 
figure out the top 10 hits by only evaluating a few thousands hits, so probably 
that parts of the logic still runs in interpreted mode. The fact that queries 
run slower when you run them in isolation further suggests that this is the 
problematic scenario, not the case when the benchmark includes multiple types 
of queries?

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562730#comment-17562730
 ] 

Adrien Grand commented on LUCENE-10480:
---

Looking at this new scorer from the perspective of disjunctions within 
conjunctions, maybe there are bits from advance() that we could move to 
matches() so that we would hand it over to the other clause before we start 
doing expensive operations like computing scores. What do you think 
[~zacharymorn]?

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562711#comment-17562711
 ] 

Adrien Grand commented on LUCENE-10480:
---

Nightly benchmarks picked up the change and top-level disjunctions are seeing 
massive speedups, see 
[OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] 
or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. 
However disjunctions within conjunctions got a slowdown, see 
[AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 or 
[AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html].

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10636) Could the partial score sum from essential list scores be cached?

2022-07-05 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10636.
---
Fix Version/s: 9.3
   Resolution: Fixed

> Could the partial score sum from essential list scores be cached?
> -
>
> Key: LUCENE-10636
> URL: https://issues.apache.org/jira/browse/LUCENE-10636
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Zach Chen
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This is a follow-up issue from discussion 
> [https://github.com/apache/lucene/pull/972#discussion_r909300200] . Currently 
> in the implementation of BlockMaxMaxscoreScorer, there's duplicated 
> computation of summing up scores from essential list scorers. We would like 
> to see if this duplicated computation can be cached without introducing much 
> overhead or data structure that might out-weight the benefit of caching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-04 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562247#comment-17562247
 ] 

Adrien Grand commented on LUCENE-10151:
---

For reference, I opened new JIRA issues for suggested follow-ups: LUCENE-10640, 
LUCENE-10641.

> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-04 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562219#comment-17562219
 ] 

Adrien Grand commented on LUCENE-10151:
---

bq. I've merged this now to main and backported to 9.x

Did you forget to push to branch_9x? I cannot see the change there.

> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10641) IndexSearcher#setTimeout should also abort query rewrites, point ranges and vector searches

2022-07-04 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10641:
-

 Summary: IndexSearcher#setTimeout should also abort query 
rewrites, point ranges and vector searches
 Key: LUCENE-10641
 URL: https://issues.apache.org/jira/browse/LUCENE-10641
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


{{IndexSearcher}} only checks the query timeout in the collection phase for 
now. It should check the timeout in other operations that may take time such as 
intersecting a fuzzy automaton with a terms dictionary, evaluating points that 
fall into a range or running a vector search. This should be possible to do by 
wrapping the IndexReader's data structures in the same way as 
{{ExitableDirectoryReader}}?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10640) Can TimeLimitingBulkScorer exponentially grow the window size?

2022-07-04 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10640:
-

 Summary: Can TimeLimitingBulkScorer exponentially grow the window 
size?
 Key: LUCENE-10640
 URL: https://issues.apache.org/jira/browse/LUCENE-10640
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


{{TimeLimitingBulkScorer}} scores 100 documents at a time. Unfortunately, bulk 
scorers have non-null overhead for {{BulkScorer#score}} since they need to set 
the scorer, figure out how to combine the Scorer with the competitive iterator 
of the collector, etc. Larger windows of doc IDs would help better amortize 
such costs.

Could we grow the window of scored doc IDs exponentially, maybe with guarantees 
such as making sure that the new window is at most 50% of doc IDs that have 
been scored so far so that this exponential growth could only exceed the 
configured timeout by 50%?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10616) Moving to dictionaries has made stored fields slower at skipping

2022-07-04 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562072#comment-17562072
 ] 

Adrien Grand commented on LUCENE-10616:
---

Thanks [~joe hou] for giving it a try! The high-level idea looks good to me, of 
somehow leveraging information in the {{StoredFieldVisitor}} to only decompress 
the bits that matter. In terms of implementation, I would like to see if we can 
avoid introducing the new {{StoredFieldVisitor#hasMoreFieldsToVisit}} method 
and rely on {{StoredFieldVisitor#needsField}} returning {{STOP}} instead. The 
fact that decompressing data and decoding decompressed data are interleaved 
also make the code harder to test, I wonder if we could change the signature of 
{{Decompressor#decompress}} to return an {{InputStream}} that would decompress 
data lazily instead of filling a {{BytesRef}} so that it's possible to stop 
decompressing early while still being able to test decompression and decoding 
in isolation?

> Moving to dictionaries has made stored fields slower at skipping
> 
>
> Key: LUCENE-10616
> URL: https://issues.apache.org/jira/browse/LUCENE-10616
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~ywelsch] has been digging into a regression of stored fields retrieval that 
> is caused by LUCENE-9486.
> Say your documents have two stored fields, one that is 100B and is stored 
> first, and the other one that is 100kB, and you are only interested in the 
> first one. While the idea behind blocks of stored fields is to store multiple 
> documents in the same block to leverage redundancy across documents, 
> sometimes documents are larger than the block size. As soon as documents are 
> larger than 2x the block size, our stored fields format splits such large 
> documents into multiple blocks, so that you wouldn't need to decompress 
> everything only to retrieve a couple small fields.
> Before LUCENE-9486, BEST_SPEED had a block size of 16kB, so only retrieving 
> the first field value would only need to decompress 16kB of data. With the 
> move to preset dictionaries in LUCENE-9486 and then LUCENE-9917, we now have 
> blocks of 80kB, so stored fields would now need to decompress 80kB of data, 
> 5x more than before.
> With dictionaries, our blocks are now split into 10 sub blocks. We happen to 
> eagerly decompress all sub blocks that intersect with the stored document, 
> which is why we would decompress 80kB of data, but this is an implementation 
> detail. It should be possible to decompress these sub blocks lazily so that 
> we would only decompress those that intersect with one of the field values 
> that the user is interested in retrieving?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10635) Ensure test coverage for WANDScorer after additional scorers get added

2022-07-02 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561754#comment-17561754
 ] 

Adrien Grand commented on LUCENE-10635:
---

Thinking out loud, maybe one way to do this would be to have a specialized 
WANDQuery in the test folder that is guaranteed to produce a WANDScorer?

> Ensure test coverage for WANDScorer after additional scorers get added
> --
>
> Key: LUCENE-10635
> URL: https://issues.apache.org/jira/browse/LUCENE-10635
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Zach Chen
>Priority: Major
>
> This is a follow-up issue from discussions 
> [https://github.com/apache/lucene/pull/972#issuecomment-1170684358] & 
> [https://github.com/apache/lucene/pull/972#pullrequestreview-1024377641] .
>  
> As additional scorers such as BlockMaxMaxscoreScorer get added, some tests in 
> TestWANDScorer that used to test WANDScorer now test BlockMaxMaxscoreScorer 
> instead, reducing test coverage for WANDScorer. We would like to see how we 
> can ensure TestWANDScorer reliably tests WANDScorer, perhaps by initiating 
> the scorer directly inside the tests?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10639) WANDScorer performs better without two-phase

2022-07-02 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561753#comment-17561753
 ] 

Adrien Grand commented on LUCENE-10639:
---

On a recent PR [~ChrisHegarty] found out that Hotspot was not always able to 
optimize "if (liveDocs == null)" checks within for loops 
(https://github.com/apache/lucene/pull/812#discussion_r851301618). Since then 
I've been wondering if DefaultBulkScorer is affected by this. If it is, we 
could look into the performance benefit of moving the {{if (liveDocs == null)}} 
check out of the for loop here: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/Weight.java#L311-L317.
 This might also help the compiler figure out that the approximation and 
TwoPhaseIterator's matches run in sequence?

> WANDScorer performs better without two-phase
> 
>
> Key: LUCENE-10639
> URL: https://issues.apache.org/jira/browse/LUCENE-10639
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Major
>
> After looking at the recent improvement [~jpountz] made to WAND scoring in 
> LUCENE-10634, which does additional work during match confirmation to not 
> confirm a match who's score wouldn't be competitive, I wanted to see how 
> performance would shift if we squashed the two-phase iteration completely and 
> only returned true matches (that were also known to be competitive by score) 
> in the "approximation" phase. I was a bit surprised to find that luceneutil 
> benchmarks (run with {{{}wikimediumall{}}}), improves significantly on some 
> disjunction tasks and doesn't show significant regressions anywhere else.
> Note that I used LUCENE-10634 as a baseline, and built my candidate change on 
> top of that. The diff can be seen here: 
> [DIFF|https://github.com/gsmiller/lucene/compare/b2d46440998fe4a972e8cc8c948580111359ed0f..c5bab794c92dbc66e70f9389948c1bdfe9b45231]
> A simple conclusion here might be that we shouldn't do two-phase iteration in 
> WANDScorer, but I'm pretty sure that's not right. I wonder if what's really 
> going on is that we're under-estimating the cost of confirming a match? Right 
> now we just return the tail size as the cost. While the cost of confirming a 
> match is proportional to the tail size, the actual work involved can be quite 
> significant (having to advance tail iterators to new blocks and decompress 
> them). I wonder if the WAND second phase is being run too early on 
> approximate candidates, and if less-expensive, (and even possibly more 
> restrictive?), second phases could/should be running first?
> I'm raising this here as more of a curiosity to see if it sparks ideas on how 
> to move forward. Again, I'm not proposing we do away with two-phase 
> iteration, but it seems we might be able to improve things. Maybe I'll 
> explore changing the cost heuristic next. Also, maybe there's some different 
> benchmarking that would be useful here that I may not be familiar with?
> Benchmark results on wikimediumall:
> {code:java}
> TaskQPS baseline  StdDevQPS candidate  
> StdDevPct diff p-value
> HighTermTitleBDVSort   22.52 (18.9%)   21.66 
> (15.6%)   -3.8% ( -32% -   37%) 0.485
>  Prefix39.38  (9.2%)9.09 
> (10.6%)   -3.1% ( -20% -   18%) 0.326
>HighTermMonthSort   25.37 (16.0%)   24.87 
> (17.1%)   -2.0% ( -30% -   37%) 0.710
> MedTermDayTaxoFacets9.62  (4.2%)9.51  
> (4.1%)   -1.2% (  -9% -7%) 0.368
>   TermDTSort   74.69 (18.0%)   74.13 
> (18.2%)   -0.7% ( -31% -   43%) 0.897
>HighTermDayOfYearSort   52.64 (16.1%)   52.32 
> (15.4%)   -0.6% ( -27% -   36%) 0.903
>BrowseMonthTaxoFacets8.64 (19.1%)8.59 
> (19.8%)   -0.6% ( -33% -   47%) 0.926
> BrowseDateSSDVFacets0.86  (9.5%)0.86 
> (13.1%)   -0.4% ( -20% -   24%) 0.914
> PKLookup  147.18  (3.9%)  146.66  
> (3.3%)   -0.3% (  -7% -7%) 0.759
>BrowseDayOfYearSSDVFacets3.47  (4.5%)3.45  
> (4.8%)   -0.3% (  -9% -9%) 0.822
> Wildcard   36.36  (4.4%)   36.26  
> (5.2%)   -0.3% (  -9% -9%) 0.866
>BrowseMonthSSDVFacets4.15 (12.7%)4.13 
> (12.8%)   -0.3% ( -22% -   28%) 0.950
>  AndHighMedDayTaxoFacets   15.21  (2.7%)   15.18  
> (2.9%)   -0.2% (  -5% -5%) 0.819
>   Fuzzy1   68.33  (1.8%)   68.22  
> (2.0%)   -0.2% (  -3% -3%) 

[jira] [Commented] (LUCENE-10639) WANDScorer performs better without two-phase

2022-07-02 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561751#comment-17561751
 ] 

Adrien Grand commented on LUCENE-10639:
---

I suspected there was some overhead to two-phase iteration but not as much as 
this. Two-phase iteration doesn't aim at improving the performance of queries 
on their own, but when combined with other queries through conjunctions: 
conjunctions make sure to reach agreement across approximations before they 
proceed with the match phase. This is the feature that makes Lucene perform 
better than other search libraries on the query `+"the who" +uk` at 
https://tantivy-search.github.io/bench/, because Lucene makes sure that 
documents contain all of "the", "who" and "uk" before it starts checking 
positions. I would also expect two-phase iteration to help on [AndMedOrHighHigh 
on nightly 
benchmarks|https://home.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]
 since WANDScorer will do less work to return the next candidate on or beyond 
the lead doc ID produced by the "Med" term.

> WANDScorer performs better without two-phase
> 
>
> Key: LUCENE-10639
> URL: https://issues.apache.org/jira/browse/LUCENE-10639
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Major
>
> After looking at the recent improvement [~jpountz] made to WAND scoring in 
> LUCENE-10634, which does additional work during match confirmation to not 
> confirm a match who's score wouldn't be competitive, I wanted to see how 
> performance would shift if we squashed the two-phase iteration completely and 
> only returned true matches (that were also known to be competitive by score) 
> in the "approximation" phase. I was a bit surprised to find that luceneutil 
> benchmarks (run with {{{}wikimediumall{}}}), improves significantly on some 
> disjunction tasks and doesn't show significant regressions anywhere else.
> Note that I used LUCENE-10634 as a baseline, and built my candidate change on 
> top of that. The diff can be seen here: 
> [DIFF|https://github.com/gsmiller/lucene/compare/b2d46440998fe4a972e8cc8c948580111359ed0f..c5bab794c92dbc66e70f9389948c1bdfe9b45231]
> A simple conclusion here might be that we shouldn't do two-phase iteration in 
> WANDScorer, but I'm pretty sure that's not right. I wonder if what's really 
> going on is that we're under-estimating the cost of confirming a match? Right 
> now we just return the tail size as the cost. While the cost of confirming a 
> match is proportional to the tail size, the actual work involved can be quite 
> significant (having to advance tail iterators to new blocks and decompress 
> them). I wonder if the WAND second phase is being run too early on 
> approximate candidates, and if less-expensive, (and even possibly more 
> restrictive?), second phases could/should be running first?
> I'm raising this here as more of a curiosity to see if it sparks ideas on how 
> to move forward. Again, I'm not proposing we do away with two-phase 
> iteration, but it seems we might be able to improve things. Maybe I'll 
> explore changing the cost heuristic next. Also, maybe there's some different 
> benchmarking that would be useful here that I may not be familiar with?
> Benchmark results on wikimediumall:
> {code:java}
> TaskQPS baseline  StdDevQPS candidate  
> StdDevPct diff p-value
> HighTermTitleBDVSort   22.52 (18.9%)   21.66 
> (15.6%)   -3.8% ( -32% -   37%) 0.485
>  Prefix39.38  (9.2%)9.09 
> (10.6%)   -3.1% ( -20% -   18%) 0.326
>HighTermMonthSort   25.37 (16.0%)   24.87 
> (17.1%)   -2.0% ( -30% -   37%) 0.710
> MedTermDayTaxoFacets9.62  (4.2%)9.51  
> (4.1%)   -1.2% (  -9% -7%) 0.368
>   TermDTSort   74.69 (18.0%)   74.13 
> (18.2%)   -0.7% ( -31% -   43%) 0.897
>HighTermDayOfYearSort   52.64 (16.1%)   52.32 
> (15.4%)   -0.6% ( -27% -   36%) 0.903
>BrowseMonthTaxoFacets8.64 (19.1%)8.59 
> (19.8%)   -0.6% ( -33% -   47%) 0.926
> BrowseDateSSDVFacets0.86  (9.5%)0.86 
> (13.1%)   -0.4% ( -20% -   24%) 0.914
> PKLookup  147.18  (3.9%)  146.66  
> (3.3%)   -0.3% (  -7% -7%) 0.759
>BrowseDayOfYearSSDVFacets3.47  (4.5%)3.45  
> (4.8%)   -0.3% (  -9% -9%) 0.822
> Wildcard   36.36  (4.4%)   36.26  
> (5.2%)   -0.3% (  -9% -9%) 0.866
>BrowseMonthSSDVFacets4.15 (12.7%)4.13 
> 

[jira] [Resolved] (LUCENE-10581) Optimize stored fields merges on the first segment

2022-06-30 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10581.
---
Resolution: Won't Fix

> Optimize stored fields merges on the first segment
> --
>
> Key: LUCENE-10581
> URL: https://issues.apache.org/jira/browse/LUCENE-10581
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> This is mostly repurposing LUCENE-10573. Even though our merge policies no 
> longer perform quadratic merging, it's still possible to configure them with 
> low merge factors (e.g. 2) or they might decide to create unbalanced merges 
> where the biggest segment of the merge accounts for a large part of the 
> merge. In such cases, copying compressed data directly still yields 
> significant benefits.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10634) Speed up WANDScorer by computing scores before advancing tail scorers

2022-06-30 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10634:
-

 Summary: Speed up WANDScorer by computing scores before advancing 
tail scorers
 Key: LUCENE-10634
 URL: https://issues.apache.org/jira/browse/LUCENE-10634
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


While looking at performance numbers on LUCENE-10480, I noticed that it is 
often faster to compute a score in order to finer-grained estimation of the 
best score that the current document can possibly get before advancing a tail 
scorer.

Making this change to WANDScorer yielded a small but reproducible speedup:

{noformat}
TaskQPS baseline  StdDevQPS my_modified_version 
 StdDevPct diff p-value
  IntNRQ  186.50 (11.8%)  175.34 
(19.1%)   -6.0% ( -33% -   28%) 0.234
HighTermTitleBDVSort  167.27 (20.6%)  161.85 
(17.2%)   -3.2% ( -34% -   43%) 0.591
 MedSloppyPhrase  194.77  (5.5%)  190.45  
(7.8%)   -2.2% ( -14% -   11%) 0.299
   HighTermDayOfYearSort  229.61  (7.7%)  225.74  
(7.1%)   -1.7% ( -15% -   14%) 0.471
 LowSloppyPhrase   20.22  (4.3%)   19.95  
(4.8%)   -1.3% ( -10% -8%) 0.366
  TermDTSort  319.62  (7.7%)  316.78  
(7.5%)   -0.9% ( -14% -   15%) 0.712
OrHighNotLow 1856.44  (5.6%) 1842.88  
(5.7%)   -0.7% ( -11% -   11%) 0.682
AndMedOrHighHigh   73.87  (3.8%)   73.51  
(3.6%)   -0.5% (  -7% -7%) 0.677
   OrHighNotHigh 2000.56  (5.6%) 1991.65  
(6.9%)   -0.4% ( -12% -   12%) 0.823
   LowPhrase  106.90  (2.4%)  106.61  
(2.9%)   -0.3% (  -5% -5%) 0.750
  AndHighLow 1661.80  (3.5%) 1658.56  
(3.7%)   -0.2% (  -7% -7%) 0.865
  Fuzzy2  110.64  (1.8%)  110.43  
(1.9%)   -0.2% (  -3% -3%) 0.752
   HighTermMonthSort   73.74 (17.5%)   73.68 
(20.8%)   -0.1% ( -32% -   46%) 0.989
PKLookup  242.86  (1.8%)  242.75  
(1.8%)   -0.0% (  -3% -3%) 0.934
OrHighNotMed 1454.98  (5.3%) 1456.26  
(5.8%)0.1% ( -10% -   11%) 0.960
  HighPhrase  523.22  (2.9%)  524.01  
(2.6%)0.2% (  -5% -5%) 0.862
   MedPhrase  140.65  (2.7%)  140.87  
(2.9%)0.2% (  -5% -5%) 0.862
HighSloppyPhrase8.74  (4.6%)8.75  
(5.5%)0.2% (  -9% -   10%) 0.914
 LowSpanNear   28.05  (3.6%)   28.14  
(3.0%)0.3% (  -6% -7%) 0.777
 MedSpanNear7.59  (3.5%)7.61  
(3.4%)0.3% (  -6% -7%) 0.778
 Respell   67.62  (1.9%)   67.82  
(1.8%)0.3% (  -3% -4%) 0.595
   OrAndHigMedAndHighMed  127.87  (3.1%)  128.27  
(4.0%)0.3% (  -6% -7%) 0.780
OrNotHighLow 1513.24  (2.1%) 1520.33  
(2.6%)0.5% (  -4% -5%) 0.528
  OrHighPhraseHighPhrase   25.26  (3.0%)   25.38  
(3.0%)0.5% (  -5% -6%) 0.616
OrNotHighMed 1544.04  (4.5%) 1552.26  
(4.2%)0.5% (  -7% -9%) 0.697
 AndHighHigh   92.24  (4.8%)   92.79  
(6.6%)0.6% ( -10% -   12%) 0.744
  AndHighMed  420.42  (3.1%)  423.19  
(5.2%)0.7% (  -7% -9%) 0.624
  Fuzzy1  117.42  (1.9%)  118.19  
(2.2%)0.7% (  -3% -4%) 0.307
 MedTerm 2209.36  (4.6%) 2224.54  
(5.3%)0.7% (  -8% -   11%) 0.661
 MedIntervalsOrdered  124.18  (8.1%)  125.12  
(8.0%)0.8% ( -14% -   18%) 0.767
   OrNotHighHigh 1239.43  (4.6%) 1249.63  
(4.8%)0.8% (  -8% -   10%) 0.580
 AndHighOrMedMed   95.02  (4.3%)   95.82  
(3.8%)0.8% (  -6% -9%) 0.515
Wildcard  315.22 (23.3%)  317.98 
(22.5%)0.9% ( -36% -   60%) 0.904
 LowTerm 2775.81  (4.0%) 2808.32  
(5.2%)1.2% (  -7% -   10%) 0.425
HighIntervalsOrdered   14.24  (8.0%)   14.41  
(8.4%)1.2% ( -14% -   19%) 0.646
 LowIntervalsOrdered  120.62  (5.8%)  122.09  
(6.6%)1.2% ( -10% -   14%) 0.534
HighSpanNear   39.04  (6.7%)   39.71  
(4.3%)1.7% (  -8% -   13%) 0.332

[jira] [Created] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-06-30 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10633:
-

 Summary: Dynamic pruning for queries sorted by SORTED(_SET) field
 Key: LUCENE-10633
 URL: https://issues.apache.org/jira/browse/LUCENE-10633
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
when sorting by a numeric field, by leveraging the points index to skip 
documents that do not compare better than the top of the priority queue 
maintained by the field comparator.

However queries sorted by a SORTED(_SET) field still look at all hits, which is 
disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10627) Using CompositeByteBuf to Reduce Memory Copy

2022-06-28 Thread Adrien Grand (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Adrien Grand commented on  LUCENE-10627  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Using CompositeByteBuf to Reduce Memory Copy   
 

  
 
 
 
 

 
 I understand how the change helps, but overall based on the benchmark result that you shared, this is only a 0.3% (BEST_COMPRESSION) or 1.4% (BEST_SPEED) improvement while the change adds some complexity. I wonder if we could reduce this complexity by reusing some existing abstractions like ByteBuffersDataInput instead of this new CompositeByteBuf, and have a single Compressor#compress API instead of two.  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock

2022-06-24 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558472#comment-17558472
 ] 

Adrien Grand commented on LUCENE-10624:
---

The code from SearchTaxis.java that you copied doesn't use doc values at all, 
it just looks at stored fields, so it shouldn't benefit from your change. 
Sorry, maybe your change actually speeds up things, but it's just unclear to me 
why and I'd like to make sure that I understand why. :)

bq. I plan to open a new issue for exponential search. Does it make sense?

I'm unsure. my worry is that a naive binary search would make things slower 
than the current main branch for many users who have relatively dense fields 
that get advanced by small increments.

> Binary Search for Sparse IndexedDISI advanceWithinBlock & 
> advanceExactWithinBlock
> -
>
> Key: LUCENE-10624
> URL: https://issues.apache.org/jira/browse/LUCENE-10624
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 9.0, 9.1, 9.2
>Reporter: Weiming Wu
>Priority: Major
> Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, 
> candiate-exponential-searchsparse-sorted.0.log, 
> candidate_sparseTaxis_searchsparse-sorted.0.log
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> h3. Problem Statement
> We noticed DocValue read performance regression with the iterative API when 
> upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The 
> degradation is similar to what's described in 
> https://issues.apache.org/jira/browse/SOLR-9599 
> By analyzing profiling data, we found method "advanceWithinBlock" and 
> "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to 
> their O(N) doc lookup algorithm.
> h3. Changes
> Used binary search algorithm to replace current O(N) lookup algorithm in 
> Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because 
> docs are in ascending order.
> h3. Test
> {code:java}
> ./gradlew tidy
> ./gradlew check {code}
> h3. Benchmark
> Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the 
> reports of baseline and candidates in attachments section.{color}
> {color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}
> {color:#1d1c1d}2. Some highlights (>20%):{color}
>  * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] 
> yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*
>  ** {color:#1d1c1d}*Baseline:*  10973978+ hits hits in *726.81967 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 
> msec*{color}
>  * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*
>  ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color}
>  * {color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}
>  ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 
> msec*{color}{*}{*}
>  * {color:#1d1c1d}*...*{color}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10620) Can we pass the Weight to Collector?

2022-06-23 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10620.
---
Fix Version/s: 9.3
   Resolution: Fixed

> Can we pass the Weight to Collector?
> 
>
> Key: LUCENE-10620
> URL: https://issues.apache.org/jira/browse/LUCENE-10620
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Today collectors cannot know about the Weight, and thus they cannot leverage 
> {{Weight#count}}. {{IndexSearcher#count}} works around it by extending 
> {{TotalHitCountCollector}} in order to shortcut counting the number of hits 
> on a segment via {{Weight#count}} whenever possible.
> It works, but I would prefer this shortcut to work for all users of 
> TotalHitCountCollector. For instance the faceting module creates a 
> MultiCollector over a TotalHitCountCollector and a FacetCollector, and today 
> it doesn't benefit from quick counts, which would enable it to only collect 
> matches into a FacetCollector.
> I'm considering adding a new {{Collector#setWeight}} API to allow collectors 
> to leverage {{Weight#count}}. I gave {{TotalHitCountCollector}} as an example 
> above, but this could have applications for our top-docs collectors too, 
> which could skip counting hits at all if the weight can provide them with the 
> hit count up-front.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?

2022-06-21 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556690#comment-17556690
 ] 

Adrien Grand commented on LUCENE-10507:
---

OK I found the issue with the test. The comparator was not correctly 
implemented, {{compareValues}} would sort values in the opposite order as 
{{compare}}. I pushed a fix.

> Should it be more likely to search concurrently in tests?
> -
>
> Key: LUCENE-10507
> URL: https://issues.apache.org/jira/browse/LUCENE-10507
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> As part of LUCENE-10002 we are migrating test usages of 
> IndexSearcher#search(Query, Collector) to use the corresponding search method 
> that takes a CollectorManager in place of a Collector. As part of such 
> changes, I've been paying attention to whether searchers are created through 
> LuceneTestCase#newSearcher and migrating to it when possible.
> This caused some recent test failures following test changes, which were in 
> most cases test issues, although they were quite rare due to the fact that we 
> only rarely exercise the concurrent code-path in tests.
> One recent failure uncovered LUCENE-10500, which was an actual bug that 
> affected concurrent searches only, and was uncovered by a test run that 
> indexed a considerable amount of docs and was lucky enough to get an executor 
> set to its index searcher as well as get multiple slices.
> LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and 
> even when useThreads is true, the searcher may not get an executor set. Also, 
> it can often happen that despite an executor is set, the searcher will hold 
> only one slice, as not enough documents are indexed. Some nightly tests index 
> enough documents, and LuceneTestCase also lowers the slice limits but only 
> 50% of the times and only when wrapWithAssertions is false. Also I wonder if 
> the lower limits are low enough:
> {code:java}
> int maxDocPerSlice = 1 + random.nextInt(10);
> int maxSegmentsPerSlice = 1 + random.nextInt(20);
> {code}
> All in all, I wonder if we should make it more likely for real concurrent 
> searches to happen while testing across multiple slices. It seems like it 
> could be useful especially as we'd like users to use collector managers 
> instead of collectors (although that does not necessarily translate to 
> concurrent search).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?

2022-06-21 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556684#comment-17556684
 ] 

Adrien Grand commented on LUCENE-10507:
---

Also we wondered if this change could affect the time it takes to run tests, 
but things look good so far: 
http://people.apache.org/~mikemccand/lucenebench/antcleantest.html.

> Should it be more likely to search concurrently in tests?
> -
>
> Key: LUCENE-10507
> URL: https://issues.apache.org/jira/browse/LUCENE-10507
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> As part of LUCENE-10002 we are migrating test usages of 
> IndexSearcher#search(Query, Collector) to use the corresponding search method 
> that takes a CollectorManager in place of a Collector. As part of such 
> changes, I've been paying attention to whether searchers are created through 
> LuceneTestCase#newSearcher and migrating to it when possible.
> This caused some recent test failures following test changes, which were in 
> most cases test issues, although they were quite rare due to the fact that we 
> only rarely exercise the concurrent code-path in tests.
> One recent failure uncovered LUCENE-10500, which was an actual bug that 
> affected concurrent searches only, and was uncovered by a test run that 
> indexed a considerable amount of docs and was lucky enough to get an executor 
> set to its index searcher as well as get multiple slices.
> LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and 
> even when useThreads is true, the searcher may not get an executor set. Also, 
> it can often happen that despite an executor is set, the searcher will hold 
> only one slice, as not enough documents are indexed. Some nightly tests index 
> enough documents, and LuceneTestCase also lowers the slice limits but only 
> 50% of the times and only when wrapWithAssertions is false. Also I wonder if 
> the lower limits are low enough:
> {code:java}
> int maxDocPerSlice = 1 + random.nextInt(10);
> int maxSegmentsPerSlice = 1 + random.nextInt(20);
> {code}
> All in all, I wonder if we should make it more likely for real concurrent 
> searches to happen while testing across multiple slices. It seems like it 
> could be useful especially as we'd like users to use collector managers 
> instead of collectors (although that does not necessarily translate to 
> concurrent search).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10507) Should it be more likely to search concurrently in tests?

2022-06-21 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556683#comment-17556683
 ] 

Adrien Grand commented on LUCENE-10507:
---

It looks like this change help find a reproducible test failure:
./gradlew test --tests TestElevationComparator.testSorting 
-Dtests.seed=3AC6BE539DA8C1F3 -Dtests.locale=sg-CF 
-Dtests.timezone=America/Indiana/Knox -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8

I don't understand the reason yet.

> Should it be more likely to search concurrently in tests?
> -
>
> Key: LUCENE-10507
> URL: https://issues.apache.org/jira/browse/LUCENE-10507
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> As part of LUCENE-10002 we are migrating test usages of 
> IndexSearcher#search(Query, Collector) to use the corresponding search method 
> that takes a CollectorManager in place of a Collector. As part of such 
> changes, I've been paying attention to whether searchers are created through 
> LuceneTestCase#newSearcher and migrating to it when possible.
> This caused some recent test failures following test changes, which were in 
> most cases test issues, although they were quite rare due to the fact that we 
> only rarely exercise the concurrent code-path in tests.
> One recent failure uncovered LUCENE-10500, which was an actual bug that 
> affected concurrent searches only, and was uncovered by a test run that 
> indexed a considerable amount of docs and was lucky enough to get an executor 
> set to its index searcher as well as get multiple slices.
> LuceneTestCase#newIndexSearcher(IndexReader) uses threads only rarely, and 
> even when useThreads is true, the searcher may not get an executor set. Also, 
> it can often happen that despite an executor is set, the searcher will hold 
> only one slice, as not enough documents are indexed. Some nightly tests index 
> enough documents, and LuceneTestCase also lowers the slice limits but only 
> 50% of the times and only when wrapWithAssertions is false. Also I wonder if 
> the lower limits are low enough:
> {code:java}
> int maxDocPerSlice = 1 + random.nextInt(10);
> int maxSegmentsPerSlice = 1 + random.nextInt(20);
> {code}
> All in all, I wonder if we should make it more likely for real concurrent 
> searches to happen while testing across multiple slices. It seems like it 
> could be useful especially as we'd like users to use collector managers 
> instead of collectors (although that does not necessarily translate to 
> concurrent search).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock

2022-06-21 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556673#comment-17556673
 ] 

Adrien Grand commented on LUCENE-10624:
---

I find these speedups surprising since I was not expecting these queries to 
leverage doc values. The one query where I would expect a speedup is the term 
query sorted by field: 
http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#search_sort_qps.

Regarding the implementation, in the past we observed better performance for 
this sort of things with exponential search than with binary search, since 
exponential search would better optimize for the case when callers repeatedly 
call advance() on small increments.

> Binary Search for Sparse IndexedDISI advanceWithinBlock & 
> advanceExactWithinBlock
> -
>
> Key: LUCENE-10624
> URL: https://issues.apache.org/jira/browse/LUCENE-10624
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 9.0, 9.1, 9.2
>Reporter: Weiming Wu
>Priority: Major
> Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, 
> candidate_sparseTaxis_searchsparse-sorted.0.log
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. Problem Statement
> We noticed DocValue read performance regression with the iterative API when 
> upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The 
> degradation is similar to what's described in 
> https://issues.apache.org/jira/browse/SOLR-9599 
> By analyzing profiling data, we found method "advanceWithinBlock" and 
> "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to 
> their O(N) doc lookup algorithm.
> h3. Changes
> Used binary search algorithm to replace current O(N) lookup algorithm in 
> Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because 
> docs are in ascending order.
> h3. Test
> {code:java}
> ./gradlew tidy
> ./gradlew check {code}
> h3. Benchmark
> Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the 
> reports of baseline and candidates in attachments section.{color}
> {color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}
> {color:#1d1c1d}2. Some highlights (>20%):{color}
>  * *{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] 
> yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*
>  ** {color:#1d1c1d}*Baseline:*  10973978+ hits hits in *726.81967 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 
> msec*{color}
>  * *{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*
>  ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 msec*{color}
>  * {color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}
>  ** {color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 msec*{color}
>  ** {color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 
> msec*{color}{*}{*}
>  * {color:#1d1c1d}*...*{color}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10618) Implement BooleanQuery rewrite rules based for minimumShouldMatch

2022-06-20 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10618.
---
Fix Version/s: 9.3
   Resolution: Fixed

Thanks [~joe hou]!

> Implement BooleanQuery rewrite rules based for minimumShouldMatch
> -
>
> Key: LUCENE-10618
> URL: https://issues.apache.org/jira/browse/LUCENE-10618
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> While looking into a test failure I noticed that we sometimes create weights 
> for boolean queries with no SHOULD clauses and a non-zero 
> minimumNumberShouldMatch.
> We could rewrite BooleanQuery to MatchNoDocsQuery when the number of SHOULD 
> clauses is less than minimumNumberShouldMatch, and make SHOULD clauses 
> required when the number of SHOULD clauses is equal to 
> minimumNumberShouldMatch.
> This feels a bit like a degenerate case (why would the use create such a 
> query in the first place?) but this case can also happen to non-degenerate 
> queries if some SHOULD clauses rewrite to a MatchNoDocsQuery and get removed 
> through rewrite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8806) WANDScorer should support two-phase iterator

2022-06-17 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555683#comment-17555683
 ] 

Adrien Grand commented on LUCENE-8806:
--

Sorry [~denimorim] I'm not getting your question.

> WANDScorer should support two-phase iterator
> 
>
> Key: LUCENE-8806
> URL: https://issues.apache.org/jira/browse/LUCENE-8806
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Major
> Attachments: LUCENE-8806.patch, LUCENE-8806.patch
>
>
> Following https://issues.apache.org/jira/browse/LUCENE-8770 the WANDScorer 
> should leverage two-phase iterators in order to be faster when used in 
> conjunctions.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10617) Investigate recent Jenkins build failures in TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler

2022-06-17 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10617.
---
Fix Version/s: 9.3
   Resolution: Fixed

This one looks addressed.

> Investigate recent Jenkins build failures in 
> TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler
> 
>
> Key: LUCENE-10617
> URL: https://issues.apache.org/jira/browse/LUCENE-10617
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Gautam Worah
>Priority: Minor
> Fix For: 9.3
>
>
> Sample failures: [https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/692/, 
> https://jenkins.thetaphi.de/job/Lucene-main-MacOSX/8177/|https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/692/]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10619) Optimize the writeBytes in TermsHashPerField

2022-06-17 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555646#comment-17555646
 ] 

Adrien Grand commented on LUCENE-10619:
---

This looks like an interesting idea!

> Optimize the writeBytes in TermsHashPerField
> 
>
> Key: LUCENE-10619
> URL: https://issues.apache.org/jira/browse/LUCENE-10619
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tangdh
>Priority: Major
>
> Because we don't know the length of slice, writeBytes will always write byte 
> one after another instead of writing a block of bytes.
> May be we could return both offset and length in ByteBlockPool#allocSlice?
> 1. BYTE_BLOCK_SIZE is 32768, offset is at most 15 bits.
> 2. slice size is at most 200, so it could fit in 8 bits.
> So we could put them together into an int  offset | length
> There are only two places where this function is used,the cost of change it 
> is relatively small.
> When allocSlice could return the offset and length of new Slice, we could 
> change writeBytes like below
> {code:java}
> // write block of bytes each time
> while(remaining > 0 ) {
>int offsetAndLength = allocSlice(bytes, offset);
>length = min(remaining, (offsetAndLength & 0xff) - 1);
>offset = offsetAndLength >> 8;
>System.arraycopy(src, srcPos, bytePool.buffer, offset, length);
>remaining -= length;
>offset+= (length + 1);
> }
> {code}
> If it could work, I'd like to raise a pr.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10620) Can we pass the Weight to Collector?

2022-06-16 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555001#comment-17555001
 ] 

Adrien Grand commented on LUCENE-10620:
---

I opened a draft PR that demonstrates the idea: 
https://github.com/apache/lucene/pull/964.

> Can we pass the Weight to Collector?
> 
>
> Key: LUCENE-10620
> URL: https://issues.apache.org/jira/browse/LUCENE-10620
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Today collectors cannot know about the Weight, and thus they cannot leverage 
> {{Weight#count}}. {{IndexSearcher#count}} works around it by extending 
> {{TotalHitCountCollector}} in order to shortcut counting the number of hits 
> on a segment via {{Weight#count}} whenever possible.
> It works, but I would prefer this shortcut to work for all users of 
> TotalHitCountCollector. For instance the faceting module creates a 
> MultiCollector over a TotalHitCountCollector and a FacetCollector, and today 
> it doesn't benefit from quick counts, which would enable it to only collect 
> matches into a FacetCollector.
> I'm considering adding a new {{Collector#setWeight}} API to allow collectors 
> to leverage {{Weight#count}}. I gave {{TotalHitCountCollector}} as an example 
> above, but this could have applications for our top-docs collectors too, 
> which could skip counting hits at all if the weight can provide them with the 
> hit count up-front.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10620) Can we pass the Weight to Collector?

2022-06-16 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10620:
-

 Summary: Can we pass the Weight to Collector?
 Key: LUCENE-10620
 URL: https://issues.apache.org/jira/browse/LUCENE-10620
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


Today collectors cannot know about the Weight, and thus they cannot leverage 
{{Weight#count}}. {{IndexSearcher#count}} works around it by extending 
{{TotalHitCountCollector}} in order to shortcut counting the number of hits on 
a segment via {{Weight#count}} whenever possible.

It works, but I would prefer this shortcut to work for all users of 
TotalHitCountCollector. For instance the faceting module creates a 
MultiCollector over a TotalHitCountCollector and a FacetCollector, and today it 
doesn't benefit from quick counts, which would enable it to only collect 
matches into a FacetCollector.

I'm considering adding a new {{Collector#setWeight}} API to allow collectors to 
leverage {{Weight#count}}. I gave {{TotalHitCountCollector}} as an example 
above, but this could have applications for our top-docs collectors too, which 
could skip counting hits at all if the weight can provide them with the hit 
count up-front.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10618) Implement BooleanQuery rewrite rules based for minimumShouldMatch

2022-06-15 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554645#comment-17554645
 ] 

Adrien Grand commented on LUCENE-10618:
---

sure, feel free to give it a try and ping me for reviews!

> Implement BooleanQuery rewrite rules based for minimumShouldMatch
> -
>
> Key: LUCENE-10618
> URL: https://issues.apache.org/jira/browse/LUCENE-10618
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> While looking into a test failure I noticed that we sometimes create weights 
> for boolean queries with no SHOULD clauses and a non-zero 
> minimumNumberShouldMatch.
> We could rewrite BooleanQuery to MatchNoDocsQuery when the number of SHOULD 
> clauses is less than minimumNumberShouldMatch, and make SHOULD clauses 
> required when the number of SHOULD clauses is equal to 
> minimumNumberShouldMatch.
> This feels a bit like a degenerate case (why would the use create such a 
> query in the first place?) but this case can also happen to non-degenerate 
> queries if some SHOULD clauses rewrite to a MatchNoDocsQuery and get removed 
> through rewrite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10600) SortedSetDocValues#docValueCount should be an int, not long

2022-06-15 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554625#comment-17554625
 ] 

Adrien Grand commented on LUCENE-10600:
---

[~mikemccand] I don't think so, we still need to make 
{{SortedSetDocValues#docValueCount}} an integer in my opinion.

> SortedSetDocValues#docValueCount should be an int, not long
> ---
>
> Key: LUCENE-10600
> URL: https://issues.apache.org/jira/browse/LUCENE-10600
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Lu Xugang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10618) Implement BooleanQuery rewrite rules based for minimumShouldMatch

2022-06-15 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10618:
-

 Summary: Implement BooleanQuery rewrite rules based for 
minimumShouldMatch
 Key: LUCENE-10618
 URL: https://issues.apache.org/jira/browse/LUCENE-10618
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


While looking into a test failure I noticed that we sometimes create weights 
for boolean queries with no SHOULD clauses and a non-zero 
minimumNumberShouldMatch.

We could rewrite BooleanQuery to MatchNoDocsQuery when the number of SHOULD 
clauses is less than minimumNumberShouldMatch, and make SHOULD clauses required 
when the number of SHOULD clauses is equal to minimumNumberShouldMatch.

This feels a bit like a degenerate case (why would the use create such a query 
in the first place?) but this case can also happen to non-degenerate queries if 
some SHOULD clauses rewrite to a MatchNoDocsQuery and get removed through 
rewrite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10617) Investigate recent Jenkins build failures in TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler

2022-06-15 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554440#comment-17554440
 ] 

Adrien Grand commented on LUCENE-10617:
---

Neither of these seeds reproduce for me but it looks like a test bug where the 
test expects indexing threads to not notice that there is an exception in a 
background merge, while `hasPendingMerges()` now complains if the writer hit a 
tragic exception. I'll push a tentative fix and watch failures for this test.

> Investigate recent Jenkins build failures in 
> TestMergeSchedulerExternal.testSubclassConcurrentMergeScheduler
> 
>
> Key: LUCENE-10617
> URL: https://issues.apache.org/jira/browse/LUCENE-10617
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Gautam Worah
>Priority: Minor
>
> Sample failures: [https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/692/, 
> https://jenkins.thetaphi.de/job/Lucene-main-MacOSX/8177/|https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/692/]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10616) Moving to dictionaries has made stored fields slower at skipping

2022-06-14 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10616:
-

 Summary: Moving to dictionaries has made stored fields slower at 
skipping
 Key: LUCENE-10616
 URL: https://issues.apache.org/jira/browse/LUCENE-10616
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand


[~ywelsch] has been digging into a regression of stored fields retrieval that 
is caused by LUCENE-9486.

Say your documents have two stored fields, one that is 100B and is stored 
first, and the other one that is 100kB, and you are only interested in the 
first one. While the idea behind blocks of stored fields is to store multiple 
documents in the same block to leverage redundancy across documents, sometimes 
documents are larger than the block size. As soon as documents are larger than 
2x the block size, our stored fields format splits such large documents into 
multiple blocks, so that you wouldn't need to decompress everything only to 
retrieve a couple small fields.

Before LUCENE-9486, BEST_SPEED had a block size of 16kB, so only retrieving the 
first field value would only need to decompress 16kB of data. With the move to 
preset dictionaries in LUCENE-9486 and then LUCENE-9917, we now have blocks of 
80kB, so stored fields would now need to decompress 80kB of data, 5x more than 
before.

With dictionaries, our blocks are now split into 10 sub blocks. We happen to 
eagerly decompress all sub blocks that intersect with the stored document, 
which is why we would decompress 80kB of data, but this is an implementation 
detail. It should be possible to decompress these sub blocks lazily so that we 
would only decompress those that intersect with one of the field values that 
the user is interested in retrieving?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10600) SortedSetDocValues#docValueCount should be an int, not long

2022-06-14 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554004#comment-17554004
 ] 

Adrien Grand commented on LUCENE-10600:
---

bq. should we also make SortedSetDocValues#nextOrd() returns int

No, SORTED_SET doc values could have more than Integer.MAX_VALUE unique values 
overall. SortedSetDocValuesWriter does indeed use ints to represent term IDs, 
but this class is only used for flushes and flushes have a hard bound of ~2GB 
per thread so you can't have more than Integer.MAX_VALUE unique terms in a 
flush. However, the unique count of terms can grow through merges beyond 
Integer.MAX_VALUE.

> SortedSetDocValues#docValueCount should be an int, not long
> ---
>
> Key: LUCENE-10600
> URL: https://issues.apache.org/jira/browse/LUCENE-10600
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Lu Xugang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10608) Implement Weight#count for pure conjunctions

2022-06-14 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10608.
---
Fix Version/s: 9.3
   Resolution: Fixed

> Implement Weight#count for pure conjunctions
> 
>
> Key: LUCENE-10608
> URL: https://issues.apache.org/jira/browse/LUCENE-10608
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> It's common for Elasticsearch to ingest time-based data where newer segments 
> contain recent data and older segments contain older data. On such indices, 
> it's common for range queries on the time field to match either all of or 
> none of the documents in the segment.
> We could implement Weight#count on pure conjunctions to take advantage of 
> this by either returning 0 if any of the clauses has a match count of 0, or 
> the count of the only clause that doesn't have a match count that is equal to 
> maxDoc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10078) Enable merge-on-refresh by default?

2022-06-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553697#comment-17553697
 ] 

Adrien Grand commented on LUCENE-10078:
---

As expected, this slowed down refresh latency a bit. 
http://people.apache.org/~mikemccand/lucenebench/nrt.html I pushed an 
annotation that should show up in the coming days.

> Enable merge-on-refresh by default?
> ---
>
> Key: LUCENE-10078
> URL: https://issues.apache.org/jira/browse/LUCENE-10078
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is a spinoff from the discussion in LUCENE-10073.
> The newish merge-on-refresh ([crazy origin 
> story|https://blog.mikemccandless.com/2021/03/open-source-collaboration-or-how-we.html])
>  feature is a powerful way to reduce searched segment counts, especially 
> helpful for applications using many indexing threads.  Such usage will write 
> many tiny segments on each refresh, which could quickly be merged up during 
> the {{refresh}} operation.
> We would have to implement a default for {{findFullFlushMerges}} 
> (LUCENE-10064 is open for this), and then we would need 
> {{IndexWriterConfig.getMaxFullFlushMergeWaitMillis}} a non-zero value (this 
> issue).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10266) Move nearest-neighbor search on points to core?

2022-06-13 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10266.
---
Fix Version/s: 10.0 (main)
   Resolution: Fixed

> Move nearest-neighbor search on points to core?
> ---
>
> Key: LUCENE-10266
> URL: https://issues.apache.org/jira/browse/LUCENE-10266
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 10.0 (main)
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Now that the Points' public API supports running nearest-nearest neighbor 
> search, should we move it to core via helper methods on {{LatLonPoint}} and 
> {{XYPoint}}?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10527) Use bigger maxConn for last layer in HNSW

2022-06-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553699#comment-17553699
 ] 

Adrien Grand commented on LUCENE-10527:
---

I pushed an annotation to nightly benchmarks for the above performance change. 
It should show up in the coming days.

> Use bigger maxConn for last layer in HNSW
> -
>
> Key: LUCENE-10527
> URL: https://issues.apache.org/jira/browse/LUCENE-10527
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.2
>
> Attachments: Screen Shot 2022-05-18 at 4.26.14 PM.png, Screen Shot 
> 2022-05-18 at 4.26.24 PM.png, Screen Shot 2022-05-18 at 4.27.37 PM.png, 
> image-2022-04-20-14-53-58-484.png
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Recently I was rereading the HNSW paper 
> ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using 
> a different maxConn for the upper layers vs. the bottom one (which contains 
> the full neighborhood graph). Specifically, they suggest using maxConn=M for 
> upper layers and maxConn=2*M for the bottom. This differs from what we do, 
> which is to use maxConn=M for all layers.
> I tried updating our logic using a hacky patch, and noticed an improvement in 
> latency for higher recall values (which is consistent with the paper's 
> observation):
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> !image-2022-04-20-14-53-58-484.png|width=400,height=367!
> As we'd expect, indexing becomes a bit slower:
> {code:java}
> Baseline: Indexed 1183514 documents in 733s 
> Candidate: Indexed 1183514 documents in 948s{code}
> When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
> big difference in recall for the same settings of M and efConstruction. (Even 
> adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
> change, the recall is now very similar:
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> {code:java}
> kApproach  Recall 
> QPS
> 10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563  
>4410.499
> 50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798  
>1956.280
> 100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862  
>1209.734
> 500  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958  
> 341.428
> 800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974  
> 230.396
> 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980  
> 188.757
> 10   hnswlib ({'M': 32, 'efConstruction': 100})0.552  
>   16745.433
> 50   hnswlib ({'M': 32, 'efConstruction': 100})0.794  
>5738.468
> 100  hnswlib ({'M': 32, 'efConstruction': 100})0.860  
>3336.386
> 500  hnswlib ({'M': 32, 'efConstruction': 100})0.956  
> 832.982
> 800  hnswlib ({'M': 32, 'efConstruction': 100})0.973  
> 541.097
> 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979  
> 442.163
> {code}
> I think it'd be nice update to maxConn so that we faithfully implement the 
> paper's algorithm. This is probably least surprising for users, and I don't 
> see a strong reason to take a different approach from the paper? Let me know 
> what you think!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10612) Add parameters for HNSW codec in Lucene93Codec

2022-06-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553499#comment-17553499
 ] 

Adrien Grand commented on LUCENE-10612:
---

We have been rejecting such requests in the past due to the impact it has on 
backward compatibility, as the default codec has strong backward compatibility 
guarantees, and we need to make sure that the compatibility guarantees hold for 
every combination of options.

Stored fields are indeed an exception because it was hard to come up with 
values that would work well enough for everyone. But it was done in a way that 
has a very small surface, e.g. it doesn't expose the algorithm that is used 
under the hood or the size of blocks, or the DEFLATE compression level, it's 
only two options with opaque implementation details. On the other hand maxConn 
and beamWidth are specific implementation details of HNSW that can take a large 
range of values. And even with only two possible options, we still set the bar 
pretty high for configurability of the default codec, e.g. there was an option 
for doc values at some point that we ended up removing.

Would it work for you to override `Lucene93Codec#getKnnVectorsFormatForField`? 
The caveat is that it is customizing file formats, so it puts you on your own 
regarding backward compatibility.

> Add parameters for HNSW codec in Lucene93Codec
> --
>
> Key: LUCENE-10612
> URL: https://issues.apache.org/jira/browse/LUCENE-10612
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Reporter: Elia Porciani
>Priority: Major
>
> Currently, it is possible to specify only the compression mode for stored 
> fields in the LuceneXXCodec constructors.
> With the introduction of HNSW graph, and the LuceneXXHnswCodecFormat, 
> LuceneXXCodec should provide an easy way to specify custom parameters for 
> HNSW graph layout:
> * maxConn
> * beamWidth



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-06-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553494#comment-17553494
 ] 

Adrien Grand commented on LUCENE-10480:
---

Good question, looking at your BlockMaxMaxScoreScorer it looks like it also has 
potential for being specialized in the 2-clauses case by having two sub scorers 
and tracking during document collection whether the scorer that produces lower 
scores is optional or required. I didn't have concrete plans in mind when 
opening the issue, I was just observing that we pay significant overhead for 
supporting arbitrary numbers of clauses when disjunctions often have only two 
clauses.

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-06-09 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552280#comment-17552280
 ] 

Adrien Grand commented on LUCENE-10603:
---

+1

I'd be curious to get thoughts from other people but my understanding was that 
we'd like to deprecate iteration using NO_MORE_ORDS longer term.

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Priority: Trivial
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10602) Dynamic Index Cache Sizing

2022-06-09 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552279#comment-17552279
 ] 

Adrien Grand commented on LUCENE-10602:
---

We sure can build something custom in Elasticsearch, I was thinking that 
evicting unused entries would be generally useful to all Lucene users, who are 
encouraged to used LRUQueryCache since it's currently the only implementation 
of the QueryCache interface.

> Dynamic Index Cache Sizing
> --
>
> Key: LUCENE-10602
> URL: https://issues.apache.org/jira/browse/LUCENE-10602
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Chris Earle
>Priority: Major
>
> Working with Lucene's filter cache, it has become apparent that it can be an 
> enormous drain on the heap and therefore the JVM. After extensive usage of an 
> index, it is not uncommon to tune performance by shrinking or altogether 
> removing the filter cache.
> Lucene tracks hit/miss stats of the filter cache, but it does nothing with 
> the data other than inform an interested user about the effectiveness of 
> their index's caching.
> It would be interesting if Lucene would be able to tune the index filter 
> cache heuristically based on actual usage (age, frequency, and value).
> This could ultimately be used to give GBs of heap back to an individual 
> Lucene instance instead of burning it on cache storage that's not effectively 
> used (or useful).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10599) Improve LogMergePolicy's handling of maxMergeSize

2022-06-09 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10599.
---
Fix Version/s: 9.3
   Resolution: Fixed

> Improve LogMergePolicy's handling of maxMergeSize
> -
>
> Key: LUCENE-10599
> URL: https://issues.apache.org/jira/browse/LUCENE-10599
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> LogMergePolicy excludes from merging segments whose size is greater than or 
> equal to maxMergeSize. Since a segment whose size is maxMergeSize-1 is still 
> considered for merging, segments will effectively reach a size somewhere 
> between maxMergeSize and mergeFactor*maxMergeSize before they are not 
> considered for merging anymore.
> At least this is what I thought. When LogMergePolicy ignores a segment that 
> is too large for merging, it also ignores other segments that are in the same 
> window of mergeFactor segments for merging if they are on the same tier. So 
> actually segments might reach a size that is somewhere between maxMergeSize / 
> mergeFactor^0.75 and maxMergeSize * mergeFactor before they are not 
> considered for merging anymore.
> Assuming a merge factor of 10 and a max merge size of 1,000 this means that 
> segments will reach their maximum size somewhere between 178 and 10,000. This 
> range is too large and makes maxMergeSize too hard to reason about?
> Specifically, if you have 10 999-docs segments, then LogDocMergePolicy will 
> happily merge them into a single 9990-docs segment. However if you have one 
> 1,000 segment and 9 180-docs segments, then the 180-docs segments will not 
> get merged with any other segment, even if you keep adding segments to the 
> index.
> I propose to change this behavior so that when a large segment is 
> encountered, then we wouldn't skip the entire window of mergeFactor segments, 
> but just the segments that are too large.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10608) Implement Weight#count for pure conjunctions

2022-06-09 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10608:
-

 Summary: Implement Weight#count for pure conjunctions
 Key: LUCENE-10608
 URL: https://issues.apache.org/jira/browse/LUCENE-10608
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Adrien Grand


It's common for Elasticsearch to ingest time-based data where newer segments 
contain recent data and older segments contain older data. On such indices, 
it's common for range queries on the time field to match either all of or none 
of the documents in the segment.

We could implement Weight#count on pure conjunctions to take advantage of this 
by either returning 0 if any of the clauses has a match count of 0, or the 
count of the only clause that doesn't have a match count that is equal to 
maxDoc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10602) Dynamic Index Cache Sizing

2022-06-08 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551712#comment-17551712
 ] 

Adrien Grand commented on LUCENE-10602:
---

I work with Chris and I suggested him to open this issue, I'll try to provide a 
bit more context.

It's not uncommon for us to have nodes that handle several TBs of data. With 
documents that need ~250 bytes each, which is also typical, this gives ~4.4B 
documents per TB of data. Caching a query for 4.4B documents requires ~520MB of 
memory assuming one bit per document. So if we want to be able to cache, say 4 
queries across 1TB of data, then we need ~2GB of heap for the query cache.

We could give less memory to the cache, but this would increase the risk that 
every new entry in the cache evicts a hot entry for the cache. This is 
potentially an issue for the query cache since computing cache entries has 
overhead: it requires evaluating all documents that match the query, while the 
query that is being cached might be used in a conjunction that only requires 
evaluating a subset of the matching docs.

But we're also seeing the opposite case when the cache is oversized for the 
amount of data that a node handles. And because the cache only evicts when it's 
full or when segments get closed, the cache will often grow until it's 
completely full, even though most cache entries never get used.

We don't know at node startup time how much data this node is going to handle 
eventually, which makes it impossible to size the query cache correctly. So if 
Lucene's query cache could evict entries from the cache when they appear to be 
very little used, this would help not spend large amounts of heap on useless 
cache entries.

> Dynamic Index Cache Sizing
> --
>
> Key: LUCENE-10602
> URL: https://issues.apache.org/jira/browse/LUCENE-10602
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Chris Earle
>Priority: Major
>
> Working with Lucene's filter cache, it has become apparent that it can be an 
> enormous drain on the heap and therefore the JVM. After extensive usage of an 
> index, it is not uncommon to tune performance by shrinking or altogether 
> removing the filter cache.
> Lucene tracks hit/miss stats of the filter cache, but it does nothing with 
> the data other than inform an interested user about the effectiveness of 
> their index's caching.
> It would be interesting if Lucene would be able to tune the index filter 
> cache heuristically based on actual usage (age, frequency, and value).
> This could ultimately be used to give GBs of heap back to an individual 
> Lucene instance instead of burning it on cache storage that's not effectively 
> used (or useful).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10396) Automatically create sparse indexes for sort fields

2022-06-08 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551429#comment-17551429
 ] 

Adrien Grand commented on LUCENE-10396:
---

Another potential use-case we are interested in for Elasticsearch would be to 
have the ability to visit a single document per field value for a field that is 
the primary index sort. This has a few applications, one of them is to compute 
the number of unique values of this primary sort field for documents that match 
a query. The collector could implement {{LeafCollector#competitiveIterator}} by 
using the sparse index to skip all documents that have the same value as the 
last collected hit.

> Automatically create sparse indexes for sort fields
> ---
>
> Key: LUCENE-10396
> URL: https://issues.apache.org/jira/browse/LUCENE-10396
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: sorted_conjunction.png
>
>
> On Elasticsearch we're more and more leveraging index sorting not as a way to 
> be able to early terminate sorted queries, but as a way to cluster doc IDs 
> that share similar properties so that queries can take advantage of it. For 
> instance imagine you're maintaining a catalog of cars for sale, by sorting by 
> car type, then fuel type then price. Then all cars with the same type, fuel 
> type and similar prices will be stored in a contiguous range of doc IDs. 
> Without index sorting, conjunctions across these 3 fields would be almost a 
> worst-case scenario as every clause might match lots of documents while their 
> intersection might not. With index sorting enabled however, there's only a 
> very small number of calls to advance() that would lead to doc IDs that do 
> not match, because these advance() calls that do not lead to a match would 
> always jump over a large number of doc IDs. I had created that example for 
> ApacheCon last year that demonstrates the benefits of index sorting on 
> conjunctions. In both cases, the index is storing the same data, it just gets 
> different doc ID ordering thanks to index sorting:
> !sorted_conjunction.png!
> While index sorting can help improve query efficiency out-of-the-box, there 
> is a lot more we can do by taking advantage of the index sort explicitly. For 
> instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
> queries on fields that are primary sort fields by performing a binary search 
> to identify the first and last documents that match the range query. I would 
> like to introduce [sparse 
> indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
> that are used for index sorting, with the goal of improving the runtime of 
> {{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
> and making it easier and more efficient to leverage index sorting to filter 
> on subsequent sort fields. A simple form of a sparse index could consist of 
> storing every N-th values of the fields that are used for index sorting.
> In terms of implementation, sparse indexing should be cheap enough that we 
> wouldn't need to make it configurable and could enable it automatically as 
> soon as index sorting is enabled. And it would get its own file format 
> abstraction.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

2022-06-07 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551199#comment-17551199
 ] 

Adrien Grand commented on LUCENE-10592:
---

+1

In general I have a preference for "pull" APIs like we have for points and doc 
values, it makes it possible to iterate over the data twice without 
materializing a temporary representation of the data for instance. That said, 
it's indeed bad how indexing is super fast today but flushing is dog slow. It 
creates surprising situations where flushes might get stalled because too many 
flushes are still in progress and the overall indexing rate is very irregular. 
So I'd be supportive of moving to a push API that helps us move more of the 
cost of indexing vectors from flushing to indexing.

I guess that one argument against it could be that we're optimizing for one 
particular implementation, and future implementations might better benefit from 
a pull API. I know too little about vector search to have a sense of how likely 
we are to switch to a completely different algorithm in the near future, but in 
my opiion it'd be ok to reconsider the API then since codec APIs are expert.

> Should we build HNSW graph on the fly during indexing
> -
>
> Key: LUCENE-10592
> URL: https://issues.apache.org/jira/browse/LUCENE-10592
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
>
> Currently, when we index vectors for KnnVectorField, we buffer those vectors 
> in memory and on flush during a segment construction we build an HNSW graph.  
> As building an HNSW graph is very expensive, this makes flush operation take 
> a lot of time. This also makes overall indexing performance quite 
> unpredictable (as the number of flushes are defined by memory used, and the 
> presence of concurrent searches), e.g. some indexing operations return almost 
> instantly while others that trigger flush take a lot of time. 
> Building an HNSW graph on the fly as we index vectors allows to avoid this 
> problem, and spread a load of HNSW graph construction evenly during indexing.
> This will also supersede LUCENE-10194



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10078) Enable merge-on-refresh by default?

2022-06-07 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10078.
---
Fix Version/s: 9.3
   Resolution: Fixed

> Enable merge-on-refresh by default?
> ---
>
> Key: LUCENE-10078
> URL: https://issues.apache.org/jira/browse/LUCENE-10078
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is a spinoff from the discussion in LUCENE-10073.
> The newish merge-on-refresh ([crazy origin 
> story|https://blog.mikemccandless.com/2021/03/open-source-collaboration-or-how-we.html])
>  feature is a powerful way to reduce searched segment counts, especially 
> helpful for applications using many indexing threads.  Such usage will write 
> many tiny segments on each refresh, which could quickly be merged up during 
> the {{refresh}} operation.
> We would have to implement a default for {{findFullFlushMerges}} 
> (LUCENE-10064 is open for this), and then we would need 
> {{IndexWriterConfig.getMaxFullFlushMergeWaitMillis}} a non-zero value (this 
> issue).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10601) NullPointerException in Lucene Merge (v7.4)

2022-06-07 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550849#comment-17550849
 ] 

Adrien Grand commented on LUCENE-10601:
---

This is an old version, but this bit of code hasn't changed for a long time. At 
first sight I have a hard time seeing how we could run into a NPE there, do you 
have a unit test that reproduces the issue? What JVM are you running?

> NullPointerException in Lucene Merge (v7.4)
> ---
>
> Key: LUCENE-10601
> URL: https://issues.apache.org/jira/browse/LUCENE-10601
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 7.4
>Reporter: Anantha Krishnan Mahalingam
>Priority: Minor
>
> Exception in thread "Lucene Merge Thread #1" 
> org.apache.lucene.index.MergePolicy$MergeException: 
> java.lang.NullPointerException
> at 
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:704)
> at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.lucene.index.SortingTermVectorsConsumer.abort(SortingTermVectorsConsumer.java:86)
>   at org.apache.lucene.index.TermsHash.abort(TermsHash.java:67)
>   at 
> org.apache.lucene.index.DefaultIndexingChain$$Lambda$58/2086447408.close(Unknown
>  Source)
>   at 
> org.apache.lucene.index.DefaultIndexingChain.abort(DefaultIndexingChain.java:321)
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread.abort(DocumentsWriterPerThread.java:138)
>   at 
> org.apache.lucene.index.DocumentsWriter.abortThreadState(DocumentsWriter.java:330)
>   at org.apache.lucene.index.DocumentsWriter.abort(DocumentsWriter.java:232)
>   at 
> org.apache.lucene.index.IndexWriter.rollbackInternalNoCommit(IndexWriter.java:2298)
>   at 
> org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2274)
>   at 
> org.apache.lucene.index.IndexWriter.maybeCloseOnTragicEvent(IndexWriter.java:4860)
>   at org.apache.lucene.index.IndexWriter.tragicEvent(IndexWriter.java:4850)
>   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4087)
>   at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625)
>   at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



  1   2   3   4   5   6   7   8   9   10   >