[jira] [Commented] (LUCENE-10676) FieldInfo#name contributes significantly to heap usage at scale

2022-08-08 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576678#comment-17576678
 ] 

Michael McCandless commented on LUCENE-10676:
-

Is each field name exotically long as well?

Lucene used to do just this -- intern {{FieldInfo.name}} and then use `==` to 
compare field names everywhere.  But we decided long ago that this was 
dangerous and not an important optimization.  Still, that decision was maybe 
pre Java 7 days when the intern'd pool was stored in {{PermGen}} instead of 
"ordinary" heap and was more likely to cause {{OutOfMemoryError}}?

Maybe dig into those long ago issues / dev list thread to see the motivation to 
stop interning?

> FieldInfo#name contributes significantly to heap usage at scale
> ---
>
> Key: LUCENE-10676
> URL: https://issues.apache.org/jira/browse/LUCENE-10676
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs
>Affects Versions: 9.3
> Environment: Seen in Lucene 9.3.0 running on Linux using JDK18 but 
> seems independent of environment.
>Reporter: David Turner
>Priority: Minor
>  Labels: heap, scalability
>
> We encountered an Elasticsearch user with high heap usage, a significant 
> proportion of which was down to the contents of `FieldInfo#name`.
> This user was certainly pushing some scalability boundaries: this single 
> process had thousands of active Lucene indices, many with 10k+ fields, and 
> many indices had hundreds of segments due to an excess of flushes, so in 
> total they had an enormous number of `FieldInfo` instances. Still, the bulk 
> of the heap usage was just field names, and the total number of distinct 
> field names was fairly small. That's pretty common, especially for time-based 
> data like logs. Some kind of interning or deduplication of these strings 
> would have reduced their heap usage by many GBs.
> Is there a way we could deduplicate these strings? Deduplicating them across 
> segments within each index would already have helped, but ideally we'd like 
> to deduplicate them across indices too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10646) Add some comment on LevenshteinAutomata

2022-08-07 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-10646.
-
Fix Version/s: 10.0 (main)
   9.4
   Resolution: Fixed

Thank you [~tangdh]!

> Add some comment on LevenshteinAutomata
> ---
>
> Key: LUCENE-10646
> URL: https://issues.apache.org/jira/browse/LUCENE-10646
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/FSTs
>Affects Versions: 9.2
>Reporter: tang donghai
>Priority: Minor
> Fix For: 10.0 (main), 9.4
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> After having a hard time reading the code, I may have understood the relevant 
> code of levenshteinautomata, except for the part of minErrors.
> I think this part of the code is too difficult to understand, full of magic 
> numbers. I will sort it out and then raise a PR to add some necessary 
> comments to this part of the code. So, others can better understand this part 
> of the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8675) Divide Segment Search Amongst Multiple Threads

2022-08-04 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575323#comment-17575323
 ] 

Michael McCandless commented on LUCENE-8675:


bq. I wonder if we could avoid paying the cost of Scorer/BulkScorer 
initialization multiple times by implementing Cloneable on these classes, 
similarly to how we use cloning on IndexInputs to consume them from multiple 
threads. 

+1

> Divide Segment Search Amongst Multiple Threads
> --
>
> Key: LUCENE-8675
> URL: https://issues.apache.org/jira/browse/LUCENE-8675
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Atri Sharma
>Priority: Major
> Attachments: PhraseHighFreqP50.png, PhraseHighFreqP90.png, 
> TermHighFreqP50.png, TermHighFreqP90.png
>
>
> Segment search is a single threaded operation today, which can be a 
> bottleneck for large analytical queries which index a lot of data and have 
> complex queries which touch multiple segments (imagine a composite query with 
> range query and filters on top). This ticket is for discussing the idea of 
> splitting a single segment into multiple threads based on mutually exclusive 
> document ID ranges.
> This will be a two phase effort, the first phase targeting queries returning 
> all matching documents (collectors not terminating early). The second phase 
> patch will introduce staged execution and will build on top of this patch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10672) Re-evaluate different ways to encode postings

2022-08-03 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574668#comment-17574668
 ] 

Michael McCandless commented on LUCENE-10672:
-

+1

Maybe we can also peek at how Vespa and Tantivy do their encoding ... any 
inspiration there?

The world needs more popular open-search source engines.

> Re-evaluate different ways to encode postings
> -
>
> Key: LUCENE-10672
> URL: https://issues.apache.org/jira/browse/LUCENE-10672
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> In Lucene 4, we moved to FOR to encode postings because it woud give better 
> throughput compared to VInts that we had been using until then. This was a 
> time when Lucene would often need to evaluate entire postings lists, and 
> optimizations like BS1 were very important for good performance.
> Nowadays, Lucene performs more dynamic pruning and it's less frequent that 
> Lucene needs to evaluate all hits that match a query. So the performance of 
> {{nextDoc()}} has become a bit less relevant while the performance of 
> {{advance(target)}} has become more relevant.
> I wonder if we should re-evaluate other ways to encode postings that are 
> theoretically better at skipping, such as Elias-Fano coding, since they 
> support skipping directly on the encoded representation instead of requiring 
> decoding a full block of integers where only a couple of them would be 
> relevant.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

2022-08-01 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-10216.
-
Fix Version/s: 9.4
   Resolution: Fixed

> Add concurrency to addIndexes(CodecReader…) API
> ---
>
> Key: LUCENE-10216
> URL: https://issues.apache.org/jira/browse/LUCENE-10216
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Vigya Sharma
>Priority: Major
> Fix For: main, 9.4
>
>  Time Spent: 11h 40m
>  Remaining Estimate: 0h
>
> I work at Amazon Product Search, and we use Lucene to power search for the 
> e-commerce platform. I’m working on a project that involves applying 
> metadata+ETL transforms and indexing documents on n different _indexing_ 
> boxes, combining them into a single index on a separate _reducer_ box, and 
> making it available for queries on m different _search_ boxes (replicas). 
> Segments are asynchronously copied from indexers to reducers to searchers as 
> they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the 
> reducer boxes. Since we also have taxonomy data, we need to remap facet field 
> ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version 
> of this API. The API leverages {{SegmentMerger.merge()}} to create segments 
> with new ordinal values while also merging all provided segments in the 
> process.
> _This is however a blocking call that runs in a single thread._ Until we have 
> written segments with new ordinal values, we cannot copy them to searcher 
> boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges, 
> each with only a single reader, creating a concurrently running 1:1 
> conversion from old segments to new ones (with new ordinal values). We follow 
> this up with non-blocking background merges. This lets us copy the segments 
> to searchers and replicas as soon as they are available, and later replace 
> them with merged segments as background jobs complete. On the Amazon dataset 
> I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. 
> Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}} 
> API with a {{boolean}} flag for concurrency, that internally submits multiple 
> merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, 
> and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting 
> multiple addIndexes() calls and waiting for them to complete, I felt it needs 
> some understanding of what addIndexes does, why you need to wait on the merge 
> and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

2022-08-01 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573796#comment-17573796
 ] 

Michael McCandless commented on LUCENE-10216:
-

Awesome!  I think we can close this now [~vigyas]?

> Add concurrency to addIndexes(CodecReader…) API
> ---
>
> Key: LUCENE-10216
> URL: https://issues.apache.org/jira/browse/LUCENE-10216
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Vigya Sharma
>Priority: Major
> Fix For: main
>
>  Time Spent: 11h 40m
>  Remaining Estimate: 0h
>
> I work at Amazon Product Search, and we use Lucene to power search for the 
> e-commerce platform. I’m working on a project that involves applying 
> metadata+ETL transforms and indexing documents on n different _indexing_ 
> boxes, combining them into a single index on a separate _reducer_ box, and 
> making it available for queries on m different _search_ boxes (replicas). 
> Segments are asynchronously copied from indexers to reducers to searchers as 
> they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the 
> reducer boxes. Since we also have taxonomy data, we need to remap facet field 
> ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version 
> of this API. The API leverages {{SegmentMerger.merge()}} to create segments 
> with new ordinal values while also merging all provided segments in the 
> process.
> _This is however a blocking call that runs in a single thread._ Until we have 
> written segments with new ordinal values, we cannot copy them to searcher 
> boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges, 
> each with only a single reader, creating a concurrently running 1:1 
> conversion from old segments to new ones (with new ordinal values). We follow 
> this up with non-blocking background merges. This lets us copy the segments 
> to searchers and replicas as soon as they are available, and later replace 
> them with merged segments as background jobs complete. On the Amazon dataset 
> I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. 
> Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}} 
> API with a {{boolean}} flag for concurrency, that internally submits multiple 
> merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, 
> and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting 
> multiple addIndexes() calls and waiting for them to complete, I felt it needs 
> some understanding of what addIndexes does, why you need to wait on the merge 
> and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10664) SearcherManager should return new IndexSearchers every time

2022-07-27 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572105#comment-17572105
 ] 

Michael McCandless commented on LUCENE-10664:
-

Hmm, {{SearcherLifetimeManager}} too.  Maybe we need an 
{{IndexReaderLifetimeManager}}.

> SearcherManager should return new IndexSearchers every time
> ---
>
> Key: LUCENE-10664
> URL: https://issues.apache.org/jira/browse/LUCENE-10664
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Major
>
> SearcherManager caches IndexSearcher instances. This is no longer a good 
> approach now that IndexSearcher has timeout support (LUCENE-10151) and keeps 
> track of the time until which queries are allowed to run.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10583) Deadlock with MMapDirectory while waitForMerges

2022-07-27 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-10583:

Fix Version/s: 9.4
   (was: 9.3)

> Deadlock with MMapDirectory while waitForMerges
> ---
>
> Key: LUCENE-10583
> URL: https://issues.apache.org/jira/browse/LUCENE-10583
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.11.1
> Environment: Java 17
> OS: Windows 2016
>Reporter: Thomas Hoffmann
>Priority: Minor
> Fix For: 10.0 (main), 9.4
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hello,
> a deadlock situation happened in our application. We are using MMapDirectory 
> on Windows 2016 and got the following stacktrace:
> {code:java}
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> elapsed=81248.18s tid=0x2860af10 nid=0x237c in Object.wait()  
> [0x413fc000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>     at java.lang.Object.wait(java.base@17.0.2/Native Method)
>     - waiting on 
>     at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4983)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at 
> org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2697)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1236)
>     at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1278)
>     at 
> com.speed4trade.ebs.module.search.SearchService.updateSearchIndex(SearchService.java:1723)
>     - locked <0x0006d5c00208> (a org.apache.lucene.store.MMapDirectory)
>     at 
> com.speed4trade.ebs.module.businessrelations.ticket.TicketChangedListener.postUpdate(TicketChangedListener.java:142)
> ...{code}
> All threads were waiting to lock <0x0006d5c00208> which got never 
> released.
> A lucene thread was also blocked, I dont know if this is relevant:
> {code:java}
> "Lucene Merge Thread #0" #18466 daemon prio=5 os_prio=0 cpu=15.63ms 
> elapsed=3499.07s tid=0x459453e0 nid=0x1f8 waiting for monitor entry  
> [0x5da9e000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>     at 
> org.apache.lucene.store.FSDirectory.deletePendingFiles(FSDirectory.java:346)
>     - waiting to lock <0x0006d5c00208> (a 
> org.apache.lucene.store.MMapDirectory)
>     at 
> org.apache.lucene.store.FSDirectory.maybeDeletePendingFiles(FSDirectory.java:363)
>     at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:248)
>     at 
> org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$1.createOutput(ConcurrentMergeScheduler.java:289)
>     at 
> org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:121)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130)
>     at 
> org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141)
>     at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:227)
>     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
>     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4757)
>     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4361)
>     at 
> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5920)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:626)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684){code}
> If looks like the merge operation never finished and released the lock.
> Is there any option to prevent this deadlock or how to investigate it further?
> A load-test didn't show this problem unfortunately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10583) Deadlock with MMapDirectory while waitForMerges

2022-07-27 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-10583.
-
Fix Version/s: 10.0 (main)
   9.3
   Resolution: Fixed

> Deadlock with MMapDirectory while waitForMerges
> ---
>
> Key: LUCENE-10583
> URL: https://issues.apache.org/jira/browse/LUCENE-10583
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.11.1
> Environment: Java 17
> OS: Windows 2016
>Reporter: Thomas Hoffmann
>Priority: Minor
> Fix For: 10.0 (main), 9.3
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hello,
> a deadlock situation happened in our application. We are using MMapDirectory 
> on Windows 2016 and got the following stacktrace:
> {code:java}
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> elapsed=81248.18s tid=0x2860af10 nid=0x237c in Object.wait()  
> [0x413fc000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>     at java.lang.Object.wait(java.base@17.0.2/Native Method)
>     - waiting on 
>     at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4983)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at 
> org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2697)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1236)
>     at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1278)
>     at 
> com.speed4trade.ebs.module.search.SearchService.updateSearchIndex(SearchService.java:1723)
>     - locked <0x0006d5c00208> (a org.apache.lucene.store.MMapDirectory)
>     at 
> com.speed4trade.ebs.module.businessrelations.ticket.TicketChangedListener.postUpdate(TicketChangedListener.java:142)
> ...{code}
> All threads were waiting to lock <0x0006d5c00208> which got never 
> released.
> A lucene thread was also blocked, I dont know if this is relevant:
> {code:java}
> "Lucene Merge Thread #0" #18466 daemon prio=5 os_prio=0 cpu=15.63ms 
> elapsed=3499.07s tid=0x459453e0 nid=0x1f8 waiting for monitor entry  
> [0x5da9e000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>     at 
> org.apache.lucene.store.FSDirectory.deletePendingFiles(FSDirectory.java:346)
>     - waiting to lock <0x0006d5c00208> (a 
> org.apache.lucene.store.MMapDirectory)
>     at 
> org.apache.lucene.store.FSDirectory.maybeDeletePendingFiles(FSDirectory.java:363)
>     at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:248)
>     at 
> org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$1.createOutput(ConcurrentMergeScheduler.java:289)
>     at 
> org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:121)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130)
>     at 
> org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141)
>     at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:227)
>     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
>     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4757)
>     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4361)
>     at 
> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5920)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:626)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684){code}
> If looks like the merge operation never finished and released the lock.
> Is there any option to prevent this deadlock or how to investigate it further?
> A load-test didn't show this problem unfortunately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10583) Deadlock with MMapDirectory while waitForMerges

2022-07-27 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571822#comment-17571822
 ] 

Michael McCandless commented on LUCENE-10583:
-

bq. We could assert that common lucene objects are lock free at some popular 
public entry points; but how do we differentiate on whether the lock is 
acquired by an internal lucene thread or an external user thread..? We do lock 
on IndexWriter at multiple places within lucene.

Yeah, good point.  And even if we find a public entry point that is not used 
internally today, maybe tomorrow it will be.

So +1 to just resolve this with the javadocs improvements.  Thanks [~vigyas]!

> Deadlock with MMapDirectory while waitForMerges
> ---
>
> Key: LUCENE-10583
> URL: https://issues.apache.org/jira/browse/LUCENE-10583
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.11.1
> Environment: Java 17
> OS: Windows 2016
>Reporter: Thomas Hoffmann
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hello,
> a deadlock situation happened in our application. We are using MMapDirectory 
> on Windows 2016 and got the following stacktrace:
> {code:java}
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> elapsed=81248.18s tid=0x2860af10 nid=0x237c in Object.wait()  
> [0x413fc000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>     at java.lang.Object.wait(java.base@17.0.2/Native Method)
>     - waiting on 
>     at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4983)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at 
> org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2697)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1236)
>     at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1278)
>     at 
> com.speed4trade.ebs.module.search.SearchService.updateSearchIndex(SearchService.java:1723)
>     - locked <0x0006d5c00208> (a org.apache.lucene.store.MMapDirectory)
>     at 
> com.speed4trade.ebs.module.businessrelations.ticket.TicketChangedListener.postUpdate(TicketChangedListener.java:142)
> ...{code}
> All threads were waiting to lock <0x0006d5c00208> which got never 
> released.
> A lucene thread was also blocked, I dont know if this is relevant:
> {code:java}
> "Lucene Merge Thread #0" #18466 daemon prio=5 os_prio=0 cpu=15.63ms 
> elapsed=3499.07s tid=0x459453e0 nid=0x1f8 waiting for monitor entry  
> [0x5da9e000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>     at 
> org.apache.lucene.store.FSDirectory.deletePendingFiles(FSDirectory.java:346)
>     - waiting to lock <0x0006d5c00208> (a 
> org.apache.lucene.store.MMapDirectory)
>     at 
> org.apache.lucene.store.FSDirectory.maybeDeletePendingFiles(FSDirectory.java:363)
>     at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:248)
>     at 
> org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$1.createOutput(ConcurrentMergeScheduler.java:289)
>     at 
> org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:121)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130)
>     at 
> org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141)
>     at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:227)
>     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
>     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4757)
>     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4361)
>     at 
> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5920)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:626)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684){code}
> If looks like the merge operation never finished and released the lock.
> Is there any option to prevent this deadlock or how to investigate it further?
> A load-test didn't show this problem unfortunately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (LUCENE-10583) Deadlock with MMapDirectory while waitForMerges

2022-07-26 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571393#comment-17571393
 ] 

Michael McCandless commented on LUCENE-10583:
-

We could perhaps make a best effort to detect on common incoming APIs that 
external locks are not already held on {{Directory}} and {{IndexWriter}}?

> Deadlock with MMapDirectory while waitForMerges
> ---
>
> Key: LUCENE-10583
> URL: https://issues.apache.org/jira/browse/LUCENE-10583
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.11.1
> Environment: Java 17
> OS: Windows 2016
>Reporter: Thomas Hoffmann
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hello,
> a deadlock situation happened in our application. We are using MMapDirectory 
> on Windows 2016 and got the following stacktrace:
> {code:java}
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> elapsed=81248.18s tid=0x2860af10 nid=0x237c in Object.wait()  
> [0x413fc000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>     at java.lang.Object.wait(java.base@17.0.2/Native Method)
>     - waiting on 
>     at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4983)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at 
> org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2697)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1236)
>     at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1278)
>     at 
> com.speed4trade.ebs.module.search.SearchService.updateSearchIndex(SearchService.java:1723)
>     - locked <0x0006d5c00208> (a org.apache.lucene.store.MMapDirectory)
>     at 
> com.speed4trade.ebs.module.businessrelations.ticket.TicketChangedListener.postUpdate(TicketChangedListener.java:142)
> ...{code}
> All threads were waiting to lock <0x0006d5c00208> which got never 
> released.
> A lucene thread was also blocked, I dont know if this is relevant:
> {code:java}
> "Lucene Merge Thread #0" #18466 daemon prio=5 os_prio=0 cpu=15.63ms 
> elapsed=3499.07s tid=0x459453e0 nid=0x1f8 waiting for monitor entry  
> [0x5da9e000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>     at 
> org.apache.lucene.store.FSDirectory.deletePendingFiles(FSDirectory.java:346)
>     - waiting to lock <0x0006d5c00208> (a 
> org.apache.lucene.store.MMapDirectory)
>     at 
> org.apache.lucene.store.FSDirectory.maybeDeletePendingFiles(FSDirectory.java:363)
>     at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:248)
>     at 
> org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$1.createOutput(ConcurrentMergeScheduler.java:289)
>     at 
> org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:121)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130)
>     at 
> org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141)
>     at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:227)
>     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
>     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4757)
>     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4361)
>     at 
> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5920)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:626)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684){code}
> If looks like the merge operation never finished and released the lock.
> Is there any option to prevent this deadlock or how to investigate it further?
> A load-test didn't show this problem unfortunately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10583) Deadlock with MMapDirectory while waitForMerges

2022-07-26 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571383#comment-17571383
 ] 

Michael McCandless commented on LUCENE-10583:
-

[~vigyas] can this be resolved now?

> Deadlock with MMapDirectory while waitForMerges
> ---
>
> Key: LUCENE-10583
> URL: https://issues.apache.org/jira/browse/LUCENE-10583
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.11.1
> Environment: Java 17
> OS: Windows 2016
>Reporter: Thomas Hoffmann
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hello,
> a deadlock situation happened in our application. We are using MMapDirectory 
> on Windows 2016 and got the following stacktrace:
> {code:java}
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> elapsed=81248.18s tid=0x2860af10 nid=0x237c in Object.wait()  
> [0x413fc000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>     at java.lang.Object.wait(java.base@17.0.2/Native Method)
>     - waiting on 
>     at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4983)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at 
> org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2697)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1236)
>     at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1278)
>     at 
> com.speed4trade.ebs.module.search.SearchService.updateSearchIndex(SearchService.java:1723)
>     - locked <0x0006d5c00208> (a org.apache.lucene.store.MMapDirectory)
>     at 
> com.speed4trade.ebs.module.businessrelations.ticket.TicketChangedListener.postUpdate(TicketChangedListener.java:142)
> ...{code}
> All threads were waiting to lock <0x0006d5c00208> which got never 
> released.
> A lucene thread was also blocked, I dont know if this is relevant:
> {code:java}
> "Lucene Merge Thread #0" #18466 daemon prio=5 os_prio=0 cpu=15.63ms 
> elapsed=3499.07s tid=0x459453e0 nid=0x1f8 waiting for monitor entry  
> [0x5da9e000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>     at 
> org.apache.lucene.store.FSDirectory.deletePendingFiles(FSDirectory.java:346)
>     - waiting to lock <0x0006d5c00208> (a 
> org.apache.lucene.store.MMapDirectory)
>     at 
> org.apache.lucene.store.FSDirectory.maybeDeletePendingFiles(FSDirectory.java:363)
>     at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:248)
>     at 
> org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$1.createOutput(ConcurrentMergeScheduler.java:289)
>     at 
> org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:121)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130)
>     at 
> org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141)
>     at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:227)
>     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
>     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4757)
>     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4361)
>     at 
> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5920)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:626)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684){code}
> If looks like the merge operation never finished and released the lock.
> Is there any option to prevent this deadlock or how to investigate it further?
> A load-test didn't show this problem unfortunately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10658) Merges should periodically check for abort

2022-07-20 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568962#comment-17568962
 ] 

Michael McCandless commented on LUCENE-10658:
-

+1, merges should abort promptly.  But it is indeed only a "best effort" 
mechanism.

I guess Lucene's completion field is building FSTs during merging and not 
writing bytes to disk as it builds the large FST, until the end?

Maybe there are other parts of Lucene merging that also fail to check promptly 
enough, e.g. maybe when dimensional points are doing a (large) offline sort 
before writing anything to the output files?

Maybe we could instrument {{MergeRateLimiter}} to write a WARNING into 
{{infoStream}} whenever too much time has elapsed between visits to its 
{{maybePause}} API?  We could use that to tease out other places that are 
failing to write bytes frequently enough for abort checking.

Lucene used to check for merge abort deep inside {{IndexWriter}} and merging 
code (e.g. merging postings would check periodically, same for doc values, 
etc.), but I think we refactored that down to the rate limiter only in 
LUCENE-7700 which was a nice cleanup / step forward.

> Merges should periodically check for abort
> --
>
> Key: LUCENE-10658
> URL: https://issues.apache.org/jira/browse/LUCENE-10658
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.3
>Reporter: Nhat Nguyen
>Priority: Major
>
> Rolling back an IndexWriter without committing shouldn't take long (i.e., 
> less than several seconds), and Elasticsearch cluster coordination [relies 
> on|https://github.com/elastic/elasticsearch/issues/88055] this assumption. If 
> some merges are taking place, the rollback can take several minutes as merges 
> only check for abort when writing to files via 
> [MergeRateLimiter|https://github.com/apache/lucene/blob/3d7d85f245381f84c46c766119695a8645cde2b8/lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java#L117-L119].
>  Merging a completion field, for example, can take a long time without 
> touching output files. Another reason merges should periodically check for 
> abort is its outputs will be discarded.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

2022-07-19 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568756#comment-17568756
 ] 

Michael McCandless commented on LUCENE-10216:
-

I think we can backport to 9.x?  But we should not rush it for 9.3.  It has 
baked for quite a while in {{main}} and [~vigyas] has fixed some followon build 
failures.

> Add concurrency to addIndexes(CodecReader…) API
> ---
>
> Key: LUCENE-10216
> URL: https://issues.apache.org/jira/browse/LUCENE-10216
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> I work at Amazon Product Search, and we use Lucene to power search for the 
> e-commerce platform. I’m working on a project that involves applying 
> metadata+ETL transforms and indexing documents on n different _indexing_ 
> boxes, combining them into a single index on a separate _reducer_ box, and 
> making it available for queries on m different _search_ boxes (replicas). 
> Segments are asynchronously copied from indexers to reducers to searchers as 
> they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the 
> reducer boxes. Since we also have taxonomy data, we need to remap facet field 
> ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version 
> of this API. The API leverages {{SegmentMerger.merge()}} to create segments 
> with new ordinal values while also merging all provided segments in the 
> process.
> _This is however a blocking call that runs in a single thread._ Until we have 
> written segments with new ordinal values, we cannot copy them to searcher 
> boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges, 
> each with only a single reader, creating a concurrently running 1:1 
> conversion from old segments to new ones (with new ordinal values). We follow 
> this up with non-blocking background merges. This lets us copy the segments 
> to searchers and replicas as soon as they are available, and later replace 
> them with merged segments as background jobs complete. On the Amazon dataset 
> I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. 
> Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}} 
> API with a {{boolean}} flag for concurrency, that internally submits multiple 
> merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, 
> and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting 
> multiple addIndexes() calls and waiting for them to complete, I felt it needs 
> some understanding of what addIndexes does, why you need to wait on the merge 
> and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

2022-07-19 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-10216:

Fix Version/s: main

> Add concurrency to addIndexes(CodecReader…) API
> ---
>
> Key: LUCENE-10216
> URL: https://issues.apache.org/jira/browse/LUCENE-10216
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Vigya Sharma
>Priority: Major
> Fix For: main
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> I work at Amazon Product Search, and we use Lucene to power search for the 
> e-commerce platform. I’m working on a project that involves applying 
> metadata+ETL transforms and indexing documents on n different _indexing_ 
> boxes, combining them into a single index on a separate _reducer_ box, and 
> making it available for queries on m different _search_ boxes (replicas). 
> Segments are asynchronously copied from indexers to reducers to searchers as 
> they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the 
> reducer boxes. Since we also have taxonomy data, we need to remap facet field 
> ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version 
> of this API. The API leverages {{SegmentMerger.merge()}} to create segments 
> with new ordinal values while also merging all provided segments in the 
> process.
> _This is however a blocking call that runs in a single thread._ Until we have 
> written segments with new ordinal values, we cannot copy them to searcher 
> boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges, 
> each with only a single reader, creating a concurrently running 1:1 
> conversion from old segments to new ones (with new ordinal values). We follow 
> this up with non-blocking background merges. This lets us copy the segments 
> to searchers and replicas as soon as they are available, and later replace 
> them with merged segments as background jobs complete. On the Amazon dataset 
> I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. 
> Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}} 
> API with a {{boolean}} flag for concurrency, that internally submits multiple 
> merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, 
> and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting 
> multiple addIndexes() calls and waiting for them to complete, I felt it needs 
> some understanding of what addIndexes does, why you need to wait on the merge 
> and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10651) SimpleQueryParser stack overflow for large nested queries.

2022-07-18 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568068#comment-17568068
 ] 

Michael McCandless commented on LUCENE-10651:
-

It's a bit depressing that {{SimpleQueryParser}} consumes Java's stack in 
proportion to the number of clauses.  Though, it is "simple", and just 
recursing to parse another clause is indeed simple.  Too bad we seem to have 
lost tail recursion from the Lisp days ...

This is really a bug/imitation in {{{}SimpleQueryParser{}}}.  E.g. a user may 
increase that clause limit and then try to parse a {{BooleanQuery}} that should 
work (doesn't have too many clauses) yet hits {{{}StackOverflowException{}}}.  
I wonder if our other query parsers have this problem too?

But I like your fix – it prevents a {{StackOverflowException}} when the 
returned Query would have failed with {{TooManyClauses}} anyways.  It's at 
least progress not perfection?

> SimpleQueryParser stack overflow for large nested queries.
> --
>
> Key: LUCENE-10651
> URL: https://issues.apache.org/jira/browse/LUCENE-10651
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 9.1, 8.10, 9.2, 9.3
>Reporter: Marc Handalian
>Priority: Major
>
> The OpenSearch project received an issue [1] where stack overflow can occur 
> for large nested boolean queries during rewrite.  In trying to reproduce this 
> error I've also encountered SO during parsing where queries expand beyond the 
> default 1024 clause limit.  This unit test will fail with SO:
> {code:java}
> public void testSimpleQueryParserWithTooManyClauses() {
>   StringBuilder queryString = new StringBuilder("foo");
>   for (int i = 0; i < 1024; i++) {
> queryString.append(" | bar").append(i).append(" + baz");
>   }
>   expectThrows(IndexSearcher.TooManyClauses.class, () -> 
> parse(queryString.toString()));
> }
>  {code}
> I would expect this case to also fail with TooManyClauses, is my 
> understanding correct?  If so, I've attempted a fix [2] that during parsing 
> increments a counter whenever a clause is added.
>  [1] [https://github.com/opensearch-project/OpenSearch/issues/3760]
>  [2] 
> [https://github.com/mch2/lucene/commit/6a558f17f448b92ae4cf8c43e0b759ff7425acdf]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-18 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567913#comment-17567913
 ] 

Michael McCandless commented on LUCENE-10633:
-

{quote}I plan on opening a PR against luceneutil and I already opened 
LUCENE-10162 a while back about making this sort of things a more obvious 
choice. It also relates to [~gsmiller] 's work about running term-in-set 
queries using doc values, which would only help if doc values are enabled on 
the field.
{quote}
Awesome, thanks [~jpountz]!

> Dynamic pruning for queries sorted by SORTED(_SET) field
> 
>
> Key: LUCENE-10633
> URL: https://issues.apache.org/jira/browse/LUCENE-10633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
> when sorting by a numeric field, by leveraging the points index to skip 
> documents that do not compare better than the top of the priority queue 
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which 
> is disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-17 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567630#comment-17567630
 ] 

Michael McCandless commented on LUCENE-10557:
-

{quote}Again - I don't think we have had any discussion about making Jira 
read-only in the mail list. I don't have strong opinion on that, but I am 
against it if we are going to do so without giving others time to consider / 
chance to express opinions on this.
{quote}
OK – I'll start a [DISCUSS] and then [VOTE] thread to reach consensus on this.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-17 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567603#comment-17567603
 ] 

Michael McCandless commented on LUCENE-10633:
-

Should we make that change to luceneutil permanent? (indexing sorting fields in 
both points and DVs)?

Maybe we need to make this path more of a default / obvious choice for users so 
they see these optos?  E.g. some sort of combined 
{{{}DocValuesAndPointsField{}}}?

> Dynamic pruning for queries sorted by SORTED(_SET) field
> 
>
> Key: LUCENE-10633
> URL: https://issues.apache.org/jira/browse/LUCENE-10633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
> when sorting by a numeric field, by leveraging the points index to skip 
> documents that do not compare better than the top of the priority queue 
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which 
> is disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

2022-07-17 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567602#comment-17567602
 ] 

Michael McCandless commented on LUCENE-10633:
-

Good grief :)  It is not every day you see a 77X speedup in Lucene queries!!!

> Dynamic pruning for queries sorted by SORTED(_SET) field
> 
>
> Key: LUCENE-10633
> URL: https://issues.apache.org/jira/browse/LUCENE-10633
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
> when sorting by a numeric field, by leveraging the points index to skip 
> documents that do not compare better than the top of the priority queue 
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which 
> is disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-17 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567600#comment-17567600
 ] 

Michael McCandless commented on LUCENE-10557:
-

Though, one small wrinkle we have is that we will need to append one final 
comment to all Jiras (what [~tomoko] has been iterating on recently!) after the 
GitHub issue assignment is known ... so we can't make it completely read-only 
until after the migration.  Tricky.  There will be a window where users may 
update some Jiras mid-migration.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-17 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567599#comment-17567599
 ] 

Michael McCandless commented on LUCENE-10557:
-

I realized that the Infra team makes Jira projects read-only sometimes (e.g. 
when they retire / expire to Attic): 
https://issues.apache.org/jira/browse/INFRA-23440?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

I'll send an email to Infra to see if they have some clean/simple way to do 
this at the start of our migration process.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567504#comment-17567504
 ] 

Michael McCandless commented on LUCENE-10557:
-

OK hmm we will likely need Infra's help for this then.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567486#comment-17567486
 ] 

Michael McCandless commented on LUCENE-10557:
-

Hi [~tomoko] – I added you as a Jira Administrator so you can poke around if 
you want to.

But I'll still try to figure out how to make Jira read-only.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567466#comment-17567466
 ] 

Michael McCandless commented on LUCENE-10557:
-

OK I am able to administer our Jira instance.

There are some wrinkles – apparently because some of our workflows are shared 
across two projects (Lucene and Solr), the workflows themselves are read-only!  
So we cannot change them unless we work the workflows.

But there is much discussion about this problem, e.g.: 
[https://community.atlassian.com/t5/Jira-questions/Fastest-way-to-make-JIRA-read-only/qaq-p/1261492]

I'll try to find the simplest way that works for us.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-07-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567465#comment-17567465
 ] 

Michael McCandless commented on LUCENE-10557:
-

bq. [TEST] This was moved to GitHub issue: 
https://github.com/mocobeta/migration-test-3/issues/196.

Oooh that looks promising!!

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: Screen Shot 2022-06-29 at 11.02.35 AM.png, 
> image-2022-06-29-13-36-57-365.png, screenshot-1.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * (/) Choose issues that should be moved to GitHub - We'll migrate all 
> issues towards an atomic switch to GitHub if no major technical obstacles 
> show up.
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses.
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
> * Prepare a complete migration tool
> ** See https://github.com/apache/lucene-jira-archive/issues/5 
> * Build the convention for issue label/milestone management
>  ** See [https://github.com/apache/lucene-jira-archive/issues/6]
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * (/) Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** See [https://github.com/apache/lucene-jira-archive/issues/7]
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Michael McCandless (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Michael McCandless updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Lucene - Core /  LUCENE-10557  
 
 
  Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
Change By: 
 Michael McCandless  
 
 
Attachment: 
 Screen Shot 2022-06-29 at 11.02.35 AM.png  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Updated] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Michael McCandless (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Michael McCandless updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Lucene - Core /  LUCENE-10557  
 
 
  Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
Change By: 
 Michael McCandless  
 
 
Attachment: 
 Screen Shot 2022-06-05 at 8.13.41 AM.png  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Michael McCandless (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Michael McCandless commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 So cool!  I asked for all (open and closed) issues from Tomoko Uchida's latest migration, sorting by oldest and I see all the original issues (LUCENE-1, -2, -3, etc.):     
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9) 

 If image attachments aren't displayed, see 
this article. 
  
 
 

 
   
 

  
 

  
 

   



[jira] [Updated] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Michael McCandless (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Michael McCandless updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Lucene - Core /  LUCENE-10557  
 
 
  Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
Change By: 
 Michael McCandless  
 
 
Attachment: 
 Screen Shot 2022-06-05 at 8.13.41 AM.png  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-29 Thread Michael McCandless (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Michael McCandless commented on  LUCENE-10557  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Migrate to GitHub issue from Jira   
 

  
 
 
 
 

 
 

I think I have addressed attachments.
 Woot!  I love seeing the attached patch file rendered inline via GitHub like that (versus downloading to my local disk in Jira)Unable to render embedded object: File (  This is awesome progress – thanks [~tomoko]) not found. 

 
Michael McCandless Just for your information, we now have a public ASF repository https://github.com/apache/lucene-jira-archive for the migration and I pushed the migration scripts there to develop/archive it under Apache. I also opened a few issues for it.  

 YAY!  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-27 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17559276#comment-17559276
 ] 

Michael McCandless commented on LUCENE-10557:
-

{quote}  Jira markup is converted into Markdown for rendering.
 * There are many conversion errors and need close investigation.{quote}
This seems perhaps solvable, relatively quickly – the conversion tool is 
open-source right?  Tables seem flaky ... what other markup?  I can try to dive 
deep on this if I can make some time.  Let's not rush this conversion.
{quote}"attachments" (patches, images, etc) cannot be migrated with basic 
GitHub API functionality.
 * There could be workarounds; e.g. save them in another github repo and 
rewrite attachment links to refer to them.{quote}
I thought the "unofficial" migration API might support attachments?  Or are 
there big problems with using that API?
{quote}As a reference I will migrate existing all issues into a test repository 
in shortly. Hope we can make a decision by looking at it - I mean, I'll be not 
able to further invest my time in this PoC.

I'll post the PoC migration result to the dev list to ask if we should proceed 
with it or not next week.
{quote}
+1!  Thank you for pushing so hard on this [~tomoko]!  Let's not rush the 
decision ... others can try to push your PoC forwards too to improve the 
migration quality.  This is worth the one-time investment.  And hey, maybe we 
enable something that future Jira -> GitHub issues migrations can use.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For 

[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-24 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558558#comment-17558558
 ] 

Michael McCandless commented on LUCENE-10557:
-

[~tomoko] could you share the source code of the import tool you are working 
on?  Maybe post it in a personal public GitHub repo?  We call can try to make 
PRs / review ;)

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-24 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558548#comment-17558548
 ] 

Michael McCandless commented on LUCENE-10557:
-

I suppose we cannot ask GitHub to set the creation timestamp?  I.e. it always 
creates an issue "right now"?

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-24 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558546#comment-17558546
 ] 

Michael McCandless commented on LUCENE-10557:
-

{quote}Seems converting Jira "table" markup to Markdown is error-prone
{quote}
It looks like the converter didn't even recognize that there was a Jira table 
in that comment, maybe because extra whitespace was inserted on the Jira 
export?  Or, perhaps Jira recognizes a table even with excessive newlines 
inserted, but the converter does not?  Might be a simple addition of {{\r\n}} 
into the regexp the converter is using.  Do you have the raw Jira exported text 
for this issue?  The converter can be invoked from the command line for fast 
debugging / iterating.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-23 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558263#comment-17558263
 ] 

Michael McCandless commented on LUCENE-10557:
-

{quote}I'm still not fully sure if we can/should Jira completely read-only, 
maybe we'll have a discussion in the mail list later.
{quote}
OK that's fair – I just think having two writable issue trackers at the same 
time is asking for disaster.  It really should be an atomic switch from Jira to 
GitHub issues to close that risk.  But we can defer that discussion until we 
agree the migration is even the right choice.  Maybe we must decide to live 
with Jira forever instead of hard switching to GitHub issues
{quote}{quote}{quote} Did you see/start from [the Lucene.Net migration 
tool|https://github.com/bongohrtech/jira-issues-importer/tree/lucenenet]?
{quote}{quote}
No - Lucene.Net and Lucene have different requirements and data 
migration/conversion scripts like this are usually not reusable.  I think it'd 
be easier to write a tool that fits our needs from scratch than to tweak 
others' work that is optimized for their needs. (It's not technically difficult 
- a set of tiny scripts are sufficient, there are just many uncertainties.)
{quote}
OK that's fair.

I just wanted to make sure you were aware of how Lucene.Net accomplished their 
Jira -> GitHub Issues migration so we could build on that / improve for our 
specific requirements.  We are not the first Apache project that feels the need 
to 1) migrate from Jira -> GitHub issues, and 2) preserve the history.  So 
let's learn from past projects like Lucene.Net and others.

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-23 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558029#comment-17558029
 ] 

Michael McCandless commented on LUCENE-10557:
-

Finally catching up over here!

*Thank you* [~tomoko] for tackling this!

I agree testing what is realistic/possible will enable us to make an informed 
decision.  I really hope we are not stuck asking all future developers to 
fallback to Jira and use two search engines.

To make Jira effectively read-only post-migration, Robert suggested we could 
use Jira's workflow controls to make a "degraded" workflow that does not allow 
any writes to the issues (creating new issues, adding comments, changing 
milestones, etc.).  We can add that to the (draft) migration steps.

For committers, [https://id.apache.org|https://id.apache.org/] has the mapping 
of apache userid to GitHub id, though I'm not sure if that is publicly 
queryable.  And as [~msoko...@gmail.com] pointed out on the dev list thread, 
the [GitHub Apache org might also have it|https://github.com/apache].

Did you see/start from [the Lucene.Net migration 
tool|https://github.com/bongohrtech/jira-issues-importer/tree/lucenenet]? This 
is what [~nightowl888] pointed to (up above).

Those few migrated issues look like a great start!
{quote}{*}There is no way to upload files to GitHub with REST APIs{*}; it is 
only allowed via the Web Interface.
{quote}
Wow that is indeed disappointing.  I wonder whether GitHub's issue search also 
search attachments?  Does Jira's search?
 

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * (/) Get a consensus about the migration among committers
>  * Choose issues that should be moved to GitHub
>  ** Discussion thread 
> [https://lists.apache.org/thread/1p3p90k5c0d4othd2ct7nj14bkrxkr12]
>  ** -Conclusion for now: We don't migrate any issues. Only new issues should 
> be opened on GitHub.-
>  ** Write a prototype migration script - the decision could be made on that. 
> Things to consider:
>  *** version numbers - labels or milestones?
>  *** add a comment/ prepend a link to the source Jira issue on github side,
>  *** add a comment/ prepend a link on the jira side to the new issue on 
> github side (for people who access jira from blogs, mailing list archives and 
> other sources that will have stale links),
>  *** convert cross-issue automatic links in comments/ descriptions (as 
> suggested by Robert),
>  *** strategy to deal with sub-issues (hierarchies),
>  *** maybe prefix (or postfix) the issue title on github side with the 
> original LUCENE-XYZ key so that it is easier to search for a particular issue 
> there?
>  *** how to deal with user IDs (author, reporter, commenters)? Do they have 
> to be github users? Will information about people not registered on github be 
> lost?
>  *** create an extra mapping file of old-issue-new-issue URLs for any 
> potential future uses. 
>  *** what to do with issue numbers in git/svn commits? These could be 
> rewritten but it'd change the entire git history tree - I don't think this is 
> practical, while doable.
>  * Build the convention for issue label/milestone management
>  ** Do some experiments on a sandbox repository 
> [https://github.com/mocobeta/sandbox-lucene-10557]
>  ** Make documentation for metadata (label/milestone) management 
>  * Enable Github issue on the lucene's repository
>  ** Raise an issue on INFRA
>  ** (Create an issue-only private repository for sensitive issues if it's 
> needed and allowed)
>  ** Set a mail hook to 
> [issues@lucene.apache.org|mailto:issues@lucene.apache.org] (many thanks to 
> the general mail group name)
>  * Set a schedule for migration
>  ** Give some time to committers to play around with issues/labels/milestones 
> before the actual migration
>  ** Make an announcement on the mail lists
>  ** Show some text messages when opening a new Jira issue (in issue template?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] [Commented] (LUCENE-10600) SortedSetDocValues#docValueCount should be an int, not long

2022-06-15 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554764#comment-17554764
 ] 

Michael McCandless commented on LUCENE-10600:
-

OK thanks.

I guess because the "number of ords" in one doc is limited (based on the 2 GB 
RAM buffer flush limit) we feel that no single doc could have more than {{int}} 
limit number of unique ords?  Merging segments will not increase that per-doc 
count.

> SortedSetDocValues#docValueCount should be an int, not long
> ---
>
> Key: LUCENE-10600
> URL: https://issues.apache.org/jira/browse/LUCENE-10600
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Lu Xugang
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10600) SortedSetDocValues#docValueCount should be an int, not long

2022-06-15 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554620#comment-17554620
 ] 

Michael McCandless commented on LUCENE-10600:
-

Can this be resolved as {{Not a Problem}} now?

> SortedSetDocValues#docValueCount should be an int, not long
> ---
>
> Key: LUCENE-10600
> URL: https://issues.apache.org/jira/browse/LUCENE-10600
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Lu Xugang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10557) Migrate to GitHub issue from Jira

2022-06-15 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-10557:

Summary: Migrate to GitHub issue from Jira  (was: Migrate to GitHub issue 
from Jira?)

> Migrate to GitHub issue from Jira
> -
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * Get a consensus about the migration among committers
>  * Enable Github issue on the lucene's repository (currently, it is disabled 
> on it)
>  * Build the convention or rules for issue label/milestone management
>  * Choose issues that should be moved to GitHub (I think too old or obsolete 
> issues can remain Jira.)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira?

2022-06-15 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554603#comment-17554603
 ] 

Michael McCandless commented on LUCENE-10557:
-

Thanks [~tomoko] for driving this discussion/vote!  I will now remove the 
trailing {{?}} from the issue title :)

 

> Migrate to GitHub issue from Jira?
> --
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * Get a consensus about the migration among committers
>  * Enable Github issue on the lucene's repository (currently, it is disabled 
> on it)
>  * Build the convention or rules for issue label/milestone management
>  * Choose issues that should be moved to GitHub (I think too old or obsolete 
> issues can remain Jira.)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10596) Remove unused parameter in #getOrAddPerField

2022-05-29 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543691#comment-17543691
 ] 

Michael McCandless commented on LUCENE-10596:
-

Thanks for noticing this!!  +1 to remove it if indeed it is a pointless method!!

 

> Remove unused parameter in #getOrAddPerField
> 
>
> Key: LUCENE-10596
> URL: https://issues.apache.org/jira/browse/LUCENE-10596
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tangdh
>Priority: Minor
>
> I noticed that the parameter fieldType is no longer used in the method 
> getOrAddPerField(indexingChain.java:773), do we need to remove it? If so, I'd 
> be happy to raise a PR



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10596) Remove unused parameter in #getOrAddPerField

2022-05-29 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543691#comment-17543691
 ] 

Michael McCandless edited comment on LUCENE-10596 at 5/29/22 5:44 PM:
--

Thanks for noticing this!  +1 to remove it if indeed it is a pointless method)!

 


was (Author: mikemccand):
Thanks for noticing this!!  +1 to remove it if indeed it is a pointless method!!

 

> Remove unused parameter in #getOrAddPerField
> 
>
> Key: LUCENE-10596
> URL: https://issues.apache.org/jira/browse/LUCENE-10596
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tangdh
>Priority: Minor
>
> I noticed that the parameter fieldType is no longer used in the method 
> getOrAddPerField(indexingChain.java:773), do we need to remove it? If so, I'd 
> be happy to raise a PR



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10591) Invalid character in SortableSingleDocSource.java

2022-05-25 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-10591.
-
Fix Version/s: 10.0 (main)
   9.3
   Resolution: Fixed

Thank you for the attention to detail [~asalamon74]!  I merged the PR to 
main/10.0 and cherry-picked to 9.x (eventuallhy 9.3).

> Invalid character in SortableSingleDocSource.java
> -
>
> Key: LUCENE-10591
> URL: https://issues.apache.org/jira/browse/LUCENE-10591
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Andras Salamon
>Priority: Trivial
> Fix For: 10.0 (main), 9.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There are invalid UTF-8 characters in SortableSingleDocSource.java
> "S�o Tom� and Pr�ncipe"
> Sonar gave me a warning because of this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10586) Minor refactoring in Lucene90BlockTreeTermsReader local variables: metaIn, indexMetaIn, termsMetaIn

2022-05-22 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540708#comment-17540708
 ] 

Michael McCandless commented on LUCENE-10586:
-

+1

> Minor refactoring in Lucene90BlockTreeTermsReader local variables: metaIn, 
> indexMetaIn, termsMetaIn
> ---
>
> Key: LUCENE-10586
> URL: https://issues.apache.org/jira/browse/LUCENE-10586
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomoko Uchida
>Priority: Trivial
>
> Those three local variables refer to the same {{IndexInput}} object (no 
> clone() is called).
> {code}
> indexMetaIn = termsMetaIn = metaIn;
> {code}
> I'm not sure but maybe there are some historical reasons. I wonder if it 
> would be better to have only one reference for the underlying {{IndexInput}} 
> object to make it a little easy to follow the code.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10587) Rename "master seed" to "root seed" or "main seed" or so?

2022-05-22 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540621#comment-17540621
 ] 

Michael McCandless commented on LUCENE-10587:
-

Ahh OK thanks [~dweiss]!

> Rename "master seed" to "root seed" or "main seed" or so?
> -
>
> Key: LUCENE-10587
> URL: https://issues.apache.org/jira/browse/LUCENE-10587
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I noticed that Lucene's test infrastructure (or perhaps it's in 
> R{{{}andomizedTesting{}}} dependency?) still says things like this:
> {noformat}
> > [junit4:junit4]  says Привет! Master seed: 3296009A5B3B7A05 
> > {noformat}
> Let's rename away from the term {{{}master{}}}?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10587) Rename "master seed" to "root seed" or "main seed" or so?

2022-05-22 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540606#comment-17540606
 ] 

Michael McCandless commented on LUCENE-10587:
-

Woops, my bad – this was based on old test output in old issues!

These days we say this:
{noformat}
Running tests with randomization seed: tests.seed=7FB5EB33F1ED3689 {noformat}
Perfect :)

> Rename "master seed" to "root seed" or "main seed" or so?
> -
>
> Key: LUCENE-10587
> URL: https://issues.apache.org/jira/browse/LUCENE-10587
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I noticed that Lucene's test infrastructure (or perhaps it's in 
> R{{{}andomizedTesting{}}} dependency?) still says things like this:
> {noformat}
> > [junit4:junit4]  says Привет! Master seed: 3296009A5B3B7A05 
> > {noformat}
> Let's rename away from the term {{{}master{}}}?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10587) Rename "master seed" to "root seed" or "main seed" or so?

2022-05-22 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-10587.
-
Resolution: Not A Problem

> Rename "master seed" to "root seed" or "main seed" or so?
> -
>
> Key: LUCENE-10587
> URL: https://issues.apache.org/jira/browse/LUCENE-10587
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I noticed that Lucene's test infrastructure (or perhaps it's in 
> R{{{}andomizedTesting{}}} dependency?) still says things like this:
> {noformat}
> > [junit4:junit4]  says Привет! Master seed: 3296009A5B3B7A05 
> > {noformat}
> Let's rename away from the term {{{}master{}}}?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10587) Rename "master seed" to "root seed" or "main seed" or so?

2022-05-22 Thread Michael McCandless (Jira)
Michael McCandless created LUCENE-10587:
---

 Summary: Rename "master seed" to "root seed" or "main seed" or so?
 Key: LUCENE-10587
 URL: https://issues.apache.org/jira/browse/LUCENE-10587
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless


I noticed that Lucene's test infrastructure (or perhaps it's in 
R{{{}andomizedTesting{}}} dependency?) still says things like this:
{noformat}
> [junit4:junit4]  says Привет! Master seed: 3296009A5B3B7A05 {noformat}
Let's rename away from the term {{{}master{}}}?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10481) FacetsCollector does not need scores when not keeping them

2022-05-18 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538891#comment-17538891
 ] 

Michael McCandless commented on LUCENE-10481:
-

{quote}Hmm... some slightly disappointing results - although we saw great 
improvement with this change, that doesn't seem to persist with Lucene 9.1 
benchmarking that I'm trying to do right now. Possible that something else has 
taken care of this optimization in a different way.
{quote}
That's interesting ... I wonder what other change could've stolen this thunder?

> FacetsCollector does not need scores when not keeping them
> --
>
> Key: LUCENE-10481
> URL: https://issues.apache.org/jira/browse/LUCENE-10481
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
> Fix For: 8.11.2, 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> FacetsCollector currently always specifies ScoreMode.COMPLETE, we could get 
> better performance by not requesting scores when we don't need them.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10481) FacetsCollector does not need scores when not keeping them

2022-05-18 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538889#comment-17538889
 ] 

Michael McCandless commented on LUCENE-10481:
-

I think the reason why it may sometimes need scores is if you ask it to 
aggregate the relevance for each facet value, using "association facets", and 
then pick top N by descending relevance.  Maybe?

But yeah +1 to the change – we should not ask for scores if we won't use them :)

> FacetsCollector does not need scores when not keeping them
> --
>
> Key: LUCENE-10481
> URL: https://issues.apache.org/jira/browse/LUCENE-10481
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
> Fix For: 8.11.2, 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> FacetsCollector currently always specifies ScoreMode.COMPLETE, we could get 
> better performance by not requesting scores when we don't need them.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-18 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538797#comment-17538797
 ] 

Michael McCandless commented on LUCENE-10574:
-

If any one finally gives a talk about "How Lucene developers try to use 
algorithms that minimize adversarial use cases", this might be a good example 
to add.  We try to choose algorithms that minimize the adversarial cases even 
if it means sometimes slower performance for normal usage.  Maybe someone could 
submit this talk for ApacheCon :)

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-18 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538787#comment-17538787
 ] 

Michael McCandless commented on LUCENE-10574:
-

I like [~jpountz]'s approach!

It forces the "below floor" merges to not be pathological by insisting that the 
sizes of the segments being merged are somewhat balanced (less balanced than 
once the segments are over the floor size). The cost is O(N * log(N)) again, 
with a higher constant factor, not O(N^2) anymore.  Progress not perfection (hi 
[~dweiss]).

I do think (long-term) we should consider removing the floor entirely (open a 
follow-on issue after [~jpountz]'s PR), perhaps only once we enable 
merge-on-refresh by default. Applications that flush/refresh/commit tiny 
segments would pay a higher search-time price for the long tail of minuscule 
segments, but that is already an inefficient thing to do and so those users 
perhaps are not optimizing / caring about performance. If you follow the best 
practice for faster indexing (and you use merge-on-refresh/commit) you should 
be unaffected by completely removal of the floor merge size.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537577#comment-17537577
 ] 

Michael McCandless commented on LUCENE-10572:
-

OK I ran a simple {{luceneutil}} benchmark, indexing all EN Wikipedia docs.  I 
turned off merging ({{{}NoMergePolicy{}}}), set IW's RAM buffer to 64 GB, 
indexed with 12 threads.  I turned off stored fields, doc values, facets, 
points, to try to focus on just the inverted index throughput.

Net/net the results look very noisy and I can't see any difference in 
performance (blue is {{{}main{}}}, and red is Uwe's "nuke baby vInt" PR) after 
26 iterations:

!Screen Shot 2022-05-16 at 10.28.22 AM.png!

trunk: mean 350.0871463644134, var 423.00080226708536

PR: mean 345.57021109942616, var 869.254999491771

JSFiddle for the chart is here: [https://jsfiddle.net/4jv8ew91/]

 

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-16 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-10572:

Attachment: Screen Shot 2022-05-16 at 10.28.22 AM.png

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-14 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537023#comment-17537023
 ] 

Michael McCandless commented on LUCENE-10572:
-

{quote}Mike, could you make a test on how much memory increaes by the PR and if 
there's a speed improvement at all?
{quote}
+1, I will try to benchmark the PR!  Thank you for the fast iterations here!  
Exciting :)

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-10572:

Summary: Can we optimize BytesRefHash?  (was: Can we optimize BytesRefHash)

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-13 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-10572:

Description: 
I was poking around in our nightly benchmarks 
([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
profiling that the hottest method is this:
{noformat}
PERCENT   CPU SAMPLES   STACK
9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
  at org.apache.lucene.util.BytesRefHash#findHash()
  at org.apache.lucene.util.BytesRefHash#add()
  at org.apache.lucene.index.TermsHashPerField#add()
  at 
org.apache.lucene.index.IndexingChain$PerField#invert()
  at 
org.apache.lucene.index.IndexingChain#processField()
  at 
org.apache.lucene.index.IndexingChain#processDocument()
  at 
org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
This is kinda crazy – comparing if the term to be inserted into the inverted 
index hash equals the term already added to {{BytesRefHash}} is the hottest 
method during nightly benchmarks.

Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
questionable things about our current implementation:
 * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the inserted 
term into the hash?  Let's just use two bytes always, since IW limits term 
length to 32 K (< 64K that an unsigned short can cover)

 * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
(BitUtil.VH_BE_SHORT.get)


 * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
aggressive enough?  Or the initial sizing of the hash is too small?

 * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too many 
{{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible "upgrades"?

 * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
({{{}murmurhash3_x86_32{}}})?

 * Are we using the JVM's intrinsics to compare multiple bytes in a single SIMD 
instruction ([~rcmuir] is quite sure we are indeed)?

 * [~jpountz] suggested maybe the hash insert is simply memory bound

 * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total CPU 
cost)

I pulled these observations from a recent (5/6/22) profiler output: 
[https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]

Maybe we can improve our performance on this crazy hotspot?

Or maybe this is a "healthy" hotspot and we should leave it be!

  was:
I was poking around in our nightly benchmarks 
([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
profiling that the hottest method is this:
{noformat}
PERCENT   CPU SAMPLES   STACK
9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
  at org.apache.lucene.util.BytesRefHash#findHash()
  at org.apache.lucene.util.BytesRefHash#add()
  at org.apache.lucene.index.TermsHashPerField#add()
  at 
org.apache.lucene.index.IndexingChain$PerField#invert()
  at 
org.apache.lucene.index.IndexingChain#processField()
  at 
org.apache.lucene.index.IndexingChain#processDocument()
  at 
org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
This is kinda crazy – comparing if the term to be inserted into the inverted 
index hash equals the term already added to {{BytesRefHash}} is the hottest 
method during nightly benchmarks.

Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
questionable things about our current implementation:
 * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the inserted 
term into the hash?  Let's just use two bytes always, since IW limits term 
length to 32 K (< 64K that an unsigned short can cover)


 * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
(BitUtil.VH_BE_SHORT.get)
 * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
aggressive enough?  Or the initial sizing of the hash is too small?

 * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too many 
{{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible "upgrades"?


 * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
({{{}murmurhash3_x86_32{}}})?

 * Are we using the JVM's intrinsics to compare multiple bytes in a single SIMD 
instruction ([~rcmuir] is quite sure we are indeed)?


 * [~jpountz] suggested maybe the hash insert is simply memory bound


 * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total CPU 
cost)

I 

[jira] [Created] (LUCENE-10572) Can we optimize BytesRefHash

2022-05-13 Thread Michael McCandless (Jira)
Michael McCandless created LUCENE-10572:
---

 Summary: Can we optimize BytesRefHash
 Key: LUCENE-10572
 URL: https://issues.apache.org/jira/browse/LUCENE-10572
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless


I was poking around in our nightly benchmarks 
([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
profiling that the hottest method is this:
{noformat}
PERCENT   CPU SAMPLES   STACK
9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
  at org.apache.lucene.util.BytesRefHash#findHash()
  at org.apache.lucene.util.BytesRefHash#add()
  at org.apache.lucene.index.TermsHashPerField#add()
  at 
org.apache.lucene.index.IndexingChain$PerField#invert()
  at 
org.apache.lucene.index.IndexingChain#processField()
  at 
org.apache.lucene.index.IndexingChain#processDocument()
  at 
org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
This is kinda crazy – comparing if the term to be inserted into the inverted 
index hash equals the term already added to {{BytesRefHash}} is the hottest 
method during nightly benchmarks.

Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
questionable things about our current implementation:
 * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the inserted 
term into the hash?  Let's just use two bytes always, since IW limits term 
length to 32 K (< 64K that an unsigned short can cover)


 * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
(BitUtil.VH_BE_SHORT.get)
 * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
aggressive enough?  Or the initial sizing of the hash is too small?

 * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too many 
{{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible "upgrades"?


 * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
({{{}murmurhash3_x86_32{}}})?

 * Are we using the JVM's intrinsics to compare multiple bytes in a single SIMD 
instruction ([~rcmuir] is quite sure we are indeed)?


 * [~jpountz] suggested maybe the hash insert is simply memory bound


 * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total CPU 
cost)

I pulled these observations from a recent (5/6/22) profiler output: 
[https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]

Maybe we can improve our performance on this crazy hotspot?

Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10556) Relax the maximum dirtiness for stored fields and term vectors?

2022-05-12 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536285#comment-17536285
 ] 

Michael McCandless commented on LUCENE-10556:
-

{quote}I'm not sure if we should change the MP in the benchmark though, since 
so many users do use TieredMP (the default).
{quote}
Or, maybe we need to improve TMP's defaults!!!  If the floor segment MB size is 
causing too much O(N^2) behavior we should fix that default ...

> Relax the maximum dirtiness for stored fields and term vectors?
> ---
>
> Key: LUCENE-10556
> URL: https://issues.apache.org/jira/browse/LUCENE-10556
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Stored fields and term vectors compress data and have merge-time 
> optimizations to copy compressed data directly instead of decompressing and 
> recompressing over and over again. However, sometimes incomplete blocks get 
> carried over (typically the last block of a flushed segment) and so these 
> file formats keep track of how "dirty" their current blocks are to know 
> whether stored fields / term vectors for a segment should be re-compressed.
> Currently the logic is to recompress if more than 1% of the blocks are 
> incomplete, or if the total number of missing documents across incomplete 
> blocks is more than the configured maximum number of documents per block.
> I'd be interested in evaluating what the compression ratio would be if we 
> relaxed these conditions a bit, e.g. by allowing up to 5% dirtiness. My gut 
> feeling is that the compression ratio could be barely worse while index-time 
> CPU usage could be significantly improved. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira?

2022-05-12 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536176#comment-17536176
 ] 

Michael McCandless commented on LUCENE-10557:
-

Oh that's great news; thanks for sharing [~nightowl888]!

> Migrate to GitHub issue from Jira?
> --
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * Get a consensus about the migration among committers
>  * Enable Github issue on the lucene's repository (currently, it is disabled 
> on it)
>  * Build the convention or rules for issue label/milestone management
>  * Choose issues that should be moved to GitHub (I think too old or obsolete 
> issues can remain Jira.)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10568) Javadocs errors in IndexWriter.DocStats

2022-05-12 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536173#comment-17536173
 ] 

Michael McCandless commented on LUCENE-10568:
-

Thanks [~wormday] – the wording is indeed VERY confusing :)   I commented on 
the PR.

> Javadocs errors in IndexWriter.DocStats
> ---
>
> Key: LUCENE-10568
> URL: https://issues.apache.org/jira/browse/LUCENE-10568
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: general/javadocs
>Affects Versions: 8.0, 9.1
>Reporter: sun wuqiang
>Priority: Trivial
> Attachments: Image 007.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> *org.apache.lucene.index.IndexWriter.DocStats*
> This class has two fields
> The field maxDoc should contain numDeletedDocs, and numDocs does not contain 
> numDeletedDocs
> However, the javadocs are just the opposite.
> !Image 007.png!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10551) LowercaseAsciiCompression should return false when it's unable to compress

2022-05-11 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17535093#comment-17535093
 ] 

Michael McCandless commented on LUCENE-10551:
-

+1 to get to the bottom of the GraalVM mis-compilation.

And also +1 if we can find a simple code change that's low risk / performance 
impact to other JVM users and could side-step this bug.

> LowercaseAsciiCompression should return false when it's unable to compress
> --
>
> Key: LUCENE-10551
> URL: https://issues.apache.org/jira/browse/LUCENE-10551
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Lucene version 8.11.1
>Reporter: Peixin Li
>Priority: Major
> Attachments: LUCENE-10551-test.patch
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> {code:java}
>  Failed to commit..
> java.lang.IllegalStateException: 10 <> 5 
> cion1cion_desarrollociones_oraclecionesnaturacionesnatura2tedppsa-integrationdemotiontion
>  cloud gen2tion instance - dev1tion instance - 
> testtion-devbtion-instancetion-prdtion-promerication-qation064533tion535217tion697401tion761348tion892818tion_matrationcauto_simmonsintgic_testtioncloudprodictioncloudservicetiongateway10tioninstance-jtsundatamartprd??o
>         at 
> org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:628)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:947)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:912)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318)
>         at 
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170)
>         at 
> org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120)
>         at 
> org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:267)
>         at 
> org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
>         at 
> org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
>         at 
> org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
>         at 
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
>         at 
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
>         at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
>        {code}
> {code:java}
> key=och-live--WorkResource.renewAssignmentToken.ResourceTime[namespace=workflow,
>  resourceGroup=workflow-service-overlay]{availabilityDomain=iad-ad-1, 
> domainId=och-live, host=workflow-service-overlay-01341.node.ad1.us-ashburn-1})
> java.lang.IllegalStateException: 29 <> 16 
> 

[jira] [Comment Edited] (LUCENE-10557) Migrate to GitHub issue from Jira?

2022-05-04 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531680#comment-17531680
 ] 

Michael McCandless edited comment on LUCENE-10557 at 5/4/22 11:29 AM:
--

Hmm at least ~4 years ago, migrating was not 
[possible/easy|https://issues.apache.org/jira/browse/INFRA-15702?focusedCommentId=16312429=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16312429].


was (Author: mikemccand):
Hmm at least ~4 years ago, migrating was 
not[possible/easy]|https://issues.apache.org/jira/browse/INFRA-15702?focusedCommentId=16312429=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16312429].

> Migrate to GitHub issue from Jira?
> --
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * Get a consensus about the migration among committers
>  * Enable Github issue on the lucene's repository (currently, it is disabled 
> on it)
>  * Build the convention or rules for issue label/milestone management



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira?

2022-05-04 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531681#comment-17531681
 ] 

Michael McCandless commented on LUCENE-10557:
-

INFRA-16128 was the issue when RocketMQ migrated: 
https://issues.apache.org/jira/browse/INFRA-16128

> Migrate to GitHub issue from Jira?
> --
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * Get a consensus about the migration among committers
>  * Enable Github issue on the lucene's repository (currently, it is disabled 
> on it)
>  * Build the convention or rules for issue label/milestone management



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira?

2022-05-04 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531680#comment-17531680
 ] 

Michael McCandless commented on LUCENE-10557:
-

Hmm at least ~4 years ago, migrating was 
not[possible/easy]|https://issues.apache.org/jira/browse/INFRA-15702?focusedCommentId=16312429=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16312429].

> Migrate to GitHub issue from Jira?
> --
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * Get a consensus about the migration among committers
>  * Enable Github issue on the lucene's repository (currently, it is disabled 
> on it)
>  * Build the convention or rules for issue label/milestone management



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10557) Migrate to GitHub issue from Jira?

2022-05-04 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531677#comment-17531677
 ] 

Michael McCandless commented on LUCENE-10557:
-

+1

Do we know whether we could (relatively easily) migrate our Jira issues over to 
GitHub issues?

> Migrate to GitHub issue from Jira?
> --
>
> Key: LUCENE-10557
> URL: https://issues.apache.org/jira/browse/LUCENE-10557
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Tomoko Uchida
>Priority: Major
>
> A few (not the majority) Apache projects already use the GitHub issue instead 
> of Jira. For example,
> Airflow: [https://github.com/apache/airflow/issues]
> BookKeeper: [https://github.com/apache/bookkeeper/issues]
> So I think it'd be technically possible that we move to GitHub issue. I have 
> little knowledge of how to proceed with it, I'd like to discuss whether we 
> should migrate to it, and if so, how to smoothly handle the migration.
> The major tasks would be:
>  * Get a consensus about the migration among committers
>  * Enable Github issue on the lucene's repository (currently, it is disabled 
> on it)
>  * Build the convention or rules for issue label/milestone management



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10188) Give SortedSetDocValues a docValueCount()?

2022-05-02 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-10188.
-
Fix Version/s: 10.0 (main)
   9.2
   Resolution: Fixed

> Give SortedSetDocValues a docValueCount()?
> --
>
> Key: LUCENE-10188
> URL: https://issues.apache.org/jira/browse/LUCENE-10188
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 10.0 (main), 9.2
>
>
> Theoretically SortedSetDocValues gives more options to codecs with regard to 
> how SORTED_SET doc values could store ords. However in practice we currently 
> always store counts. Maybe giving SORTED_SET doc values an API that is closer 
> to the API of SORTED_NUMERIC doc values would be a better trade-off?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10188) Give SortedSetDocValues a docValueCount()?

2022-05-02 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530769#comment-17530769
 ] 

Michael McCandless commented on LUCENE-10188:
-

Thanks [~spike.liu] – I just pushed this and backported to 9.x.

In the backport I also had to add this method to the Lucene70 backward codec.

> Give SortedSetDocValues a docValueCount()?
> --
>
> Key: LUCENE-10188
> URL: https://issues.apache.org/jira/browse/LUCENE-10188
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Priority: Minor
>
> Theoretically SortedSetDocValues gives more options to codecs with regard to 
> how SORTED_SET doc values could store ords. However in practice we currently 
> always store counts. Maybe giving SORTED_SET doc values an API that is closer 
> to the API of SORTED_NUMERIC doc values would be a better trade-off?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10551) LowercaseAsciiCompression should return false when it's unable to compress

2022-05-02 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530738#comment-17530738
 ] 

Michael McCandless commented on LUCENE-10551:
-

I committed some simple test improvements – let's see if CI builds uncover 
anything exciting?

[~irislpx] – which JDK full version are you using?

> LowercaseAsciiCompression should return false when it's unable to compress
> --
>
> Key: LUCENE-10551
> URL: https://issues.apache.org/jira/browse/LUCENE-10551
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Lucene version 8.11.1
>Reporter: Peixin Li
>Priority: Major
> Attachments: LUCENE-10551-test.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> {code:java}
>  Failed to commit..
> java.lang.IllegalStateException: 10 <> 5 
> cion1cion_desarrollociones_oraclecionesnaturacionesnatura2tedppsa-integrationdemotiontion
>  cloud gen2tion instance - dev1tion instance - 
> testtion-devbtion-instancetion-prdtion-promerication-qation064533tion535217tion697401tion761348tion892818tion_matrationcauto_simmonsintgic_testtioncloudprodictioncloudservicetiongateway10tioninstance-jtsundatamartprd??o
>         at 
> org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:628)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:947)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:912)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318)
>         at 
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170)
>         at 
> org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120)
>         at 
> org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:267)
>         at 
> org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
>         at 
> org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
>         at 
> org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
>         at 
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
>         at 
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
>         at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
>        {code}
> {code:java}
> key=och-live--WorkResource.renewAssignmentToken.ResourceTime[namespace=workflow,
>  resourceGroup=workflow-service-overlay]{availabilityDomain=iad-ad-1, 
> domainId=och-live, host=workflow-service-overlay-01341.node.ad1.us-ashburn-1})
> java.lang.IllegalStateException: 29 <> 16 
> 

[jira] [Commented] (LUCENE-10550) Add getAllChildren functionality to facets

2022-05-02 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530719#comment-17530719
 ] 

Michael McCandless commented on LUCENE-10550:
-

+1

> Add getAllChildren functionality to facets
> --
>
> Key: LUCENE-10550
> URL: https://issues.apache.org/jira/browse/LUCENE-10550
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Yuting Gan
>Priority: Minor
>
> Currently Lucene does not support returning range counts sorted by label 
> values, but there are use cases demanding this feature. For example, a user 
> specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts 
> without changing the range order. Today we can only call getTopChildren to 
> populate range counts, but it would return ranges sorted by counts (e.g., 
> [10, 20] 100, [0, 10] 50) instead of range values. 
> Lucene has a API, getAllChildrenSortByValue, that returns numeric values with 
> counts sorted by label values, please see 
> [LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. 
> Therefore, it would be nice that we can also have a similar API to support 
> range counts. The proposed getAllChildren API is to return value/range counts 
> sorted by label values instead of counts. 
> This proposal was inspired from the discussions with [~gsmiller] when I was 
> working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], 
> and we believe users would benefit from adding this API to Facets. 
> Hope I can get some feedback from the community since this proposal would 
> require changes to the getTopChildren API in RangeFacetCounts. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10551) LowercaseAsciiCompression should return false when it's unable to compress

2022-04-30 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530423#comment-17530423
 ] 

Michael McCandless commented on LUCENE-10551:
-

I think at a minimum we should fix the exception message to not expect/require 
that the incoming {{byte[]}} is really {{UTF-8}} – we should change the 
{{.toUTF8String()}} to {{.toString()}} which will render the bytes accurately 
in hex I think.

> LowercaseAsciiCompression should return false when it's unable to compress
> --
>
> Key: LUCENE-10551
> URL: https://issues.apache.org/jira/browse/LUCENE-10551
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Lucene version 8.11.1
>Reporter: Peixin Li
>Priority: Major
> Attachments: LUCENE-10551-test.patch
>
>
> {code:java}
>  Failed to commit..
> java.lang.IllegalStateException: 10 <> 5 
> cion1cion_desarrollociones_oraclecionesnaturacionesnatura2tedppsa-integrationdemotiontion
>  cloud gen2tion instance - dev1tion instance - 
> testtion-devbtion-instancetion-prdtion-promerication-qation064533tion535217tion697401tion761348tion892818tion_matrationcauto_simmonsintgic_testtioncloudprodictioncloudservicetiongateway10tioninstance-jtsundatamartprd??o
>         at 
> org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:628)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:947)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:912)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318)
>         at 
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170)
>         at 
> org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120)
>         at 
> org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:267)
>         at 
> org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
>         at 
> org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
>         at 
> org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
>         at 
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
>         at 
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
>         at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
>        {code}
> {code:java}
> key=och-live--WorkResource.renewAssignmentToken.ResourceTime[namespace=workflow,
>  resourceGroup=workflow-service-overlay]{availabilityDomain=iad-ad-1, 
> domainId=och-live, host=workflow-service-overlay-01341.node.ad1.us-ashburn-1})
> java.lang.IllegalStateException: 29 <> 16 
> 

[jira] [Updated] (LUCENE-10551) LowercaseAsciiCompression should return false when it's unable to compress

2022-04-30 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-10551:

Attachment: LUCENE-10551-test.patch
Status: Open  (was: Open)

Hmm I wrote a simple test case for each of the reported strings here but they 
do not fail.  Maybe this test is invoking the API slightly differently than 
{{{}blocktree{}}}'s suffix compression?

> LowercaseAsciiCompression should return false when it's unable to compress
> --
>
> Key: LUCENE-10551
> URL: https://issues.apache.org/jira/browse/LUCENE-10551
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Lucene version 8.11.1
>Reporter: Peixin Li
>Priority: Major
> Attachments: LUCENE-10551-test.patch
>
>
> {code:java}
>  Failed to commit..
> java.lang.IllegalStateException: 10 <> 5 
> cion1cion_desarrollociones_oraclecionesnaturacionesnatura2tedppsa-integrationdemotiontion
>  cloud gen2tion instance - dev1tion instance - 
> testtion-devbtion-instancetion-prdtion-promerication-qation064533tion535217tion697401tion761348tion892818tion_matrationcauto_simmonsintgic_testtioncloudprodictioncloudservicetiongateway10tioninstance-jtsundatamartprd??o
>         at 
> org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:628)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:947)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:912)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318)
>         at 
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170)
>         at 
> org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120)
>         at 
> org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:267)
>         at 
> org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
>         at 
> org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
>         at 
> org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
>         at 
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
>         at 
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
>         at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
>        {code}
> {code:java}
> key=och-live--WorkResource.renewAssignmentToken.ResourceTime[namespace=workflow,
>  resourceGroup=workflow-service-overlay]{availabilityDomain=iad-ad-1, 
> domainId=och-live, host=workflow-service-overlay-01341.node.ad1.us-ashburn-1})
> java.lang.IllegalStateException: 29 <> 16 
> 

[jira] [Commented] (LUCENE-10551) LowercaseAsciiCompression should return false when it's unable to compress

2022-04-30 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530419#comment-17530419
 ] 

Michael McCandless commented on LUCENE-10551:
-

Yeah this is definitely no good.  The exception includes the exact term 
{{LowercaseAsciiCompression}} was trying to compress – I'll see if that exact 
string repros the exception.

Thanks for reporting [~irislpx]!

> LowercaseAsciiCompression should return false when it's unable to compress
> --
>
> Key: LUCENE-10551
> URL: https://issues.apache.org/jira/browse/LUCENE-10551
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Lucene version 8.11.1
>Reporter: Peixin Li
>Priority: Major
>
> {code:java}
>  Failed to commit..
> java.lang.IllegalStateException: 10 <> 5 
> cion1cion_desarrollociones_oraclecionesnaturacionesnatura2tedppsa-integrationdemotiontion
>  cloud gen2tion instance - dev1tion instance - 
> testtion-devbtion-instancetion-prdtion-promerication-qation064533tion535217tion697401tion761348tion892818tion_matrationcauto_simmonsintgic_testtioncloudprodictioncloudservicetiongateway10tioninstance-jtsundatamartprd??o
>         at 
> org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:628)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:947)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:912)
>         at 
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318)
>         at 
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170)
>         at 
> org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120)
>         at 
> org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:267)
>         at 
> org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
>         at 
> org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
>         at 
> org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
>         at 
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
>         at 
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
>         at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
>        {code}
> {code:java}
> key=och-live--WorkResource.renewAssignmentToken.ResourceTime[namespace=workflow,
>  resourceGroup=workflow-service-overlay]{availabilityDomain=iad-ad-1, 
> domainId=och-live, host=workflow-service-overlay-01341.node.ad1.us-ashburn-1})
> java.lang.IllegalStateException: 29 <> 16 
> 

[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529452#comment-17529452
 ] 

Michael McCandless commented on LUCENE-10543:
-

Also, I love this new "Achieve XYZ perfection (with progress)" template ;)

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-28 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529437#comment-17529437
 ] 

Michael McCandless commented on LUCENE-10543:
-

+1 to work out a migration plan to switch to GitHub issues.  Can we preserve 
our whole history here?

The most compelling reason to me is that our Jira instance still does not 
(cannot?) support Markdown.  Maybe all these comments where we all 
optimistically tried to use Markdown will then render correctly on migration to 
GitHub issues!!

I even see attempted Markdown in our CHANGES.txt, but does our 
`changes2html.pl` support rendering/translating MD to HTML?

Hmm then I will have to figure out how to migrate 
[https://jirasearch.mikemccandless.com|https://jirasearch.mikemccandless.com/] 
near-real-time indexing onto GitHub issues too!

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-28 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529416#comment-17529416
 ] 

Michael McCandless commented on LUCENE-10541:
-

{quote}enwiki lines contains 2 million lines. It'd be nice to calculate the 
probability of any of the k faulty (long-term) lines being drawn in n tries and 
distribute it over time - this would address Mike's question about why it took 
so long to discover them. :)
{quote}
LOL this is indeed fun to work out.

There are a couple wrinkles to modeling this though :)

First, it's not really "randomly picking N lines for each test run", it's 
seeking to one spot and then reading N sequential lines from there.  Assuming 
the file is well shuffled (I think it is), this is maybe not changing the 
result over picking N random lines, since those N sequential lines were already 
randomized.

Second, the way the seeking works is to pick a random spot (byte location), 
seek there, scan to the end of that line, and start reading from the following 
line forwards.  Many of the lines are very short, but some of them are longer, 
and even fewer of them are truly massive (and might have an evil Darth Term in 
there).  One wrinkle here is that if you seek into the middle of one of the 
Darth Terms, you'll then seek to end of line and skip that large term entirely. 
 Given that these massive lines take more bytes it seems more likely the 
seeking will then skip the Darth Term lines?

Finally, there is one more crazy wrinkle – the nightly LineFileDocs is no 
longer a simple text file – it also has a pre-chunked "index" so test 
randomization can jump to one the pre-computed known skip points.  Maybe that 
chunking introduced some sort of bias?

Fun to think about the Darth Terms!!

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-27 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528913#comment-17528913
 ] 

Michael McCandless commented on LUCENE-10541:
-

{quote}Let's fix the default. I know the real analyzers default to something 
like 255.
{quote}
+1

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-27 Thread Michael McCandless (Jira)
Michael McCandless created LUCENE-10541:
---

 Summary: What to do about massive terms in our Wikipedia EN 
LineFileDocs?
 Key: LUCENE-10541
 URL: https://issues.apache.org/jira/browse/LUCENE-10541
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless


Spinoff from this fun build failure that [~dweiss] root caused: 
[https://lucene.markmail.org/thread/pculfuazll4oebra]

Thank you and sorry [~dweiss]!!

This test failure happened because the test case randomly indexed a chunk of 
the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
test.

It's crazy that it took so long for Lucene's randomized tests to discover this 
too-massive term in Lucene's nightly benchmarks.  It's like searching for 
Nessie, or 
[SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].

We need to prevent such false failures, somehow, and there are multiple 
options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
{{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1383) Work around ThreadLocal's "leak"

2022-04-22 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526499#comment-17526499
 ] 

Michael McCandless commented on LUCENE-1383:


> Even if the issue is closed, for those who want to know why ThreadLocal had 
> to be fixed : [http://www.javaspecialists.eu/archive/Issue164.html]

Thanks [~adrian.tarau] – this was a very interesting read (even 12 years too 
late)!!!  I wonder if modern JDKs have improved this situation?

> Work around ThreadLocal's "leak"
> 
>
> Key: LUCENE-1383
> URL: https://issues.apache.org/jira/browse/LUCENE-1383
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Major
> Fix For: 2.4
>
> Attachments: LUCENE-1383.patch, ScreenHunter_01 Sep. 13 08.40.jpg, 
> ScreenHunter_02 Sep. 13 08.42.jpg, ScreenHunter_03 Sep. 13 08.43.jpg, 
> ScreenHunter_07 Sep. 13 19.13.jpg
>
>
> Java's ThreadLocal is dangerous to use because it is able to take a
> surprisingly very long time to release references to the values you
> store in it.  Even when a ThreadLocal instance itself is GC'd, hard
> references to the values you had stored in it are easily kept for
> quite some time later.
> While this is not technically a "memory leak", because eventually
> (when the underlying Map that stores the values cleans up its "stale"
> references) the hard reference will be cleared, and GC can proceed,
> its end behavior is not different from a memory leak in that under the
> right situation you can easily tie up far more memory than you'd
> expect, and then hit unexpected OOM error despite allocating an
> extremely large heap to your JVM.
> Lucene users have hit this many times.  Here's the most recent thread:
>   
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200809.mbox/%3C6e3ae6310809091157j7a9fe46bxcc31f6e63305fcdc%40mail.gmail.com%3E
> And here's another:
>   
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200807.mbox/%3CF5FC94B2-E5C7-40C0-8B73-E12245B91CEE%40mikemccandless.com%3E
> And then there's LUCENE-436 and LUCENE-529 at least.
> A google search for "ThreadLocal leak" yields many compelling hits.
> Sun does this for performance reasons, but I think it's a terrible
> trap and we should work around it with Lucene.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10517) Improve performance of SortedSetDV faceting by iterating on class types

2022-04-19 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524262#comment-17524262
 ] 

Michael McCandless commented on LUCENE-10517:
-

This is a very impressive performance jump for the "pure browse" faceting case! 
 I'll review the PR soon.   Thanks [~ChrisHegarty]!

> Improve performance of SortedSetDV faceting by iterating on class types
> ---
>
> Key: LUCENE-10517
> URL: https://issues.apache.org/jira/browse/LUCENE-10517
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.1
>Reporter: Chris Hegarty
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> While analysing various profiles, [@grcevski|https://github.com/grcevski] and 
> I can came across this potential improvement.
> SortedSetDV faceting (and friends), can improve performance within tight 
> loops by using invokevirtual (rather than invokeinterface). The C2 JIT 
> compiler can produce slightly more optimal code in this case, and since these 
> loops are very hot, the impact can be significant (in the order of 10-30%).
> This issue is in some ways similar to, and builds upon, prior optimisations 
> in this area, like say LUCENE-5300 or more recently LUCENE-5309



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10521) Tests in windows are failing for the new testAlwaysRefreshDirectoryTaxonomyReader test

2022-04-19 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524258#comment-17524258
 ] 

Michael McCandless commented on LUCENE-10521:
-

Phew!  This time I put the {{@Ignore}} in the right place, I think :)  I 
confirmed now when I run that one test case, it indeed says skipped.  Hopefully 
[~uschindler]'s awesome Windows Jenkins builds are OK again.  Sorry for all the 
flailing :)

> Tests in windows are failing for the new 
> testAlwaysRefreshDirectoryTaxonomyReader test
> --
>
> Key: LUCENE-10521
> URL: https://issues.apache.org/jira/browse/LUCENE-10521
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/facet
> Environment: Windows 10
>Reporter: Gautam Worah
>Priority: Minor
>
> Build: [https://jenkins.thetaphi.de/job/Lucene-main-Windows/10725/] is 
> failing.
>  
> Specifically, the loop which checks if any files still remain to be deleted 
> is not ending.
> We have added an exception to the main test class to not run the test on 
> WindowsFS (not sure if this is related).
>  
> ```
> SEVERE: 1 thread leaked from SUITE scope at 
> org.apache.lucene.facet.taxonomy.directory.TestAlwaysRefreshDirectoryTaxonomyReader:
>  1) Thread[id=19, 
> name=TEST-TestAlwaysRefreshDirectoryTaxonomyReader.testAlwaysRefreshDirectoryTaxonomyReader-seed#[F46E42CB7F2B6959],
>  state=RUNNABLE, group=TGRP-TestAlwaysRefreshDirectoryTaxonomyReader] at 
> java.base@18/sun.nio.fs.WindowsNativeDispatcher.GetFileAttributesEx0(Native 
> Method) at 
> java.base@18/sun.nio.fs.WindowsNativeDispatcher.GetFileAttributesEx(WindowsNativeDispatcher.java:390)
>  at 
> java.base@18/sun.nio.fs.WindowsFileAttributes.get(WindowsFileAttributes.java:307)
>  at 
> java.base@18/sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:251)
>  at 
> java.base@18/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
>  at 
> app/org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterFileSystemProvider.delete(FilterFileSystemProvider.java:130)
>  at 
> app/org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterFileSystemProvider.delete(FilterFileSystemProvider.java:130)
>  at 
> app/org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterFileSystemProvider.delete(FilterFileSystemProvider.java:130)
>  at 
> app/org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterFileSystemProvider.delete(FilterFileSystemProvider.java:130)
>  at 
> app/org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.mockfile.FilterFileSystemProvider.delete(FilterFileSystemProvider.java:130)
>  at java.base@18/java.nio.file.Files.delete(Files.java:1152) at 
> app/org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.store.FSDirectory.privateDeleteFile(FSDirectory.java:344)
>  at 
> app/org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.store.FSDirectory.deletePendingFiles(FSDirectory.java:325)
>  at 
> app/org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.store.FSDirectory.getPendingDeletions(FSDirectory.java:410)
>  at 
> app/org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.store.FilterDirectory.getPendingDeletions(FilterDirectory.java:121)
>  at 
> app//org.apache.lucene.facet.taxonomy.directory.TestAlwaysRefreshDirectoryTaxonomyReader.testAlwaysRefreshDirectoryTaxonomyReader(TestAlwaysRefreshDirectoryTaxonomyReader.java:97)
> ```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10482) Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the taxoEpoch decide

2022-04-17 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523347#comment-17523347
 ] 

Michael McCandless commented on LUCENE-10482:
-

Thanks [~gworah] -- can we resolve this now?

> Allow users to create their own DirectoryTaxonomyReaders with empty 
> taxoArrays instead of letting the taxoEpoch decide
> --
>
> Key: LUCENE-10482
> URL: https://issues.apache.org/jira/browse/LUCENE-10482
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 9.1
>Reporter: Gautam Worah
>Priority: Minor
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> I was experimenting with the taxonomy index and {{DirectoryTaxonomyReaders}} 
> in my day job where we were trying to replace the index underneath a reader 
> asynchronously and then call the {{doOpenIfChanged}} call on it.
> It turns out that the taxonomy index uses its own index based counter (the 
> {{{}taxonomyIndexEpoch{}}}) to determine if the index was opened in write 
> mode after the last time it was written and if not, it directly tries to 
> reuse the previous {{taxoArrays}} it had created. This logic fails in a 
> scenario where both the old and new index were opened just once but the index 
> itself is completely different in both the cases.
> In such a case, it would be good to give the user the flexibility to inform 
> the DTR to recreate its {{{}taxoArrays{}}}, {{ordinalCache}} and 
> {{{}categoryCache{}}} (not refreshing these arrays causes it to fail in 
> various ways). Luckily, such a constructor already exists! But it is private 
> today! The idea here is to allow subclasses of DTR to use this constructor.
> Curious to see what other folks think about this idea. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-02-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493306#comment-17493306
 ] 

Michael McCandless commented on LUCENE-10421:
-

+1 for a constant.  42 seems good?

> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10391) Reuse data structures across HnswGraph invocations

2022-02-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493305#comment-17493305
 ] 

Michael McCandless commented on LUCENE-10391:
-

Sorry for the nightly benchmarks down-time!  I think I [pushed a fix just 
now|https://github.com/mikemccand/luceneutil/commit/36eec79e5ea3cb336c38d53bd4ea35bd6847b4c5]
 that should get them running again ... cross fingers!

> Reuse data structures across HnswGraph invocations
> --
>
> Key: LUCENE-10391
> URL: https://issues.apache.org/jira/browse/LUCENE-10391
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Julie Tibshirani
>Priority: Minor
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Creating HNSW graphs involves doing many repeated calls to HnswGraph#search. 
> Profiles from nightly benchmarks suggest that allocating data-structures 
> incurs both lots of heap allocations 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_heap)]
>  and CPU usage 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_cpu).]
>  It looks like reusing data structures across invocations would be a 
> low-hanging fruit that could help save significant CPU?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-02-14 Thread Michael McCandless (Jira)
Michael McCandless created LUCENE-10421:
---

 Summary: Non-deterministic results from KnnVectorQuery?
 Key: LUCENE-10421
 URL: https://issues.apache.org/jira/browse/LUCENE-10421
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless


[Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have been 
upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} is 
giving slightly different results on every run, even on an identical 
(deterministically constructed – single thread indexing, flush by doc count, 
{{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
night.  It produces failures like this, which then abort the benchmark to help 
us catch any recent accidental bug that alters our precise top N search hits 
and scores:
{noformat}
 Traceback (most recent call last):
 File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
  run()
 File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
  raise RuntimeError(‘search result differences: %s’ % str(errors))
RuntimeError: search result differences: 
[“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
‘0.92060816’) vs ([254438\
06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
filter=None sort=None groupField=None hitCount=10: hit 7 has wrong field/score 
value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, “qu\
ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
[0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 0 
has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
‘0.8378446’)“]{noformat}
At first I thought this might be expected because of the recent (awesome!!) 
improvements to HNSW, so I tried to simply "regold".  But the regold did not 
"take", so it indeed looks like there is some non-determinism here.

I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
likely the cause?
{noformat}
public final class HnswGraphBuilder {

  /** Default random seed for level generation * */
  private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
{noformat}

Can we somehow make this deterministic instead?  Or maybe the nightly 
benchmarks could somehow pass something in to make results deterministic for 
benchmarking?  Or ... we could also relax the benchmarks to accept 
non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10250) Add hierarchical labels to SSDV facets

2021-12-03 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17453149#comment-17453149
 ] 

Michael McCandless commented on LUCENE-10250:
-

{quote}So I think, separately, it would be great to think about improving 
benchmarking with a more realistic use-case to drive decisions and tuning.
{quote}
+1, this is a great idea!  I had not realized Wikipedia had this.  I'll open a 
{{luceneutil}} issue.

> Add hierarchical labels to SSDV facets
> --
>
> Key: LUCENE-10250
> URL: https://issues.apache.org/jira/browse/LUCENE-10250
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Marc D'Mello
>Priority: Major
>  Labels: discussion
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi all,
> I recently [added a new benchmarking 
> task|https://github.com/mikemccand/luceneutil/issues/141] to {{luceneutil}} 
> to count facets on a random word chosen from each document which would give 
> us a very high cardinality facet benchmarking compared to the faceting 
> benchmarks we already had. After being merged, [~mikemccand] pointed out some 
> [interesting 
> results|https://home.apache.org/~mikemccand/lucenebench/BrowseRandomLabelTaxoFacets.html]
>  in the nightly benchmarks where the {{BrowseRandomLabelSSDVFacets}} task was 
> much faster than the {{BrowseRandomLabelTaxoFacets}} task.
> I was thinking that using SSDV facets instead of taxonomy facets for our use 
> case at Amazon Product Search could potentially lead to some increases in QPS 
> and decreases in index size, but the issue is we use hierarchical labels, and 
> as I understand it, SSDV faceting only supports a 2 level hierarchy as of 
> today. This leads to my question of why is there a limitation like this on 
> SSDV facets? Is hierarchical labels just a feature that hasn't been 
> implemented in SSDV facets yet, or is there some more complex reason that we 
> can't add hierarchical labels to SSDV facets?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10265) IO write throttle rate will beyond the Ceiling(1024MB/s) in the merge

2021-11-27 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449841#comment-17449841
 ] 

Michael McCandless commented on LUCENE-10265:
-

{quote}it should't beyond the Ceiling(1024MB/s).
{quote}
Hmm – it's actually 10 * 1024 MB/sec (i.e. 10 GB/sec):
{noformat}
  /** Ceiling for IO write rate limit (we will never go any higher than this) */
  private static final double MAX_MERGE_MB_PER_SEC = 10240.0; {noformat}
{quote}`targetMBPerSec` is shared by many merge threads, it will be changed by 
the way:

The modification process is not a atomic operation:
{quote}
Hmm but this is inside a {{synchronized}} method ({{{}updateIOThrottle{}}}) 
right?

> IO write throttle rate will beyond the Ceiling(1024MB/s) in the merge
> -
>
> Key: LUCENE-10265
> URL: https://issues.apache.org/jira/browse/LUCENE-10265
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 8.6.2
>Reporter: kkewwei
>Priority: Major
>
> It's known that merge io write throttle rate is under the control  of 
> `targetMBPerSec` In ConcurrentMergeSchedule, it should't beyond the 
> Ceiling(1024MB/s).
> `targetMBPerSec` is shared by many merge threads, it will be changed by the 
> way:
> {code:java}
> if (newBacklog) {
>   // This new merge adds to the backlog: increase IO throttle by 20%
>   targetMBPerSec *= 1.20; 
>   if (targetMBPerSec > MAX_MERGE_MB_PER_SEC) {
> targetMBPerSec = MAX_MERGE_MB_PER_SEC;
>   }
>   ..
> } else {
>   // We are not falling behind: decrease IO throttle by 10%
>   targetMBPerSec /= 1.10;
>   if (targetMBPerSec < MIN_MERGE_MB_PER_SEC) {
> targetMBPerSec = MIN_MERGE_MB_PER_SEC;
>   }
>  ..
> }
> {code}
> The modification process is not a atomic operation:
> # `targetMBPerSec` is changed by the first merge thread from 1024 to 1024*1.2
> # other merge thread will read the new value(1024*1.2).
> # the first merge thread limit the value to be 1024.
> The bad case will happen.
> In product, we do find that IO write throttle rate is beyond the 
> Ceiling(1024MB/s) in the merge.
> {code:java}
> [2021-11-26T15:27:19,861][TRACE][o.e.i.e.E.MS ] [data1] 
> [test1][25] elasticsearch[data1][refresh][T#5] MS: io throttle: current merge 
> backlog; leave IO rate at 3589.1 MB/sec
> [2021-11-26T15:27:20,304][TRACE][o.e.i.e.E.MS ] [data1] 
> [test1][13] elasticsearch[data1][write][T#3] MS: io throttle: current merge 
> backlog; leave IO rate at 192.4 MB/sec
> [2021-11-26T15:27:25,330][TRACE][o.e.i.e.E.MS ] [data1] 
> [test1][22] elasticsearch[data1][[test1][22]: Lucene Merge Thread #1026] MS: 
> io throttle: current merge backlog; leave IO rate at 96.3 MB/sec
> [2021-11-26T15:27:25,995][TRACE][o.e.i.e.E.MS ] [data1] 
> [test1][16] elasticsearch[data1][[test1][16]: Lucene Merge Thread #1063] MS: 
> io throttle: current merge backlog; leave IO rate at 419.2 MB/sec
> [2021-11-26T15:27:38,335][TRACE][o.e.i.e.E.MS ] [data1] 
> [test1][19] elasticsearch[data1][write][T#2] MS: io throttle: current merge 
> backlog; leave IO rate at 3051.5 MB/sec
> {code}
> If we shoud do the following:
> 1. changing it by the atomic operation.
> 2. adding the `volatile` attribute to `targetMBPerSec`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10266) Move nearest-neighbor search on points to core?

2021-11-27 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449838#comment-17449838
 ] 

Michael McCandless commented on LUCENE-10266:
-

+1

> Move nearest-neighbor search on points to core?
> ---
>
> Key: LUCENE-10266
> URL: https://issues.apache.org/jira/browse/LUCENE-10266
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> Now that the Points' public API supports running nearest-nearest neighbor 
> search, should we move it to core via helper methods on {{LatLonPoint}} and 
> {{XYPoint}}?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10232) MultiRangeQuery incorrectly matches docs that only match on a single dimension

2021-11-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444845#comment-17444845
 ] 

Michael McCandless commented on LUCENE-10232:
-

Egads, this is awful.  Who/what uses {{{}MultiRangeQuery{}}}?

> MultiRangeQuery incorrectly matches docs that only match on a single dimension
> --
>
> Key: LUCENE-10232
> URL: https://issues.apache.org/jira/browse/LUCENE-10232
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/sandbox
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When {{MultiRangeQuery}} iterates the multiple dimensions to see if any of 
> them contain a point/range, it incorrectly short-circuits as soon as one 
> dimension matches. It should instead confirm that all dimensions for that 
> range match.
> I'll attach a PR shortly with updated tests that show this bug along with a 
> fix.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array

2021-11-15 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444091#comment-17444091
 ] 

Michael McCandless commented on LUCENE-10122:
-

OK I am convinced too – let's move forward!  It is really weird to abuse 
positions like this.

[~zhai7631] what do you think?  I think the only issue on the PR is back-compat 
from 8.x indices?

Later we can also try to eliminate the separate large heap-resident {{int[]}} 
as [~rmuir] suggested on the dev list today.

> Explore using NumericDocValue to store taxonomy parent array
> 
>
> Key: LUCENE-10122
> URL: https://issues.apache.org/jira/browse/LUCENE-10122
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (10.0)
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> We currently use term position of a hardcoded term in a hardcoded field to 
> represent the parent ordinal of each taxonomy label. That is an old way and 
> perhaps could be dated back to the time where doc values didn't exist.
> We probably would want to use NumericDocValues instead given we have spent 
> quite a lot of effort optimizing them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

2021-11-15 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443953#comment-17443953
 ] 

Michael McCandless commented on LUCENE-10216:
-

I like this plan (extending {{MergePolicy}} so it also has purview over how 
merging is done in {{{}addIndexes(CodecReader[]){}}}).

Reference counting might get tricky, if {{OneMerge}} or {{IndexWriter}} holding 
completed {{OneMerge}} instances try to {{decRef}} readers.

To improve testing we could create a new {{LuceneTestCase}} method to 
{{addIndexes}} from {{Directory[]}} that randomly does so with both impls and 
fix tests to sometimes use that for adding indices.

> Add concurrency to addIndexes(CodecReader…) API
> ---
>
> Key: LUCENE-10216
> URL: https://issues.apache.org/jira/browse/LUCENE-10216
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Vigya Sharma
>Priority: Major
>
> I work at Amazon Product Search, and we use Lucene to power search for the 
> e-commerce platform. I’m working on a project that involves applying 
> metadata+ETL transforms and indexing documents on n different _indexing_ 
> boxes, combining them into a single index on a separate _reducer_ box, and 
> making it available for queries on m different _search_ boxes (replicas). 
> Segments are asynchronously copied from indexers to reducers to searchers as 
> they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the 
> reducer boxes. Since we also have taxonomy data, we need to remap facet field 
> ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version 
> of this API. The API leverages {{SegmentMerger.merge()}} to create segments 
> with new ordinal values while also merging all provided segments in the 
> process.
> _This is however a blocking call that runs in a single thread._ Until we have 
> written segments with new ordinal values, we cannot copy them to searcher 
> boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges, 
> each with only a single reader, creating a concurrently running 1:1 
> conversion from old segments to new ones (with new ordinal values). We follow 
> this up with non-blocking background merges. This lets us copy the segments 
> to searchers and replicas as soon as they are available, and later replace 
> them with merged segments as background jobs complete. On the Amazon dataset 
> I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. 
> Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}} 
> API with a {{boolean}} flag for concurrency, that internally submits multiple 
> merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, 
> and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting 
> multiple addIndexes() calls and waiting for them to complete, I felt it needs 
> some understanding of what addIndexes does, why you need to wait on the merge 
> and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9673) The level of IntBlockPool slice is always 1

2021-10-28 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-9673.

Fix Version/s: 8.11
   main (9.0)
   Resolution: Fixed

Thanks [~mashudong]!

I pushed to main (9.0) and 8.11.

Let's watch nightly benchy tonite and see if that small RAM efficiency 
improvement for packing {{int[]}} moves the needle.

I opened LUCENE-10211 to move the slice logic out of core's {{IntBlockPool}} 
into private {{MemoryIndex}}.

> The level of IntBlockPool slice is always 1 
> 
>
> Key: LUCENE-9673
> URL: https://issues.apache.org/jira/browse/LUCENE-9673
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Reporter: mashudong
>Priority: Minor
> Fix For: main (9.0), 8.11
>
> Attachments: LUCENE-9673.patch
>
>
> First slice is allocated by IntBlockPoo.newSlice(), and its level is 1,
>  
> {code:java}
> private int newSlice(final int size) {
>  if (intUpto > INT_BLOCK_SIZE-size) {
>  nextBuffer();
>  assert assertSliceBuffer(buffer);
>  }
>  
>  final int upto = intUpto;
>  intUpto += size;
>  buffer[intUpto-1] = 1;
>  return upto;
> }{code}
>  
>  
> If one slice is not enough, IntBlockPoo.allocSlice() is called to allocate 
> more slices,
> as the following code shows, level is 1, newLevel is NEXT_LEVEL_ARRAY[0] 
> which is also 1.
>  
> The result is the level of IntBlockPool slice is always 1, the first slice is 
>  2 bytes long, and all subsequent slices are 4 bytes long.
>  
> {code:java}
> private static final int[] NEXT_LEVEL_ARRAY = {1, 2, 3, 4, 5, 6, 7, 8, 9, 9};
> private int allocSlice(final int[] slice, final int sliceOffset) {
>  final int level = slice[sliceOffset];
>  final int newLevel = NEXT_LEVEL_ARRAY[level - 1];
>  final int newSize = LEVEL_SIZE_ARRAY[newLevel];
>  // Maybe allocate another block
>  if (intUpto > INT_BLOCK_SIZE - newSize) {
>  nextBuffer();
>  assert assertSliceBuffer(buffer);
>  }
> final int newUpto = intUpto;
>  final int offset = newUpto + intOffset;
>  intUpto += newSize;
>  // Write forwarding address at end of last slice:
>  slice[sliceOffset] = offset;
> // Write new level:
>  buffer[intUpto - 1] = newLevel;
> return newUpto;
>  } 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10211) Move IntBlockPool's slice allocator and SliceReader/Writer out to MemoryIndex

2021-10-28 Thread Michael McCandless (Jira)
Michael McCandless created LUCENE-10211:
---

 Summary: Move IntBlockPool's slice allocator and 
SliceReader/Writer out to MemoryIndex
 Key: LUCENE-10211
 URL: https://issues.apache.org/jira/browse/LUCENE-10211
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless


Spinoff from exciting issue LUCENE-9673.

Lucene's {{ByteBlockPool}} and {{IntBlockPool}} are sort of like a little 
malloc implementation, embedding many logical growing {{int[]}} into a series 
of block buffers.  This is needed to compactly/efficiently store postings data, 
since there can be many unique terms, each with their own growing {{byte[]}} 
holding postings.

{{IntBlockPool}} also has these slices, however, nowhere in {{core}} do we use 
those.  Rather, the allocation needs for {{int[]}} is simpler: just allocated a 
fixed length 1, 2 or 3 {{int[]}} per unique term.  We can greatly simplify 
{{IntBlockPool}}.

However, {{MemoryIndex}} does use these slices from {{IntBlockPool}}.  I think 
we should move the complex slice logic in {{IntBlockPool}} out to 
{{MemoryIndex}}, simplifying core?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9673) The level of IntBlockPool slice is always 1

2021-10-28 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17435401#comment-17435401
 ] 

Michael McCandless commented on LUCENE-9673:


OK phew catching up on this issue again [~mashudong]! Sorry for the crazy long 
delay.

It turns out nothing in Lucene's {{core}} uses any of this complex growable 
{{int[]}} logic – only {{MemoryIndex}} does (today anyways).  {{core}}'s 
{{int[]}} allocation need are are simpler: just allocating 1, 2 or 3 ints per 
new term encountered during indexing (depending on docs, freqs, prox are 
enabled). For {{byte[]}} storage, we do still use/need the growing slices to 
account for longer and shorter vInt encoded postings lists.

I will open a follow-on issue to promote this out of {{core}} into 
{{MemoryIndex}}.

For this issue let's just fix this sneaky {{IntBlockPool}} performance bug!

Oh and I also found this long-standing {{TODO}}:
{noformat}
   // TODO: figure out why this is 2*streamCount here. streamCount should be 
enough?{noformat}
And indeed it is over-allocating – we are wasting half of the {{int[]}} RAM we 
are allocating!  I fixed that, tests pass.  So this will be a little RAM 
efficiency improvement for {{IndexWriter}}.

Separately, I wonder if we could run a static "locally dead code detector" from 
gradle that would crawl the source graph dependencies, excluding tests?  I.e. 
this code was not technically dead, since unit tests were indeed exercising it, 
and another Lucene module was also using it, but nothing in Lucene's {{core}} 
was in fact using it.  I wish such code were automatically removed from our 
repository, or proposed to be moved out to the module that really needs it :)  
Sort of a source code garbage collector ...

> The level of IntBlockPool slice is always 1 
> 
>
> Key: LUCENE-9673
> URL: https://issues.apache.org/jira/browse/LUCENE-9673
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Reporter: mashudong
>Priority: Minor
> Attachments: LUCENE-9673.patch
>
>
> First slice is allocated by IntBlockPoo.newSlice(), and its level is 1,
>  
> {code:java}
> private int newSlice(final int size) {
>  if (intUpto > INT_BLOCK_SIZE-size) {
>  nextBuffer();
>  assert assertSliceBuffer(buffer);
>  }
>  
>  final int upto = intUpto;
>  intUpto += size;
>  buffer[intUpto-1] = 1;
>  return upto;
> }{code}
>  
>  
> If one slice is not enough, IntBlockPoo.allocSlice() is called to allocate 
> more slices,
> as the following code shows, level is 1, newLevel is NEXT_LEVEL_ARRAY[0] 
> which is also 1.
>  
> The result is the level of IntBlockPool slice is always 1, the first slice is 
>  2 bytes long, and all subsequent slices are 4 bytes long.
>  
> {code:java}
> private static final int[] NEXT_LEVEL_ARRAY = {1, 2, 3, 4, 5, 6, 7, 8, 9, 9};
> private int allocSlice(final int[] slice, final int sliceOffset) {
>  final int level = slice[sliceOffset];
>  final int newLevel = NEXT_LEVEL_ARRAY[level - 1];
>  final int newSize = LEVEL_SIZE_ARRAY[newLevel];
>  // Maybe allocate another block
>  if (intUpto > INT_BLOCK_SIZE - newSize) {
>  nextBuffer();
>  assert assertSliceBuffer(buffer);
>  }
> final int newUpto = intUpto;
>  final int offset = newUpto + intOffset;
>  intUpto += newSize;
>  // Write forwarding address at end of last slice:
>  slice[sliceOffset] = offset;
> // Write new level:
>  buffer[intUpto - 1] = newLevel;
> return newUpto;
>  } 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10207) Make TermInSetQuery usable with IndexOrDocValuesQuery

2021-10-27 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434769#comment-17434769
 ] 

Michael McCandless commented on LUCENE-10207:
-

I love this idea!  Using the aggregate term statistics already in the index to 
efficiently guesstimate the cost on the index side of things.  The user can 
always override the decision if they know something is unusual about their 
index?  (Hmm, maybe not – looks like the logic is hardcoded deep inside an 
anonymous {{ScorerSuppplier}} in {{IoDVQ}}).

Should we try to take deletions into account at all?  Because a PK field with 
deletions will look like it is not "precisely" PK based on the aggregate stats. 
 Though I suppose even with e.g. 50% deletions in the index, this proposed cost 
metric is close enough.

> Make TermInSetQuery usable with IndexOrDocValuesQuery
> -
>
> Key: LUCENE-10207
> URL: https://issues.apache.org/jira/browse/LUCENE-10207
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> IndexOrDocValuesQuery is very useful to pick the right execution mode for a 
> query depending on other bits of the query tree.
> We would like to be able to use it to optimize execution of TermInSetQuery. 
> However IndexOrDocValuesQuery only works well if the "index" query can give 
> an estimation of the cost of the query without doing anything expensive (like 
> looking up all terms of the TermInSetQuery in the terms dict). Maybe we could 
> implement it for primary keys (terms.size() == sumDocFreq) by returning the 
> number of terms of the query? Another idea is to multiply the number of terms 
> by the average postings length, though this could be dangerous if the field 
> has a zipfian distribution and some terms have a much higher doc frequency 
> than the average.
> [~romseygeek] and I were discussing this a few weeks ago, and more recently 
> [~mikemccand] and [~gsmiller] again independently. So it looks like there is 
> interest in this. Here is an email thread where this was recently discussed: 
> https://lists.apache.org/thread.html/re3b20a486c9a4e66b2ca4a2646e2d3be48535a90cdd95911a8445183%40%3Cdev.lucene.apache.org%3E.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10008) CommonGramsFilterFactory doesn't respect ignoreCase=true when default stopwords are used

2021-10-21 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-10008.
-
Fix Version/s: 8.11
   main (9.0)
   Resolution: Fixed

Thanks [~vigyas]!

> CommonGramsFilterFactory doesn't respect ignoreCase=true when default 
> stopwords are used
> 
>
> Key: LUCENE-10008
> URL: https://issues.apache.org/jira/browse/LUCENE-10008
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Fix For: main (9.0), 8.11
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> CommonGramsFilterFactory's use of the "words" and "ignoreCase" config options 
> is inconsistent with how StopFilterFactory uses them - leading to 
> "ignoreCase=true" not being respected unless "words" is specified...
> StopFilterFactory...
> {code:java}
>   public void inform(ResourceLoader loader) throws IOException {
> if (stopWordFiles != null) {
>   ...
> } else {
>   ...
>   stopWords = new CharArraySet(EnglishAnalyzer.ENGLISH_STOP_WORDS_SET, 
> ignoreCase);
> }
>   }
> {code}
> CommonGramsFilterFactory...
> {code:java}
>   @Override
>   public void inform(ResourceLoader loader) throws IOException {
> if (commonWordFiles != null) {
>   ...
> } else {
>   commonWords = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10093) TestTieredMergePolicy.testForcedMergesUseLeastNumberOfMerges test failure

2021-10-21 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-10093.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

> TestTieredMergePolicy.testForcedMergesUseLeastNumberOfMerges test failure
> -
>
> Key: LUCENE-10093
> URL: https://issues.apache.org/jira/browse/LUCENE-10093
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Fix For: main (9.0)
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This test fails periodically in our CI builds, and the failing seed repros 
> for me:
> {noformat}
> org.apache.lucene.index.TestTieredMergePolicy > test suite's output saved to 
> /l/trunk/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestTieredMergePolicy.txt,
>  copied below:
>    >     java.lang.AssertionError
>    >         at 
> __randomizedtesting.SeedInfo.seed([7B591E657503510C:C958DC291BD5CF0A]:0)
>    >         at org.junit.Assert.fail(Assert.java:87)
>    >         at org.junit.Assert.assertTrue(Assert.java:42)
>    >         at org.junit.Assert.assertTrue(Assert.java:53)
>    >         at 
> org.apache.lucene.index.TestTieredMergePolicy.assertMaxSize(TestTieredMergePolicy.java:497)
>    >         at 
> org.apache.lucene.index.TestTieredMergePolicy.testForcedMergesUseLeastNumberOfMerges(TestTieredMergePolicy.java:454)
>    >         at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    >         at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78)
>    >         at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>    >         at java.base/java.lang.reflect.Method.invoke(Method.java:567)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
>    >         at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
>    >         at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>    >         at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
>    >         at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
>    >         at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
>    >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>    >         at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>    >         at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
>    >         at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
>    >         at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
>    >         at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>    >         at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>    >         at 
> org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
>    >         at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
>    >         at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
>    >         at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>    >         at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>    >         at 
> 

[jira] [Commented] (LUCENE-10093) TestTieredMergePolicy.testForcedMergesUseLeastNumberOfMerges test failure

2021-10-21 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432495#comment-17432495
 ] 

Michael McCandless commented on LUCENE-10093:
-

The above ^^ fix should resolve this.

> TestTieredMergePolicy.testForcedMergesUseLeastNumberOfMerges test failure
> -
>
> Key: LUCENE-10093
> URL: https://issues.apache.org/jira/browse/LUCENE-10093
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This test fails periodically in our CI builds, and the failing seed repros 
> for me:
> {noformat}
> org.apache.lucene.index.TestTieredMergePolicy > test suite's output saved to 
> /l/trunk/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestTieredMergePolicy.txt,
>  copied below:
>    >     java.lang.AssertionError
>    >         at 
> __randomizedtesting.SeedInfo.seed([7B591E657503510C:C958DC291BD5CF0A]:0)
>    >         at org.junit.Assert.fail(Assert.java:87)
>    >         at org.junit.Assert.assertTrue(Assert.java:42)
>    >         at org.junit.Assert.assertTrue(Assert.java:53)
>    >         at 
> org.apache.lucene.index.TestTieredMergePolicy.assertMaxSize(TestTieredMergePolicy.java:497)
>    >         at 
> org.apache.lucene.index.TestTieredMergePolicy.testForcedMergesUseLeastNumberOfMerges(TestTieredMergePolicy.java:454)
>    >         at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    >         at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:78)
>    >         at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>    >         at java.base/java.lang.reflect.Method.invoke(Method.java:567)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
>    >         at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
>    >         at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>    >         at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
>    >         at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
>    >         at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
>    >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>    >         at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>    >         at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
>    >         at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
>    >         at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
>    >         at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
>    >         at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
>    >         at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>    >         at 
> org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
>    >         at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
>    >         at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
>    >         at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>    >         at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>    >         at 
> 

[jira] [Commented] (LUCENE-8739) ZSTD Compressor support in Lucene

2021-10-21 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432444#comment-17432444
 ] 

Michael McCandless commented on LUCENE-8739:


{quote}My codec passed all test cases with test option -Dtests.codec=MyCodec.
{quote}
Aha, that is great news!  Lucene's tests tend to stress out new Codecs.  If you 
want to evil-up the tests, pass {{-Dtests.nightly=true}}.  The tests will run 
longer but try harder to find problems.

> ZSTD Compressor support in Lucene
> -
>
> Key: LUCENE-8739
> URL: https://issues.apache.org/jira/browse/LUCENE-8739
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/codecs
>Reporter: Sean Torres
>Priority: Minor
>  Labels: features
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> ZStandard has a great speed and compression ratio tradeoff. 
> ZStandard is open source compression from Facebook.
> More about ZSTD
> [https://github.com/facebook/zstd]
> [https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



  1   2   3   4   5   6   7   >