[GitHub] [lucene-jira-archive] manishbafna commented on a diff in pull request #75: Update account-map.csv.20220722.verified

2022-07-26 Thread GitBox


manishbafna commented on code in PR #75:
URL: https://github.com/apache/lucene-jira-archive/pull/75#discussion_r930634094


##
migration/mappings-data/account-map.csv.20220722.verified:
##
@@ -169,3 +169,4 @@ mharwood,markharwood,Mark Harwood
 hossman,hossman,Chris M. Hostetter
 munendrasn,munendrasn,Munendra S N
 vajda,ovalhub,Andi Vajda
+manish1982,manishbafna,Manish

Review Comment:
   https://user-images.githubusercontent.com/1758199/181168231-9e0d1ac2-b47b-4879-ad94-19db05900bec.png;>
   My username in JIRA is manish1982. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10583) Deadlock with MMapDirectory while waitForMerges

2022-07-26 Thread Vigya Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571691#comment-17571691
 ] 

Vigya Sharma commented on LUCENE-10583:
---

{quote}We could perhaps make a best effort to detect on common incoming APIs 
that external locks are not already held on {{Directory}} and 
{{{}IndexWriter{}}}?
{quote}
Interesting thought. I like the idea of safe-guarding users against such 
errors, but I don't have a good practical solution for it yet.

We could assert that common lucene objects are lock free at some popular 
{{public}} entry points; but how do we differentiate on whether the lock is 
acquired by an internal lucene thread or an external user thread..? We do lock 
on IndexWriter at multiple places within lucene.
{quote}can this be resolved now?
{quote}
We added doc strings at a couple of places to warn users, and the user who 
reported this issue is unblocked. I don't have a concrete plan for anything 
else we can do here. Unless there are more ideas, we could go ahead and resolve 
this.

 

> Deadlock with MMapDirectory while waitForMerges
> ---
>
> Key: LUCENE-10583
> URL: https://issues.apache.org/jira/browse/LUCENE-10583
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.11.1
> Environment: Java 17
> OS: Windows 2016
>Reporter: Thomas Hoffmann
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hello,
> a deadlock situation happened in our application. We are using MMapDirectory 
> on Windows 2016 and got the following stacktrace:
> {code:java}
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> elapsed=81248.18s tid=0x2860af10 nid=0x237c in Object.wait()  
> [0x413fc000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>     at java.lang.Object.wait(java.base@17.0.2/Native Method)
>     - waiting on 
>     at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4983)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at 
> org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2697)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1236)
>     at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1278)
>     at 
> com.speed4trade.ebs.module.search.SearchService.updateSearchIndex(SearchService.java:1723)
>     - locked <0x0006d5c00208> (a org.apache.lucene.store.MMapDirectory)
>     at 
> com.speed4trade.ebs.module.businessrelations.ticket.TicketChangedListener.postUpdate(TicketChangedListener.java:142)
> ...{code}
> All threads were waiting to lock <0x0006d5c00208> which got never 
> released.
> A lucene thread was also blocked, I dont know if this is relevant:
> {code:java}
> "Lucene Merge Thread #0" #18466 daemon prio=5 os_prio=0 cpu=15.63ms 
> elapsed=3499.07s tid=0x459453e0 nid=0x1f8 waiting for monitor entry  
> [0x5da9e000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>     at 
> org.apache.lucene.store.FSDirectory.deletePendingFiles(FSDirectory.java:346)
>     - waiting to lock <0x0006d5c00208> (a 
> org.apache.lucene.store.MMapDirectory)
>     at 
> org.apache.lucene.store.FSDirectory.maybeDeletePendingFiles(FSDirectory.java:363)
>     at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:248)
>     at 
> org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$1.createOutput(ConcurrentMergeScheduler.java:289)
>     at 
> org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:121)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130)
>     at 
> org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141)
>     at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:227)
>     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
>     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4757)
>     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4361)
>     at 
> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5920)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:626)
>     at 
> 

[jira] [Updated] (LUCENE-10654) New companion doc value format for LatLonShape and XYShape field types

2022-07-26 Thread Nick Knize (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Knize updated LUCENE-10654:

Fix Version/s: 9.4
   (was: 9.3)

> New companion doc value format for LatLonShape and XYShape field types
> --
>
> Key: LUCENE-10654
> URL: https://issues.apache.org/jira/browse/LUCENE-10654
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Nick Knize
>Priority: Major
> Fix For: 9.4
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> {{XYDocValuesField}} provides doc value support for {{XYPoint}}. 
> {{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}.
> However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue 
> format. 
> This lack of doc value support for shapes means facets, aggregations, and 
> IndexOrDocValues queries are currently not possible for Shape field types. 
> This gap needs be closed in lucene.
> To support IndexOrDocValues queries along with various geometry aggregations 
> and facets, the ability to compute the spatial relation with the doc value is 
> needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since 
> the doc value encoding is nothing more than a simple 2D integer encoding of 
> the x,y and lat,lon dimensional components. Accomplishing the same with a 
> naive integer encoded binary representation for N-vertex shapes would be 
> costly. 
> {{ComponentTree}} already provides an efficient in memory structure for 
> quickly computing spatial relations over Shape types based on a binary tree 
> of tessellated triangles provided by the {{Tessellator}}. Furthermore, this 
> tessellation is already computed at index time. If we create an on-disk 
> representation of {{ComponentTree}} 's binary tree of tessellated triangles 
> and use this as the doc value {{binaryValue}} format we will be able to 
> efficiently compute spatial relations with this binary representation and 
> achieve the same facet/aggregation result over shapes as we can with points 
> today (e.g., grid facets, centroid, area, etc).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10654) New companion doc value format for LatLonShape and XYShape field types

2022-07-26 Thread Nick Knize (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571655#comment-17571655
 ] 

Nick Knize commented on LUCENE-10654:
-

As per discussion on the PR I think this is too late for 9.3 so I'd like to 
move forward for 9.4 and iterating the "nice to haves" (visitor access pattern) 
in a follow up.

> New companion doc value format for LatLonShape and XYShape field types
> --
>
> Key: LUCENE-10654
> URL: https://issues.apache.org/jira/browse/LUCENE-10654
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Nick Knize
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> {{XYDocValuesField}} provides doc value support for {{XYPoint}}. 
> {{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}.
> However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue 
> format. 
> This lack of doc value support for shapes means facets, aggregations, and 
> IndexOrDocValues queries are currently not possible for Shape field types. 
> This gap needs be closed in lucene.
> To support IndexOrDocValues queries along with various geometry aggregations 
> and facets, the ability to compute the spatial relation with the doc value is 
> needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since 
> the doc value encoding is nothing more than a simple 2D integer encoding of 
> the x,y and lat,lon dimensional components. Accomplishing the same with a 
> naive integer encoded binary representation for N-vertex shapes would be 
> costly. 
> {{ComponentTree}} already provides an efficient in memory structure for 
> quickly computing spatial relations over Shape types based on a binary tree 
> of tessellated triangles provided by the {{Tessellator}}. Furthermore, this 
> tessellation is already computed at index time. If we create an on-disk 
> representation of {{ComponentTree}} 's binary tree of tessellated triangles 
> and use this as the doc value {{binaryValue}} format we will be able to 
> efficiently compute spatial relations with this binary representation and 
> achieve the same facet/aggregation result over shapes as we can with points 
> today (e.g., grid facets, centroid, area, etc).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-07-26 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571631#comment-17571631
 ] 

Michael Sokolov commented on LUCENE-10577:
--

OK, I will revive the FieldInfo version of this thing and see about making a 
byte-oriented KnnVectorField; perhaps the VectorFormat can remain internal in 
that case. It seems likely to me that if this is a win for this algorithm that 
it could very well be so for others. Plus there is an easy fallback position 
which is to accept bytes and inflate them to four-bit floats, so the burden is 
not necessarily so great on future vector formats. Agree we can add Euclidean 
distance.

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-26 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571594#comment-17571594
 ] 

Michael Sokolov edited comment on LUCENE-10404 at 7/26/22 8:01 PM:
---

Here is a test using GloVe 100-dim vectors plus much more aggressive indexing 
settings, and we can see that here the IntIntHashMap is adding cost
h3. baseline
h3. {{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.991    0.92   1   50      64      500     150     12068}}
{{0.996    1.11   1   100     64      500     200     0}}
{{0.999    1.45   1   200     64      500     300     0}}
{{1.000    1.94   1   400     64      500     500     0}}
{{0.955    2.53   10  50      64      500     150     463142}}
{{0.973    3.03   10  100     64      500     200     0}}
{{0.988    4.44   10  200     64      500     300     0}}
{{0.997    6.57   10  400     64      500     500     0}}
{{0.895    3.44   100 50      64      500     150     9811483}}
{{0.920    4.33   100 100     64      500     200     0}}
{{0.950    6.20   100 200     64      500     300     0}}
{{0.974    9.53   100 400     64      500     500     0}}

IntIntHashMap

{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.991    1.03   1   50      64      500     150     13274}}
{{0.996    1.24   1   100     64      500     200     0}}
{{0.999    1.62   1   200     64      500     300     0}}
{{1.000    2.09   1   400     64      500     500     0}}
{{0.955    2.47   10  50      64      500     150     485131}}
{{0.973    3.31   10  100     64      500     200     0}}
{{0.988    4.66   10  200     64      500     300     0}}
{{0.997    7.26   10  400     64      500     500     0}}
{{0.895    3.58   100 50      64      500     150     10173818}}
{{0.920    4.49   100 100     64      500     200     0}}
{{0.950    6.45   100 200     64      500     300     0}}
{{0.974    9.91   100 400     64      500     500     0}}


was (Author: sokolov):
Here is a test using GloVe 100-dim vectors plus much more aggressive indexing 
settings, and we can see that here the IntIntHashMap is adding cost

h3. baseline

{{recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9910.92   1   50  64  500 150 12068
0.9961.11   1   100 64  500 200 0
0.9991.45   1   200 64  500 300 0
1.0001.94   1   400 64  500 500 0
0.9552.53   10  50  64  500 150 463142
0.9733.03   10  100 64  500 200 0
0.9884.44   10  200 64  500 300 0
0.9976.57   10  400 64  500 500 0
0.8953.44   100 50  64  500 150 9811483
0.9204.33   100 100 64  500 200 0
0.9506.20   100 200 64  500 300 0
0.9749.53   100 400 64  500 500 0}}
}}

h3. IntIntHashMap

{{recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9911.03   1   50  64  500 150 13274
0.9961.24   1   100 64  500 200 0
0.9991.62   1   200 64  500 300 0
1.0002.09   1   400 64  500 500 0
0.9552.47   10  50  64  500 150 485131
0.9733.31   10  100 64  500 200 0
0.9884.66   10  200 64  500 300 0
0.9977.26   10  400 64  500 500 0
0.8953.58   100 50  64  500 150 10173818
0.9204.49   100 100 64  500 200 0
0.9506.45   100 200 64  500 300 0
0.9749.91   100 400 64  500 500 0
}}


> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 

[jira] [Comment Edited] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-26 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571594#comment-17571594
 ] 

Michael Sokolov edited comment on LUCENE-10404 at 7/26/22 7:59 PM:
---

Here is a test using GloVe 100-dim vectors plus much more aggressive indexing 
settings, and we can see that here the IntIntHashMap is adding cost

h3. baseline

{{recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9910.92   1   50  64  500 150 12068
0.9961.11   1   100 64  500 200 0
0.9991.45   1   200 64  500 300 0
1.0001.94   1   400 64  500 500 0
0.9552.53   10  50  64  500 150 463142
0.9733.03   10  100 64  500 200 0
0.9884.44   10  200 64  500 300 0
0.9976.57   10  400 64  500 500 0
0.8953.44   100 50  64  500 150 9811483
0.9204.33   100 100 64  500 200 0
0.9506.20   100 200 64  500 300 0
0.9749.53   100 400 64  500 500 0}}
}}

h3. IntIntHashMap

{{recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9911.03   1   50  64  500 150 13274
0.9961.24   1   100 64  500 200 0
0.9991.62   1   200 64  500 300 0
1.0002.09   1   400 64  500 500 0
0.9552.47   10  50  64  500 150 485131
0.9733.31   10  100 64  500 200 0
0.9884.66   10  200 64  500 300 0
0.9977.26   10  400 64  500 500 0
0.8953.58   100 50  64  500 150 10173818
0.9204.49   100 100 64  500 200 0
0.9506.45   100 200 64  500 300 0
0.9749.91   100 400 64  500 500 0
}}



was (Author: sokolov):
Here is a test using GloVe 100-dim vectors plus much more aggressive indexing 
settings, and we can see that here the IntIntHashMap is adding cost

h3. baseline

{{
recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9910.92   1   50  64  500 150 12068
0.9961.11   1   100 64  500 200 0
0.9991.45   1   200 64  500 300 0
1.0001.94   1   400 64  500 500 0
0.9552.53   10  50  64  500 150 463142
0.9733.03   10  100 64  500 200 0
0.9884.44   10  200 64  500 300 0
0.9976.57   10  400 64  500 500 0
0.8953.44   100 50  64  500 150 9811483
0.9204.33   100 100 64  500 200 0
0.9506.20   100 200 64  500 300 0
0.9749.53   100 400 64  500 500 0
}}

h3. IntIntHashMap

{{
recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9911.03   1   50  64  500 150 13274
0.9961.24   1   100 64  500 200 0
0.9991.62   1   200 64  500 300 0
1.0002.09   1   400 64  500 500 0
0.9552.47   10  50  64  500 150 485131
0.9733.31   10  100 64  500 200 0
0.9884.66   10  200 64  500 300 0
0.9977.26   10  400 64  500 500 0
0.8953.58   100 50  64  500 150 10173818
0.9204.49   100 100 64  500 200 0
0.9506.45   100 200 64  500 300 0
0.9749.91   100 400 64  500 500 0
}}


> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted 

[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?

2022-07-26 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571594#comment-17571594
 ] 

Michael Sokolov commented on LUCENE-10404:
--

Here is a test using GloVe 100-dim vectors plus much more aggressive indexing 
settings, and we can see that here the IntIntHashMap is adding cost

h3. baseline

{{
recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9910.92   1   50  64  500 150 12068
0.9961.11   1   100 64  500 200 0
0.9991.45   1   200 64  500 300 0
1.0001.94   1   400 64  500 500 0
0.9552.53   10  50  64  500 150 463142
0.9733.03   10  100 64  500 200 0
0.9884.44   10  200 64  500 300 0
0.9976.57   10  400 64  500 500 0
0.8953.44   100 50  64  500 150 9811483
0.9204.33   100 100 64  500 200 0
0.9506.20   100 200 64  500 300 0
0.9749.53   100 400 64  500 500 0
}}

h3. IntIntHashMap

{{
recall  latency nDocfanout  maxConn beamWidth   visited index ms
0.9911.03   1   50  64  500 150 13274
0.9961.24   1   100 64  500 200 0
0.9991.62   1   200 64  500 300 0
1.0002.09   1   400 64  500 500 0
0.9552.47   10  50  64  500 150 485131
0.9733.31   10  100 64  500 200 0
0.9884.66   10  200 64  500 300 0
0.9977.26   10  400 64  500 500 0
0.8953.58   100 50  64  500 150 10173818
0.9204.49   100 100 64  500 200 0
0.9506.45   100 200 64  500 300 0
0.9749.91   100 400 64  500 500 0
}}


> Use hash set for visited nodes in HNSW search?
> --
>
> Key: LUCENE-10404
> URL: https://issues.apache.org/jira/browse/LUCENE-10404
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10054) Handle hierarchy in HNSW graph

2022-07-26 Thread Mike Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571588#comment-17571588
 ] 

Mike Sokolov commented on LUCENE-10054:
---

what is it with this issue that spammers love so much!? I wonder if we
could somehow lock it as read-only ...



> Handle hierarchy in HNSW graph
> --
>
> Key: LUCENE-10054
> URL: https://issues.apache.org/jira/browse/LUCENE-10054
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Mayya Sharipova
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 20h 20m
>  Remaining Estimate: 0h
>
> Currently HNSW graph is represented as a single layer graph. 
>  We would like to extend it to handle hierarchy as per 
> [discussion|https://issues.apache.org/jira/browse/LUCENE-9004?focusedCommentId=17393216=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17393216].
>  
>  
> TODO tasks:
> - add multiple layers in the HnswGraph class
>  - modify the format in  Lucene90HnswVectorsWriter and 
> Lucene90HnswVectorsReader to handle multiple layers
> - modify graph construction and search algorithm to handle hierarchy
>  - run benchmarks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] nknize commented on a diff in pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape

2022-07-26 Thread GitBox


nknize commented on code in PR #1017:
URL: https://github.com/apache/lucene/pull/1017#discussion_r930257797


##
lucene/core/src/java/org/apache/lucene/document/ShapeDocValuesField.java:
##
@@ -0,0 +1,896 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.document.ShapeField.DecodedTriangle.TYPE;
+import org.apache.lucene.document.ShapeField.QueryRelation;
+import org.apache.lucene.document.SpatialQuery.EncodedRectangle;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.IndexableFieldType;
+import org.apache.lucene.index.PointValues.Relation;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteBuffersDataOutput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+
+/** A doc values field representation for {@link LatLonShape} and {@link 
XYShape} */
+public final class ShapeDocValuesField extends Field {
+  private final ShapeComparator shapeComparator;
+
+  private static final FieldType FIELD_TYPE = new FieldType();
+
+  static {
+FIELD_TYPE.setDocValuesType(DocValuesType.BINARY);
+FIELD_TYPE.setOmitNorms(true);
+FIELD_TYPE.freeze();
+  }
+
+  /**
+   * Creates a {@ShapeDocValueField} instance from a shape tessellation
+   *
+   * @param name The Field Name (must not be null)
+   * @param tessellation The tessellation (must not be null)
+   */
+  ShapeDocValuesField(String name, List 
tessellation) {
+super(name, FIELD_TYPE);
+BytesRef b = computeBinaryValue(tessellation);
+this.fieldsData = b;
+try {
+  this.shapeComparator = new ShapeComparator(b);
+} catch (IOException e) {
+  throw new IllegalArgumentException("unable to read binary shape doc 
value field. ", e);
+}
+  }
+
+  /** Creates a {@code ShapeDocValue} field from a given serialized value */
+  ShapeDocValuesField(String name, BytesRef binaryValue) {
+super(name, FIELD_TYPE);
+this.fieldsData = binaryValue;
+try {
+  this.shapeComparator = new ShapeComparator(binaryValue);
+} catch (IOException e) {
+  throw new IllegalArgumentException("unable to read binary shape doc 
value field. ", e);
+}
+  }
+
+  /** The name of the field */
+  @Override
+  public String name() {
+return name;
+  }
+
+  /** Gets the {@code IndexableFieldType} for this ShapeDocValue field */
+  @Override
+  public IndexableFieldType fieldType() {
+return FIELD_TYPE;
+  }
+
+  /** Currently there is no string representation for the ShapeDocValueField */
+  @Override
+  public String stringValue() {
+return null;
+  }
+
+  /** TokenStreams are not yet supported */
+  @Override
+  public TokenStream tokenStream(Analyzer analyzer, TokenStream reuse) {
+return null;
+  }
+
+  /** create a shape docvalue field from indexable fields */
+  public static ShapeDocValuesField createDocValueField(String fieldName, 
Field[] indexableFields) {
+ArrayList tess = new 
ArrayList<>(indexableFields.length);
+final byte[] scratch = new byte[7 * Integer.BYTES];
+for (Field f : indexableFields) {
+  BytesRef br = f.binaryValue();
+  assert br.length == 7 * ShapeField.BYTES;
+  System.arraycopy(br.bytes, br.offset, scratch, 0, 7 * ShapeField.BYTES);
+  ShapeField.DecodedTriangle t = new ShapeField.DecodedTriangle();
+  ShapeField.decodeTriangle(scratch, t);
+  tess.add(t);
+}
+return new ShapeDocValuesField(fieldName, tess);
+  }
+
+  /** Returns the number of terms (tessellated triangles) for this shape */
+  public int numberOfTerms() {
+return shapeComparator.numberOfTerms();
+  }
+
+  /** Creates a geometry query for shape docvalues */
+  public static Query newGeometryQuery(
+  final String field, final QueryRelation relation, Object... geometries) {
+return null;
+// TODO
+//  return new ShapeDocValuesQuery(field, relation, 

[jira] [Commented] (LUCENE-10662) Make LuceneTestCase to not extend from org.junit.Assert

2022-07-26 Thread Marios Trivyzas (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571529#comment-17571529
 ] 

Marios Trivyzas commented on LUCENE-10662:
--

[~dweiss] Thx! Checkout how it looks like without the renaming: 
https://github.com/apache/lucene/pull/1049/commits/7b71302c915bc81d9d29ad49f1e917c219ee

> Make LuceneTestCase to not extend from org.junit.Assert
> ---
>
> Key: LUCENE-10662
> URL: https://issues.apache.org/jira/browse/LUCENE-10662
> Project: Lucene - Core
>  Issue Type: Test
>  Components: general/test
>Reporter: Marios Trivyzas
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since *LuceneTestCase* is a very useful abstract class that can be extended 
> and used by many projects, having it extending *org.junit.Assert* limits all 
> users to exclusively use the static methods of {*}org.junit.Assert{*}. In our 
> project we want to use [https://joel-costigliola.github.io/assertj] where the 
> main method to call is *org.assertj.core.api.Assertions.assertThat* which 
> conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized 
> by default by the compiler. So one can only use assertj if on every call uses 
> fully qualified name for the *assertThat* method, i.e.
>  
> {code:java}
> org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10662) Make LuceneTestCase to not extend from org.junit.Assert

2022-07-26 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571469#comment-17571469
 ] 

Dawid Weiss commented on LUCENE-10662:
--

I think the compiler should be able to pick the most specific variant based on 
argument types, unless there really is ambiguity - I admit I haven't checked 
whether this is the case, for example here:

https://github.com/apache/lucene/pull/1049/files#diff-334836e7b61b74a76eec5aa18eacec6b14c1496f5595b684842ce05583a6df22L209-R213

> Make LuceneTestCase to not extend from org.junit.Assert
> ---
>
> Key: LUCENE-10662
> URL: https://issues.apache.org/jira/browse/LUCENE-10662
> Project: Lucene - Core
>  Issue Type: Test
>  Components: general/test
>Reporter: Marios Trivyzas
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since *LuceneTestCase* is a very useful abstract class that can be extended 
> and used by many projects, having it extending *org.junit.Assert* limits all 
> users to exclusively use the static methods of {*}org.junit.Assert{*}. In our 
> project we want to use [https://joel-costigliola.github.io/assertj] where the 
> main method to call is *org.assertj.core.api.Assertions.assertThat* which 
> conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized 
> by default by the compiler. So one can only use assertj if on every call uses 
> fully qualified name for the *assertThat* method, i.e.
>  
> {code:java}
> org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #3: Create mapping on Jira user id -> GitHub account

2022-07-26 Thread GitBox


mocobeta commented on issue #3:
URL: 
https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195564497

   I'll try to improve candidate generation and verification steps maybe next 
week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10662) Make LuceneTestCase to not extend from org.junit.Assert

2022-07-26 Thread Marios Trivyzas (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marios Trivyzas updated LUCENE-10662:
-
Summary: Make LuceneTestCase to not extend from org.junit.Assert  (was: 
Make LuceneTestCase not extending from org.junit.Assert)

> Make LuceneTestCase to not extend from org.junit.Assert
> ---
>
> Key: LUCENE-10662
> URL: https://issues.apache.org/jira/browse/LUCENE-10662
> Project: Lucene - Core
>  Issue Type: Test
>  Components: general/test
>Reporter: Marios Trivyzas
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since *LuceneTestCase* is a very useful abstract class that can be extended 
> and used by many projects, having it extending *org.junit.Assert* limits all 
> users to exclusively use the static methods of {*}org.junit.Assert{*}. In our 
> project we want to use [https://joel-costigliola.github.io/assertj] where the 
> main method to call is *org.assertj.core.api.Assertions.assertThat* which 
> conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized 
> by default by the compiler. So one can only use assertj if on every call uses 
> fully qualified name for the *assertThat* method, i.e.
>  
> {code:java}
> org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10662) Make LuceneTestCase not extending from org.junit.Assert

2022-07-26 Thread Marios Trivyzas (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571453#comment-17571453
 ] 

Marios Trivyzas edited comment on LUCENE-10662 at 7/26/22 2:14 PM:
---

{quote}I wouldn't rename any methods (assertEquals becomes assertEquality) - 
this will be even more confusing for downstream users. I'd remove the extend 
and assertEquals* methods from LuceneTestCase and move those methods into a 
separate class (like LuceneAssertions or something) - then the upgrade would be 
about importing them statically from junit's Assert or LuceneAssertions.
{quote}
I don't get how we can resolve a few issues: for example the *private void 
assertEquals(Sort a, Sort b)* in {*}TestSort{*}, if it remains like that and we 
also *import static org.junit.Assert.assertEquals* in the same class, the 
compiler doesn't know which one is using unless we use *Assert.assertEquals()* 
everwhere else, to actually use the junit one.

 

The most important point, is what you mentioned, about all the projects that 
use {*}LuceneTestCase{*}, so let's see what other people also think about this.

 


was (Author: matriv):
{quote}

I wouldn't rename any methods (assertEquals becomes assertEquality) - this will 
be even more confusing for downstream users. I'd remove the extend and 
assertEquals* methods from LuceneTestCase and move those methods into a 
separate class (like LuceneAssertions or something) - then the upgrade would be 
about importing them statically from junit's Assert or LuceneAssertions.

{quote}

I don't get how we can resolve a few issues: for example the *private void 
assertEquals(Sort a, Sort b)* in {*}TestSort{*}, if it remains like that and we 
also *import static org.junit.Assert.assertEquals* in the same class, the 
compiler doesn't know which one is using unless we use *Assert.assertEquals()* 
everwhere else, to actually use the junit one.

 

The most important point is what you mentioned about all the projects that use 
{*}LuceneTestCase{*}, so let's see what other people also think about this.

 

> Make LuceneTestCase not extending from org.junit.Assert
> ---
>
> Key: LUCENE-10662
> URL: https://issues.apache.org/jira/browse/LUCENE-10662
> Project: Lucene - Core
>  Issue Type: Test
>  Components: general/test
>Reporter: Marios Trivyzas
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since *LuceneTestCase* is a very useful abstract class that can be extended 
> and used by many projects, having it extending *org.junit.Assert* limits all 
> users to exclusively use the static methods of {*}org.junit.Assert{*}. In our 
> project we want to use [https://joel-costigliola.github.io/assertj] where the 
> main method to call is *org.assertj.core.api.Assertions.assertThat* which 
> conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized 
> by default by the compiler. So one can only use assertj if on every call uses 
> fully qualified name for the *assertThat* method, i.e.
>  
> {code:java}
> org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10662) Make LuceneTestCase not extending from org.junit.Assert

2022-07-26 Thread Marios Trivyzas (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571453#comment-17571453
 ] 

Marios Trivyzas commented on LUCENE-10662:
--

{quote}

I wouldn't rename any methods (assertEquals becomes assertEquality) - this will 
be even more confusing for downstream users. I'd remove the extend and 
assertEquals* methods from LuceneTestCase and move those methods into a 
separate class (like LuceneAssertions or something) - then the upgrade would be 
about importing them statically from junit's Assert or LuceneAssertions.

{quote}

I don't get how we can resolve a few issues: for example the *private void 
assertEquals(Sort a, Sort b)* in {*}TestSort{*}, if it remains like that and we 
also *import static org.junit.Assert.assertEquals* in the same class, the 
compiler doesn't know which one is using unless we use *Assert.assertEquals()* 
everwhere else, to actually use the junit one.

 

The most important point is what you mentioned about all the projects that use 
{*}LuceneTestCase{*}, so let's see what other people also think about this.

 

> Make LuceneTestCase not extending from org.junit.Assert
> ---
>
> Key: LUCENE-10662
> URL: https://issues.apache.org/jira/browse/LUCENE-10662
> Project: Lucene - Core
>  Issue Type: Test
>  Components: general/test
>Reporter: Marios Trivyzas
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since *LuceneTestCase* is a very useful abstract class that can be extended 
> and used by many projects, having it extending *org.junit.Assert* limits all 
> users to exclusively use the static methods of {*}org.junit.Assert{*}. In our 
> project we want to use [https://joel-costigliola.github.io/assertj] where the 
> main method to call is *org.assertj.core.api.Assertions.assertThat* which 
> conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized 
> by default by the compiler. So one can only use assertj if on every call uses 
> fully qualified name for the *assertThat* method, i.e.
>  
> {code:java}
> org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10662) Make LuceneTestCase not extending from org.junit.Assert

2022-07-26 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571418#comment-17571418
 ] 

Dawid Weiss commented on LUCENE-10662:
--

Changing these methods will require a huge follow-up and cleanup in any other 
project that uses LuceneTestCase (and there are many). I don't think people 
will be happy with it (even though my heart is with you on assertj - I also 
prefer it to what's in hamcrest/junit). 

Even if people agree to change it, looking at the patch, I wouldn't rename any 
methods (assertEquals becomes assertEquality) - this will be even more 
confusing for downstream users. I'd remove the extend and assertEquals* methods 
from LuceneTestCase and move those methods into a separate class (like 
LuceneAssertions or something) - then the upgrade would be about importing them 
statically from junit's Assert or LuceneAssertions.

Again, I'm not convinced this is a necessary improvement. I've lived with an 
explicit Assertions.* call from assertj - this is fine and explicit. And even 
used within Lucene code itself:

[https://github.com/apache/lucene/blob/main/lucene/distribution.tests/src/test/org/apache/lucene/distribution/TestModularLayer.java#L117]

> Make LuceneTestCase not extending from org.junit.Assert
> ---
>
> Key: LUCENE-10662
> URL: https://issues.apache.org/jira/browse/LUCENE-10662
> Project: Lucene - Core
>  Issue Type: Test
>  Components: general/test
>Reporter: Marios Trivyzas
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since *LuceneTestCase* is a very useful abstract class that can be extended 
> and used by many projects, having it extending *org.junit.Assert* limits all 
> users to exclusively use the static methods of {*}org.junit.Assert{*}. In our 
> project we want to use [https://joel-costigliola.github.io/assertj] where the 
> main method to call is *org.assertj.core.api.Assertions.assertThat* which 
> conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized 
> by default by the compiler. So one can only use assertj if on every call uses 
> fully qualified name for the *assertThat* method, i.e.
>  
> {code:java}
> org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on issue #1: Fix markup conversion error

2022-07-26 Thread GitBox


mikemccand commented on issue #1:
URL: 
https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1195457720

   > GitHub won't accept labels such as `legacy-jira-label:java11` for some 
reason
   
   That's really weird ;)
   
   I was able to apply the label to [this 
issue](https://github.com/apache/lucene-jira-archive/issues/78) through the 
GitHub web UI.  Not sure why import API would fail on it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error

2022-07-26 Thread GitBox


mocobeta commented on issue #1:
URL: 
https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1195451400

   The rehearsal failed - 52 issues won't be imported by errors. The errors 
come from recent changes in the conversion script (for example, GitHub won't 
accept labels such as `legacy-jira-label:java11` for some reason). I'll 
investigate the errors and retry again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on pull request #64: Cover all Jira components in module label mapping

2022-07-26 Thread GitBox


mocobeta commented on PR #64:
URL: 
https://github.com/apache/lucene-jira-archive/pull/64#issuecomment-1195443803

   Looks like this causes import error for some issues.
   ```
   [2022-07-26 18:53:05,540] ERROR:import_github_issues: Import GitHub issue 
/mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-9500.json
 was failed. status=failed, errors=[{'location': '/issue/labels[11]', 
'resource': 'Label', 'field': 'name', 'value': 'legacy-jira-label:java11', 
'code': 'invalid'}]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] matriv commented on pull request #1049: LUCENE-10662 Make LuceneTestCase to not extend from org.junit.Assert

2022-07-26 Thread GitBox


matriv commented on PR #1049:
URL: https://github.com/apache/lucene/pull/1049#issuecomment-1195437459

   - 4904fedef1a3e0ca0a67f8f0db0961b09db51f30 Renames some methods to avoid 
naming conflicts
   - b9fe0008b10ecff6b29feb3b61250ba343a1b1bd Removes `extends Assert` from 
`LuceneTestCase` and adds static imports of `org.junit.Assert.xxx` everywhere


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10583) Deadlock with MMapDirectory while waitForMerges

2022-07-26 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571393#comment-17571393
 ] 

Michael McCandless commented on LUCENE-10583:
-

We could perhaps make a best effort to detect on common incoming APIs that 
external locks are not already held on {{Directory}} and {{IndexWriter}}?

> Deadlock with MMapDirectory while waitForMerges
> ---
>
> Key: LUCENE-10583
> URL: https://issues.apache.org/jira/browse/LUCENE-10583
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.11.1
> Environment: Java 17
> OS: Windows 2016
>Reporter: Thomas Hoffmann
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hello,
> a deadlock situation happened in our application. We are using MMapDirectory 
> on Windows 2016 and got the following stacktrace:
> {code:java}
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> elapsed=81248.18s tid=0x2860af10 nid=0x237c in Object.wait()  
> [0x413fc000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>     at java.lang.Object.wait(java.base@17.0.2/Native Method)
>     - waiting on 
>     at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4983)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at 
> org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2697)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1236)
>     at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1278)
>     at 
> com.speed4trade.ebs.module.search.SearchService.updateSearchIndex(SearchService.java:1723)
>     - locked <0x0006d5c00208> (a org.apache.lucene.store.MMapDirectory)
>     at 
> com.speed4trade.ebs.module.businessrelations.ticket.TicketChangedListener.postUpdate(TicketChangedListener.java:142)
> ...{code}
> All threads were waiting to lock <0x0006d5c00208> which got never 
> released.
> A lucene thread was also blocked, I dont know if this is relevant:
> {code:java}
> "Lucene Merge Thread #0" #18466 daemon prio=5 os_prio=0 cpu=15.63ms 
> elapsed=3499.07s tid=0x459453e0 nid=0x1f8 waiting for monitor entry  
> [0x5da9e000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>     at 
> org.apache.lucene.store.FSDirectory.deletePendingFiles(FSDirectory.java:346)
>     - waiting to lock <0x0006d5c00208> (a 
> org.apache.lucene.store.MMapDirectory)
>     at 
> org.apache.lucene.store.FSDirectory.maybeDeletePendingFiles(FSDirectory.java:363)
>     at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:248)
>     at 
> org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$1.createOutput(ConcurrentMergeScheduler.java:289)
>     at 
> org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:121)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130)
>     at 
> org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141)
>     at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:227)
>     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
>     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4757)
>     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4361)
>     at 
> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5920)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:626)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684){code}
> If looks like the merge operation never finished and released the lock.
> Is there any option to prevent this deadlock or how to investigate it further?
> A load-test didn't show this problem unfortunately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #3: Create mapping on Jira user id -> GitHub account

2022-07-26 Thread GitBox


mocobeta commented on issue #3:
URL: 
https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195419311

   > Could we expand the matching so that if the userid in jira == the userid 
in GitHub we strongly suggest a match? E.g. mdmarshmallow would have been 
matched this way.
   
   It'd be easy to pick up such candidates - I think we'd need manually verify 
all of them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10583) Deadlock with MMapDirectory while waitForMerges

2022-07-26 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571383#comment-17571383
 ] 

Michael McCandless commented on LUCENE-10583:
-

[~vigyas] can this be resolved now?

> Deadlock with MMapDirectory while waitForMerges
> ---
>
> Key: LUCENE-10583
> URL: https://issues.apache.org/jira/browse/LUCENE-10583
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 8.11.1
> Environment: Java 17
> OS: Windows 2016
>Reporter: Thomas Hoffmann
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hello,
> a deadlock situation happened in our application. We are using MMapDirectory 
> on Windows 2016 and got the following stacktrace:
> {code:java}
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms 
> elapsed=81248.18s tid=0x2860af10 nid=0x237c in Object.wait()  
> [0x413fc000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>     at java.lang.Object.wait(java.base@17.0.2/Native Method)
>     - waiting on 
>     at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4983)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at 
> org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2697)
>     - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter)
>     at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1236)
>     at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1278)
>     at 
> com.speed4trade.ebs.module.search.SearchService.updateSearchIndex(SearchService.java:1723)
>     - locked <0x0006d5c00208> (a org.apache.lucene.store.MMapDirectory)
>     at 
> com.speed4trade.ebs.module.businessrelations.ticket.TicketChangedListener.postUpdate(TicketChangedListener.java:142)
> ...{code}
> All threads were waiting to lock <0x0006d5c00208> which got never 
> released.
> A lucene thread was also blocked, I dont know if this is relevant:
> {code:java}
> "Lucene Merge Thread #0" #18466 daemon prio=5 os_prio=0 cpu=15.63ms 
> elapsed=3499.07s tid=0x459453e0 nid=0x1f8 waiting for monitor entry  
> [0x5da9e000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>     at 
> org.apache.lucene.store.FSDirectory.deletePendingFiles(FSDirectory.java:346)
>     - waiting to lock <0x0006d5c00208> (a 
> org.apache.lucene.store.MMapDirectory)
>     at 
> org.apache.lucene.store.FSDirectory.maybeDeletePendingFiles(FSDirectory.java:363)
>     at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:248)
>     at 
> org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$1.createOutput(ConcurrentMergeScheduler.java:289)
>     at 
> org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:121)
>     at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130)
>     at 
> org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141)
>     at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:227)
>     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105)
>     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4757)
>     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4361)
>     at 
> org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5920)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:626)
>     at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684){code}
> If looks like the merge operation never finished and released the lock.
> Is there any option to prevent this deadlock or how to investigate it further?
> A load-test didn't show this problem unfortunately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on issue #3: Create mapping on Jira user id -> GitHub account

2022-07-26 Thread GitBox


mikemccand commented on issue #3:
URL: 
https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195411445

   OK got it.
   
   Could we expand the matching so that if the userid in jira == the userid in 
GitHub we strongly suggest a match?  E.g. `mdmarshmallow` would have been 
matched this way.
   
   Hmm, actually, his presented name (`Marc D'mello`) looks the same [in 
GitHub](https://github.com/mdmarshmallow) and 
[Jira](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mdmarshmallow).
  Oh, wait, no!  One is `Marc D'mello` and the other is `Marc D'Mello` (m vs 
M).  Maybe we can do a case insensitive comparison?
   
   But I'll push his account to the verified file separately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #3: Create mapping on Jira user id -> GitHub account

2022-07-26 Thread GitBox


mocobeta commented on issue #3:
URL: 
https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195406408

   Properly speaking, the current "verified" account mapping includes both 
committers and commit authors. "commit authors" can be committers or 
contributors.
   
   ```
   4. Verify the candidate GitHub accounts by checking  if (1) the GitHub 
account has push access to [apache/lucene 
repository](https://github.com/apache/lucene), or (2) the GitHub account has 
been logged as commit author in the repo's commit history at least once.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #3: Create mapping on Jira user id -> GitHub account

2022-07-26 Thread GitBox


mocobeta commented on issue #3:
URL: 
https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195399616

   "Authors" are not necessarily committers; they literally pull request 
authors (contributors).
   For example 
https://github.com/apache/lucene/commit/2cf12b8cdcc629617b2d58c0a2a6336679ff9249


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on issue #3: Create mapping on Jira user id -> GitHub account

2022-07-26 Thread GitBox


mikemccand commented on issue #3:
URL: 
https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195397098

   > We already include merged pull requests' authors (if their GitHub full 
names are set to the same string as Jira full names).
   Maybe we could also consider all opened pull requests' authors.
   
   OK thanks, but does this only work for committers?
   
   I was thinking if a contributor who is not a committer comments on a Jira 
issue and also opens a PR, linked to the issue, we could maybe correlate those 
two events to speculate about ID mapping.  And then verify by hand after.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on issue #3: Create mapping on Jira user id -> GitHub account

2022-07-26 Thread GitBox


mocobeta commented on issue #3:
URL: 
https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195389801

   We already include merged pull requests' authors (if their GitHub full names 
are set to the same string as Jira full names).
   Maybe we could also consider all opened pull requests' authors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-26 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571374#comment-17571374
 ] 

Michael Sokolov commented on LUCENE-10151:
--

oh, too bad. Well this feature is new, so at least no existing usage will be 
broken.

> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #1039: LUCENE-10635: Ensure test coverage for WANDScorer by using a test query

2022-07-26 Thread GitBox


jpountz commented on code in PR #1039:
URL: https://github.com/apache/lucene/pull/1039#discussion_r929867414


##
lucene/core/src/test/org/apache/lucene/search/TestWANDScorer.java:
##
@@ -815,7 +856,7 @@ private void doTestRandomSpecialMaxScore(float maxScore) 
throws IOException {
 }
 builder.add(query, Occur.SHOULD);
   }
-  Query query = builder.build();
+  Query query = numClauses > 0 ? new WANDScorerQuery(builder.build()) : 
builder.build();

Review Comment:
   Maybe we could instead handle it in WandScorerQuery by returning the single 
scorer when there is a single clause?



##
lucene/core/src/test/org/apache/lucene/search/TestWANDScorer.java:
##
@@ -947,4 +988,82 @@ public long cost() {
   };
 }
   }
+
+  private static class WANDScorerQuery extends Query {
+private final BooleanQuery query;
+
+private WANDScorerQuery(BooleanQuery query) {

Review Comment:
   I wonder if it would make the tests easier to read if we took an array of 
queries here:
   
   ```suggestion
   private WANDScorerQuery(Query... query) {
   ```
   
   while still creating a `BooleanQuery` under the hood to reuse 
equals/hashcode/etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #1042: Cache decoded length bytes for TFIDFSimilarity scorer.

2022-07-26 Thread GitBox


jpountz commented on PR #1042:
URL: https://github.com/apache/lucene/pull/1042#issuecomment-1195379555

   Thanks @wuwm!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #1042: Cache decoded length bytes for TFIDFSimilarity scorer.

2022-07-26 Thread GitBox


jpountz merged PR #1042:
URL: https://github.com/apache/lucene/pull/1042


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-26 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571370#comment-17571370
 ] 

Adrien Grand commented on LUCENE-10151:
---

I just noticed that the push of my backport had failed, so it will be in 9.4, 
not 9.3. I don't think it's worth respinning for it.

> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10660) precompute the max level in LogMergePolicy

2022-07-26 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-10660:
--
Fix Version/s: 9.4
   (was: 9.3)

> precompute the max level in LogMergePolicy
> --
>
> Key: LUCENE-10660
> URL: https://issues.apache.org/jira/browse/LUCENE-10660
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tang donghai
>Priority: Minor
> Fix For: 9.4
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I notice LogMergePolicy#findMerges will always calculate  max level on the 
> right side when find the next segments to merge.
>  
> I think we could calculate the max levels only once, and when we need the max 
> level, we could simply
> {code:java}
> float maxLevel = maxLevels[start];
> {code}
> and the precomputed code looks like below, compare each level in levels from 
> right to left 
> {code:java}
> float[] maxLevels = new float[numMergeableSegments + 1];
> maxLevels[numMergeableSegments] = -1.0f;
> for (int i = numMergeableSegments - 1; i >= 0; i--) {
>   maxLevels[i] = Math.max(levels.get(i).level, maxLevels[i + 1]);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10660) precompute the max level in LogMergePolicy

2022-07-26 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571369#comment-17571369
 ] 

Adrien Grand commented on LUCENE-10660:
---

The change made sense to me and I merged it, thank you [~tangdh]!

> precompute the max level in LogMergePolicy
> --
>
> Key: LUCENE-10660
> URL: https://issues.apache.org/jira/browse/LUCENE-10660
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tang donghai
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I notice LogMergePolicy#findMerges will always calculate  max level on the 
> right side when find the next segments to merge.
>  
> I think we could calculate the max levels only once, and when we need the max 
> level, we could simply
> {code:java}
> float maxLevel = maxLevels[start];
> {code}
> and the precomputed code looks like below, compare each level in levels from 
> right to left 
> {code:java}
> float[] maxLevels = new float[numMergeableSegments + 1];
> maxLevels[numMergeableSegments] = -1.0f;
> for (int i = numMergeableSegments - 1; i >= 0; i--) {
>   maxLevels[i] = Math.max(levels.get(i).level, maxLevels[i + 1]);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10660) precompute the max level in LogMergePolicy

2022-07-26 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10660.
---
Fix Version/s: 9.3
   Resolution: Fixed

> precompute the max level in LogMergePolicy
> --
>
> Key: LUCENE-10660
> URL: https://issues.apache.org/jira/browse/LUCENE-10660
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 9.2
>Reporter: tang donghai
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I notice LogMergePolicy#findMerges will always calculate  max level on the 
> right side when find the next segments to merge.
>  
> I think we could calculate the max levels only once, and when we need the max 
> level, we could simply
> {code:java}
> float maxLevel = maxLevels[start];
> {code}
> and the precomputed code looks like below, compare each level in levels from 
> right to left 
> {code:java}
> float[] maxLevels = new float[numMergeableSegments + 1];
> maxLevels[numMergeableSegments] = -1.0f;
> for (int i = numMergeableSegments - 1; i >= 0; i--) {
>   maxLevels[i] = Math.max(levels.get(i).level, maxLevels[i + 1]);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher

2022-07-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571367#comment-17571367
 ] 

ASF subversion and git services commented on LUCENE-10151:
--

Commit be81cd79346e869da94d9db89e1b863bfaabbd65 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=be81cd79346 ]

LUCENE-10151: Some fixes to query timeouts. (#996)

I noticed some minor bugs in the original PR #927 that this PR should fix:
 - When a timeout is set, we would no longer catch
   `CollectionTerminatedException`.
 - I added randomization to `LuceneTestCase` to randomly set a timeout, it
   would have caught the above bug.
 - Fixed visibility of `TimeLimitingBulkScorer`.

> Add timeout support to IndexSearcher
> 
>
> Key: LUCENE-10151
> URL: https://issues.apache.org/jira/browse/LUCENE-10151
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Greg Miller
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> I'd like to explore adding optional "timeout" capabilities to 
> {{IndexSearcher}}. This would enable users to (optionally) specify a maximum 
> time budget for search execution. If the search "times out", partial results 
> would be available.
> This idea originated on the dev list (thanks [~jpountz] for the suggestion). 
> Thread for reference: 
> [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E]
>  
> A couple things to watch out for with this change:
>  # We want to make sure it's robust to a two-phase query evaluation scenario 
> where the "approximate" step matches a large number of candidates but the 
> "confirmation" step matches very few (or none). This is a particularly tricky 
> case.
>  # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is 
> {{GREATER_THAN_OR_EQUAL_TO}} if the query times out
>  # We want to make sure it plays nice with the {{LRUCache}} since it iterates 
> the query to pre-populate a {{BitSet}} when caching. That step shouldn't be 
> allowed to overrun the timeout. The proper way to handle this probably needs 
> some thought.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #1045: LUCENE-10660: precompute maxlevel in LogMergePolicy

2022-07-26 Thread GitBox


jpountz merged PR #1045:
URL: https://github.com/apache/lucene/pull/1045


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #1047: LUCENE-10661: Reduce memory copy in BytesStore

2022-07-26 Thread GitBox


jpountz commented on code in PR #1047:
URL: https://github.com/apache/lucene/pull/1047#discussion_r929854326


##
lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java:
##
@@ -179,6 +179,30 @@ void writeBytes(long dest, byte[] b, int offset, int len) {
 }
   }
 
+  @Override
+  public void copyBytes(DataInput input, long numBytes) throws IOException {
+assert numBytes >= 0 : "numBytes=" + numBytes;
+assert input != null;
+int len = (int) numBytes;

Review Comment:
   We could make `len` a long and avoid the unchecked cast, couldn't we?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] matriv opened a new pull request, #1049: [LUCENE-10662] Make LuceneTestCase to not extend from org.junit.Assert

2022-07-26 Thread GitBox


matriv opened a new pull request, #1049:
URL: https://github.com/apache/lucene/pull/1049

   ### Description (or a Jira issue link if you have one)
   
   https://issues.apache.org/jira/browse/LUCENE-10662


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10662) Make LuceneTestCase not extending from org.junit.Assert

2022-07-26 Thread Marios Trivyzas (Jira)
Marios Trivyzas created LUCENE-10662:


 Summary: Make LuceneTestCase not extending from org.junit.Assert
 Key: LUCENE-10662
 URL: https://issues.apache.org/jira/browse/LUCENE-10662
 Project: Lucene - Core
  Issue Type: Test
  Components: general/test
Reporter: Marios Trivyzas


Since *LuceneTestCase* is a very useful abstract class that can be extended and 
used by many projects, having it extending *org.junit.Assert* limits all users 
to exclusively use the static methods of {*}org.junit.Assert{*}. In our project 
we want to use [https://joel-costigliola.github.io/assertj] where the main 
method to call is *org.assertj.core.api.Assertions.assertThat* which conflicts 
with the deprecated {*}org.junit.Assert.assertThat{*}, recognized by default by 
the compiler. So one can only use assertj if on every call uses fully qualified 
name for the *assertThat* method, i.e.

 
{code:java}
org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

2022-07-26 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571355#comment-17571355
 ] 

Adrien Grand commented on LUCENE-10592:
---

I just pushed an annotion that should show up in the next couple days.

> Should we build HNSW graph on the fly during indexing
> -
>
> Key: LUCENE-10592
> URL: https://issues.apache.org/jira/browse/LUCENE-10592
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
> Fix For: 9.4
>
> Attachments: Screen Shot 2022-07-25 at 9.04.11 AM.png
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Currently, when we index vectors for KnnVectorField, we buffer those vectors 
> in memory and on flush during a segment construction we build an HNSW graph.  
> As building an HNSW graph is very expensive, this makes flush operation take 
> a lot of time. This also makes overall indexing performance quite 
> unpredictable (as the number of flushes are defined by memory used, and the 
> presence of concurrent searches), e.g. some indexing operations return almost 
> instantly while others that trigger flush take a lot of time. 
> Building an HNSW graph on the fly as we index vectors allows to avoid this 
> problem, and spread a load of HNSW graph construction evenly during indexing.
> This will also supersede LUCENE-10194



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on issue #3: Create mapping on Jira user id -> GitHub account

2022-07-26 Thread GitBox


mikemccand commented on issue #3:
URL: 
https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195354814

   Could we maybe look for Jira issues that have GitHub PRs attached and 
"correlate" the ids of who opened the PR against who commented on the issue?
   
   It would clearly not be perfect, but it could provide input for a human to 
sift through and carry over some verified accounts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand closed issue #27: Improve the `Jira Information` header?

2022-07-26 Thread GitBox


mikemccand closed issue #27: Improve the `Jira Information` header?
URL: https://github.com/apache/lucene-jira-archive/issues/27


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on issue #27: Improve the `Jira Information` header?

2022-07-26 Thread GitBox


mikemccand commented on issue #27:
URL: 
https://github.com/apache/lucene-jira-archive/issues/27#issuecomment-1195343198

   I think this one is done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand closed issue #79: Carry parent issue over

2022-07-26 Thread GitBox


mikemccand closed issue #79: Carry parent issue over
URL: https://github.com/apache/lucene-jira-archive/issues/79


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand merged pull request #80: #79: include parent issue link

2022-07-26 Thread GitBox


mikemccand merged PR #80:
URL: https://github.com/apache/lucene-jira-archive/pull/80


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on issue #1048: Why lucene doc id changes after updating or merging?

2022-07-26 Thread GitBox


dweiss commented on issue #1048:
URL: https://github.com/apache/lucene/issues/1048#issuecomment-1195234738

   If you need constant IDs, use a stored document field. IDs are internal 
because they are used for per-segment document ordering and once segments are 
merged, any previous document IDs (they're actually an ordinal sequence) are 
discarded. The naming "ID" may be confusing - it's not a "global" identifier, 
it is only unique within a segment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss closed issue #1048: Why lucene doc id changes after updating or merging?

2022-07-26 Thread GitBox


dweiss closed issue #1048: Why lucene doc id changes after updating or merging?
URL: https://github.com/apache/lucene/issues/1048


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10616) Moving to dictionaries has made stored fields slower at skipping

2022-07-26 Thread fang hou (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571217#comment-17571217
 ] 

fang hou commented on LUCENE-10616:
---

I think this pr [https://github.com/apache/lucene/pull/1003] is ready for 
review. As Adrien advised above, this pr changed {{decompress}} signature to 
return {{InputStream}} to make it able to decompress lazily. Different than 
returning {{STOP}} in {{{}StoredFieldVisitor#needsField{}}}(tried but found 
it's maybe impossible due to multiple value fields, see test case), this pr 
optimized skip method to be more smart to bypass unneeded compressed block by 
reading compressed block length. So for large unneeded field, we can save many 
decompression time. This applied to both {{BEST_SPEED}} mode and 
{{HIGH_COMPRESSION}} mode. So this pr optimized these two modes with preset 
dictionary. Could someone give some feedbacks? thanks cc [~jpountz] 

> Moving to dictionaries has made stored fields slower at skipping
> 
>
> Key: LUCENE-10616
> URL: https://issues.apache.org/jira/browse/LUCENE-10616
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> [~ywelsch] has been digging into a regression of stored fields retrieval that 
> is caused by LUCENE-9486.
> Say your documents have two stored fields, one that is 100B and is stored 
> first, and the other one that is 100kB, and you are only interested in the 
> first one. While the idea behind blocks of stored fields is to store multiple 
> documents in the same block to leverage redundancy across documents, 
> sometimes documents are larger than the block size. As soon as documents are 
> larger than 2x the block size, our stored fields format splits such large 
> documents into multiple blocks, so that you wouldn't need to decompress 
> everything only to retrieve a couple small fields.
> Before LUCENE-9486, BEST_SPEED had a block size of 16kB, so only retrieving 
> the first field value would only need to decompress 16kB of data. With the 
> move to preset dictionaries in LUCENE-9486 and then LUCENE-9917, we now have 
> blocks of 80kB, so stored fields would now need to decompress 80kB of data, 
> 5x more than before.
> With dictionaries, our blocks are now split into 10 sub blocks. We happen to 
> eagerly decompress all sub blocks that intersect with the stored document, 
> which is why we would decompress 80kB of data, but this is an implementation 
> detail. It should be possible to decompress these sub blocks lazily so that 
> we would only decompress those that intersect with one of the field values 
> that the user is interested in retrieving?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] JoeHF commented on pull request #1003: LUCENE-10616: optimizing decompress when only retrieving some fields

2022-07-26 Thread GitBox


JoeHF commented on PR #1003:
URL: https://github.com/apache/lucene/pull/1003#issuecomment-1195050814

   no obvious regression or perf improvement, guess there are no such cases in 
benchmark
   wikimedium10k:
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
BrowseRandomLabelTaxoFacets  569.65  (7.9%)  543.58 
(15.4%)   -4.6% ( -25% -   20%) 0.236
Prefix3  377.77  (9.1%)  368.32  
(6.6%)   -2.5% ( -16% -   14%) 0.321
 AndHighMed  656.18  (8.2%)  648.39 
(10.6%)   -1.2% ( -18% -   19%) 0.691
MedIntervalsOrdered  574.68  (6.3%)  567.95  
(9.7%)   -1.2% ( -16% -   15%) 0.651
 AndHighLow  978.77  (9.3%)  972.00  
(8.5%)   -0.7% ( -16% -   18%) 0.806
   HighSpanNear  425.66  (8.3%)  423.78 
(10.2%)   -0.4% ( -17% -   19%) 0.880
  OrHighMed  656.72  (8.2%)  655.28 
(10.5%)   -0.2% ( -17% -   20%) 0.942
LowIntervalsOrdered  481.42  (5.2%)  480.65 
(10.3%)   -0.2% ( -14% -   16%) 0.951
 HighPhrase  500.26  (7.6%)  499.86 
(11.4%)   -0.1% ( -17% -   20%) 0.979
Respell  123.33 (11.8%)  123.48 
(10.4%)0.1% ( -19% -   25%) 0.973
 OrHighHigh  416.58  (6.9%)  417.19  
(9.4%)0.1% ( -15% -   17%) 0.955
MedTerm 2063.41  (9.5%) 2069.51 
(11.0%)0.3% ( -18% -   23%) 0.928
LowSloppyPhrase  301.12  (7.5%)  303.12 
(12.6%)0.7% ( -18% -   22%) 0.840
   HighTerm 1088.05  (9.8%) 1102.10 
(14.8%)1.3% ( -21% -   28%) 0.745
  LowPhrase  896.10  (8.4%)  907.71  
(9.8%)1.3% ( -15% -   21%) 0.654
   HighSloppyPhrase  309.31  (8.1%)  313.60 
(10.0%)1.4% ( -15% -   21%) 0.629
 Fuzzy2   42.78 (11.1%)   43.46 
(12.2%)1.6% ( -19% -   27%) 0.665
   Wildcard  315.36  (9.2%)  320.46  
(7.7%)1.6% ( -14% -   20%) 0.548
MedSpanNear  520.33  (6.6%)  530.21 
(11.6%)1.9% ( -15% -   21%) 0.524
   HighIntervalsOrdered  356.49 (10.3%)  363.39 
(10.1%)1.9% ( -16% -   24%) 0.547
AndHighHigh  619.32  (5.9%)  631.54  
(9.5%)2.0% ( -12% -   18%) 0.432
  HighTermMonthSort 1479.95  (6.0%) 1509.95 
(11.1%)2.0% ( -14% -   20%) 0.472
MedSloppyPhrase  230.30  (8.6%)  235.24 
(10.8%)2.1% ( -15% -   23%) 0.488
  MedPhrase  567.04  (6.2%)  579.72 
(11.5%)2.2% ( -14% -   21%) 0.442
BrowseRandomLabelSSDVFacets  350.13 (10.2%)  358.12 
(16.8%)2.3% ( -22% -   32%) 0.604
  HighTermDayOfYearSort 1087.80  (7.4%) 1118.61  
(8.5%)2.8% ( -12% -   20%) 0.260
LowTerm 2557.43  (9.2%) 2636.37  
(8.9%)3.1% ( -13% -   23%) 0.281
LowSpanNear  795.88  (9.0%)  828.70 
(11.1%)4.1% ( -14% -   26%) 0.195
   PKLookup   26.79 (16.3%)   27.91 
(19.8%)4.2% ( -27% -   48%) 0.466
 Fuzzy1  136.23  (9.7%)  142.21 
(16.8%)4.4% ( -20% -   34%) 0.312
  BrowseMonthTaxoFacets  801.97 (17.7%)  840.43 
(19.1%)4.8% ( -27% -   50%) 0.410
 IntNRQ  603.46 (10.0%)  636.52  
(7.9%)5.5% ( -11% -   25%) 0.054
  OrHighLow  532.25  (9.0%)  562.37 
(13.6%)5.7% ( -15% -   31%) 0.121
  BrowseMonthSSDVFacets  839.55 (20.9%)  894.35 
(22.1%)6.5% ( -30% -   62%) 0.337
  BrowseDayOfYearTaxoFacets  784.80 (16.4%)  839.36 
(25.1%)7.0% ( -29% -   58%) 0.300
   BrowseDateTaxoFacets  849.34 (17.9%)  908.75 
(25.8%)7.0% ( -31% -   61%) 0.319
  BrowseDayOfYearSSDVFacets  832.41 (17.6%)  907.43 
(22.9%)9.0% ( -26% -   60%) 0.163
   BrowseDateSSDVFacets  215.63 (21.1%)  241.38 
(27.5%)   11.9% ( -30% -   76%) 0.123
   
   wikimedium1m:
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
Respell   39.67 (11.8%)   37.57 
(14.2%)   -5.3% ( -27% -   23%) 0.200