[GitHub] [lucene-jira-archive] manishbafna commented on a diff in pull request #75: Update account-map.csv.20220722.verified
manishbafna commented on code in PR #75: URL: https://github.com/apache/lucene-jira-archive/pull/75#discussion_r930634094 ## migration/mappings-data/account-map.csv.20220722.verified: ## @@ -169,3 +169,4 @@ mharwood,markharwood,Mark Harwood hossman,hossman,Chris M. Hostetter munendrasn,munendrasn,Munendra S N vajda,ovalhub,Andi Vajda +manish1982,manishbafna,Manish Review Comment: https://user-images.githubusercontent.com/1758199/181168231-9e0d1ac2-b47b-4879-ad94-19db05900bec.png;> My username in JIRA is manish1982. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10583) Deadlock with MMapDirectory while waitForMerges
[ https://issues.apache.org/jira/browse/LUCENE-10583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571691#comment-17571691 ] Vigya Sharma commented on LUCENE-10583: --- {quote}We could perhaps make a best effort to detect on common incoming APIs that external locks are not already held on {{Directory}} and {{{}IndexWriter{}}}? {quote} Interesting thought. I like the idea of safe-guarding users against such errors, but I don't have a good practical solution for it yet. We could assert that common lucene objects are lock free at some popular {{public}} entry points; but how do we differentiate on whether the lock is acquired by an internal lucene thread or an external user thread..? We do lock on IndexWriter at multiple places within lucene. {quote}can this be resolved now? {quote} We added doc strings at a couple of places to warn users, and the user who reported this issue is unblocked. I don't have a concrete plan for anything else we can do here. Unless there are more ideas, we could go ahead and resolve this. > Deadlock with MMapDirectory while waitForMerges > --- > > Key: LUCENE-10583 > URL: https://issues.apache.org/jira/browse/LUCENE-10583 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 8.11.1 > Environment: Java 17 > OS: Windows 2016 >Reporter: Thomas Hoffmann >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > Hello, > a deadlock situation happened in our application. We are using MMapDirectory > on Windows 2016 and got the following stacktrace: > {code:java} > "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms > "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms > elapsed=81248.18s tid=0x2860af10 nid=0x237c in Object.wait() > [0x413fc000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(java.base@17.0.2/Native Method) > - waiting on > at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4983) > - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter) > at > org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2697) > - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter) > at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1236) > at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1278) > at > com.speed4trade.ebs.module.search.SearchService.updateSearchIndex(SearchService.java:1723) > - locked <0x0006d5c00208> (a org.apache.lucene.store.MMapDirectory) > at > com.speed4trade.ebs.module.businessrelations.ticket.TicketChangedListener.postUpdate(TicketChangedListener.java:142) > ...{code} > All threads were waiting to lock <0x0006d5c00208> which got never > released. > A lucene thread was also blocked, I dont know if this is relevant: > {code:java} > "Lucene Merge Thread #0" #18466 daemon prio=5 os_prio=0 cpu=15.63ms > elapsed=3499.07s tid=0x459453e0 nid=0x1f8 waiting for monitor entry > [0x5da9e000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.lucene.store.FSDirectory.deletePendingFiles(FSDirectory.java:346) > - waiting to lock <0x0006d5c00208> (a > org.apache.lucene.store.MMapDirectory) > at > org.apache.lucene.store.FSDirectory.maybeDeletePendingFiles(FSDirectory.java:363) > at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:248) > at > org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44) > at > org.apache.lucene.index.ConcurrentMergeScheduler$1.createOutput(ConcurrentMergeScheduler.java:289) > at > org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:121) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130) > at > org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141) > at > org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:227) > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105) > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4757) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4361) > at > org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5920) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:626) > at >
[jira] [Updated] (LUCENE-10654) New companion doc value format for LatLonShape and XYShape field types
[ https://issues.apache.org/jira/browse/LUCENE-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Knize updated LUCENE-10654: Fix Version/s: 9.4 (was: 9.3) > New companion doc value format for LatLonShape and XYShape field types > -- > > Key: LUCENE-10654 > URL: https://issues.apache.org/jira/browse/LUCENE-10654 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Nick Knize >Priority: Major > Fix For: 9.4 > > Time Spent: 5h 20m > Remaining Estimate: 0h > > {{XYDocValuesField}} provides doc value support for {{XYPoint}}. > {{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}. > However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue > format. > This lack of doc value support for shapes means facets, aggregations, and > IndexOrDocValues queries are currently not possible for Shape field types. > This gap needs be closed in lucene. > To support IndexOrDocValues queries along with various geometry aggregations > and facets, the ability to compute the spatial relation with the doc value is > needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since > the doc value encoding is nothing more than a simple 2D integer encoding of > the x,y and lat,lon dimensional components. Accomplishing the same with a > naive integer encoded binary representation for N-vertex shapes would be > costly. > {{ComponentTree}} already provides an efficient in memory structure for > quickly computing spatial relations over Shape types based on a binary tree > of tessellated triangles provided by the {{Tessellator}}. Furthermore, this > tessellation is already computed at index time. If we create an on-disk > representation of {{ComponentTree}} 's binary tree of tessellated triangles > and use this as the doc value {{binaryValue}} format we will be able to > efficiently compute spatial relations with this binary representation and > achieve the same facet/aggregation result over shapes as we can with points > today (e.g., grid facets, centroid, area, etc). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10654) New companion doc value format for LatLonShape and XYShape field types
[ https://issues.apache.org/jira/browse/LUCENE-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571655#comment-17571655 ] Nick Knize commented on LUCENE-10654: - As per discussion on the PR I think this is too late for 9.3 so I'd like to move forward for 9.4 and iterating the "nice to haves" (visitor access pattern) in a follow up. > New companion doc value format for LatLonShape and XYShape field types > -- > > Key: LUCENE-10654 > URL: https://issues.apache.org/jira/browse/LUCENE-10654 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Nick Knize >Priority: Major > Fix For: 9.3 > > Time Spent: 5h 20m > Remaining Estimate: 0h > > {{XYDocValuesField}} provides doc value support for {{XYPoint}}. > {{LatLonDocValuesField}} provides docvalue support for {{LatLonPoint}}. > However, neither {{LatLonShape}} nor {{XYShape}} currently have a docvalue > format. > This lack of doc value support for shapes means facets, aggregations, and > IndexOrDocValues queries are currently not possible for Shape field types. > This gap needs be closed in lucene. > To support IndexOrDocValues queries along with various geometry aggregations > and facets, the ability to compute the spatial relation with the doc value is > needed. This is straightforward with {{XYPoint}} and {{LatLonPoint}} since > the doc value encoding is nothing more than a simple 2D integer encoding of > the x,y and lat,lon dimensional components. Accomplishing the same with a > naive integer encoded binary representation for N-vertex shapes would be > costly. > {{ComponentTree}} already provides an efficient in memory structure for > quickly computing spatial relations over Shape types based on a binary tree > of tessellated triangles provided by the {{Tessellator}}. Furthermore, this > tessellation is already computed at index time. If we create an on-disk > representation of {{ComponentTree}} 's binary tree of tessellated triangles > and use this as the doc value {{binaryValue}} format we will be able to > efficiently compute spatial relations with this binary representation and > achieve the same facet/aggregation result over shapes as we can with points > today (e.g., grid facets, centroid, area, etc). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571631#comment-17571631 ] Michael Sokolov commented on LUCENE-10577: -- OK, I will revive the FieldInfo version of this thing and see about making a byte-oriented KnnVectorField; perhaps the VectorFormat can remain internal in that case. It seems likely to me that if this is a win for this algorithm that it could very well be so for others. Plus there is an easy fallback position which is to accept bytes and inflate them to four-bit floats, so the burden is not necessarily so great on future vector formats. Agree we can add Euclidean distance. > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571594#comment-17571594 ] Michael Sokolov edited comment on LUCENE-10404 at 7/26/22 8:01 PM: --- Here is a test using GloVe 100-dim vectors plus much more aggressive indexing settings, and we can see that here the IntIntHashMap is adding cost h3. baseline h3. {{recall latency nDoc fanout maxConn beamWidth visited index ms}} {{0.991 0.92 1 50 64 500 150 12068}} {{0.996 1.11 1 100 64 500 200 0}} {{0.999 1.45 1 200 64 500 300 0}} {{1.000 1.94 1 400 64 500 500 0}} {{0.955 2.53 10 50 64 500 150 463142}} {{0.973 3.03 10 100 64 500 200 0}} {{0.988 4.44 10 200 64 500 300 0}} {{0.997 6.57 10 400 64 500 500 0}} {{0.895 3.44 100 50 64 500 150 9811483}} {{0.920 4.33 100 100 64 500 200 0}} {{0.950 6.20 100 200 64 500 300 0}} {{0.974 9.53 100 400 64 500 500 0}} IntIntHashMap {{recall latency nDoc fanout maxConn beamWidth visited index ms}} {{0.991 1.03 1 50 64 500 150 13274}} {{0.996 1.24 1 100 64 500 200 0}} {{0.999 1.62 1 200 64 500 300 0}} {{1.000 2.09 1 400 64 500 500 0}} {{0.955 2.47 10 50 64 500 150 485131}} {{0.973 3.31 10 100 64 500 200 0}} {{0.988 4.66 10 200 64 500 300 0}} {{0.997 7.26 10 400 64 500 500 0}} {{0.895 3.58 100 50 64 500 150 10173818}} {{0.920 4.49 100 100 64 500 200 0}} {{0.950 6.45 100 200 64 500 300 0}} {{0.974 9.91 100 400 64 500 500 0}} was (Author: sokolov): Here is a test using GloVe 100-dim vectors plus much more aggressive indexing settings, and we can see that here the IntIntHashMap is adding cost h3. baseline {{recall latency nDocfanout maxConn beamWidth visited index ms 0.9910.92 1 50 64 500 150 12068 0.9961.11 1 100 64 500 200 0 0.9991.45 1 200 64 500 300 0 1.0001.94 1 400 64 500 500 0 0.9552.53 10 50 64 500 150 463142 0.9733.03 10 100 64 500 200 0 0.9884.44 10 200 64 500 300 0 0.9976.57 10 400 64 500 500 0 0.8953.44 100 50 64 500 150 9811483 0.9204.33 100 100 64 500 200 0 0.9506.20 100 200 64 500 300 0 0.9749.53 100 400 64 500 500 0}} }} h3. IntIntHashMap {{recall latency nDocfanout maxConn beamWidth visited index ms 0.9911.03 1 50 64 500 150 13274 0.9961.24 1 100 64 500 200 0 0.9991.62 1 200 64 500 300 0 1.0002.09 1 400 64 500 500 0 0.9552.47 10 50 64 500 150 485131 0.9733.31 10 100 64 500 200 0 0.9884.66 10 200 64 500 300 0 0.9977.26 10 400 64 500 500 0 0.8953.58 100 50 64 500 150 10173818 0.9204.49 100 100 64 500 200 0 0.9506.45 100 200 64 500 300 0 0.9749.91 100 400 64 500 500 0 }} > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 -
[jira] [Comment Edited] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571594#comment-17571594 ] Michael Sokolov edited comment on LUCENE-10404 at 7/26/22 7:59 PM: --- Here is a test using GloVe 100-dim vectors plus much more aggressive indexing settings, and we can see that here the IntIntHashMap is adding cost h3. baseline {{recall latency nDocfanout maxConn beamWidth visited index ms 0.9910.92 1 50 64 500 150 12068 0.9961.11 1 100 64 500 200 0 0.9991.45 1 200 64 500 300 0 1.0001.94 1 400 64 500 500 0 0.9552.53 10 50 64 500 150 463142 0.9733.03 10 100 64 500 200 0 0.9884.44 10 200 64 500 300 0 0.9976.57 10 400 64 500 500 0 0.8953.44 100 50 64 500 150 9811483 0.9204.33 100 100 64 500 200 0 0.9506.20 100 200 64 500 300 0 0.9749.53 100 400 64 500 500 0}} }} h3. IntIntHashMap {{recall latency nDocfanout maxConn beamWidth visited index ms 0.9911.03 1 50 64 500 150 13274 0.9961.24 1 100 64 500 200 0 0.9991.62 1 200 64 500 300 0 1.0002.09 1 400 64 500 500 0 0.9552.47 10 50 64 500 150 485131 0.9733.31 10 100 64 500 200 0 0.9884.66 10 200 64 500 300 0 0.9977.26 10 400 64 500 500 0 0.8953.58 100 50 64 500 150 10173818 0.9204.49 100 100 64 500 200 0 0.9506.45 100 200 64 500 300 0 0.9749.91 100 400 64 500 500 0 }} was (Author: sokolov): Here is a test using GloVe 100-dim vectors plus much more aggressive indexing settings, and we can see that here the IntIntHashMap is adding cost h3. baseline {{ recall latency nDocfanout maxConn beamWidth visited index ms 0.9910.92 1 50 64 500 150 12068 0.9961.11 1 100 64 500 200 0 0.9991.45 1 200 64 500 300 0 1.0001.94 1 400 64 500 500 0 0.9552.53 10 50 64 500 150 463142 0.9733.03 10 100 64 500 200 0 0.9884.44 10 200 64 500 300 0 0.9976.57 10 400 64 500 500 0 0.8953.44 100 50 64 500 150 9811483 0.9204.33 100 100 64 500 200 0 0.9506.20 100 200 64 500 300 0 0.9749.53 100 400 64 500 500 0 }} h3. IntIntHashMap {{ recall latency nDocfanout maxConn beamWidth visited index ms 0.9911.03 1 50 64 500 150 13274 0.9961.24 1 100 64 500 200 0 0.9991.62 1 200 64 500 300 0 1.0002.09 1 400 64 500 500 0 0.9552.47 10 50 64 500 150 485131 0.9733.31 10 100 64 500 200 0 0.9884.66 10 200 64 500 300 0 0.9977.26 10 400 64 500 500 0 0.8953.58 100 50 64 500 150 10173818 0.9204.49 100 100 64 500 200 0 0.9506.45 100 200 64 500 300 0 0.9749.91 100 400 64 500 500 0 }} > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted
[jira] [Commented] (LUCENE-10404) Use hash set for visited nodes in HNSW search?
[ https://issues.apache.org/jira/browse/LUCENE-10404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571594#comment-17571594 ] Michael Sokolov commented on LUCENE-10404: -- Here is a test using GloVe 100-dim vectors plus much more aggressive indexing settings, and we can see that here the IntIntHashMap is adding cost h3. baseline {{ recall latency nDocfanout maxConn beamWidth visited index ms 0.9910.92 1 50 64 500 150 12068 0.9961.11 1 100 64 500 200 0 0.9991.45 1 200 64 500 300 0 1.0001.94 1 400 64 500 500 0 0.9552.53 10 50 64 500 150 463142 0.9733.03 10 100 64 500 200 0 0.9884.44 10 200 64 500 300 0 0.9976.57 10 400 64 500 500 0 0.8953.44 100 50 64 500 150 9811483 0.9204.33 100 100 64 500 200 0 0.9506.20 100 200 64 500 300 0 0.9749.53 100 400 64 500 500 0 }} h3. IntIntHashMap {{ recall latency nDocfanout maxConn beamWidth visited index ms 0.9911.03 1 50 64 500 150 13274 0.9961.24 1 100 64 500 200 0 0.9991.62 1 200 64 500 300 0 1.0002.09 1 400 64 500 500 0 0.9552.47 10 50 64 500 150 485131 0.9733.31 10 100 64 500 200 0 0.9884.66 10 200 64 500 300 0 0.9977.26 10 400 64 500 500 0 0.8953.58 100 50 64 500 150 10173818 0.9204.49 100 100 64 500 200 0 0.9506.45 100 200 64 500 300 0 0.9749.91 100 400 64 500 500 0 }} > Use hash set for visited nodes in HNSW search? > -- > > Key: LUCENE-10404 > URL: https://issues.apache.org/jira/browse/LUCENE-10404 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > > While searching each layer, HNSW tracks the nodes it has already visited > using a BitSet. We could look into using something like IntHashSet instead. I > tried out the idea quickly by switching to IntIntHashMap (which has already > been copied from hppc) and saw an improvement in index performance. > *Baseline:* 760896 msec to write vectors > *Using IntIntHashMap:* 733017 msec to write vectors > I noticed search performance actually got a little bit worse with the change > -- that is something to look into. > For background, it's good to be aware that HNSW can visit a lot of nodes. For > example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search > visits ~1000 - 15,000 docs depending on the recall. This number can increase > when searching with deleted docs, especially if you hit a "pathological" case > where the deleted docs happen to be closest to the query vector. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10054) Handle hierarchy in HNSW graph
[ https://issues.apache.org/jira/browse/LUCENE-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571588#comment-17571588 ] Mike Sokolov commented on LUCENE-10054: --- what is it with this issue that spammers love so much!? I wonder if we could somehow lock it as read-only ... > Handle hierarchy in HNSW graph > -- > > Key: LUCENE-10054 > URL: https://issues.apache.org/jira/browse/LUCENE-10054 > Project: Lucene - Core > Issue Type: Task >Reporter: Mayya Sharipova >Priority: Major > Labels: vector-based-search > Fix For: 9.1 > > Time Spent: 20h 20m > Remaining Estimate: 0h > > Currently HNSW graph is represented as a single layer graph. > We would like to extend it to handle hierarchy as per > [discussion|https://issues.apache.org/jira/browse/LUCENE-9004?focusedCommentId=17393216=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17393216]. > > > TODO tasks: > - add multiple layers in the HnswGraph class > - modify the format in Lucene90HnswVectorsWriter and > Lucene90HnswVectorsReader to handle multiple layers > - modify graph construction and search algorithm to handle hierarchy > - run benchmarks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nknize commented on a diff in pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape
nknize commented on code in PR #1017: URL: https://github.com/apache/lucene/pull/1017#discussion_r930257797 ## lucene/core/src/java/org/apache/lucene/document/ShapeDocValuesField.java: ## @@ -0,0 +1,896 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Comparator; +import java.util.List; +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.document.ShapeField.DecodedTriangle.TYPE; +import org.apache.lucene.document.ShapeField.QueryRelation; +import org.apache.lucene.document.SpatialQuery.EncodedRectangle; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.IndexableFieldType; +import org.apache.lucene.index.PointValues.Relation; +import org.apache.lucene.search.Query; +import org.apache.lucene.store.ByteArrayDataInput; +import org.apache.lucene.store.ByteBuffersDataOutput; +import org.apache.lucene.store.DataInput; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; + +/** A doc values field representation for {@link LatLonShape} and {@link XYShape} */ +public final class ShapeDocValuesField extends Field { + private final ShapeComparator shapeComparator; + + private static final FieldType FIELD_TYPE = new FieldType(); + + static { +FIELD_TYPE.setDocValuesType(DocValuesType.BINARY); +FIELD_TYPE.setOmitNorms(true); +FIELD_TYPE.freeze(); + } + + /** + * Creates a {@ShapeDocValueField} instance from a shape tessellation + * + * @param name The Field Name (must not be null) + * @param tessellation The tessellation (must not be null) + */ + ShapeDocValuesField(String name, List tessellation) { +super(name, FIELD_TYPE); +BytesRef b = computeBinaryValue(tessellation); +this.fieldsData = b; +try { + this.shapeComparator = new ShapeComparator(b); +} catch (IOException e) { + throw new IllegalArgumentException("unable to read binary shape doc value field. ", e); +} + } + + /** Creates a {@code ShapeDocValue} field from a given serialized value */ + ShapeDocValuesField(String name, BytesRef binaryValue) { +super(name, FIELD_TYPE); +this.fieldsData = binaryValue; +try { + this.shapeComparator = new ShapeComparator(binaryValue); +} catch (IOException e) { + throw new IllegalArgumentException("unable to read binary shape doc value field. ", e); +} + } + + /** The name of the field */ + @Override + public String name() { +return name; + } + + /** Gets the {@code IndexableFieldType} for this ShapeDocValue field */ + @Override + public IndexableFieldType fieldType() { +return FIELD_TYPE; + } + + /** Currently there is no string representation for the ShapeDocValueField */ + @Override + public String stringValue() { +return null; + } + + /** TokenStreams are not yet supported */ + @Override + public TokenStream tokenStream(Analyzer analyzer, TokenStream reuse) { +return null; + } + + /** create a shape docvalue field from indexable fields */ + public static ShapeDocValuesField createDocValueField(String fieldName, Field[] indexableFields) { +ArrayList tess = new ArrayList<>(indexableFields.length); +final byte[] scratch = new byte[7 * Integer.BYTES]; +for (Field f : indexableFields) { + BytesRef br = f.binaryValue(); + assert br.length == 7 * ShapeField.BYTES; + System.arraycopy(br.bytes, br.offset, scratch, 0, 7 * ShapeField.BYTES); + ShapeField.DecodedTriangle t = new ShapeField.DecodedTriangle(); + ShapeField.decodeTriangle(scratch, t); + tess.add(t); +} +return new ShapeDocValuesField(fieldName, tess); + } + + /** Returns the number of terms (tessellated triangles) for this shape */ + public int numberOfTerms() { +return shapeComparator.numberOfTerms(); + } + + /** Creates a geometry query for shape docvalues */ + public static Query newGeometryQuery( + final String field, final QueryRelation relation, Object... geometries) { +return null; +// TODO +// return new ShapeDocValuesQuery(field, relation,
[jira] [Commented] (LUCENE-10662) Make LuceneTestCase to not extend from org.junit.Assert
[ https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571529#comment-17571529 ] Marios Trivyzas commented on LUCENE-10662: -- [~dweiss] Thx! Checkout how it looks like without the renaming: https://github.com/apache/lucene/pull/1049/commits/7b71302c915bc81d9d29ad49f1e917c219ee > Make LuceneTestCase to not extend from org.junit.Assert > --- > > Key: LUCENE-10662 > URL: https://issues.apache.org/jira/browse/LUCENE-10662 > Project: Lucene - Core > Issue Type: Test > Components: general/test >Reporter: Marios Trivyzas >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Since *LuceneTestCase* is a very useful abstract class that can be extended > and used by many projects, having it extending *org.junit.Assert* limits all > users to exclusively use the static methods of {*}org.junit.Assert{*}. In our > project we want to use [https://joel-costigliola.github.io/assertj] where the > main method to call is *org.assertj.core.api.Assertions.assertThat* which > conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized > by default by the compiler. So one can only use assertj if on every call uses > fully qualified name for the *assertThat* method, i.e. > > {code:java} > org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10662) Make LuceneTestCase to not extend from org.junit.Assert
[ https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571469#comment-17571469 ] Dawid Weiss commented on LUCENE-10662: -- I think the compiler should be able to pick the most specific variant based on argument types, unless there really is ambiguity - I admit I haven't checked whether this is the case, for example here: https://github.com/apache/lucene/pull/1049/files#diff-334836e7b61b74a76eec5aa18eacec6b14c1496f5595b684842ce05583a6df22L209-R213 > Make LuceneTestCase to not extend from org.junit.Assert > --- > > Key: LUCENE-10662 > URL: https://issues.apache.org/jira/browse/LUCENE-10662 > Project: Lucene - Core > Issue Type: Test > Components: general/test >Reporter: Marios Trivyzas >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Since *LuceneTestCase* is a very useful abstract class that can be extended > and used by many projects, having it extending *org.junit.Assert* limits all > users to exclusively use the static methods of {*}org.junit.Assert{*}. In our > project we want to use [https://joel-costigliola.github.io/assertj] where the > main method to call is *org.assertj.core.api.Assertions.assertThat* which > conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized > by default by the compiler. So one can only use assertj if on every call uses > fully qualified name for the *assertThat* method, i.e. > > {code:java} > org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #3: Create mapping on Jira user id -> GitHub account
mocobeta commented on issue #3: URL: https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195564497 I'll try to improve candidate generation and verification steps maybe next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10662) Make LuceneTestCase to not extend from org.junit.Assert
[ https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marios Trivyzas updated LUCENE-10662: - Summary: Make LuceneTestCase to not extend from org.junit.Assert (was: Make LuceneTestCase not extending from org.junit.Assert) > Make LuceneTestCase to not extend from org.junit.Assert > --- > > Key: LUCENE-10662 > URL: https://issues.apache.org/jira/browse/LUCENE-10662 > Project: Lucene - Core > Issue Type: Test > Components: general/test >Reporter: Marios Trivyzas >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Since *LuceneTestCase* is a very useful abstract class that can be extended > and used by many projects, having it extending *org.junit.Assert* limits all > users to exclusively use the static methods of {*}org.junit.Assert{*}. In our > project we want to use [https://joel-costigliola.github.io/assertj] where the > main method to call is *org.assertj.core.api.Assertions.assertThat* which > conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized > by default by the compiler. So one can only use assertj if on every call uses > fully qualified name for the *assertThat* method, i.e. > > {code:java} > org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10662) Make LuceneTestCase not extending from org.junit.Assert
[ https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571453#comment-17571453 ] Marios Trivyzas edited comment on LUCENE-10662 at 7/26/22 2:14 PM: --- {quote}I wouldn't rename any methods (assertEquals becomes assertEquality) - this will be even more confusing for downstream users. I'd remove the extend and assertEquals* methods from LuceneTestCase and move those methods into a separate class (like LuceneAssertions or something) - then the upgrade would be about importing them statically from junit's Assert or LuceneAssertions. {quote} I don't get how we can resolve a few issues: for example the *private void assertEquals(Sort a, Sort b)* in {*}TestSort{*}, if it remains like that and we also *import static org.junit.Assert.assertEquals* in the same class, the compiler doesn't know which one is using unless we use *Assert.assertEquals()* everwhere else, to actually use the junit one. The most important point, is what you mentioned, about all the projects that use {*}LuceneTestCase{*}, so let's see what other people also think about this. was (Author: matriv): {quote} I wouldn't rename any methods (assertEquals becomes assertEquality) - this will be even more confusing for downstream users. I'd remove the extend and assertEquals* methods from LuceneTestCase and move those methods into a separate class (like LuceneAssertions or something) - then the upgrade would be about importing them statically from junit's Assert or LuceneAssertions. {quote} I don't get how we can resolve a few issues: for example the *private void assertEquals(Sort a, Sort b)* in {*}TestSort{*}, if it remains like that and we also *import static org.junit.Assert.assertEquals* in the same class, the compiler doesn't know which one is using unless we use *Assert.assertEquals()* everwhere else, to actually use the junit one. The most important point is what you mentioned about all the projects that use {*}LuceneTestCase{*}, so let's see what other people also think about this. > Make LuceneTestCase not extending from org.junit.Assert > --- > > Key: LUCENE-10662 > URL: https://issues.apache.org/jira/browse/LUCENE-10662 > Project: Lucene - Core > Issue Type: Test > Components: general/test >Reporter: Marios Trivyzas >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Since *LuceneTestCase* is a very useful abstract class that can be extended > and used by many projects, having it extending *org.junit.Assert* limits all > users to exclusively use the static methods of {*}org.junit.Assert{*}. In our > project we want to use [https://joel-costigliola.github.io/assertj] where the > main method to call is *org.assertj.core.api.Assertions.assertThat* which > conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized > by default by the compiler. So one can only use assertj if on every call uses > fully qualified name for the *assertThat* method, i.e. > > {code:java} > org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10662) Make LuceneTestCase not extending from org.junit.Assert
[ https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571453#comment-17571453 ] Marios Trivyzas commented on LUCENE-10662: -- {quote} I wouldn't rename any methods (assertEquals becomes assertEquality) - this will be even more confusing for downstream users. I'd remove the extend and assertEquals* methods from LuceneTestCase and move those methods into a separate class (like LuceneAssertions or something) - then the upgrade would be about importing them statically from junit's Assert or LuceneAssertions. {quote} I don't get how we can resolve a few issues: for example the *private void assertEquals(Sort a, Sort b)* in {*}TestSort{*}, if it remains like that and we also *import static org.junit.Assert.assertEquals* in the same class, the compiler doesn't know which one is using unless we use *Assert.assertEquals()* everwhere else, to actually use the junit one. The most important point is what you mentioned about all the projects that use {*}LuceneTestCase{*}, so let's see what other people also think about this. > Make LuceneTestCase not extending from org.junit.Assert > --- > > Key: LUCENE-10662 > URL: https://issues.apache.org/jira/browse/LUCENE-10662 > Project: Lucene - Core > Issue Type: Test > Components: general/test >Reporter: Marios Trivyzas >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Since *LuceneTestCase* is a very useful abstract class that can be extended > and used by many projects, having it extending *org.junit.Assert* limits all > users to exclusively use the static methods of {*}org.junit.Assert{*}. In our > project we want to use [https://joel-costigliola.github.io/assertj] where the > main method to call is *org.assertj.core.api.Assertions.assertThat* which > conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized > by default by the compiler. So one can only use assertj if on every call uses > fully qualified name for the *assertThat* method, i.e. > > {code:java} > org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10662) Make LuceneTestCase not extending from org.junit.Assert
[ https://issues.apache.org/jira/browse/LUCENE-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571418#comment-17571418 ] Dawid Weiss commented on LUCENE-10662: -- Changing these methods will require a huge follow-up and cleanup in any other project that uses LuceneTestCase (and there are many). I don't think people will be happy with it (even though my heart is with you on assertj - I also prefer it to what's in hamcrest/junit). Even if people agree to change it, looking at the patch, I wouldn't rename any methods (assertEquals becomes assertEquality) - this will be even more confusing for downstream users. I'd remove the extend and assertEquals* methods from LuceneTestCase and move those methods into a separate class (like LuceneAssertions or something) - then the upgrade would be about importing them statically from junit's Assert or LuceneAssertions. Again, I'm not convinced this is a necessary improvement. I've lived with an explicit Assertions.* call from assertj - this is fine and explicit. And even used within Lucene code itself: [https://github.com/apache/lucene/blob/main/lucene/distribution.tests/src/test/org/apache/lucene/distribution/TestModularLayer.java#L117] > Make LuceneTestCase not extending from org.junit.Assert > --- > > Key: LUCENE-10662 > URL: https://issues.apache.org/jira/browse/LUCENE-10662 > Project: Lucene - Core > Issue Type: Test > Components: general/test >Reporter: Marios Trivyzas >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Since *LuceneTestCase* is a very useful abstract class that can be extended > and used by many projects, having it extending *org.junit.Assert* limits all > users to exclusively use the static methods of {*}org.junit.Assert{*}. In our > project we want to use [https://joel-costigliola.github.io/assertj] where the > main method to call is *org.assertj.core.api.Assertions.assertThat* which > conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized > by default by the compiler. So one can only use assertj if on every call uses > fully qualified name for the *assertThat* method, i.e. > > {code:java} > org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #1: Fix markup conversion error
mikemccand commented on issue #1: URL: https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1195457720 > GitHub won't accept labels such as `legacy-jira-label:java11` for some reason That's really weird ;) I was able to apply the label to [this issue](https://github.com/apache/lucene-jira-archive/issues/78) through the GitHub web UI. Not sure why import API would fail on it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error
mocobeta commented on issue #1: URL: https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1195451400 The rehearsal failed - 52 issues won't be imported by errors. The errors come from recent changes in the conversion script (for example, GitHub won't accept labels such as `legacy-jira-label:java11` for some reason). I'll investigate the errors and retry again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on pull request #64: Cover all Jira components in module label mapping
mocobeta commented on PR #64: URL: https://github.com/apache/lucene-jira-archive/pull/64#issuecomment-1195443803 Looks like this causes import error for some issues. ``` [2022-07-26 18:53:05,540] ERROR:import_github_issues: Import GitHub issue /mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-9500.json was failed. status=failed, errors=[{'location': '/issue/labels[11]', 'resource': 'Label', 'field': 'name', 'value': 'legacy-jira-label:java11', 'code': 'invalid'}] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] matriv commented on pull request #1049: LUCENE-10662 Make LuceneTestCase to not extend from org.junit.Assert
matriv commented on PR #1049: URL: https://github.com/apache/lucene/pull/1049#issuecomment-1195437459 - 4904fedef1a3e0ca0a67f8f0db0961b09db51f30 Renames some methods to avoid naming conflicts - b9fe0008b10ecff6b29feb3b61250ba343a1b1bd Removes `extends Assert` from `LuceneTestCase` and adds static imports of `org.junit.Assert.xxx` everywhere -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10583) Deadlock with MMapDirectory while waitForMerges
[ https://issues.apache.org/jira/browse/LUCENE-10583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571393#comment-17571393 ] Michael McCandless commented on LUCENE-10583: - We could perhaps make a best effort to detect on common incoming APIs that external locks are not already held on {{Directory}} and {{IndexWriter}}? > Deadlock with MMapDirectory while waitForMerges > --- > > Key: LUCENE-10583 > URL: https://issues.apache.org/jira/browse/LUCENE-10583 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 8.11.1 > Environment: Java 17 > OS: Windows 2016 >Reporter: Thomas Hoffmann >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > Hello, > a deadlock situation happened in our application. We are using MMapDirectory > on Windows 2016 and got the following stacktrace: > {code:java} > "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms > "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms > elapsed=81248.18s tid=0x2860af10 nid=0x237c in Object.wait() > [0x413fc000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(java.base@17.0.2/Native Method) > - waiting on > at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4983) > - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter) > at > org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2697) > - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter) > at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1236) > at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1278) > at > com.speed4trade.ebs.module.search.SearchService.updateSearchIndex(SearchService.java:1723) > - locked <0x0006d5c00208> (a org.apache.lucene.store.MMapDirectory) > at > com.speed4trade.ebs.module.businessrelations.ticket.TicketChangedListener.postUpdate(TicketChangedListener.java:142) > ...{code} > All threads were waiting to lock <0x0006d5c00208> which got never > released. > A lucene thread was also blocked, I dont know if this is relevant: > {code:java} > "Lucene Merge Thread #0" #18466 daemon prio=5 os_prio=0 cpu=15.63ms > elapsed=3499.07s tid=0x459453e0 nid=0x1f8 waiting for monitor entry > [0x5da9e000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.lucene.store.FSDirectory.deletePendingFiles(FSDirectory.java:346) > - waiting to lock <0x0006d5c00208> (a > org.apache.lucene.store.MMapDirectory) > at > org.apache.lucene.store.FSDirectory.maybeDeletePendingFiles(FSDirectory.java:363) > at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:248) > at > org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44) > at > org.apache.lucene.index.ConcurrentMergeScheduler$1.createOutput(ConcurrentMergeScheduler.java:289) > at > org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:121) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130) > at > org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141) > at > org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:227) > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105) > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4757) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4361) > at > org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5920) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:626) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684){code} > If looks like the merge operation never finished and released the lock. > Is there any option to prevent this deadlock or how to investigate it further? > A load-test didn't show this problem unfortunately. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #3: Create mapping on Jira user id -> GitHub account
mocobeta commented on issue #3: URL: https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195419311 > Could we expand the matching so that if the userid in jira == the userid in GitHub we strongly suggest a match? E.g. mdmarshmallow would have been matched this way. It'd be easy to pick up such candidates - I think we'd need manually verify all of them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10583) Deadlock with MMapDirectory while waitForMerges
[ https://issues.apache.org/jira/browse/LUCENE-10583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571383#comment-17571383 ] Michael McCandless commented on LUCENE-10583: - [~vigyas] can this be resolved now? > Deadlock with MMapDirectory while waitForMerges > --- > > Key: LUCENE-10583 > URL: https://issues.apache.org/jira/browse/LUCENE-10583 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 8.11.1 > Environment: Java 17 > OS: Windows 2016 >Reporter: Thomas Hoffmann >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > Hello, > a deadlock situation happened in our application. We are using MMapDirectory > on Windows 2016 and got the following stacktrace: > {code:java} > "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms > "https-openssl-nio-443-exec-30" #166 daemon prio=5 os_prio=0 cpu=78703.13ms > elapsed=81248.18s tid=0x2860af10 nid=0x237c in Object.wait() > [0x413fc000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(java.base@17.0.2/Native Method) > - waiting on > at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4983) > - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter) > at > org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2697) > - locked <0x0006ef1fc020> (a org.apache.lucene.index.IndexWriter) > at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1236) > at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1278) > at > com.speed4trade.ebs.module.search.SearchService.updateSearchIndex(SearchService.java:1723) > - locked <0x0006d5c00208> (a org.apache.lucene.store.MMapDirectory) > at > com.speed4trade.ebs.module.businessrelations.ticket.TicketChangedListener.postUpdate(TicketChangedListener.java:142) > ...{code} > All threads were waiting to lock <0x0006d5c00208> which got never > released. > A lucene thread was also blocked, I dont know if this is relevant: > {code:java} > "Lucene Merge Thread #0" #18466 daemon prio=5 os_prio=0 cpu=15.63ms > elapsed=3499.07s tid=0x459453e0 nid=0x1f8 waiting for monitor entry > [0x5da9e000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.lucene.store.FSDirectory.deletePendingFiles(FSDirectory.java:346) > - waiting to lock <0x0006d5c00208> (a > org.apache.lucene.store.MMapDirectory) > at > org.apache.lucene.store.FSDirectory.maybeDeletePendingFiles(FSDirectory.java:363) > at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:248) > at > org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:44) > at > org.apache.lucene.index.ConcurrentMergeScheduler$1.createOutput(ConcurrentMergeScheduler.java:289) > at > org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.(CompressingStoredFieldsWriter.java:121) > at > org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130) > at > org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141) > at > org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:227) > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105) > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4757) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4361) > at > org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5920) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:626) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684){code} > If looks like the merge operation never finished and released the lock. > Is there any option to prevent this deadlock or how to investigate it further? > A load-test didn't show this problem unfortunately. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #3: Create mapping on Jira user id -> GitHub account
mikemccand commented on issue #3: URL: https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195411445 OK got it. Could we expand the matching so that if the userid in jira == the userid in GitHub we strongly suggest a match? E.g. `mdmarshmallow` would have been matched this way. Hmm, actually, his presented name (`Marc D'mello`) looks the same [in GitHub](https://github.com/mdmarshmallow) and [Jira](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mdmarshmallow). Oh, wait, no! One is `Marc D'mello` and the other is `Marc D'Mello` (m vs M). Maybe we can do a case insensitive comparison? But I'll push his account to the verified file separately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #3: Create mapping on Jira user id -> GitHub account
mocobeta commented on issue #3: URL: https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195406408 Properly speaking, the current "verified" account mapping includes both committers and commit authors. "commit authors" can be committers or contributors. ``` 4. Verify the candidate GitHub accounts by checking if (1) the GitHub account has push access to [apache/lucene repository](https://github.com/apache/lucene), or (2) the GitHub account has been logged as commit author in the repo's commit history at least once. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #3: Create mapping on Jira user id -> GitHub account
mocobeta commented on issue #3: URL: https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195399616 "Authors" are not necessarily committers; they literally pull request authors (contributors). For example https://github.com/apache/lucene/commit/2cf12b8cdcc629617b2d58c0a2a6336679ff9249 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #3: Create mapping on Jira user id -> GitHub account
mikemccand commented on issue #3: URL: https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195397098 > We already include merged pull requests' authors (if their GitHub full names are set to the same string as Jira full names). Maybe we could also consider all opened pull requests' authors. OK thanks, but does this only work for committers? I was thinking if a contributor who is not a committer comments on a Jira issue and also opens a PR, linked to the issue, we could maybe correlate those two events to speculate about ID mapping. And then verify by hand after. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #3: Create mapping on Jira user id -> GitHub account
mocobeta commented on issue #3: URL: https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195389801 We already include merged pull requests' authors (if their GitHub full names are set to the same string as Jira full names). Maybe we could also consider all opened pull requests' authors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher
[ https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571374#comment-17571374 ] Michael Sokolov commented on LUCENE-10151: -- oh, too bad. Well this feature is new, so at least no existing usage will be broken. > Add timeout support to IndexSearcher > > > Key: LUCENE-10151 > URL: https://issues.apache.org/jira/browse/LUCENE-10151 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Greg Miller >Priority: Minor > Fix For: 9.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > I'd like to explore adding optional "timeout" capabilities to > {{IndexSearcher}}. This would enable users to (optionally) specify a maximum > time budget for search execution. If the search "times out", partial results > would be available. > This idea originated on the dev list (thanks [~jpountz] for the suggestion). > Thread for reference: > [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E] > > A couple things to watch out for with this change: > # We want to make sure it's robust to a two-phase query evaluation scenario > where the "approximate" step matches a large number of candidates but the > "confirmation" step matches very few (or none). This is a particularly tricky > case. > # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is > {{GREATER_THAN_OR_EQUAL_TO}} if the query times out > # We want to make sure it plays nice with the {{LRUCache}} since it iterates > the query to pre-populate a {{BitSet}} when caching. That step shouldn't be > allowed to overrun the timeout. The proper way to handle this probably needs > some thought. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #1039: LUCENE-10635: Ensure test coverage for WANDScorer by using a test query
jpountz commented on code in PR #1039: URL: https://github.com/apache/lucene/pull/1039#discussion_r929867414 ## lucene/core/src/test/org/apache/lucene/search/TestWANDScorer.java: ## @@ -815,7 +856,7 @@ private void doTestRandomSpecialMaxScore(float maxScore) throws IOException { } builder.add(query, Occur.SHOULD); } - Query query = builder.build(); + Query query = numClauses > 0 ? new WANDScorerQuery(builder.build()) : builder.build(); Review Comment: Maybe we could instead handle it in WandScorerQuery by returning the single scorer when there is a single clause? ## lucene/core/src/test/org/apache/lucene/search/TestWANDScorer.java: ## @@ -947,4 +988,82 @@ public long cost() { }; } } + + private static class WANDScorerQuery extends Query { +private final BooleanQuery query; + +private WANDScorerQuery(BooleanQuery query) { Review Comment: I wonder if it would make the tests easier to read if we took an array of queries here: ```suggestion private WANDScorerQuery(Query... query) { ``` while still creating a `BooleanQuery` under the hood to reuse equals/hashcode/etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #1042: Cache decoded length bytes for TFIDFSimilarity scorer.
jpountz commented on PR #1042: URL: https://github.com/apache/lucene/pull/1042#issuecomment-1195379555 Thanks @wuwm! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #1042: Cache decoded length bytes for TFIDFSimilarity scorer.
jpountz merged PR #1042: URL: https://github.com/apache/lucene/pull/1042 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher
[ https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571370#comment-17571370 ] Adrien Grand commented on LUCENE-10151: --- I just noticed that the push of my backport had failed, so it will be in 9.4, not 9.3. I don't think it's worth respinning for it. > Add timeout support to IndexSearcher > > > Key: LUCENE-10151 > URL: https://issues.apache.org/jira/browse/LUCENE-10151 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Greg Miller >Priority: Minor > Fix For: 9.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > I'd like to explore adding optional "timeout" capabilities to > {{IndexSearcher}}. This would enable users to (optionally) specify a maximum > time budget for search execution. If the search "times out", partial results > would be available. > This idea originated on the dev list (thanks [~jpountz] for the suggestion). > Thread for reference: > [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E] > > A couple things to watch out for with this change: > # We want to make sure it's robust to a two-phase query evaluation scenario > where the "approximate" step matches a large number of candidates but the > "confirmation" step matches very few (or none). This is a particularly tricky > case. > # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is > {{GREATER_THAN_OR_EQUAL_TO}} if the query times out > # We want to make sure it plays nice with the {{LRUCache}} since it iterates > the query to pre-populate a {{BitSet}} when caching. That step shouldn't be > allowed to overrun the timeout. The proper way to handle this probably needs > some thought. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10660) precompute the max level in LogMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-10660: -- Fix Version/s: 9.4 (was: 9.3) > precompute the max level in LogMergePolicy > -- > > Key: LUCENE-10660 > URL: https://issues.apache.org/jira/browse/LUCENE-10660 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 9.2 >Reporter: tang donghai >Priority: Minor > Fix For: 9.4 > > Time Spent: 20m > Remaining Estimate: 0h > > I notice LogMergePolicy#findMerges will always calculate max level on the > right side when find the next segments to merge. > > I think we could calculate the max levels only once, and when we need the max > level, we could simply > {code:java} > float maxLevel = maxLevels[start]; > {code} > and the precomputed code looks like below, compare each level in levels from > right to left > {code:java} > float[] maxLevels = new float[numMergeableSegments + 1]; > maxLevels[numMergeableSegments] = -1.0f; > for (int i = numMergeableSegments - 1; i >= 0; i--) { > maxLevels[i] = Math.max(levels.get(i).level, maxLevels[i + 1]); > } > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10660) precompute the max level in LogMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571369#comment-17571369 ] Adrien Grand commented on LUCENE-10660: --- The change made sense to me and I merged it, thank you [~tangdh]! > precompute the max level in LogMergePolicy > -- > > Key: LUCENE-10660 > URL: https://issues.apache.org/jira/browse/LUCENE-10660 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 9.2 >Reporter: tang donghai >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > I notice LogMergePolicy#findMerges will always calculate max level on the > right side when find the next segments to merge. > > I think we could calculate the max levels only once, and when we need the max > level, we could simply > {code:java} > float maxLevel = maxLevels[start]; > {code} > and the precomputed code looks like below, compare each level in levels from > right to left > {code:java} > float[] maxLevels = new float[numMergeableSegments + 1]; > maxLevels[numMergeableSegments] = -1.0f; > for (int i = numMergeableSegments - 1; i >= 0; i--) { > maxLevels[i] = Math.max(levels.get(i).level, maxLevels[i + 1]); > } > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10660) precompute the max level in LogMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-10660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10660. --- Fix Version/s: 9.3 Resolution: Fixed > precompute the max level in LogMergePolicy > -- > > Key: LUCENE-10660 > URL: https://issues.apache.org/jira/browse/LUCENE-10660 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Affects Versions: 9.2 >Reporter: tang donghai >Priority: Minor > Fix For: 9.3 > > Time Spent: 20m > Remaining Estimate: 0h > > I notice LogMergePolicy#findMerges will always calculate max level on the > right side when find the next segments to merge. > > I think we could calculate the max levels only once, and when we need the max > level, we could simply > {code:java} > float maxLevel = maxLevels[start]; > {code} > and the precomputed code looks like below, compare each level in levels from > right to left > {code:java} > float[] maxLevels = new float[numMergeableSegments + 1]; > maxLevels[numMergeableSegments] = -1.0f; > for (int i = numMergeableSegments - 1; i >= 0; i--) { > maxLevels[i] = Math.max(levels.get(i).level, maxLevels[i + 1]); > } > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10151) Add timeout support to IndexSearcher
[ https://issues.apache.org/jira/browse/LUCENE-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571367#comment-17571367 ] ASF subversion and git services commented on LUCENE-10151: -- Commit be81cd79346e869da94d9db89e1b863bfaabbd65 in lucene's branch refs/heads/branch_9x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=be81cd79346 ] LUCENE-10151: Some fixes to query timeouts. (#996) I noticed some minor bugs in the original PR #927 that this PR should fix: - When a timeout is set, we would no longer catch `CollectionTerminatedException`. - I added randomization to `LuceneTestCase` to randomly set a timeout, it would have caught the above bug. - Fixed visibility of `TimeLimitingBulkScorer`. > Add timeout support to IndexSearcher > > > Key: LUCENE-10151 > URL: https://issues.apache.org/jira/browse/LUCENE-10151 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Greg Miller >Priority: Minor > Fix For: 9.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > I'd like to explore adding optional "timeout" capabilities to > {{IndexSearcher}}. This would enable users to (optionally) specify a maximum > time budget for search execution. If the search "times out", partial results > would be available. > This idea originated on the dev list (thanks [~jpountz] for the suggestion). > Thread for reference: > [http://mail-archives.apache.org/mod_mbox/lucene-dev/202110.mbox/%3CCAL8PwkZdNGmYJopPjeXYK%3DF7rvLkWon91UEXVxMM4MeeJ3UHxQ%40mail.gmail.com%3E] > > A couple things to watch out for with this change: > # We want to make sure it's robust to a two-phase query evaluation scenario > where the "approximate" step matches a large number of candidates but the > "confirmation" step matches very few (or none). This is a particularly tricky > case. > # We want to make sure the {{TotalHits#Relation}} reported by {{TopDocs}} is > {{GREATER_THAN_OR_EQUAL_TO}} if the query times out > # We want to make sure it plays nice with the {{LRUCache}} since it iterates > the query to pre-populate a {{BitSet}} when caching. That step shouldn't be > allowed to overrun the timeout. The proper way to handle this probably needs > some thought. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #1045: LUCENE-10660: precompute maxlevel in LogMergePolicy
jpountz merged PR #1045: URL: https://github.com/apache/lucene/pull/1045 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #1047: LUCENE-10661: Reduce memory copy in BytesStore
jpountz commented on code in PR #1047: URL: https://github.com/apache/lucene/pull/1047#discussion_r929854326 ## lucene/core/src/java/org/apache/lucene/util/fst/BytesStore.java: ## @@ -179,6 +179,30 @@ void writeBytes(long dest, byte[] b, int offset, int len) { } } + @Override + public void copyBytes(DataInput input, long numBytes) throws IOException { +assert numBytes >= 0 : "numBytes=" + numBytes; +assert input != null; +int len = (int) numBytes; Review Comment: We could make `len` a long and avoid the unchecked cast, couldn't we? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] matriv opened a new pull request, #1049: [LUCENE-10662] Make LuceneTestCase to not extend from org.junit.Assert
matriv opened a new pull request, #1049: URL: https://github.com/apache/lucene/pull/1049 ### Description (or a Jira issue link if you have one) https://issues.apache.org/jira/browse/LUCENE-10662 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10662) Make LuceneTestCase not extending from org.junit.Assert
Marios Trivyzas created LUCENE-10662: Summary: Make LuceneTestCase not extending from org.junit.Assert Key: LUCENE-10662 URL: https://issues.apache.org/jira/browse/LUCENE-10662 Project: Lucene - Core Issue Type: Test Components: general/test Reporter: Marios Trivyzas Since *LuceneTestCase* is a very useful abstract class that can be extended and used by many projects, having it extending *org.junit.Assert* limits all users to exclusively use the static methods of {*}org.junit.Assert{*}. In our project we want to use [https://joel-costigliola.github.io/assertj] where the main method to call is *org.assertj.core.api.Assertions.assertThat* which conflicts with the deprecated {*}org.junit.Assert.assertThat{*}, recognized by default by the compiler. So one can only use assertj if on every call uses fully qualified name for the *assertThat* method, i.e. {code:java} org.assertj.core.api.Assertions.assertThat(myObj.name()).isEqualTo(expectedName) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing
[ https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571355#comment-17571355 ] Adrien Grand commented on LUCENE-10592: --- I just pushed an annotion that should show up in the next couple days. > Should we build HNSW graph on the fly during indexing > - > > Key: LUCENE-10592 > URL: https://issues.apache.org/jira/browse/LUCENE-10592 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Assignee: Mayya Sharipova >Priority: Minor > Fix For: 9.4 > > Attachments: Screen Shot 2022-07-25 at 9.04.11 AM.png > > Time Spent: 8h > Remaining Estimate: 0h > > Currently, when we index vectors for KnnVectorField, we buffer those vectors > in memory and on flush during a segment construction we build an HNSW graph. > As building an HNSW graph is very expensive, this makes flush operation take > a lot of time. This also makes overall indexing performance quite > unpredictable (as the number of flushes are defined by memory used, and the > presence of concurrent searches), e.g. some indexing operations return almost > instantly while others that trigger flush take a lot of time. > Building an HNSW graph on the fly as we index vectors allows to avoid this > problem, and spread a load of HNSW graph construction evenly during indexing. > This will also supersede LUCENE-10194 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #3: Create mapping on Jira user id -> GitHub account
mikemccand commented on issue #3: URL: https://github.com/apache/lucene-jira-archive/issues/3#issuecomment-1195354814 Could we maybe look for Jira issues that have GitHub PRs attached and "correlate" the ids of who opened the PR against who commented on the issue? It would clearly not be perfect, but it could provide input for a human to sift through and carry over some verified accounts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand closed issue #27: Improve the `Jira Information` header?
mikemccand closed issue #27: Improve the `Jira Information` header? URL: https://github.com/apache/lucene-jira-archive/issues/27 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on issue #27: Improve the `Jira Information` header?
mikemccand commented on issue #27: URL: https://github.com/apache/lucene-jira-archive/issues/27#issuecomment-1195343198 I think this one is done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand closed issue #79: Carry parent issue over
mikemccand closed issue #79: Carry parent issue over URL: https://github.com/apache/lucene-jira-archive/issues/79 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand merged pull request #80: #79: include parent issue link
mikemccand merged PR #80: URL: https://github.com/apache/lucene-jira-archive/pull/80 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on issue #1048: Why lucene doc id changes after updating or merging?
dweiss commented on issue #1048: URL: https://github.com/apache/lucene/issues/1048#issuecomment-1195234738 If you need constant IDs, use a stored document field. IDs are internal because they are used for per-segment document ordering and once segments are merged, any previous document IDs (they're actually an ordinal sequence) are discarded. The naming "ID" may be confusing - it's not a "global" identifier, it is only unique within a segment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss closed issue #1048: Why lucene doc id changes after updating or merging?
dweiss closed issue #1048: Why lucene doc id changes after updating or merging? URL: https://github.com/apache/lucene/issues/1048 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10616) Moving to dictionaries has made stored fields slower at skipping
[ https://issues.apache.org/jira/browse/LUCENE-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571217#comment-17571217 ] fang hou commented on LUCENE-10616: --- I think this pr [https://github.com/apache/lucene/pull/1003] is ready for review. As Adrien advised above, this pr changed {{decompress}} signature to return {{InputStream}} to make it able to decompress lazily. Different than returning {{STOP}} in {{{}StoredFieldVisitor#needsField{}}}(tried but found it's maybe impossible due to multiple value fields, see test case), this pr optimized skip method to be more smart to bypass unneeded compressed block by reading compressed block length. So for large unneeded field, we can save many decompression time. This applied to both {{BEST_SPEED}} mode and {{HIGH_COMPRESSION}} mode. So this pr optimized these two modes with preset dictionary. Could someone give some feedbacks? thanks cc [~jpountz] > Moving to dictionaries has made stored fields slower at skipping > > > Key: LUCENE-10616 > URL: https://issues.apache.org/jira/browse/LUCENE-10616 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 40m > Remaining Estimate: 0h > > [~ywelsch] has been digging into a regression of stored fields retrieval that > is caused by LUCENE-9486. > Say your documents have two stored fields, one that is 100B and is stored > first, and the other one that is 100kB, and you are only interested in the > first one. While the idea behind blocks of stored fields is to store multiple > documents in the same block to leverage redundancy across documents, > sometimes documents are larger than the block size. As soon as documents are > larger than 2x the block size, our stored fields format splits such large > documents into multiple blocks, so that you wouldn't need to decompress > everything only to retrieve a couple small fields. > Before LUCENE-9486, BEST_SPEED had a block size of 16kB, so only retrieving > the first field value would only need to decompress 16kB of data. With the > move to preset dictionaries in LUCENE-9486 and then LUCENE-9917, we now have > blocks of 80kB, so stored fields would now need to decompress 80kB of data, > 5x more than before. > With dictionaries, our blocks are now split into 10 sub blocks. We happen to > eagerly decompress all sub blocks that intersect with the stored document, > which is why we would decompress 80kB of data, but this is an implementation > detail. It should be possible to decompress these sub blocks lazily so that > we would only decompress those that intersect with one of the field values > that the user is interested in retrieving? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] JoeHF commented on pull request #1003: LUCENE-10616: optimizing decompress when only retrieving some fields
JoeHF commented on PR #1003: URL: https://github.com/apache/lucene/pull/1003#issuecomment-1195050814 no obvious regression or perf improvement, guess there are no such cases in benchmark wikimedium10k: TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value BrowseRandomLabelTaxoFacets 569.65 (7.9%) 543.58 (15.4%) -4.6% ( -25% - 20%) 0.236 Prefix3 377.77 (9.1%) 368.32 (6.6%) -2.5% ( -16% - 14%) 0.321 AndHighMed 656.18 (8.2%) 648.39 (10.6%) -1.2% ( -18% - 19%) 0.691 MedIntervalsOrdered 574.68 (6.3%) 567.95 (9.7%) -1.2% ( -16% - 15%) 0.651 AndHighLow 978.77 (9.3%) 972.00 (8.5%) -0.7% ( -16% - 18%) 0.806 HighSpanNear 425.66 (8.3%) 423.78 (10.2%) -0.4% ( -17% - 19%) 0.880 OrHighMed 656.72 (8.2%) 655.28 (10.5%) -0.2% ( -17% - 20%) 0.942 LowIntervalsOrdered 481.42 (5.2%) 480.65 (10.3%) -0.2% ( -14% - 16%) 0.951 HighPhrase 500.26 (7.6%) 499.86 (11.4%) -0.1% ( -17% - 20%) 0.979 Respell 123.33 (11.8%) 123.48 (10.4%)0.1% ( -19% - 25%) 0.973 OrHighHigh 416.58 (6.9%) 417.19 (9.4%)0.1% ( -15% - 17%) 0.955 MedTerm 2063.41 (9.5%) 2069.51 (11.0%)0.3% ( -18% - 23%) 0.928 LowSloppyPhrase 301.12 (7.5%) 303.12 (12.6%)0.7% ( -18% - 22%) 0.840 HighTerm 1088.05 (9.8%) 1102.10 (14.8%)1.3% ( -21% - 28%) 0.745 LowPhrase 896.10 (8.4%) 907.71 (9.8%)1.3% ( -15% - 21%) 0.654 HighSloppyPhrase 309.31 (8.1%) 313.60 (10.0%)1.4% ( -15% - 21%) 0.629 Fuzzy2 42.78 (11.1%) 43.46 (12.2%)1.6% ( -19% - 27%) 0.665 Wildcard 315.36 (9.2%) 320.46 (7.7%)1.6% ( -14% - 20%) 0.548 MedSpanNear 520.33 (6.6%) 530.21 (11.6%)1.9% ( -15% - 21%) 0.524 HighIntervalsOrdered 356.49 (10.3%) 363.39 (10.1%)1.9% ( -16% - 24%) 0.547 AndHighHigh 619.32 (5.9%) 631.54 (9.5%)2.0% ( -12% - 18%) 0.432 HighTermMonthSort 1479.95 (6.0%) 1509.95 (11.1%)2.0% ( -14% - 20%) 0.472 MedSloppyPhrase 230.30 (8.6%) 235.24 (10.8%)2.1% ( -15% - 23%) 0.488 MedPhrase 567.04 (6.2%) 579.72 (11.5%)2.2% ( -14% - 21%) 0.442 BrowseRandomLabelSSDVFacets 350.13 (10.2%) 358.12 (16.8%)2.3% ( -22% - 32%) 0.604 HighTermDayOfYearSort 1087.80 (7.4%) 1118.61 (8.5%)2.8% ( -12% - 20%) 0.260 LowTerm 2557.43 (9.2%) 2636.37 (8.9%)3.1% ( -13% - 23%) 0.281 LowSpanNear 795.88 (9.0%) 828.70 (11.1%)4.1% ( -14% - 26%) 0.195 PKLookup 26.79 (16.3%) 27.91 (19.8%)4.2% ( -27% - 48%) 0.466 Fuzzy1 136.23 (9.7%) 142.21 (16.8%)4.4% ( -20% - 34%) 0.312 BrowseMonthTaxoFacets 801.97 (17.7%) 840.43 (19.1%)4.8% ( -27% - 50%) 0.410 IntNRQ 603.46 (10.0%) 636.52 (7.9%)5.5% ( -11% - 25%) 0.054 OrHighLow 532.25 (9.0%) 562.37 (13.6%)5.7% ( -15% - 31%) 0.121 BrowseMonthSSDVFacets 839.55 (20.9%) 894.35 (22.1%)6.5% ( -30% - 62%) 0.337 BrowseDayOfYearTaxoFacets 784.80 (16.4%) 839.36 (25.1%)7.0% ( -29% - 58%) 0.300 BrowseDateTaxoFacets 849.34 (17.9%) 908.75 (25.8%)7.0% ( -31% - 61%) 0.319 BrowseDayOfYearSSDVFacets 832.41 (17.6%) 907.43 (22.9%)9.0% ( -26% - 60%) 0.163 BrowseDateSSDVFacets 215.63 (21.1%) 241.38 (27.5%) 11.9% ( -30% - 76%) 0.123 wikimedium1m: TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Respell 39.67 (11.8%) 37.57 (14.2%) -5.3% ( -27% - 23%) 0.200