[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306249#comment-17306249 ] Michael McCandless commented on LUCENE-8962: Hi Team! This issue was so exotic and such a great example of powerful open-source collaboration, which I suspect many people didn't even notice, so I wrote fun blog post explaining/uncovering all the juicy details: [http://blog.mikemccandless.com/2021/03/open-source-collaboration-or-how-we.html] > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: main (9.0), 8.6, 8.7 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, image-2021-03-22-10-36-32-201.png, test.diff > > Time Spent: 31h > Remaining Estimate: 0h > > Two improvements were added: 8.6 has merge-on-commit (by Froh et. all), 8.7 > has merge-on-refresh (by Simon). See \{{MergePolicy.findFullFlushMerges}} > The original description follows: > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve \{{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184735#comment-17184735 ] Michael McCandless commented on LUCENE-8962: Yeah, +1 we should have opened a new issue for the {{getReader}} case (separate from the {{commit}} case which already shipped in 8.6). It is a related but different feature, and, yes, in a different Lucene 8.x feature release. This was an exciting change, conceived originally by [~msfroh], contributed by [~msoko...@gmail.com], undergoing many iterations after that, exotic test failures, tons of help from [~simonw] to iterate the change to a clean solution that would work for both {{commit}} and {{getReader}}, beasting, more iterations, commits then reverts, etc. I think it would make a great blog post! > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: master (9.0), 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 31h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184702#comment-17184702 ] David Smiley commented on LUCENE-8962: -- Then we need a new JIRA issue targeted at 8.7 discussing this. Once a release ships, the issues it references (in CHANGES.txt) are read-only from a code standpoint. If 8.6 hadn't shipped, then we could have iterated on it in this issue. Although sadly git history is forever, the situation can be fixed up by using a new issue and changing the CHANTES.txt to refer to it. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: master (9.0), 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 31h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184699#comment-17184699 ] Michael Froh commented on LUCENE-8962: -- 8.6 added ability to merge small segments on commit. The more recent changes add the ability to merge on getReader calls (which is what the original issue was asking for -- merging on commit was a slightly easier step on the way there). > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: master (9.0), 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 31h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184698#comment-17184698 ] David Smiley commented on LUCENE-8962: -- I've been following along here (or _trying to_) and I'm confused by the recent commits. LUCENE-8962 was fixed & released into 8.6 which shipped. Why are new commits landing that reference it? > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: master (9.0), 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 31h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184679#comment-17184679 ] Michael McCandless commented on LUCENE-8962: Woot! Thank you [~simonw]! > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: master (9.0), 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 31h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183522#comment-17183522 ] ASF subversion and git services commented on LUCENE-8962: - Commit 2f2146aad46b5969e6fbc99d85e21f595aba56c9 in lucene-solr's branch refs/heads/branch_8x from Simon Willnauer [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2f2146a ] LUCENE-8962: Merge segments on getReader (#1623) Add IndexWriter merge-on-refresh feature to selectively merge small segments on getReader, subject to a configurable timeout, to improve search performance by reducing the number of small segments for searching. Co-authored-by: Mike McCandless > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: master (9.0), 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 31h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183517#comment-17183517 ] ASF subversion and git services commented on LUCENE-8962: - Commit 8294e1ae2068aa39e91c25cbaabf62afae40a02e in lucene-solr's branch refs/heads/master from Simon Willnauer [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8294e1a ] LUCENE-8962: Merge segments on getReader (#1623) Add IndexWriter merge-on-refresh feature to selectively merge small segments on getReader, subject to a configurable timeout, to improve search performance by reducing the number of small segments for searching. Co-authored-by: Mike McCandless > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: master (9.0), 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 31h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148366#comment-17148366 ] ASF subversion and git services commented on LUCENE-8962: - Commit 3377b09fcbf46315ac64350ab259a45b1ec2889c in lucene-solr's branch refs/heads/jira/SOLR-14354 from Simon Willnauer [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3377b09 ] LUCENE-8962: Ensure we never flush by ram buffer or doc count in test > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: master (9.0), 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 23h 20m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147239#comment-17147239 ] ASF subversion and git services commented on LUCENE-8962: - Commit f8b1d698af981251120e4e96255dbc0ee42e3ed8 in lucene-solr's branch refs/heads/branch_8x from Simon Willnauer [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f8b1d69 ] LUCENE-8962: Ensure we never flush by ram buffer or doc count in test > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: master (9.0), 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 22h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147238#comment-17147238 ] ASF subversion and git services commented on LUCENE-8962: - Commit 3377b09fcbf46315ac64350ab259a45b1ec2889c in lucene-solr's branch refs/heads/master from Simon Willnauer [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3377b09 ] LUCENE-8962: Ensure we never flush by ram buffer or doc count in test > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: master (9.0), 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 22h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147105#comment-17147105 ] ASF subversion and git services commented on LUCENE-8962: - Commit 9b1c92809e730a70c1d9bf9e5b65a4b521bddc8b in lucene-solr's branch refs/heads/branch_8x from Simon Willnauer [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9b1c928 ] LUCENE-8962: Merge small segments on commit (#1617) Add IndexWriter merge-on-commit feature to selectively merge small segments on commit, subject to a configurable timeout, to improve search performance by reducing the number of small segments for searching. Co-authored-by: Michael Froh Co-authored-by: Michael Sokolov Co-authored-by: Mike McCandless > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: master (9.0), 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 22h 10m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147100#comment-17147100 ] ASF subversion and git services commented on LUCENE-8962: - Commit fb3c5d23537c650f7c0ba020f5e08f128e0a9b5a in lucene-solr's branch refs/heads/master from Simon Willnauer [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fb3c5d2 ] LUCENE-8962: Fix changes entry. This feature is added to 8.6 > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 22h 10m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147099#comment-17147099 ] ASF subversion and git services commented on LUCENE-8962: - Commit 7f352a96659b4d304ae3ef0c7a1b62784ce2a406 in lucene-solr's branch refs/heads/master from Simon Willnauer [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7f352a9 ] LUCENE-8962: Merge small segments on commit (#1617) Add IndexWriter merge-on-commit feature to selectively merge small segments on commit, subject to a configurable timeout, to improve search performance by reducing the number of small segments for searching. Co-authored-by: Michael Froh Co-authored-by: Michael Sokolov Co-authored-by: Mike McCandless > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 22h 10m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146689#comment-17146689 ] Michael McCandless commented on LUCENE-8962: OK it looks to me like this failure is pre-existing, because it got {{MockRandomMergePolicy}} which randomly decided to merge the two segments the test wrote into one segment, *after* (not during!) the second commit. I'll fix the test to use a deterministic {{MergePolicy}}. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 21h 50m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146683#comment-17146683 ] Michael McCandless commented on LUCENE-8962: LOL, that's awesome. Hmm, beasting just hit another failure in same source: {noformat} [junit4:pickseed] Seed property 'tests.seed' already defined: BC66D44FB7C99235 [junit4] says salut! Master seed: BC66D44FB7C99235 [junit4] Executing 1 suite with 1 JVM. [junit4] [junit4] Started J0 PID(1739111@localhost). [junit4] Suite: org.apache.lucene.search.TestPhraseWildcardQuery [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestPhraseWildcardQuery -Dtests.method=testMaxExpansions -Dtests.seed=BC66D44FB7C99235 -Dtests.slow=true -Dtests.badapples=true \ -Dtests.locale=es -Dtests.timezone=Africa/Douala -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 0.22s | TestPhraseWildcardQuery.testMaxExpansions <<< [junit4] > Throwable #1: java.lang.AssertionError: test test relies on 2 segments expected:<2> but was:<1> [junit4] > at __randomizedtesting.SeedInfo.seed([BC66D44FB7C99235:83D14835E8C05BA3]:0) [junit4] > at org.apache.lucene.search.TestPhraseWildcardQuery.setUp(TestPhraseWildcardQuery.java:76) [junit4] > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit4] > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [junit4] > at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [junit4] > at java.base/java.lang.reflect.Method.invoke(Method.java:566) [junit4] > at java.base/java.lang.Thread.run(Thread.java:834) [junit4] 2> NOTE: test params are: codec=Asserting(Lucene86), sim=Asserting(RandomSimilarity(queryNorm=true): {other=DFR I(n)B2, author=org.apache.lucene.search.similarities.BooleanSi\ milarity@3dcd0436, title=DFR GL3(800.0), category=DFR I(F)LZ(0.3)}), locale=es, timezone=Africa/Douala [junit4] 2> NOTE: Linux 5.5.6-arch1-1 amd64/Oracle Corporation 11.0.6 (64-bit)/cpus=128,threads=1,free=477476440,total=536870912 [junit4] 2> NOTE: All tests run in this JVM: [TestPhraseWildcardQuery] [junit4] Completed [1/1 (1!)] in 0.46s, 1 test, 1 failure <<< FAILURES! [junit4] [junit4] [junit4] Tests with failures [seed: BC66D44FB7C99235]: [junit4] - org.apache.lucene.search.TestPhraseWildcardQuery.testMaxExpansions [junit4] [junit4] [junit4] JVM J0: 0.34 .. 1.19 = 0.85s [junit4] Execution time total: 1.20 sec. [junit4] Tests summary: 1 suite, 1 test, 1 failure {noformat} which is odd because you fix (in {{.setUp}}) should have also fixed this one. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 21h 50m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsu
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146635#comment-17146635 ] Simon Willnauer commented on LUCENE-8962: - I did push a fix. it's a test bug, this test relies on 2 segments that got merged away by this feature :) > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 21h 50m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146633#comment-17146633 ] Michael McCandless commented on LUCENE-8962: Thanks [~simonw], PR looks great. I left a couple comments and pushed some small fixes. I started beasting, 37 full iterations of Lucene core + modules tests. It uncovered this curious failure, which does reproduce: {noformat} org.apache.lucene.search.TestPhraseWildcardQuery > test suite's output saved to /home/mike/src/simon/lucene/sandbox/build/test-results/test/outputs/OUTPUT-org.apache.lucene.search.TestPhraseWildcardQuery.txt, copied below: > java.lang.AssertionError: expected:<1> but was:<0> > at __randomizedtesting.SeedInfo.seed([8FCBBE65B983A2FA:393FF384952858B5]:0) > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at org.apache.lucene.search.TestPhraseWildcardQuery.testExplain(TestPhraseWildcardQuery.java:259) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754) > at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942) > at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978) > at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992) > at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49) > at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) > at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) > at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) > at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) > at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370) > at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819) > at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470) > at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951) > at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836) > at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887) > at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898) > at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) > at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41) > at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53) > at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) > at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) > at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54) > at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146321#comment-17146321 ] Simon Willnauer commented on LUCENE-8962: - I think we are ready for review I opened a PR here https://github.com/apache/lucene-solr/pull/1617 > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 20h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146286#comment-17146286 ] Simon Willnauer commented on LUCENE-8962: - [~mikemccand] I pushed fixes for both failures > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 20h 20m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145871#comment-17145871 ] Michael McCandless commented on LUCENE-8962: Ooh, found another failing test (after 341 full iterations), and this one does reproduce! {noformat} [junit4:pickseed] Seed property 'tests.seed' already defined: D75F6483D2E6C62C [junit4] says ¡Hola! Master seed: D75F6483D2E6C62C [junit4] Executing 1 suite with 1 JVM. [junit4] [junit4] Started J0 PID(3328881@localhost). [junit4] Suite: org.apache.lucene.index.TestIndexWriterMergePolicy [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexWriterMergePolicy -Dtests.method=testMergeOnCommit -Dtests.seed=D75F6483D2E6C62C -Dtests.slow=true -Dtests.badapples=tr\ ue -Dtests.locale=en-CM -Dtests.timezone=Eire -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 0.27s | TestIndexWriterMergePolicy.testMergeOnCommit <<< [junit4] > Throwable #1: java.lang.AssertionError: expected:<1> but was:<6> [junit4] > at __randomizedtesting.SeedInfo.seed([D75F6483D2E6C62C:F3D97E843303240C]:0) [junit4] > at org.apache.lucene.index.TestIndexWriterMergePolicy.testMergeOnCommit(TestIndexWriterMergePolicy.java:340) [junit4] > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit4] > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [junit4] > at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [junit4] > at java.base/java.lang.reflect.Method.invoke(Method.java:566) [junit4] > at java.base/java.lang.Thread.run(Thread.java:834) [junit4] 2> Jun 25, 2020 5:41:39 PM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks [junit4] 2> WARNING: Will linger awaiting termination of 1 leaked thread(s). [junit4] 2> NOTE: test params are: codec=CheapBastard, sim=Asserting(RandomSimilarity(queryNorm=true): {content=DFI(ChiSquared)}), locale=en-CM, timezone=Eire [junit4] 2> NOTE: Linux 5.5.6-arch1-1 amd64/Oracle Corporation 11.0.6 (64-bit)/cpus=128,threads=1,free=477412792,total=536870912 [junit4] 2> NOTE: All tests run in this JVM: [TestIndexWriterMergePolicy] [junit4] Completed [1/1 (1!)] in 0.56s, 1 test, 1 failure <<< FAILURES! [junit4] [junit4] [junit4] Tests with failures [seed: D75F6483D2E6C62C]: [junit4] - org.apache.lucene.index.TestIndexWriterMergePolicy.testMergeOnCommit [junit4] [junit4] [junit4] JVM J0: 0.34 .. 1.45 = 1.11s [junit4] Execution time total: 1.45 sec. [junit4] Tests summary: 1 suite, 1 test, 1 failure {noformat} > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 20h 20m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubs
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145716#comment-17145716 ] Michael McCandless commented on LUCENE-8962: Beasting the branch (all Lucene core + module tests), 216 iterations, and hit one test failure: {noformat} > TEST thread 1: got interrupt > org.apache.lucene.util.ThreadInterruptedException: java.lang.InterruptedException: sleep interrupted > at org.apache.lucene.store.SlowOpeningMockIndexInputWrapper.(SlowOpeningMockIndexInputWrapper.java:38) > at org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:769) > at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:157) > at org.apache.lucene.store.MockDirectoryWrapper.openChecksumInput(MockDirectoryWrapper.java:1044) > at org.apache.lucene.codecs.compressing.FieldsIndexReader.(FieldsIndexReader.java:53) > at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.(CompressingStoredFieldsReader.java:166) > at org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsReader(CompressingStoredFieldsFormat.java:123) > at org.apache.lucene.codecs.lucene50.Lucene50StoredFieldsFormat.fieldsReader(Lucene50StoredFieldsFormat.java:135) > at org.apache.lucene.codecs.asserting.AssertingStoredFieldsFormat.fieldsReader(AssertingStoredFieldsFormat.java:43) > at org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:127) > at org.apache.lucene.index.SegmentReader.(SegmentReader.java:83) > at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:171) > at org.apache.lucene.index.ReadersAndUpdates.getReaderForMerge(ReadersAndUpdates.java:714) > at org.apache.lucene.index.IndexWriter.lambda$prepareOnCommitMerge$6(IndexWriter.java:3382) > at org.apache.lucene.index.MergePolicy$OneMerge.initMergeReaders(MergePolicy.java:438) > at org.apache.lucene.index.IndexWriter$2.initMergeReaders(IndexWriter.java:3361) > at org.apache.lucene.index.IndexWriter.prepareOnCommitMerge(IndexWriter.java:3376) > at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3247) > at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3547) > at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3505) > at org.apache.lucene.index.TestIndexWriter$IndexerThreadInterrupt.run(TestIndexWriter.java:1004) > Caused by: java.lang.InterruptedException: sleep interrupted > at java.base/java.lang.Thread.sleep(Native Method) > at org.apache.lucene.store.SlowOpeningMockIndexInputWrapper.(SlowOpeningMockIndexInputWrapper.java:33) > ... 20 more > thread 1 FAILED; unexpected exception > java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot commit > at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3170) > at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3547) > at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1099) > at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1140) > at org.apache.lucene.index.TestIndexWriter$IndexerThreadInterrupt.run(TestIndexWriter.java:942) > Caused by: java.lang.AssertionError: can't be done and not suppressing exceptions > at org.apache.lucene.index.IndexWriter.closeMergeReaders(IndexWriter.java:4426) > at org.apache.lucene.index.IndexWriter.commitMerge(IndexWriter.java:4075) > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4581) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4140) > at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5692) > at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) > at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682) > index files: [_42.fdm, _42.fdt, _42.fdx, _42.fnm, _42.nvd, _42.nvm, _42.si, _42_Lucene80_0.dvd, _42_Lucene80_0.dvm, _42_Lucene84_0.doc, _42_Lucene84_0.pos, _42_Lucene84_0.tim, _42\ _Lucene84_0.tip, _42_Lucene84_0.tmd, _42_TestBloomFilteredLucenePostings_0.blm, _42_TestBloomFilteredLucenePostings_0.doc, _42_TestBloomFilteredLucenePostings_0.pos, _42_TestBloomFilteredL\ ucenePostings_0.tim, _42_TestBloomFilteredLucenePostings_0.tip, _42_TestBloomFilteredLucenePostings_0.tmd, _47.fdm, _47.fdt,
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144969#comment-17144969 ] Michael Sokolov commented on LUCENE-8962: - Yes, thanks Simon! I'll take a look too; hopefully I learn something :) > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 20h 20m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144904#comment-17144904 ] Michael McCandless commented on LUCENE-8962: Thanks [~simonw]. I will have a look at the latest & greatest, and kick off beasting on my 128 core Ryzen Threadripper box :) > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 20h 20m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144366#comment-17144366 ] Simon Willnauer commented on LUCENE-8962: - [~sokolov] thanks, I updated the branch with the approach that I think can work. I added quite some changes that you folks should catch up on. I think we need to do more on the testing end with this feature being explicitly enabled. I already pushed some tests for it. I would appreciate if you and [~mikemccand] could spend some time looking that the work that I did. I am happy to open a PR do do the reviews but maybe take a look at it first? > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 20h 20m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143890#comment-17143890 ] Michael Sokolov commented on LUCENE-8962: - [~simonw] I think we can continue to use https://github.com/apache/lucene-solr/tree/jira/lucene-8962? But perhaps with a new PR > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 19h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143881#comment-17143881 ] Simon Willnauer commented on LUCENE-8962: - I agree mike! I looked into it and it seems doable with some modifications. I already pushed a prototype of what I think can work [here|https://github.com/apache/lucene-solr/pull/1601/commits/de7677f1c98ffd38e9febce74f265fb44484c1d1] [~sokolov] do you have a branch we can work on for this? > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 19h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143295#comment-17143295 ] Michael McCandless commented on LUCENE-8962: Thanks [~simonw], what a delightful unit test! Clearly it confirms what we thought – the feature as implemented now breaks the atomicity guarantee of {{updateDocument}} which is clearly no good! And I think this is likely tricky to fix, if we need new locking around IW acquiring merge readers to ensure a concurrent {{updateDocument}} is atomic with respect to that. Let's mull/sleep on it ... > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, > failure_log.txt, test.diff > > Time Spent: 19h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142433#comment-17142433 ] ASF subversion and git services commented on LUCENE-8962: - Commit ba2a812f84e2a05cf6d7039a19000df10d25590c in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ba2a812 ] Revert "LUCENE-8962: add ability to selectively merge on commit (#1552)" This reverts commit 28f2e1cd7f39e482b100d55ec94c46fc292f7b67. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt > > Time Spent: 19h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142431#comment-17142431 ] ASF subversion and git services commented on LUCENE-8962: - Commit 5d43e73c6616479bec19dd8edf57076e084d6063 in lucene-solr's branch refs/heads/master from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5d43e73 ] Revert "LUCENE-8962: add ability to selectively merge on commit (#1552)" This reverts commit 972c84022ec9c2b867848cb11f6d4c6bc466e8f2. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt > > Time Spent: 19h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142410#comment-17142410 ] Michael Sokolov commented on LUCENE-8962: - OK I'll revert from master and branch_8x then > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt > > Time Spent: 19h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142395#comment-17142395 ] Michael McCandless commented on LUCENE-8962: {quote}[~mikemccand] WDYT I think we should roll it back for now. {quote} Yeah, +1, this is one challenging feature! I'm glad we had that tricky test failure above leading to this. We cannot break atomicity of {{updateDocument}} – Lucene's point-in-time guarantee would be broken. I do think this is fixable, by changing {{commitMergedDeletes}} to NOT carry over concurrent deletes that had occurred to docs in the source segments while they were being merged. But let's start with a test case showing that the current approach did indeed break atomicity guarantee of {{updateDocument}}... > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt > > Time Spent: 19h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142389#comment-17142389 ] Simon Willnauer commented on LUCENE-8962: - This would also explain the failure I am looking into right now on elastic CI: {noformat} 12:20:39[junit4] Suite: org.apache.lucene.index.TestIndexingSequenceNumbers 12:20:39[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexingSequenceNumbers -Dtests.method=testStressConcurrentCommit -Dtests.seed=FA92259FC8239E7 -Dtests.nightly=true -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=en-TC -Dtests.timezone=Pacific/Pago_Pago -Dtests.asserts=true -Dtests.file.encoding=UTF8 12:20:39[junit4] FAILURE 85.1s J2 | TestIndexingSequenceNumbers.testStressConcurrentCommit <<< 12:20:39[junit4]> Throwable #1: java.lang.AssertionError: expected:<1> but was:<0> 12:20:39[junit4]> at __randomizedtesting.SeedInfo.seed([FA92259FC8239E7:743DB8E67E421850]:0) 12:20:39[junit4]> at org.apache.lucene.index.TestIndexingSequenceNumbers.testStressConcurrentCommit(TestIndexingSequenceNumbers.java:273) 12:20:39[junit4]> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 12:20:39[junit4]> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 12:20:39[junit4]> at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 12:20:39[junit4]> at java.base/java.lang.reflect.Method.invoke(Method.java:566) 12:20:39[junit4]> at java.base/java.lang.Thread.run(Thread.java:834) 12:20:39[junit4] 2> NOTE: leaving temporary files on disk at: /var/lib/jenkins/workspace/apache+lucene-solr+nightly+branch_8x/lucene/build/core/test/J2/temp/lucene.index.TestIndexingSequenceNumbers_FA92259FC8239E7-001 12:20:39[junit4] 2> NOTE: test params are: codec=Asserting(Lucene86): {id=Lucene84}, docValues:{thread=DocValuesFormat(name=Asserting), ___soft_deletes=DocValuesFormat(name=Asserting)}, maxPointsInLeafNode=279, maxMBSortInHeap=6.3565394726088424, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@1c08a55f), locale=en-TC, timezone=Pacific/Pago_Pago 12:20:39[junit4] 2> NOTE: Linux 4.18.0-193.6.3.el8_2.x86_64 amd64/Oracle Corporation 11.0.2 (64-bit)/cpus=32,threads=1,free=339006432,total=400556032 12:20:39[junit4] 2> NOTE: All tests run in this JVM: [TestMultiTermConstantScore, TestSparseFixedBitSet, TestSpanBoostQuery, TestOfflineSorter, TestRadixSelector, TestBagOfPositions, TestPhrasePrefixQuery, TestAxiomaticF2LOG, TestLucene50CompoundFormat, TestLongBitSet, TestBinaryDocument, TestBooleanOr, TestComplexExplanationsOfNonMatches, TestTransactions, TestTermQuery, TestUTF32ToUTF8, TestWildcard, TestIndexWriterForceMerge, TestMinShouldMatch2, TestCodecs, TestBytesStore, Test2BNumericDocValues, TestCharFilter, TestXYMultiPolygonShapeQueries, TestFeatureDoubleValues, TestPackedInts, TestGraphTokenizers, TestFilterCodecReader, TestSetOnce, TestLatLonPoint, TestLongRange, TestQueryRescorer, TestNRTThreads, TestMergedIterator, TestLucene86SegmentInfoFormat, TestPackedTokenAttributeImpl, TestBasicModelIne, TestDocCount, TestLMDirichletSimilarity, TestAttributeSource, TestPositiveScoresOnlyCollector, TestXYPointQueries, TestElevationComparator, TestIndexWriterMergePolicy, TestUpgradeIndexMergePolicy, TestIntRangeFieldQueries, TestIndexingSequenceNumbers] 12:20:39[junit4] Completed [369/567 (1!)] on J2 in 168.09s, 8 tests, 1 failure <<< FAILURES! {noformat} it's missing a document, pretty scary but explains what we see here. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt > > Time Spent: 19h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refres
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142388#comment-17142388 ] Simon Willnauer commented on LUCENE-8962: - [~msoko...@gmail.com] I think I have bad news for you. I think we forgot about an important guarantee that we have from updateDocument that is now with this feature enabled not holding up anymore. During a merge we carry over deletes that are made concurrently to the source segments. This would mean that if we continue including deleted that are carried over in a commit we are currently making we don't carry over the corresponding added document which means that there are documents not visible in the comment that should be visible. I think this is a pretty significant problem and we should roll back this feature for now. I am sorry that it took me a while to understand the implications here but these IW logs are not the most intuitive thing to read and stare at. We need to look into commitMergedDeletes if we can make it work or if we need to do the merge again since we now need point in time semantics for merges too. I am not sure if that's actually fixable, but lets look into it. [~mikemccand] WDYT I think we should roll it back for now. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt > > Time Spent: 19h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142380#comment-17142380 ] David Smiley commented on LUCENE-8962: -- bq. If we decide to do this in the future we can keep the default and implement that method by default without enabling the feature. If we set it to non-zero that option goes away. +1 makes sense Simon > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt > > Time Spent: 19h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142303#comment-17142303 ] Simon Willnauer commented on LUCENE-8962: - I opened a PR [here|https://github.com/apache/lucene-solr/pull/1601] > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt > > Time Spent: 19.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142197#comment-17142197 ] Simon Willnauer commented on LUCENE-8962: - I attached a log file with a failure and index writer logging enabled. I think I have a smoking gun here: {noformat} [junit4] 1> IW 2 [2020-06-22T15:33:08.479370Z; Lucene Merge Thread #31]: merged segment _2f(9.0.0):c1:[diagnostics={os.version=10.14.6, mergeMaxNumSegments=-1, java.version=12.0.1, java.vm.version=12.0.1+12, lucene.version=9.0.0, timestamp=1592839988473, os=Mac OS X, java.runtime.version=12.0.1+12, mergeFactor=1, os.arch=x86_64, source=merge, java.vendor=Oracle Corporation}]:[attributes={Lucene50StoredFieldsFormat.mode=BEST_SPEED}] :id=70ut1o7n9kgx798rria9sma3n is 100% deleted; skipping insert [junit4] 1> SM 2 [2020-06-22T15:33:08.479615Z; Lucene Merge Thread #32]: 0 msec to write field infos [29 docs] [junit4] 1> IFD 2 [2020-06-22T15:33:08.479662Z; Lucene Merge Thread #31]: will delete new file "_2f.cfe" [junit4] 1> IFD 2 [2020-06-22T15:33:08.479712Z; Lucene Merge Thread #31]: will delete new file "_2f.cfs" [junit4] 1> IFD 2 [2020-06-22T15:33:08.479740Z; Lucene Merge Thread #31]: will delete new file "_2f.si" [junit4] 1> IFD 2 [2020-06-22T15:33:08.479770Z; Lucene Merge Thread #31]: delete [_2f.cfe, _2f.cfs, _2f.si] {noformat} and further down {noformat} [junit4] 2> Caused by: java.nio.file.NoSuchFileException: _2f.cfs [junit4] 2>at org.apache.lucene.store.ByteBuffersDirectory.deleteFile(ByteBuffersDirectory.java:148) [junit4] 2>at org.apache.lucene.store.MockDirectoryWrapper.deleteFile(MockDirectoryWrapper.java:607) [junit4] 2>at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38) [junit4] 2>at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:696) [junit4] 2>at org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:690) [junit4] 2>at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:589) [junit4] 2>at org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:382) [junit4] 2>at org.apache.lucene.index.IndexFileDeleter.checkpoint(IndexFileDeleter.java:527) [junit4] 2>at org.apache.lucene.index.IndexWriter.finishCommit(IndexWriter.java:3546) [junit4] 2>at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3502) [junit4] 2>at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3452) [junit4] 2>at org.apache.lucene.index.TestIndexWriter.lambda$testRandomOperations$48(TestIndexWriter.java:3879) [junit4] 2>... 1 more [junit4] 2> {noformat} I think we are not respecting the fact that it's fully deleted and dropped because of this before we get a chance to incRef the segment. I will work on a patch for this. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt > > Time Spent: 18h 50m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not y
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142190#comment-17142190 ] Erick Erickson commented on LUCENE-8962: Sounds like it may _not_ be related to SOLR-13268 then, it's a perfectly legitimate error if some run happens to produce more than 8K of output. If none of the messages look like they came from another test, it's probably legitimate and not a result of async logging. Mostly I wanted you to be aware that it _might_ be related, but that's looking less likely. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 18h 50m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142184#comment-17142184 ] Simon Willnauer commented on LUCENE-8962: - I can reproduce this every 10 iterations. Funny enough I can't in my IDE only when I repeatedly run it on the CMD. This line reproduces for me quite frequently: {noformat} ant test -Dtestcase=TestIndexWriter -Dtests.method=testRandomOperations -Dtests.seed=83A78A6E34594E08 -Dtests.nightly=true -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=teo-KE -Dtests.timezone=Antarctica/Rothera -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 {noformat} I also checked if it's related to calling mergeFinished multiple times but it doesn't > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 18h 50m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142110#comment-17142110 ] Erick Erickson commented on LUCENE-8962: [~mikemccand] Ever since I changed Solr to use asnyc logging by default, these errors show up _very_ occasionally and I have been unable to figure out why. My best guess is that "somehow" the logging buffer isn't being flushed when the logger is closed, as it should be. See: SOLR-13268. It happens so rarely that it's hard to diagnose/fix, any insights appreciated. I do wonder if the right thing to do is go back to synchronous logging for the tests. My guess is that some buffer flushing isn't quite completed when loggers shut down, but in the real world this won't happen unless the app is going away so shouldn't be a problem? WDYT? > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 18h 50m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142026#comment-17142026 ] Michael McCandless commented on LUCENE-8962: {quote}It seems like this caused this test failure: {quote} Hmm it does not repro for me on first try. Also, I wonder why the repro line is missing the test case ({{testRandomOperations}})? Oh, hmm, I wonder if the test case itself did not fail due to all the unhandled exceptions in threads, and it was just the final {noformat} Throwable #1: java.lang.AssertionError: The test or suite printed 9264 bytes to stdout and stderr, even though the limit was set to 8192 bytes. Increase the limit with @Limit, ignore it completely with @SuppressSysoutChecks or run with -Dtests.verbose=true {noformat} that "failed"? This is a nice evil test. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 18h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142010#comment-17142010 ] Michael Sokolov commented on LUCENE-8962: - Re: the TestIndexWriter failure. I tried running this morning (in IntelliJ, on a linux laptop), and couldn't reproduce that seed after 3000 iterations. Just to rule out any system dependency of the test failure; where did you get the stack trace, [~simonw]? Was this from Elastic tests? Also, I wonder if the machine has any impact - in particular does the number of CPUs/cores influence the IndexWriter threading in some way? > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 18h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141811#comment-17141811 ] Simon Willnauer commented on LUCENE-8962: - {quote} I'm surprised to see that IndexWriterConfig.DEFAULT_MAX_COMMIT_MERGE_WAIT_SECONDS is 0. After all, the default is not to merge on commit so if someone configured a merge policy to do this, shouldn't we wait? I think by default we should wait indefinitely. It's up to the developer/configure-er of the merge policy to find cheap merges (if any). Keeping it at 0 creates a gotcha for yet another setting that's required to merge on commit. {quote} I don't think we should. It's such an expert setting and unless we implement it ourself I think we should keep it that way. I'd assume if we make it non-zero by default that our MPs implement it and I can use it right away. If we decide to do this in the future we can keep the default and implement that method by default without enabling the feature. If we set it to non-zero that option goes away. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 18.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141809#comment-17141809 ] Simon Willnauer commented on LUCENE-8962: - It seems like this caused this test failure: {noformat} 18:06:29[junit4] Suite: org.apache.lucene.index.TestIndexWriter 18:06:29[junit4] 2> Mod 21, 2020 3:06:19 EBONGI com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException 18:06:29[junit4] 2> WARNING: Uncaught exception in thread: Thread[Thread-1905,5,TGRP-TestIndexWriter] 18:06:29[junit4] 2> java.lang.AssertionError: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed 18:06:29[junit4] 2> at __randomizedtesting.SeedInfo.seed([83A78A6E34594E08]:0) 18:06:29[junit4] 2> at org.apache.lucene.index.TestIndexWriter.lambda$testRandomOperations$48(TestIndexWriter.java:3886) 18:06:29[junit4] 2> at java.base/java.lang.Thread.run(Thread.java:834) 18:06:29[junit4] 2> Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed 18:06:29[junit4] 2> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:739) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:753) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1330) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1619) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1262) 18:06:29[junit4] 2> at org.apache.lucene.index.TestIndexWriter.lambda$testRandomOperations$48(TestIndexWriter.java:3873) 18:06:29[junit4] 2> ... 1 more 18:06:29[junit4] 2> Caused by: java.nio.file.NoSuchFileException: _7.cfs 18:06:29[junit4] 2> at org.apache.lucene.store.ByteBuffersDirectory.deleteFile(ByteBuffersDirectory.java:148) 18:06:29[junit4] 2> at org.apache.lucene.store.MockDirectoryWrapper.deleteFile(MockDirectoryWrapper.java:607) 18:06:29[junit4] 2> at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:696) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:690) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:589) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:382) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexFileDeleter.checkpoint(IndexFileDeleter.java:527) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexWriter.finishCommit(IndexWriter.java:3546) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3502) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3452) 18:06:29[junit4] 2> at org.apache.lucene.index.TestIndexWriter.lambda$testRandomOperations$48(TestIndexWriter.java:3879) 18:06:29[junit4] 2> ... 1 more 18:06:29[junit4] 2> 18:06:29[junit4] 2> Mod 21, 2020 3:06:19 EBONGI com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException 18:06:29[junit4] 2> WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread #3,5,TGRP-TestIndexWriter] 18:06:29[junit4] 2> org.apache.lucene.index.MergePolicy$MergeException: java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot merge 18:06:29[junit4] 2> at __randomizedtesting.SeedInfo.seed([83A78A6E34594E08]:0) 18:06:29[junit4] 2> at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:703) 18:06:29[junit4] 2> at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:694) 18:06:29[junit4] 2> Caused by: java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot merge 18:06:29[junit4] 2> at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4250) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4230) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4083) 18:06:29[junit4] 2> at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5650) 18:06:29[junit4] 2> at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) 18:06:29[j
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141583#comment-17141583 ] David Smiley commented on LUCENE-8962: -- I'm surprised to see that {{IndexWriterConfig.DEFAULT_MAX_COMMIT_MERGE_WAIT_SECONDS}} is 0. After all, the default is not to merge on commit so if someone configured a merge policy to do this, shouldn't we wait? I think by default we should wait indefinitely. It's up to the developer/configure-er of the merge policy to find cheap merges (if any). Keeping it at 0 creates a gotcha for yet another setting that's required to merge on commit. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 18h 20m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140039#comment-17140039 ] Michael Sokolov commented on LUCENE-8962: - pushed [https://github.com/apache/lucene-solr/pull/1552] to master, and cherry-picked to branch_8x, resolving > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 18h 20m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138827#comment-17138827 ] ASF subversion and git services commented on LUCENE-8962: - Commit d65dcb43728dd6bb64393226e24576525328cecc in lucene-solr's branch refs/heads/branch_8x from Simon Willnauer [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d65dcb4 ] LUCENE-8962: Allow waiting for all merges in a merge spec (#1585) This change adds infrastructure to allow straight forward waiting on one or more merges or an entire merge specification. This is a basis for LUCENE-8962. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 16h 50m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138826#comment-17138826 ] ASF subversion and git services commented on LUCENE-8962: - Commit d65dcb43728dd6bb64393226e24576525328cecc in lucene-solr's branch refs/heads/branch_8x from Simon Willnauer [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d65dcb4 ] LUCENE-8962: Allow waiting for all merges in a merge spec (#1585) This change adds infrastructure to allow straight forward waiting on one or more merges or an entire merge specification. This is a basis for LUCENE-8962. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 16h 50m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138820#comment-17138820 ] ASF subversion and git services commented on LUCENE-8962: - Commit 59efe22ac29c95f9ba85b7214fcf5e30cc979222 in lucene-solr's branch refs/heads/master from Simon Willnauer [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=59efe22 ] LUCENE-8962: Allow waiting for all merges in a merge spec (#1585) This change adds infrastructure to allow straight forward waiting on one or more merges or an entire merge specification. This is a basis for LUCENE-8962. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 16h 50m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138821#comment-17138821 ] ASF subversion and git services commented on LUCENE-8962: - Commit 59efe22ac29c95f9ba85b7214fcf5e30cc979222 in lucene-solr's branch refs/heads/master from Simon Willnauer [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=59efe22 ] LUCENE-8962: Allow waiting for all merges in a merge spec (#1585) This change adds infrastructure to allow straight forward waiting on one or more merges or an entire merge specification. This is a basis for LUCENE-8962. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 16h 50m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126253#comment-17126253 ] Nhat Nguyen commented on LUCENE-8962: - > For some reason it's not linked above, and I'm not sure how to remedy that I guess the title of the PR needs to be in the format of "LUCENE-: description". > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 10h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125282#comment-17125282 ] Michael Sokolov commented on LUCENE-8962: - Posted a new PR that fixes the test failures we were seeing: [https://github.com/apache/lucene-solr/pull/1552] For some reason it's not linked above, and I'm not sure how to remedy that > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066966#comment-17066966 ] ASF subversion and git services commented on LUCENE-8962: - Commit 075adac59865b3277adcf86052f2fae3e6d11135 in lucene-solr's branch refs/heads/master from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=075adac ] remove LUCENE-8962 from CHANGES.txt > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066968#comment-17066968 ] Michael Sokolov commented on LUCENE-8962: - Yes, sorry, merging all those reverts was slightly hellacious and I missed a spot. Now removed. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066742#comment-17066742 ] David Smiley commented on LUCENE-8962: -- ok; makes sense. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066573#comment-17066573 ] Alan Woodward commented on LUCENE-8962: --- It's not in the released 8.5 CHANGES.txt on branch_8_5 so I think this is probably only incorrect in master? > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066150#comment-17066150 ] David Smiley commented on LUCENE-8962: -- I see this in 8.5 CHANGES.txt yet this issue isn't "Closed". So in CHANGES.txt wrong or the Jira status? > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054545#comment-17054545 ] Michael McCandless commented on LUCENE-8962: Thanks @sokolovm! This is a tricky feature, but high impact. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054541#comment-17054541 ] Michael Sokolov commented on LUCENE-8962: - > Wouldn’t it be easier to sort it all out in a branch rather on master? Fair enough - I reverted from master and pushed jira/lucene-8962 > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054537#comment-17054537 ] ASF subversion and git services commented on LUCENE-8962: - Commit 4501b3d3fdbc35af99bde6abe7432cfc5e8b5547 in lucene-solr's branch refs/heads/master from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4501b3d ] Revert "LUCENE-8962: Split test case (#1313)" This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3. Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)" This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8. Revert "LUCENE-8962: Fix intermittent test failures" This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537. Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)" This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054539#comment-17054539 ] ASF subversion and git services commented on LUCENE-8962: - Commit 4501b3d3fdbc35af99bde6abe7432cfc5e8b5547 in lucene-solr's branch refs/heads/master from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4501b3d ] Revert "LUCENE-8962: Split test case (#1313)" This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3. Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)" This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8. Revert "LUCENE-8962: Fix intermittent test failures" This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537. Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)" This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054540#comment-17054540 ] ASF subversion and git services commented on LUCENE-8962: - Commit 4501b3d3fdbc35af99bde6abe7432cfc5e8b5547 in lucene-solr's branch refs/heads/master from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4501b3d ] Revert "LUCENE-8962: Split test case (#1313)" This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3. Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)" This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8. Revert "LUCENE-8962: Fix intermittent test failures" This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537. Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)" This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054538#comment-17054538 ] ASF subversion and git services commented on LUCENE-8962: - Commit 4501b3d3fdbc35af99bde6abe7432cfc5e8b5547 in lucene-solr's branch refs/heads/master from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4501b3d ] Revert "LUCENE-8962: Split test case (#1313)" This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3. Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)" This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8. Revert "LUCENE-8962: Fix intermittent test failures" This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537. Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)" This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054506#comment-17054506 ] Simon Willnauer commented on LUCENE-8962: - {quote} I don't think we should do this – IndexWriter's purpose is making changes to the index, and IndexReader simply reads what IndexWriter created. There are wildly diverse users of Lucene and if we now set down the path of expecting/allowing IndexReader to do it's own "little" optimizations on startup, that can add a lot of unexpected cost, and surprising bugs, to many use cases. IndexWriter is indeed complex, but we should find ways to reduce that complexity so that we can implement features in the right classes, rather than shifting index-changing features out to IndexReader. {quote} This was really just an idea of how this could be done without adding much complexity to existing methods. I can see this being a method like this _IndexReader optimize(IndexReader reader, MergePolicy policy)_ than can even commit the changes. It can run in parallel to the normal indexing but is idempotent and much simpler to test. Again, just an idea to another approach. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054489#comment-17054489 ] Nhat Nguyen commented on LUCENE-8962: - {quote}it's no good if ES tests are failing but Lucene's tests are not! Are these ES test runs / logs publicly visible somewhere?{quote} [~mikemccand] Thank you for responding. The failing Elasticsearch tests are tracked at [https://github.com/elastic/elasticsearch/issues/53195|https://github.com/elastic/elasticsearch/issues/53195]. I've opened [https://github.com/apache/lucene-solr/pull/1331|https://github.com/apache/lucene-solr/pull/1331] to port those tests to Lucene. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054483#comment-17054483 ] Michael McCandless commented on LUCENE-8962: {quote}I have translated some Elasticsearch tests (see [^failed-tests.patch]) to Lucene. {quote} Thanks [~dnhatn] – it's no good if ES tests are failing but Lucene's tests are not! Are these ES test runs / logs publicly visible somewhere? {quote}Wouldn’t it be easier to sort it all out in a branch rather on master? {quote} +1, let's revert on master too, until we can work through these new randomized testing failures. This is certainly tricky code, and we should iterate outside master/8.x until it's working well with all tests. {quote}I'm thinking of adding a {{boolean}} field to {{OneMerge}} that gets set once a merge is successfully committed (e.g. just before the call to {{closeMergeReaders}} in {{IndexWriter.commitMerge()}}), which the {{mergeFinished}} override can use to determine if the merge completed successfully or not. {quote} +1 – clearly {{mergeFinished}} needs some way to know if the merge was successful or not. Maybe open a separate issue/PR to solve this first, as a precursor? {quote}With a slightly refactored IW we can share the merge logic and let the reader re-write itself since we are talking about very small segments the overhead is very small. {quote} I don't think we should do this – {{IndexWriter}}'s purpose is making changes to the index, and {{IndexReader}} simply reads what {{IndexWriter}} created. There are wildly diverse users of Lucene and if we now set down the path of expecting/allowing {{IndexReader}} to do it's own "little" optimizations on startup, that can add a lot of unexpected cost, and surprising bugs, to many use cases. {{IndexWriter}} is indeed complex, but we should find ways to reduce that complexity so that we can implement features in the right classes, rather than shifting index-changing features out to {{IndexReader}}. In our (Amazon customer facing product search) usage of Lucene, this optimization is impactful because a single index is replicated across deep replica count, and shifting work that could be done once by {{IndexWriter}} to every {{IndexReader}} that later opens that commit point would add quite a bit of cost to those readers. That said, I'm definitely +1 to find a simpler way to achieve this powerful feature, just within {{IndexWriter}}. If you look at the discussion in the original PR, we considered making it just another trigger under the existing {{MergePolicy}} methods, but there were downsides (discussed on the PR) to that. Also, this feature/change is only targeting {{commit}} which is already a simplification (NOT trying to target NRT readers). But maybe [~simonwillnauer] you see a simpler way to implement this during {{IndexWriter.commit}}? > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) --
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054163#comment-17054163 ] Nhat Nguyen commented on LUCENE-8962: - I have translated some Elasticsearch tests (see [^failed-tests.patch]) to Lucene. The test _testMergeOnCommitKeepFullyDeletedSegments_ consistently fails because _mergeFinished_ was called twice. This happens because we forget to set _success_ to _true_ when the new segment does not have any document (i.e., _SegmentMerger#shouldMerge_ is false). Even with that fix, both _testRandomOperations_ and _testRandomOperationsWithSoftDeletes_ still randomly fail after a few thousand iterations. Since both are multi-threaded tests, we will need more time to figure out the root cause. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054111#comment-17054111 ] Simon Willnauer commented on LUCENE-8962: - Wouldn’t it be easier to sort it all out in a branch rather on master? > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054078#comment-17054078 ] Michael Sokolov commented on LUCENE-8962: - I'm leaving on master for now until we can sort out how to move forward > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054076#comment-17054076 ] ASF subversion and git services commented on LUCENE-8962: - Commit 89083d43ae6c884e498e40f92053c689925abd66 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=89083d4 ] Revert "LUCENE-8962: Split test case (#1313)" This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3. Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)" This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8. Revert "LUCENE-8962: Fix intermittent test failures" This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537. Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)" This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054075#comment-17054075 ] ASF subversion and git services commented on LUCENE-8962: - Commit 89083d43ae6c884e498e40f92053c689925abd66 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=89083d4 ] Revert "LUCENE-8962: Split test case (#1313)" This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3. Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)" This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8. Revert "LUCENE-8962: Fix intermittent test failures" This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537. Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)" This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054077#comment-17054077 ] ASF subversion and git services commented on LUCENE-8962: - Commit 89083d43ae6c884e498e40f92053c689925abd66 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=89083d4 ] Revert "LUCENE-8962: Split test case (#1313)" This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3. Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)" This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8. Revert "LUCENE-8962: Fix intermittent test failures" This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537. Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)" This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054074#comment-17054074 ] ASF subversion and git services commented on LUCENE-8962: - Commit 89083d43ae6c884e498e40f92053c689925abd66 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=89083d4 ] Revert "LUCENE-8962: Split test case (#1313)" This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3. Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)" This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8. Revert "LUCENE-8962: Fix intermittent test failures" This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537. Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)" This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054068#comment-17054068 ] ASF subversion and git services commented on LUCENE-8962: - Commit 27b4fee074437a892c511898e9e86d62c127f1d7 in lucene-solr's branch refs/heads/branch_8_5 from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=27b4fee ] Revert "LUCENE-8962: Fix intermittent test failures" This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537. Revert "Add CHANGES entry for LUCENE-8962" This reverts commit fdac6d866344611290c45c164112277581328bc9. Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)" This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054066#comment-17054066 ] ASF subversion and git services commented on LUCENE-8962: - Commit 27b4fee074437a892c511898e9e86d62c127f1d7 in lucene-solr's branch refs/heads/branch_8_5 from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=27b4fee ] Revert "LUCENE-8962: Fix intermittent test failures" This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537. Revert "Add CHANGES entry for LUCENE-8962" This reverts commit fdac6d866344611290c45c164112277581328bc9. Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)" This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054067#comment-17054067 ] ASF subversion and git services commented on LUCENE-8962: - Commit 27b4fee074437a892c511898e9e86d62c127f1d7 in lucene-solr's branch refs/heads/branch_8_5 from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=27b4fee ] Revert "LUCENE-8962: Fix intermittent test failures" This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537. Revert "Add CHANGES entry for LUCENE-8962" This reverts commit fdac6d866344611290c45c164112277581328bc9. Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)" This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053640#comment-17053640 ] Michael Froh commented on LUCENE-8962: -- bq. With a slightly refactored IW we can share the merge logic and let the reader re-write itself since we are talking about very small segments the overhead is very small. This would in turn mean that we are doing the work twice ie. the IW would do its normal work and might merge later etc. Just to provide a bit more context, for the case where my team uses this change, we're replicating the index (think Solr master/slave) from "writers" to many "searchers", so we're avoiding doing the work many times. An earlier (less invasive) approach I tried to address the small flushed segments problem was roughly: call commit on writer, hard link the commit files to another filesystem directory to "clone" the index, open an IW on that directory, merge small segments on the clone, let searchers replicate from the clone. That approach does mean that the merging work happens twice (since the "real" index doesn't benefit from the merge on the clone), but it doesn't involve any changes in Lucene. Maybe that less-invasive approach is a better way to address this. It's certainly more consistent with [~simonw]'s suggestion above. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053620#comment-17053620 ] Michael Sokolov commented on LUCENE-8962: - Based on [~simonw]'s recent comments in github, plus difficulty getting tests to pass consistently (apparently there are more failing tests in Elasticland), we should probably revert for now, at least from 8.x and 8.5 branches. I am tied up for the moment, but will be able to do the revert this weekend. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053545#comment-17053545 ] Nhat Nguyen commented on LUCENE-8962: - Some engine tests in Elasticsearch are failing because of this change. I am working to backport them to Lucene so that we can catch similar issues in Lucene. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053427#comment-17053427 ] David Smiley commented on LUCENE-8962: -- Thanks so much for your input Simon! We need to fight the complexity here. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053381#comment-17053381 ] ASF subversion and git services commented on LUCENE-8962: - Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5 ] LUCENE-8962: Split test case (#1313) * LUCENE-8962: Simplify test case The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once. Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown. * LUCENE-8962: Only update toCommit if merge was committed The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9h 20m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053380#comment-17053380 ] ASF subversion and git services commented on LUCENE-8962: - Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5 ] LUCENE-8962: Split test case (#1313) * LUCENE-8962: Simplify test case The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once. Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown. * LUCENE-8962: Only update toCommit if merge was committed The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9h 20m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053382#comment-17053382 ] ASF subversion and git services commented on LUCENE-8962: - Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5 ] LUCENE-8962: Split test case (#1313) * LUCENE-8962: Simplify test case The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once. Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown. * LUCENE-8962: Only update toCommit if merge was committed The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9h 20m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053231#comment-17053231 ] Simon Willnauer commented on LUCENE-8962: - I read through this issue and I want to share some of my thoughts. First, I understand the need for this and the motivation, yet every time we add something like this to the IndexWriter to do something _as part of_ another method it triggers an alarm on my end. I have spent hours and days thinking about how IW can be simpler and the biggest issues that I see is that the primitives on IW like commit or openReader are doing too much. Just look at openReader it's pretty involved and changing the bus factor or making it easier to understand is hard. Adding stuff like _wait for merge_ with something like a timeout is not what I think we should do neither to _openReader_ nor to _commit_. That said, I think we can make the same things happen but we should think in primitives rather than changing method behavior with configuration. Let me explain what I mean: Lets say we keep _commit_ and _openReader_ the way it is and would instead allow to use an existing reader NRT or not and allow itself to _optimize_ itself (yeah I said that - it might be a good name after all). With a slightly refactored IW we can share the merge logic and let the reader re-write itself since we are talking about very small segments the overhead is very small. This would in turn mean that we are doing the work twice ie. the IW would do its normal work and might merge later etc. We might even merge this stuff into heap-space or so if we have enough I haven't thought too much about that. This way we can clean up IW potentially and add a very nice optimization that works for commit as well as NRT. We should strive for making IW simpler not do more. I hope I wasn't too discouraging. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 9h 20m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052608#comment-17052608 ] ASF subversion and git services commented on LUCENE-8962: - Commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8 in lucene-solr's branch refs/heads/branch_8x from Michael McCandless [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3dbfd10 ] LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs) > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 7h 10m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052607#comment-17052607 ] ASF subversion and git services commented on LUCENE-8962: - Commit e5be034df2fc22f1b88e4d271b25c8fae1c3093f in lucene-solr's branch refs/heads/master from Michael McCandless [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e5be034 ] LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs) > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 7h 10m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052510#comment-17052510 ] ASF subversion and git services commented on LUCENE-8962: - Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch refs/heads/master from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207 ] LUCENE-8962: Split test case (#1313) * LUCENE-8962: Simplify test case The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once. Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown. * LUCENE-8962: Only update toCommit if merge was committed The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 7h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052512#comment-17052512 ] ASF subversion and git services commented on LUCENE-8962: - Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch refs/heads/master from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207 ] LUCENE-8962: Split test case (#1313) * LUCENE-8962: Simplify test case The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once. Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown. * LUCENE-8962: Only update toCommit if merge was committed The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 7h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052508#comment-17052508 ] ASF subversion and git services commented on LUCENE-8962: - Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch refs/heads/master from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207 ] LUCENE-8962: Split test case (#1313) * LUCENE-8962: Simplify test case The testMergeOnCommit test case was trying to verify too many things at once: basic semantics of merge on commit and proper behavior when a bunch of indexing threads are writing and committing all at once. Now we just verify basic behavior, with strict assertions on invariants, while leaving it to MockRandomMergePolicy to enable merge on commit in existing test cases to verify that indexing generally works as expected and no new unexpected exceptions are thrown. * LUCENE-8962: Only update toCommit if merge was committed The code was previously assuming that if mergeFinished() was called and isAborted() was false, then the merge must have completed successfully. Instead, we should know for sure if a given merge was committed, and only then update our pending commit SegmentInfos. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 7h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051571#comment-17051571 ] Michael Froh commented on LUCENE-8962: -- I updated https://github.com/apache/lucene-solr/pull/1313 with that proposed fix (adding a {{boolean}} field to OneMerge that gets set once a merge is successfully committed). > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 6.5h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051558#comment-17051558 ] Michael Froh commented on LUCENE-8962: -- It's not immediately obvious to me how to fix the failure on {{TestIndexWriterExceptions2}}. A merge on commit fails (because it's using {{CrankyCodec}}), closing the merge readers, which calls the custom {{mergeFinished}} override, which assumes the merge completed (since it wasn't aborted), and tries to reference the files for the merged segment (to increment their reference counts). That triggers an {{IllegalStateException}} because the files weren't set (because we didn't get that far in the merge). Unfortunately, stepping through the debugger, I don't see a clear way of telling in {{mergeFinished}} that a merge failed. Obviously, I could wrap the call to {{SegmentCommitInfo.files()}} in a try-catch, and assume that the {{IllegalStateException}} means that the merge failed, but that would fail to catch an IOException when e.g. committing the merge. I'm thinking of adding a {{boolean}} field to {{OneMerge}} that gets set once a merge is successfully committed (e.g. just before the call to {{closeMergeReaders}} in {{IndexWriter.commitMerge()}}), which the {{mergeFinished}} override can use to determine if the merge completed successfully or not. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 6h 20m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051229#comment-17051229 ] Michael Sokolov commented on LUCENE-8962: - I think I'll test and push [~msfroh]'s test fix since it seems to fail frequently, while we investigate this new failure. It looks as if we are not properly handling the case when an IOException is thrown at some inopportune time? > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 6h 10m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050967#comment-17050967 ] Adrien Grand commented on LUCENE-8962: -- FYI this test seems to fail too because of this change: {noformat} NOTE: reproduce with: ant test -Dtestcase=TestIndexWriterExceptions2 -Dtests.method=testBasics -Dtests.seed=1F5EDA9C353C018F -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=kkj-CM -Dtests.timezone=Etc/UCT -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 {noformat} > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 6h 10m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050607#comment-17050607 ] Michael Froh commented on LUCENE-8962: -- I ended up splitting testMergeOnCommit into two test cases. One runs through the basic invariants on a single thread and confirms that everything behaves as expected. The other tries indexing and committing from multiple threads, but doesn't really make any assumptions about the segment topology in the end (since randomness and concurrency can lead to all kinds of possible valid segment counts). Instead it just verifies that it doesn't fail and doesn't lose any documents. https://github.com/apache/lucene-solr/pull/1313 > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.5 > > Attachments: LUCENE-8962_demo.png > > Time Spent: 6h 10m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049754#comment-17049754 ] ASF subversion and git services commented on LUCENE-8962: - Commit e308e538731f392eb81ba81cfb3ec5fc526fd383 in lucene-solr's branch refs/heads/master from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e308e53 ] Add CHANGES entry for LUCENE-8962 > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Attachments: LUCENE-8962_demo.png > > Time Spent: 6h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049750#comment-17049750 ] ASF subversion and git services commented on LUCENE-8962: - Commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a in lucene-solr's branch refs/heads/branch_8x from msfroh [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a1791e7 ] LUCENE-8962: Add ability to selectively merge on commit (#1155) * LUCENE-8962: Add ability to selectively merge on commit This adds a new "findCommitMerges" method to MergePolicy, which can specify merges to be executed before the IndexWriter.prepareCommitInternal method returns. If we have many index writer threads, they will flush their DWPT buffers on commit, resulting in many small segments, which can be merged before the commit returns. * Add missing Javadoc * Fix incorrect comment * Refactoring and fix intermittent test failure 1. Made some changes to the callback to update toCommit, leveraging SegmentInfos.applyMergeChanges. 2. I realized that we'll never end up with 0 registered merges, because we throw an exception if we fail to register a merge. 3. Moved the IndexWriterEvents.beginMergeOnCommit notification to before we call MergeScheduler.merge, since we may not be merging on another thread. 4. There was an intermittent test failure due to randomness in the time it takes for merges to complete. Before doing the final commit, we wait for pending merges to finish. We may still end up abandoning the final merge, but we can detect that and assert that either the merge was abandoned (and we have > 1 segment) or we did merge down to 1 segment. * Fix typo * Fix/improve comments based on PR feedback * More comment improvements from PR feedback * Rename method and add new MergeTrigger 1. Renamed findCommitMerges -> findFullFlushMerges. 2. Added MergeTrigger.COMMIT, passed to findFullFlushMerges and to MergeScheduler when merging on commit. * Update renamed method name in strings and comments > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Attachments: LUCENE-8962_demo.png > > Time Spent: 6h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049751#comment-17049751 ] ASF subversion and git services commented on LUCENE-8962: - Commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a in lucene-solr's branch refs/heads/branch_8x from msfroh [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a1791e7 ] LUCENE-8962: Add ability to selectively merge on commit (#1155) * LUCENE-8962: Add ability to selectively merge on commit This adds a new "findCommitMerges" method to MergePolicy, which can specify merges to be executed before the IndexWriter.prepareCommitInternal method returns. If we have many index writer threads, they will flush their DWPT buffers on commit, resulting in many small segments, which can be merged before the commit returns. * Add missing Javadoc * Fix incorrect comment * Refactoring and fix intermittent test failure 1. Made some changes to the callback to update toCommit, leveraging SegmentInfos.applyMergeChanges. 2. I realized that we'll never end up with 0 registered merges, because we throw an exception if we fail to register a merge. 3. Moved the IndexWriterEvents.beginMergeOnCommit notification to before we call MergeScheduler.merge, since we may not be merging on another thread. 4. There was an intermittent test failure due to randomness in the time it takes for merges to complete. Before doing the final commit, we wait for pending merges to finish. We may still end up abandoning the final merge, but we can detect that and assert that either the merge was abandoned (and we have > 1 segment) or we did merge down to 1 segment. * Fix typo * Fix/improve comments based on PR feedback * More comment improvements from PR feedback * Rename method and add new MergeTrigger 1. Renamed findCommitMerges -> findFullFlushMerges. 2. Added MergeTrigger.COMMIT, passed to findFullFlushMerges and to MergeScheduler when merging on commit. * Update renamed method name in strings and comments > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Attachments: LUCENE-8962_demo.png > > Time Spent: 6h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049753#comment-17049753 ] ASF subversion and git services commented on LUCENE-8962: - Commit fdac6d866344611290c45c164112277581328bc9 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fdac6d8 ] Add CHANGES entry for LUCENE-8962 > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Attachments: LUCENE-8962_demo.png > > Time Spent: 6h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049752#comment-17049752 ] ASF subversion and git services commented on LUCENE-8962: - Commit a5475de57fed6b339cd5565bd1bd2650f265a537 in lucene-solr's branch refs/heads/branch_8x from Michael Froh [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a5475de ] LUCENE-8962: Fix intermittent test failures 1. TestIndexWriterMergePolicy.testMergeOnCommit will fail if the last commit (the one that should trigger the full merge) doesn't have any pending changes (which could occur if the last indexing thread commits at the end). We can fix that by adding one more document before that commit. 2. The previous implementation was throwing IOException if the commit thread gets interrupted while waiting for merges to complete. This violates IndexWriter's documented behavior of throwing ThreadInterruptedException. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Attachments: LUCENE-8962_demo.png > > Time Spent: 6h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049748#comment-17049748 ] ASF subversion and git services commented on LUCENE-8962: - Commit f017ae465ec416b9cf5ac91f9aa12ff71abd7de0 in lucene-solr's branch refs/heads/master from msfroh [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f017ae4 ] LUCENE-8962: Fix intermittent test failures (#1307) 1. TestIndexWriterMergePolicy.testMergeOnCommit will fail if the last commit (the one that should trigger the full merge) doesn't have any pending changes (which could occur if the last indexing thread commits at the end). We can fix that by adding one more document before that commit. 2. The previous implementation was throwing IOException if the commit thread gets interrupted while waiting for merges to complete. This violates IndexWriter's documented behavior of throwing ThreadInterruptedException. > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Attachments: LUCENE-8962_demo.png > > Time Spent: 6h > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org