subject:"\[jira\] \[Commented\] \(LUCENE\-8962\) Can we merge small segments during refresh, for faster searching\?"

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2021-03-22 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306249#comment-17306249
 ] 

Michael McCandless commented on LUCENE-8962:


Hi Team!  This issue was so exotic and such a great example of powerful 
open-source collaboration, which I suspect many people didn't even notice, so I 
wrote fun blog post explaining/uncovering all the juicy details: 
[http://blog.mikemccandless.com/2021/03/open-source-collaboration-or-how-we.html]

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: main (9.0), 8.6, 8.7
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, image-2021-03-22-10-36-32-201.png, test.diff
>
>  Time Spent: 31h
>  Remaining Estimate: 0h
>
> Two improvements were added: 8.6 has merge-on-commit (by Froh et. all), 8.7 
> has merge-on-refresh (by Simon).  See \{{MergePolicy.findFullFlushMerges}}
> The original description follows:
> 
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve \{{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-08-25 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184735#comment-17184735
 ] 

Michael McCandless commented on LUCENE-8962:


Yeah, +1 we should have opened a new issue for the {{getReader}} case (separate 
from the {{commit}} case which already shipped in 8.6).  It is a related but 
different feature, and, yes, in a different Lucene 8.x feature release.

This was an exciting change, conceived originally by [~msfroh], contributed by 
[~msoko...@gmail.com], undergoing many iterations after that, exotic test 
failures, tons of help from [~simonw] to iterate the change to a clean solution 
that would work for both {{commit}} and {{getReader}}, beasting, more 
iterations, commits then reverts, etc.  I think it would make a great blog post!

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: master (9.0), 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 31h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-08-25 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184702#comment-17184702
 ] 

David Smiley commented on LUCENE-8962:
--

Then we need a new JIRA issue targeted at 8.7 discussing this.  Once a release 
ships, the issues it references (in CHANGES.txt) are read-only from a code 
standpoint.  If 8.6 hadn't shipped, then we could have iterated on it in this 
issue.  Although sadly git history is forever, the situation can be fixed up by 
using a new issue and changing the CHANTES.txt to refer to it.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: master (9.0), 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 31h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-08-25 Thread Michael Froh (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184699#comment-17184699
 ] 

Michael Froh commented on LUCENE-8962:
--

8.6 added ability to merge small segments on commit.

The more recent changes add the ability to merge on getReader calls (which is 
what the original issue was asking for -- merging on commit was a slightly 
easier step on the way there).

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: master (9.0), 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 31h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-08-25 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184698#comment-17184698
 ] 

David Smiley commented on LUCENE-8962:
--

I've been following along here (or _trying to_) and I'm confused by the recent 
commits.  LUCENE-8962 was fixed & released into 8.6 which shipped.  Why are new 
commits landing that reference it?

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: master (9.0), 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 31h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-08-25 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184679#comment-17184679
 ] 

Michael McCandless commented on LUCENE-8962:


Woot!  Thank you [~simonw]!

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: master (9.0), 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 31h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-08-24 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183522#comment-17183522
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 2f2146aad46b5969e6fbc99d85e21f595aba56c9 in lucene-solr's branch 
refs/heads/branch_8x from Simon Willnauer
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2f2146a ]

LUCENE-8962: Merge segments on getReader (#1623)

Add IndexWriter merge-on-refresh feature to selectively merge 
small segments on getReader, subject to a configurable timeout, 
to improve search performance by reducing the number of small 
segments for searching.

Co-authored-by: Mike McCandless 

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: master (9.0), 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 31h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-08-24 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183517#comment-17183517
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 8294e1ae2068aa39e91c25cbaabf62afae40a02e in lucene-solr's branch 
refs/heads/master from Simon Willnauer
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8294e1a ]

LUCENE-8962: Merge segments on getReader (#1623)

Add IndexWriter merge-on-refresh feature to selectively merge 
small segments on getReader, subject to a configurable timeout, 
to improve search performance by reducing the number of small 
segments for searching.

Co-authored-by: Mike McCandless 

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: master (9.0), 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 31h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148366#comment-17148366
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 3377b09fcbf46315ac64350ab259a45b1ec2889c in lucene-solr's branch 
refs/heads/jira/SOLR-14354 from Simon Willnauer
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3377b09 ]

LUCENE-8962: Ensure we never flush by ram buffer or doc count in test


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: master (9.0), 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 23h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147239#comment-17147239
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit f8b1d698af981251120e4e96255dbc0ee42e3ed8 in lucene-solr's branch 
refs/heads/branch_8x from Simon Willnauer
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f8b1d69 ]

LUCENE-8962: Ensure we never flush by ram buffer or doc count in test


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: master (9.0), 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 22h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-28 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147238#comment-17147238
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 3377b09fcbf46315ac64350ab259a45b1ec2889c in lucene-solr's branch 
refs/heads/master from Simon Willnauer
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3377b09 ]

LUCENE-8962: Ensure we never flush by ram buffer or doc count in test


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: master (9.0), 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 22h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-27 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147105#comment-17147105
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 9b1c92809e730a70c1d9bf9e5b65a4b521bddc8b in lucene-solr's branch 
refs/heads/branch_8x from Simon Willnauer
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=9b1c928 ]

LUCENE-8962: Merge small segments on commit (#1617)

Add IndexWriter merge-on-commit feature to selectively merge small segments on 
commit,
subject to a configurable timeout, to improve search performance by reducing 
the number of small
segments for searching.

Co-authored-by: Michael Froh 
Co-authored-by: Michael Sokolov 
Co-authored-by: Mike McCandless 


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: master (9.0), 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 22h 10m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-27 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147100#comment-17147100
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit fb3c5d23537c650f7c0ba020f5e08f128e0a9b5a in lucene-solr's branch 
refs/heads/master from Simon Willnauer
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fb3c5d2 ]

LUCENE-8962: Fix changes entry. This feature is added to 8.6


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 22h 10m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-27 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147099#comment-17147099
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 7f352a96659b4d304ae3ef0c7a1b62784ce2a406 in lucene-solr's branch 
refs/heads/master from Simon Willnauer
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7f352a9 ]

LUCENE-8962: Merge small segments on commit (#1617)

Add IndexWriter merge-on-commit feature to selectively merge small segments on 
commit,
subject to a configurable timeout, to improve search performance by reducing 
the number of small
segments for searching.

Co-authored-by: Michael Froh 
Co-authored-by: Michael Sokolov 
Co-authored-by: Mike McCandless 

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 22h 10m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-26 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146689#comment-17146689
 ] 

Michael McCandless commented on LUCENE-8962:


OK it looks to me like this failure is pre-existing, because it got 
{{MockRandomMergePolicy}} which randomly decided to merge the two segments the 
test wrote into one segment, *after* (not during!) the second commit.

I'll fix the test to use a deterministic {{MergePolicy}}.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 21h 50m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-26 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146683#comment-17146683
 ] 

Michael McCandless commented on LUCENE-8962:


LOL, that's awesome.

Hmm, beasting just hit another failure in same source:
{noformat}
[junit4:pickseed] Seed property 'tests.seed' already defined: BC66D44FB7C99235
   [junit4]  says salut! Master seed: BC66D44FB7C99235
   [junit4] Executing 1 suite with 1 JVM.
   [junit4]
   [junit4] Started J0 PID(1739111@localhost).
   [junit4] Suite: org.apache.lucene.search.TestPhraseWildcardQuery
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestPhraseWildcardQuery -Dtests.method=testMaxExpansions 
-Dtests.seed=BC66D44FB7C99235 -Dtests.slow=true -Dtests.badapples=true \
-Dtests.locale=es -Dtests.timezone=Africa/Douala -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] FAILURE 0.22s | TestPhraseWildcardQuery.testMaxExpansions <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: test test relies on 2 
segments expected:<2> but was:<1>
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([BC66D44FB7C99235:83D14835E8C05BA3]:0)
   [junit4]    >        at 
org.apache.lucene.search.TestPhraseWildcardQuery.setUp(TestPhraseWildcardQuery.java:76)
   [junit4]    >        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >        at 
java.base/java.lang.reflect.Method.invoke(Method.java:566)
   [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:834)
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene86), 
sim=Asserting(RandomSimilarity(queryNorm=true): {other=DFR I(n)B2, 
author=org.apache.lucene.search.similarities.BooleanSi\
milarity@3dcd0436, title=DFR GL3(800.0), category=DFR I(F)LZ(0.3)}), locale=es, 
timezone=Africa/Douala
   [junit4]   2> NOTE: Linux 5.5.6-arch1-1 amd64/Oracle Corporation 11.0.6 
(64-bit)/cpus=128,threads=1,free=477476440,total=536870912
   [junit4]   2> NOTE: All tests run in this JVM: [TestPhraseWildcardQuery]
   [junit4] Completed [1/1 (1!)] in 0.46s, 1 test, 1 failure <<< FAILURES!
   [junit4]
   [junit4]
   [junit4] Tests with failures [seed: BC66D44FB7C99235]:
   [junit4]   - 
org.apache.lucene.search.TestPhraseWildcardQuery.testMaxExpansions
   [junit4]
   [junit4]
   [junit4] JVM J0:     0.34 ..     1.19 =     0.85s
   [junit4] Execution time total: 1.20 sec.
   [junit4] Tests summary: 1 suite, 1 test, 1 failure {noformat}
which is odd because you fix (in {{.setUp}}) should have also fixed this one.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 21h 50m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsu

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-26 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146635#comment-17146635
 ] 

Simon Willnauer commented on LUCENE-8962:
-

I did push a fix. it's a test bug, this test relies on 2 segments that got 
merged away by this feature :)

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 21h 50m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-26 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146633#comment-17146633
 ] 

Michael McCandless commented on LUCENE-8962:


Thanks [~simonw], PR looks great.  I left a couple comments and pushed some 
small fixes.

I started beasting, 37 full iterations of Lucene core + modules tests.  It 
uncovered this curious failure, which does reproduce:
{noformat}
org.apache.lucene.search.TestPhraseWildcardQuery > test suite's output saved to 
/home/mike/src/simon/lucene/sandbox/build/test-results/test/outputs/OUTPUT-org.apache.lucene.search.TestPhraseWildcardQuery.txt,
 copied below:
   >     java.lang.AssertionError: expected:<1> but was:<0>
   >         at 
__randomizedtesting.SeedInfo.seed([8FCBBE65B983A2FA:393FF384952858B5]:0)
   >         at org.junit.Assert.fail(Assert.java:88)
   >         at org.junit.Assert.failNotEquals(Assert.java:834)
   >         at org.junit.Assert.assertEquals(Assert.java:645)
   >         at org.junit.Assert.assertEquals(Assert.java:631)
   >         at 
org.apache.lucene.search.TestPhraseWildcardQuery.testExplain(TestPhraseWildcardQuery.java:259)
   >         at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   >         at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   >         at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   >         at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   >         at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
   >         at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
   >         at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
   >         at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
   >         at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
   >         at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
   >         at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
   >         at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
   >         at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
   >         at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
   >         at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
   >         at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
   >         at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
   >         at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
   >         at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
   >         at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
   >         at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
   >         at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at 
org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
   >         at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at 
org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
   >         at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
   >         at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
   >         at 
org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
   >         at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-26 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146321#comment-17146321
 ] 

Simon Willnauer commented on LUCENE-8962:
-

I think we are ready for review I opened a PR here 
https://github.com/apache/lucene-solr/pull/1617

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 20h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-26 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146286#comment-17146286
 ] 

Simon Willnauer commented on LUCENE-8962:
-

[~mikemccand] I pushed fixes for both failures

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 20h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-25 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145871#comment-17145871
 ] 

Michael McCandless commented on LUCENE-8962:


Ooh, found another failing test (after 341 full iterations), and this one does 
reproduce!

 
{noformat}
[junit4:pickseed] Seed property 'tests.seed' already defined: D75F6483D2E6C62C
   [junit4]  says ¡Hola! Master seed: D75F6483D2E6C62C
   [junit4] Executing 1 suite with 1 JVM.
   [junit4]
   [junit4] Started J0 PID(3328881@localhost).
   [junit4] Suite: org.apache.lucene.index.TestIndexWriterMergePolicy
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestIndexWriterMergePolicy -Dtests.method=testMergeOnCommit 
-Dtests.seed=D75F6483D2E6C62C -Dtests.slow=true -Dtests.badapples=tr\
ue -Dtests.locale=en-CM -Dtests.timezone=Eire -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] FAILURE 0.27s | TestIndexWriterMergePolicy.testMergeOnCommit <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: expected:<1> but 
was:<6>
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([D75F6483D2E6C62C:F3D97E843303240C]:0)
   [junit4]    >        at 
org.apache.lucene.index.TestIndexWriterMergePolicy.testMergeOnCommit(TestIndexWriterMergePolicy.java:340)
   [junit4]    >        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >        at 
java.base/java.lang.reflect.Method.invoke(Method.java:566)
   [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:834)
   [junit4]   2> Jun 25, 2020 5:41:39 PM 
com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
   [junit4]   2> WARNING: Will linger awaiting termination of 1 leaked 
thread(s).
   [junit4]   2> NOTE: test params are: codec=CheapBastard, 
sim=Asserting(RandomSimilarity(queryNorm=true): {content=DFI(ChiSquared)}), 
locale=en-CM, timezone=Eire
   [junit4]   2> NOTE: Linux 5.5.6-arch1-1 amd64/Oracle Corporation 11.0.6 
(64-bit)/cpus=128,threads=1,free=477412792,total=536870912
   [junit4]   2> NOTE: All tests run in this JVM: [TestIndexWriterMergePolicy]
   [junit4] Completed [1/1 (1!)] in 0.56s, 1 test, 1 failure <<< FAILURES!
   [junit4]
   [junit4]
   [junit4] Tests with failures [seed: D75F6483D2E6C62C]:
   [junit4]   - 
org.apache.lucene.index.TestIndexWriterMergePolicy.testMergeOnCommit
   [junit4]
   [junit4]
   [junit4] JVM J0:     0.34 ..     1.45 =     1.11s
   [junit4] Execution time total: 1.45 sec.
   [junit4] Tests summary: 1 suite, 1 test, 1 failure {noformat}

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 20h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubs

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-25 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145716#comment-17145716
 ] 

Michael McCandless commented on LUCENE-8962:


Beasting the branch (all Lucene core + module tests), 216 iterations, and hit 
one test failure:
{noformat}

   >     TEST thread 1: got interrupt
   >     org.apache.lucene.util.ThreadInterruptedException: 
java.lang.InterruptedException: sleep interrupted
   >            at 
org.apache.lucene.store.SlowOpeningMockIndexInputWrapper.(SlowOpeningMockIndexInputWrapper.java:38)
   >            at 
org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:769)
   >            at 
org.apache.lucene.store.Directory.openChecksumInput(Directory.java:157)
   >            at 
org.apache.lucene.store.MockDirectoryWrapper.openChecksumInput(MockDirectoryWrapper.java:1044)
   >            at 
org.apache.lucene.codecs.compressing.FieldsIndexReader.(FieldsIndexReader.java:53)
   >            at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.(CompressingStoredFieldsReader.java:166)
   >            at 
org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsReader(CompressingStoredFieldsFormat.java:123)
   >            at 
org.apache.lucene.codecs.lucene50.Lucene50StoredFieldsFormat.fieldsReader(Lucene50StoredFieldsFormat.java:135)
   >            at 
org.apache.lucene.codecs.asserting.AssertingStoredFieldsFormat.fieldsReader(AssertingStoredFieldsFormat.java:43)
   >            at 
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:127)
   >            at 
org.apache.lucene.index.SegmentReader.(SegmentReader.java:83)
   >            at 
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:171)
   >            at 
org.apache.lucene.index.ReadersAndUpdates.getReaderForMerge(ReadersAndUpdates.java:714)
   >            at 
org.apache.lucene.index.IndexWriter.lambda$prepareOnCommitMerge$6(IndexWriter.java:3382)
   >            at 
org.apache.lucene.index.MergePolicy$OneMerge.initMergeReaders(MergePolicy.java:438)
   >            at 
org.apache.lucene.index.IndexWriter$2.initMergeReaders(IndexWriter.java:3361)
   >            at 
org.apache.lucene.index.IndexWriter.prepareOnCommitMerge(IndexWriter.java:3376)
   >            at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3247)
   >            at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3547)
   >            at 
org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3505)
   >            at 
org.apache.lucene.index.TestIndexWriter$IndexerThreadInterrupt.run(TestIndexWriter.java:1004)
   >     Caused by: java.lang.InterruptedException: sleep interrupted
   >            at java.base/java.lang.Thread.sleep(Native Method)
   >            at 
org.apache.lucene.store.SlowOpeningMockIndexInputWrapper.(SlowOpeningMockIndexInputWrapper.java:33)
   >            ... 20 more
   >     thread 1 FAILED; unexpected exception
   >     java.lang.IllegalStateException: this writer hit an unrecoverable 
error; cannot commit
   >            at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3170)
   >            at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3547)
   >            at 
org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1099)
   >            at 
org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1140)
   >            at 
org.apache.lucene.index.TestIndexWriter$IndexerThreadInterrupt.run(TestIndexWriter.java:942)
   >     Caused by: java.lang.AssertionError: can't be done and not suppressing 
exceptions
   >            at 
org.apache.lucene.index.IndexWriter.closeMergeReaders(IndexWriter.java:4426)
   >            at 
org.apache.lucene.index.IndexWriter.commitMerge(IndexWriter.java:4075)
   >            at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4581)
   >            at 
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4140)
   >            at 
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5692)
   >            at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
   >            at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682)
   >     index files: [_42.fdm, _42.fdt, _42.fdx, _42.fnm, _42.nvd, _42.nvm, 
_42.si, _42_Lucene80_0.dvd, _42_Lucene80_0.dvm, _42_Lucene84_0.doc, 
_42_Lucene84_0.pos, _42_Lucene84_0.tim, _42\
_Lucene84_0.tip, _42_Lucene84_0.tmd, _42_TestBloomFilteredLucenePostings_0.blm, 
_42_TestBloomFilteredLucenePostings_0.doc, 
_42_TestBloomFilteredLucenePostings_0.pos, _42_TestBloomFilteredL\
ucenePostings_0.tim, _42_TestBloomFilteredLucenePostings_0.tip, 
_42_TestBloomFilteredLucenePostings_0.tmd, _47.fdm, _47.fdt,

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-25 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144969#comment-17144969
 ] 

Michael Sokolov commented on LUCENE-8962:
-

Yes, thanks Simon! I'll take a look too; hopefully I learn something :)

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 20h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-25 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144904#comment-17144904
 ] 

Michael McCandless commented on LUCENE-8962:


Thanks [~simonw].  I will have a look at the latest & greatest, and kick off 
beasting on my 128 core Ryzen Threadripper box :)

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 20h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-24 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144366#comment-17144366
 ] 

Simon Willnauer commented on LUCENE-8962:
-

[~sokolov] thanks, I updated the branch with the approach that I think can 
work. I added quite some changes that you folks should catch up on. I think we 
need to do more on the testing end with this feature being explicitly enabled. 
I already pushed some tests for it. I would appreciate if you and [~mikemccand] 
could spend some time looking that the work that I did. I am happy to open a PR 
do do the reviews but maybe take a look at it first? 

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 20h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-24 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143890#comment-17143890
 ] 

Michael Sokolov commented on LUCENE-8962:
-

[~simonw] I think we can continue to use 
https://github.com/apache/lucene-solr/tree/jira/lucene-8962? But perhaps with a 
new PR

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-24 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143881#comment-17143881
 ] 

Simon Willnauer commented on LUCENE-8962:
-

I agree mike! I looked into it and it seems doable with some modifications. I 
already pushed a prototype of what I think can work 
[here|https://github.com/apache/lucene-solr/pull/1601/commits/de7677f1c98ffd38e9febce74f265fb44484c1d1]
 [~sokolov] do you have a branch we can work on for this?

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-23 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143295#comment-17143295
 ] 

Michael McCandless commented on LUCENE-8962:


Thanks [~simonw], what a delightful unit test!  Clearly it confirms what we 
thought – the feature as implemented now breaks the atomicity guarantee of 
{{updateDocument}} which is clearly no good!  And I think this is likely tricky 
to fix, if we need new locking around IW acquiring merge readers to ensure a 
concurrent {{updateDocument}} is atomic with respect to that.  Let's mull/sleep 
on it ...

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, 
> failure_log.txt, test.diff
>
>  Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142433#comment-17142433
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit ba2a812f84e2a05cf6d7039a19000df10d25590c in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ba2a812 ]

Revert "LUCENE-8962: add ability to selectively merge on commit (#1552)"

This reverts commit 28f2e1cd7f39e482b100d55ec94c46fc292f7b67.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt
>
>  Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142431#comment-17142431
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 5d43e73c6616479bec19dd8edf57076e084d6063 in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5d43e73 ]

Revert "LUCENE-8962: add ability to selectively merge on commit (#1552)"

This reverts commit 972c84022ec9c2b867848cb11f6d4c6bc466e8f2.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt
>
>  Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142410#comment-17142410
 ] 

Michael Sokolov commented on LUCENE-8962:
-

OK I'll revert from master and branch_8x then

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt
>
>  Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142395#comment-17142395
 ] 

Michael McCandless commented on LUCENE-8962:


{quote}[~mikemccand] WDYT I think we should roll it back for now.
{quote}
Yeah, +1, this is one challenging feature!

I'm glad we had that tricky test failure above leading to this.  We cannot 
break atomicity of {{updateDocument}} – Lucene's point-in-time guarantee would 
be broken.

I do think this is fixable, by changing {{commitMergedDeletes}} to NOT carry 
over concurrent deletes that had occurred to docs in the source segments while 
they were being merged.  But let's start with a test case showing that the 
current approach did indeed break atomicity guarantee of {{updateDocument}}...

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt
>
>  Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142389#comment-17142389
 ] 

Simon Willnauer commented on LUCENE-8962:
-

This would also explain the failure I am looking into right now on elastic CI:


{noformat}
12:20:39[junit4] Suite: org.apache.lucene.index.TestIndexingSequenceNumbers
12:20:39[junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestIndexingSequenceNumbers 
-Dtests.method=testStressConcurrentCommit -Dtests.seed=FA92259FC8239E7 
-Dtests.nightly=true -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=en-TC -Dtests.timezone=Pacific/Pago_Pago -Dtests.asserts=true 
-Dtests.file.encoding=UTF8
12:20:39[junit4] FAILURE 85.1s J2 | 
TestIndexingSequenceNumbers.testStressConcurrentCommit <<<
12:20:39[junit4]> Throwable #1: java.lang.AssertionError: expected:<1> 
but was:<0>
12:20:39[junit4]>   at 
__randomizedtesting.SeedInfo.seed([FA92259FC8239E7:743DB8E67E421850]:0)
12:20:39[junit4]>   at 
org.apache.lucene.index.TestIndexingSequenceNumbers.testStressConcurrentCommit(TestIndexingSequenceNumbers.java:273)
12:20:39[junit4]>   at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
12:20:39[junit4]>   at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
12:20:39[junit4]>   at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
12:20:39[junit4]>   at 
java.base/java.lang.reflect.Method.invoke(Method.java:566)
12:20:39[junit4]>   at 
java.base/java.lang.Thread.run(Thread.java:834)
12:20:39[junit4]   2> NOTE: leaving temporary files on disk at: 
/var/lib/jenkins/workspace/apache+lucene-solr+nightly+branch_8x/lucene/build/core/test/J2/temp/lucene.index.TestIndexingSequenceNumbers_FA92259FC8239E7-001
12:20:39[junit4]   2> NOTE: test params are: codec=Asserting(Lucene86): 
{id=Lucene84}, docValues:{thread=DocValuesFormat(name=Asserting), 
___soft_deletes=DocValuesFormat(name=Asserting)}, maxPointsInLeafNode=279, 
maxMBSortInHeap=6.3565394726088424, 
sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@1c08a55f),
 locale=en-TC, timezone=Pacific/Pago_Pago
12:20:39[junit4]   2> NOTE: Linux 4.18.0-193.6.3.el8_2.x86_64 amd64/Oracle 
Corporation 11.0.2 (64-bit)/cpus=32,threads=1,free=339006432,total=400556032
12:20:39[junit4]   2> NOTE: All tests run in this JVM: 
[TestMultiTermConstantScore, TestSparseFixedBitSet, TestSpanBoostQuery, 
TestOfflineSorter, TestRadixSelector, TestBagOfPositions, 
TestPhrasePrefixQuery, TestAxiomaticF2LOG, TestLucene50CompoundFormat, 
TestLongBitSet, TestBinaryDocument, TestBooleanOr, 
TestComplexExplanationsOfNonMatches, TestTransactions, TestTermQuery, 
TestUTF32ToUTF8, TestWildcard, TestIndexWriterForceMerge, TestMinShouldMatch2, 
TestCodecs, TestBytesStore, Test2BNumericDocValues, TestCharFilter, 
TestXYMultiPolygonShapeQueries, TestFeatureDoubleValues, TestPackedInts, 
TestGraphTokenizers, TestFilterCodecReader, TestSetOnce, TestLatLonPoint, 
TestLongRange, TestQueryRescorer, TestNRTThreads, TestMergedIterator, 
TestLucene86SegmentInfoFormat, TestPackedTokenAttributeImpl, TestBasicModelIne, 
TestDocCount, TestLMDirichletSimilarity, TestAttributeSource, 
TestPositiveScoresOnlyCollector, TestXYPointQueries, TestElevationComparator, 
TestIndexWriterMergePolicy, TestUpgradeIndexMergePolicy, 
TestIntRangeFieldQueries, TestIndexingSequenceNumbers]
12:20:39[junit4] Completed [369/567 (1!)] on J2 in 168.09s, 8 tests, 1 
failure <<< FAILURES!
{noformat}

it's missing a document, pretty scary but explains what we see here.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt
>
>  Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refres

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142388#comment-17142388
 ] 

Simon Willnauer commented on LUCENE-8962:
-

[~msoko...@gmail.com] I think I have bad news for you. I think we forgot about 
an important guarantee that we have from updateDocument that is now with this 
feature enabled not holding up anymore. During a merge we carry over deletes 
that are made concurrently to the source segments. This would mean that if we 
continue including deleted that are carried over in a commit we are currently 
making we don't carry over the corresponding added document which means that 
there are documents not visible in the comment that should be visible. I think 
this is a pretty significant problem and we should roll back this feature for 
now. I am sorry that it took me a while to understand the implications here but 
these IW logs are not the most intuitive thing to read and stare at. 

We need to look into commitMergedDeletes if we can make it work or if we need 
to do the merge again since we now need point in time semantics for merges too. 
I am not sure if that's actually fixable, but lets look into it. [~mikemccand] 
WDYT I think we should roll it back for now. 

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt
>
>  Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142380#comment-17142380
 ] 

David Smiley commented on LUCENE-8962:
--

bq. If we decide to do this in the future we can keep the default and implement 
that method by default without enabling the feature. If we set it to non-zero 
that option goes away.

+1 makes sense Simon

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt
>
>  Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142303#comment-17142303
 ] 

Simon Willnauer commented on LUCENE-8962:
-

I opened a PR [here|https://github.com/apache/lucene-solr/pull/1601] 

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt
>
>  Time Spent: 19.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142197#comment-17142197
 ] 

Simon Willnauer commented on LUCENE-8962:
-

I attached a log file with a failure and index writer logging enabled. I think 
I have a smoking gun here:

{noformat}
 [junit4]   1> IW 2 [2020-06-22T15:33:08.479370Z; Lucene Merge Thread #31]: 
merged segment _2f(9.0.0):c1:[diagnostics={os.version=10.14.6, 
mergeMaxNumSegments=-1, java.version=12.0.1, java.vm.version=12.0.1+12, 
lucene.version=9.0.0, timestamp=1592839988473, os=Mac OS X, 
java.runtime.version=12.0.1+12, mergeFactor=1, os.arch=x86_64, source=merge, 
java.vendor=Oracle 
Corporation}]:[attributes={Lucene50StoredFieldsFormat.mode=BEST_SPEED}] 
:id=70ut1o7n9kgx798rria9sma3n is 100% deleted; skipping insert
   [junit4]   1> SM 2 [2020-06-22T15:33:08.479615Z; Lucene Merge Thread #32]: 0 
msec to write field infos [29 docs]
   [junit4]   1> IFD 2 [2020-06-22T15:33:08.479662Z; Lucene Merge Thread #31]: 
will delete new file "_2f.cfe"
   [junit4]   1> IFD 2 [2020-06-22T15:33:08.479712Z; Lucene Merge Thread #31]: 
will delete new file "_2f.cfs"
   [junit4]   1> IFD 2 [2020-06-22T15:33:08.479740Z; Lucene Merge Thread #31]: 
will delete new file "_2f.si"
   [junit4]   1> IFD 2 [2020-06-22T15:33:08.479770Z; Lucene Merge Thread #31]: 
delete [_2f.cfe, _2f.cfs, _2f.si]
{noformat}

and further down


{noformat}
   [junit4]   2> Caused by: java.nio.file.NoSuchFileException: _2f.cfs
   [junit4]   2>at 
org.apache.lucene.store.ByteBuffersDirectory.deleteFile(ByteBuffersDirectory.java:148)
   [junit4]   2>at 
org.apache.lucene.store.MockDirectoryWrapper.deleteFile(MockDirectoryWrapper.java:607)
   [junit4]   2>at 
org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38)
   [junit4]   2>at 
org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:696)
   [junit4]   2>at 
org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:690)
   [junit4]   2>at 
org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:589)
   [junit4]   2>at 
org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:382)
   [junit4]   2>at 
org.apache.lucene.index.IndexFileDeleter.checkpoint(IndexFileDeleter.java:527)
   [junit4]   2>at 
org.apache.lucene.index.IndexWriter.finishCommit(IndexWriter.java:3546)
   [junit4]   2>at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3502)
   [junit4]   2>at 
org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3452)
   [junit4]   2>at 
org.apache.lucene.index.TestIndexWriter.lambda$testRandomOperations$48(TestIndexWriter.java:3879)
   [junit4]   2>... 1 more
   [junit4]   2>

{noformat}

I think we are not respecting the fact that it's fully deleted and dropped 
because of this before we get a chance to incRef the segment. I will work on a 
patch for this.





> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt
>
>  Time Spent: 18h 50m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not y

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread Erick Erickson (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142190#comment-17142190
 ] 

Erick Erickson commented on LUCENE-8962:


Sounds like it may _not_ be related to SOLR-13268 then, it's a perfectly 
legitimate error if some run happens to produce more than 8K of output. If none 
of the messages look like they came from another test, it's probably legitimate 
and not a result of async logging.

Mostly I wanted you to be aware that it _might_ be related, but that's looking 
less likely.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 18h 50m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142184#comment-17142184
 ] 

Simon Willnauer commented on LUCENE-8962:
-

I can reproduce this every 10 iterations. Funny enough I can't in my IDE only 
when I repeatedly run it on the CMD. This line reproduces for me quite 
frequently:


{noformat}
ant test  -Dtestcase=TestIndexWriter -Dtests.method=testRandomOperations 
-Dtests.seed=83A78A6E34594E08 -Dtests.nightly=true -Dtests.slow=true 
-Dtests.badapples=true -Dtests.locale=teo-KE 
-Dtests.timezone=Antarctica/Rothera -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
{noformat}

I also checked if it's related to calling mergeFinished multiple times but it 
doesn't


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 18h 50m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread Erick Erickson (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142110#comment-17142110
 ] 

Erick Erickson commented on LUCENE-8962:


[~mikemccand] Ever since I changed Solr to use asnyc logging by default, these 
errors show up _very_ occasionally and I have been unable to figure out why. My 
best guess is that "somehow" the logging buffer isn't being flushed when the 
logger is closed, as it should be. See: SOLR-13268.

It happens so rarely that it's hard to diagnose/fix, any insights appreciated. 
I do wonder if the right thing to do is go back to synchronous logging for the 
tests. My guess is that some buffer flushing isn't quite completed when loggers 
shut down, but in the real world this won't happen unless the app is going away 
so shouldn't be a problem?

WDYT?

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 18h 50m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142026#comment-17142026
 ] 

Michael McCandless commented on LUCENE-8962:


{quote}It seems like this caused this test failure:
{quote}
Hmm it does not repro for me on first try.  Also, I wonder why the repro line 
is missing the test case ({{testRandomOperations}})?  Oh, hmm, I wonder if the 
test case itself did not fail due to all the unhandled exceptions in threads, 
and it was just the final
{noformat}
Throwable #1: java.lang.AssertionError: The test or suite printed 9264 bytes to 
stdout and stderr, even though the limit was set to 8192 bytes. Increase the 
limit with @Limit, ignore it completely with @SuppressSysoutChecks or run with 
-Dtests.verbose=true {noformat}
that "failed"?

This is a nice evil test.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 18h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142010#comment-17142010
 ] 

Michael Sokolov commented on LUCENE-8962:
-

Re: the TestIndexWriter failure. I tried running this morning (in IntelliJ, on 
a linux laptop), and couldn't reproduce that seed after 3000 iterations. Just 
to rule out any system dependency of the test failure; where did you get the 
stack trace, [~simonw]? Was this from Elastic tests? Also, I wonder if the 
machine has any impact - in particular does the number of CPUs/cores influence 
the IndexWriter threading in some way? 

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 18h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141811#comment-17141811
 ] 

Simon Willnauer commented on LUCENE-8962:
-

{quote}
I'm surprised to see that 
IndexWriterConfig.DEFAULT_MAX_COMMIT_MERGE_WAIT_SECONDS is 0. After all, the 
default is not to merge on commit so if someone configured a merge policy to do 
this, shouldn't we wait? I think by default we should wait indefinitely. It's 
up to the developer/configure-er of the merge policy to find cheap merges (if 
any). Keeping it at 0 creates a gotcha for yet another setting that's required 
to merge on commit.
{quote}

I don't think we should. It's such an expert setting and unless we implement it 
ourself I think we should keep it that way. I'd assume if we make it non-zero 
by default that our MPs implement it and I can use it right away. If we decide 
to do this in the future we can keep the default and implement that method by 
default without enabling the feature. If we set it to non-zero that option goes 
away.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 18.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-22 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141809#comment-17141809
 ] 

Simon Willnauer commented on LUCENE-8962:
-

It seems like this caused this test failure:

{noformat}
18:06:29[junit4] Suite: org.apache.lucene.index.TestIndexWriter
18:06:29[junit4]   2> Mod 21, 2020 3:06:19 EBONGI 
com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
 uncaughtException
18:06:29[junit4]   2> WARNING: Uncaught exception in thread: 
Thread[Thread-1905,5,TGRP-TestIndexWriter]
18:06:29[junit4]   2> java.lang.AssertionError: 
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
18:06:29[junit4]   2>   at 
__randomizedtesting.SeedInfo.seed([83A78A6E34594E08]:0)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.TestIndexWriter.lambda$testRandomOperations$48(TestIndexWriter.java:3886)
18:06:29[junit4]   2>   at 
java.base/java.lang.Thread.run(Thread.java:834)
18:06:29[junit4]   2> Caused by: 
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:739)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:753)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1330)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1619)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1262)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.TestIndexWriter.lambda$testRandomOperations$48(TestIndexWriter.java:3873)
18:06:29[junit4]   2>   ... 1 more
18:06:29[junit4]   2> Caused by: java.nio.file.NoSuchFileException: _7.cfs
18:06:29[junit4]   2>   at 
org.apache.lucene.store.ByteBuffersDirectory.deleteFile(ByteBuffersDirectory.java:148)
18:06:29[junit4]   2>   at 
org.apache.lucene.store.MockDirectoryWrapper.deleteFile(MockDirectoryWrapper.java:607)
18:06:29[junit4]   2>   at 
org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:696)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:690)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:589)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:382)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexFileDeleter.checkpoint(IndexFileDeleter.java:527)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexWriter.finishCommit(IndexWriter.java:3546)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3502)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3452)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.TestIndexWriter.lambda$testRandomOperations$48(TestIndexWriter.java:3879)
18:06:29[junit4]   2>   ... 1 more
18:06:29[junit4]   2> 
18:06:29[junit4]   2> Mod 21, 2020 3:06:19 EBONGI 
com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
 uncaughtException
18:06:29[junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene 
Merge Thread #3,5,TGRP-TestIndexWriter]
18:06:29[junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot 
merge
18:06:29[junit4]   2>   at 
__randomizedtesting.SeedInfo.seed([83A78A6E34594E08]:0)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:703)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:694)
18:06:29[junit4]   2> Caused by: java.lang.IllegalStateException: this 
writer hit an unrecoverable error; cannot merge
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4250)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4230)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4083)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5650)
18:06:29[junit4]   2>   at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
18:06:29[j

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-21 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141583#comment-17141583
 ] 

David Smiley commented on LUCENE-8962:
--

I'm surprised to see that 
{{IndexWriterConfig.DEFAULT_MAX_COMMIT_MERGE_WAIT_SECONDS}} is 0.  After all, 
the default is not to merge on commit so if someone configured a merge policy 
to do this, shouldn't we wait?  I think by default we should wait indefinitely. 
 It's up to the developer/configure-er of the merge policy to find cheap merges 
(if any).  Keeping it at 0 creates a gotcha for yet another setting that's 
required to merge on commit.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 18h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-18 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140039#comment-17140039
 ] 

Michael Sokolov commented on LUCENE-8962:
-

pushed [https://github.com/apache/lucene-solr/pull/1552] to master, and 
cherry-picked to branch_8x, resolving

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 18h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-17 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138827#comment-17138827
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit d65dcb43728dd6bb64393226e24576525328cecc in lucene-solr's branch 
refs/heads/branch_8x from Simon Willnauer
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d65dcb4 ]

LUCENE-8962: Allow waiting for all merges in a merge spec (#1585)

This change adds infrastructure to allow straight forward waiting
on one or more merges or an entire merge specification. This is
a basis for LUCENE-8962.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 16h 50m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-17 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138826#comment-17138826
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit d65dcb43728dd6bb64393226e24576525328cecc in lucene-solr's branch 
refs/heads/branch_8x from Simon Willnauer
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d65dcb4 ]

LUCENE-8962: Allow waiting for all merges in a merge spec (#1585)

This change adds infrastructure to allow straight forward waiting
on one or more merges or an entire merge specification. This is
a basis for LUCENE-8962.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 16h 50m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-17 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138820#comment-17138820
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 59efe22ac29c95f9ba85b7214fcf5e30cc979222 in lucene-solr's branch 
refs/heads/master from Simon Willnauer
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=59efe22 ]

LUCENE-8962: Allow waiting for all merges in a merge spec (#1585)

This change adds infrastructure to allow straight forward waiting
on one or more merges or an entire merge specification. This is
a basis for LUCENE-8962.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 16h 50m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-17 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138821#comment-17138821
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 59efe22ac29c95f9ba85b7214fcf5e30cc979222 in lucene-solr's branch 
refs/heads/master from Simon Willnauer
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=59efe22 ]

LUCENE-8962: Allow waiting for all merges in a merge spec (#1585)

This change adds infrastructure to allow straight forward waiting
on one or more merges or an entire merge specification. This is
a basis for LUCENE-8962.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 16h 50m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-04 Thread Nhat Nguyen (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126253#comment-17126253
 ] 

Nhat Nguyen commented on LUCENE-8962:
-

>  For some reason it's not linked above, and I'm not sure how to remedy that

I guess the title of the PR needs to be in the format of "LUCENE-: 
description".

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 10h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-06-03 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125282#comment-17125282
 ] 

Michael Sokolov commented on LUCENE-8962:
-

Posted a new PR that fixes the test failures we were seeing: 
[https://github.com/apache/lucene-solr/pull/1552]  For some reason it's not 
linked above, and I'm not sure how to remedy that

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-25 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066966#comment-17066966
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 075adac59865b3277adcf86052f2fae3e6d11135 in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=075adac ]

remove LUCENE-8962 from CHANGES.txt


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-25 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066968#comment-17066968
 ] 

Michael Sokolov commented on LUCENE-8962:
-

Yes, sorry, merging all those reverts was slightly hellacious and I missed a 
spot. Now removed.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-25 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066742#comment-17066742
 ] 

David Smiley commented on LUCENE-8962:
--

ok; makes sense.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-25 Thread Alan Woodward (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066573#comment-17066573
 ] 

Alan Woodward commented on LUCENE-8962:
---

It's not in the released 8.5 CHANGES.txt on branch_8_5 so I think this is 
probably only incorrect in master?

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-24 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066150#comment-17066150
 ] 

David Smiley commented on LUCENE-8962:
--

I see this in 8.5 CHANGES.txt yet this issue isn't "Closed".  So in CHANGES.txt 
wrong or the Jira status?

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-08 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054545#comment-17054545
 ] 

Michael McCandless commented on LUCENE-8962:


Thanks @sokolovm!  This is a tricky feature, but high impact.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-08 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054541#comment-17054541
 ] 

Michael Sokolov commented on LUCENE-8962:
-

> Wouldn’t it be easier to sort it all out in a branch rather on master?

Fair enough - I reverted from master and pushed jira/lucene-8962

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054537#comment-17054537
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 4501b3d3fdbc35af99bde6abe7432cfc5e8b5547 in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4501b3d ]

Revert "LUCENE-8962: Split test case (#1313)"

This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3.

Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)"

This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8.

Revert "LUCENE-8962: Fix intermittent test failures"

This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537.

Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)"

This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054539#comment-17054539
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 4501b3d3fdbc35af99bde6abe7432cfc5e8b5547 in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4501b3d ]

Revert "LUCENE-8962: Split test case (#1313)"

This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3.

Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)"

This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8.

Revert "LUCENE-8962: Fix intermittent test failures"

This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537.

Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)"

This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054540#comment-17054540
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 4501b3d3fdbc35af99bde6abe7432cfc5e8b5547 in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4501b3d ]

Revert "LUCENE-8962: Split test case (#1313)"

This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3.

Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)"

This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8.

Revert "LUCENE-8962: Fix intermittent test failures"

This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537.

Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)"

This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054538#comment-17054538
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 4501b3d3fdbc35af99bde6abe7432cfc5e8b5547 in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4501b3d ]

Revert "LUCENE-8962: Split test case (#1313)"

This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3.

Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)"

This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8.

Revert "LUCENE-8962: Fix intermittent test failures"

This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537.

Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)"

This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-08 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054506#comment-17054506
 ] 

Simon Willnauer commented on LUCENE-8962:
-

{quote}
I don't think we should do this – IndexWriter's purpose is making changes to 
the index, and IndexReader simply reads what IndexWriter created.  There are 
wildly diverse users of Lucene and if we now set down the path of 
expecting/allowing IndexReader to do it's own "little" optimizations on 
startup, that can add a lot of unexpected cost, and surprising bugs, to many 
use cases.  IndexWriter is indeed complex, but we should find ways to reduce 
that complexity so that we can implement features in the right classes, rather 
than shifting index-changing features out to IndexReader.
{quote}

This was really just an idea of how this could be done without adding much 
complexity to existing methods. I can see this being a method like this 
_IndexReader optimize(IndexReader reader, MergePolicy policy)_ than can even 
commit the changes. It can run in parallel to the normal indexing but is 
idempotent and much simpler to test. Again, just an idea to another approach. 

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-08 Thread Nhat Nguyen (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054489#comment-17054489
 ] 

Nhat Nguyen commented on LUCENE-8962:
-

{quote}it's no good if ES tests are failing but Lucene's tests are not!  Are 
these ES test runs / logs publicly visible somewhere?{quote}

[~mikemccand] Thank you for responding. The failing Elasticsearch tests are 
tracked at 
[https://github.com/elastic/elasticsearch/issues/53195|https://github.com/elastic/elasticsearch/issues/53195].
 I've opened 
[https://github.com/apache/lucene-solr/pull/1331|https://github.com/apache/lucene-solr/pull/1331]
 to port those tests to Lucene.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-08 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054483#comment-17054483
 ] 

Michael McCandless commented on LUCENE-8962:


{quote}I have translated some Elasticsearch tests (see [^failed-tests.patch]) 
to Lucene.
{quote}
Thanks [~dnhatn] – it's no good if ES tests are failing but Lucene's tests are 
not!  Are these ES test runs / logs publicly visible somewhere?
{quote}Wouldn’t it be easier to sort it all out in a branch rather on master?
{quote}
+1, let's revert on master too, until we can work through these new randomized 
testing failures.  This is certainly tricky code, and we should iterate outside 
master/8.x until it's working well with all tests.
{quote}I'm thinking of adding a {{boolean}} field to {{OneMerge}} that gets set 
once a merge is successfully committed (e.g. just before the call to 
{{closeMergeReaders}} in {{IndexWriter.commitMerge()}}), which the 
{{mergeFinished}} override can use to determine if the merge completed 
successfully or not.
{quote}
+1 – clearly {{mergeFinished}} needs some way to know if the merge was 
successful or not.  Maybe open a separate issue/PR to solve this first, as a 
precursor?
{quote}With a slightly refactored IW we can share the merge logic and let the 
reader re-write itself since we are talking about very small segments the 
overhead is very small. 
{quote}
I don't think we should do this – {{IndexWriter}}'s purpose is making changes 
to the index, and {{IndexReader}} simply reads what {{IndexWriter}} created.  
There are wildly diverse users of Lucene and if we now set down the path of 
expecting/allowing {{IndexReader}} to do it's own "little" optimizations on 
startup, that can add a lot of unexpected cost, and surprising bugs, to many 
use cases.  {{IndexWriter}} is indeed complex, but we should find ways to 
reduce that complexity so that we can implement features in the right classes, 
rather than shifting index-changing features out to {{IndexReader}}.

In our (Amazon customer facing product search) usage of Lucene, this 
optimization is impactful because a single index is replicated across deep 
replica count, and shifting work that could be done once by {{IndexWriter}} to 
every {{IndexReader}} that later opens that commit point would add quite a bit 
of cost to those readers.

That said, I'm definitely +1 to find a simpler way to achieve this powerful 
feature, just within {{IndexWriter}}.  If you look at the discussion in the 
original PR, we considered making it just another trigger under the existing 
{{MergePolicy}} methods, but there were downsides (discussed on the PR) to 
that.  Also, this feature/change is only targeting {{commit}} which is already 
a simplification (NOT trying to target NRT readers).  But maybe 
[~simonwillnauer] you see a simpler way to implement this during 
{{IndexWriter.commit}}?

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

--

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-07 Thread Nhat Nguyen (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054163#comment-17054163
 ] 

Nhat Nguyen commented on LUCENE-8962:
-

I have translated some Elasticsearch tests (see  [^failed-tests.patch]) to 
Lucene. The test _testMergeOnCommitKeepFullyDeletedSegments_ consistently fails 
because _mergeFinished_ was called twice. This happens because we forget to set 
_success_ to _true_ when the new segment does not have any document (i.e., 
_SegmentMerger#shouldMerge_ is false).

Even with that fix, both _testRandomOperations_ and 
_testRandomOperationsWithSoftDeletes_ still randomly fail after a few thousand 
iterations. Since both are multi-threaded tests, we will need more time to 
figure out the root cause.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png, failed-tests.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-07 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054111#comment-17054111
 ] 

Simon Willnauer commented on LUCENE-8962:
-

Wouldn’t it be easier to sort it all out in a branch rather on master? 





> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-07 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054078#comment-17054078
 ] 

Michael Sokolov commented on LUCENE-8962:
-

I'm leaving on master for now until we can sort out how to move forward

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054076#comment-17054076
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 89083d43ae6c884e498e40f92053c689925abd66 in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=89083d4 ]

Revert "LUCENE-8962: Split test case (#1313)"

This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3.

Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)"

This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8.

Revert "LUCENE-8962: Fix intermittent test failures"

This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537.

Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)"

This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054075#comment-17054075
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 89083d43ae6c884e498e40f92053c689925abd66 in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=89083d4 ]

Revert "LUCENE-8962: Split test case (#1313)"

This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3.

Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)"

This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8.

Revert "LUCENE-8962: Fix intermittent test failures"

This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537.

Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)"

This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054077#comment-17054077
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 89083d43ae6c884e498e40f92053c689925abd66 in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=89083d4 ]

Revert "LUCENE-8962: Split test case (#1313)"

This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3.

Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)"

This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8.

Revert "LUCENE-8962: Fix intermittent test failures"

This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537.

Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)"

This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054074#comment-17054074
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 89083d43ae6c884e498e40f92053c689925abd66 in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=89083d4 ]

Revert "LUCENE-8962: Split test case (#1313)"

This reverts commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3.

Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)"

This reverts commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8.

Revert "LUCENE-8962: Fix intermittent test failures"

This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537.

Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)"

This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054068#comment-17054068
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 27b4fee074437a892c511898e9e86d62c127f1d7 in lucene-solr's branch 
refs/heads/branch_8_5 from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=27b4fee ]

Revert "LUCENE-8962: Fix intermittent test failures"

This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537.

Revert "Add CHANGES entry for LUCENE-8962"

This reverts commit fdac6d866344611290c45c164112277581328bc9.

Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)"

This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054066#comment-17054066
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 27b4fee074437a892c511898e9e86d62c127f1d7 in lucene-solr's branch 
refs/heads/branch_8_5 from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=27b4fee ]

Revert "LUCENE-8962: Fix intermittent test failures"

This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537.

Revert "Add CHANGES entry for LUCENE-8962"

This reverts commit fdac6d866344611290c45c164112277581328bc9.

Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)"

This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-07 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054067#comment-17054067
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 27b4fee074437a892c511898e9e86d62c127f1d7 in lucene-solr's branch 
refs/heads/branch_8_5 from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=27b4fee ]

Revert "LUCENE-8962: Fix intermittent test failures"

This reverts commit a5475de57fed6b339cd5565bd1bd2650f265a537.

Revert "Add CHANGES entry for LUCENE-8962"

This reverts commit fdac6d866344611290c45c164112277581328bc9.

Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)"

This reverts commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread Michael Froh (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053640#comment-17053640
 ] 

Michael Froh commented on LUCENE-8962:
--

bq. With a slightly refactored IW we can share the merge logic and let the 
reader re-write itself since we are talking about very small segments the 
overhead is very small. This would in turn mean that we are doing the work 
twice ie. the IW would do its normal work and might merge later etc.

Just to provide a bit more context, for the case where my team uses this 
change, we're replicating the index (think Solr master/slave) from "writers" to 
many "searchers", so we're avoiding doing the work many times.

An earlier (less invasive) approach I tried to address the small flushed 
segments problem was roughly: call commit on writer, hard link the commit files 
to another filesystem directory to "clone" the index, open an IW on that 
directory, merge small segments on the clone, let searchers replicate from the 
clone. That approach does mean that the merging work happens twice (since the 
"real" index doesn't benefit from the merge on the clone), but it doesn't 
involve any changes in Lucene.

Maybe that less-invasive approach is a better way to address this. It's 
certainly more consistent with [~simonw]'s suggestion above.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053620#comment-17053620
 ] 

Michael Sokolov commented on LUCENE-8962:
-

Based on [~simonw]'s recent comments in github, plus difficulty getting tests 
to pass consistently (apparently there are more failing tests in Elasticland), 
we should probably revert for now, at least from 8.x and 8.5 branches. I am 
tied up for the moment, but will be able to do the revert this weekend.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread Nhat Nguyen (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053545#comment-17053545
 ] 

Nhat Nguyen commented on LUCENE-8962:
-

Some engine tests in Elasticsearch are failing because of this change. I am 
working to backport them to Lucene so that we can catch similar issues in 
Lucene.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053427#comment-17053427
 ] 

David Smiley commented on LUCENE-8962:
--

Thanks so much for your input Simon!  We need to fight the complexity here.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053381#comment-17053381
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5 ]

LUCENE-8962: Split test case (#1313)

* LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things
at once: basic semantics of merge on commit and proper behavior when
a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while 
leaving it to MockRandomMergePolicy to enable merge on commit in existing
 test cases to verify that indexing generally works as expected and no new
unexpected exceptions are thrown.

* LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and
isAborted() was false, then the merge must have completed successfully.
Instead, we should know for sure if a given merge was committed, and
only then update our pending commit SegmentInfos.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053380#comment-17053380
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5 ]

LUCENE-8962: Split test case (#1313)

* LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things
at once: basic semantics of merge on commit and proper behavior when
a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while 
leaving it to MockRandomMergePolicy to enable merge on commit in existing
 test cases to verify that indexing generally works as expected and no new
unexpected exceptions are thrown.

* LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and
isAborted() was false, then the merge must have completed successfully.
Instead, we should know for sure if a given merge was committed, and
only then update our pending commit SegmentInfos.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053382#comment-17053382
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 90aced5a51f92ffd6e97449eb7c44aacc643c8a3 in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90aced5 ]

LUCENE-8962: Split test case (#1313)

* LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things
at once: basic semantics of merge on commit and proper behavior when
a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while 
leaving it to MockRandomMergePolicy to enable merge on commit in existing
 test cases to verify that indexing generally works as expected and no new
unexpected exceptions are thrown.

* LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and
isAborted() was false, then the merge must have completed successfully.
Instead, we should know for sure if a given merge was committed, and
only then update our pending commit SegmentInfos.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-06 Thread Simon Willnauer (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053231#comment-17053231
 ] 

Simon Willnauer commented on LUCENE-8962:
-

I read through this issue and I want to share some of my thoughts. First, I 
understand the need for this and the motivation, yet every time we add 
something like this to the IndexWriter to do something _as part of_ another 
method it triggers an alarm on my end. I have spent hours and days thinking 
about how IW can be simpler and the biggest issues that I see is that the 
primitives on IW like commit or openReader are doing too much. Just look at 
openReader it's pretty involved and changing the bus factor or making it easier 
to understand is hard. Adding stuff like _wait for merge_ with something like a 
timeout is not what I think we should do neither to _openReader_ nor to 
_commit_.  
That said, I think we can make the same things happen but we should think in 
primitives rather than changing method behavior with configuration. Let me 
explain what I mean:

Lets say we keep _commit_ and _openReader_ the way it is and would instead 
allow to use an existing reader NRT or not and allow itself to _optimize_ 
itself (yeah I said that  - it might be a good name after all). With a slightly 
refactored IW we can share the merge logic and let the reader re-write itself 
since we are talking about very small segments the overhead is very small. This 
would in turn mean that we are doing the work twice ie. the IW would do its 
normal work and might merge later etc. We might even merge this stuff into 
heap-space or so if we have enough I haven't thought too much about that. This 
way we can clean up IW potentially and add a very nice optimization that works 
for commit as well as NRT. We should strive for making IW simpler not do more. 
I hope I wasn't too discouraging. 

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-05 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052608#comment-17052608
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit 3dbfd102794419551f2ba4b43344cf9e6242a2b8 in lucene-solr's branch 
refs/heads/branch_8x from Michael McCandless
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3dbfd10 ]

LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-05 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052607#comment-17052607
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit e5be034df2fc22f1b88e4d271b25c8fae1c3093f in lucene-solr's branch 
refs/heads/master from Michael McCandless
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e5be034 ]

LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-05 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052510#comment-17052510
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207 ]

LUCENE-8962: Split test case (#1313)

* LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things
at once: basic semantics of merge on commit and proper behavior when
a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while 
leaving it to MockRandomMergePolicy to enable merge on commit in existing
 test cases to verify that indexing generally works as expected and no new
unexpected exceptions are thrown.

* LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and
isAborted() was false, then the merge must have completed successfully.
Instead, we should know for sure if a given merge was committed, and
only then update our pending commit SegmentInfos.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-05 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052512#comment-17052512
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207 ]

LUCENE-8962: Split test case (#1313)

* LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things
at once: basic semantics of merge on commit and proper behavior when
a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while 
leaving it to MockRandomMergePolicy to enable merge on commit in existing
 test cases to verify that indexing generally works as expected and no new
unexpected exceptions are thrown.

* LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and
isAborted() was false, then the merge must have completed successfully.
Instead, we should know for sure if a given merge was committed, and
only then update our pending commit SegmentInfos.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-05 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052508#comment-17052508
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit a030207a5e547a70db01d72fe4bd1627814ea94c in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a030207 ]

LUCENE-8962: Split test case (#1313)

* LUCENE-8962: Simplify test case

The testMergeOnCommit test case was trying to verify too many things
at once: basic semantics of merge on commit and proper behavior when
a bunch of indexing threads are writing and committing all at once.

Now we just verify basic behavior, with strict assertions on invariants, while 
leaving it to MockRandomMergePolicy to enable merge on commit in existing
 test cases to verify that indexing generally works as expected and no new
unexpected exceptions are thrown.

* LUCENE-8962: Only update toCommit if merge was committed

The code was previously assuming that if mergeFinished() was called and
isAborted() was false, then the merge must have completed successfully.
Instead, we should know for sure if a given merge was committed, and
only then update our pending commit SegmentInfos.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-04 Thread Michael Froh (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051571#comment-17051571
 ] 

Michael Froh commented on LUCENE-8962:
--

I updated https://github.com/apache/lucene-solr/pull/1313 with that proposed 
fix (adding a {{boolean}} field to OneMerge that gets set once a merge is 
successfully committed).

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-04 Thread Michael Froh (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051558#comment-17051558
 ] 

Michael Froh commented on LUCENE-8962:
--

It's not immediately obvious to me how to fix the failure on 
{{TestIndexWriterExceptions2}}.

A merge on commit fails (because it's using {{CrankyCodec}}), closing the merge 
readers, which calls the custom {{mergeFinished}} override, which assumes the 
merge completed (since it wasn't aborted), and tries to reference the files for 
the merged segment (to increment their reference counts). That triggers an 
{{IllegalStateException}} because the files weren't set (because we didn't get 
that far in the merge).

Unfortunately, stepping through the debugger, I don't see a clear way of 
telling in {{mergeFinished}} that a merge failed. Obviously, I could wrap the 
call to {{SegmentCommitInfo.files()}} in a try-catch, and assume that the 
{{IllegalStateException}} means that the merge failed, but that would fail to 
catch an IOException when e.g. committing the merge.

I'm thinking of adding a {{boolean}} field to {{OneMerge}} that gets set once a 
merge is successfully committed (e.g. just before the call to 
{{closeMergeReaders}} in {{IndexWriter.commitMerge()}}), which the 
{{mergeFinished}} override can use to determine if the merge completed 
successfully or not.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-04 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051229#comment-17051229
 ] 

Michael Sokolov commented on LUCENE-8962:
-

I think I'll test and push [~msfroh]'s test fix since it seems to fail 
frequently, while we investigate this new failure. It looks as if we are not 
properly handling the case when an IOException is thrown at some inopportune 
time? 

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-03 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050967#comment-17050967
 ] 

Adrien Grand commented on LUCENE-8962:
--

FYI this test seems to fail too because of this change:

{noformat}
NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriterExceptions2 
-Dtests.method=testBasics -Dtests.seed=1F5EDA9C353C018F -Dtests.slow=true 
-Dtests.badapples=true -Dtests.locale=kkj-CM -Dtests.timezone=Etc/UCT 
-Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
{noformat}

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-03 Thread Michael Froh (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050607#comment-17050607
 ] 

Michael Froh commented on LUCENE-8962:
--

I ended up splitting testMergeOnCommit into two test cases.

One runs through the basic invariants on a single thread and confirms that 
everything behaves as expected.

The other tries indexing and committing from multiple threads, but doesn't 
really make any assumptions about the segment topology in the end (since 
randomness and concurrency can lead to all kinds of possible valid segment 
counts). Instead it just verifies that it doesn't fail and doesn't lose any 
documents.

https://github.com/apache/lucene-solr/pull/1313

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-02 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049754#comment-17049754
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit e308e538731f392eb81ba81cfb3ec5fc526fd383 in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e308e53 ]

Add CHANGES entry for LUCENE-8962


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-02 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049750#comment-17049750
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a in lucene-solr's branch 
refs/heads/branch_8x from msfroh
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a1791e7 ]

LUCENE-8962: Add ability to selectively merge on commit (#1155)

* LUCENE-8962: Add ability to selectively merge on commit

This adds a new "findCommitMerges" method to MergePolicy, which can
specify merges to be executed before the
IndexWriter.prepareCommitInternal method returns.

If we have many index writer threads, they will flush their DWPT buffers
on commit, resulting in many small segments, which can be merged before
the commit returns.

* Add missing Javadoc

* Fix incorrect comment

* Refactoring and fix intermittent test failure

1. Made some changes to the callback to update toCommit, leveraging
SegmentInfos.applyMergeChanges.
2. I realized that we'll never end up with 0 registered merges, because
we throw an exception if we fail to register a merge.
3. Moved the IndexWriterEvents.beginMergeOnCommit notification to before
we call MergeScheduler.merge, since we may not be merging on another
thread.
4. There was an intermittent test failure due to randomness in the time
it takes for merges to complete. Before doing the final commit, we wait
for pending merges to finish. We may still end up abandoning the final
merge, but we can detect that and assert that either the merge was
abandoned (and we have > 1 segment) or we did merge down to 1 segment.

* Fix typo

* Fix/improve comments based on PR feedback

* More comment improvements from PR feedback

* Rename method and add new MergeTrigger

1. Renamed findCommitMerges -> findFullFlushMerges.
2. Added MergeTrigger.COMMIT, passed to findFullFlushMerges and to
   MergeScheduler when merging on commit.

* Update renamed method name in strings and comments


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-02 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049751#comment-17049751
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit a1791e77143aa8087c0b5ee0e8eb57422e59a09a in lucene-solr's branch 
refs/heads/branch_8x from msfroh
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a1791e7 ]

LUCENE-8962: Add ability to selectively merge on commit (#1155)

* LUCENE-8962: Add ability to selectively merge on commit

This adds a new "findCommitMerges" method to MergePolicy, which can
specify merges to be executed before the
IndexWriter.prepareCommitInternal method returns.

If we have many index writer threads, they will flush their DWPT buffers
on commit, resulting in many small segments, which can be merged before
the commit returns.

* Add missing Javadoc

* Fix incorrect comment

* Refactoring and fix intermittent test failure

1. Made some changes to the callback to update toCommit, leveraging
SegmentInfos.applyMergeChanges.
2. I realized that we'll never end up with 0 registered merges, because
we throw an exception if we fail to register a merge.
3. Moved the IndexWriterEvents.beginMergeOnCommit notification to before
we call MergeScheduler.merge, since we may not be merging on another
thread.
4. There was an intermittent test failure due to randomness in the time
it takes for merges to complete. Before doing the final commit, we wait
for pending merges to finish. We may still end up abandoning the final
merge, but we can detect that and assert that either the merge was
abandoned (and we have > 1 segment) or we did merge down to 1 segment.

* Fix typo

* Fix/improve comments based on PR feedback

* More comment improvements from PR feedback

* Rename method and add new MergeTrigger

1. Renamed findCommitMerges -> findFullFlushMerges.
2. Added MergeTrigger.COMMIT, passed to findFullFlushMerges and to
   MergeScheduler when merging on commit.

* Update renamed method name in strings and comments


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-02 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049753#comment-17049753
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit fdac6d866344611290c45c164112277581328bc9 in lucene-solr's branch 
refs/heads/branch_8x from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fdac6d8 ]

Add CHANGES entry for LUCENE-8962


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-02 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049752#comment-17049752
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit a5475de57fed6b339cd5565bd1bd2650f265a537 in lucene-solr's branch 
refs/heads/branch_8x from Michael Froh
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a5475de ]

LUCENE-8962: Fix intermittent test failures

1. TestIndexWriterMergePolicy.testMergeOnCommit will fail if the last
   commit (the one that should trigger the full merge) doesn't have any
   pending changes (which could occur if the last indexing thread
   commits at the end). We can fix that by adding one more document
   before that commit.
2. The previous implementation was throwing IOException if the commit
   thread gets interrupted while waiting for merges to complete. This
   violates IndexWriter's documented behavior of throwing
   ThreadInterruptedException.


> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-02 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049748#comment-17049748
 ] 

ASF subversion and git services commented on LUCENE-8962:
-

Commit f017ae465ec416b9cf5ac91f9aa12ff71abd7de0 in lucene-solr's branch 
refs/heads/master from msfroh
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f017ae4 ]

LUCENE-8962: Fix intermittent test failures (#1307)

1. TestIndexWriterMergePolicy.testMergeOnCommit will fail if the last
   commit (the one that should trigger the full merge) doesn't have any
   pending changes (which could occur if the last indexing thread
   commits at the end). We can fix that by adding one more document
   before that commit.
2. The previous implementation was throwing IOException if the commit
   thread gets interrupted while waiting for merges to complete. This
   violates IndexWriter's documented behavior of throwing
   ThreadInterruptedException.

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

1 2 >

1 - 100 of 125 matches

Mail list logo