[jira] [Comment Edited] (LUCENE-8263) Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more aggressive merging

2018-07-16 Thread Marc Morissette (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545788#comment-16545788
 ] 

Marc Morissette edited comment on LUCENE-8263 at 7/16/18 10:02 PM:
---

{quote}I've gone back and forth on this. Now that optimize and forceMerge 
respect maxSegmentSize I've been thinking that those operations would suffice 
for those real-world edge cases.

forceMergeDeletes (expungeDeletes) has a maximum percent of deletes allowed per 
segment for instance that must be between 0 and 100. 0 is roughly equivalent to 
forceMerge/optimize at this point. And will not create any segments over 
maxSegmentSizeMB.
{quote}
I hadn't considered using forceMergeDeletes to address these edge cases but the 
more I think about it, the more I like it. Consider me convinced.

My only remaining concern with forceMergeDeletes as it is currently designed 
(and if I'm reading the code correctly) is that if enough segments somehow end 
up having a delete % above forceMergeDeletesPctAllowed, then it is possible for 
it to use a lot of disk space. Perhaps we could find a way to configure an 
upper limit on the number of merges that forceMergeDeletes can perform per 
call? When configured this way, each forceMergeDeletes could only claim a 
maximum amount of disk space before returning. Repeated calls would be 
necessary to "clean" an entire index but if each one were accompanied by a soft 
commit, then the amount of free disk space required to perform the entire 
operation would be more predictable.


was (Author: marc.morissette):
{quote}I've gone back and forth on this. Now that optimize and forceMerge 
respect maxSegmentSize I've been thinking that those operations would suffice 
for those real-world edge cases.

forceMergeDeletes (expungeDeletes) has a maximum percent of deletes allowed per 
segment for instance that must be between 0 and 100. 0 is roughly equivalent to 
forceMerge/optimize at this point. And will not create any segments over 
maxSegmentSizeMB.
{quote}
I hadn't considered using forceMergeDeletes to address these edge cases but the 
more I think about it, the more I like it. Consider me convinced.

My only remaining concern with forceMergeDeletes as it is currently designed 
(and if I'm reading the code correctly) is that if enough segments somehow end 
up having a delete % above forceMergeDeletesPctAllowed, then it is possible for 
it to use a lot of disk space. Perhaps we could find a way to configure an 
upper limit on the number of merges that forceMergeDeletes can perform per 
call? When configured this way, each forceMergeDeletes could only claim a 
maximum amount of disk space before returning. Repeated calls would be 
necessary to "clean" an entire index but if each one were accompanied by a soft 
commit, then the amount of free disk space required to perform the operation 
would be more predictable.

> Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more 
> aggressive merging
> 
>
> Key: LUCENE-8263
> URL: https://issues.apache.org/jira/browse/LUCENE-8263
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-8263.patch
>
>
> Spinoff of LUCENE-7976 to keep the two issues separate.
> The current TMP allows up to 50% deleted docs, which can be wasteful on large 
> indexes. This parameter will do more aggressive merging of segments with 
> deleted documents when the _total_ percentage of deleted docs in the entire 
> index exceeds it.
> Setting this to 50% should approximate current behavior. Setting it to 20% 
> caused the first cut at this to increase I/O roughly 10%. Setting it to 10% 
> caused about a 50% increase in I/O.
> I was conflating the two issues, so I'll change 7976 and comment out the bits 
> that reference this new parameter. After it's checked in we can bring this 
> back. That should be less work than reconstructing this later.
> Among the questions to be answered:
> 1> what should the default be? I propose 20% as it results in significantly 
> less space wasted and helps control heap usage for a modest increase in I/O.
> 2> what should the floor be? I propose 10% with _strong_ documentation 
> warnings about not setting it below 20%.
> 3> should there be two parameters? I think this was discussed somewhat in 
> 7976. The first cut at  this used this number for two purposes:
> 3a> the total percentage of deleted docs index-wide to trip this trigger
> 3b> the percentage of an _individual_ segment that had to be deleted if the 
> segment was over maxSegmentSize/2 bytes in order to be eligible for merging. 
> Empirically, using the same percentage for both caused the 

[jira] [Comment Edited] (LUCENE-8263) Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more aggressive merging

2018-07-13 Thread Marc Morissette (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543616#comment-16543616
 ] 

Marc Morissette edited comment on LUCENE-8263 at 7/13/18 7:37 PM:
--

I would like to argue against a 20% floor.

Some indexes contain documents of wildly different sizes with the larger 
documents experiencing much higher turnover. I have seen indexes with around 
20% deletions that were more than 2x their optimized size because of this 
phenomenon.

I such situations, deletesPctAllowed around 10-15% would make a lot of sense. I 
say keep the floor at 10%.

Or maybe simply issue a warning instead?


was (Author: marc.morissette):
I would like to argue against a 20% floor.

Some indexes contain documents of wildly different sizes with the larger 
documents experiencing much higher turnover. I have seen indexes with around 
20% deletions that were more than 2x their optimized size because of this 
phenomenon.

I such situations, deletesPctAllowed around 10-15% would make a lot of sense. I 
say keep the floor at 10%.

> Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more 
> aggressive merging
> 
>
> Key: LUCENE-8263
> URL: https://issues.apache.org/jira/browse/LUCENE-8263
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-8263.patch
>
>
> Spinoff of LUCENE-7976 to keep the two issues separate.
> The current TMP allows up to 50% deleted docs, which can be wasteful on large 
> indexes. This parameter will do more aggressive merging of segments with 
> deleted documents when the _total_ percentage of deleted docs in the entire 
> index exceeds it.
> Setting this to 50% should approximate current behavior. Setting it to 20% 
> caused the first cut at this to increase I/O roughly 10%. Setting it to 10% 
> caused about a 50% increase in I/O.
> I was conflating the two issues, so I'll change 7976 and comment out the bits 
> that reference this new parameter. After it's checked in we can bring this 
> back. That should be less work than reconstructing this later.
> Among the questions to be answered:
> 1> what should the default be? I propose 20% as it results in significantly 
> less space wasted and helps control heap usage for a modest increase in I/O.
> 2> what should the floor be? I propose 10% with _strong_ documentation 
> warnings about not setting it below 20%.
> 3> should there be two parameters? I think this was discussed somewhat in 
> 7976. The first cut at  this used this number for two purposes:
> 3a> the total percentage of deleted docs index-wide to trip this trigger
> 3b> the percentage of an _individual_ segment that had to be deleted if the 
> segment was over maxSegmentSize/2 bytes in order to be eligible for merging. 
> Empirically, using the same percentage for both caused the merging to hover 
> around the value specified for this parameter.
> My proposal for <3> would be to have the parameter do double-duty. Assuming 
> my preliminary results hold, you specify this parameter at, say, 20% and once 
> the index hits that % deleted docs it hovers right around there, even if 
> you've forceMerged earlier down to 1 segment. This seems in line with what 
> I'd expect and adding another parameter seems excessively complicated to no 
> good purpose. We could always add something like that later if we wanted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org