Re: Pathological index condition

2017-08-28 Thread Erick Erickson
bq: I guess the alternative would be to occasionally roll the dice and
decide to merge that kind of segment.

That's what I was getting to  with the "autoCompact" idea in a more
deterministic manner.



On Mon, Aug 28, 2017 at 1:32 PM, Walter Underwood  wrote:
> That makes sense.
>
> I guess the alternative would be to occasionally roll the dice and decide to
> merge that kind of segment.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> On Aug 28, 2017, at 1:28 PM, Erick Erickson  wrote:
>
> I don't think jitter would help. As long as a segment has > 50% max
> segment size "live" docs, it's forever ineligible for merging (outside
> optimize of expungeDeletes commands). So the "zone" is anything over
> 50%.
>
> Or I missed your point.
>
> Erick
>
> On Mon, Aug 28, 2017 at 12:50 PM, Walter Underwood
>  wrote:
>
> If this happens in a precise zone, how about adding some random jitter to
> the threshold? That tends to get this kind of lock-up unstuck.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> On Aug 28, 2017, at 12:44 PM, Erick Erickson 
> wrote:
>
> And one more thought (not very well thought out).
>
> A parameter on TMP (or whatever) that did <3> something like:
>
> a parameter 
> a parameter 
> On startup TMP takes the current timestamp
>
> *> Every minute (or whatever) it checks the current timestamp and if
>  is in between the last check time and now, do <2>.
>
> set the last checked time to the value from * above.
>
>
> Taking the current timestamp would keep from kicking of the compaction
> on startup, so we wouldn't need to keep some stateful information
> across restarts and wouldn't go into a compact cycle on startup.
>
> Erick
>
> On Sun, Aug 27, 2017 at 11:31 AM, Erick Erickson
>  wrote:
>
> I've been thinking about this a little more. Since this is an outlier,
> I'm loathe to change the core TMP merge selection process. Say the max
> segment size if 5G. You'd be doing an awful lot of I/O to merge a
> segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't
> really allow users who issue the tempting "optimize" command to
> recover; that one huge segment can hang around for a _very_ long time,
> accumulating lots of deleted docs. Same with expungeDeletes.
>
> I can think of several approaches:
>
> 1> despite my comment above, a flag that says something like "if a
> segment has > X% deleted docs, merge it with a smaller segment anyway
> respecting the max segment size. I know, I know this will affect
> indexing throughput, do it anyway".
>
> 2> A special op (or perhaps a flag on expungeDeletes) that would
> behave like <1> but on-demand rather than part of standard merging.
>
> In both of these cases, if a segment had > X% deleted docs but the
> live doc size for that segment was > the max seg size, rewrite it into
> a single new segment removing deleted docs.
>
> 3> some way to do the above on a schedule. My notion is something like
> a maintenance window at 1:00 AM. You'd still have a live collection,
> but (presumably) a way to purge the day's accumulation of deleted
> documents during off hours.
>
> 4> ???
>
> I probably like <2> best so far, I don't see this condition in the
> wild very often it usually occurs during heavy re-indexing operations
> and often after an optimize or expungeDeletes has happened. <1> could
> get horribly pathological if the threshold was 1% or something.
>
> WDYT?
>
>
> On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson 
> wrote:
>
> Thanks Mike:
>
> bq: Or are you saying that each segments 20% of not-deleted docs is
> still greater than 1/2 of the max segment size, and so TMP considers
> them ineligible?
>
> Exactly.
>
> Hadn't seen the blog, thanks for that. Added to my list of things to refer
> to.
>
> The problem we're seeing is that "in the wild" there are cases where
> people can now get satisfactory performance from huge numbers of
> documents, as in close to 2B (there was a question on the user's list
> about that recently). So allowing up to 60% deleted documents is
> dangerous in that situation.
>
> And the situation is exacerbated by optimizing (I know, "don't do that").
>
> Ah, well, the joys of people using this open source thing and pushing
> its limits.
>
> Thanks again,
> Erick
>
> On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless
>  wrote:
>
> Hi Erick,
>
> Some questions/answers below:
>
> On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson 
> wrote:
>
>
> Particularly interested if Mr. McCandless has any opinions here.
>
> I admit it took some work, but I can create an index that never merges
> and is 80% deleted documents using TieredMergePolicy.
>
> I'm trying to understand how indexes "in the wild" can have > 30%
> deleted 

Re: Pathological index condition

2017-08-28 Thread Walter Underwood
That makes sense.

I guess the alternative would be to occasionally roll the dice and decide to 
merge that kind of segment.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 28, 2017, at 1:28 PM, Erick Erickson  wrote:
> 
> I don't think jitter would help. As long as a segment has > 50% max
> segment size "live" docs, it's forever ineligible for merging (outside
> optimize of expungeDeletes commands). So the "zone" is anything over
> 50%.
> 
> Or I missed your point.
> 
> Erick
> 
> On Mon, Aug 28, 2017 at 12:50 PM, Walter Underwood
>  wrote:
>> If this happens in a precise zone, how about adding some random jitter to
>> the threshold? That tends to get this kind of lock-up unstuck.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>> On Aug 28, 2017, at 12:44 PM, Erick Erickson 
>> wrote:
>> 
>> And one more thought (not very well thought out).
>> 
>> A parameter on TMP (or whatever) that did <3> something like:
>> 
>> a parameter 
>> a parameter 
>> On startup TMP takes the current timestamp
>> 
>> *> Every minute (or whatever) it checks the current timestamp and if
>>  is in between the last check time and now, do <2>.
>> 
>> set the last checked time to the value from * above.
>> 
>> 
>> Taking the current timestamp would keep from kicking of the compaction
>> on startup, so we wouldn't need to keep some stateful information
>> across restarts and wouldn't go into a compact cycle on startup.
>> 
>> Erick
>> 
>> On Sun, Aug 27, 2017 at 11:31 AM, Erick Erickson
>>  wrote:
>> 
>> I've been thinking about this a little more. Since this is an outlier,
>> I'm loathe to change the core TMP merge selection process. Say the max
>> segment size if 5G. You'd be doing an awful lot of I/O to merge a
>> segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't
>> really allow users who issue the tempting "optimize" command to
>> recover; that one huge segment can hang around for a _very_ long time,
>> accumulating lots of deleted docs. Same with expungeDeletes.
>> 
>> I can think of several approaches:
>> 
>> 1> despite my comment above, a flag that says something like "if a
>> segment has > X% deleted docs, merge it with a smaller segment anyway
>> respecting the max segment size. I know, I know this will affect
>> indexing throughput, do it anyway".
>> 
>> 2> A special op (or perhaps a flag on expungeDeletes) that would
>> behave like <1> but on-demand rather than part of standard merging.
>> 
>> In both of these cases, if a segment had > X% deleted docs but the
>> live doc size for that segment was > the max seg size, rewrite it into
>> a single new segment removing deleted docs.
>> 
>> 3> some way to do the above on a schedule. My notion is something like
>> a maintenance window at 1:00 AM. You'd still have a live collection,
>> but (presumably) a way to purge the day's accumulation of deleted
>> documents during off hours.
>> 
>> 4> ???
>> 
>> I probably like <2> best so far, I don't see this condition in the
>> wild very often it usually occurs during heavy re-indexing operations
>> and often after an optimize or expungeDeletes has happened. <1> could
>> get horribly pathological if the threshold was 1% or something.
>> 
>> WDYT?
>> 
>> 
>> On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson 
>> wrote:
>> 
>> Thanks Mike:
>> 
>> bq: Or are you saying that each segments 20% of not-deleted docs is
>> still greater than 1/2 of the max segment size, and so TMP considers
>> them ineligible?
>> 
>> Exactly.
>> 
>> Hadn't seen the blog, thanks for that. Added to my list of things to refer
>> to.
>> 
>> The problem we're seeing is that "in the wild" there are cases where
>> people can now get satisfactory performance from huge numbers of
>> documents, as in close to 2B (there was a question on the user's list
>> about that recently). So allowing up to 60% deleted documents is
>> dangerous in that situation.
>> 
>> And the situation is exacerbated by optimizing (I know, "don't do that").
>> 
>> Ah, well, the joys of people using this open source thing and pushing
>> its limits.
>> 
>> Thanks again,
>> Erick
>> 
>> On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless
>>  wrote:
>> 
>> Hi Erick,
>> 
>> Some questions/answers below:
>> 
>> On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson 
>> wrote:
>> 
>> 
>> Particularly interested if Mr. McCandless has any opinions here.
>> 
>> I admit it took some work, but I can create an index that never merges
>> and is 80% deleted documents using TieredMergePolicy.
>> 
>> I'm trying to understand how indexes "in the wild" can have > 30%
>> deleted documents. I think the root issue here is that
>> TieredMergePolicy doesn't consider for merging any segments > 50% of
>> maxMergedSegmentMB of 

Re: Pathological index condition

2017-08-28 Thread Erick Erickson
I don't think jitter would help. As long as a segment has > 50% max
segment size "live" docs, it's forever ineligible for merging (outside
optimize of expungeDeletes commands). So the "zone" is anything over
50%.

Or I missed your point.

Erick

On Mon, Aug 28, 2017 at 12:50 PM, Walter Underwood
 wrote:
> If this happens in a precise zone, how about adding some random jitter to
> the threshold? That tends to get this kind of lock-up unstuck.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> On Aug 28, 2017, at 12:44 PM, Erick Erickson 
> wrote:
>
> And one more thought (not very well thought out).
>
> A parameter on TMP (or whatever) that did <3> something like:
>
> a parameter 
> a parameter 
> On startup TMP takes the current timestamp
>
> *> Every minute (or whatever) it checks the current timestamp and if
>  is in between the last check time and now, do <2>.
>
> set the last checked time to the value from * above.
>
>
> Taking the current timestamp would keep from kicking of the compaction
> on startup, so we wouldn't need to keep some stateful information
> across restarts and wouldn't go into a compact cycle on startup.
>
> Erick
>
> On Sun, Aug 27, 2017 at 11:31 AM, Erick Erickson
>  wrote:
>
> I've been thinking about this a little more. Since this is an outlier,
> I'm loathe to change the core TMP merge selection process. Say the max
> segment size if 5G. You'd be doing an awful lot of I/O to merge a
> segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't
> really allow users who issue the tempting "optimize" command to
> recover; that one huge segment can hang around for a _very_ long time,
> accumulating lots of deleted docs. Same with expungeDeletes.
>
> I can think of several approaches:
>
> 1> despite my comment above, a flag that says something like "if a
> segment has > X% deleted docs, merge it with a smaller segment anyway
> respecting the max segment size. I know, I know this will affect
> indexing throughput, do it anyway".
>
> 2> A special op (or perhaps a flag on expungeDeletes) that would
> behave like <1> but on-demand rather than part of standard merging.
>
> In both of these cases, if a segment had > X% deleted docs but the
> live doc size for that segment was > the max seg size, rewrite it into
> a single new segment removing deleted docs.
>
> 3> some way to do the above on a schedule. My notion is something like
> a maintenance window at 1:00 AM. You'd still have a live collection,
> but (presumably) a way to purge the day's accumulation of deleted
> documents during off hours.
>
> 4> ???
>
> I probably like <2> best so far, I don't see this condition in the
> wild very often it usually occurs during heavy re-indexing operations
> and often after an optimize or expungeDeletes has happened. <1> could
> get horribly pathological if the threshold was 1% or something.
>
> WDYT?
>
>
> On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson 
> wrote:
>
> Thanks Mike:
>
> bq: Or are you saying that each segments 20% of not-deleted docs is
> still greater than 1/2 of the max segment size, and so TMP considers
> them ineligible?
>
> Exactly.
>
> Hadn't seen the blog, thanks for that. Added to my list of things to refer
> to.
>
> The problem we're seeing is that "in the wild" there are cases where
> people can now get satisfactory performance from huge numbers of
> documents, as in close to 2B (there was a question on the user's list
> about that recently). So allowing up to 60% deleted documents is
> dangerous in that situation.
>
> And the situation is exacerbated by optimizing (I know, "don't do that").
>
> Ah, well, the joys of people using this open source thing and pushing
> its limits.
>
> Thanks again,
> Erick
>
> On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless
>  wrote:
>
> Hi Erick,
>
> Some questions/answers below:
>
> On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson 
> wrote:
>
>
> Particularly interested if Mr. McCandless has any opinions here.
>
> I admit it took some work, but I can create an index that never merges
> and is 80% deleted documents using TieredMergePolicy.
>
> I'm trying to understand how indexes "in the wild" can have > 30%
> deleted documents. I think the root issue here is that
> TieredMergePolicy doesn't consider for merging any segments > 50% of
> maxMergedSegmentMB of non-deleted documents.
>
> Let's say I have segments at the default 5G max. For the sake of
> argument, it takes exactly 5,000,000 identically-sized documents to
> fill the segment to exactly 5G.
>
> IIUC, as long as the segment has more than 2,500,000 documents in it
> it'll never be eligible for merging.
>
>
>
> That's right.
>
>
> The only way to force deleted
> docs to be purged is to expungeDeletes or optimize, neither of which
> is recommended.
>
>
>
> +1
>
> The condition 

Re: Pathological index condition

2017-08-28 Thread Walter Underwood
If this happens in a precise zone, how about adding some random jitter to the 
threshold? That tends to get this kind of lock-up unstuck.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 28, 2017, at 12:44 PM, Erick Erickson  wrote:
> 
> And one more thought (not very well thought out).
> 
> A parameter on TMP (or whatever) that did <3> something like:
>> a parameter 
>> a parameter 
>> On startup TMP takes the current timestamp
> *> Every minute (or whatever) it checks the current timestamp and if
>  is in between the last check time and now, do <2>.
>> set the last checked time to the value from * above.
> 
> Taking the current timestamp would keep from kicking of the compaction
> on startup, so we wouldn't need to keep some stateful information
> across restarts and wouldn't go into a compact cycle on startup.
> 
> Erick
> 
> On Sun, Aug 27, 2017 at 11:31 AM, Erick Erickson
>  wrote:
>> I've been thinking about this a little more. Since this is an outlier,
>> I'm loathe to change the core TMP merge selection process. Say the max
>> segment size if 5G. You'd be doing an awful lot of I/O to merge a
>> segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't
>> really allow users who issue the tempting "optimize" command to
>> recover; that one huge segment can hang around for a _very_ long time,
>> accumulating lots of deleted docs. Same with expungeDeletes.
>> 
>> I can think of several approaches:
>> 
>> 1> despite my comment above, a flag that says something like "if a
>> segment has > X% deleted docs, merge it with a smaller segment anyway
>> respecting the max segment size. I know, I know this will affect
>> indexing throughput, do it anyway".
>> 
>> 2> A special op (or perhaps a flag on expungeDeletes) that would
>> behave like <1> but on-demand rather than part of standard merging.
>> 
>> In both of these cases, if a segment had > X% deleted docs but the
>> live doc size for that segment was > the max seg size, rewrite it into
>> a single new segment removing deleted docs.
>> 
>> 3> some way to do the above on a schedule. My notion is something like
>> a maintenance window at 1:00 AM. You'd still have a live collection,
>> but (presumably) a way to purge the day's accumulation of deleted
>> documents during off hours.
>> 
>> 4> ???
>> 
>> I probably like <2> best so far, I don't see this condition in the
>> wild very often it usually occurs during heavy re-indexing operations
>> and often after an optimize or expungeDeletes has happened. <1> could
>> get horribly pathological if the threshold was 1% or something.
>> 
>> WDYT?
>> 
>> 
>> On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson  
>> wrote:
>>> Thanks Mike:
>>> 
>>> bq: Or are you saying that each segments 20% of not-deleted docs is
>>> still greater than 1/2 of the max segment size, and so TMP considers
>>> them ineligible?
>>> 
>>> Exactly.
>>> 
>>> Hadn't seen the blog, thanks for that. Added to my list of things to refer 
>>> to.
>>> 
>>> The problem we're seeing is that "in the wild" there are cases where
>>> people can now get satisfactory performance from huge numbers of
>>> documents, as in close to 2B (there was a question on the user's list
>>> about that recently). So allowing up to 60% deleted documents is
>>> dangerous in that situation.
>>> 
>>> And the situation is exacerbated by optimizing (I know, "don't do that").
>>> 
>>> Ah, well, the joys of people using this open source thing and pushing
>>> its limits.
>>> 
>>> Thanks again,
>>> Erick
>>> 
>>> On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless
>>>  wrote:
 Hi Erick,
 
 Some questions/answers below:
 
 On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson 
 wrote:
> 
> Particularly interested if Mr. McCandless has any opinions here.
> 
> I admit it took some work, but I can create an index that never merges
> and is 80% deleted documents using TieredMergePolicy.
> 
> I'm trying to understand how indexes "in the wild" can have > 30%
> deleted documents. I think the root issue here is that
> TieredMergePolicy doesn't consider for merging any segments > 50% of
> maxMergedSegmentMB of non-deleted documents.
> 
> Let's say I have segments at the default 5G max. For the sake of
> argument, it takes exactly 5,000,000 identically-sized documents to
> fill the segment to exactly 5G.
> 
> IIUC, as long as the segment has more than 2,500,000 documents in it
> it'll never be eligible for merging.
 
 
 That's right.
 
> 
> The only way to force deleted
> docs to be purged is to expungeDeletes or optimize, neither of which
> is recommended.
 
 
 +1
 
> The condition I created was highly artificial but illustrative:
> - I set my max segment size to 20M
> - 

Re: Pathological index condition

2017-08-28 Thread Erick Erickson
And one more thought (not very well thought out).

A parameter on TMP (or whatever) that did <3> something like:
> a parameter 
> a parameter 
> On startup TMP takes the current timestamp
*> Every minute (or whatever) it checks the current timestamp and if
 is in between the last check time and now, do <2>.
> set the last checked time to the value from * above.

Taking the current timestamp would keep from kicking of the compaction
on startup, so we wouldn't need to keep some stateful information
across restarts and wouldn't go into a compact cycle on startup.

Erick

On Sun, Aug 27, 2017 at 11:31 AM, Erick Erickson
 wrote:
> I've been thinking about this a little more. Since this is an outlier,
> I'm loathe to change the core TMP merge selection process. Say the max
> segment size if 5G. You'd be doing an awful lot of I/O to merge a
> segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't
> really allow users who issue the tempting "optimize" command to
> recover; that one huge segment can hang around for a _very_ long time,
> accumulating lots of deleted docs. Same with expungeDeletes.
>
> I can think of several approaches:
>
> 1> despite my comment above, a flag that says something like "if a
> segment has > X% deleted docs, merge it with a smaller segment anyway
> respecting the max segment size. I know, I know this will affect
> indexing throughput, do it anyway".
>
> 2> A special op (or perhaps a flag on expungeDeletes) that would
> behave like <1> but on-demand rather than part of standard merging.
>
> In both of these cases, if a segment had > X% deleted docs but the
> live doc size for that segment was > the max seg size, rewrite it into
> a single new segment removing deleted docs.
>
> 3> some way to do the above on a schedule. My notion is something like
> a maintenance window at 1:00 AM. You'd still have a live collection,
> but (presumably) a way to purge the day's accumulation of deleted
> documents during off hours.
>
> 4> ???
>
> I probably like <2> best so far, I don't see this condition in the
> wild very often it usually occurs during heavy re-indexing operations
> and often after an optimize or expungeDeletes has happened. <1> could
> get horribly pathological if the threshold was 1% or something.
>
> WDYT?
>
>
> On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson  
> wrote:
>> Thanks Mike:
>>
>> bq: Or are you saying that each segments 20% of not-deleted docs is
>> still greater than 1/2 of the max segment size, and so TMP considers
>> them ineligible?
>>
>> Exactly.
>>
>> Hadn't seen the blog, thanks for that. Added to my list of things to refer 
>> to.
>>
>> The problem we're seeing is that "in the wild" there are cases where
>> people can now get satisfactory performance from huge numbers of
>> documents, as in close to 2B (there was a question on the user's list
>> about that recently). So allowing up to 60% deleted documents is
>> dangerous in that situation.
>>
>> And the situation is exacerbated by optimizing (I know, "don't do that").
>>
>> Ah, well, the joys of people using this open source thing and pushing
>> its limits.
>>
>> Thanks again,
>> Erick
>>
>> On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless
>>  wrote:
>>> Hi Erick,
>>>
>>> Some questions/answers below:
>>>
>>> On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson 
>>> wrote:

 Particularly interested if Mr. McCandless has any opinions here.

 I admit it took some work, but I can create an index that never merges
 and is 80% deleted documents using TieredMergePolicy.

 I'm trying to understand how indexes "in the wild" can have > 30%
 deleted documents. I think the root issue here is that
 TieredMergePolicy doesn't consider for merging any segments > 50% of
 maxMergedSegmentMB of non-deleted documents.

 Let's say I have segments at the default 5G max. For the sake of
 argument, it takes exactly 5,000,000 identically-sized documents to
 fill the segment to exactly 5G.

 IIUC, as long as the segment has more than 2,500,000 documents in it
 it'll never be eligible for merging.
>>>
>>>
>>> That's right.
>>>

 The only way to force deleted
 docs to be purged is to expungeDeletes or optimize, neither of which
 is recommended.
>>>
>>>
>>> +1
>>>
 The condition I created was highly artificial but illustrative:
 - I set my max segment size to 20M
 - Through experimentation I found that each segment would hold roughly
 160K synthetic docs.
 - I set my ramBuffer to 1G.
 - Then I'd index 500K docs, then delete 400K of them, and commit. This
 produces a single segment occupying (roughly) 80M of disk space, 15M
 or so of it "live" documents the rest deleted.
 - rinse, repeat with a disjoint set of doc IDs.

 The number of segments continues to grow forever, each one consisting
 of 80% 

Re: Pathological index condition

2017-08-27 Thread Erick Erickson
I've been thinking about this a little more. Since this is an outlier,
I'm loathe to change the core TMP merge selection process. Say the max
segment size if 5G. You'd be doing an awful lot of I/O to merge a
segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't
really allow users who issue the tempting "optimize" command to
recover; that one huge segment can hang around for a _very_ long time,
accumulating lots of deleted docs. Same with expungeDeletes.

I can think of several approaches:

1> despite my comment above, a flag that says something like "if a
segment has > X% deleted docs, merge it with a smaller segment anyway
respecting the max segment size. I know, I know this will affect
indexing throughput, do it anyway".

2> A special op (or perhaps a flag on expungeDeletes) that would
behave like <1> but on-demand rather than part of standard merging.

In both of these cases, if a segment had > X% deleted docs but the
live doc size for that segment was > the max seg size, rewrite it into
a single new segment removing deleted docs.

3> some way to do the above on a schedule. My notion is something like
a maintenance window at 1:00 AM. You'd still have a live collection,
but (presumably) a way to purge the day's accumulation of deleted
documents during off hours.

4> ???

I probably like <2> best so far, I don't see this condition in the
wild very often it usually occurs during heavy re-indexing operations
and often after an optimize or expungeDeletes has happened. <1> could
get horribly pathological if the threshold was 1% or something.

WDYT?


On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson  wrote:
> Thanks Mike:
>
> bq: Or are you saying that each segments 20% of not-deleted docs is
> still greater than 1/2 of the max segment size, and so TMP considers
> them ineligible?
>
> Exactly.
>
> Hadn't seen the blog, thanks for that. Added to my list of things to refer to.
>
> The problem we're seeing is that "in the wild" there are cases where
> people can now get satisfactory performance from huge numbers of
> documents, as in close to 2B (there was a question on the user's list
> about that recently). So allowing up to 60% deleted documents is
> dangerous in that situation.
>
> And the situation is exacerbated by optimizing (I know, "don't do that").
>
> Ah, well, the joys of people using this open source thing and pushing
> its limits.
>
> Thanks again,
> Erick
>
> On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless
>  wrote:
>> Hi Erick,
>>
>> Some questions/answers below:
>>
>> On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson 
>> wrote:
>>>
>>> Particularly interested if Mr. McCandless has any opinions here.
>>>
>>> I admit it took some work, but I can create an index that never merges
>>> and is 80% deleted documents using TieredMergePolicy.
>>>
>>> I'm trying to understand how indexes "in the wild" can have > 30%
>>> deleted documents. I think the root issue here is that
>>> TieredMergePolicy doesn't consider for merging any segments > 50% of
>>> maxMergedSegmentMB of non-deleted documents.
>>>
>>> Let's say I have segments at the default 5G max. For the sake of
>>> argument, it takes exactly 5,000,000 identically-sized documents to
>>> fill the segment to exactly 5G.
>>>
>>> IIUC, as long as the segment has more than 2,500,000 documents in it
>>> it'll never be eligible for merging.
>>
>>
>> That's right.
>>
>>>
>>> The only way to force deleted
>>> docs to be purged is to expungeDeletes or optimize, neither of which
>>> is recommended.
>>
>>
>> +1
>>
>>> The condition I created was highly artificial but illustrative:
>>> - I set my max segment size to 20M
>>> - Through experimentation I found that each segment would hold roughly
>>> 160K synthetic docs.
>>> - I set my ramBuffer to 1G.
>>> - Then I'd index 500K docs, then delete 400K of them, and commit. This
>>> produces a single segment occupying (roughly) 80M of disk space, 15M
>>> or so of it "live" documents the rest deleted.
>>> - rinse, repeat with a disjoint set of doc IDs.
>>>
>>> The number of segments continues to grow forever, each one consisting
>>> of 80% deleted documents.
>>
>>
>> But wouldn't TMP at some point merge these segments?  Or are you saying that
>> each segments 20% of not-deleted docs is still greater than 1/2 of the max
>> segment size, and so TMP considers them ineligible?
>>
>> This is indeed a rather pathological case, and you're right TMP would never
>> merge them (if my logic above is right).  Maybe we could tweak TMP for
>> situations like this, though I'm not sure they happen in practice.  Normally
>> the max segment size is quite a bit larger than the initially flushed
>> segment sizes.
>>
>>>
>>> This artificial situation just allowed me to see how the segments
>>> merged. Without such artificial constraints I suspect the limit for
>>> deleted documents would be capped at 50% theoretically and in practice
>>> less than that 

Re: Pathological index condition

2017-08-09 Thread Erick Erickson
Thanks Mike:

bq: Or are you saying that each segments 20% of not-deleted docs is
still greater than 1/2 of the max segment size, and so TMP considers
them ineligible?

Exactly.

Hadn't seen the blog, thanks for that. Added to my list of things to refer to.

The problem we're seeing is that "in the wild" there are cases where
people can now get satisfactory performance from huge numbers of
documents, as in close to 2B (there was a question on the user's list
about that recently). So allowing up to 60% deleted documents is
dangerous in that situation.

And the situation is exacerbated by optimizing (I know, "don't do that").

Ah, well, the joys of people using this open source thing and pushing
its limits.

Thanks again,
Erick

On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless
 wrote:
> Hi Erick,
>
> Some questions/answers below:
>
> On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson 
> wrote:
>>
>> Particularly interested if Mr. McCandless has any opinions here.
>>
>> I admit it took some work, but I can create an index that never merges
>> and is 80% deleted documents using TieredMergePolicy.
>>
>> I'm trying to understand how indexes "in the wild" can have > 30%
>> deleted documents. I think the root issue here is that
>> TieredMergePolicy doesn't consider for merging any segments > 50% of
>> maxMergedSegmentMB of non-deleted documents.
>>
>> Let's say I have segments at the default 5G max. For the sake of
>> argument, it takes exactly 5,000,000 identically-sized documents to
>> fill the segment to exactly 5G.
>>
>> IIUC, as long as the segment has more than 2,500,000 documents in it
>> it'll never be eligible for merging.
>
>
> That's right.
>
>>
>> The only way to force deleted
>> docs to be purged is to expungeDeletes or optimize, neither of which
>> is recommended.
>
>
> +1
>
>> The condition I created was highly artificial but illustrative:
>> - I set my max segment size to 20M
>> - Through experimentation I found that each segment would hold roughly
>> 160K synthetic docs.
>> - I set my ramBuffer to 1G.
>> - Then I'd index 500K docs, then delete 400K of them, and commit. This
>> produces a single segment occupying (roughly) 80M of disk space, 15M
>> or so of it "live" documents the rest deleted.
>> - rinse, repeat with a disjoint set of doc IDs.
>>
>> The number of segments continues to grow forever, each one consisting
>> of 80% deleted documents.
>
>
> But wouldn't TMP at some point merge these segments?  Or are you saying that
> each segments 20% of not-deleted docs is still greater than 1/2 of the max
> segment size, and so TMP considers them ineligible?
>
> This is indeed a rather pathological case, and you're right TMP would never
> merge them (if my logic above is right).  Maybe we could tweak TMP for
> situations like this, though I'm not sure they happen in practice.  Normally
> the max segment size is quite a bit larger than the initially flushed
> segment sizes.
>
>>
>> This artificial situation just allowed me to see how the segments
>> merged. Without such artificial constraints I suspect the limit for
>> deleted documents would be capped at 50% theoretically and in practice
>> less than that although I have seen 35% or so deleted documents in the
>> wild.
>
>
> Yeah I think so too.  I wrote this blog post about deletions:
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents
>
> It has a fun chart showing how the %tg deleted docs bounces around.
>
>>
>> So at the end of the day I have a couple of questions:
>>
>> 1> Is my understanding close to correct? This is really the first time
>> I've had to dive into the guts of merging.
>
>
> Yes!
>
>>
>> 2> Is there a way I've missed to slim down an index other than
>> expungedeletes of optimize/forcemerge?
>
>
> No.
>
>> It seems to me like eventually, with large indexes, every segment that
>> is the max size allowed is going to have to go over 50% deletes before
>> being merged and there will have to be at least two of them. I don't
>> see a clean way to fix this, any algorithm would likely be far too
>> expensive to be part of regular merging. I suppose we could merge
>> segments of different sizes if the combined size was < max segment
>> size. On a quick glance it doesn't seem like the log merge policies
>> address this kind of case either, but haven't dug into them much.
>
>
> TMP should be able to merge one max sized segment (that has eek'd just over
> 50% deleted docs) with smaller sized segments.  It would not prefer this
> merge, since merging substantially different segment sizes is poor
> performance vs. merging equally sized segments, but it does have a bias for
> removing deleted docs that would offset that.
>
>>
>> Thanks!
>
>
> You're welcome!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

Re: Pathological index condition

2017-08-08 Thread Michael McCandless
Hi Erick,

Some questions/answers below:

On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson 
wrote:

> Particularly interested if Mr. McCandless has any opinions here.
>
> I admit it took some work, but I can create an index that never merges
> and is 80% deleted documents using TieredMergePolicy.
>
> I'm trying to understand how indexes "in the wild" can have > 30%
> deleted documents. I think the root issue here is that
> TieredMergePolicy doesn't consider for merging any segments > 50% of
> maxMergedSegmentMB of non-deleted documents.
>
> Let's say I have segments at the default 5G max. For the sake of
> argument, it takes exactly 5,000,000 identically-sized documents to
> fill the segment to exactly 5G.
>
> IIUC, as long as the segment has more than 2,500,000 documents in it
> it'll never be eligible for merging.


That's right.


> The only way to force deleted
> docs to be purged is to expungeDeletes or optimize, neither of which
> is recommended.


+1

The condition I created was highly artificial but illustrative:
> - I set my max segment size to 20M
> - Through experimentation I found that each segment would hold roughly
> 160K synthetic docs.
> - I set my ramBuffer to 1G.
> - Then I'd index 500K docs, then delete 400K of them, and commit. This
> produces a single segment occupying (roughly) 80M of disk space, 15M
> or so of it "live" documents the rest deleted.
> - rinse, repeat with a disjoint set of doc IDs.
>
> The number of segments continues to grow forever, each one consisting
> of 80% deleted documents.
>

But wouldn't TMP at some point merge these segments?  Or are you saying
that each segments 20% of not-deleted docs is still greater than 1/2 of the
max segment size, and so TMP considers them ineligible?

This is indeed a rather pathological case, and you're right TMP would never
merge them (if my logic above is right).  Maybe we could tweak TMP for
situations like this, though I'm not sure they happen in practice.
Normally the max segment size is quite a bit larger than the initially
flushed segment sizes.


> This artificial situation just allowed me to see how the segments
> merged. Without such artificial constraints I suspect the limit for
> deleted documents would be capped at 50% theoretically and in practice
> less than that although I have seen 35% or so deleted documents in the
> wild.
>

Yeah I think so too.  I wrote this blog post about deletions:
https://www.elastic.co/blog/lucenes-handling-of-deleted-documents

It has a fun chart showing how the %tg deleted docs bounces around.


> So at the end of the day I have a couple of questions:
>
> 1> Is my understanding close to correct? This is really the first time
> I've had to dive into the guts of merging.
>

Yes!


> 2> Is there a way I've missed to slim down an index other than
> expungedeletes of optimize/forcemerge?
>

No.

It seems to me like eventually, with large indexes, every segment that
> is the max size allowed is going to have to go over 50% deletes before
> being merged and there will have to be at least two of them. I don't
> see a clean way to fix this, any algorithm would likely be far too
> expensive to be part of regular merging. I suppose we could merge
> segments of different sizes if the combined size was < max segment
> size. On a quick glance it doesn't seem like the log merge policies
> address this kind of case either, but haven't dug into them much.
>

TMP should be able to merge one max sized segment (that has eek'd just over
50% deleted docs) with smaller sized segments.  It would not prefer this
merge, since merging substantially different segment sizes is poor
performance vs. merging equally sized segments, but it does have a bias for
removing deleted docs that would offset that.


> Thanks!
>

You're welcome!

Mike McCandless

http://blog.mikemccandless.com


Pathological index condition

2017-08-06 Thread Erick Erickson
Particularly interested if Mr. McCandless has any opinions here.

I admit it took some work, but I can create an index that never merges
and is 80% deleted documents using TieredMergePolicy.

I'm trying to understand how indexes "in the wild" can have > 30%
deleted documents. I think the root issue here is that
TieredMergePolicy doesn't consider for merging any segments > 50% of
maxMergedSegmentMB of non-deleted documents.

Let's say I have segments at the default 5G max. For the sake of
argument, it takes exactly 5,000,000 identically-sized documents to
fill the segment to exactly 5G.

IIUC, as long as the segment has more than 2,500,000 documents in it
it'll never be eligible for merging. The only way to force deleted
docs to be purged is to expungeDeletes or optimize, neither of which
is recommended.

The condition I created was highly artificial but illustrative:
- I set my max segment size to 20M
- Through experimentation I found that each segment would hold roughly
160K synthetic docs.
- I set my ramBuffer to 1G.
- Then I'd index 500K docs, then delete 400K of them, and commit. This
produces a single segment occupying (roughly) 80M of disk space, 15M
or so of it "live" documents the rest deleted.
- rinse, repeat with a disjoint set of doc IDs.

The number of segments continues to grow forever, each one consisting
of 80% deleted documents.

This artificial situation just allowed me to see how the segments
merged. Without such artificial constraints I suspect the limit for
deleted documents would be capped at 50% theoretically and in practice
less than that although I have seen 35% or so deleted documents in the
wild.

So at the end of the day I have a couple of questions:

1> Is my understanding close to correct? This is really the first time
I've had to dive into the guts of merging.

2> Is there a way I've missed to slim down an index other than
expungedeletes of optimize/forcemerge?

It seems to me like eventually, with large indexes, every segment that
is the max size allowed is going to have to go over 50% deletes before
being merged and there will have to be at least two of them. I don't
see a clean way to fix this, any algorithm would likely be far too
expensive to be part of regular merging. I suppose we could merge
segments of different sizes if the combined size was < max segment
size. On a quick glance it doesn't seem like the log merge policies
address this kind of case either, but haven't dug into them much.

Thanks!
Erick

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org