Re: merge policy & autocommit

2019-10-28 Thread Shawn Heisey

On 10/28/2019 7:23 AM, Danilo Tomasoni wrote:

We have a solr instance with around 40MLN docs.

In the bulk import phase we noticed a high IO and CPU load and it looks 
like it's related to autocommit because if I disable autocommit the load 
of the system is very low.


I know that disabling autocommit is not recommended, but I'm wondering 
if there is a minimum hardware requirement to make this suggestion 
effective.


What are your settings for autoCommit and autoSoftCommit?  If the 
settings are referring to system properties, have you defined those 
system properties?  Would you be able to restart Solr and then share a 
solr.log file that goes back to that start?


The settings that Solr has shipped with for quite a while are to enable 
autoCommit with a 15 second maxTime, no maxDoc, and openSearcher set to 
false.  The autoSoftCommit setting is not enabled by default.


These settings work well, though I personally think 15 seconds is 
perhaps too frequent, and like to set it to something like one minute 
instead.


With openSearcher set to false, autoCommit will not affect document 
visibility.  If automatically making index changes visible is desired, 
it is better to configure autoSoftCommit in addition to autoCommit ... 
and super short intervals are not recommended.


Our system is not very powerful in terms of IO read/write speed (around 
100 Mbyte/s) is it possible that this relatively low IO performance 
combined with


100MB/sec is not what I would call low I/O.  It's the minimum that you 
can expect from modern commodity SATA hard drives, and some of those can 
go even faster.  It's also roughly equivalent to the maximum real-world 
achievable throughput of a gigabit network connection with TCP-based 
protocols.


autocommit will slow down incredibly our solr instance to the point of 
making it not responsive?


If it's configured correctly, autoCommit should have very little effect 
on performance.  Hard commits that do not open a new searcher should 
happen VERY quickly.  It seems very strange to me that disabling a 
correctly configured autoCommit would substantially affect indexing speeds.


The same can be true also for the merge policy? how the IO speed can 
affect the merge policy parameters?


I kept the default merge policy configuration but it looks like it never 
merges segments. How can I know if a merge is happening?


If you have segments that are radically different sizes, then merging is 
happening.  With default settings, merges from the first level should 
produce segments roughly ten times the size of the ones created by 
indexing.  Second level merges will probably produce segments roughly 
100 times the size of the smallest ones.  Segment merging is a normal 
part of Lucene operation, it would be very unusual for it to not occur.


Merging will affect I/O, but it is extremely rare for merging to happen 
super-quickly.  The fastest I have ever seen merging on a single Solr 
core proceed is about 30 megabytes per second, though usually that 
system achieved about 20 megabytes per second.  Merging involves 
considerable computational work, it's not just a straight data copy.


Thanks,
Shawn


Re: Merge policy

2016-10-28 Thread Walter Underwood
25% overhead is pretty good. It is easy for a merge to need almost double the 
space of a minimum sized index. It is possible to use 3X the space.

Don’t try use the least possible disk space. If there isn’t enough free space 
on the disk, Solr cannot merge the big indexes. Ever. That may be what has 
happened here.

Make sure the nodes have at lease 100 Gb of free space on the volumes, maybe 
150. That space is not “wasted” or “unused”. It is necessary for merges.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 28, 2016, at 12:20 AM, Arkadi Colson  wrote:
> 
> The index size of 1 shard is about 125GB and we are running 11 shards with 
> replication factor 2 so it's a lot of data. The deletions percentage at the 
> bottom of the segment page is around 25%. So it's quite some space which we 
> could recover. That's why I was looking for an optimize.
> 
> Do you have any idea why the merge policy does not merge away the deletions? 
> Should I tweak some parameters somehow? It's a default installation using the 
> default settings and parameters. If you need more info, just let me know...
> 
> Thx!
> 
> On 27-10-16 17:40, Erick Erickson wrote:
>> Why do you think you need to get rid of the deleted data? During normal
>> indexing, these will be "merged away". Optimizing has some downsides
>> for continually changing indexes, in particular since the default 
>> tieredmergepolicy tries to merge "like size" segments, deletions will
>> accumulate in your one large segment and the percentage of
>> deleted documents may get even higher.
>> 
>> Unless there's some measurable performance gain that the users
>> will notice, I'd just leave this alone.
>> 
>> The exception here is if you have, say, an index that changes rarely
>> in which case optimizing then makes more sense.
>> 
>> Best,
>> Erick
>> 
>> On Thu, Oct 27, 2016 at 6:56 AM, Arkadi Colson > > wrote:
>> Thanks for the answer!
>> Do you know if there is a way to trigger an optimize for only 1 shard and 
>> not the whole collection at once?
>> 
>> On 27-10-16 15:30, Pushkar Raste wrote:
>>> Try commit with expungeDeletes="true"
>>> 
>>> I am not sure if it will merge old segments that have deleted documents.
>>> 
>>> In the worst case you can 'optimize' your index which should take care of 
>>> removing deleted document
>>> 
>>> 
>>> On Oct 27, 2016 4:20 AM, "Arkadi Colson" >> > wrote:
>>> Hi
>>> 
>>> As you can see in the screenshot above in the oldest segments there are a 
>>> lot of deletions. In total the shard has about 26% deletions. How can I get 
>>> rid of them so the index will be smaller again?
>>> Can this only be done with an optimize or does it also depend on the merge 
>>> policy? If it also depends also on the merge policy which one should I 
>>> choose then?
>>> 
>>> Thanks!
>>> 
>>> BR,
>>> Arkadi
>> 
>> 
> 



Re: Merge policy

2016-10-28 Thread Emir Arnautovic

I got some notification from mailer, so not sure if my reply reached you:

"If you are using TieredMergePolicy, you can try setting 
/*reclaimDeletesWeight*/."


HTH,
Emir


On 28.10.2016 09:20, Arkadi Colson wrote:


The index size of 1 shard is about 125GB and we are running 11 shards 
with replication factor 2 so it's a lot of data. The deletions 
percentage at the bottom of the segment page is around 25%. So it's 
quite some space which we could recover. That's why I was looking for 
an optimize.


Do you have any idea why the merge policy does not merge away the 
deletions? Should I tweak some parameters somehow? It's a default 
installation using the default settings and parameters. If you need 
more info, just let me know...


Thx!


On 27-10-16 17:40, Erick Erickson wrote:

Why do you think you need to get rid of the deleted data? During normal
indexing, these will be "merged away". Optimizing has some downsides
for continually changing indexes, in particular since the default
tieredmergepolicy tries to merge "like size" segments, deletions will
accumulate in your one large segment and the percentage of
deleted documents may get even higher.

Unless there's some measurable performance gain that the users
will notice, I'd just leave this alone.

The exception here is if you have, say, an index that changes rarely
in which case optimizing then makes more sense.

Best,
Erick

On Thu, Oct 27, 2016 at 6:56 AM, Arkadi Colson > wrote:


Thanks for the answer!
Do you know if there is a way to trigger an optimize for only 1
shard and not the whole collection at once?


On 27-10-16 15:30, Pushkar Raste wrote:


Try commit with expungeDeletes="true"

I am not sure if it will merge old segments that have deleted
documents.

In the worst case you can 'optimize' your index which should
take care of removing deleted document


On Oct 27, 2016 4:20 AM, "Arkadi Colson" > wrote:

Hi

As you can see in the screenshot above in the oldest
segments there are a lot of deletions. In total the shard
has about 26% deletions. How can I get rid of them so the
index will be smaller again?
Can this only be done with an optimize or does it also
depend on the merge policy? If it also depends also on the
merge policy which one should I choose then?

Thanks!

BR,
Arkadi








--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Merge policy

2016-10-28 Thread Arkadi Colson
It's a default installation using the default settings and parameters. 
Should I perhaps change the segment size or so? Is it possible to do 
live without re-indexing? If you need more info, just let me know...


Thx!


On 27-10-16 19:03, Walter Underwood wrote:

That distribution of segment sizes seems odd. Why so many medium-large segments?

Are there custom settings for merge policy? I think the default policy would 
avoid so many segments that are mostly deleted documents.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Oct 27, 2016, at 9:40 AM, Shawn Heisey  wrote:

On 10/27/2016 9:50 AM, Yonik Seeley wrote:

On Thu, Oct 27, 2016 at 9:56 AM, Arkadi Colson 
wrote:

Thanks for the answer! Do you know if there is a way to trigger an
optimize for only 1 shard and not the whole collection at once?

Adding a "distrib=false" parameter should work I think.

Last time I checked, which I admit has been a little while, optimize
ignored distrib and proceeded with a sequential optimize of every core
in the collection.

Thanks,
Shawn







Re: Merge policy

2016-10-28 Thread Arkadi Colson
The index size of 1 shard is about 125GB and we are running 11 shards 
with replication factor 2 so it's a lot of data. The deletions 
percentage at the bottom of the segment page is around 25%. So it's 
quite some space which we could recover. That's why I was looking for an 
optimize.


Do you have any idea why the merge policy does not merge away the 
deletions? Should I tweak some parameters somehow? It's a default 
installation using the default settings and parameters. If you need more 
info, just let me know...


Thx!


On 27-10-16 17:40, Erick Erickson wrote:

Why do you think you need to get rid of the deleted data? During normal
indexing, these will be "merged away". Optimizing has some downsides
for continually changing indexes, in particular since the default
tieredmergepolicy tries to merge "like size" segments, deletions will
accumulate in your one large segment and the percentage of
deleted documents may get even higher.

Unless there's some measurable performance gain that the users
will notice, I'd just leave this alone.

The exception here is if you have, say, an index that changes rarely
in which case optimizing then makes more sense.

Best,
Erick

On Thu, Oct 27, 2016 at 6:56 AM, Arkadi Colson > wrote:


Thanks for the answer!
Do you know if there is a way to trigger an optimize for only 1
shard and not the whole collection at once?


On 27-10-16 15:30, Pushkar Raste wrote:


Try commit with expungeDeletes="true"

I am not sure if it will merge old segments that have deleted
documents.

In the worst case you can 'optimize' your index which should take
care of removing deleted document


On Oct 27, 2016 4:20 AM, "Arkadi Colson" > wrote:

Hi

As you can see in the screenshot above in the oldest segments
there are a lot of deletions. In total the shard has about
26% deletions. How can I get rid of them so the index will be
smaller again?
Can this only be done with an optimize or does it also depend
on the merge policy? If it also depends also on the merge
policy which one should I choose then?

Thanks!

BR,
Arkadi








Re: Merge policy

2016-10-27 Thread Walter Underwood
That distribution of segment sizes seems odd. Why so many medium-large segments?

Are there custom settings for merge policy? I think the default policy would 
avoid so many segments that are mostly deleted documents.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 27, 2016, at 9:40 AM, Shawn Heisey  wrote:
> 
> On 10/27/2016 9:50 AM, Yonik Seeley wrote:
>> On Thu, Oct 27, 2016 at 9:56 AM, Arkadi Colson 
>> wrote:
>>> Thanks for the answer! Do you know if there is a way to trigger an
>>> optimize for only 1 shard and not the whole collection at once? 
>> Adding a "distrib=false" parameter should work I think. 
> 
> Last time I checked, which I admit has been a little while, optimize
> ignored distrib and proceeded with a sequential optimize of every core
> in the collection.
> 
> Thanks,
> Shawn
> 



Re: Merge policy

2016-10-27 Thread Shawn Heisey
On 10/27/2016 9:50 AM, Yonik Seeley wrote:
> On Thu, Oct 27, 2016 at 9:56 AM, Arkadi Colson 
> wrote:
>> Thanks for the answer! Do you know if there is a way to trigger an
>> optimize for only 1 shard and not the whole collection at once? 
> Adding a "distrib=false" parameter should work I think. 

Last time I checked, which I admit has been a little while, optimize
ignored distrib and proceeded with a sequential optimize of every core
in the collection.

Thanks,
Shawn



Re: Merge policy

2016-10-27 Thread Yonik Seeley
On Thu, Oct 27, 2016 at 9:56 AM, Arkadi Colson  wrote:

> Thanks for the answer!
> Do you know if there is a way to trigger an optimize for only 1 shard and
> not the whole collection at once?
>

Adding a "distrib=false" parameter should work I think.

-Yonik


Re: Merge policy

2016-10-27 Thread Erick Erickson
Why do you think you need to get rid of the deleted data? During normal
indexing, these will be "merged away". Optimizing has some downsides
for continually changing indexes, in particular since the default
tieredmergepolicy tries to merge "like size" segments, deletions will
accumulate in your one large segment and the percentage of
deleted documents may get even higher.

Unless there's some measurable performance gain that the users
will notice, I'd just leave this alone.

The exception here is if you have, say, an index that changes rarely
in which case optimizing then makes more sense.

Best,
Erick

On Thu, Oct 27, 2016 at 6:56 AM, Arkadi Colson  wrote:

> Thanks for the answer!
> Do you know if there is a way to trigger an optimize for only 1 shard and
> not the whole collection at once?
>
> On 27-10-16 15:30, Pushkar Raste wrote:
>
> Try commit with expungeDeletes="true"
>
> I am not sure if it will merge old segments that have deleted documents.
>
> In the worst case you can 'optimize' your index which should take care of
> removing deleted document
>
> On Oct 27, 2016 4:20 AM, "Arkadi Colson"  wrote:
>
>> Hi
>>
>> As you can see in the screenshot above in the oldest segments there are a
>> lot of deletions. In total the shard has about 26% deletions. How can I get
>> rid of them so the index will be smaller again?
>> Can this only be done with an optimize or does it also depend on the
>> merge policy? If it also depends also on the merge policy which one should
>> I choose then?
>>
>> Thanks!
>>
>> BR,
>> Arkadi
>>
>
>


Re: Merge policy

2016-10-27 Thread Arkadi Colson

Thanks for the answer!
Do you know if there is a way to trigger an optimize for only 1 shard 
and not the whole collection at once?



On 27-10-16 15:30, Pushkar Raste wrote:


Try commit with expungeDeletes="true"

I am not sure if it will merge old segments that have deleted documents.

In the worst case you can 'optimize' your index which should take care 
of removing deleted document



On Oct 27, 2016 4:20 AM, "Arkadi Colson" > wrote:


Hi

As you can see in the screenshot above in the oldest segments
there are a lot of deletions. In total the shard has about 26%
deletions. How can I get rid of them so the index will be smaller
again?
Can this only be done with an optimize or does it also depend on
the merge policy? If it also depends also on the merge policy
which one should I choose then?

Thanks!

BR,
Arkadi





Re: Merge policy

2016-10-27 Thread Pushkar Raste
Try commit with expungeDeletes="true"

I am not sure if it will merge old segments that have deleted documents.

In the worst case you can 'optimize' your index which should take care of
removing deleted document

On Oct 27, 2016 4:20 AM, "Arkadi Colson"  wrote:

> Hi
>
> As you can see in the screenshot above in the oldest segments there are a
> lot of deletions. In total the shard has about 26% deletions. How can I get
> rid of them so the index will be smaller again?
> Can this only be done with an optimize or does it also depend on the merge
> policy? If it also depends also on the merge policy which one should I
> choose then?
>
> Thanks!
>
> BR,
> Arkadi
>


Re: Merge Policy Recommendation for 3.6.1

2012-09-29 Thread Sujatha Arun
Thanks Shawn,that helps a lot .our current OS limit is set to 300,000+, I
guess, which is I heard is maximum for the OS .. not sure of the soft and
hard limits .Will check this .

Regards,
Sujatha



On Fri, Sep 28, 2012 at 8:14 PM, Shawn Heisey s...@elyograg.org wrote:

 On 9/28/2012 12:43 AM, Sujatha Arun wrote:

 Hello,

 In the case where there are over 200+ cores on a single node , is it
 recommended to go with Tiered MP with segment size of 4 ? Our Index size
 vary from a few MB to 4 GB .

 Will there be any issue with Too many open files  and the number of
 indexes with respect to MP ?  At the moment we are thinking of going with
 Tiered MP ..

 Os file limit has been set to maximum.


 Whether or not to deviate from the standard TieredMergePolicy depends
 heavily on many factors which we do not know, but I can tell you that it's
 probably not a good idea.  That policy typically produces the best results
 in all scenarios.

 http://blog.mikemccandless.**com/2011/02/visualizing-**
 lucenes-segment-merges.htmlhttp://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

 On the subject of open files:  With its default configuration, a Solr 3.x
 index will have either 8 or 11 files per segment, depending on whether you
 are using termvectors.  I am completely unsure about 4.0, because I've
 never used it, but it is probably similar.  The following calculations are
 based on my experience with 3.x.

 With a segment limit of 4, you might expect to have only six segments
 around at any one time - the four that are being merged, the new merged
 segment, and a segment where new data is being written.  If your system
 indexes data slow enough for merges to complete before another new segment
 is created, this is indeed the most you will ever see.  If your system
 indexes data fast enough, you might actually have short-lived moments with
 10 or 14 segments, and possibly more.

 Assuming some things, which lead to using the 13 segment figure:
 simultaneous indexing to multiple cores at once, with termvectors turned
 on.  With these assumptions, a 200 core Solr installation using 4 segments
 might potentially have nearly 37000 files open, but is more likely to have
 significantly less.  If you increase your merge policy segment limit, the
 numbers will go up from there.

 I have configured my Linux servers with a soft file limit of 49152 and a
 hard limit of 65536.  My segment limit is set to 35, and each server has a
 maximum of four active cores, which means that during heavy indexing, I can
 see over 8000 open files.

 What does maximum on the OS file limit actually mean?  Does your OS have
 a way to specify unlimited? My personal feeling is that it's a bad idea to
 run with no limits at all.  I would imagine that you need to go with a
 minimum soft limit of 65536.  Your segment limit of 4 is probably
 reasonable, unless you will be doing a lot of indexing in a very short
 amount of time.  If you are, you may want a larger limit, and a larger
 number of maximum open files.

 Thanks,
 Shawn




Re: Merge Policy Recommendation for 3.6.1

2012-09-28 Thread Shawn Heisey

On 9/28/2012 12:43 AM, Sujatha Arun wrote:

Hello,

In the case where there are over 200+ cores on a single node , is it
recommended to go with Tiered MP with segment size of 4 ? Our Index size
vary from a few MB to 4 GB .

Will there be any issue with Too many open files  and the number of
indexes with respect to MP ?  At the moment we are thinking of going with
Tiered MP ..

Os file limit has been set to maximum.


Whether or not to deviate from the standard TieredMergePolicy depends 
heavily on many factors which we do not know, but I can tell you that 
it's probably not a good idea.  That policy typically produces the best 
results in all scenarios.


http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

On the subject of open files:  With its default configuration, a Solr 
3.x index will have either 8 or 11 files per segment, depending on 
whether you are using termvectors.  I am completely unsure about 4.0, 
because I've never used it, but it is probably similar.  The following 
calculations are based on my experience with 3.x.


With a segment limit of 4, you might expect to have only six segments 
around at any one time - the four that are being merged, the new merged 
segment, and a segment where new data is being written.  If your system 
indexes data slow enough for merges to complete before another new 
segment is created, this is indeed the most you will ever see.  If your 
system indexes data fast enough, you might actually have short-lived 
moments with 10 or 14 segments, and possibly more.


Assuming some things, which lead to using the 13 segment figure: 
simultaneous indexing to multiple cores at once, with termvectors turned 
on.  With these assumptions, a 200 core Solr installation using 4 
segments might potentially have nearly 37000 files open, but is more 
likely to have significantly less.  If you increase your merge policy 
segment limit, the numbers will go up from there.


I have configured my Linux servers with a soft file limit of 49152 and a 
hard limit of 65536.  My segment limit is set to 35, and each server has 
a maximum of four active cores, which means that during heavy indexing, 
I can see over 8000 open files.


What does maximum on the OS file limit actually mean?  Does your OS 
have a way to specify unlimited? My personal feeling is that it's a bad 
idea to run with no limits at all.  I would imagine that you need to go 
with a minimum soft limit of 65536.  Your segment limit of 4 is probably 
reasonable, unless you will be doing a lot of indexing in a very short 
amount of time.  If you are, you may want a larger limit, and a larger 
number of maximum open files.


Thanks,
Shawn



Re: Merge Policy

2009-07-21 Thread Chris Hostetter

: SolrIndexConfig accepts a mergePolicy class name, however how does one
: inject properties into it?

At the moment you can't.  

If you look at the history of MergePolicy, users have never been 
encouraged to implement their own (the API actively discourages it, 
without going so far as to make it impossible).


-Hoss



Re: Merge Policy

2009-07-21 Thread Jason Rutherglen
I am referring to setting properties on the *existing* policy
available in Lucene such as LogByteSizeMergePolicy.setMaxMergeMB

On Tue, Jul 21, 2009 at 5:11 PM, Chris
Hostetterhossman_luc...@fucit.org wrote:

 : SolrIndexConfig accepts a mergePolicy class name, however how does one
 : inject properties into it?

 At the moment you can't.

 If you look at the history of MergePolicy, users have never been
 encouraged to implement their own (the API actively discourages it,
 without going so far as to make it impossible).


 -Hoss