You can always increase the maximum segment size. For large indexes
that should reduce the number of segments. But watch your indexing
stats, I can't predict the consequences of bumping it to 100G for
instance. I'd _expect_  bursty I/O whne those large segments started
to be created or merged....

You'll be interested in LUCENE-7976 (Solr 7.4?), especially (probably)
the idea of increasing the segment sizes and/or a related JIRA that
allows you to tweak how aggressively solr merges segments that have
deleted docs.

NOTE: that JIRA has the consequence that _by default_ the optimize
with no parameters respects the maximum segment size, which is a
change from now.

Finally, expungeDeletes may be useful as that too will respect max
segment size, again after LUCENE-7976 is committed.

Best,
Erick

On Wed, May 2, 2018 at 9:22 AM, Michael Joyner <mich...@newsrx.com> wrote:
> The main reason we go this route is that after awhile (with default
> settings) we end up with hundreds of shards and performance of course drops
> abysmally as a result. By using a stepped optimize a) we don't run into the
> we need the 3x+ head room issue, b) optimize performance penalty during
> optimize is less than the hundreds of shards not being optimized performance
> penalty.
>
> BTW, as we use batched a batch insert/update cycle [once daily] we only do
> optimize to a segment of 1 after a complete batch has been run. Though
> during the batch we reduce segment counts down to a max of 16 every 250K
> insert/updates to prevent the large segment count performance penalty.
>
>
> On 04/30/2018 07:10 PM, Erick Erickson wrote:
>>
>> There's really no good way to purge deleted documents from the index
>> other than to wait until merging happens.
>>
>> Optimize/forceMerge and expungeDeletes both suffer from the problem
>> that they create massive segments that then stick around for a very
>> long time, see:
>>
>> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
>>
>> Best,
>> Erick
>>
>> On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner <mich...@newsrx.com>
>> wrote:
>>>
>>> Based on experience, 2x head room is room is not always enough, sometimes
>>> not even 3x, if you are optimizing from many segments down to 1 segment
>>> in a
>>> single go.
>>>
>>> We have however figured out a way that can work with as little as 51%
>>> free
>>> space via the following iteration cycle:
>>>
>>> public void solrOptimize() {
>>>          int initialMaxSegments = 256;
>>>          int finalMaxSegments = 1;
>>>          if (isShowSegmentCounter()) {
>>>              log.info("Optimizing ...");
>>>          }
>>>          try (SolrClient solrServerInstance = getSolrClientInstance()){
>>>              for (int segments=initialMaxSegments;
>>> segments>=finalMaxSegments; segments--) {
>>>                  if (isShowSegmentCounter()) {
>>>                      System.out.println("Optimizing to a max of
>>> "+segments+"
>>> segments.");
>>>                  }
>>>                  solrServerInstance.optimize(true, true, segments);
>>>              }
>>>          } catch (SolrServerException | IOException e) {
>>>              throw new RuntimeException(e);
>>>
>>>          }
>>>      }
>>>
>>>
>>> On 04/30/2018 04:23 PM, Walter Underwood wrote:
>>>>
>>>> You need 2X the minimum index size in disk space anyway, so don’t worry
>>>> about keeping the indexes as small as possible. Worry about having
>>>> enough
>>>> headroom.
>>>>
>>>> If your indexes are 250 GB, you need 250 GB of free space.
>>>>
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>>
>>>>> On Apr 30, 2018, at 1:13 PM, Antony A <antonyaugus...@gmail.com> wrote:
>>>>>
>>>>> Thanks Erick/Deepak.
>>>>>
>>>>> The cloud is running on baremetal (128 GB/24 cpu).
>>>>>
>>>>> Is there an option to run a compact on the data files to make the size
>>>>> equal on both the clouds? I am trying find all the options before I add
>>>>> the
>>>>> new fields into the production cloud.
>>>>>
>>>>> Thanks
>>>>> AA
>>>>>
>>>>> On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson
>>>>> <erickerick...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Anthony:
>>>>>>
>>>>>> You are probably seeing the results of removing deleted documents from
>>>>>> the shards as they're merged. Even on replicas in the same _shard_,
>>>>>> the size of the index on disk won't necessarily be identical. This has
>>>>>> to do with which segments are selected for merging, which are not
>>>>>> necessarily coordinated across replicas.
>>>>>>
>>>>>> The test is if the number of docs on each collection is the same. If
>>>>>> it is, then don't worry about index sizes.
>>>>>>
>>>>>> Best,
>>>>>> Erick
>>>>>>
>>>>>> On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel <deic...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Could you please also give the machine details of the two clouds you
>>>>>>> are
>>>>>>> running?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Deepak
>>>>>>> "The greatness of a nation can be judged by the way its animals are
>>>>>>> treated. Please stop cruelty to Animals, become a Vegan"
>>>>>>>
>>>>>>> +91 73500 12833
>>>>>>> deic...@gmail.com
>>>>>>>
>>>>>>> Facebook: https://www.facebook.com/deicool
>>>>>>> LinkedIn: www.linkedin.com/in/deicool
>>>>>>>
>>>>>>> "Plant a Tree, Go Green"
>>>>>>>
>>>>>>> Make In India : http://www.makeinindia.com/home
>>>>>>>
>>>>>>> On Mon, Apr 30, 2018 at 9:51 PM, Antony A <antonyaugus...@gmail.com>
>>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Shawn,
>>>>>>>>
>>>>>>>> The cloud is running version 6.2.1. with ClassicIndexSchemaFactory
>>>>>>>>
>>>>>>>> The sum of size from admin UI on all the shards is around 265 G vs
>>>>>>>> 224
>>>>>>>> G
>>>>>>>> between the two clouds.
>>>>>>>>
>>>>>>>> I created the collection using "numShards" so compositeId router.
>>>>>>>>
>>>>>>>> If you need more information, please let me know.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> AA
>>>>>>>>
>>>>>>>> On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey <apa...@elyograg.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On 4/30/2018 9:51 AM, Antony A wrote:
>>>>>>>>>
>>>>>>>>>> I am running two separate solr clouds. I have 8 shards in each
>>>>>>>>>> with
>>>>>>>>>> a
>>>>>>>>>> total
>>>>>>>>>> of 300 million documents. Both the clouds are indexing the
>>>>>>>>>> document
>>>>>>
>>>>>> from
>>>>>>>>>>
>>>>>>>>>> the same source/configuration.
>>>>>>>>>>
>>>>>>>>>> I am noticing there is a difference in the size of the collection
>>>>>>>>
>>>>>>>> between
>>>>>>>>>>
>>>>>>>>>> them. I am planning to add more shards to see if that helps solve
>>>>>>>>>> the
>>>>>>>>>> issue. Has anyone come across similar issue?
>>>>>>>>>>
>>>>>>>>> There's no information here about exactly what you are seeing, what
>>>>>>
>>>>>> you
>>>>>>>>>
>>>>>>>>> are expecting to see, and why you believe that what you are seeing
>>>>>>>>> is
>>>>>>>>
>>>>>>>> wrong.
>>>>>>>>>
>>>>>>>>> You did say that there is "a difference in size".  That is a very
>>>>>>
>>>>>> vague
>>>>>>>>>
>>>>>>>>> problem description.
>>>>>>>>>
>>>>>>>>> FYI, unless a SolrCloud collection is using the implicit router,
>>>>>>>>> you
>>>>>>>>> cannot add shards.  And if it *IS* using the implicit router, then
>>>>>>>>> you
>>>>>>>>
>>>>>>>> are
>>>>>>>>>
>>>>>>>>> 100% in control of document routing -- Solr cannot influence that
>>>>>>>>> at
>>>>>>
>>>>>> all.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>>
>

Reply via email to