Re: Ever increasing startup times as data grow in persistent storage

Raymond Wilson Thu, 28 Jan 2021 11:32:03 -0800

We have been continuing to monitor this issue and have experimented more
with deactivation versus Ingition.Stop(). There is some evidence that using
graceful shutdown does not perform a checkpoint in that startup times are
not necessarily improved on restart. We'll continue to investigate this but
confirmation that a check point definitely is performed on deactivation
would be good.


We see some instances when a node takes 15 minutes to restore WAL changes
(reported by the 'Finished restoring partition state for local groups' log
statement), this is despite reducing CP interval to 30 seconds. Our ingest
rate is not changing significantly which suggests there is a correlation
between the size of a partition and the time taken to finish restoring
partition state, rather than between the size of un-checkpointed changes
and the time to finish restoring partition state. Can someone confirm if
this is expected, or not?

Size of data in the persistent store is around 150Gb.

Thanks,
Raymond.



On Fri, Jan 22, 2021 at 4:21 AM andrei <[email protected]> wrote:

> Hi,
>
> I don't think there are any other options at the moment other than the
> ones you mentioned.
>
> However, you can also create your own application that will check the
> topology and activate it when all nodes from the baseline are online. For
> example, additional java code when starting a server node.
>
> In case you require any changes to the current Ignite implementation, you
> can create a thread in the Ignite developer list:
>
> http://apache-ignite-developers.2346864.n4.nabble.com/
>
> BR,
> Andrei
>
>
> 1/20/2021 9:16 PM, Raymond Wilson пишет:
>
> Hi Andre,
>
> I would like to see Ignite support a graceful shutdown scenario you get
> with deactivation, but which does not need to be manually reactivated.
>
> We run a pretty agile process and it is not uncommon to have multiple
> deploys to production throughout a week. This is a pretty automated affair
> (essentially push-button) and it works well, except for the WAL rescan on
> startup.
>
> Today there are two approaches we can take for a deployment:
>
> 1. Stop the nodes (which is what we currently do), leaving the WAL and
> persistent store inconsistent. This requires a rescan of the WAL before the
> grid is auto re-activated on startup. The time to do this is increasing
> with the size of the persistent store - it does not appear to be related to
> the size of the WAL.
> 2. Deactivate the grid, which leaves the WAL and persistent store in a
> consistent state. This requires manual re-activation on restart, but does
> not incur the increasing WAL restart cost.
>
> Is an option like the one below possible?:
>
> 3. Suspend the grid, which performs the same steps deactivation does to
> make the WAL and persistent store consistent, but which leaves the grid
> activated so the manual activation process is not required on restart.
>
> Thanks,
> Raymond.
>
>
> On Thu, Jan 21, 2021 at 4:02 AM andrei <[email protected]> wrote:
>
>> Hi,
>>
>> Yes, that was to be expected. The main autoactivation scenario is cluster
>> restart. If you are using manual deactivation, you should also manually
>> activate your cluster.
>>
>> BR,
>> Andrei
>> 1/20/2021 5:50 AM, Raymond Wilson пишет:
>>
>> We have been experimenting with using deactivation to shutdown the grid
>> to reduce the time for the grid to start up again.
>>
>> It appears there is a downside to this: once deactivated the grid does
>> not appear to auto-activate once baseline topology is achieved, which means
>> we will need to run through the bootstrapping protocol of ensuring the grid
>> has restarted correctly before activating it once again.
>>
>> The baseline topology documentation at
>> https://ignite.apache.org/docs/latest/clustering/baseline-topology does
>> not cover this condition.
>>
>> Is this expected?
>>
>> Thanks,
>> Raymond.
>>
>>
>> On Wed, Jan 13, 2021 at 11:49 PM Pavel Tupitsyn <[email protected]>
>> wrote:
>>
>>> Raymond,
>>>
>>> Please use ICluster.SetActive [1] instead, the API linked above is
>>> obsolete
>>>
>>>
>>> [1]
>>> https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Cluster.ICluster.html?#Apache_Ignite_Core_Cluster_ICluster_SetActive_System_Boolean_
>>>
>>> On Wed, Jan 13, 2021 at 11:54 AM Raymond Wilson <
>>> [email protected]> wrote:
>>>
>>>> Of course. Obvious! :)
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On 13/01/2021, at 9:15 PM, Zhenya Stanilovsky <[email protected]>
>>>> wrote:
>>>>
>>>> 
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Is there an API version of the cluster deactivation?
>>>>
>>>>
>>>>
>>>> https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131
>>>>
>>>>
>>>> On Wed, Jan 13, 2021 at 8:28 PM Zhenya Stanilovsky <[email protected]
>>>> <//e.mail.ru/compose/?mailto=mailto%[email protected]>> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Hi Zhenya,
>>>>
>>>> Thanks for confirming performing checkpoints more often will help here.
>>>>
>>>> Hi Raymond !
>>>>
>>>>
>>>> I have established this configuration so will experiment with settings
>>>> little.
>>>>
>>>> On a related note, is there any way to automatically trigger a
>>>> checkpoint, for instance as a pre-shutdown activity?
>>>>
>>>>
>>>> If you shutdown your cluster gracefully = with deactivation [1] further
>>>> start will not trigger wal readings.
>>>>
>>>> [1]
>>>> https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
>>>>
>>>>
>>>> Checkpoints seem to be much faster than the process of applying WAL
>>>> updates.
>>>>
>>>> Raymond.
>>>>
>>>> On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky <[email protected]
>>>> <http://e.mail.ru/compose/?mailto=mailto%[email protected]>> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> We have noticed that startup time for our server nodes has been slowly
>>>> increasing in time as the amount of data stored in the persistent store
>>>> grows.
>>>>
>>>> This appears to be closely related to recovery of WAL changes that were
>>>> not checkpointed at the time the node was stopped.
>>>>
>>>> After enabling debug logging we see that the WAL file is scanned, and
>>>> for every cache, all partitions in the cache are examined, and if there are
>>>> any uncommitted changes in the WAL file then the partition is updated (I
>>>> assume this requires reading of the partition itself as a part of this
>>>> process).
>>>>
>>>> We now have ~150Gb of data in our persistent store and we see WAL
>>>> update times between 5-10 minutes to complete, during which the node is
>>>> unavailable.
>>>>
>>>> We use fairly large WAL files (512Mb) and use 10 segments, with WAL
>>>> archiving enabled.
>>>>
>>>> We anticipate data in persistent storage to grow to Terabytes, and if
>>>> the startup time continues to grow as storage grows then this makes deploys
>>>> and restarts difficult.
>>>>
>>>> Until now we have been using the default checkpoint time out of 3
>>>> minutes which may mean we have significant uncheckpointed data in the WAL
>>>> files. We are moving to 1 minute checkpoint but don't yet know if this
>>>> improve startup times. We also use the default 1024 partitions per cache,
>>>> though some partitions may be large.
>>>>
>>>> Can anyone confirm this is expected behaviour and recommendations for
>>>> resolving it?
>>>>
>>>> Will reducing checking pointing intervals help?
>>>>
>>>>
>>>> yes, it will help. Check
>>>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>>>
>>>> Is the entire content of a partition read while applying WAL changes?
>>>>
>>>>
>>>> don`t think so, may be someone else suggest here?
>>>>
>>>> Does anyone else have this issue?
>>>>
>>>> Thanks,
>>>> Raymond.
>>>>
>>>>
>>>> --
>>>> <http://www.trimble.com/>
>>>> Raymond Wilson
>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>> [email protected]
>>>> <http://e.mail.ru/compose/?mailto=mailto%[email protected]>
>>>>
>>>>
>>>>
>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> <http://www.trimble.com/>
>>>> Raymond Wilson
>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>> [email protected]
>>>> <http://e.mail.ru/compose/?mailto=mailto%[email protected]>
>>>>
>>>>
>>>>
>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> <http://www.trimble.com/>
>>>> Raymond Wilson
>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>> [email protected]
>>>> <//e.mail.ru/compose/?mailto=mailto%[email protected]>
>>>>
>>>>
>>>>
>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Solution Architect, Civil Construction Software Systems (CCSS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> [email protected]
>>
>>
>>
>>
>>
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> [email protected]
>
>
>
>
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>

-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[email protected]

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Ever increasing startup times as data grow in persistent storage

Reply via email to