Re: Ever increasing startup times as data grow in persistent storage

andrei Thu, 21 Jan 2021 07:21:33 -0800

Hi,

I don't think there are any other options at the moment other than theones you mentioned.

However, you can also create your own application that will check thetopology and activate it when all nodes from the baseline are online.For example, additional java code when starting a server node.

In case you require any changes to the current Ignite implementation,you can create a thread in the Ignite developer list:


http://apache-ignite-developers.2346864.n4.nabble.com/

BR,
Andrei


1/20/2021 9:16 PM, Raymond Wilson пишет:

Hi Andre,

I would like to see Ignite support a graceful shutdown scenario youget with deactivation, but which does not need to be manually reactivated.

We run a pretty agile process and it is not uncommon to have multipledeploys to production throughout a week. This is a pretty automatedaffair (essentially push-button) and it works well, except for the WALrescan on startup.


Today there are two approaches we can take for a deployment:

1. Stop the nodes (which is what we currently do), leaving the WAL andpersistent store inconsistent. This requires a rescan of the WALbefore the grid is auto re-activated on startup. The time to do thisis increasing with the size of the persistent store - it does notappear to be related to the size of the WAL.2. Deactivate the grid, which leaves the WAL and persistent store in aconsistent state. This requires manual re-activation on restart, butdoes not incur the increasing WAL restart cost.


Is an option like the one below possible?:

3. Suspend the grid, which performs the same steps deactivation doesto make the WAL and persistent store consistent, but which leaves thegrid activated so the manual activation process is not required onrestart.


Thanks,
Raymond.

On Thu, Jan 21, 2021 at 4:02 AM andrei <[email protected]<mailto:[email protected]>> wrote:


    Hi,

    Yes, that was to be expected. The main autoactivation scenario is
    cluster restart. If you are using manual deactivation, you should
    also manually activate your cluster.

    BR,
    Andrei

    1/20/2021 5:50 AM, Raymond Wilson пишет:

    We have been experimenting with using deactivation to shutdown
    the grid to reduce the time for the grid to start up again.

    It appears there is a downside to this: once deactivated the grid
    does not appear to auto-activate once baseline topology is
    achieved, which means we will need to run through the
    bootstrapping protocol of ensuring the grid has restarted
    correctly before activating it once again.

    The baseline topology documentation at
    https://ignite.apache.org/docs/latest/clustering/baseline-topology
    <https://ignite.apache.org/docs/latest/clustering/baseline-topology>
    does not cover this condition.

    Is this expected?

    Thanks,
    Raymond.


    On Wed, Jan 13, 2021 at 11:49 PM Pavel Tupitsyn
    <[email protected] <mailto:[email protected]>> wrote:

        Raymond,

        Please use ICluster.SetActive [1] instead, the API linked
        above is obsolete


        [1]
        
https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Cluster.ICluster.html?#Apache_Ignite_Core_Cluster_ICluster_SetActive_System_Boolean_
        
<https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Cluster.ICluster.html?#Apache_Ignite_Core_Cluster_ICluster_SetActive_System_Boolean_>

        On Wed, Jan 13, 2021 at 11:54 AM Raymond Wilson
        <[email protected]
        <mailto:[email protected]>> wrote:

            Of course. Obvious! :)

            Sent from my iPhone

            On 13/01/2021, at 9:15 PM, Zhenya Stanilovsky
            <[email protected] <mailto:[email protected]>> wrote:

            



                Is there an API version of the cluster deactivation?

            
https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131
            
<https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131>

                On Wed, Jan 13, 2021 at 8:28 PM Zhenya Stanilovsky
                <[email protected]
                <//e.mail.ru/compose/?mailto=mailto%[email protected]>>
                wrote:



                        Hi Zhenya,
                        Thanks for confirming performing checkpoints
                        more often will help here.

                    Hi Raymond !

                        I have established this configuration so
                        will experiment with settings little.
                        On a related note, is there any way to
                        automatically trigger a checkpoint, for
                        instance as a pre-shutdown activity?

                    If you shutdown your cluster gracefully = with
                    deactivation [1] further start will not
                    trigger wal readings.
                    [1]
                    
https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
                    
<https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster>

                        Checkpoints seem to be much faster than the
                        process of applying WAL updates.
                        Raymond.
                        On Wed, Jan 13, 2021 at 8:07 PM Zhenya
                        Stanilovsky <[email protected]
                        
<http://e.mail.ru/compose/?mailto=mailto%[email protected]>>
                        wrote:




                                We have noticed that startup time
                                for our server nodes has been slowly
                                increasing in time as the amount of
                                data stored in the persistent store
                                grows.
                                This appears to be closely related
                                to recovery of WAL changes that were
                                not checkpointed at the time the
                                node was stopped.
                                After enabling debug logging we see
                                that the WAL file is scanned, and
                                for every cache, all partitions in
                                the cache are examined, and if there
                                are any uncommitted changes in the
                                WAL file then the partition is
                                updated (I assume this requires
                                reading of the partition itself as a
                                part of this process).
                                We now have ~150Gb of data in our
                                persistent store and we see WAL
                                update times between 5-10 minutes to
                                complete, during which the node is
                                unavailable.
                                We use fairly large WAL files
                                (512Mb) and use 10 segments, with
                                WAL archiving enabled.
                                We anticipate data in persistent
                                storage to grow to Terabytes, and if
                                the startup time continues to grow
                                as storage grows then this makes
                                deploys and restarts difficult.
                                Until now we have been using the
                                default checkpoint time out of 3
                                minutes which may mean we have
                                significant uncheckpointed data in
                                the WAL files. We are moving to 1
                                minute checkpoint but don't yet know
                                if this improve startup times. We
                                also use the default 1024 partitions
                                per cache, though some partitions
                                may be large.
                                Can anyone confirm this is expected
                                behaviour and recommendations for
                                resolving it?
                                Will reducing checking pointing
                                intervals help?

                            yes, it will help. Check
                            
https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
                            
<https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood>

                                Is the entire content of a partition
                                read while applying WAL changes?

                            don`t think so, may be someone else
                            suggest here?

                                Does anyone else have this issue?
                                Thanks,
                                Raymond.

--<http://www.trimble.com/>

                                Raymond Wilson
                                Solution Architect, Civil
                                Construction Software Systems (CCSS)
                                11 Birmingham Drive | Christchurch,
                                New Zealand
                                [email protected]
                                
<http://e.mail.ru/compose/?mailto=mailto%[email protected]>

                                                                        
                                        
                                
<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

--<http://www.trimble.com/>

                        Raymond Wilson
                        Solution Architect, Civil Construction
                        Software Systems (CCSS)
                        11 Birmingham Drive | Christchurch, New Zealand
                        [email protected]
                        
<http://e.mail.ru/compose/?mailto=mailto%[email protected]>

                                                                
                                
                        
<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

--<http://www.trimble.com/>

                Raymond Wilson
                Solution Architect, Civil Construction Software
                Systems (CCSS)
                11 Birmingham Drive | Christchurch, New Zealand
                [email protected]
                
<//e.mail.ru/compose/?mailto=mailto%[email protected]>

                                                        
                        
                
<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

--<http://www.trimble.com/>

    Raymond Wilson
    Solution Architect, Civil Construction Software Systems (CCSS)
    11 Birmingham Drive | Christchurch, New Zealand
    [email protected] <mailto:[email protected]>

        
        
        
        
        
        
    
<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>



--
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[email protected] <mailto:[email protected]>

        
        
        
        
        
        
<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Ever increasing startup times as data grow in persistent storage

Reply via email to