We have noticed that startup time for our server nodes has been slowly
increasing in time as the amount of data stored in the persistent store
grows.

This appears to be closely related to recovery of WAL changes that were not
checkpointed at the time the node was stopped.

After enabling debug logging we see that the WAL file is scanned, and for
every cache, all partitions in the cache are examined, and if there are any
uncommitted changes in the WAL file then the partition is updated (I assume
this requires reading of the partition itself as a part of this process).

We now have ~150Gb of data in our persistent store and we see WAL update
times between 5-10 minutes to complete, during which the node is
unavailable.

We use fairly large WAL files (512Mb) and use 10 segments, with WAL
archiving enabled.

We anticipate data in persistent storage to grow to Terabytes, and if the
startup time continues to grow as storage grows then this makes deploys and
restarts difficult.

Until now we have been using the default checkpoint time out of 3 minutes
which may mean we have significant uncheckpointed data in the WAL files. We
are moving to 1 minute checkpoint but don't yet know if this improve
startup times. We also use the default 1024 partitions per cache, though
some partitions may be large.

Can anyone confirm this is expected behaviour and recommendations for
resolving it?

Will reducing checking pointing intervals help?
Is the entire content of a partition read while applying WAL changes?
Does anyone else have this issue?

Thanks,
Raymond.


-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[email protected]

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Reply via email to