We have noticed that startup time for our server nodes has been slowly increasing in time as the amount of data stored in the persistent store grows.
This appears to be closely related to recovery of WAL changes that were not checkpointed at the time the node was stopped. After enabling debug logging we see that the WAL file is scanned, and for every cache, all partitions in the cache are examined, and if there are any uncommitted changes in the WAL file then the partition is updated (I assume this requires reading of the partition itself as a part of this process). We now have ~150Gb of data in our persistent store and we see WAL update times between 5-10 minutes to complete, during which the node is unavailable. We use fairly large WAL files (512Mb) and use 10 segments, with WAL archiving enabled. We anticipate data in persistent storage to grow to Terabytes, and if the startup time continues to grow as storage grows then this makes deploys and restarts difficult. Until now we have been using the default checkpoint time out of 3 minutes which may mean we have significant uncheckpointed data in the WAL files. We are moving to 1 minute checkpoint but don't yet know if this improve startup times. We also use the default 1024 partitions per cache, though some partitions may be large. Can anyone confirm this is expected behaviour and recommendations for resolving it? Will reducing checking pointing intervals help? Is the entire content of a partition read while applying WAL changes? Does anyone else have this issue? Thanks, Raymond. -- <http://www.trimble.com/> Raymond Wilson Solution Architect, Civil Construction Software Systems (CCSS) 11 Birmingham Drive | Christchurch, New Zealand [email protected] <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
