Hi, We have been investigating some issues which appear to be related to checkpointing. We currently use the IA 2.8.1 with the C# client.
I have been trying to gain clarity on how certain aspects of the Ignite configuration relate to the checkpointing process: 1. Number of check pointing threads. This defaults to 4, but I don't understand how it applies to the checkpointing process. Are more threads generally better (eg: because it makes the disk IO parallel across the threads), or does it only have a positive effect if you have many data storage regions? Or something else? If this could be clarified in the documentation (or a pointer to it which Google has not yet found), that would be good. 2. Checkpoint frequency. This is defaulted to 180 seconds. I was thinking that reducing this time would result in smaller less disruptive check points. Setting it to 60 seconds seems pretty safe, but is there a practical lower limit that should be used for use cases with new data constantly being added, eg: 5 seconds, 10 seconds? 3. Write exclusivity constraints during checkpointing. I understand that while a checkpoint is occurring ongoing writes will be supported into the caches being check pointed, and if those are writes to existing pages then those will be duplicated into the checkpoint buffer. If this buffer becomes full or stressed then Ignite will throttle, and perhaps block, writes until the checkpoint is complete. If this is the case then Ignite will emit logging (warning or informational?) that writes are being throttled. We have cases where simple puts to caches (a few requests per second) are taking up to 90 seconds to execute when there is an active check point occurring, where the check point has been triggered by the checkpoint timer. When a checkpoint is not occurring the time to do this is usually in the milliseconds. The checkpoints themselves can take 90 seconds or longer, and are updating up to 30,000-40,000 pages, across a pair of data storage regions, one with 4Gb in-memory space allocated (which should be 1,000,000 pages at the standard 4kb page size), and one small region with 128Mb. There is no 'throttling' logging being emitted that we can tell, so the checkpoint buffer (which should be 1Gb for the first data region and 256 Mb for the second smaller region in this case) does not look like it can fill up during the checkpoint. It seems like the checkpoint is affecting the put operations, but I don't understand why that may be given the documented checkpointing process, and the checkpoint itself (at least via Informational logging) is not advertising any restrictions. Thanks, Raymond. -- <http://www.trimble.com/> Raymond Wilson Solution Architect, Civil Construction Software Systems (CCSS)
