Don`t think so, checkpointing work perfectly well already before this fix.
Need additional info for start digging your problem, can you share ignite logs
somewhere?
>I noticed an entry in the Ignite 2.9.1 changelog:
>* Improved checkpoint concurrent behaviour
>I am having trouble finding the relevant Jira ticket for this in the 2.9.1
>Jira area at
>https://issues.apache.org/jira/browse/IGNITE-13876?jql=project%20%3D%20IGNITE%20AND%20fixVersion%20%3D%202.9.1%20and%20status%20%3D%20Resolved
>
>Perhaps this change may improve the checkpointing issue we are seeing?
>
>Raymond.
>
>On Tue, Dec 29, 2020 at 8:35 PM Raymond Wilson < [email protected] >
>wrote:
>>Hi Zhenya,
>>
>>1. We currently use AWS EFS for primary storage, with provisioned IOPS to
>>provide sufficient IO. Our Ignite cluster currently tops out at ~10% usage
>>(with at least 5 nodes writing to it, including WAL and WAL archive), so we
>>are not saturating the EFS interface. We use the default page size
>>(experiments with larger page sizes showed instability when checkpointing due
>>to free page starvation, so we reverted to the default size).
>>
>>2. Thanks for the detail, we will look for that in thread dumps when we can
>>create them.
>>
>>3. We are using the default CP buffer size, which is max(256Mb,
>>DataRagionSize / 4) according to the Ignite documentation, so this should
>>have more than enough checkpoint buffer space to cope with writes. As
>>additional information, the cache which is displaying very slow writes is in
>>a data region with relatively slow write traffic. There is a primary
>>(default) data region with large write traffic, and the vast majority of
>>pages being written in a checkpoint will be for that default data region.
>>
>>4. Yes, this is very surprising. Anecdotally from our logs it appears write
>>traffic into the low write traffic cache is blocked during checkpoints.
>>
>>Thanks,
>>Raymond.
>>
>>
>>On Tue, Dec 29, 2020 at 7:31 PM Zhenya Stanilovsky < [email protected] >
>>wrote:
>>>*
>>>Additionally to Ilya reply you can check vendors page for additional info,
>>>all in this page are applicable for ignite too [1]. Increasing threads
>>>number leads to concurrent io usage, thus if your have something like nvme —
>>>it`s up to you but in case of sas possibly better would be to reduce this
>>>param.
>>>* Log will shows you something like :
>>>Parking thread=%Thread name% for timeout(ms)= %time% and appropriate :
>>>Unparking thread=
>>>* No additional looging with cp buffer usage are provided. cp buffer need
>>>to be more than 10% of overall persistent DataRegions size.
>>>* 90 seconds or longer — Seems like problems in io or system tuning,
>>>it`s very bad score i hope.
>>>[1]
>>>https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/persistence-tuning
>>>
>>>
>>>
>>>>Hi,
>>>>
>>>>We have been investigating some issues which appear to be related to
>>>>checkpointing. We currently use the IA 2.8.1 with the C# client.
>>>>
>>>>I have been trying to gain clarity on how certain aspects of the Ignite
>>>>configuration relate to the checkpointing process:
>>>>
>>>>1. Number of check pointing threads. This defaults to 4, but I don't
>>>>understand how it applies to the checkpointing process. Are more threads
>>>>generally better (eg: because it makes the disk IO parallel across the
>>>>threads), or does it only have a positive effect if you have many data
>>>>storage regions? Or something else? If this could be clarified in the
>>>>documentation (or a pointer to it which Google has not yet found), that
>>>>would be good.
>>>>
>>>>2. Checkpoint frequency. This is defaulted to 180 seconds. I was thinking
>>>>that reducing this time would result in smaller less disruptive check
>>>>points. Setting it to 60 seconds seems pretty safe, but is there a
>>>>practical lower limit that should be used for use cases with new data
>>>>constantly being added, eg: 5 seconds, 10 seconds?
>>>>
>>>>3. Write exclusivity constraints during checkpointing. I understand that
>>>>while a checkpoint is occurring ongoing writes will be supported into the
>>>>caches being check pointed, and if those are writes to existing pages then
>>>>those will be duplicated into the checkpoint buffer. If this buffer becomes
>>>>full or stressed then Ignite will throttle, and perhaps block, writes until
>>>>the checkpoint is complete. If this is the case then Ignite will emit
>>>>logging (warning or informational?) that writes are being throttled.
>>>>
>>>>We have cases where simple puts to caches (a few requests per second) are
>>>>taking up to 90 seconds to execute when there is an active check point
>>>>occurring, where the check point has been triggered by the checkpoint
>>>>timer. When a checkpoint is not occurring the time to do this is usually in
>>>>the milliseconds. The checkpoints themselves can take 90 seconds or longer,
>>>>and are updating up to 30,000-40,000 pages, across a pair of data storage
>>>>regions, one with 4Gb in-memory space allocated (which should be 1,000,000
>>>>pages at the standard 4kb page size), and one small region with 128Mb.
>>>>There is no 'throttling' logging being emitted that we can tell, so the
>>>>checkpoint buffer (which should be 1Gb for the first data region and 256 Mb
>>>>for the second smaller region in this case) does not look like it can fill
>>>>up during the checkpoint.
>>>>
>>>>It seems like the checkpoint is affecting the put operations, but I don't
>>>>understand why that may be given the documented checkpointing process, and
>>>>the checkpoint itself (at least via Informational logging) is not
>>>>advertising any restrictions.
>>>>
>>>>Thanks,
>>>>Raymond.
>>>> --
>>>>
>>>>Raymond Wilson
>>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>>
>>>
>>>
>>>
>>>
>>
>> --
>>
>>Raymond Wilson
>>Solution Architect, Civil Construction Software Systems (CCSS)
>>11 Birmingham Drive | Christchurch, New Zealand
>>+64-21-2013317 Mobile
>>[email protected]
>>
>>
>
> --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>11 Birmingham Drive | Christchurch, New Zealand
>+64-21-2013317 Mobile
>[email protected]
>
>