Re: Questions related to check pointing

Raymond Wilson Tue, 29 Dec 2020 19:25:47 -0800

I noticed an entry in the Ignite 2.9.1 changelog:

   - Improved checkpoint concurrent behaviour


I am having trouble finding the relevant Jira ticket for this in the 2.9.1
Jira area at
https://issues.apache.org/jira/browse/IGNITE-13876?jql=project%20%3D%20IGNITE%20AND%20fixVersion%20%3D%202.9.1%20and%20status%20%3D%20Resolved

Perhaps this change may improve the checkpointing issue we are seeing?

Raymond.


On Tue, Dec 29, 2020 at 8:35 PM Raymond Wilson <[email protected]>
wrote:

> Hi Zhenya,
>
> 1. We currently use AWS EFS for primary storage, with provisioned IOPS to
> provide sufficient IO. Our Ignite cluster currently tops out at ~10% usage
> (with at least 5 nodes writing to it, including WAL and WAL archive), so we
> are not saturating the EFS interface. We use the default page size
> (experiments with larger page sizes showed instability when checkpointing
> due to free page starvation, so we reverted to the default size).
>
> 2. Thanks for the detail, we will look for that in thread dumps when we
> can create them.
>
> 3. We are using the default CP buffer size, which is max(256Mb,
> DataRagionSize / 4) according to the Ignite documentation, so this should
> have more than enough checkpoint buffer space to cope with writes. As
> additional information, the cache which is displaying very slow writes is
> in a data region with relatively slow write traffic. There is a primary
> (default) data region with large write traffic, and the vast majority of
> pages being written in a checkpoint will be for that default data region.
>
> 4. Yes, this is very surprising. Anecdotally from our logs it appears
> write traffic into the low write traffic cache is blocked during
> checkpoints.
>
> Thanks,
> Raymond.
>
>
>
> On Tue, Dec 29, 2020 at 7:31 PM Zhenya Stanilovsky <[email protected]>
> wrote:
>
>>
>>    1. Additionally to Ilya reply you can check vendors page for
>>    additional info, all in this page are applicable for ignite too [1].
>>    Increasing threads number leads to concurrent io usage, thus if your have
>>    something like nvme — it`s up to you but in case of sas possibly better
>>    would be to reduce this param.
>>    2. Log will shows you something like :
>>
>>    Parking thread=%Thread name% for timeout(ms)= %time%
>>
>>    and appropriate :
>>
>>    Unparking thread=
>>
>>    3. No additional looging with cp buffer usage are provided. cp buffer
>>    need to be more than 10% of overall persistent  DataRegions size.
>>    4. 90 seconds or longer —  Seems like problems in io or system
>>    tuning, it`s very bad score i hope.
>>
>> [1]
>> https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/persistence-tuning
>>
>>
>>
>>
>>
>> Hi,
>>
>> We have been investigating some issues which appear to be related to
>> checkpointing. We currently use the IA 2.8.1 with the C# client.
>>
>> I have been trying to gain clarity on how certain aspects of the Ignite
>> configuration relate to the checkpointing process:
>>
>> 1. Number of check pointing threads. This defaults to 4, but I don't
>> understand how it applies to the checkpointing process. Are more threads
>> generally better (eg: because it makes the disk IO parallel across the
>> threads), or does it only have a positive effect if you have many data
>> storage regions? Or something else? If this could be clarified in the
>> documentation (or a pointer to it which Google has not yet found), that
>> would be good.
>>
>> 2. Checkpoint frequency. This is defaulted to 180 seconds. I was thinking
>> that reducing this time would result in smaller less disruptive check
>> points. Setting it to 60 seconds seems pretty safe, but is there a
>> practical lower limit that should be used for use cases with new data
>> constantly being added, eg: 5 seconds, 10 seconds?
>>
>> 3. Write exclusivity constraints during checkpointing. I understand that
>> while a checkpoint is occurring ongoing writes will be supported into the
>> caches being check pointed, and if those are writes to existing pages then
>> those will be duplicated into the checkpoint buffer. If this buffer becomes
>> full or stressed then Ignite will throttle, and perhaps block, writes until
>> the checkpoint is complete. If this is the case then Ignite will emit
>> logging (warning or informational?) that writes are being throttled.
>>
>> We have cases where simple puts to caches (a few requests per second) are
>> taking up to 90 seconds to execute when there is an active check point
>> occurring, where the check point has been triggered by the checkpoint
>> timer. When a checkpoint is not occurring the time to do this is usually in
>> the milliseconds. The checkpoints themselves can take 90 seconds or longer,
>> and are updating up to 30,000-40,000 pages, across a pair of data storage
>> regions, one with 4Gb in-memory space allocated (which should be 1,000,000
>> pages at the standard 4kb page size), and one small region with 128Mb.
>> There is no 'throttling' logging being emitted that we can tell, so the
>> checkpoint buffer (which should be 1Gb for the first data region and 256 Mb
>> for the second smaller region in this case) does not look like it can fill
>> up during the checkpoint.
>>
>> It seems like the checkpoint is affecting the put operations, but I don't
>> understand why that may be given the documented checkpointing process, and
>> the checkpoint itself (at least via Informational logging) is not
>> advertising any restrictions.
>>
>> Thanks,
>> Raymond.
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Solution Architect, Civil Construction Software Systems (CCSS)
>>
>>
>>
>>
>>
>>
>>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> +64-21-2013317 Mobile
> [email protected]
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
+64-21-2013317 Mobile
[email protected]

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Questions related to check pointing

Reply via email to