Re[2]: Questions related to check pointing

Zhenya Stanilovsky Tue, 29 Dec 2020 22:48:38 -0800

Don`t think so, checkpointing work perfectly well already before this fix.
Need additional info for start digging your problem, can you share ignite logs 
somewhere?
 
>I noticed an entry in the Ignite 2.9.1 changelog:
>*  Improved checkpoint concurrent behaviour
>I am having trouble finding the relevant Jira ticket for this in the 2.9.1 
>Jira area at  
>https://issues.apache.org/jira/browse/IGNITE-13876?jql=project%20%3D%20IGNITE%20AND%20fixVersion%20%3D%202.9.1%20and%20status%20%3D%20Resolved
> 
>Perhaps this change may improve the checkpointing issue we are seeing?
> 
>Raymond.
>   
>On Tue, Dec 29, 2020 at 8:35 PM Raymond Wilson < [email protected] > 
>wrote:
>>Hi Zhenya,
>> 
>>1. We currently use AWS EFS for primary storage, with provisioned IOPS to 
>>provide sufficient IO. Our Ignite cluster currently tops out at ~10% usage 
>>(with at least 5 nodes writing to it, including WAL and WAL archive), so we 
>>are not saturating the EFS interface. We use the default page size 
>>(experiments with larger page sizes showed instability when checkpointing due 
>>to free page starvation, so we reverted to the default size). 
>> 
>>2. Thanks for the detail, we will look for that in thread dumps when we can 
>>create them.
>> 
>>3. We are using the default CP buffer size, which is max(256Mb, 
>>DataRagionSize / 4) according to the Ignite documentation, so this should 
>>have more than enough checkpoint buffer space to cope with writes. As 
>>additional information, the cache which is displaying very slow writes is in 
>>a data region with relatively slow write traffic. There is a primary 
>>(default) data region with large write traffic, and the vast majority of 
>>pages being written in a checkpoint will be for that default data region.
>> 
>>4. Yes, this is very surprising. Anecdotally from our logs it appears write 
>>traffic into the low write traffic cache is blocked during checkpoints.
>> 
>>Thanks,
>>Raymond.
>>    
>>   
>>On Tue, Dec 29, 2020 at 7:31 PM Zhenya Stanilovsky < [email protected] > 
>>wrote:
>>>*  
>>>Additionally to Ilya reply you can check vendors page for additional info, 
>>>all in this page are applicable for ignite too [1]. Increasing threads 
>>>number leads to concurrent io usage, thus if your have something like nvme — 
>>>it`s up to you but in case of sas possibly better would be to reduce this 
>>>param.
>>>*  Log will shows you something like :
>>>Parking thread=%Thread name% for timeout(ms)= %time% and appropriate :
>>>Unparking thread=
>>>*  No additional looging with cp buffer usage are provided. cp buffer need 
>>>to be more than 10% of overall persistent  DataRegions size.
>>>*  90 seconds or longer  —    Seems like problems in io or system tuning, 
>>>it`s very bad score i hope. 
>>>[1]  
>>>https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/persistence-tuning
>>>
>>>
>>> 
>>>>Hi,
>>>> 
>>>>We have been investigating some issues which appear to be related to 
>>>>checkpointing. We currently use the IA 2.8.1 with the C# client.
>>>> 
>>>>I have been trying to gain clarity on how certain aspects of the Ignite 
>>>>configuration relate to the checkpointing process:
>>>> 
>>>>1. Number of check pointing threads. This defaults to 4, but I don't 
>>>>understand how it applies to the checkpointing process. Are more threads 
>>>>generally better (eg: because it makes the disk IO parallel across the 
>>>>threads), or does it only have a positive effect if you have many data 
>>>>storage regions? Or something else? If this could be clarified in the 
>>>>documentation (or a pointer to it which Google has not yet found), that 
>>>>would be good.
>>>> 
>>>>2. Checkpoint frequency. This is defaulted to 180 seconds. I was thinking 
>>>>that reducing this time would result in smaller less disruptive check 
>>>>points. Setting it to 60 seconds seems pretty safe, but is there a 
>>>>practical lower limit that should be used for use cases with new data 
>>>>constantly being added, eg: 5 seconds, 10 seconds?
>>>> 
>>>>3. Write exclusivity constraints during checkpointing. I understand that 
>>>>while a checkpoint is occurring ongoing writes will be supported into the 
>>>>caches being check pointed, and if those are writes to existing pages then 
>>>>those will be duplicated into the checkpoint buffer. If this buffer becomes 
>>>>full or stressed then Ignite will throttle, and perhaps block, writes until 
>>>>the checkpoint is complete. If this is the case then Ignite will emit 
>>>>logging (warning or informational?) that writes are being throttled.
>>>> 
>>>>We have cases where simple puts to caches (a few requests per second) are 
>>>>taking up to 90 seconds to execute when there is an active check point 
>>>>occurring, where the check point has been triggered by the checkpoint 
>>>>timer. When a checkpoint is not occurring the time to do this is usually in 
>>>>the milliseconds. The checkpoints themselves can take 90 seconds or longer, 
>>>>and are updating up to 30,000-40,000 pages, across a pair of data storage 
>>>>regions, one with 4Gb in-memory space allocated (which should be 1,000,000 
>>>>pages at the standard 4kb page size), and one small region with 128Mb. 
>>>>There is no 'throttling' logging being emitted that we can tell, so the 
>>>>checkpoint buffer (which should be 1Gb for the first data region and 256 Mb 
>>>>for the second smaller region in this case) does not look like it can fill 
>>>>up during the checkpoint.
>>>> 
>>>>It seems like the checkpoint is affecting the put operations, but I don't 
>>>>understand why that may be given the documented checkpointing process, and 
>>>>the checkpoint itself (at least via Informational logging) is not 
>>>>advertising any restrictions.
>>>> 
>>>>Thanks,
>>>>Raymond.
>>>>  --
>>>>
>>>>Raymond Wilson
>>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>>  
>>> 
>>> 
>>> 
>>>  
>> 
>>  --
>>
>>Raymond Wilson
>>Solution Architect, Civil Construction Software Systems (CCSS)
>>11 Birmingham Drive |  Christchurch, New Zealand
>>+64-21-2013317  Mobile
>>[email protected]
>>         
>> 
> 
>  --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>11 Birmingham Drive |  Christchurch, New Zealand
>+64-21-2013317  Mobile
>[email protected]
>         
>
Re[2]: Questions related to check pointing

Reply via email to