fsync=37104ms too long for such pages amount : pages=33421, plz check how can you improve fsync on your storage.
> > >------- Forwarded message ------- >From: "Raymond Wilson" < raymond_wil...@trimble.com > >To: user < user@ignite.apache.org >, "Zhenya Stanilovsky" < arzamas...@mail.ru >> >Cc: >Subject: Re: Re[4]: Questions related to check pointing >Date: Thu, 31 Dec 2020 01:46:20 +0300 > >Hi Zhenya, > >The matching checkpoint finished log is this: > >2020-12-15 19:07:39,253 [106] INF [MutableCacheComputeServer] Checkpoint >finished [cpId=e2c31b43-44df-43f1-b162-6b6cefa24e28, pages=33421, >markPos=FileWALPointer [idx=6339, fileOff=243287334, len=196573], >walSegmentsCleared=0, walSegmentsCovered=[], markDuration=218ms, >pagesWrite=1150ms, fsync=37104ms, total=38571ms] > >Regards your comment that 3/4 of pages in whole data region need to be dirty >to trigger this, can you confirm this is 3/4 of the maximum size of the data >region, or of the currently used size (eg: if Min is 1Gb, and Max is 4Gb, and >used is 2Gb, would 1.5Gb of dirty pages trigger this?) > >Are data regions independently checkpointed, or are they checkpointed as a >whole, so that a 'too many dirty pages' condition affects all data regions in >terms of write blocking? > >Can you comment on my query regarding should we set Min and Max size of the >data region to be the same? Ie: Don't bother with growing the data region >memory use on demand, just allocate the maximum? > >In terms of the checkpoint lock hold time metric, of the checkpoints quoting >'too many dirty pages' there is one instance apart from the one I have >provided earlier violating this limit, ie: > >2020-12-17 18:56:39,086 [104] INF [MutableCacheComputeServer] Checkpoint >started [checkpointId=e9ccf0ca-f813-4f91-ac93-5483350fdf66, >startPtr=FileWALPointer [idx=7164, fileOff=389224517, len=196573], >checkpointBeforeLockTime=276ms, checkpointLockWait=0ms, >checkpointListenersExecuteTime=16ms, checkpointLockHoldTime=39ms, >walCpRecordFsyncDuration=254ms, writeCheckpointEntryDuration=32ms, >splitAndSortCpPagesDuration=276ms, pages=77774, reason=' too many dirty pages >'] > >This is out of a population of 16 instances I can find. The remainder have >lock times of 16-17ms. > >Regarding writes of pages to the persistent store, does the check pointing >system parallelise writes across partitions ro maximise throughput? > >Thanks, >Raymond. > > >On Thu, Dec 31, 2020 at 1:17 AM Zhenya Stanilovsky < arzamas...@mail.ru > >wrote: >> >>All write operations will be blocked for this timeout : >>checkpointLockHoldTime=32ms (Write Lock holding) If you observe huge amount >>of such messages : reason=' too many dirty pages ' may be you need to >>store some data in not persisted regions for example or reduce indexes (if >>you use them). And please attach other part of cp message starting with : >>Checkpoint finished. >> >> >> >>>In ( >>>https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood >>> ), there is a mention of a dirty pages limit that is a factor that can >>>trigger check points. >>> >>>I also found this issue: >>>http://apache-ignite-users.70518.x6.nabble.com/too-many-dirty-pages-td28572.html >>> where "too many dirty pages" is a reason given for initiating a checkpoint. >>> >>>After reviewing our logs I found this: (one example) >>> >>>2020-12-15 19:07:00,999 [106] INF [MutableCacheComputeServer] Checkpoint >>>started [checkpointId=e2c31b43-44df-43f1-b162-6b6cefa24e28, >>>startPtr=FileWALPointer [idx=6339, fileOff=243287334, len=196573], >>>checkpointBeforeLockTime=99ms, checkpointLockWait=0ms, >>>checkpointListenersExecuteTime=16ms, checkpointLockHoldTime=32ms, >>>walCpRecordFsyncDuration=113ms, writeCheckpointEntryDuration=27ms, >>>splitAndSortCpPagesDuration=45ms, pages=33421, reason=' too many dirty pages >>>'] >>> >>>Which suggests we may have the issue where writes are frozen until the check >>>point is completed. >>> >>>Looking at the AI 2.8.1 source code, the dirty page limit fraction appears >>>to be 0.1 (10%), via this entry in GridCacheDatabaseSharedManager.java: >>> >>> /** >>> * Threshold to calculate limit for pages list on-heap caches. >>> * <p> >>> * Note: When a checkpoint is triggered, we need some amount of page >>>memory to store pages list on-heap cache. >>> * If a checkpoint is triggered by "too many dirty pages" reason and >>>pages list cache is rather big, we can get >>>* {@code IgniteOutOfMemoryException}. To prevent this, we can limit the >>>total amount of cached page list buckets, >>> * assuming that checkpoint will be triggered if no more then 3/4 of >>>pages will be marked as dirty (there will be >>> * at least 1/4 of clean pages) and each cached page list bucket can be >>>stored to up to 2 pages (this value is not >>> * static, but depends on PagesCache.MAX_SIZE, so if PagesCache.MAX_SIZE >>>> PagesListNodeIO#getCapacity it can take >>> * more than 2 pages). Also some amount of page memory needed to store >>>page list metadata. >>> */ >>> private static final double PAGE_LIST_CACHE_LIMIT_THRESHOLD = >>>0.1 ; >>> >>>This raises two questions: >>> >>>1. The data region where most writes are occurring has 4Gb allocated to it, >>>though it is permitted to start at a much lower level. 4Gb should be >>>1,000,000 pages, 10% of which should be 100,000 dirty pages. >>> >>>The 'limit holder' is calculated like this: >>> >>> /** >>> * @return Holder for page list cache limit for given data region. >>> */ >>> public AtomicLong pageListCacheLimitHolder ( DataRegion >>>dataRegion ) { >>> if ( dataRegion . config (). isPersistenceEnabled ()) { >>> return pageListCacheLimits . computeIfAbsent ( dataRegion . >>>config (). getName (), name -> new AtomicLong ( >>> ( long )(((PageMemoryEx) dataRegion . pageMemory ()). >>>totalPages () * PAGE_LIST_CACHE_LIMIT_THRESHOLD))); >>> } >>> return null ; >>> } >>> >>>... but I am unsure if totalPages() is referring to the current size of the >>>data region, or the size it is permitted to grow to. ie: Could the 'dirty >>>page limit' be a sliding limit based on the growth of the data region? Is it >>>better to set the initial and maximum sizes of data regions to be the same >>>number? >>> >>>2. We have two data regions, one supporting inbound arrival of data (with >>>low numbers of writes), and one supporting storage of processed results from >>>the arriving data (with many more writes). >>> >>>The block on writes due to the number of dirty pages appears to affect all >>>data regions, not just the one which has violated the dirty page limit. Is >>>that correct? If so, is this something that can be improved? >>> >>>Thanks, >>>Raymond. >>> >>>On Wed, Dec 30, 2020 at 9:17 PM Raymond Wilson < raymond_wil...@trimble.com >>>> wrote: >>>>I'm working on getting automatic JVM thread stack dumping occurring if we >>>>detect long delays in put (PutIfAbsent) operations. Hopefully this will >>>>provide more information. >>>>On Wed, Dec 30, 2020 at 7:48 PM Zhenya Stanilovsky < arzamas...@mail.ru > >>>>wrote: >>>>> >>>>>Don`t think so, checkpointing work perfectly well already before this fix. >>>>>Need additional info for start digging your problem, can you share ignite >>>>>logs somewhere? >>>>> >>>>>>I noticed an entry in the Ignite 2.9.1 changelog: >>>>>>* Improved checkpoint concurrent behaviour >>>>>>I am having trouble finding the relevant Jira ticket for this in the >>>>>>2.9.1 Jira area at >>>>>>https://issues.apache.org/jira/browse/IGNITE-13876?jql=project%20%3D%20IGNITE%20AND%20fixVersion%20%3D%202.9.1%20and%20status%20%3D%20Resolved >>>>>> >>>>>>Perhaps this change may improve the checkpointing issue we are seeing? >>>>>> >>>>>>Raymond. >>>>>> >>>>>>On Tue, Dec 29, 2020 at 8:35 PM Raymond Wilson < >>>>>>raymond_wil...@trimble.com > wrote: >>>>>>>Hi Zhenya, >>>>>>> >>>>>>>1. We currently use AWS EFS for primary storage, with provisioned IOPS >>>>>>>to provide sufficient IO. Our Ignite cluster currently tops out at ~10% >>>>>>>usage (with at least 5 nodes writing to it, including WAL and WAL >>>>>>>archive), so we are not saturating the EFS interface. We use the default >>>>>>>page size (experiments with larger page sizes showed instability when >>>>>>>checkpointing due to free page starvation, so we reverted to the default >>>>>>>size). >>>>>>> >>>>>>>2. Thanks for the detail, we will look for that in thread dumps when we >>>>>>>can create them. >>>>>>> >>>>>>>3. We are using the default CP buffer size, which is max(256Mb, >>>>>>>DataRagionSize / 4) according to the Ignite documentation, so this >>>>>>>should have more than enough checkpoint buffer space to cope with >>>>>>>writes. As additional information, the cache which is displaying very >>>>>>>slow writes is in a data region with relatively slow write traffic. >>>>>>>There is a primary (default) data region with large write traffic, and >>>>>>>the vast majority of pages being written in a checkpoint will be for >>>>>>>that default data region. >>>>>>> >>>>>>>4. Yes, this is very surprising. Anecdotally from our logs it appears >>>>>>>write traffic into the low write traffic cache is blocked during >>>>>>>checkpoints. >>>>>>> >>>>>>>Thanks, >>>>>>>Raymond. >>>>>>> >>>>>>> >>>>>>>On Tue, Dec 29, 2020 at 7:31 PM Zhenya Stanilovsky < arzamas...@mail.ru >>>>>>>> wrote: >>>>>>>>* >>>>>>>>Additionally to Ilya reply you can check vendors page for additional >>>>>>>>info, all in this page are applicable for ignite too [1]. Increasing >>>>>>>>threads number leads to concurrent io usage, thus if your have >>>>>>>>something like nvme — it`s up to you but in case of sas possibly better >>>>>>>>would be to reduce this param. >>>>>>>>* Log will shows you something like : >>>>>>>>Parking thread=%Thread name% for timeout(ms)= %time% and appropriate : >>>>>>>>Unparking thread= >>>>>>>>* No additional looging with cp buffer usage are provided. cp buffer >>>>>>>>need to be more than 10% of overall persistent DataRegions size. >>>>>>>>* 90 seconds or longer — Seems like problems in io or system >>>>>>>>tuning, it`s very bad score i hope. >>>>>>>>[1] >>>>>>>>https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/persistence-tuning >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>Hi, >>>>>>>>> >>>>>>>>>We have been investigating some issues which appear to be related to >>>>>>>>>checkpointing. We currently use the IA 2.8.1 with the C# client. >>>>>>>>> >>>>>>>>>I have been trying to gain clarity on how certain aspects of the >>>>>>>>>Ignite configuration relate to the checkpointing process: >>>>>>>>> >>>>>>>>>1. Number of check pointing threads. This defaults to 4, but I don't >>>>>>>>>understand how it applies to the checkpointing process. Are more >>>>>>>>>threads generally better (eg: because it makes the disk IO parallel >>>>>>>>>across the threads), or does it only have a positive effect if you >>>>>>>>>have many data storage regions? Or something else? If this could be >>>>>>>>>clarified in the documentation (or a pointer to it which Google has >>>>>>>>>not yet found), that would be good. >>>>>>>>> >>>>>>>>>2. Checkpoint frequency. This is defaulted to 180 seconds. I was >>>>>>>>>thinking that reducing this time would result in smaller less >>>>>>>>>disruptive check points. Setting it to 60 seconds seems pretty safe, >>>>>>>>>but is there a practical lower limit that should be used for use cases >>>>>>>>>with new data constantly being added, eg: 5 seconds, 10 seconds? >>>>>>>>> >>>>>>>>>3. Write exclusivity constraints during checkpointing. I understand >>>>>>>>>that while a checkpoint is occurring ongoing writes will be supported >>>>>>>>>into the caches being check pointed, and if those are writes to >>>>>>>>>existing pages then those will be duplicated into the checkpoint >>>>>>>>>buffer. If this buffer becomes full or stressed then Ignite will >>>>>>>>>throttle, and perhaps block, writes until the checkpoint is complete. >>>>>>>>>If this is the case then Ignite will emit logging (warning or >>>>>>>>>informational?) that writes are being throttled. >>>>>>>>> >>>>>>>>>We have cases where simple puts to caches (a few requests per second) >>>>>>>>>are taking up to 90 seconds to execute when there is an active check >>>>>>>>>point occurring, where the check point has been triggered by the >>>>>>>>>checkpoint timer. When a checkpoint is not occurring the time to do >>>>>>>>>this is usually in the milliseconds. The checkpoints themselves can >>>>>>>>>take 90 seconds or longer, and are updating up to 30,000-40,000 pages, >>>>>>>>>across a pair of data storage regions, one with 4Gb in-memory space >>>>>>>>>allocated (which should be 1,000,000 pages at the standard 4kb page >>>>>>>>>size), and one small region with 128Mb. There is no 'throttling' >>>>>>>>>logging being emitted that we can tell, so the checkpoint buffer >>>>>>>>>(which should be 1Gb for the first data region and 256 Mb for the >>>>>>>>>second smaller region in this case) does not look like it can fill up >>>>>>>>>during the checkpoint. >>>>>>>>> >>>>>>>>>It seems like the checkpoint is affecting the put operations, but I >>>>>>>>>don't understand why that may be given the documented checkpointing >>>>>>>>>process, and the checkpoint itself (at least via Informational >>>>>>>>>logging) is not advertising any restrictions. >>>>>>>>> >>>>>>>>>Thanks, >>>>>>>>>Raymond. >>>>>>>>> -- >>>>>>>>> >>>>>>>>>Raymond Wilson >>>>>>>>>Solution Architect, Civil Construction Software Systems (CCSS) >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>>Raymond Wilson >>>>>>>Solution Architect, Civil Construction Software Systems (CCSS) >>>>>>>11 Birmingham Drive | Christchurch, New Zealand >>>>>>>+64-21-2013317 Mobile >>>>>>>raymond_wil...@trimble.com >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>>Raymond Wilson >>>>>>Solution Architect, Civil Construction Software Systems (CCSS) >>>>>>11 Birmingham Drive | Christchurch, New Zealand >>>>>>+64-21-2013317 Mobile >>>>>>raymond_wil...@trimble.com >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> >>>>Raymond Wilson >>>>Solution Architect, Civil Construction Software Systems (CCSS) >>>>11 Birmingham Drive | Christchurch, New Zealand >>>>+64-21-2013317 Mobile >>>>raymond_wil...@trimble.com >>>> >>>> >>> >>> -- >>> >>>Raymond Wilson >>>Solution Architect, Civil Construction Software Systems (CCSS) >>>11 Birmingham Drive | Christchurch, New Zealand >>>+64-21-2013317 Mobile >>>raymond_wil...@trimble.com >>> >>> >> >> >> >> > > -- > >Raymond Wilson >Solution Architect, Civil Construction Software Systems (CCSS) >11 Birmingham Drive | Christchurch, New Zealand >+64-21-2013317 Mobile >raymond_wil...@trimble.com > > > > >