Andrey,

Can you please describe in greater detail the configuration of your nodes
(specifically, number of caches and number of partitions). Ignite would not
load all the partitions into memory on startup simply because there is no
such logic. What it does, however, is loading meta pages for each partition
in each cache group to determine the correct cluster state and schedule
rebalancing, if needed. If the number of caches x number of partitions is
high, this may take a while.
If this is the case, you can either reduce the number of partitions or
group logical caches with the same affinity into physical cache group, so
that those caches will share the same partition file. See
CacheConfiguration#setGroupName(String) for more detail.

Last but not least, it looks very suspicious that with 0 pending updates it
took almost 90 seconds to read WAL. From the code, I see that this again
may be related to partition state recovery, I will need to re-check this
and get back to you later.

Thanks,
AG

2018-01-19 2:51 GMT+03:00 Andrey Kornev <andrewkor...@hotmail.com>:

> Hello,
>
> I'm wondering if there is a way to improve the startup time of Ignite node
> when the persistence is enabled?
>
> It seems the time is proportional to the size (and number) of the
> partition files. This is somewhat surprising as I expected the startup
> time be the same (plus-minus some constant factor) regardless of the amount
> of data persisted.
>
> The delay looks to be due to Ignite loading *all* partition files for
> *all* persistence-enabled caches as part of a node's join. Here's an
> example of the startup log output:
>
> 2018-01-18 14:00:40,230 INFO  [exchange-worker-#42%ignite-1%]
> GridCacheDatabaseSharedManager - Read checkpoint status
> [startMarker=/tmp/storage/data/1/cp/1516311778910-d56f8ceb-2205-4bef-9ed3-a7446e34aa06-START.bin,
> endMarker=/tmp/storage/data/1/cp/1516311778910-d56f8ceb-
> 2205-4bef-9ed3-a7446e34aa06-END.bin]
> 2018-01-18 14:00:40,230 INFO  [exchange-worker-#42%ignite-1%]
> GridCacheDatabaseSharedManager - Applying lost cache updates since last
> checkpoint record [lastMarked=FileWALPointer [idx=1693, fileOff=7970054,
> len=60339], lastCheckpointId=d56f8ceb-2205-4bef-9ed3-a7446e34aa06]
> 2018-01-18 14:00:57,114 WARN  [exchange-worker-#42%ignite-1%]
> PageMemoryImpl - Page evictions started, this will affect storage
> performance (consider increasing DataRegionConfiguration#setMaxSize).
> 2018-01-18 14:02:05,469 INFO  [exchange-worker-#42%ignite-1%]
> GridCacheDatabaseSharedManager - Finished applying WAL changes
> [updatesApplied=0, time=85234ms]
>
> It took ≈1.5 minute to activate a node. To add insult to injury, the
> eviction kicked in and most of the loaded pages got evicted (in this
> test, I had the caches sharing a 1GB memory region loading about 10GB of
> data and index). In general, I think it's not unreasonable to expect
> 1-to-10 ratio of the data region size to the total persisted data size.
>
> Why load all that data in the first place? It seems like a huge waste of
> time. Can the data partitions be loaded lazily on demand while the index
> partition can still be loaded during node startup?
>
> Thanks
> Andrey
>
>

Reply via email to