Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Arseny Kovalchuk Sat, 17 Mar 2018 03:35:50 -0700

Thanks, Gaurav.


Arseny Kovalchuk


Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16
LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>

On 17 March 2018 at 13:13, Gaurav Bajaj <gauravhba...@gmail.com> wrote:

> 1. Data piece size (like event or entity size in bytes)
>
> -> 1 KB
>
> 2. What is your write rate (like entities per second)
> -> 8K/Sec
>
> 3. How do you evict (delete) data from the cache
>
> -> We don't evict/delete.
>
> 4. How many caches (differ by Ignite cache name) do you have
>
> -> 3 Caches
>
> 5. What kind of storage do you have (network, HDD, SSD, etc.)
>
> -> SSD
>
> 6. If you can provide a solid reproducer, I'd like to investigate it.
>
> -> We read files having data abd stream it to caches using ignite
> streamer. Not sure at this time about steps to consistently reproduce this.
>
> On 17-Mar-2018 7:36 AM, "Arseny Kovalchuk" <arseny.kovalc...@synesis.ru>
> wrote:
>
> Hi Gaurav.
>
> Could you please share your environment and some details please?
> 1. Data piece size (like event or entity size in bytes)
> 2. What is your write rate (like entities per second)
> 3. How do you evict (delete) data from the cache
> 4. How many caches (differ by Ignite cache name) do you have
> 5. What kind of storage do you have (network, HDD, SSD, etc.)
> 6. If you can provide a solid reproducer, I'd like to investigate it.
>
> Sincerely
>
> 
> Arseny Kovalchuk
>
> Senior Software Engineer at Synesis
> skype: arseny.kovalchuk
> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
> LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>
>
> On 16 March 2018 at 22:40, Gaurav Bajaj <gauravhba...@gmail.com> wrote:
>
>> Hi,
>>
>> We also got exact same error. Ours is  setup without kubernetes. We are
>> using ignite data streamer to put data into caches. After streaming aroung
>> 500k records streamer failed with exception mentioned in original email.
>>
>> Thanks,
>> Gaurav
>>
>> On 16-Mar-2018 4:44 PM, "Arseny Kovalchuk" <arseny.kovalc...@synesis.ru>
>> wrote:
>>
>>> Hi Dmitry.
>>>
>>> Thanks for you attention to this issue.
>>>
>>> I changed repository to jcenter and set Ignite version to 2.4.
>>> Unfortunately the reproducer starts with the same error message in the log
>>> (see attached).
>>>
>>> I cannot say whether behavior of the whole cluster will change on 2.4, I
>>> mean if the cluster can start on corrupted data on 2.4, because we have
>>> wiped the data and restarted the cluster where the problem has arrived.
>>> We'll move to 2.4 next week and continue testing of our software. We are
>>> moving forward to production in April/May, and it would be good if we get
>>> some clue how to deal with such situation with data in the future.
>>>
>>>
>>>
>>> 
>>> Arseny Kovalchuk
>>>
>>> Senior Software Engineer at Synesis
>>> skype: arseny.kovalchuk
>>> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
>>> LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>
>>>
>>> On 16 March 2018 at 17:03, Dmitry Pavlov <dpavlov....@gmail.com> wrote:
>>>
>>>> Hi Arseny,
>>>>
>>>> I've observed in reproducer
>>>> ignite_version=2.3.0
>>>>
>>>> Could you check if it is reproducible in our freshest release 2.4.0.
>>>>
>>>> I'm not sure about ticket number, but it is quite possible issue is
>>>> already fixed.
>>>>
>>>> Sincerely,
>>>> Dmitriy Pavlov
>>>>
>>>> чт, 15 мар. 2018 г. в 19:34, Dmitry Pavlov <dpavlov....@gmail.com>:
>>>>
>>>>> Hi Alexey,
>>>>>
>>>>> It may be serious issue. Could you recommend expert here who can pick
>>>>> up this?
>>>>>
>>>>> Sincerely,
>>>>> Dmitriy Pavlov
>>>>>
>>>>> чт, 15 мар. 2018 г. в 19:25, Arseny Kovalchuk <
>>>>> arseny.kovalc...@synesis.ru>:
>>>>>
>>>>>> Hi, guys.
>>>>>>
>>>>>> I've got a reproducer for a problem which is generally reported as
>>>>>> "Caused by: java.lang.IllegalStateException: Failed to get page IO
>>>>>> instance (page content is corrupted)". Actually it reproduces the 
>>>>>> result. I
>>>>>> don't have an idea how the data has been corrupted, but the cluster node
>>>>>> doesn't want to start with this data.
>>>>>>
>>>>>> We got the issue again when some of server nodes were restarted
>>>>>> several times by kubernetes. I suspect that the data got corrupted during
>>>>>> such restarts. But the main functionality that we really desire to have,
>>>>>> that the cluster DOESN'T HANG during next restart even if the data is
>>>>>> corrupted! Anyway, there is no a tool that can help to correct such data,
>>>>>> and as a result we wipe all data manually to start the cluster. So, 
>>>>>> having
>>>>>> warnings about corrupted data in logs and just working cluster is the
>>>>>> expected behavior.
>>>>>>
>>>>>> How to reproduce:
>>>>>> 1. Download the data from here https://storage.googleapi
>>>>>> s.com/pub-data-0/data5.tar.gz (~200Mb)
>>>>>> 2. Download and import Gradle project https://storage.google
>>>>>> apis.com/pub-data-0/project.tar.gz (~100Kb)
>>>>>> 3. Unpack the data to the home folder, say /home/user1. You should
>>>>>> get the path like */home/user1/data5*. Inside data5 you should have
>>>>>> binary_meta, db, marshaller.
>>>>>> 4. Open *src/main/resources/data-test.xml* and put the absolute path
>>>>>> of unpacked data into *workDirectory* property of *igniteCfg5* bean.
>>>>>> In this example it should be */home/user1/data5.* Do not
>>>>>> edit consistentId! The consistentId is ignite-instance-5, so the real 
>>>>>> data
>>>>>> is in the data5/db/ignite_instance_5 folder
>>>>>> 5. Start application from ru.synesis.kipod.DataTestBootApp
>>>>>> 6. Enjoy
>>>>>>
>>>>>> Hope it will help.
>>>>>>
>>>>>>
>>>>>> 
>>>>>> Arseny Kovalchuk
>>>>>>
>>>>>> Senior Software Engineer at Synesis
>>>>>> skype: arseny.kovalchuk
>>>>>> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
>>>>>> LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>
>>>>>>
>>>>>> On 26 December 2017 at 21:15, Denis Magda <dma...@apache.org> wrote:
>>>>>>
>>>>>>> Cross-posting to the dev list.
>>>>>>>
>>>>>>> Ignite persistence maintainers please chime in.
>>>>>>>
>>>>>>> —
>>>>>>> Denis
>>>>>>>
>>>>>> On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <
>>>>>>> arseny.kovalc...@synesis.ru> wrote:
>>>>>>>
>>>>>>> Hi guys.
>>>>>>>
>>>>>>> Another issue when using Ignite 2.3 with native persistence enabled.
>>>>>>> See details below.
>>>>>>>
>>>>>>> We deploy Ignite along with our services in Kubernetes (v 1.8) on
>>>>>>> premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of 
>>>>>>> Ignite
>>>>>>> version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD.
>>>>>>>
>>>>>>> We put about 230 events/second into Ignite, 70% of events are ~200KB
>>>>>>> in size and 30% are 5000KB. Smaller events have indexed fields and we 
>>>>>>> query
>>>>>>> them via SQL.
>>>>>>>
>>>>>>> The cluster is activated from a client node which also streams
>>>>>>> events into Ignite from Kafka. We use custom implementation of streamer
>>>>>>> which uses cache.putAll() API.
>>>>>>>
>>>>>>> We started cluster from scratch without any persistent data. After a
>>>>>>> while we got corrupted data with the error message.
>>>>>>>
>>>>>>> [2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%]
>>>>>>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader:
>>>>>>> - Partition eviction failed, this can cause grid hang.
>>>>>>> class org.apache.ignite.IgniteException: Runtime failure on search
>>>>>>> row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val:
>>>>>>> ru.synesis.kipod.event.KipodEvent [idHash=510912646,
>>>>>>> hash=-387621419, face_last_name=null, face_list_id=null, channel=171,
>>>>>>> source=, face_similarity=null, license_plate_number=null, 
>>>>>>> descriptors=null,
>>>>>>> cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854,
>>>>>>> stream=171, alarm=false, processed_at=0, face_id=null, 
>>>>>>> id=3008806055072854,
>>>>>>> persistent=false, face_first_name=null, license_plate_first_name=null,
>>>>>>> face_full_name=null, level=0, module=Kpx.Synesis.Outdoor,
>>>>>>> end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0,
>>>>>>> human, 0, truck, 0, start_time=1513946618964, processed=false,
>>>>>>> kafka_offset=111259, license_plate_last_name=null, armed=false,
>>>>>>> license_plate_country=null, topic=MovingObject, comment=,
>>>>>>> expiration=1514033024000, original_id=null, license_plate_lists=null], 
>>>>>>> ver:
>>>>>>> GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][
>>>>>>> 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964,
>>>>>>> 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259,
>>>>>>> 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null,
>>>>>>> null, null, null, null, null, null, null, null ]
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .BPlusTree.doRemove(BPlusTree.java:1787)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .BPlusTree.remove(BPlusTree.java:1578)
>>>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
>>>>>>> eeIndex.remove(H2TreeIndex.java:216)
>>>>>>> at org.apache.ignite.internal.processors.query.h2.opt.GridH2Tab
>>>>>>> le.doUpdate(GridH2Table.java:496)
>>>>>>> at org.apache.ignite.internal.processors.query.h2.opt.GridH2Tab
>>>>>>> le.update(GridH2Table.java:423)
>>>>>>> at org.apache.ignite.internal.processors.query.h2.IgniteH2Index
>>>>>>> ing.remove(IgniteH2Indexing.java:580)
>>>>>>> at org.apache.ignite.internal.processors.query.GridQueryProcess
>>>>>>> or.remove(GridQueryProcessor.java:2334)
>>>>>>> at org.apache.ignite.internal.processors.cache.query.GridCacheQ
>>>>>>> ueryManager.remove(GridCacheQueryManager.java:461)
>>>>>>> at org.apache.ignite.internal.processors.cache.IgniteCacheOffhe
>>>>>>> apManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOff
>>>>>>> heapManagerImpl.java:1453)
>>>>>>> at org.apache.ignite.internal.processors.cache.IgniteCacheOffhe
>>>>>>> apManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapMa
>>>>>>> nagerImpl.java:1416)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.Grid
>>>>>>> CacheOffheapManager$GridCacheDataStore.remove(GridCacheOffhe
>>>>>>> apManager.java:1271)
>>>>>>> at org.apache.ignite.internal.processors.cache.IgniteCacheOffhe
>>>>>>> apManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
>>>>>>> at org.apache.ignite.internal.processors.cache.GridCacheMapEntr
>>>>>>> y.removeValue(GridCacheMapEntry.java:3233)
>>>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht.
>>>>>>> GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
>>>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht.
>>>>>>> GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
>>>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht.
>>>>>>> GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
>>>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht.
>>>>>>> preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
>>>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht.
>>>>>>> preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
>>>>>>> at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader
>>>>>>> (IgniteUtils.java:6631)
>>>>>>> at org.apache.ignite.internal.processors.closure.GridClosurePro
>>>>>>> cessor$2.body(GridClosureProcessor.java:967)
>>>>>>> at org.apache.ignite.internal.util.worker.GridWorker.run(GridWo
>>>>>>> rker.java:110)
>>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>>>>> Executor.java:1149)
>>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>>>>> lExecutor.java:624)
>>>>>>> at java.lang.Thread.run(Thread.java:748)
>>>>>>> Caused by: java.lang.IllegalStateException: Failed to get page IO
>>>>>>> instance (page content is corrupted)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .io.IOVersions.forVersion(IOVersions.java:83)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .io.IOVersions.forPage(IOVersions.java:95)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.Cach
>>>>>>> eDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.Cach
>>>>>>> eDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
>>>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Ro
>>>>>>> wFactory.getRow(H2RowFactory.java:62)
>>>>>>> at org.apache.ignite.internal.processors.query.h2.database.io.H
>>>>>>> 2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
>>>>>>> at org.apache.ignite.internal.processors.query.h2.database.io.H
>>>>>>> 2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
>>>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
>>>>>>> ee.getRow(H2Tree.java:123)
>>>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
>>>>>>> ee.getRow(H2Tree.java:40)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .BPlusTree.getRow(BPlusTree.java:4372)
>>>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
>>>>>>> ee.compare(H2Tree.java:200)
>>>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
>>>>>>> ee.compare(H2Tree.java:40)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .BPlusTree.compare(BPlusTree.java:4359)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .BPlusTree.findInsertionPoint(BPlusTree.java:4279)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .BPlusTree.access$1500(BPlusTree.java:81)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .BPlusTree$Search.run0(BPlusTree.java:261)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .util.PageHandler.readPage(PageHandler.java:158)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.Data
>>>>>>> Structure.read(DataStructure.java:319)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .BPlusTree.removeDown(BPlusTree.java:1823)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .BPlusTree.removeDown(BPlusTree.java:1842)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .BPlusTree.removeDown(BPlusTree.java:1842)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .BPlusTree.removeDown(BPlusTree.java:1842)
>>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>>>> .BPlusTree.doRemove(BPlusTree.java:1752)
>>>>>>> ... 23 more
>>>>>>>
>>>>>>>
>>>>>>> After restart we also get this error. See *ignite-instance-2.log*.
>>>>>>>
>>>>>>> The *cache-config.xml* is used for *server* instances.
>>>>>>> The *ignite-common-cache-conf.xml* is used for *client* instances
>>>>>>> which activete cluster and stream data from Kafka into Ignite.
>>>>>>>
>>>>>>> *Is it possible to tune up (or implement) native persistence in a
>>>>>>> way when it just reports about error in data or corrupted data, then 
>>>>>>> skip
>>>>>>> it and continue to work without that corrupted part. Thus it will make 
>>>>>>> the
>>>>>>> cluster to continue operating regardless of errors on storage?*
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>> Arseny Kovalchuk
>>>>>>>
>>>>>>> Senior Software Engineer at Synesis
>>>>>>> skype: arseny.kovalchuk
>>>>>>> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
>>>>>>> LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>
>>>>>>>
>>>>>>> <ignite-instance-0.log><ignite-instance-1.log><ignite-instan
>>>>>>> ce-2.log><ignite-instance-3.log><ignite-instance-4.log><cach
>>>>>>> e-config.xml><ignite-discovery-kubernetes.xml><ignite-common.xml>
>>>>>>> <ignite-common-storage.xml><ignite-common-entity.xml>
>>>>>>>
>>>>>>>
>>>
>
>

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Reply via email to