Thanks, Gaurav. Arseny Kovalchuk
Senior Software Engineer at Synesis skype: arseny.kovalchuk mobile: +375 (29) 666-16-16 LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en> On 17 March 2018 at 13:13, Gaurav Bajaj <gauravhba...@gmail.com> wrote: > 1. Data piece size (like event or entity size in bytes) > > -> 1 KB > > 2. What is your write rate (like entities per second) > -> 8K/Sec > > 3. How do you evict (delete) data from the cache > > -> We don't evict/delete. > > 4. How many caches (differ by Ignite cache name) do you have > > -> 3 Caches > > 5. What kind of storage do you have (network, HDD, SSD, etc.) > > -> SSD > > 6. If you can provide a solid reproducer, I'd like to investigate it. > > -> We read files having data abd stream it to caches using ignite > streamer. Not sure at this time about steps to consistently reproduce this. > > On 17-Mar-2018 7:36 AM, "Arseny Kovalchuk" <arseny.kovalc...@synesis.ru> > wrote: > > Hi Gaurav. > > Could you please share your environment and some details please? > 1. Data piece size (like event or entity size in bytes) > 2. What is your write rate (like entities per second) > 3. How do you evict (delete) data from the cache > 4. How many caches (differ by Ignite cache name) do you have > 5. What kind of storage do you have (network, HDD, SSD, etc.) > 6. If you can provide a solid reproducer, I'd like to investigate it. > > Sincerely > > > Arseny Kovalchuk > > Senior Software Engineer at Synesis > skype: arseny.kovalchuk > mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16> > LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en> > > On 16 March 2018 at 22:40, Gaurav Bajaj <gauravhba...@gmail.com> wrote: > >> Hi, >> >> We also got exact same error. Ours is setup without kubernetes. We are >> using ignite data streamer to put data into caches. After streaming aroung >> 500k records streamer failed with exception mentioned in original email. >> >> Thanks, >> Gaurav >> >> On 16-Mar-2018 4:44 PM, "Arseny Kovalchuk" <arseny.kovalc...@synesis.ru> >> wrote: >> >>> Hi Dmitry. >>> >>> Thanks for you attention to this issue. >>> >>> I changed repository to jcenter and set Ignite version to 2.4. >>> Unfortunately the reproducer starts with the same error message in the log >>> (see attached). >>> >>> I cannot say whether behavior of the whole cluster will change on 2.4, I >>> mean if the cluster can start on corrupted data on 2.4, because we have >>> wiped the data and restarted the cluster where the problem has arrived. >>> We'll move to 2.4 next week and continue testing of our software. We are >>> moving forward to production in April/May, and it would be good if we get >>> some clue how to deal with such situation with data in the future. >>> >>> >>> >>> >>> Arseny Kovalchuk >>> >>> Senior Software Engineer at Synesis >>> skype: arseny.kovalchuk >>> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16> >>> LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en> >>> >>> On 16 March 2018 at 17:03, Dmitry Pavlov <dpavlov....@gmail.com> wrote: >>> >>>> Hi Arseny, >>>> >>>> I've observed in reproducer >>>> ignite_version=2.3.0 >>>> >>>> Could you check if it is reproducible in our freshest release 2.4.0. >>>> >>>> I'm not sure about ticket number, but it is quite possible issue is >>>> already fixed. >>>> >>>> Sincerely, >>>> Dmitriy Pavlov >>>> >>>> чт, 15 мар. 2018 г. в 19:34, Dmitry Pavlov <dpavlov....@gmail.com>: >>>> >>>>> Hi Alexey, >>>>> >>>>> It may be serious issue. Could you recommend expert here who can pick >>>>> up this? >>>>> >>>>> Sincerely, >>>>> Dmitriy Pavlov >>>>> >>>>> чт, 15 мар. 2018 г. в 19:25, Arseny Kovalchuk < >>>>> arseny.kovalc...@synesis.ru>: >>>>> >>>>>> Hi, guys. >>>>>> >>>>>> I've got a reproducer for a problem which is generally reported as >>>>>> "Caused by: java.lang.IllegalStateException: Failed to get page IO >>>>>> instance (page content is corrupted)". Actually it reproduces the >>>>>> result. I >>>>>> don't have an idea how the data has been corrupted, but the cluster node >>>>>> doesn't want to start with this data. >>>>>> >>>>>> We got the issue again when some of server nodes were restarted >>>>>> several times by kubernetes. I suspect that the data got corrupted during >>>>>> such restarts. But the main functionality that we really desire to have, >>>>>> that the cluster DOESN'T HANG during next restart even if the data is >>>>>> corrupted! Anyway, there is no a tool that can help to correct such data, >>>>>> and as a result we wipe all data manually to start the cluster. So, >>>>>> having >>>>>> warnings about corrupted data in logs and just working cluster is the >>>>>> expected behavior. >>>>>> >>>>>> How to reproduce: >>>>>> 1. Download the data from here https://storage.googleapi >>>>>> s.com/pub-data-0/data5.tar.gz (~200Mb) >>>>>> 2. Download and import Gradle project https://storage.google >>>>>> apis.com/pub-data-0/project.tar.gz (~100Kb) >>>>>> 3. Unpack the data to the home folder, say /home/user1. You should >>>>>> get the path like */home/user1/data5*. Inside data5 you should have >>>>>> binary_meta, db, marshaller. >>>>>> 4. Open *src/main/resources/data-test.xml* and put the absolute path >>>>>> of unpacked data into *workDirectory* property of *igniteCfg5* bean. >>>>>> In this example it should be */home/user1/data5.* Do not >>>>>> edit consistentId! The consistentId is ignite-instance-5, so the real >>>>>> data >>>>>> is in the data5/db/ignite_instance_5 folder >>>>>> 5. Start application from ru.synesis.kipod.DataTestBootApp >>>>>> 6. Enjoy >>>>>> >>>>>> Hope it will help. >>>>>> >>>>>> >>>>>> >>>>>> Arseny Kovalchuk >>>>>> >>>>>> Senior Software Engineer at Synesis >>>>>> skype: arseny.kovalchuk >>>>>> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16> >>>>>> LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en> >>>>>> >>>>>> On 26 December 2017 at 21:15, Denis Magda <dma...@apache.org> wrote: >>>>>> >>>>>>> Cross-posting to the dev list. >>>>>>> >>>>>>> Ignite persistence maintainers please chime in. >>>>>>> >>>>>>> — >>>>>>> Denis >>>>>>> >>>>>> On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk < >>>>>>> arseny.kovalc...@synesis.ru> wrote: >>>>>>> >>>>>>> Hi guys. >>>>>>> >>>>>>> Another issue when using Ignite 2.3 with native persistence enabled. >>>>>>> See details below. >>>>>>> >>>>>>> We deploy Ignite along with our services in Kubernetes (v 1.8) on >>>>>>> premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of >>>>>>> Ignite >>>>>>> version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. >>>>>>> >>>>>>> We put about 230 events/second into Ignite, 70% of events are ~200KB >>>>>>> in size and 30% are 5000KB. Smaller events have indexed fields and we >>>>>>> query >>>>>>> them via SQL. >>>>>>> >>>>>>> The cluster is activated from a client node which also streams >>>>>>> events into Ignite from Kafka. We use custom implementation of streamer >>>>>>> which uses cache.putAll() API. >>>>>>> >>>>>>> We started cluster from scratch without any persistent data. After a >>>>>>> while we got corrupted data with the error message. >>>>>>> >>>>>>> [2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%] >>>>>>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader: >>>>>>> - Partition eviction failed, this can cause grid hang. >>>>>>> class org.apache.ignite.IgniteException: Runtime failure on search >>>>>>> row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val: >>>>>>> ru.synesis.kipod.event.KipodEvent [idHash=510912646, >>>>>>> hash=-387621419, face_last_name=null, face_list_id=null, channel=171, >>>>>>> source=, face_similarity=null, license_plate_number=null, >>>>>>> descriptors=null, >>>>>>> cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854, >>>>>>> stream=171, alarm=false, processed_at=0, face_id=null, >>>>>>> id=3008806055072854, >>>>>>> persistent=false, face_first_name=null, license_plate_first_name=null, >>>>>>> face_full_name=null, level=0, module=Kpx.Synesis.Outdoor, >>>>>>> end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0, >>>>>>> human, 0, truck, 0, start_time=1513946618964, processed=false, >>>>>>> kafka_offset=111259, license_plate_last_name=null, armed=false, >>>>>>> license_plate_country=null, topic=MovingObject, comment=, >>>>>>> expiration=1514033024000, original_id=null, license_plate_lists=null], >>>>>>> ver: >>>>>>> GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][ >>>>>>> 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964, >>>>>>> 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259, >>>>>>> 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null, >>>>>>> null, null, null, null, null, null, null, null ] >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .BPlusTree.doRemove(BPlusTree.java:1787) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .BPlusTree.remove(BPlusTree.java:1578) >>>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr >>>>>>> eeIndex.remove(H2TreeIndex.java:216) >>>>>>> at org.apache.ignite.internal.processors.query.h2.opt.GridH2Tab >>>>>>> le.doUpdate(GridH2Table.java:496) >>>>>>> at org.apache.ignite.internal.processors.query.h2.opt.GridH2Tab >>>>>>> le.update(GridH2Table.java:423) >>>>>>> at org.apache.ignite.internal.processors.query.h2.IgniteH2Index >>>>>>> ing.remove(IgniteH2Indexing.java:580) >>>>>>> at org.apache.ignite.internal.processors.query.GridQueryProcess >>>>>>> or.remove(GridQueryProcessor.java:2334) >>>>>>> at org.apache.ignite.internal.processors.cache.query.GridCacheQ >>>>>>> ueryManager.remove(GridCacheQueryManager.java:461) >>>>>>> at org.apache.ignite.internal.processors.cache.IgniteCacheOffhe >>>>>>> apManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOff >>>>>>> heapManagerImpl.java:1453) >>>>>>> at org.apache.ignite.internal.processors.cache.IgniteCacheOffhe >>>>>>> apManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapMa >>>>>>> nagerImpl.java:1416) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.Grid >>>>>>> CacheOffheapManager$GridCacheDataStore.remove(GridCacheOffhe >>>>>>> apManager.java:1271) >>>>>>> at org.apache.ignite.internal.processors.cache.IgniteCacheOffhe >>>>>>> apManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374) >>>>>>> at org.apache.ignite.internal.processors.cache.GridCacheMapEntr >>>>>>> y.removeValue(GridCacheMapEntry.java:3233) >>>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht. >>>>>>> GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588) >>>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht. >>>>>>> GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951) >>>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht. >>>>>>> GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809) >>>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht. >>>>>>> preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593) >>>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht. >>>>>>> preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580) >>>>>>> at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader >>>>>>> (IgniteUtils.java:6631) >>>>>>> at org.apache.ignite.internal.processors.closure.GridClosurePro >>>>>>> cessor$2.body(GridClosureProcessor.java:967) >>>>>>> at org.apache.ignite.internal.util.worker.GridWorker.run(GridWo >>>>>>> rker.java:110) >>>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >>>>>>> Executor.java:1149) >>>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >>>>>>> lExecutor.java:624) >>>>>>> at java.lang.Thread.run(Thread.java:748) >>>>>>> Caused by: java.lang.IllegalStateException: Failed to get page IO >>>>>>> instance (page content is corrupted) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .io.IOVersions.forVersion(IOVersions.java:83) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .io.IOVersions.forPage(IOVersions.java:95) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.Cach >>>>>>> eDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.Cach >>>>>>> eDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102) >>>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Ro >>>>>>> wFactory.getRow(H2RowFactory.java:62) >>>>>>> at org.apache.ignite.internal.processors.query.h2.database.io.H >>>>>>> 2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126) >>>>>>> at org.apache.ignite.internal.processors.query.h2.database.io.H >>>>>>> 2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36) >>>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr >>>>>>> ee.getRow(H2Tree.java:123) >>>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr >>>>>>> ee.getRow(H2Tree.java:40) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .BPlusTree.getRow(BPlusTree.java:4372) >>>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr >>>>>>> ee.compare(H2Tree.java:200) >>>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr >>>>>>> ee.compare(H2Tree.java:40) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .BPlusTree.compare(BPlusTree.java:4359) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .BPlusTree.findInsertionPoint(BPlusTree.java:4279) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .BPlusTree.access$1500(BPlusTree.java:81) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .BPlusTree$Search.run0(BPlusTree.java:261) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .BPlusTree$GetPageHandler.run(BPlusTree.java:4697) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .BPlusTree$GetPageHandler.run(BPlusTree.java:4682) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .util.PageHandler.readPage(PageHandler.java:158) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.Data >>>>>>> Structure.read(DataStructure.java:319) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .BPlusTree.removeDown(BPlusTree.java:1823) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .BPlusTree.removeDown(BPlusTree.java:1842) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .BPlusTree.removeDown(BPlusTree.java:1842) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .BPlusTree.removeDown(BPlusTree.java:1842) >>>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree >>>>>>> .BPlusTree.doRemove(BPlusTree.java:1752) >>>>>>> ... 23 more >>>>>>> >>>>>>> >>>>>>> After restart we also get this error. See *ignite-instance-2.log*. >>>>>>> >>>>>>> The *cache-config.xml* is used for *server* instances. >>>>>>> The *ignite-common-cache-conf.xml* is used for *client* instances >>>>>>> which activete cluster and stream data from Kafka into Ignite. >>>>>>> >>>>>>> *Is it possible to tune up (or implement) native persistence in a >>>>>>> way when it just reports about error in data or corrupted data, then >>>>>>> skip >>>>>>> it and continue to work without that corrupted part. Thus it will make >>>>>>> the >>>>>>> cluster to continue operating regardless of errors on storage?* >>>>>>> >>>>>>> >>>>>>> >>>>>>> Arseny Kovalchuk >>>>>>> >>>>>>> Senior Software Engineer at Synesis >>>>>>> skype: arseny.kovalchuk >>>>>>> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16> >>>>>>> LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en> >>>>>>> >>>>>>> <ignite-instance-0.log><ignite-instance-1.log><ignite-instan >>>>>>> ce-2.log><ignite-instance-3.log><ignite-instance-4.log><cach >>>>>>> e-config.xml><ignite-discovery-kubernetes.xml><ignite-common.xml> >>>>>>> <ignite-common-storage.xml><ignite-common-entity.xml> >>>>>>> >>>>>>> >>> > >