Thanks Anthony. We have already enabled synchronous disk writes to minimize data loss in the event of crash.
From: Anthony Baker <aba...@pivotal.io<mailto:aba...@pivotal.io>> Reply-To: <user@geode.incubator.apache.org<mailto:user@geode.incubator.apache.org>> Date: Thursday, October 13, 2016 at 8:31 PM To: <user@geode.incubator.apache.org<mailto:user@geode.incubator.apache.org>> Subject: Re: GemFire persisted data corruption - how to debug? Hi Kapil, Geode (by default) writes data synchronously to other cluster members. If a node crashes like in your test, the update is preserved by the cluster even in the absence of persistence. Synchronous disk writes can be turned on (see [1]) but many users prefer to avoid the fsync performance penalty. Anthony [1] https://cwiki.apache.org/confluence/display/GEODE/Native+Disk+Persistence On Oct 13, 2016, at 6:46 PM, Kapil Goyal <goy...@vmware.com<mailto:goy...@vmware.com>> wrote: Hi Folks, I am doing some crash testing with a single cache node of GemFire, where I power off the VM where cache is running and then bring it back up. Upon restart, GemFire refuses to come up with this error: Caused by: java.lang.NullPointerException at com.gemstone.gemfire.internal.util.concurrent.CustomEntryConcurrentHashMap.keyHash(CustomEntryConcurrentHashMap.java:228) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.AbstractRegionEntry$HashRegionEntryCreator.keyHashCode(AbstractRegionEntry.java:934) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.util.concurrent.CustomEntryConcurrentHashMap.get(CustomEntryConcurrentHashMap.java:1447) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.AbstractRegionMap.getEntry(AbstractRegionMap.java:368) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.getEntry(AbstractLRURegionMap.java:47) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.PlaceHolderDiskRegion.getDiskEntry(PlaceHolderDiskRegion.java:93) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.Oplog.readModifyEntry(Oplog.java:2779) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.Oplog.readCrf(Oplog.java:1957) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.Oplog.recoverCrf(Oplog.java:2270) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.PersistentOplogSet.recoverOplogs(PersistentOplogSet.java:459) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.PersistentOplogSet.recoverRegionsThatAreReady(PersistentOplogSet.java:367) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.DiskStoreImpl.recoverRegionsThatAreReady(DiskStoreImpl.java:2065) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.DiskStoreImpl.initializeIfNeeded(DiskStoreImpl.java:2052) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.DiskStoreImpl.doInitialRecovery(DiskStoreImpl.java:2057) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.DiskStoreFactoryImpl.create(DiskStoreFactoryImpl.java:135) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.xmlcache.CacheCreation.createDiskStore(CacheCreation.java:650) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:425) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:331) ~[gemfire-8.2.0.2.jar:?] at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4248) ~[gemfire-8.2.0.2.jar:?] at org.springframework.data.gemfire.CacheFactoryBean.init(CacheFactoryBean.java:306) ~[spring-data-gemfire-1.5.2.RELEASE.jar:1.5.2.RELEASE] at org.springframework.data.gemfire.CacheFactoryBean.getObject(CacheFactoryBean.java:455) ~[spring-data-gemfire-1.5.2.RELEASE.jar:1.5.2.RELEASE] It hints at GemFire data on disk being corrupted, so I used 'gfsh' to verify: gfsh>validate offline-disk-store --name=nsxDiskStore --disk-dirs=/common/nsxapi/data/self Validating nsxDiskStore /nsx_sys/ArrayListIDPriorityModel: entryCount=0 /nsx_sys/Crl: entryCount=0 /nsx_sys/Certificate: entryCount=1 ...... Error in validating disk store nsxDiskStore is : null This confirms that the disk-store is corrupted, but doesn't give any more information to debug this further. How do I go about debugging this? Have you seen this before and are there any fixes/workarounds available? Thanks Kapil