Hi Michael,

Thanks for the suggestion and I do agree with FT and HA part. It’s important 
for us to work with a single node as of now and use GemFire as a persistent 
store, somewhat as a replacement of a traditional database server such as mysql.

Regards
Kapil

From: Michael Stolz <mst...@pivotal.io<mailto:mst...@pivotal.io>>
Reply-To: 
"user@geode.incubator.apache.org<mailto:user@geode.incubator.apache.org>" 
<user@geode.incubator.apache.org<mailto:user@geode.incubator.apache.org>>
Date: Friday, October 14, 2016 at 2:00 PM
To: "user@geode.incubator.apache.org<mailto:user@geode.incubator.apache.org>" 
<user@geode.incubator.apache.org<mailto:user@geode.incubator.apache.org>>
Subject: Re: GemFire persisted data corruption - how to debug?

It is highly unusual to use Geode with just a single cache node. A big part of 
the value of an In-Memory Data Grid is that it can provide fault-tolerance and 
high-availability for your data. Please consider running at least 3 nodes in 
your tests as that would be the minimum real-world configuration that Geode 
would likely be used in.

--
Mike Stolz
Principal Engineer, GemFire Product Manager
Mobile: 631-835-4771

On Sat, Oct 15, 2016 at 4:59 AM, Kapil Goyal 
<goy...@vmware.com<mailto:goy...@vmware.com>> wrote:
Thanks Anthony.

We have already enabled synchronous disk writes to minimize data loss in the 
event of crash.

From: Anthony Baker <aba...@pivotal.io<mailto:aba...@pivotal.io>>
Reply-To: 
<user@geode.incubator.apache.org<mailto:user@geode.incubator.apache.org>>
Date: Thursday, October 13, 2016 at 8:31 PM
To: <user@geode.incubator.apache.org<mailto:user@geode.incubator.apache.org>>
Subject: Re: GemFire persisted data corruption - how to debug?

Hi Kapil,

Geode (by default) writes data synchronously to other cluster members.  If a 
node crashes like in your test, the update is preserved by the cluster even in 
the absence of persistence.  Synchronous disk writes can be turned on (see [1]) 
but many users prefer to avoid the fsync performance penalty.

Anthony

[1] 
https://cwiki.apache.org/confluence/display/GEODE/Native+Disk+Persistence<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_GEODE_Native-2BDisk-2BPersistence&d=CwMFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=jI39iGhJSMsophpPzgwoqWd6xw05tZ5QPHjlmT5c7Tw&m=6Xqiqt_blRT_CsKoO_DzK7RypEVblo86aOPVKMAKqQo&s=hnJ3DPIaTd3dcYRSz3CxWwcJ0cAr6OmhA7SLX2A99iM&e=>

On Oct 13, 2016, at 6:46 PM, Kapil Goyal 
<goy...@vmware.com<mailto:goy...@vmware.com>> wrote:

Hi Folks,

I am doing some crash testing with a single cache node of GemFire, where I 
power off the VM where cache is running and then bring it back up. Upon 
restart, GemFire refuses to come up with this error:

Caused by: java.lang.NullPointerException
        at 
com.gemstone.gemfire.internal.util.concurrent.CustomEntryConcurrentHashMap.keyHash(CustomEntryConcurrentHashMap.java:228)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.AbstractRegionEntry$HashRegionEntryCreator.keyHashCode(AbstractRegionEntry.java:934)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.util.concurrent.CustomEntryConcurrentHashMap.get(CustomEntryConcurrentHashMap.java:1447)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.AbstractRegionMap.getEntry(AbstractRegionMap.java:368)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.getEntry(AbstractLRURegionMap.java:47)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.PlaceHolderDiskRegion.getDiskEntry(PlaceHolderDiskRegion.java:93)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.Oplog.readModifyEntry(Oplog.java:2779) 
~[gemfire-8.2.0.2.jar:?]
        at com.gemstone.gemfire.internal.cache.Oplog.readCrf(Oplog.java:1957) 
~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.Oplog.recoverCrf(Oplog.java:2270) 
~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.PersistentOplogSet.recoverOplogs(PersistentOplogSet.java:459)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.PersistentOplogSet.recoverRegionsThatAreReady(PersistentOplogSet.java:367)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.DiskStoreImpl.recoverRegionsThatAreReady(DiskStoreImpl.java:2065)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.DiskStoreImpl.initializeIfNeeded(DiskStoreImpl.java:2052)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.DiskStoreImpl.doInitialRecovery(DiskStoreImpl.java:2057)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.DiskStoreFactoryImpl.create(DiskStoreFactoryImpl.java:135)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.xmlcache.CacheCreation.createDiskStore(CacheCreation.java:650)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:425)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:331)
 ~[gemfire-8.2.0.2.jar:?]
        at 
com.gemstone.gemfire.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4248)
 ~[gemfire-8.2.0.2.jar:?]
        at 
org.springframework.data.gemfire.CacheFactoryBean.init(CacheFactoryBean.java:306)
 ~[spring-data-gemfire-1.5.2.RELEASE.jar:1.5.2.RELEASE]
        at 
org.springframework.data.gemfire.CacheFactoryBean.getObject(CacheFactoryBean.java:455)
 ~[spring-data-gemfire-1.5.2.RELEASE.jar:1.5.2.RELEASE]

It hints at GemFire data on disk being corrupted, so I used 'gfsh' to verify:

gfsh>validate offline-disk-store --name=nsxDiskStore 
--disk-dirs=/common/nsxapi/data/self

Validating nsxDiskStore
/nsx_sys/ArrayListIDPriorityModel: entryCount=0
/nsx_sys/Crl: entryCount=0
/nsx_sys/Certificate: entryCount=1
……
Error in validating disk store nsxDiskStore is : null

This confirms that the disk-store is corrupted, but doesn't give any more 
information to debug this further. How do I go about debugging this? Have you 
seen this before and are there any fixes/workarounds available?

Thanks
Kapil


Reply via email to