Hi Andrey. Unfortunately I couldn't copy all data from file system to try reproducing that locally or in our cluster. That was very likely due to some issues with our underlying CEPH behavior, I mean we also got some problems with CEPH in our cluster at the same time, so that might cause data corruption. So, no results with OracleJDK.
>From the other hand, we disabled backup copies of data "backups=0" (taking into account information from mentioned JIRAs) and we haven't got any severe issues with Ignite persistence so far. Arseny Kovalchuk Senior Software Engineer at Synesis skype: arseny.kovalchuk mobile: +375 (29) 666-16-16 LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en> On 15 January 2018 at 17:50, Andrey Mashenkov <[email protected]> wrote: > Hi Arseny, > > Have you success with reproducing the issue and getting stacktrace? > Do you observe same behavior on OracleJDK? > > On Mon, Jan 15, 2018 at 5:50 PM, Andrey Mashenkov < > [email protected]> wrote: > >> Hi Arseny, >> >> Have you success with reproducing the issue and getting stacktrace? >> Do you observe same behavior on OracleJDK? >> >> On Tue, Dec 26, 2017 at 2:43 PM, Andrey Mashenkov < >> [email protected]> wrote: >> >>> Hi Arseny, >>> >>> This looks like a known issues that is unresolved yet [1], >>> but we can't sure it is same issue as there is no stacktrace in logs >>> attached. >>> >>> >>> [1] https://issues.apache.org/jira/browse/IGNITE-7278 >>> >>> On Tue, Dec 26, 2017 at 12:54 PM, Arseny Kovalchuk < >>> [email protected]> wrote: >>> >>>> Hi guys. >>>> >>>> We've successfully tested Ignite as in-memory solution, it showed >>>> acceptable performance. But we cannot get stable work of Ignite cluster >>>> with native persistence enabled. Our first error we've got is Segmentation >>>> fault (JVM crash) while memory restoring on start. >>>> >>>> [2017-12-22 11:11:51,992] INFO [exchange-worker-#46%ignite-instance-0%] >>>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager: >>>> - Read checkpoint status [startMarker=/ignite-work-dire >>>> ctory/db/ignite_instance_0/cp/1513938154201-8c574131-763d-4c >>>> fa-99b6-0ce0321d61ab-START.bin, endMarker=/ignite-work-directo >>>> ry/db/ignite_instance_0/cp/1513932413840-55ea1713-8e9e-44cd- >>>> b51a-fcad8fb94de1-END.bin] >>>> [2017-12-22 11:11:51,993] INFO [exchange-worker-#46%ignite-instance-0%] >>>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager: >>>> - Checking memory state [lastValidPos=FileWALPointer [idx=391, >>>> fileOffset=220593830, len=19573, forceFlush=false], >>>> lastMarked=FileWALPointer [idx=394, fileOffset=38532201, len=19573, >>>> forceFlush=false], lastCheckpointId=8c574131-763d >>>> -4cfa-99b6-0ce0321d61ab] >>>> [2017-12-22 11:11:51,993] WARN [exchange-worker-#46%ignite-instance-0%] >>>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager: >>>> - Ignite node stopped in the middle of checkpoint. Will restore memory >>>> state and finish checkpoint on node start. >>>> [CodeBlob (0x00007f9b58f24110)] >>>> Framesize: 0 >>>> BufferBlob (0x00007f9b58f24110) used for StubRoutines (2) >>>> # >>>> # A fatal error has been detected by the Java Runtime Environment: >>>> # >>>> # Internal Error (sharedRuntime.cpp:842), pid=221, >>>> tid=0x00007f9b473c1ae8 >>>> # fatal error: exception happened outside interpreter, nmethods and >>>> vtable stubs at pc 0x00007f9b58f248f6 >>>> # >>>> # JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build >>>> 1.8.0_151-b12) >>>> # Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 >>>> compressed oops) >>>> # Derivative: IcedTea 3.6.0 >>>> # Distribution: Custom build (Tue Nov 21 11:22:36 GMT 2017) >>>> # Core dump written. Default location: /opt/ignite/core or core.221 >>>> # >>>> # An error report file with more information is saved as: >>>> # /ignite-work-directory/core_dump_221.log >>>> # >>>> # If you would like to submit a bug report, please include >>>> # instructions on how to reproduce the bug and visit: >>>> # http://icedtea.classpath.org/bugzilla >>>> # >>>> >>>> >>>> >>>> Please find logs and configs attached. >>>> >>>> We deploy Ignite along with our services in Kubernetes (v 1.8) on >>>> premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite >>>> version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. >>>> >>>> We put about 230 events/second into Ignite, 70% of events are ~200KB in >>>> size and 30% are 5000KB. Smaller events have indexed fields and we query >>>> them via SQL. >>>> >>>> The cluster is activated from a client node which also streams events >>>> into Ignite from Kafka. We use custom implementation of streamer which uses >>>> cache.putAll() API. >>>> >>>> We got the error when we stopped and restarted cluster again. It >>>> happened only on one instance. >>>> >>>> The general question is: >>>> >>>> *Is it possible to tune up (or implement) native persistence in a way >>>> when it just reports about error in data or corrupted data, then skip it >>>> and continue to work without that corrupted part. Thus it will make the >>>> cluster to continue operating regardless of errors on storage?* >>>> >>>> >>>> >>>> Arseny Kovalchuk >>>> >>>> Senior Software Engineer at Synesis >>>> skype: arseny.kovalchuk >>>> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16> >>>> LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en> >>>> >>> >>> >>> >>> -- >>> Best regards, >>> Andrey V. Mashenkov >>> >> >> >> >> -- >> Best regards, >> Andrey V. Mashenkov >> > > > > -- > Best regards, > Andrey V. Mashenkov >
