Hi Andrey.

Unfortunately I couldn't copy all data from file system to try reproducing
that locally or in our cluster. That was very likely due to some issues
with our underlying CEPH behavior, I mean we also got some problems with
CEPH in our cluster at the same time, so that might cause data corruption.
So, no results with OracleJDK.

>From the other hand, we disabled backup copies of data "backups=0" (taking
into account information from mentioned JIRAs) and we haven't got any
severe issues with Ignite persistence so far.



​
Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16
​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​

On 15 January 2018 at 17:50, Andrey Mashenkov <[email protected]>
wrote:

> Hi Arseny,
>
> Have you success with reproducing the issue and getting stacktrace?
> Do you observe same behavior on OracleJDK?
>
> On Mon, Jan 15, 2018 at 5:50 PM, Andrey Mashenkov <
> [email protected]> wrote:
>
>> Hi Arseny,
>>
>> Have you success with reproducing the issue and getting stacktrace?
>> Do you observe same behavior on OracleJDK?
>>
>> On Tue, Dec 26, 2017 at 2:43 PM, Andrey Mashenkov <
>> [email protected]> wrote:
>>
>>> Hi Arseny,
>>>
>>> This looks like a known issues that is unresolved yet [1],
>>> but we can't sure it is same issue as there is no stacktrace in logs
>>> attached.
>>>
>>>
>>> [1] https://issues.apache.org/jira/browse/IGNITE-7278
>>>
>>> On Tue, Dec 26, 2017 at 12:54 PM, Arseny Kovalchuk <
>>> [email protected]> wrote:
>>>
>>>> Hi guys.
>>>>
>>>> We've successfully tested Ignite as in-memory solution, it showed
>>>> acceptable performance. But we cannot get stable work of Ignite cluster
>>>> with native persistence enabled. Our first error we've got is Segmentation
>>>> fault (JVM crash) while memory restoring on start.
>>>>
>>>> [2017-12-22 11:11:51,992]  INFO [exchange-worker-#46%ignite-instance-0%]
>>>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>>>> - Read checkpoint status [startMarker=/ignite-work-dire
>>>> ctory/db/ignite_instance_0/cp/1513938154201-8c574131-763d-4c
>>>> fa-99b6-0ce0321d61ab-START.bin, endMarker=/ignite-work-directo
>>>> ry/db/ignite_instance_0/cp/1513932413840-55ea1713-8e9e-44cd-
>>>> b51a-fcad8fb94de1-END.bin]
>>>> [2017-12-22 11:11:51,993]  INFO [exchange-worker-#46%ignite-instance-0%]
>>>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>>>> - Checking memory state [lastValidPos=FileWALPointer [idx=391,
>>>> fileOffset=220593830, len=19573, forceFlush=false],
>>>> lastMarked=FileWALPointer [idx=394, fileOffset=38532201, len=19573,
>>>> forceFlush=false], lastCheckpointId=8c574131-763d
>>>> -4cfa-99b6-0ce0321d61ab]
>>>> [2017-12-22 11:11:51,993]  WARN [exchange-worker-#46%ignite-instance-0%]
>>>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:
>>>> - Ignite node stopped in the middle of checkpoint. Will restore memory
>>>> state and finish checkpoint on node start.
>>>> [CodeBlob (0x00007f9b58f24110)]
>>>> Framesize: 0
>>>> BufferBlob (0x00007f9b58f24110) used for StubRoutines (2)
>>>> #
>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>> #
>>>> #  Internal Error (sharedRuntime.cpp:842), pid=221,
>>>> tid=0x00007f9b473c1ae8
>>>> #  fatal error: exception happened outside interpreter, nmethods and
>>>> vtable stubs at pc 0x00007f9b58f248f6
>>>> #
>>>> # JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build
>>>> 1.8.0_151-b12)
>>>> # Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64
>>>> compressed oops)
>>>> # Derivative: IcedTea 3.6.0
>>>> # Distribution: Custom build (Tue Nov 21 11:22:36 GMT 2017)
>>>> # Core dump written. Default location: /opt/ignite/core or core.221
>>>> #
>>>> # An error report file with more information is saved as:
>>>> # /ignite-work-directory/core_dump_221.log
>>>> #
>>>> # If you would like to submit a bug report, please include
>>>> # instructions on how to reproduce the bug and visit:
>>>> #   http://icedtea.classpath.org/bugzilla
>>>> #
>>>>
>>>>
>>>>
>>>> Please find logs and configs attached.
>>>>
>>>> We deploy Ignite along with our services in Kubernetes (v 1.8) on
>>>> premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite
>>>> version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD.
>>>>
>>>> We put about 230 events/second into Ignite, 70% of events are ~200KB in
>>>> size and 30% are 5000KB. Smaller events have indexed fields and we query
>>>> them via SQL.
>>>>
>>>> The cluster is activated from a client node which also streams events
>>>> into Ignite from Kafka. We use custom implementation of streamer which uses
>>>> cache.putAll() API.
>>>>
>>>> We got the error when we stopped and restarted cluster again. It
>>>> happened only on one instance.
>>>>
>>>> The general question is:
>>>>
>>>> *Is it possible to tune up (or implement) native persistence in a way
>>>> when it just reports about error in data or corrupted data, then skip it
>>>> and continue to work without that corrupted part. Thus it will make the
>>>> cluster to continue operating regardless of errors on storage?*
>>>>
>>>>
>>>> ​
>>>> Arseny Kovalchuk
>>>>
>>>> Senior Software Engineer at Synesis
>>>> skype: arseny.kovalchuk
>>>> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
>>>> ​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​
>>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Andrey V. Mashenkov
>>>
>>
>>
>>
>> --
>> Best regards,
>> Andrey V. Mashenkov
>>
>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>

Reply via email to