Hi all, Can somebody please provide some pointers on what could be the issue or how to debug it? We have a fairly large Ignite use case, but cannot go ahead with a POC because of these crashes.
Cheers, Eugene On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky <[email protected]> wrote: > Also, don't want to spam the mailing list with more threads, but I get the > same stability issue when writing to Ignite from Spark. Logfile from the > crashed node (not same node as before, probably random) is attached. > > I have also attached a gc log from another node (I have gc logging > enabled only on one node) > > > On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky < > [email protected]> wrote: > >> Thanks Denis, >> >> Execution plan + all logs right after the carsh are attached. >> >> Cheers, >> Eugene >> nohup.out >> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web> >> >> >> >> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <[email protected]> wrote: >> >>> Eugene, >>> >>> Please share full logs from all the nodes and execution plan for the >>> query. That's what the community usually needs to help with >>> troubleshooting. Also, attach GC logs. Use these settings to gather them: >>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats >>> >>> -- >>> Denis >>> >>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky < >>> [email protected]> wrote: >>> >>>> Hello, >>>> >>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2 nodes. >>>> It has persistence enabled, and zero backup. >>>> - Full configs are attached. >>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server >>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m -XX:+AlwaysPreTouch >>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC" >>>> >>>> The table has 145M rows, and takes up about 180G of memory >>>> I testing 2 things >>>> 1) Writing SQL tables from Spark >>>> 2) Performing large SQL queries (from the web console): for example Select >>>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12' >>>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1) >>>> >>>> Most of the times I run the query it fails after one of the nodes >>>> crashes (it has finished a few times, and then crashed the next time). I >>>> have also similar stability issues when writing from Spark - at some point, >>>> one of the nodes crashes. All I can see in the logs is >>>> >>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical >>>> system error detected. Will be handled accordingly to configured handler >>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext >>>> [type=SEGMENTATION, err=null]] >>>> >>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor] >>>> Ignite node is in invalid state due to a critical failure. >>>> >>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on Ignite >>>> failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]] >>>> >>>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780] >>>> >>>> My questions are: >>>> 1) What is causing the issue? >>>> 2) How can I debug it better? >>>> >>>> The rate of crashes and our lack of ability to debug them is becoming >>>> quite a concern. >>>> >>>> Cheers, >>>> Eugene >>>> >>>> >>>> >>>>
