Hello! I have filed this ticket: https://issues.apache.org/jira/browse/IGNITE-9586
Hope that it eventually get looked at by somebody in context. Regards, -- Ilya Kasnacheev ср, 12 сент. 2018 г. в 22:10, eugene miretsky <[email protected]>: > Good question :) > yardstick does this, but not sure if it is a valid prod solution. > > https://github.com/apache/ignite/blob/3307a8b26ccb5f0bb7e9c387c73fd221b98ab668/modules/yardstick/src/main/java/org/apache/ignite/yardstick/jdbc/AbstractJdbcBenchmark.java > > We have set preferIPv4Stack=true and provided localAddress in the config - > it seems to have solved the problem. (Didn't run it enough to be 100% sure) > > On Wed, Sep 12, 2018 at 10:59 AM Ilya Kasnacheev < > [email protected]> wrote: > >> Hello! >> >> How would you distinguish the wrong interface (172.17.0.1) from the >> right one if you were Ignite? >> >> I think it's not the first time I have seen this problem but I have >> positively no idea how to tackle it. >> Maybe Docker experts could chime in? >> >> Regards, >> -- >> Ilya Kasnacheev >> >> >> ср, 12 сент. 2018 г. в 3:29, eugene miretsky <[email protected]>: >> >>> Thanks Ilya, >>> >>> We are writing to Ignite from Spark running in EMR. We don't know the >>> address of the node in advance, we have tried >>> 1) Set localHost in Ignite configuration to 127.0.0.1, as per the >>> example online >>> 2) Leave localHost unset, and let ignite figure out the host >>> >>> I have attached more logs at the end. >>> >>> My understanding is that Ignite should pick the first non-local address >>> to publish, however, it seems like it picks randomly one of (a) proper >>> address, (b) ipv6 address, (c) 127.0.0.1, (d) 172.17.0.1. >>> >>> A few questions: >>> 1) How do we force Spark client to use the proper address >>> 2) Where is 172.17.0.1 coming from? It is usually the default docker >>> network host address, and it seems like Ignite creates a network interface >>> for it on the instance. (otherwise I have no idea where the interface is >>> coming from) >>> 3) If there are communication errors, shouldn't the Zookeeper split >>> brain resolver kick in and shut down the dead node. Or shouldn't at least >>> the initiating node mark the remote node as dead? >>> >>> [19:36:26,189][INFO][grid-nio-worker-tcp-comm-15-#88%Server%][TcpCommunicationSpi] >>> Accepted incoming communication connection [locAddr=/172.17.0.1:47100, >>> rmtAddr=/172.21.86.7:41648] >>> >>> [19:36:26,190][INFO][grid-nio-worker-tcp-comm-3-#76%Server%][TcpCommunicationSpi] >>> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100, >>> rmtAddr=/0:0:0:0:0:0:0:1:52484] >>> >>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-5-#78%Server%][TcpCommunicationSpi] >>> Accepted incoming communication connection [locAddr=/127.0.0.1:47100, >>> rmtAddr=/127.0.0.1:37656] >>> >>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-1-#74%Server%][TcpCommunicationSpi] >>> Established outgoing communication connection [locAddr=/ >>> 172.21.86.7:53272, rmtAddr=ip-172-21-86-175.ap-south-1.compute.internal/ >>> 172.21.86.175:47100] >>> >>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-0-#73%Server%][TcpCommunicationSpi] >>> Established outgoing communication connection [locAddr=/172.17.0.1:41648, >>> rmtAddr=ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100] >>> >>> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-4-#77%Server%][TcpCommunicationSpi] >>> Established outgoing communication connection [locAddr=/127.0.0.1:37656, >>> rmtAddr=/127.0.0.1:47100] >>> >>> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-2-#75%Server%][TcpCommunicationSpi] >>> Established outgoing communication connection >>> [locAddr=/0:0:0:0:0:0:0:1:52484, rmtAddr=/0:0:0:0:0:0:0:1%lo:47100] >>> >>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-8-#81%Server%][TcpCommunicationSpi] >>> Accepted incoming communication connection [locAddr=/172.17.0.1:47100, >>> rmtAddr=/172.21.86.7:41656] >>> >>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-10-#83%Server%][TcpCommunicationSpi] >>> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100, >>> rmtAddr=/0:0:0:0:0:0:0:1:52492] >>> >>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-12-#85%Server%][TcpCommunicationSpi] >>> Accepted incoming communication connection [locAddr=/127.0.0.1:47100, >>> rmtAddr=/127.0.0.1:37664] >>> >>> [19:36:26,196][INFO][grid-nio-worker-tcp-comm-7-#80%Server%][TcpCommunicationSpi] >>> Established outgoing communication connection [locAddr=/ >>> 172.21.86.7:41076, rmtAddr=ip-172-21-86-229.ap-south-1.compute.internal/ >>> 172.21.86.229:47100] >>> >>> >>> >>> >>> On Mon, Sep 10, 2018 at 12:04 PM Ilya Kasnacheev < >>> [email protected]> wrote: >>> >>>> Hello! >>>> >>>> I can see a lot of errors like this one: >>>> >>>> [04:05:29,268][INFO][tcp-comm-worker-#1%Server%][ZookeeperDiscoveryImpl] >>>> Created new communication error process future >>>> [errNode=598e3ead-99b8-4c49-b7df-04d578dcbf5f, err=class >>>> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is >>>> node still alive?). Make sure that each ComputeTask and cache Transaction >>>> has a timeout set in order to prevent parties from waiting forever in case >>>> of network issues [nodeId=598e3ead-99b8-4c49-b7df-04d578dcbf5f, >>>> addrs=[ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100, >>>> ip-172-21-85-213.ap-south-1.compute.internal/172.21.85.213:47100, >>>> /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]] >>>> >>>> I think the problem is, you have two nodes, they both have 172.17.0.1 >>>> address but it's the different address (totally unrelated private nets). >>>> >>>> Try to specify your external address (such as 172.21.85.213) with >>>> TcpCommunicationSpi.setLocalAddress() on each node. >>>> >>>> Regards, >>>> -- >>>> Ilya Kasnacheev >>>> >>>> >>>> пт, 7 сент. 2018 г. в 20:01, eugene miretsky <[email protected] >>>> >: >>>> >>>>> Hi all, >>>>> >>>>> Can somebody please provide some pointers on what could be the issue >>>>> or how to debug it? We have a fairly large Ignite use case, but cannot go >>>>> ahead with a POC because of these crashes. >>>>> >>>>> Cheers, >>>>> Eugene >>>>> >>>>> >>>>> >>>>> On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky < >>>>> [email protected]> wrote: >>>>> >>>>>> Also, don't want to spam the mailing list with more threads, but I >>>>>> get the same stability issue when writing to Ignite from Spark. Logfile >>>>>> from the crashed node (not same node as before, probably random) is >>>>>> attached. >>>>>> >>>>>> I have also attached a gc log from another node (I have gc logging >>>>>> enabled only on one node) >>>>>> >>>>>> >>>>>> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Thanks Denis, >>>>>>> >>>>>>> Execution plan + all logs right after the carsh are attached. >>>>>>> >>>>>>> Cheers, >>>>>>> Eugene >>>>>>> nohup.out >>>>>>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Eugene, >>>>>>>> >>>>>>>> Please share full logs from all the nodes and execution plan for >>>>>>>> the query. That's what the community usually needs to help with >>>>>>>> troubleshooting. Also, attach GC logs. Use these settings to gather >>>>>>>> them: >>>>>>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats >>>>>>>> >>>>>>>> -- >>>>>>>> Denis >>>>>>>> >>>>>>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2 >>>>>>>>> nodes. It has persistence enabled, and zero backup. >>>>>>>>> - Full configs are attached. >>>>>>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server >>>>>>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m -XX:+AlwaysPreTouch >>>>>>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC" >>>>>>>>> >>>>>>>>> The table has 145M rows, and takes up about 180G of memory >>>>>>>>> I testing 2 things >>>>>>>>> 1) Writing SQL tables from Spark >>>>>>>>> 2) Performing large SQL queries (from the web console): for >>>>>>>>> example Select COUNT (*) FROM (SELECT customer_id FROM MyTable >>>>>>>>> where dt > '2018-05-12' GROUP BY customer_id having SUM(column1) > 2 >>>>>>>>> AND >>>>>>>>> MAX(column2) < 1) >>>>>>>>> >>>>>>>>> Most of the times I run the query it fails after one of the nodes >>>>>>>>> crashes (it has finished a few times, and then crashed the next >>>>>>>>> time). I >>>>>>>>> have also similar stability issues when writing from Spark - at some >>>>>>>>> point, >>>>>>>>> one of the nodes crashes. All I can see in the logs is >>>>>>>>> >>>>>>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical >>>>>>>>> system error detected. Will be handled accordingly to configured >>>>>>>>> handler >>>>>>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, >>>>>>>>> failureCtx=FailureContext >>>>>>>>> [type=SEGMENTATION, err=null]] >>>>>>>>> >>>>>>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor] >>>>>>>>> Ignite node is in invalid state due to a critical failure. >>>>>>>>> >>>>>>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on >>>>>>>>> Ignite failure: [failureCtx=FailureContext [type=SEGMENTATION, >>>>>>>>> err=null]] >>>>>>>>> >>>>>>>>> [21:52:03] Ignite node stopped OK [name=Server, >>>>>>>>> uptime=00:07:06.780] >>>>>>>>> >>>>>>>>> My questions are: >>>>>>>>> 1) What is causing the issue? >>>>>>>>> 2) How can I debug it better? >>>>>>>>> >>>>>>>>> The rate of crashes and our lack of ability to debug them is >>>>>>>>> becoming quite a concern. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Eugene >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>
