Hello! How would you distinguish the wrong interface (172.17.0.1) from the right one if you were Ignite?
I think it's not the first time I have seen this problem but I have positively no idea how to tackle it. Maybe Docker experts could chime in? Regards, -- Ilya Kasnacheev ср, 12 сент. 2018 г. в 3:29, eugene miretsky <[email protected]>: > Thanks Ilya, > > We are writing to Ignite from Spark running in EMR. We don't know the > address of the node in advance, we have tried > 1) Set localHost in Ignite configuration to 127.0.0.1, as per the example > online > 2) Leave localHost unset, and let ignite figure out the host > > I have attached more logs at the end. > > My understanding is that Ignite should pick the first non-local address to > publish, however, it seems like it picks randomly one of (a) proper > address, (b) ipv6 address, (c) 127.0.0.1, (d) 172.17.0.1. > > A few questions: > 1) How do we force Spark client to use the proper address > 2) Where is 172.17.0.1 coming from? It is usually the default docker > network host address, and it seems like Ignite creates a network interface > for it on the instance. (otherwise I have no idea where the interface is > coming from) > 3) If there are communication errors, shouldn't the Zookeeper split brain > resolver kick in and shut down the dead node. Or shouldn't at least the > initiating node mark the remote node as dead? > > [19:36:26,189][INFO][grid-nio-worker-tcp-comm-15-#88%Server%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.17.0.1:47100, > rmtAddr=/172.21.86.7:41648] > > [19:36:26,190][INFO][grid-nio-worker-tcp-comm-3-#76%Server%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100, > rmtAddr=/0:0:0:0:0:0:0:1:52484] > > [19:36:26,191][INFO][grid-nio-worker-tcp-comm-5-#78%Server%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:37656] > > [19:36:26,191][INFO][grid-nio-worker-tcp-comm-1-#74%Server%][TcpCommunicationSpi] > Established outgoing communication connection [locAddr=/172.21.86.7:53272, > rmtAddr=ip-172-21-86-175.ap-south-1.compute.internal/172.21.86.175:47100] > > [19:36:26,191][INFO][grid-nio-worker-tcp-comm-0-#73%Server%][TcpCommunicationSpi] > Established outgoing communication connection [locAddr=/172.17.0.1:41648, > rmtAddr=ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100] > > [19:36:26,193][INFO][grid-nio-worker-tcp-comm-4-#77%Server%][TcpCommunicationSpi] > Established outgoing communication connection [locAddr=/127.0.0.1:37656, > rmtAddr=/127.0.0.1:47100] > > [19:36:26,193][INFO][grid-nio-worker-tcp-comm-2-#75%Server%][TcpCommunicationSpi] > Established outgoing communication connection > [locAddr=/0:0:0:0:0:0:0:1:52484, rmtAddr=/0:0:0:0:0:0:0:1%lo:47100] > > [19:36:26,195][INFO][grid-nio-worker-tcp-comm-8-#81%Server%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/172.17.0.1:47100, > rmtAddr=/172.21.86.7:41656] > > [19:36:26,195][INFO][grid-nio-worker-tcp-comm-10-#83%Server%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100, > rmtAddr=/0:0:0:0:0:0:0:1:52492] > > [19:36:26,195][INFO][grid-nio-worker-tcp-comm-12-#85%Server%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:37664] > > [19:36:26,196][INFO][grid-nio-worker-tcp-comm-7-#80%Server%][TcpCommunicationSpi] > Established outgoing communication connection [locAddr=/172.21.86.7:41076, > rmtAddr=ip-172-21-86-229.ap-south-1.compute.internal/172.21.86.229:47100] > > > > > On Mon, Sep 10, 2018 at 12:04 PM Ilya Kasnacheev < > [email protected]> wrote: > >> Hello! >> >> I can see a lot of errors like this one: >> >> [04:05:29,268][INFO][tcp-comm-worker-#1%Server%][ZookeeperDiscoveryImpl] >> Created new communication error process future >> [errNode=598e3ead-99b8-4c49-b7df-04d578dcbf5f, err=class >> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is >> node still alive?). Make sure that each ComputeTask and cache Transaction >> has a timeout set in order to prevent parties from waiting forever in case >> of network issues [nodeId=598e3ead-99b8-4c49-b7df-04d578dcbf5f, >> addrs=[ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100, >> ip-172-21-85-213.ap-south-1.compute.internal/172.21.85.213:47100, >> /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]] >> >> I think the problem is, you have two nodes, they both have 172.17.0.1 >> address but it's the different address (totally unrelated private nets). >> >> Try to specify your external address (such as 172.21.85.213) with >> TcpCommunicationSpi.setLocalAddress() on each node. >> >> Regards, >> -- >> Ilya Kasnacheev >> >> >> пт, 7 сент. 2018 г. в 20:01, eugene miretsky <[email protected]>: >> >>> Hi all, >>> >>> Can somebody please provide some pointers on what could be the issue or >>> how to debug it? We have a fairly large Ignite use case, but cannot go >>> ahead with a POC because of these crashes. >>> >>> Cheers, >>> Eugene >>> >>> >>> >>> On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky < >>> [email protected]> wrote: >>> >>>> Also, don't want to spam the mailing list with more threads, but I get >>>> the same stability issue when writing to Ignite from Spark. Logfile from >>>> the crashed node (not same node as before, probably random) is attached. >>>> >>>> I have also attached a gc log from another node (I have gc logging >>>> enabled only on one node) >>>> >>>> >>>> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky < >>>> [email protected]> wrote: >>>> >>>>> Thanks Denis, >>>>> >>>>> Execution plan + all logs right after the carsh are attached. >>>>> >>>>> Cheers, >>>>> Eugene >>>>> nohup.out >>>>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web> >>>>> >>>>> >>>>> >>>>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <[email protected]> wrote: >>>>> >>>>>> Eugene, >>>>>> >>>>>> Please share full logs from all the nodes and execution plan for the >>>>>> query. That's what the community usually needs to help with >>>>>> troubleshooting. Also, attach GC logs. Use these settings to gather them: >>>>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats >>>>>> >>>>>> -- >>>>>> Denis >>>>>> >>>>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2 >>>>>>> nodes. It has persistence enabled, and zero backup. >>>>>>> - Full configs are attached. >>>>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server >>>>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m -XX:+AlwaysPreTouch >>>>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC" >>>>>>> >>>>>>> The table has 145M rows, and takes up about 180G of memory >>>>>>> I testing 2 things >>>>>>> 1) Writing SQL tables from Spark >>>>>>> 2) Performing large SQL queries (from the web console): for example >>>>>>> Select >>>>>>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12' >>>>>>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1) >>>>>>> >>>>>>> Most of the times I run the query it fails after one of the nodes >>>>>>> crashes (it has finished a few times, and then crashed the next time). I >>>>>>> have also similar stability issues when writing from Spark - at some >>>>>>> point, >>>>>>> one of the nodes crashes. All I can see in the logs is >>>>>>> >>>>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical >>>>>>> system error detected. Will be handled accordingly to configured handler >>>>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, >>>>>>> failureCtx=FailureContext >>>>>>> [type=SEGMENTATION, err=null]] >>>>>>> >>>>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor] >>>>>>> Ignite node is in invalid state due to a critical failure. >>>>>>> >>>>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on Ignite >>>>>>> failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]] >>>>>>> >>>>>>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780] >>>>>>> >>>>>>> My questions are: >>>>>>> 1) What is causing the issue? >>>>>>> 2) How can I debug it better? >>>>>>> >>>>>>> The rate of crashes and our lack of ability to debug them is >>>>>>> becoming quite a concern. >>>>>>> >>>>>>> Cheers, >>>>>>> Eugene >>>>>>> >>>>>>> >>>>>>> >>>>>>>
