Good question :) yardstick does this, but not sure if it is a valid prod solution. https://github.com/apache/ignite/blob/3307a8b26ccb5f0bb7e9c387c73fd221b98ab668/modules/yardstick/src/main/java/org/apache/ignite/yardstick/jdbc/AbstractJdbcBenchmark.java
We have set preferIPv4Stack=true and provided localAddress in the config - it seems to have solved the problem. (Didn't run it enough to be 100% sure) On Wed, Sep 12, 2018 at 10:59 AM Ilya Kasnacheev <[email protected]> wrote: > Hello! > > How would you distinguish the wrong interface (172.17.0.1) from the right > one if you were Ignite? > > I think it's not the first time I have seen this problem but I have > positively no idea how to tackle it. > Maybe Docker experts could chime in? > > Regards, > -- > Ilya Kasnacheev > > > ср, 12 сент. 2018 г. в 3:29, eugene miretsky <[email protected]>: > >> Thanks Ilya, >> >> We are writing to Ignite from Spark running in EMR. We don't know the >> address of the node in advance, we have tried >> 1) Set localHost in Ignite configuration to 127.0.0.1, as per the example >> online >> 2) Leave localHost unset, and let ignite figure out the host >> >> I have attached more logs at the end. >> >> My understanding is that Ignite should pick the first non-local address >> to publish, however, it seems like it picks randomly one of (a) proper >> address, (b) ipv6 address, (c) 127.0.0.1, (d) 172.17.0.1. >> >> A few questions: >> 1) How do we force Spark client to use the proper address >> 2) Where is 172.17.0.1 coming from? It is usually the default docker >> network host address, and it seems like Ignite creates a network interface >> for it on the instance. (otherwise I have no idea where the interface is >> coming from) >> 3) If there are communication errors, shouldn't the Zookeeper split brain >> resolver kick in and shut down the dead node. Or shouldn't at least the >> initiating node mark the remote node as dead? >> >> [19:36:26,189][INFO][grid-nio-worker-tcp-comm-15-#88%Server%][TcpCommunicationSpi] >> Accepted incoming communication connection [locAddr=/172.17.0.1:47100, >> rmtAddr=/172.21.86.7:41648] >> >> [19:36:26,190][INFO][grid-nio-worker-tcp-comm-3-#76%Server%][TcpCommunicationSpi] >> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100, >> rmtAddr=/0:0:0:0:0:0:0:1:52484] >> >> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-5-#78%Server%][TcpCommunicationSpi] >> Accepted incoming communication connection [locAddr=/127.0.0.1:47100, >> rmtAddr=/127.0.0.1:37656] >> >> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-1-#74%Server%][TcpCommunicationSpi] >> Established outgoing communication connection [locAddr=/172.21.86.7:53272, >> rmtAddr=ip-172-21-86-175.ap-south-1.compute.internal/172.21.86.175:47100] >> >> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-0-#73%Server%][TcpCommunicationSpi] >> Established outgoing communication connection [locAddr=/172.17.0.1:41648, >> rmtAddr=ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100] >> >> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-4-#77%Server%][TcpCommunicationSpi] >> Established outgoing communication connection [locAddr=/127.0.0.1:37656, >> rmtAddr=/127.0.0.1:47100] >> >> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-2-#75%Server%][TcpCommunicationSpi] >> Established outgoing communication connection >> [locAddr=/0:0:0:0:0:0:0:1:52484, rmtAddr=/0:0:0:0:0:0:0:1%lo:47100] >> >> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-8-#81%Server%][TcpCommunicationSpi] >> Accepted incoming communication connection [locAddr=/172.17.0.1:47100, >> rmtAddr=/172.21.86.7:41656] >> >> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-10-#83%Server%][TcpCommunicationSpi] >> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100, >> rmtAddr=/0:0:0:0:0:0:0:1:52492] >> >> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-12-#85%Server%][TcpCommunicationSpi] >> Accepted incoming communication connection [locAddr=/127.0.0.1:47100, >> rmtAddr=/127.0.0.1:37664] >> >> [19:36:26,196][INFO][grid-nio-worker-tcp-comm-7-#80%Server%][TcpCommunicationSpi] >> Established outgoing communication connection [locAddr=/172.21.86.7:41076, >> rmtAddr=ip-172-21-86-229.ap-south-1.compute.internal/172.21.86.229:47100] >> >> >> >> >> On Mon, Sep 10, 2018 at 12:04 PM Ilya Kasnacheev < >> [email protected]> wrote: >> >>> Hello! >>> >>> I can see a lot of errors like this one: >>> >>> [04:05:29,268][INFO][tcp-comm-worker-#1%Server%][ZookeeperDiscoveryImpl] >>> Created new communication error process future >>> [errNode=598e3ead-99b8-4c49-b7df-04d578dcbf5f, err=class >>> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is >>> node still alive?). Make sure that each ComputeTask and cache Transaction >>> has a timeout set in order to prevent parties from waiting forever in case >>> of network issues [nodeId=598e3ead-99b8-4c49-b7df-04d578dcbf5f, >>> addrs=[ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100, >>> ip-172-21-85-213.ap-south-1.compute.internal/172.21.85.213:47100, >>> /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]] >>> >>> I think the problem is, you have two nodes, they both have 172.17.0.1 >>> address but it's the different address (totally unrelated private nets). >>> >>> Try to specify your external address (such as 172.21.85.213) with >>> TcpCommunicationSpi.setLocalAddress() on each node. >>> >>> Regards, >>> -- >>> Ilya Kasnacheev >>> >>> >>> пт, 7 сент. 2018 г. в 20:01, eugene miretsky <[email protected] >>> >: >>> >>>> Hi all, >>>> >>>> Can somebody please provide some pointers on what could be the issue or >>>> how to debug it? We have a fairly large Ignite use case, but cannot go >>>> ahead with a POC because of these crashes. >>>> >>>> Cheers, >>>> Eugene >>>> >>>> >>>> >>>> On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky < >>>> [email protected]> wrote: >>>> >>>>> Also, don't want to spam the mailing list with more threads, but I get >>>>> the same stability issue when writing to Ignite from Spark. Logfile from >>>>> the crashed node (not same node as before, probably random) is attached. >>>>> >>>>> I have also attached a gc log from another node (I have gc logging >>>>> enabled only on one node) >>>>> >>>>> >>>>> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky < >>>>> [email protected]> wrote: >>>>> >>>>>> Thanks Denis, >>>>>> >>>>>> Execution plan + all logs right after the carsh are attached. >>>>>> >>>>>> Cheers, >>>>>> Eugene >>>>>> nohup.out >>>>>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Eugene, >>>>>>> >>>>>>> Please share full logs from all the nodes and execution plan for the >>>>>>> query. That's what the community usually needs to help with >>>>>>> troubleshooting. Also, attach GC logs. Use these settings to gather >>>>>>> them: >>>>>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats >>>>>>> >>>>>>> -- >>>>>>> Denis >>>>>>> >>>>>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2 >>>>>>>> nodes. It has persistence enabled, and zero backup. >>>>>>>> - Full configs are attached. >>>>>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server >>>>>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m -XX:+AlwaysPreTouch >>>>>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC" >>>>>>>> >>>>>>>> The table has 145M rows, and takes up about 180G of memory >>>>>>>> I testing 2 things >>>>>>>> 1) Writing SQL tables from Spark >>>>>>>> 2) Performing large SQL queries (from the web console): for example >>>>>>>> Select >>>>>>>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12' >>>>>>>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1) >>>>>>>> >>>>>>>> Most of the times I run the query it fails after one of the nodes >>>>>>>> crashes (it has finished a few times, and then crashed the next time). >>>>>>>> I >>>>>>>> have also similar stability issues when writing from Spark - at some >>>>>>>> point, >>>>>>>> one of the nodes crashes. All I can see in the logs is >>>>>>>> >>>>>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical >>>>>>>> system error detected. Will be handled accordingly to configured >>>>>>>> handler >>>>>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, >>>>>>>> failureCtx=FailureContext >>>>>>>> [type=SEGMENTATION, err=null]] >>>>>>>> >>>>>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor] >>>>>>>> Ignite node is in invalid state due to a critical failure. >>>>>>>> >>>>>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on >>>>>>>> Ignite failure: [failureCtx=FailureContext [type=SEGMENTATION, >>>>>>>> err=null]] >>>>>>>> >>>>>>>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780] >>>>>>>> >>>>>>>> My questions are: >>>>>>>> 1) What is causing the issue? >>>>>>>> 2) How can I debug it better? >>>>>>>> >>>>>>>> The rate of crashes and our lack of ability to debug them is >>>>>>>> becoming quite a concern. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Eugene >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>
