Re: Node keeps crashing under load

eugene miretsky Tue, 11 Sep 2018 17:29:59 -0700

Thanks Ilya,

We are writing to Ignite from Spark running in EMR. We don't know the
address of the node in advance, we have tried
1) Set localHost in Ignite configuration to 127.0.0.1, as per the example
online
2) Leave localHost unset, and let ignite figure out the host


I have attached more logs at the end.

My understanding is that Ignite should pick the first non-local address to
publish, however, it seems like it picks randomly one of (a) proper
address, (b) ipv6 address, (c) 127.0.0.1, (d)  172.17.0.1.

A few questions:
1) How do we force Spark client to use the proper address
2) Where is 172.17.0.1 coming from? It is usually the default docker
network host address, and it seems like Ignite creates a network interface
for it on the instance. (otherwise I have no idea where the interface is
coming from)
3) If there are communication errors, shouldn't the Zookeeper split brain
resolver kick in and shut down the dead node. Or shouldn't at least the
initiating node mark the remote node as dead?

[19:36:26,189][INFO][grid-nio-worker-tcp-comm-15-#88%Server%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
rmtAddr=/172.21.86.7:41648]

[19:36:26,190][INFO][grid-nio-worker-tcp-comm-3-#76%Server%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
rmtAddr=/0:0:0:0:0:0:0:1:52484]

[19:36:26,191][INFO][grid-nio-worker-tcp-comm-5-#78%Server%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
rmtAddr=/127.0.0.1:37656]

[19:36:26,191][INFO][grid-nio-worker-tcp-comm-1-#74%Server%][TcpCommunicationSpi]
Established outgoing communication connection [locAddr=/172.21.86.7:53272,
rmtAddr=ip-172-21-86-175.ap-south-1.compute.internal/172.21.86.175:47100]

[19:36:26,191][INFO][grid-nio-worker-tcp-comm-0-#73%Server%][TcpCommunicationSpi]
Established outgoing communication connection [locAddr=/172.17.0.1:41648,
rmtAddr=ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100]

[19:36:26,193][INFO][grid-nio-worker-tcp-comm-4-#77%Server%][TcpCommunicationSpi]
Established outgoing communication connection [locAddr=/127.0.0.1:37656,
rmtAddr=/127.0.0.1:47100]

[19:36:26,193][INFO][grid-nio-worker-tcp-comm-2-#75%Server%][TcpCommunicationSpi]
Established outgoing communication connection
[locAddr=/0:0:0:0:0:0:0:1:52484, rmtAddr=/0:0:0:0:0:0:0:1%lo:47100]

[19:36:26,195][INFO][grid-nio-worker-tcp-comm-8-#81%Server%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
rmtAddr=/172.21.86.7:41656]

[19:36:26,195][INFO][grid-nio-worker-tcp-comm-10-#83%Server%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
rmtAddr=/0:0:0:0:0:0:0:1:52492]

[19:36:26,195][INFO][grid-nio-worker-tcp-comm-12-#85%Server%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
rmtAddr=/127.0.0.1:37664]

[19:36:26,196][INFO][grid-nio-worker-tcp-comm-7-#80%Server%][TcpCommunicationSpi]
Established outgoing communication connection [locAddr=/172.21.86.7:41076,
rmtAddr=ip-172-21-86-229.ap-south-1.compute.internal/172.21.86.229:47100]




On Mon, Sep 10, 2018 at 12:04 PM Ilya Kasnacheev <ilya.kasnach...@gmail.com>
wrote:

> Hello!
>
> I can see a lot of errors like this one:
>
> [04:05:29,268][INFO][tcp-comm-worker-#1%Server%][ZookeeperDiscoveryImpl]
> Created new communication error process future
> [errNode=598e3ead-99b8-4c49-b7df-04d578dcbf5f, err=class
> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is
> node still alive?). Make sure that each ComputeTask and cache Transaction
> has a timeout set in order to prevent parties from waiting forever in case
> of network issues [nodeId=598e3ead-99b8-4c49-b7df-04d578dcbf5f,
> addrs=[ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100,
> ip-172-21-85-213.ap-south-1.compute.internal/172.21.85.213:47100,
> /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]]
>
> I think the problem is, you have two nodes, they both have 172.17.0.1
> address but it's the different address (totally unrelated private nets).
>
> Try to specify your external address (such as 172.21.85.213) with
> TcpCommunicationSpi.setLocalAddress() on each node.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пт, 7 сент. 2018 г. в 20:01, eugene miretsky <eugene.miret...@gmail.com>:
>
>> Hi all,
>>
>> Can somebody please provide some pointers on what could be the issue or
>> how to debug it? We have a fairly large Ignite use case, but cannot go
>> ahead with a POC because of these crashes.
>>
>> Cheers,
>> Eugene
>>
>>
>>
>> On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky <
>> eugene.miret...@gmail.com> wrote:
>>
>>> Also, don't want to spam the mailing list with more threads, but I get
>>> the same stability issue when writing to Ignite from Spark. Logfile from
>>> the crashed node (not same node as before, probably random) is attached.
>>>
>>>  I have also attached a gc log from another node (I have gc logging
>>> enabled only on one node)
>>>
>>>
>>> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky <
>>> eugene.miret...@gmail.com> wrote:
>>>
>>>> Thanks Denis,
>>>>
>>>> Execution plan + all logs right after the carsh are attached.
>>>>
>>>> Cheers,
>>>> Eugene
>>>>  nohup.out
>>>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web>
>>>>
>>>>
>>>>
>>>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <dma...@apache.org> wrote:
>>>>
>>>>> Eugene,
>>>>>
>>>>> Please share full logs from all the nodes and execution plan for the
>>>>> query. That's what the community usually needs to help with
>>>>> troubleshooting. Also, attach GC logs. Use these settings to gather them:
>>>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats
>>>>>
>>>>> --
>>>>> Denis
>>>>>
>>>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <
>>>>> eugene.miret...@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2
>>>>>> nodes. It has persistence enabled, and zero backup.
>>>>>> - Full configs are attached.
>>>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server
>>>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch
>>>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>>>>>
>>>>>> The table has 145M rows, and takes up about 180G of memory
>>>>>> I testing 2 things
>>>>>> 1) Writing SQL tables from Spark
>>>>>> 2) Performing large SQL queries (from the web console): for example 
>>>>>> Select
>>>>>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12'
>>>>>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1)
>>>>>>
>>>>>> Most of the times I run the query it fails after one of the nodes
>>>>>> crashes (it has finished a few times, and then crashed the next time). I
>>>>>> have also similar stability issues when writing from Spark - at some 
>>>>>> point,
>>>>>> one of the nodes crashes. All I can see in the logs is
>>>>>>
>>>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical
>>>>>> system error detected. Will be handled accordingly to configured handler
>>>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, 
>>>>>> failureCtx=FailureContext
>>>>>> [type=SEGMENTATION, err=null]]
>>>>>>
>>>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
>>>>>> Ignite node is in invalid state due to a critical failure.
>>>>>>
>>>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on Ignite
>>>>>> failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
>>>>>>
>>>>>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780]
>>>>>>
>>>>>> My questions are:
>>>>>> 1) What is causing the issue?
>>>>>> 2) How can I debug it better?
>>>>>>
>>>>>> The rate of crashes and our lack of ability to debug them is becoming
>>>>>> quite a concern.
>>>>>>
>>>>>> Cheers,
>>>>>> Eugene
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: Node keeps crashing under load

Reply via email to