Re: Node keeps crashing under load

Ilya Kasnacheev Wed, 12 Sep 2018 07:59:35 -0700

Hello!

How would you distinguish the wrong interface (172.17.0.1) from the right
one if you were Ignite?


I think it's not the first time I have seen this problem but I have
positively no idea how to tackle it.
Maybe Docker experts could chime in?

Regards,
-- 
Ilya Kasnacheev


ср, 12 сент. 2018 г. в 3:29, eugene miretsky <[email protected]>:

> Thanks Ilya,
>
> We are writing to Ignite from Spark running in EMR. We don't know the
> address of the node in advance, we have tried
> 1) Set localHost in Ignite configuration to 127.0.0.1, as per the example
> online
> 2) Leave localHost unset, and let ignite figure out the host
>
> I have attached more logs at the end.
>
> My understanding is that Ignite should pick the first non-local address to
> publish, however, it seems like it picks randomly one of (a) proper
> address, (b) ipv6 address, (c) 127.0.0.1, (d)  172.17.0.1.
>
> A few questions:
> 1) How do we force Spark client to use the proper address
> 2) Where is 172.17.0.1 coming from? It is usually the default docker
> network host address, and it seems like Ignite creates a network interface
> for it on the instance. (otherwise I have no idea where the interface is
> coming from)
> 3) If there are communication errors, shouldn't the Zookeeper split brain
> resolver kick in and shut down the dead node. Or shouldn't at least the
> initiating node mark the remote node as dead?
>
> [19:36:26,189][INFO][grid-nio-worker-tcp-comm-15-#88%Server%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
> rmtAddr=/172.21.86.7:41648]
>
> [19:36:26,190][INFO][grid-nio-worker-tcp-comm-3-#76%Server%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
> rmtAddr=/0:0:0:0:0:0:0:1:52484]
>
> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-5-#78%Server%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
> rmtAddr=/127.0.0.1:37656]
>
> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-1-#74%Server%][TcpCommunicationSpi]
> Established outgoing communication connection [locAddr=/172.21.86.7:53272,
> rmtAddr=ip-172-21-86-175.ap-south-1.compute.internal/172.21.86.175:47100]
>
> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-0-#73%Server%][TcpCommunicationSpi]
> Established outgoing communication connection [locAddr=/172.17.0.1:41648,
> rmtAddr=ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100]
>
> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-4-#77%Server%][TcpCommunicationSpi]
> Established outgoing communication connection [locAddr=/127.0.0.1:37656,
> rmtAddr=/127.0.0.1:47100]
>
> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-2-#75%Server%][TcpCommunicationSpi]
> Established outgoing communication connection
> [locAddr=/0:0:0:0:0:0:0:1:52484, rmtAddr=/0:0:0:0:0:0:0:1%lo:47100]
>
> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-8-#81%Server%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
> rmtAddr=/172.21.86.7:41656]
>
> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-10-#83%Server%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
> rmtAddr=/0:0:0:0:0:0:0:1:52492]
>
> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-12-#85%Server%][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
> rmtAddr=/127.0.0.1:37664]
>
> [19:36:26,196][INFO][grid-nio-worker-tcp-comm-7-#80%Server%][TcpCommunicationSpi]
> Established outgoing communication connection [locAddr=/172.21.86.7:41076,
> rmtAddr=ip-172-21-86-229.ap-south-1.compute.internal/172.21.86.229:47100]
>
>
>
>
> On Mon, Sep 10, 2018 at 12:04 PM Ilya Kasnacheev <
> [email protected]> wrote:
>
>> Hello!
>>
>> I can see a lot of errors like this one:
>>
>> [04:05:29,268][INFO][tcp-comm-worker-#1%Server%][ZookeeperDiscoveryImpl]
>> Created new communication error process future
>> [errNode=598e3ead-99b8-4c49-b7df-04d578dcbf5f, err=class
>> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is
>> node still alive?). Make sure that each ComputeTask and cache Transaction
>> has a timeout set in order to prevent parties from waiting forever in case
>> of network issues [nodeId=598e3ead-99b8-4c49-b7df-04d578dcbf5f,
>> addrs=[ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100,
>> ip-172-21-85-213.ap-south-1.compute.internal/172.21.85.213:47100,
>> /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]]
>>
>> I think the problem is, you have two nodes, they both have 172.17.0.1
>> address but it's the different address (totally unrelated private nets).
>>
>> Try to specify your external address (such as 172.21.85.213) with
>> TcpCommunicationSpi.setLocalAddress() on each node.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> пт, 7 сент. 2018 г. в 20:01, eugene miretsky <[email protected]>:
>>
>>> Hi all,
>>>
>>> Can somebody please provide some pointers on what could be the issue or
>>> how to debug it? We have a fairly large Ignite use case, but cannot go
>>> ahead with a POC because of these crashes.
>>>
>>> Cheers,
>>> Eugene
>>>
>>>
>>>
>>> On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky <
>>> [email protected]> wrote:
>>>
>>>> Also, don't want to spam the mailing list with more threads, but I get
>>>> the same stability issue when writing to Ignite from Spark. Logfile from
>>>> the crashed node (not same node as before, probably random) is attached.
>>>>
>>>>  I have also attached a gc log from another node (I have gc logging
>>>> enabled only on one node)
>>>>
>>>>
>>>> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks Denis,
>>>>>
>>>>> Execution plan + all logs right after the carsh are attached.
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>  nohup.out
>>>>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <[email protected]> wrote:
>>>>>
>>>>>> Eugene,
>>>>>>
>>>>>> Please share full logs from all the nodes and execution plan for the
>>>>>> query. That's what the community usually needs to help with
>>>>>> troubleshooting. Also, attach GC logs. Use these settings to gather them:
>>>>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats
>>>>>>
>>>>>> --
>>>>>> Denis
>>>>>>
>>>>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2
>>>>>>> nodes. It has persistence enabled, and zero backup.
>>>>>>> - Full configs are attached.
>>>>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server
>>>>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch
>>>>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>>>>>>
>>>>>>> The table has 145M rows, and takes up about 180G of memory
>>>>>>> I testing 2 things
>>>>>>> 1) Writing SQL tables from Spark
>>>>>>> 2) Performing large SQL queries (from the web console): for example 
>>>>>>> Select
>>>>>>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12'
>>>>>>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1)
>>>>>>>
>>>>>>> Most of the times I run the query it fails after one of the nodes
>>>>>>> crashes (it has finished a few times, and then crashed the next time). I
>>>>>>> have also similar stability issues when writing from Spark - at some 
>>>>>>> point,
>>>>>>> one of the nodes crashes. All I can see in the logs is
>>>>>>>
>>>>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical
>>>>>>> system error detected. Will be handled accordingly to configured handler
>>>>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, 
>>>>>>> failureCtx=FailureContext
>>>>>>> [type=SEGMENTATION, err=null]]
>>>>>>>
>>>>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
>>>>>>> Ignite node is in invalid state due to a critical failure.
>>>>>>>
>>>>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on Ignite
>>>>>>> failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
>>>>>>>
>>>>>>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780]
>>>>>>>
>>>>>>> My questions are:
>>>>>>> 1) What is causing the issue?
>>>>>>> 2) How can I debug it better?
>>>>>>>
>>>>>>> The rate of crashes and our lack of ability to debug them is
>>>>>>> becoming quite a concern.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eugene
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>

Re: Node keeps crashing under load

Reply via email to