Re: Node keeps crashing under load

eugene miretsky Wed, 12 Sep 2018 12:11:33 -0700

Good question :)
yardstick does this, but not sure if it is a valid prod solution.
https://github.com/apache/ignite/blob/3307a8b26ccb5f0bb7e9c387c73fd221b98ab668/modules/yardstick/src/main/java/org/apache/ignite/yardstick/jdbc/AbstractJdbcBenchmark.java


We have set preferIPv4Stack=true and provided localAddress in the config -
it seems to have solved the problem. (Didn't run it enough to be 100% sure)

On Wed, Sep 12, 2018 at 10:59 AM Ilya Kasnacheev <[email protected]>
wrote:

> Hello!
>
> How would you distinguish the wrong interface (172.17.0.1) from the right
> one if you were Ignite?
>
> I think it's not the first time I have seen this problem but I have
> positively no idea how to tackle it.
> Maybe Docker experts could chime in?
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> ср, 12 сент. 2018 г. в 3:29, eugene miretsky <[email protected]>:
>
>> Thanks Ilya,
>>
>> We are writing to Ignite from Spark running in EMR. We don't know the
>> address of the node in advance, we have tried
>> 1) Set localHost in Ignite configuration to 127.0.0.1, as per the example
>> online
>> 2) Leave localHost unset, and let ignite figure out the host
>>
>> I have attached more logs at the end.
>>
>> My understanding is that Ignite should pick the first non-local address
>> to publish, however, it seems like it picks randomly one of (a) proper
>> address, (b) ipv6 address, (c) 127.0.0.1, (d)  172.17.0.1.
>>
>> A few questions:
>> 1) How do we force Spark client to use the proper address
>> 2) Where is 172.17.0.1 coming from? It is usually the default docker
>> network host address, and it seems like Ignite creates a network interface
>> for it on the instance. (otherwise I have no idea where the interface is
>> coming from)
>> 3) If there are communication errors, shouldn't the Zookeeper split brain
>> resolver kick in and shut down the dead node. Or shouldn't at least the
>> initiating node mark the remote node as dead?
>>
>> [19:36:26,189][INFO][grid-nio-worker-tcp-comm-15-#88%Server%][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
>> rmtAddr=/172.21.86.7:41648]
>>
>> [19:36:26,190][INFO][grid-nio-worker-tcp-comm-3-#76%Server%][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
>> rmtAddr=/0:0:0:0:0:0:0:1:52484]
>>
>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-5-#78%Server%][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
>> rmtAddr=/127.0.0.1:37656]
>>
>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-1-#74%Server%][TcpCommunicationSpi]
>> Established outgoing communication connection [locAddr=/172.21.86.7:53272,
>> rmtAddr=ip-172-21-86-175.ap-south-1.compute.internal/172.21.86.175:47100]
>>
>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-0-#73%Server%][TcpCommunicationSpi]
>> Established outgoing communication connection [locAddr=/172.17.0.1:41648,
>> rmtAddr=ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100]
>>
>> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-4-#77%Server%][TcpCommunicationSpi]
>> Established outgoing communication connection [locAddr=/127.0.0.1:37656,
>> rmtAddr=/127.0.0.1:47100]
>>
>> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-2-#75%Server%][TcpCommunicationSpi]
>> Established outgoing communication connection
>> [locAddr=/0:0:0:0:0:0:0:1:52484, rmtAddr=/0:0:0:0:0:0:0:1%lo:47100]
>>
>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-8-#81%Server%][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
>> rmtAddr=/172.21.86.7:41656]
>>
>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-10-#83%Server%][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
>> rmtAddr=/0:0:0:0:0:0:0:1:52492]
>>
>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-12-#85%Server%][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
>> rmtAddr=/127.0.0.1:37664]
>>
>> [19:36:26,196][INFO][grid-nio-worker-tcp-comm-7-#80%Server%][TcpCommunicationSpi]
>> Established outgoing communication connection [locAddr=/172.21.86.7:41076,
>> rmtAddr=ip-172-21-86-229.ap-south-1.compute.internal/172.21.86.229:47100]
>>
>>
>>
>>
>> On Mon, Sep 10, 2018 at 12:04 PM Ilya Kasnacheev <
>> [email protected]> wrote:
>>
>>> Hello!
>>>
>>> I can see a lot of errors like this one:
>>>
>>> [04:05:29,268][INFO][tcp-comm-worker-#1%Server%][ZookeeperDiscoveryImpl]
>>> Created new communication error process future
>>> [errNode=598e3ead-99b8-4c49-b7df-04d578dcbf5f, err=class
>>> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is
>>> node still alive?). Make sure that each ComputeTask and cache Transaction
>>> has a timeout set in order to prevent parties from waiting forever in case
>>> of network issues [nodeId=598e3ead-99b8-4c49-b7df-04d578dcbf5f,
>>> addrs=[ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100,
>>> ip-172-21-85-213.ap-south-1.compute.internal/172.21.85.213:47100,
>>> /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]]
>>>
>>> I think the problem is, you have two nodes, they both have 172.17.0.1
>>> address but it's the different address (totally unrelated private nets).
>>>
>>> Try to specify your external address (such as 172.21.85.213) with
>>> TcpCommunicationSpi.setLocalAddress() on each node.
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> пт, 7 сент. 2018 г. в 20:01, eugene miretsky <[email protected]
>>> >:
>>>
>>>> Hi all,
>>>>
>>>> Can somebody please provide some pointers on what could be the issue or
>>>> how to debug it? We have a fairly large Ignite use case, but cannot go
>>>> ahead with a POC because of these crashes.
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>>
>>>>
>>>> On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky <
>>>> [email protected]> wrote:
>>>>
>>>>> Also, don't want to spam the mailing list with more threads, but I get
>>>>> the same stability issue when writing to Ignite from Spark. Logfile from
>>>>> the crashed node (not same node as before, probably random) is attached.
>>>>>
>>>>>  I have also attached a gc log from another node (I have gc logging
>>>>> enabled only on one node)
>>>>>
>>>>>
>>>>> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks Denis,
>>>>>>
>>>>>> Execution plan + all logs right after the carsh are attached.
>>>>>>
>>>>>> Cheers,
>>>>>> Eugene
>>>>>>  nohup.out
>>>>>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Eugene,
>>>>>>>
>>>>>>> Please share full logs from all the nodes and execution plan for the
>>>>>>> query. That's what the community usually needs to help with
>>>>>>> troubleshooting. Also, attach GC logs. Use these settings to gather 
>>>>>>> them:
>>>>>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats
>>>>>>>
>>>>>>> --
>>>>>>> Denis
>>>>>>>
>>>>>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2
>>>>>>>> nodes. It has persistence enabled, and zero backup.
>>>>>>>> - Full configs are attached.
>>>>>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server
>>>>>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch
>>>>>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>>>>>>>
>>>>>>>> The table has 145M rows, and takes up about 180G of memory
>>>>>>>> I testing 2 things
>>>>>>>> 1) Writing SQL tables from Spark
>>>>>>>> 2) Performing large SQL queries (from the web console): for example 
>>>>>>>> Select
>>>>>>>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12'
>>>>>>>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1)
>>>>>>>>
>>>>>>>> Most of the times I run the query it fails after one of the nodes
>>>>>>>> crashes (it has finished a few times, and then crashed the next time). 
>>>>>>>> I
>>>>>>>> have also similar stability issues when writing from Spark - at some 
>>>>>>>> point,
>>>>>>>> one of the nodes crashes. All I can see in the logs is
>>>>>>>>
>>>>>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical
>>>>>>>> system error detected. Will be handled accordingly to configured 
>>>>>>>> handler
>>>>>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, 
>>>>>>>> failureCtx=FailureContext
>>>>>>>> [type=SEGMENTATION, err=null]]
>>>>>>>>
>>>>>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
>>>>>>>> Ignite node is in invalid state due to a critical failure.
>>>>>>>>
>>>>>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on
>>>>>>>> Ignite failure: [failureCtx=FailureContext [type=SEGMENTATION, 
>>>>>>>> err=null]]
>>>>>>>>
>>>>>>>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780]
>>>>>>>>
>>>>>>>> My questions are:
>>>>>>>> 1) What is causing the issue?
>>>>>>>> 2) How can I debug it better?
>>>>>>>>
>>>>>>>> The rate of crashes and our lack of ability to debug them is
>>>>>>>> becoming quite a concern.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Eugene
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: Node keeps crashing under load

Reply via email to