Re: Node keeps crashing under load

Ilya Kasnacheev Thu, 13 Sep 2018 08:20:35 -0700

Hello!

I have filed this ticket: https://issues.apache.org/jira/browse/IGNITE-9586


Hope that it eventually get looked at by somebody in context.

Regards,
-- 
Ilya Kasnacheev


ср, 12 сент. 2018 г. в 22:10, eugene miretsky <[email protected]>:

> Good question :)
> yardstick does this, but not sure if it is a valid prod solution.
>
> https://github.com/apache/ignite/blob/3307a8b26ccb5f0bb7e9c387c73fd221b98ab668/modules/yardstick/src/main/java/org/apache/ignite/yardstick/jdbc/AbstractJdbcBenchmark.java
>
> We have set preferIPv4Stack=true and provided localAddress in the config -
> it seems to have solved the problem. (Didn't run it enough to be 100% sure)
>
> On Wed, Sep 12, 2018 at 10:59 AM Ilya Kasnacheev <
> [email protected]> wrote:
>
>> Hello!
>>
>> How would you distinguish the wrong interface (172.17.0.1) from the
>> right one if you were Ignite?
>>
>> I think it's not the first time I have seen this problem but I have
>> positively no idea how to tackle it.
>> Maybe Docker experts could chime in?
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> ср, 12 сент. 2018 г. в 3:29, eugene miretsky <[email protected]>:
>>
>>> Thanks Ilya,
>>>
>>> We are writing to Ignite from Spark running in EMR. We don't know the
>>> address of the node in advance, we have tried
>>> 1) Set localHost in Ignite configuration to 127.0.0.1, as per the
>>> example online
>>> 2) Leave localHost unset, and let ignite figure out the host
>>>
>>> I have attached more logs at the end.
>>>
>>> My understanding is that Ignite should pick the first non-local address
>>> to publish, however, it seems like it picks randomly one of (a) proper
>>> address, (b) ipv6 address, (c) 127.0.0.1, (d)  172.17.0.1.
>>>
>>> A few questions:
>>> 1) How do we force Spark client to use the proper address
>>> 2) Where is 172.17.0.1 coming from? It is usually the default docker
>>> network host address, and it seems like Ignite creates a network interface
>>> for it on the instance. (otherwise I have no idea where the interface is
>>> coming from)
>>> 3) If there are communication errors, shouldn't the Zookeeper split
>>> brain resolver kick in and shut down the dead node. Or shouldn't at least
>>> the initiating node mark the remote node as dead?
>>>
>>> [19:36:26,189][INFO][grid-nio-worker-tcp-comm-15-#88%Server%][TcpCommunicationSpi]
>>> Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
>>> rmtAddr=/172.21.86.7:41648]
>>>
>>> [19:36:26,190][INFO][grid-nio-worker-tcp-comm-3-#76%Server%][TcpCommunicationSpi]
>>> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
>>> rmtAddr=/0:0:0:0:0:0:0:1:52484]
>>>
>>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-5-#78%Server%][TcpCommunicationSpi]
>>> Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
>>> rmtAddr=/127.0.0.1:37656]
>>>
>>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-1-#74%Server%][TcpCommunicationSpi]
>>> Established outgoing communication connection [locAddr=/
>>> 172.21.86.7:53272, rmtAddr=ip-172-21-86-175.ap-south-1.compute.internal/
>>> 172.21.86.175:47100]
>>>
>>> [19:36:26,191][INFO][grid-nio-worker-tcp-comm-0-#73%Server%][TcpCommunicationSpi]
>>> Established outgoing communication connection [locAddr=/172.17.0.1:41648,
>>> rmtAddr=ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100]
>>>
>>> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-4-#77%Server%][TcpCommunicationSpi]
>>> Established outgoing communication connection [locAddr=/127.0.0.1:37656,
>>> rmtAddr=/127.0.0.1:47100]
>>>
>>> [19:36:26,193][INFO][grid-nio-worker-tcp-comm-2-#75%Server%][TcpCommunicationSpi]
>>> Established outgoing communication connection
>>> [locAddr=/0:0:0:0:0:0:0:1:52484, rmtAddr=/0:0:0:0:0:0:0:1%lo:47100]
>>>
>>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-8-#81%Server%][TcpCommunicationSpi]
>>> Accepted incoming communication connection [locAddr=/172.17.0.1:47100,
>>> rmtAddr=/172.21.86.7:41656]
>>>
>>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-10-#83%Server%][TcpCommunicationSpi]
>>> Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100,
>>> rmtAddr=/0:0:0:0:0:0:0:1:52492]
>>>
>>> [19:36:26,195][INFO][grid-nio-worker-tcp-comm-12-#85%Server%][TcpCommunicationSpi]
>>> Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
>>> rmtAddr=/127.0.0.1:37664]
>>>
>>> [19:36:26,196][INFO][grid-nio-worker-tcp-comm-7-#80%Server%][TcpCommunicationSpi]
>>> Established outgoing communication connection [locAddr=/
>>> 172.21.86.7:41076, rmtAddr=ip-172-21-86-229.ap-south-1.compute.internal/
>>> 172.21.86.229:47100]
>>>
>>>
>>>
>>>
>>> On Mon, Sep 10, 2018 at 12:04 PM Ilya Kasnacheev <
>>> [email protected]> wrote:
>>>
>>>> Hello!
>>>>
>>>> I can see a lot of errors like this one:
>>>>
>>>> [04:05:29,268][INFO][tcp-comm-worker-#1%Server%][ZookeeperDiscoveryImpl]
>>>> Created new communication error process future
>>>> [errNode=598e3ead-99b8-4c49-b7df-04d578dcbf5f, err=class
>>>> org.apache.ignite.IgniteCheckedException: Failed to connect to node (is
>>>> node still alive?). Make sure that each ComputeTask and cache Transaction
>>>> has a timeout set in order to prevent parties from waiting forever in case
>>>> of network issues [nodeId=598e3ead-99b8-4c49-b7df-04d578dcbf5f,
>>>> addrs=[ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100,
>>>> ip-172-21-85-213.ap-south-1.compute.internal/172.21.85.213:47100,
>>>> /0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]]
>>>>
>>>> I think the problem is, you have two nodes, they both have 172.17.0.1
>>>> address but it's the different address (totally unrelated private nets).
>>>>
>>>> Try to specify your external address (such as 172.21.85.213) with
>>>> TcpCommunicationSpi.setLocalAddress() on each node.
>>>>
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>>
>>>>
>>>> пт, 7 сент. 2018 г. в 20:01, eugene miretsky <[email protected]
>>>> >:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Can somebody please provide some pointers on what could be the issue
>>>>> or how to debug it? We have a fairly large Ignite use case, but cannot go
>>>>> ahead with a POC because of these crashes.
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Also, don't want to spam the mailing list with more threads, but I
>>>>>> get the same stability issue when writing to Ignite from Spark. Logfile
>>>>>> from the crashed node (not same node as before, probably random) is
>>>>>> attached.
>>>>>>
>>>>>>  I have also attached a gc log from another node (I have gc logging
>>>>>> enabled only on one node)
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks Denis,
>>>>>>>
>>>>>>> Execution plan + all logs right after the carsh are attached.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eugene
>>>>>>>  nohup.out
>>>>>>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Eugene,
>>>>>>>>
>>>>>>>> Please share full logs from all the nodes and execution plan for
>>>>>>>> the query. That's what the community usually needs to help with
>>>>>>>> troubleshooting. Also, attach GC logs. Use these settings to gather 
>>>>>>>> them:
>>>>>>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats
>>>>>>>>
>>>>>>>> --
>>>>>>>> Denis
>>>>>>>>
>>>>>>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2
>>>>>>>>> nodes. It has persistence enabled, and zero backup.
>>>>>>>>> - Full configs are attached.
>>>>>>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server
>>>>>>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch
>>>>>>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>>>>>>>>
>>>>>>>>> The table has 145M rows, and takes up about 180G of memory
>>>>>>>>> I testing 2 things
>>>>>>>>> 1) Writing SQL tables from Spark
>>>>>>>>> 2) Performing large SQL queries (from the web console): for
>>>>>>>>> example Select COUNT (*) FROM (SELECT customer_id FROM MyTable
>>>>>>>>> where dt > '2018-05-12' GROUP BY customer_id having SUM(column1) > 2 
>>>>>>>>> AND
>>>>>>>>> MAX(column2) < 1)
>>>>>>>>>
>>>>>>>>> Most of the times I run the query it fails after one of the nodes
>>>>>>>>> crashes (it has finished a few times, and then crashed the next 
>>>>>>>>> time). I
>>>>>>>>> have also similar stability issues when writing from Spark - at some 
>>>>>>>>> point,
>>>>>>>>> one of the nodes crashes. All I can see in the logs is
>>>>>>>>>
>>>>>>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical
>>>>>>>>> system error detected. Will be handled accordingly to configured 
>>>>>>>>> handler
>>>>>>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, 
>>>>>>>>> failureCtx=FailureContext
>>>>>>>>> [type=SEGMENTATION, err=null]]
>>>>>>>>>
>>>>>>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
>>>>>>>>> Ignite node is in invalid state due to a critical failure.
>>>>>>>>>
>>>>>>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on
>>>>>>>>> Ignite failure: [failureCtx=FailureContext [type=SEGMENTATION, 
>>>>>>>>> err=null]]
>>>>>>>>>
>>>>>>>>> [21:52:03] Ignite node stopped OK [name=Server,
>>>>>>>>> uptime=00:07:06.780]
>>>>>>>>>
>>>>>>>>> My questions are:
>>>>>>>>> 1) What is causing the issue?
>>>>>>>>> 2) How can I debug it better?
>>>>>>>>>
>>>>>>>>> The rate of crashes and our lack of ability to debug them is
>>>>>>>>> becoming quite a concern.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Eugene
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Re: Node keeps crashing under load

Reply via email to