Re: Ignite in Kubernetes not works correctly

2019-01-14 Thread Alena Laas
failureDetectionTimeout - 6
joinTimeout - 12
Saw these recomendations in one of the answers in your forum

On Mon, Jan 14, 2019 at 2:21 PM Stephen Darlington <
stephen.darling...@gridgain.com> wrote:

> Glad you managed to resolve it. What did you have to increase the values
> to?
>
> Regards,
> Stephen
>
> On 14 Jan 2019, at 09:34, Alena Laas 
> wrote:
>
> It seems that increasing joinTimeout and failureDetectionTimeout solved
> the problem.
>
> On Fri, Jan 11, 2019 at 5:24 PM Alena Laas 
> wrote:
>
>> I attached part of the log with "node failed" events (100.99.129.141 - ip
>> of restarted node)
>>
>> These events are repeated until suddenly after about 40 min - an hour
>> node is connected to cluster.
>>
>> Could you explain why this is happening?
>>
>> On Thu, Jan 10, 2019 at 7:54 PM Alena Laas 
>> wrote:
>>
>>> We are using Azure AKS cluster.
>>>
>>> We kill pod using Kubernetes dashboard or through kubectl (kubectl
>>> delete pods ), never mind, result is the same.
>>>
>>> Maybe you need some more logs from us?
>>>
>>> On Thu, Jan 10, 2019 at 7:28 PM Stephen Darlington <
>>> stephen.darling...@gridgain.com> wrote:
>>>
>>>> What kind of environment are you using? A public cloud? Your own data
>>>> centre? And how are you killing the pod?
>>>>
>>>> I fired up a cluster using Minikube and your configuration and it
>>>> worked as far as I could see. (I deleted the pod using the dashboard, for
>>>> what that’s worth.)
>>>>
>>>> Regards,
>>>> Stephen
>>>>
>>>> On 10 Jan 2019, at 14:20, Alena Laas 
>>>> wrote:
>>>>
>>>>
>>>>
>>>> -- Forwarded message -
>>>> From: Alena Laas 
>>>> Date: Thu, Jan 10, 2019 at 5:13 PM
>>>> Subject: Ignite in Kubernetes not works correctly
>>>> To: 
>>>> Cc: Vadim Shcherbakov 
>>>>
>>>>
>>>> Hello!
>>>> Could you please help with some problem with Ignite within Kubernetes
>>>> cluster?
>>>>
>>>> When we start 2 Ignite nodes at the same time or use scaling for
>>>> Deployment (from 1 to 2) everything is fine, both of them are visible
>>>> inside Ignite cluster (we use web console to see it)
>>>>
>>>> But after we kill pod with one node and it restarts the node is no more
>>>> seen in Ignite cluster. Moreover the logs from this restarted node look
>>>> poor:
>>>> [13:32:57] __ 
>>>> [13:32:57] / _/ ___/ |/ / _/_ __/ __/
>>>> [13:32:57] _/ // (7 7 // / / / / _/
>>>> [13:32:57] /___/\___/_/|_/___/ /_/ /___/
>>>> [13:32:57]
>>>> [13:32:57] ver. 2.7.0#20181130-sha1:256ae401
>>>> [13:32:57] 2018 Copyright(C) Apache Software Foundation
>>>> [13:32:57]
>>>> [13:32:57] Ignite documentation: http://ignite.apache.org
>>>> [13:32:57]
>>>> [13:32:57] Quiet mode.
>>>> [13:32:57] ^-- Logging to file
>>>> '/opt/ignite/apache-ignite/work/log/ignite-7d323675.0.log'
>>>> [13:32:57] ^-- Logging by 'JavaLogger [quiet=true, config=null]'
>>>> [13:32:57] ^-- To see **FULL** console log here add
>>>> -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
>>>> [13:32:57]
>>>> [13:32:57] OS: Linux 4.15.0-1036-azure amd64
>>>> [13:32:57] VM information: OpenJDK Runtime Environment 1.8.0_181-b13
>>>> Oracle Corporation OpenJDK 64-Bit Server VM 25.181-b13
>>>> [13:32:57] Please set system property '-Djava.net.preferIPv4Stack=true'
>>>> to avoid possible problems in mixed environments.
>>>> [13:32:57] Configured plugins:
>>>> [13:32:57] ^-- None
>>>> [13:32:57]
>>>> [13:32:57] Configured failure handler:
>>>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>> super=AbstractFailureHandler 
>>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED
>>>> [13:32:58] Message queue limit is set to 0 which may lead to potential
>>>> OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due
>>>> to message queues growth on sender and receiver sides.
>>>> [13:32:58] Security status [authentication=off, tls/ssl=off]
>>>>
>>>> And logs from the remaining node say that there are either 2 or 1
>>>> ser

Re: Ignite in Kubernetes not works correctly

2019-01-14 Thread Alena Laas
It seems that increasing joinTimeout and failureDetectionTimeout solved the
problem.

On Fri, Jan 11, 2019 at 5:24 PM Alena Laas 
wrote:

> I attached part of the log with "node failed" events (100.99.129.141 - ip
> of restarted node)
>
> These events are repeated until suddenly after about 40 min - an hour node
> is connected to cluster.
>
> Could you explain why this is happening?
>
> On Thu, Jan 10, 2019 at 7:54 PM Alena Laas 
> wrote:
>
>> We are using Azure AKS cluster.
>>
>> We kill pod using Kubernetes dashboard or through kubectl (kubectl delete
>> pods ), never mind, result is the same.
>>
>> Maybe you need some more logs from us?
>>
>> On Thu, Jan 10, 2019 at 7:28 PM Stephen Darlington <
>> stephen.darling...@gridgain.com> wrote:
>>
>>> What kind of environment are you using? A public cloud? Your own data
>>> centre? And how are you killing the pod?
>>>
>>> I fired up a cluster using Minikube and your configuration and it worked
>>> as far as I could see. (I deleted the pod using the dashboard, for what
>>> that’s worth.)
>>>
>>> Regards,
>>> Stephen
>>>
>>> On 10 Jan 2019, at 14:20, Alena Laas 
>>> wrote:
>>>
>>>
>>>
>>> -- Forwarded message -
>>> From: Alena Laas 
>>> Date: Thu, Jan 10, 2019 at 5:13 PM
>>> Subject: Ignite in Kubernetes not works correctly
>>> To: 
>>> Cc: Vadim Shcherbakov 
>>>
>>>
>>> Hello!
>>> Could you please help with some problem with Ignite within Kubernetes
>>> cluster?
>>>
>>> When we start 2 Ignite nodes at the same time or use scaling for
>>> Deployment (from 1 to 2) everything is fine, both of them are visible
>>> inside Ignite cluster (we use web console to see it)
>>>
>>> But after we kill pod with one node and it restarts the node is no more
>>> seen in Ignite cluster. Moreover the logs from this restarted node look
>>> poor:
>>> [13:32:57] __ 
>>> [13:32:57] / _/ ___/ |/ / _/_ __/ __/
>>> [13:32:57] _/ // (7 7 // / / / / _/
>>> [13:32:57] /___/\___/_/|_/___/ /_/ /___/
>>> [13:32:57]
>>> [13:32:57] ver. 2.7.0#20181130-sha1:256ae401
>>> [13:32:57] 2018 Copyright(C) Apache Software Foundation
>>> [13:32:57]
>>> [13:32:57] Ignite documentation: http://ignite.apache.org
>>> [13:32:57]
>>> [13:32:57] Quiet mode.
>>> [13:32:57] ^-- Logging to file
>>> '/opt/ignite/apache-ignite/work/log/ignite-7d323675.0.log'
>>> [13:32:57] ^-- Logging by 'JavaLogger [quiet=true, config=null]'
>>> [13:32:57] ^-- To see **FULL** console log here add -DIGNITE_QUIET=false
>>> or "-v" to ignite.{sh|bat}
>>> [13:32:57]
>>> [13:32:57] OS: Linux 4.15.0-1036-azure amd64
>>> [13:32:57] VM information: OpenJDK Runtime Environment 1.8.0_181-b13
>>> Oracle Corporation OpenJDK 64-Bit Server VM 25.181-b13
>>> [13:32:57] Please set system property '-Djava.net.preferIPv4Stack=true'
>>> to avoid possible problems in mixed environments.
>>> [13:32:57] Configured plugins:
>>> [13:32:57] ^-- None
>>> [13:32:57]
>>> [13:32:57] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler
>>> [tryStop=false, timeout=0, super=AbstractFailureHandler
>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED
>>> [13:32:58] Message queue limit is set to 0 which may lead to potential
>>> OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due
>>> to message queues growth on sender and receiver sides.
>>> [13:32:58] Security status [authentication=off, tls/ssl=off]
>>>
>>> And logs from the remaining node say that there are either 2 or 1 server
>>> and this info is blinking
>>> [14:02:05] Joining node doesn't have encryption data
>>> [node=7d323675-bc0b-4507-affb-672b25766201]
>>> [14:02:15] Topology snapshot [ver=234, locNode=a5eb30e1, servers=2,
>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
>>> [14:02:15] Topology snapshot [ver=235, locNode=a5eb30e1, servers=1,
>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
>>> [14:02:20] Joining node doesn't have encryption data
>>> [node=7d323675-bc0b-4507-affb-672b25766201]
>>> [14:02:30] Topology snapshot [ver=236, locNode=a5eb30e1, servers=2,
>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
>>> [14:02:30] Topology snapshot [ver=237, locNode

Re: Ignite in Kubernetes not works correctly

2019-01-10 Thread Alena Laas
We are using Azure AKS cluster.

We kill pod using Kubernetes dashboard or through kubectl (kubectl delete
pods ), never mind, result is the same.

Maybe you need some more logs from us?

On Thu, Jan 10, 2019 at 7:28 PM Stephen Darlington <
stephen.darling...@gridgain.com> wrote:

> What kind of environment are you using? A public cloud? Your own data
> centre? And how are you killing the pod?
>
> I fired up a cluster using Minikube and your configuration and it worked
> as far as I could see. (I deleted the pod using the dashboard, for what
> that’s worth.)
>
> Regards,
> Stephen
>
> On 10 Jan 2019, at 14:20, Alena Laas 
> wrote:
>
>
>
> ------ Forwarded message -
> From: Alena Laas 
> Date: Thu, Jan 10, 2019 at 5:13 PM
> Subject: Ignite in Kubernetes not works correctly
> To: 
> Cc: Vadim Shcherbakov 
>
>
> Hello!
> Could you please help with some problem with Ignite within Kubernetes
> cluster?
>
> When we start 2 Ignite nodes at the same time or use scaling for
> Deployment (from 1 to 2) everything is fine, both of them are visible
> inside Ignite cluster (we use web console to see it)
>
> But after we kill pod with one node and it restarts the node is no more
> seen in Ignite cluster. Moreover the logs from this restarted node look
> poor:
> [13:32:57] __ 
> [13:32:57] / _/ ___/ |/ / _/_ __/ __/
> [13:32:57] _/ // (7 7 // / / / / _/
> [13:32:57] /___/\___/_/|_/___/ /_/ /___/
> [13:32:57]
> [13:32:57] ver. 2.7.0#20181130-sha1:256ae401
> [13:32:57] 2018 Copyright(C) Apache Software Foundation
> [13:32:57]
> [13:32:57] Ignite documentation: http://ignite.apache.org
> [13:32:57]
> [13:32:57] Quiet mode.
> [13:32:57] ^-- Logging to file
> '/opt/ignite/apache-ignite/work/log/ignite-7d323675.0.log'
> [13:32:57] ^-- Logging by 'JavaLogger [quiet=true, config=null]'
> [13:32:57] ^-- To see **FULL** console log here add -DIGNITE_QUIET=false
> or "-v" to ignite.{sh|bat}
> [13:32:57]
> [13:32:57] OS: Linux 4.15.0-1036-azure amd64
> [13:32:57] VM information: OpenJDK Runtime Environment 1.8.0_181-b13
> Oracle Corporation OpenJDK 64-Bit Server VM 25.181-b13
> [13:32:57] Please set system property '-Djava.net.preferIPv4Stack=true' to
> avoid possible problems in mixed environments.
> [13:32:57] Configured plugins:
> [13:32:57] ^-- None
> [13:32:57]
> [13:32:57] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler
> [tryStop=false, timeout=0, super=AbstractFailureHandler
> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED
> [13:32:58] Message queue limit is set to 0 which may lead to potential
> OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due
> to message queues growth on sender and receiver sides.
> [13:32:58] Security status [authentication=off, tls/ssl=off]
>
> And logs from the remaining node say that there are either 2 or 1 server
> and this info is blinking
> [14:02:05] Joining node doesn't have encryption data
> [node=7d323675-bc0b-4507-affb-672b25766201]
> [14:02:15] Topology snapshot [ver=234, locNode=a5eb30e1, servers=2,
> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
> [14:02:15] Topology snapshot [ver=235, locNode=a5eb30e1, servers=1,
> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
> [14:02:20] Joining node doesn't have encryption data
> [node=7d323675-bc0b-4507-affb-672b25766201]
> [14:02:30] Topology snapshot [ver=236, locNode=a5eb30e1, servers=2,
> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
> [14:02:30] Topology snapshot [ver=237, locNode=a5eb30e1, servers=1,
> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
> [14:02:35] Joining node doesn't have encryption data
> [node=7d323675-bc0b-4507-affb-672b25766201]
> [14:02:45] Topology snapshot [ver=238, locNode=a5eb30e1, servers=2,
> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
> [14:02:45] Topology snapshot [ver=239, locNode=a5eb30e1, servers=1,
> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
> [14:02:50] Joining node doesn't have encryption data
> [node=7d323675-bc0b-4507-affb-672b25766201]
> [14:03:00] Topology snapshot [ver=240, locNode=a5eb30e1, servers=2,
> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
> [14:03:00] Topology snapshot [ver=241, locNode=a5eb30e1, servers=1,
> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
> [14:03:06] Joining node doesn't have encryption data
> [node=7d323675-bc0b-4507-affb-672b25766201]
> [14:03:16] Topology snapshot [ver=242, locNode=a5eb30e1, servers=2,
> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
> [14:03:16] Topology snapshot [ver=243, locNode=a5eb30e1, servers=1,
> clients=0, state=ACTIVE, CPUs=8

Fwd: Ignite in Kubernetes not works correctly

2019-01-10 Thread Alena Laas
-- Forwarded message -
From: Alena Laas 
Date: Thu, Jan 10, 2019 at 5:13 PM
Subject: Ignite in Kubernetes not works correctly
To: 
Cc: Vadim Shcherbakov 


Hello!
Could you please help with some problem with Ignite within Kubernetes
cluster?

When we start 2 Ignite nodes at the same time or use scaling for Deployment
(from 1 to 2) everything is fine, both of them are visible inside Ignite
cluster (we use web console to see it)

But after we kill pod with one node and it restarts the node is no more
seen in Ignite cluster. Moreover the logs from this restarted node look
poor:
[13:32:57] __ 
[13:32:57] / _/ ___/ |/ / _/_ __/ __/
[13:32:57] _/ // (7 7 // / / / / _/
[13:32:57] /___/\___/_/|_/___/ /_/ /___/
[13:32:57]
[13:32:57] ver. 2.7.0#20181130-sha1:256ae401
[13:32:57] 2018 Copyright(C) Apache Software Foundation
[13:32:57]
[13:32:57] Ignite documentation: http://ignite.apache.org
[13:32:57]
[13:32:57] Quiet mode.
[13:32:57] ^-- Logging to file
'/opt/ignite/apache-ignite/work/log/ignite-7d323675.0.log'
[13:32:57] ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[13:32:57] ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or
"-v" to ignite.{sh|bat}
[13:32:57]
[13:32:57] OS: Linux 4.15.0-1036-azure amd64
[13:32:57] VM information: OpenJDK Runtime Environment 1.8.0_181-b13 Oracle
Corporation OpenJDK 64-Bit Server VM 25.181-b13
[13:32:57] Please set system property '-Djava.net.preferIPv4Stack=true' to
avoid possible problems in mixed environments.
[13:32:57] Configured plugins:
[13:32:57] ^-- None
[13:32:57]
[13:32:57] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler
[tryStop=false, timeout=0, super=AbstractFailureHandler
[ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED
[13:32:58] Message queue limit is set to 0 which may lead to potential
OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due
to message queues growth on sender and receiver sides.
[13:32:58] Security status [authentication=off, tls/ssl=off]

And logs from the remaining node say that there are either 2 or 1 server
and this info is blinking
[14:02:05] Joining node doesn't have encryption data
[node=7d323675-bc0b-4507-affb-672b25766201]
[14:02:15] Topology snapshot [ver=234, locNode=a5eb30e1, servers=2,
clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
[14:02:15] Topology snapshot [ver=235, locNode=a5eb30e1, servers=1,
clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
[14:02:20] Joining node doesn't have encryption data
[node=7d323675-bc0b-4507-affb-672b25766201]
[14:02:30] Topology snapshot [ver=236, locNode=a5eb30e1, servers=2,
clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
[14:02:30] Topology snapshot [ver=237, locNode=a5eb30e1, servers=1,
clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
[14:02:35] Joining node doesn't have encryption data
[node=7d323675-bc0b-4507-affb-672b25766201]
[14:02:45] Topology snapshot [ver=238, locNode=a5eb30e1, servers=2,
clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
[14:02:45] Topology snapshot [ver=239, locNode=a5eb30e1, servers=1,
clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
[14:02:50] Joining node doesn't have encryption data
[node=7d323675-bc0b-4507-affb-672b25766201]
[14:03:00] Topology snapshot [ver=240, locNode=a5eb30e1, servers=2,
clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
[14:03:00] Topology snapshot [ver=241, locNode=a5eb30e1, servers=1,
clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
[14:03:06] Joining node doesn't have encryption data
[node=7d323675-bc0b-4507-affb-672b25766201]
[14:03:16] Topology snapshot [ver=242, locNode=a5eb30e1, servers=2,
clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
[14:03:16] Topology snapshot [ver=243, locNode=a5eb30e1, servers=1,
clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
[14:03:21] Joining node doesn't have encryption data
[node=7d323675-bc0b-4507-affb-672b25766201]
[14:03:31] Topology snapshot [ver=244, locNode=a5eb30e1, servers=2,
clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
[14:03:31] Topology snapshot [ver=245, locNode=a5eb30e1, servers=1,
clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
[14:03:36] Joining node doesn't have encryption data
[node=7d323675-bc0b-4507-affb-672b25766201]
[14:03:46] Topology snapshot [ver=246, locNode=a5eb30e1, servers=2,
clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
[14:03:46] Topology snapshot [ver=247, locNode=a5eb30e1, servers=1,
clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
[14:03:51] Joining node doesn't have encryption data
[node=7d323675-bc0b-4507-affb-672b25766201]
[14:04:01] Topology snapshot [ver=248, locNode=a5eb30e1, servers=2,
clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB]
[14:04:01] Topology snapshot [ver=249, locNode=a5eb30e1, servers=1,
clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB]
[14:04:06] Jo