Re: node becoming inactive

Denis Mekhanikov Wed, 01 Aug 2018 08:03:02 -0700

Tim,

Server nodes do not try to reconnect to the cluster. Only clients do.


Denis

ср, 1 авг. 2018 г. в 17:52, Tim Dudgeon <[email protected]>:

> OK, thanks. Will investigate the possibility.
>
> If this is caused by a network timeout or long GC pause as you suggested,
> shouldn't the node be expected to rejoin the cluster once the problem has
> gone away?
>
> On 01/08/18 15:46, Denis Mekhanikov wrote:
>
> Tim,
>
> This is strange, since *failureDetectionTimeout* and
> *clientFailureDetectionTimeout* are pretty important properties
> and Ignite maintainers encourage users to tune timeouts using them.
>
> I think, you should send a feature request to Nextflow and ask them to add
> this property to configuration.
>
> But once again, this functionality is currently broken in Ignite, and it's
> going to be fixed in 2.7
>
> Denis
>
> ср, 1 авг. 2018 г. в 17:25, Tim Dudgeon <[email protected]>:
>
>> Hi
>>
>> Yes, I saw that property, but:
>>
>> 1. I wasn't sure what the default was and what to use for a more tolerant
>> value
>>
>> 2. The Nextflow framework does not seem to allow to set this. Only these
>> Ignite params seem to be controllable:
>> https://www.nextflow.io/docs/latest/ignite.html#advanced-options
>>
>> TIm
>>
>> On 01/08/18 15:17, Denis Mekhanikov wrote:
>>
>> As per failing node: looks like it failed due to a network issue or a
>> long GC pause.
>> Try increasing value of
>> IgniteConfiguration.html#clientFailureDetectionTimeout
>> <https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/configuration/IgniteConfiguration.html#setClientFailureDetectionTimeout-long->
>>  property.
>>
>> Denis
>>
>> ср, 1 авг. 2018 г. в 17:00, Denis Mekhanikov <[email protected]>:
>>
>>> Tim,
>>>
>>> By default IP finder cleans unreachable addresses from the registry once
>>> per minute.
>>> You can change this frequency by setting a different value to
>>> TcpDiscoverySpi.html#setIpFinderCleanFrequency
>>> <https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html#setIpFinderCleanFrequency-long->
>>>  configuration
>>> property.
>>>
>>> You shouldn't be concerned about these files too much. Unreachable
>>> addresses in IP finder usually don't have any negative impact.
>>> IP finder addresses are used only during initial node lookup. After that
>>> nodes exchange their addresses, and use only those, which are actually
>>> connected to the cluster.
>>>
>>> Denis
>>>
>>> ср, 1 авг. 2018 г. в 11:51, Tim Dudgeon <[email protected]>:
>>>
>>>> I'm hitting a strange problem when using an ignite cluster for
>>>> performing compute jobs.
>>>> Ignite is being managed by the underlying Nextflow tool
>>>> (https://www.nextflow.io/docs/latest/ignite.html) and I don't
>>>> understand
>>>> the precise details of how this is set up, but I believe there's
>>>> nothing
>>>> unusual going on.
>>>> Cluster discovery is being done using a shared directory.
>>>>
>>>> The cluster is set up fine, and jobs are executed on all the nodes as
>>>> expected. Then some random event happens on a node which results in it
>>>> leaving the cluster. The nextflow/ignite process is still running on
>>>> the
>>>> node, and the tasks that are currently executing continue to execute to
>>>> completion, but no new tasks are started. And the node is still
>>>> registered in the shared cluster directory. But on the master the node
>>>> is seen to have left the cluster and no longer consumes jobs, and never
>>>> rejoins.
>>>>
>>>> We are seeing this on an OpenStack environment. When we do the same on
>>>> AWS the problem is not encountered. So presumably there is something
>>>> strange going on at the network level to cause this. Possibly changing
>>>> some of the timeouts might help. The ones that Nextflow allows to
>>>> change
>>>> are listed here:
>>>> https://www.nextflow.io/docs/latest/ignite.html#advanced-options
>>>> But its not clear to me which timeouts should be changed, and what new
>>>> values to try. Any advice here would be most welcome.
>>>>
>>>> For an example of this in action look here for the Nextflow log file on
>>>> the worker node, which includes various log output from Ignite:
>>>>
>>>> https://gist.github.com/tdudgeon/2940b8b1d1df03aecb7d13395cfb16a8#file-node-nextflow-log-L1109-L1265
>>>>
>>>>
>>>>
>>>>
>>
>

Re: node becoming inactive

Reply via email to