Re: node becoming inactive

Denis Mekhanikov Wed, 01 Aug 2018 07:46:32 -0700

Tim,

This is strange, since *failureDetectionTimeout* and
*clientFailureDetectionTimeout* are pretty important properties
and Ignite maintainers encourage users to tune timeouts using them.


I think, you should send a feature request to Nextflow and ask them to add
this property to configuration.

But once again, this functionality is currently broken in Ignite, and it's
going to be fixed in 2.7

Denis

ср, 1 авг. 2018 г. в 17:25, Tim Dudgeon <[email protected]>:

> Hi
>
> Yes, I saw that property, but:
>
> 1. I wasn't sure what the default was and what to use for a more tolerant
> value
>
> 2. The Nextflow framework does not seem to allow to set this. Only these
> Ignite params seem to be controllable:
> https://www.nextflow.io/docs/latest/ignite.html#advanced-options
>
> TIm
>
> On 01/08/18 15:17, Denis Mekhanikov wrote:
>
> As per failing node: looks like it failed due to a network issue or a long
> GC pause.
> Try increasing value of
> IgniteConfiguration.html#clientFailureDetectionTimeout
> <https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/configuration/IgniteConfiguration.html#setClientFailureDetectionTimeout-long->
>  property.
>
> Denis
>
> ср, 1 авг. 2018 г. в 17:00, Denis Mekhanikov <[email protected]>:
>
>> Tim,
>>
>> By default IP finder cleans unreachable addresses from the registry once
>> per minute.
>> You can change this frequency by setting a different value to
>> TcpDiscoverySpi.html#setIpFinderCleanFrequency
>> <https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html#setIpFinderCleanFrequency-long->
>>  configuration
>> property.
>>
>> You shouldn't be concerned about these files too much. Unreachable
>> addresses in IP finder usually don't have any negative impact.
>> IP finder addresses are used only during initial node lookup. After that
>> nodes exchange their addresses, and use only those, which are actually
>> connected to the cluster.
>>
>> Denis
>>
>> ср, 1 авг. 2018 г. в 11:51, Tim Dudgeon <[email protected]>:
>>
>>> I'm hitting a strange problem when using an ignite cluster for
>>> performing compute jobs.
>>> Ignite is being managed by the underlying Nextflow tool
>>> (https://www.nextflow.io/docs/latest/ignite.html) and I don't
>>> understand
>>> the precise details of how this is set up, but I believe there's nothing
>>> unusual going on.
>>> Cluster discovery is being done using a shared directory.
>>>
>>> The cluster is set up fine, and jobs are executed on all the nodes as
>>> expected. Then some random event happens on a node which results in it
>>> leaving the cluster. The nextflow/ignite process is still running on the
>>> node, and the tasks that are currently executing continue to execute to
>>> completion, but no new tasks are started. And the node is still
>>> registered in the shared cluster directory. But on the master the node
>>> is seen to have left the cluster and no longer consumes jobs, and never
>>> rejoins.
>>>
>>> We are seeing this on an OpenStack environment. When we do the same on
>>> AWS the problem is not encountered. So presumably there is something
>>> strange going on at the network level to cause this. Possibly changing
>>> some of the timeouts might help. The ones that Nextflow allows to change
>>> are listed here:
>>> https://www.nextflow.io/docs/latest/ignite.html#advanced-options
>>> But its not clear to me which timeouts should be changed, and what new
>>> values to try. Any advice here would be most welcome.
>>>
>>> For an example of this in action look here for the Nextflow log file on
>>> the worker node, which includes various log output from Ignite:
>>>
>>> https://gist.github.com/tdudgeon/2940b8b1d1df03aecb7d13395cfb16a8#file-node-nextflow-log-L1109-L1265
>>>
>>>
>>>
>>>
>

Re: node becoming inactive

Reply via email to