As per failing node: looks like it failed due to a network issue or a long
GC pause.
Try increasing value of
IgniteConfiguration.html#clientFailureDetectionTimeout
<https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/configuration/IgniteConfiguration.html#setClientFailureDetectionTimeout-long->
 property.

Denis

ср, 1 авг. 2018 г. в 17:00, Denis Mekhanikov <[email protected]>:

> Tim,
>
> By default IP finder cleans unreachable addresses from the registry once
> per minute.
> You can change this frequency by setting a different value to
> TcpDiscoverySpi.html#setIpFinderCleanFrequency
> <https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html#setIpFinderCleanFrequency-long->
>  configuration
> property.
>
> You shouldn't be concerned about these files too much. Unreachable
> addresses in IP finder usually don't have any negative impact.
> IP finder addresses are used only during initial node lookup. After that
> nodes exchange their addresses, and use only those, which are actually
> connected to the cluster.
>
> Denis
>
> ср, 1 авг. 2018 г. в 11:51, Tim Dudgeon <[email protected]>:
>
>> I'm hitting a strange problem when using an ignite cluster for
>> performing compute jobs.
>> Ignite is being managed by the underlying Nextflow tool
>> (https://www.nextflow.io/docs/latest/ignite.html) and I don't understand
>> the precise details of how this is set up, but I believe there's nothing
>> unusual going on.
>> Cluster discovery is being done using a shared directory.
>>
>> The cluster is set up fine, and jobs are executed on all the nodes as
>> expected. Then some random event happens on a node which results in it
>> leaving the cluster. The nextflow/ignite process is still running on the
>> node, and the tasks that are currently executing continue to execute to
>> completion, but no new tasks are started. And the node is still
>> registered in the shared cluster directory. But on the master the node
>> is seen to have left the cluster and no longer consumes jobs, and never
>> rejoins.
>>
>> We are seeing this on an OpenStack environment. When we do the same on
>> AWS the problem is not encountered. So presumably there is something
>> strange going on at the network level to cause this. Possibly changing
>> some of the timeouts might help. The ones that Nextflow allows to change
>> are listed here:
>> https://www.nextflow.io/docs/latest/ignite.html#advanced-options
>> But its not clear to me which timeouts should be changed, and what new
>> values to try. Any advice here would be most welcome.
>>
>> For an example of this in action look here for the Nextflow log file on
>> the worker node, which includes various log output from Ignite:
>>
>> https://gist.github.com/tdudgeon/2940b8b1d1df03aecb7d13395cfb16a8#file-node-nextflow-log-L1109-L1265
>>
>>
>>
>>

Reply via email to