node becoming inactive

Tim Dudgeon Wed, 01 Aug 2018 01:51:49 -0700

I'm hitting a strange problem when using an ignite cluster forperforming compute jobs.Ignite is being managed by the underlying Nextflow tool(https://www.nextflow.io/docs/latest/ignite.html) and I don't understandthe precise details of how this is set up, but I believe there's nothingunusual going on.

Cluster discovery is being done using a shared directory.

The cluster is set up fine, and jobs are executed on all the nodes asexpected. Then some random event happens on a node which results in itleaving the cluster. The nextflow/ignite process is still running on thenode, and the tasks that are currently executing continue to execute tocompletion, but no new tasks are started. And the node is stillregistered in the shared cluster directory. But on the master the nodeis seen to have left the cluster and no longer consumes jobs, and neverrejoins.

We are seeing this on an OpenStack environment. When we do the same onAWS the problem is not encountered. So presumably there is somethingstrange going on at the network level to cause this. Possibly changingsome of the timeouts might help. The ones that Nextflow allows to changeare listed here:https://www.nextflow.io/docs/latest/ignite.html#advanced-optionsBut its not clear to me which timeouts should be changed, and what newvalues to try. Any advice here would be most welcome.

For an example of this in action look here for the Nextflow log file onthe worker node, which includes various log output from Ignite:

https://gist.github.com/tdudgeon/2940b8b1d1df03aecb7d13395cfb16a8#file-node-nextflow-log-L1109-L1265

node becoming inactive

Reply via email to