I'm hitting a strange problem when using an ignite cluster for performing compute jobs. Ignite is being managed by the underlying Nextflow tool (https://www.nextflow.io/docs/latest/ignite.html) and I don't understand the precise details of how this is set up, but I believe there's nothing unusual going on.
Cluster discovery is being done using a shared directory.

The cluster is set up fine, and jobs are executed on all the nodes as expected. Then some random event happens on a node which results in it leaving the cluster. The nextflow/ignite process is still running on the node, and the tasks that are currently executing continue to execute to completion, but no new tasks are started. And the node is still registered in the shared cluster directory. But on the master the node is seen to have left the cluster and no longer consumes jobs, and never rejoins.

We are seeing this on an OpenStack environment. When we do the same on AWS the problem is not encountered. So presumably there is something strange going on at the network level to cause this. Possibly changing some of the timeouts might help. The ones that Nextflow allows to change are listed here: https://www.nextflow.io/docs/latest/ignite.html#advanced-options But its not clear to me which timeouts should be changed, and what new values to try. Any advice here would be most welcome.

For an example of this in action look here for the Nextflow log file on the worker node, which includes various log output from Ignite:
https://gist.github.com/tdudgeon/2940b8b1d1df03aecb7d13395cfb16a8#file-node-nextflow-log-L1109-L1265



Reply via email to