As per failing node: looks like it failed due to a network issue or a long GC pause. Try increasing value of IgniteConfiguration.html#clientFailureDetectionTimeout <https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/configuration/IgniteConfiguration.html#setClientFailureDetectionTimeout-long-> property.
Denis ср, 1 авг. 2018 г. в 17:00, Denis Mekhanikov <[email protected]>: > Tim, > > By default IP finder cleans unreachable addresses from the registry once > per minute. > You can change this frequency by setting a different value to > TcpDiscoverySpi.html#setIpFinderCleanFrequency > <https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html#setIpFinderCleanFrequency-long-> > configuration > property. > > You shouldn't be concerned about these files too much. Unreachable > addresses in IP finder usually don't have any negative impact. > IP finder addresses are used only during initial node lookup. After that > nodes exchange their addresses, and use only those, which are actually > connected to the cluster. > > Denis > > ср, 1 авг. 2018 г. в 11:51, Tim Dudgeon <[email protected]>: > >> I'm hitting a strange problem when using an ignite cluster for >> performing compute jobs. >> Ignite is being managed by the underlying Nextflow tool >> (https://www.nextflow.io/docs/latest/ignite.html) and I don't understand >> the precise details of how this is set up, but I believe there's nothing >> unusual going on. >> Cluster discovery is being done using a shared directory. >> >> The cluster is set up fine, and jobs are executed on all the nodes as >> expected. Then some random event happens on a node which results in it >> leaving the cluster. The nextflow/ignite process is still running on the >> node, and the tasks that are currently executing continue to execute to >> completion, but no new tasks are started. And the node is still >> registered in the shared cluster directory. But on the master the node >> is seen to have left the cluster and no longer consumes jobs, and never >> rejoins. >> >> We are seeing this on an OpenStack environment. When we do the same on >> AWS the problem is not encountered. So presumably there is something >> strange going on at the network level to cause this. Possibly changing >> some of the timeouts might help. The ones that Nextflow allows to change >> are listed here: >> https://www.nextflow.io/docs/latest/ignite.html#advanced-options >> But its not clear to me which timeouts should be changed, and what new >> values to try. Any advice here would be most welcome. >> >> For an example of this in action look here for the Nextflow log file on >> the worker node, which includes various log output from Ignite: >> >> https://gist.github.com/tdudgeon/2940b8b1d1df03aecb7d13395cfb16a8#file-node-nextflow-log-L1109-L1265 >> >> >> >>
