Tim, Server nodes do not try to reconnect to the cluster. Only clients do.
Denis ср, 1 авг. 2018 г. в 17:52, Tim Dudgeon <[email protected]>: > OK, thanks. Will investigate the possibility. > > If this is caused by a network timeout or long GC pause as you suggested, > shouldn't the node be expected to rejoin the cluster once the problem has > gone away? > > On 01/08/18 15:46, Denis Mekhanikov wrote: > > Tim, > > This is strange, since *failureDetectionTimeout* and > *clientFailureDetectionTimeout* are pretty important properties > and Ignite maintainers encourage users to tune timeouts using them. > > I think, you should send a feature request to Nextflow and ask them to add > this property to configuration. > > But once again, this functionality is currently broken in Ignite, and it's > going to be fixed in 2.7 > > Denis > > ср, 1 авг. 2018 г. в 17:25, Tim Dudgeon <[email protected]>: > >> Hi >> >> Yes, I saw that property, but: >> >> 1. I wasn't sure what the default was and what to use for a more tolerant >> value >> >> 2. The Nextflow framework does not seem to allow to set this. Only these >> Ignite params seem to be controllable: >> https://www.nextflow.io/docs/latest/ignite.html#advanced-options >> >> TIm >> >> On 01/08/18 15:17, Denis Mekhanikov wrote: >> >> As per failing node: looks like it failed due to a network issue or a >> long GC pause. >> Try increasing value of >> IgniteConfiguration.html#clientFailureDetectionTimeout >> <https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/configuration/IgniteConfiguration.html#setClientFailureDetectionTimeout-long-> >> property. >> >> Denis >> >> ср, 1 авг. 2018 г. в 17:00, Denis Mekhanikov <[email protected]>: >> >>> Tim, >>> >>> By default IP finder cleans unreachable addresses from the registry once >>> per minute. >>> You can change this frequency by setting a different value to >>> TcpDiscoverySpi.html#setIpFinderCleanFrequency >>> <https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html#setIpFinderCleanFrequency-long-> >>> configuration >>> property. >>> >>> You shouldn't be concerned about these files too much. Unreachable >>> addresses in IP finder usually don't have any negative impact. >>> IP finder addresses are used only during initial node lookup. After that >>> nodes exchange their addresses, and use only those, which are actually >>> connected to the cluster. >>> >>> Denis >>> >>> ср, 1 авг. 2018 г. в 11:51, Tim Dudgeon <[email protected]>: >>> >>>> I'm hitting a strange problem when using an ignite cluster for >>>> performing compute jobs. >>>> Ignite is being managed by the underlying Nextflow tool >>>> (https://www.nextflow.io/docs/latest/ignite.html) and I don't >>>> understand >>>> the precise details of how this is set up, but I believe there's >>>> nothing >>>> unusual going on. >>>> Cluster discovery is being done using a shared directory. >>>> >>>> The cluster is set up fine, and jobs are executed on all the nodes as >>>> expected. Then some random event happens on a node which results in it >>>> leaving the cluster. The nextflow/ignite process is still running on >>>> the >>>> node, and the tasks that are currently executing continue to execute to >>>> completion, but no new tasks are started. And the node is still >>>> registered in the shared cluster directory. But on the master the node >>>> is seen to have left the cluster and no longer consumes jobs, and never >>>> rejoins. >>>> >>>> We are seeing this on an OpenStack environment. When we do the same on >>>> AWS the problem is not encountered. So presumably there is something >>>> strange going on at the network level to cause this. Possibly changing >>>> some of the timeouts might help. The ones that Nextflow allows to >>>> change >>>> are listed here: >>>> https://www.nextflow.io/docs/latest/ignite.html#advanced-options >>>> But its not clear to me which timeouts should be changed, and what new >>>> values to try. Any advice here would be most welcome. >>>> >>>> For an example of this in action look here for the Nextflow log file on >>>> the worker node, which includes various log output from Ignite: >>>> >>>> https://gist.github.com/tdudgeon/2940b8b1d1df03aecb7d13395cfb16a8#file-node-nextflow-log-L1109-L1265 >>>> >>>> >>>> >>>> >> >
