Hi
Yes, I saw that property, but:
1. I wasn't sure what the default was and what to use for a more
tolerant value
2. The Nextflow framework does not seem to allow to set this. Only these
Ignite params seem to be controllable:
https://www.nextflow.io/docs/latest/ignite.html#advanced-options
TIm
On 01/08/18 15:17, Denis Mekhanikov wrote:
As per failing node: looks like it failed due to a network issue or a
long GC pause.
Try increasing value of
IgniteConfiguration.html#clientFailureDetectionTimeout
<https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/configuration/IgniteConfiguration.html#setClientFailureDetectionTimeout-long-> property.
Denis
ср, 1 авг. 2018 г. в 17:00, Denis Mekhanikov <[email protected]
<mailto:[email protected]>>:
Tim,
By default IP finder cleans unreachable addresses from the
registry once per minute.
You can change this frequency by setting a different value to
TcpDiscoverySpi.html#setIpFinderCleanFrequency
<https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html#setIpFinderCleanFrequency-long->
configuration
property.
You shouldn't be concerned about these files too much. Unreachable
addresses in IP finder usually don't have any negative impact.
IP finder addresses are used only during initial node lookup.
After that nodes exchange their addresses, and use only those,
which are actually connected to the cluster.
Denis
ср, 1 авг. 2018 г. в 11:51, Tim Dudgeon <[email protected]
<mailto:[email protected]>>:
I'm hitting a strange problem when using an ignite cluster for
performing compute jobs.
Ignite is being managed by the underlying Nextflow tool
(https://www.nextflow.io/docs/latest/ignite.html) and I don't
understand
the precise details of how this is set up, but I believe
there's nothing
unusual going on.
Cluster discovery is being done using a shared directory.
The cluster is set up fine, and jobs are executed on all the
nodes as
expected. Then some random event happens on a node which
results in it
leaving the cluster. The nextflow/ignite process is still
running on the
node, and the tasks that are currently executing continue to
execute to
completion, but no new tasks are started. And the node is still
registered in the shared cluster directory. But on the master
the node
is seen to have left the cluster and no longer consumes jobs,
and never
rejoins.
We are seeing this on an OpenStack environment. When we do the
same on
AWS the problem is not encountered. So presumably there is
something
strange going on at the network level to cause this. Possibly
changing
some of the timeouts might help. The ones that Nextflow allows
to change
are listed here:
https://www.nextflow.io/docs/latest/ignite.html#advanced-options
But its not clear to me which timeouts should be changed, and
what new
values to try. Any advice here would be most welcome.
For an example of this in action look here for the Nextflow
log file on
the worker node, which includes various log output from Ignite:
https://gist.github.com/tdudgeon/2940b8b1d1df03aecb7d13395cfb16a8#file-node-nextflow-log-L1109-L1265