OK, thanks. Will investigate the possibility.

If this is caused by a network timeout or long GC pause as you suggested, shouldn't the node be expected to rejoin the cluster once the problem has gone away?


On 01/08/18 15:46, Denis Mekhanikov wrote:
Tim,

This is strange, since /failureDetectionTimeout/ and /clientFailureDetectionTimeout/ are pretty important properties
and Ignite maintainers encourage users to tune timeouts using them.

I think, you should send a feature request to Nextflow and ask them to add this property to configuration.

But once again, this functionality is currently broken in Ignite, and it's going to be fixed in 2.7

Denis

ср, 1 авг. 2018 г. в 17:25, Tim Dudgeon <tdudgeon...@gmail.com <mailto:tdudgeon...@gmail.com>>:

    Hi

    Yes, I saw that property, but:

    1. I wasn't sure what the default was and what to use for a more
    tolerant value

    2. The Nextflow framework does not seem to allow to set this. Only
    these Ignite params seem to be controllable:
    https://www.nextflow.io/docs/latest/ignite.html#advanced-options

    TIm


    On 01/08/18 15:17, Denis Mekhanikov wrote:
    As per failing node: looks like it failed due to a network issue
    or a long GC pause.
    Try increasing value of
    IgniteConfiguration.html#clientFailureDetectionTimeout
    
<https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/configuration/IgniteConfiguration.html#setClientFailureDetectionTimeout-long->
 property.

    Denis

    ср, 1 авг. 2018 г. в 17:00, Denis Mekhanikov
    <dmekhani...@gmail.com <mailto:dmekhani...@gmail.com>>:

        Tim,

        By default IP finder cleans unreachable addresses from the
        registry once per minute.
        You can change this frequency by setting a different value to
        TcpDiscoverySpi.html#setIpFinderCleanFrequency
        
<https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html#setIpFinderCleanFrequency-long->
 configuration
        property.

        You shouldn't be concerned about these files too much.
        Unreachable addresses in IP finder usually don't have any
        negative impact.
        IP finder addresses are used only during initial node lookup.
        After that nodes exchange their addresses, and use only
        those, which are actually connected to the cluster.

        Denis

        ср, 1 авг. 2018 г. в 11:51, Tim Dudgeon
        <tdudgeon...@gmail.com <mailto:tdudgeon...@gmail.com>>:

            I'm hitting a strange problem when using an ignite
            cluster for
            performing compute jobs.
            Ignite is being managed by the underlying Nextflow tool
            (https://www.nextflow.io/docs/latest/ignite.html) and I
            don't understand
            the precise details of how this is set up, but I believe
            there's nothing
            unusual going on.
            Cluster discovery is being done using a shared directory.

            The cluster is set up fine, and jobs are executed on all
            the nodes as
            expected. Then some random event happens on a node which
            results in it
            leaving the cluster. The nextflow/ignite process is still
            running on the
            node, and the tasks that are currently executing continue
            to execute to
            completion, but no new tasks are started. And the node is
            still
            registered in the shared cluster directory. But on the
            master the node
            is seen to have left the cluster and no longer consumes
            jobs, and never
            rejoins.

            We are seeing this on an OpenStack environment. When we
            do the same on
            AWS the problem is not encountered. So presumably there
            is something
            strange going on at the network level to cause this.
            Possibly changing
            some of the timeouts might help. The ones that Nextflow
            allows to change
            are listed here:
            https://www.nextflow.io/docs/latest/ignite.html#advanced-options
            But its not clear to me which timeouts should be changed,
            and what new
            values to try. Any advice here would be most welcome.

            For an example of this in action look here for the
            Nextflow log file on
            the worker node, which includes various log output from
            Ignite:
            
https://gist.github.com/tdudgeon/2940b8b1d1df03aecb7d13395cfb16a8#file-node-nextflow-log-L1109-L1265





Reply via email to