Re: node becoming inactive

Tim Dudgeon Wed, 01 Aug 2018 07:25:26 -0700

Hi

Yes, I saw that property, but:

1. I wasn't sure what the default was and what to use for a moretolerant value

2. The Nextflow framework does not seem to allow to set this. Only theseIgnite params seem to be controllable:

https://www.nextflow.io/docs/latest/ignite.html#advanced-options

TIm


On 01/08/18 15:17, Denis Mekhanikov wrote:

As per failing node: looks like it failed due to a network issue or along GC pause.Try increasing value ofIgniteConfiguration.html#clientFailureDetectionTimeout<https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/configuration/IgniteConfiguration.html#setClientFailureDetectionTimeout-long-> property.


Denis

ср, 1 авг. 2018 г. в 17:00, Denis Mekhanikov <[email protected]<mailto:[email protected]>>:


    Tim,

    By default IP finder cleans unreachable addresses from the
    registry once per minute.
    You can change this frequency by setting a different value to
    TcpDiscoverySpi.html#setIpFinderCleanFrequency
    
<https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html#setIpFinderCleanFrequency-long->
 configuration
    property.

    You shouldn't be concerned about these files too much. Unreachable
    addresses in IP finder usually don't have any negative impact.
    IP finder addresses are used only during initial node lookup.
    After that nodes exchange their addresses, and use only those,
    which are actually connected to the cluster.

    Denis

    ср, 1 авг. 2018 г. в 11:51, Tim Dudgeon <[email protected]
    <mailto:[email protected]>>:

        I'm hitting a strange problem when using an ignite cluster for
        performing compute jobs.
        Ignite is being managed by the underlying Nextflow tool
        (https://www.nextflow.io/docs/latest/ignite.html) and I don't
        understand
        the precise details of how this is set up, but I believe
        there's nothing
        unusual going on.
        Cluster discovery is being done using a shared directory.

        The cluster is set up fine, and jobs are executed on all the
        nodes as
        expected. Then some random event happens on a node which
        results in it
        leaving the cluster. The nextflow/ignite process is still
        running on the
        node, and the tasks that are currently executing continue to
        execute to
        completion, but no new tasks are started. And the node is still
        registered in the shared cluster directory. But on the master
        the node
        is seen to have left the cluster and no longer consumes jobs,
        and never
        rejoins.

        We are seeing this on an OpenStack environment. When we do the
        same on
        AWS the problem is not encountered. So presumably there is
        something
        strange going on at the network level to cause this. Possibly
        changing
        some of the timeouts might help. The ones that Nextflow allows
        to change
        are listed here:
        https://www.nextflow.io/docs/latest/ignite.html#advanced-options
        But its not clear to me which timeouts should be changed, and
        what new
        values to try. Any advice here would be most welcome.

        For an example of this in action look here for the Nextflow
        log file on
        the worker node, which includes various log output from Ignite:
        
https://gist.github.com/tdudgeon/2940b8b1d1df03aecb7d13395cfb16a8#file-node-nextflow-log-L1109-L1265

Re: node becoming inactive

Reply via email to