Hey,

It has been solved. It was networking as always (master ip was changed for
a while and I've just missed a couple of packages in tcpdump from the wrong
ip).

Best regards,
Taras



On Fri, May 16, 2014 at 3:45 PM, Taras Shapovalov <
[email protected]> wrote:

> Hi guys,
>
> Recently I've faced with quite weird behavior of sgeexecd (OGS 2011.11p1),
> maybe you can help me to investigate the issue.
>
> I have a cluster with local and EC2 nodes, qmaster runs locally. On local
> nodes sgeexecd works as usually good, but sgeexecd on cloud nodes register
> in qmaster (when starts) and then, exactly after 120 seconds, it tries to
> register in qmaster again (sgeexecd is not restarted at his point)! Of
> course qmaster rejects the registration with message like this:
>
> commlib error: endpoint is not unique error (endpoint 
> "cnode001.cm.cluster/execd/1" is already connected)
>
>
> After that jobs hang in t state (although they are finished).
>
>
> Could you advise me what I should check or maybe how I can debug this? I 
> don't see any configuration parameters with 2 minutes set, so I don't get 
> what could trigger the re-registration after this period of time. Nothing 
> useful is printed when I set SGE_ND and loglevel=log_info.
>
>
> The only difference between local and cloud nodes I see is cloud nodes have 2 
> networks (local nodes only one). But according netstat (and tcpdump) sgeexecd 
> on a cloud node connects to qmaster from the same IP the first time and the 
> next time when tries to re-register, so it seems network configuration is not 
> a reason for that.
>
>
> Any idea is appreciated!
>
>
> Thanks,
>
> Taras
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to