Hi Lachlan,

Thanks for the replay. I am trying to find more Ideas for this problem. May be 
some system or strange communication problem.
As for your suggestion:
> Check that it's in /etc/hosts --> It is. And answer to ping both on ip and 
> host name every time I check
> Check the slurmd logs --> In the node log there is no error, In the server 
> log there is the error I wrote ("agent/is_node_resp: node:myName1 
> RPC:REQUEST_PING : Can't find an address, check slurm.conf ")
> Make sure there is enough disk space --> More than enough
> Make sure that it's datetime is synchronized with the others  --> Same time 
> and date on all nodes and Slurm server.

The problem is that I don't see any other error and the node is up and running 
without any error.
The communication looks good with good ping but still it looks like the server 
can't find it (And it happen every two minute, always).

Thanks for your ideas,
Roy.

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Lachlan Musicman
Sent: Thursday, October 25, 2018 1:59 AM
To: Slurm User Community List
Subject: Re: [slurm-users] Can't find an address


On Wed, 24 Oct 2018 at 22:56, Zohar Roe MLM 
<rzoh...@iai.co.il<mailto:rzoh...@iai.co.il>> wrote:
Hello,
I have a node that from some reason change state to "Down" evert few minutes.
When I change it with scontrol to "resume" its ok until Down again.
In the slurm server log I can see error:
"agent/is_node_resp: node:myName1 RPC:REQUEST_PING : Can't find an address, 
check slurm.conf"

Now, The error message seems kind of straight forward but I can't find the 
problem.
* The node is up and answer to ping from the slurm server.
* The slurm deamon on the node is up and running.
* There isn't any error on the node itself.
* There are more node, configure the same (except from the ip address) that are 
Ok.
* running "scontrol update state=eesume nodename"myNode" fix the problem for a 
short time
* restarting slurm deamon on node also fix this for a short time

Any idea what more I can check to resolve this?

Here's a quick top of my head checklist:

Check that it's in /etc/hosts
Check the slurmd logs
Make sure there is enough disk space
Make sure that it's datetime is synchronized with the others

cheers
L.

------
'...postwork futures are dismissed with the claim that "it is not in our nature 
to be idle", thereby demonstrating at once an essentialist view of labor and an 
impoverished imagination of the possibilities of nonwork.'

Kathi Weeks, The Problem with Work: Feminism, Marxism, Antiwork Politics and 
Postwork Imaginaries<https://www.dukeupress.edu/The-Problem-with-Work/>

Default Profile
***********************************************************************************************
 Please consider the environment before printing this email ! The information 
contained in this communication is proprietary to Israel Aerospace Industries 
Ltd. and/or third parties, may contain confidential or privileged information, 
and is intended only for the use of the intended addressee thereof. If you are 
not the intended addressee, please be aware that any use, disclosure, 
distribution and/or copying of this communication is strictly prohibited. If 
you receive this communication in error, please notify the sender immediately 
and delete it from your computer. Thank you. Visit us at: www.iai.co.il

Reply via email to