Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-14 Thread Christopher Harrop - NOAA Affiliate
> Hi Chris > > You are right in pointing that the job actually runs, despite of the error in > the sbatch. The customer mention that: > === start === > Problem had usual scenario - job script was submitted and executed, but > sbatch command returned non-zero exit status to ecflow, which thus

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-14 Thread Marcelo Garcia
lurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Christopher Harrop - NOAA Affiliate Sent: Donnerstag, 13. Juni 2019 16:47 To: Slurm User Community List Subject: Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation" Hi, My grou

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-14 Thread Bjørn-Helge Mevik
Christopher Benjamin Coffey writes: > Hi, you may want to look into increasing the sssd cache length on the > nodes, We have thought about that, but it will not solve the problem, only make it less frequent, I think. > and improving the network connectivity to your ldap > directory. That is

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread Christopher W. Harrop
> ... >> One way I?m using to work around this is to inject a long random string >> into the ?comment option. Then, if I see the socket timeout, I use squeue >> to look for that job and retrieve its ID. It?s not ideal, but it can work. > > I would have expected a different approach: use a

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread John Hearns
I agree with Christopher Coffey - look at the sssd caching. I have had experience with sssd and can help a bit. Also if you are seeing long waits could you have nested groups? sssd is notorious for not handling these well, and there are settings in the configuration file which you can experiment

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread Mark Hahn
On Thu, 13 Jun 2019, Christopher Harrop - NOAA Affiliate wrote: ... One way I?m using to work around this is to inject a long random string into the ?comment option. Then, if I see the socket timeout, I use squeue to look for that job and retrieve its ID. It?s not ideal, but it can work. I

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread Jeffrey Frey
The error message cited is associated with SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT, which is only ever raised by slurm_send_timeout() and slurm_recv_timeout(). Those functions raise that error when a generic socket-based send/receive operation exceeds an arbitrary time limit imposed by the caller.

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread Christopher Harrop - NOAA Affiliate
Hi, My group is struggling with this also. The worst part of this, which no one has brought up yet, is that the sbatch command does not necessarily fail to submit the job in this situation. In fact, most of the time (for us), it succeeds. There appears to be some sort of race condition or

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Marcus Wagner
Hi, we hit the same issue, up to 30.000 entries per day in the slurmctld log. As we used SL6 the first time (Scientific Linux), we had massive problems with sssd, often crashing. We therefore decided to get rid of sssd and manually fill /etc/passwd and /etc/groups via cronjob. So, yes we

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Christopher Benjamin Coffey
Hi, you may want to look into increasing the sssd cache length on the nodes, and improving the network connectivity to your ldap directory. I recall when playing with sssd in the past that it wasn't actually caching. Verify with tcpdump, and "ls -l" through a directory. Once the uid/gid is

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Bjørn-Helge Mevik
Another possible cause (we currently see it on one of our clusters): delays in ldap lookups. We have sssd on the machines, and occasionally, when sssd contacts the ldap server, it takes 5 or 10 seconds (or even 15) before it gets an answer. If that happens because slurmctld is trying to look up

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Marcelo Garcia
Message- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Steffen Grunewald Sent: Dienstag, 11. Juni 2019 16:28 To: Slurm User Community List Subject: Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation" On Tue, 20

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Daniel Letai
I had similar problems in the past. The 2 most common issues were: 1. Controller load - if the slurmctld was in heavy use, it sometimes didn't respond in timely manner, exceeding the timeout limit. 2. Topology and msg forwarding and aggregation. For

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Steffen Grunewald
On Tue, 2019-06-11 at 13:56:34 +, Marcelo Garcia wrote: > Hi > > Since mid-March 2019 we are having a strange problem with slurm. Sometimes, > the command "sbatch" fails: > > + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p > operw

[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Marcelo Garcia
Hi Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the command "sbatch" fails: + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1 sbatch: error: Batch job submission