On Friday, 19 October 2018 4:58:37 AM AEDT Kirk Main wrote: > I'm a new administrator to Slurm and I've just got my new cluster up and > running. We started getting a lot of "Socket timed out on send/recv > operation" errors when submitting jobs, and also if you try to "squeue" > while others are submitting jobs. The job does eventually run after about a > minute, but the entire system feels very sluggish and obviously this isn't > normal. Not sure whats going on here...
Hmm, you're trying to do HA for Slurm with NFS. I suspect that's going to be killing you unless your NFS server is very very fast. >From conversations I've had with folks in the past if you want to do HA you need shared storage that can sustain a lot of IOPS for it to really be usable. Try it without HA first *AND* use local disk for your state directory, to see if the problem goes away. If it does then you know you're going to need to find a different way to do that storage that in future if you really want to do HA. If it doesn't go away then you'll know there's something more fundamental going on, but from what you describe it really does sound like NFS latencies are the problem here. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC