[slurm-dev] Fwd: sbcast not working with slurm ran under user
Hi! I am running slurm under common user. Everything works fine except sbcast, which fails with the following message: sbcast: error: REQUEST_FILE_BCAST(A11): Operation not permitted What can cause this problem? Is it possible to work around it? Thanks, Alex
[slurm-dev] Question concerning node reason Low RealMemory
Hello list! I asked this question in #slurm yesterday but didn't receive a response, and I also wasn't able to find any insight via Google or the Slurm site. Anyways, to the point! How does Slurm (14.03) determine when a node should be placed in a drain state with the reason Low RealMemory? I'm asking this question because I have three nodes each having between 12-14 GB RAM total, with free reporting between 7-10 GB as free. I'll paste some scontrol output below and corresponding entries from slurm.conf. NodeName=sanitized_hostname[1] Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.53 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Gres=(null) NodeAddr=sanitized_hostname[1] NodeHostName=sanitized_hostname[1] Version=(null) OS=Linux RealMemory=12929 AllocMem=0 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-03-08T20:15:30 SlurmdStartTime=2014-07-02T12:29:17 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2014-07-01T14:48:44] NodeName=sanitized_hostname[2] Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.54 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Gres=(null) NodeAddr=sanitized_hostname[2] NodeHostName=sanitized_hostname[2] Version=(null) OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-03-08T20:15:02 SlurmdStartTime=2014-07-02T12:29:17 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2014-07-01T14:48:44] NodeName=sanitized_hostname[3] Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.71 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Gres=(null) NodeAddr=sanitized_hostname[3] NodeHostName=sanitized_hostname[3] Version=(null) OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-03-08T20:14:55 SlurmdStartTime=2014-07-02T12:29:17 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2014-07-01T14:48:44] NodeName=sanitized_hostname[1] CPUs=8 CoresPerSocket=4 Sockets=2 RealMemory=12929 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon NodeName=sanitized_hostname[2-3] CPUs=8 CoresPerSocket=4 Sockets=2 RealMemory=10909 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Thanks for any help and/or insight! John DeSantis
[slurm-dev] Re: Question concerning node reason Low RealMemory
Did you check the slurmd.log on the node's and make sure the RealMemory for them on start up is less then what's defined in slurmd.conf? On Wed, Jul 2, 2014 at 12:45 PM, John Desantis desan...@mail.usf.edu wrote: Hello list! I asked this question in #slurm yesterday but didn't receive a response, and I also wasn't able to find any insight via Google or the Slurm site. Anyways, to the point! How does Slurm (14.03) determine when a node should be placed in a drain state with the reason Low RealMemory? I'm asking this question because I have three nodes each having between 12-14 GB RAM total, with free reporting between 7-10 GB as free. I'll paste some scontrol output below and corresponding entries from slurm.conf. NodeName=sanitized_hostname[1] Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.53 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Gres=(null) NodeAddr=sanitized_hostname[1] NodeHostName=sanitized_hostname[1] Version=(null) OS=Linux RealMemory=12929 AllocMem=0 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-03-08T20:15:30 SlurmdStartTime=2014-07-02T12:29:17 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2014-07-01T14:48:44] NodeName=sanitized_hostname[2] Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.54 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Gres=(null) NodeAddr=sanitized_hostname[2] NodeHostName=sanitized_hostname[2] Version=(null) OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-03-08T20:15:02 SlurmdStartTime=2014-07-02T12:29:17 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2014-07-01T14:48:44] NodeName=sanitized_hostname[3] Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.71 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Gres=(null) NodeAddr=sanitized_hostname[3] NodeHostName=sanitized_hostname[3] Version=(null) OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-03-08T20:14:55 SlurmdStartTime=2014-07-02T12:29:17 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2014-07-01T14:48:44] NodeName=sanitized_hostname[1] CPUs=8 CoresPerSocket=4 Sockets=2 RealMemory=12929 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon NodeName=sanitized_hostname[2-3] CPUs=8 CoresPerSocket=4 Sockets=2 RealMemory=10909 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Thanks for any help and/or insight! John DeSantis
[slurm-dev] Re: Question concerning node reason Low RealMemory
John, Did you find and read this thread from 2011 that appears to discuss this issue? http://comments.gmane.org/gmane.comp.distributed.slurm.devel/669 Do you have RealMemory set in your slurm.conf? If so what is it set to? Have you tried manually updating the node to Idle? Something like: scontrol update NodeName=sanitized_hostname State=IDLE Mike On Jul 2, 2014, at 10:45 AM, John Desantis desan...@mail.usf.edu wrote: Hello list! I asked this question in #slurm yesterday but didn't receive a response, and I also wasn't able to find any insight via Google or the Slurm site. Anyways, to the point! How does Slurm (14.03) determine when a node should be placed in a drain state with the reason Low RealMemory? I'm asking this question because I have three nodes each having between 12-14 GB RAM total, with free reporting between 7-10 GB as free. I'll paste some scontrol output below and corresponding entries from slurm.conf. NodeName=http://comments.gmane.org/gmane.comp.distributed.slurm.devel/669[1] Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.53 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Gres=(null) NodeAddr=sanitized_hostname[1] NodeHostName=sanitized_hostname[1] Version=(null) OS=Linux RealMemory=12929 AllocMem=0 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-03-08T20:15:30 SlurmdStartTime=2014-07-02T12:29:17 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2014-07-01T14:48:44] NodeName=sanitized_hostname[2] Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.54 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Gres=(null) NodeAddr=sanitized_hostname[2] NodeHostName=sanitized_hostname[2] Version=(null) OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-03-08T20:15:02 SlurmdStartTime=2014-07-02T12:29:17 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2014-07-01T14:48:44] NodeName=sanitized_hostname[3] Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.71 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Gres=(null) NodeAddr=sanitized_hostname[3] NodeHostName=sanitized_hostname[3] Version=(null) OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-03-08T20:14:55 SlurmdStartTime=2014-07-02T12:29:17 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2014-07-01T14:48:44] NodeName=sanitized_hostname[1] CPUs=8 CoresPerSocket=4 Sockets=2 RealMemory=12929 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon NodeName=sanitized_hostname[2-3] CPUs=8 CoresPerSocket=4 Sockets=2 RealMemory=10909 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Thanks for any help and/or insight! John DeSantis smime.p7s Description: S/MIME cryptographic signature
[slurm-dev] Re: Question concerning node reason Low RealMemory
EV, Did you check the slurmd.log on the node's and make sure the RealMemory for them on start up is less then what's defined in slurmd.conf? I didn't do this unfortunately! Feel free to jeer! What I had done is configure the nodes in question by looking at what was reported via 'free -m' and then subtracting a GB and configuring that as the 'RealMemory' value in slurm.conf. Thank you for pointing this out, and my apologies if this was a basic question. I've updated the configuration and all is well after changing the nodes' state to IDLE. I'll make sure to review the slurmd.log first before posting any more questions, should they arise! John DeSantis 2014-07-02 14:09 GMT-04:00 E V eliven...@gmail.com: Did you check the slurmd.log on the node's and make sure the RealMemory for them on start up is less then what's defined in slurmd.conf? On Wed, Jul 2, 2014 at 12:45 PM, John Desantis desan...@mail.usf.edu wrote: Hello list! I asked this question in #slurm yesterday but didn't receive a response, and I also wasn't able to find any insight via Google or the Slurm site. Anyways, to the point! How does Slurm (14.03) determine when a node should be placed in a drain state with the reason Low RealMemory? I'm asking this question because I have three nodes each having between 12-14 GB RAM total, with free reporting between 7-10 GB as free. I'll paste some scontrol output below and corresponding entries from slurm.conf. NodeName=sanitized_hostname[1] Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.53 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Gres=(null) NodeAddr=sanitized_hostname[1] NodeHostName=sanitized_hostname[1] Version=(null) OS=Linux RealMemory=12929 AllocMem=0 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-03-08T20:15:30 SlurmdStartTime=2014-07-02T12:29:17 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2014-07-01T14:48:44] NodeName=sanitized_hostname[2] Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.54 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Gres=(null) NodeAddr=sanitized_hostname[2] NodeHostName=sanitized_hostname[2] Version=(null) OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-03-08T20:15:02 SlurmdStartTime=2014-07-02T12:29:17 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2014-07-01T14:48:44] NodeName=sanitized_hostname[3] Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.71 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Gres=(null) NodeAddr=sanitized_hostname[3] NodeHostName=sanitized_hostname[3] Version=(null) OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-03-08T20:14:55 SlurmdStartTime=2014-07-02T12:29:17 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2014-07-01T14:48:44] NodeName=sanitized_hostname[1] CPUs=8 CoresPerSocket=4 Sockets=2 RealMemory=12929 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon NodeName=sanitized_hostname[2-3] CPUs=8 CoresPerSocket=4 Sockets=2 RealMemory=10909 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Thanks for any help and/or insight! John DeSantis
[slurm-dev] 14.03 FlexLM
If memory serves I thought that 14.03 was supposed to support hooking into FlexLM licensing. However, I can't find any documentation on that. Was that pushed off to a future release? -Paul Edmon-
[slurm-dev] Re: pbsdsh -u equivalent
I may be wrong about this, but doesn't this not necessarily solve the problem? Let's say we have one task and two nodes allocated. In PBS using pbsdsh -u, both of the nodes will get a copy of the task. However, according to the documentation --ntasks-per-node=1 only means that each node can get a maximum of one task. This does not seem to entail that multiple copies of the tasks will be produced and given to all the nodes only that a maximum of one task is performed by each node. Hartley On Mon, Jun 30, 2014 at 6:04 PM, Christopher Samuel sam...@unimelb.edu.au wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 01/07/14 09:18, Hartley Greenwald wrote: I may be wrong about this because I'm pretty new to all this stuff, but I think that I want to give a copy to every node allocated for the job. To emulate pbsdsh you are quite correct. According to the manual page the --ntasks-per-node=1 option for srun should do what you want. cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlOx+JUACgkQO2KABBYQAh/+uQCdHWQEQ/H+aJMJ8ppeMD+C/r88 jb0An2qJT4FZxloNNOqP2owAC2N3W7eZ =7BJX -END PGP SIGNATURE-
[slurm-dev] Re: pbsdsh -u equivalent
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 03/07/14 05:31, Hartley Greenwald wrote: Let's say we have one task and two nodes allocated. Er, how are you going to do that? $ sbatch --nodes=2 --ntasks=1 --wrap /bin/true sbatch: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 Submitted batch job 1856638 A distributed job (MPI for instance) must have at least one task on every node for this to make sense. All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlO0rFAACgkQO2KABBYQAh8jeQCdGbLpk/X8FOcc32TGuqyC/Hpy ic8AoJHa1wO2ZN+vix1WfpEw3DCWtQSR =yuDd -END PGP SIGNATURE-