[slurm-dev] Fwd: sbcast not working with slurm ran under user

2014-07-02 Thread Alexander Frolov
Hi!

I am running slurm under common user. Everything works fine except sbcast,
which fails with the following message:

sbcast: error: REQUEST_FILE_BCAST(A11): Operation not permitted

What can cause this problem? Is it possible to work around it?

Thanks,
  Alex


[slurm-dev] Question concerning node reason Low RealMemory

2014-07-02 Thread John Desantis

Hello list!

I asked this question in #slurm yesterday but didn't receive a
response, and I also wasn't able to find any insight via Google or the
Slurm site.

Anyways, to the point!

How does Slurm (14.03) determine when a node should be placed in a
drain state with the reason Low RealMemory?  I'm asking this
question because I have three nodes each having between 12-14 GB RAM
total, with free reporting between 7-10 GB as free.

I'll paste some scontrol output below and corresponding entries from slurm.conf.

NodeName=sanitized_hostname[1] Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.53
Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
   Gres=(null)
   NodeAddr=sanitized_hostname[1] NodeHostName=sanitized_hostname[1]
Version=(null)
   OS=Linux RealMemory=12929 AllocMem=0 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2014-03-08T20:15:30 SlurmdStartTime=2014-07-02T12:29:17
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [root@2014-07-01T14:48:44]

NodeName=sanitized_hostname[2] Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.54
Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
   Gres=(null)
   NodeAddr=sanitized_hostname[2] NodeHostName=sanitized_hostname[2]
Version=(null)
   OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2014-03-08T20:15:02 SlurmdStartTime=2014-07-02T12:29:17
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [root@2014-07-01T14:48:44]

NodeName=sanitized_hostname[3] Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.71
Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
   Gres=(null)
   NodeAddr=sanitized_hostname[3] NodeHostName=sanitized_hostname[3]
Version=(null)
   OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2014-03-08T20:14:55 SlurmdStartTime=2014-07-02T12:29:17
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [root@2014-07-01T14:48:44]

NodeName=sanitized_hostname[1] CPUs=8 CoresPerSocket=4 Sockets=2
RealMemory=12929 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
NodeName=sanitized_hostname[2-3] CPUs=8 CoresPerSocket=4 Sockets=2
RealMemory=10909 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon

Thanks for any help and/or insight!

John DeSantis


[slurm-dev] Re: Question concerning node reason Low RealMemory

2014-07-02 Thread E V

Did you check the slurmd.log on the node's and make sure the
RealMemory for them on start up is less then what's defined in
slurmd.conf?

On Wed, Jul 2, 2014 at 12:45 PM, John Desantis desan...@mail.usf.edu wrote:

 Hello list!

 I asked this question in #slurm yesterday but didn't receive a
 response, and I also wasn't able to find any insight via Google or the
 Slurm site.

 Anyways, to the point!

 How does Slurm (14.03) determine when a node should be placed in a
 drain state with the reason Low RealMemory?  I'm asking this
 question because I have three nodes each having between 12-14 GB RAM
 total, with free reporting between 7-10 GB as free.

 I'll paste some scontrol output below and corresponding entries from 
 slurm.conf.

 NodeName=sanitized_hostname[1] Arch=x86_64 CoresPerSocket=4
CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.53
 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
Gres=(null)
NodeAddr=sanitized_hostname[1] NodeHostName=sanitized_hostname[1]
 Version=(null)
OS=Linux RealMemory=12929 AllocMem=0 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
BootTime=2014-03-08T20:15:30 SlurmdStartTime=2014-07-02T12:29:17
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [root@2014-07-01T14:48:44]

 NodeName=sanitized_hostname[2] Arch=x86_64 CoresPerSocket=4
CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.54
 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
Gres=(null)
NodeAddr=sanitized_hostname[2] NodeHostName=sanitized_hostname[2]
 Version=(null)
OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
BootTime=2014-03-08T20:15:02 SlurmdStartTime=2014-07-02T12:29:17
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [root@2014-07-01T14:48:44]

 NodeName=sanitized_hostname[3] Arch=x86_64 CoresPerSocket=4
CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.71
 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
Gres=(null)
NodeAddr=sanitized_hostname[3] NodeHostName=sanitized_hostname[3]
 Version=(null)
OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
BootTime=2014-03-08T20:14:55 SlurmdStartTime=2014-07-02T12:29:17
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [root@2014-07-01T14:48:44]

 NodeName=sanitized_hostname[1] CPUs=8 CoresPerSocket=4 Sockets=2
 RealMemory=12929 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
 NodeName=sanitized_hostname[2-3] CPUs=8 CoresPerSocket=4 Sockets=2
 RealMemory=10909 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon

 Thanks for any help and/or insight!

 John DeSantis


[slurm-dev] Re: Question concerning node reason Low RealMemory

2014-07-02 Thread Michael Robbert
John,
Did you find and read this thread from 2011 that appears to discuss this issue?

http://comments.gmane.org/gmane.comp.distributed.slurm.devel/669

Do you have RealMemory set in your slurm.conf? If so what is it set to?
Have you tried manually updating the node to Idle? Something like:
scontrol update NodeName=sanitized_hostname State=IDLE

Mike

On Jul 2, 2014, at 10:45 AM, John Desantis desan...@mail.usf.edu wrote:

 
 Hello list!
 
 I asked this question in #slurm yesterday but didn't receive a
 response, and I also wasn't able to find any insight via Google or the
 Slurm site.
 
 Anyways, to the point!
 
 How does Slurm (14.03) determine when a node should be placed in a
 drain state with the reason Low RealMemory?  I'm asking this
 question because I have three nodes each having between 12-14 GB RAM
 total, with free reporting between 7-10 GB as free.
 
 I'll paste some scontrol output below and corresponding entries from 
 slurm.conf.
 
 NodeName=http://comments.gmane.org/gmane.comp.distributed.slurm.devel/669[1] 
 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.53
 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
   Gres=(null)
   NodeAddr=sanitized_hostname[1] NodeHostName=sanitized_hostname[1]
 Version=(null)
   OS=Linux RealMemory=12929 AllocMem=0 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2014-03-08T20:15:30 SlurmdStartTime=2014-07-02T12:29:17
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [root@2014-07-01T14:48:44]
 
 NodeName=sanitized_hostname[2] Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.54
 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
   Gres=(null)
   NodeAddr=sanitized_hostname[2] NodeHostName=sanitized_hostname[2]
 Version=(null)
   OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2014-03-08T20:15:02 SlurmdStartTime=2014-07-02T12:29:17
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [root@2014-07-01T14:48:44]
 
 NodeName=sanitized_hostname[3] Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.71
 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
   Gres=(null)
   NodeAddr=sanitized_hostname[3] NodeHostName=sanitized_hostname[3]
 Version=(null)
   OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2014-03-08T20:14:55 SlurmdStartTime=2014-07-02T12:29:17
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [root@2014-07-01T14:48:44]
 
 NodeName=sanitized_hostname[1] CPUs=8 CoresPerSocket=4 Sockets=2
 RealMemory=12929 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
 NodeName=sanitized_hostname[2-3] CPUs=8 CoresPerSocket=4 Sockets=2
 RealMemory=10909 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
 
 Thanks for any help and/or insight!
 
 John DeSantis



smime.p7s
Description: S/MIME cryptographic signature


[slurm-dev] Re: Question concerning node reason Low RealMemory

2014-07-02 Thread John Desantis

EV,

 Did you check the slurmd.log on the node's and make sure the
 RealMemory for them on start up is less then what's defined in
 slurmd.conf?

I didn't do this unfortunately!  Feel free to jeer!

What I had done is configure the nodes in question by looking at what
was reported via 'free -m' and then subtracting a GB and configuring
that as the 'RealMemory' value in slurm.conf.

Thank you for pointing this out, and my apologies if this was a basic
question.  I've updated the configuration and all is well after
changing the nodes' state to IDLE.  I'll make sure to review the
slurmd.log first before posting any more questions, should they arise!

John DeSantis

2014-07-02 14:09 GMT-04:00 E V eliven...@gmail.com:

 Did you check the slurmd.log on the node's and make sure the
 RealMemory for them on start up is less then what's defined in
 slurmd.conf?

 On Wed, Jul 2, 2014 at 12:45 PM, John Desantis desan...@mail.usf.edu wrote:

 Hello list!

 I asked this question in #slurm yesterday but didn't receive a
 response, and I also wasn't able to find any insight via Google or the
 Slurm site.

 Anyways, to the point!

 How does Slurm (14.03) determine when a node should be placed in a
 drain state with the reason Low RealMemory?  I'm asking this
 question because I have three nodes each having between 12-14 GB RAM
 total, with free reporting between 7-10 GB as free.

 I'll paste some scontrol output below and corresponding entries from 
 slurm.conf.

 NodeName=sanitized_hostname[1] Arch=x86_64 CoresPerSocket=4
CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.53
 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
Gres=(null)
NodeAddr=sanitized_hostname[1] NodeHostName=sanitized_hostname[1]
 Version=(null)
OS=Linux RealMemory=12929 AllocMem=0 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
BootTime=2014-03-08T20:15:30 SlurmdStartTime=2014-07-02T12:29:17
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [root@2014-07-01T14:48:44]

 NodeName=sanitized_hostname[2] Arch=x86_64 CoresPerSocket=4
CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.54
 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
Gres=(null)
NodeAddr=sanitized_hostname[2] NodeHostName=sanitized_hostname[2]
 Version=(null)
OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
BootTime=2014-03-08T20:15:02 SlurmdStartTime=2014-07-02T12:29:17
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [root@2014-07-01T14:48:44]

 NodeName=sanitized_hostname[3] Arch=x86_64 CoresPerSocket=4
CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.71
 Features=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
Gres=(null)
NodeAddr=sanitized_hostname[3] NodeHostName=sanitized_hostname[3]
 Version=(null)
OS=Linux RealMemory=10909 AllocMem=0 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
BootTime=2014-03-08T20:14:55 SlurmdStartTime=2014-07-02T12:29:17
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [root@2014-07-01T14:48:44]

 NodeName=sanitized_hostname[1] CPUs=8 CoresPerSocket=4 Sockets=2
 RealMemory=12929 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon
 NodeName=sanitized_hostname[2-3] CPUs=8 CoresPerSocket=4 Sockets=2
 RealMemory=10909 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon

 Thanks for any help and/or insight!

 John DeSantis


[slurm-dev] 14.03 FlexLM

2014-07-02 Thread Paul Edmon


If memory serves I thought that 14.03 was supposed to support hooking 
into FlexLM licensing.  However, I can't find any documentation on 
that.  Was that pushed off to a future release?


-Paul Edmon-


[slurm-dev] Re: pbsdsh -u equivalent

2014-07-02 Thread Hartley Greenwald
I may be wrong about this, but doesn't this not necessarily solve the
problem?

Let's say we have one task and two nodes allocated.  In PBS using pbsdsh
-u, both of the nodes will get a copy of the task.  However, according to
the documentation  --ntasks-per-node=1 only means that each node can get a
maximum of one task.  This does not seem to entail that multiple copies of
the tasks will be produced and given to all the nodes only that a maximum
of one task is performed by each node.

Hartley


On Mon, Jun 30, 2014 at 6:04 PM, Christopher Samuel sam...@unimelb.edu.au
wrote:


 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 01/07/14 09:18, Hartley Greenwald wrote:

  I may be wrong about this because I'm pretty new to all this stuff,
  but I think that I want to give a copy to every node allocated for
  the job.

 To emulate pbsdsh you are quite correct.

 According to the manual page the --ntasks-per-node=1 option for srun
 should do what you want.

 cheers,
 Chris
 - --
  Christopher SamuelSenior Systems Administrator
  VLSCI - Victorian Life Sciences Computation Initiative
  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
  http://www.vlsci.org.au/  http://twitter.com/vlsci

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

 iEYEARECAAYFAlOx+JUACgkQO2KABBYQAh/+uQCdHWQEQ/H+aJMJ8ppeMD+C/r88
 jb0An2qJT4FZxloNNOqP2owAC2N3W7eZ
 =7BJX
 -END PGP SIGNATURE-



[slurm-dev] Re: pbsdsh -u equivalent

2014-07-02 Thread Christopher Samuel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/07/14 05:31, Hartley Greenwald wrote:

 Let's say we have one task and two nodes allocated. 

Er, how are you going to do that?

$ sbatch --nodes=2 --ntasks=1 --wrap /bin/true
sbatch: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1  
  
Submitted batch job 1856638  

A distributed job (MPI for instance) must have at least
one task on every node for this to make sense.

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlO0rFAACgkQO2KABBYQAh8jeQCdGbLpk/X8FOcc32TGuqyC/Hpy
ic8AoJHa1wO2ZN+vix1WfpEw3DCWtQSR
=yuDd
-END PGP SIGNATURE-