I would recommend looking at the slurmd log on node100.

Quoting Julien Rey <[email protected]>:

Hello everyone,


I am currently having troubles making python-drmaa work with Slurm. Indeed,
jobs systematically return a FAILED state (exit code 256) when I launch
jobs with python-drmaa as non-root user. I have no problem if I run as
root. Here's the code sample I've been using to do some tests:

#!/usr/bin/env python
import os

os.environ [ 'DRMAA_LIBRARY_PATH' ] =
'/usr/lib/slurm-drmaa/lib/libdrmaa.so.1.0.6'
import drmaa

def main():

    s = drmaa.Session()
    s.initialize()

    print 'Creating job template'
    jt = s.createJobTemplate()
    jt.nativeSpecification = ''
    jt.remoteCommand = 'sleep'
    jt.args = '30'

    jobid = s.runJob(jt)
    print 'Your job has been submitted with id ' + jobid

    jinfo=s.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)
    print 'Job exited with ', jinfo.exitStatus

    print 'Cleaning up'
    s.deleteJobTemplate(jt)
    s.exit()
if __name__=='__main__':
    main()

Here are the results of the sacct command after I ran the script as user
and then as root:

519          allocation      debug        mti          1     FAILED
  1:0 519.batch         batch                   mti          1
FAILED      1:0 520          allocation      debug       root
1  COMPLETED      0:0 520.batch         batch                  root
      1  COMPLETED      0:0

And here are the logs from /var/log/slurm-llnl/slurmctld.log

Run as user:

[2014-10-17T13:27:56.819] _slurm_rpc_submit_batch_job JobId=519
usec=455[2014-10-17T13:27:56.823] sched: Allocate JobId=519
NodeList=node100 #CPUs=1[2014-10-17T13:27:56.859] completing job
519[2014-10-17T13:27:56.861] sched: job_complete for JobId=519
successful, exit code=256

Run as root:

[2014-10-17T13:28:39.879] _slurm_rpc_submit_batch_job JobId=520
usec=468[2014-10-17T13:28:39.882] sched: Allocate JobId=520
NodeList=node100 #CPUs=1[2014-10-17T13:28:42.963] completing job
520[2014-10-17T13:28:42.965] sched: job_complete for JobId=520
successful, exit code=0

Also I have no problem running jobs with the srun command as user, for
instance, if I run as www-data

srun sleep 30

and then

sacct -a

I get:

522               sleep      debug     mobyle          1  COMPLETED      0:0

Here are the packages that were installed:

   - slurm-llnl 2.6.7-2+b1
   - slurm-drmaa1 1.0.7-1
   - python-drmaa 0.5-1

I am completly new to slurm and drmaa so I have no idea where to look for.

Any help will be greatly appreciated.


--
Morris "Moe" Jette
CTO, SchedMD LLC

Reply via email to