I managed to get it working, but not in the way I intended. The only
way for slurm to work for everyone was to run slurm*d as root.
Otherwise it seemed that a non-root SlurmUser had insufficient
privileges to /var/spool, even though the directories were 755 and
owned by SlurmUser. Perhaps it's some ubuntu intricacy, but at this
point everything is running as it should, with (hopefully minimal)
security issue.

Cheers,
Andrej

> 
> Hi Tray,
> 
> > May have to look at the logs for slurmd on the node that was
> > allocated the job.  The slurmctld logs may identify which node was
> > allocated and then can check that node's slurmd logs.  Usually a
> > requeue held job is a result of a node being unable to launch the
> > job, for example if UID/GID mapping is incorrect or non-existent.
> 
> I did find out which node the job was submitted to, and the only
> offending entry in the slurmd log seems to be this:
> 
> chown(/var/spool/slurm/slurmd/job00053): Operation not permitted
> batch script setup failed for job 53.4294967294
> _step_setup: no job returned
> 
> But /var/spool/slurm and everything under it is owned by SlurmUser. Is
> this not correct? The jobs seem to be created with 750 permissions.
> 
> Thanks,
> Andrej

Reply via email to