Finally I found some time available when I could do the job without
disrupting my users.
It turned out to be both the permissions issue as discussed here, and
the fact that the slurm.conf needs the fully qualified path of the
prolog script.
So that is solved, but sadly my problem is not solved as
Davide DelVento writes:
>> I'm curious: What kind of disruption did it cause for your production
>> jobs?
>
> All jobs failed and went in pending/held with "launch failed requeued
> held" status, all nodes where the jobs were scheduled went draining.
>
> The logs only said "error: validate_node_s
Thanks a lot.
> > Does it need the execution permission? For root alone sufficient?
>
> slurmd runs as root, so it only need exec perms for root.
Perfect. That must have been then, since my script (like the example
one) did not have the execution permission on.
> I'm curious: What kind of disrup
Davide DelVento writes:
> Does it need the execution permission? For root alone sufficient?
slurmd runs as root, so it only need exec perms for root.
>> > 2. How to debug the issue?
>> I'd try capturing all stdout and stderr from the script into a file on the
>> compute
>> node, for instance l
Thanks to both of you.
> Permissions on the file itself (and the directories in the path to it)
Does it need the execution permission? For root alone sufficient?
> Existence of the script on the nodes (prologue is run on the nodes, not the
> head)
Yes, it's in a shared filesystem.
> Not sure
Davide DelVento writes:
> 2. How to debug the issue?
I'd try capturing all stdout and stderr from the script into a file on the
compute
node, for instance like this:
exec &> /root/prolog_slurmd.$$
set -x # To print out all commands
before any other commands in the script. The "prolog_slurmd.
Davide,
Quick things to check:
* Permissions on the file itself (and the directories in the path to it)
* Existence of the script on the nodes (prologue is run on the nodes,
not the head)
Not sure your error is the prologue script itself. Does everything run
fine with no prologue configur
I have a super simple prolog script, as follows (very similar to the
example one)
#!/bin/bash
if [[ $VAR == 1 ]]; then
echo "True"
fi
exit 0
This fails (and obviously causes great disruption to my production
jobs). So I have two questions:
1. Why does it fail? It does so regardless of