I'm setting up slurm, and it's pretty neat so far.  I have an issue though 
where I want to do some processing after a job completes successfully.  I'm 
running it in a FreeBS7.4 environment.

I was playing with the mail-type and mail-user options for sbatch.  I don't 
receive the mail, however I do see a process on the slurm controller. It seems 
the child process never returns, and the parent stalls at waitpid in 
slurmctld's agent.c

esupport74> cat script-ok-notify
#!/bin/sh
#
#SBATCH --mail-type=ALL
#SBATCH [email protected]

sleep 30

exit 0
esupport74> sbatch ./script-ok-notify
Submitted batch job 143

From another window:
slurmctld -c -D -v -v -v

slurmctld: pidfile not locked, assuming no running daemon
slurmctld: Accounting storage NOT INVOKED plugin loaded
slurmctld: slurmctld version 2.5.6 started on cluster cluster
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: preempt/none loaded
slurmctld: Checkpoint plugin loaded: checkpoint/none
slurmctld: Job accounting gather NOT_INVOKED plugin loaded
slurmctld: debug:  No backup controller to shutdown
slurmctld: switch NONE plugin loaded
slurmctld: debug:  Reading slurm.conf file: /etc/slurm.conf
slurmctld: topology NONE plugin loaded
slurmctld: debug:  No DownNodes
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: Purging files for defunct batch job 142
slurmctld: debug:  Updating partition uid access list
slurmctld: Recovered state of 0 reservations
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: Running as primary controller
slurmctld: debug:  Priority BASIC plugin loaded
slurmctld: debug2: slurmctld listening on 0.0.0.0:6817
slurmctld: debug:  power_save module disabled, SuspendTime < 0
slurmctld: debug:  Spawning registration agent for esupport74,fbsd-worker00 2 
hosts
slurmctld: debug2: Spawning RPC agent for msg_type 1001
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 2
slurmctld: auth plugin for Munge (http://code.google.com/p/munge/) loaded
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
slurmctld: debug:  validate_node_specs: node esupport74 registered with 0 jobs
slurmctld: debug2: _slurm_rpc_node_registration complete for esupport74 usec=529
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got back 2
slurmctld: debug2: Tree head got them all
slurmctld: debug2: node_did_resp esupport74
slurmctld: debug2: node_did_resp fbsd-worker00
slurmctld: debug2: agent maximum delay 1 seconds
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
slurmctld: debug2: _slurm_rpc_node_registration complete for fbsd-worker00 
usec=94
slurmctld: debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 
JobId=142
slurmctld: completing job 142
slurmctld: job_complete: invalid JobId=142
slurmctld: _slurm_rpc_complete_batch_script JobId=142: Invalid job id specified
slurmctld: debug:  backfill: no jobs to backfill
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=2451
slurmctld: debug2: found 1 usable nodes from config containing esupport74
slurmctld: debug2: found 1 usable nodes from config containing fbsd-worker00
slurmctld: debug2: host esupport74 HW_ cpus 1 boards 1 sockets 1 cores 1 
threads 1
slurmctld: debug2: host esupport74 HW_ cpus 1 boards 1 sockets 1 cores 1 
threads 1
slurmctld: debug2: sched: JobId=143 allocated resources: NodeList=(null)
slurmctld: _slurm_rpc_submit_batch_job JobId=143 usec=4966
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug2: found 1 usable nodes from config containing esupport74
slurmctld: debug2: found 1 usable nodes from config containing fbsd-worker00
slurmctld: debug2: host esupport74 HW_ cpus 1 boards 1 sockets 1 cores 1 
threads 1
slurmctld: debug2: host esupport74 HW_ cpus 1 boards 1 sockets 1 cores 1 
threads 1
slurmctld: debug:  email msg to [email protected]: SLURM Job_id=143 
Name=script-ok-notify Began, Queued time 00:00:00
slurmctld: sched: Allocate JobId=143 NodeList=esupport74 #CPUs=1
slurmctld: debug2: Spawning RPC agent for msg_type 4005
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got them all
slurmctld: debug2: node_did_resp esupport74
slurmctld: debug2: agent maximum delay 1 seconds
slurmctld: debug:  email waitpid 36640 command was /usr/bin/mail
slurmctld: debug:  email waitpid 36640
slurmctld: debug:  child /usr/bin/mail msg to [email protected]: SLURM 
Job_id=143 Name=script-ok-notify Began, Queued time 00:00:00   closing 0-3 now
slurmctld: debug:  backfill: no jobs to backfill
slurmctld: debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 
JobId=143
slurmctld: completing job 143
slurmctld: debug:  email msg to [email protected]: SLURM Job_id=143 
Name=script-ok-notify Ended, Run time 00:00:33
slurmctld: debug2: Spawning RPC agent for msg_type 6011
slurmctld: sched: job_complete for JobId=143 successful
slurmctld: debug2: _slurm_rpc_complete_batch_script JobId=143 usec=385
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got them all
slurmctld: debug2: node_did_resp esupport74
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug:  backfill: no jobs to backfill
^Cslurmctld: Terminate signal (SIGINT or SIGTERM) received
slurmctld: debug:  sched: slurmctld terminating

(as you can see - I can't even kill slurmctld now w/o kill -9)

This is in the process table

36640  p3  I+     0:00.00 mail -s SLURM Job_id=143 Name=script-ok-notify Began, 
Queued time 00:00:00 jb

If I don't kill it (with kill -9), it keeps slurmctld's port open.

As you can see, I added a few debug lines, and I've tested this type of code 
out in a simple stand-alone c program and it didn't stall in the same way, the 
child completed fine.

static void _mail_proc(mail_info_t *mi)
{
        pid_t pid;

        pid = fork();
        if (pid < 0) {          /* error */
                error("fork(): %m");
        } else if (pid == 0) {  /* child */
                debug("child %s msg to %s: %s   closing 0-3 now", 
slurmctld_conf.mail_prog,
                      mi->user_name, mi->message);
                int fd;
                (void) close(0);
                (void) close(1);
                (void) close(2);
                fd = open("/dev/null", O_RDWR); // 0
                if(dup(fd) == -1) // 1
                        error("Couldn't do a dup for 1: %m");
                if(dup(fd) == -1) // 2
                        error("Couldn't do a dup for 2 %m");
                execle(slurmctld_conf.mail_prog, "mail",
                        "-s", mi->message, mi->user_name,
                        NULL, NULL);
                error("Failed to exec %s: %m",
                        slurmctld_conf.mail_prog);
                exit(1);
        } else {                /* parent */
          debug("email waitpid %d command was %s 
",pid,slurmctld_conf.mail_prog);
          debug("email waitpid %d",pid);
          waitpid(pid, NULL, 0);
          debug("email waitpid %d complete",pid);
        }
        _mail_free(mi);
        return;
}


# kill -9 36640

slurmctld: debug:  email waitpid 36640 complete
slurmctld: Warning: Note very large processing time from _slurmctld_background: 
usec=755104369
slurmctld: Saving all slurm state
slurmctld: Unable to remove pidfile '/var/run/slurmctld.pid': Permission denied

It doesn't seem like other folks have issues with sbatch's mail - any ideas?

Also - Ideally I'd run another script if the processing I have to do returns 
"0" - is the best way to do that to just write a script that includes the 
processing, return code check, and subsequent processing (e.g. report 
notification) ?

Thanks

Jim









Reply via email to