I'm setting up slurm, and it's pretty neat so far. I have an issue though where I want to do some processing after a job completes successfully. I'm running it in a FreeBS7.4 environment.
I was playing with the mail-type and mail-user options for sbatch. I don't receive the mail, however I do see a process on the slurm controller. It seems the child process never returns, and the parent stalls at waitpid in slurmctld's agent.c esupport74> cat script-ok-notify #!/bin/sh # #SBATCH --mail-type=ALL #SBATCH [email protected] sleep 30 exit 0 esupport74> sbatch ./script-ok-notify Submitted batch job 143 From another window: slurmctld -c -D -v -v -v slurmctld: pidfile not locked, assuming no running daemon slurmctld: Accounting storage NOT INVOKED plugin loaded slurmctld: slurmctld version 2.5.6 started on cluster cluster slurmctld: Munge cryptographic signature plugin loaded slurmctld: preempt/none loaded slurmctld: Checkpoint plugin loaded: checkpoint/none slurmctld: Job accounting gather NOT_INVOKED plugin loaded slurmctld: debug: No backup controller to shutdown slurmctld: switch NONE plugin loaded slurmctld: debug: Reading slurm.conf file: /etc/slurm.conf slurmctld: topology NONE plugin loaded slurmctld: debug: No DownNodes slurmctld: sched: Backfill scheduler plugin loaded slurmctld: Purging files for defunct batch job 142 slurmctld: debug: Updating partition uid access list slurmctld: Recovered state of 0 reservations slurmctld: read_slurm_conf: backup_controller not specified. slurmctld: Running as primary controller slurmctld: debug: Priority BASIC plugin loaded slurmctld: debug2: slurmctld listening on 0.0.0.0:6817 slurmctld: debug: power_save module disabled, SuspendTime < 0 slurmctld: debug: Spawning registration agent for esupport74,fbsd-worker00 2 hosts slurmctld: debug2: Spawning RPC agent for msg_type 1001 slurmctld: debug2: got 1 threads to send out slurmctld: debug2: Tree head got back 0 looking for 2 slurmctld: auth plugin for Munge (http://code.google.com/p/munge/) loaded slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: debug: validate_node_specs: node esupport74 registered with 0 jobs slurmctld: debug2: _slurm_rpc_node_registration complete for esupport74 usec=529 slurmctld: debug2: Tree head got back 1 slurmctld: debug2: Tree head got back 2 slurmctld: debug2: Tree head got them all slurmctld: debug2: node_did_resp esupport74 slurmctld: debug2: node_did_resp fbsd-worker00 slurmctld: debug2: agent maximum delay 1 seconds slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: debug2: _slurm_rpc_node_registration complete for fbsd-worker00 usec=94 slurmctld: debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=142 slurmctld: completing job 142 slurmctld: job_complete: invalid JobId=142 slurmctld: _slurm_rpc_complete_batch_script JobId=142: Invalid job id specified slurmctld: debug: backfill: no jobs to backfill slurmctld: debug2: Testing job time limits and checkpoints slurmctld: debug2: Testing job time limits and checkpoints slurmctld: debug2: Performing purge of old job records slurmctld: debug: sched: Running job scheduler slurmctld: debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=2451 slurmctld: debug2: found 1 usable nodes from config containing esupport74 slurmctld: debug2: found 1 usable nodes from config containing fbsd-worker00 slurmctld: debug2: host esupport74 HW_ cpus 1 boards 1 sockets 1 cores 1 threads 1 slurmctld: debug2: host esupport74 HW_ cpus 1 boards 1 sockets 1 cores 1 threads 1 slurmctld: debug2: sched: JobId=143 allocated resources: NodeList=(null) slurmctld: _slurm_rpc_submit_batch_job JobId=143 usec=4966 slurmctld: debug: sched: Running job scheduler slurmctld: debug2: found 1 usable nodes from config containing esupport74 slurmctld: debug2: found 1 usable nodes from config containing fbsd-worker00 slurmctld: debug2: host esupport74 HW_ cpus 1 boards 1 sockets 1 cores 1 threads 1 slurmctld: debug2: host esupport74 HW_ cpus 1 boards 1 sockets 1 cores 1 threads 1 slurmctld: debug: email msg to [email protected]: SLURM Job_id=143 Name=script-ok-notify Began, Queued time 00:00:00 slurmctld: sched: Allocate JobId=143 NodeList=esupport74 #CPUs=1 slurmctld: debug2: Spawning RPC agent for msg_type 4005 slurmctld: debug2: got 1 threads to send out slurmctld: debug2: Tree head got back 0 looking for 1 slurmctld: debug2: Tree head got back 1 slurmctld: debug2: Tree head got them all slurmctld: debug2: node_did_resp esupport74 slurmctld: debug2: agent maximum delay 1 seconds slurmctld: debug: email waitpid 36640 command was /usr/bin/mail slurmctld: debug: email waitpid 36640 slurmctld: debug: child /usr/bin/mail msg to [email protected]: SLURM Job_id=143 Name=script-ok-notify Began, Queued time 00:00:00 closing 0-3 now slurmctld: debug: backfill: no jobs to backfill slurmctld: debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=143 slurmctld: completing job 143 slurmctld: debug: email msg to [email protected]: SLURM Job_id=143 Name=script-ok-notify Ended, Run time 00:00:33 slurmctld: debug2: Spawning RPC agent for msg_type 6011 slurmctld: sched: job_complete for JobId=143 successful slurmctld: debug2: _slurm_rpc_complete_batch_script JobId=143 usec=385 slurmctld: debug2: got 1 threads to send out slurmctld: debug2: Tree head got back 0 looking for 1 slurmctld: debug2: Tree head got back 1 slurmctld: debug2: Tree head got them all slurmctld: debug2: node_did_resp esupport74 slurmctld: debug: sched: Running job scheduler slurmctld: debug: backfill: no jobs to backfill ^Cslurmctld: Terminate signal (SIGINT or SIGTERM) received slurmctld: debug: sched: slurmctld terminating (as you can see - I can't even kill slurmctld now w/o kill -9) This is in the process table 36640 p3 I+ 0:00.00 mail -s SLURM Job_id=143 Name=script-ok-notify Began, Queued time 00:00:00 jb If I don't kill it (with kill -9), it keeps slurmctld's port open. As you can see, I added a few debug lines, and I've tested this type of code out in a simple stand-alone c program and it didn't stall in the same way, the child completed fine. static void _mail_proc(mail_info_t *mi) { pid_t pid; pid = fork(); if (pid < 0) { /* error */ error("fork(): %m"); } else if (pid == 0) { /* child */ debug("child %s msg to %s: %s closing 0-3 now", slurmctld_conf.mail_prog, mi->user_name, mi->message); int fd; (void) close(0); (void) close(1); (void) close(2); fd = open("/dev/null", O_RDWR); // 0 if(dup(fd) == -1) // 1 error("Couldn't do a dup for 1: %m"); if(dup(fd) == -1) // 2 error("Couldn't do a dup for 2 %m"); execle(slurmctld_conf.mail_prog, "mail", "-s", mi->message, mi->user_name, NULL, NULL); error("Failed to exec %s: %m", slurmctld_conf.mail_prog); exit(1); } else { /* parent */ debug("email waitpid %d command was %s ",pid,slurmctld_conf.mail_prog); debug("email waitpid %d",pid); waitpid(pid, NULL, 0); debug("email waitpid %d complete",pid); } _mail_free(mi); return; } # kill -9 36640 slurmctld: debug: email waitpid 36640 complete slurmctld: Warning: Note very large processing time from _slurmctld_background: usec=755104369 slurmctld: Saving all slurm state slurmctld: Unable to remove pidfile '/var/run/slurmctld.pid': Permission denied It doesn't seem like other folks have issues with sbatch's mail - any ideas? Also - Ideally I'd run another script if the processing I have to do returns "0" - is the best way to do that to just write a script that includes the processing, return code check, and subsequent processing (e.g. report notification) ? Thanks Jim
