We have a problem submitting interactive jobs to a new environment we're building out. We are upgrading our nodes from opensuse 11.3 to 12.1, and something about that upgrade is breaking pty allocation.
All nodes have slurm 2.3.2-1.2chaos- submitting "srun --pty" to an 11.3 node works fine, but if I submit to a 12.1 host using "srun --pty" I get: slapshot: srun --pty -N1 -n1 -w puck1 /bin/bash -i srun: Job is in held state, pending scheduler release srun: job 286 queued and waiting for resources srun: job 286 has been allocated resources ... about ten minutes passes here... srun: error: timeout waiting for task launch srun: Job step 286.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: error: Timed out waiting for job step to complete slapshot: The host "puck1" is a 12.1 host. In it's log I see: [2012-01-24T10:30:36] debug: task_slurmd_launch_request: 286.0 0 [2012-01-24T10:30:36] launch task 286.0 request from 34152.34152@140.107.43.101 (port 51922) [2012-01-24T10:30:36] debug: Checking credential with 256 bytes of sig data [2012-01-24T10:30:36] switch NONE plugin loaded [2012-01-24T10:30:36] [286.0] Message thread started pid = 22345 [2012-01-24T10:30:36] [286.0] task NONE plugin loaded [2012-01-24T10:30:36] debug: task_slurmd_reserve_resources: 286 0 [2012-01-24T10:30:36] [286.0] Checkpoint plugin loaded: checkpoint/none [2012-01-24T10:30:36] [286.0] mpi type = openmpi [2012-01-24T10:30:36] [286.0] stdin uses a pty object [2012-01-24T10:30:36] [286.0] stdin openpty: Permission denied [2012-01-24T10:30:36] [286.0] IO setup failed: Permission denied [2012-01-24T10:30:36] [286.0] auth plugin for Munge (http://home.gna.org/munge/) loaded [2012-01-24T10:30:36] [286.0] Message thread exited [2012-01-24T10:30:36] [286.0] done with job [2012-01-24T10:40:38] debug: _rpc_terminate_job, uid = 6281 [2012-01-24T10:40:38] debug: task_slurmd_release_resources: 286 [2012-01-24T10:40:38] debug: credential for job 286 revoked If I submit to an opensuse 11.3 node this works just fine. It also works fine if I submit the job as root: slapshot: sudo srun --pty -N1 -n1 -w puck1 /bin/bash -i srun: Job is in held state, pending scheduler release srun: job 288 queued and waiting for resources srun: job 288 has been allocated resources puck1 # exit slapshot: [2012-01-24T10:45:35] debug: task_slurmd_launch_request: 288.0 0 [2012-01-24T10:45:35] launch task 288.0 request from 0.0@140.107.43.101 (port 14291) [2012-01-24T10:45:35] debug: Checking credential with 256 bytes of sig data [2012-01-24T10:45:35] switch NONE plugin loaded [2012-01-24T10:45:35] [288.0] Message thread started pid = 22676 [2012-01-24T10:45:35] debug: task_slurmd_reserve_resources: 288 0 [2012-01-24T10:45:35] [288.0] task NONE plugin loaded [2012-01-24T10:45:35] [288.0] Checkpoint plugin loaded: checkpoint/none [2012-01-24T10:45:35] [288.0] mpi type = openmpi [2012-01-24T10:45:35] [288.0] stdin uses a pty object [2012-01-24T10:45:35] [288.0] init pty size 42:80 [2012-01-24T10:45:35] [288.0] in _window_manager [2012-01-24T10:45:35] [288.0] debug level = 2 [2012-01-24T10:45:35] [288.0] IO handler started pid=22676 [2012-01-24T10:45:35] [288.0] task 0 (22680) started 2012-01-24T10:45:35 [2012-01-24T10:45:35] [288.0] Sending launch resp rc=0 [2012-01-24T10:45:35] [288.0] auth plugin for Munge (http://home.gna.org/munge/) loaded [2012-01-24T10:45:35] [288.0] mpi type = (null) [2012-01-24T10:45:35] [288.0] Using mpi/openmpi [2012-01-24T10:45:35] [288.0] task_pre_launch: 288.0, task 0 [2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_CPU in environment [2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_FSIZE in environment [2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_DATA in environment [2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_STACK in environment [2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_CORE in environment [2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_RSS in environment [2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_NPROC in environment [2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_NOFILE in environment [2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_MEMLOCK in environment [2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_AS in environment [2012-01-24T10:45:35] [288.0] Handling REQUEST_INFO [2012-01-24T10:45:35] [288.0] Handling REQUEST_SIGNAL_CONTAINER [2012-01-24T10:45:35] [288.0] _handle_signal_container for job 288.0 [2012-01-24T10:46:00] [288.0] task 0 (22680) exited with exit code 0. [2012-01-24T10:46:00] [288.0] task_post_term: 288.0, task 0 [2012-01-24T10:46:00] [288.0] Waiting for IO [2012-01-24T10:46:00] [288.0] Closing debug channel [2012-01-24T10:46:00] [288.0] IO handler exited, rc=0 [2012-01-24T10:46:00] [288.0] _slurm_recv_timeout at 0 of 4, recv zero bytes [2012-01-24T10:46:00] [288.0] read window size error: Zero Bytes were transmitted or received [2012-01-24T10:46:00] debug: _rpc_terminate_job, uid = 6281 [2012-01-24T10:46:00] debug: task_slurmd_release_resources: 288 [2012-01-24T10:46:00] debug: credential for job 288 revoked [2012-01-24T10:46:00] [288.0] Handling REQUEST_SIGNAL_CONTAINER [2012-01-24T10:46:00] [288.0] _handle_signal_container for job 288.0 [2012-01-24T10:46:00] [288.0] Sent signal 18 to 288.0 [2012-01-24T10:46:00] [288.0] Handling REQUEST_SIGNAL_CONTAINER [2012-01-24T10:46:00] [288.0] _handle_signal_container for job 288.0 [2012-01-24T10:46:00] [288.0] Sent signal 15 to 288.0 [2012-01-24T10:46:00] [288.0] Handling REQUEST_STATE [2012-01-24T10:46:00] [288.0] Message thread exited [2012-01-24T10:46:00] [288.0] done with job [2012-01-24T10:46:01] debug: Waiting for job 288's prolog to complete [2012-01-24T10:46:01] debug: Finished wait for job 288's prolog to complete [2012-01-24T10:46:01] debug: completed epilog for jobid 288 [2012-01-24T10:46:01] debug: Job 288: sent epilog complete msg: rc = 0 I am using the Moab scheduler, but I get the same behaviour if I switch to the built-in. I see that 2.3.3 came out this morning- I didn't immediately see anything that I think would impact this problem, so I'll hold off on upgrading right now. Can anyone provide some insight into how to further diagnose this problem? It's a permissions issue, but I can't fathom why 12.1 and 11.3 would be so different. Thanks Michael