[slurm-dev] Interactive job on OpenSuSE 12.1 hangs

Michael Gutteridge Tue, 24 Jan 2012 16:16:47 -0800

We have a problem submitting interactive jobs to a new environment
we're building out.  We are upgrading our nodes from opensuse 11.3 to
12.1, and something about that upgrade is breaking pty allocation.


All nodes have slurm 2.3.2-1.2chaos- submitting "srun --pty" to an
11.3 node works fine, but if I submit to a 12.1 host using "srun
--pty" I get:

slapshot: srun --pty -N1 -n1 -w puck1 /bin/bash -i
srun: Job is in held state, pending scheduler release
srun: job 286 queued and waiting for resources
srun: job 286 has been allocated resources

  ... about ten minutes passes here...

srun: error: timeout waiting for task launch
srun: Job step 286.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
slapshot:

The host "puck1" is a 12.1 host.  In it's log I see:

[2012-01-24T10:30:36] debug:  task_slurmd_launch_request: 286.0 0
[2012-01-24T10:30:36] launch task 286.0 request from
34152.34152@140.107.43.101 (port 51922)
[2012-01-24T10:30:36] debug:  Checking credential with 256 bytes of sig data
[2012-01-24T10:30:36] switch NONE plugin loaded
[2012-01-24T10:30:36] [286.0] Message thread started pid = 22345
[2012-01-24T10:30:36] [286.0] task NONE plugin loaded
[2012-01-24T10:30:36] debug:  task_slurmd_reserve_resources: 286 0
[2012-01-24T10:30:36] [286.0] Checkpoint plugin loaded: checkpoint/none
[2012-01-24T10:30:36] [286.0] mpi type = openmpi
[2012-01-24T10:30:36] [286.0]   stdin uses a pty object
[2012-01-24T10:30:36] [286.0] stdin openpty: Permission denied
[2012-01-24T10:30:36] [286.0] IO setup failed: Permission denied
[2012-01-24T10:30:36] [286.0] auth plugin for Munge
(http://home.gna.org/munge/) loaded
[2012-01-24T10:30:36] [286.0] Message thread exited
[2012-01-24T10:30:36] [286.0] done with job
[2012-01-24T10:40:38] debug:  _rpc_terminate_job, uid = 6281
[2012-01-24T10:40:38] debug:  task_slurmd_release_resources: 286
[2012-01-24T10:40:38] debug:  credential for job 286 revoked

If I submit to an opensuse 11.3 node this works just fine.  It also
works fine if I submit the job as root:

slapshot: sudo srun --pty -N1 -n1 -w puck1 /bin/bash -i
srun: Job is in held state, pending scheduler release
srun: job 288 queued and waiting for resources
srun: job 288 has been allocated resources
puck1 # exit
slapshot:

[2012-01-24T10:45:35] debug:  task_slurmd_launch_request: 288.0 0
[2012-01-24T10:45:35] launch task 288.0 request from
0.0@140.107.43.101 (port 14291)
[2012-01-24T10:45:35] debug:  Checking credential with 256 bytes of sig data
[2012-01-24T10:45:35] switch NONE plugin loaded
[2012-01-24T10:45:35] [288.0] Message thread started pid = 22676
[2012-01-24T10:45:35] debug:  task_slurmd_reserve_resources: 288 0
[2012-01-24T10:45:35] [288.0] task NONE plugin loaded
[2012-01-24T10:45:35] [288.0] Checkpoint plugin loaded: checkpoint/none
[2012-01-24T10:45:35] [288.0] mpi type = openmpi
[2012-01-24T10:45:35] [288.0]   stdin uses a pty object
[2012-01-24T10:45:35] [288.0] init pty size 42:80
[2012-01-24T10:45:35] [288.0] in _window_manager
[2012-01-24T10:45:35] [288.0] debug level = 2
[2012-01-24T10:45:35] [288.0] IO handler started pid=22676
[2012-01-24T10:45:35] [288.0] task 0 (22680) started 2012-01-24T10:45:35
[2012-01-24T10:45:35] [288.0] Sending launch resp rc=0
[2012-01-24T10:45:35] [288.0] auth plugin for Munge
(http://home.gna.org/munge/) loaded
[2012-01-24T10:45:35] [288.0] mpi type = (null)
[2012-01-24T10:45:35] [288.0] Using mpi/openmpi
[2012-01-24T10:45:35] [288.0] task_pre_launch: 288.0, task 0
[2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_CPU in environment
[2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_FSIZE in environment
[2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_DATA in environment
[2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_STACK in environment
[2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_CORE in environment
[2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_RSS in environment
[2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_NPROC in environment
[2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_NOFILE in environment
[2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_MEMLOCK in environment
[2012-01-24T10:45:35] [288.0] Couldn't find SLURM_RLIMIT_AS in environment
[2012-01-24T10:45:35] [288.0] Handling REQUEST_INFO
[2012-01-24T10:45:35] [288.0] Handling REQUEST_SIGNAL_CONTAINER
[2012-01-24T10:45:35] [288.0] _handle_signal_container for job 288.0
[2012-01-24T10:46:00] [288.0] task 0 (22680) exited with exit code 0.
[2012-01-24T10:46:00] [288.0] task_post_term: 288.0, task 0
[2012-01-24T10:46:00] [288.0] Waiting for IO
[2012-01-24T10:46:00] [288.0] Closing debug channel
[2012-01-24T10:46:00] [288.0] IO handler exited, rc=0
[2012-01-24T10:46:00] [288.0] _slurm_recv_timeout at 0 of 4, recv zero bytes
[2012-01-24T10:46:00] [288.0] read window size error: Zero Bytes were
transmitted or received
[2012-01-24T10:46:00] debug:  _rpc_terminate_job, uid = 6281
[2012-01-24T10:46:00] debug:  task_slurmd_release_resources: 288
[2012-01-24T10:46:00] debug:  credential for job 288 revoked
[2012-01-24T10:46:00] [288.0] Handling REQUEST_SIGNAL_CONTAINER
[2012-01-24T10:46:00] [288.0] _handle_signal_container for job 288.0
[2012-01-24T10:46:00] [288.0] Sent signal 18 to 288.0
[2012-01-24T10:46:00] [288.0] Handling REQUEST_SIGNAL_CONTAINER
[2012-01-24T10:46:00] [288.0] _handle_signal_container for job 288.0
[2012-01-24T10:46:00] [288.0] Sent signal 15 to 288.0
[2012-01-24T10:46:00] [288.0] Handling REQUEST_STATE
[2012-01-24T10:46:00] [288.0] Message thread exited
[2012-01-24T10:46:00] [288.0] done with job
[2012-01-24T10:46:01] debug:  Waiting for job 288's prolog to complete
[2012-01-24T10:46:01] debug:  Finished wait for job 288's prolog to complete
[2012-01-24T10:46:01] debug:  completed epilog for jobid 288
[2012-01-24T10:46:01] debug:  Job 288: sent epilog complete msg: rc = 0

I am using the Moab scheduler, but I get the same behaviour if I
switch to the built-in.  I see that 2.3.3 came out this morning- I
didn't immediately see anything that I think would impact this
problem, so I'll hold off on upgrading right now.

Can anyone provide some insight into how to further diagnose this
problem? It's a permissions issue, but I can't fathom why 12.1 and
11.3 would be so different.

Thanks

Michael

[slurm-dev] Interactive job on OpenSuSE 12.1 hangs

Reply via email to