If you really want to see what's going wrong with your shell startup, you will
need to make two changes.
1. Start your script with:
#!/bin/sh -xl
-x makes it echo every command and -l makes it run as a login shell.
2. Edit /etc/profile to make your /etc/profile.d scripts show any errors and
echo the commands. For some reason Red Hat decided to throw away all of the
output from these scripts if they were not run from an interactive shell. Of
course, with all of the errors and stdout redirected to /dev/null if something
fails in one of these scripts, you'll never know.
In /etc/profile look for a line like:
for i in /etc/profile.d/*.sh; do
and change the command a few lines under it from:
. $i > /dev/null 2>&1
to
. $i
Hopefully, as you slog through the the output you'll find the command that's
breaking everything.
-Phil
________________________________
From: Michael Colonno [[email protected]]
Sent: Friday, January 25, 2013 2:37 PM
To: slurm-dev
Subject: [slurm-dev] RE: not executing script(?)
Using the exact script below, ssh output:
cv-hpcf1
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1032015
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
_=/bin/env
CVS_RSH=ssh
G_BROKEN_FILENAMES=1
HOME=/u/mcolonno
KRB5CCNAME=FILE:/tmp/krb5cc_1050163475_PhEns23756
LANG=en_US.UTF-8
LESSOPEN=|/usr/bin/lesspipe.sh %s
LOGNAME=mcolonno
MAIL=/var/mail/mcolonno
PATH=/usr/local/apps/NASTRAN/NX/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin
PWD=/u/mcolonno
QTDIR=/usr/lib64/qt-3.3
QTINC=/usr/lib64/qt-3.3/include
QTLIB=/usr/lib64/qt-3.3/lib
SHELL=/bin/bash
SHLVL=2
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
SSH_CLIENT=192.168.101.220 35086 22
SSH_CONNECTION=192.168.101.220 35086 192.168.230.33 22
USER=mcolonno
done: Fri Jan 25 13:59:46 PST 2013
Using srun:
cv-hpcf1
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1032015
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
}
_=/bin/env
CVS_RSH=ssh
G_BROKEN_FILENAMES=1
HISTCONTROL=ignoredups
HISTSIZE=1000
HOME=/u/mcolonno
HOSTNAME=cv-hpcq
LANG=en_US.UTF-8
LESSOPEN=|/usr/bin/lesspipe.sh %s
LOADEDMODULES=
LOGNAME=mcolonno
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
MAIL=/var/spool/mail/mcolonno
module=() { eval `/usr/bin/modulecmd bash $*`
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles
MODULESHOME=/usr/share/Modules
PATH=/usr/local/apps/NASTRAN/NX/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/u/mcolonno/bin
PWD=/u/mcolonno/NXNASTRAN
QTDIR=/usr/lib64/qt-3.3
QTINC=/usr/lib64/qt-3.3/include
QTLIB=/usr/lib64/qt-3.3/lib
SHELL=/bin/bash
SHLVL=2
SLURM_CHECKPOINT_IMAGE_DIR=/u/mcolonno/NXNASTRAN
SLURM_CPUS_ON_NODE=16
SLURM_DISTRIBUTION=cyclic
SLURMD_NODENAME=cv-hpcf1
SLURM_GTIDS=0
SLURM_JOB_CPUS_PER_NODE=16
SLURM_JOB_ID=199
SLURM_JOBID=199
SLURM_JOB_NAME=/u/mcolonno/NXNASTRAN/test-env.sh
SLURM_LAUNCH_NODE_IPADDR=192.168.101.220
SLURM_LOCALID=0
SLURM_NNODES=1
SLURM_NODEID=0
SLURM_NODELIST=cv-hpcf1
SLURM_NPROCS=1
SLURM_NTASKS=1
SLURM_PRIO_PROCESS=0
SLURM_PROCID=0
SLURM_SRUN_COMM_HOST=192.168.101.220
SLURM_SRUN_COMM_PORT=33121
SLURM_STEP_ID=0
SLURM_STEPID=0
SLURM_STEP_LAUNCHER_PORT=33121
SLURM_STEP_NODELIST=cv-hpcf1
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_SUBMIT_DIR=/u/mcolonno/NXNASTRAN
SLURM_TASK_PID=27327
SLURM_TASKS_PER_NODE=1
SLURM_TOPOLOGY_ADDR=cv-hpcf1
SLURM_TOPOLOGY_ADDR_PATTERN=node
SRUN_DEBUG=3
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
SSH_CLIENT=127.0.0.1 50820 22
SSH_CONNECTION=127.0.0.1 50820 127.0.0.1 22
SSH_TTY=/dev/pts/1
TERM=xterm
TMPDIR=/tmp
USER=mcolonno
done: Fri Jan 25 14:02:22 PST 2013
Nothing jumps out at me that would change the behavior of a bash
script on the node.
Thanks,
~Mike C.
-----Original Message-----
From: David Bigagli [mailto:[email protected]]
Sent: Friday, January 25, 2013 6:04 AM
To: slurm-dev
Subject: [slurm-dev] RE: not executing script(?)
I think the idea is that given a script like this one:
---------------------------------
cat myenv
#!/bin/sh
hostname
ulimit -a
env|sort
echo "done: `date`"
---------------------------------
run it as:
ssh myhost myenv > LOG.ssh
and as
srun -p mypartition -w myhost myenv > LOG.srun
then compare the logs line by line.
/David
On 01/24/2013 01:55 AM, Michael Colonno wrote:
>
> Updating this thread: Iran additional experiments submitting the
> job from the node it executes on - same behavior so I think this rules out
> system config limits. It seems like the application runs scripts that run
> other scripts and somehow SLURM's mode of execution confuses this. Anything
> else I can test?
>
> Thanks,
> ~Mike C.
>
> -----Original Message-----
> From: Moe Jette [mailto:[email protected]]
> Sent: Tuesday, January 22, 2013 7:49 PM
> To: slurm-dev; Michael Colonno
> Subject: Re: [slurm-dev] not executing script(?)
>
> Compare limits and environment variables for the two different modes of
> operation.
>
> Quoting Michael Colonno<[email protected]<mailto:[email protected]>>:
>
>>
>> Hi ~
>>
>> Getting some odd behavior with SLURM I haven't seen before (2.5.0 on
>> CentOS 6.3 x64 though I don't think any of that matters for this
>> issue). I'm trying to run a code which launches from a bash script
>> (commercial code, we didn't write it). If I ssh to a node and launch
>> the code, everything works fine. Syntax looks like this:
>>
>> >> launch_script input_file
>>
>> If I paste the exact same command at the end on a srun command the
>> job "runs" and I get a copy of the bash script that was supposed to
>> have been executed in the directory I launched from (even with
>> executable properties) in a file labeled input_file.[bunch of letters
>> and numbers]. Syntax looks like:
>>
>> >>srun -n1 -p whatever launch_script input_file
>>
>> Scratching my head on this one. Clearly it finds the correct script
>> to launch on the correct node but I can't explain the difference in
>> behavior between the interactive and SLURM versions. Test cases like
>> "hostname" all work fine. Probably not relevant but the parallel
>> codes I've compiled into SLURM also launch and run great.
>>
>> Thanks,
>> ~Mike C.
>>
>
>