Still having difficulty with this. I agree there must be a 
difference but I must be discerning enough in wading through this output to 
find it. 

 

            Let me try a different approach. The users of this particular 
deployment will mix & match using the cluster in an interactive way (salloc and 
then running things by hand) and batch solutions (the usual srun or sbatch). To 
that end it would be a good idea to make the “interactive” (and batch for that 
matter) environment as similar as possible to the environment seen just logging 
into a node outside of SLURM. That would preempt other such issues in the 
future. Any suggestions on the best way to do this? In previous deployments 
I’ve had to tinker with the SLURM init scripts to remove memory locks and other 
restrictions. If I modify the init scripts to run .bashrc (for example) on 
process startup will that achieve the desired effect? 

 

            Thanks,

            ~Mike C.   

 

From: Phil Sharfstein [mailto:[email protected]] 
Sent: Friday, January 25, 2013 4:02 PM
To: slurm-dev
Subject: [slurm-dev] RE: not executing script(?)

 


If you really want to see what's going wrong with your shell startup, you will 
need to make two changes.  

1. Start your script with:
#!/bin/sh -xl
-x makes it echo every command and -l makes it run as a login shell.

2. Edit /etc/profile to make your /etc/profile.d scripts show any errors and 
echo the commands.  For some reason Red Hat decided to throw away all of the 
output from these scripts if they were not run from an interactive shell.  Of 
course, with all of the errors and stdout redirected to /dev/null if something 
fails in one of these scripts, you'll never know.

In /etc/profile look for a line like: 

   for i in /etc/profile.d/*.sh; do

and change the command a few lines under it from:

          . $i > /dev/null 2>&1
to
          . $i

Hopefully, as you slog through the the output you'll find the command that's 
breaking everything.


-Phil



  _____  

From: Michael Colonno [[email protected]]
Sent: Friday, January 25, 2013 2:37 PM
To: slurm-dev
Subject: [slurm-dev] RE: not executing script(?)

            Using the exact script below, ssh output:

 

cv-hpcf1

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 1032015

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 10240

cpu time               (seconds, -t) unlimited

max user processes              (-u) 1024

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited

_=/bin/env

CVS_RSH=ssh

G_BROKEN_FILENAMES=1

HOME=/u/mcolonno

KRB5CCNAME=FILE:/tmp/krb5cc_1050163475_PhEns23756

LANG=en_US.UTF-8

LESSOPEN=|/usr/bin/lesspipe.sh %s

LOGNAME=mcolonno

MAIL=/var/mail/mcolonno

PATH=/usr/local/apps/NASTRAN/NX/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin

PWD=/u/mcolonno

QTDIR=/usr/lib64/qt-3.3

QTINC=/usr/lib64/qt-3.3/include

QTLIB=/usr/lib64/qt-3.3/lib

SHELL=/bin/bash

SHLVL=2

SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass

SSH_CLIENT=192.168.101.220 35086 22

SSH_CONNECTION=192.168.101.220 35086 192.168.230.33 22

USER=mcolonno

done: Fri Jan 25 13:59:46 PST 2013

 

            Using srun:

 

cv-hpcf1

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 1032015

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 10240

cpu time               (seconds, -t) unlimited

max user processes              (-u) 1024

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited

}

_=/bin/env

CVS_RSH=ssh

G_BROKEN_FILENAMES=1

HISTCONTROL=ignoredups

HISTSIZE=1000

HOME=/u/mcolonno

HOSTNAME=cv-hpcq

LANG=en_US.UTF-8

LESSOPEN=|/usr/bin/lesspipe.sh %s

LOADEDMODULES=

LOGNAME=mcolonno

LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:

MAIL=/var/spool/mail/mcolonno

module=() {  eval `/usr/bin/modulecmd bash $*`

MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles

MODULESHOME=/usr/share/Modules

PATH=/usr/local/apps/NASTRAN/NX/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/u/mcolonno/bin

PWD=/u/mcolonno/NXNASTRAN

QTDIR=/usr/lib64/qt-3.3

QTINC=/usr/lib64/qt-3.3/include

QTLIB=/usr/lib64/qt-3.3/lib

SHELL=/bin/bash

SHLVL=2

SLURM_CHECKPOINT_IMAGE_DIR=/u/mcolonno/NXNASTRAN

SLURM_CPUS_ON_NODE=16

SLURM_DISTRIBUTION=cyclic

SLURMD_NODENAME=cv-hpcf1

SLURM_GTIDS=0

SLURM_JOB_CPUS_PER_NODE=16

SLURM_JOB_ID=199

SLURM_JOBID=199

SLURM_JOB_NAME=/u/mcolonno/NXNASTRAN/test-env.sh

SLURM_LAUNCH_NODE_IPADDR=192.168.101.220

SLURM_LOCALID=0

SLURM_NNODES=1

SLURM_NODEID=0

SLURM_NODELIST=cv-hpcf1

SLURM_NPROCS=1

SLURM_NTASKS=1

SLURM_PRIO_PROCESS=0

SLURM_PROCID=0

SLURM_SRUN_COMM_HOST=192.168.101.220

SLURM_SRUN_COMM_PORT=33121

SLURM_STEP_ID=0

SLURM_STEPID=0

SLURM_STEP_LAUNCHER_PORT=33121

SLURM_STEP_NODELIST=cv-hpcf1

SLURM_STEP_NUM_NODES=1

SLURM_STEP_NUM_TASKS=1

SLURM_STEP_TASKS_PER_NODE=1

SLURM_SUBMIT_DIR=/u/mcolonno/NXNASTRAN

SLURM_TASK_PID=27327

SLURM_TASKS_PER_NODE=1

SLURM_TOPOLOGY_ADDR=cv-hpcf1

SLURM_TOPOLOGY_ADDR_PATTERN=node

SRUN_DEBUG=3

SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass

SSH_CLIENT=127.0.0.1 50820 22

SSH_CONNECTION=127.0.0.1 50820 127.0.0.1 22

SSH_TTY=/dev/pts/1

TERM=xterm

TMPDIR=/tmp

USER=mcolonno

done: Fri Jan 25 14:02:22 PST 2013

 

            Nothing jumps out at me that would change the behavior of a bash 
script on the node. 

 

            Thanks,

            ~Mike C. 

 

-----Original Message-----
From: David Bigagli [mailto:[email protected]] 
Sent: Friday, January 25, 2013 6:04 AM
To: slurm-dev
Subject: [slurm-dev] RE: not executing script(?)

 

 

I think the idea is that given a script like this one:

 

---------------------------------

cat myenv

#!/bin/sh

 

hostname

ulimit -a

env|sort

echo "done: `date`"

---------------------------------

 

run it as:

 

ssh myhost myenv > LOG.ssh

 

and as

 

srun -p mypartition -w myhost myenv > LOG.srun

 

then compare the logs line by line.

 

/David

 

On 01/24/2013 01:55 AM, Michael Colonno wrote:

> 

>          Updating this thread:  Iran additional experiments submitting the 
> job from the node it executes on - same behavior so I think this rules out 
> system config limits. It seems like the application runs scripts that run 
> other scripts and somehow SLURM's mode of execution confuses this. Anything 
> else I can test?

> 

>          Thanks,

>          ~Mike C.

> 

> -----Original Message-----

> From: Moe Jette [ <mailto:[email protected]> mailto:[email protected]]

> Sent: Tuesday, January 22, 2013 7:49 PM

> To: slurm-dev; Michael Colonno

> Subject: Re: [slurm-dev] not executing script(?)

> 

> Compare limits and environment variables for the two different modes of 
> operation.

> 

> Quoting Michael Colonno< <mailto:[email protected]> 
> [email protected]>:

> 

>> 

>>        Hi ~

>> 

>>        Getting some odd behavior with SLURM I haven't seen before (2.5.0 on 

>> CentOS 6.3 x64 though I don't think any of that matters for this 

>> issue). I'm trying to run a code which launches from a bash script 

>> (commercial code, we didn't write it). If I ssh to a node and launch 

>> the code, everything works fine. Syntax looks like this:

>> 

>>        >>  launch_script input_file

>> 

>>        If I paste the exact same command at the end on a srun command the 

>> job "runs" and I get a copy of the bash script that was supposed to 

>> have been executed in the directory I launched from (even with 

>> executable properties) in a file labeled input_file.[bunch of letters 

>> and numbers]. Syntax looks like:

>> 

>>        >>srun -n1 -p whatever launch_script input_file

>> 

>>        Scratching my head on this one. Clearly it finds the correct script 

>> to launch on the correct node but I can't explain the difference in 

>> behavior between the interactive and SLURM versions. Test cases like 

>> "hostname" all work fine. Probably not relevant but the parallel 

>> codes I've compiled into SLURM also launch and run great.

>> 

>>        Thanks,

>>        ~Mike C.

>> 

> 

> 

Image removed by sender.

Image removed by sender.

<<image001.jpg>>

Reply via email to