Mark, Thanks for all the explanations. Most of it I found through following the build logs from the linpack executable. I did get the "hello world" program running. I am now down to digging through options to the srun and sbatch commands to determine exactly what they do and how it affects the jobs running. I am also getting the programmers in my group to supply some other test programs.
Carl ----- Original Message ----- > Hi Carl, > > On 02/08/12 23:25, Carl Schmidtmann wrote: > > > > I have now gotten slurm to the point of submitting jobs to my > > BlueGeneQ but they fail with the following error: > > > > 2012-08-02 08:41:30.704 (FATAL) [0xfff801c8b70] > > 10066:ibm.runjob.client.Job: Load failed on R00-IC-J07: > > Application executable ELF header contains invalid value, errno 8 > > Exec format error > > This error is because the executable that's been passed to srun (and > consequently runjob) gets to the IO node and doesn't look like an > appropriate executable, so doesn't get loaded. This is most likely > because it hasn't been compiled using the right toolchain. > > > > > I get this error when trying to run - a simple "hello world" shell > > script; a simple "hello world" compiled C program; a simple "hello > > world" C program compiled with the mpi compiler. I am obviously > > missing something simple here. Below I have included my > > slurm.conf, bluegene.conf, the shell script and C source code > > files. > > The shell script won't run on the compute nodes of the Blue Gene (so > it > is expected that runjob will complain about it). From the Blue Gene/Q > Application Development Redbook (section 1.3.6, Application > development > and debugging): > > Shell scripts > The CNK does not provide a mechanism for a command interpreter or > shell when > applications start on the Blue Gene/Q system. Only the executable > program can be started. > Therefore, if the application includes shell scripts that control > workflow, the workflow must be > adapted. > For example, an application workflow shell script cannot be started > with > the runjob command. > Instead, run the application workflow scripts on the front end node > and > start the runjob > command only at the innermost shell script level where the main > application binary is called. > > > But your simple C program should run fine. You should be able to > compile > it using the supplied gcc, here: > /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc > > Here's what I get: > > $ /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -Wall > -o > hello-gcc hello.c > $ salloc -N 1 bash > salloc: Pending job allocation 27133 > salloc: job 27133 queued and waiting for resources > salloc: job 27133 has been allocated resources > salloc: Granted job allocation 27133 > salloc: Block RMP19Jl181150422 is ready for job > $ srun hello-gcc > Hello world. > > > Alternatively if you wanted to use the xlc compiler you would use > /opt/ibmcmp/vacpp/bg/12.1/bin/bgxlc (although your BG xlc compiler > may > be located elsewhere, this is just the default install location). > > Seeing as you can launch the linpack executable, this looks like a > compiler or perhaps linking issue. Your SLURM config is probably fine > if > you've gotten this far. I can e-mail you the hello world test program > that I compiled if you want, to ensure that SLURM is working all > correctly (it is 3.1MB - statically linked). > > Hope that helps! > Mark > > > > > The command I use to run them is: > > > > srun -N 256 ./helloworld.sh > > > > (I know 256 processors to run a shell script is pretty silly but I > > will work on smaller blocks once I can run a job.) > > > > I am running slurm v2.4.2, BlueGene V1R1M1. > > > > The weird part is that if I use the linpack executable that I am > > able to run with the the IBM 'runjob' command does execute from > > slurm but it doesn't see the extra processors and exits. I have > > tried replicating how that is compiled by using the mpi compiler > > for the hello.c program but the makefiles are very convoluted and > > I have probably missed some flags somewhere. > > > > Is this just an issue with compiler flags or is there some slurm > > setting that might affect this? I would have thought a shell > > script would run to enable someone to control execution of > > multiple executables in a job. > > > > Thanks for any pointers or suggestions, > > Carl > > > -- Carl Schmidtmann Center for Integrated Research Computing University of Rochester
