Hi Carl, On 02/08/12 23:25, Carl Schmidtmann wrote: > > I have now gotten slurm to the point of submitting jobs to my BlueGeneQ but > they fail with the following error: > > 2012-08-02 08:41:30.704 (FATAL) [0xfff801c8b70] 10066:ibm.runjob.client.Job: > Load failed on R00-IC-J07: Application executable ELF header contains invalid > value, errno 8 Exec format error
This error is because the executable that's been passed to srun (and consequently runjob) gets to the IO node and doesn't look like an appropriate executable, so doesn't get loaded. This is most likely because it hasn't been compiled using the right toolchain. > > I get this error when trying to run - a simple "hello world" shell script; a > simple "hello world" compiled C program; a simple "hello world" C program > compiled with the mpi compiler. I am obviously missing something simple here. > Below I have included my slurm.conf, bluegene.conf, the shell script and C > source code files. The shell script won't run on the compute nodes of the Blue Gene (so it is expected that runjob will complain about it). From the Blue Gene/Q Application Development Redbook (section 1.3.6, Application development and debugging): Shell scripts The CNK does not provide a mechanism for a command interpreter or shell when applications start on the Blue Gene/Q system. Only the executable program can be started. Therefore, if the application includes shell scripts that control workflow, the workflow must be adapted. For example, an application workflow shell script cannot be started with the runjob command. Instead, run the application workflow scripts on the front end node and start the runjob command only at the innermost shell script level where the main application binary is called. But your simple C program should run fine. You should be able to compile it using the supplied gcc, here: /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc Here's what I get: $ /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -Wall -o hello-gcc hello.c $ salloc -N 1 bash salloc: Pending job allocation 27133 salloc: job 27133 queued and waiting for resources salloc: job 27133 has been allocated resources salloc: Granted job allocation 27133 salloc: Block RMP19Jl181150422 is ready for job $ srun hello-gcc Hello world. Alternatively if you wanted to use the xlc compiler you would use /opt/ibmcmp/vacpp/bg/12.1/bin/bgxlc (although your BG xlc compiler may be located elsewhere, this is just the default install location). Seeing as you can launch the linpack executable, this looks like a compiler or perhaps linking issue. Your SLURM config is probably fine if you've gotten this far. I can e-mail you the hello world test program that I compiled if you want, to ensure that SLURM is working all correctly (it is 3.1MB - statically linked). Hope that helps! Mark > > The command I use to run them is: > > srun -N 256 ./helloworld.sh > > (I know 256 processors to run a shell script is pretty silly but I will work > on smaller blocks once I can run a job.) > > I am running slurm v2.4.2, BlueGene V1R1M1. > > The weird part is that if I use the linpack executable that I am able to run > with the the IBM 'runjob' command does execute from slurm but it doesn't see > the extra processors and exits. I have tried replicating how that is compiled > by using the mpi compiler for the hello.c program but the makefiles are very > convoluted and I have probably missed some flags somewhere. > > Is this just an issue with compiler flags or is there some slurm setting that > might affect this? I would have thought a shell script would run to enable > someone to control execution of multiple executables in a job. > > Thanks for any pointers or suggestions, > Carl >
