Hi Carl,

On 02/08/12 23:25, Carl Schmidtmann wrote:
>
> I have now gotten slurm to the point of submitting jobs to my BlueGeneQ but 
> they fail with the following error:
>
> 2012-08-02 08:41:30.704 (FATAL) [0xfff801c8b70] 10066:ibm.runjob.client.Job: 
> Load failed on R00-IC-J07: Application executable ELF header contains invalid 
> value, errno 8 Exec format error

This error is because the executable that's been passed to srun (and 
consequently runjob) gets to the IO node and doesn't look like an 
appropriate executable, so doesn't get loaded. This is most likely 
because it hasn't been compiled using the right toolchain.

>
> I get this error when trying to run - a simple "hello world" shell script; a 
> simple "hello world" compiled C program; a simple "hello world" C program 
> compiled with the mpi compiler. I am obviously missing something simple here. 
> Below I have included my slurm.conf, bluegene.conf, the shell script and C 
> source code files.

The shell script won't run on the compute nodes of the Blue Gene (so it 
is expected that runjob will complain about it). From the Blue Gene/Q 
Application Development Redbook (section 1.3.6, Application development 
and debugging):

Shell scripts
The CNK does not provide a mechanism for a command interpreter or shell when
applications start on the Blue Gene/Q system. Only the executable 
program can be started.
Therefore, if the application includes shell scripts that control 
workflow, the workflow must be
adapted.
For example, an application workflow shell script cannot be started with 
the runjob command.
Instead, run the application workflow scripts on the front end node and 
start the runjob
command only at the innermost shell script level where the main 
application binary is called.


But your simple C program should run fine. You should be able to compile 
it using the supplied gcc, here: 
/bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc

Here's what I get:

$ /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -Wall -o 
hello-gcc hello.c
$ salloc -N 1 bash
salloc: Pending job allocation 27133
salloc: job 27133 queued and waiting for resources
salloc: job 27133 has been allocated resources
salloc: Granted job allocation 27133
salloc: Block RMP19Jl181150422 is ready for job
$ srun hello-gcc
Hello world.


Alternatively if you wanted to use the xlc compiler you would use 
/opt/ibmcmp/vacpp/bg/12.1/bin/bgxlc (although your BG xlc compiler may 
be located elsewhere, this is just the default install location).

Seeing as you can launch the linpack executable, this looks like a 
compiler or perhaps linking issue. Your SLURM config is probably fine if 
you've gotten this far. I can e-mail you the hello world test program 
that I compiled if you want, to ensure that SLURM is working all 
correctly (it is 3.1MB - statically linked).

Hope that helps!
Mark

>
> The command I use to run them is:
>
> srun -N 256 ./helloworld.sh
>
> (I know 256 processors to run a shell script is pretty silly but I will work 
> on smaller blocks once I can run a job.)
>
> I am running slurm v2.4.2, BlueGene V1R1M1.
>
> The weird part is that if I use the linpack executable that I am able to run 
> with the the IBM 'runjob' command does execute from slurm but it doesn't see 
> the extra processors and exits. I have tried replicating how that is compiled 
> by using the mpi compiler for the hello.c program but the makefiles are very 
> convoluted and I have probably missed some flags somewhere.
>
> Is this just an issue with compiler flags or is there some slurm setting that 
> might affect this? I would have thought a shell script would run to enable 
> someone to control execution of multiple executables in a job.
>
> Thanks for any pointers or suggestions,
> Carl
>

Reply via email to