Hi Carl,

That's excellent progress. Good work!

All the best with it and let us know if there's anything else you get 
stuck on.

Mark.

On 07/08/12 00:11, Carl Schmidtmann wrote:
> Mark,
>
> Thanks for all the explanations. Most of it I found through following the 
> build logs from the linpack executable. I did get the "hello world" program 
> running. I am now down to digging through options to the srun and sbatch 
> commands to determine exactly what they do and how it affects the jobs 
> running. I am also getting the programmers in my group to supply some other 
> test programs.
>
> Carl
>
> ----- Original Message -----
>> Hi Carl,
>>
>> On 02/08/12 23:25, Carl Schmidtmann wrote:
>>>
>>> I have now gotten slurm to the point of submitting jobs to my
>>> BlueGeneQ but they fail with the following error:
>>>
>>> 2012-08-02 08:41:30.704 (FATAL) [0xfff801c8b70]
>>> 10066:ibm.runjob.client.Job: Load failed on R00-IC-J07:
>>> Application executable ELF header contains invalid value, errno 8
>>> Exec format error
>>
>> This error is because the executable that's been passed to srun (and
>> consequently runjob) gets to the IO node and doesn't look like an
>> appropriate executable, so doesn't get loaded. This is most likely
>> because it hasn't been compiled using the right toolchain.
>>
>>>
>>> I get this error when trying to run - a simple "hello world" shell
>>> script; a simple "hello world" compiled C program; a simple "hello
>>> world" C program compiled with the mpi compiler. I am obviously
>>> missing something simple here. Below I have included my
>>> slurm.conf, bluegene.conf, the shell script and C source code
>>> files.
>>
>> The shell script won't run on the compute nodes of the Blue Gene (so
>> it
>> is expected that runjob will complain about it). From the Blue Gene/Q
>> Application Development Redbook (section 1.3.6, Application
>> development
>> and debugging):
>>
>> Shell scripts
>> The CNK does not provide a mechanism for a command interpreter or
>> shell when
>> applications start on the Blue Gene/Q system. Only the executable
>> program can be started.
>> Therefore, if the application includes shell scripts that control
>> workflow, the workflow must be
>> adapted.
>> For example, an application workflow shell script cannot be started
>> with
>> the runjob command.
>> Instead, run the application workflow scripts on the front end node
>> and
>> start the runjob
>> command only at the innermost shell script level where the main
>> application binary is called.
>>
>>
>> But your simple C program should run fine. You should be able to
>> compile
>> it using the supplied gcc, here:
>> /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc
>>
>> Here's what I get:
>>
>> $ /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -Wall
>> -o
>> hello-gcc hello.c
>> $ salloc -N 1 bash
>> salloc: Pending job allocation 27133
>> salloc: job 27133 queued and waiting for resources
>> salloc: job 27133 has been allocated resources
>> salloc: Granted job allocation 27133
>> salloc: Block RMP19Jl181150422 is ready for job
>> $ srun hello-gcc
>> Hello world.
>>
>>
>> Alternatively if you wanted to use the xlc compiler you would use
>> /opt/ibmcmp/vacpp/bg/12.1/bin/bgxlc (although your BG xlc compiler
>> may
>> be located elsewhere, this is just the default install location).
>>
>> Seeing as you can launch the linpack executable, this looks like a
>> compiler or perhaps linking issue. Your SLURM config is probably fine
>> if
>> you've gotten this far. I can e-mail you the hello world test program
>> that I compiled if you want, to ensure that SLURM is working all
>> correctly (it is 3.1MB - statically linked).
>>
>> Hope that helps!
>> Mark
>>
>>>
>>> The command I use to run them is:
>>>
>>> srun -N 256 ./helloworld.sh
>>>
>>> (I know 256 processors to run a shell script is pretty silly but I
>>> will work on smaller blocks once I can run a job.)
>>>
>>> I am running slurm v2.4.2, BlueGene V1R1M1.
>>>
>>> The weird part is that if I use the linpack executable that I am
>>> able to run with the the IBM 'runjob' command does execute from
>>> slurm but it doesn't see the extra processors and exits. I have
>>> tried replicating how that is compiled by using the mpi compiler
>>> for the hello.c program but the makefiles are very convoluted and
>>> I have probably missed some flags somewhere.
>>>
>>> Is this just an issue with compiler flags or is there some slurm
>>> setting that might affect this? I would have thought a shell
>>> script would run to enable someone to control execution of
>>> multiple executables in a job.
>>>
>>> Thanks for any pointers or suggestions,
>>> Carl
>>>
>>
>

Reply via email to