Hi Carl, That's excellent progress. Good work!
All the best with it and let us know if there's anything else you get stuck on. Mark. On 07/08/12 00:11, Carl Schmidtmann wrote: > Mark, > > Thanks for all the explanations. Most of it I found through following the > build logs from the linpack executable. I did get the "hello world" program > running. I am now down to digging through options to the srun and sbatch > commands to determine exactly what they do and how it affects the jobs > running. I am also getting the programmers in my group to supply some other > test programs. > > Carl > > ----- Original Message ----- >> Hi Carl, >> >> On 02/08/12 23:25, Carl Schmidtmann wrote: >>> >>> I have now gotten slurm to the point of submitting jobs to my >>> BlueGeneQ but they fail with the following error: >>> >>> 2012-08-02 08:41:30.704 (FATAL) [0xfff801c8b70] >>> 10066:ibm.runjob.client.Job: Load failed on R00-IC-J07: >>> Application executable ELF header contains invalid value, errno 8 >>> Exec format error >> >> This error is because the executable that's been passed to srun (and >> consequently runjob) gets to the IO node and doesn't look like an >> appropriate executable, so doesn't get loaded. This is most likely >> because it hasn't been compiled using the right toolchain. >> >>> >>> I get this error when trying to run - a simple "hello world" shell >>> script; a simple "hello world" compiled C program; a simple "hello >>> world" C program compiled with the mpi compiler. I am obviously >>> missing something simple here. Below I have included my >>> slurm.conf, bluegene.conf, the shell script and C source code >>> files. >> >> The shell script won't run on the compute nodes of the Blue Gene (so >> it >> is expected that runjob will complain about it). From the Blue Gene/Q >> Application Development Redbook (section 1.3.6, Application >> development >> and debugging): >> >> Shell scripts >> The CNK does not provide a mechanism for a command interpreter or >> shell when >> applications start on the Blue Gene/Q system. Only the executable >> program can be started. >> Therefore, if the application includes shell scripts that control >> workflow, the workflow must be >> adapted. >> For example, an application workflow shell script cannot be started >> with >> the runjob command. >> Instead, run the application workflow scripts on the front end node >> and >> start the runjob >> command only at the innermost shell script level where the main >> application binary is called. >> >> >> But your simple C program should run fine. You should be able to >> compile >> it using the supplied gcc, here: >> /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc >> >> Here's what I get: >> >> $ /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gcc -Wall >> -o >> hello-gcc hello.c >> $ salloc -N 1 bash >> salloc: Pending job allocation 27133 >> salloc: job 27133 queued and waiting for resources >> salloc: job 27133 has been allocated resources >> salloc: Granted job allocation 27133 >> salloc: Block RMP19Jl181150422 is ready for job >> $ srun hello-gcc >> Hello world. >> >> >> Alternatively if you wanted to use the xlc compiler you would use >> /opt/ibmcmp/vacpp/bg/12.1/bin/bgxlc (although your BG xlc compiler >> may >> be located elsewhere, this is just the default install location). >> >> Seeing as you can launch the linpack executable, this looks like a >> compiler or perhaps linking issue. Your SLURM config is probably fine >> if >> you've gotten this far. I can e-mail you the hello world test program >> that I compiled if you want, to ensure that SLURM is working all >> correctly (it is 3.1MB - statically linked). >> >> Hope that helps! >> Mark >> >>> >>> The command I use to run them is: >>> >>> srun -N 256 ./helloworld.sh >>> >>> (I know 256 processors to run a shell script is pretty silly but I >>> will work on smaller blocks once I can run a job.) >>> >>> I am running slurm v2.4.2, BlueGene V1R1M1. >>> >>> The weird part is that if I use the linpack executable that I am >>> able to run with the the IBM 'runjob' command does execute from >>> slurm but it doesn't see the extra processors and exits. I have >>> tried replicating how that is compiled by using the mpi compiler >>> for the hello.c program but the makefiles are very convoluted and >>> I have probably missed some flags somewhere. >>> >>> Is this just an issue with compiler flags or is there some slurm >>> setting that might affect this? I would have thought a shell >>> script would run to enable someone to control execution of >>> multiple executables in a job. >>> >>> Thanks for any pointers or suggestions, >>> Carl >>> >> >
