I have now gotten slurm to the point of submitting jobs to my BlueGeneQ but
they fail with the following error:
2012-08-02 08:41:30.704 (FATAL) [0xfff801c8b70] 10066:ibm.runjob.client.Job:
Load failed on R00-IC-J07: Application executable ELF header contains invalid
value, errno 8 Exec format error
I get this error when trying to run - a simple "hello world" shell script; a
simple "hello world" compiled C program; a simple "hello world" C program
compiled with the mpi compiler. I am obviously missing something simple here.
Below I have included my slurm.conf, bluegene.conf, the shell script and C
source code files.
The command I use to run them is:
srun -N 256 ./helloworld.sh
(I know 256 processors to run a shell script is pretty silly but I will work on
smaller blocks once I can run a job.)
I am running slurm v2.4.2, BlueGene V1R1M1.
The weird part is that if I use the linpack executable that I am able to run
with the the IBM 'runjob' command does execute from slurm but it doesn't see
the extra processors and exits. I have tried replicating how that is compiled
by using the mpi compiler for the hello.c program but the makefiles are very
convoluted and I have probably missed some flags somewhere.
Is this just an issue with compiler flags or is there some slurm setting that
might affect this? I would have thought a shell script would run to enable
someone to control execution of multiple executables in a job.
Thanks for any pointers or suggestions,
Carl
--
Carl Schmidtmann
Center for Integrated Research Computing
University of Rochester
[cschmid7_local@bgqsn BlueGeneQ.HPL-base]$ grep -v '^#'
/usr/local/slurm/2.4.2/etc/slurm.conf
ControlMachine=bgqsn
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
Epilog=/usr/local/slurm/current/sbin/epilog.bash
MpiDefault=none
ProctrackType=proctrack/pgid
Prolog=/usr/local/slurm/current/sbin/prolog.bash
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/slurm/state/slurmd
SlurmUser=slurm
StateSaveLocation=/var/slurm/state
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/bluegene
AccountingStorageHost=localhost
AccountingStorageLoc=slurmacct
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
ClusterName=UR-BGQ
JobCompHost=localhost
JobCompLoc=slurmdb
JobCompType=jobcomp/slurmdbd
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmSchedLogFile=/var/log/slurm/slurmsched.log
SlurmSchedLogLevel=3
FrontendName=bgqsn State=UNKNOWN
NodeName=bg[0000x0001] CPUs=1024 State=UNKNOWN
PartitionName=debug Nodes=bg[0000x0001] Default=YES MaxTime=INFINITE State=UP
Shared=force
[cschmid7_local@bgqsn BlueGeneQ.HPL-base]$ grep -v '^#'
/usr/local/slurm/2.4.2/etc/bluegene.conf
MloaderImage=/bgsys/drivers/ppcfloor/boot/firmware
Numpsets=4 # io semi-poor
BridgeAPILogFile=/var/log/slurm/bridgeapi.log
BridgeAPIVerbose=2
BasePartitionNodeCnt=512
NodeCardNodeCnt=32
LayoutMode=STATIC
MPs=0000 Type=Torus,Torus,Torus,Torus 32CNBlocks=0 64CNBlocks=0 128CNBlocks=0
256CNBlocks=2
MPs=0001 Type=Torus,Torus,Torus,Torus 32CNBlocks=0 64CNBlocks=0 128CNBlocks=0
256CNBlocks=2
[cschmid7_local@bgqsn BlueGeneQ.HPL-base]$ cat helloworld.sh
#!/bin/bash
/bin/echo "Hello World"
exit 0
[cschmid7_local@bgqsn BlueGeneQ.HPL-base]$ cat hello.c
#include <stdio.h>
int main( int argc, char** argv )
{
printf( "Hello world.\n" );
return 1;
}