Hello,
I've been trying for a couple of weeks now (not doing this
exclusively, of course) to get
DAKOTA (http://dakota.sandia.gov) running in parallel, with MPI, on a cluster
that runs BULL's XBAS-5v3.1. Bull provides its own version of MPI and
also intel MPI.
(These are actually a bit of problem since dakota automaticaly detects
neither, but
that's another problem...)
After compiling the system, when submitted to slurm, I get these messages (from
all processors, I edited the output to shorten it)
cpu_bind=MASK - super34, task 0 0 [6100]: mask 0x1 set
dakota: error: keyvalue regex compilation failed
dakota: error: Parsing error at unrecognized key:
dakota: error: Parse error in file /etc/slurm/slurm.conf line 1:
"ClusterName=super"
dakota: error: Parsing error at unrecognized key:
dakota: error: Parse error in file /etc/slurm/slurm.conf line 2:
"ControlMachine=super0"
dakota: error: Parsing error at unrecognized key:
dakota: error: Parse error in file /etc/slurm/slurm.conf line 3:
"Licenses=imex*100,stars*100,gem*100,br*100"
dakota: error: Parsing error at unrecognized key:
dakota: error: Parse error in file /etc/slurm/slurm.conf line 8:
"SlurmUser=slurm"
and on and on and on until the end of slurm.conf.
Due to an incredible bout of serial stupidity, it took me quite a
while to locate the
source of the messages, and I blindly tried all possible combinations
of compilers and mpi
libraries in the system. I must have compiled dakota about 30 times :)
Finally I realized that these messages are being printed by libslurm, and
come from, as far as I can tell, from this code snippet, from the file
./src/common/parse_config.c
static void _keyvalue_regex_init(void)
{
if (!keyvalue_initialized) {
if (regcomp(&keyvalue_re, keyvalue_pattern,
REG_EXTENDED) != 0) {
/* FIXME - should be fatal? */
error("keyvalue regex compilation failed\n");
}
keyvalue_initialized = true;
}
}
*Really*, for the life of me, I can't see what could possibly fail here,
and not fail in the many, many other mpi programs that run fine in the
cluster.
( I took the above code from the stock source distribution of
slurm-2.0.5, which is
the version BULL uses in this system, (which I think is the same as in RHEL5.3),
but I can't swear they didn't modify it. )
The only thing I can think of, and I don't know if this is even possible, and
would, at first, think not at all, is that when linked to dakota, the
slurm library
ends up using a different regex library than what it's expecting. The
regex patterns
are constant, how can the compilation fail in this case?
It's kind of hard to debug this because it's a production cluster, and
I'm somewhat
reluctant to install a different resource manager, recompiled with debugging
information on. That said, it now occurred to me that I don't need to
install the
version with debugging symbols, just link this one executable to the
library with
debugging symbols on, no? Of course, probably everything will work fine with the
new library :) Also, debugging this is a royal PITA, because it's a
parallel program,
but it's doable.
Well, I'm off to try this now, but, if anyone has any recommendation
on anything else
I could try, I'd be more than grateful.
Thank you for you attention,
Ramiro.
--
Ramiro Brito Willmersdorf [email protected]
Departamento de Eng. Mecânica UFPE
tel: +81 2126-8231r239 fax: +81 2126-8232