Sorry for replying to myself, but I wanted to leave the solution to
this problem archived in the mailing list.

The "problem", so to speak, is that there is a  function in the dakota
source code, in the file:

Dakota-5.1/src/regexp.c (well, obviously :)

also named regcomp, and that ended up being called instead of the one
in the slurm library, (I don't know where the latter is supposed to
come from.)

I just renamed all functions in this file (just to be safe :) and in
another file that calls them,
CtelRegExp.C, and everything worked as expected.

With hindsight everything is ridiculously simple, but this was actuall
*REALLY HARD* for me to find :)

Thanks for slurm anyway, I was an user of PBS and relatives, but now I
think slurm is much better all around.

Ramiro

2011/5/19 Ramiro Willmersdorf <[email protected]>:
> Hello,
>
> I've been trying for a couple of weeks now (not doing this
> exclusively, of course) to get
> DAKOTA (http://dakota.sandia.gov) running in parallel, with MPI, on a cluster
> that runs BULL's XBAS-5v3.1. Bull provides its own version of MPI and
> also intel MPI.
> (These are actually a bit of problem since dakota automaticaly detects
> neither, but
> that's another problem...)
>
> After compiling the system, when submitted to slurm, I get these messages 
> (from
> all processors, I edited the output to shorten it)
>
> cpu_bind=MASK - super34, task  0  0 [6100]: mask 0x1 set
> dakota: error: keyvalue regex compilation failed
> dakota: error: Parsing error at unrecognized key:
> dakota: error: Parse error in file /etc/slurm/slurm.conf line 1:
> "ClusterName=super"
> dakota: error: Parsing error at unrecognized key:
> dakota: error: Parse error in file /etc/slurm/slurm.conf line 2:
> "ControlMachine=super0"
> dakota: error: Parsing error at unrecognized key:
> dakota: error: Parse error in file /etc/slurm/slurm.conf line 3:
> "Licenses=imex*100,stars*100,gem*100,br*100"
> dakota: error: Parsing error at unrecognized key:
> dakota: error: Parse error in file /etc/slurm/slurm.conf line 8:
> "SlurmUser=slurm"
>
> and on and on and on until the end of slurm.conf.
>
> Due to an incredible bout of serial stupidity, it took me quite a
> while to locate the
> source of the messages, and I blindly tried all possible combinations
> of compilers and mpi
> libraries in the system. I must have compiled dakota about 30 times :)
>
> Finally I realized that these messages are being printed by libslurm, and
> come from, as far as I can tell, from this code snippet, from the file
> ./src/common/parse_config.c
>
> static void _keyvalue_regex_init(void)
> {
>        if (!keyvalue_initialized) {
>                if (regcomp(&keyvalue_re, keyvalue_pattern,
>                            REG_EXTENDED) != 0) {
>                        /* FIXME - should be fatal? */
>                        error("keyvalue regex compilation failed\n");
>                }
>                keyvalue_initialized = true;
>        }
> }
>
> *Really*, for the life of me, I can't see what could possibly fail here,
> and not fail in the many, many other mpi programs that run fine in the
> cluster.
>
> ( I took the above code from the stock source distribution of
> slurm-2.0.5, which is
> the version BULL uses in this system, (which I think is the same as in 
> RHEL5.3),
>  but I can't swear they didn't modify it. )
>
> The only thing I can think of, and I don't know if this is even possible, and
> would, at first, think not at all, is that when linked to dakota, the
> slurm library
> ends up using a different regex library than what it's expecting. The
> regex patterns
> are constant, how can the compilation fail in this case?
>
> It's kind of hard to debug this because it's a production cluster, and
> I'm somewhat
> reluctant to install a different resource manager, recompiled with debugging
> information on. That said, it now occurred to me that I don't need to
> install the
> version with debugging symbols, just link this one executable to the
> library with
> debugging symbols on, no? Of course, probably everything will work fine with 
> the
> new library :) Also, debugging this is a royal PITA, because it's a
> parallel program,
> but it's doable.
>
> Well, I'm off to try this now, but, if anyone has any recommendation
> on anything else
> I could try, I'd be more than grateful.
>
> Thank you for you attention,
>
> Ramiro.
>
>
> --
> Ramiro Brito Willmersdorf            [email protected]
> Departamento de Eng. Mecânica        UFPE
> tel: +81 2126-8231r239               fax: +81 2126-8232
>



-- 
Ramiro Brito Willmersdorf            [email protected]
Departamento de Eng. Mecânica        UFPE
tel: +81 2126-8231r239               fax: +81 2126-8232

Reply via email to