Jeff,
Some limited testing shows that that srun does seem to work where the
quote-y one did not. I'm working with our admins now to make sure it let's
the prolog work as expected as well.
I'll keep you informed,
Matt
On Thu, Sep 4, 2014 at 1:26 PM, Jeff Squyres (jsquyres)
Try this (typed in editor, not tested!):
#! /usr/bin/perl -w
use strict;
use warnings;
use FindBin;
# Specify the path to the prolog.
my $prolog = '--task-prolog=/gpfsm//.task.prolog';
# Build the path to the SLURM srun command.
my $srun_slurm = "${FindBin::Bin}/srun.slurm";
# Add
Jeff,
Here is the script (with a bit of munging for safety's sake):
#! /usr/bin/perl -w
use strict;
use warnings;
use FindBin;
# Specify the path to the prolog.
my $prolog = '--task-prolog=/gpfsm//.task.prolog';
# Build the path to the SLURM srun command.
my $srun_slurm =
Still begs the bigger question, though, as others have used script wrappers
before - and I'm not sure we (OMPI) want to be in the business of dictating the
scripting language they can use. :-)
Jeff and I will argue that one out
On Sep 4, 2014, at 7:38 AM, Jeff Squyres (jsquyres)
Ah, if it's perl, it might be easy. It might just be the difference between
system("...string...") and system(@argv).
Sent from my phone. No type good.
On Sep 4, 2014, at 8:35 AM, "Matt Thompson"
> wrote:
Jeff,
I actually misspoke earlier. It turns
Jeff,
I actually misspoke earlier. It turns out our srun is a *Perl* script
around the SLURM srun. I'll speak with our admins to see if they can
massage the script to not interpret the arguments. If possible, I'll ask
them if I can share the script with you (privately or on the list) and
maybe
On Sep 3, 2014, at 9:27 AM, Matt Thompson wrote:
> Just saw this, sorry. Our srun is indeed a shell script. It seems to be a
> wrapper around the regular srun that runs a --task-prolog. What it
> does...that's beyond my ken, but I could ask. My guess is that it probably
>
Thanks Matt - that does indeed resolve the "how" question :-)
We'll talk internally about how best to resolve the issue. We could, of course,
add a flag to indicate "we are using a shellscript version of srun" so we know
to quote things, but it would mean another thing that the user would have
On Tue, Sep 2, 2014 at 8:38 PM, Jeff Squyres (jsquyres)
wrote:
> Matt: Random thought -- is your "srun" a shell script, perchance? (it
> shouldn't be, but perhaps there's some kind of local override...?)
>
> Ralph's point on the call today is that it doesn't matter *how*
Jeff,
I tried your script and I saw:
(1027) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun
-np 8 ./script.sh
(1028) $
Now, the very first time I ran it, I think I might have noticed a blip of
orted on the nodes, but it disappeared fast. When I re-run the same
command, it
Ah, I see the "sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:41686: No such
file or directory" message now -- I was looking for something like that when I
replied before and missed it.
I really wish I understood why the heck that is happening; it doesn't seem to
make sense.
Matt: Random
I can answer that for you right now. The launch of the orted's is what is
failing, and they are "silently" failing at this time. The reason is simple:
1. we are failing due to truncation of the HNP uri at the first semicolon. This
causes the orted to emit an ORTE_ERROR_LOG message and then
Thanks - I expect we'll have to release 1.8.3 soon to fix this in case others
have similar issues. Out of curiosity, what OS are you using?
On Sep 1, 2014, at 9:00 AM, Matt Thompson wrote:
> Ralph,
>
> Okay that seems to have done it here (well, minus the usual
>
Ralph,
Okay that seems to have done it here (well, minus the
usual shmem_mmap_enable_nfs_warning that our system always generates):
(1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
(1034) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun
-np 8
HmmmI may see the problem. Would you be so kind as to apply the attached patch to your 1.8.2 code, rebuild, and try again?Much appreciate the help. Everyone's system is slightly different, and I think you've uncovered one of those differences.Ralph
uri.diff
Description: Binary data
On Aug 31,
Ralph,
Sorry it took me a bit of time. Here you go:
(1002) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
--leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca
plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x
[borg01w063:03815] mca:base:select:(
Ralph,
Here you go:
(1080) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
--leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8
./helloWorld.182-debug.x
[borg01x142:29232] mca: base: components_register: registering oob
components
[borg01x142:29232]
Okay, something quite weird is happening here. I can't replicate using the
1.8.2 release tarball on a slurm machine, so my guess is that something else is
going on here.
Could you please rebuild the 1.8.2 code with --enable-debug on the configure
line (assuming you haven't already done so),
I'm unaware of any changes to the Slurm integration between rc4 and final
release. It sounds like this might be something else going on - try adding
"--leave-session-attached --debug-daemons" to your 1.8.2 command line and let's
see if any errors get reported.
On Aug 28, 2014, at 12:20 PM,
Open MPI List,
I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1 on our
cluster (reported on this list), and decided to try it with 1.8.2. However,
we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even weirder,
Open MPI 1.8.2rc4 doesn't show the bug. And the bug is:
20 matches
Mail list logo