Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Matt Thompson
Jeff, Some limited testing shows that that srun does seem to work where the quote-y one did not. I'm working with our admins now to make sure it let's the prolog work as expected as well. I'll keep you informed, Matt On Thu, Sep 4, 2014 at 1:26 PM, Jeff Squyres (jsquyres)

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Jeff Squyres (jsquyres)
Try this (typed in editor, not tested!): #! /usr/bin/perl -w use strict; use warnings; use FindBin; # Specify the path to the prolog. my $prolog = '--task-prolog=/gpfsm//.task.prolog'; # Build the path to the SLURM srun command. my $srun_slurm = "${FindBin::Bin}/srun.slurm"; # Add

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Matt Thompson
Jeff, Here is the script (with a bit of munging for safety's sake): #! /usr/bin/perl -w use strict; use warnings; use FindBin; # Specify the path to the prolog. my $prolog = '--task-prolog=/gpfsm//.task.prolog'; # Build the path to the SLURM srun command. my $srun_slurm =

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Ralph Castain
Still begs the bigger question, though, as others have used script wrappers before - and I'm not sure we (OMPI) want to be in the business of dictating the scripting language they can use. :-) Jeff and I will argue that one out On Sep 4, 2014, at 7:38 AM, Jeff Squyres (jsquyres)

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Jeff Squyres (jsquyres)
Ah, if it's perl, it might be easy. It might just be the difference between system("...string...") and system(@argv). Sent from my phone. No type good. On Sep 4, 2014, at 8:35 AM, "Matt Thompson" > wrote: Jeff, I actually misspoke earlier. It turns

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Matt Thompson
Jeff, I actually misspoke earlier. It turns out our srun is a *Perl* script around the SLURM srun. I'll speak with our admins to see if they can massage the script to not interpret the arguments. If possible, I'll ask them if I can share the script with you (privately or on the list) and maybe

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Jeff Squyres (jsquyres)
On Sep 3, 2014, at 9:27 AM, Matt Thompson wrote: > Just saw this, sorry. Our srun is indeed a shell script. It seems to be a > wrapper around the regular srun that runs a --task-prolog. What it > does...that's beyond my ken, but I could ask. My guess is that it probably >

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-03 Thread Ralph Castain
Thanks Matt - that does indeed resolve the "how" question :-) We'll talk internally about how best to resolve the issue. We could, of course, add a flag to indicate "we are using a shellscript version of srun" so we know to quote things, but it would mean another thing that the user would have

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-03 Thread Matt Thompson
On Tue, Sep 2, 2014 at 8:38 PM, Jeff Squyres (jsquyres) wrote: > Matt: Random thought -- is your "srun" a shell script, perchance? (it > shouldn't be, but perhaps there's some kind of local override...?) > > Ralph's point on the call today is that it doesn't matter *how*

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-03 Thread Matt Thompson
Jeff, I tried your script and I saw: (1027) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun -np 8 ./script.sh (1028) $ Now, the very first time I ran it, I think I might have noticed a blip of orted on the nodes, but it disappeared fast. When I re-run the same command, it

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-02 Thread Jeff Squyres (jsquyres)
Ah, I see the "sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:41686: No such file or directory" message now -- I was looking for something like that when I replied before and missed it. I really wish I understood why the heck that is happening; it doesn't seem to make sense. Matt: Random

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-02 Thread Ralph Castain
I can answer that for you right now. The launch of the orted's is what is failing, and they are "silently" failing at this time. The reason is simple: 1. we are failing due to truncation of the HNP uri at the first semicolon. This causes the orted to emit an ORTE_ERROR_LOG message and then

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-01 Thread Ralph Castain
Thanks - I expect we'll have to release 1.8.3 soon to fix this in case others have similar issues. Out of curiosity, what OS are you using? On Sep 1, 2014, at 9:00 AM, Matt Thompson wrote: > Ralph, > > Okay that seems to have done it here (well, minus the usual >

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-01 Thread Matt Thompson
Ralph, Okay that seems to have done it here (well, minus the usual shmem_mmap_enable_nfs_warning that our system always generates): (1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0 (1034) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun -np 8

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-31 Thread Ralph Castain
HmmmI may see the problem. Would you be so kind as to apply the attached patch to your 1.8.2 code, rebuild, and try again?Much appreciate the help. Everyone's system is slightly different, and I think you've uncovered one of those differences.Ralph uri.diff Description: Binary data On Aug 31,

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-31 Thread Matt Thompson
Ralph, Sorry it took me a bit of time. Here you go: (1002) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x [borg01w063:03815] mca:base:select:(

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Matt Thompson
Ralph, Here you go: (1080) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8 ./helloWorld.182-debug.x [borg01x142:29232] mca: base: components_register: registering oob components [borg01x142:29232]

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Ralph Castain
Okay, something quite weird is happening here. I can't replicate using the 1.8.2 release tarball on a slurm machine, so my guess is that something else is going on here. Could you please rebuild the 1.8.2 code with --enable-debug on the configure line (assuming you haven't already done so),

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-28 Thread Ralph Castain
I'm unaware of any changes to the Slurm integration between rc4 and final release. It sounds like this might be something else going on - try adding "--leave-session-attached --debug-daemons" to your 1.8.2 command line and let's see if any errors get reported. On Aug 28, 2014, at 12:20 PM,

[OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-28 Thread Matt Thompson
Open MPI List, I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1 on our cluster (reported on this list), and decided to try it with 1.8.2. However, we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even weirder, Open MPI 1.8.2rc4 doesn't show the bug. And the bug is: