Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Matt Thompson
Jeff, Some limited testing shows that that srun does seem to work where the quote-y one did not. I'm working with our admins now to make sure it let's the prolog work as expected as well. I'll keep you informed, Matt On Thu, Sep 4, 2014 at 1:26 PM, Jeff Squyres (jsquyres) wrote: > Try this (t

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Jeff Squyres (jsquyres)
Try this (typed in editor, not tested!): #! /usr/bin/perl -w use strict; use warnings; use FindBin; # Specify the path to the prolog. my $prolog = '--task-prolog=/gpfsm//.task.prolog'; # Build the path to the SLURM srun command. my $srun_slurm = "${FindBin::Bin}/srun.slurm"; # Add the

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Matt Thompson
Jeff, Here is the script (with a bit of munging for safety's sake): #! /usr/bin/perl -w use strict; use warnings; use FindBin; # Specify the path to the prolog. my $prolog = '--task-prolog=/gpfsm//.task.prolog'; # Build the path to the SLURM srun command. my $srun_slurm = "${FindBin::

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Ralph Castain
Still begs the bigger question, though, as others have used script wrappers before - and I'm not sure we (OMPI) want to be in the business of dictating the scripting language they can use. :-) Jeff and I will argue that one out On Sep 4, 2014, at 7:38 AM, Jeff Squyres (jsquyres) wrote: > Ah,

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Jeff Squyres (jsquyres)
Ah, if it's perl, it might be easy. It might just be the difference between system("...string...") and system(@argv). Sent from my phone. No type good. On Sep 4, 2014, at 8:35 AM, "Matt Thompson" mailto:fort...@gmail.com>> wrote: Jeff, I actually misspoke earlier. It turns out our srun is a *

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Matt Thompson
Jeff, I actually misspoke earlier. It turns out our srun is a *Perl* script around the SLURM srun. I'll speak with our admins to see if they can massage the script to not interpret the arguments. If possible, I'll ask them if I can share the script with you (privately or on the list) and maybe you

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Jeff Squyres (jsquyres)
On Sep 3, 2014, at 9:27 AM, Matt Thompson wrote: > Just saw this, sorry. Our srun is indeed a shell script. It seems to be a > wrapper around the regular srun that runs a --task-prolog. What it > does...that's beyond my ken, but I could ask. My guess is that it probably > does something that h

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-03 Thread Ralph Castain
Thanks Matt - that does indeed resolve the "how" question :-) We'll talk internally about how best to resolve the issue. We could, of course, add a flag to indicate "we are using a shellscript version of srun" so we know to quote things, but it would mean another thing that the user would have t

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-03 Thread Matt Thompson
On Tue, Sep 2, 2014 at 8:38 PM, Jeff Squyres (jsquyres) wrote: > Matt: Random thought -- is your "srun" a shell script, perchance? (it > shouldn't be, but perhaps there's some kind of local override...?) > > Ralph's point on the call today is that it doesn't matter *how* this > problem is happen

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-03 Thread Matt Thompson
Jeff, I tried your script and I saw: (1027) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun -np 8 ./script.sh (1028) $ Now, the very first time I ran it, I think I might have noticed a blip of orted on the nodes, but it disappeared fast. When I re-run the same command, it ju

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-02 Thread Jeff Squyres (jsquyres)
Ah, I see the "sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:41686: No such file or directory" message now -- I was looking for something like that when I replied before and missed it. I really wish I understood why the heck that is happening; it doesn't seem to make sense. Matt: Random th

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-02 Thread Ralph Castain
I can answer that for you right now. The launch of the orted's is what is failing, and they are "silently" failing at this time. The reason is simple: 1. we are failing due to truncation of the HNP uri at the first semicolon. This causes the orted to emit an ORTE_ERROR_LOG message and then abort

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-02 Thread Jeff Squyres (jsquyres)
Matt -- We were discussing this issue on our weekly OMPI engineering call today. Can you check one thing for me? With the un-edited 1.8.2 tarball installation, I see that you're getting no output for commands that you run -- but also no errors. Can you verify and see if your commands are actu

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-02 Thread Matt Thompson
On that machine, it would be SLES 11 SP1. I think it's soon transitioning to SLES 11 SP3. I also use Open MPI on an RHEL 6.5 box (possibly soon to be RHEL 7). On Mon, Sep 1, 2014 at 8:41 PM, Ralph Castain wrote: > Thanks - I expect we'll have to release 1.8.3 soon to fix this in case > others

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-01 Thread Ralph Castain
Thanks - I expect we'll have to release 1.8.3 soon to fix this in case others have similar issues. Out of curiosity, what OS are you using? On Sep 1, 2014, at 9:00 AM, Matt Thompson wrote: > Ralph, > > Okay that seems to have done it here (well, minus the usual > shmem_mmap_enable_nfs_warnin

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-01 Thread Matt Thompson
Ralph, Okay that seems to have done it here (well, minus the usual shmem_mmap_enable_nfs_warning that our system always generates): (1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0 (1034) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun -np 8 ./helloWorld.1

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-31 Thread Ralph Castain
HmmmI may see the problem. Would you be so kind as to apply the attached patch to your 1.8.2 code, rebuild, and try again?Much appreciate the help. Everyone's system is slightly different, and I think you've uncovered one of those differences.Ralph uri.diff Description: Binary data On Aug 31,

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-31 Thread Matt Thompson
Ralph, Sorry it took me a bit of time. Here you go: (1002) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x [borg01w063:03815] mca:base:select:( plm

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Ralph Castain
Rats - I also need "-mca plm_base_verbose 5" on there so I can see the cmd line being executed. Can you add it? On Aug 29, 2014, at 11:16 AM, Matt Thompson wrote: > Ralph, > > Here you go: > > (1080) $ > /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun > --leave-ses

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Matt Thompson
Ralph, Here you go: (1080) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8 ./helloWorld.182-debug.x [borg01x142:29232] mca: base: components_register: registering oob components [borg01x142:29232]

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Ralph Castain
Okay, something quite weird is happening here. I can't replicate using the 1.8.2 release tarball on a slurm machine, so my guess is that something else is going on here. Could you please rebuild the 1.8.2 code with --enable-debug on the configure line (assuming you haven't already done so), and

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Matt Thompson
Ralph, For 1.8.2rc4 I get: (1003) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x srun.slurm: cluster configuration lacks support for cpu binding srun.slurm: cluster configuration lacks support for cpu bindi

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-28 Thread Ralph Castain
I'm unaware of any changes to the Slurm integration between rc4 and final release. It sounds like this might be something else going on - try adding "--leave-session-attached --debug-daemons" to your 1.8.2 command line and let's see if any errors get reported. On Aug 28, 2014, at 12:20 PM, Mat

[OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-28 Thread Matt Thompson
Open MPI List, I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1 on our cluster (reported on this list), and decided to try it with 1.8.2. However, we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even weirder, Open MPI 1.8.2rc4 doesn't show the bug. And the bug is: I