Ah, I was misled by the subject.

Can you provide more information about "hangs", and your environment?

You previously cited:

- E5-2697A v4 CPUs and Mellanox ConnectX-3 FDR Infiniband
- SLRUM
- Open MPI v3.0.0
- IMB-MPI1

Can you send the information listed here:

    https://www.open-mpi.org/community/help/

BTW, the fact that you fixed the last error by growing the tmpdir size 
(admittedly: we should probably have a better error message here, and shouldn't 
just segv like you were seeing -- I'll open a bug on that), you can probably 
remove "--mca btl ^vader" or other similar CLI options.  vader and sm were 
[probably?] failing due to the memory-mapped files on the filesystem running 
out of space and Open MPI not handling it well.  Meaning: in general, you don't 
want to turn off shared memory support, because that will likely always be the 
fastest for on-node communication.






> On Nov 30, 2017, at 11:10 AM, Götz Waschk <goetz.was...@gmail.com> wrote:
> 
> Dear Jeff,
> 
> I'm using openmpi as shipped by OpenHPC, so I'll upgrade 1.10 to
> 1.10.7 when they do. But it isn't 1.10 that is failing for me but
> openmpi 3.0.0.
> 
> Regards, Götz
> 
> On Thu, Nov 30, 2017 at 4:24 PM, Jeff Squyres (jsquyres)
> <jsquy...@cisco.com> wrote:
>> Can you upgrade to 1.10.7?  That's the last release in the v1.10 series, and 
>> has all the latest bug fixes.
>> 
>>> On Nov 30, 2017, at 9:53 AM, Götz Waschk <goetz.was...@gmail.com> wrote:
>>> 
>>> Hi everyone,
>>> 
>>> I have managed to solve the first part of this problem. It was caused
>>> by the quota on /tmp, that's where the session directory of openmpi
>>> was stored. There's a XFS default quota of 100MB to prevent users from
>>> filling up /tmp. Instead of an over quota message, the result was the
>>> openmpi crash from a bus error.
>>> 
>>> After setting TMPDIR in slurm, I was finally able to run IMB-MPI1 with
>>> 1024 cores and openmpi 1.10.6.
>>> 
>>> But now for the new problem: with openmpi3, the same test (IMB-MPI1,
>>> 1024 cores, 32 nodes) hangs after about 30 minutes of runtime. Any
>>> idea on this?
>>> 
>>> Regards, Götz Waschk
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to