Hello all,

we have had issues with a bug in openmpi wrt to vader on "slow" systems.

See:

https://bitbucket.org/einsteintoolkit/tickets/issues/2287/add-openmpi-env-vars-to-notebook-to-avoid

for the ET ticket explaining this (the slow system being the tutorial
VM) and the OpenMPI ticket here:

https://github.com/open-mpi/ompi/issues/6568

Yours,
Roland

> On Fri, Apr 9, 2021 at 12:43 PM Hee Il Kim <[email protected]> wrote:
> >
> > Thanks Erik.
> >
> >
> > On Fri, Apr 9, 2021, 23:45 Erik Schnetter <[email protected]> wrote:  
> >>
> >> Hee Il
> >>
> >> Yes, that has happened to me several times. Usually, the problem is
> >> either MPI or I/O.  
> >
> >
> > Ever experienced under UCX?  
> 
> No, but I think UCX and MPI are about equivalent in this context here.
> 
> -erik
> 
> >> It might be that there is a file system problem, and one process is
> >> trying to write to a file, but is blocked indefinitely. The other
> >> processes then also stop making progress since they wait on
> >> communication.
> >>
> >> It could also be that there is an MPI problem, either caused by a
> >> problem in the code, or by an error in the system, that makes MPI
> >> hang.  
> >
> >
> > I think I haven't seen the issue when I use 'sm' btl. At least vader was 
> > used for all the problematic runs.
> >  
> >>
> >> In both cases, restarting from a checkpoint might solve the problem.
> >> If the problem is reproducible, then it would make sense to dig deeper
> >> to find out what's wrong, and whether there is a work-around (e.g.
> >> changing the grid structure a bit to avoid triggering the bug).
> >>
> >> -erik  
> >
> >
> > Yes. restarting could solve the issue.
> >
> > Hee Il
> >
> >  
> >>
> >>
> >>
> >> On Fri, Apr 9, 2021 at 8:19 AM Hee Il Kim <[email protected]> wrote:  
> >> >
> >> > Hi,
> >> >
> >> > Though it might not be an issue of ET. Have you ever seen ET runs 
> >> > stopped making every output (even the stdout), even though the processes 
> >> > are running?
> >> >
> >> > I have seen this issue on new and old NVMe storages with various 
> >> > versions of OpenMPI. It happened in more than a day of runs.
> >> >
> >> > Oh, not all the processes are running. One process is in Dl state, so 
> >> > the every output stopped. Do you have any hints on this issue? There's 
> >> > no specific limits set for the files. The other write/read tasks on the 
> >> > disks are ok.
> >> >
> >> > Thanks for your help in advance.
> >> >
> >> > Hee Il
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > Users mailing list
> >> > [email protected]
> >> > https://urldefense.com/v3/__http://lists.einsteintoolkit.org/mailman/listinfo/users__;!!DZ3fjg!tgd_rDKIJitACUnLixHB2PND01Yf7MisM-hbW7PEzYIVhk3Rao1sdCz_tOPj3NlB$
> >> >    
> >>
> >>
> >>
> >> --
> >> Erik Schnetter <[email protected]>
> >> https://urldefense.com/v3/__http://www.perimeterinstitute.ca/personal/eschnetter/__;!!DZ3fjg!tgd_rDKIJitACUnLixHB2PND01Yf7MisM-hbW7PEzYIVhk3Rao1sdCz_tIt3aW2E$
> >>    
> 
> 
> 


-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .

Attachment: pgpU6axsSSM3i.pgp
Description: OpenPGP digital signature

_______________________________________________
Users mailing list
[email protected]
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to