On Fri, Apr 9, 2021 at 12:43 PM Hee Il Kim <[email protected]> wrote: > > Thanks Erik. > > > On Fri, Apr 9, 2021, 23:45 Erik Schnetter <[email protected]> wrote: >> >> Hee Il >> >> Yes, that has happened to me several times. Usually, the problem is >> either MPI or I/O. > > > Ever experienced under UCX?
No, but I think UCX and MPI are about equivalent in this context here. -erik >> It might be that there is a file system problem, and one process is >> trying to write to a file, but is blocked indefinitely. The other >> processes then also stop making progress since they wait on >> communication. >> >> It could also be that there is an MPI problem, either caused by a >> problem in the code, or by an error in the system, that makes MPI >> hang. > > > I think I haven't seen the issue when I use 'sm' btl. At least vader was used > for all the problematic runs. > >> >> In both cases, restarting from a checkpoint might solve the problem. >> If the problem is reproducible, then it would make sense to dig deeper >> to find out what's wrong, and whether there is a work-around (e.g. >> changing the grid structure a bit to avoid triggering the bug). >> >> -erik > > > Yes. restarting could solve the issue. > > Hee Il > > >> >> >> >> On Fri, Apr 9, 2021 at 8:19 AM Hee Il Kim <[email protected]> wrote: >> > >> > Hi, >> > >> > Though it might not be an issue of ET. Have you ever seen ET runs stopped >> > making every output (even the stdout), even though the processes are >> > running? >> > >> > I have seen this issue on new and old NVMe storages with various versions >> > of OpenMPI. It happened in more than a day of runs. >> > >> > Oh, not all the processes are running. One process is in Dl state, so the >> > every output stopped. Do you have any hints on this issue? There's no >> > specific limits set for the files. The other write/read tasks on the disks >> > are ok. >> > >> > Thanks for your help in advance. >> > >> > Hee Il >> > >> > >> > >> > >> > >> > >> > _______________________________________________ >> > Users mailing list >> > [email protected] >> > http://lists.einsteintoolkit.org/mailman/listinfo/users >> >> >> >> -- >> Erik Schnetter <[email protected]> >> http://www.perimeterinstitute.ca/personal/eschnetter/ -- Erik Schnetter <[email protected]> http://www.perimeterinstitute.ca/personal/eschnetter/ _______________________________________________ Users mailing list [email protected] http://lists.einsteintoolkit.org/mailman/listinfo/users
