Hee Il

Yes, that has happened to me several times. Usually, the problem is
either MPI or I/O.

It might be that there is a file system problem, and one process is
trying to write to a file, but is blocked indefinitely. The other
processes then also stop making progress since they wait on
communication.

It could also be that there is an MPI problem, either caused by a
problem in the code, or by an error in the system, that makes MPI
hang.

In both cases, restarting from a checkpoint might solve the problem.
If the problem is reproducible, then it would make sense to dig deeper
to find out what's wrong, and whether there is a work-around (e.g.
changing the grid structure a bit to avoid triggering the bug).

-erik



On Fri, Apr 9, 2021 at 8:19 AM Hee Il Kim <[email protected]> wrote:
>
> Hi,
>
> Though it might not be an issue of ET. Have you ever seen ET runs stopped 
> making every output (even the stdout), even though the processes are running?
>
> I have seen this issue on new and old NVMe storages with various versions of 
> OpenMPI. It happened in more than a day of runs.
>
> Oh, not all the processes are running. One process is in Dl state, so the 
> every output stopped. Do you have any hints on this issue? There's no 
> specific limits set for the files. The other write/read tasks on the disks 
> are ok.
>
> Thanks for your help in advance.
>
> Hee Il
>
>
>
>
>
>
> _______________________________________________
> Users mailing list
> [email protected]
> http://lists.einsteintoolkit.org/mailman/listinfo/users



-- 
Erik Schnetter <[email protected]>
http://www.perimeterinstitute.ca/personal/eschnetter/
_______________________________________________
Users mailing list
[email protected]
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to