Re: [OMPI users] Output redirection: missing output from all but one node

Joseph Schuchart Tue, 13 Feb 2018 01:37:23 -0800

Christoph, all,

Thank you for looking into this. I can confirm that all but theprocesses running on the first node write their output into$HOME/<outdir> and that using an absolute path is a workaround.


I have created an issue for this on Github:
https://github.com/open-mpi/ompi/issues/4806

Cheers,
Joseph

On 02/09/2018 08:14 PM, Christoph Niethammer wrote:

Hi Joseph,

Thanks for reporting!

Regarding your second point about the missing output files there seems to be a
problem with the current working directory detection on the remote nodes:
while on the first node - on which mpirun is executed - the output folder is
created in the current working directory, the processes on the other nodes
seem to write the files into $HOME/output.log/

As a workaround you can use an absolute directory path:
--output-filename $PWD/output.log

Best
Christoph



On Friday, 9 February 2018 15:52:31 CET Joseph Schuchart wrote:

All,

I am trying to debug my MPI application using good ol' printf and I am
running into an issue with Open MPI's output redirection (using
--output-filename).

The system I'm running on is an IB cluster with the home directory
mounted through NFS.

1) Sometimes I get the following error message and the application hangs:

```
$ mpirun -n 2 -N 1 --output-filename output.log ls
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_
base_setup.c at line 314
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/orted/iof
_orted.c at line 184
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_
base_setup.c at line 237
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/odls/base/odl
s_base_default_fns.c at line 1147
```

So far I have only seen this error when running straight out of my home
directory, not when running from a subdirectory.

In case this error does not appear all log files are written correctly.

2) If I call mpirun from within a subdirectory I am only seeing output
files from processes running on the same node as rank 0. I have not seen
above error messages in this case.

Example:

```
# two procs, one per node
~/test $ mpirun -n 2 -N 1 --output-filename output.log ls
output.log
output.log
~/test $ ls output.log/*
rank.0
# two procs, single node
~/test $ mpirun -n 2 -N 2 --output-filename output.log ls
output.log
output.log
~/test $ ls output.log/*
rank.0  rank.1
```

Using Open MPI 2.1.1, I can observe a similar effect:
```
# two procs, one per node
~/test $ mpirun --output-filename output.log -n 2 -N 1 ls
~/test $ ls
output.log.1.0
# two procs, single node
~/test $ mpirun --output-filename output.log -n 2 -N 2 ls
~/test $ ls
output.log.1.0  output.log.1.1
```

Any idea why this happens and/or how to debug this?

In case this helps, the NFS mount flags are:
(rw,nosuid,nodev,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,pro
to=tcp,timeo=600,retrans=2,sec=sys,mountaddr=<addr>,mountvers=3,mountport=<p
ort>,mountproto=udp,local_lock=none,addr=<addr>)

I also tested above commands with MPICH, which gives me the expected
output for all processes on all nodes.

Any help would be much appreciated!

Cheers,
Joseph



--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Output redirection: missing output from all but one node

Reply via email to