All,
I am trying to debug my MPI application using good ol' printf and I am
running into an issue with Open MPI's output redirection (using
--output-filename).
The system I'm running on is an IB cluster with the home directory
mounted through NFS.
1) Sometimes I get the following error message and the application hangs:
```
$ mpirun -n 2 -N 1 --output-filename output.log ls
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_
base_setup.c at line 314
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/orted/iof
_orted.c at line 184
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_
base_setup.c at line 237
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/odls/base/odl
s_base_default_fns.c at line 1147
```
So far I have only seen this error when running straight out of my home
directory, not when running from a subdirectory.
In case this error does not appear all log files are written correctly.
2) If I call mpirun from within a subdirectory I am only seeing output
files from processes running on the same node as rank 0. I have not seen
above error messages in this case.
Example:
```
# two procs, one per node
~/test $ mpirun -n 2 -N 1 --output-filename output.log ls
output.log
output.log
~/test $ ls output.log/*
rank.0
# two procs, single node
~/test $ mpirun -n 2 -N 2 --output-filename output.log ls
output.log
output.log
~/test $ ls output.log/*
rank.0 rank.1
```
Using Open MPI 2.1.1, I can observe a similar effect:
```
# two procs, one per node
~/test $ mpirun --output-filename output.log -n 2 -N 1 ls
~/test $ ls
output.log.1.0
# two procs, single node
~/test $ mpirun --output-filename output.log -n 2 -N 2 ls
~/test $ ls
output.log.1.0 output.log.1.1
```
Any idea why this happens and/or how to debug this?
In case this helps, the NFS mount flags are:
(rw,nosuid,nodev,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,pro
to=tcp,timeo=600,retrans=2,sec=sys,mountaddr=<addr>,mountvers=3,mountport=<p
ort>,mountproto=udp,local_lock=none,addr=<addr>)
I also tested above commands with MPICH, which gives me the expected
output for all processes on all nodes.
Any help would be much appreciated!
Cheers,
Joseph