
I am trying to debug my MPI application using good ol' printf and I am running into an issue with Open MPI's output redirection (using --output-filename).

The system I'm running on is an IB cluster with the home directory mounted through NFS.

1) Sometimes I get the following error message and the application hangs:

$ mpirun -n 2 -N 1 --output-filename output.log ls
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_base_setup.c at line 314 [n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/orted/iof_orted.c at line 184 [n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_base_setup.c at line 237 [n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/odls/base/odls_base_default_fns.c at line 1147

So far I have only seen this error when running straight out of my home directory, not when running from a subdirectory.

In case this error does not appear all log files are written correctly.

2) If I call mpirun from within a subdirectory I am only seeing output files from processes running on the same node as rank 0. I have not seen above error messages in this case.


# two procs, one per node
~/test $ mpirun -n 2 -N 1 --output-filename output.log ls
~/test $ ls output.log/*
# two procs, single node
~/test $ mpirun -n 2 -N 2 --output-filename output.log ls
~/test $ ls output.log/*
rank.0  rank.1

Using Open MPI 2.1.1, I can observe a similar effect:
# two procs, one per node
~/test $ mpirun --output-filename output.log -n 2 -N 1 ls
~/test $ ls
# two procs, single node
~/test $ mpirun --output-filename output.log -n 2 -N 2 ls
~/test $ ls
output.log.1.0  output.log.1.1

Any idea why this happens and/or how to debug this?

In case this helps, the NFS mount flags are:

I also tested above commands with MPICH, which gives me the expected output for all processes on all nodes.

Any help would be much appreciated!

Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
users mailing list

Reply via email to