All,

I am trying to debug my MPI application using good ol' printf and I am running into an issue with Open MPI's output redirection (using --output-filename).

The system I'm running on is an IB cluster with the home directory mounted through NFS.

1) Sometimes I get the following error message and the application hangs:

```
$ mpirun -n 2 -N 1 --output-filename output.log ls
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_base_setup.c at line 314 [n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/orted/iof_orted.c at line 184 [n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_base_setup.c at line 237 [n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/odls/base/odls_base_default_fns.c at line 1147
```

So far I have only seen this error when running straight out of my home directory, not when running from a subdirectory.

In case this error does not appear all log files are written correctly.

2) If I call mpirun from within a subdirectory I am only seeing output files from processes running on the same node as rank 0. I have not seen above error messages in this case.

Example:

```
# two procs, one per node
~/test $ mpirun -n 2 -N 1 --output-filename output.log ls
output.log
output.log
~/test $ ls output.log/*
rank.0
# two procs, single node
~/test $ mpirun -n 2 -N 2 --output-filename output.log ls
output.log
output.log
~/test $ ls output.log/*
rank.0  rank.1
```

Using Open MPI 2.1.1, I can observe a similar effect:
```
# two procs, one per node
~/test $ mpirun --output-filename output.log -n 2 -N 1 ls
~/test $ ls
output.log.1.0
# two procs, single node
~/test $ mpirun --output-filename output.log -n 2 -N 2 ls
~/test $ ls
output.log.1.0  output.log.1.1
```

Any idea why this happens and/or how to debug this?

In case this helps, the NFS mount flags are:
(rw,nosuid,nodev,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=<addr>,mountvers=3,mountport=<port>,mountproto=udp,local_lock=none,addr=<addr>)

I also tested above commands with MPICH, which gives me the expected output for all processes on all nodes.

Any help would be much appreciated!

Cheers,
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to