Thanks a lot! Indeed, it was an issue of permissions. I did not realize the
difference in the /tmp directories, and it seems that the /tmp directory
for the node in question was "read-only". This has since been switched, and
presumably everything else will run smoothly now. My fingers are crossed.

-Brandon


On Tue, Dec 17, 2013 at 2:26 PM, Reuti <re...@staff.uni-marburg.de> wrote:

> Hi,
>
> Am 17.12.2013 um 22:32 schrieb Brandon Turner:
>
> > I've been struggling with this problem for a few days now and am out of
> ideas. I am submitting a job using TORQUE on a beowulf cluster. One step
> involves running mpiexec, and that is where this error occurs. I've found
> some similar other queries in the past:
> >
> > http://www.open-mpi.org/community/lists/users/att-11378/attachment
> >
> > http://www.open-mpi.org/community/lists/users/2013/09/22608.php
> >
> > http://www.open-mpi.org/community/lists/users/2009/11/11129.php
> >
> > I'm new to using open-mpi so much of this is very new to me. However, it
> does not seem that my /tmp folder is full as far as I can tell. I've tried
> reassigning the temporary directory using the MCA attribute (i.e. mpiexec
> --mca orte_tmpdir_base /home/pathA/pathB process argument1 argument2
> argument3), but that was unsuccessful as well. Similarly, if thousands of
> sub-directories are being created, I have no idea where those would be if
> this is some ext3 violation issue. It's worth noting that when I submit
> this job--it works on some occassions and not on others. I suspect it has
> something to do with the nodes that I am assigned and some property of
> certain nodes that is an issue.
> >
> > It never used to have this problem until a few days ago, and now I
> mostly can't get it to work except on a few occasions, which makes me think
> that perhaps it is a node-specific issue. Any thoughts or suggestions would
> be much appreciated!
>
> a) As it's not your personal /tmp, but a machine wide, it might be full on
> this particular node.
>
> b) Or the admin changed the permissions on /tmp so that only Torque can
> generate any temporary directory therein, and any additional one created by
> a batch job should go to $TMPDIR which is created and removed by Torque for
> your particular job. It might be that Open MPI is not tightly integrated
> into your Torque installation. Did you ever have the chance to peek on a
> node whether your MPI processes are kids of pbs_mom and not of any ssh
> connection?
>
> -- Reuti
>
>
> > Thanks,
> >
> > Brandon
> >
> > PS I've copied the full error output below:
> > [bc11bl08.deac.wfu.edu:31532] opal_os_dirpath_create: Error: Unable to
> create the sub-directory
> (/tmp/openmpi-sessions-turn...@bc11bl08.deac.wfu.edu_0) of
> (/tmp/openmpi-sessions-turn...@bc11bl08.deac.wfu.edu_0/2243/0/7), mkdir
> failed [1]
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in
> file ../../orte/util/session_dir.c at line 106
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in
> file ../../orte/util/session_dir.c at line 399
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in
> file ../../../../orte/mca/ess/base/ess_base_std_orted.c at line 283
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is unknown in
> file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to
> [[INVALID],INVALID]
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is unknown in
> file ../../orte/util/show_help.c at line 627
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in
> file ../../../../../orte/mca/ess/tm/ess_tm_module.c at line 112
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is unknown in
> file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to
> [[INVALID],INVALID]
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is unknown in
> file ../../orte/util/show_help.c at line 627
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in
> file ../../orte/runtime/orte_init.c at line 128
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is unknown in
> file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to
> [[INVALID],INVALID]
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is unknown in
> file ../../orte/util/show_help.c at line 627
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in
> file ../../orte/orted/orted_main.c at line 357
> > =>> PBS: job killed: walltime 3626 exceeded limit 3600
> > Terminated
> > mpiexec: killing job...
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to