Thanks a lot! Indeed, it was an issue of permissions. I did not realize the difference in the /tmp directories, and it seems that the /tmp directory for the node in question was "read-only". This has since been switched, and presumably everything else will run smoothly now. My fingers are crossed.
-Brandon On Tue, Dec 17, 2013 at 2:26 PM, Reuti <re...@staff.uni-marburg.de> wrote: > Hi, > > Am 17.12.2013 um 22:32 schrieb Brandon Turner: > > > I've been struggling with this problem for a few days now and am out of > ideas. I am submitting a job using TORQUE on a beowulf cluster. One step > involves running mpiexec, and that is where this error occurs. I've found > some similar other queries in the past: > > > > http://www.open-mpi.org/community/lists/users/att-11378/attachment > > > > http://www.open-mpi.org/community/lists/users/2013/09/22608.php > > > > http://www.open-mpi.org/community/lists/users/2009/11/11129.php > > > > I'm new to using open-mpi so much of this is very new to me. However, it > does not seem that my /tmp folder is full as far as I can tell. I've tried > reassigning the temporary directory using the MCA attribute (i.e. mpiexec > --mca orte_tmpdir_base /home/pathA/pathB process argument1 argument2 > argument3), but that was unsuccessful as well. Similarly, if thousands of > sub-directories are being created, I have no idea where those would be if > this is some ext3 violation issue. It's worth noting that when I submit > this job--it works on some occassions and not on others. I suspect it has > something to do with the nodes that I am assigned and some property of > certain nodes that is an issue. > > > > It never used to have this problem until a few days ago, and now I > mostly can't get it to work except on a few occasions, which makes me think > that perhaps it is a node-specific issue. Any thoughts or suggestions would > be much appreciated! > > a) As it's not your personal /tmp, but a machine wide, it might be full on > this particular node. > > b) Or the admin changed the permissions on /tmp so that only Torque can > generate any temporary directory therein, and any additional one created by > a batch job should go to $TMPDIR which is created and removed by Torque for > your particular job. It might be that Open MPI is not tightly integrated > into your Torque installation. Did you ever have the chance to peek on a > node whether your MPI processes are kids of pbs_mom and not of any ssh > connection? > > -- Reuti > > > > Thanks, > > > > Brandon > > > > PS I've copied the full error output below: > > [bc11bl08.deac.wfu.edu:31532] opal_os_dirpath_create: Error: Unable to > create the sub-directory > (/tmp/openmpi-sessions-turn...@bc11bl08.deac.wfu.edu_0) of > (/tmp/openmpi-sessions-turn...@bc11bl08.deac.wfu.edu_0/2243/0/7), mkdir > failed [1] > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in > file ../../orte/util/session_dir.c at line 106 > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in > file ../../orte/util/session_dir.c at line 399 > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in > file ../../../../orte/mca/ess/base/ess_base_std_orted.c at line 283 > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is > attempting to be sent to a process whose contact information is unknown in > file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104 > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to > [[INVALID],INVALID] > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is > attempting to be sent to a process whose contact information is unknown in > file ../../orte/util/show_help.c at line 627 > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in > file ../../../../../orte/mca/ess/tm/ess_tm_module.c at line 112 > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is > attempting to be sent to a process whose contact information is unknown in > file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104 > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to > [[INVALID],INVALID] > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is > attempting to be sent to a process whose contact information is unknown in > file ../../orte/util/show_help.c at line 627 > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in > file ../../orte/runtime/orte_init.c at line 128 > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is > attempting to be sent to a process whose contact information is unknown in > file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104 > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to > [[INVALID],INVALID] > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is > attempting to be sent to a process whose contact information is unknown in > file ../../orte/util/show_help.c at line 627 > > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in > file ../../orte/orted/orted_main.c at line 357 > > =>> PBS: job killed: walltime 3626 exceeded limit 3600 > > Terminated > > mpiexec: killing job... > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >