Hi Beichuan
If you are using the university cluster, chances are that /home is not
local, but on an NFS share, or perhaps Lustre (which you may have
mentioned before, I don't remember).
Maybe "df -h" will show what is local what is not.
It works for NFS, it prefixes file systems
with the server name, but I don't know about Lustre.
Did you try just not to set TMPDIR and let it default?
If the default TMPDIR is on Lustre (did you say this?, anyway I don't
remember) you could perhaps try to force it to /tmp:
export TMPDIR=/tmp,
If the cluster nodes are diskfull /tmp is likely to exist and be
local to the cluster nodes.
[But the cluster nodes may be diskless ... :( ]
I hope this helps,
Gus Correa
On 03/03/2014 07:10 PM, Beichuan Yan wrote:
How to set TMPDIR to a local filesystem? Is /home/yanb/tmp a local filesystem?
I don't know how to tell a directory is local file system or network file
system.
-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres
(jsquyres)
Sent: Monday, March 03, 2014 16:57
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem
How about setting TMPDIR to a local filesystem?
On Mar 3, 2014, at 3:43 PM, Beichuan Yan<beichuan....@colorado.edu> wrote:
I agree there are two cases for pure-MPI mode: 1. Job fails with no apparent reason; 2
job complains shared-memory file on network file system, which can be resolved by "
export TMPDIR=/home/yanb/tmp", /home/yanb/tmp is my local directory. The default
TMPDIR points to a Lustre directory.
There is no any other output. I checked my job with "qstat -n" and found that processes
were actually not started on compute nodes even though PBS Pro has "started" my job.
Beichuan
3. Then I test pure-MPI mode: OPENMP is turned off, and each compute node runs 16 processes
(clearly shared-memory of MPI is used). Four combinations of "TMPDIR" and "TCP"
are tested:
case 1:
#export TMPDIR=/home/yanb/tmp
TCP="--mca btl_tcp_if_include 10.148.0.0/16"
mpirun $TCP -np 64 -npernode 16 -hostfile $PBS_NODEFILE ./paraEllip3d
input.txt
output:
Start Prologue v2.5 Mon Mar 3 15:47:16 EST 2014 End Prologue v2.5
Mon Mar 3 15:47:16 EST 2014
-bash: line 1: 448597 Terminated
/var/spool/PBS/mom_priv/jobs/602244.service12.SC
Start Epilogue v2.5 Mon Mar 3 15:50:51 EST 2014 Statistics
cpupercent=0,cput=00:00:00,mem=7028kb,ncpus=128,vmem=495768kb,walltim
e
=00:03:24 End Epilogue v2.5 Mon Mar 3 15:50:52 EST 2014
It looks like you have two general cases:
1. The job fails for no apparent reason (like above), or 2. The job
complains that your TMPDIR is on a shared filesystem
Right?
I think the real issue, then, is to figure out why your jobs are failing with
no output.
Is there anything in the stderr output?
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users