[OMPI users] Process is hanging

2014-09-21 Thread Lee-Ping Wang
Hi there, I'm running into an issue where mpirun isn't terminating when my executable has a nonzero exit status - instead it's hanging indefinitely. I'm attaching my process tree, the error message from the application, and the messages printed to stderr. Please let me know what I can do.

Re: [OMPI users] Process is hanging

2014-09-21 Thread Lee-Ping Wang
are using? On Sep 21, 2014, at 6:08 AM, Lee-Ping Wang <leep...@stanford.edu> wrote: Hi there, I'm running into an issue where mpirun isn't terminating when my executable has a nonzero exit status - instead it's hanging indefinitely. I'm attaching my process tree, the error messag

Re: [OMPI users] Process is hanging

2014-09-21 Thread Lee-Ping Wang
is hanging Just to be clear: is your program returning a non-zero status and then exiting, or is it segfaulting? On Sep 21, 2014, at 8:22 AM, Lee-Ping Wang <leep...@stanford.edu> wrote: I'm using version 1.8.1. Thanks, - Lee-Ping From: users [mailto:users-boun...@open-m

Re: [OMPI users] Process is hanging

2014-09-21 Thread Lee-Ping Wang
, at 10:02 AM, Lee-Ping Wang <leep...@stanford.edu> wrote: My program isn't segfaulting - it's returning a non-zero status and then existing. Thanks, - Lee-Ping From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Sunday, September 21, 2014

Re: [OMPI users] Process is hanging

2014-09-22 Thread Lee-Ping Wang
in the 1.8 series, but I can't replicate it now with the nightly 1.8 tarball that is about to be released as 1.8.3. http://www.open-mpi.org/nightly/v1.8/ On Sep 21, 2014, at 12:25 PM, Lee-Ping Wang <leep...@stanford.edu> wrote: Hmm, I didn't know those were segfault reports. It could

[OMPI users] OpenMPI 1.8.3 build without BTL

2014-09-29 Thread Lee-Ping Wang
Hi there, I'm building OpenMPI 1.8.3 on a system where I explicitly don't want any of the BTL components (they tend to break my single node jobs). ./configure CC=gcc CXX=g++ F77=gfortran FC=gfortran --prefix=$QC_EXT_LIBS/openmpi --enable-static --enable-mca-no-build=btl Building gives me

Re: [OMPI users] OpenMPI 1.8.3 build without BTL

2014-09-29 Thread Lee-Ping Wang
returned 1 exit status Thanks, - Lee-Ping On Sep 29, 2014, at 5:27 AM, Lee-Ping Wang <leep...@stanford.edu> wrote: > Hi there, > > I'm building OpenMPI 1.8.3 on a system where I explicitly don't want any of > the BTL components (they tend to break my single node jobs). >

Re: [OMPI users] OpenMPI 1.8.3 build without BTL

2014-09-29 Thread Lee-Ping Wang
undefined reference to `ibv_get_device_list' >> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_get_device_name' >> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_destroy_cq' >> collect2: error: ld returned 1 exit status >> >> Thanks, >> &

[OMPI users] General question about running single-node jobs.

2014-09-29 Thread Lee-Ping Wang
Hi there, My application uses MPI to run parallel jobs on a single node, so I have no need of any support for communication between nodes. However, when I use mpirun to launch my application I see strange errors such as:

Re: [OMPI users] General question about running single-node jobs.

2014-09-29 Thread Lee-Ping Wang
sn't freeing up some system resource as it should. Is there something that needs to be corrected in the code? Thanks, - Lee-Ping On Sep 29, 2014, at 5:12 PM, Lee-Ping Wang <leep...@stanford.edu> wrote: > Hi there, > > My application uses MPI to run parallel jobs on a single node, so

Re: [OMPI users] General question about running single-node jobs.

2014-09-29 Thread Lee-Ping Wang
Here's another data point that might be useful: The error message is much more rare if I run my application on 4 cores instead of 8. Thanks, - Lee-Ping On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang <leep...@stanford.edu> wrote: > Sorry for my last email - I think I spoke too quick. I

Re: [OMPI users] OpenMPI 1.8.3 build without BTL

2014-09-30 Thread Lee-Ping Wang
here we find missing library issues. It looks like >> someone has left incorrect configure logic in the system such that we always >> attempt to build Infiniband-related code, but without linking against the >> library. >> >> We'll have to try and track it down. >&g

Re: [OMPI users] General question about running single-node jobs.

2014-09-30 Thread Lee-Ping Wang
ble that you are trying to open > statically defined ports, which means that running the job again too soon > could leave the OS thinking the socket is already busy. It takes awhile for > the OS to release a socket resource. > > > On Sep 29, 2014, at 5:49 PM, Lee-Ping Wan

Re: [OMPI users] General question about running single-node jobs.

2014-09-30 Thread Lee-Ping Wang
pute node - it's only subsequent executions that give the error messages. Thanks, - Lee-Ping On Sep 30, 2014, at 11:05 AM, Ralph Castain <r...@open-mpi.org> wrote: > > On Sep 30, 2014, at 10:49 AM, Lee-Ping Wang <leep...@stanford.edu> wrote: > >> Hi Ralph, &

Re: [OMPI users] General question about running single-node jobs.

2014-09-30 Thread Lee-Ping Wang
Hi Ralph, Thanks. I'll add some print statements to the code and try to figure out precisely where the failure is happening. - Lee-Ping On Sep 30, 2014, at 12:06 PM, Ralph Castain <r...@open-mpi.org> wrote: > > On Sep 30, 2014, at 11:19 AM, Lee-Ping Wang <leep...@stan

Re: [OMPI users] General question about running single-node jobs.

2014-10-02 Thread Lee-Ping Wang
with the fix. :) Thanks, - Lee-Ping From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lee-Ping Wang Sent: Tuesday, September 30, 2014 1:15 PM To: Open MPI Users Subject: Re: [OMPI users] General question about running single-node jobs. Hi Ralph, Thanks. I'll add some

Re: [OMPI users] General question about running single-node jobs.

2014-10-02 Thread Lee-Ping Wang
clusters, it > is sometimes preferable to > write the temporary files to a disk local to the node. > QCLOCALSCR > spec- > ifies this directory. The temporary files will be copied to > QCSCRATCH > at > the end of the job, unless the job is terminated abnormally. I >

[OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
it always thinks it's running on a single node, regardless of the type of job I submit to the batch system? Thank you, - Lee-Ping Wang (Postdoc in Dept. of Chemistry, Stanford University) [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
np 16 ./my-Q-chem-executable I hope this helps, Gus Correa On Aug 10, 2013, at 1:51 PM, Lee-Ping Wang wrote: > Hi there, > > Recently, I've begun some calculations on a cluster where I submit a multiple node job to the Torque batch system, and the job executes multiple single-node par

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
-boun...@open-mpi.org] On Behalf Of Gustavo Correa Sent: Saturday, August 10, 2013 12:39 PM To: Open MPI Users Subject: Re: [OMPI users] Error launching single-node tasks from multiple-node job. Hi Lee-Ping On Aug 10, 2013, at 3:15 PM, Lee-Ping Wang wrote: > Hi Gus, > > Thank you for your

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lee-Ping Wang Sent: Saturday, August 10, 2013 12:51 PM To: 'Open MPI Users' Subject: Re: [OMPI users] Error launching single-node tasks from multiple-node job. Hi Gus, Thank you. You gave me many helpful suggestions, which I

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lee-Ping Wang Sent: Saturday, August 10, 2013 3:07 PM To: 'Open MPI Users' Subject: Re: [OMPI users] Error launching single-node tasks from multiple-node job. Hi Gus, I tried your suggestions. Here is the command line which executes

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
nvironment variables: echo $PBS_JOBID echo $PBS_NODEFILE ls -l $PBS_NODEFILE cat $PBS_NODEFILE cat $PBS_JOBID [this one should fail, because that is not a file, but may work the PBS variables were messed up along the way] I hope this helps, Gus Correa On Aug 10, 2013, at 6:39 PM, Lee-Ping Wang wrote:

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Regardless, you can always deselect Torque support at runtime. Just put the following in your environment: OMPI_MCA_ras=^tm That will tell ORTE to ignore the Torque allocation module and it should then look at the machinefile. On Aug 10, 2013, at 4:18 PM, "Lee-Ping Wang" <leep.

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
ment: -machinefile /scratch/leeping/pbs_nodefile.$HOSTNAME This will leave the PBS_NODEFILE variable intact, and have the same net effect as your workflow. Anyway, congratulations for sorting things out and making it work! Gus Correa On Aug 10, 2013, at 7:40 PM, Lee-Ping Wang wrote: > H

[OMPI users] Changing directory from /tmp

2013-09-04 Thread Lee-Ping Wang
Hi there, On a few clusters I am running into an issue where a temporary directory cannot be created due to the root filesystem being full, causing mpirun to crash. Would it be possible to change the location where this directory is being created? [compute-109-4.local:12055]

Re: [OMPI users] Changing directory from /tmp

2013-09-04 Thread Lee-Ping Wang
euti On Sep 4, 2013, at 10:13 AM, Lee-Ping Wang <leep...@stanford.edu> wrote: Hi there, On a few clusters I am running into an issue where a temporary directory cannot be created due to the root filesystem being full, causing mpirun to crash. Would it be possible to change the locatio