Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi Gus, I think your suggestion sounds good. I'll leave the PBS_NODEFILE intact. Thank you again for your assistance! - Lee-Ping -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gustavo Correa Sent: Saturday, August 10, 2013 5:36 PM To: Open MPI Users

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Gustavo Correa
Hi Lee-Ping Yes, configuring --without-tm, as Ralph told you to do, will make your OpenMPI independent from Torque, although as Ralph said, even with an Open MPI configured with Torque support you can override it at runtime. I don't know what Open MPI uses the PBS_JOBID for, maybe some

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi Ralph, Thank you. I didn't know that "--without-tm" was the correct configure option. I built and reinstalled OpenMPI 1.4.2, and now I no longer need to set PBS_JOBID for it to recognize the correct machine file. My current workflow is: 1) Submit a multiple-node batch job. 2) Launch a

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Ralph Castain
It helps if you use the correct configure option: --without-tm Regardless, you can always deselect Torque support at runtime. Just put the following in your environment: OMPI_MCA_ras=^tm That will tell ORTE to ignore the Torque allocation module and it should then look at the machinefile.

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi Gus, I agree that $PBS_JOBID should not point to a file in normal situations, because it is the job identifier given by the scheduler. However, ras_tm_module.c actually does search for a file named $PBS_JOBID, and that seems to be why it was failing. You can see this in the source code as

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Gustavo Correa
Lee-Ping Something looks amiss. PBS_JOBID contains the job name. PBS_NODEFILE contains a list (with repetitions up to the number of cores) of the nodes that torque assigned to the job. Why things get twisted it is hard to tell, it may be something in the Q-Chem scripts (could it be mixing up

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Gustavo Correa
Hi Lee-Ping Is /scratch/leeping/272055.certainty.stanford.edu the actual PBS_NODEFILE provided by Torque? You could check this by adding a few lines to the Q-Chem launching scripts: ls -l $PBS_NODEFILE cat $PBS_NODEFILE This can go right before the mpiexec line. If the OpenMPI with torque

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi Gus, It seems the calculation is now working, or at least it didn't crash. I set the PBS_JOBID environment variable to the name of my custom node file. That is to say, I set PBS_JOBID=pbs_nodefile.compute-3-3.local. It appears that ras_tm_module.c is trying to open the file located at

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi Gus, I tried your suggestions. Here is the command line which executes mpirun. I was puzzled because it still reported a file open failure, so I inserted a print statement into ras_tm_module.c and recompiled. The results are below. As you can see, it tries to open a different file

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi Gus, Thank you. You gave me many helpful suggestions, which I will try out and get back to you. I will provide more specifics (e.g. how my jobs were submitted) in a future email. As for the queue policy, that is a highly political issue because the cluster is a shared resource. My usual

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Gustavo Correa
Hi Lee-Ping On Aug 10, 2013, at 3:15 PM, Lee-Ping Wang wrote: > Hi Gus, > > Thank you for your reply. I want to run MPI jobs inside a single node, but > due to the resource allocation policies on the clusters, I could get many > more resources if I submit multiple-node "batch jobs". Once I

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Gustavo Correa
... from a (probably obsolete) Q-Chem user guide found on the Web: *** " To run parallel Q-Chem using a batch scheduler such as PBS, users may have to modify the mpirun command in $QC/bin/parallel.csh depending on whether or not the MPI implementation requires the -machinefi le option to be

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi Gus, Thank you for your reply. I want to run MPI jobs inside a single node, but due to the resource allocation policies on the clusters, I could get many more resources if I submit multiple-node "batch jobs". Once I have a multiple-node batch job, then I can use a command like "pbsdsh" to

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Gustavo Correa
Hi Lee-Ping I know nothing about Q-Chem, but I was confused by these sentences: "That is to say, these tasks are intended to use OpenMPI parallelism on each node, but no parallelism across nodes. " "I do not observe this error when submitting single-node jobs." "Since my jobs are only

[OMPI users] Fault Tolerant Features in OpenMPI

2013-08-10 Thread Edson Tavares de Camargo
Hi All, I was looking for posts about fault tolerant in MPI and I found the post below: http://www.open-mpi.org/community/lists/users/2012/06/19658.php I am trying to understand all work about failures detection present in open-mpi. So, I began with a simple application, a ring application

[OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi there, Recently, I've begun some calculations on a cluster where I submit a multiple node job to the Torque batch system, and the job executes multiple single-node parallel tasks. That is to say, these tasks are intended to use OpenMPI parallelism on each node, but no parallelism across