Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread Ralph Castain
HmmmTetsuya is quite correct. Afraid I got distracted by the segfault (still investigating that one). Our default policy for 2 processes is to map-by core, and that would indeed fail when cpus-per-proc > 1. However, that seems like a non-intuitive requirement, so let me see if I can make

Re: [OMPI users] --preload-binary does not work

2014-06-06 Thread Ralph Castain
Looks like there is some strange interaction there, but I doubt I'll get around to fixing it soon unless someone has a burning reason to not use tree spawn when preloading binaries. I'll mark it down as something to look at as time permits. On Jun 6, 2014, at 4:28 PM, Ralph Castain

Re: [OMPI users] --preload-binary does not work

2014-06-06 Thread Ralph Castain
Don't know - I haven't seen someone use that option in awhile. Is there some reason to do so? On Jun 6, 2014, at 3:44 PM, E.O. wrote: > Thank you! > With the patch, --preload-binary option is working fine. > However, if I add "--gmca plm_rsh_no_tree_spawn 1" as a

Re: [OMPI users] --preload-binary does not work

2014-06-06 Thread E.O.
Thank you! With the patch, --preload-binary option is working fine. However, if I add "--gmca plm_rsh_no_tree_spawn 1" as a mpirun command line option, it hangs. # /mpi/bin/mpirun --allow-run-as-root --gmca plm_rsh_no_tree_spawn 1 --preload-binary --hostfile /root/.hosts --prefix /mpi --np 120

Re: [OMPI users] OPENIB unknown transport errors

2014-06-06 Thread Tim Miller
Hi Josh, I asked one of our more advanced users to add the "-mca btl_openib_if_include mlx4_0:1" argument to his job script. Unfortunately, the same error occurred as before. We'll keep digging on our end; if you have any other suggestions, please let us know. Tim On Thu, Jun 5, 2014 at 7:32

Re: [OMPI users] --preload-binary does not work

2014-06-06 Thread Ralph Castain
Okay, I found the problem and think I have a fix that I posted (copied EO on it). You are welcome to download the patch and try it. Scheduled for release in 1.8.2 Thanks Ralph On Jun 6, 2014, at 1:01 PM, Ralph Castain wrote: > Yeah, it doesn't require ssh any more - but I

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread tmishima
Hi Dan, Please try: mpirun -np 2 --map-by socket:pe=8 ./hello or mpirun -np 2 --map-by slot:pe=8 ./hello You can not bind 8 cpus to the object "core" which has only one cpu. This limitation started from 1.8 series. The objcet "socket" has 8 cores in your case. So you can do it. And, the

Re: [OMPI users] --preload-binary does not work

2014-06-06 Thread Ralph Castain
Yeah, it doesn't require ssh any more - but I haven't tested it in a bit, and so it's possible something crept in there. On Jun 6, 2014, at 12:27 PM, Reuti wrote: > Am 06.06.2014 um 21:04 schrieb Ralph Castain: > >> Supposed to, yes - but I don't know how much

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread Ralph Castain
Okay, I'll poke into this - thanks! On Jun 6, 2014, at 12:48 PM, Dan Dietz wrote: > No problem - > > These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips. > 2 per node, 8 cores each. No threading enabled. > > $ lstopo > Machine (64GB) > NUMANode L#0 (P#0

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread Dan Dietz
No problem - These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips. 2 per node, 8 cores each. No threading enabled. $ lstopo Machine (64GB) NUMANode L#0 (P#0 32GB) Socket L#0 + L3 L#0 (20MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)

Re: [OMPI users] --preload-binary does not work

2014-06-06 Thread Reuti
Am 06.06.2014 um 21:04 schrieb Ralph Castain: > Supposed to, yes - but I don't know how much testing it has seen. I can try > to take a look Wasn't it on the list recently, that 1.8.1 should do it even without passphraseless SSH between the nodes? -- Reuti > On Jun 6, 2014, at 12:02 PM,

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Ralph Castain
You might want to update to 1.6.5, if you can - I'll see what I can find On Jun 6, 2014, at 12:07 PM, Sasso, John (GE Power & Water, Non-GE) wrote: > Version 1.6 (i.e. prior to 1.6.1) > > -Original Message- > From: users [mailto:users-boun...@open-mpi.org] On

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Sasso, John (GE Power & Water, Non-GE)
Version 1.6 (i.e. prior to 1.6.1) -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Friday, June 06, 2014 3:03 PM To: Open MPI Users Subject: Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI It's possible that

Re: [OMPI users] --preload-binary does not work

2014-06-06 Thread Ralph Castain
Supposed to, yes - but I don't know how much testing it has seen. I can try to take a look On Jun 6, 2014, at 12:02 PM, E.O. wrote: > Hello > I am using OpenMPI ver 1.8.1 on a cluster of 4 machines. > One Redhat 6.2 and three busybox machine. They are all 64bit

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Ralph Castain
It's possible that you are hitting a bug - not sure how much the cpus-per-proc option has been exercised in 1.6. Is this 1.6.5, or some other member of that series? I don't have a Torque machine handy any more, but should be able to test this scenario on my boxes On Jun 6, 2014, at 10:51 AM,

[OMPI users] --preload-binary does not work

2014-06-06 Thread E.O.
Hello I am using OpenMPI ver 1.8.1 on a cluster of 4 machines. One Redhat 6.2 and three busybox machine. They are all 64bit environment. I want to use --preload-binary option to send the binary file to hosts but it's not working. # /mpi/bin/mpirun --prefix /mpi --preload-files ./a.out

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread Ralph Castain
Sorry to pester with questions, but I'm trying to narrow down the issue. * What kind of chips are on these machines? * If they have h/w threads, are they enabled? * you might have lstopo on one of those machines - could you pass along its output? Otherwise, you can run a simple "mpirun -n 1

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Sasso, John (GE Power & Water, Non-GE)
Re: $PBS_NODEFILE, we use that to create the hostfile that is passed via --hostfile (i.e. the two are the same). To further debug this, I passed "--display-allocation --display-map" to orterun, which resulted in: == ALLOCATED NODES == Data for

Re: [OMPI users] spml_ikrit_np random values

2014-06-06 Thread Mike Dubman
fixed here: https://svn.open-mpi.org/trac/ompi/changeset/31962 Thanks for report. On Thu, Jun 5, 2014 at 7:45 PM, Mike Dubman wrote: > seems oshmem_info uses uninitialized value. > we will check it, thanks for report. > > > On Thu, Jun 5, 2014 at 6:56 PM, Timur

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread Dan Dietz
Thanks for the reply. I tried out the --display-allocation option with several different combinations and have attached the output. I see this behavior on both RHEL6.4, RHEL6.5, and RHEL5.10 clusters. Here's debugging info on the segfault. Does that help? FWIW this does not seem to crash on the

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Ralph Castain
On Jun 6, 2014, at 10:24 AM, Gus Correa wrote: > On 06/06/2014 01:05 PM, Ralph Castain wrote: >> You can always add --display-allocation to the cmd line to see what we >> thought we received. >> >> If you configure OMPI with --enable-debug, you can set --mca >>

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Gus Correa
On 06/06/2014 01:05 PM, Ralph Castain wrote: You can always add --display-allocation to the cmd line to see what we thought we received. If you configure OMPI with --enable-debug, you can set --mca ras_base_verbose 10 to see the details Hi John On the Torque side, you can put a line "cat

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Ralph Castain
You can always add --display-allocation to the cmd line to see what we thought we received. If you configure OMPI with --enable-debug, you can set --mca ras_base_verbose 10 to see the details On Jun 6, 2014, at 10:01 AM, Reuti wrote: > Am 06.06.2014 um 18:58

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Reuti
Am 06.06.2014 um 18:58 schrieb Sasso, John (GE Power & Water, Non-GE): > OK, so at the least, how can I get the node and slots/node info that is > passed from PBS? > > I ask because I’m trying to troubleshoot a problem w/ PBS and the build of > OpenMPI 1.6 I noted. If I submit a 24-process

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Sasso, John (GE Power & Water, Non-GE)
OK, so at the least, how can I get the node and slots/node info that is passed from PBS? I ask because I'm trying to troubleshoot a problem w/ PBS and the build of OpenMPI 1.6 I noted. If I submit a 24-process simple job through PBS using a script which has: /usr/local/openmpi/bin/orterun -n

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Ralph Castain
Fascinating - I can only assume that Torque is setting something in the environment that is creating the confusion. Sadly, Nathan is at the MPI Forum this week, so we may have to wait until Mon to get his input on the problem as he wrote the udcm code. On Jun 6, 2014, at 8:51 AM, Fischer,

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Ralph Castain
We currently only get the node and slots/node info from PBS - we don't get any task placement info at all. We then use the mpirun cmd options and built-in mappers to map the tasks to the nodes. I suppose we could do more integration in that regard, but haven't really seen a reason to do so -

[OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Sasso, John (GE Power & Water, Non-GE)
For the PBS scheduler and using a build of OpenMPI 1.6 built against PBS include files + libs, is there a way to determine (perhaps via some debugging flags passed to mpirun) what job placement parameters are passed from the PBS scheduler to OpenMPI? In particular, I am talking about task

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Fischer, Greg A.
Yep, TCP works fine when launched via Torque/qsub: [binf315:fischega] $ mpirun -np 2 -mca btl tcp,sm,self ring_c Process 0 sending 10 to 1, tag 201 (2 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Ralph Castain
Huh - how strange. I can't imagine what it has to do with Torque vs rsh - this is failing when the openib BTL is trying to create the connection, which comes way after the launch is complete. Are you able to run this with btl tcp,sm,self? If so, that would confirm that everything else is

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Fischer, Greg A.
Here are the results when logging in to the compute node via ssh and running as you suggest: [binf102:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c Process 0 sending 10 to 1, tag 201 (2 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0 decremented value: 8

Re: [OMPI users] [warn] Epoll ADD(1) on fd 0 failed

2014-06-06 Thread Ralph Castain
Possible - honestly don't know On Jun 6, 2014, at 12:16 AM, Timur Ismagilov wrote: > Sometimes, after termination of the program, launched with the command > "sbatch ... -o myprogram.out .", no file "myprogram.out" is being > produced. Could this be due to the above

Re: [OMPI users] Problem with yoda component in oshmem.

2014-06-06 Thread Mike Dubman
could you please provide command line ? On Fri, Jun 6, 2014 at 10:56 AM, Timur Ismagilov wrote: > Hello! > > I am using Open MPI v1.8.1 in > example program hello_oshmem.cpp. > > When I put spml_ikrit_np = 1000 (more than 4) and run task on 4 (2,1) > nodes, I get an: > in

[OMPI users] Problem with yoda component in oshmem.

2014-06-06 Thread Timur Ismagilov
Hello! I am using Open MPI v1.8.1 in example program hello_oshmem.cpp. When I put  spml_ikrit_np = 1000 (more than 4) and run task on 4 (2,1) nodes, I get an: in out file:  No available spml components were found! This means that there are no components of this type installed on your system or

Re: [OMPI users] [warn] Epoll ADD(1) on fd 0 failed

2014-06-06 Thread Timur Ismagilov
Sometimes,  after termination of the program, launched with the command "sbatch ... -o myprogram.out .",  no file "myprogram.out"  is being produced. Could this be due to the above mentioned problem? Thu, 5 Jun 2014 07:45:01 -0700 от Ralph Castain : >FWIW: support for