Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-20 Thread Ralph Castain
Hooray! On Dec 19, 2013, at 10:14 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > Thank you for your fix. It works for me. > > Tetsuya Mishima > > >> Actually, it looks like it would happen with hetero-nodes set - only > required that at least two nodes have the same

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-20 Thread tmishima
Hi Ralph, Thank you for your fix. It works for me. Tetsuya Mishima > Actually, it looks like it would happen with hetero-nodes set - only required that at least two nodes have the same architecture. So you might want to give the trunk a shot as it may well now be > fixed. > > > On Dec 19,

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-19 Thread Ralph Castain
Actually, it looks like it would happen with hetero-nodes set - only required that at least two nodes have the same architecture. So you might want to give the trunk a shot as it may well now be fixed. On Dec 19, 2013, at 8:35 AM, Ralph Castain wrote: > Hmmm...not having

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-19 Thread Ralph Castain
Hmmm...not having any luck tracking this down yet. If anything, based on what I saw in the code, I would have expected it to fail when hetero-nodes was false, not the other way around. I'll keep poking around - just wanted to provide an update. On Dec 19, 2013, at 12:54 AM,

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-18 Thread Ralph Castain
Very strange - I can't seem to replicate it. Is there any chance that you have < 8 actual cores on node12? On Dec 18, 2013, at 4:53 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, sorry for confusing you. > > At that time, I cut and paste the part of "cat $PBS_NODEFILE". > I guess I

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-18 Thread tmishima
Hi Ralph, sorry for confusing you. At that time, I cut and paste the part of "cat $PBS_NODEFILE". I guess I didn't paste the last line by my mistake. I retried the test and below one is exactly what I got when I did the test. [mishima@manage ~]$ qsub -I -l nodes=node11:ppn=8+node12:ppn=8

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-18 Thread Ralph Castain
I removed the debug in #2 - thanks for reporting it For #1, it actually looks to me like this is correct. If you look at your allocation, there are only 7 slots being allocated on node12, yet you have asked for 8 cpus to be assigned (2 procs with 2 cpus/proc). So the warning is in fact correct

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-18 Thread Jeff Squyres (jsquyres)
On Dec 18, 2013, at 7:04 PM, wrote: > 3) I use PGI compiler. It can not accept compiler switch > "-Wno-variadic-macros", which is > included in configure script. > > btl_usnic_CFLAGS="-Wno-variadic-macros" Yoinks. I'll fix (that

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-18 Thread tmishima
Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So I'd like to report 3 issues mainly regarding -cpus-per-proc. 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2 sockets X 4 cores/socket), it starts to produce the error again as shown below. At least,

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-11 Thread tmishima
Thank you, Ralph. I just hope that it helps you to improve the quality of openmpi-1.7 series. Tetsuya Mishima > Hmmm...okay, I understand the scenario. Must be something in the algo when it only has one node, so it shouldn't be too hard to track down. > > I'm off on travel for a few days,

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread Ralph Castain
Hmmm...okay, I understand the scenario. Must be something in the algo when it only has one node, so it shouldn't be too hard to track down. I'm off on travel for a few days, but will return to this when I get back. Sorry for delay - will try to look at this while I'm gone, but can't promise

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread tmishima
Hi Ralph, sorry for confusing. We usually logon to "manage", which is our control node. >From manage, we submit job or enter a remote node such as node03 by torque interactive mode(qsub -I). At that time, instead of torque, I just did rsh to node03 from manage and ran myprog on the node. I

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread Ralph Castain
On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > I tried again with -cpus-per-proc 2 as shown below. > Here, I found that "-map-by socket:span" worked well. > > [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 2 > -map-by socket:span

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread tmishima
Hi Ralph, I tried again with -cpus-per-proc 2 as shown below. Here, I found that "-map-by socket:span" worked well. [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 2 -map-by socket:span myprog [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread Ralph Castain
Hmmm...that's strange. I only have 2 sockets on my system, but let me poke around a bit and see what might be happening. On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > Thanks. I didn't know the meaning of "socket:span". > > But it still causes the

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread tmishima
Hi Ralph, Thanks. I didn't know the meaning of "socket:span". But it still causes the problem, which seems socket:span doesn't work. [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32 qsub: waiting for job 8265.manage.cluster to start qsub: job 8265.manage.cluster ready [mishima@node03

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread Ralph Castain
No, that is actually correct. We map a socket until full, then move to the next. What you want is --map-by socket:span On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > I had a time to try your patch yesterday using openmpi-1.7.4a1r29646. > > It stopped the

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread tmishima
Hi Ralph, I had a time to try your patch yesterday using openmpi-1.7.4a1r29646. It stopped the error but unfortunately "mapping by socket" itself didn't work well as shown bellow: [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32 qsub: waiting for job 8260.manage.cluster to start qsub: job

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-08 Thread tmishima
Hi Ralph, Thank you for providing the fix. I'll check it in 1.7.4. Regards, Tetsuya Mishima > I fixed this under the trunk (was an issue regardless of RM) and have scheduled it for 1.7.4. > > Thanks! > Ralph > > On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Hi

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-08 Thread Ralph Castain
I fixed this under the trunk (was an issue regardless of RM) and have scheduled it for 1.7.4. Thanks! Ralph On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > Thank you very much for your quick response. > > I'm afraid to say that I found one more issuse...

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-11-26 Thread tmishima
Hi, Here is the output of "printenv | grep PBS". It seems that all variables are set as I expected. [mishima@manage mpi_demo]$ qsub -I -l nodes=1:ppn=32 qsub: waiting for job 8120.manage.cluster to start qsub: job 8120.manage.cluster ready [mishima@node03 ~]$ printenv | grep PBS

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-11-26 Thread tmishima
Hi, I used interactive mode just because it was easy to report the behavior. I'm sure that submiting job gives the same result. Therefore, I think the environment variables are also set in the session. Anyway, I'm away from the cluster now. Regarding "$ env | grep PBS", I'll send it later.

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-11-26 Thread Reuti
Hi, Am 26.11.2013 um 01:22 schrieb tmish...@jcity.maeda.co.jp: > Thank you very much for your quick response. > > I'm afraid to say that I found one more issuse... > > It's not so serious. Please check it when you have a lot of time. > > The problem is cpus-per-proc with -map-by option under

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-11-25 Thread tmishima
Hi Ralph, Thank you very much for your quick response. I'm afraid to say that I found one more issuse... It's not so serious. Please check it when you have a lot of time. The problem is cpus-per-proc with -map-by option under Torque manager. It doesn't work as shown below. I guess you can

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-11-24 Thread Ralph Castain
Fixed and scheduled to move to 1.7.4. Thanks again! On Nov 17, 2013, at 6:11 PM, Ralph Castain wrote: > Thanks! That's precisely where I was going to look when I had time :-) > > I'll update tomorrow. > Ralph > > > > > On Sun, Nov 17, 2013 at 7:01 PM,

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-11-17 Thread Ralph Castain
Thanks! That's precisely where I was going to look when I had time :-) I'll update tomorrow. Ralph On Sun, Nov 17, 2013 at 7:01 PM, wrote: > > > Hi Ralph, > > This is the continuous story of "Segmentation fault in oob_tcp.c of > openmpi-1.7.4a1r29646". > > I

[OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-11-17 Thread tmishima
Hi Ralph, This is the continuous story of "Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646". I found the cause. Firstly, I noticed that your hostfile can work and mine can not. Your host file: cat hosts bend001 slots=12 My host file: cat hosts node08 node08 ...(total 8 lines) I