Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?
On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyreswrote: > It would simplify testing if you could get all the eth0's to be of one type > and on the same subnet, and the same for eth1. > > Once you do that, try using just one of the networks by telling OMPI to use > only one of the devices, something like this: > > mpirun --mca btl_tcp_if_include eth0 ... Thanks for all the suggestions guys! We finally got this figured out. It was the result of two different (hardware specific) bugs in the RDMA driver. The 10GigE card was advertising a wrong size for the CQ stack (as far as I understand!). In case anyone wants to know more, the bugfixes are posted here: http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg05451.html http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg05246.html Cheers! -- Rahul
Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?
On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyreswrote: > Once you do that, try using just one of the networks by telling OMPI to use > only one of the devices, something like this: > > mpirun --mca btl_tcp_if_include eth0 ... Thanks Jeff! Just tried the exact test that you suggested. [rpnabar@eu001 ~]$ NP=64;time mpirun -np $NP --host eu001,eu003,eu004,eu005,eu006,eu007,eu008,eu012 --mca btl_tcp_if_include eth0 -mca btl openib,sm,self /opt/src/mpitests/imb/src/IMB-MPI1 -npmin $NP gather Still the same problem. The NP64 gather stalls at 4096 for about 7 minutes and then completes with a step change increase in times. All 10GigE's are eth0 now and all on the 192.168.x.x. subnet. The 7 minute stall time seems very reproducible each time around. Once the test stalled I ran a padb stack trace from the master node. Posted here: [rpnabar@eu001 root]$ /opt/sbin/bin/padb --all --stack-trace --tree --config-option rmgr=orte http://dl.dropbox.com/u/118481/padb_Aug26_gather_NP64.txt Did a top for the most cpu intensive processes during the stall and the all seem the IMB-MPI ones. Memory usage seems minimal. (Each node has 16 Gigs of RAM) http://dl.dropbox.com/u/118481/top_Aug26.txt Interestingly the NP56 test runs just great and finishes in less than a minute. It's only at NP64 that I hit this roadblock. On the other hand even for the NP56 test at the bytesize of 4096-->8192 there is almost a 10x degradation in transmit times. Any other debug options or suggestions are most welcome! # /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 64 gather # Minimum message length in bytes: 0 # Maximum message length in bytes: 4194304 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions: MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # Gather # # Benchmarking Gather # #processes = 64 # #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.02 0.03 0.02 1 100084.2584.5584.40 2 100084.1684.4584.31 4 100084.4884.7884.64 8 100084.5884.9284.77 16 100086.5186.7986.66 32 100088.6088.9388.78 64 100090.8891.2291.06 128 100092.4492.7692.60 256 100095.7996.1495.98 512 1000 104.90 105.25 105.07 1024 1000 118.01 118.40 118.19 2048 1000 154.42 154.94 154.67 4096 1000 292.15 292.95 292.52 8192 13 1436.77 1667.15 1581.73 16384 13 1733.38 2004.77 1903.27 32768 13 2082.55 2403.24 2282.68 65536 13 3106.37 3546.15 3384.07 131072 13 7812.54 9011.62 8572.76 262144 13 10773.70 12358.30 11782.77 524288 13 19377.23 22315.85 21238.98 1048576 13 38661.61 44293.92 42280.09 2097152 13120665.00140697.08136576.54 4194304 10475155.12567579.08536037.92 # All processes entering MPI_Finalize real7m31.039s user58m58.321s sys 0m21.633s NP56 test-- # # Benchmarking Gather # #processes = 56 # #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.02 0.09 0.03 1 100074.2374.5374.35 2 100073.8774.1574.02 4 100073.5973.8673.72 8 100074.1574.4074.27 16 100076.1876.4576.30 32 100077.8278.1077.95 64 100079.8580.1680.00 128 100081.6782.0181.84 256 100086.0786.4186.27 512 100094.9195.2395.07 1024 84333.4535.1334.38 2048 843 218.82 241.49 230.18 4096 843 130.76
Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?
On Wed, Aug 25, 2010 at 6:41 AM, John Hearnswrote: > You could sort that out with udev rules on each machine. Sure. I'd always wanted consistent names for the eth interfaces when I set up the cluster but I couldn't get udev to co-operate. Maybe this time! Let me try. > Look in the directory /etc/udev/rules.d for the file > NN-net_persistent_names.rules > you'll need a script which looks for the HWaddr (MAC) address matching > the 10gig cards > and edit the SUBSYSTEM line for that interface. I don't have the particular file you mention. I do have the following files: 05-udev-early.rules 51-hotplug.rules 60-raw.rules 90-hal.rules bluetooth.rules 40-multipath.rules60-net.rules 85-pcscd_ccid.rules 90-ib.rules 50-udev.rules 60-pcmcia.rules 90-dm.rules 95-pam-console.rules Not sure how to proceed with udev, but maybe this is OT for this list. -- Rahul
Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?
On Thu, Aug 19, 2010 at 9:03 PM, Rahul Nabar <rpna...@gmail.com> wrote: > -- > gather: > NP256 hangs > NP128 hangs > NP64 hangs > NP32 OK > > Note: "gather" always hangs at the following line of the test: > #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] > [snip] > 4096 1000 525.80 527.69 526.79 > -- What I thought was a permanent "hang" for the NP64 "gather" test, was, in fact, an exceedingly long stall. After waiting for more than 7 minutes the test runs forward to completion. What is surprising is the _huge_ jump in times from the 4096 to 8192 byte packet sizes. Its a step change from 275 to 1380 usecs. Any ideas what could cause this and if this could be related to the other "hangs" I am seeing? We are using jumbo frames with a MTU:9000 so that was one thought I had for this transition. On the other hand, this doesn't seem to be the case with the "hang" for the NP256 bcast test. That one stayed hung for more than an hour at which point I did kill it. Just to make sure this wasn't just some quirk or buggy implementation in the Intel-IMB test suite are there any alternative testing suites that I could run on my cluster? I was a bit iffy about the "Intel-IMB test suite" because I have found no active forums or mailing lists that focus on this suite so can't really get in touch with any users nor developers that might have an insight into how these benchmarks run. 7m22.972s # /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 64 gather # Minimum message length in bytes: 0 # Maximum message length in bytes: 4194304 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions: MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # Gather # # Benchmarking Gather # #processes = 64 # #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.02 0.03 0.02 1 100068.7268.9568.84 2 100069.1669.3969.28 4 100068.8569.0868.97 8 100069.0269.2569.14 16 100070.2970.5170.40 32 100072.1472.3872.27 64 100070.9971.2471.12 128 100072.5972.8472.72 256 100076.0076.2676.14 512 100084.9285.2185.06 1024 1000 101.69 102.01 101.84 2048 1000 146.94 147.41 147.18 4096 1000 275.61 276.45 276.04 8192 13 1380.54 1607.84 1522.64 16384 13 1497.09 1749.46 1656.61 32768 13 2055.61 2380.37 2259.50 65536 13 4553.46 5002.70 4837.14 131072 13 7720.76 8926.69 8483.07 262144 13 10423.99 12027.23 11440.07 524288 13 19456.94 22369.62 21317.78 1048576 13 38228.53 43892.99 41880.94 2097152 13 99705.55119614.62115667.49 4194304 10425823.38496396.78468326.45
Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?
On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumannwrote: > Bugs are always a possibility but unless there is something very unusual > about the cluster and interconnect or this is an unstable version of MPI, it > seems very unlikely this use of MPI_Bcast with so few tasks and only a 1/2 > MB message would trip on one. 80 tasks is a very small number in modern > parallel computing. Thousands of tasks involved in an MPI collective has > become pretty standard. Here's something absolutely strange that I accidentally stumbled upon: I ran the test again, but accidentally forgot to kill the user-jobs already running on the test servers (via. Torque and our usual queues). I was about to kick myself, but I couldn't believe that the test actually completes! I mean the timings are horribly bad but the test ( for the first time ) runs to completion. How could this be happening? Doesn't make sense to me that the test completes when the cards+servers+network is loaded but not otherwise! But I repeated the experiment many times and still the same result. # /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast [snip] # Bcast #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.02 0.02 0.02 1 34546807.94626743.09565196.07 2 34 37159.11 52942.09 44910.73 4 34 19777.97 40382.53 29656.53 8 34 36060.21 53265.27 43909.68 16 34 11765.59 31912.50 19611.75 32 34 23530.79 41176.94 32532.89 64 34 11735.91 23529.02 16552.16 128 34 47998.44 59323.76 55164.14 256 34 18121.96 30500.15 25528.95 512 34 20072.76 33787.32 26786.55 1024 34 39737.29 55589.97 45704.99 20489 77787.56150555.66118741.83 40969 4.67118331.78 77201.40 81929 80835.6616.56133781.08 163849 77032.88149890.66119558.73 327689111819.4518.99149048.91 655369159304.6798.99195071.34 1310729172941.13262216.57218351.14 2621449161371.65266703.79223514.31 5242882 497.46 4402568.94 2183980.20 10485762 5401.49 3519284.01 1947754.45 20971522 75251.10 4137861.49 2220910.50 41943042 33270.48 4601072.91 2173905.32 # All processes entering MPI_Finalize Another observation is that if I replace the openib BTL with the tcp BTL the tests run OK. -- Rahul
Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?
On Mon, Aug 23, 2010 at 8:39 PM, Randolph Pullenwrote: > > I have had a similar load related problem with Bcast. Thanks Randolph! That's interesting to know! What was the hardware you were using? Does your bcast fail at the exact same point too? > > I don't know what caused it though. With this one, what about the > possibility of a buffer overrun or network saturation? How can I test for a buffer overrun? For network saturation I guess I could use something like mrtg to monitor the bandwidth used. On the other hand, all 32 servers are connected to a single dedicated Nexus5000. The back-plane carries no other traffic. Hence I am skeptical that just 41943040 saturated what Cisco rates as a 10GigE fabric. But I might be wrong. -- Rahul
Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?
On Mon, Aug 23, 2010 at 6:39 PM, Richard Treumannwrote: > It is hard to imagine how a total data load of 41,943,040 bytes could be a > problem. That is really not much data. By the time the BCAST is done, each > task (except root) will have received a single half meg message form one > sender. That is not much. Thanks very much for your comments Dick! I'm somewhat new to MPI so appreciate all the advice I can get.My main roadblock is I'm not sure how to attack this problem more? How can I obtain more diagnostic output to help me trace what the origin of this "broadcast stall" is? So far I've obtained a stack trace via padb ( http://dl.dropbox.com/u/118481/padb.log.new.new.txt ) but that is about all. Any suggestions as to what else I could try? Would a full dump by something like tcpdump or wireshark on the packets passing the network be of any relevance? Or is there something useful to be known from the switch side? The technology is fairly new for HPC (Chelsio 10GigE adapters + Cisco Nexus5000 switches). So I wouldn't rule out some strange hardware or firmware bug that's tickled by this particular suite of tests. I'm grasping at straws here. [ On the other hand I'm fairly new so I wouldn't rule out some silly setting by me as well. ] -- Rahul
[OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?
My Intel IMB-MPI tests stall, but only in very specific cases:larger packet sizes + large core counts. Only happens for bcast, gather and exchange tests. Only for the larger core counts (~256 cores). Other tests like pingpong and sendrecev run fine even with larger core counts. e.g. This bcast test hangs consistently at the 524288 bytes packet size when invoked on 256 cores. Same test runs fine on 128 cores. NP=256;mpirun -np $NP --host [ 32_HOSTS_8_core_each] -mca btl openib,sm,self/mpitests/imb/src/IMB-MPI1 -npmin $NP bcast #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.02 0.02 0.02 1 13026.9427.5927.25 2 13026.4427.0926.77 4 13075.9881.0776.75 8 13028.4129.0628.74 16 13028.7029.3929.03 32 13028.4829.1528.85 64 13030.1030.8630.48 128 13031.6232.4132.01 256 13031.0831.7231.42 512 13031.7932.5832.13 1024 13033.2234.0633.65 2048 13066.2167.6167.21 4096 13079.1480.8680.37 8192 130 103.38 105.21 104.70 16384 130 160.82 163.67 162.97 32768 130 516.11 541.75 533.46 65536 130 1044.09 1063.63 1052.88 131072 130 1740.09 1750.12 1746.78 262144 130 3587.23 3598.52 3594.52 524288 80 4000.99 6669.65 5737.78 stalls for at least 5 minutes at this point when I killed the test. I did more extensive testing for various combinations of test-type and core counts (see below). I know exactly when the tests fail but I still cannot see a trend from this data. Any points or further debug ideas? I do have padb installed and have collected core dumps if that is going to help? One example below: http://dl.dropbox.com/u/118481/padb.log.new.new.txt System Details: Intel Nehalem 2.2 GHz 10Gig Ethernet Chelsio Cards and Cisco Nexus Switch. Using the OFED drivers. CentOS 5.4 Open MPI: 1.4.1 / Open RTE: 1.4.1 / OPAL: 1.4.1 -- bcast: NP256hangs NP128OK Note: "bcast" mostly hangs at: #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 524288 80 2682.61 4408.94 3880.68 -- sendrecv: NP256OK -- gather: NP256hangs NP128hangs NP64hangs NP32OK Note: "gather" always hangs at the following line of the test: #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] [snip] 4096 1000 525.80 527.69 526.79 -- exchange: NP256hangs NP128OK Note: "exchange" always hangs at: #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec 8192 1000 109.65 110.79 110.37 282.08 -- Note: I kept the --host string the same (all 32 servers) and just changed the NPMIN. Just in case this matters for how the procs are mapped out
[OMPI users] MPI broadcast test fails only when I run within a torque job
I'm not sure if this is a torque issue or an MPI issue. If I log in to a compute-node and run the standard mpi broadcast test it returns no error but if I run it through PBS/Torque I get an error (see below) The nodes that return the error are fairly random. Even the same set of nodes will run a test once and then the next time they fail. In case it matters, these nodes have dual interfaces: 1GigE and 10GigE. All tests I was trying on the same group of 32 nodes. If I login to the node (just as a regular user ; not as root) then the test runs fine. No errors at all. Is there a timeout somewhere? Or some such issue? Not at all sure why this is happening Things I've verified. ulimit seems ok. I explicitly have set the ulimit within the pbs init script as well as in the ssh script that spawns it. [root@eu013 ~]# grep ulimit /etc/init.d/pbs ulimit -l unlimited [root@eu013 ~]# grep ulimit /etc/init.d/sshd ulimit -l unlimited ssh eu013 ulimit -l unlimited Even if I put a "ulimit -l" in a PBS job it does return unlimited. "cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs" returns a zero on all nodes concerned. Even ifconfig does not return any Error packets. -- Rahul #3 PBS command: mpirun -mca btl openib,sm,self -mca orte_base_help_aggregate 0 /opt/src/mpitests/imb/src/IMB-MPI1 bcast -through PBS- The RDMA CM returned an event error while attempting to make a connection. This type of error usually indicates a network configuration error. Local host: eu013 Local device: cxgb3_0 Error name: RDMA_CM_EVENT_UNREACHABLE Peer: eu010 Your MPI job will now abort, sorry. - ### Run physically from a compute node mpirun -host eu001,eu002,eu003,eu004,eu005,eu006,eu007,eu008,eu009,eu010,eu011,eu012,eu013,eu014,eu015,eu016,eu017,eu018,eu019,eu010,eu011,eu012,eu013,eu014,eu015,eu016,eu017,eu018,eu019,eu020,eu021,eu022,eu023,eu024,eu025,eu026,eu027,eu028,eu029,eu030,eu031,eu032 -mca btl openib,sm,self -mca orte_base_help_aggregate 0 /opt/src/mpitests/imb/src/IMB-MPI1 bcast # # Benchmarking Bcast # #processes = 42 # #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.02 0.03 0.02 1 1000 170.70 170.76 170.74 2 1000 171.04 171.10 171.08 4 1000 171.09 171.15 171.13 8 1000 171.05 171.13 171.10 16 1000 171.03 171.10 171.07 32 100031.9332.0031.98 64 100028.8629.0228.99 128 100029.3429.4029.38 256 100029.9029.9829.95 512 100030.3930.4730.44 1024 100031.5931.6731.64 2048 100038.1538.2638.23 4096 1000 187.59 187.75 187.68 8192 1000 208.26 208.41 208.37 16384 1000 395.47 395.71 395.61 32768 1000 9360.99 9441.36 9416.47 65536 400 10522.09 11003.08 10781.73 131072 299 16971.71 17647.29 17329.27 262144 160 15404.01 17131.36 15816.46 524288 80 2659.56 4258.90 3002.04 1048576 40 4305.72 5305.33 5219.00 2097152 20 2472.34 10711.80 8599.28 4194304 10 6275.51 20791.20 13687.10 # All processes entering MPI_Finalize
[OMPI users] subnet specification for MPI when multiple networks are present
I have compute-nodes with twin eth interfaces 1GigE and 10GigE. In the OpenMPI docs I found an instruction: " It is therefore very important that if active ports on the same host are on physically separate fabrics, they must have different subnet IDs." Is this the same "subnet" that is set via an ifconfig e.g. 192.168.x.x or 10.0.x.x that I have for my 10Gig and 1Gig networks? Or is this a different usage of the term "subnet"? The reason I am confused is that it subsequently discusses setting the right subnets by using the subnet-managers "opensm" or "Cisco High Performance Subnet Manager". I don't seem to have either of these on my system but I have set the subnets via the usual eth and ifcfg framework. Is that sufficient? [I am using Chelsio 10GigE cards with the OpenIB framework] -- Rahul
Re: [OMPI users] MPI daemon error
On Sat, May 29, 2010 at 8:19 AM, Ralph Castainwrote: > > >From your other note, it sounds like #3 might be the problem here. Do you > >have some nodes that are configured with "eth0" pointing to your 10.x > >network, and other nodes with "eth0" pointing to your 192.x network? I have > >found that having interfaces that share a name but are on different IP > >addresses sometimes causes OMPI to miss-connect. > > If you randomly got some of those nodes in your allocation, that might > explain why your jobs sometimes work and sometimes don't. That is exactly true. On some nodes eth0 is 1Gig and on others 10Gig and vice versa. Is that going to be a problem and is there a workaround? I mean 192.168 is always the 10Gig and 10.0 the 1 Gig but the correspondence with eth0 vs eth1 is not consistent. I'd have liked that but couldn't figure out a way to guarantee the order of the eth interfaces. -- Rahul
[OMPI users] which eth interface does mpi use by default when torque supplies it with a hostfile?
Each of our servers has twin eth cards: 1GigE and 10GigE. How does openmpi decide which card to use while sending messages on? One of the cards is on a 10.0. IP address subnet whereas the other cards are on a 192.168 adress subnet. Can I select one or the other by specifying the --host option with the correct IP addresses? How does it select the default though? Frequently I call mpirun from within a PBS wrapper and then there is no explicit --host directive. (I think PBS somehow communicates to mpirun what the assigned hostfile is) In such a case though, which interface will mpirun use? -- Rahul
Re: [OMPI users] MPI daemon error
On Fri, May 28, 2010 at 3:53 PM, Ralph Castainwrote: > What environment are you running on the cluster, and what version of OMPI? > Not sure that error message is coming from us. openmpi-1.4.1 The cluster runs PBS-Torque. So I guess, that could be the other error source. -- Rahul
[OMPI users] MPI daemon error
Often when I try and run larger jobs on our cluster I get the error of the sort from some of the compute-servers: eu260 - daemon did not report back when launched It does not happen every time; but pretty often. Any ideas what could be wrong? The node seems pingable and I could log in successfully to it as well. /var/log/messages shows no errors but maybe there is another log elsewhere? -- Rahul
[OMPI users] Disabling irqbalance service for better performance of MPI jobs
I have already been using the processor and memory affinity options to bind the processes to specific cores. Does the presence of the irqbalance daemon matter? I saw some recommendation to disable this for a performance boost. Or is this irrelevant? I am running HPC jobs with no over- nor under-subscription. These are 8 core Nehalem servers. -- Rahul
Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?
On Wed, Sep 30, 2009 at 3:16 PM, Peter Kjellstromwrote: > Not MPI aware, but, you could watch network traffic with a tool such as > collectl in real-time. collectl is a great idea. I am going to try that now. -- Rahul
Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?
On Tue, Sep 29, 2009 at 1:33 PM, Anthony Chanwrote: > > Rahul, > > > What errors did you see when compiling MPE for OpenMPI ? > Can you send me the configure and make outputs as seen on > your terminal ? ALso, what version of MPE are you using > with OpenMPI ? Version: mpe2-1.0.6p1 ./configure FC=ifort CC=icc CXX=icpc F77=ifort CFLAGS="-g -O2 -mp" FFLAGS="-mp -recursive" CXXFLAGS="-g -O2" CPPFLAGS=-DpgiFortran MPI_CC=/usr/local/ompi-ifort/bin/mpiCC MPI_F77=/usr/local/ompi-ifort/bin/mpif77 MPI_LIBS=/usr/local/ompi-ifort/lib/ Configuring MPE Profiling System with 'FC=ifort' 'CC=icc' 'CXX=icpc' 'F77=ifort' 'CFLAGS=-g -O2 -mp' 'FFLAGS=-mp -recursive' 'CXXFLAGS=-g -O2' 'CPPFLAGS=-DpgiFortran' 'MPI_CC=/usr/local/ompi-ifort/bin/mpiCC' 'MPI_F77=/usr/local/ompi-ifort/bin/mpif77' 'MPI_LIBS=/usr/local/ompi-ifort/lib/' checking for current directory name... /src/mpe2-1.0.6p1 checking gnumake... yes using --no-print-directory checking BSD 4.4 make... no - whew checking OSF V3 make... no checking for virtual path format... VPATH User supplied MPI implmentation (Good Luck!) checking for gcc... icc checking for C compiler default output file name... a.out checking whether the C compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether icc accepts -g... yes checking for icc option to accept ANSI C... none needed checking whether MPI_CC has been set ... /usr/local/ompi-ifort/bin/mpiCC checking whether we are using the GNU Fortran 77 compiler... no checking whether ifort accepts -g... yes checking whether MPI_F77 has been set ... /usr/local/ompi-ifort/bin/mpif77 checking for the linkage of the supplied MPI C definitions ... no configure: error: Cannot link with basic MPI C program! Check your MPI include paths, MPI libraries and MPI CC compiler
Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?
On Tue, Sep 29, 2009 at 10:40 AM, Eugene Lohwrote: > to know. It sounds like you want to be able to watch some % utilization of > a hardware interface as the program is running. I *think* these tools (the > ones on the FAQ, including MPE, Vampir, and Sun Studio) are not of that > class. You are correct. A real time tool would be best that sniffs at the MPI traffic. Post mortem profilers would be the next best option I assume. I was trying to compile MPE but gave up. Too many errors. Trying to decide if I should prod on or look at another tool. -- Rahul
[OMPI users] profile the performance of a MPI code: how much traffic is being generated?
I have a code that seems to run about 40% faster when I bond together twin eth interfaces. The question, of course, arises: is it really producing so much traffic to keep twin 1 Gig eth interfaces busy? I don't really believe this but need a way to check. What are good tools to monitior the MPI performance of a running job. Basically what throughput loads is it imposing on the eth interfaces. Any suggestions? The code does not seem to produce much of disk I/O as profiled via strace (if at all NFS I/O is a bottleneck). -- Rahul
Re: [OMPI users] very bad parallel scaling of vasp using openmpi
On Tue, Aug 18, 2009 at 5:28 PM, Gerry Creagerwrote: > Most of that bandwidth is in marketing... Sorry, but it's not a high > performance switch. Well, how does one figure out what exactly is a "hih performance switch"? I've found this an exceedingly hard task. Like the OP posted the Dell 6248 is rated to give more than a fully subscribed backbone capacity. Nor I do not know any good third party test lab nor do I know any switch load testing benchmarks that'd take a switch through its paces. So, how does one go about selecting a good switch? "The most expensive the better" is somewhat a unsatisfying option! -- Rahul
Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.
On Wed, Apr 1, 2009 at 1:13 AM, Ralph Castainwrote: > So I gather that by "direct" you mean that you don't get an allocation from > Maui before running the job, but for the other you do? Otherwise, OMPI > should detect the that it is running under Torque and automatically use the > Torque launcher unless directed to do otherwise. > I think I've figured out the sore point. It seems "ulimit" is needed. Things seem sensitive to where exactly I put the ulimit directive though. Funnily, the nodes reported an unlimited stack before too but putting this extra directive in there seems to have helped! I'm doing more testing to be sure that the problem has been solved! Thanks for the leads guys! -- Rahul
Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.
2009/3/31 Ralph Castain: > I have no idea why your processes are crashing when run via Torque - are you > sure that the processes themselves crash? Are they segfaulting - if so, can > you use gdb to find out where? I have to admit I'm a newbiee with gdb. I am trying to recompile my code as "ifort -g..." so that I can use gdb. But the code only crashes in this specific Torque+MPI call. How do I use gdb in conjunction with mpi. I'm not really sure. Any tutorials? -- Rahul
Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.
2009/3/31 Ralph Castain: > It is very hard to debug the problem with so little information. We > regularly run OMPI jobs on Torque without issue. Another small thing that I noticed. Not sure if it is relevant. When the job starts running there is an orte process. The args to this process are slightly different depending on whether the job was submitted with Torque or directly on a node. Could this be an issue? Just a thought. The essential difference seems that the torque run has the --no-daemonize option whereas the direct run has a --set-sid option. I got these via ps after I submitted an interactive Torque job. Do these matter at all? Full ps output snippets reproduced below. Some other numbers also seem different on closer inspection but that might be by design. ###via Torque; segfaults. ## rpnabar 11287 0.1 0.0 24680 1828 ?Ss 21:04 0:00 orted --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename node17 --universe rpnabar@node17:default-universe-11286 --nsreplica "0.0.0;tcp://10.0.0.17:45839" --gprreplica "0.0.0;tcp://10.0.0.17:45839" ## ##direct MPI run; this works OK rpnabar 11026 0.0 0.0 24676 1712 ?Ss 20:52 0:00 orted --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename node17 --universe rpnabar@node17:default-universe-11024 --nsreplica "0.0.0;tcp://10.0.0.17:34716" --gprreplica "0.0.0;tcp://10.0.0.17:34716" --set-sid ##
Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.
2009/3/31 Ralph Castain: > > Information would be most helpful - the information we really need is > specified here: http://www.open-mpi.org/community/help/ Output of "ompi_info --all" is attached in a file. echo $LD_LIBRARY_PATH /usr/local/ompi-ifort/lib:/opt/intel/fce/10.1.018/lib:/opt/intel/mkl/10.0.4.023/lib/em64t:/opt/intel/cce/10.1.018/lib which mpirun /usr/local/ompi-ifort/bin/mpirun which mpiexec /usr/local/ompi-ifort/bin/mpiexec This three things are invariant with or outside Torque. So unlikely to be an issue. I am setting no MCA parameters explicitly ( at least none that I conciously know of!) Any way of obtaining a dump from the environment of a running job? Just a plain-old Gigabit ethernet network. Maybe this helps a bit? Feel free to instruct me to run any more diagnostic commands. I'm essentially the "sys admin" on our tiny cluster here so do have root access and try any tweaks or sugesstions you guys might have. Thanks again! -- Rahul Open MPI: 1.2.7 Open MPI SVN revision: r19401 Open RTE: 1.2.7 Open RTE SVN revision: r19401 OPAL: 1.2.7 OPAL SVN revision: r19401 MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.7) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.7) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.7) MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.7) MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.7) MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.7) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.7) MCA coll: self (MCA v1.0, API v1.0, Component v1.2.7) MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.7) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.7) MCA io: romio (MCA v1.0, API v1.0, Component v1.2.7) MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.7) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.7) MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.7) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.7) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.7) MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.7) MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.7) MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.7) MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.7) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.7) MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.7) MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.7) MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.7) MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.7) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.7) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.7) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.7) MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.7) MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.7) MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.7) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2.7) MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2.7) MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7) MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2.7) MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.7) MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2.7) MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.7) MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2.7) MCA rmaps: round_robin (MCA v1.0, API v1.3, Component v1.2.7) MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.2.7) MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.7) MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.7) MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2.7) MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7) MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.7) MCA pls: slurm (MCA v1.0, API v1.3, Component v1.2.7) MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.7) MCA sds: env (MCA v1.0, API v1.0, Component v1.2.7) MCA sds: seed (MCA v1.0, API v1.0,
Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.
2009/3/31 Ralph Castain: > It is very hard to debug the problem with so little information. We Thanks Ralph! I'm sorry my first post lacked enough specifics. I'll try my best to fill you guys in on as much debug info as I can. > regularly run OMPI jobs on Torque without issue. So do we. In fact on the very same cluster other jobs using the same code do run fine. Its only this one type of jobs that I am seeing this strange behavior on. For those more curious, the code I am trying to run is a computational chemistry code called DACAPO developed at CAMd at the Technical University of Denemark. Link: https://wiki.fysik.dtu.dk/dacapo Hardware Architecture: Dell rack servers: PowerEdge SC1435. 2.2GHz Opteron 1Ghz. (AMD) 8 cpus per node. > Are you getting an allocation from somewhere for the nodes? >If so, are you > using Moab to get it? We are using Torque as the scheduler and Maui as the master scheduler. >Do you have a $PBS_NODEFILE in your environment? Yes, I do. For a test case I was trying to run on a single node (which has 8 cpus) If I cat $PBS_NODEFILE I do get the name "node17" 8 times. I did dump the environment variables from a running job. I get: PBS_NODEFILE="/var/spool/torque/aux//4609.uranus.che.foo.edu" > I have no idea why your processes are crashing when run via Torque - are you > sure that the processes themselves crash? >Are they segfaulting - if so, can Yes, they are indeed segfaulting. And only when I run them through Torque. forrtl: error (78): process killed (SIGTERM) mpirun noticed that job rank 5 with PID 10580 on node node17 exited on signal 11 (Segmentation fault). # Exact same job runs like a charm if I submit it via mprrun on the node outside of Torque. > you use gdb to find out where? I can try that. I haven't used gdb much before. In case it matters the core executable is a fortran source compiled via the Intel Fortran Compiler ifort. That executable runs fine for all other cases except this one. Maybe this helps more? -- Rahul