Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-09-01 Thread Rahul Nabar
On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyres  wrote:
> It would simplify testing if you could get all the eth0's to be of one type 
> and on the same subnet, and the same for eth1.
>
> Once you do that, try using just one of the networks by telling OMPI to use 
> only one of the devices, something like this:
>
>    mpirun --mca btl_tcp_if_include eth0 ...

Thanks for all the suggestions guys! We finally got this figured out.
It was the result of two different (hardware specific) bugs in the
RDMA driver. The 10GigE card was advertising a wrong size for the CQ
stack (as far as I understand!).

In case anyone wants to know more, the bugfixes are posted here:

http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg05451.html
http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg05246.html

Cheers!

-- 
Rahul



Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-26 Thread Rahul Nabar
On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyres  wrote:
> Once you do that, try using just one of the networks by telling OMPI to use 
> only one of the devices, something like this:
>
>    mpirun --mca btl_tcp_if_include eth0 ...

Thanks Jeff! Just tried the exact test that you suggested.

[rpnabar@eu001 ~]$ NP=64;time mpirun  -np $NP --host
eu001,eu003,eu004,eu005,eu006,eu007,eu008,eu012 --mca
btl_tcp_if_include eth0  -mca btl openib,sm,self
/opt/src/mpitests/imb/src/IMB-MPI1 -npmin $NP  gather

Still the same problem. The NP64 gather stalls at 4096 for about 7
minutes and then completes with a step change increase in times. All
10GigE's are eth0 now and all on the 192.168.x.x. subnet. The 7 minute
stall time seems very reproducible each time around.

Once the test stalled I ran a padb stack trace from the master node.
Posted here:

[rpnabar@eu001 root]$ /opt/sbin/bin/padb --all --stack-trace --tree
--config-option rmgr=orte
http://dl.dropbox.com/u/118481/padb_Aug26_gather_NP64.txt

Did a top for the most cpu intensive processes during the stall and
the all seem the IMB-MPI ones. Memory usage seems minimal. (Each node
has 16 Gigs of RAM)
http://dl.dropbox.com/u/118481/top_Aug26.txt

Interestingly the NP56 test runs just great and finishes in less than
a minute. It's only at NP64 that I hit this roadblock. On the other
hand even for the NP56 test at the bytesize of 4096-->8192 there is
almost a 10x degradation in transmit times.

 Any other debug options or suggestions are most welcome!

# /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 64 gather

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype   :   MPI_BYTE
# MPI_Datatype for reductions:   MPI_FLOAT
# MPI_Op :   MPI_SUM
#
#

# List of Benchmarks to run:

# Gather

#
# Benchmarking Gather
# #processes = 64
#
   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
0 1000 0.02 0.03 0.02
1 100084.2584.5584.40
2 100084.1684.4584.31
4 100084.4884.7884.64
8 100084.5884.9284.77
   16 100086.5186.7986.66
   32 100088.6088.9388.78
   64 100090.8891.2291.06
  128 100092.4492.7692.60
  256 100095.7996.1495.98
  512 1000   104.90   105.25   105.07
 1024 1000   118.01   118.40   118.19
 2048 1000   154.42   154.94   154.67
 4096 1000   292.15   292.95   292.52
 8192   13  1436.77  1667.15  1581.73
16384   13  1733.38  2004.77  1903.27
32768   13  2082.55  2403.24  2282.68
65536   13  3106.37  3546.15  3384.07
   131072   13  7812.54  9011.62  8572.76
   262144   13 10773.70 12358.30 11782.77
   524288   13 19377.23 22315.85 21238.98
  1048576   13 38661.61 44293.92 42280.09
  2097152   13120665.00140697.08136576.54
  4194304   10475155.12567579.08536037.92


# All processes entering MPI_Finalize


real7m31.039s
user58m58.321s
sys 0m21.633s

NP56
test--
#
# Benchmarking Gather
# #processes = 56
#
   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
0 1000 0.02 0.09 0.03
1 100074.2374.5374.35
2 100073.8774.1574.02
4 100073.5973.8673.72
8 100074.1574.4074.27
   16 100076.1876.4576.30
   32 100077.8278.1077.95
   64 100079.8580.1680.00
  128 100081.6782.0181.84
  256 100086.0786.4186.27
  512 100094.9195.2395.07
 1024  84333.4535.1334.38
 2048  843   218.82   241.49   230.18
 4096  843   130.76   

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-25 Thread Rahul Nabar
On Wed, Aug 25, 2010 at 6:41 AM, John Hearns  wrote:
> You could sort that out with udev rules on each machine.

Sure. I'd always wanted consistent names for the eth interfaces when I
set up the cluster but I couldn't get udev to co-operate. Maybe this
time! Let me try.

> Look in the directory /etc/udev/rules.d for the file
> NN-net_persistent_names.rules
> you'll need a script which looks for the HWaddr (MAC) address matching
> the 10gig cards
> and edit the SUBSYSTEM line for that interface.

I don't have the particular file you mention. I do have the following files:

05-udev-early.rules   51-hotplug.rules  60-raw.rules
90-hal.rules  bluetooth.rules
40-multipath.rules60-net.rules  85-pcscd_ccid.rules
90-ib.rules
50-udev.rules 60-pcmcia.rules   90-dm.rules
95-pam-console.rules

Not sure how to proceed with udev, but maybe this is OT for this list.

-- 
Rahul


Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-25 Thread Rahul Nabar
On Thu, Aug 19, 2010 at 9:03 PM, Rahul Nabar <rpna...@gmail.com> wrote:
> --
> gather:
>    NP256    hangs
>    NP128    hangs
>    NP64    hangs
>    NP32    OK
>
> Note: "gather" always hangs at the following line of the test:
>       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
> [snip]
>         4096         1000       525.80       527.69       526.79
> --

What I thought was a permanent "hang" for the NP64 "gather" test, was,
in fact, an exceedingly long stall. After waiting for more than 7
minutes the test runs forward to completion.  What is surprising is
the _huge_ jump in times from the 4096 to 8192 byte packet sizes. Its
a step change from 275 to 1380 usecs.  Any ideas what could cause this
and if this could be related to the other "hangs" I am seeing? We are
using jumbo frames with a MTU:9000 so that was one thought I had for
this transition.

On the other hand, this doesn't seem to be the case with the "hang"
for the NP256 bcast test. That one stayed hung for more than an hour
at which point I did kill it.

Just to make sure this wasn't just some quirk or buggy implementation
in the Intel-IMB test suite are there any alternative testing suites
that I could  run on my cluster? I was a bit iffy about the "Intel-IMB
test suite" because I have found no active forums or mailing lists
that focus on this suite so can't really get in touch with any users
nor developers that might have an insight into how these benchmarks
run.

7m22.972s
# /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 64 gather

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype   :   MPI_BYTE
# MPI_Datatype for reductions:   MPI_FLOAT
# MPI_Op :   MPI_SUM
#
#

# List of Benchmarks to run:

# Gather

#
# Benchmarking Gather
# #processes = 64
#
   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
0 1000 0.02 0.03 0.02
1 100068.7268.9568.84
2 100069.1669.3969.28
4 100068.8569.0868.97
8 100069.0269.2569.14
   16 100070.2970.5170.40
   32 100072.1472.3872.27
   64 100070.9971.2471.12
  128 100072.5972.8472.72
  256 100076.0076.2676.14
  512 100084.9285.2185.06
 1024 1000   101.69   102.01   101.84
 2048 1000   146.94   147.41   147.18
 4096 1000   275.61   276.45   276.04
 8192   13  1380.54  1607.84  1522.64
16384   13  1497.09  1749.46  1656.61
32768   13  2055.61  2380.37  2259.50
65536   13  4553.46  5002.70  4837.14
   131072   13  7720.76  8926.69  8483.07
   262144   13 10423.99 12027.23 11440.07
   524288   13 19456.94 22369.62 21317.78
  1048576   13 38228.53 43892.99 41880.94
  2097152   13 99705.55119614.62115667.49
  4194304   10425823.38496396.78468326.45



Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar
On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumann  wrote:
> Bugs are always a possibility but unless there is something very unusual
> about the cluster and interconnect or this is an unstable version of MPI, it
> seems very unlikely this use of MPI_Bcast with so few tasks and only a 1/2
> MB message would trip on one.  80 tasks is a very small number in modern
> parallel computing.  Thousands of tasks involved in an MPI collective has
> become pretty standard.

Here's something absolutely strange that I accidentally stumbled upon:

I ran the test  again, but accidentally forgot to kill the
user-jobs already running on the test servers (via. Torque and our
usual queues).
I was about to kick myself, but I couldn't believe that the test
actually completes! I mean the timings are horribly bad but the test
( for the first time ) runs to completion. How could this be happening?
Doesn't make sense to me that the test completes when the
cards+servers+network is loaded but not otherwise! But I repeated the
experiment many times and still the same result.

# /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast
[snip]
# Bcast
   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
0 1000 0.02 0.02 0.02
1   34546807.94626743.09565196.07
2   34 37159.11 52942.09 44910.73
4   34 19777.97 40382.53 29656.53
8   34 36060.21 53265.27 43909.68
   16   34 11765.59 31912.50 19611.75
   32   34 23530.79 41176.94 32532.89
   64   34 11735.91 23529.02 16552.16
  128   34 47998.44 59323.76 55164.14
  256   34 18121.96 30500.15 25528.95
  512   34 20072.76 33787.32 26786.55
 1024   34 39737.29 55589.97 45704.99
 20489 77787.56150555.66118741.83
 40969 4.67118331.78 77201.40
 81929 80835.6616.56133781.08
163849 77032.88149890.66119558.73
327689111819.4518.99149048.91
655369159304.6798.99195071.34
   1310729172941.13262216.57218351.14
   2621449161371.65266703.79223514.31
   5242882   497.46   4402568.94   2183980.20
  10485762  5401.49   3519284.01   1947754.45
  20971522 75251.10   4137861.49   2220910.50
  41943042 33270.48   4601072.91   2173905.32
# All processes entering MPI_Finalize

Another observation is that if I replace the openib BTL with the tcp
BTL the tests run OK.


-- 
Rahul



Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar
On Mon, Aug 23, 2010 at 8:39 PM, Randolph Pullen
 wrote:
>
> I have had a similar load related problem with Bcast.

Thanks Randolph! That's interesting to know! What was the hardware you
were using? Does your bcast fail at the exact same point too?

>
> I don't know what caused it though.  With this one, what about the 
> possibility of a buffer overrun or network saturation?

How can I test for a buffer overrun?

For network saturation I guess I could use something like mrtg to
monitor the bandwidth used. On the other hand, all 32 servers are
connected to a single dedicated Nexus5000. The back-plane carries no
other traffic. Hence I am skeptical that just 41943040 saturated what
Cisco rates as a 10GigE fabric. But I might be wrong.

-- 
Rahul



Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar
On Mon, Aug 23, 2010 at 6:39 PM, Richard Treumann  wrote:
> It is hard to imagine how a total data load of 41,943,040 bytes could be a
> problem. That is really not much data. By the time the BCAST is done, each
> task (except root) will have received a single half meg message form one
> sender. That is not much.

Thanks very much for your comments Dick! I'm somewhat new to MPI so
appreciate all the advice I can get.My main roadblock is I'm not sure
how to attack this problem more? How can I obtain more diagnostic
output to help me trace what the origin of this "broadcast stall" is?
So far I've obtained a stack trace via padb (
http://dl.dropbox.com/u/118481/padb.log.new.new.txt ) but that is
about all.

Any suggestions as to what else I could try? Would a full dump by
something like tcpdump or wireshark on the packets passing the network
be of any relevance? Or is there something useful to be known from the
switch side? The technology is fairly new for HPC (Chelsio 10GigE
adapters + Cisco Nexus5000 switches). So I wouldn't rule out some
strange hardware or firmware bug that's tickled by this particular
suite of tests.   I'm grasping at straws here.

 [ On the other hand I'm fairly new so I wouldn't rule out some silly
setting by me as well. ]

-- 
Rahul


[OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-19 Thread Rahul Nabar
My Intel IMB-MPI tests stall, but only in very specific cases:larger
packet sizes + large core counts. Only happens for bcast, gather and
exchange tests. Only for the larger core counts (~256 cores). Other
tests like pingpong and sendrecev run fine even with larger core
counts.

e.g. This bcast test hangs consistently at the 524288 bytes packet
size when invoked on 256 cores. Same test runs fine on 128 cores.

NP=256;mpirun  -np $NP --host [ 32_HOSTS_8_core_each]  -mca btl
openib,sm,self/mpitests/imb/src/IMB-MPI1 -npmin $NP  bcast

   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
0 1000 0.02 0.02 0.02
1  13026.9427.5927.25
2  13026.4427.0926.77
4  13075.9881.0776.75
8  13028.4129.0628.74
   16  13028.7029.3929.03
   32  13028.4829.1528.85
   64  13030.1030.8630.48
  128  13031.6232.4132.01
  256  13031.0831.7231.42
  512  13031.7932.5832.13
 1024  13033.2234.0633.65
 2048  13066.2167.6167.21
 4096  13079.1480.8680.37
 8192  130   103.38   105.21   104.70
16384  130   160.82   163.67   162.97
32768  130   516.11   541.75   533.46
65536  130  1044.09  1063.63  1052.88
   131072  130  1740.09  1750.12  1746.78
   262144  130  3587.23  3598.52  3594.52
   524288   80  4000.99  6669.65  5737.78
stalls for at least 5 minutes at this point when I killed the test.

I did more extensive testing for various combinations of test-type and
core counts (see below). I know exactly when the tests fail but I
still cannot see a trend from this data. Any points or further debug
ideas? I do have padb installed and have collected core dumps if that
is going to help? One example below:

http://dl.dropbox.com/u/118481/padb.log.new.new.txt

System Details:
Intel Nehalem 2.2 GHz
10Gig Ethernet Chelsio Cards and Cisco Nexus Switch. Using the OFED drivers.
CentOS 5.4
Open MPI: 1.4.1 / Open RTE: 1.4.1 / OPAL: 1.4.1


--
bcast:
NP256hangs
NP128OK

Note: "bcast" mostly hangs at:

   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
   524288   80  2682.61  4408.94  3880.68
--
sendrecv:
NP256OK
--
gather:
NP256hangs
NP128hangs
NP64hangs
NP32OK

Note: "gather" always hangs at the following line of the test:
   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
[snip]
 4096 1000   525.80   527.69   526.79
--
exchange:
NP256hangs
NP128OK

Note: "exchange" always hangs at:

#bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
8192 1000   109.65   110.79   110.37   282.08
--

Note: I kept the --host string the same (all 32 servers) and just
changed the NPMIN. Just in case this matters for how the procs are
mapped out


[OMPI users] MPI broadcast test fails only when I run within a torque job

2010-07-28 Thread Rahul Nabar
I'm not sure if this is a torque issue or an MPI issue. If I log in to
a compute-node and run the standard mpi broadcast  test it returns no
error but if I run it through PBS/Torque I get an error (see below)
The nodes that return the error are fairly random. Even the same set
of nodes will run a test once and then the next time they fail.  In
case it matters, these nodes have dual interfaces: 1GigE and 10GigE.
All tests I was trying on the same group of 32 nodes.

If I login to the node (just as a regular user ; not as root) then the
test runs fine. No errors at all.

Is there a timeout somewhere? Or some such issue? Not at all sure why
this is happening

Things I've verified. ulimit seems ok. I explicitly have set the
ulimit within the pbs init script as well as in the ssh script that
spawns it.

[root@eu013 ~]# grep ulimit /etc/init.d/pbs
ulimit -l unlimited
[root@eu013 ~]# grep ulimit /etc/init.d/sshd
ulimit -l unlimited


ssh eu013 ulimit -l
unlimited

Even if I put a "ulimit -l" in a PBS job it does return unlimited.

"cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs" returns
a zero on all nodes concerned. Even ifconfig does not return any Error
packets.

-- 
Rahul
#3


PBS command:

mpirun -mca btl openib,sm,self -mca orte_base_help_aggregate 0
/opt/src/mpitests/imb/src/IMB-MPI1 bcast
-through
PBS-
The RDMA CM returned an event error while attempting to make a
connection.  This type of error usually indicates a network
configuration error.

  Local host:   eu013
  Local device: cxgb3_0
  Error name:   RDMA_CM_EVENT_UNREACHABLE
  Peer: eu010

Your MPI job will now abort, sorry.
-
###
Run  physically from a compute node

mpirun -host 
eu001,eu002,eu003,eu004,eu005,eu006,eu007,eu008,eu009,eu010,eu011,eu012,eu013,eu014,eu015,eu016,eu017,eu018,eu019,eu010,eu011,eu012,eu013,eu014,eu015,eu016,eu017,eu018,eu019,eu020,eu021,eu022,eu023,eu024,eu025,eu026,eu027,eu028,eu029,eu030,eu031,eu032
-mca btl openib,sm,self -mca orte_base_help_aggregate 0
/opt/src/mpitests/imb/src/IMB-MPI1 bcast

#
# Benchmarking Bcast
# #processes = 42
#
   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
0 1000 0.02 0.03 0.02
1 1000   170.70   170.76   170.74
2 1000   171.04   171.10   171.08
4 1000   171.09   171.15   171.13
8 1000   171.05   171.13   171.10
   16 1000   171.03   171.10   171.07
   32 100031.9332.0031.98
   64 100028.8629.0228.99
  128 100029.3429.4029.38
  256 100029.9029.9829.95
  512 100030.3930.4730.44
 1024 100031.5931.6731.64
 2048 100038.1538.2638.23
 4096 1000   187.59   187.75   187.68
 8192 1000   208.26   208.41   208.37
16384 1000   395.47   395.71   395.61
32768 1000  9360.99  9441.36  9416.47
65536  400 10522.09 11003.08 10781.73
   131072  299 16971.71 17647.29 17329.27
   262144  160 15404.01 17131.36 15816.46
   524288   80  2659.56  4258.90  3002.04
  1048576   40  4305.72  5305.33  5219.00
  2097152   20  2472.34 10711.80  8599.28
  4194304   10  6275.51 20791.20 13687.10


# All processes entering MPI_Finalize



[OMPI users] subnet specification for MPI when multiple networks are present

2010-06-22 Thread Rahul Nabar
I have compute-nodes with twin eth interfaces 1GigE and 10GigE. In the
OpenMPI docs I found an instruction:

" It is therefore very important that if active ports on the same host
are on physically separate fabrics, they must have different subnet
IDs."

Is this the same "subnet" that is set via an ifconfig e.g. 192.168.x.x
or 10.0.x.x that I have for my 10Gig and 1Gig networks? Or is this a
different usage of the term "subnet"?

The reason I am confused is that it subsequently discusses setting the
right subnets by using the subnet-managers "opensm" or "Cisco High
Performance Subnet Manager". I don't seem to have either of these on
my system but I have set the subnets via the usual eth and ifcfg
framework. Is that sufficient?

[I am using Chelsio 10GigE cards with the OpenIB framework]

-- 
Rahul


Re: [OMPI users] MPI daemon error

2010-05-29 Thread Rahul Nabar
On Sat, May 29, 2010 at 8:19 AM, Ralph Castain  wrote:

>
> >From your other note, it sounds like #3 might be the problem here. Do you 
> >have some nodes that are configured with "eth0" pointing to your 10.x 
> >network, and other nodes with "eth0" pointing to your 192.x network? I have 
> >found that having interfaces that share a name but are on different IP 
> >addresses sometimes causes OMPI to miss-connect.
>
> If you randomly got some of those nodes in your allocation, that might 
> explain why your jobs sometimes work and sometimes don't.

That is exactly true. On some nodes eth0 is 1Gig and on others 10Gig
and vice versa. Is that going to be a problem and is there a
workaround? I mean 192.168 is always the 10Gig and 10.0 the 1 Gig but
the correspondence with eth0 vs eth1 is not consistent. I'd have liked
that but couldn't figure out a way to guarantee the order of the eth
interfaces.

-- 
Rahul



[OMPI users] which eth interface does mpi use by default when torque supplies it with a hostfile?

2010-05-28 Thread Rahul Nabar
Each of our servers has twin eth cards: 1GigE and 10GigE. How does
openmpi decide which card to use while sending messages on? One of the
cards is on a 10.0. IP address subnet whereas the other cards are on a
192.168 adress subnet. Can I select one or the other by specifying the
--host option with the correct IP addresses?

How does it select the default though? Frequently I call mpirun from
within a PBS wrapper and then there is no explicit --host directive.
(I think PBS somehow communicates to mpirun what the assigned hostfile
is) In such a case though, which interface will mpirun use?

-- 
Rahul


Re: [OMPI users] MPI daemon error

2010-05-28 Thread Rahul Nabar
On Fri, May 28, 2010 at 3:53 PM, Ralph Castain  wrote:
> What environment are you running on the cluster, and what version of OMPI? 
> Not sure that error message is coming from us.

openmpi-1.4.1
The cluster runs PBS-Torque. So I guess, that could be the other error source.

-- 
Rahul


[OMPI users] MPI daemon error

2010-05-28 Thread Rahul Nabar
Often when I try and run larger jobs on our cluster I get the error of
the sort from some of the compute-servers:

eu260 - daemon did not report back when launched

It does not happen every time; but pretty often. Any ideas what could
be wrong? The node seems pingable and I could log in successfully to
it as well. /var/log/messages shows no errors but maybe there is
another log elsewhere?

-- 
Rahul


[OMPI users] Disabling irqbalance service for better performance of MPI jobs

2009-12-14 Thread Rahul Nabar
I have already been using the processor and memory affinity options to
bind the processes to specific cores. Does the presence of the
irqbalance daemon matter? I saw some recommendation to disable this
for a performance boost. Or is this irrelevant?

I am running HPC jobs with no over- nor under-subscription. These are
8 core Nehalem servers.

-- 
Rahul


Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-10-02 Thread Rahul Nabar
On Wed, Sep 30, 2009 at 3:16 PM, Peter Kjellstrom  wrote:

> Not MPI aware, but, you could watch network traffic with a tool such as
> collectl in real-time.

collectl is a great idea. I am going to try that now.

-- 
Rahul


Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-09-29 Thread Rahul Nabar
On Tue, Sep 29, 2009 at 1:33 PM, Anthony Chan  wrote:
>
> Rahul,
>

>
> What errors did you see when compiling MPE for OpenMPI ?
> Can you send me the configure and make outputs as seen on
> your terminal ?  ALso, what version of MPE are you using
> with OpenMPI ?

Version: mpe2-1.0.6p1

./configure FC=ifort CC=icc CXX=icpc F77=ifort CFLAGS="-g -O2 -mp"
FFLAGS="-mp -recursive" CXXFLAGS="-g -O2" CPPFLAGS=-DpgiFortran
MPI_CC=/usr/local/ompi-ifort/bin/mpiCC
MPI_F77=/usr/local/ompi-ifort/bin/mpif77
MPI_LIBS=/usr/local/ompi-ifort/lib/
Configuring MPE Profiling System with 'FC=ifort' 'CC=icc' 'CXX=icpc'
'F77=ifort' 'CFLAGS=-g -O2 -mp' 'FFLAGS=-mp -recursive' 'CXXFLAGS=-g
-O2' 'CPPFLAGS=-DpgiFortran' 'MPI_CC=/usr/local/ompi-ifort/bin/mpiCC'
'MPI_F77=/usr/local/ompi-ifort/bin/mpif77'
'MPI_LIBS=/usr/local/ompi-ifort/lib/'
checking for current directory name... /src/mpe2-1.0.6p1
checking gnumake... yes using --no-print-directory
checking BSD 4.4 make... no - whew
checking OSF V3 make... no
checking for virtual path format... VPATH
User supplied MPI implmentation (Good Luck!)
checking for gcc... icc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether icc accepts -g... yes
checking for icc option to accept ANSI C... none needed
checking whether MPI_CC has been set ... /usr/local/ompi-ifort/bin/mpiCC
checking whether we are using the GNU Fortran 77 compiler... no
checking whether ifort accepts -g... yes
checking whether MPI_F77 has been set ... /usr/local/ompi-ifort/bin/mpif77
checking for the linkage of the supplied MPI C definitions ... no
configure: error:  Cannot link with basic MPI C program!
Check your MPI include paths, MPI libraries and MPI CC compiler



Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-09-29 Thread Rahul Nabar
On Tue, Sep 29, 2009 at 10:40 AM, Eugene Loh  wrote:
> to know.  It sounds like you want to be able to watch some % utilization of
> a hardware interface as the program is running.  I *think* these tools (the
> ones on the FAQ, including MPE, Vampir, and Sun Studio) are not of that
> class.

You are correct. A real time tool would be best that sniffs at the MPI
traffic. Post mortem profilers would be the next best option I assume.
I was trying to compile MPE but gave up. Too many errors. Trying to
decide if I should prod on or look at another tool.

-- 
Rahul



[OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-09-29 Thread Rahul Nabar
I have a code that seems to run about 40% faster when I bond together
twin eth interfaces. The question, of course, arises: is it really
producing so much traffic to keep twin 1 Gig eth interfaces busy? I
don't really believe this but need a way to check.

What are good tools to monitior the MPI performance of a running job.
Basically what throughput loads is it imposing on the eth interfaces.
Any suggestions?

The code does not seem to produce much of disk I/O as profiled via
strace (if at all NFS I/O is a bottleneck).

-- 
Rahul


Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-09-23 Thread Rahul Nabar
On Tue, Aug 18, 2009 at 5:28 PM, Gerry Creager  wrote:
> Most of that bandwidth is in marketing...  Sorry, but it's not a high
> performance switch.

Well, how does one figure out what exactly is a "hih performance
switch"? I've found this an exceedingly hard task. Like the OP posted
the Dell 6248 is rated to give more than a fully subscribed backbone
capacity. Nor I do not know any good third party test lab nor do I
know any switch load testing benchmarks that'd take a switch through
its paces.

So, how does one go about selecting a good switch? "The most expensive
the better" is somewhat a unsatisfying option!

-- 
Rahul



Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-04-01 Thread Rahul Nabar
On Wed, Apr 1, 2009 at 1:13 AM, Ralph Castain  wrote:
> So I gather that by "direct" you mean that you don't get an allocation from
> Maui before running the job, but for the other you do? Otherwise, OMPI
> should detect the that it is running under Torque and automatically use the
> Torque launcher unless directed to do otherwise.
>

I think I've figured out the sore point. It seems "ulimit" is needed.
Things seem sensitive to where exactly I put the ulimit directive
though. Funnily, the nodes reported an unlimited stack before too but
putting this extra directive in there seems to have helped!

I'm doing more testing to be sure that the problem has been solved!

Thanks for the leads guys!

-- 
Rahul


Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-04-01 Thread Rahul Nabar
2009/3/31 Ralph Castain :
> I have no idea why your processes are crashing when run via Torque - are you
> sure that the processes themselves crash? Are they segfaulting - if so, can
> you use gdb to find out where?

I have to admit I'm a newbiee with gdb. I am trying to recompile my
code as "ifort -g..." so that I can use gdb.

But the code only crashes in this specific Torque+MPI call. How do I
use gdb in conjunction with mpi. I'm not really sure. Any tutorials?

-- 
Rahul


Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar
2009/3/31 Ralph Castain :
> It is very hard to debug the problem with so little information. We
> regularly run OMPI jobs on Torque without issue.

Another small thing that I noticed. Not sure if it is relevant.

When the job starts running there is an orte process. The args to this
process are slightly different depending on whether the job was
submitted with Torque or directly on a node. Could this be an issue?
Just a thought.

The essential difference seems that the torque run has the
--no-daemonize option whereas the direct run has a --set-sid option. I
got these via ps after I submitted an interactive Torque job.

Do these matter at all? Full ps output snippets reproduced below. Some
other numbers also seem different on closer inspection but that might
be by design.

###via Torque; segfaults. ##
rpnabar  11287  0.1  0.0  24680  1828 ?Ss   21:04   0:00 orted
--no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0
--nodename node17 --universe rpnabar@node17:default-universe-11286
--nsreplica "0.0.0;tcp://10.0.0.17:45839" --gprreplica
"0.0.0;tcp://10.0.0.17:45839"
##


##direct MPI run; this works OK
rpnabar  11026  0.0  0.0  24676  1712 ?Ss   20:52   0:00 orted
--bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename
node17 --universe rpnabar@node17:default-universe-11024 --nsreplica
"0.0.0;tcp://10.0.0.17:34716" --gprreplica
"0.0.0;tcp://10.0.0.17:34716" --set-sid
##


Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar
2009/3/31 Ralph Castain :
>
> Information would be most helpful - the information we really need is
> specified here: http://www.open-mpi.org/community/help/

Output of "ompi_info --all"  is attached in a file.


echo $LD_LIBRARY_PATH
/usr/local/ompi-ifort/lib:/opt/intel/fce/10.1.018/lib:/opt/intel/mkl/10.0.4.023/lib/em64t:/opt/intel/cce/10.1.018/lib

which mpirun
/usr/local/ompi-ifort/bin/mpirun

which mpiexec
/usr/local/ompi-ifort/bin/mpiexec

This three things are  invariant with or outside Torque. So unlikely
to be an issue.

I am setting no MCA parameters explicitly ( at least none that I
conciously know of!) Any way of obtaining a dump from the environment
of a running job?

Just a plain-old Gigabit ethernet network.

Maybe this helps a bit? Feel free to instruct me to run any more
diagnostic commands. I'm essentially the "sys admin" on our tiny
cluster here so do have root access and try any tweaks or sugesstions
you guys might have. Thanks again!

-- 
Rahul
Open MPI: 1.2.7
   Open MPI SVN revision: r19401
Open RTE: 1.2.7
   Open RTE SVN revision: r19401
OPAL: 1.2.7
   OPAL SVN revision: r19401
   MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.7)
   MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.7)
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.7)
   MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.7)
 MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.7)
 MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.7)
   MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
   MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.7)
MCA coll: self (MCA v1.0, API v1.0, Component v1.2.7)
MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.7)
MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.7)
  MCA io: romio (MCA v1.0, API v1.0, Component v1.2.7)
   MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.7)
   MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.7)
 MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.7)
 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.7)
 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.7)
  MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.7)
 MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.7)
 MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.7)
 MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.7)
 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.7)
  MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.7)
  MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.7)
  MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.7)
 MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.7)
 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.7)
 MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.7)
 MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.7)
 MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.7)
  MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.7)
  MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.7)
 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
 MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2.7)
 MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2.7)
 MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7)
 MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2.7)
 MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.7)
 MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2.7)
 MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.7)
 MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2.7)
   MCA rmaps: round_robin (MCA v1.0, API v1.3, Component v1.2.7)
MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.2.7)
MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.7)
 MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.7)
 MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2.7)
 MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7)
 MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.7)
 MCA pls: slurm (MCA v1.0, API v1.3, Component v1.2.7)
 MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.7)
 MCA sds: env (MCA v1.0, API v1.0, Component v1.2.7)
 MCA sds: seed (MCA v1.0, API v1.0, 

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar
2009/3/31 Ralph Castain :

> It is very hard to debug the problem with so little information. We

Thanks Ralph! I'm sorry my first post lacked enough specifics. I'll
try my best to fill you guys in on as much debug info as I can.

> regularly run OMPI jobs on Torque without issue.

So do we. In fact on the very same cluster other jobs using the same
code do run fine. Its only this  one type of jobs that I am seeing
this strange behavior on. For those more curious, the code I am trying
to run is a computational chemistry code called DACAPO developed at
CAMd at the Technical University of Denemark. Link:
https://wiki.fysik.dtu.dk/dacapo

Hardware Architecture:
Dell rack servers: PowerEdge SC1435.
2.2GHz Opteron 1Ghz. (AMD)
8 cpus per node.

> Are you getting an allocation from somewhere for the nodes?
>If so, are you
> using Moab to get it?

We are using Torque as the scheduler and Maui as the master scheduler.

 >Do you have a $PBS_NODEFILE in your environment?

Yes, I do. For a test case I was trying to run on a single node (which
has 8 cpus)

If I cat  $PBS_NODEFILE I do get the name "node17"  8 times.

I did dump the environment variables from a running job. I get:
PBS_NODEFILE="/var/spool/torque/aux//4609.uranus.che.foo.edu"

> I have no idea why your processes are crashing when run via Torque - are you
> sure that the processes themselves crash?
>Are they segfaulting - if so, can

Yes, they are indeed segfaulting. And only when I run them through Torque.

forrtl: error (78): process killed (SIGTERM)
mpirun noticed that job rank 5 with PID 10580 on node node17 exited on
signal 11 (Segmentation fault).
#

Exact same job runs like a charm if I submit it via mprrun on the node
outside of Torque.


> you use gdb to find out where?

I can try that. I haven't used gdb much before. In case it matters the
core executable is a fortran source compiled via the Intel Fortran
Compiler ifort. That executable runs fine for all other cases except
this one.

Maybe this helps more?

-- 
Rahul