Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-09-01 Thread Rahul Nabar
On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyres wrote: > It would simplify testing if you could get all the eth0's to be of one type > and on the same subnet, and the same for eth1. > > Once you do that, try using just one of the networks by telling OMPI to use > only one of

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-26 Thread Rahul Nabar
On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyres wrote: > Once you do that, try using just one of the networks by telling OMPI to use > only one of the devices, something like this: > >    mpirun --mca btl_tcp_if_include eth0 ... Thanks Jeff! Just tried the exact test that you

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-25 Thread Rahul Nabar
On Wed, Aug 25, 2010 at 6:41 AM, John Hearns wrote: > You could sort that out with udev rules on each machine. Sure. I'd always wanted consistent names for the eth interfaces when I set up the cluster but I couldn't get udev to co-operate. Maybe this time! Let me try. >

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-25 Thread Rahul Nabar
On Thu, Aug 19, 2010 at 9:03 PM, Rahul Nabar <rpna...@gmail.com> wrote: > -- > gather: >    NP256    hangs >    NP128    hangs >    NP64    hangs >    NP32    OK > > Note: "gather" always

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar
On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumann wrote: > Bugs are always a possibility but unless there is something very unusual > about the cluster and interconnect or this is an unstable version of MPI, it > seems very unlikely this use of MPI_Bcast with so few tasks and

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar
On Mon, Aug 23, 2010 at 8:39 PM, Randolph Pullen wrote: > > I have had a similar load related problem with Bcast. Thanks Randolph! That's interesting to know! What was the hardware you were using? Does your bcast fail at the exact same point too? > > I don't know

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar
On Mon, Aug 23, 2010 at 6:39 PM, Richard Treumann wrote: > It is hard to imagine how a total data load of 41,943,040 bytes could be a > problem. That is really not much data. By the time the BCAST is done, each > task (except root) will have received a single half meg message

[OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-19 Thread Rahul Nabar
My Intel IMB-MPI tests stall, but only in very specific cases:larger packet sizes + large core counts. Only happens for bcast, gather and exchange tests. Only for the larger core counts (~256 cores). Other tests like pingpong and sendrecev run fine even with larger core counts. e.g. This bcast

[OMPI users] MPI broadcast test fails only when I run within a torque job

2010-07-28 Thread Rahul Nabar
I'm not sure if this is a torque issue or an MPI issue. If I log in to a compute-node and run the standard mpi broadcast test it returns no error but if I run it through PBS/Torque I get an error (see below) The nodes that return the error are fairly random. Even the same set of nodes will run a

[OMPI users] subnet specification for MPI when multiple networks are present

2010-06-22 Thread Rahul Nabar
I have compute-nodes with twin eth interfaces 1GigE and 10GigE. In the OpenMPI docs I found an instruction: " It is therefore very important that if active ports on the same host are on physically separate fabrics, they must have different subnet IDs." Is this the same "subnet" that is set via

Re: [OMPI users] MPI daemon error

2010-05-29 Thread Rahul Nabar
On Sat, May 29, 2010 at 8:19 AM, Ralph Castain wrote: > > >From your other note, it sounds like #3 might be the problem here. Do you > >have some nodes that are configured with "eth0" pointing to your 10.x > >network, and other nodes with "eth0" pointing to your 192.x

[OMPI users] which eth interface does mpi use by default when torque supplies it with a hostfile?

2010-05-28 Thread Rahul Nabar
Each of our servers has twin eth cards: 1GigE and 10GigE. How does openmpi decide which card to use while sending messages on? One of the cards is on a 10.0. IP address subnet whereas the other cards are on a 192.168 adress subnet. Can I select one or the other by specifying the --host option with

Re: [OMPI users] MPI daemon error

2010-05-28 Thread Rahul Nabar
On Fri, May 28, 2010 at 3:53 PM, Ralph Castain wrote: > What environment are you running on the cluster, and what version of OMPI? > Not sure that error message is coming from us. openmpi-1.4.1 The cluster runs PBS-Torque. So I guess, that could be the other error source. --

[OMPI users] MPI daemon error

2010-05-28 Thread Rahul Nabar
Often when I try and run larger jobs on our cluster I get the error of the sort from some of the compute-servers: eu260 - daemon did not report back when launched It does not happen every time; but pretty often. Any ideas what could be wrong? The node seems pingable and I could log in

[OMPI users] Disabling irqbalance service for better performance of MPI jobs

2009-12-14 Thread Rahul Nabar
I have already been using the processor and memory affinity options to bind the processes to specific cores. Does the presence of the irqbalance daemon matter? I saw some recommendation to disable this for a performance boost. Or is this irrelevant? I am running HPC jobs with no over- nor

Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-10-02 Thread Rahul Nabar
On Wed, Sep 30, 2009 at 3:16 PM, Peter Kjellstrom wrote: > Not MPI aware, but, you could watch network traffic with a tool such as > collectl in real-time. collectl is a great idea. I am going to try that now. -- Rahul

Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-09-29 Thread Rahul Nabar
On Tue, Sep 29, 2009 at 1:33 PM, Anthony Chan wrote: > > Rahul, > > > What errors did you see when compiling MPE for OpenMPI ? > Can you send me the configure and make outputs as seen on > your terminal ?  ALso, what version of MPE are you using > with OpenMPI ? Version:

Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-09-29 Thread Rahul Nabar
On Tue, Sep 29, 2009 at 10:40 AM, Eugene Loh wrote: > to know.  It sounds like you want to be able to watch some % utilization of > a hardware interface as the program is running.  I *think* these tools (the > ones on the FAQ, including MPE, Vampir, and Sun Studio) are not of

[OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-09-29 Thread Rahul Nabar
I have a code that seems to run about 40% faster when I bond together twin eth interfaces. The question, of course, arises: is it really producing so much traffic to keep twin 1 Gig eth interfaces busy? I don't really believe this but need a way to check. What are good tools to monitior the MPI

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-09-23 Thread Rahul Nabar
On Tue, Aug 18, 2009 at 5:28 PM, Gerry Creager wrote: > Most of that bandwidth is in marketing...  Sorry, but it's not a high > performance switch. Well, how does one figure out what exactly is a "hih performance switch"? I've found this an exceedingly hard task. Like the

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-04-01 Thread Rahul Nabar
On Wed, Apr 1, 2009 at 1:13 AM, Ralph Castain wrote: > So I gather that by "direct" you mean that you don't get an allocation from > Maui before running the job, but for the other you do? Otherwise, OMPI > should detect the that it is running under Torque and automatically use the

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-04-01 Thread Rahul Nabar
2009/3/31 Ralph Castain : > I have no idea why your processes are crashing when run via Torque - are you > sure that the processes themselves crash? Are they segfaulting - if so, can > you use gdb to find out where? I have to admit I'm a newbiee with gdb. I am trying to recompile

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar
2009/3/31 Ralph Castain : > It is very hard to debug the problem with so little information. We > regularly run OMPI jobs on Torque without issue. Another small thing that I noticed. Not sure if it is relevant. When the job starts running there is an orte process. The args to this

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar
2009/3/31 Ralph Castain : > > Information would be most helpful - the information we really need is > specified here: http://www.open-mpi.org/community/help/ Output of "ompi_info --all" is attached in a file. echo $LD_LIBRARY_PATH

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar
2009/3/31 Ralph Castain : > It is very hard to debug the problem with so little information. We Thanks Ralph! I'm sorry my first post lacked enough specifics. I'll try my best to fill you guys in on as much debug info as I can. > regularly run OMPI jobs on Torque without issue.