[OMPI users] IO issue with OpenMPI 1.4.1 and earlier versions

2011-09-12 Thread Steve Jones
Hi. We've run into an IO issue with 1.4.1 and earlier versions. We're able to reproduce the issue in around 120 lines of code to help, I'd like to find if there's something we're simply doing incorrectly with the build or if it's in fact a known bug. I've included the following in order: 1.

Re: [OMPI users] mpiexec option for node failure

2011-09-12 Thread Ralph Castain
We don't have anything similar in OMPI. There are fault tolerance modes, but not like the one you describe. On Sep 12, 2011, at 5:52 PM, Rob Stewart wrote: > Hi, > > I have implemented a simple fault tolerant ping pong C program with MPI, > here: http://pastebin.com/7mtmQH2q > > MPICH2

Re: [OMPI users] Question on using rsh

2011-09-12 Thread Ralph Castain
The two are synonyms for each other - they resolve to the identical variable, so there isn't anything different about them. Not sure what the issue might be, but I would check for a typo - we don't check that mca params are spelled correctly, nor do we check for params that don't exist (e.g.,

[OMPI users] mpiexec option for node failure

2011-09-12 Thread Rob Stewart
Hi, I have implemented a simple fault tolerant ping pong C program with MPI, here: http://pastebin.com/7mtmQH2q MPICH2 offers a parameter with mpiexec: $ mpiexec -disable-auto-cleanup .. as described here: http://trac.mcs.anl.gov/projects/mpich2/ticket/1421 It is fault tolerant in the

Re: [OMPI users] EXTERNAL: Re: qp memory allocation problem

2011-09-12 Thread Nathan Hjelm
On Mon, 12 Sep 2011, Blosch, Edwin L wrote: It was set to 0 previously. We've set it to 4 and restarted some service and now it works. So both your and Samuel's suggestions worked. On another system, slightly older, it was defaulted to 3 instead of 0, and apparently that explains why the

Re: [OMPI users] EXTERNAL: Re: qp memory allocation problem

2011-09-12 Thread Blosch, Edwin L
It was set to 0 previously. We've set it to 4 and restarted some service and now it works. So both your and Samuel's suggestions worked. On another system, slightly older, it was defaulted to 3 instead of 0, and apparently that explains why the job always ran before and on this newer system

[OMPI users] Question on using rsh

2011-09-12 Thread Blosch, Edwin L
I have a hello world program that runs without prompting for password with plm_rsh_agent but not with orte_rsh_agent, I mean it runs but only after prompting for a password: /bin/mpirun --machinefile mpihosts.dat -np 16 -mca plm_rsh_agent /usr/bin/rsh ./test_setup Hello from process

[OMPI users] OpenMPI 1.5.4 with VS2008 and example code fails at orte_init

2011-09-12 Thread Riku
Hello all, I recently successfully compiled Open MPI 1.5.4 with Visual Studio 2008 for the 32-bit platform. Because of some adaptations (yet to be added in) I cannot use the provided binary release. For initial testing I also compiled the Hello World example code (hello_cxx.cc). The program

Re: [OMPI users] EXTERNAL: Re: qp memory allocation problem

2011-09-12 Thread Samuel K. Gutierrez
On Sep 12, 2011, at 10:16 AM, Blosch, Edwin L wrote: > Samuel, > > This worked. Great! > Did this magic line disable the use of per-peer queue pairs? Yes, it sure did. > I have seen a previous post by Jeff that explains what this line does > generally, but I didn’t study the post in

Re: [OMPI users] openmpi 1.5.4 paffinity with Magny-Cours

2011-09-12 Thread Jeff Squyres
Great! We'll get that in the next OMPI v1.5.x release. On Sep 12, 2011, at 2:23 PM, Kaizaad Bilimorya wrote: > > On Fri, 9 Sep 2011, Brice Goglin wrote: > >> This looks like the exact same issue. Did you try the patch(es) I sent >> earlier? >> See

Re: [OMPI users] openmpi 1.5.4 paffinity with Magny-Cours

2011-09-12 Thread Kaizaad Bilimorya
On Fri, 9 Sep 2011, Brice Goglin wrote: This looks like the exact same issue. Did you try the patch(es) I sent earlier? See http://www.open-mpi.org/community/lists/users/2011/09/17159.php If it's not enough, try adding the other patch from

Re: [OMPI users] qp memory allocation problem

2011-09-12 Thread Jeff Squyres
On Sep 12, 2011, at 12:39 PM, Shamis, Pavel wrote: > OMPI Developers: > > Maybe we should consider disabling the use of per-peer queue pairs by > default. Do they buy us anything? For what it is worth, we have stopped > using them on all of our large systems here at LANL. > > It is

Re: [OMPI users] EXTERNAL: Re: qp memory allocation problem

2011-09-12 Thread Nathan Hjelm
On Mon, 12 Sep 2011, Blosch, Edwin L wrote: Nathan, I found this parameters under /sys/module/mlx4_core/parameters. How do you incorporate a changed value? What to restart/rebuild? Forgot to say that you will need to reload the mlx4_core module by either rebooting or

Re: [OMPI users] qp memory allocation problem

2011-09-12 Thread Blosch, Edwin L
Actually we were already aware of this FAQ and already have the limits set to hard and soft unlimited in the PAM limits.conf as well as in the pbs_mom resource manager startup script. We encountered those issues a few years ago and definitely are aware of having process limits set too low. I

Re: [OMPI users] EXTERNAL: Re: qp memory allocation problem

2011-09-12 Thread Nathan Hjelm
On Mon, 12 Sep 2011, Blosch, Edwin L wrote: Nathan, I found this parameters under /sys/module/mlx4_core/parameters. How do you incorporate a changed value? What to restart/rebuild? Add the following line to /etc/modprobe (replace X with the appropriate value for log_mtts_per_seg):

Re: [OMPI users] EXTERNAL: Re: qp memory allocation problem

2011-09-12 Thread Blosch, Edwin L
Nathan, I found this parameters under /sys/module/mlx4_core/parameters. How do you incorporate a changed value? What to restart/rebuild? -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Nathan Hjelm Sent: Monday, September 12,

Re: [OMPI users] qp memory allocation problem

2011-09-12 Thread Shamis, Pavel
Alternative solution for the problem is updating your memory limits Please see below: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Apparently you memory limit is low and the driver fails to create QPs What happens when you add the following to your mpirun command? -mca

Re: [OMPI users] EXTERNAL: Re: qp memory allocation problem

2011-09-12 Thread Blosch, Edwin L
Samuel, This worked. Did this magic line disable the use of per-peer queue pairs?I have seen a previous post by Jeff that explains what this line does generally, but I didn't study the post in detail, so if you could provide a little explanation I would appreciate it. Ed From:

Re: [OMPI users] OpenIB error messages: reporting the default or telling you what's happening?

2011-09-12 Thread Jeff Squyres
FWIW, the default for the ib_timeout is 20 in both v1.4.x and v1.5.x. As Ralph said, ompi_info will show the current value -- not the default value. Of course, the current value will be the default value, unless it has been overridden. In OMPI v1.5, ompi_info should indicate where the value

Re: [OMPI users] qp memory allocation problem

2011-09-12 Thread Teng Ma
I met a similar problem possibly related with QP memory allocation. I run 768 processes' allgather with 1MB message size but by node binding(forcing the edge of Tuned's ring algorithm through IB links every time). The IMB test hang over there more than 3 hours without any output. I don't know

Re: [OMPI users] qp memory allocation problem

2011-09-12 Thread Nathan Hjelm
I also recommend checking the log_mtts_per_set parameter to the mlx4 module. This parameter controls how much memory can be registered for use by the mlx4 driver and it should be in the range 1-5 (or 0-7 depending on the version of the mlx4 driver). I recommend tthe parameter be set such that

Re: [OMPI users] qp memory allocation problem

2011-09-12 Thread Samuel K. Gutierrez
Hi, This problem can be caused by a variety of things, but I suspect our default queue pair parameters (QP) aren't helping the situation :-). What happens when you add the following to your mpirun command? -mca btl_openib_receive_queues S,4096,128:S,12288,128:S,65536,12 OMPI Developers:

[OMPI users] qp memory allocation problem

2011-09-12 Thread Blosch, Edwin L
I am getting this error message below and I don't know what it means or how to fix it. It only happens when I run on a large number of processes, e.g. 960. Things work fine on 480, and I don't think the application has a bug. Any help is appreciated...

Re: [OMPI users] OpenIB error messages: reporting the default or telling you what's happening?

2011-09-12 Thread Shamis, Pavel
> > * btl_openib_ib_retry_count - The number of times the sender will > attempt to retry (defaulted to 7, the maximum value). > * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted > to 10). The actual timeout value used is calculated as: > Actually I'm surprised that

Re: [OMPI users] error on malloc

2011-09-12 Thread Jeff Squyres
This usually means that you have a memory error of some kind in your application. Have you tried running your application through a memory-checking debugger, such as valgrind? On Sep 5, 2011, at 3:48 AM, Jai Dayal wrote: > Hi all, > I've been beating my head on this for quite a while now.

Re: [hwloc-users] Process and thread binding

2011-09-12 Thread Samuel Thibault
Gabriele Fatigati, le Mon 12 Sep 2011 15:50:45 +0200, a écrit : > thanks very much for your explanations. But I don't understand why a process > inherits core bound of his threads On Linux, there is no such thing as "process binding", only "thread binding". hwloc emulates the former by using the

Re: [hwloc-users] Process and thread binding

2011-09-12 Thread Gabriele Fatigati
Ok Brice, thanks very much for your explanations. But I don't understand why a process inherits core bound of his threads according to your example: >It worked because you never mixed it with single thread binding. If you bind process X to >coreA and then thread Y of process X to coreB, what you

Re: [OMPI users] Infiniband Error

2011-09-12 Thread Yevgeny Kliteynik
This means that you have some problem on that node, and it's probably unrelated to Open MPI. Bad cable? Bad port? FW/driver in some bad state? Do other IB performance tests work OK on this node? Try rebooting the node. -- YK On 12-Sep-11 7:52 AM, Ahsan Ali wrote: > Hello all > > I am getting

Re: [OMPI users] OpenIB error messages: reporting the default or telling you what's happening?

2011-09-12 Thread Ralph Castain
I ask because those are set via MCA param. So ompi_info would show the "default" if the param isn't set in the environment or param file, but the app could see something different if you set the param on the mpirun cmd line. Those are the default values, but it looks like the MCA param is being

Re: [hwloc-users] Process and thread binding

2011-09-12 Thread Brice Goglin
Le 12/09/2011 14:17, Gabriele Fatigati a écrit : > Mm, and why? In a hybrid code ( MPI + OpenMP), my idea is to bind a > single MPI process in one core, and his threads in other cores. > Otherwise I have all threads that runs on a single core. > The usual way to do that is to first bind the

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-12 Thread Ghislain Lartigue
Thank you: this is very enlightening. I will try this and let you know... Ghislain. Le 9 sept. 2011 à 18:00, Eugene Loh a écrit : > > > On 9/8/2011 11:47 AM, Ghislain Lartigue wrote: >> I guess you're perfectly right! >> I will try to test it tomorrow by putting a call system("wait(X)) befor

Re: [hwloc-users] Process and thread binding

2011-09-12 Thread Gabriele Fatigati
Ok, >A process is a container that contains one or several threads. The binding is where >something can run. It's normal that "where a process can run" is "where any of its >threads can run", which means it's the logical OR of their binding. Ok, now it's clear. I'm using

Re: [hwloc-users] Process and thread binding

2011-09-12 Thread Gabriele Fatigati
Hi Brice, but in the manual is not written that get_cpubind() returns the logical OR of the binding of all threads... I ever understand that returns the bind of the calloer, where the caller can be process or thread.. I'm mixing bind of process and threads, and I've noted that if the process and

Re: [hwloc-users] Process and thread binding

2011-09-12 Thread Brice Goglin
Le 12/09/2011 13:29, Gabriele Fatigati a écrit : > Hi Birce, > > I'm so confused.. > > I'm binding MPI processes with set_cpu_bind and it works well. The > problem is when I try to bind process and threads. > > It seem that thread process influence bind of main thread. > > And from hwloc manual:

[OMPI users] Infiniband Error

2011-09-12 Thread Ahsan Ali
Hello all I am getting following error during an application run which causes it to crash. *[[36944,1],41][btl_openib_component.c:3227:handle_wc] from compute-01-19.private.dns.zone to: compute-01-04 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 167703304 opcode