Hi.
We've run into an IO issue with 1.4.1 and earlier versions. We're able to
reproduce the issue in around 120 lines of code to help, I'd like to find if
there's something we're simply doing incorrectly with the build or if it's in
fact a known bug. I've included the following in order:
1.
We don't have anything similar in OMPI. There are fault tolerance modes, but
not like the one you describe.
On Sep 12, 2011, at 5:52 PM, Rob Stewart wrote:
> Hi,
>
> I have implemented a simple fault tolerant ping pong C program with MPI,
> here: http://pastebin.com/7mtmQH2q
>
> MPICH2
The two are synonyms for each other - they resolve to the identical variable,
so there isn't anything different about them.
Not sure what the issue might be, but I would check for a typo - we don't check
that mca params are spelled correctly, nor do we check for params that don't
exist (e.g.,
Hi,
I have implemented a simple fault tolerant ping pong C program with MPI,
here: http://pastebin.com/7mtmQH2q
MPICH2 offers a parameter with mpiexec:
$ mpiexec -disable-auto-cleanup
.. as described here: http://trac.mcs.anl.gov/projects/mpich2/ticket/1421
It is fault tolerant in the
On Mon, 12 Sep 2011, Blosch, Edwin L wrote:
It was set to 0 previously. We've set it to 4 and restarted some service and
now it works. So both your and Samuel's suggestions worked.
On another system, slightly older, it was defaulted to 3 instead of 0, and
apparently that explains why the
It was set to 0 previously. We've set it to 4 and restarted some service and
now it works. So both your and Samuel's suggestions worked.
On another system, slightly older, it was defaulted to 3 instead of 0, and
apparently that explains why the job always ran before and on this newer system
I have a hello world program that runs without prompting for password with
plm_rsh_agent but not with orte_rsh_agent, I mean it runs but only after
prompting for a password:
/bin/mpirun --machinefile mpihosts.dat -np 16 -mca plm_rsh_agent
/usr/bin/rsh ./test_setup
Hello from process
Hello all,
I recently successfully compiled Open MPI 1.5.4 with Visual Studio
2008 for the 32-bit platform. Because of some adaptations (yet to be
added in) I cannot use the provided binary release.
For initial testing I also compiled the Hello World example code
(hello_cxx.cc). The program
On Sep 12, 2011, at 10:16 AM, Blosch, Edwin L wrote:
> Samuel,
>
> This worked.
Great!
> Did this magic line disable the use of per-peer queue pairs?
Yes, it sure did.
> I have seen a previous post by Jeff that explains what this line does
> generally, but I didn’t study the post in
Great! We'll get that in the next OMPI v1.5.x release.
On Sep 12, 2011, at 2:23 PM, Kaizaad Bilimorya wrote:
>
> On Fri, 9 Sep 2011, Brice Goglin wrote:
>
>> This looks like the exact same issue. Did you try the patch(es) I sent
>> earlier?
>> See
On Fri, 9 Sep 2011, Brice Goglin wrote:
This looks like the exact same issue. Did you try the patch(es) I sent
earlier?
See http://www.open-mpi.org/community/lists/users/2011/09/17159.php
If it's not enough, try adding the other patch from
On Sep 12, 2011, at 12:39 PM, Shamis, Pavel wrote:
> OMPI Developers:
>
> Maybe we should consider disabling the use of per-peer queue pairs by
> default. Do they buy us anything? For what it is worth, we have stopped
> using them on all of our large systems here at LANL.
>
> It is
On Mon, 12 Sep 2011, Blosch, Edwin L wrote:
Nathan, I found this parameters under /sys/module/mlx4_core/parameters. How
do you incorporate a changed value? What to restart/rebuild?
Forgot to say that you will need to reload the mlx4_core module by either
rebooting or
Actually we were already aware of this FAQ and already have the limits set to
hard and soft unlimited in the PAM limits.conf as well as in the pbs_mom
resource manager startup script. We encountered those issues a few years ago
and definitely are aware of having process limits set too low. I
On Mon, 12 Sep 2011, Blosch, Edwin L wrote:
Nathan, I found this parameters under /sys/module/mlx4_core/parameters. How
do you incorporate a changed value? What to restart/rebuild?
Add the following line to /etc/modprobe (replace X with the appropriate value
for log_mtts_per_seg):
Nathan, I found this parameters under /sys/module/mlx4_core/parameters. How
do you incorporate a changed value? What to restart/rebuild?
-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf
Of Nathan Hjelm
Sent: Monday, September 12,
Alternative solution for the problem is updating your memory limits
Please see below:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Apparently you memory limit is low and the driver fails to create QPs
What happens when you add the following to your mpirun command?
-mca
Samuel,
This worked. Did this magic line disable the use of per-peer queue pairs?I
have seen a previous post by Jeff that explains what this line does generally,
but I didn't study the post in detail, so if you could provide a little
explanation I would appreciate it.
Ed
From:
FWIW, the default for the ib_timeout is 20 in both v1.4.x and v1.5.x.
As Ralph said, ompi_info will show the current value -- not the default value.
Of course, the current value will be the default value, unless it has been
overridden. In OMPI v1.5, ompi_info should indicate where the value
I met a similar problem possibly related with QP memory allocation. I run
768 processes' allgather with 1MB message size but by node binding(forcing
the edge of Tuned's ring algorithm through IB links every time). The IMB
test hang over there more than 3 hours without any output. I don't know
I also recommend checking the log_mtts_per_set parameter to the mlx4 module.
This parameter controls how much memory can be registered for use by the mlx4
driver and it should be in the range 1-5 (or 0-7 depending on the version of
the mlx4 driver). I recommend tthe parameter be set such that
Hi,
This problem can be caused by a variety of things, but I suspect our default
queue pair parameters (QP) aren't helping the situation :-).
What happens when you add the following to your mpirun command?
-mca btl_openib_receive_queues S,4096,128:S,12288,128:S,65536,12
OMPI Developers:
I am getting this error message below and I don't know what it means or how to
fix it. It only happens when I run on a large number of processes, e.g. 960.
Things work fine on 480, and I don't think the application has a bug. Any help
is appreciated...
>
> * btl_openib_ib_retry_count - The number of times the sender will
> attempt to retry (defaulted to 7, the maximum value).
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
> to 10). The actual timeout value used is calculated as:
>
Actually I'm surprised that
This usually means that you have a memory error of some kind in your
application.
Have you tried running your application through a memory-checking debugger,
such as valgrind?
On Sep 5, 2011, at 3:48 AM, Jai Dayal wrote:
> Hi all,
> I've been beating my head on this for quite a while now.
Gabriele Fatigati, le Mon 12 Sep 2011 15:50:45 +0200, a écrit :
> thanks very much for your explanations. But I don't understand why a process
> inherits core bound of his threads
On Linux, there is no such thing as "process binding", only "thread
binding". hwloc emulates the former by using the
Ok Brice,
thanks very much for your explanations. But I don't understand why a process
inherits core bound of his threads according to your example:
>It worked because you never mixed it with single thread binding. If you
bind process X to >coreA and then thread Y of process X to coreB, what you
This means that you have some problem on that node,
and it's probably unrelated to Open MPI.
Bad cable? Bad port? FW/driver in some bad state?
Do other IB performance tests work OK on this node?
Try rebooting the node.
-- YK
On 12-Sep-11 7:52 AM, Ahsan Ali wrote:
> Hello all
>
> I am getting
I ask because those are set via MCA param. So ompi_info would show the
"default" if the param isn't set in the environment or param file, but the app
could see something different if you set the param on the mpirun cmd line.
Those are the default values, but it looks like the MCA param is being
Le 12/09/2011 14:17, Gabriele Fatigati a écrit :
> Mm, and why? In a hybrid code ( MPI + OpenMP), my idea is to bind a
> single MPI process in one core, and his threads in other cores.
> Otherwise I have all threads that runs on a single core.
>
The usual way to do that is to first bind the
Thank you: this is very enlightening.
I will try this and let you know...
Ghislain.
Le 9 sept. 2011 à 18:00, Eugene Loh a écrit :
>
>
> On 9/8/2011 11:47 AM, Ghislain Lartigue wrote:
>> I guess you're perfectly right!
>> I will try to test it tomorrow by putting a call system("wait(X)) befor
Ok,
>A process is a container that contains one or several threads. The binding
is where >something can run. It's normal that "where a process can run" is
"where any of its >threads can run", which means it's the logical OR of
their binding.
Ok, now it's clear.
I'm using
Hi Brice,
but in the manual is not written that get_cpubind() returns the logical OR
of the binding of all threads... I ever understand that returns the bind of
the calloer, where the caller can be process or thread..
I'm mixing bind of process and threads, and I've noted that if the process
and
Le 12/09/2011 13:29, Gabriele Fatigati a écrit :
> Hi Birce,
>
> I'm so confused..
>
> I'm binding MPI processes with set_cpu_bind and it works well. The
> problem is when I try to bind process and threads.
>
> It seem that thread process influence bind of main thread.
>
> And from hwloc manual:
Hello all
I am getting following error during an application run which causes it to
crash.
*[[36944,1],41][btl_openib_component.c:3227:handle_wc] from
compute-01-19.private.dns.zone to: compute-01-04 error polling LP CQ with
status RETRY EXCEEDED ERROR status number 12 for wr_id 167703304 opcode
35 matches
Mail list logo