This commit causes mpirun to segfault when running the IBM spawn tests
on our slurm platforms (it may affect others as well). The failures only
happen when mpirun is run in a batch script.
The backtrace I get is:
Program terminated with signal 11, Segmentation fault.
#0 0x002a969b9dbe in
To echo what Josh said, there are no special compile flags being used.
If you send me a patch with debug output, I'd be happy to run it for you.
Both odin and sif are fairly normal linux based clusters, with ethernet
and openib IP networks. The ethernet network has both ipv4 & ipv6, and
the
Hi Adrian,
After this change, I am getting a lot of errors of the form:
[sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by
peer (104)
See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615
I have found this
Hate to bring this up again, but I was thinking that an easy way to
reduce the size of the modex would be to reduce the length of the names
describing each piece of data.
More concretely, for a simple run I get the following names, each of
which are sent over the wire for every proc (note
Hi all,
I reported this before, but it seems that the report got lost. I have
found some situations where mpirun will return a '0' when there is an error.
An easy way to reproduce this is to edit the file
'orte/mca/plm/base/plm_base_launch_support.c' and on line 154 put in
'return
Thanks for the report. As Ralph indicated the threading support in Open
MPI is not good right now, but we are working to make it better.
I have filed a ticket (https://svn.open-mpi.org/trac/ompi/ticket/1267)
so we do not loose track of this issue, and attached a potential fix to
the ticket.
Is there a reason to rename ompi_modex_{send,recv} to
ompi_modex_proc_{send,recv}? It seems simpler (and no more confusing and
less work) to leave the names alone and add ompi_modex_node_{send,recv}.
Another question: Does the receiving process care that the information
received applies to a
Unfortunately now with r17988 I cannot run any mpi programs, they seem
to hang in the modex.
Tim
Ralph H Castain wrote:
Thanks Tim - I found the problem and will commit a fix shortly.
Appreciate your testing and reporting!
On 3/27/08 8:24 AM, "Tim Prins" <tpr...@cs.india
This commit breaks things for me. Running on 3 nodes of odin:
mpirun -mca btl tcp,sm,self examples/ring_c
causes a hang. All of the processes are stuck in
orte_grpcomm_base_barrier during MPI_Finalize. Not all programs hang,
and the ring program does not hang all the time, but fairly often.
try running with:
--mca opal_event_include select
and see if that fixes the problem for you?
On Mar 25, 2008, at 8:49 AM, Tim Prins wrote:
Hi everyone,
For the last couple nights ALL of our mtt runs have been failing
(although the failure is masked because mpirun is returning the wrong
Hi everyone,
For the last couple nights ALL of our mtt runs have been failing
(although the failure is masked because mpirun is returning the wrong
error code) with:
[odin005.cs.indiana.edu:28167] [[46567,0],0] ORTE_ERROR_LOG: Error in file
base/plm_base_launch_support.c at line 161
Hi,
Something went wrong last night and all our MTT tests had the following
output:
[odin005.cs.indiana.edu:28167] [[46567,0],0] ORTE_ERROR_LOG: Error in file
base/plm_base_launch_support.c at line 161
--
mpirun was unable
WHAT: Reduce the number of tests run by make check
WHY: Some of the tests will not work properly until Open MPI is
installed. Also, many of the tests do not really test anything.
WHERE: See below.
TIMEOUT: COB Friday March 14
DESCRIPTION:
We have been having many problems with make check
carto modules.
Is there some reason why carto absolutely must find a module? Can we
crate a
default "none available" module in the base?
On 3/4/08 7:39 AM, "Tim Prins" <tpr...@cs.indiana.edu> wrote:
Hi,
We have been having a problem lately with our MTT runs where mak
Hi,
We have been having a problem lately with our MTT runs where make check
would fail when mpi threads were enabled.
Turns out the problem is that opal_init now calls
opal_base_carto_select, which cannot find any carto modules since we
have not done an install yet. So it returns a failure.
We have used '^' elsewhere to indicate not, so maybe just have the
syntax be if you put '^' at the beginning of a line, that node is not used.
So we could have:
n0
n1
^headnode
n3
I understand the idea of having a flag to indicate that all nodes below
a certain point should be ignored, but I
WHAT: Removal of orte_proc_table
WHY: It is the last 'orte' class, its implementation is an abstraction
violation since it assumes certain things about how the opal_hash_table
is implemented, and it is not much code to remove it.
WHERE: This will necessitate minor changes in:
btl: tcp,
, 2008, at 9:19 AM, Tim Prins wrote:
Hi,
We are running into a problem with the IBM test cxx_call_errhandler
since the merge of the c++ bindings changes. Not sure if this is a
known
problem, but I did not see a bug or any traffic about this one.
MTT link: http://www.open-mpi.org/mtt/index.php
Hi,
We are running into a problem with the IBM test cxx_call_errhandler
since the merge of the c++ bindings changes. Not sure if this is a known
problem, but I did not see a bug or any traffic about this one.
MTT link: http://www.open-mpi.org/mtt/index.php?do_redir=532
Thanks,
Tim
Adrian Knoth wrote:
On Fri, Feb 01, 2008 at 11:40:20AM -0500, Tim Prins wrote:
Adrian,
Hi!
Sorry for the late reply and thanks for your testing.
1. There are some warnings when compiling:
I've fixed these issues.
Thanks.
2. If I exclude all my tcp interfaces, the connection fails
I just talked to Jeff about this. The problem was that on Sif we use
--enable-visibility, and apparently the new c++ bindings access
ompi_errhandler_create, which was not OMPI_DECLSPEC'd. Jeff will fix
this soon.
Tim
Jeff Squyres wrote:
I'm a little concerned about the C++ test build
Adrian,
For the most part this seems to work for me. But there are a few issues.
I'm not sure which are introduced by this patch, and whether some may be
expected behavior. But for completeness I will point them all out.
First, let me explain I am working on a machine with 3 tcp interfaces,
Hi Matthias,
I just noticed something else that seems odd. On a fresh checkout, I did
a autogen and configure. Then I type 'make clean'. Things seem to
progress normally, but once it gets to ompi/contrib/vt/vt/extlib/otf, a
new configure script gets run.
Specifically:
[tprins@sif test]$
Hi,
I am seeing some warnings on the trunk when compiling udapl in 32 bit
mode with OFED 1.2.5.1:
btl_udapl.c: In function 'udapl_reg_mr':
btl_udapl.c:95: warning: cast from pointer to integer of different size
btl_udapl.c: In function 'mca_btl_udapl_alloc':
btl_udapl.c:852: warning: cast
Jeff Squyres wrote:
I got a bunch of compiler warnings and errors with VT on the PGI
compiler last night -- my mail client won't paste it in nicely. :-(
See these MTT reports for details:
- On Absoft systems:
http://www.open-mpi.org/mtt/index.php?do_redir=516
- On Cisco systems:
With
On Wednesday 02 January 2008 08:52:08 am Jeff Squyres wrote:
> On Dec 31, 2007, at 11:42 PM, Paul H. Hargrove wrote:
> > I tried today to build the OMPI trunk on a system w/ GM libs installed
> > (I tried both GM-2.0.16 and GM-1.6.4) and found that the GM BTL won't
> > even compile, due to
Hi,
A couple of questions.
First, in opal_condition_wait (condition.h:97) we do not release the
passed mutex if opal_using_threads() is not set. Is there a reason for
this? I ask since this violates the way condition variables are supposed
to work, and it seems like there are situations
Hi,
Last night we had one of our threaded builds on the trunk hang when
running make check on the test opal_condition in test/threads/
After running the test about 30-40 times, I was only able to get it to
hang once. Looking at it is gdb, we get:
(gdb) info threads
3 Thread 1084229984
Well, I think it is pretty obvious that I am a fan of a attribute system :)
For completeness, I will point out that we also exchange architecture
and hostname info in the modex.
Do we really need a complete node map? A far as I can tell, it looks
like the MPI layer only needs a list of local
Hi,
The following files bother me about this commit:
trunk/ompi/mca/btl/sctp/sctp_writev.c
trunk/ompi/mca/btl/sctp/sctp_writev.h
They bother me for 2 reasons:
1. Their naming does not follow the prefix rule
2. They are LGPL licensed. While I personally like the LGPL, I do not
believe
an
> error in the way I was doing things, or could be a real characteristic of
> the parser. Anyway, we would have to ensure that the parser removes any
> surrounding "" before passing along the param value or this won't work.
>
> Ralph
>
> On 11/5/07 12:10 PM, &
Thanks for the clarification everyone.
Tim
On Monday 05 November 2007 05:41:00 pm Torsten Hoefler wrote:
> On Mon, Nov 05, 2007 at 05:32:04PM -0500, Brian W. Barrett wrote:
> > On Mon, 5 Nov 2007, Torsten Hoefler wrote:
> > > On Mon, Nov 05, 2007 at 04:57:19PM -0500, Brian W. Barrett wrote:
> >
Hi,
After talking with Torsten today I found something weird. When using the SLURM
pls we seem to forward a user's environment, but when using the rsh pls we do
not.
I.e.:
[tprins@odin ~]$ mpirun -np 1 printenv |grep foo
[tprins@odin ~]$ export foo=bar
[tprins@odin ~]$ mpirun -np 1 printenv
Hi,
Commit 16364 broke things when using multiword mca param values. For
instance:
mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca pls_rsh_agent
"ssh -Y" xterm
Will crash and burn, because the value "ssh -Y" is being stored into the
argv orted_cmd_line in orterun.c:1506. This
Hi,
The openib and udapl btls currently use the orte_pointer_array class.
This is a problem for me as I am trying to implement the RSL. So, as far
as I can tell, there are 3 options:
1. Move the orte_pointer_array class to opal. This would be quite simple
to do and makes sense in that there
Hi,
I am working on implementing the RSL. Part of this is changing the modex
to use the process attribute system in the RSL. I had designed this
system to to include a non-blocking interface.
However, I have looked again and noticed that nobody is using the
non-blocking modex receive.
WHAT: Remove the opal message buffer code
WHY: It is not used
WHERE: Remove references from opal/mca/base/Makefile.am and
opal/mca/base/base.h
svn rm opal/mca/base/mca_base_msgbuf*
WHEN: After timeout
TIMEOUT: COB, Wednesday October 10, 2007
I ran into this code
ight help
> you. The option --enable-mem-debug add a unused space at the end of
> each memory allocation and make sure we don't write anything there. I
> think this is the simplest way to pinpoint this problem.
>
>Thanks,
> george.
>
> On Sep 21, 2007, at 10:07 AM,
Hi folks,
In our nightly runs with the trunk I have started seeing cases where we
appear to be segfaulting within/below malloc. Below is a typical output.
Note that this appears to only happen on the trunk, when we use openib,
and are in 32 bit mode. It seems to happen randomly at a very low
This is fixed in r16164.
Tim
Brian Barrett wrote:
On Sep 19, 2007, at 4:11 PM, Tim Prins wrote:
Here is where it gets nasty. On FreeBSD, /usr/include/string.h
includes
strings.h in some cases. But there is a strings.h in the ompi/mpi/f77
directory, so that is getting included instead
Jeff Squyres wrote:
That's fine, too. I don't really care -- /public already exists. We
can simply rename it to /tmp-public.
Let's do that. It should (more or less) address all concerns that have
been voiced.
Tim
On Aug 31, 2007, at 8:52 AM, Ralph Castain wrote:
Why not make
Why not make /tmp-public and /tmp-private?
Leave /tmp alone. Have all new branches made in one of the two new
directories, and as /tmp branches are slowly whacked, we can
(eventually) get rid of /tmp.
Tim
Jeff Squyres (jsquyres) wrote:
I thought about both of those (/tmp/private and
ns of ORTE, nor for supporting
ORTE development. It would be nice if we could re-evaluate this after the
next ORTE version becomes solidified to see how the cost/benefit analysis
has changed, and whether the RSL remains a desirable option.
Ralph
On 8/16/07 7:47 PM, "Tim Prins" <tp
which supports
every system ever imagined, and provides every possible fault-tolerant
feature, when all they want is a thin RTE.
Tim
george.
On Aug 16, 2007, at 9:47 PM, Tim Prins wrote:
WHAT: Solicitation of feedback on the possibility of adding a runtime
services layer to Open MPI
me concern as Jeff's).
Understood.
Tim
--td
Tim Prins wrote:
WHAT: Solicitation of feedback on the possibility of adding a runtime
services layer to Open MPI to abstract out the runtime.
WHY: To solidify the interface between OMPI and the runtime environment,
and to allow the use of
gainst moving too fast, and then having to redo things.
> I am very supportive if this, I do believe this is the right way to go,
> unless someone else can come up with a better idea, and time to implement.
Thanks for the comments,
Tim
>
> Thanks,
> Rich
>
> On 8/16/07 9
it is impossible to say right now, but wanted to throw it out
there for people to consider/think about.
Tim
>
> On Aug 16, 2007, at 9:47 PM, Tim Prins wrote:
> > WHAT: Solicitation of feedback on the possibility of adding a runtime
> > services layer to Open MPI to abstract
Jeff Squyres wrote:
On Aug 16, 2007, at 11:48 AM, Tim Prins wrote:
+#define ORTE_RML_TAG_UDAPL 25
+#define ORTE_RML_TAG_OPENIB 26
+#define ORTE_RML_TAG_MVAPI 27
I think that UDAPL, OPENIB, MVAPI should not appear anywhere in the
ORTE layer
Sorry, I pushed the wrong button and sent this before it was ready
Tim Prins wrote:
Hi folks,
I am running into a problem with the ibm test 'group'. I will try to
explain what I think is going on, but I do not really understand the
group code so please forgive me if it is wrong
Hi folks,
I am running into a problem with the ibm test 'group'. I will try to
explain what I think is going on, but I do not really understand the
group code so please forgive me if it is wrong...
The test creates a group based on MPI_COMM_WORLD (group1), and a group
that has half the
Hi folks,
I was looking at the rml usage in ompi, and noticed that several of the
btls (udapl, mvapi, and openib) use the same rml tag for their messages.
My guess is that this is a mistake, but just want to ask if there is a
reason for this before I correct it.
Thanks,
Tim
This might be breaking things on odin. All our 64 bit openib mtt tests
have the following output:
[odin003.cs.indiana.edu:30971] Wrong QP specification (QP 0
"P,128,256,128,16:S,1024,256,128,32:S,4096,256,128,32:S,65536,256,128,32").
Point-to-point QP get 1-5 parameters
However, on my debug
Scott Atchley wrote:
On Jul 10, 2007, at 3:24 PM, Tim Prins wrote:
On Tuesday 10 July 2007 03:11:45 pm Scott Atchley wrote:
On Jul 10, 2007, at 2:58 PM, Scott Atchley wrote:
Tim, starting with the recently released 1.2.1, it is the default.
To clarify, MX_RCACHE=1 is the default.
It would
Jeff Squyres wrote:
2. The "--enable-mca-no-build" option takes a comma-delimited list of
components that will then not be built. Granted, this option isn't
exactly intuitive, but it was the best that we could think of at the
time to present a general solution for inhibiting the build of a
Gleb Natapov wrote:
On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote:
On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote:
On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote:
While looking into another problem I ran into an issue which made ob1
segfault on me. Using gm
On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote:
> On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote:
> > While looking into another problem I ran into an issue which made ob1
> > segfault on me. Using gm, and running the test test_dan1 in the onesided
> > t
done those tests, then my apology - but your note only indicates
that you ran "hello_world" and are basing your recommendation *solely* on
that test.
On 6/6/07 7:51 AM, "Tim Prins" <tpr...@open-mpi.org> wrote:
I hate to go back to this, but...
The or
, but...sigh.
Anyway, it doesn't appear to have any bearing either way on George's
patch(es), so whomever wants to commit them is welcome to do so.
Thanks
Ralph
On 5/29/07 11:44 AM, "Ralph Castain" <r...@lanl.gov> wrote:
On 5/29/07 11:02 AM, "Tim Prins" <tpr...@open-
it, let me know and I'll push it in the
trunk asap.
Thanks,
george.
On May 29, 2007, at 10:56 AM, Tim Prins wrote:
I think both patches should be put in immediately. I have done some
simple testing, and with 128 nodes of odin, with 1024 processes
running mpi hello, these decrease our running
I think both patches should be put in immediately. I have done some
simple testing, and with 128 nodes of odin, with 1024 processes
running mpi hello, these decrease our running time from about 14.2
seconds to 10.9 seconds. This is a significant decrease, and as the
scale increases there
Hi everyone,
I have been playing around with Open-MPI, using it as a test bed for
another project I am working on, and have found that on the intel test
suite, ompi is failing the MPI_Allreduce_user_c,
MPI_Reduce_scatter_user_c, and MPI_Reduce_user_c tests (it prints
something like MPITEST error
61 matches
Mail list logo