Creating nightly hwloc snapshot git tarball was a success.
Snapshot: hwloc 1.11.3-43-g8f0e3cd
Start time: Wed Aug 10 08:50:41 PDT 2016
End time: Wed Aug 10 08:53:56 PDT 2016
Your friendly daemon,
hwloc-devel mailing list
Creating nightly hwloc snapshot git tarball was a success.
Snapshot: hwloc dev-1222-gdbe0cfd
Start time: Wed Aug 10 18:01:05 PDT 2016
End time: Wed Aug 10 18:04:43 PDT 2016
Your friendly daemon,
hwloc-devel mailing list
Looks okay to me Brian - I went ahead and filed the CMR and sent it on to
Brad for approval.
> On Tue, 3 Mar 2009, Brian W. Barrett wrote:
>> On Tue, 3 Mar 2009, Jeff Squyres wrote:
>>> 1.3.1rc3 had a race condition in the ORTE shutdown sequence. The only
>>> difference between rc3
Very interesting! Appreciate the info. My numbers are slightly better
- as I've indicated, there is a NxN message exchange currently in the
system that needs to be removed. With that commented out, the system
scales roughly linearly with number of processes.
At 04:31 PM 7/28/2005, you wrote:
Per last week's discussions, I have created a set of new simplified
API's for the registry. These include:
1. orte_gpr.put_1 and orte_gpr.put_N: these allow you to put data on
the registry without having to define your own value structures. They
take a segment name, a NULL-terminated
Hmmm...it was running for me last night and (I thought) this morning,
but I'll test it again and see if I can reproduce the problem. Could
be something crept in there.
At 06:28 PM 8/3/2005, you wrote:
I just noticed that mpirun hangs forever inside the
orte_rmgr.finalize() routine. AFAIK
Very interesting - it built fine for me (building static). However,
the ns_base_nds.c file is "stale", so I just committed a "delete" of
that file. It shouldn't have been building anyway as it isn't in the
Makefile. My guess, therefore, is that you are building dynamically
and are encountering
Several people have asked lately what I am planning to do next on
ORTE. Just to help maintain coordination, here is my current list of
planned activities (in priority order). Any requests/suggestions are
welcomed - this isn't in concrete by any means.
1. Add George's architecture
I have now completed the first three of these items. I believe this
brings ORTE to a stage that is - at the least - very close to release
quality. There are a few memory leaks left (oob and iof subsystems),
but I'm not as familiar with those and have asked for help.
Josh ran some tests for me on Odin earlier today - the results show a
major improvement in our startup/shutdown performance. As you may
recall, our times grew roughly exponentially before - as the attached
graph shows, they now grow roughly linearly. The data also shows that
While I generally find the new build methodology (i.e., reducing the
number of makefiles) has little impact on me, I have now encountered
one problem that causes a significant difficulty. In trying to work
on a revised data packing system for the orte part of the branch, I
Your proposed change would help a great deal - thanks! Can you steer
me through the change?
At 07:33 AM 11/15/2005, you wrote:
* Ralph H. Castain wrote on Tue, Nov 15, 2005 at 03:12:38PM CET:
> While I generally find the new build methodology (i.e., reducing the
ut we have
definitely made it harder to develop a subsystem. Is that really a
good trade? I wonder.
At 08:08 AM 11/15/2005, you wrote:
* Ralph H. Castain wrote on Tue, Nov 15, 2005 at 03:45:26PM CET:
> At 07:33 AM 11/15/2005, you wrote:
> >Would it help if onl
un "make" in a framework directory, it
just builds the stuff in base without recusing. Of course, you can't
run make in the base/ directory, but since running make in the
framework directory is essentially equivalent, it doesn't exactly
On Nov 20, 2005, at 10:04 PM, Ralph H. Casta
Appreciate the offer, but I think at this stage it isn't worth the
hassle. We either implement a long-term fix, or just pay the price.
At 01:37 AM 11/21/2005, you wrote:
* Ralph H. Castain wrote on Mon, Nov 21, 2005 at 04:04:34AM CET:
> Just as an
As you may have seen from earlier emails, I encountered some
difficulty in modifying existing APIs within the streamlined build
system. After some effort, I think I have defined a method for
modifying the API-level of a subsystem that gets around some of the
problems. I thought I
No problem with me - seems straightforward and resolves some confusion.
On the orted check for the fork pls, you will find that there is a
flag in the process info structure that indicates "I am a daemon".
You may just need to check that flag - gets set very early and so
should be available
I've just finished some stuff - will check it into the system
(hopefully) tomorrow. I'll be able to take a look at this next week.
My guess is that the launcher isn't setting that proc state at this
time since it isn't being used by the system internally and we didn't
know anyone else was
After several months of development, I have merged the new data
support subsystem for ORTE into the trunk. I must provide one caveat
of warning: I have made every effort to test the revised system, but
cannot guarantee its operation in every condition and under every
This should now be fixed on the trunk. Once it is checked out more
thoroughly, I'll ask that it be moved to the 1.0 branch. For now, you
might want to check out the trunk and verify it meets your needs.
At 03:05 PM 2/1/2006, you wrote:
This was happening on Alpha 1 as well but
As you'll see in my latest commit, I have made a slight modification
to the standard triggers that ensures we define them for ALL of the
process and job states. This will now allow users to subscribe to
triggers (for example) on all processes achieving INIT, LAUNCH, and
Hmmmyuck! I'll take a look - will set it back to what it was
before in the interim.
At 07:05 AM 2/9/2006, you wrote:
On Feb 8, 2006, at 12:46 PM, Ralph H. Castain wrote:
> In addition, I took advantage of the change to fix something Brian
> had flagged in the orte/mc
>> Nathan DeBardeleben, Ph.D.
>> Los Alamos National Laboratory
>> Parallel Tools Team
>> High Performance Computing Environments
>> phone: 505-667-3428
>> email: ndeb...@lanl.gov
I¹m not entirely sure I understand your questions, but will try to answer
them below. If you can share what you are doing, we¹d be happy to provide
On 6/30/06 5:45 AM, "amrita mathuria" wrote:
> I am working with open mpi
This has been around for a very long time (at least a year, if memory serves
correctly). The problem is that the system "hangs" while trying to flush the
io buffers through the RML because it loses connection to the head node
process (for 1.x, that's basically mpirun) - but the "flush" procedure
Could you tell us which version of the code you are using, and print out the
rc value that was returned by the "get" call? I see nothing obviously wrong
with the code, but much depends on what happened prior to this call too.
BTW: you might want to release the memory stored in the
h Performance Computing Environments
> phone: 505-667-3428
> email: ndeb...@lanl.gov
> Ralph H Castain wrote:
>> Hi Nathan
>> Could you tell us which version of the code you a
On 8/21/06 1:14 AM, "Ralf Wildenhues" wrote:
>> Perhaps we should use int64_t instead.
> No, that would not help: int64_t is C99, so it should not be declared
> either in C89 mode. Also, the int64_t is required to have 64 bits, and
> could thus theoretically be
On 8/21/06 6:58 AM, "Ralf Wildenhues" <ralf.wildenh...@gmx.de> wrote:
> * Ralph H Castain wrote on Mon, Aug 21, 2006 at 02:39:51PM CEST:
>> It sounds, therefore, like we are now C99 compliant and no longer C90
>> compliant at all?
There has been a bit of discussion about this on the core developers list
and on telecons, but I felt that perhaps I should provide a more detailed
warning to the broader developer community.
In the next few weeks, there will be some major revisions submitted to the
Open MPI trunk on the
Actually, I was a part of that thread - see my comments beginning with
Perhaps I communicated poorly here. The issue in the prior thread was that
few systems nowadays don't offer at least some level of IPv6 compatibility,
On 9/6/06 9:44 AM, "Christian Kauhaus" wrote:
> Bogdan Costescu :
>> I don't know why you think that this (talking to different nodes via
>> different channels) is unusual - I think that it's quite probable,
>> especially in a
> I even volunteer for that. Next week I will be away, so I will come
> back with a design for the phone conference on ... well beginning of
> On Sep 7, 2006, at 12:22 PM, Ralph H Castain wrote:
>> Jeff and I talked about t
I need to do a little planning and it would help a bunch to have a
preliminary head count. Could you please let me know (a) if you plan to
participate in the tutorial, and (b) indicate if in-person or remote?
For an agenda, my thought is that we will start at 7am Mountain time (that's
round 10.30 pm on wednesday, and by the time I pick up
> the rental car and drive to White Rocks, it can become quite late)
> Could we maybe start a little later that day, e.g. 8am or 9am?
> Ralph H Castain wrote:
>> Yo folks
I have attached a tentative agenda for this week's tutorial, based on inputs
received so far from planned participants. I have adjusted things to try and
accommodate the needs of a geographically distributed audience, and the fact
that - as sole speaker - I cannot possibly talk for
The materials for Thursday's session of the ORTE tutorial are now complete
and stable. I have posted them on the OpenRTE web site at:
Both Powerpoint and PDF (printed two slides/page) formats are available.
I should have the
There was some discussion at yesterday's tutorial about ORTE scalability and
where bottlenecks might be occurring. I spent some time last night
identifying key information required to answer those questions. I'll be
presenting a slide today showing the key timing points that we would
I can't speak to the MPI layer, but for OpenRTE, each process holds one
socket open to the HNP. Each process *has* all the socket connection info
for all of the processes in its job, but I don't believe we actually open
those sockets until we attempt to communicate with that process (needs to be
I don't see any new component, Adrian. There have been a few updates to the
existing component, some of which might cause conflicts with the merge, but
those shouldn't be too hard to resolve.
As far as I know, the oob/tcp component is relatively stable. Brian is doing
some work on it to enable us
All of the schema keys are listed in orte/mca/schema/schema_types.h. The key
you are looking for is the ORTE_PROC_LOCAL_PID_KEY.
You will also see a ORTE_PROC_PID_KEY. This one refers to the pid assigned
by the launcher - the other refers to the pid reported by the process from
Thanks Ralf! Much appreciated.
On 11/30/06 8:33 AM, "Ralf Wildenhues" wrote:
> * Ralph Castain wrote on Thu, Nov 30, 2006 at 04:12:16PM CET:
>> That could be the problem. I had to update automake, and unfortunately
>> Darwin Ports hasn't reached that level yet. So I had
We aren't ignoring your situation, Adrian - Jeff and I are talking about how
best to deal with the situation and your offer to help. This revision will
indeed see some significant change in the oob/tcp component, mostly in the
init and connect procedures.
The concern is that we want to leave open
The changes we are planning to do will in no way preclude the use of
multicast for the xcast procedure. The changes in the OOB subsystem deal
specifically with how those connections are initialized, which is something
we would need to do for multicast anyway.
The routing method for the xcast is
Several of us were on a telecon yesterday and the topic of better
coordinating the activities on OpenRTE came up. While things have percolated
along reasonably well, the general feeling was that better, wider knowledge
of current OpenRTE development activities and directions would
on adding functionality to the code - I will note
those on the site as I am fixing them.
Again, I would like to note that people are always welcome to drop me a note
or call me on the phone if they have a question about what I'm doing or
planning to do.
On 1/4/07 7:41 AM, "
On 1/27/07 9:37 AM, "Greg Watson" wrote:
> There are two more interfaces that have changed:
> 1. orte_rds.query() now takes a job id, whereas in 1.2b1 it didn't
> take any arguments. I seem to remember that I call this to kick orted
> into action, but I'm not sure of the
On 1/29/07 10:20 AM, "Greg Watson" <gwat...@lanl.gov> wrote:
> On Jan 29, 2007, at 6:47 AM, Ralph H Castain wrote:
>> On 1/27/07 9:37 AM, "Greg Watson" <gwat...@lanl.gov> wrote:
>>> There a
>> other than a hostfile, we really don't have a way to do that right
>> now. The
>> ORTE 2.0 design allows for it, but we haven't implemented that yet -
>> probably a few months away.
>> Hope that helps
On 4/3/07 9:32 AM, "Li-Ta Lo" wrote:
> On Sun, 2007-04-01 at 13:12 -0600, Ralph Castain wrote:
>> 2. I'm not sure what you mean by mapping MPI processes to "physical"
>> processes, but I assume you mean how do we assign MPI ranks to processes on
>> specific nodes. You
I understand that several people are interested in the OpenRTE scalability
issues - this is great! However, it appears we haven't done a very good job
of circulating information about the identified causes of the current
issues. In the hope of helping people to be productive in their
Actually, I was aware of that and should have clarified that these tests did
*not* involve the IPv6 code.
On 4/17/07 1:31 AM, "Christian Kauhaus" <ckauh...@minet.uni-jena.de> wrote:
> Ralph H Castain <r...@lanl.gov>:
>> even tho
For the last several months, we have supported three modes of sending the
xcast messages used to release MPI processes from their various stage gates:
1. Direct - message sent directly to each process in a serial fashion
2. Linear - message sent serially to the daemon on each node, which then
e messages to each orted independently
(instead of via a binomial tree method).
> Ralph H Castain wrote:
>> For the last several months, we have supported three modes of sending the
>> xcast messages used to release MPI processes from their various stag
This came up in today's telecon and I promised to send this to George -
however, it occurred to me that others may also want to know.
If you want to dump info for debugging purposes, and if you can get into
orterun/mpirun (e.g., via gdb), you can dump info on anything with the
Just a quick glance (running out door) - it looks like Josh commented out a
critical piece of code in the rds hostfile component at line 442. It loads
the cell info into the name service so it can correctly respond to the query
you cite below.
You might try restoring that code - if you do, check
> I haven't looked at this at all, but that line changed in r6813 which
> was Aug. 2005 so I would guess the problem is elsewhere. However with
> the recent ORTE changes maybe this is a side effect.
> -- Josh
> On May 23, 2007, at 11:11 AM, Ralph H
Okay, this is now fixed as of r14732.
Thanks (and apologies) to George for spotting it.
On 5/23/07 9:57 AM, "Ralph H Castain" <r...@lanl.gov> wrote:
> Actually, I think that is true (got back earlier than expected). The problem
> really is that we had multip
Thanks - I'll take a look at this (and the prior ones!) in the next couple
of weeks when time permits and get back to you.
On 5/23/07 1:11 PM, "George Bosilca" wrote:
> Attached is another patch to the ORTE layer, more specifically the
> replica. The idea is to
Scaling tests over the last few months have all shown a behavior that has
elicited significant comment: namely, that the HNP is observed to grow to
multiple gigabytes in size for runs involving several thousand processes.
This represents a peak size that declines to a much smaller footprint once
s (you'll have to look at the tests to
>>>>>> which ones make sense in the latter case). This will ensure that we
>>>>>> have at
>>>>>> least some degree of coverage.
ou do, remember to also
> remove test/class/orte_bitmap.c
> Ralph H Castain wrote:
>> Sigh...is it really so much to ask that we at least run the tests in
>> orte/test/system and orte/test/mpi using both mpirun and singleton (where
I made a major commit to the trunk this morning (r15007) that merits general
notification and some explanation.
*** IMPORTANT NOTE ***
One major impact of the commit you *may* notice is that support for several
environments will be broken. This commit is known to break
Actually, I was talking specifically about configuration at build time. I
realize there are trade-offs here, and suspect we can find a common ground.
The problem with using the options Jeff described is that they require
knowledge on the part of the builder as to what environments have had their
As I understood our original discussions, this would move responsibility for
mapping rank to processor back into the orted - is that still true?
Reason I ask is to again clarify for people if we are doing so as it (a)
impacts those systems that don't use our orteds (e.g., will affinity still
I have upgraded the support for Bproc on the Open MPI trunk as of r15328.
We now support Bproc environments that do not utilize resource managers - in
these cases, we will allow the user to launch on all nodes upon which they
have execution authorities. Please note that, if you login to
al processes. Currently this component is the ODLS. Most of my
> work is in the ODLS component so if you decide to eliminate the orteds
> you mast, somehow, preserve the ODLS functionality.
> -Original Message-
> From: devel-boun...@open-mpi.org
as multiple, related
> frameworks (e.g., RAS and PLS). E.g., "orte_base_launcher=tm", or
> On Jul 10, 2007, at 9:08 AM, Ralph H Castain wrote:
>> Actually, I was talking specifically about configuration at build
>> time. I
>> realize the
Interesting point - no reason why we couldn't use that functionality for
this purpose. Good idea!
On 7/11/07 5:38 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote:
> On Jul 10, 2007, at 1:26 PM, Ralph H Castain wrote:
>>> 2. It may be useful to have some h
I have a fairly significant change coming to the orte part of the code base
that will require an autogen (sorry). I'll check it in late this afternoon
(can't do it at night as it is on my office desktop).
The commit will fix the singleton operations, including singleton
On 7/12/07 7:53 AM, "Ralph H Castain" <r...@lanl.gov> wrote:
> Yo all
> I have a fairly significant change coming to the orte part of the code base
> that will require an autogen (sorry). I'll check it in late this afternoon
> (can't do it
let me know of any problems.
On 7/12/07 1:45 PM, "Ralph H Castain" <r...@lanl.gov> wrote:
> Yo folks
> Several of us are stuck waiting for this commit to hit. Rather than wasting
> the next several hours, I'm going to make the commit now.
> So plea
ns ? This
> will solve the Windows problem, and will give us a more consistent
> On Jul 12, 2007, at 4:02 PM, Ralph H Castain wrote:
>> The commit has been made - it is r15390.
>> This commit restored the ability to e
y separate code paths.
That's why we wound up where we are. Remember, the ODLS fork/exec's
application procs, so it includes all kinds of stuff for that purpose. In
this case, we are fork/exec'ing an orted - totally different informational
On 7/12/07 2:17 PM, "Ralph H Castain&
As we are discussing functional requirements for the upcoming 1.3 release, I
was asked to provide a little info about what is going to be happening to
the ORTE part of the code base over the remainder of this year.
Short answer: there will be a major code revision to reduce ORTE to the
> On Thu, Jul 12, 2007 at 03:04:01PM -0600, Ralph H Castain wrote:
>> As always, any thoughts/suggestions are welcomed.
> I hope Sharon's work on process affinity will be merged into the trunk
> before this works begins and functionality will be preser
On 7/13/07 7:22 AM, "Sven Stork" <st...@hlrs.de> wrote:
> Hi Ralph,
> On Thursday 12 July 2007 15:53, Ralph H Castain wrote:
>> Yo all
>> I have a fairly significant change coming to the orte part of the code base
>> that will
Sigh - somehow, the fix slid out of that commit. I have now fixed it in
On 7/16/07 6:11 AM, "Sven Stork" <st...@hlrs.de> wrote:
> On Friday 13 July 2007 15:35, Ralph H Castain wrote:
>> On 7/13/07 7:22 AM, "Sven Stork"
Just to further clarify the clarification... ;-)
This condition has existed for the last several months. The root problem
dates at least back into the 1.1 series. We chased the problem down to the
iof_flush call in the odls when a process terminates in something like Jan
or Feb this year, at
I believe that was fixed in r15405 - are you at that rev level?
On 7/18/07 7:27 AM, "Gleb Natapov" wrote:
> With current trunk LD_LIBRARY_PATH is not set for ranks that are
> launched on the head node. This worked previously.
It works for me in both cases, provided I give the fully qualified host name
for your first example. In other words, these work:
pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host localhost printenv | grep
[pn1180961.lanl.gov:22021] [0.0] test of print_name
So the question is: why do you not have LD_LIBRARY_PATH set in your
environment when you provide a different hostname?
On 7/19/07 7:45 AM, "Gleb Natapov" <gl...@voltaire.com> wrote:
> On Wed, Jul 18, 2007 at 09:08:38PM +0300, Gleb Natapov wrote:
> that works fine. The failing one is the first one, where
> LD_LIBRARY_PATH is not provided. As Gleb indicate using localhost
> make the problem vanish.
> On Jul 19, 2007, at 10:57 AM, Ralph H Castain wrote:
>> But it *does* provide an LD
Talked with Brian and we have identified the problem and a fix - will come
in later today.
On 7/19/07 9:24 AM, "Ralph H Castain" <r...@lanl.gov> wrote:
> You are correct - I misread the note. My bad.
> I'll look at how we might ensure the LD_LIBRAR
problem that will take some
discussion - to occur separately from this chain. So some of the behavior
you cited continues for the moment.
On 7/19/07 9:39 AM, "Ralph H Castain" <r...@lanl.gov> wrote:
> Talked with Brian and we have identified the problem and a fix - w
This change has finally been merged into the trunk as r15517. It will
unfortunately require an autogen (sorry).
Please let me know if you encounter any problems. As noted in the commit, I
tried to catch all the places that required change, but cannot guarantee
that I got all of them.
As you know, I am working on revamping the hostfile functionality to make it
work better with managed environments (at the moment, the two are
exclusive). The issue that we need to review is how we want the interaction
to work, both for the initial launch and for comm_spawn.
Perhaps some bad news on this subject - see below.
On 7/26/07 7:53 AM, "Ralph H Castain" <r...@lanl.gov> wrote:
> On 7/26/07 7:33 AM, "rolf.vandeva...@sun.com" <rolf.vandeva...@sun.com>
>> Aurelien Bout
On 7/26/07 2:24 PM, "Aurelien Bouteiller" <boute...@cs.utk.edu> wrote:
> Ralph H Castain wrote:
>> After some investigation, I'm afraid that I have to report that this - as
>> far as I understand what you are doing - may no longer work in Open MPI in
I've been playing with the trunk today and found it appears to be broken for
comm_spawn. I'm getting two types of errors, perhaps related:
1. if everything is being done on localhost, I do not see any of the IO from
the child process. Mpirun executes and completes cleanly, however.
On 8/6/07 1:51 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote:
> On Aug 6, 2007, at 11:49 AM, Ralph H Castain wrote:
>> 1. if everything is being done on localhost, I do not see any of
>> the IO from
>> the child process. Mpirun executes an
This is just to let you know of a change in my status. I will be on vacation
all of next week (Aug 13-17), and possibly part of the following week as
well. I will not have my computer with me, so I will not be reading or
responding to email for up to two weeks.
When I return, I will
Just checked out a fresh copy of the trunk and tried to build it using my
./configure --prefix=/Users/rhc/openmpi --with-devel-headers
--disable-shared --enable-static --disable-mpi-f77 --disable-mpi-f90
--enable-mem-debug --without-memory-manager --enable-debug
On 8/27/07 7:30 AM, "Tim Prins" <tpr...@cs.indiana.edu> wrote:
> Ralph H Castain wrote:
>> Just returned from vacation...sorry for delayed response
> No Problem. Hope you had a good vacation :) And sorry for my super
> delayed respo
WHAT: Decide upon how to handle MPI applications where one or more
processes exit without calling MPI_Finalize
WHY:Some applications can abort via an exit call instead of
calling MPI_Abort when a library (or something else) calls
exit. This situation is outside a
Sorry for delay - wasn't ignoring the issue.
There are several fixes to this problem - ranging in order from least to
1. just alias "ssh" to be "ssh -Y" and run without setting the mca param. It
won't affect anything on the backend because the daemon/procs don't use ssh.
>>> I'm curious what changed to make this a problem. How were we passing mca
>>> from the base to the app before, and why did it change?
>>> I think that options 1 & 2 below are no good, since we, in general, allow
-- I'm just joining this conversation late: what's the problem
> with opal_cmd_line_parse?
> It should obey all quoting from shells, etc. I.e., it shouldn't care
> about tokens with special characters (to include spaces) because the
> shell divides all of that stuff up --
I'm not aware of any continuing discussion to totally remove the process
name from ORTE - I believe we coalesced to redefining how the jobid was
established to a procedure that doesn't require a name server. This hasn't
come over to the trunk yet, but will in the next couple of months.
odels are used. (i.e., you exec locally
> but it turns into a system-like invocation on the remote side). In
> this case, I think you'll need to quote extended strings (e.g., those
> containing spaces) for the non-local invocations not not quote it for
> local invocations.
1 - 100 of 262 matches
Mail list logo