RMetis call only...
Thanks for helping,
Eric
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42
--
Josh Hursey
IBM Spectrum MPI Developer
rankfile is specified?
[1]
https://stackoverflow.com/questions/32333785/how-to-provide-a-default-slot-list-in-openmpi-rankfile
--
Josh Hursey
IBM Spectrum MPI Developer
> > wrote:
On 18/12/2020 23:04, Josh Hursey wrote:
Vincent,
Thanks for the details on the bug. Indeed this is a case that seems to have
been a problem for a little while now when you use static ports with ORTE (-mca
oob_tcp_static_ipv4_ports option). It must have crept in w
!= object) {
492 object->obj_class = cls;
493 object->obj_reference_count = 1;
494 opal_obj_run_constructors(object);
495 }
496 return object;
497 }
Can you maybe (firstly) fix my knowledge about what correct mca option I could
us
sure there are no
> known issues between the hardware and software before we make a purchase.
> >
> > Any feedback will be greatly appreciated.
> >
> > Thanks,
> >
> > Prentice
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
&
n in private/netloc.h. Am I missing something here?
> >>>>>
> >>>>> Thanks
> >>>>> Kavitha
> >>>>>
> >>>> ___
> >>>> hwloc-users mailing list
> >
eloper's meeting next week.
> > > I hope we can talk about it.
> > >
> > > Takahiro Kawashima,
> > > MPI development team,
> > > Fujitsu
>
> ___
> mtt-users mailing list
> mtt-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/mtt-users
>
--
Josh Hursey
IBM Spectrum MPI Developer
___
mtt-users mailing list
mtt-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/mtt-users
_
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
--
Josh Hursey
IBM Spectrum MPI Developer
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
rank 1 with PID 0 on node cn15 exited on
> signal 7 (Bus error).
> ------
>
> Thanks in advance,
>
> Ender
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
--
Josh Hursey
IBM Spectrum MPI Developer
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
rc1-Linux.x86_64.64_cc 129
>
>
> Gilles, I would be grateful, if you can fix the problem for
> openmpi-2.1.0rc1 as well. Thank you very much for your help
> in advance.
>
>
> Kind regards
>
> Siegmar
> ___
> users mailing lis
und to socket 0[core 4[hwt 0]], socket
> 0[core 5[hwt 0]]: [././././B/B/./././././.][./././././././././././.]
> > [somehost:105601] MCW rank 5 bound to socket 1[core 16[hwt 0]], socket
> 1[core 17[hwt 0]]: [./././././././././././.][././././B/B/./././././.]
> >
> >
> > Any ideas, please?
> >
> > Thanks,
> >
>
.
This move requires -no- changes to any of your MTT client setups.
Let me know if you have any issues.
-- Josh
On Fri, Oct 21, 2016 at 9:53 PM, Josh Hursey <jjhur...@open-mpi.org> wrote:
> I have taken down the MTT Reporter at mtt.open-mpi.org while we finish up
> the migration. I'll send
I have taken down the MTT Reporter at mtt.open-mpi.org while we finish up
the migration. I'll send out another email when everything is up and
running again.
On Fri, Oct 21, 2016 at 10:17 AM, Josh Hursey <jjhur...@open-mpi.org> wrote:
> Reminder that the MTT will go offline starting at
19, 2016 at 10:14 AM, Josh Hursey <jjhur...@open-mpi.org> wrote:
> Based on current estimates we need to extend the window of downtime for
> MTT to 24 hours.
>
> *Start time*: *Fri., Oct. 21, 2016 at Noon US Eastern* (11 am US Central)
> *End time*: *Sat., Oct. 22, 20
any questions or concerns.
On Tue, Oct 18, 2016 at 10:59 AM, Josh Hursey <jjhur...@open-mpi.org> wrote:
> We are moving this downtime to *Friday, Oct. 21 from 2-5 pm US Eastern*.
>
> We hit a snag with the AWS configuration that we are working through.
>
> On Sun, Oct 16,
We are moving this downtime to *Friday, Oct. 21 from 2-5 pm US Eastern*.
We hit a snag with the AWS configuration that we are working through.
On Sun, Oct 16, 2016 at 9:53 AM, Josh Hursey <jjhur...@open-mpi.org> wrote:
> I will announce this on the Open MPI developer's teleconf o
to access MTT using
themtt.open-mpi.org URL. No changes are needed in your MTT client setup,
and all permalinks are expected to still work after the move.
Let me know if you have any questions or concerns about the move.
--
Josh Hursey
IBM Spectrum MPI Developer
ling list
> mtt-users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/mtt-users
>
--
Josh Hursey
IBM Spectrum MPI Developer
___
mtt-users mailing list
mtt-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/mtt-users
-----
>
> Is the URL incorrect?
> Ralph
>
> ___
> mtt-users mailing list
> mtt-users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/mtt-users
>
--
Josh Hursey
IBM Spectrum MPI Developer
___
mtt-users mailing list
mtt-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/mtt-users
IBM will be helping to support the LSF functionality in Open MPI. We don't
have any detailed documentation just yet, other than the FAQ on the Open
MPI site. However, the LSF components in Open MPI should be functional in
the latest releases. I've tested recently with LSF 9.1.3 and 10.1.
I pushed
y (without the
LSB_PJL_TASK_GEOMETRY variable).
-- Josh
On Tue, Apr 19, 2016 at 8:57 AM, Josh Hursey <jjhur...@open-mpi.org> wrote:
> Farid,
>
> I have access to the same cluster inside IBM. I can try to help you track
> this down and maybe work up a patch with the LSF folks. I'll contac
Farid,
I have access to the same cluster inside IBM. I can try to help you track
this down and maybe work up a patch with the LSF folks. I'll contact you
off-list with my IBM address and we can work on this a bit.
I'll post back to the list with what we found.
-- Josh
On Tue, Apr 19, 2016 at
I think this is fine. If we do start to organize ourselves for a formal
release then we might want to move to pull requests to keep the branch
stable for a bit, but for now this is ok with me.
The Python client looks like it will be a nice addition. Hopefully, I will
have the REST submission
I think that would be good. I won't have any cycles to help until after the
first of the year. We started working towards a release way back when, but
I think we got stuck with the license to package up the graphing library
for the MTT Reporter. We could just remove that feature from the release
The C/R Debugging feature (the ability to do reversible debugging or
backward stepping with gdb and/or DDT) was added on 8/10/2010 in the commit
below:
https://svn.open-mpi.org/trac/ompi/changeset/23587
This feature never made it into a release so it was only ever available on
the trunk.
This is a bit late in the thread, but I wanted to add one more note.
The functionality that made it to v1.6 is fairly basic in terms of C/R
support in Open MPI. It supported a global checkpoint write, and (for a
time) a simple staged option (I think that is now broken).
In the trunk (about 3
> bash: ompi-migrate: command not found
>
> Please assist.
>
> Regards - Ifeanyi
>
>
>
> On Wed, Dec 12, 2012 at 3:19 AM, Josh Hursey <jjhur...@open-mpi.org>wrote:
>
>> Process migration was implemented in Open MPI and working in the trunk a
>> couple
With that configure string, Open MPI should fail in configure if it does
not find the BLCR libraries. Note that this does not check to make sure the
BLCR is loaded as a module in the kernel (you will need to check that
manually).
The ompi_info command will also show you if C/R is enabled and will
Process migration was implemented in Open MPI and working in the trunk a
couple of years ago. It has not been well maintained for a few years though
(hopefully that will change one day). So you can try it, but your results
may vary.
Some details are at the link below:
The openib BTL and BLCR support in Open MPI were working about a year ago
(when I last checked). The psm BTL is not supported at the moment though.
>From the error, I suspect that we are not fully closing the openib btl
driver before the checkpoint thus when we try to restart it is looking for
a
Can you send the config.log and some of the other information described on:
http://www.open-mpi.org/community/help/
-- Josh
On Wed, Nov 14, 2012 at 6:01 PM, Ifeanyi wrote:
> Hi all,
>
> I got this message when I issued this command:
>
> root@node1:/home/abolap#
Pramoda,
That paper was exploring an application of a proposed extension to the MPI
standard for fault tolerance purposes. By default this proposed interface
is not provided by Open MPI. We have created a prototype version of Open
MPI that includes this extension, and it can be found at the
In your desired ordering you have rank 0 on (socket,core) (0,0) and
rank 1 on (0,2). Is there an architectural reason for that? Meaning
are cores 0 and 1 hardware threads in the same core, or is there a
cache level (say L2 or L3) connecting cores 0 and 1 separate from
cores 2 and 3?
hwloc's
experience any problems with the new server.
-- Josh
On Fri, Nov 2, 2012 at 9:26 AM, Josh Hursey <jjhur...@open-mpi.org> wrote:
> Reminder that we will be shutting down the MTT submission and reporter
> services this weekend to migrate it to another machine. The MTT
> services
Reminder that we will be shutting down the MTT submission and reporter
services this weekend to migrate it to another machine. The MTT
services will go offline at COB today, and be brought back by Monday
morning.
On Wed, Oct 31, 2012 at 7:54 AM, Jeff Squyres wrote:
> *** IF
Currently you have to do as Reuti mentioned (use the queuing system,
or create a script). We do have a feature request ticket open for this
feature if you are interested in following the progress:
https://svn.open-mpi.org/trac/ompi/ticket/1961
It has been open for a while, but the feature
The official support page for the C/R features is hosted by Indiana
University (linked from the Open MPI FAQs):
http://osl.iu.edu/research/ft/ompi-cr/
The instructions probably need to be cleaned up (some of the release
references are not quite correct any longer). But the following should
give
Ifeanyi,
I am usually the one that responds to checkpoint/restart questions,
but unfortunately I do not have time to look into this issue at the
moment (and probably won't for at least a few more months). There are
a few other developers that work on the checkpoint/restart
functionality that
You are correct that the Open MPI project combined the efforts of a
few preexisting MPI implementations towards building a single,
extensible MPI implementation with the best features of the prior MPI
implementations. From the beginning of the project the Open MPI
developer community has desired
(4) I install openmpi in root ,should I move to
> General-user-account ?
>
>
> 寄件者: Josh Hursey <jjhur...@open-mpi.org>
> 收件者: Open MPI Users <us...@open-mpi.org>
> 寄件日期: 2012/4/24 (週二) 10:58 PM
>
> 主旨: Re: [OMPI users]
heck/Restart Program contains DLL ?
I do not understand what you are trying to ask here. Please rephrase.
-- Josh
>
>
>
> 寄件者: Josh Hursey <jjhur...@open-mpi.org>
> 收件者: Open MPI Users <us...@open-mpi.org>
I wonder if the LD_LIBRARY_PATH is not being set properly upon
restart. In your mpirun you pass the '-x LD_LIBRARY_PATH'.
ompi-restart will not pass that variable along for you, so if you are
using that to set the BLCR path this might be your problem.
A couple solutions:
- have the PATH and
The 1.5 series does not support process migration, so there is no
ompi-migrate option there. This was only contributed to the trunk (1.7
series). However, changes to the runtime environment over the past few
months have broken this functionality. It is currently unclear when
this will be repaired.
This is a bit of a non-answer, but can you try the 1.5 series (1.5.5
in the current release)? 1.4 is being phased out, and 1.5 will replace
it in the near future. 1.5 has a number of C/R related fixes that
might help.
-- Josh
On Thu, Mar 29, 2012 at 1:12 PM, Linton, Tom
When you receive that callback the MPI has ben put in a quiescent state. As
such it does not allow MPI communication until the checkpoint is completely
finished. So you cannot call barrier in the checkpoint callback. Since Open
MPI did doing a coordinated checkpoint, you can assume that all
It looks like Jeff beat me too it. The problem was with a missing 'test' in
the configure script. I'm not sure how it creeped in there, but the fix is
in the pipeline for the next 1.5 release. The ticket to track the progress
of this patch is on the following ticket:
Well that is awfully insistent. I have been able to reproduce the problem.
Upon initial inspection I don't see the bug, but I'll dig into it today and
hopefully have a patch in a bit. Below is a ticket for this bug:
https://svn.open-mpi.org/trac/ompi/ticket/2980
I'll let you know what I find
tensen <je...@fysik.dtu.dk>
> On 20-01-2012 15:26, Josh Hursey wrote:
>
> That behavior is permitted by the MPI 2.2 standard. It seems that our
> documentation is incorrect in this regard. I'll file a bug to fix it.
>
> Just to clarify, in the MPI 2.2 standard in Section 6.4.2 (
That behavior is permitted by the MPI 2.2 standard. It seems that our
documentation is incorrect in this regard. I'll file a bug to fix it.
Just to clarify, in the MPI 2.2 standard in Section 6.4.2 (Communicator
Constructors) under MPI_Comm_create it states:
"Each process must call with a group
library to annotate what's important to
>> store, and how to do so, etc.). But if you're writing the application,
>> you're better off to handle it internally, than externally.
>>
>> Lloyd Brown
>> Systems Administrator
>> Fulton Supercomputing Lab
>> Brigham Y
Currently Open MPI only supports the checkpointing of the whole
application. There has been some work on uncoordinated checkpointing with
message logging, though I do not know the state of that work with regards
to availability. That work has been undertaken by the University of
Tennessee
I have not tried to support a MTL with the checkpointing functionality, so
I do not have first hand experience with those - just the OB1/BML/BTL stack.
The difficulty in porting to a new transport is really a function of how
the transport interacts with the checkpointer (e.g., BLCR). The draining
Often this type of problem is due to the 'prelink' option in Linux.
BLCR has a FAQ item that discusses this issue and how to resolve it:
https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink
I would give that a try. If that does not help then you might want to
try checkpointing a single
For MPI_Comm_split, all processes in the input communicator (oldcomm
or MPI_COMM_WORLD in your case) must call the operation since it is
collective over the input communicator. In your program rank 0 is not
calling the operation, so MPI_Comm_split is waiting for it to
participate.
If you want
Note that the "migrate me from my current node to node " scenario
is covered by the migration API exported by the C/R infrastructure, as
I noted earlier.
http://osl.iu.edu/research/ft/ompi-cr/api.php#api-cr_migrate
The "move rank N to node " scenario could probably be added as an
extension of
The MPI standard does not provide explicit support for process
migration. However, some MPI implementations (including Open MPI) have
integrated such support based on checkpoint/restart functionality. For
more information about the checkpoint/restart process migration
functionality in Open MPI see
I wonder if the try_compile step is failing. Can you send a compressed
copy of your config.log from this build?
-- Josh
On Mon, Oct 31, 2011 at 10:04 AM, wrote:
> Hi !
>
> I am trying to compile openmpi 1.4.4 with Torque, Infiniband and blcr
> checkpoint support on
n Wed, Oct 26, 2011 at 3:25 AM, Josh Hursey <jjhur...@open-mpi.org> wrote:
>>
>> Open MPI (trunk/1.7 - not 1.4 or 1.5) provides an application level
>> interface to request a checkpoint of an application. This API is
>> defined on the following website:
>>
Open MPI (trunk/1.7 - not 1.4 or 1.5) provides an application level
interface to request a checkpoint of an application. This API is
defined on the following website:
http://osl.iu.edu/research/ft/ompi-cr/api.php#api-cr_checkpoint
This will behave the same as if you requested the checkpoint of
That option is only available on the trunk at the moment. I filed a
ticket to move the functionality to the 1.5 branch:
https://svn.open-mpi.org/trac/ompi/ticket/2890
The work around would be to take the appfile generated from
"ompi-restart --apponly ompi_snapshot...", and then run mpirun with
That command line option may be only available on the trunk. What
version of Open MPI are you using?
-- Josh
On Tue, Oct 18, 2011 at 11:14 AM, Faisal Shahzad wrote:
> Hi,
> Thank you for your reply.
> I actually do not see option flag '--mpirun_opts' with 'ompi-restart
>
I'll preface my response with the note that I have not tried any of
those options with the C/R functionality. It should just work, but I
am not 100% certain. If it doesn't, let me know and I'll file a bug to
fix it.
You can pass any mpirun option through ompi-restart by using the
--mpirun_opts
It sounds like there is a race happening in the shutdown of the
processes. I wonder if the app is shutting down in a way that mpirun
does not quite like.
I have not tested the C/R functionality in the 1.4 series in a long
time. Can you give it a try with the 1.5 series, and see if there is
any
Though I do not share George's pessimism about acceptance to the Open
MPI community, it has been slightly difficult to add such a
non-standard feature to the code base for various reasons.
At ORNL, I have been developing a prototype for the MPI Forum Fault
Tolerance Working Group [1] of the
That seems like a bug to me.
What version of Open MPI are you using? How have you setup the C/R
functionality (what MCA options do you have set, what command line
options are you using)? Can you send a small reproducing application
that we can test against?
That should help us focus in on the
There are some great comments in this thread. Process migration (like
many topics in systems) can get complex fast.
The Open MPI process migration implementation is checkpoint/restart
based (currently using BLCR), and uses an 'eager' style of migration.
This style of migration stops a process
There should not be any issue is checkpointing a C++ vs C program
using the 'self' checkpointer. The self checkpointer just looks for a
particular function name to be present in the compiled program binary.
Something to try is to run 'nm' on the compiled C++ program and make
sure that the 'self'
I wonder if this is related to memory pinning. Can you try turning off
the leave pinned, and see if the problem persists (this may affect
performance, but should avoid the crash):
mpirun ... --mca mpi_leave_pinned 0 ...
Also it looks like Smoky has a slightly newer version of the 1.4
branch
When we started adding Checkpoint/Restart functionality to Open MPI,
we were hoping to provide a LAM/MPI-like interface to the C/R
functionality. So we added a configure option as a placeholder. The
'LAM' option was intended to help those transitioning from LAM/MPI to
Open MPI. However we never
gain,
Josh
On Thu, Jun 9, 2011 at 11:18 AM, Samuel Thibault
<samuel.thiba...@inria.fr> wrote:
> Hello,
>
> Josh Hursey, le Thu 09 Jun 2011 17:03:29 +0200, a écrit :
>> Program terminated with signal 4, Illegal instruction.
>> #0 0x0041d8d9 in hwloc_w
: leaveq
0x0041d8e0 <hwloc_weight_long+15>: retq
End of assembler dump.
-
On Thu, Jun 9, 2011 at 9:05 AM, Samuel Thibault
<samuel.thiba...@inria.fr> wrote:
> Josh Hursey, le Thu 09 Jun 2011 14:52:39 +0200, a écrit :
>> The odd thi
gure via sub-shell; we directly invoke its m4, so we don't have an
> opportunity to pass --disable-gcc-builtin. Unless you passed that to the
> top-level OMPI configure script...?
>
>
> On Jun 8, 2011, at 4:28 PM, Josh Hursey wrote:
>
>> (This should have gone to the
(This should have gone to the devel list)
The attached patch adds a configure option (--disable-gcc-builtin) to
disable the use of GCC __builtin_ operations, even if the GCC compiler
supports them. The patch is a diff from the r3509 revision of the
hwloc trunk.
I hit a problem when installing
(Sorry for the late reply)
On Jun 7, 2010, at 4:48 AM, Nguyen Kim Son wrote:
> Hello,
>
> I'n trying to get functions like orte-checkpoint, orte-restart,... works but
> there are some errors that I don't have any clue about.
>
> Blcr (0.8.2) works fine apparently and I have installed openmpi
Open MPI can restart multi-threaded applications on any number of nodes (I do
this routinely in testing).
If you are still experiencing this problem (sorry for the late reply), can you
send me the MCA parameters that you are using, command line, and a backtrace
from the corefile generated by
On Jun 14, 2010, at 5:26 AM, Nguyen Toan wrote:
> Hi all,
> I have a MPI program as follows:
> ---
> int main(){
>MPI_Init();
>..
>for (i=0; i<1; i++) {
> my_atomic_func();
>}
>...
>MPI_Finalize();
>return 0;
> }
>
>
The amount of checkpoint overhead is application and system configuration
specific. So it is impossible to give you a good answer to how much checkpoint
overhead to expect for your application and system setup.
BLCR is only used to capture the single process image. The coordination of the
(Sorry for the delay, I missed the C/R question in the mail)
On May 25, 2010, at 9:35 AM, Jeff Squyres wrote:
On May 24, 2010, at 2:02 PM, Michael E. Thomadakis wrote:
| > 2) I have installed blcr V0.8.2 but when I try to built OMPI
and I point to the
| > full installation it complains it
(Sorry for the delay in replying, more below)
On Apr 8, 2010, at 1:34 PM, Fernando Lemos wrote:
Hello,
I've noticed that ompi-restart doesn't support the --rankfile option.
It only supports --hostfile/--machinefile. Is there any reason
--rankfile isn't supported?
Suppose you have a cluster
(Sorry for the delay in replying, more below)
On Apr 12, 2010, at 6:36 AM, Hideyuki Jitsumoto wrote:
Hi Members,
I tried to use checkpoint/restart by openmpi.
But I can not get collect checkpoint data.
I prepared execution environment as follows, the strings in () mean
name of output file
The functionality of checkpoint operation is not tied to CPU
utilization. Are you running with the C/R thread enabled? If not then
the checkpoint might be waiting until the process enters the MPI
library.
Does the system emit an error message describing the error that it
encountered?
When you defined them in your environment did you prefix them with
'OMPI_MCA_'? Open MPI looks for this prefix to identify which
parameters are intended for it specifically.
-- Josh
On May 12, 2010, at 11:09 PM, wrote:
Ralph
Defining
So I recently hit this same problem while doing some scalability
testing. I experimented with adding the --no-restore-pid option, but
found the same problem as you mention. Unfortunately, the problem is
with BLCR, not Open MPI.
BLCR will restart the process with a new PID, but the value
So what you are looking for is checkpoint/restart support, which you
can find some details about at the link below:
http://osl.iu.edu/research/ft/ompi-cr/
Additionally, we relatively recently added the ability to checkpoint
and 'stop' the application. This generates a usable checkpoint of
I wonder if this is a bug with BLCR (since the segv stack is in the
BLCR thread). Can you try an non-MPI version of this application that
uses popen(), and see if BLCR properly checkpoints/restarts it?
If so, we can start to see what Open MPI might be doing to confuse
things, but I suspect
On Mon, Mar 29, 2010 at 11:42 AM, Josh Hursey <jjhursey@open-
mpi.org> wrote:
On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote:
On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian
<ferny...@gmail.com> wrote:
I use mpirun -np 50 -am ft-enable-cr --mca
snapc_base_global_snapshot_dir
Does this happen when you run without '-am ft-enable-cr' (so a no-C/R
run)?
This will help us determine if your problem is with the C/R work or
with the ORTE runtime. I suspect that there is something odd with your
system that is confusing the runtime (so not a C/R problem).
Have you
On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote:
On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian
wrote:
I use mpirun -np 50 -am ft-enable-cr --mca
snapc_base_global_snapshot_dir
--hostfile .mpihostfile
to store the global checkpoint snapshot into the shared
So the MCA parameter that you mention is explained at the link below:
http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_use_thread
This enables/disables the C/R thread a runtime if Open MPI was
configured with C/R thread support:
On Mar 20, 2010, at 11:14 PM, wrote:
I am observing a very strange performance issue with my openmpi
program.
I have compute intensive openmpi based application that keeps the
data in memory, process the data and then dumps it to GPFS
On Mar 22, 2010, at 4:41 PM, wrote:
Hi
If the run my compute intensive openmpi based program using regular
invocation of mpirun (ie; mpirun –host -np cores>), it gets completed in few seconds but if I run the same
program with “-am
On Mar 21, 2010, at 12:58 PM, Addepalli, Srirangam V wrote:
Yes We have seen this behavior too.
Another behavior I have seen is that one MPI process starts to
show different elapsed time than its peers. Is it because
checkpoint happened on behalf of this process?
R
I have not been working with the integration of Open MPI and Torque
directly, so I cannot state how well this is supported. However, the
BLCR folks have been working on a Torque/Open MPI/BLCR project for a
while now, and have had some success. You might want to raise the
question on the
This type of failure is usually due to prelink'ing being left enabled
on one or more of the systems. This has come up multiple times on the
Open MPI list, but is actually a problem between BLCR and the Linux
kernel. BLCR has a FAQ entry on this that you will want to check out:
On Feb 10, 2010, at 9:45 AM, Addepalli, Srirangam V wrote:
> I am trying to test orte-checkpoint with a MPI JOB. It how ever hangs for all
> jobs. This is how i submit the job is started
> mpirun -np 8 -mca ft-enable cr /apps/nwchem-5.1.1/bin/LINUX64/nwchem
> siosi6.nw
This might be the
Thanks for the bug report. There are a couple of places in the code
that, in a sense, hard code '/tmp' as the temporary directory. It
shouldn't be to hard to fix since there is a common function used in
the code to discovery the 'true' temporary directory (which defaults
to /tmp). Of
to the v1.5
series if possible.
-- Josh
On Jan 25, 2010, at 3:33 PM, Josh Hursey wrote:
So while working on the error message, I noticed that the global
coordinator was using the wrong path to investigate the checkpoint
metadata. This particular section of code is not often used (which
reproduce it. Can you try the
trunk (either SVN checkout or nightly tarball from tonight) and check
if this solves your problem?
Cheers,
Josh
On Jan 25, 2010, at 12:14 PM, Josh Hursey wrote:
I am not able to reproduce this problem with the 1.4 branch using a
hostfile, and node configuration
to resolve
this problem.
Thank you
Jean
--- On Mon, 11/1/10, Josh Hursey <jjhur...@open-mpi.org> wrote:
From: Josh Hursey <jjhur...@open-mpi.org>
Subject: Re: [OMPI users] checkpointing multi node and multi process
applications
To: "Open MPI Users" <us...@open-
I tested the 1.4.1 release, and everything worked fine for me (tested
a few different configurations of nodes/environments).
The ompi-checkpoint error you cited is usually caused by one of two
things:
- The PID specified is wrong (which I don't think that is the case
here)
- The session
1 - 100 of 221 matches
Mail list logo