I am trying to compile openmpi (Revision: 29539) from svn
with '--with-ft=cr'. I get a compilation error and I am
lost how to solve it:
../../../../opal/mca/base/mca_base_components_open.c: In function
'open_components':
../../../../opal/mca/base/mca_base_components_open.c:144:9: error:
'mca_ba
release series at this
> time. We're looking to restore that support next year as part of the 1.9
> release series.
>
>
> On Oct 28, 2013, at 8:47 AM, Adrian Reber wrote:
>
> > I am trying to compile openmpi (Revision: 29539) from svn
> > with '--with-ft=cr
).
This first patch fixes wrong include directives when compiling with
OPAL_SETUP_FT_OPTIONS.
Adrian
>From c417f21e5a720f8bfe9ee222948ae8c59d4a485b Mon Sep 17 00:00:00 2001
From: Adrian Reber
List-Post: devel@lists.open-mpi.org
Date: Wed, 20 Nov 2013 14:50:12 +0100
Subject: [PA
using
"--with-ft=cr". This patchset only fixes existing compilation problems;
the code is not yet expected to work.
I used "make check" to verify that it does not break existing code.
Adrian Reber (4):
Trying to get the C/R code to compile again. (void value not ignored)
T
From: Adrian Reber
This patch changes all recv/recv_buffer occurrences in the C/R code
to recv_nb/recv_buffer_nb.
Signed-off-by: Adrian Reber
---
ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 49 +++-
orte/mca/errmgr/base/errmgr_base_tool.c | 12 +---
orte/mca/rml/ftrm
From: Adrian Reber
This are the remaining changes to get C/R to compile again. This patch
includes various fixes all over the C/R code and are hard to group
like the previous patches.
Signed-off-by: Adrian Reber
---
ompi/mca/bml/r2/bml_r2_ft.c| 10 +-
opal/mca/base
From: Adrian Reber
This patch fixes
error: void value not ignored as it ought to be
in the C/R code by ignoring the return value of functions which
no longer return a value (only void).
Signed-off-by: Adrian Reber
---
orte/mca/errmgr/base/errmgr_base_tool.c | 8 +-
orte/mca
From: Adrian Reber
This patch changes all send/send_buffer occurrences in the C/R code
to send_nb/send_buffer_nb.
Signed-off-by: Adrian Reber
---
ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 12 ++---
orte/mca/errmgr/base/errmgr_base_tool.c | 2 +-
orte/mca/rml/ftrm/rml_ftrm.h
On Wed, Dec 04, 2013 at 08:07:39PM +, Jeff Squyres (jsquyres) wrote:
> On Dec 4, 2013, at 11:29 AM, Ralph Castain wrote:
>
> > Jeff - you are jumping way ahead. I already said this needs further work to
> > resolve blocking. These patches (per Adrian's email) just makes things
> > compile
>
9 AM, Adrian Reber wrote:
>
> > From: Adrian Reber
> >
> > This patch fixes
> >
> > error: void value not ignored as it ought to be
> >
> > in the C/R code by ignoring the return value of functions which
> > no longer return a value (only vo
e totally wrong about emulating
> > blocking. There might be (probably are?) rules/assumptions in the ORTE
> > layer (of which I am *not* an expert) that disallow you from [emulating]
> > blocking.
> >
> > If that's the case, then there's architectural issu
On Fri, Dec 06, 2013 at 08:43:39AM -0600, Josh Hursey wrote:
> Did the mca_base_component_distill_checkpoint_ready paramter go away? Its
> intention was to allow a user to have a build with C/R compiled in and then
> choose at runtime if they want to restrict their component section to just
> C/R e
From: Adrian Reber
This are the remaining changes to get C/R to compile again. This patch
includes various fixes all over the C/R code and are hard to group
like the previous patches.
Changes from V1:
* explain why mca_base_component_distill_checkpoint_ready no longer works
* compare return
9, 2013, at 5:38 AM, Adrian Reber wrote:
>
> > diff --git a/orte/mca/rml/oob/rml_oob_component.c
> > b/orte/mca/rml/oob/rml_oob_component.c
> > index dd539cd..b91f4a3 100644
> > --- a/orte/mca/rml/oob/rml_oob_component.c
> > +++ b/orte/mca/rml/oob/rml_oob_compon
eferencing the OOB, then we need to
> > > go directly to it. I'll have to check/correct the code, but the RML
> > > shouldn't even be storing a pointer to the OOB in it as there no longer
> > > is a direct linkage.
> > >
> > >
> > >
Is there a phone number I can use to join the meeting via phone from
Germany?
Adrian
On Thu, Dec 12, 2013 at 02:43:38PM -0600, Ralph Castain wrote:
> Sorry for delay - we just realized we hadn't addressed this yet.
>
> Plan is to still start at 9am as planned. Hope that is okay.
> >
> > orte_rml_oob_ft_event
> >
> > No need to reference thru the module unless you want to for some reason.
> >
> >
> > >
> > > This doesn't seem right - if we are referencing the OOB, then we need to
> > go directly to it. I'l
From: Adrian Reber
This is the second try to replace the usage of blocking send and
recv in the C/R code with the non-blocking versions. The new code
compiles (in contrast to the old code) but does not work yet.
This is the first step to get the C/R code working again. Right
now it only compiles
From: Adrian Reber
This patch changes all send/send_buffer occurrences in the C/R code
to send_nb/send_buffer_nb.
The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
The new code compiles but does not work.
Changes from V1:
* #ifdef out the code (so it is preserved for
From: Adrian Reber
This patch changes all recv/recv_buffer occurrences in the C/R code
to recv_nb/recv_buffer_nb.
The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
The new code compiles but does not work.
Changes from V1:
* #ifdef out the code (so it is preserved for
e return code isn't
> the number of bytes sent any more - it is just ORTE_SUCCESS or else an error
> code, so you should be testing for ORTE_SUCCESS ==
>
>
>
>
> On Dec 18, 2013, at 6:42 AM, Adrian Reber wrote:
>
> > From: Adrian Reber
> >
> >
From: Adrian Reber
This is the second try to replace the usage of blocking send and
recv in the C/R code with the non-blocking versions. The new code
compiles (in contrast to the old code) but does not work yet.
This is the first step to get the C/R code working again. Right
now it only compiles
From: Adrian Reber
This patch changes all recv/recv_buffer occurrences in the C/R code
to recv_nb/recv_buffer_nb.
The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
The new code compiles but does not work.
Changes from V1:
* #ifdef out the code (so it is preserved for
From: Adrian Reber
This patch changes all send/send_buffer occurrences in the C/R code
to send_nb/send_buffer_nb.
The new code compiles but does not work.
Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED
Changes
On Thu, Dec 19, 2013 at 09:54:19PM +0100, Adrian Reber wrote:
> This is the second try to replace the usage of blocking send and
> recv in the C/R code with the non-blocking versions. The new code
> compiles (in contrast to the old code) but does not work yet.
> This is the first step
Trying to run Open MPI with C/R enabled I get the following error
with --enable-debug:
[dcbz:20360] orte_rml_base_select: initializing rml component oob
[dcbz:20360] orte_rml_base_select: initializing rml component ftrm
[dcbz:20360] orte_rml_base_select: module ftrm unloaded
orterun: ../../opal/cl
0045 and let me know if it fixes your issue.
>
> George.
>
>
> On Dec 21, 2013, at 22:05 , Adrian Reber wrote:
>
> > Trying to run Open MPI with C/R enabled I get the following error
> > with --enable-debug:
> >
> > [dcbz:20360] orte_rml_ba
Right now the C/R code fails because of a change introduced in
opal/mca/compress/base/compress_base_open.c in 2013 with commit
git 734c724ff76d9bf814f3ab0396bcd9ee6fddcd1b
svn r28239
Update OPAL frameworks to use the MCA framework system.
This commit changed a lot but also the return value o
gt; CR infrastructure (although the CR infrastructure can use the compress
> framework if the user chooses to). So I bet we can remove the protection
> altogether and be fine.
>
> So I think this patch is fine. I might also go as far as removing the 'if'
> block altoge
_ERR_NOT_AVAILABLE. This would still avoid opening the
> > components for no reason (thus saving some memory) while not causing
> > opal_init to abort.
> >
> >
> > On Jan 3, 2014, at 3:19 AM, Adrian Reber wrote:
> >
> > > So removing all output like
Continuing with the CR code I now get a crash which can be easily reproduced
using orte/test/system/orte_barrier.c
I get:
orte_barrier: ../../../../../opal/class/opal_list.h:547: _opal_list_append:
Assertion `0 == item->opal_list_item_refcount' failed.
[dcbz:05085] *** Process received signal **
For my CR work this can probably ignored. I think I was looking at the
wrong place.
On Thu, Jan 09, 2014 at 05:28:01PM +0100, Adrian Reber wrote:
> Continuing with the CR code I now get a crash which can be easily reproduced
> using orte/test/system/orte_barrier.c
>
> I get:
>
I am currently trying to understand how callbacks are working. Right now
I am looking at orte/mca/rml/base/rml_base_receive.c
orte_rml_base_comm_start() which does
orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
ORTE_RML_TAG_RML_INFO_UPDATE,
On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
>
> On Jan 10, 2014, at 8:02 AM, Adrian Reber wrote:
>
> > I am currently trying to understand how callbacks are working. Right now
> > I am looking at orte/mca/rml/base/rml_base_receive.c
> > orte_rml_b
s okay to
> block using ORTE_WAIT_FOR_COMPLETION. Look in
> orte/mca/routed/base/routed_base_fns.c starting at line 252 for an example.
>
> HTH
> Ralph
>
> On Jan 10, 2014, at 12:55 PM, Ralph Castain wrote:
>
> >
> > On Jan 10, 2014, at 12:45 PM, Adrian Rebe
, Jan 20, 2014 at 02:46:04PM -0800, Ralph Castain wrote:
> Is it orte-checkpoint that is hanging, or the app you are trying to
> checkpoint?
>
>
> On Jan 20, 2014, at 2:10 PM, Adrian Reber wrote:
>
> > Thanks for your help. I tried initializing the barrier correctly (see
>
n Mon, Jan 20, 2014 at 4:46 PM, Ralph Castain wrote:
>
> > Is it orte-checkpoint that is hanging, or the app you are trying to
> > checkpoint?
> >
> >
> > On Jan 20, 2014, at 2:10 PM, Adrian Reber wrote:
> >
> > Thanks for your help. I tried initializing
> However, like I said, it makes no sense for orte-checkpoint to do a barrier
> as it is a singleton - there is nothing for it to "barrier" with.
>
> On Jan 21, 2014, at 7:24 AM, Adrian Reber wrote:
>
> > I think I still do not really understand how it w
s that orte-checkpoint is a tool, and so it isn't a daemon -
> but it is also not an app.
>
>
> On Jan 21, 2014, at 11:56 AM, Adrian Reber wrote:
>
> > Good to know that it does not make any sense. So it not just me.
> >
> > Looking at the call cha
Following patch makes orte-checkpoint communicate with orterun again:
diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c
b/orte/tools/orte-checkpoint/orte-checkpoint.c
index 7106342..8539f34 100644
--- a/orte/tools/orte-checkpoint/orte-checkpoint.c
+++ b/orte/tools/orte-checkpoint/orte-che
Selecting SNAPC requires the information if it is an app or not:
int orte_snapc_base_select(bool seed, bool app);
The following patch uses the correct define. Can I commit it like this:
t a/orte/mca/ess/base/ess_base_std_app.c b/orte/mca/ess/base/ess_base_std_app.c
index dbbb2f4..f3a38f0 100644
ata underneath to save
> > memory and time.
> >
> >
> > On Jan 23, 2014, at 6:51 AM, Adrian Reber wrote:
> >
> > > Following patch makes orte-checkpoint communicate with orterun again:
> > >
> > > diff --git a/orte/tools/orte-checkpoint/ort
removed with my 'getting-it-compiled-again'
patches. Instead of blocking recv() calls it now uses
ORTE_WAIT_FOR_COMPLETION(). I included gitweb links to the patches.
Please have a look at the patches.
Adrian
commit 6f10b44499b59c84d9032378c7f8c6b3526a029b
Author: Ad
> 2. be aware that ORTE_WAIT_FOR_COMPLETION will block if you are in an RML
> callback. I don't think that's an issue here, but just wanted to point it out.
>
> Ralph
>
> On Jan 27, 2014, at 8:12 AM, Adrian Reber wrote:
>
> > I have the following patc
This patch
https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=14ec7f42baab882e345948ff79c4f75f5084bbbf
introduces unique collective ids for the checkpoint/restart code and
with this applied it seems to work pretty good. As this patch also
touches non-CR code it would be good if someone could hav
a "printf" statement in
> plm_base_launch_support.c, so you might want to make that an
> opal_output_verbose or something.
>
> On Feb 3, 2014, at 12:19 PM, Adrian Reber wrote:
>
> > This patch
> >
> > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=1
When I initially made the C/R code compile again I made following
change:
diff --git a/orte/mca/rml/oob/rml_oob_component.c
b/orte/mca/rml/oob/rml_oob_component.c
index f0b22fc..90ed086 100644
--- a/orte/mca/rml/oob/rml_oob_component.c
+++ b/orte/mca/rml/oob/rml_oob_component.c
@@ -185,8 +185,7 @
e, once you did that, the OOB would no longer be available to, for
> example, tell the local daemon that the app is ready for checkpoint :-)
>
> Afraid I'll have to defer to Josh H for any further guidance.
>
>
> On Feb 6, 2014, at 8:15 AM, Adrian Reber wrote:
>
>
I have created a new CRS component using criu (criu.org) to support
checkpoint/restart in Open MPI. My current patch only provides the
framework and necessary configure scripts to detect and link against
criu. With this patch orte-checkpoint can request a checkpoint and the
new CRIU CRS component i
On Fri, Feb 07, 2014 at 10:08:48PM +, Jeff Squyres (jsquyres) wrote:
> Sweet -- +1 for CRIU support!
>
> FWIW, I see you modeled your configure.m4 off the blcr configure.m4, but I'd
> actually go with making it a bit simpler. For example, I typically structure
> my configure.m4's like this
On Tue, Feb 11, 2014 at 08:09:35PM +, Jeff Squyres (jsquyres) wrote:
> On Feb 8, 2014, at 4:49 PM, Adrian Reber wrote:
>
> >> I note you have a stray $3 at the end of your configure.m4, too (it might
> >> supposed to be $2?).
> >
> > I think I do not re
I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system
with slurm and moab. I requested an interactive session using:
msub -I -l nodes=3:ppn=8
and started a simple test case which fails:
$ mpirun -np 2 ./mpi-test 1
:45AM -0800, Ralph Castain wrote:
> Seems rather odd - since this is managed by Moab, you shouldn't be seeing
> SLURM envars at all. What you should see are PBS_* envars, including a
> PBS_NODEFILE that actually contains the allocation.
>
>
> On Feb 12, 2014, at 4:42
9 AM, Ralph Castain wrote:
>
> > What is your SLURM_TASKS_PER_NODE?
> >
> > On Feb 12, 2014, at 6:58 AM, Adrian Reber wrote:
> >
> >> No, the system has only a few MOAB_* variables and many SLURM_*
> >> variables:
> >>
On Wed, Feb 12, 2014 at 07:47:53AM -0800, Ralph Castain wrote:
> >
> > $ msub -I -l nodes=3:ppn=8
> > salloc: Job is in held state, pending scheduler release
> > salloc: Pending job allocation 131828
> > salloc: job 131828 queued and waiting for resources
> > salloc: job 131828 has been allocated
2014, at 7:47 AM, Ralph Castain wrote:
>
> >
> > On Feb 12, 2014, at 7:32 AM, Adrian Reber wrote:
> >
> >>
> >> $ msub -I -l nodes=3:ppn=8
> >> salloc: Job is in held state, pending scheduler release
> >> salloc: Pending job allocation 131828
On Thu, Feb 06, 2014 at 02:45:07PM -0800, Ralph Castain wrote:
> On Feb 6, 2014, at 2:16 PM, Adrian Reber wrote:
>
> > Josh explained it to me a few days ago, that after a checkpoint has been
> > received TCP should no longer be used to not lose any messages. The
> > co
I am trying to find out how to deal with string variables. Do I have to
allocate the memory before calling mca_base_component_var_register() or
not? It seems it does a strdup() meaning it has to be free()'d while
closing the component. Looking at other occurrences of string variables
I see differen
Sure. I added the cloneurl information:
https://lisas.de/~adrian/open-mpi.git
On Fri, Feb 14, 2014 at 04:30:05PM +, Jeff Squyres (jsquyres) wrote:
> Can I clone your git tree and send you a patch?
>
> On Feb 11, 2014, at 4:45 PM, Adrian Reber wrote:
>
> > On Tue, Feb
tps://github.com/jsquyres/fork-from-adrian-ft/commit/f5962184f3ea6dffc182a18f7603c5e70e82ac99
>
>
>
> On Feb 14, 2014, at 11:35 AM, "Jeff Squyres (jsquyres)"
> wrote:
>
> > Perfect; cloning now. Thanks!
> >
> > On Feb 14, 2014, at 11:34 AM, Adrian Reber
> > w
This is probably for Josh. What is the meaning of the OPAL_CRS_* enums?
They are probably used to communicate the state of the CRS modules.
OPAL_CRS_ERROR seems to be used in case an error happened. What is the
CRS module supposed to set this to if the checkpoint was successful.
OPAL_CRS_CONTINUE
With the newly added oob/usock checkpointing with CRIU stopped working.
Is there a way I can prefer oob/tcp on the command line?
Adrian
I have prepared a patch I would like to commit which adds to code to
actually checkpoint a process. Thanks for the pointers about the string
variables I tried to do implement it correctly.
CRIU currently has problems with the new OOB usock but I will contact
the CRIU developers about this error. U
ell from its return code (or some other mechanism)
> that it is being restarted versus continuing after checkpointing?
>
>
> On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain wrote:
>
> > Great - looks fine to me!!
> >
> >
> > On Feb 17, 2014, at 11:39 AM, Adria
> You can see it used in the opal_cr_inc_core_prep() function in
> opal/runtime/opal_cr.c
>
> -- Josh
>
>
>
> On Mon, Feb 17, 2014 at 9:28 AM, Adrian Reber wrote:
>
> > This is probably for Josh. What is the meaning of the OPAL_CRS_* enums?
> >
> >
On Fri, Feb 14, 2014 at 02:51:51PM -0800, Ralph Castain wrote:
> On Feb 13, 2014, at 11:26 AM, Adrian Reber wrote:
> > I tried to implement something like you described. It is not yet event
> > driven, but before continuing I wanted to get some feedback if it is at
> >
On Tue, Feb 18, 2014 at 06:39:12AM -0800, Ralph Castain wrote:
> On Feb 18, 2014, at 6:24 AM, Adrian Reber wrote:
>
> > On Fri, Feb 14, 2014 at 02:51:51PM -0800, Ralph Castain wrote:
> >> On Feb 13, 2014, at 11:26 AM, Adrian Reber wrote:
> >>> I tried to imple
e checkpoint only functionality in continue
mode the patch can be checked in?
Adrian
> On Tue, Feb 18, 2014 at 4:08 AM, Adrian Reber wrote:
>
> > I think I do not understand your question. So far I have only implemented
> > the
> > checkpoint part and no
To restart a process using orte-restart I need sstore initialized when
running as a tool. This is currently missing. The new code is
#if OPAL_ENABLE_FT_CR == 1
and should only affect --with-ft builds. The following is the change I
want to make:
diff --git a/orte/mca/ess/base/ess_base_std_tool.c
There is a variable in the FT code which is not defined and therefore
currently #ifdef'd out.
#if (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1)
#ifdef ENABLE_FT_FIXED
/* FIXME_FT
*
* the variable mca_base_component_distill_checkpoint_ready
* was removed by commit 8181c8273c4
On a Scientific Linux 5.5 system the nightly snapshot
openmpi-1.7.5a1r30797 fails to build with following errors:
Making all in romio
make[3]: Entering directory
`/tmp/adrian/openmpi-compile/openmpi-1.7.5a1r30797/build/ompi/mca/io/romio/romio'
make[4]: Entering directory
`/tmp/adrian/openmpi-co
I have a simple patch which fixes the remaining compiler warnings when
running with '--with-ft':
https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=4dee703a0a2e64972b0c35b7693c11a09f1fbe5f
Does anybody see any problems with this patch?
Adrian
mewhere else? or do you have a different way to set those parameters?
>
> Other than that it looks good to me.
>
>
> On Mon, Mar 3, 2014 at 5:29 AM, Adrian Reber wrote:
>
> > I have a simple patch which fixes the remaining compiler warnings when
> > running with
r how to set it up at the moment.
>
>
>
>
> On Mon, Mar 3, 2014 at 7:25 AM, Adrian Reber wrote:
>
> > I removed a complete function because it was not used:
> >
> > ../../../../../orte/mca/sstore/stage/sstore_stage_component.c: At top
On Fri, Feb 21, 2014 at 10:12:54AM -0700, Nathan Hjelm wrote:
> On Fri, Feb 21, 2014 at 05:21:10PM +0100, Adrian Reber wrote:
> > There is a variable in the FT code which is not defined and therefore
> > currently #ifdef'd out.
> >
> > #if (OPAL_ENABLE_FT
Caching = Disabled
[dcbz:02880] sstore:stage: open: Compression = Disabled
[dcbz:02880] sstore:stage: open: Compression Delay= 0
[dcbz:02880] sstore:stage: open: Skip FileM (Debug Only) = False
On Mon, Mar 03, 2014 at 05:42:13PM +0100, Adrian Reber wrote:
> I w
On Tue, Feb 18, 2014 at 03:46:58PM +0100, Adrian Reber wrote:
> > >>> I tried to implement something like you described. It is not yet event
> > >>> driven, but before continuing I wanted to get some feedback if it is at
> > >>> least the right start:
On Thu, Mar 06, 2014 at 07:47:22PM -0800, Ralph Castain wrote:
> > Sorry for delay - yes, that looks like the right direction. I would
> > suggest doing it via the current state machine, though, by simply
> > defining another job or proc state in orte/mca/plm/plm_types.h, and
> >
On Fri, Mar 07, 2014 at 06:54:18AM -0800, Ralph Castain wrote:
> > If you like, I can define the required code in the trunk and let you
> > fill in the event functionality.
>
> That would be great.
> >>>
> >>> Thanks for your changes. When using --with-ft there are a few compil
I am using orte-restart without setting my PATH to my Open MPI
installation. I am running /full/path/to/orte-restart and orte-restart
tries to run mpirun to restart the process. This fails on my system
because I do not have any mpirun in my PATH. Is it expected for an Open
MPI installation to set u
I am now trying to run orte-restart. As far as I understand it
orte-restart analyzes the checkpoint metadata and then tries to exec()
mpirun which then starts opal-restart. During the startup of
opal-restart (during initialize()) detection of the best CRS module is
disabled:
/*
* Turn of
egistered.
>
> -Nathan
>
> Please excuse the horrible Outlook top-posting. OWA sucks.
>
>
> From: devel [devel-boun...@open-mpi.org] on behalf of Adrian Reber
> [adr...@lisas.de]
> Sent: Friday, March 14, 2014 3:05 PM
> To: de
On Fri, Mar 14, 2014 at 10:18:06PM +, Hjelm, Nathan T wrote:
> The preferred way is to use mca_base_var_find and then call
> mca_base_var_[set|get]_value. For performance sake we only look at the
> environment when the variable is registered.
I believe I found a bug in mca_base_var_set_value
lue() to select the preferred crs module?
Adrian
On Mon, Mar 17, 2014 at 08:47:16AM -0600, Nathan Hjelm wrote:
> Good catch. Fixing now.
>
> -Nathan
>
> On Mon, Mar 17, 2014 at 02:50:02PM +0100, Adrian Reber wrote:
> > On Fri, Mar 14, 2014 at 10:18:06PM +, Hjelm, Nathan T w
Cross-posting to criu and openmpi devel mailinglists.
To get fault tolerance back into Open MPI I added code to use criu as
a checkpoint/restart tool. I can checkpoint a process successfully
but I have troubles restarting it. CRIU has currently problems restoring
the process which is probably rela
Trying to restart a process I see that orterun has three pipes connected
to the processes running under its control (-np 1).
orterun:
orterun 11562 adrian 15w FIFO0,8 0t0 5304173 pipe
orterun 11562 adrian 16r FIFO0,8 0t0 5304174 pipe
orterun
On Wed, Apr 16, 2014 at 10:32:10AM +, Jeff Squyres (jsquyres) wrote:
> What source code repository technology(ies) do you use for Open MPI
> development? (indicate all that apply)
>
> - SVN
> - Mercurial
> - Git
git
Adrian
pgp0Qj8qxYTHc.pgp
Description: PGP signature
On Fri, Apr 25, 2014 at 10:29:36AM +, Jeff Squyres (jsquyres) wrote:
> On Apr 25, 2014, at 6:13 AM, Gilles Gouaillardet
> wrote:
>
> > it is possible to use qemu in order to emulate unavailable hardware.
> > for what it's worth, i am now running a ppc64 qemu emulated virtual
> > machine on a
The fault tolerance code also needs additional changes because of this
commit. I have the changes prepared but not committed.
On Wed, Jun 18, 2014 at 03:45:11PM -0700, Ralph Castain wrote:
> Huh - thought I got that. Sorry I missed it. Let me take a look and ensure
> that the alps ras module is s
I have seen it before but it was not reproducible. I have now two
segfaults in opal_fifo in today's MTT run on master and 2.x:
https://mtt.open-mpi.org/index.php?do_redir=2270
https://mtt.open-mpi.org/index.php?do_redir=2271
The thing that is strange about the MTT output is that MTT does not det
Errors like that (Win::Get_attr: Got wrong value for disp unit) are from
my ppc64 machine: https://mtt.open-mpi.org/index.php?do_redir=2295
The MTT setup is checking out the tests from github directly:
[Test get: ibm]
module = SCM
scm_module = Git
scm_url = https://github.com/open-mpi/ompi-tests.
I can confirm this on Fedora 20 with gcc 4.8.3.
Running ./configure without any options gives me the same error.
On Mon, Aug 04, 2014 at 04:24:29PM +, Pritchard Jr., Howard wrote:
> Hi Ralph,
>
> Nope that doesn't fix the problem I'm hitting. I tried to build the opmi
> trunk
> on a syste
db5a4d0..efeaf98 100644
--- a/opal/util/malloc.h
+++ b/opal/util/malloc.h
@@ -21,7 +21,7 @@
#ifndef OPAL_MALLOC_H
#define OPAL_MALLOC_H
-#include "opal_config.h"
+#include
#include
/*
On Mon, Aug 04, 2014 at 06:39:13PM +0200, Adrian Reber wrote:
> I can confirm this on Fedora
formation go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/08/15607.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
Adrian
--
Adrian Reber http://lisas.de/~adrian/
Authentic:
Indubitably true, in somebody's opinion.
__
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php
> >
Running Open MPI 1.8.3 with PSM does not seem to work right now at all.
I am getting the same errors also on trunk from my newly set up MTT.
Before trying to debug this I just wanted to make sure this is not a
configuration error. I have following PSM packages installed:
infinipath-devel-3.1.1-363
operations like comm_spawn or
> connect_accept, so if you are running those tests that just won’t work. Is
> that the heart of the problem here?
>
>
> > On Oct 27, 2014, at 1:40 AM, Adrian Reber wrote:
> >
> > Running Open MPI 1.8.3 with PSM does not seem to work ri
rs.
>
> Any chance you could try updating your infinipath libraries?
>
> Andrew
>
> > -Original Message-
> > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> > Reber
> > Sent: Monday, October 27, 2014 9:11 AM
> > To: Open MPI Developers
&g
original case:
>
> $ mpirun -np 32 ./mpi_test_suite -t "All,^io,^one-sided"
>
> Runs for a while and eventually hits send cancellation errors.
>
> Any chance you could try updating your infinipath libraries?
>
> Andrew
>
> > -Original Message-
>
_dup
> > > (Rank:0) tst_test_array[3]:Get_version Number of failed tests:0
> > >
> > > Works with various np from 8 to 32. Your original case:
> > >
> > > $ mpirun -np 32 ./mpi_test_suite -t "All,^io,^one-sided"
> >
1 - 100 of 129 matches
Mail list logo