[OMPI devel] openmpi with FT enabled

2013-10-28 Thread Adrian Reber
I am trying to compile openmpi (Revision: 29539) from svn with '--with-ft=cr'. I get a compilation error and I am lost how to solve it: ../../../../opal/mca/base/mca_base_components_open.c: In function 'open_components': ../../../../opal/mca/base/mca_base_components_open.c:144:9: error: 'mca_ba

Re: [OMPI devel] openmpi with FT enabled

2013-10-28 Thread Adrian Reber
release series at this > time. We're looking to restore that support next year as part of the 1.9 > release series. > > > On Oct 28, 2013, at 8:47 AM, Adrian Reber wrote: > > > I am trying to compile openmpi (Revision: 29539) from svn > > with '--with-ft=cr

[OMPI devel] [PATCH] Trying to get the C/R code to compile again

2013-11-20 Thread Adrian Reber
). This first patch fixes wrong include directives when compiling with OPAL_SETUP_FT_OPTIONS. Adrian >From c417f21e5a720f8bfe9ee222948ae8c59d4a485b Mon Sep 17 00:00:00 2001 From: Adrian Reber List-Post: devel@lists.open-mpi.org Date: Wed, 20 Nov 2013 14:50:12 +0100 Subject: [PA

[OMPI devel] [PATCH 0/4] Trying to get the C/R code to compile again

2013-11-25 Thread Adrian Reber
using "--with-ft=cr". This patchset only fixes existing compilation problems; the code is not yet expected to work. I used "make check" to verify that it does not break existing code. Adrian Reber (4): Trying to get the C/R code to compile again. (void value not ignored) T

[OMPI devel] [PATCH 3/4] Trying to get the C/R code to compile again. (recv_*_nb)

2013-11-25 Thread Adrian Reber
From: Adrian Reber This patch changes all recv/recv_buffer occurrences in the C/R code to recv_nb/recv_buffer_nb. Signed-off-by: Adrian Reber --- ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 49 +++- orte/mca/errmgr/base/errmgr_base_tool.c | 12 +--- orte/mca/rml/ftrm

[OMPI devel] [PATCH 4/4] Trying to get the C/R code to compile again. (last)

2013-11-25 Thread Adrian Reber
From: Adrian Reber This are the remaining changes to get C/R to compile again. This patch includes various fixes all over the C/R code and are hard to group like the previous patches. Signed-off-by: Adrian Reber --- ompi/mca/bml/r2/bml_r2_ft.c| 10 +- opal/mca/base

[OMPI devel] [PATCH 1/4] Trying to get the C/R code to compile again. (void value not ignored)

2013-11-25 Thread Adrian Reber
From: Adrian Reber This patch fixes error: void value not ignored as it ought to be in the C/R code by ignoring the return value of functions which no longer return a value (only void). Signed-off-by: Adrian Reber --- orte/mca/errmgr/base/errmgr_base_tool.c | 8 +- orte/mca

[OMPI devel] [PATCH 2/4] Trying to get the C/R code to compile again. (send_*_nb)

2013-11-25 Thread Adrian Reber
From: Adrian Reber This patch changes all send/send_buffer occurrences in the C/R code to send_nb/send_buffer_nb. Signed-off-by: Adrian Reber --- ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 12 ++--- orte/mca/errmgr/base/errmgr_base_tool.c | 2 +- orte/mca/rml/ftrm/rml_ftrm.h

Re: [OMPI devel] [PATCH 4/4] Trying to get the C/R code to compile again. (last)

2013-12-05 Thread Adrian Reber
On Wed, Dec 04, 2013 at 08:07:39PM +, Jeff Squyres (jsquyres) wrote: > On Dec 4, 2013, at 11:29 AM, Ralph Castain wrote: > > > Jeff - you are jumping way ahead. I already said this needs further work to > > resolve blocking. These patches (per Adrian's email) just makes things > > compile >

Re: [OMPI devel] [PATCH 1/4] Trying to get the C/R code to compile again. (void value not ignored)

2013-12-06 Thread Adrian Reber
9 AM, Adrian Reber wrote: > > > From: Adrian Reber > > > > This patch fixes > > > > error: void value not ignored as it ought to be > > > > in the C/R code by ignoring the return value of functions which > > no longer return a value (only vo

Re: [OMPI devel] [PATCH 2/4] Trying to get the C/R code to compile again. (send_*_nb)

2013-12-06 Thread Adrian Reber
e totally wrong about emulating > > blocking. There might be (probably are?) rules/assumptions in the ORTE > > layer (of which I am *not* an expert) that disallow you from [emulating] > > blocking. > > > > If that's the case, then there's architectural issu

Re: [OMPI devel] [PATCH 4/4] Trying to get the C/R code to compile again. (last)

2013-12-09 Thread Adrian Reber
On Fri, Dec 06, 2013 at 08:43:39AM -0600, Josh Hursey wrote: > Did the mca_base_component_distill_checkpoint_ready paramter go away? Its > intention was to allow a user to have a build with C/R compiled in and then > choose at runtime if they want to restrict their component section to just > C/R e

[OMPI devel] [PATCH v2] Trying to get the C/R code to compile again. (last)

2013-12-09 Thread Adrian Reber
From: Adrian Reber This are the remaining changes to get C/R to compile again. This patch includes various fixes all over the C/R code and are hard to group like the previous patches. Changes from V1: * explain why mca_base_component_distill_checkpoint_ready no longer works * compare return

Re: [OMPI devel] [PATCH v2] Trying to get the C/R code to compile again. (last)

2013-12-09 Thread Adrian Reber
9, 2013, at 5:38 AM, Adrian Reber wrote: > > > diff --git a/orte/mca/rml/oob/rml_oob_component.c > > b/orte/mca/rml/oob/rml_oob_component.c > > index dd539cd..b91f4a3 100644 > > --- a/orte/mca/rml/oob/rml_oob_component.c > > +++ b/orte/mca/rml/oob/rml_oob_compon

Re: [OMPI devel] [PATCH v2] Trying to get the C/R code to compile again. (last)

2013-12-10 Thread Adrian Reber
eferencing the OOB, then we need to > > > go directly to it. I'll have to check/correct the code, but the RML > > > shouldn't even be storing a pointer to the OOB in it as there no longer > > > is a direct linkage. > > > > > > > > >

Re: [OMPI devel] OMPI developer's meeting today

2013-12-13 Thread Adrian Reber
Is there a phone number I can use to join the meeting via phone from Germany? Adrian On Thu, Dec 12, 2013 at 02:43:38PM -0600, Ralph Castain wrote: > Sorry for delay - we just realized we hadn't addressed this yet. > > Plan is to still start at 9am as planned. Hope that is okay.

Re: [OMPI devel] [PATCH v2] Trying to get the C/R code to compile again. (last)

2013-12-16 Thread Adrian Reber
> > > > orte_rml_oob_ft_event > > > > No need to reference thru the module unless you want to for some reason. > > > > > > > > > > This doesn't seem right - if we are referencing the OOB, then we need to > > go directly to it. I'l

[OMPI devel] [PATCH v2 0/2] Trying to get the C/R code to compile again

2013-12-18 Thread Adrian Reber
From: Adrian Reber This is the second try to replace the usage of blocking send and recv in the C/R code with the non-blocking versions. The new code compiles (in contrast to the old code) but does not work yet. This is the first step to get the C/R code working again. Right now it only compiles

[OMPI devel] [PATCH v2 2/2] Trying to get the C/R code to compile again. (send_*_nb)

2013-12-18 Thread Adrian Reber
From: Adrian Reber This patch changes all send/send_buffer occurrences in the C/R code to send_nb/send_buffer_nb. The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED). The new code compiles but does not work. Changes from V1: * #ifdef out the code (so it is preserved for

[OMPI devel] [PATCH v2 1/2] Trying to get the C/R code to compile again. (recv_*_nb)

2013-12-18 Thread Adrian Reber
From: Adrian Reber This patch changes all recv/recv_buffer occurrences in the C/R code to recv_nb/recv_buffer_nb. The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED). The new code compiles but does not work. Changes from V1: * #ifdef out the code (so it is preserved for

Re: [OMPI devel] [PATCH v2 2/2] Trying to get the C/R code to compile again. (send_*_nb)

2013-12-19 Thread Adrian Reber
e return code isn't > the number of bytes sent any more - it is just ORTE_SUCCESS or else an error > code, so you should be testing for ORTE_SUCCESS == > > > > > On Dec 18, 2013, at 6:42 AM, Adrian Reber wrote: > > > From: Adrian Reber > > > >

[OMPI devel] [PATCH v3 0/2] Trying to get the C/R code to compile again

2013-12-19 Thread Adrian Reber
From: Adrian Reber This is the second try to replace the usage of blocking send and recv in the C/R code with the non-blocking versions. The new code compiles (in contrast to the old code) but does not work yet. This is the first step to get the C/R code working again. Right now it only compiles

[OMPI devel] [PATCH v3 1/2] Trying to get the C/R code to compile again. (recv_*_nb)

2013-12-19 Thread Adrian Reber
From: Adrian Reber This patch changes all recv/recv_buffer occurrences in the C/R code to recv_nb/recv_buffer_nb. The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED). The new code compiles but does not work. Changes from V1: * #ifdef out the code (so it is preserved for

[OMPI devel] [PATCH v3 2/2] Trying to get the C/R code to compile again. (send_*_nb)

2013-12-19 Thread Adrian Reber
From: Adrian Reber This patch changes all send/send_buffer occurrences in the C/R code to send_nb/send_buffer_nb. The new code compiles but does not work. Changes from V1: * #ifdef out the code (so it is preserved for later re-design) * marked the broken C/R code with ENABLE_FT_FIXED Changes

Re: [OMPI devel] [PATCH v3 0/2] Trying to get the C/R code to compile again

2013-12-20 Thread Adrian Reber
On Thu, Dec 19, 2013 at 09:54:19PM +0100, Adrian Reber wrote: > This is the second try to replace the usage of blocking send and > recv in the C/R code with the non-blocking versions. The new code > compiles (in contrast to the old code) but does not work yet. > This is the first step

[OMPI devel] C/R code: opal_list_item_destruct: Assertion

2013-12-21 Thread Adrian Reber
Trying to run Open MPI with C/R enabled I get the following error with --enable-debug: [dcbz:20360] orte_rml_base_select: initializing rml component oob [dcbz:20360] orte_rml_base_select: initializing rml component ftrm [dcbz:20360] orte_rml_base_select: module ftrm unloaded orterun: ../../opal/cl

Re: [OMPI devel] C/R code: opal_list_item_destruct: Assertion

2013-12-22 Thread Adrian Reber
0045 and let me know if it fixes your issue. > > George. > > > On Dec 21, 2013, at 22:05 , Adrian Reber wrote: > > > Trying to run Open MPI with C/R enabled I get the following error > > with --enable-debug: > > > > [dcbz:20360] orte_rml_ba

[OMPI devel] return value of opal_compress_base_register() in opal/mca/compress/base/compress_base_open.c

2013-12-27 Thread Adrian Reber
Right now the C/R code fails because of a change introduced in opal/mca/compress/base/compress_base_open.c in 2013 with commit git 734c724ff76d9bf814f3ab0396bcd9ee6fddcd1b svn r28239 Update OPAL frameworks to use the MCA framework system. This commit changed a lot but also the return value o

Re: [OMPI devel] return value of opal_compress_base_register() in opal/mca/compress/base/compress_base_open.c

2014-01-03 Thread Adrian Reber
gt; CR infrastructure (although the CR infrastructure can use the compress > framework if the user chooses to). So I bet we can remove the protection > altogether and be fine. > > So I think this patch is fine. I might also go as far as removing the 'if' > block altoge

Re: [OMPI devel] return value of opal_compress_base_register() in opal/mca/compress/base/compress_base_open.c

2014-01-07 Thread Adrian Reber
_ERR_NOT_AVAILABLE. This would still avoid opening the > > components for no reason (thus saving some memory) while not causing > > opal_init to abort. > > > > > > On Jan 3, 2014, at 3:19 AM, Adrian Reber wrote: > > > > > So removing all output like

[OMPI devel] orte_barrier: Assertion `0 == item->opal_list_item_refcount' failed.

2014-01-09 Thread Adrian Reber
Continuing with the CR code I now get a crash which can be easily reproduced using orte/test/system/orte_barrier.c I get: orte_barrier: ../../../../../opal/class/opal_list.h:547: _opal_list_append: Assertion `0 == item->opal_list_item_refcount' failed. [dcbz:05085] *** Process received signal **

Re: [OMPI devel] orte_barrier: Assertion `0 == item->opal_list_item_refcount' failed.

2014-01-09 Thread Adrian Reber
For my CR work this can probably ignored. I think I was looking at the wrong place. On Thu, Jan 09, 2014 at 05:28:01PM +0100, Adrian Reber wrote: > Continuing with the CR code I now get a crash which can be easily reproduced > using orte/test/system/orte_barrier.c > > I get: >

[OMPI devel] callback debugging

2014-01-10 Thread Adrian Reber
I am currently trying to understand how callbacks are working. Right now I am looking at orte/mca/rml/base/rml_base_receive.c orte_rml_base_comm_start() which does orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD, ORTE_RML_TAG_RML_INFO_UPDATE,

Re: [OMPI devel] callback debugging

2014-01-10 Thread Adrian Reber
On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote: > > On Jan 10, 2014, at 8:02 AM, Adrian Reber wrote: > > > I am currently trying to understand how callbacks are working. Right now > > I am looking at orte/mca/rml/base/rml_base_receive.c > > orte_rml_b

Re: [OMPI devel] callback debugging

2014-01-20 Thread Adrian Reber
s okay to > block using ORTE_WAIT_FOR_COMPLETION. Look in > orte/mca/routed/base/routed_base_fns.c starting at line 252 for an example. > > HTH > Ralph > > On Jan 10, 2014, at 12:55 PM, Ralph Castain wrote: > > > > > On Jan 10, 2014, at 12:45 PM, Adrian Rebe

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
, Jan 20, 2014 at 02:46:04PM -0800, Ralph Castain wrote: > Is it orte-checkpoint that is hanging, or the app you are trying to > checkpoint? > > > On Jan 20, 2014, at 2:10 PM, Adrian Reber wrote: > > > Thanks for your help. I tried initializing the barrier correctly (see >

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
n Mon, Jan 20, 2014 at 4:46 PM, Ralph Castain wrote: > > > Is it orte-checkpoint that is hanging, or the app you are trying to > > checkpoint? > > > > > > On Jan 20, 2014, at 2:10 PM, Adrian Reber wrote: > > > > Thanks for your help. I tried initializing

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
> However, like I said, it makes no sense for orte-checkpoint to do a barrier > as it is a singleton - there is nothing for it to "barrier" with. > > On Jan 21, 2014, at 7:24 AM, Adrian Reber wrote: > > > I think I still do not really understand how it w

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
s that orte-checkpoint is a tool, and so it isn't a daemon - > but it is also not an app. > > > On Jan 21, 2014, at 11:56 AM, Adrian Reber wrote: > > > Good to know that it does not make any sense. So it not just me. > > > > Looking at the call cha

[OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

2014-01-23 Thread Adrian Reber
Following patch makes orte-checkpoint communicate with orterun again: diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c b/orte/tools/orte-checkpoint/orte-checkpoint.c index 7106342..8539f34 100644 --- a/orte/tools/orte-checkpoint/orte-checkpoint.c +++ b/orte/tools/orte-checkpoint/orte-che

[OMPI devel] [PATCH] use ORTE_PROC_IS_APP

2014-01-23 Thread Adrian Reber
Selecting SNAPC requires the information if it is an app or not: int orte_snapc_base_select(bool seed, bool app); The following patch uses the correct define. Can I commit it like this: t a/orte/mca/ess/base/ess_base_std_app.c b/orte/mca/ess/base/ess_base_std_app.c index dbbb2f4..f3a38f0 100644

Re: [OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

2014-01-24 Thread Adrian Reber
ata underneath to save > > memory and time. > > > > > > On Jan 23, 2014, at 6:51 AM, Adrian Reber wrote: > > > > > Following patch makes orte-checkpoint communicate with orterun again: > > > > > > diff --git a/orte/tools/orte-checkpoint/ort

[OMPI devel] SNAPC: dynamic send buffers

2014-01-27 Thread Adrian Reber
removed with my 'getting-it-compiled-again' patches. Instead of blocking recv() calls it now uses ORTE_WAIT_FOR_COMPLETION(). I included gitweb links to the patches. Please have a look at the patches. Adrian commit 6f10b44499b59c84d9032378c7f8c6b3526a029b Author: Ad

Re: [OMPI devel] SNAPC: dynamic send buffers

2014-01-29 Thread Adrian Reber
> 2. be aware that ORTE_WAIT_FOR_COMPLETION will block if you are in an RML > callback. I don't think that's an issue here, but just wanted to point it out. > > Ralph > > On Jan 27, 2014, at 8:12 AM, Adrian Reber wrote: > > > I have the following patc

[OMPI devel] Use unique collective ids for the checkpoint/restart code

2014-02-03 Thread Adrian Reber
This patch https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=14ec7f42baab882e345948ff79c4f75f5084bbbf introduces unique collective ids for the checkpoint/restart code and with this applied it seems to work pretty good. As this patch also touches non-CR code it would be good if someone could hav

Re: [OMPI devel] Use unique collective ids for the checkpoint/restart code

2014-02-04 Thread Adrian Reber
a "printf" statement in > plm_base_launch_support.c, so you might want to make that an > opal_output_verbose or something. > > On Feb 3, 2014, at 12:19 PM, Adrian Reber wrote: > > > This patch > > > > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=1

[OMPI devel] C/R and orte_oob

2014-02-06 Thread Adrian Reber
When I initially made the C/R code compile again I made following change: diff --git a/orte/mca/rml/oob/rml_oob_component.c b/orte/mca/rml/oob/rml_oob_component.c index f0b22fc..90ed086 100644 --- a/orte/mca/rml/oob/rml_oob_component.c +++ b/orte/mca/rml/oob/rml_oob_component.c @@ -185,8 +185,7 @

Re: [OMPI devel] C/R and orte_oob

2014-02-06 Thread Adrian Reber
e, once you did that, the OOB would no longer be available to, for > example, tell the local daemon that the app is ready for checkpoint :-) > > Afraid I'll have to defer to Josh H for any further guidance. > > > On Feb 6, 2014, at 8:15 AM, Adrian Reber wrote: > >

[OMPI devel] new CRS component added (criu)

2014-02-07 Thread Adrian Reber
I have created a new CRS component using criu (criu.org) to support checkpoint/restart in Open MPI. My current patch only provides the framework and necessary configure scripts to detect and link against criu. With this patch orte-checkpoint can request a checkpoint and the new CRIU CRS component i

Re: [OMPI devel] new CRS component added (criu)

2014-02-08 Thread Adrian Reber
On Fri, Feb 07, 2014 at 10:08:48PM +, Jeff Squyres (jsquyres) wrote: > Sweet -- +1 for CRIU support! > > FWIW, I see you modeled your configure.m4 off the blcr configure.m4, but I'd > actually go with making it a bit simpler. For example, I typically structure > my configure.m4's like this

Re: [OMPI devel] new CRS component added (criu)

2014-02-11 Thread Adrian Reber
On Tue, Feb 11, 2014 at 08:09:35PM +, Jeff Squyres (jsquyres) wrote: > On Feb 8, 2014, at 4:49 PM, Adrian Reber wrote: > > >> I note you have a stray $3 at the end of your configure.m4, too (it might > >> supposed to be $2?). > > > > I think I do not re

[OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system with slurm and moab. I requested an interactive session using: msub -I -l nodes=3:ppn=8 and started a simple test case which fails: $ mpirun -np 2 ./mpi-test 1

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
:45AM -0800, Ralph Castain wrote: > Seems rather odd - since this is managed by Moab, you shouldn't be seeing > SLURM envars at all. What you should see are PBS_* envars, including a > PBS_NODEFILE that actually contains the allocation. > > > On Feb 12, 2014, at 4:42

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
9 AM, Ralph Castain wrote: > > > What is your SLURM_TASKS_PER_NODE? > > > > On Feb 12, 2014, at 6:58 AM, Adrian Reber wrote: > > > >> No, the system has only a few MOAB_* variables and many SLURM_* > >> variables: > >>

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
On Wed, Feb 12, 2014 at 07:47:53AM -0800, Ralph Castain wrote: > > > > $ msub -I -l nodes=3:ppn=8 > > salloc: Job is in held state, pending scheduler release > > salloc: Pending job allocation 131828 > > salloc: job 131828 queued and waiting for resources > > salloc: job 131828 has been allocated

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
2014, at 7:47 AM, Ralph Castain wrote: > > > > > On Feb 12, 2014, at 7:32 AM, Adrian Reber wrote: > > > >> > >> $ msub -I -l nodes=3:ppn=8 > >> salloc: Job is in held state, pending scheduler release > >> salloc: Pending job allocation 131828

Re: [OMPI devel] C/R and orte_oob

2014-02-13 Thread Adrian Reber
On Thu, Feb 06, 2014 at 02:45:07PM -0800, Ralph Castain wrote: > On Feb 6, 2014, at 2:16 PM, Adrian Reber wrote: > > > Josh explained it to me a few days ago, that after a checkpoint has been > > received TCP should no longer be used to not lose any messages. The > > co

[OMPI devel] mca_base_component_var_register() MCA_BASE_VAR_TYPE_STRING

2014-02-14 Thread Adrian Reber
I am trying to find out how to deal with string variables. Do I have to allocate the memory before calling mca_base_component_var_register() or not? It seems it does a strdup() meaning it has to be free()'d while closing the component. Looking at other occurrences of string variables I see differen

Re: [OMPI devel] new CRS component added (criu)

2014-02-14 Thread Adrian Reber
Sure. I added the cloneurl information: https://lisas.de/~adrian/open-mpi.git On Fri, Feb 14, 2014 at 04:30:05PM +, Jeff Squyres (jsquyres) wrote: > Can I clone your git tree and send you a patch? > > On Feb 11, 2014, at 4:45 PM, Adrian Reber wrote: > > > On Tue, Feb

Re: [OMPI devel] new CRS component added (criu)

2014-02-14 Thread Adrian Reber
tps://github.com/jsquyres/fork-from-adrian-ft/commit/f5962184f3ea6dffc182a18f7603c5e70e82ac99 > > > > On Feb 14, 2014, at 11:35 AM, "Jeff Squyres (jsquyres)" > wrote: > > > Perfect; cloning now. Thanks! > > > > On Feb 14, 2014, at 11:34 AM, Adrian Reber > > w

[OMPI devel] OPAL_CRS_* meaning

2014-02-17 Thread Adrian Reber
This is probably for Josh. What is the meaning of the OPAL_CRS_* enums? They are probably used to communicate the state of the CRS modules. OPAL_CRS_ERROR seems to be used in case an error happened. What is the CRS module supposed to set this to if the checkpoint was successful. OPAL_CRS_CONTINUE

[OMPI devel] How to prefer oob/tcp over oob/usock

2014-02-17 Thread Adrian Reber
With the newly added oob/usock checkpointing with CRIU stopped working. Is there a way I can prefer oob/tcp on the command line? Adrian

[OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-17 Thread Adrian Reber
I have prepared a patch I would like to commit which adds to code to actually checkpoint a process. Thanks for the pointers about the string variables I tried to do implement it correctly. CRIU currently has problems with the new OOB usock but I will contact the CRIU developers about this error. U

Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-18 Thread Adrian Reber
ell from its return code (or some other mechanism) > that it is being restarted versus continuing after checkpointing? > > > On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain wrote: > > > Great - looks fine to me!! > > > > > > On Feb 17, 2014, at 11:39 AM, Adria

Re: [OMPI devel] OPAL_CRS_* meaning

2014-02-18 Thread Adrian Reber
> You can see it used in the opal_cr_inc_core_prep() function in > opal/runtime/opal_cr.c > > -- Josh > > > > On Mon, Feb 17, 2014 at 9:28 AM, Adrian Reber wrote: > > > This is probably for Josh. What is the meaning of the OPAL_CRS_* enums? > > > >

Re: [OMPI devel] C/R and orte_oob

2014-02-18 Thread Adrian Reber
On Fri, Feb 14, 2014 at 02:51:51PM -0800, Ralph Castain wrote: > On Feb 13, 2014, at 11:26 AM, Adrian Reber wrote: > > I tried to implement something like you described. It is not yet event > > driven, but before continuing I wanted to get some feedback if it is at > >

Re: [OMPI devel] C/R and orte_oob

2014-02-18 Thread Adrian Reber
On Tue, Feb 18, 2014 at 06:39:12AM -0800, Ralph Castain wrote: > On Feb 18, 2014, at 6:24 AM, Adrian Reber wrote: > > > On Fri, Feb 14, 2014 at 02:51:51PM -0800, Ralph Castain wrote: > >> On Feb 13, 2014, at 11:26 AM, Adrian Reber wrote: > >>> I tried to imple

Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-18 Thread Adrian Reber
e checkpoint only functionality in continue mode the patch can be checked in? Adrian > On Tue, Feb 18, 2014 at 4:08 AM, Adrian Reber wrote: > > > I think I do not understand your question. So far I have only implemented > > the > > checkpoint part and no

[OMPI devel] startup sstore orte/mca/ess/base/ess_base_std_tool.c

2014-02-21 Thread Adrian Reber
To restart a process using orte-restart I need sstore initialized when running as a tool. This is currently missing. The new code is #if OPAL_ENABLE_FT_CR == 1 and should only affect --with-ft builds. The following is the change I want to make: diff --git a/orte/mca/ess/base/ess_base_std_tool.c

[OMPI devel] mca_base_component_distill_checkpoint_ready variable

2014-02-21 Thread Adrian Reber
There is a variable in the FT code which is not defined and therefore currently #ifdef'd out. #if (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1) #ifdef ENABLE_FT_FIXED /* FIXME_FT * * the variable mca_base_component_distill_checkpoint_ready * was removed by commit 8181c8273c4

[OMPI devel] openmpi-1.7.5a1r30797 fails building on SL 5.5

2014-02-22 Thread Adrian Reber
On a Scientific Linux 5.5 system the nightly snapshot openmpi-1.7.5a1r30797 fails to build with following errors: Making all in romio make[3]: Entering directory `/tmp/adrian/openmpi-compile/openmpi-1.7.5a1r30797/build/ompi/mca/io/romio/romio' make[4]: Entering directory `/tmp/adrian/openmpi-co

[OMPI devel] Fix compiler warnings in FT code

2014-03-03 Thread Adrian Reber
I have a simple patch which fixes the remaining compiler warnings when running with '--with-ft': https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=4dee703a0a2e64972b0c35b7693c11a09f1fbe5f Does anybody see any problems with this patch? Adrian

Re: [OMPI devel] Fix compiler warnings in FT code

2014-03-03 Thread Adrian Reber
mewhere else? or do you have a different way to set those parameters? > > Other than that it looks good to me. > > > On Mon, Mar 3, 2014 at 5:29 AM, Adrian Reber wrote: > > > I have a simple patch which fixes the remaining compiler warnings when > > running with &#

Re: [OMPI devel] Fix compiler warnings in FT code

2014-03-03 Thread Adrian Reber
r how to set it up at the moment. > > > > > On Mon, Mar 3, 2014 at 7:25 AM, Adrian Reber wrote: > > > I removed a complete function because it was not used: > > > > ../../../../../orte/mca/sstore/stage/sstore_stage_component.c: At top

Re: [OMPI devel] mca_base_component_distill_checkpoint_ready variable

2014-03-03 Thread Adrian Reber
On Fri, Feb 21, 2014 at 10:12:54AM -0700, Nathan Hjelm wrote: > On Fri, Feb 21, 2014 at 05:21:10PM +0100, Adrian Reber wrote: > > There is a variable in the FT code which is not defined and therefore > > currently #ifdef'd out. > > > > #if (OPAL_ENABLE_FT

Re: [OMPI devel] Fix compiler warnings in FT code

2014-03-05 Thread Adrian Reber
Caching = Disabled [dcbz:02880] sstore:stage: open: Compression = Disabled [dcbz:02880] sstore:stage: open: Compression Delay= 0 [dcbz:02880] sstore:stage: open: Skip FileM (Debug Only) = False On Mon, Mar 03, 2014 at 05:42:13PM +0100, Adrian Reber wrote: > I w

Re: [OMPI devel] C/R and orte_oob

2014-03-06 Thread Adrian Reber
On Tue, Feb 18, 2014 at 03:46:58PM +0100, Adrian Reber wrote: > > >>> I tried to implement something like you described. It is not yet event > > >>> driven, but before continuing I wanted to get some feedback if it is at > > >>> least the right start:

Re: [OMPI devel] C/R and orte_oob

2014-03-07 Thread Adrian Reber
On Thu, Mar 06, 2014 at 07:47:22PM -0800, Ralph Castain wrote: > > Sorry for delay - yes, that looks like the right direction. I would > > suggest doing it via the current state machine, though, by simply > > defining another job or proc state in orte/mca/plm/plm_types.h, and > >

Re: [OMPI devel] C/R and orte_oob

2014-03-10 Thread Adrian Reber
On Fri, Mar 07, 2014 at 06:54:18AM -0800, Ralph Castain wrote: > > If you like, I can define the required code in the trunk and let you > > fill in the event functionality. > > That would be great. > >>> > >>> Thanks for your changes. When using --with-ft there are a few compil

[OMPI devel] orte-restart and PATH

2014-03-12 Thread Adrian Reber
I am using orte-restart without setting my PATH to my Open MPI installation. I am running /full/path/to/orte-restart and orte-restart tries to run mpirun to restart the process. This fails on my system because I do not have any mpirun in my PATH. Is it expected for an Open MPI installation to set u

[OMPI devel] usage of mca variables in orte-restart

2014-03-14 Thread Adrian Reber
I am now trying to run orte-restart. As far as I understand it orte-restart analyzes the checkpoint metadata and then tries to exec() mpirun which then starts opal-restart. During the startup of opal-restart (during initialize()) detection of the best CRS module is disabled: /* * Turn of

Re: [OMPI devel] usage of mca variables in orte-restart

2014-03-15 Thread Adrian Reber
egistered. > > -Nathan > > Please excuse the horrible Outlook top-posting. OWA sucks. > > > From: devel [devel-boun...@open-mpi.org] on behalf of Adrian Reber > [adr...@lisas.de] > Sent: Friday, March 14, 2014 3:05 PM > To: de

Re: [OMPI devel] usage of mca variables in orte-restart

2014-03-17 Thread Adrian Reber
On Fri, Mar 14, 2014 at 10:18:06PM +, Hjelm, Nathan T wrote: > The preferred way is to use mca_base_var_find and then call > mca_base_var_[set|get]_value. For performance sake we only look at the > environment when the variable is registered. I believe I found a bug in mca_base_var_set_value

Re: [OMPI devel] usage of mca variables in orte-restart

2014-03-18 Thread Adrian Reber
lue() to select the preferred crs module? Adrian On Mon, Mar 17, 2014 at 08:47:16AM -0600, Nathan Hjelm wrote: > Good catch. Fixing now. > > -Nathan > > On Mon, Mar 17, 2014 at 02:50:02PM +0100, Adrian Reber wrote: > > On Fri, Mar 14, 2014 at 10:18:06PM +, Hjelm, Nathan T w

[OMPI devel] Open MPI and CRIU stdout/stderr

2014-03-19 Thread Adrian Reber
Cross-posting to criu and openmpi devel mailinglists. To get fault tolerance back into Open MPI I added code to use criu as a checkpoint/restart tool. I can checkpoint a process successfully but I have troubles restarting it. CRIU has currently problems restoring the process which is probably rela

[OMPI devel] Restarting and Pipes

2014-04-10 Thread Adrian Reber
Trying to restart a process I see that orterun has three pipes connected to the processes running under its control (-np 1). orterun: orterun 11562 adrian 15w FIFO0,8 0t0 5304173 pipe orterun 11562 adrian 16r FIFO0,8 0t0 5304174 pipe orterun

Re: [OMPI devel] 1-question developer poll

2014-04-16 Thread Adrian Reber
On Wed, Apr 16, 2014 at 10:32:10AM +, Jeff Squyres (jsquyres) wrote: > What source code repository technology(ies) do you use for Open MPI > development? (indicate all that apply) > > - SVN > - Mercurial > - Git git Adrian pgp0Qj8qxYTHc.pgp Description: PGP signature

Re: [OMPI devel] RFC: Remove heterogeneous support

2014-04-25 Thread Adrian Reber
On Fri, Apr 25, 2014 at 10:29:36AM +, Jeff Squyres (jsquyres) wrote: > On Apr 25, 2014, at 6:13 AM, Gilles Gouaillardet > wrote: > > > it is possible to use qemu in order to emulate unavailable hardware. > > for what it's worth, i am now running a ppc64 qemu emulated virtual > > machine on a

Re: [OMPI devel] r31916 question

2014-06-19 Thread Adrian Reber
The fault tolerance code also needs additional changes because of this commit. I have the changes prepared but not committed. On Wed, Jun 18, 2014 at 03:45:11PM -0700, Ralph Castain wrote: > Huh - thought I got that. Sorry I missed it. Let me take a look and ensure > that the alps ras module is s

[OMPI devel] Segmentation fault in opal_fifo (MTT)

2016-03-01 Thread Adrian Reber
I have seen it before but it was not reproducible. I have now two segfaults in opal_fifo in today's MTT run on master and 2.x: https://mtt.open-mpi.org/index.php?do_redir=2270 https://mtt.open-mpi.org/index.php?do_redir=2271 The thing that is strange about the MTT output is that MTT does not det

Re: [OMPI devel] 1.10.3rc MTT failures

2016-04-25 Thread Adrian Reber
Errors like that (Win::Get_attr: Got wrong value for disp unit) are from my ppc64 machine: https://mtt.open-mpi.org/index.php?do_redir=2295 The MTT setup is checking out the tests from github directly: [Test get: ibm] module = SCM scm_module = Git scm_url = https://github.com/open-mpi/ompi-tests.

Re: [OMPI devel] opal_config_bottom.h question again

2014-08-04 Thread Adrian Reber
I can confirm this on Fedora 20 with gcc 4.8.3. Running ./configure without any options gives me the same error. On Mon, Aug 04, 2014 at 04:24:29PM +, Pritchard Jr., Howard wrote: > Hi Ralph, > > Nope that doesn't fix the problem I'm hitting. I tried to build the opmi > trunk > on a syste

Re: [OMPI devel] opal_config_bottom.h question again

2014-08-04 Thread Adrian Reber
db5a4d0..efeaf98 100644 --- a/opal/util/malloc.h +++ b/opal/util/malloc.h @@ -21,7 +21,7 @@ #ifndef OPAL_MALLOC_H #define OPAL_MALLOC_H -#include "opal_config.h" +#include #include /* On Mon, Aug 04, 2014 at 06:39:13PM +0200, Adrian Reber wrote: > I can confirm this on Fedora

Re: [OMPI devel] ORTE headers in OPAL source

2014-08-11 Thread Adrian Reber
formation go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/08/15607.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ Adrian -- Adrian Reber http://lisas.de/~adrian/ Authentic: Indubitably true, in somebody's opinion.

Re: [OMPI devel] ORTE headers in OPAL source

2014-10-17 Thread Adrian Reber
__ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php > >

[OMPI devel] 1.8.3 and PSM errors

2014-10-27 Thread Adrian Reber
Running Open MPI 1.8.3 with PSM does not seem to work right now at all. I am getting the same errors also on trunk from my newly set up MTT. Before trying to debug this I just wanted to make sure this is not a configuration error. I have following PSM packages installed: infinipath-devel-3.1.1-363

Re: [OMPI devel] 1.8.3 and PSM errors

2014-10-27 Thread Adrian Reber
operations like comm_spawn or > connect_accept, so if you are running those tests that just won’t work. Is > that the heart of the problem here? > > > > On Oct 27, 2014, at 1:40 AM, Adrian Reber wrote: > > > > Running Open MPI 1.8.3 with PSM does not seem to work ri

Re: [OMPI devel] 1.8.3 and PSM errors

2014-10-28 Thread Adrian Reber
rs. > > Any chance you could try updating your infinipath libraries? > > Andrew > > > -Original Message- > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian > > Reber > > Sent: Monday, October 27, 2014 9:11 AM > > To: Open MPI Developers &g

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-10 Thread Adrian Reber
original case: > > $ mpirun -np 32 ./mpi_test_suite -t "All,^io,^one-sided" > > Runs for a while and eventually hits send cancellation errors. > > Any chance you could try updating your infinipath libraries? > > Andrew > > > -Original Message- >

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-10 Thread Adrian Reber
_dup > > > (Rank:0) tst_test_array[3]:Get_version Number of failed tests:0 > > > > > > Works with various np from 8 to 32. Your original case: > > > > > > $ mpirun -np 32 ./mpi_test_suite -t "All,^io,^one-sided" > >

  1   2   >