Re: [OMPI devel] 1.10.3rc MTT failures

2016-04-25 Thread Adrian Reber
Errors like that (Win::Get_attr: Got wrong value for disp unit) are from my ppc64 machine: https://mtt.open-mpi.org/index.php?do_redir=2295 The MTT setup is checking out the tests from github directly: [Test get: ibm] module = SCM scm_module = Git scm_url = https://github.com/open-mpi/ompi-tests.

[OMPI devel] Segmentation fault in opal_fifo (MTT)

2016-03-01 Thread Adrian Reber
I have seen it before but it was not reproducible. I have now two segfaults in opal_fifo in today's MTT run on master and 2.x: https://mtt.open-mpi.org/index.php?do_redir=2270 https://mtt.open-mpi.org/index.php?do_redir=2271 The thing that is strange about the MTT output is that MTT does not det

[OMPI devel] MTT setup updated to gcc-6.0 (pre)

2016-02-25 Thread Adrian Reber
I installed a pre-release gcc 6.0 gcc version 6.0.0 20160221 (experimental) (GCC) on my MTT systems (ppc64 and x86_64) and I now get a test build failure: https://mtt.open-mpi.org/index.php?do_redir=2269 Just as a FYI. Adrian

Re: [OMPI devel] Checkpoint/restart + migration

2015-10-22 Thread Adrian Reber
On Thu, Oct 22, 2015 at 12:15:22PM +0200, Gianmario Pozzi wrote: > My team and I are working on the possibility to checkpoint a process and > restarting it on another node. We are using CRIU framework for the > checkpoint/restart part, but we are facing some issues related to migration. > > First

Re: [OMPI devel] MTT failures since the last few days on ppc64

2015-09-09 Thread Adrian Reber
ll_ml.la] Error 1 > > make[2]: Leaving directory '/home/adrian/ompi/build/ompi/mca/coll/ml' > > Makefile:3366: recipe for target 'all-recursive' failed > > > > > > > > > > On Tue, Sep 08, 2015 at 05:19:56PM +, Jeff Squyres (jsq

Re: [OMPI devel] MTT failures since the last few days on ppc64

2015-09-09 Thread Adrian Reber
pi/ompi/issues/874. > > > On Sep 8, 2015, at 9:56 AM, Adrian Reber wrote: > > > > Since a few days the MTT runs on my ppc64 systems are failing with: > > > > [bimini:11716] *** Process received signal *** > > [bimini:11716] Signal: Segmentation fault (11) > &

[OMPI devel] MTT failures since the last few days on ppc64

2015-09-08 Thread Adrian Reber
Since a few days the MTT runs on my ppc64 systems are failing with: [bimini:11716] *** Process received signal *** [bimini:11716] Signal: Segmentation fault (11) [bimini:11716] Signal code: Address not mapped (1) [bimini:11716] Failing at address: (nil)[bimini:11716] [ 0] [0x3fffa2bb0448] [bimini:

Re: [OMPI devel] esslingen MTT?

2015-08-25 Thread Adrian Reber
On Mon, Aug 24, 2015 at 09:47:22PM +, Jeff Squyres (jsquyres) wrote: > Who runs the esslingen MTT? > > You're getting some build failures on master that I don't understand: > > - > make[3]: Entering directory > '/home/adrian/mtt-scratch/mpi-install/FDvh/src/openmpi-dev-2350-geb25c00/ompi/

Re: [OMPI devel] OBJ_RELEASE() question

2015-02-12 Thread Adrian Reber
leak. So it’s best > that we (a) identify where this was done (I personally don’t recall having > seen it), and (b) add comments to the code explaining why it explicitly sets > the param to NULL (e.g., the object is tracked elsewhere and will later be > free’d). > > &g

Re: [OMPI devel] OBJ_RELEASE() question

2015-02-12 Thread Adrian Reber
as for for the "unnecessary" part */ > > about the "wrong" part, why do you think the else branch is wrong ? > /* i mean setting a pointer to NULL is not necessarily wrong */ > > Cheers, > > Gilles > > > On 2015/02/12 16:41, Adrian Reber w

[OMPI devel] OBJ_RELEASE() question

2015-02-12 Thread Adrian Reber
At many places all over the code I see OBJ_RELEASE(buffer) buffer = NULL; Looking at the definition of OBJ_RELEASE() this seems unnecessary and wrong: #define OBJ_RELEASE(object) \ do {

Re: [OMPI devel] Master hangs in opal_LIFO test

2015-02-03 Thread Adrian Reber
There is right now another bug report concerning opal_lifo and ppc64 here: https://github.com/open-mpi/ompi/issues/371 and there were hangs on ppc64 a few weeks ago in opal_lifo which Nathan fixed with additional barriers. On Mon, Feb 02, 2015 at 11:18:43PM -0800, Paul Hargrove wrote: > CORRECTI

Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-02-02 Thread Adrian Reber
https://github.com/open-mpi/ompi/issues/372 On Sat, Jan 31, 2015 at 01:38:54PM +, Jeff Squyres (jsquyres) wrote: > Adrian -- > > Can you file this as a Github issue? Thanks. > > > > On Jan 17, 2015, at 12:58 PM, Adrian Reber wrote: > > > > This time

Re: [OMPI devel] RFC: Remove embedded libltdl

2015-02-02 Thread Adrian Reber
I have reported the same error a few days ago and submitted it now as a github issue: https://github.com/open-mpi/ompi/issues/371 On Mon, Feb 02, 2015 at 12:36:54PM +1100, Christopher Samuel wrote: > On 31/01/15 10:51, Jeff Squyres (jsquyres) wrote: > > > New tarball posted (same location). Now

[OMPI devel] make check failure on ppc64

2015-01-25 Thread Adrian Reber
Tonight's MTT on ppc64 has following failure: http://mtt.open-mpi.org/index.php?do_redir=2229 This is not easily reproducible but I have the core file from that segfault: Core was generated by `/home/adrian/mtt-scratch/mpi-install/QMjb/src/openmpi-dev-750-gff7be58/test/cla'. Program terminated

Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-20 Thread Adrian Reber
Using today's nightly snapshot (openmpi-dev-730-g06d3b57) both errors are gone. Thanks! On Mon, Jan 19, 2015 at 02:38:42PM +0900, Gilles Gouaillardet wrote: > Adrian, > > about the > "[n050409][[36216,1],1][btl_openib_xrc.c:58:mca_btl_openib_xrc_check_api] XRC > error: bad XRC API (require XRC fr

[OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-17 Thread Adrian Reber
This time my bug report is not PSM related: I was able to reproduce the MTT error from http://mtt.open-mpi.org/index.php?do_redir=2228 on my system with openmpi-dev-720-gf4693c9: mpi_test_suite: btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed. [n050409:06796] *** Process r

Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-16 Thread Adrian Reber
t be such a bad idea. > > Did you experience any trouble running with the version without the default > error handler registered? > > George. > > > On Thu, Jan 15, 2015 at 4:40 PM, Adrian Reber wrote: > > > It even says so in the code: > > > >

Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread Adrian Reber
makes MPI_Cancel() fail gracefully. But then no error is handled anymore. Adrian On Thu, Jan 15, 2015 at 10:21:05PM +0100, Adrian Reber wrote: > As PSM on master is still broken I applied it on 1.8.4. Unfortunately it > does not work. The error is the same as before. > > Look

Re: [OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread Adrian Reber
return OMPI_SUCCESS; >} else if(PSM_MQ_INCOMPLETE == err) { > return OMPI_SUCCESS; >} > > George. > > > On Thu, Jan 15, 2015 at 1:30 PM, Adrian Reber wrote: > > > Doing > > > > MPI_Isend() > > > > followed by a > > > >

[OMPI devel] Another Open MPI <-> PSM question (MPI_Isend()/MPI_Cancel())

2015-01-15 Thread Adrian Reber
Doing MPI_Isend() followed by a MPI_Cancel() fails on my PSM based system with 1.8.4 like this: n040108:0.1.Cannot cancel send requests (req=0x2b6279787f80) n040108:0.0.Cannot cancel send requests (req=0x2b3a3dc92f80) --- Primary job termin

Re: [OMPI devel] Changed behaviour with PSM on master

2015-01-10 Thread Adrian Reber
https://github.com/open-mpi/ompi/issues/340 On Fri, Jan 09, 2015 at 01:12:34PM +0100, Adrian Reber wrote: > Running the mpi_test_suite on master used to work with no problems. At > some point in time it stopped working however and now I get only error > messages from PSM: > > &q

Re: [OMPI devel] Changed behaviour with PSM on master

2015-01-09 Thread Adrian Reber
to be opened twice in each process? > > > > Andrew > > > > > -Original Message- > > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian > > > Reber > > > Sent: Friday, January 9, 2015 4:13 AM > > > To: de...@o

[OMPI devel] Changed behaviour with PSM on master

2015-01-09 Thread Adrian Reber
Running the mpi_test_suite on master used to work with no problems. At some point in time it stopped working however and now I get only error messages from PSM: """ n050301:3.0.In PSM version 1.14, it is not possible to open more than one context per process [n050301:26526] Open MPI detected an

Re: [OMPI devel] test/class/opal_fifo failure on ppc64

2015-01-09 Thread Adrian Reber
Thanks. mtt on my ppc64 system is happy again. On Thu, Jan 08, 2015 at 09:16:43AM -0700, Nathan Hjelm wrote: > > Fixed on master. I forgot a write memory barrier in the 64-bit version > of opal_fifo_pop_atomic. > > -Nathan > > On Thu, Jan 08, 2015 at 02:29:05PM +0100, Adri

[OMPI devel] test/class/opal_fifo failure on ppc64

2015-01-08 Thread Adrian Reber
I am trying to build OMPI git master on ppc64 (PPC970MP) and test/class/opal_fifo fails during make check most of the time. [adrian@bimini class]$ ./opal_fifo Single thread test. Time: 0 s 99714 us 99 nsec/poppush Atomics thread finished. Time: 0 s 347577 us 347 nsec/poppush Atomics thread finishe

[OMPI devel] FT code (again)

2014-12-19 Thread Adrian Reber
Again I am trying to get the FT code working. This time I am unsure how to resolve the code changes from this commit: commit aec5cd08bd8c33677276612b899b48618d271efa Author: Ralph Castain List-Post: devel@lists.open-mpi.org Date: Thu Aug 21 18:56:47 2014 + Per the PMIx RFC: This incl

Re: [OMPI devel] 1.8.4rc4 now out for testing

2014-12-15 Thread Adrian Reber
1.8.4rc4 works without errors on my PSM based systems. Adrian On Sat, Dec 13, 2014 at 03:06:07PM -0800, Ralph Castain wrote: > Hi folks > > I’ve rolled up the bug fixes so far, including the thread-multiple > performance fix. So please give this one a whirl > > http://www.open-

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-13 Thread Adrian Reber
:24PM +0100, Adrian Reber wrote: > Using the intel test suite I can reproduce it for example with: > > $ mpirun --np 2 --map-by ppr:1:node `pwd`/src/MPI_Allgatherv_c > MPITEST info (0): Starting MPI_Allgatherv() test > MPITEST info (0): Node spec MPITEST_comm_sizes[6]=2 too

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-11 Thread Adrian Reber
e not supported by that transport. > > So could you find one test that doesn’t pass, and give us some info on that > one? > > > > On Nov 11, 2014, at 10:04 AM, Adrian Reber wrote: > > > > Some more information about our PSM troubles. > > > > Usi

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-11 Thread Adrian Reber
help support nail down the issue. > > Andrew > > > -Original Message- > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian > > Reber > > Sent: Monday, November 10, 2014 12:39 PM > > To: Open MPI Developers > > Subject: Re:

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-10 Thread Adrian Reber
_dup > > > (Rank:0) tst_test_array[3]:Get_version Number of failed tests:0 > > > > > > Works with various np from 8 to 32. Your original case: > > > > > > $ mpirun -np 32 ./mpi_test_suite -t "All,^io,^one-sided" > >

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-10 Thread Adrian Reber
original case: > > $ mpirun -np 32 ./mpi_test_suite -t "All,^io,^one-sided" > > Runs for a while and eventually hits send cancellation errors. > > Any chance you could try updating your infinipath libraries? > > Andrew > > > -Original Message- >

Re: [OMPI devel] 1.8.3 and PSM errors

2014-10-28 Thread Adrian Reber
rs. > > Any chance you could try updating your infinipath libraries? > > Andrew > > > -Original Message- > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian > > Reber > > Sent: Monday, October 27, 2014 9:11 AM > > To: Open MPI Developers &g

Re: [OMPI devel] 1.8.3 and PSM errors

2014-10-27 Thread Adrian Reber
operations like comm_spawn or > connect_accept, so if you are running those tests that just won’t work. Is > that the heart of the problem here? > > > > On Oct 27, 2014, at 1:40 AM, Adrian Reber wrote: > > > > Running Open MPI 1.8.3 with PSM does not seem to work ri

[OMPI devel] 1.8.3 and PSM errors

2014-10-27 Thread Adrian Reber
Running Open MPI 1.8.3 with PSM does not seem to work right now at all. I am getting the same errors also on trunk from my newly set up MTT. Before trying to debug this I just wanted to make sure this is not a configuration error. I have following PSM packages installed: infinipath-devel-3.1.1-363

Re: [OMPI devel] ORTE headers in OPAL source

2014-10-17 Thread Adrian Reber
__ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php > >

Re: [OMPI devel] ORTE headers in OPAL source

2014-08-11 Thread Adrian Reber
formation go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/08/15607.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ Adrian -- Adrian Reber http://lisas.de/~adrian/ Authentic: Indubitably true, in somebody's opinion.

Re: [OMPI devel] opal_config_bottom.h question again

2014-08-04 Thread Adrian Reber
db5a4d0..efeaf98 100644 --- a/opal/util/malloc.h +++ b/opal/util/malloc.h @@ -21,7 +21,7 @@ #ifndef OPAL_MALLOC_H #define OPAL_MALLOC_H -#include "opal_config.h" +#include #include /* On Mon, Aug 04, 2014 at 06:39:13PM +0200, Adrian Reber wrote: > I can confirm this on Fedora

Re: [OMPI devel] opal_config_bottom.h question again

2014-08-04 Thread Adrian Reber
I can confirm this on Fedora 20 with gcc 4.8.3. Running ./configure without any options gives me the same error. On Mon, Aug 04, 2014 at 04:24:29PM +, Pritchard Jr., Howard wrote: > Hi Ralph, > > Nope that doesn't fix the problem I'm hitting. I tried to build the opmi > trunk > on a syste

Re: [OMPI devel] r31916 question

2014-06-19 Thread Adrian Reber
The fault tolerance code also needs additional changes because of this commit. I have the changes prepared but not committed. On Wed, Jun 18, 2014 at 03:45:11PM -0700, Ralph Castain wrote: > Huh - thought I got that. Sorry I missed it. Let me take a look and ensure > that the alps ras module is s

Re: [OMPI devel] RFC: Remove heterogeneous support

2014-04-25 Thread Adrian Reber
On Fri, Apr 25, 2014 at 10:29:36AM +, Jeff Squyres (jsquyres) wrote: > On Apr 25, 2014, at 6:13 AM, Gilles Gouaillardet > wrote: > > > it is possible to use qemu in order to emulate unavailable hardware. > > for what it's worth, i am now running a ppc64 qemu emulated virtual > > machine on a

Re: [OMPI devel] 1-question developer poll

2014-04-16 Thread Adrian Reber
On Wed, Apr 16, 2014 at 10:32:10AM +, Jeff Squyres (jsquyres) wrote: > What source code repository technology(ies) do you use for Open MPI > development? (indicate all that apply) > > - SVN > - Mercurial > - Git git Adrian pgp0Qj8qxYTHc.pgp Description: PGP signature

[OMPI devel] Restarting and Pipes

2014-04-10 Thread Adrian Reber
Trying to restart a process I see that orterun has three pipes connected to the processes running under its control (-np 1). orterun: orterun 11562 adrian 15w FIFO0,8 0t0 5304173 pipe orterun 11562 adrian 16r FIFO0,8 0t0 5304174 pipe orterun

[OMPI devel] Open MPI and CRIU stdout/stderr

2014-03-19 Thread Adrian Reber
Cross-posting to criu and openmpi devel mailinglists. To get fault tolerance back into Open MPI I added code to use criu as a checkpoint/restart tool. I can checkpoint a process successfully but I have troubles restarting it. CRIU has currently problems restoring the process which is probably rela

Re: [OMPI devel] usage of mca variables in orte-restart

2014-03-18 Thread Adrian Reber
lue() to select the preferred crs module? Adrian On Mon, Mar 17, 2014 at 08:47:16AM -0600, Nathan Hjelm wrote: > Good catch. Fixing now. > > -Nathan > > On Mon, Mar 17, 2014 at 02:50:02PM +0100, Adrian Reber wrote: > > On Fri, Mar 14, 2014 at 10:18:06PM +, Hjelm, Nathan T w

Re: [OMPI devel] usage of mca variables in orte-restart

2014-03-17 Thread Adrian Reber
On Fri, Mar 14, 2014 at 10:18:06PM +, Hjelm, Nathan T wrote: > The preferred way is to use mca_base_var_find and then call > mca_base_var_[set|get]_value. For performance sake we only look at the > environment when the variable is registered. I believe I found a bug in mca_base_var_set_value

Re: [OMPI devel] usage of mca variables in orte-restart

2014-03-15 Thread Adrian Reber
egistered. > > -Nathan > > Please excuse the horrible Outlook top-posting. OWA sucks. > > > From: devel [devel-boun...@open-mpi.org] on behalf of Adrian Reber > [adr...@lisas.de] > Sent: Friday, March 14, 2014 3:05 PM > To: de

[OMPI devel] usage of mca variables in orte-restart

2014-03-14 Thread Adrian Reber
I am now trying to run orte-restart. As far as I understand it orte-restart analyzes the checkpoint metadata and then tries to exec() mpirun which then starts opal-restart. During the startup of opal-restart (during initialize()) detection of the best CRS module is disabled: /* * Turn of

[OMPI devel] orte-restart and PATH

2014-03-12 Thread Adrian Reber
I am using orte-restart without setting my PATH to my Open MPI installation. I am running /full/path/to/orte-restart and orte-restart tries to run mpirun to restart the process. This fails on my system because I do not have any mpirun in my PATH. Is it expected for an Open MPI installation to set u

Re: [OMPI devel] C/R and orte_oob

2014-03-10 Thread Adrian Reber
On Fri, Mar 07, 2014 at 06:54:18AM -0800, Ralph Castain wrote: > > If you like, I can define the required code in the trunk and let you > > fill in the event functionality. > > That would be great. > >>> > >>> Thanks for your changes. When using --with-ft there are a few compil

Re: [OMPI devel] C/R and orte_oob

2014-03-07 Thread Adrian Reber
On Thu, Mar 06, 2014 at 07:47:22PM -0800, Ralph Castain wrote: > > Sorry for delay - yes, that looks like the right direction. I would > > suggest doing it via the current state machine, though, by simply > > defining another job or proc state in orte/mca/plm/plm_types.h, and > >

Re: [OMPI devel] C/R and orte_oob

2014-03-06 Thread Adrian Reber
On Tue, Feb 18, 2014 at 03:46:58PM +0100, Adrian Reber wrote: > > >>> I tried to implement something like you described. It is not yet event > > >>> driven, but before continuing I wanted to get some feedback if it is at > > >>> least the right start:

Re: [OMPI devel] Fix compiler warnings in FT code

2014-03-05 Thread Adrian Reber
Caching = Disabled [dcbz:02880] sstore:stage: open: Compression = Disabled [dcbz:02880] sstore:stage: open: Compression Delay= 0 [dcbz:02880] sstore:stage: open: Skip FileM (Debug Only) = False On Mon, Mar 03, 2014 at 05:42:13PM +0100, Adrian Reber wrote: > I w

Re: [OMPI devel] mca_base_component_distill_checkpoint_ready variable

2014-03-03 Thread Adrian Reber
On Fri, Feb 21, 2014 at 10:12:54AM -0700, Nathan Hjelm wrote: > On Fri, Feb 21, 2014 at 05:21:10PM +0100, Adrian Reber wrote: > > There is a variable in the FT code which is not defined and therefore > > currently #ifdef'd out. > > > > #if (OPAL_ENABLE_FT

Re: [OMPI devel] Fix compiler warnings in FT code

2014-03-03 Thread Adrian Reber
r how to set it up at the moment. > > > > > On Mon, Mar 3, 2014 at 7:25 AM, Adrian Reber wrote: > > > I removed a complete function because it was not used: > > > > ../../../../../orte/mca/sstore/stage/sstore_stage_component.c: At top

Re: [OMPI devel] Fix compiler warnings in FT code

2014-03-03 Thread Adrian Reber
mewhere else? or do you have a different way to set those parameters? > > Other than that it looks good to me. > > > On Mon, Mar 3, 2014 at 5:29 AM, Adrian Reber wrote: > > > I have a simple patch which fixes the remaining compiler warnings when > > running with &#

[OMPI devel] Fix compiler warnings in FT code

2014-03-03 Thread Adrian Reber
I have a simple patch which fixes the remaining compiler warnings when running with '--with-ft': https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=4dee703a0a2e64972b0c35b7693c11a09f1fbe5f Does anybody see any problems with this patch? Adrian

[OMPI devel] openmpi-1.7.5a1r30797 fails building on SL 5.5

2014-02-22 Thread Adrian Reber
On a Scientific Linux 5.5 system the nightly snapshot openmpi-1.7.5a1r30797 fails to build with following errors: Making all in romio make[3]: Entering directory `/tmp/adrian/openmpi-compile/openmpi-1.7.5a1r30797/build/ompi/mca/io/romio/romio' make[4]: Entering directory `/tmp/adrian/openmpi-co

[OMPI devel] mca_base_component_distill_checkpoint_ready variable

2014-02-21 Thread Adrian Reber
There is a variable in the FT code which is not defined and therefore currently #ifdef'd out. #if (OPAL_ENABLE_FT == 1) && (OPAL_ENABLE_FT_CR == 1) #ifdef ENABLE_FT_FIXED /* FIXME_FT * * the variable mca_base_component_distill_checkpoint_ready * was removed by commit 8181c8273c4

[OMPI devel] startup sstore orte/mca/ess/base/ess_base_std_tool.c

2014-02-21 Thread Adrian Reber
To restart a process using orte-restart I need sstore initialized when running as a tool. This is currently missing. The new code is #if OPAL_ENABLE_FT_CR == 1 and should only affect --with-ft builds. The following is the change I want to make: diff --git a/orte/mca/ess/base/ess_base_std_tool.c

Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-18 Thread Adrian Reber
e checkpoint only functionality in continue mode the patch can be checked in? Adrian > On Tue, Feb 18, 2014 at 4:08 AM, Adrian Reber wrote: > > > I think I do not understand your question. So far I have only implemented > > the > > checkpoint part and no

Re: [OMPI devel] C/R and orte_oob

2014-02-18 Thread Adrian Reber
On Tue, Feb 18, 2014 at 06:39:12AM -0800, Ralph Castain wrote: > On Feb 18, 2014, at 6:24 AM, Adrian Reber wrote: > > > On Fri, Feb 14, 2014 at 02:51:51PM -0800, Ralph Castain wrote: > >> On Feb 13, 2014, at 11:26 AM, Adrian Reber wrote: > >>> I tried to imple

Re: [OMPI devel] C/R and orte_oob

2014-02-18 Thread Adrian Reber
On Fri, Feb 14, 2014 at 02:51:51PM -0800, Ralph Castain wrote: > On Feb 13, 2014, at 11:26 AM, Adrian Reber wrote: > > I tried to implement something like you described. It is not yet event > > driven, but before continuing I wanted to get some feedback if it is at > >

Re: [OMPI devel] OPAL_CRS_* meaning

2014-02-18 Thread Adrian Reber
> You can see it used in the opal_cr_inc_core_prep() function in > opal/runtime/opal_cr.c > > -- Josh > > > > On Mon, Feb 17, 2014 at 9:28 AM, Adrian Reber wrote: > > > This is probably for Josh. What is the meaning of the OPAL_CRS_* enums? > > > >

Re: [OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-18 Thread Adrian Reber
ell from its return code (or some other mechanism) > that it is being restarted versus continuing after checkpointing? > > > On Mon, Feb 17, 2014 at 2:00 PM, Ralph Castain wrote: > > > Great - looks fine to me!! > > > > > > On Feb 17, 2014, at 11:39 AM, Adria

[OMPI devel] CRS/CRIU: add code to actually checkpoint a process

2014-02-17 Thread Adrian Reber
I have prepared a patch I would like to commit which adds to code to actually checkpoint a process. Thanks for the pointers about the string variables I tried to do implement it correctly. CRIU currently has problems with the new OOB usock but I will contact the CRIU developers about this error. U

[OMPI devel] How to prefer oob/tcp over oob/usock

2014-02-17 Thread Adrian Reber
With the newly added oob/usock checkpointing with CRIU stopped working. Is there a way I can prefer oob/tcp on the command line? Adrian

[OMPI devel] OPAL_CRS_* meaning

2014-02-17 Thread Adrian Reber
This is probably for Josh. What is the meaning of the OPAL_CRS_* enums? They are probably used to communicate the state of the CRS modules. OPAL_CRS_ERROR seems to be used in case an error happened. What is the CRS module supposed to set this to if the checkpoint was successful. OPAL_CRS_CONTINUE

Re: [OMPI devel] new CRS component added (criu)

2014-02-14 Thread Adrian Reber
tps://github.com/jsquyres/fork-from-adrian-ft/commit/f5962184f3ea6dffc182a18f7603c5e70e82ac99 > > > > On Feb 14, 2014, at 11:35 AM, "Jeff Squyres (jsquyres)" > wrote: > > > Perfect; cloning now. Thanks! > > > > On Feb 14, 2014, at 11:34 AM, Adrian Reber > > w

Re: [OMPI devel] new CRS component added (criu)

2014-02-14 Thread Adrian Reber
Sure. I added the cloneurl information: https://lisas.de/~adrian/open-mpi.git On Fri, Feb 14, 2014 at 04:30:05PM +, Jeff Squyres (jsquyres) wrote: > Can I clone your git tree and send you a patch? > > On Feb 11, 2014, at 4:45 PM, Adrian Reber wrote: > > > On Tue, Feb

[OMPI devel] mca_base_component_var_register() MCA_BASE_VAR_TYPE_STRING

2014-02-14 Thread Adrian Reber
I am trying to find out how to deal with string variables. Do I have to allocate the memory before calling mca_base_component_var_register() or not? It seems it does a strdup() meaning it has to be free()'d while closing the component. Looking at other occurrences of string variables I see differen

Re: [OMPI devel] C/R and orte_oob

2014-02-13 Thread Adrian Reber
On Thu, Feb 06, 2014 at 02:45:07PM -0800, Ralph Castain wrote: > On Feb 6, 2014, at 2:16 PM, Adrian Reber wrote: > > > Josh explained it to me a few days ago, that after a checkpoint has been > > received TCP should no longer be used to not lose any messages. The > > co

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
2014, at 7:47 AM, Ralph Castain wrote: > > > > > On Feb 12, 2014, at 7:32 AM, Adrian Reber wrote: > > > >> > >> $ msub -I -l nodes=3:ppn=8 > >> salloc: Job is in held state, pending scheduler release > >> salloc: Pending job allocation 131828

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
On Wed, Feb 12, 2014 at 07:47:53AM -0800, Ralph Castain wrote: > > > > $ msub -I -l nodes=3:ppn=8 > > salloc: Job is in held state, pending scheduler release > > salloc: Pending job allocation 131828 > > salloc: job 131828 queued and waiting for resources > > salloc: job 131828 has been allocated

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
9 AM, Ralph Castain wrote: > > > What is your SLURM_TASKS_PER_NODE? > > > > On Feb 12, 2014, at 6:58 AM, Adrian Reber wrote: > > > >> No, the system has only a few MOAB_* variables and many SLURM_* > >> variables: > >>

Re: [OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
:45AM -0800, Ralph Castain wrote: > Seems rather odd - since this is managed by Moab, you shouldn't be seeing > SLURM envars at all. What you should see are PBS_* envars, including a > PBS_NODEFILE that actually contains the allocation. > > > On Feb 12, 2014, at 4:42

[OMPI devel] openmpi-1.7.5a1r30692 and slurm problems

2014-02-12 Thread Adrian Reber
I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system with slurm and moab. I requested an interactive session using: msub -I -l nodes=3:ppn=8 and started a simple test case which fails: $ mpirun -np 2 ./mpi-test 1

Re: [OMPI devel] new CRS component added (criu)

2014-02-11 Thread Adrian Reber
On Tue, Feb 11, 2014 at 08:09:35PM +, Jeff Squyres (jsquyres) wrote: > On Feb 8, 2014, at 4:49 PM, Adrian Reber wrote: > > >> I note you have a stray $3 at the end of your configure.m4, too (it might > >> supposed to be $2?). > > > > I think I do not re

Re: [OMPI devel] new CRS component added (criu)

2014-02-08 Thread Adrian Reber
On Fri, Feb 07, 2014 at 10:08:48PM +, Jeff Squyres (jsquyres) wrote: > Sweet -- +1 for CRIU support! > > FWIW, I see you modeled your configure.m4 off the blcr configure.m4, but I'd > actually go with making it a bit simpler. For example, I typically structure > my configure.m4's like this

[OMPI devel] new CRS component added (criu)

2014-02-07 Thread Adrian Reber
I have created a new CRS component using criu (criu.org) to support checkpoint/restart in Open MPI. My current patch only provides the framework and necessary configure scripts to detect and link against criu. With this patch orte-checkpoint can request a checkpoint and the new CRIU CRS component i

Re: [OMPI devel] C/R and orte_oob

2014-02-06 Thread Adrian Reber
e, once you did that, the OOB would no longer be available to, for > example, tell the local daemon that the app is ready for checkpoint :-) > > Afraid I'll have to defer to Josh H for any further guidance. > > > On Feb 6, 2014, at 8:15 AM, Adrian Reber wrote: > >

[OMPI devel] C/R and orte_oob

2014-02-06 Thread Adrian Reber
When I initially made the C/R code compile again I made following change: diff --git a/orte/mca/rml/oob/rml_oob_component.c b/orte/mca/rml/oob/rml_oob_component.c index f0b22fc..90ed086 100644 --- a/orte/mca/rml/oob/rml_oob_component.c +++ b/orte/mca/rml/oob/rml_oob_component.c @@ -185,8 +185,7 @

Re: [OMPI devel] Use unique collective ids for the checkpoint/restart code

2014-02-04 Thread Adrian Reber
a "printf" statement in > plm_base_launch_support.c, so you might want to make that an > opal_output_verbose or something. > > On Feb 3, 2014, at 12:19 PM, Adrian Reber wrote: > > > This patch > > > > https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=1

[OMPI devel] Use unique collective ids for the checkpoint/restart code

2014-02-03 Thread Adrian Reber
This patch https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=14ec7f42baab882e345948ff79c4f75f5084bbbf introduces unique collective ids for the checkpoint/restart code and with this applied it seems to work pretty good. As this patch also touches non-CR code it would be good if someone could hav

Re: [OMPI devel] SNAPC: dynamic send buffers

2014-01-29 Thread Adrian Reber
> 2. be aware that ORTE_WAIT_FOR_COMPLETION will block if you are in an RML > callback. I don't think that's an issue here, but just wanted to point it out. > > Ralph > > On Jan 27, 2014, at 8:12 AM, Adrian Reber wrote: > > > I have the following patc

[OMPI devel] SNAPC: dynamic send buffers

2014-01-27 Thread Adrian Reber
removed with my 'getting-it-compiled-again' patches. Instead of blocking recv() calls it now uses ORTE_WAIT_FOR_COMPLETION(). I included gitweb links to the patches. Please have a look at the patches. Adrian commit 6f10b44499b59c84d9032378c7f8c6b3526a029b Author: Ad

Re: [OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

2014-01-24 Thread Adrian Reber
ata underneath to save > > memory and time. > > > > > > On Jan 23, 2014, at 6:51 AM, Adrian Reber wrote: > > > > > Following patch makes orte-checkpoint communicate with orterun again: > > > > > > diff --git a/orte/tools/orte-checkpoint/ort

[OMPI devel] [PATCH] use ORTE_PROC_IS_APP

2014-01-23 Thread Adrian Reber
Selecting SNAPC requires the information if it is an app or not: int orte_snapc_base_select(bool seed, bool app); The following patch uses the correct define. Can I commit it like this: t a/orte/mca/ess/base/ess_base_std_app.c b/orte/mca/ess/base/ess_base_std_app.c index dbbb2f4..f3a38f0 100644

[OMPI devel] [PATCH] make orte-checkpoint communicate with orterun again

2014-01-23 Thread Adrian Reber
Following patch makes orte-checkpoint communicate with orterun again: diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c b/orte/tools/orte-checkpoint/orte-checkpoint.c index 7106342..8539f34 100644 --- a/orte/tools/orte-checkpoint/orte-checkpoint.c +++ b/orte/tools/orte-checkpoint/orte-che

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
s that orte-checkpoint is a tool, and so it isn't a daemon - > but it is also not an app. > > > On Jan 21, 2014, at 11:56 AM, Adrian Reber wrote: > > > Good to know that it does not make any sense. So it not just me. > > > > Looking at the call cha

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
> However, like I said, it makes no sense for orte-checkpoint to do a barrier > as it is a singleton - there is nothing for it to "barrier" with. > > On Jan 21, 2014, at 7:24 AM, Adrian Reber wrote: > > > I think I still do not really understand how it w

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
n Mon, Jan 20, 2014 at 4:46 PM, Ralph Castain wrote: > > > Is it orte-checkpoint that is hanging, or the app you are trying to > > checkpoint? > > > > > > On Jan 20, 2014, at 2:10 PM, Adrian Reber wrote: > > > > Thanks for your help. I tried initializing

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
, Jan 20, 2014 at 02:46:04PM -0800, Ralph Castain wrote: > Is it orte-checkpoint that is hanging, or the app you are trying to > checkpoint? > > > On Jan 20, 2014, at 2:10 PM, Adrian Reber wrote: > > > Thanks for your help. I tried initializing the barrier correctly (see >

Re: [OMPI devel] callback debugging

2014-01-20 Thread Adrian Reber
s okay to > block using ORTE_WAIT_FOR_COMPLETION. Look in > orte/mca/routed/base/routed_base_fns.c starting at line 252 for an example. > > HTH > Ralph > > On Jan 10, 2014, at 12:55 PM, Ralph Castain wrote: > > > > > On Jan 10, 2014, at 12:45 PM, Adrian Rebe

Re: [OMPI devel] callback debugging

2014-01-10 Thread Adrian Reber
On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote: > > On Jan 10, 2014, at 8:02 AM, Adrian Reber wrote: > > > I am currently trying to understand how callbacks are working. Right now > > I am looking at orte/mca/rml/base/rml_base_receive.c > > orte_rml_b

[OMPI devel] callback debugging

2014-01-10 Thread Adrian Reber
I am currently trying to understand how callbacks are working. Right now I am looking at orte/mca/rml/base/rml_base_receive.c orte_rml_base_comm_start() which does orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD, ORTE_RML_TAG_RML_INFO_UPDATE,

Re: [OMPI devel] orte_barrier: Assertion `0 == item->opal_list_item_refcount' failed.

2014-01-09 Thread Adrian Reber
For my CR work this can probably ignored. I think I was looking at the wrong place. On Thu, Jan 09, 2014 at 05:28:01PM +0100, Adrian Reber wrote: > Continuing with the CR code I now get a crash which can be easily reproduced > using orte/test/system/orte_barrier.c > > I get: >

[OMPI devel] orte_barrier: Assertion `0 == item->opal_list_item_refcount' failed.

2014-01-09 Thread Adrian Reber
Continuing with the CR code I now get a crash which can be easily reproduced using orte/test/system/orte_barrier.c I get: orte_barrier: ../../../../../opal/class/opal_list.h:547: _opal_list_append: Assertion `0 == item->opal_list_item_refcount' failed. [dcbz:05085] *** Process received signal **

Re: [OMPI devel] return value of opal_compress_base_register() in opal/mca/compress/base/compress_base_open.c

2014-01-07 Thread Adrian Reber
_ERR_NOT_AVAILABLE. This would still avoid opening the > > components for no reason (thus saving some memory) while not causing > > opal_init to abort. > > > > > > On Jan 3, 2014, at 3:19 AM, Adrian Reber wrote: > > > > > So removing all output like

  1   2   >