Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
Thanks, that helps. Now it actually starts to communicate with the orterun process. This still fails but I will try to fix it. On Tue, Jan 21, 2014 at 12:27:55PM -0800, Ralph Castain wrote: > That second argument is incorrect - it should be ORTE_PROC_IS_APP (note no > !). The problem is that orte

Re: [OMPI devel] callback debugging

2014-01-21 Thread Ralph Castain
That second argument is incorrect - it should be ORTE_PROC_IS_APP (note no !). The problem is that orte-checkpoint is a tool, and so it isn't a daemon - but it is also not an app. On Jan 21, 2014, at 11:56 AM, Adrian Reber wrote: > Good to know that it does not make any sense. So it not just

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
Good to know that it does not make any sense. So it not just me. Looking at the call chain I can see orte_snapc_base_select(ORTE_PROC_IS_HNP, !ORTE_PROC_IS_DAEMON); and the second parameter is used to decide if it is an app or not: int orte_snapc_base_select(bool seed, bool app) in orte/mca/sn

Re: [OMPI devel] callback debugging

2014-01-21 Thread Ralph Castain
That doesn't make any sense - I can't imagine a reason for orte-checkpoint itself to be running a barrier. I wonder if it is selecting the wrong component in snapc? As for the patch, that isn't going to work. The collective id has to be *globally* unique, which means that only orterun can issue

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
I think I still do not really understand how it works. The barrier on which orte-checkpoint is currently hanging is in app_coord_init(). You are also saying that orte-checkpoint should not be calling a barrier. The backtrace of the point where it is hanging now looks like: #0 0x769befa0

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
orte-checkpoint before communicating with orterun which runs the processes I am trying to checkpoint. The full backtrace: #0 0x769befa0 in __nanosleep_nocancel () at ../sysdeps/unix/syscall-template.S:81 #1 0x77b45712 in app_coord_init () at ../../../../../orte/mca/snapc/full/s

Re: [OMPI devel] callback debugging

2014-01-20 Thread Josh Hursey
If it is the application, then there is probably a barrier in the app_coord_init() to make sure all the applications are up and running. After this point then the global coordinator knows that the application can be checkpointed. I don't think orte-checkpoint should be calling a barrier - from wha

Re: [OMPI devel] callback debugging

2014-01-20 Thread Ralph Castain
Is it orte-checkpoint that is hanging, or the app you are trying to checkpoint? On Jan 20, 2014, at 2:10 PM, Adrian Reber wrote: > Thanks for your help. I tried initializing the barrier correctly (see > attached patch) but now, instead of crashing, it just hangs on the > barrier while running o

Re: [OMPI devel] callback debugging

2014-01-20 Thread Adrian Reber
Thanks for your help. I tried initializing the barrier correctly (see attached patch) but now, instead of crashing, it just hangs on the barrier while running orte-checkpoint [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at ../../../..

Re: [OMPI devel] callback debugging

2014-01-11 Thread Ralph Castain
I took a look at this, and I'm afraid you have some work to do in the orte/mca/snapc code base: 1. you must use dynamically allocated buffers for rml.send_buffer_nb. See r30261 for an example of the changes that need to be made - I did some, but can't swear to catching them all. It was enough t

Re: [OMPI devel] callback debugging

2014-01-10 Thread Ralph Castain
On Jan 10, 2014, at 12:45 PM, Adrian Reber wrote: > On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote: >> >> On Jan 10, 2014, at 8:02 AM, Adrian Reber wrote: >> >>> I am currently trying to understand how callbacks are working. Right now >>> I am looking at orte/mca/rml/base/rml_b

Re: [OMPI devel] callback debugging

2014-01-10 Thread Adrian Reber
On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote: > > On Jan 10, 2014, at 8:02 AM, Adrian Reber wrote: > > > I am currently trying to understand how callbacks are working. Right now > > I am looking at orte/mca/rml/base/rml_base_receive.c > > orte_rml_base_comm_start() which does >

Re: [OMPI devel] callback debugging

2014-01-10 Thread Ralph Castain
On Jan 10, 2014, at 8:02 AM, Adrian Reber wrote: > I am currently trying to understand how callbacks are working. Right now > I am looking at orte/mca/rml/base/rml_base_receive.c > orte_rml_base_comm_start() which does > >orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD, >

[OMPI devel] callback debugging

2014-01-10 Thread Adrian Reber
I am currently trying to understand how callbacks are working. Right now I am looking at orte/mca/rml/base/rml_base_receive.c orte_rml_base_comm_start() which does orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD, ORTE_RML_TAG_RML_INFO_UPDATE,