Thanks, that helps. Now it actually starts to communicate with the
orterun process. This still fails but I will try to fix it.
On Tue, Jan 21, 2014 at 12:27:55PM -0800, Ralph Castain wrote:
> That second argument is incorrect - it should be ORTE_PROC_IS_APP (note no
> !). The problem is that orte
That second argument is incorrect - it should be ORTE_PROC_IS_APP (note no !).
The problem is that orte-checkpoint is a tool, and so it isn't a daemon - but
it is also not an app.
On Jan 21, 2014, at 11:56 AM, Adrian Reber wrote:
> Good to know that it does not make any sense. So it not just
Good to know that it does not make any sense. So it not just me.
Looking at the call chain I can see
orte_snapc_base_select(ORTE_PROC_IS_HNP, !ORTE_PROC_IS_DAEMON);
and the second parameter is used to decide if it is an app or not:
int orte_snapc_base_select(bool seed, bool app) in
orte/mca/sn
That doesn't make any sense - I can't imagine a reason for orte-checkpoint
itself to be running a barrier. I wonder if it is selecting the wrong component
in snapc?
As for the patch, that isn't going to work. The collective id has to be
*globally* unique, which means that only orterun can issue
I think I still do not really understand how it works.
The barrier on which orte-checkpoint is currently hanging is in
app_coord_init(). You are also saying that orte-checkpoint
should not be calling a barrier. The backtrace of the point where it
is hanging now looks like:
#0 0x769befa0
orte-checkpoint before communicating with orterun which runs the
processes I am trying to checkpoint. The full backtrace:
#0 0x769befa0 in __nanosleep_nocancel () at
../sysdeps/unix/syscall-template.S:81
#1 0x77b45712 in app_coord_init () at
../../../../../orte/mca/snapc/full/s
If it is the application, then there is probably a barrier in the
app_coord_init() to make sure all the applications are up and running.
After this point then the global coordinator knows that the application can
be checkpointed.
I don't think orte-checkpoint should be calling a barrier - from wha
Is it orte-checkpoint that is hanging, or the app you are trying to checkpoint?
On Jan 20, 2014, at 2:10 PM, Adrian Reber wrote:
> Thanks for your help. I tried initializing the barrier correctly (see
> attached patch) but now, instead of crashing, it just hangs on the
> barrier while running o
Thanks for your help. I tried initializing the barrier correctly (see
attached patch) but now, instead of crashing, it just hangs on the
barrier while running orte-checkpoint
[dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
[dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at
../../../..
I took a look at this, and I'm afraid you have some work to do in the
orte/mca/snapc code base:
1. you must use dynamically allocated buffers for rml.send_buffer_nb. See
r30261 for an example of the changes that need to be made - I did some, but
can't swear to catching them all. It was enough t
On Jan 10, 2014, at 12:45 PM, Adrian Reber wrote:
> On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
>>
>> On Jan 10, 2014, at 8:02 AM, Adrian Reber wrote:
>>
>>> I am currently trying to understand how callbacks are working. Right now
>>> I am looking at orte/mca/rml/base/rml_b
On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
>
> On Jan 10, 2014, at 8:02 AM, Adrian Reber wrote:
>
> > I am currently trying to understand how callbacks are working. Right now
> > I am looking at orte/mca/rml/base/rml_base_receive.c
> > orte_rml_base_comm_start() which does
>
On Jan 10, 2014, at 8:02 AM, Adrian Reber wrote:
> I am currently trying to understand how callbacks are working. Right now
> I am looking at orte/mca/rml/base/rml_base_receive.c
> orte_rml_base_comm_start() which does
>
>orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
>
I am currently trying to understand how callbacks are working. Right now
I am looking at orte/mca/rml/base/rml_base_receive.c
orte_rml_base_comm_start() which does
orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
ORTE_RML_TAG_RML_INFO_UPDATE,
14 matches
Mail list logo