Re: [Linux-ha-dev] core dump (abort) in crmd: "untracked process" (HEAD)

2006-04-21 Thread Andrew Beekhof


On Apr 20, 2006, at 3:58 PM, Dejan Muhamedagic wrote:


Hi,

#4  0x0805add0 in build_operation_update (xml_rsc=0x8109430,  
op=0x8282b98,

src=0x80692d9 "do_update_resource", lpc=0) at lrm.c:347

(gdb) print *0x8282b98
$1 = 136448640

If you want I can send you the core off list. I keep all the cores :)


thats ok - i think i understand the problem enough now.
and since you're running against HEAD, if you update you'll get the  
fixes :)




Cheers,

Dejan

On Thu, Apr 20, 2006 at 09:15:28AM +0200, Andrew Beekhof wrote:

On 4/20/06, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:

Hello,

Running CTS with HEAD hanged the cluster after crmd dumped core
(abort).  It happened after 53 tests with this curious message:

Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]:  
ERROR: mask(lrm.c:build_operation_update): Triggered non-fatal  
assert at lrm.c:349: fsa_our_dc_version != NULL


We have two kinds of asserts... neither are supposed to happen and
both create a core file so that we can diagnose how we got there.
However non-fatal ones call fork first (so the main process doesn't
die) and then take some recovery action.

Sometimes the non-fatal varieties are used in new pieces of code to
make sure they work as we expect and that is what has happened here.

Do you still have the core file?
I'd be interested to know the result of:
   print *op
from frame #4

In the meantime, I'll look at the logs and see what I can figure out.

Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]:  
ERROR: Exiting untracked process process 19654 dumped core
Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]:  
ERROR: mask(utils.c:crm_timer_popped): Finalization Timer  
(I_ELECTION) just popped!


The cluster looks like this, unchanged for several hours:


Last updated: Thu Apr 20 04:43:47 2006
Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d)
3 Nodes configured.
3 Resources configured.


Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online
Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online
Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online

Resource Group: group_1
IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03
LVM_2   (heartbeat::ocf:LVM):   Stopped
Filesystem_3(heartbeat::ocf:Filesystem):Stopped
Resource Group: group_2
IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02
LVM_3   (heartbeat::ocf:LVM):   Started sapcl02
Filesystem_4(heartbeat::ocf:Filesystem):Started  
sapcl02

Resource Group: group_3
IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03
LVM_4   (heartbeat::ocf:LVM):   Started sapcl03
Filesystem_5(heartbeat::ocf:Filesystem):Started  
sapcl03


And:

sapcl01# crmadmin -S sapcl01
Status of [EMAIL PROTECTED]: S_TERMINATE (ok)

All processes are still running on this node, but heartbeat seems
to be in some kind of limbo.

Cheers,

Dejan


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/





___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


--
Andrew Beekhof

"Ooo Ahhh, Glenn McRath" - TISM


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] core dump (abort) in crmd: "untracked process" (HEAD)

2006-04-20 Thread Dejan Muhamedagic
Hi,

On Thu, Apr 20, 2006 at 10:58:05AM +0200, Andrew Beekhof wrote:
> On 4/20/06, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> > On 4/20/06, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> > > On 4/20/06, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > > > Hello,
> > > >
> > > > Running CTS with HEAD hanged the cluster after crmd dumped core
> > > > (abort).  It happened after 53 tests with this curious message:
> > > >
> > > > Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
> > > > mask(lrm.c:build_operation_update): Triggered non-fatal assert at 
> > > > lrm.c:349: fsa_our_dc_version != NULL
> > >
> > > We have two kinds of asserts... neither are supposed to happen and
> > > both create a core file so that we can diagnose how we got there.
> > > However non-fatal ones call fork first (so the main process doesn't
> > > die) and then take some recovery action.
> > >
> > > Sometimes the non-fatal varieties are used in new pieces of code to
> > > make sure they work as we expect and that is what has happened here.
> > >
> > > Do you still have the core file?
> > > I'd be interested to know the result of:
> > >print *op
> > > from frame #4
> > >
> > > In the meantime, I'll look at the logs and see what I can figure out.
> >
> > There is also a script Alan wrote to easily extract test data:
> >/usr/lib/heartbeat/cts/extracttests.py

Ha! This is cool. I was thinking about writing one myself :)

> two problems here:
> 1) one of the resource actions took longer than one of our internal timers

Hmm. All resources are sort of light: an IP address, a volume
on the f/o storage, and the corresponing journaled fs. Strange
that the timer went off.

> 2) as a result of 1) the assert went off
> 
> to address 2) i've taken a slightly different approach to that part of
> the code and it will be in CVS shortly
> 
> We appear to recover ok from 1) so I'm leaving the timer there but
> doubling it's interval.  This timer is not supposed to go off in the
> first place so increasing it should be safe.
> 
> > > > sapcl01# crmadmin -S sapcl01
> > > > Status of [EMAIL PROTECTED]: S_TERMINATE (ok)
> 
> The only node i see exiting in the logs is sapcl02 which was stopped by CTS.

Yes, the test was:

Apr 19 17:42:38 lingws CTS: Running test NearQuorumPoint (sapcl02) [53]
Apr 19 17:42:38 lingws CTS: start nodes:['sapcl01', 'sapcl03']
Apr 19 17:42:38 lingws CTS: stop nodes:['sapcl02']

However, the DC (sapcl01) went berserk.

> > > > All processes are still running on this node, but heartbeat seems
> > > > to be in some kind of limbo.
> 
> I see this in the logs:
> 
> Apr 19 17:48:01 sapcl01 crmd: [17937]: info:
> mask(fsa.c:do_state_transition): State transition S_TRANSITION_ENGINE
> -> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=do_msg_route
> ]
> 
> so to me it looks like everything is back on track. no?

No. The status as shown by crm_mon in one of my previous messages
is what remained so for many hours (10 or so). The cluster was
basically stalled.

However, after shutting down everything and starting the CTS from
scratch with the very same HEAD code ran perfectly OK:

Apr 20 05:18:31  BEGINNING 200 TESTS
...
Apr 20 11:01:50 Overall Results:{'failure': 0, 'success': 200, 'BadNews': 1876}

Apart from problems before the first test with pengine (looks like
some resource were left running after the shutdown, so they
eventually appeared to be running on two nodes), six monitor
operation failures with exit code OCF_NOT_RUNNING (why one per hour ;-)?
and tons of spurious messages from the LVM RA of this kind

Apr 20 05:21:40 BadNews: Apr 20 05:20:22 sapcl02 LVM[7867]: [7967]: ERROR: LVM 
Volume /dev/data03vg is offline

everything else went fine.

Obviously, the core dump was triggered due to some unusual
circumstances.

Cheers,

Dejan
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] core dump (abort) in crmd: "untracked process" (HEAD)

2006-04-20 Thread Dejan Muhamedagic
Hi,

#4  0x0805add0 in build_operation_update (xml_rsc=0x8109430, op=0x8282b98, 
src=0x80692d9 "do_update_resource", lpc=0) at lrm.c:347

(gdb) print *0x8282b98
$1 = 136448640

If you want I can send you the core off list. I keep all the cores :)

Cheers,

Dejan

On Thu, Apr 20, 2006 at 09:15:28AM +0200, Andrew Beekhof wrote:
> On 4/20/06, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> > Running CTS with HEAD hanged the cluster after crmd dumped core
> > (abort).  It happened after 53 tests with this curious message:
> >
> > Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
> > mask(lrm.c:build_operation_update): Triggered non-fatal assert at 
> > lrm.c:349: fsa_our_dc_version != NULL
> 
> We have two kinds of asserts... neither are supposed to happen and
> both create a core file so that we can diagnose how we got there.
> However non-fatal ones call fork first (so the main process doesn't
> die) and then take some recovery action.
> 
> Sometimes the non-fatal varieties are used in new pieces of code to
> make sure they work as we expect and that is what has happened here.
> 
> Do you still have the core file?
> I'd be interested to know the result of:
>print *op
> from frame #4
> 
> In the meantime, I'll look at the logs and see what I can figure out.
> 
> > Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
> > Exiting untracked process process 19654 dumped core
> > Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: 
> > mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped!
> >
> > The cluster looks like this, unchanged for several hours:
> >
> > 
> > Last updated: Thu Apr 20 04:43:47 2006
> > Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d)
> > 3 Nodes configured.
> > 3 Resources configured.
> > 
> >
> > Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online
> > Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online
> > Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online
> >
> > Resource Group: group_1
> > IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03
> > LVM_2   (heartbeat::ocf:LVM):   Stopped
> > Filesystem_3(heartbeat::ocf:Filesystem):Stopped
> > Resource Group: group_2
> > IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02
> > LVM_3   (heartbeat::ocf:LVM):   Started sapcl02
> > Filesystem_4(heartbeat::ocf:Filesystem):Started sapcl02
> > Resource Group: group_3
> > IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03
> > LVM_4   (heartbeat::ocf:LVM):   Started sapcl03
> > Filesystem_5(heartbeat::ocf:Filesystem):Started sapcl03
> >
> > And:
> >
> > sapcl01# crmadmin -S sapcl01
> > Status of [EMAIL PROTECTED]: S_TERMINATE (ok)
> >
> > All processes are still running on this node, but heartbeat seems
> > to be in some kind of limbo.
> >
> > Cheers,
> >
> > Dejan
> >
> >
> > ___
> > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> >
> >
> >
> >
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] core dump (abort) in crmd: "untracked process" (HEAD)

2006-04-20 Thread Andrew Beekhof
On 4/20/06, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> On 4/20/06, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> > On 4/20/06, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > > Hello,
> > >
> > > Running CTS with HEAD hanged the cluster after crmd dumped core
> > > (abort).  It happened after 53 tests with this curious message:
> > >
> > > Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
> > > mask(lrm.c:build_operation_update): Triggered non-fatal assert at 
> > > lrm.c:349: fsa_our_dc_version != NULL
> >
> > We have two kinds of asserts... neither are supposed to happen and
> > both create a core file so that we can diagnose how we got there.
> > However non-fatal ones call fork first (so the main process doesn't
> > die) and then take some recovery action.
> >
> > Sometimes the non-fatal varieties are used in new pieces of code to
> > make sure they work as we expect and that is what has happened here.
> >
> > Do you still have the core file?
> > I'd be interested to know the result of:
> >print *op
> > from frame #4
> >
> > In the meantime, I'll look at the logs and see what I can figure out.
>
> There is also a script Alan wrote to easily extract test data:
>/usr/lib/heartbeat/cts/extracttests.py
>
> Can you tell me what test was being performed at the time you hit the assert?
> Do you have logs from further back?

ok, i "found" the logs... my log reader was trying to be helpful :-/

two problems here:
1) one of the resource actions took longer than one of our internal timers
2) as a result of 1) the assert went off

to address 2) i've taken a slightly different approach to that part of
the code and it will be in CVS shortly

We appear to recover ok from 1) so I'm leaving the timer there but
doubling it's interval.  This timer is not supposed to go off in the
first place so increasing it should be safe.

> > > sapcl01# crmadmin -S sapcl01
> > > Status of [EMAIL PROTECTED]: S_TERMINATE (ok)

The only node i see exiting in the logs is sapcl02 which was stopped by CTS.

> > > All processes are still running on this node, but heartbeat seems
> > > to be in some kind of limbo.

I see this in the logs:

Apr 19 17:48:01 sapcl01 crmd: [17937]: info:
mask(fsa.c:do_state_transition): State transition S_TRANSITION_ENGINE
-> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=do_msg_route
]

so to me it looks like everything is back on track. no?
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] core dump (abort) in crmd: "untracked process" (HEAD)

2006-04-20 Thread Andrew Beekhof
On 4/20/06, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> On 4/20/06, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> > Running CTS with HEAD hanged the cluster after crmd dumped core
> > (abort).  It happened after 53 tests with this curious message:
> >
> > Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
> > mask(lrm.c:build_operation_update): Triggered non-fatal assert at 
> > lrm.c:349: fsa_our_dc_version != NULL
>
> We have two kinds of asserts... neither are supposed to happen and
> both create a core file so that we can diagnose how we got there.
> However non-fatal ones call fork first (so the main process doesn't
> die) and then take some recovery action.
>
> Sometimes the non-fatal varieties are used in new pieces of code to
> make sure they work as we expect and that is what has happened here.
>
> Do you still have the core file?
> I'd be interested to know the result of:
>print *op
> from frame #4
>
> In the meantime, I'll look at the logs and see what I can figure out.

There is also a script Alan wrote to easily extract test data:
   /usr/lib/heartbeat/cts/extracttests.py

Can you tell me what test was being performed at the time you hit the assert?
Do you have logs from further back?

>
> > Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
> > Exiting untracked process process 19654 dumped core
> > Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: 
> > mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped!
> >
> > The cluster looks like this, unchanged for several hours:
> >
> > 
> > Last updated: Thu Apr 20 04:43:47 2006
> > Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d)
> > 3 Nodes configured.
> > 3 Resources configured.
> > 
> >
> > Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online
> > Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online
> > Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online
> >
> > Resource Group: group_1
> > IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03
> > LVM_2   (heartbeat::ocf:LVM):   Stopped
> > Filesystem_3(heartbeat::ocf:Filesystem):Stopped
> > Resource Group: group_2
> > IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02
> > LVM_3   (heartbeat::ocf:LVM):   Started sapcl02
> > Filesystem_4(heartbeat::ocf:Filesystem):Started sapcl02
> > Resource Group: group_3
> > IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03
> > LVM_4   (heartbeat::ocf:LVM):   Started sapcl03
> > Filesystem_5(heartbeat::ocf:Filesystem):Started sapcl03
> >
> > And:
> >
> > sapcl01# crmadmin -S sapcl01
> > Status of [EMAIL PROTECTED]: S_TERMINATE (ok)
> >
> > All processes are still running on this node, but heartbeat seems
> > to be in some kind of limbo.
> >
> > Cheers,
> >
> > Dejan
> >
> >
> > ___
> > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> >
> >
> >
> >
>
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] core dump (abort) in crmd: "untracked process" (HEAD)

2006-04-20 Thread Andrew Beekhof
On 4/20/06, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> Hello,
>
> Running CTS with HEAD hanged the cluster after crmd dumped core
> (abort).  It happened after 53 tests with this curious message:
>
> Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
> mask(lrm.c:build_operation_update): Triggered non-fatal assert at lrm.c:349: 
> fsa_our_dc_version != NULL

We have two kinds of asserts... neither are supposed to happen and
both create a core file so that we can diagnose how we got there.
However non-fatal ones call fork first (so the main process doesn't
die) and then take some recovery action.

Sometimes the non-fatal varieties are used in new pieces of code to
make sure they work as we expect and that is what has happened here.

Do you still have the core file?
I'd be interested to know the result of:
   print *op
from frame #4

In the meantime, I'll look at the logs and see what I can figure out.

> Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
> Exiting untracked process process 19654 dumped core
> Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: 
> mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped!
>
> The cluster looks like this, unchanged for several hours:
>
> 
> Last updated: Thu Apr 20 04:43:47 2006
> Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d)
> 3 Nodes configured.
> 3 Resources configured.
> 
>
> Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online
> Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online
> Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online
>
> Resource Group: group_1
> IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03
> LVM_2   (heartbeat::ocf:LVM):   Stopped
> Filesystem_3(heartbeat::ocf:Filesystem):Stopped
> Resource Group: group_2
> IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02
> LVM_3   (heartbeat::ocf:LVM):   Started sapcl02
> Filesystem_4(heartbeat::ocf:Filesystem):Started sapcl02
> Resource Group: group_3
> IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03
> LVM_4   (heartbeat::ocf:LVM):   Started sapcl03
> Filesystem_5(heartbeat::ocf:Filesystem):Started sapcl03
>
> And:
>
> sapcl01# crmadmin -S sapcl01
> Status of [EMAIL PROTECTED]: S_TERMINATE (ok)
>
> All processes are still running on this node, but heartbeat seems
> to be in some kind of limbo.
>
> Cheers,
>
> Dejan
>
>
> ___
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
>
>
>
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] core dump (abort) in crmd: "untracked process" (HEAD)

2006-04-19 Thread Dejan Muhamedagic
Hello,

Running CTS with HEAD hanged the cluster after crmd dumped core
(abort).  It happened after 53 tests with this curious message:

Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: 
mask(lrm.c:build_operation_update): Triggered non-fatal assert at lrm.c:349: 
fsa_our_dc_version != NULL
Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: Exiting 
untracked process process 19654 dumped core
Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: 
mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped!

The cluster looks like this, unchanged for several hours:


Last updated: Thu Apr 20 04:43:47 2006
Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d)
3 Nodes configured.
3 Resources configured.


Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online
Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online
Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online

Resource Group: group_1
IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03
LVM_2   (heartbeat::ocf:LVM):   Stopped 
Filesystem_3(heartbeat::ocf:Filesystem):Stopped 
Resource Group: group_2
IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02
LVM_3   (heartbeat::ocf:LVM):   Started sapcl02
Filesystem_4(heartbeat::ocf:Filesystem):Started sapcl02
Resource Group: group_3
IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03
LVM_4   (heartbeat::ocf:LVM):   Started sapcl03
Filesystem_5(heartbeat::ocf:Filesystem):Started sapcl03

And:

sapcl01# crmadmin -S sapcl01
Status of [EMAIL PROTECTED]: S_TERMINATE (ok)

All processes are still running on this node, but heartbeat seems
to be in some kind of limbo.

Cheers,

Dejan
Using host libthread_db library "/lib/tls/libthread_db.so.1".
Core was generated by `/usr/lib/heartbeat/crmd'.
Program terminated with signal 6, Aborted.
#0  0xe410 in __kernel_vsyscall ()
#0  0xe410 in __kernel_vsyscall ()
#1  0x40284581 in raise () from /lib/tls/libc.so.6
#2  0x40285e65 in abort () from /lib/tls/libc.so.6
#3  0x40059488 in crm_abort (file=0x806859d "lrm.c", 
function=0x80687c6 "build_operation_update", line=349, 
assert_condition=0x806881d "fsa_our_dc_version != NULL", do_fork=1)
at utils.c:1201
#4  0x0805add0 in build_operation_update (xml_rsc=0x8109430, op=0x8282b98, 
src=0x80692d9 "do_update_resource", lpc=0) at lrm.c:347
#5  0x0805db31 in do_update_resource (op=0x8282b98) at lrm.c:1383
#6  0x0805e0f7 in do_lrm_event (action=576460752303423488, 
cause=C_LRM_OP_CALLBACK, cur_state=S_INTEGRATION, cur_input=I_LRM_EVENT, 
msg_data=0x8234d68) at lrm.c:1514
#7  0x0804b572 in do_fsa_action (fsa_data=0x8234d68, 
an_action=576460752303423488, function=0x805dc31 )
at fsa.c:178
#8  0x0804c805 in s_crmd_fsa_actions (fsa_data=0x8234d68) at fsa.c:512
#9  0x0804bb36 in s_crmd_fsa (cause=C_FSA_INTERNAL) at fsa.c:315
#10 0x08055264 in crm_fsa_trigger (user_data=0x0) at callbacks.c:647
#11 0x4002987c in G_TRIG_dispatch (source=0x8072de8, callback=0, user_data=0x0)
at GSource.c:1417
#12 0x400b29ca in g_main_context_dispatch ()
   from /opt/gnome/lib/libglib-2.0.so.0
#13 0x400b4adb in g_main_context_iterate ()
   from /opt/gnome/lib/libglib-2.0.so.0
#14 0x400b4d07 in g_main_loop_run () from /opt/gnome/lib/libglib-2.0.so.0
#15 0x0804af9b in init_start () at main.c:137
#16 0x0804aec6 in main (argc=1, argv=0xb9f4) at main.c:104


cib.xml.gz
Description: Binary data


log.gz
Description: Binary data
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/