Re: [Openais] Heartbeat to Openais conversion. cib.xml verification errors

2013-02-05 Thread Lars Marowsky-Bree
On 2013-01-02T16:43:19, Nick Hoare wrote: Hi Nick, I'll be honest - we provided the conversion script with SLE HA 11 first customer shipment and back then did extensive QA on it, but it is conceivable (and, since you have problems, very likely) that it got broken. Please open a service request

[Openais] CFP: HA Mini-Conference in Prague on Oct 25th

2011-08-14 Thread Lars Marowsky-Bree
Hi all, thanks to the Linux Foundation, we have a chance to organize a High-Availability & Clustering mini-conference on Oct 25th in Prague, in the same venue as the LinuxCon and Kernel Summit. You can find out more about the exciting venue at: http://events.linuxfoundation.org/events/linuxcon-eu

Re: [Openais] Multycast & unicast as fall back

2011-07-27 Thread Lars Marowsky-Bree
On 2011-07-21T07:59:23, Steven Dake wrote: > There is no fallback. You can specify one transport or the other. > Thinking a moment how to implement this type of feature, it could not be > reasonably implemented. But the transport is per ring, is it not? Regards, Lars -- Architect Storage

[Openais] [PATCH] corosync IPC layer check for disconnected peer

2011-06-30 Thread Lars Marowsky-Bree
We experienced a situation where corosync was killed, and then all processes connected to it started to spin on their IPC. I traced this down to them not noticing that the peer is dead. However, I noticed that in my environment, one process _did_ properly disconnect (dlm_controld). The difference

Re: [Openais] [PATCH] Implementation of automatic redundant ring recovery

2011-05-20 Thread Lars Marowsky-Bree
On 2011-05-20T07:21:16, Steven Dake wrote: > Lars, > > Generally this is how network protocols are done, but for historic > reasons, we decided to do conversion on receipt of message rather then > origination (performance was better in a cross-endian system). openais > came from an embedded wor

Re: [Openais] [PATCH] Implementation of automatic redundant ring recovery

2011-05-20 Thread Lars Marowsky-Bree
On 2011-05-20T14:35:28, Jiaju Zhang wrote: Hi Jiaju, thanks for the good work! I can't comment on totem much, but some general code only: > +#define ENDIAN_LOCAL 0xff22 I am not sure about this one. Would it not make more sense to always convert to a known byte

[Openais] Announcement: Linux Foundation HA working group mailing lists

2011-03-03 Thread Lars Marowsky-Bree
Hi everyone, please excuse the long Cc list. Behind the scenes, some of the projects that make up the cluster stack on Linux have been working together to converge and integrate the various projects. We have been meeting on and off for the last decade, and made some amazing progress over the year

Re: [Openais] Announcing Corosync 1.3.0

2011-01-13 Thread Lars Marowsky-Bree
On 2010-12-01T14:18:25, Steven Dake wrote: > Corosync 1.3.0 is available for immediate download from our website. > This version brings many enhancements to the software. The two most > visible enhancements are UDPU transport mode and the > cpg_model_initialize api call. The UDPU transport omde

Re: [Openais] [Linux-ha-dev] CFP: Linux Plumbers Mini-Conf on High-Availability/Clustering

2010-08-10 Thread Lars Marowsky-Bree
On 2010-08-04T15:59:27, Lars Marowsky-Bree wrote: > Hi all, > > there will (hopefully!) be a mini-conference on HA/Clustering at this > year's LPC in Cambridge, MA, Nov 3-5th. Just a quick reminder, there've not been many proposals submitted yet. If the trend continu

[Openais] CFP: Linux Plumbers Mini-Conf on High-Availability/Clustering

2010-08-04 Thread Lars Marowsky-Bree
Hi all, there will (hopefully!) be a mini-conference on HA/Clustering at this year's LPC in Cambridge, MA, Nov 3-5th. This would be an informal summit for the HA folks to get together and discuss the various issues that would benefit from a face to face meeting; to facilitate progress faster than

Re: [Openais] whitetank cluster not reforming after 'if down'

2009-07-21 Thread Lars Marowsky-Bree
On 2009-06-30T12:27:33, Andrew Beekhof wrote: > I'm working with a cluster that's having trouble reforming. > Before I explain, here is the totem section (which is the same on both > nodes, except for the nodeid). Hi all, Steven, this problem persist. After a reboot, we sometimes see membersh

Re: [Openais] cpg join/leave lists

2009-07-20 Thread Lars Marowsky-Bree
On 2009-07-13T09:33:50, Steven Dake wrote: > Hmm I don't recall saying they are unreliable or cannot be believed. > The problem with the join and left list is they don't follow the > semantics of virtual synchrony, meaning there is no consistent view of > these join/left lists on each node. The

Re: [Openais] whitetank - regression introduced in r1959

2009-07-17 Thread Lars Marowsky-Bree
On 2009-07-16T09:41:01, Andrew Beekhof wrote: > looks good to me Confirmed, it fixes the regression and the cluster shuts down properly again. Regards, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experienc

Re: [Openais] whitetank - regression introduced in r1959

2009-07-15 Thread Lars Marowsky-Bree
On 2009-07-15T11:17:20, Steven Dake wrote: > patch attached to fix > > thanks for catching The // is supposed to be gone, right? > @@ -476,6 +476,8 @@ > > objdb->objdb_init (); > > +// openais_shutdown_objdb_register (objdb); > + > /* >* Bootstrap in the default confi

[Openais] whitetank: exec/main.c compilation error

2009-07-08 Thread Lars Marowsky-Bree
Hi, trivial patch - avoid an implicit declaration. Regards, Lars Index: exec/main.c === --- exec/main.c (revision 2034) +++ exec/main.c (working copy) @@ -76,6 +76,7 @@ #include "print.h" #include "util.h" #include "version.h

Re: [Openais] [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

2009-06-04 Thread Lars Marowsky-Bree
On 2009-06-04T11:38:08, Robert Wipfel wrote: > > I think that was actually discussed on the openais list and on IRC in > > the past and never completely explained why it wouldn't work ;-) > Link status can also be written to the other communication > medium: the shared disk (assuming different l

Re: [Openais] [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

2009-06-04 Thread Lars Marowsky-Bree
On 2009-06-04T19:07:41, Juha Heinanen wrote: > without that kind self healing there is no way that openais could > ever replace current heartbeat2+pacemaker setup. This "self-healing" works just fine with bonded NICs. Regards, Lars -- SuSE Labs, OPS Engineering, Novell, Inc. SUSE LINUX

Re: [Openais] [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

2009-06-04 Thread Lars Marowsky-Bree
On 2009-06-04T09:23:04, Steven Dake wrote: > The problem with checking the link status with the current code is that > the protocol blocks I/O waiting for a response from the failed ring. > This could of course be modified to behave differently. Right, so the rechecking could possibly be a separ

Re: [Openais] [Pacemaker] Pacemaker on OpenAIS, RRP, and link failure

2009-06-04 Thread Lars Marowsky-Bree
On 2009-05-26T12:50:34, Andrew Beekhof wrote: > >> try all the time also after failure like was done before failure. > > > > Complete Totem amateur behind the keyboard, but I'd second that. Since > > you're constantly checking the link status while it's up, why not keep > > doing so after it's go

Re: [Openais] [Ocfs2-devel] Linux Plumbers Conference mini-conf on clustering?

2009-04-14 Thread Lars Marowsky-Bree
On 2009-04-13T16:31:35, Joel Becker wrote: > I think its a great idea. Ok, all respondents were in favor, so I will submit a miniconf proposal. > We actually have progress, with a lot > of work that we talked about in Prague starting to see the light of day > in STABLE3, so it would be go

[Openais] Linux Plumbers Conference mini-conf on clustering?

2009-04-13 Thread Lars Marowsky-Bree
Hi all, what do you think of a half-day / day long miniconference on clustering along LPC? (http://lwn.net/Articles/319215/) It's a bit late, but if noone objects, I'll submit a proposal tomorrow. What topics would you like to see discussed? We probably don't have enough mass to justify our com

Re: [Openais] service architecture question

2009-04-02 Thread Lars Marowsky-Bree
On 2009-04-02T10:18:33, Dietmar Maurer wrote: > I am playing around with corosync/openais and clvmd-openais. So far it > works. But when is stop the corosync process (or if it gets stopped by a > SIGSEGV), clvmd-openais is completely unusable. > > Like any openais client, it clvmd simple connect

Re: [Openais] openais restart issue

2009-04-01 Thread Lars Marowsky-Bree
On 2009-03-19T15:27:30, Dominik Klein wrote: > Hi > > I am using the latest whitetank code (1754 2009-03-13 20:47:05 +0100) > with pacemaker on a pair of x86_64 opensuse 11.1 boxes and I am seeing > an openais restart issue. > > When I use > /etc/init.d/openais restart > to restart the cluster,

Re: [Openais] [whitetank patch] prevent multiple threads from expiry callback in service engine

2009-03-31 Thread Lars Marowsky-Bree
On 2009-03-22T09:29:22, Lars Marowsky-Bree wrote: So that they don't drop of the radar: Just wanted to point out that the CPG crashes are still around, mostly the pi corruption manifesting itself. This is with r1761 whitetank. > Thread 1 (Thread 6988): > #0 0x7fc07fdfc667 in d

Re: [Openais] [whitetank patch] prevent multiple threads from expiry callback in service engine

2009-03-22 Thread Lars Marowsky-Bree
On 2009-03-22T00:49:08, Lars Marowsky-Bree wrote: > I'll let the test case run over night (until it crashes the test master > ;-), that might throw up a couple or coredumps by morning. 1. crash in openais response send: (gdb) bt #0 0x7fa9f048ae11 in memcpy () from /lib64/l

Re: [Openais] [whitetank patch] prevent multiple threads from expiry callback in service engine

2009-03-21 Thread Lars Marowsky-Bree
On 2009-03-21T16:42:55, Steven Dake wrote: > So what I'd like to know for sure is if the expiry_callback backtrace > can be reproduced with this patch. > > Then we can direct our energies towards coming up with a test case for > the remaining cpg issue. OK, thanks for the help. I'll let the te

Re: [Openais] [whitetank patch] prevent multiple threads from expiry callback in service engine

2009-03-21 Thread Lars Marowsky-Bree
On 2009-03-21T12:45:57, Steven Dake wrote: > see subject My current trace with this is back to square 1: (gdb) bt #0 0x7f8aba35b13c in notify_lib_joinlist (gi=0x763e10, conn=0x0, joined_list_entries=1, joined_list=0x7fffcb5ec490, left_list_entries=0, left_list=0x0, id=4) at cpg.c

Re: [Openais] [PATCH] fix cpg crash

2009-03-21 Thread Lars Marowsky-Bree
On 2009-03-21T04:35:35, Steven Dake wrote: > I merged this into whitetank and corosync. Hi Steve, Chrissie, thanks for investigating and fixing this particular crash. Alas, my test case now crashes elsewhere instead :-( From whitetank: Core was generated by `aisexec'. Program terminated with

Re: [Openais] [CRASH] corosync crash under load

2009-03-18 Thread Lars Marowsky-Bree
On 2009-03-18T14:46:42, Steven Dake wrote: > I have a patch which I believe fixes this problem in corosync. > > Please try the latest build after fabio builds it. Is that the r1866 you just posted? If so, that patch is already in our whitetank tree and does not prevent the crash, unfortunately

Re: [Openais] [CRASH] corosync crash under load

2009-03-18 Thread Lars Marowsky-Bree
On 2009-03-17T15:39:55, David Teigland wrote: > Here's another similar one while mounting/unmounting: > > Program terminated with signal 11, Segmentation fault. > #0 0x7fa0ccf34159 in do_proc_join (name=0x7fffd78743a0, pid=11227, > nodeid=2, reason=1) > at cpg.c:740 > 740

[Openais] Timing settings

2009-03-18 Thread Lars Marowsky-Bree
Hi all, I'm curious what a set of conservative settings for the timeouts would be. Bear with me, I am coming from heartbeat, which had exactly 3 tunables - keepalive and deadtime (and an initial deadtime for settling), so the various options confuse/scare me a bit ;-) With fast nodes, it seems th

Re: [Openais] [corosync trunk] use list_del properly in lib_exit_fn of cpg service

2009-03-06 Thread Lars Marowsky-Bree
On 2009-03-05T19:39:44, Steven Dake wrote: > Currently if the ipc connection is unjoined and then terminated causing > lib_exit_fn to be called, no list_del will be done on the process info > structure. This results in later processing by aisexec of the bad > process info data. (fixes segfault/

Re: [Openais] openais with >1 ring?

2009-02-24 Thread Lars Marowsky-Bree
On 2009-02-24T16:38:47, Dejan Muhamedagic wrote: > It is also tricky because as it seems to work right now, if a > node hears nothing on a particular interface it is deemed > unusable. So, the only way to recover a ring would be for at > least two nodes to start the recovery at about the same tim

Re: [Openais] openais with >1 ring?

2009-02-24 Thread Lars Marowsky-Bree
On 2009-02-23T18:45:41, Steven Dake wrote: > I can't really tell you what state it is in, other then it appears to be > broken for you :( Chrissie may be able to provide you some idea of its > state. She tests it occasionally with the full stack. > > Unfortunately with the fc11 deadlines I am

[Openais] openais with >1 ring?

2009-02-23 Thread Lars Marowsky-Bree
Hi, I configured a second ring over a different NIC and used rrp_mode active. (Again, all on whitetank.) After a reboot of one of the nodes, the rings didn't seem to properly reform, and aisexec got stuck somehow and refused to shutdown. I didn't investigate this much further then, because I was

[Openais] whitetank: question regarding saCkptCheckpointOpen()

2009-02-21 Thread Lars Marowsky-Bree
Hi all, I have a question regarding this call; possibly it applies to other CKPT functions too, but this is the one currently giving me worries. ocfs2_controld uses this service, and they get spawned by the cluster manager at essentially the same time everywhere. (At a time where all nodes are up

Re: [Openais] [PATCH] Allow cpg messages to be sent when the cluster is inquorate

2009-02-19 Thread Lars Marowsky-Bree
On 2009-02-19T15:33:04, Chrissie Caulfield wrote: > This seems to be biting a lot of people, so I propose that cpg is > allowed to send messages on an inquorate cluster I think this makes a lot of sense (and I know that, if pacemaker wants to be able to use cpg, we need this - otherwise we canno

Re: [Openais] patch for buffer overflow in clm.c:my_cluster_node_load

2009-02-18 Thread Lars Marowsky-Bree
On 2009-02-17T22:47:28, Steven Dake wrote: > IMO the bugzilla should never result in a buffer overflow and points at > a problem is totempg_ifaces_get. I put some data in the bugzilla which > I'd like collected if possible. > > Maybe it can help us get to the root cause of the problem instead o

Re: [Openais] [patch whitetank/trunk] expiry_checkpoints segfault

2009-02-06 Thread Lars Marowsky-Bree
On 2009-02-01T08:09:22, Steven Dake wrote: > The expiry list pointers are not list_init'ed after a synchronization of > checkpoints. I believe this is causing segfaults in some circumstances. > > Andrew can you verify the patch fixes the problem you reported? It turns out that it does not; is

Re: [Openais] [whitetank] ipc-20

2009-02-02 Thread Lars Marowsky-Bree
On 2009-02-01T20:31:57, Steven Dake wrote: > This is the latest iteration of this patch. > > Some duplicate data structures were moved from lib/util.c and exec/ipc.c > to include/ipc_gen.h. > > A feature was added to allow applications to use setuid/seteuid syscalls > without resulting in disco

Re: [Openais] Need your help.

2009-01-28 Thread Lars Marowsky-Bree
On 2009-01-28T15:02:30, Priyanka Ranjan wrote: > Hi all, > i am new to openais and pacemaker. i am working on sles11 . i have three > questions. If you are working on SLE11, please use the pacemaker list only. These questions have nothing to do with linux-ha nor openais. I have set the reply-to

Re: [Openais] [trunk patch] fix message rejection problem

2009-01-28 Thread Lars Marowsky-Bree
On 2009-01-25T14:14:12, Steven Dake wrote: > Same as before, but the trunk apparently doesn't have the > clear_high_node_bit flag feature. > > Andrew can you work up a patch for that feature for trunk? I think before we rework this, we may want to readdress the nodeid issue more completely. Ri

Re: [Openais] openais Whitetank Plans

2009-01-26 Thread Lars Marowsky-Bree
On 2009-01-26T16:03:22, Andrew Beekhof wrote: > > Andrew, is pacemaker/libdlm doing anything real weird with the IPC > > system? > not any more. there was a time when I didn't quite understand how to > use it properly, but i'm confident that everything is now used as it > was intended. A furthe

Re: [Openais] openais Whitetank Plans

2009-01-26 Thread Lars Marowsky-Bree
On 2009-01-26T07:39:43, Steven Dake wrote: > But I understand your desire to avoid risk. Unfortunately there is no > surgical way to fix the problems with the current ipc system without > many of these approaches used in this patch. I see your point, but are those issues present in corosync too

Re: [Openais] openais Whitetank Plans

2009-01-26 Thread Lars Marowsky-Bree
On 2009-01-23T08:17:53, Steven Dake wrote: > If you mean the apis for responding to requests or library services api, > then unfortunately, yes it is required to change these apis. No user > APIs should change, however. On the plus side these will be going into > corosync and shouldn't change i

Re: [Openais] openais Whitetank Plans

2009-01-23 Thread Lars Marowsky-Bree
On 2009-01-23T07:16:38, Steven Dake wrote: > You could try this patch but it is not yet complete. That also seems to change the externally visible API; that for sure isn't intended for a stable branch, is it? Mit freundlichen Grüßen, Lars -- Teamlead Kernel, SuSE Labs, Research and Devel

Re: [Openais] openais Whitetank Plans

2009-01-23 Thread Lars Marowsky-Bree
On 2009-01-20T06:35:03, Steven Dake wrote: > Unfortunately, even the planned 0.80.4 has some issues with the IPC > system which result in double frees, and various other badness. We're hitting exactly this on openAIS/Pacemaker - is there a workaround for whitetank? Core was generated by `aisex

Re: [Openais] secauth renders openais lame

2009-01-21 Thread Lars Marowsky-Bree
On 2009-01-21T22:47:03, Dejan Muhamedagic wrote: > > Without this patch, it would not set any ENDIAN_64BITWORD or _32BITWORD > > define on Linux, ever, and then default to using the 64bit operations > > down below. One wonders if this (correct) patch then implicitly breaks > > other archs. > Test

Re: [Openais] secauth renders openais lame

2009-01-21 Thread Lars Marowsky-Bree
On 2009-01-21T15:42:57, Dejan Muhamedagic wrote: > #if defined(OPENAIS_LINUX) > -#if __WORDIZE == 64 > +#if __WORDSIZE == 64 > #define ENDIAN_64BITWORD > #endif > -#if __WORDIZE == 32 > +#if __WORDSIZE == 32 > #define ENDIAN_32BITWORD > #endif > #else Without this patch, it would not set a

Re: [Openais] Coverity scan

2008-12-03 Thread Lars Marowsky-Bree
On 2008-12-02T22:06:34, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > Definitely worth the effort to try. > Despite my ranting, I'm still trying to get it to scan Pacemaker (and > rescan Heartbeat) too. I wonder what it'd cost to just buy that service subscription. ;-) Regards, Lars -- Teaml

Re: [Openais] [RFC] simple blackbox

2008-10-13 Thread Lars Marowsky-Bree
On 2008-10-09T08:51:58, Steven Dake <[EMAIL PROTECTED]> wrote: Hi Steven, > > The goal was to have a blackbox which we cannot just retroactively dump, > > but also easily recover from a core or kernel crashdump. > > the array is global and can easily be obtained from a core file with a > simple

Re: [Openais] [RFC] simple blackbox

2008-10-08 Thread Lars Marowsky-Bree
On 2008-10-08T21:08:16, Steven Dake <[EMAIL PROTECTED]> wrote: > Attached is my version which is as of yet incomplete. The general > concept is to allow very high performance event tracing with minimal > formatting overhead (formatting is done in a separate program after a > crash or to debug cur

Re: [Openais] Split brain when using EVS library

2008-09-13 Thread Lars Marowsky-Bree
On 2008-09-13T20:58:26, Ruppert Koch <[EMAIL PROTECTED]> wrote: > Fault detection as well as membership are managed by the Totem protocol. Right. > I assume the following happens: > > A node P experiences a token timeout. Ah, and this is the point which I'm curious about - why does this occur

Re: [Openais] Split brain when using EVS library

2008-09-13 Thread Lars Marowsky-Bree
On 2008-09-09T11:18:59, David Teigland <[EMAIL PROTECTED]> wrote: > > For some reason our cluster splits up into two rings. > > Scenario is: > > node1(n1) n2 n3 n4 n5 n6 are in the ring. > > > > Suddenly the ring splits into two rings: > > n1 n2 n3 got leave msg from n4 n5 n6 > > n4 n5 n6 got lea

[Openais] Re: [Linux-ha-dev] ANNOUNCE: Heartbeat Cluster Resource Manager Ported to OpenAIS

2007-12-06 Thread Lars Marowsky-Bree
On 2007-12-05T21:06:38, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > Over the last few months, Red Hat and SUSE engineers have been working > together to port Heartbeat's powerful Cluster Resource Manager (CRM) to run > natively on top of OpenAIS. Credit where credit is due: this means you, Andr