On 2013-01-02T16:43:19, Nick Hoare wrote:
Hi Nick,
I'll be honest - we provided the conversion script with SLE HA 11 first
customer shipment and back then did extensive QA on it, but it is
conceivable (and, since you have problems, very likely) that it got
broken.
Please open a service request
Hi all,
thanks to the Linux Foundation, we have a chance to organize a
High-Availability & Clustering mini-conference on Oct 25th in Prague, in
the same venue as the LinuxCon and Kernel Summit.
You can find out more about the exciting venue at:
http://events.linuxfoundation.org/events/linuxcon-eu
On 2011-07-21T07:59:23, Steven Dake wrote:
> There is no fallback. You can specify one transport or the other.
> Thinking a moment how to implement this type of feature, it could not be
> reasonably implemented.
But the transport is per ring, is it not?
Regards,
Lars
--
Architect Storage
We experienced a situation where corosync was killed, and then all
processes connected to it started to spin on their IPC. I traced this
down to them not noticing that the peer is dead.
However, I noticed that in my environment, one process _did_ properly
disconnect (dlm_controld). The difference
On 2011-05-20T07:21:16, Steven Dake wrote:
> Lars,
>
> Generally this is how network protocols are done, but for historic
> reasons, we decided to do conversion on receipt of message rather then
> origination (performance was better in a cross-endian system). openais
> came from an embedded wor
On 2011-05-20T14:35:28, Jiaju Zhang wrote:
Hi Jiaju,
thanks for the good work!
I can't comment on totem much, but some general code only:
> +#define ENDIAN_LOCAL 0xff22
I am not sure about this one. Would it not make more sense to always
convert to a known byte
Hi everyone,
please excuse the long Cc list.
Behind the scenes, some of the projects that make up the cluster stack
on Linux have been working together to converge and integrate the
various projects. We have been meeting on and off for the last decade,
and made some amazing progress over the year
On 2010-12-01T14:18:25, Steven Dake wrote:
> Corosync 1.3.0 is available for immediate download from our website.
> This version brings many enhancements to the software. The two most
> visible enhancements are UDPU transport mode and the
> cpg_model_initialize api call. The UDPU transport omde
On 2010-08-04T15:59:27, Lars Marowsky-Bree wrote:
> Hi all,
>
> there will (hopefully!) be a mini-conference on HA/Clustering at this
> year's LPC in Cambridge, MA, Nov 3-5th.
Just a quick reminder, there've not been many proposals submitted yet.
If the trend continu
Hi all,
there will (hopefully!) be a mini-conference on HA/Clustering at this
year's LPC in Cambridge, MA, Nov 3-5th.
This would be an informal summit for the HA folks to get together and
discuss the various issues that would benefit from a face to face
meeting; to facilitate progress faster than
On 2009-06-30T12:27:33, Andrew Beekhof wrote:
> I'm working with a cluster that's having trouble reforming.
> Before I explain, here is the totem section (which is the same on both
> nodes, except for the nodeid).
Hi all, Steven,
this problem persist. After a reboot, we sometimes see membersh
On 2009-07-13T09:33:50, Steven Dake wrote:
> Hmm I don't recall saying they are unreliable or cannot be believed.
> The problem with the join and left list is they don't follow the
> semantics of virtual synchrony, meaning there is no consistent view of
> these join/left lists on each node. The
On 2009-07-16T09:41:01, Andrew Beekhof wrote:
> looks good to me
Confirmed, it fixes the regression and the cluster shuts down properly
again.
Regards,
Lars
--
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experienc
On 2009-07-15T11:17:20, Steven Dake wrote:
> patch attached to fix
>
> thanks for catching
The // is supposed to be gone, right?
> @@ -476,6 +476,8 @@
>
> objdb->objdb_init ();
>
> +// openais_shutdown_objdb_register (objdb);
> +
> /*
>* Bootstrap in the default confi
Hi,
trivial patch - avoid an implicit declaration.
Regards,
Lars
Index: exec/main.c
===
--- exec/main.c (revision 2034)
+++ exec/main.c (working copy)
@@ -76,6 +76,7 @@
#include "print.h"
#include "util.h"
#include "version.h
On 2009-06-04T11:38:08, Robert Wipfel wrote:
> > I think that was actually discussed on the openais list and on IRC in
> > the past and never completely explained why it wouldn't work ;-)
> Link status can also be written to the other communication
> medium: the shared disk (assuming different l
On 2009-06-04T19:07:41, Juha Heinanen wrote:
> without that kind self healing there is no way that openais could
> ever replace current heartbeat2+pacemaker setup.
This "self-healing" works just fine with bonded NICs.
Regards,
Lars
--
SuSE Labs, OPS Engineering, Novell, Inc.
SUSE LINUX
On 2009-06-04T09:23:04, Steven Dake wrote:
> The problem with checking the link status with the current code is that
> the protocol blocks I/O waiting for a response from the failed ring.
> This could of course be modified to behave differently.
Right, so the rechecking could possibly be a separ
On 2009-05-26T12:50:34, Andrew Beekhof wrote:
> >> try all the time also after failure like was done before failure.
> >
> > Complete Totem amateur behind the keyboard, but I'd second that. Since
> > you're constantly checking the link status while it's up, why not keep
> > doing so after it's go
On 2009-04-13T16:31:35, Joel Becker wrote:
> I think its a great idea.
Ok, all respondents were in favor, so I will submit a miniconf proposal.
> We actually have progress, with a lot
> of work that we talked about in Prague starting to see the light of day
> in STABLE3, so it would be go
Hi all,
what do you think of a half-day / day long miniconference on clustering
along LPC? (http://lwn.net/Articles/319215/)
It's a bit late, but if noone objects, I'll submit a proposal tomorrow.
What topics would you like to see discussed?
We probably don't have enough mass to justify our com
On 2009-04-02T10:18:33, Dietmar Maurer wrote:
> I am playing around with corosync/openais and clvmd-openais. So far it
> works. But when is stop the corosync process (or if it gets stopped by a
> SIGSEGV), clvmd-openais is completely unusable.
>
> Like any openais client, it clvmd simple connect
On 2009-03-19T15:27:30, Dominik Klein wrote:
> Hi
>
> I am using the latest whitetank code (1754 2009-03-13 20:47:05 +0100)
> with pacemaker on a pair of x86_64 opensuse 11.1 boxes and I am seeing
> an openais restart issue.
>
> When I use
> /etc/init.d/openais restart
> to restart the cluster,
On 2009-03-22T09:29:22, Lars Marowsky-Bree wrote:
So that they don't drop of the radar: Just wanted to point out that the
CPG crashes are still around, mostly the pi corruption manifesting
itself.
This is with r1761 whitetank.
> Thread 1 (Thread 6988):
> #0 0x7fc07fdfc667 in d
On 2009-03-22T00:49:08, Lars Marowsky-Bree wrote:
> I'll let the test case run over night (until it crashes the test master
> ;-), that might throw up a couple or coredumps by morning.
1. crash in openais response send:
(gdb) bt
#0 0x7fa9f048ae11 in memcpy () from /lib64/l
On 2009-03-21T16:42:55, Steven Dake wrote:
> So what I'd like to know for sure is if the expiry_callback backtrace
> can be reproduced with this patch.
>
> Then we can direct our energies towards coming up with a test case for
> the remaining cpg issue.
OK, thanks for the help.
I'll let the te
On 2009-03-21T12:45:57, Steven Dake wrote:
> see subject
My current trace with this is back to square 1:
(gdb) bt
#0 0x7f8aba35b13c in notify_lib_joinlist (gi=0x763e10, conn=0x0,
joined_list_entries=1, joined_list=0x7fffcb5ec490, left_list_entries=0,
left_list=0x0, id=4) at cpg.c
On 2009-03-21T04:35:35, Steven Dake wrote:
> I merged this into whitetank and corosync.
Hi Steve, Chrissie,
thanks for investigating and fixing this particular crash.
Alas, my test case now crashes elsewhere instead :-( From whitetank:
Core was generated by `aisexec'.
Program terminated with
On 2009-03-18T14:46:42, Steven Dake wrote:
> I have a patch which I believe fixes this problem in corosync.
>
> Please try the latest build after fabio builds it.
Is that the r1866 you just posted?
If so, that patch is already in our whitetank tree and does not prevent
the crash, unfortunately
On 2009-03-17T15:39:55, David Teigland wrote:
> Here's another similar one while mounting/unmounting:
>
> Program terminated with signal 11, Segmentation fault.
> #0 0x7fa0ccf34159 in do_proc_join (name=0x7fffd78743a0, pid=11227,
> nodeid=2, reason=1)
> at cpg.c:740
> 740
Hi all,
I'm curious what a set of conservative settings for the timeouts would
be. Bear with me, I am coming from heartbeat, which had exactly 3
tunables - keepalive and deadtime (and an initial deadtime for
settling), so the various options confuse/scare me a bit ;-)
With fast nodes, it seems th
On 2009-03-05T19:39:44, Steven Dake wrote:
> Currently if the ipc connection is unjoined and then terminated causing
> lib_exit_fn to be called, no list_del will be done on the process info
> structure. This results in later processing by aisexec of the bad
> process info data. (fixes segfault/
On 2009-02-24T16:38:47, Dejan Muhamedagic wrote:
> It is also tricky because as it seems to work right now, if a
> node hears nothing on a particular interface it is deemed
> unusable. So, the only way to recover a ring would be for at
> least two nodes to start the recovery at about the same tim
On 2009-02-23T18:45:41, Steven Dake wrote:
> I can't really tell you what state it is in, other then it appears to be
> broken for you :( Chrissie may be able to provide you some idea of its
> state. She tests it occasionally with the full stack.
>
> Unfortunately with the fc11 deadlines I am
Hi,
I configured a second ring over a different NIC and used rrp_mode
active. (Again, all on whitetank.)
After a reboot of one of the nodes, the rings didn't seem to properly
reform, and aisexec got stuck somehow and refused to shutdown. I didn't
investigate this much further then, because I was
Hi all,
I have a question regarding this call; possibly it applies to other
CKPT functions too, but this is the one currently giving me worries.
ocfs2_controld uses this service, and they get spawned by the cluster
manager at essentially the same time everywhere. (At a time where all
nodes are up
On 2009-02-19T15:33:04, Chrissie Caulfield wrote:
> This seems to be biting a lot of people, so I propose that cpg is
> allowed to send messages on an inquorate cluster
I think this makes a lot of sense (and I know that, if pacemaker wants
to be able to use cpg, we need this - otherwise we canno
On 2009-02-17T22:47:28, Steven Dake wrote:
> IMO the bugzilla should never result in a buffer overflow and points at
> a problem is totempg_ifaces_get. I put some data in the bugzilla which
> I'd like collected if possible.
>
> Maybe it can help us get to the root cause of the problem instead o
On 2009-02-01T08:09:22, Steven Dake wrote:
> The expiry list pointers are not list_init'ed after a synchronization of
> checkpoints. I believe this is causing segfaults in some circumstances.
>
> Andrew can you verify the patch fixes the problem you reported?
It turns out that it does not; is
On 2009-02-01T20:31:57, Steven Dake wrote:
> This is the latest iteration of this patch.
>
> Some duplicate data structures were moved from lib/util.c and exec/ipc.c
> to include/ipc_gen.h.
>
> A feature was added to allow applications to use setuid/seteuid syscalls
> without resulting in disco
On 2009-01-28T15:02:30, Priyanka Ranjan wrote:
> Hi all,
> i am new to openais and pacemaker. i am working on sles11 . i have three
> questions.
If you are working on SLE11, please use the pacemaker list only. These
questions have nothing to do with linux-ha nor openais. I have set the
reply-to
On 2009-01-25T14:14:12, Steven Dake wrote:
> Same as before, but the trunk apparently doesn't have the
> clear_high_node_bit flag feature.
>
> Andrew can you work up a patch for that feature for trunk?
I think before we rework this, we may want to readdress the nodeid issue
more completely.
Ri
On 2009-01-26T16:03:22, Andrew Beekhof wrote:
> > Andrew, is pacemaker/libdlm doing anything real weird with the IPC
> > system?
> not any more. there was a time when I didn't quite understand how to
> use it properly, but i'm confident that everything is now used as it
> was intended.
A furthe
On 2009-01-26T07:39:43, Steven Dake wrote:
> But I understand your desire to avoid risk. Unfortunately there is no
> surgical way to fix the problems with the current ipc system without
> many of these approaches used in this patch.
I see your point, but are those issues present in corosync too
On 2009-01-23T08:17:53, Steven Dake wrote:
> If you mean the apis for responding to requests or library services api,
> then unfortunately, yes it is required to change these apis. No user
> APIs should change, however. On the plus side these will be going into
> corosync and shouldn't change i
On 2009-01-23T07:16:38, Steven Dake wrote:
> You could try this patch but it is not yet complete.
That also seems to change the externally visible API; that for sure
isn't intended for a stable branch, is it?
Mit freundlichen Grüßen,
Lars
--
Teamlead Kernel, SuSE Labs, Research and Devel
On 2009-01-20T06:35:03, Steven Dake wrote:
> Unfortunately, even the planned 0.80.4 has some issues with the IPC
> system which result in double frees, and various other badness.
We're hitting exactly this on openAIS/Pacemaker - is there a workaround
for whitetank?
Core was generated by `aisex
On 2009-01-21T22:47:03, Dejan Muhamedagic wrote:
> > Without this patch, it would not set any ENDIAN_64BITWORD or _32BITWORD
> > define on Linux, ever, and then default to using the 64bit operations
> > down below. One wonders if this (correct) patch then implicitly breaks
> > other archs.
> Test
On 2009-01-21T15:42:57, Dejan Muhamedagic wrote:
> #if defined(OPENAIS_LINUX)
> -#if __WORDIZE == 64
> +#if __WORDSIZE == 64
> #define ENDIAN_64BITWORD
> #endif
> -#if __WORDIZE == 32
> +#if __WORDSIZE == 32
> #define ENDIAN_32BITWORD
> #endif
> #else
Without this patch, it would not set a
On 2008-12-02T22:06:34, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> Definitely worth the effort to try.
> Despite my ranting, I'm still trying to get it to scan Pacemaker (and
> rescan Heartbeat) too.
I wonder what it'd cost to just buy that service subscription. ;-)
Regards,
Lars
--
Teaml
On 2008-10-09T08:51:58, Steven Dake <[EMAIL PROTECTED]> wrote:
Hi Steven,
> > The goal was to have a blackbox which we cannot just retroactively dump,
> > but also easily recover from a core or kernel crashdump.
>
> the array is global and can easily be obtained from a core file with a
> simple
On 2008-10-08T21:08:16, Steven Dake <[EMAIL PROTECTED]> wrote:
> Attached is my version which is as of yet incomplete. The general
> concept is to allow very high performance event tracing with minimal
> formatting overhead (formatting is done in a separate program after a
> crash or to debug cur
On 2008-09-13T20:58:26, Ruppert Koch <[EMAIL PROTECTED]> wrote:
> Fault detection as well as membership are managed by the Totem protocol.
Right.
> I assume the following happens:
>
> A node P experiences a token timeout.
Ah, and this is the point which I'm curious about - why does this occur
On 2008-09-09T11:18:59, David Teigland <[EMAIL PROTECTED]> wrote:
> > For some reason our cluster splits up into two rings.
> > Scenario is:
> > node1(n1) n2 n3 n4 n5 n6 are in the ring.
> >
> > Suddenly the ring splits into two rings:
> > n1 n2 n3 got leave msg from n4 n5 n6
> > n4 n5 n6 got lea
On 2007-12-05T21:06:38, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> Over the last few months, Red Hat and SUSE engineers have been working
> together to port Heartbeat's powerful Cluster Resource Manager (CRM) to run
> natively on top of OpenAIS.
Credit where credit is due: this means you, Andr
55 matches
Mail list logo