On Wed, Mar 26, 2008 at 11:57:59AM -0400, Lon Hohberger wrote:
On Wed, 2008-03-26 at 10:32 -0500, David Teigland wrote:
[1] Just to be clear, the meta-configuration idea is where a variety of
config files can be used to populate a central config-file-agnostic
respository. A single
On Wed, Mar 26, 2008 at 10:32:54AM -0500, David Teigland wrote:
A while back I drew this diagram to show what we were aiming to design, in
broad terms, for the next generation aisexec/cman config system:
http://people.redhat.com/teigland/cman3.jpg
I think perhaps that diagram attempts
On Tue, Jul 01, 2008 at 03:11:26PM -0700, Steven Dake wrote:
Dave,
Your patch looks reasonable but has a few issues which need to be
addressed.
It doesn't address the setting of logsys_subsys_id but defines it. I
want to avoid the situation where logsys_subsys_id is defined, but then
not
On Tue, Sep 09, 2008 at 12:27:34PM +0200, Arne Eriksson R wrote:
Hi,
We have a cluster with 6 processors using openais stable version 0.80.3.
For some reason our cluster splits up into two rings.
Scenario is:
node1(n1) n2 n3 n4 n5 n6 are in the ring.
Suddenly the ring splits into two
Wow that is a complicated solution. I though that simple and blackbox
went well together.
Completely agree, too complex. The logging code I copy into all the
daemons I write is at the opposite end of the spectrum; I doubt it's
possible to be much simpler. (I copy it everywhere because it's
On Thu, Oct 30, 2008 at 11:26:14PM -0700, Steven Dake wrote:
There are two types of messages. Those intended for users/admins and
those intended for developers.
Both of these message types should always be recorded *somewhere*. The
entire concept of LOG_LEVEL_DEBUG is dubious to me. If
On Tue, Nov 04, 2008 at 02:58:47PM -0600, David Teigland wrote:
the cluster.conf logging/ section? My suggestion is:
syslog_level=foo
logfile_level=bar
FWIW, I'm not set on this if someone has a better suggestion. I just want
something unambiguous. debug=on has been shown to mean
On Tue, Mar 10, 2009 at 01:41:57AM -0700, Steven Dake wrote:
./autogen.sh
./configure
make
make install DESTDIR=/
Any chance that install could default to DESTDIR=/ ?
Dave
___
Openais mailing list
Openais@lists.linux-foundation.org
On Tue, Mar 17, 2009 at 02:18:58PM +, Chrissie Caulfield wrote:
I had three GFS filesystems all mounted on 13 nodes. When I went to
umount them I got the following crash on 5 nodes of the system:
(gdb) bt
#0 0x7f21baeb0f05 in raise () from /lib64/libc.so.6
#1
On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote:
For added fun, a node that restarts quickly enough (think a VM) won't
even appear to have left (or rejoined) the cluster.
At the next totem confchg event, It will simply just be there again
with no indication that anything
On Thu, Apr 09, 2009 at 09:00:08PM +0200, Dietmar Maurer wrote:
If new, normal read/write messages to the replicated state continue while
the new node is syncing the pre-existing state, the new node needs to save
those operations to apply after it's synced.
Ah, that probably works. But
On Thu, Apr 09, 2009 at 10:12:43PM +0200, Andrew Beekhof wrote:
On Thu, Apr 9, 2009 at 20:49, Joel Becker joel.bec...@oracle.com wrote:
On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote:
For added fun, a node that restarts quickly enough (think a VM) won't
even appear to have
On Thu, Apr 09, 2009 at 06:02:38PM -0700, Steven Dake wrote:
The issue that Dave is talking about I believe is described in the
following bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=489451
No, not at all.
IMO you should get a leave event for any process that leaves the process
On Mon, Apr 13, 2009 at 12:10:33PM -0700, Steven Dake wrote:
On Mon, 2009-04-13 at 13:35 -0500, David Teigland wrote:
0. configure token timeout to some long time that is longer than all the
following steps take
1. cluster members are nodeid's: 1,2,3,4
2. cpg foo has
On Tue, Apr 14, 2009 at 02:05:10PM +0200, Dietmar Maurer wrote:
So CPG provide a framework to implement distributed finite state
machines (DFSM). But there is no standard way to get the initial state
of the DFSM. Almost all applications need to get the initial state, so I
wonder if it would
From one lone, biased, user's point of view, optimized malloc and memcpy are
uninteresting -- message throughput isn't what I'm looking for. Are there
others out there who see this as important? I *would* be interested in seeing
improvements in the following areas:
. message latency, if that's
On Tue, Apr 14, 2009 at 01:18:14PM -0700, Steven Dake wrote:
. message latency, if that's even possible
Reducing the time a token is held reduces latency. So memcpy and malloc
specials does reduce latency. I don't have measures of how much,
however.
That would be interesting to measure
If I run 'cman_tool leave' on four nodes in parallel, node1 will leave right
away, but the other three nodes don't leave until the token timeout expires
for node1 causing a confchg for it, after which the other three all leave
right away. This has only been annoying me recently, so I think it
On Thu, Apr 16, 2009 at 12:38:19PM +0200, Dietmar Maurer wrote:
Lest assume the cluster is partitioned:
Part1: node1 node2 node3
Part2: node4 node5
After recovery, what join/leave messaged do I receive with a CPG:
A.) JOIN: node4 node5
or
B.) JOIN: node1 node2 node3
or anything
On Fri, Apr 17, 2009 at 10:56:47PM -0700, Steven Dake wrote:
On Sat, 2009-04-18 at 07:49 +0200, Dietmar Maurer wrote:
like a 'merge' function? Seems the algorithm for checkpoint recovery
always uses the state from the node with the lowest processor id?
Yes that is right.
So if
On Sat, Apr 18, 2009 at 07:49:12AM +0200, Dietmar Maurer wrote:
like a 'merge' function? Seems the algorithm for checkpoint recovery
always uses the state from the node with the lowest processor id?
Yes that is right.
So if I have the following cluster:
Part1: node2 node3 node4
On Sat, Apr 18, 2009 at 09:37:26AM +0200, Dietmar Maurer wrote:
Yes, forcing the losers to reset and start from scratch is a must, but we
end up doing that a layer above corosync. That means the losers often
reappear again through corosync/cpg prior to being forced out.
Are you talking
On Sat, Apr 18, 2009 at 03:55:57AM -0700, Steven Dake wrote:
On Sat, 2009-04-18 at 12:47 +0200, Dietmar Maurer wrote:
At least the SA Forum does not mention such strange behavior. Isn't that
a serious bug?
Yes, I'd consider it a serious bug.
Consider 2 Partitions with one checkpoint:
On Tue, Apr 21, 2009 at 07:43:04PM +0200, Fabio M. Di Nitto wrote:
On Tue, 2009-04-21 at 08:51 -0500, Ryan O'Hara wrote:
On Tue, Apr 21, 2009 at 06:06:25AM +0200, Fabio M. Di Nitto wrote:
Hi guys,
in order to match the new logging config spec, 2 logging config keywords
had to be
On Wed, Apr 29, 2009 at 02:28:05PM +0200, Andrew Beekhof wrote:
At the moment, startup and shutdown ordering is controlled by the
plugin's position in an objdb list.
This is particularly problematic for cluster resource managers which
must be unloaded/stopped first.
The reason for this
On Thu, Apr 16, 2009 at 12:29:27PM -0500, David Teigland wrote:
VS guarantees that all cpg members will see the same sequence of messages
and configuration changes, i.e. history of events. If a cpg is partitioned,
that immediately violates VS. One part must be killed so that the remaining
On Mon, Apr 13, 2009 at 02:17:00PM -0500, David Teigland wrote:
On Mon, Apr 13, 2009 at 12:10:33PM -0700, Steven Dake wrote:
On Mon, 2009-04-13 at 13:35 -0500, David Teigland wrote:
0. configure token timeout to some long time that is longer than all the
following steps take
1
On Wed, May 06, 2009 at 02:10:27PM -0700, Steven Dake wrote:
On Wed, 2009-05-06 at 15:04 -0500, David Teigland wrote:
On Mon, Apr 13, 2009 at 02:17:00PM -0500, David Teigland wrote:
On Mon, Apr 13, 2009 at 12:10:33PM -0700, Steven Dake wrote:
On Mon, 2009-04-13 at 13:35 -0500, David
I think we may have lost something in transit between irc/email/svn,
Mar 26 16:10:20 dct confchg, node1 create ckpt, node2 open ckpt, node2
read ckpt - fail
Mar 26 16:10:46 dct nodeid 1 creates the ckpt
Mar 26 16:13:42 dct saCkptCheckpointOpen() works,
On Thu, May 07, 2009 at 12:46:33AM -0700, Steven Dake wrote:
On Wed, 2009-05-06 at 16:26 -0500, David Teigland wrote:
I think we may have lost something in transit between irc/email/svn,
Mar 26 16:10:20 dct confchg, node1 create ckpt, node2 open ckpt, node2
read
I recently started getting BAD_HANDLE errors from cpg_dispatch() when leaving
a cpg:
- cpg_leave()
- cpg_dispatch(handle, CPG_DISPATCH_ALL)
- dispatch executes a confchg for the leave
- dispatch returns 9
It doesn't break anything, but I'd like to avoid adding code to detect when I
should or
On Mon, May 18, 2009 at 01:44:50PM +0100, Chrissie Caulfield wrote:
Steven Dake wrote:
I don't think this will be backwards compatible with whitetank. IMO use
the memb_join_message_send function as outlined. If you can show it
works with whitetank then looks good for commit.
OK,
On Tue, May 19, 2009 at 03:40:53PM +0200, Jan Friesse wrote:
Hi,
attached are proposed solution to *dispatch* functions, which returns
CS_ERR_BAD_HANDLE (AIS_ERR_BAD_HANDLE (9)).
David, can you please test them, and give results?
Thanks, I tried the corosync patch, and cpg_dispatch error 9
On Thu, May 21, 2009 at 07:36:28AM -0700, Steven Dake wrote:
It is possible with 3+ nodes joining or leaving at same time for a
configuration change to be delivered to the user which it is not meant
for.
This patch solves that problem.
ack, using this patch I can't reproduce the problem
On Wed, May 27, 2009 at 04:15:52PM +0200, Jan Friesse wrote:
Hi,
included is patch for Makefile.am of corosync, so coroipcc.o is no
longer included in lib... directly, but rather *.so is a dependency, so
ipc_hdb is no longer in multiple *.so and multiple times in binary what
causes problem.
I wrote a new program cpgx to test the virtual synchrony guarantees of
corosync and cpg,
http://fedorapeople.org/gitweb?p=teigland/public_git/dct-stuff.git;a=summary
It joins a cpg, then randomly sends messages, leaves or exits, and repeats.
This all creates a random sequence of messages and
On Wed, Jun 03, 2009 at 04:28:27PM -0500, David Teigland wrote:
Running cpgx -d1 on four nodes, where -d1 causes the test to periodically
kill and restart corosync. When this kill/restart happens on one node, others
are typically exiting/joining the cpg during at the same time. The result
On Mon, Jun 22, 2009 at 09:26:06AM -0700, Steven Dake wrote:
On Mon, 2009-06-22 at 10:59 -0500, David Teigland wrote:
On Sat, Jun 20, 2009 at 11:51:40AM -0700, Steven Dake wrote:
I invite all of our contributors to help define the X.Y roadmap of both
corosync and openais. Please submit
On Mon, Jun 22, 2009 at 10:48:18PM -0700, Steven Dake wrote:
While you're there, perhaps knock down the level of those messages so we don't
see it all in /var/log/messages every time?
Jun 22 14:58:12 bull-01 corosync[2343]: [MAIN ] Corosync Executive Service
RELEASE 'trunk'
Jun 22 14:58:12
On Wed, Jul 01, 2009 at 06:21:14PM +0200, Jan Friesse wrote:
Included patch should fix
https://bugzilla.redhat.com/show_bug.cgi?id=506255 .
David, I hope it will fix problem for you.
It's based on simple idea of adding node startup timestamp at the end of
cpg_join (and joinlist) calls. If
On Wed, Jul 01, 2009 at 01:46:03PM -0500, David Teigland wrote:
other nodes should immediately recognize it has
previously failed and process a complete failure for it.
i.e. the full equivalent to what apps (using any api's) would see if the node
had failed via normal token timeout.
Dave
On Thu, Jul 02, 2009 at 11:09:26AM -0700, Steven Dake wrote:
On Thu, 2009-07-02 at 09:27 -0500, David Teigland wrote:
On Thu, Jul 02, 2009 at 01:15:18PM +0200, Jan Friesse wrote:
David Teigland wrote:
On Wed, Jul 01, 2009 at 01:46:03PM -0500, David Teigland wrote:
other nodes should
On Mon, Jul 20, 2009 at 10:03:36AM +0200, Jan Friesse wrote:
Patch solves problem, when one process connect multiple times to one
group by disallow this situation.
Please see patch comment for more informations.
David, do you agree, that this is how cpg should behave, or you would
rather
Here are two related and troublesome problems that would be nice to fix,
probably in future versions -- they probably can't be fixed maintaining
existing apis and protocols (although adding new api's to help with them might
be nice if possible).
1. correlating events from different services
On Mon, Aug 31, 2009 at 02:28:33PM -0700, Steven Dake wrote:
On Mon, 2009-08-31 at 15:44 -0500, David Teigland wrote:
Here are two related and troublesome problems that would be nice to fix,
probably in future versions -- they probably can't be fixed maintaining
existing apis and protocols
On Mon, Aug 31, 2009 at 02:28:33PM -0700, Steven Dake wrote:
On Mon, 2009-08-31 at 15:44 -0500, David Teigland wrote:
Here are two related and troublesome problems that would be nice to fix,
probably in future versions -- they probably can't be fixed maintaining
existing apis and protocols
On Thu, Sep 10, 2009 at 04:11:28PM -0700, Steven Dake wrote:
IMO the proper way to do this is to ensure whatever ringid was delivered in
a callback to the application is the current ring id returned by the api.
This gets rid of any races you describe above.
I can't really think of any races
On Mon, Sep 21, 2009 at 08:35:33AM -0700, Steven Dake wrote:
4) flatiron to trail trunk with bug resolution
It appears waiting months to cherrypick patches doesn't produce a high
quality flatiron that people can use continuously.
I'm open to suggestions. One option is to set some time
This puts multiple nodeids on each [QUORUM] Members line instead of
putting each nodeid on a separate line. With more than a few nodes the
excessive lines become a real nuisance, and anyone up around 32 nodes
may literally be scrolling through hundreds of those lines.
Index: vsf_quorum.c
corosync-objctl used to print a lot of useful information which now
appears only as **binary**. Is there a way to get that back?
Perhaps two output modes, one where it prints binary values in hex and
another where it makes a best effort to interpret and print the values
in a useful form?
Dave
On Wed, Jan 13, 2010 at 02:49:53PM +1100, Angus Salkeld wrote:
On Wed, Jan 13, 2010 at 6:06 AM, David Teigland teigl...@redhat.com wrote:
corosync-objctl used to print a lot of useful information which now
appears only as **binary**. ?Is there a way to get that back?
Perhaps two output
The corosync logs are so full of these messages that they end up being
unhelpful. I think they could be made very helpful, though, if they were
printed when the quorum state changed.
Dave
Index: exec/vsf_quorum.c
===
---
On Mon, Feb 22, 2010 at 06:00:21PM +0100, Jan Friesse wrote:
Related to https://bugzilla.redhat.com/show_bug.cgi?id=529424
Patch implements new callback with current totem ring id and members.
Included is modified testcpg using functionality. As required, callback
is delivered AFTER all
On Fri, Feb 19, 2010 at 03:31:10PM -0700, Steven Dake wrote:
There are millions of lines of C code involved in directing a power
fencing device to fence a node. Generally in this case, the system
directing the fencing is operating from a known good state.
There are several hundred lines of
On Mon, Feb 22, 2010 at 06:00:21PM +0100, Jan Friesse wrote:
+struct cpg_ring_id {
+ uint32_t nodeid;
+ uint64_t seq;
+};
What do you think about combining this patch with the other patch that
adds cpg_ringid_get()? It's troublesome to combine the two patches to
test.
+typedef void
On Tue, Mar 02, 2010 at 11:10:49AM +0100, Jan Friesse wrote:
I'll give you example.
Let's say, you have 3 nodes (a,b,c). B,C are joined in group EXAMPLE.
Now, A will fall ... you will not get normal confchg, because A was not
in group. Now on B, you will run new cpg process joined to group. If
On Tue, Apr 06, 2010 at 02:05:00PM +0200, Jan Friesse wrote:
Same patch but rebased on top of Steve's change (today trunk).
Thanks, this is mostly working well, but I've found one problem, and one
additional thing I need (mentioned on irc already):
1. When a node joins, I get the totem callback
On Thu, Apr 08, 2010 at 04:15:06PM +0100, Christine Caulfield wrote:
On 08/04/10 15:57, Jan Friesse wrote:
Included is patch solving 2nd problem.
In first problem, I agree with Chrissie, and really don't have any
single idea how to make regular confchg precede totem_confchg.
We can't.
On Thu, Apr 08, 2010 at 04:57:22PM +0200, Jan Friesse wrote:
Included is patch solving 2nd problem.
Thanks, it works for me.
In first problem, I agree with Chrissie, and really don't have any
single idea how to make regular confchg precede totem_confchg.
I've stepped through things and it
On Fri, Apr 09, 2010 at 09:33:30AM +0200, Jan Friesse wrote:
Dave,
Oh, and I may have just invented a time machine by merging partitioned
clusters!
1270661597 cluster node 1 added seq 2128
1270661597 fenced:daemon conf 3 1 0 memb 1 2 4 join 1 left
1270661597 cpg_mcast_joined retried 4
When corosync exits, my application (fenced) gets stuck.
# strace -p 2005
Process 2005 attached - interrupt to quit
restart_syscall(... resuming interrupted call ...) = -1 ETIMEDOUT (Connection
timed out)
poll([{fd=14, events=0}], 1, 0) = 1 ([{fd=14, revents=POLLNVAL}])
On Wed, Apr 14, 2010 at 12:57:14PM +0200, Jan Friesse wrote:
David,
in that case, corosync exits (so it is really not running) or not?
Yep, the corosync process is gone.
David Teigland wrote:
When corosync exits, my application (fenced) gets stuck.
# strace -p 2005
Process 2005
On Thu, Apr 08, 2010 at 04:57:22PM +0200, Jan Friesse wrote:
commit 0d509f4bf23f618c940c3bcdd7cf0e97faf64876
Author: Jan Friesse jfrie...@redhat.com
Date: Thu Apr 8 16:48:45 2010 +0200
CPG model_initialize and ringid + members callback
Patch adds new function to initialize
I'm using trunk svnversion 2770. I ran 'service cman start' on four nodes,
which I do all the time, and one segfaulted here,
Core was generated by `corosync -f'.
Program terminated with signal 11, Segmentation fault.
#0 0x7f1437774eb9 in object_find_next (
On Thu, Apr 22, 2010 at 11:06:19AM +1000, Angus Salkeld wrote:
Problem:
Under certain circumstances cpg does not send group leave messages.
With a big token timeout (tested with token == 5min).
1 start all nodes
2 start ./test/testcpg on all nodes
2 go to the node with the lowest nodeid
On Thu, Apr 22, 2010 at 04:35:08PM -0500, David Teigland wrote:
On Thu, Apr 22, 2010 at 11:06:19AM +1000, Angus Salkeld wrote:
Problem:
Under certain circumstances cpg does not send group leave messages.
With a big token timeout (tested with token == 5min).
1 start all nodes
2
I'm always looking for ways to make debugging/diagnosing corosync easier
since it's notoriously difficult. I've always just ignored the messages in
the subject line; they seem more or less equivalent to something happened.
(The length of corosync messages tend to be inversely proportional to
On Thu, Jan 13, 2011 at 08:09:13AM -0700, Steven Dake wrote:
On 01/13/2011 08:03 AM, Lars Marowsky-Bree wrote:
On 2010-12-01T14:18:25, Steven Dake sd...@redhat.com wrote:
Corosync 1.3.0 is available for immediate download from our website.
This version brings many enhancements to the
On Fri, Sep 02, 2011 at 10:30:53AM -0700, Steven Dake wrote:
On 09/02/2011 12:59 AM, Vladislav Bogdanov wrote:
Hi all,
I'm trying to further investigate problem I described at
https://www.redhat.com/archives/cluster-devel/2011-August/msg00133.html
The main problem for me there is
69 matches
Mail list logo