Re: [Linux-HA] Antw: Re: pcs or crmsh?

2012-11-15 Thread Lars Marowsky-Bree
On 2012-11-14T15:11:05, Digimer wrote: > any reason at all, to try new things. Sometimes it is superior, often it > is not. In either case, users are free to go where they feel is best. Ah, but can they? How likely is it that the large distributions will offer both? Only those that don't supply

Re: [Linux-HA] Antw: Re: pcs or crmsh?

2012-11-15 Thread Lars Marowsky-Bree
On 2012-11-15T09:20:44, Andrew Beekhof wrote: > > LCMC and crmsh/hawk are at least conceptionally very very different; > Conceptually LCMC and hawk are both web based GUIs, its the > implementation that makes them so different. Not quite. LCMC is pretty heavily different from a deployment perspe

Re: [Linux-HA] Heartbeat with Oracle's ASM

2012-11-15 Thread Lars Marowsky-Bree
On 2012-11-15T10:00:21, Hill Fang wrote: > Hi friend: > > I want know heartbeat is support oracle ASM now?? No - and yes. Oracle RAC (I assume that's the context for ASM?) does not tolerate any cluster solution except itself. This is not supported together with Pacemaker. Pacemaker with t

Re: [Linux-HA] Antw: Re: pcs or crmsh?

2012-11-14 Thread Lars Marowsky-Bree
On 2012-11-14T12:44:53, Digimer wrote: > Not really, to be honest. The way I see it is that Pacemaker is in tech > preview (on rhel, which is where I live). So almost by definition, > anything can change at any time. This is what happened here, so I don't > see a problem. That is a pretty limite

Re: [Linux-HA] pcs or crmsh?

2012-11-14 Thread Lars Marowsky-Bree
On 2012-11-14T09:24:45, Rasto Levrinc wrote: > What doesn't work? I think that at this point of time, it's be easier to > get crmsh going/fixed with pcmk 1.1.8. It's probably just some path > somewhere. If really nothing works, you *must* use LCMC, Pacemaker GUI. :) crmsh's latest release is sup

Re: [Linux-HA] Antw: Re: pcs or crmsh?

2012-11-14 Thread Lars Marowsky-Bree
On 2012-11-14T09:33:22, Digimer wrote: > As it was told to me, pcs was going to be what whas used "officially", > but that anyone and everyone was welcome to continue using and > developing crm or any other existing or new management tool. My > take-away was that the devs wanted pcs, for reasons

Re: [Linux-HA] Antw: Re: cib_replace failed?

2012-11-14 Thread Lars Marowsky-Bree
On 2012-11-14T08:46:25, Ulrich Windl wrote: > This recommendation is against best practices: The FQHN is usually the first > name in /etc/hosts, aliases (short names) following. Probably it's better to > fix the application rather than fiddling with /etc/hosts. Of course. But I was assuming Eric

Re: [Linux-HA] Antw: Re: pcs or crmsh?

2012-11-14 Thread Lars Marowsky-Bree
On 2012-11-14T09:08:58, Ulrich Windl wrote: > > The "official" management tool is/will be pcs. That said, crm has been > > around for a while, so it might be more complete/stable. > Ist this wishful thinking? In SLES11 SP2 it's not available for installation, > so it's either very new or not con

Re: [Linux-HA] cib_replace failed?

2012-11-13 Thread Lars Marowsky-Bree
On 2012-11-13T17:06:31, "Robinson, Eric" wrote: > I'm not sure how to correct this. Here are the results of my name resolution > test on node ha09a... I'd probably strip everything except the short names out of /etc/HOSTNAME and /etc/hosts, though it may be sufficient to make sure the short nam

Re: [Linux-HA] cib_replace failed?

2012-11-13 Thread Lars Marowsky-Bree
On 2012-11-13T16:34:23, "Robinson, Eric" wrote: > bump. > > Could someone please review the logs in the links below and tell me what the > heck is going on with this cluster? I've never encountered anything like this > before. Basically, corosync thinks the cluster is healthy but Pacemaker won

Re: [Linux-HA] cib_replace failed?

2012-11-13 Thread Lars Marowsky-Bree
On 2012-11-12T10:07:50, Andrew Beekhof wrote: > Um, are you setting a nodeid in corosync.conf? > > Because I see this: > > Nov 09 09:07:25 [2609] ha09a.mycharts.md crmd: crit: > crm_get_peer: Node ha09a.mycharts.md and ha09a share the same cluster > node id '973777088'! This

Re: [Linux-HA] Bug around on-fail on op monitor ?

2012-11-12 Thread Lars Marowsky-Bree
On 2012-11-12T15:01:47, alain.mou...@bull.net wrote: > Thanks but no, in older releases, the op monitoring failed leaded to > "fence" as required by "on-fail=fence" . yes, that's what should happen. You can file a crm_report with the PE inputs showing this for 1.1.7, or directly retest with 1.1

Re: [Linux-HA] Antw: Re: Pacemaker STONITH Config Check

2012-11-08 Thread Lars Marowsky-Bree
On 2012-11-07T12:51:25, Ulrich Windl wrote: > I agree that one shouldn't have to do it, but I've seen cases (two node > cluster with quorum-policy=ignore) where one node was down while the > "cluster" wanted to fence both nodes. So when the other node goes up, nodes > will shoot each other. >

Re: [Linux-HA] Q: "crmd: [12771]: info: handle_request: Current ping state: S_TRANSITION_ENGINE"

2012-11-06 Thread Lars Marowsky-Bree
On 2012-11-05T17:05:35, Dejan Muhamedagic wrote: > > It's a debug instrumentation message. But it is only triggered when > > someone runs crmadmin -S, -H to look up the DC or something, it isn't > > triggered by the stack internally. > If it's a debug message, why is it then at severity "info"?

Re: [Linux-HA] Q: "crmd: [12771]: info: handle_request: Current ping state: S_TRANSITION_ENGINE"

2012-11-05 Thread Lars Marowsky-Bree
On 2012-11-05T15:31:25, Ulrich Windl wrote: > I just experienced that the syslog message "crmd: [12771]: info: > handle_request: Current ping state: S_TRANSITION_ENGINE" is sent out several > times per second for an extended period of time. > > So I wonder: Is it a left-over debug message, and

Re: [Linux-HA] cib_replace failed?

2012-10-31 Thread Lars Marowsky-Bree
On 2012-10-31T15:59:05, "Robinson, Eric" wrote: > Nobody has any thoughts on why my 2-node cluster has no DC? As I mentioned, > corosync-cfgtool -s shows the ring active with no faults. That probably means that someone (i.e., you ;-) needs to dig more into the logs of corosync & pacemaker. Th

Re: [Linux-HA] Antw: Re: Limit for three Xen VMs in SLES11 SP2?

2012-10-25 Thread Lars Marowsky-Bree
On 2012-10-25T11:30:32, Ulrich Windl wrote: > I just wonder: If the reason is some kind of resource shortage in the Xen > Host that causes Xen guests to fail booting, it would ne nice if that > situation could be detected. I was just asking for an already known effect, > before digging deeper.

Re: [Linux-HA] Limit for three Xen VMs in SLES11 SP2?

2012-10-25 Thread Lars Marowsky-Bree
On 2012-10-25T08:28:29, Ulrich Windl wrote: > The VM would not be able to boot due to lack of a boot disk. All three VMs on > a specific node had the very same problem after being rebooted (through OS, > not Xen RA). The Xen RA, by default, only monitors the existence of the VM on the hypervis

Re: [Linux-HA] resource monitor timeout, Killing with signal SIGTERM (15).

2012-10-24 Thread Lars Marowsky-Bree
On 2012-10-24T13:23:09, Dimitri Maziuk wrote: > PS. but for the most part, like you said: you *have* people stuck on > 2.1.4 and you keep supporting them much as you hate it. Yes, but on SLES10, that was an actually shipping version with full support. EPEL has different policies than RHEL. Thos

Re: [Linux-HA] resource monitor timeout, Killing with signal SIGTERM (15).

2012-10-24 Thread Lars Marowsky-Bree
On 2012-10-24T13:17:57, Dimitri Maziuk wrote: > I have e.g. mon script that greps 'lsof -i' to see if httpd is listening > on * or cluster ip. Which IMO is a way saner check than wget'ting > http://localhost/server-status -- and treating a [34]04 as a fail. Hence > the "plus" quip. ;) That the p

Re: [Linux-HA] Antw: Re: [Linux-ha-dev] glue 1.0.11 released

2012-10-24 Thread Lars Marowsky-Bree
On 2012-10-22T14:12:17, Ulrich Windl wrote: > Interesting formula: I'd use something like "number of CPUs" * 4, not divided > by. > > Reason: Today's workload is usually limited by I/O, not by CPU power. > > However with something crazy like 32 CPUs, 32 tasks can easily be run, but > most lik

Re: [Linux-HA] resource monitor timeout, Killing with signal SIGTERM (15).

2012-10-24 Thread Lars Marowsky-Bree
On 2012-10-24T11:15:14, Dimitri Maziuk wrote: > > I'm happy you have something that works for you. > > Although even if you're using it in haresources mode, your resource > > agents are still years out of date. > It doesn't have resource agents (that's one of its pluses in my book). It has them;

Re: [Linux-HA] Problems with quorum, no-quorum-policy and NMI messages

2012-10-17 Thread Lars Marowsky-Bree
On 2012-10-17T09:18:06, Michael Schwartzkopff wrote: > If you have errors in the network you eventually loose packets. > corosync/paceamker doesn't like this and sometimes reacts on heavy packet > loss. It's not really pacemaker that is affected, but corosync's totem protocol implementation. T

Re: [Linux-HA] Problems with quorum, no-quorum-policy and NMI messages

2012-10-16 Thread Lars Marowsky-Bree
On 2012-10-16T23:04:59, RaSca wrote: > Hi all, > I hope that you can help me with this strange problem. I've got a nine > node cluster which is configured with no-quorum-policy to stop. > Two days ago I came across this error on one of the nodes: > > Oct 14 00:00:38 kvm06 kernel: Uhhuh. NMI rece

Re: [Linux-HA] Question about cib-xx.raw / cib-xx.raw.sig

2012-10-16 Thread Lars Marowsky-Bree
On 2012-10-16T11:20:21, alain.mou...@bull.net wrote: > OK thanks. > > And so what is the real consequences for Pacemaker on a HA cluster if we > remove all cib-xx.raw and cib-xx.raw.sig ? None. They are backup data for support. Regards, Lars -- Architect Storage/HA SUSE LINUX Products

Re: [Linux-HA] [Linux-ha-dev] glue 1.0.11 released

2012-10-16 Thread Lars Marowsky-Bree
On 2012-10-16T08:20:56, alain.mou...@bull.net wrote: > may I ask the formula or values of max number of children with regard to > the number of processors ? http://hg.linux-ha.org/glue/rev/1f36e9cdcc13 - number of CPUs divided by two or four, whatever is lower. Regards, Lars -- Architect

Re: [Linux-HA] Antw: Re: Q: Xen RA: node_ip_attribute

2012-09-28 Thread Lars Marowsky-Bree
On 2012-09-27T17:32:58, Ulrich Windl wrote: > Just a note: As it turned out, the Xen RA (SLES11 SP2, > resource-agents-3.9.3-0.7.1) is broken, because migrate will never look at > the node_ip_attribute you configured. > > It's line 369: > target_attr="$OCF_RESKEY_CRM_node_ip_attribute" > > I

Re: [Linux-HA] Q: crm shell's "migrate lifetime"

2012-09-27 Thread Lars Marowsky-Bree
On 2012-09-27T16:36:08, Ulrich Windl wrote: Hi Ulrich, we always appreciate your friendly, constructive and non-condescending feedback. > However if you specify a duration like "P2", the duration is not added to the > current time; instead the current time is used as lifetime (it seems): "P2"

Re: [Linux-HA] Antw: Re: Q: Xen RA: node_ip_attribute

2012-09-24 Thread Lars Marowsky-Bree
On 2012-09-24T08:45:39, Ulrich Windl wrote: > So I select on unique attribute name for Xen migration, specify that > in the Xen resource, and then define that attribute per node, using > one of the node's own IP addresses? Yes. The idea is that this allows you to override the IP address we'd pic

Re: [Linux-HA] OS System update in live cluster ?

2012-09-05 Thread Lars Marowsky-Bree
On 2012-09-05T06:26:50, Stefan Schloesser wrote: > Hi Lars, > > my problem with the rolling upgrade is the drbd partition. If you migrate the > service its data will move too. If you then restart the cluster and migrate > back the data will not be in an upgraded state and thus not match the bi

Re: [Linux-HA] Antw: Duplicate monitor operation on a multi state resource

2012-09-05 Thread Lars Marowsky-Bree
On 2012-09-05T07:54:46, Andrew Beekhof wrote: > > (Or rather, obscure enough to configure that it might well be a > > bug.) It'd be trivial to just append the role to the operation key > > too. (It'd cause a few monitors to be recreated on update, but > > that'd be harmless.) > Not really that tr

Re: [Linux-HA] OS System update in live cluster ?

2012-09-04 Thread Lars Marowsky-Bree
On 2012-09-04T15:56:14, Stefan Schloesser wrote: > Hi, > > I would like to know what the recommended way is to update a cluster. Every > week or so bug fixes and security patches are released for a various parts of > the used software. I prefer rolling upgrades; migrate service, stop cluster,

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-09-04 Thread Lars Marowsky-Bree
On 2012-09-04T10:50:11, "EXTERNAL Konold Martin (erfrakon, RtP2/TEF72)" wrote: > I was reporting a serious bug in _your_ product and instead of > thanking for the bugreport you simply closed it as invalid The bug was reported without a support contract. A support contract usually being the pre

Re: [Linux-HA] Antw: Re: Q: Debug clustered IP Adress

2012-08-31 Thread Lars Marowsky-Bree
On 2012-08-31T13:41:14, Ulrich Windl wrote: > Hi! > > There are things I don't understand: Even after > # /usr/lib64/heartbeat/send_arp -i 200 -r 5 br0 172.20.3.59 f1e991b1b951 > not_used not_used > > neither the local arp table (arp) not the software bridge (brctl ... > showmacs) know anythi

Re: [Linux-HA] Time based resource stickiness example with crm configure ?

2012-08-30 Thread Lars Marowsky-Bree
On 2012-08-30T12:53:45, Stefan Schloesser wrote: > I would like to configure the resource-stickiness to "0" tuesdays between 2 > and 2:20 am local time. > > I could not find any examples on how to do this using crm configure ... but > only the XML snippets to accomplish this. I don't think th

Re: [Linux-HA] Antw: Re: Q: Debug clustered IP Adress

2012-08-29 Thread Lars Marowsky-Bree
On 2012-08-29T13:31:05, Ulrich Windl wrote: > > Well, you should see the MAC/IP mapping in the arp table if the host > > is on the same ethernet segment, yes. Otherwise the host doesn't > > know where to send the packets to. > I checked the arp table of the host that is hosting the cluster IP > a

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-29 Thread Lars Marowsky-Bree
On 2012-08-20T11:31:07, Lars Marowsky-Bree wrote: > Okay, so there's a bug in the NFS agent, point taken. I'll investigate > why it took so long to release as a real maintenance update; you're > right, that shouldn't happen. (I can already see it in the update queue

Re: [Linux-HA] Antw: Re: Q: Debug clustered IP Adress

2012-08-29 Thread Lars Marowsky-Bree
On 2012-08-29T10:15:50, Ulrich Windl wrote: > The network guys say no. Should "arp" show the Cluster-IP? I cannot see it, > so I wonder if something's wrong. Well, you should see the MAC/IP mapping in the arp table if the host is on the same ethernet segment, yes. Otherwise the host doesn't kno

Re: [Linux-HA] Q: Debug clustered IP Adress

2012-08-27 Thread Lars Marowsky-Bree
On 2012-08-27T12:14:46, Ulrich Windl wrote: > Hi! > > I set up a Clustered Samba Server with SLES11 SP2 according to the manual > "Chapter 18. Samba Clustering". Everything seems to run now, but I cannot > reach the configured clustered IP address from an outside host. Local pings > on the IP

Re: [Linux-HA] How HA can start systemd service.

2012-08-23 Thread Lars Marowsky-Bree
On 2012-08-23T09:35:51, Francis SOUYRI wrote: > Hello Dejan, > > With the FC 16 heartbeat is a 3.0.4 not a v1. > > I do not use crm because I can success to implement ipfail. Dejan was refering to the "v1 mode", namely the one that uses haresources. haresources can't drive systemd scripts. You

Re: [Linux-HA] Antw: Duplicate monitor operation on a multi state resource

2012-08-22 Thread Lars Marowsky-Bree
On 2012-08-22T10:32:57, RaSca wrote: > Thank you Lars, > In fact, this is what I've done and now everything is ok. But I want to > understand one last thing: if the ID is calculated with the value of > interval then why I don't have errors even if I've got two slaves, which > means that I've got

Re: [Linux-HA] Antw: Duplicate monitor operation on a multi state resource

2012-08-22 Thread Lars Marowsky-Bree
On 2012-08-22T10:08:14, RaSca wrote: > Thank you Ulrich, > As far as you know, Is there a way to override the ID for each cloned > instance of the mysql resource? How can I resolve the problem? Just make the intervals slightly different - 31s, 30s, 29s ... Regards, Lars -- Architect Stor

Re: [Linux-HA] Three clusters with common node

2012-08-22 Thread Lars Marowsky-Bree
On 2012-08-21T15:39:06, Carlos Pedro wrote: > Dear Sirs, > > I´m working in a project > and I was proposed to build three clusters using a common node, that > is: Nodes cannot be shared between clusters like this. You can either build a >2 node cluster (with all nodes in one), or use virtual i

Re: [Linux-HA] IP Clone

2012-08-21 Thread Lars Marowsky-Bree
On 2012-08-21T13:16:29, David Lang wrote: > with ldirectord you have an extra network hop, and you have all your > traffic going through one system. This is a scalability bottleneck as > well as bing a separate system to configure. > > CLUSTERIP isn't the solution to every problem, but it works

Re: [Linux-HA] Many messages form clvmd in SLES11 SP2

2012-08-21 Thread Lars Marowsky-Bree
On 2012-08-21T14:32:53, Ulrich Windl wrote: > Maybe I'm expecting too much, but isn't it possible to simply log "Telling > other nodes that PV blabla is being created"? The problem is the error case, in which we want more logs. There is progress (libqb with the flight recorder/blackbox thingy w

Re: [Linux-HA] IP Clone

2012-08-21 Thread Lars Marowsky-Bree
On 2012-08-21T00:22:00, Dimitri Maziuk wrote: > CLUSTERIP which you presumably mean by "fun with iptables" is basically > "Jack gets all calls from even area codes and Jill: from odd area > codes". Yeah, you cold do that, I just can't imagine why. > > Because the commonly given rationale for a

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-20 Thread Lars Marowsky-Bree
On 2012-08-17T18:14:18, "EXTERNAL Konold Martin (erfrakon, RtP2/TEF72)" wrote: > On the other hand you sofar did not provide any case where SLES11 SP2 runs > reliably unmodified in a mission critical environment (e.g. a HA NFS server) > without local bugfixes. Okay, so there's a bug in the NF

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-20 Thread Lars Marowsky-Bree
On 2012-08-17T16:42:42, Ulrich Windl wrote: > obviously not, because I have the latest updates installed. It happens > frequently enough to care about it: > > # zgrep sscan /var/log/messages-201208*.bz2 |wc -l > 76 > Here are some: > /var/log/messages-20120816.bz2:Aug 16 13:55:21 so3 crmd: [270

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-20 Thread Lars Marowsky-Bree
On 2012-08-17T16:38:01, "EXTERNAL Konold Martin (erfrakon, RtP2/TEF72)" wrote: > > I don't see an open bug for something like this right now. > Are you serious? > > It was you who resolved this bug as INVALID in bugzilla > https://bugzilla.novell.com/show_bug.cgi?id=769292. Uhm, yes, I was se

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-17 Thread Lars Marowsky-Bree
On 2012-08-17T11:43:13, Nikita Michalko wrote: > - e.g. the problem with SLES 11 SP2 kernels crash - the same as described by > Martin: > >> SP2 kernels crash seriously (when a node rejoins the cluster) when using > STCP as > >> recommended in the SLES HA documentation and offered via the wiza

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-17 Thread Lars Marowsky-Bree
On 2012-08-17T08:19:45, Ulrich Windl wrote: > Likewise if you use resource utilization on primitives in a group, the group > begains to start on one node, then stalls when the next primitive's > utilization cannot be fulfilled. That's bad especially when there are enough > resources for the wh

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-17 Thread Lars Marowsky-Bree
On 2012-08-17T08:41:15, Nikita Michalko wrote: > I am also testing SP2 - and yes, it's true: not yet ready for production ;-( What problems did you find? Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnb

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-17 Thread Lars Marowsky-Bree
On 2012-08-16T17:54:06, "EXTERNAL Konold Martin (erfrakon, RtP2/TEF72)" wrote: Hi Martin, > From my experience with SLES11 SP2 (with all current updates) I conclude that > actually nobody is seriously running SP2 without local bugfixes. That isn't quite true. > E.g. Even the most simple exam

Re: [Linux-HA] How do YOU count netmask bits?

2012-08-16 Thread Lars Marowsky-Bree
On 2012-08-16T09:51:52, Ulrich Windl wrote: > Hi! > > Can somebody explain (found in resource-agents-3.9.2-0.25.5 of SLES11 SP2): > # OCF_RESKEY_ip=172.20.3.99 OCF_RESKEY_cidr_netmask=26 > /usr/lib64/heartbeat/findif -C > eth0netmask 26 broadcast 172.20.3.127 > > If I guess that 26 bi

Re: [Linux-HA] Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-14 Thread Lars Marowsky-Bree
On 2012-08-14T17:48:47, Ulrich Windl wrote: FWIW, if you can try to reproduce in 1.1.7, that may be interesting. I'm still not sure on the sequence of events to cause it, so I can't try locally. hb_report would be the minimum. > Your message arrived, BTW. ;-) It's not that we don't want to hel

Re: [Linux-HA] crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

2012-08-14 Thread Lars Marowsky-Bree
On 2012-08-14T16:59:02, Ulrich Windl wrote: > While starting a clone resource (mount OCFS2 filesystem), I see this message > in syslog: > > crmd: [31942]: notice: do_lrm_invoke: Not creating resource for a delete > event: (null) > info: notify_deleted: Notifying 25438_crm_resource on rkdvmso1

Re: [Linux-HA] Antw: Re: lrmd: [6136]: ERROR: crm_abort: crm_strdup_fn: Triggered assert at utils.c:1013 : src != NULL

2012-08-14 Thread Lars Marowsky-Bree
On 2012-08-14T12:44:43, Ulrich Windl wrote: > > The messages are coming from the stonith plugin (it's actually > > in pacemaker). But I think that that got fixed in the meantime. ^ > > Do you have the latest maintenance update? > > Yes, "latest" on SLES is relative: > > # rpm -qf

Re: [Linux-HA] Antw: Bond mode for 2 node direct link

2012-07-18 Thread Lars Marowsky-Bree
On 2012-07-18T20:01:35, Arnold Krille wrote: > That would mean that your system runs the same whether one or two links are > present. That's not what I said. What I said (or at least meant ;-) is that, even in the degraded state, the performance must still be within acceptable range. Hence, th

Re: [Linux-HA] Antw: Bond mode for 2 node direct link

2012-07-17 Thread Lars Marowsky-Bree
On 2012-07-17T23:44:13, Arnold Krille wrote: > Additionally: If its two direct links dedicated to your storage network, > there is no reason going active/backup and discarding half of the > available bandwidth. Since the system must be designed for one link to have adequate bandwidth to provide

Re: [Linux-HA] Bond mode for 2 node direct link

2012-07-17 Thread Lars Marowsky-Bree
On 2012-07-16T11:53:55, Volker Poplawski wrote: > Hello everyone. > > Could you please tell me the recommended mode for a bonded network > interface, which is used as the direct link in a two machine cluster? > > There are 'balance-rr', 'active-backup', 'balance-xor' etc > Which one to choose

Re: [Linux-HA] Pacemaker and software RAID using shared storage.

2012-07-12 Thread Lars Marowsky-Bree
On 2012-07-12T10:31:53, Caspar Smit wrote: > Now the interesting part. I would like to create a software raid6 set > (or multiple) with the disks in the JBOD and have the possibility to > use > the raid6 in an active/passive cluster. Sure. md RAID in a fail-over configuration is managed by the R

Re: [Linux-HA] Antw: Re: mount.ocfs2 in D state

2012-07-03 Thread Lars Marowsky-Bree
On 2012-07-03T11:26:11, darren.mans...@opengi.co.uk wrote: > I'd like to second Lars' comments here. I was strong-armed into doing a > dual-primary DRBD + OCFS2 cluster and it's a nightmare to manage. There's no > reason for us to do it other than 'we could'. It just needed something simple > l

Re: [Linux-HA] Antw: Re: mount.ocfs2 in D state

2012-07-02 Thread Lars Marowsky-Bree
On 2012-07-02T12:37:52, Ulrich Windl wrote: > > I've seen very few scenarios where OCFS2 was worth it over just using a > > "regular" file system like XFS in a fail-over configuration in this kind > > of environment. > How would you fail over if your shared storage went toast? Or did you mean >

Re: [Linux-HA] Antw: Re: mount.ocfs2 in D state

2012-07-02 Thread Lars Marowsky-Bree
On 2012-07-02T12:05:33, Ulrich Windl wrote: > Unfortunately unless there's a real cluster filesystem that supports > mirroring with shared devices also, DRBD on some locally mirrored device on > each node seems to be the only alternative. (Talking about desasters) I've seen very few scenarios

Re: [Linux-HA] mount.ocfs2 in D state

2012-07-02 Thread Lars Marowsky-Bree
On 2012-07-02T10:42:33, "EXTERNAL Konold Martin (erfrakon, RtP2/TEF72)" wrote: > when a split brain (drbd) happens mount.ocfs2 remains hanging unkillable in > D-state. Unsurprising, since all IO is frozen during that time (depending on your drbd setup, but I'm assuming that's what you are seei

Re: [Linux-HA] OCFS2 - Renew node's IP address which has failed - Amazon EC2

2012-06-29 Thread Lars Marowsky-Bree
On 2012-06-28T11:37:37, Heitor Lessa wrote: > Such issue happens because OCFS does not support changes (modify/del) nodes > in a running cluster, such tasks requires cluster down though. If driven by Pacemaker, OCFS2 does support adding/removing nodes at runtime. (Though if you run out of nod

Re: [Linux-HA] Antw: Re: cib_process_diff: ... Failed application of an update diff

2012-06-29 Thread Lars Marowsky-Bree
On 2012-06-29T08:19:41, Ulrich Windl wrote: > > For SLE HA 11 SP1, please report these issues to NTS and SUSE support. > As I'm sure they won't fix it in SP1 (that PTF is one year old now), SP1 is still supported by SUSE, and noone but our support folks know what exactly is in that PTF. I mean,

Re: [Linux-HA] cib_process_diff: ... Failed application of an update diff

2012-06-27 Thread Lars Marowsky-Bree
On 2012-06-27T14:18:26, Ulrich Windl wrote: > Hello, > > I see problems with applying configuration diffs so frequrntly that I suspect > there's a bug in the code. > > This is for SLES11 SP1 on x86_64 with corosync-1.4.1-0.3.3.3518.1.PTF.712037 > and libcorosync4-1.4.1-0.3.3.3518.1.PTF.712037

Re: [Linux-HA] Antw: Re: ocf:heartbeat:exportfs multiple exports, fsid, wait_for_leasetime_on_stop

2012-06-21 Thread Lars Marowsky-Bree
On 2012-06-21T08:02:25, Ulrich Windl wrote: > > See, it's simple. Any "partially" completed operation or state -> not > > successful, ergo failure must be reported. > Is it correct that the standard recovery procedure for this failure is node > fencing then? If so it makes things worse IMHO. Th

Re: [Linux-HA] What's the meaning of "... Failed application of an update diff"

2012-06-20 Thread Lars Marowsky-Bree
On 2012-06-20T17:46:19, Andreas Kurz wrote: > > hb_report does not work. > > how to do a report tarball ? > It has been renamed to "crm_report" There's still both around. Just that different distributions ship different implementations. Because. Well. Because. Regards, Lars -- Archite

Re: [Linux-HA] Antw: Re: ocf:heartbeat:exportfs multiple exports, fsid, wait_for_leasetime_on_stop

2012-06-20 Thread Lars Marowsky-Bree
On 2012-06-20T16:37:35, Ulrich Windl wrote: > so what exit code is failed? Then: With the standard logic of "stop" > only performing when the resource is up (i.e. monitor reports > "stopped"), a partially started resource that the monitor considers > "stopped" may fail to be cleanly stopped on "s

Re: [Linux-HA] Antw: Re: ocf:heartbeat:exportfs multiple exports, fsid, wait_for_leasetime_on_stop

2012-06-20 Thread Lars Marowsky-Bree
On 2012-06-20T08:44:33, Ulrich Windl wrote: > > > The problem is: What to do if "1 out of n" exports fails: Is the resource > > > "started" or "stopped" then. Likewise for unexporting and monitoring. > > If the operation partially failed, it is failed. > > But to have a "clean stopped", the res

Re: [Linux-HA] Antw: Re: ocf:heartbeat:exportfs multiple exports, fsid, wait_for_leasetime_on_stop

2012-06-19 Thread Lars Marowsky-Bree
On 2012-06-19T14:13:06, Ulrich Windl wrote: > The problem is: What to do if "1 out of n" exports fails: Is the resource > "started" or "stopped" then. Likewise for unexporting and monitoring. If the operation partially failed, it is failed. Regards, Lars -- Architect Storage/HA SUSE LIN

Re: [Linux-HA] What's the meaning of "... Failed application of an update diff"

2012-06-19 Thread Lars Marowsky-Bree
On 2012-06-19T08:38:11, alain.mou...@bull.net wrote: > So that means that my modifications by crm configure edit , even if they > are correct (I've re-checked them) , > have potentially corrupt the Pacemaker configuration ? No. The CIB automatically recovers from this by doing a full sync. The m

Re: [Linux-HA] Does globally-unique make sense on filesystems cloned resources?

2012-06-06 Thread Lars Marowsky-Bree
On 2012-06-06T17:26:41, RaSca wrote: > Thank you Florian, but how can one declare an anonymous clone? Is it > implicit with the globally-unique=false? You don't need to explicitly declare that. It is the default. (But yes, the default is globally-unique=false.) Regards, Lars -- Architec

Re: [Linux-HA] Question about stacks .

2012-06-04 Thread Lars Marowsky-Bree
On 2012-06-01T13:10:17, alain.mou...@bull.net wrote: > -does that mean that it will be this Pacemaker/cman on RH ans SLES ? > -or do RH and SLES wil require a different stack under Pacemaker ? Right now, SLE HA is on the plugin version of pacemaker, and SLE HA 11 will likely remain on it - that'

Re: [Linux-HA] Can /var/lib/pengine files be deleted at boot?

2012-05-16 Thread Lars Marowsky-Bree
On 2012-05-15T13:17:11, William Seligman wrote: > I can post details and logs and whatnot, but I don't think I need to do > detailed > debugging. My question is: I don't think your rationale holds true, though. Like Andrew said, this is only ever just written, not read. > If I were to set up a

Re: [Linux-HA] Cluster node hanging upon access to ocfs2 fs when second cluster node dies ?

2012-04-04 Thread Lars Marowsky-Bree
On 2012-04-04T11:28:31, Rainer Krienke wrote: > There is one basic thing however I do not understand: My setup involves > only a clustered filesystem. What I do not understand is why a stonith > resource is needed at all in this case which after all causes freezes > of the cl-filesystem dependin

Re: [Linux-HA] Cluster node hanging upon access to ocfs2 fs when second cluster node dies ?

2012-04-03 Thread Lars Marowsky-Bree
On 2012-04-03T15:59:00, Rainer Krienke wrote: > Hi Lars, > > this was something I detected already. And I changed the timeout in the > cluster configuration to 200sec. So the log I posted was the result of > the configuration below (200sec). Is this still to small? > > $ crm configure show > ..

Re: [Linux-HA] Cluster node hanging upon access to ocfs2 fs when second cluster node dies ?

2012-04-03 Thread Lars Marowsky-Bree
On 2012-04-03T15:50:29, Rainer Krienke wrote: > rzinstal4:~ # sbd -d /dev/disk/by-id/scsi-259316a7265713551-part1 dump > ==Dumping header on disk /dev/disk/by-id/scsi-259316a7265713551-part1 > Header version : 2 > Number of slots: 255 > Sector size: 512 > Timeout (watchdog) : 90 >

Re: [Linux-HA] Cluster node hanging upon access to ocfs2 fs when second cluster node dies ?

2012-04-03 Thread Lars Marowsky-Bree
On 2012-04-03T14:06:44, Rainer Krienke wrote: > thanks for the hint to enable the stonith resource. I did and checked > that it is set to true now, but after all the behaviour of the cluster > is still the same, if I do a halt -f on one node. > Access on the clusterfilesystem on the still running

Re: [Linux-HA] Cluster node hanging upon access to ocfs2 fs when second cluster node dies ?

2012-04-03 Thread Lars Marowsky-Bree
On 2012-04-03T10:32:48, Rainer Krienke wrote: Hi Rainer, > I am new to HA setup and my first try was to set up a HA cluster (using > SLES 11 SP2 and the SLES11 SP2 HA extension) that simply offers an > OCFS2 filesystem. I did the setup according to the SLES 11 SP2 HA > manual, that describes th

Re: [Linux-HA] ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported

2012-03-29 Thread Lars Marowsky-Bree
On 2012-03-29T11:31:38, Ulrich Windl wrote: > pengine: [17043]: WARN: pe_fence_node: Node h07 will be fenced because it is > un-expectedly down > > Th software bind used is basically SLES11 SP1 with a newer corosync > (corosync-1.4.1-0.3.3.3518.1.PTF.712037). Were there any improvements since

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread Lars Marowsky-Bree
On 2012-03-15T15:59:21, William Seligman wrote: > Could this be an issue? I've noticed that my fencing agent always seems to be > called with "action=reboot" when a node is fenced. Why is it using 'reboot' > and > not 'off'? Is this the standard, or am I missing a definition somewhere? Make sur

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread Lars Marowsky-Bree
On 2012-03-14T18:22:42, William Seligman wrote: > Now consider a primary-primary cluster. Both run the same resource. > One fails. There's no failover here; the other box still runs the > resource. In my case, the only thing that has to work is cloned > cluster IP address, and that I've verified

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread Lars Marowsky-Bree
On 2012-03-14T11:41:53, William Seligman wrote: > I'm mindful of the issues involved, such as those Lars Ellenberg > brought up in his response. I need something that will failover with > a minimum of fuss. Although I'm encountering one problem after > another, I think I'm closing in on my goal.

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread Lars Marowsky-Bree
On 2012-03-14T09:02:59, William Seligman wrote: To ask a slightly different question - why? Does your workload require / benefit from a dual-primary architecture? Most don't. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,

Re: [Linux-HA] Antw: Re: FW: How DC is selected?

2012-02-06 Thread Lars Marowsky-Bree
On 2012-02-06T22:13:20, Mayank wrote: > with-rsc="pgsql9" with-rsc-role="Master"/> > > The intention behind defining such constraints is to make sure that the > postgre should always run in the master role on the node which is a DC. > > Is something wrong with this? There's nothing wrong with

Re: [Linux-HA] Antw: Re: FW: How DC is selected?

2012-02-06 Thread Lars Marowsky-Bree
On 2012-02-06T09:05:13, Ulrich Windl wrote: > but like with CPU affinity there should be no needless change of the DC. I > also wondered why after each configuration change the DC is newly elected (it > seems). It isn't (or shouldn't be). Still, the DC election is an internal detail that shoul

Re: [Linux-HA] Antw: Re: Q: "IPC Channel to 9858 is not connected"

2011-12-08 Thread Lars Marowsky-Bree
On 2011-12-08T12:08:06, Ulrich Windl wrote: > >>> Dejan Muhamedagic schrieb am 08.12.2011 um 11:28 in > Nachricht <20111208102833.GA12338@walrus.homenet>: > > Hi, > > > > On Wed, Dec 07, 2011 at 02:26:52PM +0100, Ulrich Windl wrote: > > > Hi! > > > > > > While the openais cluster (SLES11 SP1)

Re: [Linux-HA] disconnecting network of any node cause both nodes fenced

2011-12-06 Thread Lars Marowsky-Bree
On 2011-12-05T22:37:03, Andreas Kurz wrote: > Did you clone the sbd resource? If yes, don't do that. Start it as a > primitive, so in case of a split brain at least one node needs to start > the stonith resource which should give the other node an advantage ... > adding a start-delay should furth

Re: [Linux-HA] Light Weight Quorum Arbitration

2011-12-06 Thread Lars Marowsky-Bree
On 2011-12-04T00:57:05, Andreas Kurz wrote: > the concept of an arbitrator for split-site cluster is already > implemented and should be available with Pacemaker 1.1.6 though it seem > to be "not directly documented" ... beside source code and this draft > document: Documentation is always a wor

Re: [Linux-HA] Antw: Re: Q: "cib-last-written"

2011-12-03 Thread Lars Marowsky-Bree
On 2011-12-01T13:48:56, Ulrich Windl wrote: > I wonder about that usefulness of that value, especially as any configuration > change seems to increase the epoch anyway. I never saw that CRM cares about > the cib-last-written string. It is for easy inspection by admins and for display by UIs.

Re: [Linux-HA] Antw: Re: Q: unmanaged MD-RAID & auto-recovery

2011-11-29 Thread Lars Marowsky-Bree
On 2011-11-29T12:36:39, Dimitri Maziuk wrote: > If you repeatedly try to re-sync with a dying disk, with each resync > interrupted by i/o error, you will get data corruption sooner or later. No, you shouldn't. (Unless the drive returns faulty data on read, which is actually a pretty rare failure

Re: [Linux-HA] is it good to create order constraint for sbd resource

2011-11-29 Thread Lars Marowsky-Bree
On 2011-11-29T22:10:10, Andreas Kurz wrote: > IIRC stonith resources are always started first and stopped last anyways > ... without extra constraints ... implicitly. Please someone correct me > if I'm wrong. Yes, but they are not mandatory. The configuration that was discussed here would actual

Re: [Linux-HA] Antw: Re: Q: unmanaged MD-RAID & auto-recovery

2011-11-29 Thread Lars Marowsky-Bree
On 2011-11-29T08:33:01, Ulrich Windl wrote: > The state of an unmanaged resource is the state when it left the managed > meta-state. That is not correct. An unmanaged resource is not *managed*, but its state is still relevant to other resources that possibly depend on it. The original design g

Re: [Linux-HA] Pacemaker : how to modify configuration ?

2011-11-29 Thread Lars Marowsky-Bree
On 2011-11-28T15:04:45, alain.mou...@bull.net wrote: > sorry but I forgot if there is another way than "crm configure edit" to > modify > all the value of on-fail="" for all resources in the configuration ? If they're explicitly set, you have to modify them all. Otherwise, look at op_defau

Re: [Linux-HA] Antw: Re: is it good to create order constraint for sbd resource

2011-11-29 Thread Lars Marowsky-Bree
On 2011-11-29T08:35:10, Ulrich Windl wrote: > While we're at it: Does specifying a priority implicitly create a start > order, or ist that just a start preference? Maybe if the sbd is not handled > specially, it may be a good idea to give sdb a higher priority than anything > else, at least if

Re: [Linux-HA] Stonith SBD not fencing nodes

2011-11-23 Thread Lars Marowsky-Bree
On 2011-11-24T14:46:13, Andrew Beekhof wrote: > >> Looks like you forgot to specify the sbd_device parameter. > > That is no longer necessary. It'll inherit the settings from > > /etc/sysconfig/sbd. > Perhaps its not set there then? > Its the only difference I can see between the two types of cal

Re: [Linux-HA] Stonith SBD not fencing nodes

2011-11-23 Thread Lars Marowsky-Bree
On 2011-11-24T11:14:05, Andrew Beekhof wrote: > > Relevant portions of crm config: > > primitive stonith-sbd stonith:external/sbd \ > >        meta is-managed="true" target-role="Started" > > Looks like you forgot to specify the sbd_device parameter. That is no longer necessary. It'll inherit t

<    1   2   3   4   5   6   7   8   9   10   >