Re: [Linux-HA] Backing out of HA

2013-07-02 Thread Lars Marowsky-Bree
On 2013-07-01T16:31:13, William Seligman wrote: > a) people can exclaim "You fool!" and point out all the stupid things I did > wrong; > > b) sysadmins who are contemplating the switch to HA have additional points to > add to the pros and cons. I think you bring up an important point that I al

Re: [Linux-HA] disallowing concurrent configuration (CIB modifications)

2013-06-27 Thread Lars Marowsky-Bree
On 2013-06-27T12:40:49, Dejan Muhamedagic wrote: > > It would be nice to have an intelligent "patcher" which takes one CIB > > snapshot at the beginning of edit, than generates a diff and looks if it > > applies to a current CIB cleanly (all except epoch). Then it would be > > possible to use cur

Re: [Linux-HA] How/Where does ocf_log log?

2013-06-26 Thread Lars Marowsky-Bree
On 2013-06-26T19:48:41, Tony Stocker wrote: > Where exactly do calls using 'ocf_log' log the information? For instance > with the call: > > ocf_log info "Starting $InstanceName" > > inside a start function, where (what file) should I be looking for this > output? /var/log/messages. Tho

Re: [Linux-HA] Antw: Re: Master/Slave status check using crm_mon

2013-06-13 Thread Lars Marowsky-Bree
On 2013-06-14T01:22:51, John M wrote: > Heartbeat is not restarting the failed process. In my > configuration default-resource-failure-stickiness is set to -INFINITY > and resource_failure_stickiness is set to -INFINITY at resource level. > If a Master resource fails, slave becomes Master and If

Re: [Linux-HA] "Use of uninitialized value in numeric gt (>) at /usr/sbin/ldirectord line 4037."

2013-06-13 Thread Lars Marowsky-Bree
On 2013-06-13T13:57:15, Ulrich Windl wrote: > Hi! > > In SLES11 SP2 ldirectord (ldirectord-3.9.4-0.26.84) in debug mode shows the > message: > Use of uninitialized value in numeric gt (>) at /usr/sbin/ldirectord line > 4037. Please file a bug. Thanks. -- Architect Storage/HA SUSE LINUX Pr

Re: [Linux-HA] Antw: Last call: removal of ocf:heartbeat:drbd in favor of ocf:linbit:heartbeat

2013-06-12 Thread Lars Marowsky-Bree
On 2013-06-12T10:45:35, David Vossel wrote: > > Besides "merge" or "remove" you could move it to "ocf:unsupported:drbd" ;-) > > I'd like it to disappear entirely. The drbd agent is supported, just not by > the heartbeat provider. If there wasn't a duplicate agent already, this > would make se

Re: [Linux-HA] Last call: removal of ocf:heartbeat:drbd in favor of ocf:linbit:heartbeat

2013-06-11 Thread Lars Marowsky-Bree
On 2013-06-11T11:15:09, David Vossel wrote: > Unless someone steps forward with a good argument against the removal of the > heartbeat:drbd agent, the following pull request is going to be merged a week > from today. (Tuesday 18th) > > https://github.com/ClusterLabs/resource-agents/pull/244 Y

Re: [Linux-HA] Master/Slave status check using crm_mon

2013-06-11 Thread Lars Marowsky-Bree
On 2013-06-11T15:05:11, John M wrote: > Unfortunately I cannot install pacemaker :( > > I just installed heartbeat 2.1.4 and in crm_mon I am getting > Master/Slave status. You seriously need to upgrade. Heartbeat 2.1.4 is ages old and has many, many known bugs. You'll not be able to sec

Re: [Linux-HA] Antw: Group/Set priority in low resource availability scenarios

2013-06-11 Thread Lars Marowsky-Bree
On 2013-06-11T08:53:30, Ulrich Windl wrote: > > This is no longer true. Utilization of resources within a group is now > > summed up, and that will also lead to the group being moved in the > > former case. > Interesting: Since when is this effective? We had some resource groups where > the "fatt

Re: [Linux-HA] Antw: Group/Set priority in low resource availability scenarios

2013-06-10 Thread Lars Marowsky-Bree
On 2013-06-05T15:36:47, Ulrich Windl wrote: > Also set placement-strategy="utilization" then. If you put > "utilization big_thing=1" into each fat primitives, your server won't > be overloaded, but groups aren't moved if only one primitive cannot be > run (a bug IMHO). I'm unsure whether you can

Re: [Linux-HA] Group/Set priority in low resource availability scenarios

2013-06-10 Thread Lars Marowsky-Bree
On 2013-06-05T12:24:28, Tony Stocker wrote: > Lars, can the priority be set by a resource group or does it have to be set > only at the primitive level? Yes. But that's mostly equivalent to setting them on the first resource in the group. > I don't think utilization will help simply because we

Re: [Linux-HA] Group/Set priority in low resource availability scenarios

2013-06-05 Thread Lars Marowsky-Bree
On 2013-06-05T12:09:31, Tony Stocker wrote: > Four (4) of my resource groups are related in that one of them is a process > master which farms out jobs to three (3) computing nodes (all of these are > running software designed in-house). Now, without that process master node > the computing node

Re: [Linux-HA] Manage a resource dependent on the location of other resources

2013-06-03 Thread Lars Marowsky-Bree
On 2013-06-03T16:07:33, Thomas Schulte wrote: > If the ldirectord and a managed service (let's say vsftpd) are on the > same node, everything is fine. > But if vsftpd is on a different node, I need a ocf:heartbeat:Route > resource to set a special gateway in a separate routing table > on both n

Re: [Linux-HA] custom script status)

2013-06-03 Thread Lars Marowsky-Bree
On 2013-06-01T01:14:19, Mitsuo Yazawa wrote: > Hi guys, > > I don't really understand how status work for custom scripts. > > My goal is to check many different status (for now I was just testing a > simple one), so I can make other node to take command when one-node fails. > I never see on th

Re: [Linux-HA] corosync crash (1.4.3) in memcpy

2013-05-31 Thread Lars Marowsky-Bree
On 2013-05-31T09:32:13, Ulrich Windl wrote: > Hi! > > I just discoverd a 134MB core-dump of corosync on x86_64 (SLES11 SP2). The > backtrace looks like this: > Core was generated by `/usr/sbin/corosync'. That's not good. Please file a bugzilla. > Program terminated with signal 11, Segmentatio

Re: [Linux-HA] vm live migration without shared storage

2013-05-24 Thread Lars Marowsky-Bree
On 2013-05-24T12:41:18, David Vossel wrote: > I would think vm disk IO would be improved if the image didn't live on shared > storage. This I doubt. Writing to shared storage (like a SAN, for example) is typically quite fast; the OCFS2/GFS2/cLVM2 overhead tends to show up briefly when creating/

Re: [Linux-HA] LVM Resource agent, "exclusive" activation

2013-05-24 Thread Lars Marowsky-Bree
On 2013-05-15T13:50:45, Lars Ellenberg wrote: Are we, in this discussion, perhaps losing the focus on the base submission of the code merge? Can we separate that (IMHO rather worthwhile) patch set from the exclusive activation part? (Which I happen to have no strong opinion on, unless it is alre

Re: [Linux-HA] Problem with crm shadow CIB's

2013-05-24 Thread Lars Marowsky-Bree
On 2013-05-22T13:20:06, Tony Stocker wrote: > > Version Info: > OS: CentOS 6.4 > Kernel (current): 2.6.32-358.6.2.el6.x86_64 > Pacemaker: 1.1.8-7.el6 > Corosync: 1.4.1-15.el6_4 > CRMSH: 1.2.5-55.4 > > I was attempting to crea

Re: [Linux-HA] vm live migration without shared storage

2013-05-24 Thread Lars Marowsky-Bree
On 2013-05-23T15:00:55, David Vossel wrote: > So, do we want to use this in HA? It would be trivial to add this to > the VirtualDomain resource agent, but does it make sense to do this? > Migration time, depending on network speed and hardware, is much > longer than the shared storage option (min

Re: [Linux-HA] Unable to Configure Heartbeat on RHEL-6.4 64 Bit

2013-05-22 Thread Lars Marowsky-Bree
On 2013-05-22T00:23:35, Digimer wrote: > 2. Pacemaker is the future of HA clustering. It is expected that, in > RHEL 7, Pacemaker will replace RHCS. So there is a strong argument to > start with pacemaker now. > > The main problem that might push you away from Pacemaker (and what keeps > me p

Re: [Linux-HA] Antw: Re: is delayed fencing possible?

2013-05-15 Thread Lars Marowsky-Bree
On 2013-05-15T14:19:45, Ulrich Windl wrote: > > All resources are *required* to survive a fence, otherwise it wouldn't > > be reliable. So the fence doesn't actually hurt. > ...except killing innocent resources. Remember: High availability is not > about restarting resources frequently, but to k

Re: [Linux-HA] Antw: Re: is delayed fencing possible?

2013-05-15 Thread Lars Marowsky-Bree
On 2013-05-15T14:00:45, Ulrich Windl wrote: > If the node is going to be fenced, try to stop or migrate all local resources > until each was successful or timed out. THEN reset the node > > How could it be done? It sounds like a reasonable default to me... We did that once. A faulty node then

Re: [Linux-HA] adding a monitor operation online

2013-05-15 Thread Lars Marowsky-Bree
On 2013-05-15T11:29:59, Ferenc Wagner wrote: > > Hard to say without more info. > Got it: adding a monitor operation online apparently also requires an > implemented reload action. No. This is not true. You're supplying more, but the wrong information ;-) A crm_/hb_report covering the period of

Re: [Linux-HA] Possible bug in crm shell's rename

2013-05-15 Thread Lars Marowsky-Bree
On 2013-05-15T09:53:31, Ulrich Windl wrote: > Hi! > > I think there is a bug in crm shell's "rename": > I renamed a colocation constraint named "col_Xen_VMs_cfs2" to > "col_Xen_VMs_cfs_mir" > > After that I see: > > (...) > > I wonder: Shouldn't the resource_set id be renamed as well? N

Re: [Linux-HA] LVM Resource agent, "exclusive" activation

2013-05-14 Thread Lars Marowsky-Bree
On 2013-05-14T15:59:05, Dejan Muhamedagic wrote: > > Appart from a larger restructuring of the code, > > For which I'd still like to hear a good explanation. Mind, this > is not to say that this or that version is better, but that such > a huge change calls for a serious effort on part of the >

Re: [Linux-HA] LVM Resource agent, "exclusive" activation

2013-05-14 Thread Lars Marowsky-Bree
On 2013-05-14T09:54:55, David Vossel wrote: > Here's what it comes down to. You aren't guaranteed exclusive > activation just because pacemaker is in control. There are scenarios > with SAN disks where the node starts up and can potentially attempt to > activate a volume before pacemaker has ini

Re: [Linux-HA] help(no crm command)

2013-05-14 Thread Lars Marowsky-Bree
On 2013-05-12T12:33:05, zhangyan wrote: > Hello.My system os is Centos 6.4 X64,when I install corosync and pacemaker > using this command(as offical document guide): > yum -y install corosync* pacemaker* > but I can't find crm command. why? RHT has decided to not ship the crm shell, but to ba

Re: [Linux-HA] resource agent for multiple IPs?

2013-05-13 Thread Lars Marowsky-Bree
On 2013-05-13T15:46:40, Robert Sander wrote: > Is there a heartbeat cluster resource agent that is able to manage > multiple IP addresses at once? > > When we configure IP addresses as cluster resources with > ocf:heartbeat:IPaddr2 it takes about a second for every addresses to > switch to anoth

Re: [Linux-HA] Antw: Re: Q: sbd using libxslt in SLES11 SP2?

2013-05-03 Thread Lars Marowsky-Bree
On 2013-05-03T07:24:26, Ulrich Windl wrote: > I wonder how they could fly to the moon without XML ;-) Easier. ;-) But - so, sbd needs to subscribe to CIB updates and parse the CIB to figure out what state the cluster is in. That makes the XML dependency pretty obvious. Regards, Lars --

Re: [Linux-HA] Q: sbd using libxslt in SLES11 SP2?

2013-05-02 Thread Lars Marowsky-Bree
On 2013-05-02T09:36:58, Ulrich Windl wrote: > Hi! > > After installing (amoung others) a libxslt security fix for SLES11 SP2, > "zypper ps" tells me sbd is using libxslt. I'm surprised: I thought sbd was > just processing simple command line arguments... For the pacemaker integration, sbd lin

Re: [Linux-HA] Network failover and communication channel survival

2013-04-30 Thread Lars Marowsky-Bree
On 2013-04-30T13:58:02, Richard Comblen wrote: > Now, a new requirement shows up: the two nodes should be connected > using two physically separated networks, and should survive failure of > one of the two networks. > > The two nodes communicate together for PostgreSQL replication. > > Initiall

Re: [Linux-HA] Usage of SAPDatabase resource agent without SAPHostAgent is deprecated

2013-04-29 Thread Lars Marowsky-Bree
On 2013-04-26T23:56:04, Muhammad Sharfuddin wrote: > I just upgraded from SP1(SLES 11/HAE) to SP2, and getting following > messages when I start the SAPDBInstance resource: > > SAPDatabase(SAPDBInstance)[22587]: [22602]: WARNING: Usage of > SAPDatabase resource agent without SAPHostAgent is de

Re: [Linux-HA] Is CLVM really needed in an active/passive cluster?

2013-04-24 Thread Lars Marowsky-Bree
On 2013-04-23T17:57:18, "Angel L. Mateo" wrote: > In other configuration scenarios (2) using clvm, a dlm and sfex (3) > resource is also created. Should I configure it too? "sfex" is not a fencing device, but a resource agent that offers additional protection in the hierarchy - it is some

Re: [Linux-HA] Antw: Re: When mirroring is a CPU-intensive task, something must be wrong...

2013-04-22 Thread Lars Marowsky-Bree
On 2013-04-22T15:18:20, Ulrich Windl wrote: > That's where OCFS2 comes in: When hosting Xen-VM images, multiple (we just > use the trivial two-node cluster) hosts may acess one LV (i.e.: one > filesystem), while access (if OCFS2 does allocate the data in big extents) to > specific areas is mos

Re: [Linux-HA] Antw: Re: Is CLVM really needed in an active/passive cluster?

2013-04-22 Thread Lars Marowsky-Bree
On 2013-04-22T15:30:34, Ulrich Windl wrote: > There's nothing wrong with Linux; I only have a problem if products of "beta" > quality are shipped as enterprise solutions. Enterprise customers want to USE > the software, not debug it... You can use it, but the constraints are not necessarily re

Re: [Linux-HA] Antw: Re: Is CLVM really needed in an active/passive cluster?

2013-04-22 Thread Lars Marowsky-Bree
On 2013-04-22T12:13:50, Ulrich Windl wrote: > > In that thread I can see that cLVM does not support snapshot. There is > Oops! I didn't know that. Maybe be cLVM (Clustered LVM) should be renamed to > lLVM (less LVM)... ;-) cLVM2 doesn't support snapshots because the snapshot format is not we

Re: [Linux-HA] Antw: Re: When mirroring is a CPU-intensive task, something must be wrong...

2013-04-22 Thread Lars Marowsky-Bree
On 2013-04-22T08:47:26, Ulrich Windl wrote: > > They're not in "D" because they are not waiting on disk IO, but have > > a lot of network IO and data structure maintenance to handle. > Interesting: While flooding a Gb network, the acieved mirroring rate is only > about 60MB/s. But we are not mir

Re: [Linux-HA] Is CLVM really needed in an active/passive cluster?

2013-04-22 Thread Lars Marowsky-Bree
On 2013-04-22T11:14:14, "Angel L. Mateo" wrote: > The problem I have is that I have firstly configured the cluster with > CLVM, but with this I can't create snapshots of my volumes, which is > required for backups. > > But is this CLVM really necessary? Or it is enough to configure

Re: [Linux-HA] When mirroring is a CPU-intensive task, something must be wrong...

2013-04-19 Thread Lars Marowsky-Bree
On 2013-04-19T16:27:14, Ulrich Windl wrote: > Hello, > > Using OCFS2 on top of a cLVM-mirrored LV is an absolute no-go for SLES11 SP2: Note that this is unrelated to OCFS2; cLVM2 mirroring is rather slow, since it communicates over the network to keep the dirty bitmaps and locks in sync. > Fir

Re: [Linux-HA] Antw: Re: Problem in SLES11 SP2 (actions on removed resources)?

2013-04-19 Thread Lars Marowsky-Bree
On 2013-04-19T16:38:26, Ulrich Windl wrote: > > What did you monitor? And what do you mean by "went crazy"? > > > > (Besides, monitoring sbd is unnecessary anyway.) > > Amazingly it depends on what support person you talk to: I disabled it for > one, and the other complained we disabled it. M

Re: [Linux-HA] Antw: Resource moves

2013-04-19 Thread Lars Marowsky-Bree
On 2013-04-19T16:47:00, Marcus Bointon wrote: > > Let me comment: "crm resource migrate prm_yours PT2M" will make a > > constraint that will stay in your CIB forever also, but it's active only > > for 2 minutes. > Where does the 2 minutes come from? As far as I can see they stick around > unti

Re: [Linux-HA] Q: crmd states (IDLE, POLICY, TRANSITION)

2013-04-19 Thread Lars Marowsky-Bree
On 2013-04-19T08:43:14, Ulrich Windl wrote: > So some re-check timer (I_PE_CALC) got active, and crmd changed from S_IDLE > to S_POLICY_ENGINE. By default, the cluster reruns the PE every so often to make sure time-based rules have a chance of affecting the cluster. > That's OK, but if the pol

Re: [Linux-HA] Problem in SLES11 SP2 (actions on removed resources)?

2013-04-19 Thread Lars Marowsky-Bree
On 2013-04-19T09:56:37, Ulrich Windl wrote: > sbd monitoring went crazy (reporting running sbds when there were none, > compaining the unability to stop sbd when there was none), so I stopped it. What did you monitor? And what do you mean by "went crazy"? (Besides, monitoring sbd is unnecessar

Re: [Linux-HA] Antw: Re: Q: Restarting sbd with new parameters without taking cluster stack down?

2013-04-18 Thread Lars Marowsky-Bree
On 2013-04-18T10:57:12, Ulrich Windl wrote: > I had seen the sources once: I wouldn't patch; I'd rewrite it. The coding > style is just terrible (e.g. no systematic error checking). You're welcome. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer G

Re: [Linux-HA] Q: Restarting sbd with new parameters without taking cluster stack down?

2013-04-18 Thread Lars Marowsky-Bree
On 2013-04-18T09:49:17, Ulrich Windl wrote: > It would be nice if sbd had a built-in restart/upgrade mechanism. Yes. Feel free to send a patch ;-) > I guess it wouldn't be too hard to update parameters while sbd is running... > > So is it possible with sbd? If not, why? ;-) Because it wasn't

Re: [Linux-HA] Antw: Re: Q: limiting parallel execution of resource actions

2013-04-16 Thread Lars Marowsky-Bree
On 2013-04-16T09:01:30, Ulrich Windl wrote: > Yes, but that's for every resource then. There are "independent > lightweight resources" (like adding IP addresses), and "interacting > heavyweight resources" (like Xen VMs live-migratiing using one network > channel so sync gigabytes of RAM). "migra

Re: [Linux-HA] Antw: Re: Unusual messages on April 1st anyone?

2013-04-11 Thread Lars Marowsky-Bree
On 2013-04-11T11:48:32, Ulrich Windl wrote: > > Annoying but harmless underflow in pacemaker. Since fixed, and will be > > fixed before it can occur next time (winter time). Sorry. > While talking on it: Is it a known problem in SLES11 SP2 that hosts gain one > hour during boot even though NTP i

Re: [Linux-HA] Q: "crmd: [13080]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported" anybody else?

2013-04-11 Thread Lars Marowsky-Bree
On 2013-04-11T09:05:21, Andrew Beekhof wrote: > > Apr 5 14:14:14 h01 crmd: [13080]: ERROR: tengine_stonith_notify: We were > > alegedly just fenced by h05 for h05! > The rest is pacemaker saying "holly heck" and trying to get out of there asap. > What agent are you using for fencing? Doesn't s

Re: [Linux-HA] R: Fwd: stonith with sbd not working

2013-04-11 Thread Lars Marowsky-Bree
On 2013-04-11T12:01:07, Guglielmo Abbruzzese wrote: > Hi, > This is about the tarball compiling. > I have installed pacemaker + corosync using the rpm provided by RHEL6.2 ( > Pacemaker 1.1.6-3 and Corosync 1.4.1-4). > > While compiling sbd the configure command does not succeed due to a suspect

Re: [Linux-HA] Fwd: stonith with sbd not working

2013-04-10 Thread Lars Marowsky-Bree
On 2013-04-10T17:47:56, Fredrik Hudner wrote: > Hi Lars, > I wouldn't mind to try to install one of the tar balls from > http://hg.linux-ha.org/sbd, only I'm not sure how to do it after I've > unzipped/tar it. I saw someone from a discussion group that wanted to do it > as well.. > If you only te

Re: [Linux-HA] Unusual messages on April 1st anyone?

2013-04-10 Thread Lars Marowsky-Bree
On 2013-04-10T15:58:38, Ulrich Windl wrote: > Anybody else? I don't feel very confident with these messages... Annoying but harmless underflow in pacemaker. Since fixed, and will be fixed before it can occur next time (winter time). Sorry. Regards, Lars -- Architect Storage/HA SUSE LINUX

Re: [Linux-HA] Fwd: stonith with sbd not working

2013-04-10 Thread Lars Marowsky-Bree
On 2013-04-10T15:25:38, Fredrik Hudner wrote: > and removed watchdog from the system but without success.. > Still can't see any references that sbd has started in the messages log It looks as if the init script of pacemaker (openais/corosync on SUSE) is not taking care to start the sbd daemon.

Re: [Linux-HA] Fwd: stonith with sbd not working

2013-04-10 Thread Lars Marowsky-Bree
On 2013-04-10T09:28:36, Fredrik Hudner wrote: > I have > pacemaker-1.1.7-6.el6.x86_64 > and sbd from cluster-glue-1.0.6-9.el6.x86_64 Ah, I'm not sure how recent those versions are. It's likely you still have to specify the sbd_device on the primitive in the CIB there. > About the watchdog: Shou

Re: [Linux-HA] Fwd: stonith with sbd not working

2013-04-09 Thread Lars Marowsky-Bree
On 2013-04-09T14:28:58, Fredrik Hudner wrote: > For one reason or another stonith won't start and messages log says at one > point: > > stonith-ng[20383]: notice: stonith_device_action: Device stonith_sbd not > found. > stonith-ng[30234]: info: stonith_command: Processed st_execute from >

Re: [Linux-HA] Q: corosync/TOTEM retransmit list

2013-04-03 Thread Lars Marowsky-Bree
On 2013-04-03T11:34:34, Ulrich Windl wrote: > Hi! > > I have a simple question: Is it possible that DLM or OCFS2 causes > corosync/TOTEM retransmit messages? I have the feeling that whenever OCFS2 is > busy, corosync/TOTEM sends out retransmit lists like this: Load makes hitting this issue mo

Re: [Linux-HA] sbd on VMware nodes

2013-03-28 Thread Lars Marowsky-Bree
On 2013-03-28T10:52:22, Fredrik Hudner wrote: > > So when you say, map the same 1 to 3 block devices my approach should be > > correct ? > 1/ add 1 block device on each of the 3 nodes, e.g /dev/sdd, /dev/sde > and /dev/sdf I'm not sure what you mean by this. Yes, on each of the three physica

Re: [Linux-HA] Antw: Re: manage/umanage

2013-03-27 Thread Lars Marowsky-Bree
On 2013-03-27T12:56:40, Moullé Alain wrote: > Hi Andrew, > that's fine for me even in two steps , but I don't recognize the command > to be used > to set > > rsc.managed=false + rsc.op.enabled=false > > is it a special crm syntax ? He means to say that you want to set the is-managed meta at

Re: [Linux-HA] Antw: Re: manage/umanage

2013-03-27 Thread Lars Marowsky-Bree
On 2013-03-27T08:24:20, Ulrich Windl wrote: > I see little sense to run monitoring on an unmanaged resource, specifically as > some _monitoring_ operations are not strict read-only, but do change the state > of a resource (which may be quite unexpected). One example is the RAID RA, > which tries

Re: [Linux-HA] sbd on VMware nodes

2013-03-26 Thread Lars Marowsky-Bree
On 2013-03-26T08:10:06, Fredrik Hudner wrote: > Hi, > I have a question about the setup sbd which I think belong to this forum > > I have 3 nodes. Two active pacemaker nodes and one kind of quorum node. > I would like to setup sbd as fencing device between these 3 nodes that are > running in VMw

Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread Lars Marowsky-Bree
On 2013-03-15T11:43:56, Dimitri Maziuk wrote: > Yeah, I suppose. I meant going Open/CloudStack. > (We get to write buzzword-compliant funding proposals, or I don't get to > eat. So my perspective is skewed towards the hottest shiny du jour...) Yeah, I'd agree that today there are scenarios where

Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread Lars Marowsky-Bree
On 2013-03-15T09:54:22, Dimitri Maziuk wrote: > I've always had difficulties with the concept: the way I see it if your > hardware fails you want *all* your 200+ services moved. If you want them > independently moved to different places, you're likely better off with a > full cloud solution. I

Re: [Linux-HA] Multiple instances of heartbeat

2013-03-14 Thread Lars Marowsky-Bree
On 2013-03-14T09:44:11, "GGS (linux ha)" wrote: > That's the problem. We do not run a cluster > of servers. Our logical unit is the software > stack. (see below) That's fine. But the cluster software really assumes that only one instance of it is running per server - said instance can then mana

Re: [Linux-HA] Multiple instances of heartbeat

2013-03-14 Thread Lars Marowsky-Bree
On 2013-03-13T20:13:30, "GGS (linux ha)" wrote: > Their configuration, resources, etc. are > not intermingled, so we prefer not to > configure them in a single setup. For > simplicity we are using the R1 built > into heartbeat and not CRM. What you want to do is very simple to do with a pacemake

Re: [Linux-HA] RA heartbeat/exportfs hangs sporadically

2013-03-08 Thread Lars Marowsky-Bree
On 2013-03-08T11:56:12, Roman Haefeli wrote: > Googling "TrackedProcTimeoutFunction exportfs" didn't reveal any > results, which makes me think we are alone with this specific problem. > Is it the RA that hangs or the command 'exportfs' which is executed by > this RA? What resource-agents versi

Re: [Linux-HA] Pacemaker - resource stopped, why?

2013-03-08 Thread Lars Marowsky-Bree
On 2013-03-08T09:50:10, Rafał Radecki wrote: > I have a pacemaker cluster in which I use also ocf:heartbeat:IPaddr2 > resources. > My IP resource has stopped with no reason. In corosync log I have: You need to check the logs on the DC. Something triggered the policy engine and it decided that t

Re: [Linux-HA] Pacemaker, LSB resource, how to run additional script?

2013-03-07 Thread Lars Marowsky-Bree
On 2013-03-07T13:13:22, Rafał Radecki wrote: > Hi All. > > I have a memcache service on two nodes (active/passive). I would like > to run a custom script when the service migrates from first to the > second node? Can it be done with pacemaker? > As I have read in OCF resources I can do this with

Re: [Linux-HA] SAP Instance clustering

2013-02-27 Thread Lars Marowsky-Bree
On 2013-02-27T13:20:05, michele.do...@holcim.com wrote: > Now I have one last question : I would like to clone the SAP services so > that they run on both nodes...this is causing some kind of glibc dump : > > crm configure clone CLW01 W01 meta globally-unique="true" clone-max="2" > clone-node-ma

Re: [Linux-HA] Antw: Re: stonithd doesn't use "power off" action

2013-02-22 Thread Lars Marowsky-Bree
On 2013-02-22T14:38:16, Ulrich Windl wrote: > I know, but shouldn't wrong watchdog modules fail to load? Or at least > shouldn't successful modules leave some message in syslog? Some of the modules do, some don't. Some watchdogs can't even tell if the hardware is installed until you actually acc

Re: [Linux-HA] Antw: Re: stonithd doesn't use "power off" action

2013-02-22 Thread Lars Marowsky-Bree
On 2013-02-22T13:55:47, Ulrich Windl wrote: > a quick check in the reference manual and HA-manual showed nothing for that > topic. At least I found out that "wdt" is a better pattern to look for: The HA manual doesn't detail how to setup the hardware. > # lsmod |grep wdt > xen_wdt

Re: [Linux-HA] Antw: Re: stonithd doesn't use "power off" action

2013-02-22 Thread Lars Marowsky-Bree
On 2013-02-22T08:39:29, Ulrich Windl wrote: > It's hard to find out which module is actually used from the syslog; see > yourself: That looks like there wasn't a watchdog driver loaded based on the hardware, but when sbd opened the watchdog device, the autoload triggered and the kernel/modprobe

Re: [Linux-HA] Antw: Re: stonithd doesn't use "power off" action

2013-02-21 Thread Lars Marowsky-Bree
On 2013-02-21T08:39:52, Andrew Beekhof wrote: > I think what he's suggesting is that the agent might be trying to do a > reboot but some part of that process stalls/aborts/times-out and only > the off part happens. That'd be a kernel or firmware bug though. sbd reboots or powers off by writing t

Re: [Linux-HA] Antw: Re: stonithd doesn't use "power off" action

2013-02-20 Thread Lars Marowsky-Bree
On 2013-02-20T11:03:20, Ulrich Windl wrote: > Let me remark that I'm seeing the opposite for sbd-based fencing > occasionally: The action is set to reboot, but occasionally some servers are > powered down. The less-pleasing feature of that is that you can't power up a > server using sbd, but y

Re: [Linux-HA] Antw: Re: coredump without debug info (Was: stonith (SBD) monitor times out; a software bug?)

2013-02-06 Thread Lars Marowsky-Bree
On 2013-02-06T15:07:16, Ulrich Windl wrote: > It looks like a problem with Novell Update: The only debuginfo catalogs I see > are: > # zypper ca |grep ebug > 6 | nu_novell_com:SLE11-SP1-Debuginfo-Pool| SLE11-SP1-Debuginfo-Pool > | Yes | Yes > 7 | nu_novell_com:SLE11-SP1-De

Re: [Linux-HA] Antw: Re: Q: co-location clones

2013-02-06 Thread Lars Marowsky-Bree
On 2013-02-06T14:49:57, Ulrich Windl wrote: > OK, why not be concrete then: If I have, let's say, 10 filesystems for > virtual machines that should run independently on a selection of 7 nodes, how > would your configuration look like? 10 cloned groups? Where do you put DLM, > cLVM and O2CB? M

Re: [Linux-HA] Antw: Re: coredump without debug info (Was: stonith (SBD) monitor times out; a software bug?)

2013-02-06 Thread Lars Marowsky-Bree
On 2013-02-06T12:18:56, Ulrich Windl wrote: > > It tries that. But you need configured debuginfo channels enabled and > > zypper working. > It tried it with debuginfo repositories enabled. No debug info was installed > when createing a hb_report. Looking into hb_report I could not find any > tr

Re: [Linux-HA] Antw: Re: Q: co-location clones

2013-02-06 Thread Lars Marowsky-Bree
On 2013-02-06T12:22:37, Ulrich Windl wrote: > > > colocation col_OCFS_cVG inf: _rsc_set_ ( cln_CFS ) cln_cLVM > > > order ord_cVG_CFS inf: cln_cLVM ( cln_CFS ) > > > > Why not just: > > > > colocation col_OCFS_cVG inf: cln_CFS cln_cLVM > > order ord_cVG_CFS inf: cln_cLVM cln_CFS > > The reason

Re: [Linux-HA] Q: co-location clones

2013-02-06 Thread Lars Marowsky-Bree
On 2013-02-06T10:45:21, Ulrich Windl wrote: > colocation col_OCFS_cVG inf: _rsc_set_ ( cln_CFS ) cln_cLVM > order ord_cVG_CFS inf: cln_cLVM ( cln_CFS ) Why not just: colocation col_OCFS_cVG inf: cln_CFS cln_cLVM order ord_cVG_CFS inf: cln_cLVM cln_CFS That ought to work. Probably clones and re

Re: [Linux-HA] coredump without debug info (Was: stonith (SBD) monitor times out; a software bug?)

2013-02-06 Thread Lars Marowsky-Bree
On 2013-02-06T11:11:32, Ulrich Windl wrote: > > It looks like you need to install a couple of debuginfo packages for > > glib, libglue, glibc at least until these resolve. > I don't have much experience with debuginfo packages, but when "hb_report" is > intended for providing useful information

Re: [Linux-HA] Antw: Re: stonith (SBD) monitor times out; a software bug?

2013-02-06 Thread Lars Marowsky-Bree
On 2013-02-06T09:22:53, Ulrich Windl wrote: > If you say so, here it is: > #3 0x004082c0 in ?? () > No symbol table info available. It looks like you need to install a couple of debuginfo packages for glib, libglue, glibc at least until these resolve. Regards, Lars -- Architect

Re: [Linux-HA] Antw: crmd: [12372]: WARN: decode_transition_key: Bad UUID (crm-resource-22905) in sscanf result (3) for 0:0:crm-resource-22905

2013-02-06 Thread Lars Marowsky-Bree
On 2013-02-06T09:19:58, Ulrich Windl wrote: (The "L" button in mutt. I need to make better use of that ;-) > Let me remark that the "brief window" is long enough for a user to start > crm_mon after cleanup a few times and still see "Stopped" resources. Maybe > it's more obvious for non-trivial

Re: [Linux-HA] stonith (SBD) monitor times out; a software bug?

2013-02-05 Thread Lars Marowsky-Bree
On 2013-02-05T14:33:14, Ulrich Windl wrote: > I had an unexplainable failure of the stonith monitor for SBD. When examining > the syslog, I got the impression that RA configuration data got corrupted, > causing a RA failure. Interesting. Please file a bug report. And the easiest way is to jus

Re: [Linux-HA] Antw: crmd: [12372]: WARN: decode_transition_key: Bad UUID (crm-resource-22905) in sscanf result (3) for 0:0:crm-resource-22905

2013-02-05 Thread Lars Marowsky-Bree
On 2013-02-05T11:36:30, Ulrich Windl wrote: This looks like a support incident to me. Hard to diagnose without full logs. > Let me add: I'm not completely sure, but a side-effect of this messages seems > to be that resources (being cleaned up) that are running (e.g. Xen VMs) are > considered "

Re: [Linux-HA] Antw: Re: Q: configuring resource groups with parallel ordering

2013-01-30 Thread Lars Marowsky-Bree
On 2013-01-30T13:36:58, Ulrich Windl wrote: > I thought so ;-) But as groups are more or less just a shortcut for "ordering > & colocation" I wondered why the thing I wanted to have isn't available. Groups also provide a pseudo object and are alas slightly more than a short cut. I do agree we

Re: [Linux-HA] Trying to get Corosync to work with impaired three-node cluster.

2013-01-23 Thread Lars Marowsky-Bree
On 2013-01-23T13:22:10, Alex Sudakar wrote: > In such a situation I thought that, hopefully, all the nodes would > still 'see' each other, all three in the one cluster partition, with C > relaying knowledge of A & B to each other. That's why, I thought, > Corosync calls its circuits 'rings', aft

Re: [Linux-HA] Debugging Ressource Agent

2013-01-18 Thread Lars Marowsky-Bree
On 2013-01-17T15:03:38, Andreas Mock wrote: > But: Many many variables have to be set up properly so that the > RA runs in an evironment similar to that which is build up by the > LRM. Is there a script out there which is a thin wrapper to build up > the correct environment? (environment variable

Re: [Linux-HA] Hi: Queries for using Linux-Ha and Pacemaker

2013-01-14 Thread Lars Marowsky-Bree
On 2013-01-14T10:24:33, Navneet Khanuja wrote: > http://www.linux-ha.org/doc/users-guide/_building_and_installing_from_source.html > > I have few queries please find them below: > 1) Please advise am I referring to correct link. No. Please go to clusterlabs.org and follow the getting started gu

Re: [Linux-HA] corosync and network redundancy

2013-01-09 Thread Lars Marowsky-Bree
On 2013-01-08T10:03:20, Digimer wrote: > If you use two separate networks for RRP, and each network sits on top > of a mode=1 bond, then you will have the highest redundancy. I do this > with stacked switches where each leg of the bond is in a different > switch, so I can survive total switch fai

Re: [Linux-HA] Xen on iMac(Ubuntu 12.04)

2013-01-07 Thread Lars Marowsky-Bree
On 2013-01-07T15:02:17, Felipe Gutierrez wrote: > Hi everyone, > > I tryid to install Xen on my Ubuntu12.04 on iMac and the new kernel doesn't > boot. My tip would be to ask on a Ubuntu or on a Xen-related mailing list. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF:

Re: [Linux-HA] Antw: Re: Q: NFS cross mounting

2012-12-22 Thread Lars Marowsky-Bree
On 2012-12-21T08:14:42, Ulrich Windl wrote: > In contrast with the NFS solution, the client applications would be blocked > until NFS server is running on the other node. Obviously this solution seems > preferrable. It just won't work reliably ;-) We'll see how solutions like CephFS handle th

Re: [Linux-HA] Antw: Re: Q: NFS cross mounting

2012-12-20 Thread Lars Marowsky-Bree
On 2012-12-20T10:50:11, Dimitri Maziuk wrote: > Well, they'll be wrong unless there's a way nfs-export an unmounted > filesystem. A reasonable question would be why not symlink > -> ;) That'd miss the use case of being able to switch-over the server to another node. Regards, Lars --

Re: [Linux-HA] Antw: Re: Q: NFS cross mounting

2012-12-20 Thread Lars Marowsky-Bree
On 2012-12-20T08:55:06, Ulrich Windl wrote: > > > 11.5.3.2. Mounting NFS Volumes Locally on the Exporting Server > > > > > > Mounting NFS volumes locally on the exporting server is not supported on > > > SUSE Linux Enterprise systems, as it is the case on all Enterprise class > > > Linux systems

Re: [Linux-HA] Q: NFS cross mounting

2012-12-19 Thread Lars Marowsky-Bree
On 2012-12-19T10:59:12, Ulrich Windl wrote: > Unfortunately there was an update to SLES11 SP2 release notes recently that > forbid this kind of setup: > --- > 11.5.3.2. Mounting NFS Volumes Locally on the Exporting Server > > Mounting NFS volumes locally on the exporting server is not supported

Re: [Linux-HA] Heartbeat/Pacemaker resource agent incorrect

2012-12-12 Thread Lars Marowsky-Bree
On 2012-12-13T10:31:55, Andrew Beekhof wrote: > > We once moved the ocf-shellfuncs file, which didn't work out here when > I thought we never did this sort of thing because we don't know how > people are using our stuff externally. We did it in a backwards-compatible manner; or at least if the p

Re: [Linux-HA] Heartbeat/Pacemaker resource agent incorrect

2012-12-12 Thread Lars Marowsky-Bree
On 2012-12-12T13:58:25, Fabian Herschel wrote: > > I noticed when I tried to promote one of the servers that an error > > occurred stating that the ocf:heartbeat:mysql did not support the > > feature. I evaluated the script and realized it was an older > > version and did not contain any of the

Re: [Linux-HA] Q: Stop resources without quorum?

2012-12-06 Thread Lars Marowsky-Bree
On 2012-12-06T10:49:29, Ulrich Windl wrote: > We had a network-breakdown that isolated the nodes of a 5-node cluster > (SLES11 SP2). Interestingly each node declared itself as "DC without > quorum" and tried to stop resources. > > I am surprised: Shouldn't the CRM leave resources as is while th

Re: [Linux-HA] Linux High Availibility/Stonith in VMs

2012-12-06 Thread Lars Marowsky-Bree
On 2012-12-06T00:02:28, Hermes Flying wrote: > Hi, > I was wondering how does fencing/STONITH work in a VM environment? Obviously > the nodes running inside a VM should run in separate machines but can we > STONITH a node that is running inside a VM? I'd suggest to use something like SBD or fe

Re: [Linux-HA] master/slave drbd resource STILL will not failover

2012-12-04 Thread Lars Marowsky-Bree
On 2012-12-04T20:38:54, Fabian Herschel wrote: > I am not sure if that will really help you - but in my cluster (ok > older pacemaker version) I ahve the following to define a master slave > resource: > > primitive rsc_sap_HA0_ASCS00 ocf:heartbeat:SAPInstance \ >operations $id="rsc_sap_HA0_A

Re: [Linux-HA] Does stonith always succeed?

2012-12-03 Thread Lars Marowsky-Bree
On 2012-12-03T04:05:19, Hermes Flying wrote: > But this assumes that the servers are co-located, right? How is geo-separated > nodes supported? We support this via booth, a pacemaker extension (builds on top of pacemaker's ticket support); effectively building clusters-of-clusters. Services are

Re: [Linux-HA] New Resources on heartbeat can't start

2012-11-20 Thread Lars Marowsky-Bree
On 2012-11-20T12:12:12, Felipe Gutierrez wrote: > Hi everyone, > > I am trying to setup a new resource on my heartbeat, but for some reason > the resour doesn't come on. > Does anyone have some hint, please? Yes. Read the logfiles why the resource agent is returning 1 instead of 0. I'm pretty s

Re: [Linux-HA] Antw: Re: pcs or crmsh?

2012-11-16 Thread Lars Marowsky-Bree
On 2012-11-16T14:56:21, Andrew Beekhof wrote: > > Yes, the remote capability of pcs is different. Though crmsh has some > > ability in that regard too. And probably will have to grow them. > And a REST interface and a GUI that talks to it as well? hawk actually is based on a REST interface, in m

<    1   2   3   4   5   6   7   8   9   10   >