On 11/20/2015 06:38 PM, Brian Campbell wrote:
> I've been trying to debug and do a root cause analysis for a cascading
> series of failures that a customer hit a couple of days ago, that
> caused their filesystem to be unavailable for a couple of hours.
>
> The original failure was in our own distributed filesystem backend, a
> fork of LizardFS, which is in turn a fork of MoosFS This history is
> mostly only important in reading the logs, where "efs", "lizardfs",
> and "mfs" all generally refer to the same services, just different
> generations of naming them as not all daemons, scripts, and packages
> have been renamed.
>
> There are two master servers that handle metadata operations, running
> Pacemaker to elect which one is the current primary and which one is a
> replica, and a number of chunkservers that store file chunks and
> simply connect to the current running master via a virtual IP. A bug
> in doing a checksum scan on the chunkservers caused them to leak file
> descriptors and become unresponsive, so while the master server was up
> and healthy, no actual filesystem operations could occur. (This bug is
> now fixed by the way, and the fix deployed to the customer, but we
> want to debug why the later failures occurred that caused them to
> continue to have downtime).
>
> The customer saw that things were unresponsive, and tried doing the
> simplest thing they could to try to resolve it, migrate the services
> to the other master. This succeeded, as the checksum scan had been
> initiated by the first master and so switching over to the replica
> caused all of the extra file descriptors to be closed and the
> chunkservers to become responsive again.
>
> However, due to one backup service that is not yet managed via
> Pacemaker and thus is only running on the first master, they decided
> to migrate back to the first master. This was when they ran into a
> Pacemaker problem.
>
> At the time of the problem, es-efs-master1 is the server that was
> originally the master when the first problem happened, and which they
> are trying to migrate the services back to. es-efs-master2 is the one
> actively running the services, and also happens to be the DC at the
> time to that's where to look for pengine messages.
>
> On master2, you can see the point when the user tried to migrate back
> to master1 based on the pengine decisions:
>
> (by the way, apologies for the long message with large log excerpts; I
> was trying to balance enough detail with not overwhelming, it can be
> hard to keep it short when explaining these kinds of complicated
> failures across a number of machines)
Indeed. In a case as complex as this, more is better -- a copy of the
configuration and the output of crm_report would be helpful (with any
sensitive info removed).
> Nov 18 08:28:28 es-efs-master2 pengine[1923]: warning: unpack_rsc_op:
> Forcing editshare.stack.7c645b0e-
> 46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after
> a failed demote action
Any idea why the demote failed?
> Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions:
> Moveeditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started
> es-efs-master2 -> es-efs-master1)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions:
> Promote
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
> -> Master es-efs-master1)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions:
> Demote
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master
> -> Slave es-efs-master2)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice:
> process_pe_message: Calculated Transition 1481601:
> /var/lib/pacemaker/pengine/pe-input-1355.bz2
> Nov 18 08:28:28 es-efs-master2 stonith-ng[1920]: warning:
> cib_process_diff: Diff 0.2754083.1 -> 0.2754083.2 from local not
> applied to 0.2754083.1: Failed application of an update diff
Normally failed diffs shouldn't cause any problems, other than extra
network/disk traffic for a full CIB sync.
There have been many bugs fixed since 1.1.10, and I think some of them
relate to cib diffs. Since you're doing everything custom, you might as
well use latest upstream libqb/corosync/pacemaker releases (btw current
pacemaker master branch has been stable and will soon become the basis
of 1.1.14rc1).
Have you been able to reproduce the issue on a test cluster? That would
be important to investigating further. You could set PCMK_debug=yes (I'm
guessing that would be in /etc/default/pacemaker on ubuntu) to get more
logs.
> Nov 18 08:28:28 es-efs-master2 crmd[1924]: notice:
> process_lrm_event: LRM operation
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0
> (call=400, rc=0, cib-update=0, confirmed=true) ok
> Nov 18 08:28:28 es-efs-master2 crmd[1924]: notice: run_graph:
> Transition 1481601 (Complete=5, Pending=0, Fired=0, Skipped=15,
> Incomplete=10, Sour