Re: [ClusterLabs] Sub-second failover detection in Corosync/Pacemaker clusters - 2026 update?

Windl, Ulrich via Users Tue, 03 Mar 2026 23:46:03 -0800

Hi!

Please forgive me for asking AI; anyway here is what AI suggested:
Quick clarification (Alice)
I need the environment details to give precise guidance:


  *   Linux distro and kernel version
  *   Corosync and Pacemaker versions
  *   Network topology (dedicated heartbeat VLAN vs shared LAN) and link types 
(1GbE/10GbE)
  *   Cluster size (nodes count) and the resources being failed over
  *   Do you control MTU, switches, and can you run multicast/UDPU or prefer 
TCP? Please provide those three essentials first: device/software+versions, 
error frequency (always/sometimes), any recent changes.
Below I assume a generic production setup (modern Corosync 3.x and Pacemaker 
2.x on Linux).
Short answers

  1.  Yes — sub‑second detection and initiation can be achieved in production 
with modern Corosync/Pacemaker, but only with careful tuning, dedicated 
low‑latency networking, and acceptance of trade‑offs.
  2.  Community reports and measurements vary: stable failovers of ~200–700 ms 
have been reported for simple resources; 1s is a commonly cited reliable 
minimum in real deployments. Expect full resource recovery often to be longer 
(seconds) depending on service stop/start times.
  3.  Yes — post‑2015 releases improved messaging stacks, lower‑latency 
transports (UDPU improvements and CPR/TSN work), better quorum/stonith 
handling, and more tunable timeouts that reduce detection latency.
  4.  Key params and trade‑offs are listed below.
Quick Fix (try first, <2 minutes)

  *   Ensure cluster network has low jitter and packet loss (ping <1 ms on 
dedicated link). Confirm with: ping -i 0.1 for sample jitter.
  *   Verify corosync/pacemaker versions: corosync -v ; pcs --version (or 
pacemaker --version). Report results. (✅ or ❌)
Moderate steps (~5 minutes)

  1.  Use dedicated heartbeat network (separate VLAN/interface) and set proper 
MTU.
  2.  Tune Corosync:
     *   transport: use udp (UDPU) or the recommended transport for your 
version.
     *   token and consensus timeouts (example conservative starting point for 
sub‑second):
        *   token: 500 (ms)
        *   token_timeout: 1000 (ms)
        *   join/consensus timeouts: reduce join and hold timers modestly
     *   For Corosync 3.x, tune rrm and cpusets per docs (exact param names 
depend on version).
  3.  Pacemaker timeouts:
     *   cluster property stonith-enabled=true (ensure fencing is fast)
     *   set stonith-timeout and migration-thresholds to low safe values
     *   resource agent timeouts: set op monitor intervals to 200–500ms for 
services that support fast probes
  4.  Use lightweight health checks (fast monitors) rather than heavy scripts.
  5.  Test failovers under load and measure with timestamps in logs 
(corosync/pacemaker logs include timing).
Nuclear option

  *   If tuning cannot reach target, consider redesign: split services into 
smaller failover units, use active‑active instead of active‑passive, or replace 
with specialized low‑latency HA solutions. Back up configs and data before any 
disruptive changes.
Concrete example (decisive, reasonable defaults)

  *   corosync.conf (illustrative):
     *   token: 500
     *   token_timeout: 1000
     *   interface mcast/udpu with bindnetaddr on dedicated NIC
  *   Pacemaker resource monitor ops:
     *   op monitor interval="300ms" timeout="1500ms"
  *   Ensure fencing/stonith completes in <500ms (fast power fencing or 
virtualized APi).
Trade-offs and risks

  *   False positives: aggressive timeouts increase split‑brain/failover storm 
risk. Require rock‑solid network and reliable fencing.
  *   Network sensitivity: jitter/packet loss will cause spurious failovers. 
Dedicated link and QoS recommended.
  *   CPU/network overhead: frequent heartbeats and fast monitors increase 
CPU/network usage (generally small, but measurable on constrained nodes).
  *   Recovery time vs detection: detection can be sub‑second, but service 
stop/start and client reconnection often dominate total outage.
Reliable minimum recommendation

  *   For production, target ~500 ms detection/initiation as an aggressive but 
achievable baseline; 1 s is safer and widely reliable. Validate with end‑to‑end 
tests for your services.
If you want, provide the environment details I asked for and I’ll produce a 
tuned corosync.conf, pacemaker properties, concrete resource monitor examples, 
and a test plan with log commands and expected timestamps.

As always: Take AI answers with a grain of salt 😉

Kind regards,
Ulrich Windl

From: Users <[email protected]> On Behalf Of Holger Haidinger <DE 
ERL SWD EM> via Users
Sent: Friday, February 20, 2026 4:41 PM
To: [email protected]
Cc: Holger Haidinger <DE ERL SWD EM> <[email protected]>
Subject: [EXT] [EXT] [ClusterLabs] Sub-second failover detection in 
Corosync/Pacemaker clusters - 2026 update?

Notice: This email appears to be suspicious. Do not trust the information, 
links, or attachments in this email without verifying the source through a 
trusted method. For more information see: 
https://aka.ms/ProtectYourselfFromPhishing


Sicherheits-Hinweis: Diese E-Mail wurde von einer Person außerhalb des UKR 
gesendet. Seien Sie vorsichtig vor gefälschten Absendern, wenn Sie auf Links 
klicken, Anhänge öffnen oder weitere Aktionen ausführen, bevor Sie die Echtheit 
überprüft haben.
Hi everyone,

I'm revisiting a thread from 2015 
(https://www.mail-archive.com/[email protected]/msg00554.html) about 
achieving sub-second failover detection in HA clusters, and I'm curious about 
the current state of affairs nearly a decade later.

My Environment:

- Corosync 3.1.6
- Pacemaker 2.1.2
- Architecture: 2-node cluster + QDevice (also testing 3-node setups)
- Network: Dedicated physical NIC for cluster traffic (low-latency requirements)

Specific Questions:

1. With modern Corosync/Pacemaker versions, is sub-second fault detection and 
failover initiation realistically achievable in production environments?
2. Are there any published measurements or community experiences showing the 
fastest stable failover times you've achieved? What's considered a reliable 
minimum time span?
3. Have there been significant enhancements in the newer versions of Corosync 
and Pacemaker (post-2015) that specifically target detection speed and failover 
latency?
4. If sub-second detection is possible, what are the key configuration 
parameters and potential trade-offs (false positives, network sensitivity, 
resource overhead)?

Thanks in advance!

Holger Haidinger

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Sub-second failover detection in Corosync/Pacemaker clusters - 2026 update?

Reply via email to