Hi!
Please forgive me for asking AI; anyway here is what AI suggested:
Quick clarification (Alice)
I need the environment details to give precise guidance:
* Linux distro and kernel version
* Corosync and Pacemaker versions
* Network topology (dedicated heartbeat VLAN vs shared LAN) and link types
(1GbE/10GbE)
* Cluster size (nodes count) and the resources being failed over
* Do you control MTU, switches, and can you run multicast/UDPU or prefer
TCP? Please provide those three essentials first: device/software+versions,
error frequency (always/sometimes), any recent changes.
Below I assume a generic production setup (modern Corosync 3.x and Pacemaker
2.x on Linux).
Short answers
1. Yes — sub‑second detection and initiation can be achieved in production
with modern Corosync/Pacemaker, but only with careful tuning, dedicated
low‑latency networking, and acceptance of trade‑offs.
2. Community reports and measurements vary: stable failovers of ~200–700 ms
have been reported for simple resources; 1s is a commonly cited reliable
minimum in real deployments. Expect full resource recovery often to be longer
(seconds) depending on service stop/start times.
3. Yes — post‑2015 releases improved messaging stacks, lower‑latency
transports (UDPU improvements and CPR/TSN work), better quorum/stonith
handling, and more tunable timeouts that reduce detection latency.
4. Key params and trade‑offs are listed below.
Quick Fix (try first, <2 minutes)
* Ensure cluster network has low jitter and packet loss (ping <1 ms on
dedicated link). Confirm with: ping -i 0.1 for sample jitter.
* Verify corosync/pacemaker versions: corosync -v ; pcs --version (or
pacemaker --version). Report results. (✅ or ❌)
Moderate steps (~5 minutes)
1. Use dedicated heartbeat network (separate VLAN/interface) and set proper
MTU.
2. Tune Corosync:
* transport: use udp (UDPU) or the recommended transport for your
version.
* token and consensus timeouts (example conservative starting point for
sub‑second):
* token: 500 (ms)
* token_timeout: 1000 (ms)
* join/consensus timeouts: reduce join and hold timers modestly
* For Corosync 3.x, tune rrm and cpusets per docs (exact param names
depend on version).
3. Pacemaker timeouts:
* cluster property stonith-enabled=true (ensure fencing is fast)
* set stonith-timeout and migration-thresholds to low safe values
* resource agent timeouts: set op monitor intervals to 200–500ms for
services that support fast probes
4. Use lightweight health checks (fast monitors) rather than heavy scripts.
5. Test failovers under load and measure with timestamps in logs
(corosync/pacemaker logs include timing).
Nuclear option
* If tuning cannot reach target, consider redesign: split services into
smaller failover units, use active‑active instead of active‑passive, or replace
with specialized low‑latency HA solutions. Back up configs and data before any
disruptive changes.
Concrete example (decisive, reasonable defaults)
* corosync.conf (illustrative):
* token: 500
* token_timeout: 1000
* interface mcast/udpu with bindnetaddr on dedicated NIC
* Pacemaker resource monitor ops:
* op monitor interval="300ms" timeout="1500ms"
* Ensure fencing/stonith completes in <500ms (fast power fencing or
virtualized APi).
Trade-offs and risks
* False positives: aggressive timeouts increase split‑brain/failover storm
risk. Require rock‑solid network and reliable fencing.
* Network sensitivity: jitter/packet loss will cause spurious failovers.
Dedicated link and QoS recommended.
* CPU/network overhead: frequent heartbeats and fast monitors increase
CPU/network usage (generally small, but measurable on constrained nodes).
* Recovery time vs detection: detection can be sub‑second, but service
stop/start and client reconnection often dominate total outage.
Reliable minimum recommendation
* For production, target ~500 ms detection/initiation as an aggressive but
achievable baseline; 1 s is safer and widely reliable. Validate with end‑to‑end
tests for your services.
If you want, provide the environment details I asked for and I’ll produce a
tuned corosync.conf, pacemaker properties, concrete resource monitor examples,
and a test plan with log commands and expected timestamps.
As always: Take AI answers with a grain of salt 😉
Kind regards,
Ulrich Windl
From: Users <[email protected]> On Behalf Of Holger Haidinger <DE
ERL SWD EM> via Users
Sent: Friday, February 20, 2026 4:41 PM
To: [email protected]
Cc: Holger Haidinger <DE ERL SWD EM> <[email protected]>
Subject: [EXT] [EXT] [ClusterLabs] Sub-second failover detection in
Corosync/Pacemaker clusters - 2026 update?
Notice: This email appears to be suspicious. Do not trust the information,
links, or attachments in this email without verifying the source through a
trusted method. For more information see:
https://aka.ms/ProtectYourselfFromPhishing
Sicherheits-Hinweis: Diese E-Mail wurde von einer Person außerhalb des UKR
gesendet. Seien Sie vorsichtig vor gefälschten Absendern, wenn Sie auf Links
klicken, Anhänge öffnen oder weitere Aktionen ausführen, bevor Sie die Echtheit
überprüft haben.
Hi everyone,
I'm revisiting a thread from 2015
(https://www.mail-archive.com/[email protected]/msg00554.html) about
achieving sub-second failover detection in HA clusters, and I'm curious about
the current state of affairs nearly a decade later.
My Environment:
- Corosync 3.1.6
- Pacemaker 2.1.2
- Architecture: 2-node cluster + QDevice (also testing 3-node setups)
- Network: Dedicated physical NIC for cluster traffic (low-latency requirements)
Specific Questions:
1. With modern Corosync/Pacemaker versions, is sub-second fault detection and
failover initiation realistically achievable in production environments?
2. Are there any published measurements or community experiences showing the
fastest stable failover times you've achieved? What's considered a reliable
minimum time span?
3. Have there been significant enhancements in the newer versions of Corosync
and Pacemaker (post-2015) that specifically target detection speed and failover
latency?
4. If sub-second detection is possible, what are the key configuration
parameters and potential trade-offs (false positives, network sensitivity,
resource overhead)?
Thanks in advance!
Holger Haidinger
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/