Shawn, Lustre handles the largest filesystems in the world, hundreds of PB in size, so there are definitely Lustre filesystems with hundreds of servers.
In large storage clusters the servers failover in pairs or quads, since the storage is typically not on a single global SAN for all nodes to access, so there is definitely not a single huge HA cluster for all of the servers in the filesystem. Cheers, Andreas On Jul 21, 2023, at 16:09, Shawn via lustre-discuss <lustre-discuss@lists.lustre.org> wrote: Hi Laura, thanks for your reply. It seems the OSSs will share the disks created from a shared SAN. So the OSS-pairs can failover in a pre-defined manner if one node is down, coordinated by a HA manager. This can certainly work on a limited scale. I'm curious if this static schema can scale to a large cluster with 100s of OSSs servers? regards, Shawn On Tue, Jul 18, 2023 at 1:25 PM Laura Hild <l...@jlab.org<mailto:l...@jlab.org>> wrote: I'm not familiar with using FLR to tolerate OSS failures. My site does the HA pairs with shared storage method. It's sort of described in the manual https://doc.lustre.org/lustre_manual.xhtml#configuringfailover but in more, Pacemaker-specific detail at https://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker and https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org