Re: [lustre-discuss] how does lustre handle node failure
Shawn, Lustre handles the largest filesystems in the world, hundreds of PB in size, so there are definitely Lustre filesystems with hundreds of servers. In large storage clusters the servers failover in pairs or quads, since the storage is typically not on a single global SAN for all nodes to access, so there is definitely not a single huge HA cluster for all of the servers in the filesystem. Cheers, Andreas On Jul 21, 2023, at 16:09, Shawn via lustre-discuss wrote: Hi Laura, thanks for your reply. It seems the OSSs will share the disks created from a shared SAN. So the OSS-pairs can failover in a pre-defined manner if one node is down, coordinated by a HA manager. This can certainly work on a limited scale. I'm curious if this static schema can scale to a large cluster with 100s of OSSs servers? regards, Shawn On Tue, Jul 18, 2023 at 1:25 PM Laura Hild mailto:l...@jlab.org>> wrote: I'm not familiar with using FLR to tolerate OSS failures. My site does the HA pairs with shared storage method. It's sort of described in the manual https://doc.lustre.org/lustre_manual.xhtml#configuringfailover but in more, Pacemaker-specific detail at https://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker and https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] how does lustre handle node failure
Hi Laura, thanks for your reply. It seems the OSSs will share the disks created from a shared SAN. So the OSS-pairs can failover in a pre-defined manner if one node is down, coordinated by a HA manager. This can certainly work on a limited scale. I'm curious if this static schema can scale to a large cluster with 100s of OSSs servers? regards, Shawn On Tue, Jul 18, 2023 at 1:25 PM Laura Hild wrote: > I'm not familiar with using FLR to tolerate OSS failures. My site does > the HA pairs with shared storage method. It's sort of described in the > manual > > https://doc.lustre.org/lustre_manual.xhtml#configuringfailover > > but in more, Pacemaker-specific detail at > > > https://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker > > and > > > https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services > > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] how does lustre handle node failure
I'm not familiar with using FLR to tolerate OSS failures. My site does the HA pairs with shared storage method. It's sort of described in the manual https://doc.lustre.org/lustre_manual.xhtml#configuringfailover but in more, Pacemaker-specific detail at https://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker and https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] how does lustre handle node failure
If I want to tolerate an OSS node failure (power cut, etc), what config is needed in lustre? Multiple replicas, two nodes with HA mode, or some other mechanisms? Thanks. Shawn ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org