Re: [lustre-discuss] how does lustre handle node failure

Andreas Dilger via lustre-discuss Sat, 22 Jul 2023 21:36:16 -0700

Shawn,
Lustre handles the largest filesystems in the world, hundreds of PB in size, so 
there are definitely Lustre filesystems with hundreds of servers.


In large storage clusters the servers failover in pairs or quads, since the 
storage is typically not on a single global SAN for all nodes to access, so 
there is definitely not a single huge HA cluster for all of the servers in the 
filesystem.

Cheers, Andreas

On Jul 21, 2023, at 16:09, Shawn via lustre-discuss 
<lustre-discuss@lists.lustre.org> wrote:


Hi Laura,  thanks for your reply.
It seems the OSSs will share the disks created from a shared SAN.  So the 
OSS-pairs can failover in a pre-defined manner if one node is down, coordinated 
by a HA manager.

This can certainly work on a limited scale.  I'm curious if this static schema 
can scale to a large cluster with 100s of OSSs servers?


regards,
Shawn




On Tue, Jul 18, 2023 at 1:25 PM Laura Hild 
<l...@jlab.org<mailto:l...@jlab.org>> wrote:
I'm not familiar with using FLR to tolerate OSS failures.  My site does the HA 
pairs with shared storage method.  It's sort of described in the manual

  https://doc.lustre.org/lustre_manual.xhtml#configuringfailover

but in more, Pacemaker-specific detail at

  
https://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker

and

  
https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] how does lustre handle node failure

Reply via email to