Re: [lustre-discuss] how does lustre handle node failure

2023-07-22 Thread Andreas Dilger via lustre-discuss
Shawn,
Lustre handles the largest filesystems in the world, hundreds of PB in size, so 
there are definitely Lustre filesystems with hundreds of servers.

In large storage clusters the servers failover in pairs or quads, since the 
storage is typically not on a single global SAN for all nodes to access, so 
there is definitely not a single huge HA cluster for all of the servers in the 
filesystem.

Cheers, Andreas

On Jul 21, 2023, at 16:09, Shawn via lustre-discuss 
 wrote:


Hi Laura,  thanks for your reply.
It seems the OSSs will share the disks created from a shared SAN.  So the 
OSS-pairs can failover in a pre-defined manner if one node is down, coordinated 
by a HA manager.

This can certainly work on a limited scale.  I'm curious if this static schema 
can scale to a large cluster with 100s of OSSs servers?


regards,
Shawn




On Tue, Jul 18, 2023 at 1:25 PM Laura Hild 
mailto:l...@jlab.org>> wrote:
I'm not familiar with using FLR to tolerate OSS failures.  My site does the HA 
pairs with shared storage method.  It's sort of described in the manual

  https://doc.lustre.org/lustre_manual.xhtml#configuringfailover

but in more, Pacemaker-specific detail at

  
https://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker

and

  
https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] how does lustre handle node failure

2023-07-21 Thread Shawn via lustre-discuss
Hi Laura,  thanks for your reply.
It seems the OSSs will share the disks created from a shared SAN.  So the
OSS-pairs can failover in a pre-defined manner if one node is down,
coordinated by a HA manager.

This can certainly work on a limited scale.  I'm curious if this static
schema can scale to a large cluster with 100s of OSSs servers?


regards,
Shawn




On Tue, Jul 18, 2023 at 1:25 PM Laura Hild  wrote:

> I'm not familiar with using FLR to tolerate OSS failures.  My site does
> the HA pairs with shared storage method.  It's sort of described in the
> manual
>
>   https://doc.lustre.org/lustre_manual.xhtml#configuringfailover
>
> but in more, Pacemaker-specific detail at
>
>
> https://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker
>
> and
>
>
> https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] how does lustre handle node failure

2023-07-18 Thread Laura Hild via lustre-discuss
I'm not familiar with using FLR to tolerate OSS failures.  My site does the HA 
pairs with shared storage method.  It's sort of described in the manual

  https://doc.lustre.org/lustre_manual.xhtml#configuringfailover

but in more, Pacemaker-specific detail at

  
https://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker

and

  
https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] how does lustre handle node failure

2023-07-18 Thread Shawn via lustre-discuss
If I want to tolerate an OSS node failure (power cut, etc),  what config is
needed in lustre?  Multiple replicas,  two nodes with HA mode,  or some
other mechanisms?  Thanks.


Shawn
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org