Re: [lustre-discuss] how does lustre handle node failure

2023-07-22 Thread Andreas Dilger via lustre-discuss
Shawn, Lustre handles the largest filesystems in the world, hundreds of PB in size, so there are definitely Lustre filesystems with hundreds of servers. In large storage clusters the servers failover in pairs or quads, since the storage is typically not on a single global SAN for all nodes to

Re: [lustre-discuss] how does lustre handle node failure

2023-07-21 Thread Shawn via lustre-discuss
Hi Laura, thanks for your reply. It seems the OSSs will share the disks created from a shared SAN. So the OSS-pairs can failover in a pre-defined manner if one node is down, coordinated by a HA manager. This can certainly work on a limited scale. I'm curious if this static schema can scale to

Re: [lustre-discuss] how does lustre handle node failure

2023-07-18 Thread Laura Hild via lustre-discuss
I'm not familiar with using FLR to tolerate OSS failures. My site does the HA pairs with shared storage method. It's sort of described in the manual https://doc.lustre.org/lustre_manual.xhtml#configuringfailover but in more, Pacemaker-specific detail at

[lustre-discuss] how does lustre handle node failure

2023-07-18 Thread Shawn via lustre-discuss
If I want to tolerate an OSS node failure (power cut, etc), what config is needed in lustre? Multiple replicas, two nodes with HA mode, or some other mechanisms? Thanks. Shawn ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org