Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-10 Thread Cameron Harr via lustre-discuss

On 1/10/24 11:59, Thomas Roth via lustre-discuss wrote:
Actually we had MDTs on software raid-1 *connecting two JBODs* for 
quite some time - worked surprisingly well and stable.


I'm glad it's working for you!




Hmm, if you have your MDTs on a zpool of mirrors aka raid-10, wouldn't 
going towards raidz2 increase data safety, something you don't need if 
the SSDs anyhow never fail? Doesn't raidz2 protect against failure of 
*any* two disks - in a pool of mirrors the second failure could 
destroy one mirror?


With raidz2 you can replace any disk in the raid group, but there's also 
a lot more drives that can fail. With mirrors, there's a 1:1 replacement 
ratio with essentially no rebuild time. Of course that assumes the 2 
drives you lost weren't the 2 drives in the same mirror, but we consider 
that low-probability. ZFS is also smart enough to (try to) suspend the 
pool if if it loses too many devices. And, the striped mirrors may see 
better performance over Z2.




Regards
Thomas

On 1/9/24 20:57, Cameron Harr via lustre-discuss wrote:

Thomas,

We value management over performance and have knowingly left 
performance on the floor in the name of standardization, robustness, 
management, etc; while still maintaining our performance targets. We 
are a heavy ZFS-on-Linux (ZoL) shop so we never considered MD-RAID, 
which, IMO, is very far behind ZoL in enterprise storage features.


As Jeff mentioned, we have done some tuning (and if you haven't 
noticed there are *a lot* of possible ZFS parameters) to further 
improve performance and are at a good place performance-wise.


Cameron

On 1/8/24 10:33, Jeff Johnson wrote:

Today nvme/mdraid/ldiskfs will beat nvme/zfs on MDS IOPs but you can
close the gap somewhat with tuning, zfs ashift/recordsize and special
allocation class vdevs. While the IOPs performance favors
nvme/mdraid/ldiskfs there are tradeoffs. The snapshot/backup abilities
of ZFS and the security it provides to the most critical function in a
Lustre file system shouldn't be undervalued. From personal experience,
I'd much rather deal with zfs in the event of a seriously jackknifed
MDT than mdraid/ldiskfs and both zfs and mdraid/ldiskfs are preferable
to trying to unscramble a vendor blackbox hwraid volume. ;-)

When zfs directio lands and is fully integrated into Lustre the
performance differences *should* be negligible.

Just my $.02 worth

On Mon, Jan 8, 2024 at 8:23 AM Thomas Roth via lustre-discuss
 wrote:

Hi Cameron,

did you run a performance comparison between ZFS and mdadm-raid on 
the MDTs?
I'm currently doing some tests, and the results favor software 
raid, in particular when it comes to IOPS.


Regards
Thomas

On 1/5/24 19:55, Cameron Harr via lustre-discuss wrote:
This doesn't answer your question about ldiskfs on zvols, but 
we've been running MDTs on ZFS on NVMe in production for a couple 
years (and on SAS SSDs for many years prior). Our current 
production MDTs using NVMe consist of one zpool/node made up of 3x 
2-drive mirrors, but we've been experimenting lately with using 
raidz3 and possibly even raidz2 for MDTs since SSDs have been 
pretty reliable for us.


Cameron

On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] via lustre-discuss wrote:
We are in the process of retiring two long standing LFS's (about 
8 years old), which we built and managed ourselves.  Both use ZFS 
and have the MDT'S on ssd's in a JBOD that require the kind of 
software-based management you describe, in our case ZFS pools 
built on multipath devices.  The MDT in one is ZFS and the MDT in 
the other LFS is ldiskfs but uses ZFS and a zvol as you describe 
- we build the ldiskfs MDT on top of the zvol.  Generally, this 
has worked well for us, with one big caveat.  If you look for my 
posts to this list and the ZFS list you'll find more details.  
The short version is that we utilize ZFS snapshots and clones to 
do backups of the metadata.  We've run into situations where the 
backup process stalls, leaving a clone hanging around.  We've 
experienced a situation a couple of times where the clone and the 
primary zvol get swapped, effectively rolling back our metadata 
to the point when the clone was created.  I have tried, 
unsuccessfully, to recreate
that in a test environment.  So if you do that kind of setup, 
make sure you have good monitoring in place to detect if your 
backups/clones stall.  We've kept up with lustre and ZFS updates 
over the years and are currently on lustre 2.14 and ZFS 2.1.  
We've seen the gap between our ZFS MDT and ldiskfs performance 
shrink to the point where they are pretty much on par to each 
now.  I think our ZFS MDT performance could be better with more 
hardware and software tuning but our small team hasn't had the 
bandwidth to tackle that.


Our newest LFS is vendor provided and uses NVMe MDT's. I'm not at 
liberty to talk about the proprietary way those devices are 
managed.  However, the metadata performance is SO much better 
than our 

Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-10 Thread Thomas Roth via lustre-discuss

Actually we had MDTs on software raid-1 *connecting two JBODs* for quite some 
time - worked surprisingly well and stable.

Still, personally I would prefer ZFS anytime. Nowadays we have all our OSTs are 
on ZFS, very stable.
Of course, a look at all the possible ZFS parameters tells me that surely I 
have overlooked a crucial tuning tweak ;-)


Hmm, if you have your MDTs on a zpool of mirrors aka raid-10, wouldn't going 
towards raidz2 increase data safety, something you don't need if the SSDs 
anyhow never fail? Doesn't raidz2 protect against failure of *any* two disks - 
in a pool of mirrors the second failure could destroy one mirror?


Regards
Thomas

On 1/9/24 20:57, Cameron Harr via lustre-discuss wrote:

Thomas,

We value management over performance and have knowingly left performance on the 
floor in the name of standardization, robustness, management, etc; while still 
maintaining our performance targets. We are a heavy ZFS-on-Linux (ZoL) shop so 
we never considered MD-RAID, which, IMO, is very far behind ZoL in enterprise 
storage features.

As Jeff mentioned, we have done some tuning (and if you haven't noticed there 
are *a lot* of possible ZFS parameters) to further improve performance and are 
at a good place performance-wise.

Cameron

On 1/8/24 10:33, Jeff Johnson wrote:

Today nvme/mdraid/ldiskfs will beat nvme/zfs on MDS IOPs but you can
close the gap somewhat with tuning, zfs ashift/recordsize and special
allocation class vdevs. While the IOPs performance favors
nvme/mdraid/ldiskfs there are tradeoffs. The snapshot/backup abilities
of ZFS and the security it provides to the most critical function in a
Lustre file system shouldn't be undervalued. From personal experience,
I'd much rather deal with zfs in the event of a seriously jackknifed
MDT than mdraid/ldiskfs and both zfs and mdraid/ldiskfs are preferable
to trying to unscramble a vendor blackbox hwraid volume. ;-)

When zfs directio lands and is fully integrated into Lustre the
performance differences *should* be negligible.

Just my $.02 worth

On Mon, Jan 8, 2024 at 8:23 AM Thomas Roth via lustre-discuss
 wrote:

Hi Cameron,

did you run a performance comparison between ZFS and mdadm-raid on the MDTs?
I'm currently doing some tests, and the results favor software raid, in 
particular when it comes to IOPS.

Regards
Thomas

On 1/5/24 19:55, Cameron Harr via lustre-discuss wrote:

This doesn't answer your question about ldiskfs on zvols, but we've been 
running MDTs on ZFS on NVMe in production for a couple years (and on SAS SSDs 
for many years prior). Our current production MDTs using NVMe consist of one 
zpool/node made up of 3x 2-drive mirrors, but we've been experimenting lately 
with using raidz3 and possibly even raidz2 for MDTs since SSDs have been pretty 
reliable for us.

Cameron

On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via 
lustre-discuss wrote:

We are in the process of retiring two long standing LFS's (about 8 years old), 
which we built and managed ourselves.  Both use ZFS and have the MDT'S on ssd's 
in a JBOD that require the kind of software-based management you describe, in 
our case ZFS pools built on multipath devices.  The MDT in one is ZFS and the 
MDT in the other LFS is ldiskfs but uses ZFS and a zvol as you describe - we 
build the ldiskfs MDT on top of the zvol.  Generally, this has worked well for 
us, with one big caveat.  If you look for my posts to this list and the ZFS 
list you'll find more details.  The short version is that we utilize ZFS 
snapshots and clones to do backups of the metadata.  We've run into situations 
where the backup process stalls, leaving a clone hanging around.  We've 
experienced a situation a couple of times where the clone and the primary zvol 
get swapped, effectively rolling back our metadata to the point when the clone 
was created.  I have tried, unsuccessfully, to recreate
that in a test environment.  So if you do that kind of setup, make sure you 
have good monitoring in place to detect if your backups/clones stall.  We've 
kept up with lustre and ZFS updates over the years and are currently on lustre 
2.14 and ZFS 2.1.  We've seen the gap between our ZFS MDT and ldiskfs 
performance shrink to the point where they are pretty much on par to each now.  
I think our ZFS MDT performance could be better with more hardware and software 
tuning but our small team hasn't had the bandwidth to tackle that.

Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty to 
talk about the proprietary way those devices are managed.  However, the 
metadata performance is SO much better than our older LFS's, for a lot of 
reasons, but I'd highly recommend NVMe's for your MDT's.

-Original Message-
From: lustre-discuss mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Thomas Roth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Reply-To: Thomas Roth mailto:t.r...@gsi.de>>
Date: Friday, January 5, 2024 at 

Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-09 Thread Cameron Harr via lustre-discuss

Thomas,

We value management over performance and have knowingly left performance 
on the floor in the name of standardization, robustness, management, 
etc; while still maintaining our performance targets. We are a heavy 
ZFS-on-Linux (ZoL) shop so we never considered MD-RAID, which, IMO, is 
very far behind ZoL in enterprise storage features.


As Jeff mentioned, we have done some tuning (and if you haven't noticed 
there are *a lot* of possible ZFS parameters) to further improve 
performance and are at a good place performance-wise.


Cameron

On 1/8/24 10:33, Jeff Johnson wrote:

Today nvme/mdraid/ldiskfs will beat nvme/zfs on MDS IOPs but you can
close the gap somewhat with tuning, zfs ashift/recordsize and special
allocation class vdevs. While the IOPs performance favors
nvme/mdraid/ldiskfs there are tradeoffs. The snapshot/backup abilities
of ZFS and the security it provides to the most critical function in a
Lustre file system shouldn't be undervalued. From personal experience,
I'd much rather deal with zfs in the event of a seriously jackknifed
MDT than mdraid/ldiskfs and both zfs and mdraid/ldiskfs are preferable
to trying to unscramble a vendor blackbox hwraid volume. ;-)

When zfs directio lands and is fully integrated into Lustre the
performance differences *should* be negligible.

Just my $.02 worth

On Mon, Jan 8, 2024 at 8:23 AM Thomas Roth via lustre-discuss
 wrote:

Hi Cameron,

did you run a performance comparison between ZFS and mdadm-raid on the MDTs?
I'm currently doing some tests, and the results favor software raid, in 
particular when it comes to IOPS.

Regards
Thomas

On 1/5/24 19:55, Cameron Harr via lustre-discuss wrote:

This doesn't answer your question about ldiskfs on zvols, but we've been 
running MDTs on ZFS on NVMe in production for a couple years (and on SAS SSDs 
for many years prior). Our current production MDTs using NVMe consist of one 
zpool/node made up of 3x 2-drive mirrors, but we've been experimenting lately 
with using raidz3 and possibly even raidz2 for MDTs since SSDs have been pretty 
reliable for us.

Cameron

On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via 
lustre-discuss wrote:

We are in the process of retiring two long standing LFS's (about 8 years old), 
which we built and managed ourselves.  Both use ZFS and have the MDT'S on ssd's 
in a JBOD that require the kind of software-based management you describe, in 
our case ZFS pools built on multipath devices.  The MDT in one is ZFS and the 
MDT in the other LFS is ldiskfs but uses ZFS and a zvol as you describe - we 
build the ldiskfs MDT on top of the zvol.  Generally, this has worked well for 
us, with one big caveat.  If you look for my posts to this list and the ZFS 
list you'll find more details.  The short version is that we utilize ZFS 
snapshots and clones to do backups of the metadata.  We've run into situations 
where the backup process stalls, leaving a clone hanging around.  We've 
experienced a situation a couple of times where the clone and the primary zvol 
get swapped, effectively rolling back our metadata to the point when the clone 
was created.  I have tried, unsuccessfully, to recreate
that in a test environment.  So if you do that kind of setup, make sure you 
have good monitoring in place to detect if your backups/clones stall.  We've 
kept up with lustre and ZFS updates over the years and are currently on lustre 
2.14 and ZFS 2.1.  We've seen the gap between our ZFS MDT and ldiskfs 
performance shrink to the point where they are pretty much on par to each now.  
I think our ZFS MDT performance could be better with more hardware and software 
tuning but our small team hasn't had the bandwidth to tackle that.

Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty to 
talk about the proprietary way those devices are managed.  However, the 
metadata performance is SO much better than our older LFS's, for a lot of 
reasons, but I'd highly recommend NVMe's for your MDT's.

-Original Message-
From: lustre-discuss mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Thomas Roth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Reply-To: Thomas Roth mailto:t.r...@gsi.de>>
Date: Friday, January 5, 2024 at 9:03 AM
To: Lustre Diskussionsliste mailto:lustre-discuss@lists.lustre.org>>
Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?


CAUTION: This email originated from outside of NASA. Please take care when clicking links 
or opening attachments. Use the "Report Message" button to report suspicious 
messages to the NASA SOC.








Dear all,


considering NVME storage for the next MDS.


As I understand, NVME disks are bundled in software, not by a hardware raid 
controller.
This would be done using Linux software raid, mdadm, correct?


We have some experience with ZFS, which we use on our OSTs.
But I would like to stick to ldiskfs for the MDTs, and a zpool with a zvol on 
top which is then 

Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-08 Thread Jeff Johnson
Today nvme/mdraid/ldiskfs will beat nvme/zfs on MDS IOPs but you can
close the gap somewhat with tuning, zfs ashift/recordsize and special
allocation class vdevs. While the IOPs performance favors
nvme/mdraid/ldiskfs there are tradeoffs. The snapshot/backup abilities
of ZFS and the security it provides to the most critical function in a
Lustre file system shouldn't be undervalued. From personal experience,
I'd much rather deal with zfs in the event of a seriously jackknifed
MDT than mdraid/ldiskfs and both zfs and mdraid/ldiskfs are preferable
to trying to unscramble a vendor blackbox hwraid volume. ;-)

When zfs directio lands and is fully integrated into Lustre the
performance differences *should* be negligible.

Just my $.02 worth

On Mon, Jan 8, 2024 at 8:23 AM Thomas Roth via lustre-discuss
 wrote:
>
> Hi Cameron,
>
> did you run a performance comparison between ZFS and mdadm-raid on the MDTs?
> I'm currently doing some tests, and the results favor software raid, in 
> particular when it comes to IOPS.
>
> Regards
> Thomas
>
> On 1/5/24 19:55, Cameron Harr via lustre-discuss wrote:
> > This doesn't answer your question about ldiskfs on zvols, but we've been 
> > running MDTs on ZFS on NVMe in production for a couple years (and on SAS 
> > SSDs for many years prior). Our current production MDTs using NVMe consist 
> > of one zpool/node made up of 3x 2-drive mirrors, but we've been 
> > experimenting lately with using raidz3 and possibly even raidz2 for MDTs 
> > since SSDs have been pretty reliable for us.
> >
> > Cameron
> >
> > On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
> > via lustre-discuss wrote:
> >> We are in the process of retiring two long standing LFS's (about 8 years 
> >> old), which we built and managed ourselves.  Both use ZFS and have the 
> >> MDT'S on ssd's in a JBOD that require the kind of software-based 
> >> management you describe, in our case ZFS pools built on multipath devices. 
> >>  The MDT in one is ZFS and the MDT in the other LFS is ldiskfs but uses 
> >> ZFS and a zvol as you describe - we build the ldiskfs MDT on top of the 
> >> zvol.  Generally, this has worked well for us, with one big caveat.  If 
> >> you look for my posts to this list and the ZFS list you'll find more 
> >> details.  The short version is that we utilize ZFS snapshots and clones to 
> >> do backups of the metadata.  We've run into situations where the backup 
> >> process stalls, leaving a clone hanging around.  We've experienced a 
> >> situation a couple of times where the clone and the primary zvol get 
> >> swapped, effectively rolling back our metadata to the point when the clone 
> >> was created.  I have tried, unsuccessfully, to recreate
> >> that in a test environment.  So if you do that kind of setup, make sure 
> >> you have good monitoring in place to detect if your backups/clones stall.  
> >> We've kept up with lustre and ZFS updates over the years and are currently 
> >> on lustre 2.14 and ZFS 2.1.  We've seen the gap between our ZFS MDT and 
> >> ldiskfs performance shrink to the point where they are pretty much on par 
> >> to each now.  I think our ZFS MDT performance could be better with more 
> >> hardware and software tuning but our small team hasn't had the bandwidth 
> >> to tackle that.
> >>
> >> Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty 
> >> to talk about the proprietary way those devices are managed.  However, the 
> >> metadata performance is SO much better than our older LFS's, for a lot of 
> >> reasons, but I'd highly recommend NVMe's for your MDT's.
> >>
> >> -Original Message-
> >> From: lustre-discuss  >> > on behalf of Thomas Roth 
> >> via lustre-discuss  >> >
> >> Reply-To: Thomas Roth mailto:t.r...@gsi.de>>
> >> Date: Friday, January 5, 2024 at 9:03 AM
> >> To: Lustre Diskussionsliste  >> >
> >> Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?
> >>
> >>
> >> CAUTION: This email originated from outside of NASA. Please take care when 
> >> clicking links or opening attachments. Use the "Report Message" button to 
> >> report suspicious messages to the NASA SOC.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Dear all,
> >>
> >>
> >> considering NVME storage for the next MDS.
> >>
> >>
> >> As I understand, NVME disks are bundled in software, not by a hardware 
> >> raid controller.
> >> This would be done using Linux software raid, mdadm, correct?
> >>
> >>
> >> We have some experience with ZFS, which we use on our OSTs.
> >> But I would like to stick to ldiskfs for the MDTs, and a zpool with a zvol 
> >> on top which is then formatted with ldiskfs - to much voodoo...
> >>
> >>
> >> How is this handled elsewhere? Any experiences?
> >>
> >>
> >>
> >>
> >> The available devices are quite large. If I create a raid-10 out of 4 
> >> disks, e.g. 7 TB each, 

Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-08 Thread Thomas Roth via lustre-discuss

Hi Cameron,

did you run a performance comparison between ZFS and mdadm-raid on the MDTs?
I'm currently doing some tests, and the results favor software raid, in 
particular when it comes to IOPS.

Regards
Thomas

On 1/5/24 19:55, Cameron Harr via lustre-discuss wrote:

This doesn't answer your question about ldiskfs on zvols, but we've been 
running MDTs on ZFS on NVMe in production for a couple years (and on SAS SSDs 
for many years prior). Our current production MDTs using NVMe consist of one 
zpool/node made up of 3x 2-drive mirrors, but we've been experimenting lately 
with using raidz3 and possibly even raidz2 for MDTs since SSDs have been pretty 
reliable for us.

Cameron

On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via 
lustre-discuss wrote:
We are in the process of retiring two long standing LFS's (about 8 years old), which we built and managed ourselves.  Both use ZFS and have the MDT'S on ssd's in a JBOD that require the kind of software-based management you describe, in our case ZFS pools built on multipath devices.  The MDT in one is ZFS and the MDT in the other LFS is ldiskfs but uses ZFS and a zvol as you describe - we build the ldiskfs MDT on top of the zvol.  Generally, this has worked well for us, with one big caveat.  If you look for my posts to this list and the ZFS list you'll find more details.  The short version is that we utilize ZFS snapshots and clones to do backups of the metadata.  We've run into situations where the backup process stalls, leaving a clone hanging around.  We've experienced a situation a couple of times where the clone and the primary zvol get swapped, effectively rolling back our metadata to the point when the clone was created.  I have tried, unsuccessfully, to recreate 
that in a test environment.  So if you do that kind of setup, make sure you have good monitoring in place to detect if your backups/clones stall.  We've kept up with lustre and ZFS updates over the years and are currently on lustre 2.14 and ZFS 2.1.  We've seen the gap between our ZFS MDT and ldiskfs performance shrink to the point where they are pretty much on par to each now.  I think our ZFS MDT performance could be better with more hardware and software tuning but our small team hasn't had the bandwidth to tackle that.


Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty to 
talk about the proprietary way those devices are managed.  However, the 
metadata performance is SO much better than our older LFS's, for a lot of 
reasons, but I'd highly recommend NVMe's for your MDT's.

-Original Message-
From: lustre-discuss mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Thomas Roth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Reply-To: Thomas Roth mailto:t.r...@gsi.de>>
Date: Friday, January 5, 2024 at 9:03 AM
To: Lustre Diskussionsliste mailto:lustre-discuss@lists.lustre.org>>
Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?


CAUTION: This email originated from outside of NASA. Please take care when clicking links 
or opening attachments. Use the "Report Message" button to report suspicious 
messages to the NASA SOC.








Dear all,


considering NVME storage for the next MDS.


As I understand, NVME disks are bundled in software, not by a hardware raid 
controller.
This would be done using Linux software raid, mdadm, correct?


We have some experience with ZFS, which we use on our OSTs.
But I would like to stick to ldiskfs for the MDTs, and a zpool with a zvol on 
top which is then formatted with ldiskfs - to much voodoo...


How is this handled elsewhere? Any experiences?




The available devices are quite large. If I create a raid-10 out of 4 disks, 
e.g. 7 TB each, my MDT will be 14 TB - already close to the 16 TB limit.
So no need for a box with lots of U.3 slots.


But for MDS operations, we will still need a powerful dual-CPU system with lots 
of RAM.
Then the NVME devices should be distributed between the CPUs?
Is there a way to pinpoint this in a call for tender?




Best regards,
Thomas



Thomas Roth


GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, 
https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$
  



Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz




___
lustre-discuss mailing list

Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-08 Thread Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss
Our setup has a single JBOD connected to 2 servers but the JBOD has dual 
controllers.  Each server connects to both controllers for redundancy so there 
are 4 connections to each server.  So we have a paired HA setup where one peer 
node can take over the OSTs/MDTs of its peer node.  Some specifics on our 
hardware:

Supermicro twin servers:
https://www.supermicro.com/products/archive/system/sys-6027tr-d71frf

JBOD:
https://www.supermicro.com/products/archive/chassis/sc946ed-r2kjbod

Each pair can “zpool import” all pools from either pair.  Here is an excerpt 
from our ldev.conf file


#local  foreign/-  label   [md|zfs:]device-path   [journal-path]/- [raidtab]

# primary hpfs-fsl (aka /nobackup) lustre file system
hpfs-fsl-mds0.fsl.jsc.nasa.gov  hpfs-fsl-mds1.fsl.jsc.nasa.gov  
hpfs-fsl-MDT  zfs:mds0-0/meta-fsl

hpfs-fsl-oss00.fsl.jsc.nasa.gov hpfs-fsl-oss01.fsl.jsc.nasa.gov 
hpfs-fsl-OST  zfs:oss00-0/ost-fsl
hpfs-fsl-oss00.fsl.jsc.nasa.gov hpfs-fsl-oss01.fsl.jsc.nasa.gov 
hpfs-fsl-OST000c  zfs:oss00-1/ost-fsl

hpfs-fsl-oss01.fsl.jsc.nasa.gov hpfs-fsl-oss00.fsl.jsc.nasa.gov 
hpfs-fsl-OST0001  zfs:oss01-0/ost-fsl
hpfs-fsl-oss01.fsl.jsc.nasa.gov hpfs-fsl-oss00.fsl.jsc.nasa.gov 
hpfs-fsl-OST000d  zfs:oss01-1/ost-fsl



If you wanted to fail oss01’s OST’s over to oss00, you’d do a “service lustre 
stop” on oss01 followed by a “service lustre start foreign” on oss00.  This 
setup has been stable and has served us well for a long time.  Our servers are 
stable enough that we never set up automated failover via corosync or something 
similar.



From: Vinícius Ferrão 
Date: Sunday, January 7, 2024 at 12:06 PM
To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 

Cc: Thomas Roth , Lustre Diskussionsliste 

Subject: Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

CAUTION: This email originated from outside of NASA.  Please take care when 
clicking links or opening attachments.  Use the "Report Message" button to 
report suspicious messages to the NASA SOC.


Hi Vicker may I ask if you have any kind of HA on this setup?

If yes I’m interested on how the ZFS pools would migrate from one server to 
another in case of failure. I’m considering the typical lustre deployment were 
you have two servers attached to two JBODs using a multipath SAS topology with 
crossed cables: |X|.

I can easily understand that when you have Hardware RAID running on the JBOD 
and SAS HBA on the servers, but for a total software solution I’m unaware how 
that will work effectively.

Thank you.


On 5 Jan 2024, at 14:07, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
via lustre-discuss  wrote:

We are in the process of retiring two long standing LFS's (about 8 years old), 
which we built and managed ourselves.  Both use ZFS and have the MDT'S on ssd's 
in a JBOD that require the kind of software-based management you describe, in 
our case ZFS pools built on multipath devices.  The MDT in one is ZFS and the 
MDT in the other LFS is ldiskfs but uses ZFS and a zvol as you describe - we 
build the ldiskfs MDT on top of the zvol.  Generally, this has worked well for 
us, with one big caveat.  If you look for my posts to this list and the ZFS 
list you'll find more details.  The short version is that we utilize ZFS 
snapshots and clones to do backups of the metadata.  We've run into situations 
where the backup process stalls, leaving a clone hanging around.  We've 
experienced a situation a couple of times where the clone and the primary zvol 
get swapped, effectively rolling back our metadata to the point when the clone 
was created.  I have tried, unsuccessfully, to recreate that in a test 
environment.  So if you do that kind of setup, make sure you have good 
monitoring in place to detect if your backups/clones stall.  We've kept up with 
lustre and ZFS updates over the years and are currently on lustre 2.14 and ZFS 
2.1.  We've seen the gap between our ZFS MDT and ldiskfs performance shrink to 
the point where they are pretty much on par to each now.  I think our ZFS MDT 
performance could be better with more hardware and software tuning but our 
small team hasn't had the bandwidth to tackle that.

Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty to 
talk about the proprietary way those devices are managed.  However, the 
metadata performance is SO much better than our older LFS's, for a lot of 
reasons, but I'd highly recommend NVMe's for your MDT's.

-Original Message-
From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>
 <mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Thomas Roth via 
lustre-discuss 
mailto:lustre-discuss@lists.lustre.org> 
<mailto:lustre-discuss@lists.lustre.org>>
Reply-To: Thomas Roth mailto:t.r...@gsi.de> 
<mailto:t.r...@gsi.de>>
Date: Friday, January 5, 2024 at 9:03 AM
To: Lustre Diskussionsliste 
mailto:lustre-discuss@lists.lustre.org> 
<mail

Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-07 Thread Vinícius Ferrão via lustre-discuss
Hi Vicker may I ask if you have any kind of HA on this setup?

If yes I’m interested on how the ZFS pools would migrate from one server to 
another in case of failure. I’m considering the typical lustre deployment were 
you have two servers attached to two JBODs using a multipath SAS topology with 
crossed cables: |X|.

I can easily understand that when you have Hardware RAID running on the JBOD 
and SAS HBA on the servers, but for a total software solution I’m unaware how 
that will work effectively.

Thank you.

On 5 Jan 2024, at 14:07, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
via lustre-discuss  wrote:

We are in the process of retiring two long standing LFS's (about 8 years old), 
which we built and managed ourselves.  Both use ZFS and have the MDT'S on ssd's 
in a JBOD that require the kind of software-based management you describe, in 
our case ZFS pools built on multipath devices.  The MDT in one is ZFS and the 
MDT in the other LFS is ldiskfs but uses ZFS and a zvol as you describe - we 
build the ldiskfs MDT on top of the zvol.  Generally, this has worked well for 
us, with one big caveat.  If you look for my posts to this list and the ZFS 
list you'll find more details.  The short version is that we utilize ZFS 
snapshots and clones to do backups of the metadata.  We've run into situations 
where the backup process stalls, leaving a clone hanging around.  We've 
experienced a situation a couple of times where the clone and the primary zvol 
get swapped, effectively rolling back our metadata to the point when the clone 
was created.  I have tried, unsuccessfully, to recreate that in a test 
environment.  So if you do that kind of setup, make sure you have good 
monitoring in place to detect if your backups/clones stall.  We've kept up with 
lustre and ZFS updates over the years and are currently on lustre 2.14 and ZFS 
2.1.  We've seen the gap between our ZFS MDT and ldiskfs performance shrink to 
the point where they are pretty much on par to each now.  I think our ZFS MDT 
performance could be better with more hardware and software tuning but our 
small team hasn't had the bandwidth to tackle that.

Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty to 
talk about the proprietary way those devices are managed.  However, the 
metadata performance is SO much better than our older LFS's, for a lot of 
reasons, but I'd highly recommend NVMe's for your MDT's.

-Original Message-
From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>
 > on behalf of Thomas Roth via 
lustre-discuss 
mailto:lustre-discuss@lists.lustre.org> 
>
Reply-To: Thomas Roth mailto:t.r...@gsi.de> 
>
Date: Friday, January 5, 2024 at 9:03 AM
To: Lustre Diskussionsliste 
mailto:lustre-discuss@lists.lustre.org> 
>
Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?


CAUTION: This email originated from outside of NASA. Please take care when 
clicking links or opening attachments. Use the "Report Message" button to 
report suspicious messages to the NASA SOC.








Dear all,


considering NVME storage for the next MDS.


As I understand, NVME disks are bundled in software, not by a hardware raid 
controller.
This would be done using Linux software raid, mdadm, correct?


We have some experience with ZFS, which we use on our OSTs.
But I would like to stick to ldiskfs for the MDTs, and a zpool with a zvol on 
top which is then formatted with ldiskfs - to much voodoo...


How is this handled elsewhere? Any experiences?




The available devices are quite large. If I create a raid-10 out of 4 disks, 
e.g. 7 TB each, my MDT will be 14 TB - already close to the 16 TB limit.
So no need for a box with lots of U.3 slots.


But for MDS operations, we will still need a powerful dual-CPU system with lots 
of RAM.
Then the NVME devices should be distributed between the CPUs?
Is there a way to pinpoint this in a call for tender?




Best regards,
Thomas



Thomas Roth


GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, http://www.gsi.de/ 



Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org 

http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org 





Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-05 Thread Cameron Harr via lustre-discuss
This doesn't answer your question about ldiskfs on zvols, but we've been 
running MDTs on ZFS on NVMe in production for a couple years (and on SAS 
SSDs for many years prior). Our current production MDTs using NVMe 
consist of one zpool/node made up of 3x 2-drive mirrors, but we've been 
experimenting lately with using raidz3 and possibly even raidz2 for MDTs 
since SSDs have been pretty reliable for us.


Cameron

On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
via lustre-discuss wrote:

We are in the process of retiring two long standing LFS's (about 8 years old), 
which we built and managed ourselves.  Both use ZFS and have the MDT'S on ssd's 
in a JBOD that require the kind of software-based management you describe, in 
our case ZFS pools built on multipath devices.  The MDT in one is ZFS and the 
MDT in the other LFS is ldiskfs but uses ZFS and a zvol as you describe - we 
build the ldiskfs MDT on top of the zvol.  Generally, this has worked well for 
us, with one big caveat.  If you look for my posts to this list and the ZFS 
list you'll find more details.  The short version is that we utilize ZFS 
snapshots and clones to do backups of the metadata.  We've run into situations 
where the backup process stalls, leaving a clone hanging around.  We've 
experienced a situation a couple of times where the clone and the primary zvol 
get swapped, effectively rolling back our metadata to the point when the clone 
was created.  I have tried, unsuccessfully, to recreate that in a test 
environment.  So if you do that kind of setup, make sure you have good 
monitoring in place to detect if your backups/clones stall.  We've kept up with 
lustre and ZFS updates over the years and are currently on lustre 2.14 and ZFS 
2.1.  We've seen the gap between our ZFS MDT and ldiskfs performance shrink to 
the point where they are pretty much on par to each now.  I think our ZFS MDT 
performance could be better with more hardware and software tuning but our 
small team hasn't had the bandwidth to tackle that.

Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty to 
talk about the proprietary way those devices are managed.  However, the 
metadata performance is SO much better than our older LFS's, for a lot of 
reasons, but I'd highly recommend NVMe's for your MDT's.

-Original Message-
From: lustre-discuss mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Thomas Roth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Reply-To: Thomas Roth mailto:t.r...@gsi.de>>
Date: Friday, January 5, 2024 at 9:03 AM
To: Lustre Diskussionsliste mailto:lustre-discuss@lists.lustre.org>>
Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?


CAUTION: This email originated from outside of NASA. Please take care when clicking links 
or opening attachments. Use the "Report Message" button to report suspicious 
messages to the NASA SOC.








Dear all,


considering NVME storage for the next MDS.


As I understand, NVME disks are bundled in software, not by a hardware raid 
controller.
This would be done using Linux software raid, mdadm, correct?


We have some experience with ZFS, which we use on our OSTs.
But I would like to stick to ldiskfs for the MDTs, and a zpool with a zvol on 
top which is then formatted with ldiskfs - to much voodoo...


How is this handled elsewhere? Any experiences?




The available devices are quite large. If I create a raid-10 out of 4 disks, 
e.g. 7 TB each, my MDT will be 14 TB - already close to the 16 TB limit.
So no need for a box with lots of U.3 slots.


But for MDS operations, we will still need a powerful dual-CPU system with lots 
of RAM.
Then the NVME devices should be distributed between the CPUs?
Is there a way to pinpoint this in a call for tender?




Best regards,
Thomas



Thomas Roth


GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, 
https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$
  



Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org 

Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-05 Thread Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss
We are in the process of retiring two long standing LFS's (about 8 years old), 
which we built and managed ourselves.  Both use ZFS and have the MDT'S on ssd's 
in a JBOD that require the kind of software-based management you describe, in 
our case ZFS pools built on multipath devices.  The MDT in one is ZFS and the 
MDT in the other LFS is ldiskfs but uses ZFS and a zvol as you describe - we 
build the ldiskfs MDT on top of the zvol.  Generally, this has worked well for 
us, with one big caveat.  If you look for my posts to this list and the ZFS 
list you'll find more details.  The short version is that we utilize ZFS 
snapshots and clones to do backups of the metadata.  We've run into situations 
where the backup process stalls, leaving a clone hanging around.  We've 
experienced a situation a couple of times where the clone and the primary zvol 
get swapped, effectively rolling back our metadata to the point when the clone 
was created.  I have tried, unsuccessfully, to recreate that in a test 
environment.  So if you do that kind of setup, make sure you have good 
monitoring in place to detect if your backups/clones stall.  We've kept up with 
lustre and ZFS updates over the years and are currently on lustre 2.14 and ZFS 
2.1.  We've seen the gap between our ZFS MDT and ldiskfs performance shrink to 
the point where they are pretty much on par to each now.  I think our ZFS MDT 
performance could be better with more hardware and software tuning but our 
small team hasn't had the bandwidth to tackle that.

Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty to 
talk about the proprietary way those devices are managed.  However, the 
metadata performance is SO much better than our older LFS's, for a lot of 
reasons, but I'd highly recommend NVMe's for your MDT's.

-Original Message-
From: lustre-discuss mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Thomas Roth via 
lustre-discuss mailto:lustre-discuss@lists.lustre.org>>
Reply-To: Thomas Roth mailto:t.r...@gsi.de>>
Date: Friday, January 5, 2024 at 9:03 AM
To: Lustre Diskussionsliste mailto:lustre-discuss@lists.lustre.org>>
Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?


CAUTION: This email originated from outside of NASA. Please take care when 
clicking links or opening attachments. Use the "Report Message" button to 
report suspicious messages to the NASA SOC.








Dear all,


considering NVME storage for the next MDS.


As I understand, NVME disks are bundled in software, not by a hardware raid 
controller.
This would be done using Linux software raid, mdadm, correct?


We have some experience with ZFS, which we use on our OSTs.
But I would like to stick to ldiskfs for the MDTs, and a zpool with a zvol on 
top which is then formatted with ldiskfs - to much voodoo...


How is this handled elsewhere? Any experiences?




The available devices are quite large. If I create a raid-10 out of 4 disks, 
e.g. 7 TB each, my MDT will be 14 TB - already close to the 16 TB limit.
So no need for a box with lots of U.3 slots.


But for MDS operations, we will still need a powerful dual-CPU system with lots 
of RAM.
Then the NVME devices should be distributed between the CPUs?
Is there a way to pinpoint this in a call for tender?




Best regards,
Thomas



Thomas Roth


GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, http://www.gsi.de/ 



Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org 
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org 




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org