[lustre-discuss] 2.15.4 o2iblnd on RoCEv2?

2024-01-09 Thread Jeff Johnson
Howdy intrepid Lustrefarians,

While starting down the debug rabbit hole I thought I'd raise my hand
and see if anyone has a few magic beans to spare.

I cannot get lnet (via lnetctl) to init a o2iblnd interface on a
RoCEv2 interface.

Running `lnetctl net add --net ib0 --if enp1s0np0` results in
 net:
  errno: -1
  descr: cannot parse net '<255:65535>'

Nothing in dmesg to indicate why. Search engines aren't coughing up
much here either.

Env: Rocky 8.9 x86_64, MOFED 5.8-4.1.5.0, Lustre 2.15.4

I'm able to run mpi over the RoCEv2 interface. Utils like ibstatus and
ibdev2netdev report it correctly. ibv_rc_pingpong works fine between
nodes.

Configuring as socklnd works fine. `lnetctl net add --net tcp0 --if
enp1s0np0 && lnetctl net show`
[root@r2u11n3 ~]# lnetctl net show
net:
- net type: lo
  local NI(s):
- nid: 0@lo
  status: up
- net type: tcp
  local NI(s):
- nid: 10.0.50.27@tcp
  status: up
  interfaces:
  0: enp1s0np0

I verified the RoCEv2 interface using nVidia's `cma_roce_mode` as well
as sysfs references

[root@r2u11n3 ~]# cma_roce_mode -d mlx5_0 -p 1
RoCE v2

Ideas? Suggestions? Incense?

Thanks,

--Jeff
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Extending Lustre file system

2024-01-09 Thread Backer via lustre-discuss
Thank you all for the valuable information. Are there any tools that I
could use to migrate (rebalance) OSTs? I know about lfs_migrate. Is there a
tool that walks and balance the OST usage?

Thank you!


On Mon, 8 Jan 2024 at 09:38, Backer  wrote:

> Hi,
>
> Good morning and happy new year!
>
> I have a quick question on extending a lustre file system. The extension
> is performed online. I am looking for any best practices or anything to
> watchout while doing the file system extension. The file system extension
> is done adding new OSS and many OSTs within these servers.
>
> Really appreciate your help on this.
>
> Regards,
>
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

2024-01-09 Thread Cameron Harr via lustre-discuss

Thomas,

We value management over performance and have knowingly left performance 
on the floor in the name of standardization, robustness, management, 
etc; while still maintaining our performance targets. We are a heavy 
ZFS-on-Linux (ZoL) shop so we never considered MD-RAID, which, IMO, is 
very far behind ZoL in enterprise storage features.


As Jeff mentioned, we have done some tuning (and if you haven't noticed 
there are *a lot* of possible ZFS parameters) to further improve 
performance and are at a good place performance-wise.


Cameron

On 1/8/24 10:33, Jeff Johnson wrote:

Today nvme/mdraid/ldiskfs will beat nvme/zfs on MDS IOPs but you can
close the gap somewhat with tuning, zfs ashift/recordsize and special
allocation class vdevs. While the IOPs performance favors
nvme/mdraid/ldiskfs there are tradeoffs. The snapshot/backup abilities
of ZFS and the security it provides to the most critical function in a
Lustre file system shouldn't be undervalued. From personal experience,
I'd much rather deal with zfs in the event of a seriously jackknifed
MDT than mdraid/ldiskfs and both zfs and mdraid/ldiskfs are preferable
to trying to unscramble a vendor blackbox hwraid volume. ;-)

When zfs directio lands and is fully integrated into Lustre the
performance differences *should* be negligible.

Just my $.02 worth

On Mon, Jan 8, 2024 at 8:23 AM Thomas Roth via lustre-discuss
 wrote:

Hi Cameron,

did you run a performance comparison between ZFS and mdadm-raid on the MDTs?
I'm currently doing some tests, and the results favor software raid, in 
particular when it comes to IOPS.

Regards
Thomas

On 1/5/24 19:55, Cameron Harr via lustre-discuss wrote:

This doesn't answer your question about ldiskfs on zvols, but we've been 
running MDTs on ZFS on NVMe in production for a couple years (and on SAS SSDs 
for many years prior). Our current production MDTs using NVMe consist of one 
zpool/node made up of 3x 2-drive mirrors, but we've been experimenting lately 
with using raidz3 and possibly even raidz2 for MDTs since SSDs have been pretty 
reliable for us.

Cameron

On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via 
lustre-discuss wrote:

We are in the process of retiring two long standing LFS's (about 8 years old), 
which we built and managed ourselves.  Both use ZFS and have the MDT'S on ssd's 
in a JBOD that require the kind of software-based management you describe, in 
our case ZFS pools built on multipath devices.  The MDT in one is ZFS and the 
MDT in the other LFS is ldiskfs but uses ZFS and a zvol as you describe - we 
build the ldiskfs MDT on top of the zvol.  Generally, this has worked well for 
us, with one big caveat.  If you look for my posts to this list and the ZFS 
list you'll find more details.  The short version is that we utilize ZFS 
snapshots and clones to do backups of the metadata.  We've run into situations 
where the backup process stalls, leaving a clone hanging around.  We've 
experienced a situation a couple of times where the clone and the primary zvol 
get swapped, effectively rolling back our metadata to the point when the clone 
was created.  I have tried, unsuccessfully, to recreate
that in a test environment.  So if you do that kind of setup, make sure you 
have good monitoring in place to detect if your backups/clones stall.  We've 
kept up with lustre and ZFS updates over the years and are currently on lustre 
2.14 and ZFS 2.1.  We've seen the gap between our ZFS MDT and ldiskfs 
performance shrink to the point where they are pretty much on par to each now.  
I think our ZFS MDT performance could be better with more hardware and software 
tuning but our small team hasn't had the bandwidth to tackle that.

Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty to 
talk about the proprietary way those devices are managed.  However, the 
metadata performance is SO much better than our older LFS's, for a lot of 
reasons, but I'd highly recommend NVMe's for your MDT's.

-Original Message-
From: lustre-discuss mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Thomas Roth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Reply-To: Thomas Roth mailto:t.r...@gsi.de>>
Date: Friday, January 5, 2024 at 9:03 AM
To: Lustre Diskussionsliste mailto:lustre-discuss@lists.lustre.org>>
Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?


CAUTION: This email originated from outside of NASA. Please take care when clicking links 
or opening attachments. Use the "Report Message" button to report suspicious 
messages to the NASA SOC.








Dear all,


considering NVME storage for the next MDS.


As I understand, NVME disks are bundled in software, not by a hardware raid 
controller.
This would be done using Linux software raid, mdadm, correct?


We have some experience with ZFS, which we use on our OSTs.
But I would like to stick to ldiskfs for the MDTs, and a zpool with a zvol on 
top which is then