[lustre-discuss] BCP for High Availability?

2023-01-15 Thread Andrew Elwell via lustre-discuss
Hi Folks, I'm just rebuilding my testbed and have got to the "sort out all the pacemaker stuff" part. What's the best current practice for the current LTS (2.15.x) release tree? I've always done this as multiple individual HA clusters covering each pair of servers with common dual connected

[lustre-discuss] 2.15.x with ConnectX-3 cards

2022-12-10 Thread Andrew Elwell via lustre-discuss
Hi Gang, I've just gone and reimaged a test system in prep for doing an upgrade to Rocky 8 + 2.15.1 (What's the bets 2.15.2 comes out the night I push to prod?) However, the 2.15.1-ib release uses mofed 5.6 ... which no longer supports CX-3 cards. (yeah, it's olde hardware...) Having been badly

[lustre-discuss] Version interoperability

2022-11-08 Thread Andrew Elwell via lustre-discuss
Hi folks, We're faced with a (short term measured in months, not years thankfully) seriously large gap in versions between our existing clients (2.7.5) and new hardware clients (2.15.0) that will be mounting the same file system. It's currently on 2.10.8-ib (ldiskfs) with connectx-5 cards, and I

[lustre-discuss] 2.12.9-ib release?

2022-06-24 Thread Andrew Elwell via lustre-discuss
Hi folks, I see the 2.12.9/ release tree on https://downloads.whamcloud.com/public/lustre/, but I don't see the accompanying 2.12.9-ib/ one. ISTR someone needed to poke a build process last time to get this public - can they do the same this time please? Many thanks Andrew

[lustre-discuss] unclear language in Operations manual

2022-06-15 Thread Andrew Elwell via lustre-discuss
Hi folks, I've recently come across this snippet in the ops manual (section 13.8. Running Multiple Lustre File Systems, page 111 in the current pdf) > Note > If a client(s) will be mounted on several file systems, add the following > line to /etc/ xattr.conf file to avoid problems when files

[lustre-discuss] jobstats

2022-05-27 Thread Andrew Elwell via lustre-discuss
Hi folks, I've finally started to re-investigate pushing jobstats to our central dashboards and realised there's a dearth of scripts / tooling to actually gather the job_stats files and push them to $whatever. I have seen the telegraf one, and the DDN fork of collectd seems somewhat abandonware.

Re: [lustre-discuss] Corrupted? MDT not mounting

2022-05-10 Thread Andrew Elwell via lustre-discuss
On Wed, 11 May 2022 at 04:37, Laura Hild wrote: > The non-dummy SRP module is in the kmod-srp package, which isn't included in > the Lustre repository... Thanks Laura, Yeah, I realised that earlier in the week, and have rebuilt the srp module from source via mlnxofedinstall, and sure enough

Re: [lustre-discuss] Corrupted? MDT not mounting

2022-05-08 Thread Andrew Elwell via lustre-discuss
On Fri, 6 May 2022 at 20:04, Andreas Dilger wrote: > MOFED is usually preferred over in-kernel OFED, it is just tested and fixed a > lot more. Fair enough, However is the 2.12.8-ib tree built with all the features? specifically

Re: [lustre-discuss] Corrupted? MDT not mounting

2022-05-05 Thread Andrew Elwell via lustre-discuss
> It's looking more like something filled up our space - I'm just > copying the files out as a backup (mounted as ldiskfs just now) - Ahem. Inode quotas are a good idea. Turns out that a user creating about 130 million directories rapidly is more than a small MDT volume can take. An update on

Re: [lustre-discuss] Corrupted? MDT not mounting

2022-04-20 Thread Andrew Elwell via lustre-discuss
Thanks Stéphane, It's looking more like something filled up our space - I'm just copying the files out as a backup (mounted as ldiskfs just now) - we're running DNE (MDT and this one MDT0001) but I don't understand why so much space is being taken up in REMOTE_PARENT_DIR - we seem to have

[lustre-discuss] Corrupted? MDT not mounting

2022-04-19 Thread Andrew Elwell via lustre-discuss
Hi Folks, One of our filesystems seemed to fail over the holiday weekend - we're running DNE and MDT0001 won't mount. At first it looked like we'd run out of space (rc = -28) but then we were seeing this mount.lustre: mount /dev/mapper/MDT0001 at /lustre/astrofs-MDT0001 failed: File exists

[lustre-discuss] Hardware advice for homelab

2021-07-19 Thread Andrew Elwell via lustre-discuss
Hi folks, Given my homelab testing for Lustre tends to be contained within VirtualBox on laptop ($work has a physical hardware test bed once mucking around gets serious), I'm considering expanding to some real hardware at home for testing. My MythTV days are over, but I'd ideally like an aarch64

[lustre-discuss] Determining server version from client

2021-01-18 Thread Andrew Elwell
Hi All, Is there a trivial command to determine the server side version of lustre (in my case, trying to confirm what types of quotas are allowed (project - 2.10+, default - 2.12+) ? I was hoping there'd be something in lfs, such as lfs getname --version which would ideally spit out something

Re: [lustre-discuss] quotas not being enforced

2021-01-13 Thread Andrew Elwell
On Thu, 14 Jan 2021 at 17:12, Andrew Elwell wrote: > I'm struggling to debug quota enforcement (or more worryingly, lack > of) in recentish LTS releases. > > [root@pgfs-mds3 ~]# lctl conf_param testfs.quota.ost=g > ... time passes > [root@pgfs-mds4 ~]# lctl get_param osd-*.

[lustre-discuss] quotas not being enforced

2021-01-13 Thread Andrew Elwell
Hi folks, I'm struggling to debug quota enforcement (or more worryingly, lack of) in recentish LTS releases. our test system (2 servers shared SAS disks between them) is running lustre-2.12.6-1.el7.x86_64 e2fsprogs-1.45.6.wc3-0.el7.x86_64 kernel-3.10.0-1160.2.1.el7_lustre.x86_64 but the storage

[lustre-discuss] CentOS / LTS plans

2020-12-10 Thread Andrew Elwell
Hi All, I'm guessing most of you have heard of the recent roadmap for CentOS (discussion of which isn't on topic for this list), but can we have a vague (happy for it to be "at this point we're thinking about X, but we haven't really decided" level) indication of what the plan for the upcoming

[lustre-discuss] status of HSM copytools?

2020-08-22 Thread Andrew Elwell
Hi folks, I'm looking round to see what's current / 'supported' / working in the state of copytools. Ideally one that can migrate to/from object stores (Ceph or S3). The github repo for Lemur (https://github.com/whamcloud/lemur/commits/master) doesn't seem to have had any substantial work since it

[lustre-discuss] Commvault lustre backup / archive

2020-08-10 Thread Andrew Elwell
Hi folks, I see from their release notes that Commvault should be able to act on changelogs for backup. Anyone here doing so? Any gotchas to worry about? Is it better than scanning (ugh) and making the MDS unhappy? Similarly how good is the archive functionally? Does it play well with lustre HSM

Re: [lustre-discuss] Pacemaker resource Agents

2020-07-13 Thread Andrew Elwell
> I've been trying to locate the Lustre specific Pacemaker resource agents but > I've had no luck at github where they were meant to be hosted, maybe I am > looking at the wrong project? > Has anyone recently implemented a HA lustre cluster using pacemaker and did > you use lustre specific

Re: [lustre-discuss] Jobstats harvesting

2020-02-17 Thread Andrew Elwell
On Mon., 17 Feb. 2020, 18:06 Andreas Dilger, wrote: > You don't mention which Lustre release you are using, but newer > releases allow "complex JobIDs" that can contain both the SLURMJobID > as well as other constant strings (e.g. cluster name), hostname, UID, GID, > and process name. > Yeah, i

[lustre-discuss] Jobstats harvesting

2020-02-14 Thread Andrew Elwell
Hi folks, I've finally got round to enabling jobstats on a test system. As we're a Slurm shop, setting this to jobid_var=SLURM_JOB_ID works OK, but is it possible to use a combination of variables? ie ${PAWSEY_CLUSTER}-${SLURM_JOB_ID} (or even SLURM_CLUSTER_NAME which is the same as

Re: [lustre-discuss] Slow mount on clients

2020-02-04 Thread Andrew Elwell
> HA / MGS running on second node in fstab :-) that was one of the first things we checked, and I've tried manually mounting it but no change 10.10.36.224@o2ib4:10.10.36.225@o2ib4:/askapfs1 3.7P 3.0P 507T 86% /askapbuffer hpc-admin2:~ # lctl ping 10.10.36.224@o2ib4 12345-0@lo

Re: [lustre-discuss] Running an older Lustre server (2.5) with a newer client (2.11)

2019-08-30 Thread Andrew Elwell
On Fri., 30 Aug. 2019, 09:01 Kirill Lozinskiy, wrote: > Is there anyone out there running Lustre server version 2.5.x with a > Lustre client version 2.11.x? I'm curious if you are running this > combination and whether or not you saw and gains or losses when you went to > the newer Lustre

Re: [lustre-discuss] Wanted: multipath.conf for dell ME4 series arrays

2019-08-21 Thread Andrew Elwell
Hi Jeff, On Wed, 21 Aug 2019 at 17:34, Jeff Johnson wrote: > What underlying Lustre target filesystem? (assuming ldiskfs with a hardware > RAID array) correct - ldiskfs, using 8* raid6 luns per ME4084 > What does your current multipath.conf look like? we just had blacklist, WWNs and mappings,

[lustre-discuss] Wanted: multipath.conf for dell ME4 series arrays

2019-08-21 Thread Andrew Elwell
Hi folks, we're seeing MMP reluctance to hand over the (umounted) OSTs to the partner pair on our shiny new ME4084 arrays, Does anyone have the device {} settings they'd be willing to share? My gut feel is we've not defined path failover properly and some timeouts need tweaking (4* ME4084's

[lustre-discuss] State of arm client?

2019-04-24 Thread Andrew Elwell
Hi folks, I remember seeing a press release by DDN/Whamcloud last November that they were going to support ARM, but can anyone point me to the current state of client? I'd like to deploy it onto a raspberry pi cluster (only 4-5 nodes) ideally on raspbian for demo / training purposes. (Yes I know

[lustre-discuss] lfs check *, change of behaviour from 2.7 to 2.10?

2019-04-09 Thread Andrew Elwell
I've just noticed that 'lfs check mds / servers no longer works (2.10.0 or greater clients) for unprivileged users, yet it worked for 2.7.x clients. Is this by design? (lfs quota thankfully still works as a normal user tho) Andrew ___ lustre-discuss

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-28 Thread Andrew Elwell
On Tue, 26 Feb 2019 at 23:25, Andreas Dilger wrote: > I agree that having an option that creates the OSTs as inactive might be > helpful, though I wouldn't want that to be the default as I'd imagine it > would also cause problems for the majority users that wouldn't know that they > need to

Re: [lustre-discuss] Command line tool to monitor Lustre I/O ?

2018-12-21 Thread Andrew Elwell
On Fri., 21 Dec. 2018, 01:05 Laifer, Roland (SCC) Dear Lustre administrators, > > what is a good command line tool to monitor current Lustre metadata and > throughput operations on the local client or server? > I wrote a small python script to parse lctl get_param and inject it straight into our

[lustre-discuss] Openstack Manila + Lustre?

2018-11-20 Thread Andrew Elwell
Hi All, Is there anyone on list exporting Lustre filesystems to (private) cloud services - possibly via Manila? I can see https://www.openstack.org/assets/science/CrossroadofCloudandHPC-Print.pdf and Simon's talk from LUG2016

[lustre-discuss] rsync target for https://downloads.whamcloud.com/public/?

2018-09-26 Thread Andrew Elwell
Hi folks, Is there an rsync (or other easily mirrorable) target for downloads.whamcloud.com ? I'm trying to pull e2fsprogs/latest/el7/ and lustre/latest-release/el7/server/ locally to reinstall a bunch of machines Many thanks, Andrew ___

Re: [lustre-discuss] Does lustre 2.10 client support 2.5 server ?

2017-11-09 Thread Andrew Elwell
> My Lustre server is running the version 2.5 and I want to use 2.10 client. > Is this combination supported ? Is there anything that I need to be aware of 2 of our storage appliances (sonnexion 1600 based) run 2.5.1, I've mounted this OK on infiniband clients fine with 2.10.0 and 2.10.1 OK, but

Re: [lustre-discuss] 1 MDS and 1 OSS

2017-10-30 Thread Andrew Elwell
On 31 Oct. 2017 07:20, "Dilger, Andreas" wrote: Having a larger MDT isn't bad if you plan future expansion. That said, you would get better performance over FDR if you used SSDs for the MDT rather than HDDs (if you aren't already planning this), and for a single OSS

[lustre-discuss] Point release updates

2017-09-08 Thread Andrew Elwell
Hi Folks, We currently have a couple of storage systems based on IEEEL 3.0: [root@pgfs-oss1 ~]# cat /proc/fs/lustre/version lustre: 2.7.16.8 kernel: patchless_client build: jenkins-arch=x86_64,build_type=client,distro=el7,ib_stack=inkernel-15--PRISTINE-3.10.0-327.36.1.el7_lustre.x86_64