[lustre-discuss] Zabbix Lustre template

2023-09-27 Thread David Cohen via lustre-discuss
Hi,
I'm looking for a Zabbix Lustre template, but couldn't find one.
Is anyone aware of such a template and can share a link?

Thanks,
David
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [BULK] Re: [EXTERNAL] Re: Data recovery with lost MDT data

2023-09-27 Thread Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss
The file order in the same for both methods.  I get a list of all files from 
“lfs find”.  In both cases, I pop a group of files off the list to process. Its 
very consistent that a direct mv is slow but an “xargs mv” is fast.  It would 
be nice to understand why, but, at this point, we’re happy we found a method 
that worked.

This filesystem’s MDT was originally on HDD’s (2.5” 10k drives) but we migrated 
to SSD’s.  They are old and fairly slow by today’s SSD/NVMe standards but still 
much better than even current HDD’s for seek time.  They made a big difference 
in metadata performance when we did that upgrade (5+ years ago).

We are in pretty good shape at this point.  We got the “xargs mv” method going 
in parallel to move groups of 1000 files at once into a unique subdirectory.  
As of early this morning, all 13M files have been moved into subdirectories and 
we can operate on them quickly again with regular command (stat, mv, etc.).  We 
are now looping through the files again (also in parallel) and moving each’s 
users files into their user’s level directory (i.e. 
/mnt/$USER/lost+found/subdir).  It looks like this process should also 
complete in about a day.

Thanks for your help with this.

From: Andreas Dilger 
Date: Wednesday, September 27, 2023 at 1:10 AM
To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 

Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] [BULK] Re: [EXTERNAL] Re: Data recovery with lost 
MDT data

CAUTION: This email originated from outside of NASA.  Please take care when 
clicking links or opening attachments.  Use the "Report Message" button to 
report suspicious messages to the NASA SOC.


On Sep 26, 2023, at 09:33, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] mailto:darby.vicke...@nasa.gov>> wrote:

I tried statx from lustre-tests, but apparently don’t have it built correctly:

[root@hpfs-fsl-lmon0 ~]# statx
Skip: system does not support statx syscall.
[root@hpfs-fsl-lmon0 ~]#

The stat isn’t too important right now – if I can get a smaller group of files 
separated into a subdirectory, everything is fast again.  So I’m

I also tried moving a group of files at once but that didn’t help either.  
Tested this by manually making a test script that does a “mv file1 file2 … 
fileN subdir1” on 1000 files.  That took about 7.5 hours – the same 30 seconds 
per file.

OK, looks like xargs is the answer.


subdir=/scratch-lustre/work/dvicker/lost+found/dir3
mkdir $subdir
head -1000 lost+found.txt | xargs mv -t $subdir


Moving 1000 files like this takes only about 80 seconds.  I would think this 
would be equivalent to the manual “mv file1 file2 … fileN subdir1” method but 
its not.  An ideas why this is so much faster?

It could be the order in which the files are being processed.  "find ... | head 
-1000" is generating a list of files in "readdir" order, which is the hashed 
order that they appear in large directory blocks, and probably only modifies a 
dozen blocks in the filesystem.   Using "mv file1 file2 ..." is probably in 
"sorted by filename" order, which looks nice but causes all kinds of random IO 
operations when updating a thousand different directory blocks to move them.

You wouldn't happen to be running this old filesystem with the MDT on HDD 
devices?  That would make a *huge* difference in performance from "mostly 
linear" to "mostly random", otherwise I can't imagine there is a huge 
difference for an NVMe MDT.  Possibly, the extra layering of ldiskfs on top of 
ZFS is also causing it heartburn, especially if your zvol is not configured 
with 4KiB recordsize, since that will also cause a lot of write inflation.

Cheers, Andreas



From: Andreas Dilger mailto:adil...@whamcloud.com>>
Date: Monday, September 25, 2023 at 8:19 PM
To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 
mailto:darby.vicke...@nasa.gov>>
Cc: "lustre-discuss@lists.lustre.org" 
mailto:lustre-discuss@lists.lustre.org>>
Subject: Re: [lustre-discuss] [BULK] Re: [EXTERNAL] Re: Data recovery with lost 
MDT data

CAUTION: This email originated from outside of NASA.  Please take care when 
clicking links or opening attachments.  Use the "Report Message" button to 
report suspicious messages to the NASA SOC.



Probably using "stat" on each file is slow, since this is getting the file size 
from each OST object. You could try the "xstat" utility in the lustre-tests RPM 
(or build it directly) as it will only query the MDS for the requested 
attributes (owner at minimum).

Then you could split into per-date directories in a separate phase, if needed, 
run in parallel.

I can't suggest anything about the 13M entry directory, but it _should_ be much 
faster than 1 file per 30s even at that size. I suspect that the script is 
still doing something bad, since shell and GNU utilities are terrible for doing 
extra stat/cd/etc five times on each file that is accessed, renamed, etc.

You would be better off to use 

[lustre-discuss] Ongoing issues with quota

2023-09-27 Thread Daniel Szkola via lustre-discuss
We have a lustre filesystem that we just upgraded to 2.15.3, however this 
problem has been going on for some time.

The quota command shows this:

Disk quotas for grp somegroup (gid 9544):
 Filesystemused   quota   limit   grace   files   quota   limit   grace
   /lustre1  13.38T 40T 45T   - 3136761* 2621440 3670016 expired

The group is not using nearly that many files. We have robinhood installed and 
it show this:

Using config file '/etc/robinhood.d/lustre1.conf'.
 group, type,  count, volume,   spc_used,   avg_size
somegroup,   symlink,  59071,5.12 MB,  103.16 MB, 91
somegroup,   dir, 426619,5.24 GB,5.24 GB,   12.87 KB
somegroup,  file,1310414,   16.24 TB,   13.37 TB,   13.00 MB

Total: 1796104 entries, volume: 17866508365925 bytes (16.25 TB), space used: 
14704924899840 bytes (13.37 TB)

Any ideas what is wrong here?

—
Dan Szkola
FNAL
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL EMAIL] Re: [EXTERNAL EMAIL] Re: [EXTERNAL] No port 988?

2023-09-27 Thread Jeff Johnson
Nothing better than sliding in at the last moment to steal all the glory ;-)

—Jeff

On Wed, Sep 27, 2023 at 07:10 Jan Andersen  wrote:

> Hi Jeff,
>
> Yes, that was it! Things are working beautifully now - big thanks.
>
> /jan
>
> On 27/09/2023 15:07, Jeff Johnson wrote:
> > Any chance the firewall is running?
> >
> > You can use `lctl ping ipaddress@lnet` to check if you have functional
> > lnet between machines. Example `lctl ping 10.0.0.10@tcp`
> >
> > —Jeff
> >
> > On Wed, Sep 27, 2023 at 05:35 Jan Andersen  > > wrote:
> >
> > However, it is still timing out when I try to mount on the oss. This
> is
> > the kernel module:
> >
> > [root@mds ~]# lsmod | grep lnet
> > lnet  704512  7
> mgs,obdclass,osp,ptlrpc,mgc,ksocklnd,mdt
> > libcfs266240  15
> >
>  
> fld,lnet,fid,lod,mdd,mgs,obdclass,osp,ptlrpc,mgc,ksocklnd,mdt,osd_ldiskfs,lquota,lfsck
> > sunrpc577536  2 lnet
> >
> > But it only listens on tcp6, which I don't use - is there a way to
> for
> > it to use tcp4?
> >
> > [root@mds ~]# netstat -nap | grep 988
> > tcp6   0  0 :::988  :::*
> > LISTEN  -
> >
> > /jan
> >
> > On 27/09/2023 10:15, Jan Andersen wrote:
> >  > Hi Rick,
> >  >
> >  > Very strange - when I started the vm this morning, 'modprobe lnet'
> >  > didn't return an error - and it seems to have loaded the module:
> >  >
> >  > [root@rocky8 ~]# lsmod | grep lnet
> >  > lnet  704512  0
> >  > libcfs266240  1 lnet
> >  > sunrpc577536  2 lnet
> >  >
> >  > Looking at the running kernel and the kernel source, they now
> > seem to be
> >  > the same version:
> >  >
> >  > [root@rocky8 ~]# ll /usr/src/kernels
> >  > total 4
> >  > drwxr-xr-x. 23 root root 4096 Sep 26 12:34
> > 4.18.0-477.27.1.el8_8.x86_64/
> >  > [root@rocky8 ~]# uname -r
> >  > 4.18.0-477.27.1.el8_8.x86_64
> >  >
> >  > - which would explain that it now works. Things were a bit hectic
> > with
> >  > other things yesterday afternoon, and I don't quite remember
> > installing
> >  > a new kernel, but it looks like I did. Hopefully this is my
> problem
> >  > solved, then - sorry for jumping up and down and making noise!
> >  >
> >  > /jan
> >  >
> >  > On 26/09/2023 18:13, Mohr, Rick wrote:
> >  >> What error do you get when you run "modprobe lnet"?
> >  >>
> >  >> --Rick
> >  >>
> > ___
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org  lustre-discuss@lists.lustre.org>
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> > 
> >
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL EMAIL] Re: [EXTERNAL EMAIL] Re: [EXTERNAL] No port 988?

2023-09-27 Thread Jan Andersen

Hi Jeff,

Yes, that was it! Things are working beautifully now - big thanks.

/jan

On 27/09/2023 15:07, Jeff Johnson wrote:

Any chance the firewall is running?

You can use `lctl ping ipaddress@lnet` to check if you have functional 
lnet between machines. Example `lctl ping 10.0.0.10@tcp`


—Jeff

On Wed, Sep 27, 2023 at 05:35 Jan Andersen > wrote:


However, it is still timing out when I try to mount on the oss. This is
the kernel module:

[root@mds ~]# lsmod | grep lnet
lnet                  704512  7 mgs,obdclass,osp,ptlrpc,mgc,ksocklnd,mdt
libcfs                266240  15

fld,lnet,fid,lod,mdd,mgs,obdclass,osp,ptlrpc,mgc,ksocklnd,mdt,osd_ldiskfs,lquota,lfsck
sunrpc                577536  2 lnet

But it only listens on tcp6, which I don't use - is there a way to for
it to use tcp4?

[root@mds ~]# netstat -nap | grep 988
tcp6       0      0 :::988                  :::*
LISTEN      -

/jan

On 27/09/2023 10:15, Jan Andersen wrote:
 > Hi Rick,
 >
 > Very strange - when I started the vm this morning, 'modprobe lnet'
 > didn't return an error - and it seems to have loaded the module:
 >
 > [root@rocky8 ~]# lsmod | grep lnet
 > lnet  704512  0
 > libcfs    266240  1 lnet
 > sunrpc    577536  2 lnet
 >
 > Looking at the running kernel and the kernel source, they now
seem to be
 > the same version:
 >
 > [root@rocky8 ~]# ll /usr/src/kernels
 > total 4
 > drwxr-xr-x. 23 root root 4096 Sep 26 12:34
4.18.0-477.27.1.el8_8.x86_64/
 > [root@rocky8 ~]# uname -r
 > 4.18.0-477.27.1.el8_8.x86_64
 >
 > - which would explain that it now works. Things were a bit hectic
with
 > other things yesterday afternoon, and I don't quite remember
installing
 > a new kernel, but it looks like I did. Hopefully this is my problem
 > solved, then - sorry for jumping up and down and making noise!
 >
 > /jan
 >
 > On 26/09/2023 18:13, Mohr, Rick wrote:
 >> What error do you get when you run "modprobe lnet"?
 >>
 >> --Rick
 >>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org 
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL EMAIL] Re: [EXTERNAL] No port 988?

2023-09-27 Thread Jeff Johnson
Any chance the firewall is running?

You can use `lctl ping ipaddress@lnet` to check if you have functional lnet
between machines. Example `lctl ping 10.0.0.10@tcp`

—Jeff

On Wed, Sep 27, 2023 at 05:35 Jan Andersen  wrote:

> However, it is still timing out when I try to mount on the oss. This is
> the kernel module:
>
> [root@mds ~]# lsmod | grep lnet
> lnet  704512  7 mgs,obdclass,osp,ptlrpc,mgc,ksocklnd,mdt
> libcfs266240  15
>
> fld,lnet,fid,lod,mdd,mgs,obdclass,osp,ptlrpc,mgc,ksocklnd,mdt,osd_ldiskfs,lquota,lfsck
> sunrpc577536  2 lnet
>
> But it only listens on tcp6, which I don't use - is there a way to for
> it to use tcp4?
>
> [root@mds ~]# netstat -nap | grep 988
> tcp6   0  0 :::988  :::*
> LISTEN  -
>
> /jan
>
> On 27/09/2023 10:15, Jan Andersen wrote:
> > Hi Rick,
> >
> > Very strange - when I started the vm this morning, 'modprobe lnet'
> > didn't return an error - and it seems to have loaded the module:
> >
> > [root@rocky8 ~]# lsmod | grep lnet
> > lnet  704512  0
> > libcfs266240  1 lnet
> > sunrpc577536  2 lnet
> >
> > Looking at the running kernel and the kernel source, they now seem to be
> > the same version:
> >
> > [root@rocky8 ~]# ll /usr/src/kernels
> > total 4
> > drwxr-xr-x. 23 root root 4096 Sep 26 12:34 4.18.0-477.27.1.el8_8.x86_64/
> > [root@rocky8 ~]# uname -r
> > 4.18.0-477.27.1.el8_8.x86_64
> >
> > - which would explain that it now works. Things were a bit hectic with
> > other things yesterday afternoon, and I don't quite remember installing
> > a new kernel, but it looks like I did. Hopefully this is my problem
> > solved, then - sorry for jumping up and down and making noise!
> >
> > /jan
> >
> > On 26/09/2023 18:13, Mohr, Rick wrote:
> >> What error do you get when you run "modprobe lnet"?
> >>
> >> --Rick
> >>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL EMAIL] Re: [EXTERNAL] No port 988?

2023-09-27 Thread Jan Andersen
However, it is still timing out when I try to mount on the oss. This is 
the kernel module:


[root@mds ~]# lsmod | grep lnet
lnet  704512  7 mgs,obdclass,osp,ptlrpc,mgc,ksocklnd,mdt
libcfs266240  15 
fld,lnet,fid,lod,mdd,mgs,obdclass,osp,ptlrpc,mgc,ksocklnd,mdt,osd_ldiskfs,lquota,lfsck

sunrpc577536  2 lnet

But it only listens on tcp6, which I don't use - is there a way to for 
it to use tcp4?


[root@mds ~]# netstat -nap | grep 988
tcp6   0  0 :::988  :::* 
LISTEN  -


/jan

On 27/09/2023 10:15, Jan Andersen wrote:

Hi Rick,

Very strange - when I started the vm this morning, 'modprobe lnet' 
didn't return an error - and it seems to have loaded the module:


[root@rocky8 ~]# lsmod | grep lnet
lnet  704512  0
libcfs    266240  1 lnet
sunrpc    577536  2 lnet

Looking at the running kernel and the kernel source, they now seem to be 
the same version:


[root@rocky8 ~]# ll /usr/src/kernels
total 4
drwxr-xr-x. 23 root root 4096 Sep 26 12:34 4.18.0-477.27.1.el8_8.x86_64/
[root@rocky8 ~]# uname -r
4.18.0-477.27.1.el8_8.x86_64

- which would explain that it now works. Things were a bit hectic with 
other things yesterday afternoon, and I don't quite remember installing 
a new kernel, but it looks like I did. Hopefully this is my problem 
solved, then - sorry for jumping up and down and making noise!


/jan

On 26/09/2023 18:13, Mohr, Rick wrote:

What error do you get when you run "modprobe lnet"?

--Rick


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL EMAIL] Re: [EXTERNAL] No port 988?

2023-09-27 Thread Jan Andersen

Hi Rick,

Very strange - when I started the vm this morning, 'modprobe lnet' 
didn't return an error - and it seems to have loaded the module:


[root@rocky8 ~]# lsmod | grep lnet
lnet  704512  0
libcfs266240  1 lnet
sunrpc577536  2 lnet

Looking at the running kernel and the kernel source, they now seem to be 
the same version:


[root@rocky8 ~]# ll /usr/src/kernels
total 4
drwxr-xr-x. 23 root root 4096 Sep 26 12:34 4.18.0-477.27.1.el8_8.x86_64/
[root@rocky8 ~]# uname -r
4.18.0-477.27.1.el8_8.x86_64

- which would explain that it now works. Things were a bit hectic with 
other things yesterday afternoon, and I don't quite remember installing 
a new kernel, but it looks like I did. Hopefully this is my problem 
solved, then - sorry for jumping up and down and making noise!


/jan

On 26/09/2023 18:13, Mohr, Rick wrote:

What error do you get when you run "modprobe lnet"?

--Rick


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org