Re: [lustre-discuss] kernel threads for rpcs in flight

2024-05-02 Thread Andreas Dilger via lustre-discuss
9a81e5bd:17993:1797634358961088:0@lo:44:dd.0<mailto:dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0@lo:44:dd.0>
0100:0010:2.0:1714503761.141556:0:17993:0:(client.c:2239:ptlrpc_check_set())
 Completed RPC req@90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
 
dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0@lo:44:dd.0<mailto:dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0@lo:44:dd.0>
0100:0010:2.0:1714503761.141885:0:23893:0:(client.c:1758:ptlrpc_send_new_req())
 Sending RPC req@90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
 
ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0@lo:16:dd.0<mailto:ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0@lo:16:dd.0>
0100:0010:2.0:1714503761.144172:0:23893:0:(client.c:2239:ptlrpc_check_set())
 Completed RPC req@90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
 
ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0@lo:16:dd.0<mailto:ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0@lo:16:dd.0>

There are no stats files that aggregate information about ptlrpcd thread 
utilization.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud









Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] kernel threads for rpcs in flight

2024-04-30 Thread Andreas Dilger via lustre-discuss
, I'll continue trying with perf, which 
seems to be more complex with kernel threads.

There are kernel debug logs available when "lctl set_param debug=+rpctrace" is 
enabled, that will show which ptlrpcd or application thread is handling each 
RPC, and on which core it was run on.  These can be found on the client by 
searching for "Sending RPC|Completed RPC" in the debug logs, for example:

# lctl set_param debug=+rpctrace
# lctl set_param jobid_var=procname_uid
# cp -a /etc /mnt/testfs
# lctl dk /tmp/debug
# grep -E "Sending RPC|Completed RPC" /tmp/debug
:
:
0100:0010:2.0:1714502851.435000:0:23892:0:(client.c:1758:ptlrpc_send_new_req())
 Sending RPC req@90c9b2948640 pname:cluuid:pid:xid:nid:opc:job
 
ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634353438336:0@lo:2:cp.0
0100:0010:2.0:1714502851.436117:0:23892:0:(client.c:2239:ptlrpc_check_set())
 Completed RPC req@90c9b2948640 pname:cluuid:pid:xid:nid:opc:job
 
ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634353438336:0@lo:2:cp.0

Shows that thread "ptlrpcd_01_00" (CPT 01, thread 00, pid 23892) was sent on 
core 2.0 (no hyperthread) and sent an OST_SETATTR (opc = 2) RPC on behalf of 
"cp" for root (uid=0), and competed in 1117msec.

Similarly, with a "dd" sync write workload it shows write RPCs by the ptlrpcd 
threads, and sync RPCs in the "dd" process context:
# dd if=/dev/zero of=/mnt/testfs/file bs=4k count=1 oflag=dsync
# lctl dk /tmp/debug
# grep -E "Sending RPC|Completed RPC" /tmp/debug
:
:
0100:0010:2.0:1714503761.136971:0:23892:0:(client.c:1758:ptlrpc_send_new_req())
 Sending RPC req@90c9a6ad6640 pname:cluuid:pid:xid:nid:opc:job
 
ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634358961024:0@lo:4:dd.0
0100:0010:2.0:1714503761.140288:0:23892:0:(client.c:2239:ptlrpc_check_set())
 Completed RPC req@90c9a6ad6640 pname:cluuid:pid:xid:nid:opc:job
 
ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634358961024:0@lo:4:dd.0
0100:0010:2.0:1714503761.140518:0:17993:0:(client.c:1758:ptlrpc_send_new_req())
 Sending RPC req@90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
 dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0@lo:44:dd.0
0100:0010:2.0:1714503761.141556:0:17993:0:(client.c:2239:ptlrpc_check_set())
 Completed RPC req@90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
 dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0@lo:44:dd.0
0100:0010:2.0:1714503761.141885:0:23893:0:(client.c:1758:ptlrpc_send_new_req())
 Sending RPC req@90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
 
ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0@lo:16:dd.0
0100:0010:2.0:1714503761.144172:0:23893:0:(client.c:2239:ptlrpc_check_set())
 Completed RPC req@90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
 
ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0@lo:16:dd.0

There are no stats files that aggregate information about ptlrpcd thread 
utilization.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] [BULK] Files created in append mode don't obey directory default stripe count

2024-04-29 Thread Andreas Dilger via lustre-discuss
f0



As you see, file "bbb" is created with stripe count 1 instead of 4.
Observed in Lustre 2.12.x and Lustre 2.15.4.

Thanks,
Frank

--
Dr. Frank Otto
Senior Research Infrastructure Developer
UCL Centre for Advanced Research Computing
Tel: 020 7679 1506
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] kernel threads for rpcs in flight

2024-04-28 Thread Andreas Dilger via lustre-discuss
On Apr 28, 2024, at 16:54, Anna Fuchs via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

The setting max_rpcs_in_flight affects, among other things, how many threads 
can be spawned simultaneously for processing the RPCs, right?

The {osc,mdc}.*.max_rpcs_in_flight are actually controlling the maximum number 
of RPCs a *client* will have in flight to any MDT or OST, while the number of 
MDS and OSS threads is controlled on the server with 
mds.MDS.mdt*.threads_{min,max} and ost.OSS.ost*.threads_{min,max} for each of 
the various service portals (which are selected by the client based on the RPC 
type).  The max_rpcs_in_flight allows concurrent operations on the client for 
multiple threads to hide network latency and to improve server utilization, 
without allowing a single client to overwhelm the server.

In tests where the network is clearly a bottleneck, this setting has almost no 
effect - the network cannot keep up with processing the data, there is not so 
much to do in parallel.
With a faster network, the stats show higher CPU utilization on different cores 
(at least on the client).

What is the exact mechanism by which it is decided that a kernel thread is 
spawned for processing a bulk? Is there an RPC queue with timings or something 
similar?
Is it in any way predictable or calculable how many threads a specific workload 
will require (spawn if possible) given the data rates from the network and 
storage devices?

The mechanism to start new threads is relatively simple.  Before a server 
thread is processing a new request, if it is the last thread available, and not 
the maximum number of threads are running, then it will try to launch a new 
thread; repeat as needed.  So the thread  count will depend on the client RPC 
load and the RPC processing rate and lock contention on whatever resources 
those RPCs are accessing.

With max_rpcs_in_flight = 1, multiple cores are loaded, presumably alternately, 
but the statistics are too inaccurate to capture this.  The distribution of 
threads to cores is regulated by the Linux kernel, right? Does anyone have 
experience with what happens when all CPUs are under full load with the 
application or something else?


Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target* parameter, so a 
single client can still have tens or hundreds of RPCs in flight to different 
servers.  The client will send many RPC types directly from the process 
context, since they are waiting on the result anyway.  For asynchronous bulk 
RPCs, the ptlrpcd thread will try to process the bulk IO on the same CPT (= 
Lustre CPU Partition Table, roughly aligned to NUMA nodes) as the userspace 
application was running when the request was created.  This minimizes the 
cross-NUMA traffic when accessing pages for bulk RPCs, so long as those cores 
are not busy with userspace tasks.  Otherwise, the ptlrpcd thread on another 
CPT will steal RPCs from the queues.

Do the Lustre threads suffer? Is there a prioritization of the Lustre threads 
over other tasks?

Are you asking about the client or the server?  Many of the client RPCs are 
generated by the client threads, but for the running ptlrpcd threads do not 
have a higher priority than client application threads.  If the application 
threads are running on some cores, but other cores are idle, then the ptlrpcd 
threads on other cores will try to process the RPCs to allow the application 
threads to continue running there.  Otherwise, if all cores are busy (as is 
typical for HPC applications) then they will be scheduled by the kernel as 
needed.

Are there readily available statistics or tools for this scenario?

What statistics are you looking for?  There are "{osc,mdc}.*.stats" and 
"{osc,mdc}.*rpc_stats" that have aggregate information about RPC counts and 
latency.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ko2iblnd.conf

2024-04-12 Thread Andreas Dilger via lustre-discuss
The ko2iblnd-opa settings are only used if you have Intel OPA instead of 
Mellanox cards (depends on the ko2iblnd-probe script).  You should still have 
ko2iblnd line in the server config that is used for MLX cards in order to set 
the values to match on both sides.

As for the actual settings, someone with more LNet IB experience should chime 
in on what is best to use.  All I know is that they have to be the same on both 
sides or they get unhappy, and the usable values depend on the card type and 
MOFED/OFED version.  As a starting point I would just copy the client ko2iblnd 
options to the server and see if it works.

Cheers, Andreas

On Apr 11, 2024, at 12:02, Daniel Szkola 
mailto:dszk...@fnal.gov>> wrote:

On the server node(s):

options ko2iblnd-opa peer_credits=32 peer_credits_hiw=16 credits=1024 
concurrent_sends=64 ntx=2048 map_on_demand=256 fmr_pool_size=2048 
fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

On clients:

options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 
concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 
fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

My concern isn’t so much the mismatch because I know that’s an issue but rather 
what numbers we should settle on with a recent lustre build. I also see the 
ko2iblnd-opa in the server config, which means because the server is actually 
loading ko2iblnd that maybe defaults are used?

What made me look was we were seeing lots of:
LNetError: 2961324:0:(o2iblnd_cb.c:2612:kiblnd_passive_connect()) Can't accept 
conn from xxx.xxx.xxx.xxx@o2ib2, queue depth too large:  42 (<=32 wanted)

—
Dan Szkola
FNAL


On Apr 11, 2024, at 12:36 PM, Andreas Dilger 
mailto:adil...@whamcloud.com>> wrote:

[EXTERNAL] – This message is from an external sender


On Apr 11, 2024, at 09:56, Daniel Szkola via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello all,

I recently discovered some mismatches in our /etc/modprobe.d/ko2iblnd.conf 
files between our clients and servers.

Is it now recommended to keep the defaults on this module and run without a 
config file or are there recommended numbers for lustre-2.15.X?

The only thing I’ve seen that provides any guidance is the Lustre wiki and an 
HP/Cray doc:

https://www.hpe.com/psnow/resources/ebooks/a00113867en_us_v2/Lustre_Server_Recommended_Tuning_Parameters_4.x.html

Anyone have any sage advice on what the ko2iblnd.conf (and possibly 
ko2iblnd-opa.conf and hfi1.conf as well) on modern systems?

It would be useful to know what specific settings are mismatched.  Definitely 
some of them need to be consistent between peers, others depend on your system.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud









Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ko2iblnd.conf

2024-04-11 Thread Andreas Dilger via lustre-discuss
On Apr 11, 2024, at 09:56, Daniel Szkola via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello all,

I recently discovered some mismatches in our /etc/modprobe.d/ko2iblnd.conf 
files between our clients and servers.

Is it now recommended to keep the defaults on this module and run without a 
config file or are there recommended numbers for lustre-2.15.X?

The only thing I’ve seen that provides any guidance is the Lustre wiki and an 
HP/Cray doc:

https://www.hpe.com/psnow/resources/ebooks/a00113867en_us_v2/Lustre_Server_Recommended_Tuning_Parameters_4.x.html

Anyone have any sage advice on what the ko2iblnd.conf (and possibly 
ko2iblnd-opa.conf and hfi1.conf as well) on modern systems?

It would be useful to know what specific settings are mismatched.  Definitely 
some of them need to be consistent between peers, others depend on your system.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Could not read from remote repository

2024-04-09 Thread Andreas Dilger via lustre-discuss
On Apr 9, 2024, at 04:16, Jannek Squar via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hey,

I tried to clone the source code via `git clone 
git://git.whamcloud.com/fs/lustre-release.git` but got an error:

"""
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
"""

Is there something going on with the repository or is the error probably on my 
side?

The above command worked for me on a login with no SSH key configured:

$ ssh-add -l
Could not open a connection to your authentication agent.
$ git clone git://git.whamcloud.com/fs/lustre-release.git
Cloning into 'lustre-release'...
remote: Counting objects: 386206, done.
remote: Compressing objects: 100% (81406/81406), done.
Receiving objects:  26% (100414/386206), 27.02 MiB | 9.00 MiB/s ...

Do you have connectivity to git.whamcloud.com<http://git.whamcloud.com> (e.g. 
ping/traceroute)?

A second option is to clone from the "lustre/lustre-release" repo on GitHub, 
which is itself a clone of git://git.whamcloud.com/

Otherwise, you could create a Gerrit account at https://review.whamcloud.com/ 
and register your SSH public key there and then use:

git clone ssh://review.whamcloud.com:29418/fs/lustre-release

which you would want to do anyway if you are planning to submit any patches.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Building Lustre against Mellanox OFED

2024-03-16 Thread Andreas Dilger via lustre-discuss
On Mar 15, 2024, at 09:18, Paul Edmon via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

I'm working on building Lustre 2.15.4 against recent versions of Mellanox OFED. 
I built OFED against the specific kernel and then install 
mlnx-ofa_kernel-modules for that specific kernel. After which I built lustre 
against that version of OFED using:

rpmbuild --rebuild --without servers --without mpi --with mofed --define 
"_topdir `pwd`" SRPMS/lustre-2.15.4-1.src.rpm

However once I finish building and install I get:

Error: Transaction test error:
  file /etc/depmod.d/zz01-mlnx-ofa_kernel-mlx_compat.conf from install of 
kmod-mlnx-ofa_kernel-23.10-OFED.23.10.2.1.3.1.rhel8u9.x86_64 conflicts with 
file from package 
mlnx-ofa_kernel-modules-23.10-OFED.23.10.2.1.3.1.kver.4.18.0_513.18.1.el8_9.x86_64.x86_64

I saw this earlier message which matches my case: 
http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2020-December/017407.html
 But no resolution.

Does anyone know the solution to this? Is there a work around?

You probably need to "rpm -Uvh" instead of "rpm -ivh" the new 
kmod-mlnx-ofa_kernel RPM?

Or, if you want to keep both RPMs installed (e.g. for different kernels) then 
you can probably just use "--force" since it looks like the .conf file would 
likely be the same from both packages.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] The confusion for mds hardware requirement

2024-03-11 Thread Andreas Dilger via lustre-discuss
All of the numbers in this example are estimates/approximations to give an idea 
about the amount of memory that the MDS may need under normal operating 
circumstances.  However, the MDS will also continue to function with more or 
less memory.  The actual amount of memory in use will change very significantly 
based on application type, workload, etc. and the numbers "256" and "100,000" 
are purely examples of how many files might be in use.

I'm not sure you can "test" those numbers, because whatever number of files you 
test with will be the number of files actually in use.  You could potentially 
_measure_ the number of files/locks in use on a large cluster, but again this 
will be highly site and application dependent.

Cheers, Andreas

On Mar 11, 2024, at 01:24, Amin Brick Mover 
mailto:aminbrickmo...@gmail.com>> wrote:

Hi,  Andreas.

Thank you for your reply.

Can I consider 256 files per core as an empirical parameter? And does the 
parameter '256' need testing based on hardware conditions? Additionally, in the 
calculation formula "12 interactive clients * 100,000 files * 2KB = 2400 MB," 
is the number '100,000' files also an empirical parameter? Do I need to test 
it. Can I directly use the values '256' and '100,000'?

Andreas Dilger mailto:adil...@whamcloud.com>> 
于2024年3月11日周一 05:47写道:
These numbers are just estimates, you can use values more suitable to your 
workload.

Similarly, 32-core clients may be on the low side these days.  NVIDIA DGX nodes 
have 256 cores, though you may not have 1024 of them.

The net answer is that having 64GB+ of RAM is inexpensive these days and 
improves MDS performance, especially if you compare it to the cost of client 
nodes that would sit waiting for filesystem access if the MDS is short of RAM.  
Better to have too much RAM on the MDS than too little.

Cheers, Andreas

On Mar 4, 2024, at 00:56, Amin Brick Mover via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

In the Lustre Manual 5.5.2.1 section, the examples mentioned:
For example, for a single MDT on an MDS with 1,024 compute nodes, 12 
interactive login nodes, and a
20 million file working set (of which 9 million files are cached on the clients 
at one time):
Operating system overhead = 4096 MB (RHEL8)
File system journal = 4096 MB
1024 * 32-core clients * 256 files/core * 2KB = 16384 MB
12 interactive clients * 100,000 files * 2KB = 2400 MB
20 million file working set * 1.5KB/file = 30720 MB
I'm curious, how were the two numbers, 256 files/core and 100,000 files, 
determined? Why?

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud








Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] The confusion for mds hardware requirement

2024-03-10 Thread Andreas Dilger via lustre-discuss
These numbers are just estimates, you can use values more suitable to your 
workload.

Similarly, 32-core clients may be on the low side these days.  NVIDIA DGX nodes 
have 256 cores, though you may not have 1024 of them.

The net answer is that having 64GB+ of RAM is inexpensive these days and 
improves MDS performance, especially if you compare it to the cost of client 
nodes that would sit waiting for filesystem access if the MDS is short of RAM.  
Better to have too much RAM on the MDS than too little.

Cheers, Andreas

On Mar 4, 2024, at 00:56, Amin Brick Mover via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

In the Lustre Manual 5.5.2.1 section, the examples mentioned:
For example, for a single MDT on an MDS with 1,024 compute nodes, 12 
interactive login nodes, and a
20 million file working set (of which 9 million files are cached on the clients 
at one time):
Operating system overhead = 4096 MB (RHEL8)
File system journal = 4096 MB
1024 * 32-core clients * 256 files/core * 2KB = 16384 MB
12 interactive clients * 100,000 files * 2KB = 2400 MB
20 million file working set * 1.5KB/file = 30720 MB
I'm curious, how were the two numbers, 256 files/core and 100,000 files, 
determined? Why?

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Issues draining OSTs for decommissioning

2024-03-07 Thread Andreas Dilger via lustre-discuss
It's almost certainly just internal files. You could mount as ldiskfs and run 
"ls -lR" to check. 

Cheers, Andreas

> On Mar 6, 2024, at 22:23, Scott Wood via lustre-discuss 
>  wrote:
> 
> Hi folks,
> 
> Time to empty some OSTs to shut down some old arrays.  I've been following 
> the docs from 
> https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost and am 
> emptying with "lfs find /mnt/lustre/ -obd lustre-OST0060 | lfs_migrate -y" 
> (for the various OSTs) and it's looking pretty good but I do have a few 
> questions:
> 
> Q1) I've dealt with a few edge cases, missed files, etc and now "lfs find" 
> and "rbh-find" both show that the OSTs have nothing left on them but they 
> pretty much all have 236 inodes still allocated.  Is this just overhead? 
> 
> Q2) Also, one OST shows 237 inodes (lustre-OST0074_UUID shown below) but, 
> again, "lfs find" says its empty.  Is that a concern?
> 
> Q3) Lastly, this file system is under load.  Am I safe to deactivate the OSTs 
> while we're running or should I wait till our next maintenance outage?
> 
> For reference:
> [root@hpcpbs02 ~]# lfs df -i |sed -e 's/qimrb/lustre/'
> UUID  Inodes   IUsed   IFree IUse% Mounted on
> ...
> lustre-OST0060_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:96]
> lustre-OST0061_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:97]
> lustre-OST0062_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:98]
> lustre-OST0063_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:99]
> lustre-OST0064_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:100]
> lustre-OST0065_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:101]
> lustre-OST0066_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:102]
> lustre-OST0067_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:103]
> lustre-OST0068_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:104]
> lustre-OST0069_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:105]
> lustre-OST006a_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:106]
> lustre-OST006b_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:107]
> lustre-OST006c_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:108]
> lustre-OST006d_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:109]
> lustre-OST006e_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:110]
> lustre-OST006f_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:111]
> lustre-OST0070_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:112]
> lustre-OST0071_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:113]
> lustre-OST0072_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:114]
> lustre-OST0073_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:115]
> lustre-OST0074_UUID  61002112 23761001875   1% 
> /mnt/lustre[OST:116]
> lustre-OST0075_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:117]
> lustre-OST0076_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:118]
> lustre-OST0077_UUID  61002112 23661001876   1% 
> /mnt/lustre[OST:119]
> ...
> 
> Cheers!
> Scott
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre-client-dkms-2.15.4 is still checking for python2

2024-02-06 Thread Andreas Dilger via lustre-discuss
I've cherry-picked patch https://review.whamcloud.com/53947 
"LU-15655<https://jira.whamcloud.com/browse/LU-15655> contrib: update 
branch_comm to python3" to b2_15 to avoid this issue in the future.  This 
script is for developers and does not affect functionality of the filesystem at 
all.

Cheers, Andreas

On Jan 30, 2024, at 06:32, BALVERS Martin 
mailto:martin.balv...@danone.com>> wrote:

I found the file that still references python2.
The file contrib/scripts/branch_comm contains ‘#!/usr/bin/env python2’
After changing that to python3 and building the dkms-rpm I can install the 
generated lustre-client-dkms-2.15.4-1.el9.noarch.rpm on AlmaLinux 9.3

I have no idea what that script does, or if it functions with python2 instead 
of python2 as env.

Gr,
Martin Balvers

From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Tuesday, January 23, 2024 23:32
To: BALVERS Martin mailto:martin.balv...@danone.com>>
Subject: Re: [lustre-discuss] lustre-client-dkms-2.15.4 is still checking for 
python2

** Caution - this is an external email **
Installing the DKMS is fine.  You can ignore the python2 dependency.

If you can debug *why* it is depending on python2 then a patch would be 
welcome.  Please see:
https://wiki.whamcloud.com/display/PUB/Patch+Landing+Process+Summary<https://urldefense.com/v3/__https:/wiki.whamcloud.com/display/PUB/Patch*Landing*Process*Summary__;Kysr!!OUGTln_Lrg!TSAUkm-KG6g2Ob7CpfNbcImVqN6e_mSdQocM9kETERfenX3rfkHU00V-2RczCQWF6vPTtoWDCZ5zBMdFI2kTzA$>


On Jan 23, 2024, at 01:37, BALVERS Martin 
mailto:martin.balv...@danone.com>> wrote:

I have always installed both… Hasn’t caused issues luckily.

The binary installs, but the dkms version insists it needs python2.
If I use the binary, it will break with every minor kernel version update 
right? I’ll have to wait with updating the kernel until the lustre client 
catches up?

Thanks,
Martin Balvers

From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Friday, January 19, 2024 20:00
To: BALVERS Martin mailto:martin.balv...@danone.com>>
Cc: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] lustre-client-dkms-2.15.4 is still checking for 
python2

** Caution - this is an external email **
It looks like there may be a couple of test tools that are referencing python2, 
but it definitely isn't needed
for normal operation.  Are you using the lustre-client binary or the 
lustre-client-dkms?  Only one is needed.

For the short term it would be possible to override this dependency, but it 
would be good to understand
why this dependency is actually being generated.

On Jan 19, 2024, at 04:06, BALVERS Martin via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

FYI
It seems that lustre-client-dkms-2.15.4 is still checking for python2 and does 
not install on AlmaLinux 9.3

# dnf --enablerepo=lustre-client install lustre-client lustre-client-dkms
Last metadata expiration check: 0:04:50 ago on Fri Jan 19 11:43:54 2024.
Error:
Problem: conflicting requests
  - nothing provides /usr/bin/python2 needed by 
lustre-client-dkms-2.15.4-1.el9.noarch from lustre-client
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use 
not only best candidate packages)

According to the changelog this should have been fixed 
(https://wiki.lustre.org/Lustre_2.15.4_Changelog<https://urldefense.com/v3/__https:/wiki.lustre.org/Lustre_2.15.4_Changelog__;!!OUGTln_Lrg!UFvRWmjp6SVJYjhBBuGKITiKvpICPOjXV_6cu90wK4MQf_w6X1xHXFcuxOYFOaIXkwOtBMGE6qXh1oyHJPdWyA$>).

Regards,
Martin Balvers

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud






Ce message électronique et tous les fichiers attachés qu'il contient sont 
confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils 
sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à 
son émetteur. Les idées et opinions présentées dans ce message sont celles de 
son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une 
quelconque de ses filiales. La publication, l'usage, la distribution, 
l'impression ou la copie non autorisée de ce message et des attachements qu'il 
contient sont strictement interdits.

This e-mail and any files transmitted with it are confidential and intended 
solely for the use of the individual to whom it is addressed. If you have 
received this email in error please send it back to the person that sent it to 
you. Any views or opinions presented are solely those of its author and do not 
necessarily represent those of DANONE or any of its subsidiary companies. 
Unauthorized publication, use, dissemination, forwarding, printing or copying 
of this email and its associated attachments is strictly prohibited.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







__

Re: [lustre-discuss] ldiskfs / mdt size limits

2024-02-03 Thread Andreas Dilger via lustre-discuss
Thomas,
You are exactly correct that large MDTs can be useful for DoM if you have HDD 
OSTs. The benefit is relatively small if you have NVMe OSTs. 

If the MDT is larger than 16TB it must be formatted with the extents feature to 
address block numbers over 2^32. Unfortunately, this is _slightly_ less 
efficient than the (in)direct block addressing for very fragmented allocations, 
like those of directories, so this feature is not used for MDTs below 16TiB. 

Cheers, Andreas

> On Feb 2, 2024, at 06:35, Thomas Roth via lustre-discuss 
>  wrote:
> 
> Hi all,
> 
> confused  about size limits:
> 
> I distinctly remember trying to format a ~19 TB disk / LV for use as an MDT, 
> with ldiskfs, and failing to do so: the max size for the underlying ext4 is 
> 16 TB.
> Knew that, had ignoed that, but not a problem back then - just adapted the 
> logical volume's size.
> 
> Now I have a 24T disk, and neither mkfs.lustre nor Lustre itself have show 
> any issues with it.
> 'df -h' does show the 24T, 'df -ih' shows the expected 4G of inodes.
> I suppose this MDS has a lot of space for directories and stuff, or for DOM.
> But why does it work in the first place? ldiskfs extends beyond all limits 
> these days?
> 
> Regards,
> Thomas
> 
> --
> 
> Thomas Roth
> Department: Informationstechnologie
> Location: SB3 2.291
> 
> 
> GSI Helmholtzzentrum für Schwerionenforschung GmbH
> Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de
> 
> Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
> Managing Directors / Geschäftsführung:
> Professor Dr. Paolo Giubellino, Jörg Blaurock
> Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
> Ministerialdirigent Dr. Volkmar Dietz
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre github mirror out of sync

2024-01-26 Thread Andreas Dilger via lustre-discuss
No particular reason.  I normally sync the github tree manually after Oleg
lands patches to master, but forgot to do it the last couple of times.
It's been updated now.  Thanks for pointing it out.

On Jan 26, 2024, at 00:55, Tommi Tervo  wrote:
> 
> Is sync between https://git.whamcloud.com/fs/lustre-release.git and 
> https://github.com/lustre/lustre-release off on purpose?
> 
> BR,
> Tommi

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Odd behavior with tunefs.lustre and device index

2024-01-24 Thread Andreas Dilger via lustre-discuss
de=10.99.101.6@tcp1:10.99.101.7@tcp1 
failover.node=10.99.101.18@tcp1

exiting before disk write.
[root@OSS-2 opc]#



Going over to OSS-3 and trying to mount OST.


[root@OSS-3 opc]# lctl list_nids
10.99.101.19@tcp1
[root@OSS-3 opc]#

Parameters looks same as OSS-2

[root@OSS-3 opc]#  tunefs.lustre --dryrun /dev/sdd
checking for existing Lustre data: found

   Read previous values:
Target: testfs-OST0040
Index:  64
Lustre FS:  testfs
Mount type: ldiskfs
Flags:  0x1002
  (OST no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters:  mgsnode=10.99.101.6@tcp1:10.99.101.7@tcp1 
failover.node=10.99.101.18@tcp1


   Permanent disk data:
Target: testfs-OST0040
Index:  64
Lustre FS:  testfs
Mount type: ldiskfs
Flags:  0x1002
  (OST no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters:  mgsnode=10.99.101.6@tcp1:10.99.101.7@tcp1 
failover.node=10.99.101.18@tcp1

exiting before disk write.
[root@OSS-3 opc]#

Changing failover node to current node.

[root@OSS-3 opc]# tunefs.lustre --erase-param failover.node --servicenode 
10.99.101.19@tcp1 /dev/sdd
checking for existing Lustre data: found

   Read previous values:
Target: testfs-OST0040
Index:  64
Lustre FS:  testfs
Mount type: ldiskfs
Flags:  0x1002
  (OST no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters:  mgsnode=10.99.101.6@tcp1:10.99.101.7@tcp1 
failover.node=10.99.101.18@tcp1


   Permanent disk data:
Target: testfs-OST0040
Index:  64
Lustre FS:  testfs
Mount type: ldiskfs
Flags:  0x1042
  (OST update no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters:  mgsnode=10.99.101.6@tcp1:10.99.101.7@tcp1 
failover.node=10.99.101.19@tcp1






After it completes the write, for some reason this OST is being marked as 
'first_time' flag 0x1062 in next command.

[root@OSS-3 opc]#  tunefs.lustre --dryrun /dev/sdd
checking for existing Lustre data: found

   Read previous values:
Target: testfs-OST0040
Index:  64
Lustre FS:  testfs
Mount type: ldiskfs
Flags:  0x1062
  (OST first_time update no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters:  mgsnode=10.99.101.6@tcp1:10.99.101.7@tcp1 
failover.node=10.99.101.19@tcp1


   Permanent disk data:
Target: testfs:OST0040
Index:  64
Lustre FS:  testfs
Mount type: ldiskfs
Flags:  0x1062
  (OST first_time update no_primnode )
Persistent mount opts: ,errors=remount-ro
Parameters:  mgsnode=10.99.101.6@tcp1:10.99.101.7@tcp1 
failover.node=10.99.101.19@tcp1

exiting before disk write.
[root@OSS-3 opc]#





Mount doesn't work here because it is marked as first time and this OST is not 
first time as it was already mounted using OST-2 OSS, and MGS knows about it.

[root@OSS-3 opc]#  mkdir /testfs-OST0040
[root@OSS-3 opc]# mount -t lustre /dev/sdd  /testfs-OST0040
mount.lustre: mount /dev/sdd at /testfs-OST0040 failed: Address already in use
The target service's index is already in use. (/dev/sdd)
[root@OSS-3 opc]#

>From here, if I do tunefs.lustre with --writeconf, it works. Once this is 
>done, repeating the above experiment any number of times on any servers works 
>fine as expected without using --writeconf. (FYI Note: --writeconfig is 
>mentioned as a dangerous command)




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] OST still has inodes and size after deleting all files

2024-01-19 Thread Andreas Dilger via lustre-discuss


On Jan 19, 2024, at 13:48, Pavlo Khmel via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi,

I'm trying to remove 4 OSTs.

# lfs osts
OBDS:
0: cluster-OST_UUID ACTIVE
1: cluster-OST0001_UUID ACTIVE
2: cluster-OST0002_UUID ACTIVE
3: cluster-OST0003_UUID ACTIVE
. . .

I moved all files to other OSTs. "lfs find" cannot find any files on these 4 
OSTs.

# time lfs find --ost 0 --ost 1 --ost 2 --ost 3 /cluster

real 936m8.528s
user 13m48.298s
sys 210m1.245s

But still: 2624 inods are in use and 14.5G total size.

# lfs df -i | grep -e OST -e OST0001 -e OST0002 -e OST0003
cluster-OST_UUID  4293438576 644  4293437932   1% /cluster[OST:0]
cluster-OST0001_UUID  4293438576 640  4293437936   1% /cluster[OST:1]
cluster-OST0002_UUID  4293438576 671  4293437905   1% /cluster[OST:2]
cluster-OST0003_UUID  4293438576 669  4293437907   1% /cluster[OST:3]

# lfs df -h | grep -e OST -e OST0001 -e OST0002 -e OST0003
cluster-OST_UUID   29.2T3.8G   27.6T   1% /cluster[OST:0]
cluster-OST0001_UUID   29.2T3.7G   27.6T   1% /cluster[OST:1]
cluster-OST0002_UUID   29.2T3.3G   27.6T   1% /cluster[OST:2]
cluster-OST0003_UUID   29.2T3.7G   27.6T   1% /cluster[OST:3]

I tried to check the file-system for errors:

# umount /lustre/ost01
# e2fsck -fy /dev/mapper/ost01

and

# lctl lfsck_start --device cluster-OST0001
# lctl get_param -n osd-ldiskfs.cluster-OST0001.oi_scrub
. . .
status: completed

I tried to mount OST as ldiskfs and there are several files in /O/0/d*/

# umount /lustre/ost01
# mount -t ldiskfs /dev/mapper/ost01 /mnt/
# ls -Rhl /mnt/O/0/d*/
. . .
/mnt/O/0/d11/:
-rw-rw-rw- 1 user1 group1 603K Nov  8 21:37 450605003
/mnt/O/0/d12/:
-rw-rw-rw- 1 user1 group1 110K Jun 16  2023 450322028
-rw-rw-rw- 1 user1 group1  21M Nov  8 22:17 450605484
. . .

Is it expected behavior? Is it save to delere OST even with those files?

You can run the debugfs "stat" command to print the "fid" xattr and it will 
print the MDT
parent FID for use with "lfs fid2path" on the client to see if there are any 
files related
to these objects.  You could also run "ll_decode_filter_fid" to do the same 
thing on the
mounted ldiskfs filesystem.

It is likely that there are a few stray objects from deleted files, but hard to 
say for sure.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre-client-dkms-2.15.4 is still checking for python2

2024-01-19 Thread Andreas Dilger via lustre-discuss
It looks like there may be a couple of test tools that are referencing python2, 
but it definitely isn't needed
for normal operation.  Are you using the lustre-client binary or the 
lustre-client-dkms?  Only one is needed.

For the short term it would be possible to override this dependency, but it 
would be good to understand
why this dependency is actually being generated.

On Jan 19, 2024, at 04:06, BALVERS Martin via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

FYI
It seems that lustre-client-dkms-2.15.4 is still checking for python2 and does 
not install on AlmaLinux 9.3

# dnf --enablerepo=lustre-client install lustre-client lustre-client-dkms
Last metadata expiration check: 0:04:50 ago on Fri Jan 19 11:43:54 2024.
Error:
Problem: conflicting requests
  - nothing provides /usr/bin/python2 needed by 
lustre-client-dkms-2.15.4-1.el9.noarch from lustre-client
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use 
not only best candidate packages)

According to the changelog this should have been fixed 
(https://wiki.lustre.org/Lustre_2.15.4_Changelog).

Regards,
Martin Balvers

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre errors asking for help

2024-01-17 Thread Andreas Dilger via lustre-discuss
 kernel: : LustreError: 
> 1196:0:(ofd_obd.c:1348:ofd_create()) Skipped 75 previous similar messages
> Jan 15 10:23:16 mds2 kernel: : LustreError: 
> 3400:0:(osp_precreate.c:484:osp_precreate_send()) 
> scratch-OST000f-osc-MDT: can't precreate: rc = -116
> 
> The messages concern the same 3 OSTs and appear both on the OSS servers 
> serving those OSTs and the mds server responsible for that filesystem 
> (/global/scratch).
> They appear continuously, about every 4 minutes, and appear as soon as the 
> filesystem is mounted even before any I/O occurs.  In other words, even 
> on an inactive filesystem, the messages appear continuously.
> 
> While everything seems to work, the performance is terrible.  Creating a 
> directory on the filesystem can take 1-2 minutes to complete.  The load on 
> the mds server climbs to incredibly high values (100-160) during normal I/O 
> operations and the filesystem overall is extremely slow.  The mds server 
> complains about slow connections (see messages above).
> 
> We think the error messages above indicate the problem but despite searching 
> many hours on the web, have not been able to find any documentation about 
> what may be causing them, or how to correct the issue.
> 
> Any help would be greatly appreciated. Thanks a million for any suggestions 
> and solutions
> 
> All the best
> Roman
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNet Multi-Rail config - with BODY!

2024-01-16 Thread Andreas Dilger via lustre-discuss
Hello Gwen,
I'm not a networking expert, but it seems entirely possible that the MR 
discovery in 2.12.9
isn't doing as well as what is in 2.15.3 (or 2.15.4 for that matter).  It would 
make more sense
to have both nodes running the same (newer) version before digging too deeply 
into this.

We have definitely seen performance > 1 IB interface from a single node in our 
testing,
though I can't say if that was done with lnet_selftest or with something else.

Cheers, Andreas

On Jan 16, 2024, at 08:14, Gwen Dawes via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi folks,

Let's try that again.

I'm in the luxury position of having four IB cards I'm trying to
squeeze the most performance out of for Lustre I can.

I have a small test setup - two machines - a client (2.12.9) and a
server (2.15.3) with four IB cards each. I'm able to set them up as
Multi-Rail and each one can discover the other as such. However, I
can't seem to get lnet_selftest to give me more speed than a single
interface, as reported by ib_send_bw.

Am I missing some config here? Is LNet just not capable of doing more
than one connection per NID?

Gwen
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Mixing ZFS and LDISKFS

2024-01-12 Thread Andreas Dilger via lustre-discuss
All of the OSTs and MDTs are "independently managed" (have their own connection 
state between each client and target) so this should be possible, though I 
don't know of sites that are doing this.  Possibly this makes sense to put NVMe 
flash OSTs on ldiskfs, and HDD OSTs on ZFS, and then put them in OST pools so 
that they are managed separately.

On Jan 12, 2024, at 10:38, Backer 
mailto:backer.k...@gmail.com>> wrote:

Thank you Andreas! How about mixing OSTs?  The requirement is to do RAID with 
small volumes using ZFS and have a large OST. This is to reduce the number of 
OSTs overall as the cluster being extended.

On Fri, 12 Jan 2024 at 11:26, Andreas Dilger 
mailto:adil...@whamcloud.com>> wrote:
Yes, some systems use ldiskfs for the MDT (for performance) and ZFS for the 
OSTs (for low-cost RAID).  The IOPS performance of ZFS is low vs. ldiskfs, but 
the streaming bandwidth is fine.

Cheers, Andreas

> On Jan 12, 2024, at 08:40, Backer via lustre-discuss 
> mailto:lustre-discuss@lists.lustre.org>> 
> wrote:
>
> 
> Hi,
>
> Could we mix ZFS and LDISKFS together in a cluster?
>
> Thank you,
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Recommendation on number of OSTs

2024-01-12 Thread Andreas Dilger via lustre-discuss
I would recommend *not* to use too many OSTs as this causes fragmentation of 
the free space, and excess overhead in managing the connections.  Today, single 
OSTs can be up to 500TiB in size (or larger, though not necessarily optimal for 
performance). Depending on your cluster size and total capacity, it is typical 
for large systems to have a couple hundred OSTs, 2-4 per OSS balancing the 
storage and network bandwidth.

On Jan 12, 2024, at 07:37, Backer via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi All,

What is the recommendation on the total number of OSTs?

In order to maximize throughput, go for more number of OSS with small OSTs. 
This means that it will end up with 1000s of OSTs. Any suggestions or 
recommendations?

Thank you,

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Mixing ZFS and LDISKFS

2024-01-12 Thread Andreas Dilger via lustre-discuss
Yes, some systems use ldiskfs for the MDT (for performance) and ZFS for the 
OSTs (for low-cost RAID).  The IOPS performance of ZFS is low vs. ldiskfs, but 
the streaming bandwidth is fine. 

Cheers, Andreas

> On Jan 12, 2024, at 08:40, Backer via lustre-discuss 
>  wrote:
> 
> 
> Hi,
> 
> Could we mix ZFS and LDISKFS together in a cluster? 
> 
> Thank you,
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Symbols not found in newly built lustre?

2024-01-11 Thread Andreas Dilger via lustre-discuss
/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ptlrpc.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/mgc.ko needs 
"class_unregister_type": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/obdclass.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/mgc.ko needs 
"cfs_fail_loc": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/libcfs.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/mgc.ko needs 
"libcfs_next_nidstring": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/lnet.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ofd.ko needs 
"tgt_sec_ctx_handlers": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ptlrpc.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ofd.ko needs 
"num_exports_show": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/obdclass.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ofd.ko needs 
"seq_server_fini": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/fid.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ofd.ko needs 
"lfsck_register_namespace": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/lfsck.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ofd.ko needs 
"cfs_fail_loc": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/libcfs.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ofd.ko needs 
"fld_update_from_controller": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/fld.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ofd.ko needs 
"libcfs_next_nidstring": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/lnet.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ofd.ko needs 
"lquotactl_slv": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/lquota.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/lmv.ko needs 
"ptlrpc_set_destroy": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/ptlrpc.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/lmv.ko needs 
"class_unregister_type": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/obdclass.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/lmv.ko needs 
"fld_client_add_target": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/fld.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/lmv.ko needs 
"cfs_fail_loc": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/libcfs.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/fs/lmv.ko needs 
"LNetGetId": /lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/lnet.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/ksocklnd.ko needs 
"lnet_inet_enumerate": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/lnet.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/ksocklnd.ko needs 
"cfs_cpt_bind": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/libcfs.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/lnet_selftest.ko 
needs "lnet_cpt_of_nid": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/lnet.ko
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/lnet_selftest.ko 
needs "cfs_wi_schedule": 
/lib/modules/4.18.0-513.9.1.el8_9.x86_64/extra/lustre/net/libcfs.ko
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.15.4 o2iblnd on RoCEv2?

2024-01-10 Thread Andreas Dilger via lustre-discuss
Granted that I'm not an LNet expert, but "errno: -1 descr: cannot parse net 
'<255:65535>' " doesn't immediately lead me to the same conclusion as if 
"unknown internface 'ib0' " were printed for the error message.  Also "errno: 
-1" is "-EPERM = Operation not permitted", and doesn't give the same 
information as "-ENXIO = No such device or address" or even "-EINVAL = Invalid 
argument" would.

That said, I can't even offer a patch for this myself, since that exact error 
message is used in a few different places, though I suspect it is coming from 
lustre_lnet_config_ni().

Looking further into this, now that I've found where (I think) the error 
message is generated, it seems that "errno: -1" is not "-EPERM" but rather 
"LUSTRE_CFG_RC_BAD_PARAM", which is IMHO a travesty to use different error 
numbers (and then print them after "errno:") instead of existing POSIX error 
codes that could fill the same role (with some creative mapping):

#define LUSTRE_CFG_RC_NO_ERR 0  => fine
#define LUSTRE_CFG_RC_BAD_PARAM -1  => -EINVAL
#define LUSTRE_CFG_RC_MISSING_PARAM -2  => -EFAULT
#define LUSTRE_CFG_RC_OUT_OF_RANGE_PARAM-3  => -ERANGE
#define LUSTRE_CFG_RC_OUT_OF_MEM-4  => -ENOMEM
#define LUSTRE_CFG_RC_GENERIC_ERR   -5  => -ENODATA
#define LUSTRE_CFG_RC_NO_MATCH  -6  => -ENOMSG
#define LUSTRE_CFG_RC_MATCH -7  => -EXFULL
#define LUSTRE_CFG_RC_SKIP  -8  => -EBADSLT
#define LUSTRE_CFG_RC_LAST_ELEM -9  => -ECHRNG
#define LUSTRE_CFG_RC_MARSHAL_FAIL  -10 => -ENOSTR

I don't think "overloading" the POSIX error codes to mean something similar is 
worse than using random numbers to report errors.  Also, in some cases (even in 
lustre_lnet_config_ni()) it is using "rc = -errno" so the LUSTRE_CFG_RC_* 
errors are *already* conflicting with POSIX error numbers, and it impossible to 
distinguish between them...

The main question is whether changing these numbers will break a user->kernel 
interface, or if these definitions are only in userspace?It looks like 
lnetctl.c is only ever checking "!= LUSTRE_CFG_RC_NO_ERR", so maybe it is fine? 
 None of the values currently overlap, so it would be possible to start 
accepting either of the values for the return in the user tools, and then at 
some point in the future start actually returning them...  Something for the 
LNet folks to figure out.

Cheers, Andreas

On Jan 10, 2024, at 13:29, Jeff Johnson 
mailto:jeff.john...@aeoncomputing.com>> wrote:

A LU ticket and patch for lnetctl or for me being an under-caffeinated
idiot? ;-)

On Wed, Jan 10, 2024 at 12:06 PM Andreas Dilger 
mailto:adil...@whamcloud.com>> wrote:

It would seem that the error message could be improved in this case?  Could you 
file an LU ticket for that with the reproducer below, and ideally along with a 
patch?

Cheers, Andreas
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.15.4 o2iblnd on RoCEv2?

2024-01-10 Thread Andreas Dilger via lustre-discuss
It would seem that the error message could be improved in this case?  Could you 
file an LU ticket for that with the reproducer below, and ideally along with a 
patch?

Cheers, Andreas

> On Jan 10, 2024, at 11:37, Jeff Johnson  
> wrote:
> 
> Man am I an idiot. Been up all night too many nights in a row and not
> enough coffee. It helps if you use the correct --net designation. I
> was typing ib0 instead of o2ib0. Declaring as o2ib0 works fine.
> 
> (cleanup from previous)
> lctl net down && lustre_rmmod
> 
> (new attempt)
> modprobe lnet -v
> lnetctl lnet configure
> lnetctl net add --if enp1s0np0 --net o2ib0
> lnetctl net show
> net:
>- net type: lo
>  local NI(s):
>- nid: 0@lo
>  status: up
>- net type: o2ib
>  local NI(s):
>- nid: 10.0.50.27@o2ib
>  status: up
>  interfaces:
>  0: enp1s0np0
> 
> Lots more to test and verify but the original mailing list submission
> was total pilot error on my part. Apologies to all who spent cycles
> pondering this nothingburger.
> 
> 
> 
> 
>> On Tue, Jan 9, 2024 at 7:45 PM Jeff Johnson
>>  wrote:
>> 
>> Howdy intrepid Lustrefarians,
>> 
>> While starting down the debug rabbit hole I thought I'd raise my hand
>> and see if anyone has a few magic beans to spare.
>> 
>> I cannot get lnet (via lnetctl) to init a o2iblnd interface on a
>> RoCEv2 interface.
>> 
>> Running `lnetctl net add --net ib0 --if enp1s0np0` results in
>> net:
>>  errno: -1
>>  descr: cannot parse net '<255:65535>'
>> 
>> Nothing in dmesg to indicate why. Search engines aren't coughing up
>> much here either.
>> 
>> Env: Rocky 8.9 x86_64, MOFED 5.8-4.1.5.0, Lustre 2.15.4
>> 
>> I'm able to run mpi over the RoCEv2 interface. Utils like ibstatus and
>> ibdev2netdev report it correctly. ibv_rc_pingpong works fine between
>> nodes.
>> 
>> Configuring as socklnd works fine. `lnetctl net add --net tcp0 --if
>> enp1s0np0 && lnetctl net show`
>> [root@r2u11n3 ~]# lnetctl net show
>> net:
>>- net type: lo
>>  local NI(s):
>>- nid: 0@lo
>>  status: up
>>- net type: tcp
>>  local NI(s):
>>- nid: 10.0.50.27@tcp
>>  status: up
>>  interfaces:
>>  0: enp1s0np0
>> 
>> I verified the RoCEv2 interface using nVidia's `cma_roce_mode` as well
>> as sysfs references
>> 
>> [root@r2u11n3 ~]# cma_roce_mode -d mlx5_0 -p 1
>> RoCE v2
>> 
>> Ideas? Suggestions? Incense?
>> 
>> Thanks,
>> 
>> --Jeff
> 
> 
> 
> --
> --
> Jeff Johnson
> Co-Founder
> Aeon Computing
> 
> jeff.john...@aeoncomputing.com
> www.aeoncomputing.com
> t: 858-412-3810 x1001   f: 858-412-3845
> m: 619-204-9061
> 
> 4170 Morena Boulevard, Suite C - San Diego, CA 92117
> 
> High-Performance Computing / Lustre Filesystems / Scale-out Storage
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Extending Lustre file system

2024-01-08 Thread Andreas Dilger via lustre-discuss
I would recommend *against* mounting all 175 OSTs at the same time.  There are 
(or at least were*) some issues with the MGS registration RPCs timing out when 
too many config changes happen at once.  Your "mount and wait 2 sec" is more 
robust and doesn't take very much time (a few minutes) vs. having to restart if 
some of the OSTs have problems registering.  Also, the config logs will have 
the OSTs in a nice order, which doesn't affect any functionality, but makes it 
easier for the admin to see if some device is connected in "lctl dl" output.

Cheers, Andreas


[*] some fixes have landed over time to improve registration RPC resend.

On Jan 8, 2024, at 11:57, Thomas Roth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Yes, sorry, I meant the actual procedure of mounting the OSTs for the first 
time.

Last year I did that with 175 OSTs - replacements for EOL hardware. All OSTs 
had been formatted with a specific index, so probably creating a suitable 
/etc/fstab everywhere and sending a 'mount -a -t lustre' to all OSTs 
simultaneously would have worked.

But why the hurry? Instead, I logged in to my new OSS, mounted the OSTs with 2 
sec between each mount command, watched the OSS log, watched the MDS log, saw 
the expected log messages, proceeded to the new OSS - all fine ;-)  Such a 
leisurely approach takes its time, of course.

Once all OSTs were happily incorporated, we put the max_create_count (set to 0 
before) to some finite value and started file migration. As long as the 
migration is more effective, faster, than the users's file creations, the 
result should be evenly filled OSTs with a good mixture of files (file sizes, 
ages, types).


Cheers
Thomas

On 1/8/24 19:07, Andreas Dilger wrote:
The need to rebalance depends on how full the existing OSTs are.  My 
recommendation if you know that the data will continue to grow is to add new 
OSTs when the existing ones are at 60-70% full, and add them in larger groups 
rather than one at a time.
Cheers, Andreas
On Jan 8, 2024, at 09:29, Thomas Roth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Just mount the OSTs, one by one and perhaps not if your system is heavily 
loaded. Follow what happens in the MDS log and the OSS log.
And try to rebalance the OSTs fill levels afterwards - very empty OSTs will 
attract all new files, which might be hot and direct your users's fire to your 
new OSS only.

Regards,
Thomas

On 1/8/24 15:38, Backer via lustre-discuss wrote:
Hi,
Good morning and happy new year!
I have a quick question on extending a lustre file system. The extension is 
performed online. I am looking for any best practices or anything to watchout 
while doing the file system extension. The file system extension is done adding 
new OSS and many OSTs within these servers.
Really appreciate your help on this.
Regards,
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Extending Lustre file system

2024-01-08 Thread Andreas Dilger via lustre-discuss
The need to rebalance depends on how full the existing OSTs are.  My 
recommendation if you know that the data will continue to grow is to add new 
OSTs when the existing ones are at 60-70% full, and add them in larger groups 
rather than one at a time.

Cheers, Andreas

> On Jan 8, 2024, at 09:29, Thomas Roth via lustre-discuss 
>  wrote:
> 
> Just mount the OSTs, one by one and perhaps not if your system is heavily 
> loaded. Follow what happens in the MDS log and the OSS log.
> And try to rebalance the OSTs fill levels afterwards - very empty OSTs will 
> attract all new files, which might be hot and direct your users's fire to 
> your new OSS only.
> 
> Regards,
> Thomas
> 
>> On 1/8/24 15:38, Backer via lustre-discuss wrote:
>> Hi,
>> Good morning and happy new year!
>> I have a quick question on extending a lustre file system. The extension is 
>> performed online. I am looking for any best practices or anything to 
>> watchout while doing the file system extension. The file system extension is 
>> done adding new OSS and many OSTs within these servers.
>> Really appreciate your help on this.
>> Regards,
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Building lustre on rocky 8.8 fails?

2024-01-06 Thread Andreas Dilger via lustre-discuss
 gcc
   LD:/usr/bin/ld -m elf_x86_64
   CPPFLAGS:  -include /root/lustre-release/undef.h -include 
/root/lustre-release/config.h -I/root/lustre-release/lnet/include/uapi 
-I/root/lustre-release/lustre/include/uapi 
-I/root/lustre-release/libcfs/include -I/root/lustre-release/lnet/utils/ 
-I/root/lustre-release/lustre/include
   CFLAGS:-g -O2 -Wall -Werror
   EXTRA_KCFLAGS: -include /root/lustre-release/undef.h -include 
/root/lustre-release/config.h  -g -I/root/lustre-release/libcfs/include 
-I/root/lustre-release/libcfs/include/libcfs 
-I/root/lustre-release/lnet/include/uapi -I/root/lustre-release/lnet/include 
-I/root/lustre-release/lustre/include/uapi 
-I/root/lustre-release/lustre/include -Wno-format-truncation 
-Wno-stringop-truncation -Wno-stringop-overflow
   Type 'make' to build Lustre.
   However, when I run make:
   [root@mds lustre-release]# make
   make  all-recursive
   make[1]: Entering directory '/root/lustre-release'
   Making all in ldiskfs
   make[2]: Entering directory '/root/lustre-release/ldiskfs'
   make[2]: *** No rule to make target 
'../ldiskfs/kernel_patches/series/ldiskfs-', needed by 'sources'.  Stop.
   This looks like it can't detect the ldiskfs-4.18-rhel8.8.series property.  
Most probably the ext4 kernel source rpm (kernel-debuginfo-common-x86_64) 
hasn't been installed yet.
   And like Andreas said try latest 2.15.4.
   Best,
   Xinliang
   make[2]: Leaving directory '/root/lustre-release/ldiskfs'
   make[1]: *** [autoMakefile:649: all-recursive] Error 1
   make[1]: Leaving directory '/root/lustre-release'
   make: *** [autoMakefile:521: all] Error 2
   Alternatively, I tried make rpms which results in:
   ...
   rpmbuilddir=`mktemp -t -d rpmbuild-lustre-$USER-`; \
   make  \
 rpmbuilddir="$rpmbuilddir" rpm-local || exit 1; \
   cp ./rpm/* .; \
   /usr/bin/rpmbuild \
 --define "_tmppath $rpmbuilddir/TMP" \
 --define "_topdir $rpmbuilddir" \
 --define "dist %{nil}" \
 -ts lustre-2.15.3.tar.gz || exit 1; \
   cp $rpmbuilddir/SRPMS/lustre-2.15.3-*.src.rpm . || exit 1; \
   rm -rf $rpmbuilddir
   make[1]: Entering directory '/root/lustre-release'
   make[1]: Leaving directory '/root/lustre-release'
   error: line 239: Dependency tokens must begin with alpha-numeric, '_' or 
'/': BuildRequires: %kernel_module_package_buildreqs
And this might caused by lacking of rpm kernel-rpm-macros.
Usually,  sudo dnf builddep -y lustre.spec should install all the build 
required RPMs.
   make: *** [autoMakefile:1237: srpm] Error 1
   So, I'm stuck - it seems this is something I do a lot; how do I move 
forward from here?
   ___
   lustre-discuss mailing list
   lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> 
<mailto:lustre-discuss@lists.lustre.org>
   http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org 
<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Error: GPG check FAILED when trying to install e2fsprogs

2024-01-03 Thread Andreas Dilger via lustre-discuss
Sorry, those packages are not signed, you'll just have to install them without 
a signature. 

Cheers, Andreas

> On Jan 3, 2024, at 09:10, Jan Andersen  wrote:
> 
> I have finally managed to build the lustre rpms, but when I try to install 
> them with:
> 
> dnf install ./*.rpm
> 
> I get a list of errors like
> 
> ... nothing provides ldiskfsprogs >= 1.44.3.wc1 ...
> 
> In a previous communication I was advised that:
> 
> You may need to add ldiskfsprogs rpm repo and enable ha and powertools repo
> first.
> 
> sudo dnf config-manager --add-repo 
> https://downloads.whamcloud.com/public/e2fsprogs/latest/el8/
> sudo dnf config-manager --set-enabled ha
> sudo dnf config-manager --set-enabled powertools
> 
> However, when I try to install e2fsprogs:
> 
> Package e2fsprogs-1.47.0-wc6.el8.x86_64.rpm is not signed
> Package e2fsprogs-devel-1.47.0-wc6.el8.x86_64.rpm is not signed
> Package e2fsprogs-libs-1.47.0-wc6.el8.x86_64.rpm is not signed
> Package libcom_err-1.47.0-wc6.el8.x86_64.rpm is not signed
> Package libcom_err-devel-1.47.0-wc6.el8.x86_64.rpm is not signed
> Package libss-1.47.0-wc6.el8.x86_64.rpm is not signed
> The downloaded packages were saved in cache until the next successful 
> transaction.
> You can remove cached packages by executing 'dnf clean packages'.
> Error: GPG check FAILED
> 
> And now I'm stuck with that - I imagine I need to add some appropriate GPG 
> key; where can I find that?
> 
> /jan
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Building lustre on rocky 8.8 fails?

2024-01-02 Thread Andreas Dilger via lustre-discuss
Try 2.15.4, as it may fix the EL8.8 build issue. 

Cheers, Andreas

> On Jan 2, 2024, at 07:30, Jan Andersen  wrote:
> 
> I have installed Rocky 8.8 on a new server (Dell PowerEdge R640):
> 
> [root@mds 4.18.0-513.9.1.el8_9.x86_64]# cat /etc/*release*
> Rocky Linux release 8.8 (Green Obsidian)
> NAME="Rocky Linux"
> VERSION="8.8 (Green Obsidian)"
> ID="rocky"
> ID_LIKE="rhel centos fedora"
> VERSION_ID="8.8"
> PLATFORM_ID="platform:el8"
> PRETTY_NAME="Rocky Linux 8.8 (Green Obsidian)"
> ANSI_COLOR="0;32"
> LOGO="fedora-logo-icon"
> CPE_NAME="cpe:/o:rocky:rocky:8:GA"
> HOME_URL="https://rockylinux.org/;
> BUG_REPORT_URL="https://bugs.rockylinux.org/;
> SUPPORT_END="2029-05-31"
> ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
> ROCKY_SUPPORT_PRODUCT_VERSION="8.8"
> REDHAT_SUPPORT_PRODUCT="Rocky Linux"
> REDHAT_SUPPORT_PRODUCT_VERSION="8.8"
> Rocky Linux release 8.8 (Green Obsidian)
> Rocky Linux release 8.8 (Green Obsidian)
> Derived from Red Hat Enterprise Linux 8.8
> Rocky Linux release 8.8 (Green Obsidian)
> cpe:/o:rocky:rocky:8:GA
> 
> I downloaded the kernel source (I don't remember the exact command):
> 
> [root@mds 4.18.0-513.9.1.el8_9.x86_64]# ll /usr/src/kernels
> total 8
> drwxr-xr-x. 24 root root 4096 Jan  2 13:49 4.18.0-513.9.1.el8_9.x86_64/
> drwxr-xr-x. 23 root root 4096 Jan  2 11:41 4.18.0-513.9.1.el8_9.x86_64+debug/
> 
> Copied the config from /boot and ran:
> 
> yes "" | make oldconfig
> 
> After that I cloned the Lustre source and configured (according to my notes):
> 
> git clone git://git.whamcloud.com/fs/lustre-release.git
> cd lustre-release
> git checkout 2.15.3
> 
> dnf install libtool
> dnf install flex
> dnf install bison
> dnf install openmpi-devel
> dnf install python3-devel
> dnf install python3
> dnf install kernel-devel kernel-headers
> dnf install elfutils-libelf-devel
> dnf install keyutils keyutils-libs-devel
> dnf install libmount
> dnf --enablerepo=powertools install libmount-devel
> dnf install libnl3 libnl3-devel
> dnf config-manager --set-enabled powertools
> dnf install libyaml-devel
> dnf install patch
> dnf install e2fsprogs-devel
> dnf install kernel-core
> dnf install kernel-modules
> dnf install rpm-build
> dnf config-manager --enable devel
> dnf config-manager --enable powertools
> dnf config-manager --set-enabled ha
> dnf install kernel-debuginfo
> 
> sh autogen.sh
> ./configure
> 
> This appeared to finish without errors:
> 
> ...
> config.status: executing libtool commands
> 
> CC:gcc
> LD:/usr/bin/ld -m elf_x86_64
> CPPFLAGS:  -include /root/lustre-release/undef.h -include 
> /root/lustre-release/config.h -I/root/lustre-release/lnet/include/uapi 
> -I/root/lustre-release/lustre/include/uapi 
> -I/root/lustre-release/libcfs/include -I/root/lustre-release/lnet/utils/ 
> -I/root/lustre-release/lustre/include
> CFLAGS:-g -O2 -Wall -Werror
> EXTRA_KCFLAGS: -include /root/lustre-release/undef.h -include 
> /root/lustre-release/config.h  -g -I/root/lustre-release/libcfs/include 
> -I/root/lustre-release/libcfs/include/libcfs 
> -I/root/lustre-release/lnet/include/uapi -I/root/lustre-release/lnet/include 
> -I/root/lustre-release/lustre/include/uapi 
> -I/root/lustre-release/lustre/include -Wno-format-truncation 
> -Wno-stringop-truncation -Wno-stringop-overflow
> 
> Type 'make' to build Lustre.
> 
> However, when I run make:
> 
> [root@mds lustre-release]# make
> make  all-recursive
> make[1]: Entering directory '/root/lustre-release'
> Making all in ldiskfs
> make[2]: Entering directory '/root/lustre-release/ldiskfs'
> make[2]: *** No rule to make target 
> '../ldiskfs/kernel_patches/series/ldiskfs-', needed by 'sources'.  Stop.
> make[2]: Leaving directory '/root/lustre-release/ldiskfs'
> make[1]: *** [autoMakefile:649: all-recursive] Error 1
> make[1]: Leaving directory '/root/lustre-release'
> make: *** [autoMakefile:521: all] Error 2
> 
> Alternatively, I tried make rpms which results in:
> 
> ...
> rpmbuilddir=`mktemp -t -d rpmbuild-lustre-$USER-`; \
> make  \
>rpmbuilddir="$rpmbuilddir" rpm-local || exit 1; \
> cp ./rpm/* .; \
> /usr/bin/rpmbuild \
>--define "_tmppath $rpmbuilddir/TMP" \
>--define "_topdir $rpmbuilddir" \
>--define "dist %{nil}" \
>-ts lustre-2.15.3.tar.gz || exit 1; \
> cp $rpmbuilddir/SRPMS/lustre-2.15.3-*.src.rpm . || exit 1; \
> rm -rf $rpmbuilddir
> make[1]: Entering directory '/root/lustre-release'
> make[1]: Leaving directory '/root/lustre-release'
> error: line 239: Dependency tokens must begin with alpha-numeric, '_' or '/': 
> BuildRequires: %kernel_module_package_buildreqs
> make: *** [autoMakefile:1237: srpm] Error 1
> 
> 
> So, I'm stuck - it seems this is something I do a lot; how do I move forward 
> from here?
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss 

Re: [lustre-discuss] Lustre server still try to recover the lnet reply to the depreciated clients

2023-12-08 Thread Andreas Dilger via lustre-discuss
If you are evicting a client by NID, then use the "nid:" keyword:

lctl set_param mdt.*.evict_client=nid:10.68.178.25@tcp

Otherwise it is expecting the input to be in the form of a client UUID (to allow
evicting a single export from a client mounting the filesystem multiple times).

That said, the client *should* be evicted by the server automatically, so it 
isn't
clear why this isn't happening.  Possibly this is something at the LNet level
(which unfortunately I don't know much about)? 

Cheers, Andreas

> On Dec 6, 2023, at 13:23, Huang, Qiulan via lustre-discuss 
>  wrote:
> 
> 
> 
> Hello all,
> 
> 
> We removed some clients two weeks ago but we see the Lustre server is still 
> trying to handle the lnet recovery reply to those clients (the error log is 
> posted as below). And they are still listed in the exports dir.
> 
> 
> I tried to run  to evict the clients but failed with  the error "no exports 
> found"
> 
> lctl set_param mdt.*.evict_client=10.68.178.25@tcp
> 
> 
> Do you know how to clean up the removed the depreciated clients? Any 
> suggestions would be greatly appreciated.
> 
> 
> 
> For example:
> 
> [root@mds2 ~]# ll /proc/fs/lustre/mdt/data-MDT/exports/10.67.178.25@tcp/
> total 0
> -r--r--r-- 1 root root 0 Dec  5 15:41 export
> -r--r--r-- 1 root root 0 Dec  5 15:41 fmd_count
> -r--r--r-- 1 root root 0 Dec  5 15:41 hash
> -rw-r--r-- 1 root root 0 Dec  5 15:41 ldlm_stats
> -r--r--r-- 1 root root 0 Dec  5 15:41 nodemap
> -r--r--r-- 1 root root 0 Dec  5 15:41 open_files
> -r--r--r-- 1 root root 0 Dec  5 15:41 reply_data
> -rw-r--r-- 1 root root 0 Aug 14 10:58 stats
> -r--r--r-- 1 root root 0 Dec  5 15:41 uuid
> 
> 
> 
> 
> 
> /var/log/messages:Dec  6 12:50:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 13:05:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:05:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 13:20:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:20:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 13:35:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:35:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 13:50:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:50:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 14:05:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 14:05:17 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 14:20:16 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.178.25@tcp) recovery failed with -110
> /var/log/messages:Dec  6 14:20:16 mds2 kernel: LNetError: 
> 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous 
> similar message
> /var/log/messages:Dec  6 14:30:17 mds2 kernel: LNetError: 
> 3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.176.25@tcp) recovery failed with -111
> /var/log/messages:Dec  6 14:30:17 mds2 kernel: LNetError: 
> 3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 3 previous 
> similar messages
> /var/log/messages:Dec  6 14:47:14 mds2 kernel: LNetError: 
> 3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.176.25@tcp) recovery failed with -111
> /var/log/messages:Dec  6 14:47:14 mds2 kernel: LNetError: 
> 3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 8 previous 
> similar messages
> /var/log/messages:Dec  6 15:02:14 mds2 kernel: LNetError: 
> 3817248:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI 
> (10.67.176.25@tcp) recovery failed with -111
> 
> 
> Regards,
>

Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

2023-12-07 Thread Andreas Dilger via lustre-discuss
Aurelien,
there have beeen a number of questions about this message.

> Lustre: lustrevm-OST0001: deleting orphan objects from 0x0:227 to 0x0:513

This is not marked LustreError, so it is just an advisory message.

This can sometimes be useful for debugging issues related to MDT->OST 
connections.
It is already printed with D_INFO level, so the lowest printk level available.
Would rewording the message make it more clear that this is a normal situation
when the MDT and OST are establishing connections?

Cheers, Andreas

On Dec 5, 2023, at 02:13, Aurelien Degremont  wrote:
> 
> > Now what is the messages about "deleting orphaned objects" ? Is it normal 
> > also ?
> 
> Yeah, this is kind of normal, and I'm even thinking we should lower the 
> message verbosity...
> Andreas, do you agree that could become a simple CDEBUG(D_HA, ...) instead of 
> LCONSOLE(D_INFO, ...)?
> 
> 
> Aurélien
> 
> Audet, Martin wrote on lundi 4 décembre 2023 20:26:
>> Hello Andreas,
>> 
>> Thanks for your response. Happy to learn that the "errors" I was reporting 
>> aren't really errors.
>> 
>> I now understand that the 3 messages about LDISKFS were only normal messages 
>> resulting from mounting the file systems (I was fooled by vim showing this 
>> message in red, like important error messages, but this is simply a false 
>> positive result of its syntax highlight rules probably triggered by the 
>> "errors=" string which is only a mount option...).
>> 
>> Now what is the messages about "deleting orphaned objects" ? Is it normal 
>> also ? We boot the clients VMs always after the server is ready and we 
>> shutdown clients cleanly well before the vlmf Lustre server is (also 
>> cleanly) shutdown. It is a sign of corruption ? How come this happen if 
>> shutdowns are clean ?
>> 
>> Thanks (and sorry for the beginners questions),
>> 
>> Martin
>> 
>> Andreas Dilger  wrote on December 4, 2023 5:25 AM:
>>> It wasn't clear from your rail which message(s) are you concerned about?  
>>> These look like normal mount message(s) to me. 
>>> 
>>> The "error" is pretty normal, it just means there were multiple services 
>>> starting at once and one wasn't yet ready for the other. 
>>> 
>>>  LustreError: 137-5: lustrevm-MDT_UUID: not available for 
>>> connect
>>>  from 0@lo (no target). If you are running an HA pair check that 
>>> the target
>>> is mounted on the other server.
>>> 
>>> It probably makes sense to quiet this message right at mount time to avoid 
>>> this. 
>>> 
>>> Cheers, Andreas
>>> 
>>>> On Dec 1, 2023, at 10:24, Audet, Martin via lustre-discuss 
>>>>  wrote:
>>>> 
>>>> 
>>>> Hello Lustre community,
>>>> 
>>>> Have someone ever seen messages like these on in "/var/log/messages" on a 
>>>> Lustre server ?
>>>> 
>>>> Dec  1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1
>>>> Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with 
>>>> ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
>>>> Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with 
>>>> ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
>>>> Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with 
>>>> ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
>>>> Dec  1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: 
>>>> not available for connect from 0@lo (no target). If you are running an HA 
>>>> pair check that the target is mounted on the other server.
>>>> Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery 
>>>> not enabled, recovery window 300-900
>>>> Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: deleting orphan 
>>>> objects from 0x0:227 to 0x0:513
>>>> 
>>>> This happens on every boot on a Lustre server named vlfs (a AlmaLinux 8.9 
>>>> VM hosted on a VMware) playing the role of both MGS and OSS (it hosts an 
>>>> MDT two OST using "virtual" disks). We chose LDISKFS and not ZFS. Note 
>>>> that this happens at every boot, well before the clients (AlmaLinux 9.3 or 
>>>> 8.9 VMs) connect and even when the clients are powered off. The network 
>>>> connecting the clients and the server is a "virtual&

Re: [lustre-discuss] Lustre caching and NUMA nodes

2023-12-05 Thread Andreas Dilger via lustre-discuss

On Dec 4, 2023, at 15:06, John Bauer 
mailto:bau...@iodoctors.com>> wrote:

I have a an OSC caching question.  I am running a dd process which writes an 
8GB file.  The file is on lustre, striped 8x1M. This is run on a system that 
has 2 NUMA nodes (cpu sockets). All the data is apparently stored on one NUMA 
node (node1 in the plot below) until node1 runs out of free memory.  Then it 
appears that dd comes to a stop (no more writes complete) until lustre dumps 
the data from the node1.  Then dd continues writing, but now the data is stored 
on the second NUMA node, node0.  Why does lustre go to the trouble of dumping 
node1 and then not use node1's memory, when there was always plenty of free 
memory on node0?

I'll forego the explanation of the plot.  Hopefully it is clear enough.  If 
someone has questions about what the plot is depicting, please ask.

https://www.dropbox.com/scl/fi/pijgnnlb8iilkptbeekaz/dd.png?rlkey=3abonv5tx8w5w5m08bn24qb7x=0

Hi John,
thanks for your detailed analysis.  It would be good to include the client 
kernel and Lustre version in this case, as the page cache behaviour can vary 
dramatically between different versions.

The allocation of the page cache pages may actually be out of the control of 
Lustre, since they are typically being allocated by the kernel VM affine to the 
core where the process that is doing the IO is running.  It may be that the 
"dd" is rescheduled to run on node0 during the IO, since the ptlrpcd threads 
will be busy processing all of the RPCs during this time, and then dd will 
start allocating pages from node0.

That said, it isn't clear why the client doesn't start flushing the dirty data 
from cache earlier?  Is it actually sending the data to the OSTs, but then 
waiting for the OSTs to reply that the data has been committed to the storage 
before dropping the cache?

It would be interesting to plot the osc.*.rpc_stats::write_rpcs_in_flight and 
::pending_write_pages to see if the data is already in flight.  The 
osd-ldiskfs.*.brw_stats on the server would also useful to graph over the same 
period, if possible.

It *does* look like the "node1 dirty" is kept at a low value for the entire 
run, so it at least appears that RPCs are being sent, but there is no page 
reclaim triggered until memory is getting low.  Doing page reclaim is really 
the kernel's job, but it seems possible that the Lustre client may not be 
suitably notifying the kernel about the dirty pages and kicking it in the butt 
earlier to clean up the pages.

PS: my preference would be to just attach the image to the email instead of 
hosting it externally, since it is only 55 KB.  Is this blocked by the list 
server?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Debian 11: configure fails

2023-12-04 Thread Andreas Dilger via lustre-discuss
Which version of Lustre are you trying to build?

On Dec 4, 2023, at 05:48, Jan Andersen mailto:j...@comind.io>> 
wrote:

My system:

root@debian11:~/lustre-release# uname -r
5.10.0-26-amd64

Lustre:

git clone git://git.whamcloud.com/fs/lustre-release.git

I'm building the client with:

./configure --config-cache --disable-server --enable-client 
--with-linux=/usr/src/linux-headers-5.10.0-26-amd64 --without-zfs 
--disable-ldiskfs --disable-gss --disable-gss-keyring --disable-snmp 
--enable-modules
...
checking for /usr/src/nvidia-fs-2.18.3/config-host.h... no
./configure: line 97149: syntax error near unexpected token `LIBNL3,'
./configure: line 97149: `  PKG_CHECK_MODULES(LIBNL3, libnl-genl-3.0 >= 
3.1)'

root@debian11:~/lustre-release# dpkg -l | grep libnl-genl
ii  libnl-genl-3-200:amd64   3.4.0-1+b1 amd64   
 library for dealing with netlink sockets - generic netlink
ii  libnl-genl-3-dev:amd64   3.4.0-1+b1 amd64   
 development library and headers for libnl-genl-3

What is going wrong here - does configure require libnl-genl-3.0?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Error messages (ex: not available for connect from 0@lo) on server boot with Lustre 2.15.3 and 2.15.4-RC1

2023-12-04 Thread Andreas Dilger via lustre-discuss
It wasn't clear from your rail which message(s) are you concerned about?  These 
look like normal mount message(s) to me.

The "error" is pretty normal, it just means there were multiple services 
starting at once and one wasn't yet ready for the other.

 LustreError: 137-5: lustrevm-MDT_UUID: not available for connect
 from 0@lo (no target). If you are running an HA pair check that the 
target
is mounted on the other server.

It probably makes sense to quiet this message right at mount time to avoid this.

Cheers, Andreas

On Dec 1, 2023, at 10:24, Audet, Martin via lustre-discuss 
 wrote:



Hello Lustre community,


Have someone ever seen messages like these on in "/var/log/messages" on a 
Lustre server ?


Dec  1 11:26:30 vlfs kernel: Lustre: Lustre: Build Version: 2.15.4_RC1
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdd): mounted filesystem with ordered 
data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdc): mounted filesystem with ordered 
data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:30 vlfs kernel: LDISKFS-fs (sdb): mounted filesystem with ordered 
data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
Dec  1 11:26:36 vlfs kernel: LustreError: 137-5: lustrevm-MDT_UUID: not 
available for connect from 0@lo (no target). If you are running an HA pair 
check that the target is mounted on the other server.
Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: Imperative Recovery not 
enabled, recovery window 300-900
Dec  1 11:26:36 vlfs kernel: Lustre: lustrevm-OST0001: deleting orphan objects 
from 0x0:227 to 0x0:513


This happens on every boot on a Lustre server named vlfs (a AlmaLinux 8.9 VM 
hosted on a VMware) playing the role of both MGS and OSS (it hosts an MDT two 
OST using "virtual" disks). We chose LDISKFS and not ZFS. Note that this 
happens at every boot, well before the clients (AlmaLinux 9.3 or 8.9 VMs) 
connect and even when the clients are powered off. The network connecting the 
clients and the server is a "virtual" 10GbE network (of course there is no 
virtual IB). Also we had the same messages previously with Lustre 2.15.3 using 
an AlmaLinux 8.8 server and AlmaLinux 8.8 / 9.2 clients (also using VMs). Note 
also that we compile ourselves the Lustre RPMs from the sources from the git 
repository. We also chose to use a patched kernel. Our build procedure for RPMs 
seems to work well because our real cluster run fine on CentOS 7.9 with Lustre 
2.12.9 and IB (MOFED) networking.

So has anyone seen these messages ?

Are they problematic ? If yes, how do we avoid them ?

We would like to make sure our small test system using VMs works well before we 
upgrade our real cluster.

Thanks in advance !

Martin Audet

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] OST is not mounting

2023-11-07 Thread Andreas Dilger via lustre-discuss
The OST went read-only because that is what happens when the block device 
disappears underneath it. That is a behavior of ext4 and other local 
filesystems as well. 

If you look in the console logs you would see SCSI errors and the filesystem 
being remounted read-only. 

To have reliability in the face of such storage issues you need to use 
dm-multipath. 

Cheers, Andreas

> On Nov 5, 2023, at 09:13, Backer via lustre-discuss 
>  wrote:
> 
> - Why did OST become in this state after the write failure and was mounted 
> RO.  The write error was due to iSCSI target going offline and coming back 
> after a few seconds later. 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Possible change to "lfs find -size" default units?

2023-11-05 Thread Andreas Dilger via lustre-discuss
I've recently realized that "lfs find -size N" defaults to looking for files of 
N *bytes* by default, unlike regular find(1) that is assuming 512-byte blocks 
by default if no units are given.

I'm wondering if it would be disruptive to users if the default unit for -size 
was changed to 512-byte blocks instead of bytes, or if it is most common to 
specify a unit suffix like "lfs find -size +1M" and the change would mostly be 
unnoticed?  I would add a 'c' suffix for compatibility with find(1) to allow 
specifying an exact number of chars (bytes).

On the other hand, possibly this would be *less* confusing for users that are 
already used to the behavior of regular "find"?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre-Manual on lfsck - non-existing entries?

2023-10-31 Thread Andreas Dilger via lustre-discuss
On Oct 31, 2023, at 13:12, Thomas Roth via lustre-discuss 
 wrote:
> 
> Hi all,
> 
> after starting an `lctl lfsck_start -A  -C -o` and the oi_scrub having 
> completed, I would check the layout scan as described in the Lustre manual, 
> "36.4.3.3. LFSCK status of layout via procfs", by
> 
> > lctl get_param -n mdd.FSNAME-MDT_target.lfsck_layout
> 
> Doesn't work, and inspection of 'ls /sys/fs/lustre/mdd/FSNAME-MDT/' shows:
> > ...
> > lfsck_async_windows
> > lfsck_speed_limit
> ...
> 
> as the only entries showing the string "lfsck".
> 
> > lctl lfsck_query -M FSNAME-MDT -t layout
> 
> does show some info, although it is not what the manual describes as output 
> of the `lctl get_param` command.
> 
> 
> Issue with the manual or issue with our Lustre?

Are you perhaps running the "lctl get_param" as a non-root user?  One of the 
wonderful quirks of the kernel is that they don't want new parameters stored in 
procfs, and they don't want "complex" parameters (more than one value) stored 
in sysfs, so by necessity this means anything "complex" needs to go into 
debugfs (/sys/kernel/debug) but that was changed at some point to only be 
accessible by root.

As such, you need to be root to access any of the "complex" parameters/stats:

  $ lctl get_param mdd.*.lfsck_layout
  error: get_param: param_path 'mdd/*/lfsck_layout': No such file or directory

  $ sudo lctl get_param mdd.*.lfsck_layout
  mdd.myth-MDT.lfsck_layout=
  name: lfsck_layout
  magic: 0xb1732fed
  version: 2
  status: completed
  flags:
  param: all_targets
  last_completed_time: 1694676243
  time_since_last_completed: 4111337 seconds
  latest_start_time: 1694675639
  time_since_latest_start: 4111941 seconds
  last_checkpoint_time: 1694676243
  time_since_last_checkpoint: 4111337 seconds
  latest_start_position: 12
  last_checkpoint_position: 4194304
  first_failure_position: 0
  success_count: 6
  repaired_dangling: 0
  repaired_unmatched_pair: 0
  repaired_multiple_referenced: 0
  repaired_orphan: 0
  repaired_inconsistent_owner: 0
  repaired_others: 0
  skipped: 0
  failed_phase1: 0
  failed_phase2: 0
  checked_phase1: 3791402
  checked_phase2: 0
  run_time_phase1: 595 seconds
  run_time_phase2: 8 seconds
  average_speed_phase1: 6372 items/sec
  average_speed_phase2: 0 objs/sec
  real_time_speed_phase1: N/A
  real_time_speed_phase2: N/A
  current_position: N/A

  $ sudo ls /sys/kernel/debug/lustre/mdd/myth-MDT/
  total 0
  0 changelog_current_mask  0 changelog_users  0 lfsck_namespace
  0 changelog_mask  0 lfsck_layout

Getting an update to the manual to clarify this requirement would be welcome.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] very slow mounts with OSS node down and peer discovery enabled

2023-10-26 Thread Andreas Dilger via lustre-discuss
I can't comment on the LNet peer discovery part, but I would definitely not 
tecommend to leave the lnet_transaction_timeout that low for normal usage. This 
can cause messages to be dropped while the server is processing them and 
introduce failures needlessly. 

Cheers, Andreas

> On Oct 26, 2023, at 09:48, Bertschinger, Thomas Andrew Hjorth via 
> lustre-discuss  wrote:
> 
> Hello,
> 
> Recently we had an OSS node down for an extended period with hardware 
> problems. While the node was down, mounting lustre on a client took an 
> extremely long time to complete (20-30 minutes). Once the fs is mounted, all 
> operations are normal and there isn't any noticeable impact from the absent 
> node.
> 
> While the client is mounting, the client's debug log shows entries like this 
> slowly going by:
> 
> 0020:0080:87.0:1698333195.993098:0:3801046:0:(obd_config.c:1384:class_process_config())
>  processing cmd: cf005
> 0020:0080:87.0:1698333195.993099:0:3801046:0:(obd_config.c:1396:class_process_config())
>  adding mapping from uuid 10.1.2.3@o2ib to nid 0x50abcd123 (10.1.2.4@o2ib)
> 
> and there is a "llog_process_th" kernel thread hanging in 
> lnet_discover_peer_locked().
> 
> We have peer discovery enabled on our clients, but disabling peer discovery 
> on a client causes the mount to complete quickly. Also, once the down OSS was 
> fixed and powered back on, mounting completed normally again.
> 
> We also found that reducing the following timeout sped up the mount by a 
> factor of ~10:
> 
> $ lnetctl set transaction_timeout 5# was 50 originally
> 
> Is such a dramatic slowdown normal in this situation? Is there any fix (aside 
> from disabling peer discovery or tuning down the timeout) that could speed up 
> mounts in case we have another OSS down in the future?
> 
> Lustre version (server and client): 2.15.3
> 
> Thanks, 
> Thomas Bertschinger
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] re-registration of MDTs and OSTs

2023-10-24 Thread Andreas Dilger via lustre-discuss
On Oct 18, 2023, at 13:04, Peter Grandi via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

So I have been upgrading my one and only MDT to a larger ZFS
pool, by the classic route of creating a new pool, new MDT, and
then 'zfs send'/zfs receive' for the copy over (BTW for those
who may not be aware 'zfs send' output can be put into a file to
do offline backups of a Lustre ZFS filesystem instance).

At first I just created an empty MGT on the new devices (on a
server with the same NID as the old MGS), with the assumption
that given that MDTs and OSTs have unique (filesystem instance,
target type, index number) triples, with NIDs being just the
address to find the MGS, or where they can be found, they would
just register themselves with the MGT on startup.

But I found that there was a complaint that they were in a
registered state, and the MGT did not have their registration
entries. I am not sure that is the purpose of that check. So
I just copied over the old MGT where they were registered, and
all was fine.

* Is there a way to re-register MDTs and OSTs belonging to a
 given filesystem instance into a new different MGT?

If you run the "writeconf" process documented in the Lustre Manual, the MDT(s)
and OST(s) will re-register themselves with the MGT.

* Is there a purpose to check whether MDTs and OSTs are or not
 registered in a given MGT?

Yes, this prevents the MDTs/OSTs from accidentally becoming part of a
different filesystem that might have been incorrectly formatted with the
same fsname (e.g. "lustre" has been used as the fsname more than
once).

* Is there a downside to register MDTs and OSTs in a different
 MGT from that which they were registered with initially?

Not too much.  The new MGT will not have any of the old configuration
parameters, but running "writeconf" will also reset any "conf_param"
parameters so not much different (but it will not reset "set_param -P"
parameters).

My guess is that the MGT does not just contain the identities
and addresses of MDTs and OSTs of one or more filesystem
instance, but also a parameter list

If so, is there are way to dump the parameter for a filesystem
instance so it can be restored to a different MGT?

Yes, the "lctl --device MGS llog_print CONFIG_LOG" command
will dump all of the config commands for a particular MDT/OST
or the "params" log for "set_param -P".

The parameters can be restored from a file with "lctl set_param -F".

See the lctl-set_param.8 and lctl-llog_print.8 man pages for details.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] setting quotas from within a container

2023-10-21 Thread Andreas Dilger via lustre-discuss
Hi Lisa,
The first question to ask is which Lustre version you are using?

Second, are you using subdirectory mounts or other UID/GID mapping for the 
container? That could happen at both the Lustre level or by the kernel itself.  
If you aren't sure, you could try creating a new file as root inside the 
container, then "ls -l" the file from outside the container to see if it is 
owned by root.

You could try running "strace lfs setquota" to see what operation the -EPERM = 
-1 error is coming from. 

The other important question is whether you really want to allow root inside 
the container to be able to set the quota, or whether this should be reserved 
for root outside the container?

Cheers, Andreas

> On Oct 21, 2023, at 09:18, Lisa Gerhardt via lustre-discuss 
>  wrote:
> 
> 
> Hello,
> I'm trying to set user quotas from within a container run as root. I can 
> successfully do things like "lfs setstripe", but "lfs setquota" fails with 
> 
> lfs setquota: quotactl failed: Operation not permitted
> setquota failed: Operation not permitted
> 
> I suspect it might have something to do with how the file system is mounted 
> in the container. I'm wondering if anyone has any experience with this or if 
> someone could point me to some documentation to help me understand what 
> "setquota" is doing differently from "setstripe" to see where things are 
> going off the rails.
> 
> Thanks,
> Lisa
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] mount not possible: "no server support"

2023-10-19 Thread Andreas Dilger via lustre-discuss


On Oct 19, 2023, at 19:58, Benedikt Alexander Braunger via lustre-discuss 
mailto:lustre-discuss@lists.Lustre.org>> wrote:

Hi Lustrers,

I'm currently struggling with a unmountable Lustre filesystem. The client only 
says "no server support", no further logs on client or server.
I first thought this might be related to the usage of fscrypt but I already 
recreated the whole filesystem from scratch and the error still persists.
Now I have no more idea what to look for.

Here the full CLI log:

[root@dstorsec01vl]# uname -a
Linux dstorsec01vl 5.14.0-284.11.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 
12 10:45:03 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

[root@dstorsec01vl]# modprobe lnet
[root@dstorsec01vl]# modprobe lustre
[root@dstorsec01vl]# lnetctl ping pstormgs01@tcp
ping:
- primary nid: 10.106.104.160@tcp
  Multi-Rail: False
  peer ni:
- nid: 10.106.104.160@tcp

[root@dstorsec01vl]# mount -t lustre pstormgs01@tcp:sif0 /mnt/
mount.lustre: cannot mount pstormgs01@tcp:sif0: no server support

It looks like this is failing because the mount device is missing ":/"  in it, 
which mount.lustre uses to decide whether this is a client or server 
mountpoint.  you should be using:

client# mount -t lustre pstormgs01@tcp:/sif0 /mnt/sif0

and this should work.  It probably makes sense to improve the error message to 
be more clear, like:

   mount.lustre: cannot mount block device 'pstormgs01@tcp:sif0': no server 
support

or similar

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] backup restore docs not quite accurate?

2023-10-18 Thread Andreas Dilger via lustre-discuss
Removing the OI files is for ldiskfs backup/restore (eg. after tar/untar) when 
the inode numbers are changed. That is not needed for ZFS send/recv because the 
inode numbers stay the same after such an operation. 

If that isn't clear in the manual it should be fixed. 

Cheers, Andreas

> On Oct 18, 2023, at 23:42, Peter Grandi via lustre-discuss 
>  wrote:
> 
> I was asking this on the Slack channel but as I was typing it
> looks too complicated for chat, so here:
> 
> Lustre 2.15.2, EL8, MGT, MDT and OSTs on ZFS.
> 
> I am trying to copy the MGT and the MDT to larger zpools, so I
> created them on another server, and used 'zfs send' to copy them
> while the Lustre instance was frozen (for the last incremental).
> 
> The I have put the new zpool drives in the old server with same
> NID, and I following the "Operations Manual" here:
> 
> https://doc.lustre.org/lustre_manual.xhtml#backup_fs_level.restore
> "Remove old OI and LFSCK files.[oss]# rm -rf oi.16* lfsck_* LFSCK
> Remove old CATALOGS. [oss]# rm -f CATALOGS"
> 
> But I am getting a lot of error when removing "oi.16*", of the
> "directory not empty" sort. For example "cannot remove
> 'oi.16/0x200011b90:0xabe1:0x0': Directory not empty"
> 
> Please suggest some options.
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] OSS on compute node

2023-10-13 Thread Andreas Dilger via lustre-discuss
On Oct 13, 2023, at 20:58, Fedele Stabile 
mailto:fedele.stab...@fis.unical.it>> wrote:

Hello everyone,
We are in progress to integrate Lustre on our little HPC Cluster and we would 
like to know if it is possible to use the same node in a cluster to act as an 
OSS with disks and to also use it as a Compute Node and then install a Lustre 
Client.
I know that the OSS server require a modified kernel so I suppose it can be 
installed in a virtual machine using kvm on a compute node.

There isn't really a problem with running a client + OSS on the same node 
anymore, nor is there a problem with an OSS running inside a VM (if you have 
SR-IOV and enough CPU+RAM to run the server).

*HOWEVER*, I don't think it would be good to have the client mounted on the *VM 
host*, and then run the OSS on a *VM guest*.  That could lead to deadlocks and 
priority inversion if the client becomes busy, but depends on the local OSS to 
flush dirty data from RAM and the OSS cannot run in the VM because it doesn't 
have any RAM...

If the client and OSS are BOTH run in VMs, or neither run in VMs, or only the 
client run in a VM, then that should be OK, but may have reduced performance 
due to the server contending with the client application.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Ongoing issues with quota

2023-10-10 Thread Andreas Dilger via lustre-discuss
There is a $ROOT/.lustre/lost+found that you could check. 

What does "lfs df -i" report for the used inode count?  Maybe it is RBH that is 
reporting the wrong count?

The other alternative would be to mount the MDT filesystem directly as type ZFS 
and see what df -i and find report?  

Cheers, Andreas

> On Oct 10, 2023, at 22:16, Daniel Szkola via lustre-discuss 
>  wrote:
> 
> OK, I disabled, waited for a while, then reenabled. I still get the same 
> numbers. The only thing I can think is somehow the count is correct, despite 
> the huge difference. Robinhood and find show about 1.7M files, dirs, and 
> links. The quota is showing a bit over 3.1M inodes used. We only have one MDS 
> and MGS. Any ideas where the discrepancy may lie? Orphans? Is there a 
> lost+found area in lustre?
> 
> —
> Dan Szkola
> FNAL
> 
> 
>> On Oct 10, 2023, at 8:24 AM, Daniel Szkola  wrote:
>> 
>> Hi Robert,
>> 
>> Thanks for the response. Do you remember exactly how you did it? Did you 
>> bring everything down at any point? I know you can do this:
>> 
>> lctl conf_param fsname.quota.mdt=none
>> 
>> but is that all you did? Did you wait or bring everything down before 
>> reenabling? I’m worried because that allegedly just enables/disables 
>> enforcement and space accounting is always on. Andreas stated that quotas 
>> are controlled by ZFS, but there has been no quota support enabled on any of 
>> the ZFS volumes in our lustre filesystem.
>> 
>> —
>> Dan Szkola
>> FNAL
>> 
>>>> On Oct 10, 2023, at 2:17 AM, Redl, Robert  wrote:
>>> 
>>> Dear Dan,
>>> 
>>> I had a similar problem some time ago. We are also using ZFS for MDT and 
>>> OSTs. For us, the used disk space was reported wrong. The problem was fixed 
>>> by switching quota support off on the MGS and then on again. 
>>> 
>>> Cheers,
>>> Robert
>>> 
>>>> Am 09.10.2023 um 17:55 schrieb Daniel Szkola via lustre-discuss 
>>>> :
>>>> 
>>>> Thanks, I will look into the ZFS quota since we are using ZFS for all 
>>>> storage, MDT and OSTs.
>>>> 
>>>> In our case, there is a single MDS/MDT. I have used Robinhood and lfs find 
>>>> (by group) commands to verify what the numbers should apparently be.
>>>> 
>>>> —
>>>> Dan Szkola
>>>> FNAL
>>>> 
>>>>> On Oct 9, 2023, at 10:13 AM, Andreas Dilger  wrote:
>>>>> 
>>>>> The quota accounting is controlled by the backing filesystem of the OSTs 
>>>>> and MDTs.
>>>>> 
>>>>> For ldiskfs/ext4 you could run e2fsck to re-count all of the inode and 
>>>>> block usage. 
>>>>> 
>>>>> For ZFS you would have to ask on the ZFS list to see if there is some way 
>>>>> to re-count the quota usage. 
>>>>> 
>>>>> The "inode" quota is accounted from the MDTs, while the "block" quota is 
>>>>> accounted from the OSTs. You might be able to see with "lfs quota -v -g 
>>>>> group" to see if there is one particular MDT that is returning too many 
>>>>> inodes. 
>>>>> 
>>>>> Possibly if you have directories that are striped across many MDTs it 
>>>>> would inflate the used inode count. For example, if every one of the 426k 
>>>>> directories reported by RBH was striped across 4 MDTs then you would see 
>>>>> the inode count add up to 3.6M. 
>>>>> 
>>>>> If that was the case, then I would really, really advise against striping 
>>>>> every directory in the filesystem.  That will cause problems far worse 
>>>>> than just inflating the inode quota accounting. 
>>>>> 
>>>>> Cheers, Andreas
>>>>> 
>>>>>> On Oct 9, 2023, at 22:33, Daniel Szkola via lustre-discuss 
>>>>>>  wrote:
>>>>>> 
>>>>>> Is there really no way to force a recount of files used by the quota? 
>>>>>> All indications are we have accounts where files were removed and this 
>>>>>> is not reflected in the used file count in the quota. The space used 
>>>>>> seems correct but the inodes used numbers are way high. There must be a 
>>>>>> way to clear these numbers and have a fresh count done.
>>>>>> 
>>>>>> —
>>>>>> Dan Szkola
>>>>>> FNAL
&

Re: [lustre-discuss] Ongoing issues with quota

2023-10-09 Thread Andreas Dilger via lustre-discuss
The quota accounting is controlled by the backing filesystem of the OSTs and 
MDTs.

For ldiskfs/ext4 you could run e2fsck to re-count all of the inode and block 
usage. 

For ZFS you would have to ask on the ZFS list to see if there is some way to 
re-count the quota usage. 

The "inode" quota is accounted from the MDTs, while the "block" quota is 
accounted from the OSTs. You might be able to see with "lfs quota -v -g group" 
to see if there is one particular MDT that is returning too many inodes. 

Possibly if you have directories that are striped across many MDTs it would 
inflate the used inode count. For example, if every one of the 426k directories 
reported by RBH was striped across 4 MDTs then you would see the inode count 
add up to 3.6M. 

If that was the case, then I would really, really advise against striping every 
directory in the filesystem.  That will cause problems far worse than just 
inflating the inode quota accounting. 

Cheers, Andreas

> On Oct 9, 2023, at 22:33, Daniel Szkola via lustre-discuss 
>  wrote:
> 
> Is there really no way to force a recount of files used by the quota? All 
> indications are we have accounts where files were removed and this is not 
> reflected in the used file count in the quota. The space used seems correct 
> but the inodes used numbers are way high. There must be a way to clear these 
> numbers and have a fresh count done.
> 
> —
> Dan Szkola
> FNAL
> 
>> On Oct 4, 2023, at 11:37 AM, Daniel Szkola via lustre-discuss 
>>  wrote:
>> 
>> Also, quotas on the OSTS don’t add up to near 3 million files either:
>> 
>> [root@lustreclient scratch]# ssh ossnode0 lfs quota -g somegroup -I 0 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   1394853459   0 1913344192   -  132863   0   0  
>>  -
>> [root@lustreclient scratch]# ssh ossnode0 lfs quota -g somegroup -I 1 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   1411579601   0 1963246413   -  120643   0   0  
>>  -
>> [root@lustreclient scratch]# ssh ossnode1 lfs quota -g somegroup -I 2 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   1416507527   0 1789950778   -  190687   0   0  
>>  -
>> [root@lustreclient scratch]# ssh ossnode1 lfs quota -g somegroup -I 3 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   1636465724   0 1926578117   -  195034   0   0  
>>  -
>> [root@lustreclient scratch]# ssh ossnode2 lfs quota -g somegroup -I 4 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   2202272244   0 3020159313   -  185097   0   0  
>>  -
>> [root@lustreclient scratch]# ssh ossnode2 lfs quota -g somegroup -I 5 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   1324770165   0 1371244768   -  145347   0   0  
>>  -
>> [root@lustreclient scratch]# ssh ossnode3 lfs quota -g somegroup -I 6 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   2892027349   0 3221225472   -  169386   0   0  
>>  -
>> [root@lustreclient scratch]# ssh ossnode3 lfs quota -g somegroup -I 7 
>> /lustre1
>> Disk quotas for grp somegroup (gid 9544):
>>Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
>>   2076201636   0 2474853207   -  171552   0   0  
>>  -
>> 
>> 
>> —
>> Dan Szkola
>> FNAL
>> 
 On Oct 4, 2023, at 8:45 AM, Daniel Szkola via lustre-discuss 
  wrote:
>>> 
>>> No combination of ossnodek runs has helped with this.
>>> 
>>> Again, robinhood shows 1796104 files for the group, an 'lfs find -G gid' 
>>> found 1796104 files as well.
>>> 
>>> So why is the quota command showing over 3 million inodes used?
>>> 
>>> There must be a way to force it to recount or clear all stale quota data 
>>> and have it regenerate it?
>>> 
>>> Anyone?
>>> 
>>> —
>>> Dan Szkola
>>> FNAL
>>> 
>>> 
 On Sep 27, 2023, at 9:42 AM, Daniel Szkola via lustre-discuss 
  wrote:
 
 We have a lustre filesystem that we just upgraded to 2.15.3, however this 
 problem has been going on for some time.
 
 The quota command shows this:
 
 Disk quotas for grp somegroup (gid 9544):
  Filesystemused   quota   limit   grace   files   quota   limit   grace
/lustre1  13.38T 40T 45T   - 

Re: [lustre-discuss] OST went back in time: no(?) hardware issue

2023-10-04 Thread Andreas Dilger via lustre-discuss
On Oct 3, 2023, at 16:22, Thomas Roth via lustre-discuss 
 wrote:
> 
> Hi all,
> 
> in our Lustre 2.12.5 system, we have "OST went back in time" after OST 
> hardware replacement:
> - hardware had reached EOL
> - we set `max_create_count=0` for these OSTs, searched for and migrated off 
> the files of these OSTs
> - formatted the new OSTs with `--replace` and the old indices
> - all OSTs are on ZFS
> - set the OSTs `active=0` on our 3 MDTs
> - moved in the new hardware, reused the old NIDs, old OST indices, mounted 
> the OSTs
> - set the OSTs `active=1`
> - ran `lfsck` on all servers
> - set `max_create_count=200` for these OSTs
> 
> Now the "OST went back in time" messages appeard in the MDS logs.
> 
> This doesn't quite fit the description in the manual. There were no crashes 
> or power losses. I cannot understand how which cache might have been lost.
> The transaction numbers quoted in the error are both large, eg. `transno 
> 55841088879 was previously committed, server now claims 4294992012`
> 
> What should we do? Give `lfsck` another try?

Nothing really to see here I think?

Did you delete LAST_RCVD during the replacement and the OST didn't know what 
transno was assigned to the last RPCs it sent?  The still-mounted clients have 
a record of this transno and are surprised that it was reset.  If you unmount 
and remount the clients the error would go away.

I'm not sure if the clients might try to preserve the next 55B RPCs in memory 
until the committed transno on the OST catches up, or if they just accept the 
new transno and get on with life?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Failing build of lustre client on Debian 12

2023-10-04 Thread Andreas Dilger via lustre-discuss
On Oct 4, 2023, at 16:26, Jan Andersen mailto:j...@comind.io>> 
wrote:

Hi,

I've just successfully built the lustre 2.15.3 client on Debian 11 and need to 
do the same on Debian 12; however, configure fails with:

checking if Linux kernel was built with CONFIG_FHANDLE in or as module... no
configure: error:

Lustre fid handling requires that CONFIG_FHANDLE is enabled in your kernel.



As far as I can see, CONFIG_FHANDLE is in fact enabled - eg:

root@debian12:~/lustre-release# grep CONFIG_FHANDLE /boot/config-6.1.38
CONFIG_FHANDLE=y

I've tried to figure out how configure checks for this, but the script is 
rather dense and I haven't penetrated it (yet). It seems to me there is an 
error in the way it checks. What is the best way forward, considering that I've 
already invested a lot of time and effort in setting up a slurm cluster with 
Debian 12?

You could change the AC_MSG_ERROR() to AC_MESSAGE_WARN() or similar, if you 
think the check is wrong.  It would be wortwhile to check if a patch has 
already been submitted to fix this on the master branch.  Otherwise, getting a 
proper patch submitted to fix the check would be better than just ignoring the 
error and leaving it for the next person to fix.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Cannot mount MDT after upgrading from Lustre 2.12.6 to 2.15.3

2023-10-01 Thread Andreas Dilger via lustre-discuss
o upgrade Lustre 
>>> from 2.13.0 and before to 2.15.X, we should run
>>> tune2fs -O ea_inode /dev/mdtdev
>>> After that, as I have posted, we encounter problem of mounting MDT. Then we 
>>> cure this problem by following the section 18 of Lustre Operation manual.
>>> 
>>> My personal suggestions are:
>>> 1. In the future, to do major revision upgrade for our production run 
>>> systems (say,  2.12.X to 2.15.X, or 2.15.X to 2.16 or later), I will 
>>> develop a small testing system, installing exactly the same software of the 
>>> production run, and test the upgrade, to make sure that every steps are 
>>> correct. We did this for upgrading Lustre with ZFS backend. But this time 
>>> due to time pressure, we skip this step for upgrading Lustre with ldiskfs 
>>> backend. I think no matter what situation, it is still worth to do this 
>>> step in order to avoid any risks.
>>> 
>>> 2. Currently compiling Lustre with ldiskfs backend is still a nightmare. 
>>> The ldiskfs code is not a self-contained, stand along code. It actually 
>>> copied codes from the kernel ext4 code, did a lot of patches, and then did 
>>> the compilation, on the fly. So we have to be very careful to select the 
>>> Linux kernel, to select a compatible one for both our hardware and Lustre 
>>> version. The ZFS backend is much more cleaner. It is a stand along and 
>>> self-contained code. We don't need to do patches on the fly. So I would 
>>> like to suggest the Lustre developer to consider to make the ldiskfs to be 
>>> a stand along and self-contained code in the future release. That will 
>>> bring us a lot of convenient.
>>> 
>>> Hope that the above experiences could be useful to our community.
>>> 
>>> ps. Lustre Operation Manual: 
>>> https://doc.lustre.org/lustre_manual.xhtml#Upgrading_2.x
>>> 
>>> Best Regards,
>>> 
>>> T.H.Hsieh
>>> 
>>> Audet, Martin  於 2023年9月27日 週三 上午3:44寫道:
>>>> Hello all,
>>>> 
>>>> 
>>>> 
>>>> I would appreciate if the community would give more attention to this 
>>>> issue because upgrading from 2.12.x to 2.15.x, two LTS versions, is 
>>>> something that we can expect many cluster admin will try to do in the next 
>>>> few months...
>>>> 
>>>> 
>>>> 
>>>> We ourselves plan to upgrade a small Lustre (production) system from 
>>>> 2.12.9 to 2.15.3 in the next couple of weeks...
>>>> 
>>>> 
>>>> 
>>>> After seeing problems reports like this we start feeling a bit nervous...
>>>> 
>>>> 
>>>> 
>>>> The documentation for doing this major update appears to me as not very 
>>>> specific...
>>>> 
>>>> 
>>>> 
>>>> In this document for example, 
>>>> https://doc.lustre.org/lustre_manual.xhtml#upgradinglustre , the update 
>>>> process appears not so difficult and there is no mention of using 
>>>> "tunefs.lustre --writeconf" for this kind of update.
>>>> 
>>>> 
>>>> 
>>>> Or am I missing something ?
>>>> 
>>>> 
>>>> 
>>>> Thanks in advance for providing more tips for this kind of update.
>>>> 
>>>> 
>>>> 
>>>> Martin Audet
>>>> 
>>>> From: lustre-discuss  on behalf 
>>>> of Tung-Han Hsieh via lustre-discuss 
>>>>> Sent: September 23, 2023 2:20 PM
>>>>> To: lustre-discuss@lists.lustre.org
>>>>> Subject: [lustre-discuss] Cannot mount MDT after upgrading from Lustre 
>>>>> 2.12.6 to 2.15.3
>>>>>  
>>>>> ***Attention*** This email originated from outside of the NRC. 
>>>>> ***Attention*** Ce courriel provient de l'extérieur du CNRC.
>>>>> 
>>>>> Dear All,
>>>>> 
>>>>> Today we tried to upgrade Lustre file system from version 2.12.6 to 
>>>>> 2.15.3. But after the work, we cannot mount MDT successfully. Our MDT is 
>>>>> ldiskfs backend. The procedure of upgrade is
>>>>> 
>>>>> 1. Install the new version of e2fsprogs-1.47.0
>>>>> 2. Install Lustre-2.15.3
>>>>> 3. After reboot, run: tunefs.lustre --writeconf /dev/md0
>>>>> 
>>>>> Then when mounting MDT, we got the error message in dmesg:
>>>>> 
>>>>> ===
>>>>> [11662.434724] LDISKFS-fs (md0): mounted filesystem with ordered data 
>>>>> mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
>>>>> [11662.584593] Lustre: 3440:0:(scrub.c:189:scrub_file_load()) 
>>>>> chome-MDT: reset scrub OI count for format change (LU-16655)
>>>>> [11666.036253] Lustre: MGS: Logs for fs chome were removed by user 
>>>>> request.  All servers must be restarted in order to regenerate the logs: 
>>>>> rc = 0
>>>>> [11666.523144] Lustre: chome-MDT: Imperative Recovery not enabled, 
>>>>> recovery window 300-900
>>>>> [11666.594098] LustreError: 3440:0:(mdd_device.c:1355:mdd_prepare()) 
>>>>> chome-MDD: get default LMV of root failed: rc = -2
>>>>> [11666.594291] LustreError: 
>>>>> 3440:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start 
>>>>> targets: -2
>>>>> [11666.594951] Lustre: Failing over chome-MDT
>>>>> [11672.868438] Lustre: 3440:0:(client.c:2295:ptlrpc_expire_one_request()) 
>>>>> @@@ Request sent has timed out for slow reply: [sent 1695492248/real 
>>>>> 1695492248]  req@5dfd9b53 x1777852464760768/t0(0) 
>>>>> o251->MGC192.168.32.240@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 
>>>>> 1695492254 ref 2 fl Rpc:XNQr/0/ rc 0/-1 job:''
>>>>> [11672.925905] Lustre: server umount chome-MDT complete
>>>>> [11672.926036] LustreError: 3440:0:(super25.c:183:lustre_fill_super()) 
>>>>> llite: Unable to mount : rc = -2
>>>>> [11872.893970] LDISKFS-fs (md0): mounted filesystem with ordered data 
>>>>> mode. Opts: (null)
>>>>> 
>>>>> 
>>>>> Could anyone help to solve this problem ? Sorry that it is really urgent.
>>>>> 
>>>>> Thank you very much.
>>>>> 
>>>>> T.H.Hsieh
>>>>> 

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Adding lustre clients into the Debian

2023-10-01 Thread Andreas Dilger via lustre-discuss
On Oct 1, 2023, at 05:54, Arman Khalatyan via lustre-discuss 
 wrote:
> 
> Hello everyone,
> 
> We are in the process of integrating the Lustre client into Debian. Are there 
> any legal concerns or significant obstacles to this? We're curious why it 
> hasn't been included in the official Debian repository so far. There used to 
> be an old, unmaintained Lustre client branch dating back to around 2013.

I don't think there is any particular barrier to add a Debian Lustre package 
today.  As you wrote, there was some effort put into that in the past, but I 
think there wasn't someone with the time and Debian-specific experience 
maintain it and get it included into a release.  Lustre is GPLv2 licensed, with 
some LGPL user library API code, so there aren't any legal issues.

The Lustre CI system builds Ubuntu clients, and I'd be happy to see patches to 
improve the Debian packaging in the Lustre tree.  Most Lustre users are using 
RHEL derivatives, and secondarily Ubuntu, so there hasn't been anyone to work 
on specific Debian packaging in some time.  Thomas (CC'd) was most active on 
the Debian front, and might be able to comment on this in more detail, and 
hopefully can also help review the patches.

> You can check our wishlist on Debian here: https://bugs.debian.org/1053214
> 
> At AIP, one of our colleagues is responsible for maintaining Astropy, so we 
> have some experience with Debian.
> 
> I've also set up a CI system in our GitLab, which includes a simple build and 
> push to a public S3 bucket. This is primarily for testing purposes to see if 
> it functions correctly for others...

There is an automated build farm for Lustre, with patches submitted via Gerrit 
(https://wiki.lustre.org/Using_Gerrit), and while there aren't Debian 
build/test nodes to test the correctness of changes for Debian, this will at 
least avoid regressions with Ubuntu or, though unlikely, with RHEL.  My 
assumption would be that any Debian-specific changes would continue to work 
with Ubuntu?

Depending on how actively you want to build/test Lustre, it is also possible to 
configure your CI system to asynchronously follow patches under development in 
Gerrit to provide build and/or test feedback on the patches before they land.  
The contrib/scripts/gerrit_checkpatch.py script follows new patch submissions 
and runs code style reviews on patches after they are pushed, and posts 
comments back to Gerrit.

A step beyond this, once your CI system is working reliably, it is possible to 
also post review comments directly into the patches in Gerrit, as the Gerrit 
Janitor does (https://github.com/verygreen/lustretester/).  The 
gerrit_build-and-test-new.py script is derived from gerrit_checkpatch.py, but 
implements a more complex set of operations on each patch - static code 
analysis, build, test.  It will add both general patch comments on 
success/failure of the build/test, or comments on specific lines in the patch.  
In some cases, Gerrit Janitor will also give negative code reviews in cases 
when a newly-added regression test added by a patch is failing regularly in its 
testing.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Adding lustre clients into the Debian

2023-10-01 Thread Andreas Dilger via lustre-discuss
On Oct 1, 2023, at 05:54, Arman Khalatyan via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello everyone,

We are in the process of integrating the Lustre client into Debian. Are there 
any legal concerns or significant obstacles to this? We're curious why it 
hasn't been included in the official Debian repository so far. There used to be 
an old, unmaintained Lustre client branch dating back to around 2013.

I don't think there is any particular barrier to add a Debian Lustre package 
today.  As you wrote, there was some effort put into that in the past, but I 
think there wasn't someone with the time and Debian-specific experience 
maintain it and get it included into a release.  Lustre is GPLv2 licensed, with 
some LGPL user library API code, so there aren't any legal issues.

The Lustre CI system builds Ubuntu clients, and I'd be happy to see patches to 
improve the Debian packaging in the Lustre tree.  Most Lustre users are using 
RHEL derivatives, and secondarily Ubuntu, so there hasn't been anyone to work 
on specific Debian packaging in some time.  Thomas (CC'd) was most active on 
the Debian front, and might be able to comment on this in more detail, and 
hopefully can also help review the patches.

You can check our wishlist on Debian here: https://bugs.debian.org/1053214

At AIP, one of our colleagues is responsible for maintaining Astropy, so we 
have some experience with Debian.

I've also set up a CI system in our GitLab, which includes a simple build and 
push to a public S3 bucket. This is primarily for testing purposes to see if it 
functions correctly for others...

There is an automated build farm for Lustre, with patches submitted via Gerrit 
(https://wiki.lustre.org/Using_Gerrit), and while there aren't Debian 
build/test nodes to test the correctness of changes for Debian, this will at 
least avoid regressions with Ubuntu or, though unlikely, with RHEL.  My 
assumption would be that any Debian-specific changes would continue to work 
with Ubuntu?

Depending on how actively you want to build/test Lustre, it is also possible to 
configure your CI system to asynchronously follow patches under development in 
Gerrit to provide build and/or test feedback on the patches before they land.  
The contrib/scripts/gerrit_checkpatch.py script follows new patch submissions 
and runs code style reviews on patches after they are pushed, and posts 
comments back to Gerrit.

A step beyond this, once your CI system is working reliably, it is possible to 
also post review comments directly into the patches in Gerrit, as the Gerrit 
Janitor does (https://github.com/verygreen/lustretester/).  The 
gerrit_build-and-test-new.py script is derived from gerrit_checkpatch.py, but 
implements a more complex set of operations on each patch - static code 
analysis, build, test.  It will add both general patch comments on 
success/failure of the build/test, or comments on specific lines in the patch.  
In some cases, Gerrit Janitor will also give negative code reviews in cases 
when a newly-added regression test added by a patch is failing regularly in its 
testing.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Cannot mount MDT after upgrading from Lustre 2.12.6 to 2.15.3

2023-09-28 Thread Andreas Dilger via lustre-discuss
768/t0(0) 
o251->MGC192.168.32.240@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1695492254 ref 
2 fl Rpc:XNQr/0/ rc 0/-1 job:''
[11672.925905] Lustre: server umount chome-MDT complete
[11672.926036] LustreError: 3440:0:(super25.c:183:lustre_fill_super()) llite: 
Unable to mount : rc = -2
[11872.893970] LDISKFS-fs (md0): mounted filesystem with ordered data mode. 
Opts: (null)


Could anyone help to solve this problem ? Sorry that it is really urgent.

Thank you very much.

T.H.Hsieh
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] No port 988?

2023-09-26 Thread Andreas Dilger via lustre-discuss
On Sep 26, 2023, at 06:12, Jan Andersen mailto:j...@comind.io>> 
wrote:

Hi,

I've built and installed lustre on two VirtualBoxes running Rocky 8.8 and 
formatted one as the MGS/MDS and the other as OSS, following a presentation 
from Oak Ridge National Laboratory: "Creating a Lustre Test System from Source 
with Virtual Machines" (sorry, no link; it was a while ago I downloaded them).

There are a number of such resources linked from the https://wiki.lustre.org/ 
front page.

I can mount the filesystems on the MDS, but when I try from the OSS, it just 
times out - from dmesg:

[root@oss1 log]# dmesg | grep -i lustre
[  564.028680] Lustre: Lustre: Build Version: 2.15.58_42_ga54a206
[  625.567672] LustreError: 15f-b: lustre-OST: cannot register this server 
with the MGS: rc = -110. Is the MGS running?
[  625.567767] LustreError: 1789:0:(tgt_mount.c:2216:server_fill_super()) 
Unable to start targets: -110
[  625.567851] LustreError: 1789:0:(tgt_mount.c:1752:server_put_super()) no obd 
lustre-OST
[  625.567894] LustreError: 1789:0:(tgt_mount.c:132:server_deregister_mount()) 
lustre-OST not registered
[  625.588244] Lustre: server umount lustre-OST complete
[  625.588251] LustreError: 1789:0:(tgt_mount.c:2365:lustre_tgt_fill_super()) 
Unable to mount  (-110)

Both 'nmap' and 'netstat -nap' show that there is nothing listening on port 988:

[root@mds ~]# netstat -nap | grep -i listen
tcp0  0 0.0.0.0:111 0.0.0.0:* LISTEN  1/systemd
tcp0  0 0.0.0.0:22  0.0.0.0:* LISTEN  806/sshd
tcp6   0  0 :::111  :::* LISTEN  1/systemd
tcp6   0  0 :::22   :::* LISTEN  806/sshd

What should be listening on 988?

The  MGS should be listening on port 988, running on the "mgsnode" that was 
specified at format time for the OSTs and MDTs.

It is possible to have the MGS and MDS share the same storage device for simple 
configurations, but in production they are usually running on separate devices 
so they can be started/stopped independently, even if they are running on the 
same server.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [BULK] Re: [EXTERNAL] Re: Data recovery with lost MDT data

2023-09-25 Thread Andreas Dilger via lustre-discuss
Probably using "stat" on each file is slow, since this is getting the file size 
from each OST object. You could try the "xstat" utility in the lustre-tests RPM 
(or build it directly) as it will only query the MDS for the requested 
attributes (owner at minimum).

Then you could split into per-date directories in a separate phase, if needed, 
run in parallel.

I can't suggest anything about the 13M entry directory, but it _should_ be much 
faster than 1 file per 30s even at that size. I suspect that the script is 
still doing something bad, since shell and GNU utilities are terrible for doing 
extra stat/cd/etc five times on each file that is accessed, renamed, etc.

You would be better off to use "find -print" to generate the pathnames and then 
operate on those, maybe with "xargs -P" and/or run multiple scripts in parallel 
on chunks of the file?

Cheers, Andreas

On Sep 25, 2023, at 17:34, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.]  wrote:


I’ve been attempting to move these lost+found files into subdirectories by user 
and date but I’m running into issues.

My initial attempt was to loop through each file in .lustre/lost+found, stat 
the file and then move it to a subdirectory.  However, both the stat and the 
move command are each taking about 30 seconds.  With 13 million files, this 
isn’t going to work as that would be about 25 years to organize the files. :)  
If it makes a difference, the bash script is simple:



source=/scratch-lustre/.lustre/lost+found/MDT
dest=/scratch-lustre/work

# Cd to the source
cd $source

# Loop through files
for file in * ; do
echo "file: $file"

echo "   stat-ing file"
time read -r user date time <<< $( stat --format "%U %y" $file )
echo "   user='$user' date='$date' time='$time'"

# Build the new destination in the user's directory
newdir=$dest/$user/lost+found/$date
echo "   newdir=$newdir"

echo "   checking/making direcotry"
if [ ! -d $newdir ] ; then
time mkdir -p $newdir
time chown ${user}: $newdir
fi

echo "   moving file"
time mv $file $newdir
done


I’m pretty sure the time to operate on these files is due to the very large 
number of files in a single directory.  But confirmation of that would be good. 
 I say that because I know too many files in a single directory can cause 
filesystem performance issues.  But also, the few files I have moved out of 
lost+found, I can stat and otherwise operate on very quickly.

My next attempt was to move each file into pre-created subdirectories (this 
eliminates the stat).  But again, this was serial and each move is 
(unsurprisingly) taking 30 seconds.  “Only” 12.5 years to move the files.  :)

I’m currently attempting to speed this up by getting the entire file list (all 
13e6 files) and moving groups of (10,000) files into a subdirectory at once 
(i.e. mv file1 file2 … fileN subdir1).  I’m hoping a group move is faster and 
more efficient than moving each file individually.  Seems like it should be.  
That attempt is running now and I’m waiting for the first command to complete 
so I don’t know if this is faster yet or not.  (More accurately, I’ve written 
this script in perl and I think the output is buffered so I’m waiting to the 
the first output.)

Other suggestions welcome if you have ideas how to move these files into 
subdirectories more efficiently.


From: lustre-discuss  on behalf of 
"Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss" 

Reply-To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 

Date: Monday, September 25, 2023 at 8:56 AM
To: Andreas Dilger 
Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] [BULK] Re: [EXTERNAL] Re: Data recovery with lost 
MDT data

Our lfsck finished.  It repair a lot and we have over 13 million files in 
lost+found to go through.  I'll be writing a script to move these to somewhere 
accessible by the users and grouped by owner and probably date too (trying not 
to get too many files in a single directory).  Thanks again for the help with 
this.


For the benefit of others, this is how we started our lfsck:

[root@hpfs-fsl-mds1 hpfs3-eg3]# lctl set_param printk=+lfsck
[root@hpfs-fsl-mds1 hpfs3-eg3]# lctl lfsck_start -M scratch-MDT -o
Started LFSCK on the device scratch-MDT: scrub layout namespace
[root@hpfs-fsl-mds1 hpfs3-eg3]#



It took most of the weekend to run.  Here are the results.



[root@hpfs-fsl-mds1 ~]# lctl lfsck_query -M scratch-MDT

layout_mdts_init: 0

layout_mdts_scanning-phase1: 0

layout_mdts_scanning-phase2: 0

layout_mdts_completed: 1

layout_mdts_failed: 0

layout_mdts_stopped: 0

layout_mdts_paused: 0

layout_mdts_crashed: 0

layout_mdts_partial: 0

layout_mdts_co-failed: 0

layout_mdts_co-stopped: 0

layout_mdts_co-paused: 0

layout_mdts_unknown: 0

layout_o

Re: [lustre-discuss] [EXTERNAL EMAIL] Re: Lustre 2.15.3: patching the kernel fails

2023-09-22 Thread Andreas Dilger via lustre-discuss
On Sep 22, 2023, at 01:45, Jan Andersen mailto:j...@comind.io>> 
wrote:

Hi Andreas,

Thank you for your insightful reply. I didn't know Rocky; I see there's a 
version 9 as well - is ver 8 better, since it is more mature?

There is an el9.2 ldiskfs series that would likely also apply to the Rocky9.2 
kernel of the same version.  We are currently using el8.8 servers in production 
and I'm not sure how many people are using 9.2 yet.  On the client side, 
Debian/Ubuntu are widely used.

You mention zfs, which I really liked when I worked on Solaris, but when I 
tried it on Linux it seemed to perform poorly, but that was in Ubuntu; is it 
better in Redhat et al.?

I would think Ubuntu/Debian is working with ZFS better (and may even have ZFS 
.deb packages available in the distro, which RHEL will likely never have).  
It's true the ZFS performance is worse than ldiskfs, but can make it easier to 
use.  That is up to you.

Cheers, Andreas


/jan

On 21/09/2023 18:40, Andreas Dilger wrote:
The first yes toon to ask is what is your end goal?  If you just want to build 
only a client that is mounting to an existing server, then you can disable the 
server functionality:
./configure --disable-server
and it should build fine.
If you want to also build a server, and *really* want it to run Debian instead 
of eg. Rocky 8, then you could disable ldiskfs and use ZFS:
./configure --disable-ldiskfs
You need to have installed ZFS first (either pre-packaged or built yourself), 
but it is less kernel-specific than ldiskfs.
Cheers, Andreas
On Sep 21, 2023, at 10:35, Jan Andersen mailto:j...@comind.io>> 
wrote:

My system: Debian 11, kernel version 5.10.0-13-amd64; I have the following 
source code:

# ll /usr/src/
total 117916
drwxr-xr-x  2 root root  4096 Aug 21 09:19 linux-config-5.10/
drwxr-xr-x  4 root root  4096 Jul 25  2022 linux-headers-5.10.0-12-amd64/
drwxr-xr-x  4 root root  4096 Jul 25  2022 linux-headers-5.10.0-12-common/
drwxr-xr-x  4 root root  4096 Jul 25  2022 linux-headers-5.10.0-13-amd64/
drwxr-xr-x  4 root root  4096 Jul 25  2022 linux-headers-5.10.0-13-common/
drwxr-xr-x  4 root root  4096 Aug 11 09:59 linux-headers-5.10.0-24-amd64/
drwxr-xr-x  4 root root  4096 Aug 11 09:59 linux-headers-5.10.0-24-common/
drwxr-xr-x  4 root root  4096 Aug 21 09:19 linux-headers-5.10.0-25-amd64/
drwxr-xr-x  4 root root  4096 Aug 21 09:19 linux-headers-5.10.0-25-common/
lrwxrwxrwx  1 root root24 Jun 30  2022 linux-kbuild-5.10 -> 
../lib/linux-kbuild-5.10
-rw-r--r--  1 root root161868 Aug 16 21:52 linux-patch-5.10-rt.patch.xz
drwxr-xr-x 25 root root  4096 Jul 14 21:24 linux-source-5.10/
-rw-r--r--  1 root root 120529768 Aug 16 21:52 linux-source-5.10.tar.xz
drwxr-xr-x  2 root root  4096 Jan 30  2023 percona-server/
lrwxrwxrwx  1 root root28 Jul 29  2022 vboxhost-6.1.36 -> 
/opt/VirtualBox/src/vboxhost
lrwxrwxrwx  1 root root32 Apr 17 19:32 vboxhost-7.0.8 -> 
../share/virtualbox/src/vboxhost


I have downloaded the source code of lustre 2.15.3:

# git checkout 2.15.3
# git clone git://git.whamcloud.com/fs/lustre-release.git

- and I'm trying to build it, following https://wiki.lustre.org/Compiling_Lustre

I've got through 'autogen.sh' and 'configure' and most of 'make debs', but when 
it comes to patching:

cd linux-stage && quilt push -a -q
Applying patch patches/rhel8/ext4-inode-version.patch
Applying patch patches/linux-5.4/ext4-lookup-dotdot.patch
Applying patch patches/suse15/ext4-print-inum-in-htree-warning.patch
Applying patch patches/linux-5.8/ext4-prealloc.patch
Applying patch patches/ubuntu18/ext4-osd-iop-common.patch
Applying patch patches/linux-5.10/ext4-misc.patch
1 out of 4 hunks FAILED
Patch patches/linux-5.10/ext4-misc.patch does not apply (enforce with -f)
make[2]: *** [autoMakefile:645: sources] Error 1
make[2]: Leaving directory '/root/repos/lustre-release/ldiskfs'
make[1]: *** [autoMakefile:652: all-recursive] Error 1
make[1]: Leaving directory '/root/repos/lustre-release'
make: *** [autoMakefile:524: all] Error 2


My best guess is that it is because the running kernel version doesn't exactly 
match the kernel source tree, but I can't seem to find that version. Am I right 
- and if so, where would I go to download the right kernel tree?

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] Re: Data recovery with lost MDT data

2023-09-22 Thread Andreas Dilger via lustre-discuss
On Sep 21, 2023, at 16:06, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] mailto:darby.vicke...@nasa.gov>> wrote:

I knew an lfsck would identify the orphaned objects.  That’s great that it will 
move those objects to an area we can triage.  With ownership still intact (and 
I assume time stamps too), I think this will be helpful for at least some of 
the users to recover some of their data.  Thanks Andreas.

I do have another question.  Even with the MDT loss, the top level user 
directories on the file system are still showing current modification times.  I 
was a little surprised to see this – my expectation was that the most current 
time would be from the snapshot that we accidentally reverted to, 6/20/2023 in 
this case.  Does this make sense?

The timestamps of the directories are only stored on the MDT (unlike regular 
files which keep of the timestamp on both the MDT and OST).  Is it possible 
that users (or possibly recovered clients with existing mountpoints) have 
started to access the filesystem in the past few days since it was recovered, 
or an admin was doing something that would have caused the directories to be 
modified?


Is it possible you have a newer copy of the MDT than you thought?

[dvicker@dvicker ~]$ ls -lrt /ephemeral/ | tail
  4 drwx-- 2 abjuarez   abjuarez 4096 Sep 12 
13:24 abjuarez/
  4 drwxr-x--- 2 ksmith29   ksmith29 4096 Sep 13 
15:37 ksmith29/
  4 drwxr-xr-x55 bjjohn10   bjjohn10 4096 Sep 13 
16:36 bjjohn10/
  4 drwxrwx--- 3 cbrownsc   ccp_fast 4096 Sep 14 
12:27 cbrownsc/
  4 drwx-- 3 fgholiza   fgholiza 4096 Sep 18 
06:41 fgholiza/
  4 drwx-- 5 mtfoste2   mtfoste2 4096 Sep 19 
11:35 mtfoste2/
  4 drwx-- 4 abeniniabenini  4096 Sep 19 
15:33 abenini/
  4 drwx-- 9 pdetremp   pdetremp 4096 Sep 19 
16:49 pdetremp/
[dvicker@dvicker ~]$



From: Andreas Dilger mailto:adil...@whamcloud.com>>
Date: Thursday, September 21, 2023 at 2:33 PM
To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 
mailto:darby.vicke...@nasa.gov>>
Cc: "lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>" 
mailto:lustre-discuss@lists.lustre.org>>
Subject: [EXTERNAL] Re: [lustre-discuss] Data recovery with lost MDT data

CAUTION: This email originated from outside of NASA.  Please take care when 
clicking links or opening attachments.  Use the "Report Message" button to 
report suspicious messages to the NASA SOC.


In the absence of backups, you could try LFSCK to link all of the orphan OST 
objects into .lustre/lost+found (see lctl-lfsck_start.8 man page for details).

The data is still in the objects, and they should have UID/GID/PRJID assigned 
(if used) but they have no filenames.  It would be up to you to make e.g. 
per-user lost+found directories in their home directories and move the files 
where they could access them and see if they want to keep or delete the files.

How easy/hard this is to do depends on whether the files have any content that 
can help identify them.

There was a Lustre hackathon project to save the Lustre JobID in a "user.job" 
xattr on every object, exactly to help identify the provenance of files after 
the fact (regardless of whether there is corruption), but it only just landed 
to master and will be in 2.16. That is cold comfort, but would help in the 
future.
Cheers, Andreas


On Sep 20, 2023, at 15:34, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:
Hello,

We have recently accidentally deleted some of our MDT data.  I think its gone 
for good but looking for advice to see if there is any way to recover.  
Thoughts appreciated.

We run two LFS’s on the same set of hardware.  We didn’t set out to do this, 
but it kind of evolved.  The original setup was only a single filesystem and 
was all ZFS – MDT and OST’s.  Eventually, we had some small file workflows that 
we wanted to get better performance on.  To address this, we stood up another 
filesystem on the same hardware and used a an ldiskfs MDT.  However, since were 
already using ZFS, under the hood the storage device we build the ldisk MDT on 
comes from ZFS.  That gets presented to the OS as /dev/zd0.  We do a nightly 
backup of the MDT by cloning the ZFS dataset (this creates /dev/zd16, for 
whatever reason), snapshot the clone, mount that as ldiskfs, tar up the data 
and then destroy the snapshot and clone.  Well, occasionally this process gets 
interrupted, leaving the ZFS snapshot and clone hanging around.  This is where 
things go south.  Something happens that swaps the clone with the primary 
dataset.  ZFS says you’re working with the primary but its really the clone, 
and via

Re: [lustre-discuss] Data recovery with lost MDT data

2023-09-21 Thread Andreas Dilger via lustre-discuss
In the absence of backups, you could try LFSCK to link all of the orphan OST 
objects into .lustre/lost+found (see lctl-lfsck_start.8 man page for details).

The data is still in the objects, and they should have UID/GID/PRJID assigned 
(if used) but they have no filenames.  It would be up to you to make e.g. 
per-user lost+found directories in their home directories and move the files 
where they could access them and see if they want to keep or delete the files.

How easy/hard this is to do depends on whether the files have any content that 
can help identify them.

There was a Lustre hackathon project to save the Lustre JobID in a "user.job" 
xattr on every object, exactly to help identify the provenance of files after 
the fact (regardless of whether there is corruption), but it only just landed 
to master and will be in 2.16. That is cold comfort, but would help in the 
future.

Cheers, Andreas

On Sep 20, 2023, at 15:34, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] via lustre-discuss  wrote:


Hello,

We have recently accidentally deleted some of our MDT data.  I think its gone 
for good but looking for advice to see if there is any way to recover.  
Thoughts appreciated.

We run two LFS’s on the same set of hardware.  We didn’t set out to do this, 
but it kind of evolved.  The original setup was only a single filesystem and 
was all ZFS – MDT and OST’s.  Eventually, we had some small file workflows that 
we wanted to get better performance on.  To address this, we stood up another 
filesystem on the same hardware and used a an ldiskfs MDT.  However, since were 
already using ZFS, under the hood the storage device we build the ldisk MDT on 
comes from ZFS.  That gets presented to the OS as /dev/zd0.  We do a nightly 
backup of the MDT by cloning the ZFS dataset (this creates /dev/zd16, for 
whatever reason), snapshot the clone, mount that as ldiskfs, tar up the data 
and then destroy the snapshot and clone.  Well, occasionally this process gets 
interrupted, leaving the ZFS snapshot and clone hanging around.  This is where 
things go south.  Something happens that swaps the clone with the primary 
dataset.  ZFS says you’re working with the primary but its really the clone, 
and via versa.  This happened about a year ago and we caught it, were able to 
“zfs promote” to swap them back and move on.  More details on the ZFS and this 
mailing list here.

https://zfsonlinux.topicbox.com/groups/zfs-discuss/Tcb8a3ef663db0031-M5a79e71768b20b2389efc4a4

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2022-June/018154.html

It happened again earlier this week but we didn’t remember to check this and, 
in an effort to get the backups going again, destroyed what we thought were the 
snapshot and clone.  In reality, we destroyed the primary dataset.  Even more 
unfortunately, the stale “snapshot” was about 3 months old.  This stale 
snapshot was also preventing our MDT backups from running so we don’t have 
those to restore from either.  (I know, we need better monitoring and alerting 
on this, we learned that lesson the hard way.  We had it in place after the 
June 2022 incident, it just wasn’t working properly.)  So at the end of the 
day, the data lives on the OST’s we just can’t access it due to the lost 
metadata.  Is there any chance at data recovery.  I don’t think so but want to 
explore any options.

Darby

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] File size discrepancy on lustre

2023-09-15 Thread Andreas Dilger via lustre-discuss
Are you using any file mirroring (FLR, "lfs mirror extend") on the files, 
perhaps before the "lfs getstripe" was run?

On Sep 15, 2023, at 08:12, Kurt Strosahl via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Good Morning,

We have encountered a very odd issue.  Where files are being created that 
show as double in size under du, then they do using ls or du --apparent-size.

under ls we see 119G
~> ls -lh \
> szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b
-rw-rw-r-- 1 edwards lattice 119G Sep 14 21:48 
szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b

which du --apparent-size agrees with
~> du -h --apparent-size \
> szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b
119G
szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b
under du we see 273G

However du itself shows more then double (so we are beyond "padding out a 
block" size).
~> du -h \
> szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b
273G
szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b

There is nothing unusual going on via the file layout according to lfs 
getstripe:
~> lfs getstripe \
> szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b
szscl21_24_128_b1p50_t_x4p300_um0p0840_sm0p0743_n1p265.genprop.n162.strange.t_0_22_26_28_31.sdb3160b
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:   raid0
lmm_layout_gen:0
lmm_stripe_offset: 0
lmm_pool:  production
obdidx   objid   objid   group
 0 7431775   0x71665f0

Client is running:
lustre-client-2.12.6-1.el7.centos.x86_64

lustre servers are:
lustre-osd-zfs-mount-2.12.9-1.el7.x86_64
kmod-lustre-osd-zfs-2.12.9-1.el7.x86_64
kernel-3.10.0-1127.8.2.el7_lustre.x86_64
lustre-2.12.9-1.el7.x86_64
kernel-devel-3.10.0-1127.8.2.el7_lustre.x86_64
kmod-lustre-2.12.9-1.el7.x86_64
kmod-zfs-0.7.13-1.el7.jlab.x86_64
libzfs2-0.7.13-1.el7.x86_64
zfs-0.7.13-1.el7.x86_64

w/r,
Kurt J. Strosahl (he/him)
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Getting started with Lustre on RHEL 8.8

2023-09-12 Thread Andreas Dilger via lustre-discuss
On Sep 12, 2023, at 22:31, Cyberxstudio cxs 
mailto:cyberxstudio.cl...@gmail.com>> wrote:

Hi I get this error while installing lustre and other packages

[root@localhost ~]# yum --nogpgcheck --enablerepo=lustre-server install \
> kmod-lustre-osd-ldiskfs \
> lustre-dkms \
> lustre-osd-ldiskfs-mount \
> lustre-osd-zfs-mount \
> lustre \
> lustre-resource-agents \
> zfs
Updating Subscription Management repositories.
Last metadata expiration check: 0:00:58 ago on Wed 13 Sep 2023 09:27:59 AM PKT.
Error:
 Problem: conflicting requests
  - nothing provides resource-agents needed by 
lustre-resource-agents-2.15.3-1.el8.x86_64
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use 
not only best candidate packages)

You don't need this package to start. It is used for HA failover of storage 
between servers with Corosync/Pacemaker.

You also do not need the "lustre-dkms" package - that is for building Lustre 
clients from scratch.

You also only need one of ldiskfs or ZFS.  If you don't have RAID storage, then 
ZFS is probably more useful,
while ldiskfs is more of a 'traditional" filesystem (based on ext4).

Cheers, Andreas

[root@localhost ~]# dnf install resource-agents
Updating Subscription Management repositories.
Last metadata expiration check: 0:02:01 ago on Wed 13 Sep 2023 09:27:59 AM PKT.
No match for argument: resource-agents
Error: Unable to find a match: resource-agents
[root@localhost ~]#

On Wed, Sep 13, 2023 at 9:10 AM Cyberxstudio cxs 
mailto:cyberxstudio.cl...@gmail.com>> wrote:
Thank you for the information.

On Tue, Sep 12, 2023 at 8:40 PM Andreas Dilger 
mailto:adil...@whamcloud.com>> wrote:
Hello,
The preferred path to set up Lustre depends on what you are planning to do with 
it?  If for regular usage it is easiest to start with RPMs built for the distro 
from 
https://downloads.whamcloud.com/public/lustre/latest-release/<https://downloads.whamcloud.com/public/lustre/latest-release/el8.8/server/RPMS/x86_64/>
 (you can also use the server RPMs for a client if you want).

The various "client" packages for RHEL, SLES, Ubuntu can install directly on 
the vendor kernels, but the provided server RPMs also need the matching kernel. 
 You only need one of the ldiskfs (ext4) or ZFS packages, not both.

It isn't *necessary* to build/patch your kernel for the server, though the 
pre-built server download packages have patched the kernel to add integrated 
T10-PI support (which many users do not need).  You can get unpatched el8 
server RPMs directly from the builders:
https://build.whamcloud.com/job/lustre-b2_15-patchless/48/arch=x86_64,build_type=server,distro=el8.7,ib_stack=inkernel/artifact/artifacts/

If you plan to run on non-standard kernels, then you can build RPMs for your 
particular kernel. The easiest way is to just rebuild the SPRM package:
 https://wiki.whamcloud.com/pages/viewpage.action?pageId=8556211

If you want to do Lustre development you should learn how to build from a Git 
checkout:
https://wiki.whamcloud.com/display/PUB/Building+Lustre+from+Source

Cheers, Andreas

On Sep 12, 2023, at 03:25, Cyberxstudio cxs via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:


Hi, I am setting up a lab environment for lustre. I have 3 VMs of RHEL 8.8, I 
have studied the documentation but it does not provide detail for el 8 rather 
el 7. Please guide me how to start
Thank You
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Getting started with Lustre on RHEL 8.8

2023-09-12 Thread Andreas Dilger via lustre-discuss
Hello,
The preferred path to set up Lustre depends on what you are planning to do with 
it?  If for regular usage it is easiest to start with RPMs built for the distro 
from 
https://downloads.whamcloud.com/public/lustre/latest-release/
 (you can also use the server RPMs for a client if you want).

The various "client" packages for RHEL, SLES, Ubuntu can install directly on 
the vendor kernels, but the provided server RPMs also need the matching kernel. 
 You only need one of the ldiskfs (ext4) or ZFS packages, not both.

It isn't *necessary* to build/patch your kernel for the server, though the 
pre-built server download packages have patched the kernel to add integrated 
T10-PI support (which many users do not need).  You can get unpatched el8 
server RPMs directly from the builders:
https://build.whamcloud.com/job/lustre-b2_15-patchless/48/arch=x86_64,build_type=server,distro=el8.7,ib_stack=inkernel/artifact/artifacts/

If you plan to run on non-standard kernels, then you can build RPMs for your 
particular kernel. The easiest way is to just rebuild the SPRM package:
 https://wiki.whamcloud.com/pages/viewpage.action?pageId=8556211

If you want to do Lustre development you should learn how to build from a Git 
checkout:
https://wiki.whamcloud.com/display/PUB/Building+Lustre+from+Source

Cheers, Andreas

On Sep 12, 2023, at 03:25, Cyberxstudio cxs via lustre-discuss 
 wrote:


Hi, I am setting up a lab environment for lustre. I have 3 VMs of RHEL 8.8, I 
have studied the documentation but it does not provide detail for el 8 rather 
el 7. Please guide me how to start
Thank You
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] questions about group locks / LDLM_FL_NO_TIMEOUT flag

2023-08-30 Thread Andreas Dilger via lustre-discuss
You can't directly dump the holders of a particular lock, but it is possible to 
dump the list of FIDs that each client has open. 

  mds# lctl get_param mdt.*.exports.*.open_files | egrep "=|FID" | grep -B1 FID

That should list all client NIDs that have FID open. 

It shouldn't be possible for clients to "leak" a group lock, since they are 
tied to an open file handle and are dropped as soon as the file is closed, or 
by the kernel when it closes the open fds when the process is killed.

Cheers, Andreas

> On Aug 30, 2023, at 07:42, Bertschinger, Thomas Andrew Hjorth via 
> lustre-discuss  wrote:
> 
> Hello, 
> 
> We have a few files created by a particular application where reads to those 
> files consistently hang. The debug log on a client attempting a read() has 
> messages like:
> 
>> ldlm_completion_ast(): waiting indefinitely because of NO_TIMEOUT ...
> 
> This is printed when the flag LDLM_FL_NO_TIMEOUT is true, and code comments 
> above that flag imply that it is set for group locks. So, we've been trying 
> to identify if the application in question uses group locks. (I have reached 
> out to the app's developers but do not have a response yet.)
> 
> If I open the file with O_NONBLOCK, any reads immediately return with error 
> 11 / EWOULDBLOCK. This behavior is documented to occur for Lustre group locks.
> 
> However, I would like to clarify whether the LDLM_FL_NO_TIMEOUT flag is true 
> *only* when a group lock is held, or are there other circumstances where the 
> behavior described above could occur?
> 
> If this is caused by a group lock is there an easy way to tell from server 
> side logs or data what client(s) have the group lock and are blocking access? 
> The motivation is that we believe any jobs accessing these files have long 
> since been killed, and no nodes from the job are expected to be holding the 
> files open. We would like to confirm or rule out that possibility by easily 
> identifying any such clients.
> 
> Advice on how to effectively debug ldlm issues could be useful beyond just 
> this issue. In general, if there is a reliable way to start from a log entry 
> for a lock like 
> 
>> ... ns: lustre-OST-osc-9a0942c79800 lock: 
>> 3f3a5950/0xe54ca8d2d7b66d03 lrc: 4/1,0 mode: --/PR  ...
> 
> and get information about the client(s) holding that lock and any contending 
> locks, that would be helpful in debugging situations like this.
> 
> server: 2.15.2
> client that application ran on: 2.15.0.4_rc2_cray_172_ge66844d
> client that I tested file access from: 2.15.2
> 
> Thanks!
> 
> - Thomas Bertschinger
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] question about rename operation ?

2023-08-16 Thread Andreas Dilger via lustre-discuss
Any directory renames where it is not just a simple name change (ie. parent 
directory is
not the same for both source and target) the MDS thread doing the rename will 
take the
LDLM "big filesystem lock" (BFL), which is a specific FID for global rename 
serialization.

This ensures that there is only one thread in the whole filesystem doing a 
rename that
may create directory loops, and the parent/child relationship is checked under
this lock to ensure there are no loops.

For regular file renames, and directory renames within a single parent, it is 
possible
to do parallel renames, and the MDS only locks the parent, source, and target 
FIDs to
avoid multiple threads modifying the same file or directory at once.

The client will also take the VFS rename lock before sending the rename RPC, 
which serializes the changes on the client, but does not help anything for the 
rest of the filesystem.  This unfortunately also serializes regular renames on 
a single client, but they
can still be done in parallel on multiple clients.

Cheers, Andreas

On Aug 15, 2023, at 20:14, 宋慕晗 via lustre-discuss 
 wrote:


Dear lustre maintainers,
There seems to be a bug in lustre *ll_rename* function:
/* VFS has locked the inodes before calling this */
ll_set_inode_lock_owner(src);
ll_set_inode_lock_owner(tgt);
if (tgt_dchild->d_inode)
ll_set_inode_lock_owner(tgt_dchild->d_inode);

Here we lock the src directory, target directory, and lock the target child if 
exists. But we don't lock the src child, but it's possible to change the ".." 
pointer of src child.
see this in xfs: https://www.spinics.net/lists/linux-xfs/msg68693.html

And I am also wondering how lustre deal with concurrent rename ?  Specifically, 
my concern revolves around the potential for directory loops when two clients 
initiate renames simultaneously.
In the VFS, there's a filesystem-specific vfs_rename_mutex that serializes the 
rename operation. In Ceph, I noticed the presence of a global client lock. 
However, I'm uncertain if the MDS serializes rename requests.
Consider the following scenario:

a
   /   \
 b c
/ \
  d   e
 /  \
fg

If Client 1 attempts to rename "c" to "f" while Client 2 tries to rename "b" to 
"g" concurrently, and both succeed, we could end up with a loop in the 
directory structure.
Could you please provide clarity on how lustre handles such situations? Your 
insights would be invaluable.
Thank you in advance for your time and assistance.
Warm regards,
Muhan Song

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] getting without inodes

2023-08-11 Thread Andreas Dilger via lustre-discuss
The t0 filesystem OSTs are formatted for an average file size of 70TB / 300M 
inodes = 240KB/inode.

The t1 filesystem OSTs are formatted for an average file size of 500TB / 65M 
inodes = 7.7MB/inode.

So not only are the t1 OSTs larger, but they have fewer inodes (by a factor of 
32x). This must have been done with specific formatting options since the 
default inode ratio is 1MiB/inode for the OSTs.

There isn't any information about the actual space usage (eg. "lfs df"), so I 
can't calculate whether the default 1MiB/inode would be appropriate for your 
filesystem, but definitely it was formatted with the expectation that the 
average file size would become larger as they were copied to t1 (eg. combined 
in a tarfile or something).

Unfortunately, there is no way to "fix" this in place, since the inode ratio 
for ldiskfs/ext4 filesystems is decided at format time.

One option is to use "lfs find" to find files on an OST (eg. OST0003 which is 
the least used), disable creates on that OST, and use "lfs migrate" to migrate 
all of the files to other OSTs, and then reformat the OST with more inodes and 
repeat this process for all of the OSTs.

Unfortunately, the t1 filesystem only has 8.5M free inodes and there are 27M 
inodes on OST0003, so it couldn't be drained completely to perform this 
process. You would need to delete enough files from t1 to free up inodes to do 
the migration, or eg. tar them up into larger files to reduce the inode count.

The OST migration/replacement process is described in the Lustre Operations 
Manual.

Cheers, Andreas

On Aug 11, 2023, at 01:17, Carlos Adean via lustre-discuss 
 wrote:


Hello experts,

We have a Lustre with two tiers T0(SSD) and T1(HDD), the first with 70TB and 
the second one with ~500TB.

I'm experiencing a problem that the T1 has much less inodes than T0 and that is 
getting without inodes in the OSTs, so I'd like to understand the source of 
this and how to fix that.


Thanks in advance.



=== T0

$ lfs df -i /lustre/t0
UUID  Inodes   IUsed   IFree IUse% Mounted on
t0-MDT_UUID390627328 1499300   389128028   1% /lustre/t0[MDT:0]
t0-OST_UUID 14651392 109744213553950   8% /lustre/t0[OST:0]
t0-OST0001_UUID 14651392 109749213553900   8% /lustre/t0[OST:1]
t0-OST0002_UUID 14651392 109733113554061   8% /lustre/t0[OST:2]
t0-OST0003_UUID 14651392 109756313553829   8% /lustre/t0[OST:3]
t0-OST0004_UUID 14651392 109757613553816   8% /lustre/t0[OST:4]
t0-OST0005_UUID 14651392 109750513553887   8% /lustre/t0[OST:5]
t0-OST0006_UUID 14651392 109752413553868   8% /lustre/t0[OST:6]
t0-OST0007_UUID 14651392 109759613553796   8% /lustre/t0[OST:7]
t0-OST0008_UUID 14651392 109744213553950   8% /lustre/t0[OST:8]
t0-OST0009_UUID 14651392 109756313553829   8% /lustre/t0[OST:9]
t0-OST000a_UUID 14651392 109751513553877   8% /lustre/t0[OST:10]
t0-OST000b_UUID 14651392 109652413554868   8% /lustre/t0[OST:11]
t0-OST000c_UUID 14651392 109660813554784   8% /lustre/t0[OST:12]
t0-OST000d_UUID 14651392 109652413554868   8% /lustre/t0[OST:13]
t0-OST000e_UUID 14651392 109664113554751   8% /lustre/t0[OST:14]
t0-OST000f_UUID 14651392 109664713554745   8% /lustre/t0[OST:15]
t0-OST0010_UUID 14651392 109670513554687   8% /lustre/t0[OST:16]
t0-OST0011_UUID 14651392 109661613554776   8% /lustre/t0[OST:17]
t0-OST0012_UUID 14651392 109652013554872   8% /lustre/t0[OST:18]
t0-OST0013_UUID 14651392 109659813554794   8% /lustre/t0[OST:19]
t0-OST0014_UUID 14651392 109666913554723   8% /lustre/t0[OST:20]
t0-OST0015_UUID 14651392 109657013554822   8% /lustre/t0[OST:21]

filesystem_summary:299694753 1499300   298195453   1% /lustre/t0


=== T1

$  lfs df -i /lustre/t1
UUID  Inodes   IUsed   IFree IUse% Mounted on
t1-MDT_UUID   147872153656448788  1422272748   4% /lustre/t1[MDT:0]
t1-OST_UUID 3049203230491899 133 100% /lustre/t1[OST:0]
t1-OST0001_UUID 3049203230491990  42 100% /lustre/t1[OST:1]
t1-OST0002_UUID 3049203230491916 116 100% /lustre/t1[OST:2]
t1-OST0003_UUID 3049203227471050 3020982  91% /lustre/t1[OST:3]
t1-OST0004_UUID 3049203230491989  43 100% /lustre/t1[OST:4]
t1-OST0005_UUID 3049203230491960  72 100% /lustre/t1[OST:5]
t1-OST0006_UUID 3049203230491948  84 100% /lustre/t1[OST:6]
t1-OST0007_UUID 3049203230491939  93 100% /lustre/t1[OST:7]
t1-OST0008_UUID 3049203229811803  680229  98% /lustre/t1[OST:8]
t1-OST0009_UUID 3049203229808261  683771  98% /lustre/t1[OST:9]

Re: [lustre-discuss] Pool_New Naming Error

2023-08-08 Thread Andreas Dilger via lustre-discuss


On Aug 8, 2023, at 18:41, Baucum, Rashun via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello,

I am running into an issue when attempting to setup pooling. The commands are 
being run on a server that hosts the MDS and MGS:

# lctl dl
  0 UP osd-ldiskfs lfs1-MDT-osd lfs1-MDT-osd_UUID 12
  1 UP osd-ldiskfs MGS-osd MGS-osd_UUID 4
  2 UP mgs MGS MGS 18
  3 UP mgc MGC@tcp<mailto:MGC10.197.183.26@tcp> 
6a356911-e772-c2ac-20a3-2dade59f93bb 4
  4 UP mds MDS MDS_uuid 2
  5 UP lod lfs1-MDT-mdtlov lfs1-MDT-mdtlov_UUID 3
  6 UP mdt lfs1-MDT lfs1-MDT_UUID 24
  7 UP mdd lfs1-MDD lfs1-MDD_UUID 3
  8 UP qmt lfs1-QMT lfs1-QMT_UUID 3
  9 UP osp lfs1-OST-osc-MDT lfs1-MDT-mdtlov_UUID 4
10 UP osp lfs1-OST0001-osc-MDT lfs1-MDT-mdtlov_UUID 4
11 UP osp lfs1-OST0004-osc-MDT lfs1-MDT-mdtlov_UUID 4
12 UP osp lfs1-OST0005-osc-MDT lfs1-MDT-mdtlov_UUID 4
13 UP osp lfs1-OST0002-osc-MDT lfs1-MDT-mdtlov_UUID 4
14 UP osp lfs1-OST0003-osc-MDT lfs1-MDT-mdtlov_UUID 4
15 UP lwp lfs1-MDT-lwp-MDT lfs1-MDT-lwp-MDT_UUID 4

# lctl pool_new lustre.pool1
error: pool_new can contain only alphanumeric characters, underscores, and 
dashes besides the required '.'
pool_new: Invalid argument

dmesg logs:
LustreError: 19441:0:(llog.c:416:llog_init_handle()) llog type is not specified!
LustreError: 19441:0:(mgs_llog.c:5513:mgs_pool_cmd()) lustre is not defined

Is this an error anyone has seen or knows a solution to?

The problem is that the pool name should be "fsname.pool_name", and your 
filesystem is named "lfs1" and not "lustre".  The
last error message above is trying to say this, but it could be more clear, 
like "filesystem name 'lustre' is not defined" or similar.  A patch to fix this 
would be welcome.

So your command should be:

lctl pool_new lfs1.pool1

though I would suggest a more descriptive name than "pool1" (e.g. "flash" or 
"new_osts" or whatever), but that is really up to you..

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] how does lustre handle node failure

2023-07-22 Thread Andreas Dilger via lustre-discuss
Shawn,
Lustre handles the largest filesystems in the world, hundreds of PB in size, so 
there are definitely Lustre filesystems with hundreds of servers.

In large storage clusters the servers failover in pairs or quads, since the 
storage is typically not on a single global SAN for all nodes to access, so 
there is definitely not a single huge HA cluster for all of the servers in the 
filesystem.

Cheers, Andreas

On Jul 21, 2023, at 16:09, Shawn via lustre-discuss 
 wrote:


Hi Laura,  thanks for your reply.
It seems the OSSs will share the disks created from a shared SAN.  So the 
OSS-pairs can failover in a pre-defined manner if one node is down, coordinated 
by a HA manager.

This can certainly work on a limited scale.  I'm curious if this static schema 
can scale to a large cluster with 100s of OSSs servers?


regards,
Shawn




On Tue, Jul 18, 2023 at 1:25 PM Laura Hild 
mailto:l...@jlab.org>> wrote:
I'm not familiar with using FLR to tolerate OSS failures.  My site does the HA 
pairs with shared storage method.  It's sort of described in the manual

  https://doc.lustre.org/lustre_manual.xhtml#configuringfailover

but in more, Pacemaker-specific detail at

  
https://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker

and

  
https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] File system global quota

2023-07-20 Thread Andreas Dilger via lustre-discuss
Probably the closest that could be achieved like this would be to set the 
ldiskfs reserved space on the OSTs like:

  tune2fs -m 10 /dev/sdX

That sets the root reserved space to 10% of the filesystem, and non-root users 
wouldn't be able to allocate blocks once the filesystem hits 90% full. This 
doesn't reserve specific blocks, just a percentage of the total capacity, so it 
should avoid fragmentation from the filesystem getting too full. 

Cheers, Andreas

> On Jul 20, 2023, at 08:11, Sebastian Oeste via lustre-discuss 
>  wrote:
> 
> Hi there,
> 
> we operate multiple lustre file systems at our side and wondering if it is 
> possible to have a file system global quota in lustre?
> For example, that only 90 percent of the file system can be used. The Idea is 
> to avoid situations where a file system is 99% full.
> I was looking through mount options and lfs-quota but there just quota for 
> users, groups or projects.
> I there a way to achieve something like that?
> 
> Thanks,
> sebastian
> -- 
> Sebastian Oeste, M.Sc.
> Computer Scientist
> 
> Technische Universität Dresden
> Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
> Tel. +49 (0)351 463-32405
> E-Mail: sebastian.oe...@tu-dresden.de
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Old Lustre Filesystem migrate to newer servers

2023-07-19 Thread Andreas Dilger via lustre-discuss
Wow,  Lustre 1.6 is really old, released in 2009.  Even Lustre 2.6 would be 
pretty old, released in 2014.

While there haven't been a *lot* of on-disk format changes over the years, 
there was a fairly significant change in Lustre 2.0 that would probably make 
upgrading the filesystem directly to a more recent Lustre release (e.g. 2.12.9) 
difficult.  We've long since removed compatibility for such old versions from 
the code.

My recommendation would be to install a VM with an old version of RHEL and 
Lustre (e.g. RHEL5 and Lustre 1.8.9 from 
https://downloads.whamcloud.com/public/lustre/lustre-1.8.9-wc1/el5/server/RPMS/x86_64/)
 and attach the storage to the node.  You would probably need to do the 
"writeconf" process to change the server NIDs from their current IB addresses 
to TCP.  Other than that, if the node can see the storage, Lustre should be 
able to mount it.

I would then strongly recommend to copy all of the data to new hardware, 
instead of running it like this.  *Even if* the storage is currently working, 
it is 10+ years old and likely to also fail soon.  Also, the storage is likely 
to be small and slow compared to any modern devices, and should be refreshed 
before the data is permanently lost.

Cheers, Andreas

On Jul 19, 2023, at 21:31, Richard Chang via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi,

I have an existing, old Lustre, don't remember the exact version, but most most 
likely 1.6 . The Lustre Servers had crashed and can't be fixed, HW wise.

The MDT/MGT and OSTs are housed in a backend FC based DAS.

How easy or difficult it is to create a few new servers and adding these 
backend storage to the new servers to get back the data ?

I am not saying it is staright forward, but there is no harm in trying. We can 
even load the old version of OS and Lustre SW.

All the user is concerned about is the data, which I am sure is still safe in 
the backend storage box.

One thing though. The old servers had Infiniband as the LNET, but the newer 
ones will use TCP.

Any help and advice will be highly appreciated.

Thanks & regards,

Richard.





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] New client mounts fail after deactivating OSTs

2023-07-18 Thread Andreas Dilger via lustre-discuss
Brian,
Please file a ticket in LUDOC with details of how the manual should be updated. 
Ideally, including a patch. :-)

Cheers, Andreas

On Jul 11, 2023, at 15:39, Brad Merchant  
wrote:


We recreated the issue in a test cluster and it was definitely the llog_cancel 
steps that caused the issue. Clients couldn't process the llog properly on new 
mounts and would fail. We had to completely clear the llog and --writeconf 
every target to regenerate it from scratch.

The cluster is up and running now but I would certainly recommend at least 
revising that section of the manual.

On Mon, Jul 10, 2023 at 5:22 PM Brad Merchant 
mailto:bmerch...@cambridgecomputer.com>> wrote:
We deactivated half of 32 OSTs after draining them. We followed the steps in 
section 14.9.3 of the lustre manual

https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost

After running the steps in subhead "3. Deactivate the OST." on OST0010-OST001f, 
new client mounts fail with the below log messages. Existing client mounts seem 
to function correctly but are on a bit of a ticking timebomb because they are 
configured with autofs.

The llog_cancel steps are new to me and the issues seemed to appear after those 
commands were issued (can't say that 100% definitively however). Servers are 
running 2.12.5 and clients are on 2.14.x


Jul 10 15:22:40 adm-sup1 kernel: LustreError: 
26814:0:(obd_config.c:1514:class_process_config()) no device for: 
hydra-OST0010-osc-8be5340c2000
Jul 10 15:22:40 adm-sup1 kernel: LustreError: 
26814:0:(obd_config.c:2038:class_config_llog_handler()) MGC172.16.100.101@o2ib: 
cfg command failed: rc = -22
Jul 10 15:22:40 adm-sup1 kernel: Lustre:cmd=cf00f 0:hydra-OST0010-osc  
1:osc.active=0
Jul 10 15:22:40 adm-sup1 kernel: LustreError: 15b-f: MGC172.16.100.101@o2ib: 
Configuration from log hydra-client failed from MGS -22. Check client and MGS 
are on compatible version.
Jul 10 15:22:40 adm-sup1 kernel: Lustre: hydra: root_squash is set to 99:99
Jul 10 15:22:40 adm-sup1 systemd-udevd[26823]: Process '/usr/sbin/lctl 
set_param 'llite.hydra-8be5340c2000.nosquash_nids=192.168.80.84@tcp 
192.168.80.122@tcp 192.168.80.21@tcp 172.16.90.11@o2ib 172.16.100.211@o2ib 
172.16.100.212@o2ib 172.16.100.213@o2ib 172.16.100.214@o2ib 172.16.100.215@o2ib 
172.16.90.51@o2ib'' failed with exit code 2.
Jul 10 15:22:40 adm-sup1 kernel: Lustre: Unmounted hydra-client
Jul 10 15:22:40 adm-sup1 kernel: LustreError: 
26803:0:(obd_mount.c:1680:lustre_fill_super()) Unable to mount  (-22)



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Use of lazystatfs

2023-07-05 Thread Andreas Dilger via lustre-discuss
On Jul 5, 2023, at 07:14, Mike Mosley via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello everyone,

We have drained some of our OSS/OSTs and plan to deactivate them soon.  The 
process ahead leads us to a couple of questions that we hope somebody can 
advise us on.

Scenario
We have fully drained the target OSTs using  'lfs find' to identify all files 
located on the targets and then feeding the list to 'lfs migrate. ' A final 
scan shows there are no files left on the targets.

Questions
1) Running 'lfs df -h' still shows some space being used even though we have 
drained all of the data.   Is that normal?  i.e.

UUID   bytesUsed   Available Use% Mounted on
hydra-OST0010_UUID 84.7T  583.8M   80.5T   1% /dfs/hydra[OST:16]
hydra-OST0011_UUID 84.7T  581.4M   80.5T   1% /dfs/hydra[OST:17]
hydra-OST0012_UUID 84.7T  581.7M   80.5T   1% /dfs/hydra[OST:18]
hydra-OST0013_UUID 84.7T  582.4M   80.5T   1% /dfs/hydra[OST:19]
hydra-OST0014_UUID 84.7T  584.1M   80.5T   1% /dfs/hydra[OST:20]
hydra-OST0015_UUID 84.7T  583.4M   80.5T   1% /dfs/hydra[OST:21]
hydra-OST0016_UUID 84.7T  583.6M   80.5T   1% /dfs/hydra[OST:22]
hydra-OST0017_UUID 84.7T  581.8M   80.5T   1% /dfs/hydra[OST:23]
hydra-OST0018_UUID 84.7T  582.6M   80.5T   1% /dfs/hydra[OST:24]
hydra-OST0019_UUID 84.7T  582.7M   80.5T   1% /dfs/hydra[OST:25]
hydra-OST001a_UUID 84.7T  580.0M   80.5T   1% /dfs/hydra[OST:26]
hydra-OST001b_UUID 84.7T  580.4M   80.5T   1% /dfs/hydra[OST:27]
hydra-OST001c_UUID 84.7T  582.1M   80.5T   1% /dfs/hydra[OST:28]
hydra-OST001d_UUID 84.7T  583.2M   80.5T   1% /dfs/hydra[OST:29]
hydra-OST001e_UUID 84.7T  583.7M   80.5T   1% /dfs/hydra[OST:30]
hydra-OST001f_UUID 84.7T  587.7M   80.5T   1% /dfs/hydra[OST:31]

I would suggest to unmount the OSTs from Lustre and mount via ldiskfs, then run 
"find $MOUNT/O -type f -ls" to find if there are any in-use files left.  It is 
likely that the 580M used by all of the OSTs is just residual logs and large 
directories under O/*.  There might be some hundreds or thousands of 
zero-length object files that were precreated but never used, that will 
typically have an unusual file access mode 07666 and can be ignored.

2) According to some comments, prior to deactivating the OSS/OSTs, we should 
add the 'lazystatfs' option to all of our client mounts so that they do not 
hang once we deactivate some of the OSTs.   Is that correct?  If so, why would 
you not just always have that option set?What are the ramifications of 
doing it well in advance of the OST deactivations?

The lazystatfs feature has been enabled by default since Lustre 2.9 so I don't 
think you need to do anything with it anymore.  The "lfs df" command will 
automatically skip unconfigured OSTs.


Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Rocky 9.2/lustre 2.15.3 client questions

2023-06-23 Thread Andreas Dilger via lustre-discuss
Applying the LU-16626 patch locally should fix the issue, and has no risk since 
it is only fixing a build issue that affects an obscure diagnostic tool.

That said, I've cherry-picked that patch back to b2_15, so it should be 
included into 2.15.4.

https://review.whamcloud.com/51426

Cheers, Andreas

On Jun 23, 2023, at 05:04, Mountford, Christopher J. (Dr.) via lustre-discuss 
 wrote:

Hi,

I'm building the lustre client/kernel modules for our new HPC cluster and have 
a couple of questions:

1) Are there any known issues running lustre 2.15.3 clients and lustre 2.12.9 
servers? I haven't seen anything showstopping on the mailing list or in JIRA 
but wondered if anyone had run into problems.

2) Is it possible to get the dkms kernel rpm to work with Rocky/RHEL 9.2? If I 
try to install the lustre-client-dkms rpm I get the following error:

error: Failed dependencies:
   /usr/bin/python2 is needed by lustre-client-dkms-2.15.3-1.el9.noarch

- Not surprisingly as I understand that python2 is not available for rocky/rhel 
9

I see there is a patch for 2.16 (from LU-16626). Not a major problem as I can 
build kmod-lustre-client rpms for our kernel/ofed, but I would prefer to use 
dkms if possible.

Kind Regards,
Christopher.


Dr. Christopher Mountford,
System Specialist,
RCS,
Digital Services.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] CentOS Stream 8/9 support?

2023-06-22 Thread Andreas Dilger via lustre-discuss
On Jun 22, 2023, at 06:58, Will Furnass via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi,

I imagine that many here might have seen RedHat's announcement
yesterday about ceasing to provide sources for EL8 and EL9 to those
who aren't paying customers (see [1] - CentOS 7 unaffected).  For many
HPC sites using or planning to adopt Alma/Rocky 8/9 this prompts a
change of tack:
- buy RHEL 8/9
- switch to CentOS 8/9 Stream for something EL-like
- switch to something else (SUSE or Ubuntu)

Those wanting to stick with EL-like will be interested in how well
Lustre works with Stream 8/9.  Seems it's not in the support matrix
[2].  Have others here used Lustre with Stream successfully?  If so,
anything folks would care to share about gotchas if encountered?  Did
you used patched or unpatched kernels?

For clients I don't think it will matter much, since users often have to build
their own client RPMs (possibly via DKMS), or they use weak updates to
avoid rebuilding the RPMs at all for client updates.  The Lustre client code
itself works with a wide range of kernel versions (3.10-6.0 currently), and
I suspect that relatively few production systems want to be on the bleeding
edge of Linux kernels either, so the lack of 6.1-6.3 kernel support is likely
not affecting anyone, and even then patches are already in flight for them.

Definitely servers will be more tricky, since the baseline will always be
moving, and more quickly than EL kernels.

[1] https://www.redhat.com/en/blog/furthering-evolution-centos-stream
[2] https://wiki.whamcloud.com/display/PUB/Lustre+Support+Matrix

Cheers,

Will

--
Dr Will Furnass | Research Platforms Engineer
IT Services | University of Sheffield


Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] No space left on device MDT DoM but not full nor run out of inodes

2023-06-22 Thread Andreas Dilger via lustre-discuss
There is a bug in the grant accounting that leaks under certain operations 
(maybe O_DIRECT?).  It is resolved by unmounting and remounting the clients, 
and/or upgrading.  There was a thread about it on lustre-discuss a couple of 
years ago.

Cheers, Andreas

On Jun 20, 2023, at 09:32, Jon Marshall via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Sorry, typo in the version number - the version we are actually running is 
2.12.6

From: Jon Marshall
Sent: 20 June 2023 16:18
To: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> 
mailto:lustre-discuss@lists.lustre.org>>
Subject: No space left on device MDT DoM but not full nor run out of inodes

Hi,

We've been running lustre 2.15.1 in production for over a year and recently 
decided to enable PFL with DoM on our filesystem. Things have been fine up 
until last week, when users started reporting issues copying files, 
specifically "No space left on device". The MDT is running ldiskfs as the 
backend.

I've searched through the mailing list and found a couple of people reporting 
similar problems, which prompted me to check the inode allocation, which is 
currently:

UUID  Inodes   IUsed   IFree IUse% Mounted on
scratchc-MDT_UUID   62449254471144384   553348160  12% 
/mnt/scratchc[MDT:0]
scratchc-OST_UUID577125792448993433222645  43% 
/mnt/scratchc[OST:0]
scratchc-OST0001_UUID571140642450587632608188  43% 
/mnt/scratchc[OST:1]

filesystem_summary:1369752177114438465830833  52% /mnt/scratchc

So, nowhere near full - the disk usage is a little higher:

UUID   bytesUsed   Available Use% Mounted on
scratchc-MDT_UUID  882.1G  451.9G  355.8G  56% 
/mnt/scratchc[MDT:0]
scratchc-OST_UUID   53.6T   22.7T   31.0T  43% 
/mnt/scratchc[OST:0]
scratchc-OST0001_UUID   53.6T   23.0T   30.6T  43% 
/mnt/scratchc[OST:1]

filesystem_summary:   107.3T   45.7T   61.6T  43% /mnt/scratchc

But not full either! The errors are accompanied in the logs by:

LustreError: 15450:0:(tgt_grant.c:463:tgt_grant_space_left()) scratchc-MDT: 
cli ba0195c7-1ab4-4f7c-9e28-8689478f5c17/9e331e231c00 left 82586337280 < 
tot_grant 82586681321 unstable 0 pending 0 dirty 1044480
LustreError: 15450:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 33050 
previous similar messages

For reference the DoM striping we're using is:

  lcm_layout_gen:0
  lcm_mirror_count:  1
  lcm_entry_count:   3
lcme_id: N/A
lcme_mirror_id:  N/A
lcme_flags:  0
lcme_extent.e_start: 0
lcme_extent.e_end:   1048576
  stripe_count:  0   stripe_size:   1048576   pattern:   mdt
   stripe_offset: -1

lcme_id: N/A
lcme_mirror_id:  N/A
lcme_flags:  0
lcme_extent.e_start: 1048576
lcme_extent.e_end:   1073741824
  stripe_count:  1   stripe_size:   1048576   pattern:   raid0  
 stripe_offset: -1

lcme_id: N/A
lcme_mirror_id:  N/A
lcme_flags:  0
lcme_extent.e_start: 1073741824
lcme_extent.e_end:   EOF
  stripe_count:  -1   stripe_size:   1048576   pattern:   raid0 
  stripe_offset: -1

So the first 1MB on the MDT.

My question is obviously what is causing these errors? I'm not massively 
familiar with Lustre internals, so any pointers on where to look would be 
greatly appreciated!

Cheers
Jon

Jon Marshall
High Performance Computing Specialist



IT and Scientific Computing Team



Cancer Research UK Cambridge Institute
Li Ka Shing Centre | Robinson Way | Cambridge | CB2 0RE
Web<http://www.cruk.cam.ac.uk/> | 
Facebook<http://www.facebook.com/cancerresearchuk> | 
Twitter<http://twitter.com/CR_UK>



[Description: CRI Logo]<http://www.cruk.cam.ac.uk/>

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Data stored in OST

2023-05-22 Thread Andreas Dilger via lustre-discuss
Yes, the OSTs must provide internal redundancy - RAID-6 typically. 

There is File Level Redundancy (FLR = mirroring) possible in Lustre file 
layouts, but it is "unmanaged", so users or other system-level tools are 
required to resync FLR files if they are written after mirroring.

Cheers, Andreas

> On May 22, 2023, at 09:39, Nick dan via lustre-discuss 
>  wrote:
> 
> 
> Hi
> 
> I had one doubt.
> In lustre, data is divided into stripes and stored in multiple OSTs. So each 
> OST will have some part of data. 
> My question is if one OST fails, will there be data loss?
> 
> Please advise for the same.
> 
> Thanks and regards
> Nick
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] mlx5 errors on oss

2023-05-18 Thread Andreas Dilger via lustre-discuss
I can't comment on the specific network issue, but in general it is far better 
to use the MOFED drivers than the in-kernel ones. 

Cheers, Andreas

> On May 18, 2023, at 09:08, Nehring, Shane R [LAS] via lustre-discuss 
>  wrote:
> 
> Hello all,
> 
> We recently added infiniband to our cluster and are in the process of testing 
> it
> with lustre. We're running the distro provided drivers for the mellanox cards
> with the latest firmware. Overnight we started seeing the following errors on 
> a
> few oss:
> 
> infiniband mlx5_0: dump_cqe:272:(pid 40058): dump error cqe
> : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0030: 00 00 00 00 00 00 88 13 08 00 00 a0 00 63 4d d2
> infiniband mlx5_0: dump_cqe:272:(pid 40057): dump error cqe
> : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0030: 00 00 00 00 00 00 88 13 08 00 00 a1 00 c2 8e d2
> infiniband mlx5_0: dump_cqe:272:(pid 40057): dump error cqe
> : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0030: 00 00 00 00 00 00 88 13 08 00 00 a2 00 1a 12 d2
> 
> I found a post suggesting this might be iommu related, disabling the iommu
> doesn't seem to help any.
> 
> We're running luster 2.15, more or less at the tip of b2_15
> (b74560d74a9f890838dbf2f0719e3d27c1e5eaf8)
> 
> Has anyone seen this before or have any pointers?
> 
> Thanks
> 
> Shane
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] Re: storing Lustre jobid in file xattrs: seeking feedback

2023-05-15 Thread Andreas Dilger via lustre-discuss
Note that there have been some requests to increase the jobid size (LU-16765) 
so any tools that are accessing the xattr shouldn't assume the jobid is only 32 
bytes in size.

On May 14, 2023, at 13:11, Bertschinger, Thomas Andrew Hjorth 
mailto:bertschin...@lanl.gov>> wrote:

Thanks for the responses.

I like the idea of allowing the xattr name to be a parameter, because while it 
increases the complexity, it seems safer.

The main difficulty I can think of is that user tools that query the jobid will 
need to get the value of the parameter first in order to query the correct 
xattr. Additionally, if the parameter is changed, jobids from old files may be 
missed. This doesn't seem like a big risk however, because I imagine this value 
would be changed rarely if ever.

As for limiting the name to 7 characters, I believe Andreas is referring to the 
xattr name itself, not the contents of the xattr, so there should be no problem 
with storing the full length of a jobid (32 characters) -- but let me know if I 
am not interpreting that correctly.

- Tom Bertschinger

From: Jeff Johnson 
mailto:jeff.john...@aeoncomputing.com>>
Sent: Friday, May 12, 2023 4:56 PM
To: Andreas Dilger
Cc: Bertschinger, Thomas Andrew Hjorth; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: [EXTERNAL] Re: [lustre-discuss] storing Lustre jobid in file xattrs: 
seeking feedback

Just a thought, instead of embedding the jobname itself, perhaps just a least 
significant 7 character sha-1 hash of the jobname. Small chance of collision, 
easy to decode/cross reference to jobid when needed. Just a thought.

--Jeff


On Fri, May 12, 2023 at 3:08 PM Andreas Dilger via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>>
 wrote:
Hi Thomas,
thanks for working on this functionality and raising this question.

As you know, I'm inclined toward the user.job xattr, but I think it is never a 
good idea to unilaterally make policy decisions in the kernel that cannot be 
changed.

As such, it probably makes sense to have a tunable parameter like 
"mdt.*.job_xattr=user.job" and then this could be changed in the future if 
there is some conflict (e.g. some site already uses the "user.job" xattr for 
some other purpose).

I don't think the job_xattr should allow totally arbitrary values (e.g. 
overwriting trusted.lov or trusted.lma or security.* would be bad). One option 
is to only allow a limited selection of valid xattr namespaces, and possibly 
names:

 *   NONE to turn this feature off
 *   user, or trusted or system (if admin wants to restrict the ability of 
regular users to change this value?), with ".job" added automatically
 *   user.* (or trusted.* or system.*) to also allow specifying the xattr name

If we allow the xattr name portion to be specified (which I'm not sure about, 
but putting it out for completeness), it should have some reasonable limits:

 *   <= 7 characters long to avoid wasting valuable xattr space in the inode
 *   should not conflict with other known xattrs, which is tricky if we allow 
the name to be arbitrary. Possibly if in trusted (and system?) it should only 
allow trusted.job to avoid future conflicts?
 *   maybe restrict it to contain "job" (or maybe "pbs", "slurm", ...) to 
reduce the chance of namespace clashes in user or system? However, I'm 
reluctant to restrict names in user since this shouldn't have any fatal side 
effects (e.g. data corruption like in trusted or system), and the admin is 
supposed to know what they are doing...

On May 4, 2023, at 15:53, Bertschinger, Thomas Andrew Hjorth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>>
 wrote:

Hello Lustre Users,

There has been interest in a proposed feature 
https://jira.whamcloud.com/browse/LU-13031 to store the jobid with each Lustre 
file at create time, in an extended attribute. An open question is which xattr 
namespace is to use between "user", the Lustre-specific namespace "lustre", 
"trusted", or even perhaps "system".

The correct namespace likely depends on how this xattr will be used. For 
example, will interoperability with other filesystems be important? Different 
namespaces have their own limitations so the correct choice depends on the use 
cases.

I'm looking for feedback on applications for this feature. If you have thoughts 
on how you could use this, please feel free to share them so that we design it 
in a way that meets your needs.

Thanks!

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] storing Lustre jobid in file xattrs: seeking feedback

2023-05-12 Thread Andreas Dilger via lustre-discuss
Hi Thomas,
thanks for working on this functionality and raising this question.

As you know, I'm inclined toward the user.job xattr, but I think it is never a 
good idea to unilaterally make policy decisions in the kernel that cannot be 
changed.

As such, it probably makes sense to have a tunable parameter like 
"mdt.*.job_xattr=user.job" and then this could be changed in the future if 
there is some conflict (e.g. some site already uses the "user.job" xattr for 
some other purpose).

I don't think the job_xattr should allow totally arbitrary values (e.g. 
overwriting trusted.lov or trusted.lma or security.* would be bad). One option 
is to only allow a limited selection of valid xattr namespaces, and possibly 
names:

  *   NONE to turn this feature off
  *   user, or trusted or system (if admin wants to restrict the ability of 
regular users to change this value?), with ".job" added automatically
  *   user.* (or trusted.* or system.*) to also allow specifying the xattr name

If we allow the xattr name portion to be specified (which I'm not sure about, 
but putting it out for completeness), it should have some reasonable limits:

  *   <= 7 characters long to avoid wasting valuable xattr space in the inode
  *   should not conflict with other known xattrs, which is tricky if we allow 
the name to be arbitrary. Possibly if in trusted (and system?) it should only 
allow trusted.job to avoid future conflicts?
  *   maybe restrict it to contain "job" (or maybe "pbs", "slurm", ...) to 
reduce the chance of namespace clashes in user or system? However, I'm 
reluctant to restrict names in user since this shouldn't have any fatal side 
effects (e.g. data corruption like in trusted or system), and the admin is 
supposed to know what they are doing...

On May 4, 2023, at 15:53, Bertschinger, Thomas Andrew Hjorth via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello Lustre Users,

There has been interest in a proposed feature 
https://jira.whamcloud.com/browse/LU-13031 to store the jobid with each Lustre 
file at create time, in an extended attribute. An open question is which xattr 
namespace is to use between "user", the Lustre-specific namespace "lustre", 
"trusted", or even perhaps "system".

The correct namespace likely depends on how this xattr will be used. For 
example, will interoperability with other filesystems be important? Different 
namespaces have their own limitations so the correct choice depends on the use 
cases.

I'm looking for feedback on applications for this feature. If you have thoughts 
on how you could use this, please feel free to share them so that we design it 
in a way that meets your needs.

Thanks!

Tom Bertschinger
LANL
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Missing Files in /proc/fs/lustre after Upgrading to Lustre 2.15.X

2023-05-04 Thread Andreas Dilger via lustre-discuss


On May 4, 2023, at 16:43, Jane Liu via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi,

We previously had a monitoring tool in Lustre 2.12.X that relied on files 
located under /proc/fs/lustre for gathering metrics. However, after upgrading 
our system to version 2.15.2, we noticed that at least five files previously 
found under /proc/fs/lustre are no longer present. Here is a list of these 
files as an example:

/proc/fs/lustre/osd-ldiskfs/fsname-OST0078/brw_stats
/proc/fs/lustre/osd-ldiskfs/fsname-OST0078/kbytestotal
/proc/fs/lustre/osd-ldiskfs/fsname-OST0078/kbytesfree
/proc/fs/lustre/osd-ldiskfs/fsname-OST0078/filestotal
/proc/fs/lustre/osd-ldiskfs/fsname-OST0078/filesfree

We have been unable to locate these files in the new version. We can still 
obtain size information using the following commands:

lctl get_param obdfilter.*.kbytestotal
lctl get_param obdfilter.*.kbytesfree
lctl get_param obdfilter.*.filestotal
lctl get_param obdfilter.*.filesfree

However, we are unsure how to access the information previously available in 
the brw_stats file. Any guidance or suggestions would be greatly appreciated.

You've already partially answered your own question - the parameters for "lctl 
get_param" are under "osd-ldiskfs.*.{brw_stats,kbytes*,files*}" and not 
"obdfilter.*.*", but they have (mostly) moved from /proc/fs/lustre/osd-ldiskfs 
to /sys/fs/lustre/osd-ldiskfs.  In the case of brw_stats they are under 
/sys/kernel/debug/lustre/osd-ldiskfs.

These stats actually moved from obdfilter to osd-ldiskfs back in Lustre 2.4 
when the ZFS backend was added, and a symlink has been kept until now for 
compatibility.  That means your monitoring tool should still work with any 
modern Lustre version if you change the path. The move of brw_stats to 
/sys/kernel/debug/lustre was mandated by the upstream kernel and only happened 
in 2.15.0.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] question mark when listing file after the upgrade

2023-05-03 Thread Andreas Dilger via lustre-discuss
This looks like https://jira.whamcloud.com/browse/LU-16655 causing problems 
after the upgrade from 2.12.x to 2.15.[012] breaking the Object Index files.

A patch for this has already been landed to b2_15 and will be included in 
2.15.3. If you've hit this issue, then you need to backup/delete the OI files 
(off of Lustre) and run OI Scrub to rebuild them.

I believe the OI Scrub/rebuild is described in the Lustre Manual.

Cheers, Andreas

On May 3, 2023, at 09:30, Colin Faber via lustre-discuss 
 wrote:


Hi,

What does your client log indicate? (dmesg / syslog)

On Wed, May 3, 2023, 7:32 AM Jane Liu via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:
Hello,

I'm writing to ask for your help on one issue we observed after a major
upgrade of a large Lustre system from RHEL7 + 2.12.9 to RHEL8 + 2.15.2.
Basically we preserved MDT disk (VDisk on a VM) and also all OST disk
(JBOD) in RHEL7 and then reinstalled RHEL8 OS and then attached those
preserved disks to RHEL8 OS. However, I met an issue after the OS
upgrade and lustre installation.

I believe the issue is related to metadata.

The old MDS was a virtual machine, and the MDT vdisk was preserved
during the upgrade. When a new VM was created with the same hostname and
IP, the preserved MDT vdisk was attached to it. Everything seemed fine
initially. However, after the client mount was completed, the file
listing displayed question marks, as shown below:

[root@experimds01 ~]# mount -t lustre 11.22.33.44@tcp:/experi01
/mntlustre/
[root@experimds01 ~]# cd /mntlustre/
[root@experimds01 mntlustre]# ls -l
ls: cannot access 'experipro': No such file or directory
ls: cannot access 'admin': No such file or directory
ls: cannot access 'test4': No such file or directory
ls: cannot access 'test3': No such file or directory
total 0
d? ? ? ? ?? admin
d? ? ? ? ?? experipro
-? ? ? ? ?? test3
-? ? ? ? ?? test4

I shut down the MDT and ran "e2fsck -p
/dev/mapper/experimds01-experimds01". It reported "primary superblock
features different from
  backup, check forced."

[root@experimds01 ~]# e2fsck -p /dev/mapper/experimds01-experimds01
experi01-MDT primary superblock features different from backup,
check forced.
experi01-MDT: 9493348/429444224 files (0.5% non-contiguous),
109369520/268428864 blocks

Running e2fsck again showed that the filesystem was clean.
[root@experimds01 /]# e2fsck -p /dev/mapper/experimds01-experimds01
experi01-MDT: clean, 9493378/429444224 files, 109369610/268428864
blocks

However, the issue persisted. The file listing continued to display
question marks.

Do you have any idea what could be causing this problem and how to fix
it? By the way, I have an e2image backup of the MDT from the
RHEL7 system just in case we need fix it using the backup. Also, after
the upgrade, the command "lfs df" shows that all OSTs and MDT
  are fine.

Thank you in advance for your assistance.

Best regards,
Jane
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Recovering MDT failure

2023-04-28 Thread Andreas Dilger via lustre-discuss
On Apr 27, 2023, at 02:12, Ramiro Alba Queipo 
mailto:ramiro.a...@upc.edu>> wrote:

Hi everybody,

I have lustre 2.15.0 using Oracle on servers and Ubuntu 20.04 at clients. I 
have one MDT on a raid 1 SSD and the two disk have failed, so all the data is 
apparently lost.

- Is there a remote posibility to access data on OSSTs without MDT?
- When I started this system I tried to backup MDT data without succes. I there 
any procedure to backup data which I got, but also to recover it which I did 
not achieve?

Any help/suggestion es very welcomed
Thanks in advance.

Regards

Depending on the file layout used, the files on the OSTs are "just files", if 
you can mount the OSTs as type ldiskfs (or ZFS if that is the way you 
configured it).  The default file layout is one OST stripe per file, so you 
could read those files, and the UID/GID/timestamps should be correct, but there 
will not be any filenames associated with the files.

Regards, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] Mounting lustre on block device

2023-04-05 Thread Andreas Dilger via lustre-discuss


On Mar 16, 2023, at 17:02, Jeff Johnson 
mailto:jeff.john...@aeoncomputing.com>> wrote:

If you *really* want a block device on a client that resides in Lustre you 
*could* create a file in Lustre and then make that file a loopback device with 
losetup. Of course, your mileage will vary *a lot* based on use case, access, 
underlying LFS configuration.

dd if=/dev/zero of=/my_lustre_mountpoint/some_subdir/big_raw_file bs=1048576 
count=10
losetup -f /my_lustre_mountpoint/some_subdir/big_raw_file
*assuming loop0 is created*
some_fun_command /dev/loop0

Note with ldiskfs backends you can use "fallocate -l 10M 
/my_lustre_mountpoint/some_subdir/big_raw_file" to reserve the space.

Alternately, if you have flash-based OSTs you could truncate a sparse file to 
the full size ("truncate -S 10M ...") and format that, which will not 
consume as much space but will generate more random allocation on the OSTs.

Disclaimer: Just because you *can* do this, doesn't necessarily mean it is a 
good idea

We saw a good performance boost with ext4 images on Lustre holding many small 
files (CCI).  Also, I recall some customers in the past using ext2 or ext4 
images effectively to aggregate many small files for read-only use on compute 
nodes.

Cheers, Andreas


On Thu, Mar 16, 2023 at 3:29 PM Mohr, Rick via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:
Are you asking if you can mount Lustre on a client so that it shows up as a 
block device?  If so, the answer to that is you can't.  Lustre does not appear 
as a block device to the clients.

-Rick



On 3/16/23, 3:44 PM, "lustre-discuss on behalf of Shambhu Raje via 
lustre-discuss" 
mailto:lustre-discuss-boun...@lists.lustre.org>
 
<mailto:lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> 
<mailto:lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>>>
 wrote:


When we mount a lustre file system on client, the lustre file system does not 
use block device on client side. Instead it uses virtual file system namespace. 
Mounting point will not be shown when we do 'lsblk'. As it only show on 'df-hT'.


How can we mount lustre file system on block such that when we write something 
with lusterfs then it can be shown in block device??
Can share command??











___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com<mailto:jeff.john...@aeoncomputing.com>
www.aeoncomputing.com<http://www.aeoncomputing.com/>
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Joining files

2023-03-30 Thread Andreas Dilger via lustre-discuss
Based on your use case, I don't think file join will be a suitable solution. 
There is a limit on the number of files that can be joined (about 2000) and 
this would make for an unusual file format (something like a tar file, but 
would need special tools to access). It would also be very Lustre-specific. 

Instead, my recommendation would be to use an ext4 filesystem image to hold the 
many small files (during create, if from a single client, or aggregated after 
they are created).  Later, this filesystem image could be mounted read-only on 
multiple clients for access. Also, the whole image file can be archived to tape 
efficiently (taking all small files with it, instead of keeping a stub in 
Lustre for each file).

The use of loopback mounting image files from Lustre already works today, but 
needs userspace help to create and mount/unmount them. There was some proposal 
"Client Container Image (CCI)" on how this could be integrated directly into 
Lustre.  Please see my LUG presentation for details (maybe 2019 or so?)

Cheers, Andreas

> On Mar 30, 2023, at 00:47, Sven Willner  wrote:
> 
> Dear Patrick and Anders,
> 
> Thank you very much for your quick and comprehensive replies.
> 
> My motivation behind this issues is the following:
> At my institute (research around a large earth system/climate model) we are 
> evaluating using zarr (https://zarr.readthedocs.io) for outputing large 
> multi-dimensional arrays. This currently results in a huge number of small 
> files as the responsibility of parallel writing is fully shifted to the file 
> system. However, after closing the respective datasets we could merge those 
> files again to reduce the metadata burden onto the file system and for easier 
> archival if needed at a later point. Ideally without copying the large amount 
> of data again. For read access I would simply create an appropriate 
> index/lookup table for the resulting large file - hence holes/gaps in the 
> file are not a problem as such.
> 
> As Patrick writes
>> Layout: 1 1 1 1 1 1 1 ... 20 MiB 2 2 2 2 2 2  35 MiB
>> 
>> With data from 0-10 MiB and 20 - 30 MiB.
> that would be the resulting layout (I guess, minimizing holes could be 
> achieved by appropriate striping of the original files and/or a layout 
> adjustment during the merge, if possible).
> 
>> My expectation is that "join" of two files would be handled at the file EOF 
>> and *not* at the layout boundary.  Based on the original description from 
>> Sven, I'd think that small gaps in the file (e.g. 4KB for page alignment, 
>> 64KB for minimum layout alignment, or 1MB for stripe alignment) would be OK, 
>> but tens or hundreds of MB holes would be inefficient for processing.
> (Andreas)
> 
> Apart from archival, the resulting file would only be accessed locally in the 
> boundaries of the orginial smaller files, so I would expect the performance 
> costs of the gaps to be not that critical.
> 
>> while I think it is possible to implement this in Lustre, I'd have to ask 
>> what requirements are driving your request?  Is this just something you want 
>> to test, or is there some real-world usage demand for this (e.g. specific 
>> application workload, usage in some popular library, etc)?
> (Andreas)
> 
> At this stage I am just looking into possibilites to handle this situation - 
> I am neither an expert in zarr nor in Lustre.
> 
> If such a merge on the file system level turns out to be route worth taking, 
> I would be happy to work on an implementation. However, yes, I would need 
> some guidance there. Also, at this point I cannot estimate the amount of work 
> needed even to test this approach.
> 
> Would the necessary layout manipulation be possible in userspace? (I will 
> have a look into the implementations of `lfs migrate` and `lfs mirror 
> extend`).
> 
> Thanks a lot!
> Best,
> Sven
> 
> On Wed, Mar 29, 2023 at 07:41:56PM +, Andreas Dilger wrote:
> [-- Type: text/plain; charset=utf-8, Encoding: base64, Size: 8.2K --]
>> Patrick,
>> once upon a time there was "file join" functionality in Lustre that was 
>> ancient and complex, and was finally removed in 2009.  There are still a few 
>> remnants of this like "MDS_OPEN_JOIN_FILE" and "LOV_MAGIC_JOIN_V1" defined, 
>> but unused.   That functionality long predated composite file layouts (PFL, 
>> FLR), and used an external llog file *per file* to declare a series of other 
>> files that described the layout.  It was extremely fragile and complex and 
>> thankfully never got into widespread usage.
>> 
>> I think with the advent of composite file layout that it should be 
>> _possible_ to implement this kind of functionality pu

Re: [lustre-discuss] Joining files

2023-03-29 Thread Andreas Dilger via lustre-discuss
re.org>>
Subject: [lustre-discuss] Joining files

[You don't often get email from 
sven.will...@mpimet.mpg.de<mailto:sven.will...@mpimet.mpg.de>. Learn why this 
is important at https://aka.ms/LearnAboutSenderIdentification ]

Dear all,

I am looking for a way to join/merge/concatenate several files into one, whose 
layout is just the concatenation of the layouts of the respective files - 
ideally without any copying/moving on the data side (even if this would result 
in "holes" in the joined file).

I would very much appreciate any hints to tools or ideas of how to achieve such 
a join. As I understand that has been a `join` command for `lfs`, which is now 
deprecated (however, I am not sure if a use case like mine has been its purpose 
or why it has been deprecated).

Thanks a lot!
Best regards,
Sven

--
Dr. Sven Willner
Scientific Computing Lab (SCLab)
Max Planck Institute for Meteorology
Bundesstraße 53, D-20146 Hamburg, Germany
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] About Lustre small files performace(8k) improve

2023-03-27 Thread Andreas Dilger via lustre-discuss
Are your performance tests on NFS or on native Lustre clients?  Native Lustre 
clients will likely be faster, and with many clients they can create files in 
parallel, even in the same directory.  With a single NFS server they will be 
limited by the VFS locking for a single directory.

Are you using IB or TCP networking?  IB will be faster for low-latency requests.

Are you using the Data-on-MDT feature?  This can reduce overhead for very small 
files.

Are you using NVMe storage or e.g. SATA SSDs?  Based on the OST size it looks 
like flash of some kind, unless you are using single-HDD OSTs?

Cheers, Andreas

On Mar 18, 2023, at 01:44, 王烁斌 via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi all,

This is my Lustre FS.
UUID   1K-blocksUsed   Available Use% Mounted on
ltfs-MDT_UUID  307826072   36904   281574768   1% /mnt/lfs[MDT:0]
ltfs-MDT0001_UUID  307826072   36452   281575220   1% /mnt/lfs[MDT:1]
ltfs-MDT0002_UUID  307826072   36600   281575072   1% /mnt/lfs[MDT:2]
ltfs-MDT0003_UUID  307826072   36300   281575372   1% /mnt/lfs[MDT:3]
ltfs-OST_UUID15962575136 1027740 15156068868   1% /mnt/lfs[OST:0]
ltfs-OST0001_UUID15962575136 1027780 15156067516   1% /mnt/lfs[OST:1]
ltfs-OST0002_UUID15962575136 1027772 15156074212   1% /mnt/lfs[OST:2]
ltfs-OST0003_UUID15962575136 1027756 15156067860   1% /mnt/lfs[OST:3]
ltfs-OST0004_UUID15962575136 1027728 15156058224   1% /mnt/lfs[OST:4]
ltfs-OST0005_UUID15962575136 1027772 15156057668   1% /mnt/lfs[OST:5]
ltfs-OST0006_UUID15962575136 1027768 15156058568   1% /mnt/lfs[OST:6]
ltfs-OST0007_UUID15962575136 1027792 15156056752   1% /mnt/lfs[OST:7]

filesystem_summary:  127700601088 8222108 121248509668   1% /mnt/lfs

Structure ias flow:


After testing, under the current structure, the write performance of 500,000 
"8k" small files is:
NFSclient1——IOPS:28,000;  bandwidth——230MB
NFSclient1——IOPS:27,500;  bandwidth——220MB

Now I want to improve the performance of small files to a better level,May I 
ask if there is a better way。

I have noticed a feature called "MIP-IO" that can improve small file 
performance, but I don't know how to deploy this feature. Is there any way to 
improve small file performance?



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] DNE v3 and directory inode changing

2023-03-24 Thread Andreas Dilger via lustre-discuss
On Mar 24, 2023, at 13:20, Bertschinger, Thomas Andrew Hjorth 
mailto:bertschin...@lanl.gov>> wrote:

Thanks, this is helpful. We certainly don't need the auto-split feature and 
were just experimenting with it, so this should be fine for us. And we have 
been satisfied with the round robin directory creation so far. Just out of 
curiosity, is the auto-split feature still being actively worked on and 
expected to be complete/production-ready within some defined period of time?

Nobody is currently working on directory split.  There are a number of other 
DNE optimizations that are underway (mostly internal code, locking, and 
recovery improvements without much "visible" to the outside world).

The main feature in progress on the metadata side is the metadata writeback 
cache (WBC) that will greatly improve single-client workloads in a single 
directory tree (e.g. untar and then build/process files in that tree).   That 
should help significantly with genomics and machine learning workloads that 
have this kind of usage pattern.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] DNE v3 and directory inode changing

2023-03-23 Thread Andreas Dilger via lustre-discuss
The DNE auto-split functionality is disabled by default and not fully completed 
(e.g. preserve inode numbers) because it had issues with significant 
performance impact/latency while splitting a directory that was currently in 
use (which is exactly when you would want to use it), so I wouldn't recommend 
to use it at this time.

Instead, development efforts were focussed on DNE MDT space balancing.  This 
adds two different features that allow all of the MDTs in a filesystem to be 
used without user/admin intervention (though it is still possible to manually 
create directories on specific MDTs as before).

The "round-robin" MDT selection ("lfs setdirstripe -D --max-depth-rr=N -c 1 -i 
-1") for top-level directories (enabled for the top 3 levels of the filesystem 
by default) will, as the name suggests, round robin new directories across all 
of the available MDTs, when their space is evenly balanced (within 5% free 
space*inodes by default).  That is important to distribute *new* directories 
across MDTs in new filesystems when e.g. .../home/$user or .../project/$project 
or .../scratch/$user are being created.

The "space balance" MDT selection ("lctl set_param lmv.*.qos_threshold_rr=N" on 
the *CLIENT*) kicks in when MDT space usage becomes imbalanced (free 
space*inodes difference above 5% by default), and then starts selecting the MDT 
for *new* directories based on the ratio of free space*inodes.  That allows the 
MDTs to return toward balance over time, without causing a performance 
imbalance when it isn't necessary.

Note that both of these heuristics operate on *single-stripe directories* and 
not regular files, so the MDT balance will not be perfect if some directory 
tree has millions more files/subdirectories than another.  However, the main 
issue being avoided is the *very* common case of MDT getting full and 
MDT0001..N being (almost) totally unused.  These features also make the MDT 
*usage* balance also pretty good as a result, so it is a win-win.   For most 
filesystems, the MDT capacity is not the limiting factor (it only makes up a 
few percent of the total storage).

Cheers, Andreas

On Mar 23, 2023, at 15:31, Bertschinger, Thomas Andrew Hjorth via 
lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello,

We've been experimenting with DNEv3 recently and have run into this issue: 
https://jira.whamcloud.com/browse/LU-7607 where the directory inode number 
changes after auto-split.

In addition to the problem noted with backups that track the inode number, we 
have found that file access through a previously open file descriptor is broken 
post migration. This can occur when a shell's CWD is the affected directory. 
For example:

mds0 # lctl get_param 
mdt.mylustre-MDT.{dir_split_count,enable_dir_auto_split}
mdt.mylustre-MDT.dir_split_count=100
mdt.mylustre-MDT.enable_dir_auto_split=1

client $ pwd
/mnt/mylustre/dnetest
client $ for i in {0..100}; do touch file$i; done
client $ ls
ls: cannot open directory '.': Operation not permitted
client $ ls file0
ls: cannot access 'file0': No such file or directory
client $ ls /mnt/mylustre/dnetest/file0
/mnt/mylustre/dnetest/file0

(This is from a build of the current master branch.)

We believe users will certainly encounter this, because users monitor output 
directories of jobs as they run. Therefore this issue is a dealbreaker with 
DNEv3 for us.

I wanted to ask about the status of the linked issue, since it looks like it 
hasn't been updated in a while. Would the resolution to LU-7607 be expected to 
fix the file access problem I've noted here or will this require additional 
changes to resolve?

Thanks!

- Thomas Bertschinger
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre project quotas and project IDs

2023-03-22 Thread Andreas Dilger via lustre-discuss
Of course my preference would be a contribution to improving the name-projid 
mapping in the "lfs project" command under LU-13335 so this would also help 
other Lustre users manage their project IDs.

One proposal I had in LU-13335 that I would welcome feedback on was if a name 
or projid did not exist in /etc/projid that the lfs tool would fall back to 
doing a name/uid lookup in /etc/passwd (or other database as configured in 
/etc/nsswitch.conf).

This would avoid the need to duplicate the full UID database in /etc/projid for 
the common case of projid = uid, and allows using LDAP, NIS, AD, sssd, etc. for 
projid lookup without them having explicit support for a projid database.

This behavior could optionally be configured with a "commented-out" directive 
at the start of /etc/projid, like:

 #lfs fallback: passwd

or "group" or "none".  If all the projects are defined in the passwd database, 
then potentially just this one line is needed in /etc/projid, or not at all if 
"passwd" is the default fallback.

Would this meet your need for using an external database, while still allowing 
your development efforts to produce a solution that helps the Lustre community?

Of course at some point it wouod be desirable to have a dedicated projid 
database supported by glibc, but that is would take much more time and effort 
to implement and deploy, while the passwd/group fallback can be handled 
internally by the lfs command.

Cheers, Andreas

On Mar 17, 2023, at 04:10, Passerini Marco  wrote:



Hi Andreas,


I'm talking the order of ~10,000s of project IDs.

I've been thinking the same as you, that is, doing PROJID=1M + UID  etc. 
However, in our case, it might be better to rely on some scripting and an 
external DB, to keep track of the latest added ID, so that we could increment 
the highest value by 1 on new ID creation. The highest value could as well be 
looked up in:


/proc/fs/lustre/osd-ldiskfs/myfs-MDT/quota_slave_dt/acct_project

Regards,

Marco Passerini

____
From: Andreas Dilger 
Sent: Thursday, March 16, 2023 11:35:16 PM
To: Passerini Marco
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Lustre project quotas and project IDs

On Mar 16, 2023, at 04:50, Passerini Marco 
mailto:marco.passer...@cscs.ch>> wrote:

By trial and error, I found that, when using project quotas, the maximum ID 
available is 4294967294. Is this correct?

Yes, the "-1" ID is reserved for error conditions.

If I assign quota to a lot of project IDs, is the performance expected to go 
down more than having just a few or is it fixed?

Probably if you have millions or billions of different IDs there would be some 
performance loss, at a minimum just because the quota files will consume a lot 
of disk space and memory to manage.  I don't think we've done specific scaling 
testing for the number of project IDs, but it has worked well for the 
"expected" number of different IDs at production sites (in the 10,000s).

I've recommended to a few sites that want to have a "unified" quota to use e.g. 
PROJID=UID for user directories, PROJID=1M + UID for scratch, and PROJID=2M+N 
for independent projects, just to make the PROJIDs easily identified (at least 
until someone implements LU-13335 to do projid<->name mapping).

How many IDs were you thinking of using?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Repeated ZFS panics on MDT

2023-03-17 Thread Andreas Dilger via lustre-discuss
It's been a while since I've worked with ZFS servers, but one old chestnut that 
caused problems with ZFS 0.7 on the MDTs was the variable dnode size feature. 

I believe there was a tunable, something like "dnodesize=auto" that caused 
problems, and this could be changed to "dnodesize=1024" or similar to avoid the 
issue. You'll have to check some Lustre-discuss archives and/or the ZFS docs to 
confirm, but that was the most common issue.

Alternately, you could try updating to ZFS 0.8 or later to see if this avoids 
the issue. 

Cheers, Andreas

> On Mar 17, 2023, at 05:39, Mountford, Christopher J. (Dr.) via lustre-discuss 
>  wrote:
> 
> Unfortunately this problem seems to be getting worse, to the point where ZFS 
> Panics immediately after Lustre recovery completes when the system is under 
> load.
> 
> Luckily this happened on our /home filesystem which is relatively small. We 
> are rebuilding onto spare hardware so we can return the system to production 
> whilst we investigate.
> 
> The panics seem to happen during writes - under low load we see 1 every few 
> hours, one trigger appears to be loading or reloading a page in firefox on a 
> login node, but this is definitely not the only trigger (we've also seen the 
> panic when login nodes were all down).
> 
> Stack trace seems fairly consistent across all panics we've seen:
> 
> Mar 17 09:44:46 amds01b kernel: PANIC at zfs_vfsops.c:584:zfs_space_delta_cb()
> Mar 17 09:44:46 amds01b kernel: Showing stack for process 10494
> Mar 17 09:44:46 amds01b kernel: CPU: 8 PID: 10494 Comm: mdt00_012 Tainted: P  
>  OE     3.10.0-1160.49.1.el7_lustre.x86_64 #1
> Mar 17 09:44:46 amds01b kernel: Hardware name: HPE ProLiant DL360 
> Gen10/ProLiant DL360 Gen10, BIOS U32 02/09/2023
> Mar 17 09:44:46 amds01b kernel: Call Trace:
> Mar 17 09:44:46 amds01b kernel: [] dump_stack+0x19/0x1b
> Mar 17 09:44:46 amds01b kernel: [] spl_dumpstack+0x44/0x50 
> [spl]
> Mar 17 09:44:46 amds01b kernel: [] spl_panic+0xc9/0x110 
> [spl]
> Mar 17 09:44:46 amds01b kernel: [] ? 
> dbuf_rele_and_unlock+0x34c/0x4c0 [zfs]
> Mar 17 09:44:46 amds01b kernel: [] ? 
> getrawmonotonic64+0x34/0xc0
> Mar 17 09:44:46 amds01b kernel: [] ? dmu_zfetch+0x393/0x520 
> [zfs]
> Mar 17 09:44:46 amds01b kernel: [] ? 
> dbuf_rele_and_unlock+0x283/0x4c0 [zfs]
> Mar 17 09:44:46 amds01b kernel: [] ? __cv_init+0x41/0x60 
> [spl]
> Mar 17 09:44:46 amds01b kernel: [] 
> zfs_space_delta_cb+0x9c/0x200 [zfs]
> Mar 17 09:44:46 amds01b kernel: [] 
> dmu_objset_userquota_get_ids+0x154/0x440 [zfs]
> Mar 17 09:44:46 amds01b kernel: [] dnode_setdirty+0x38/0xf0 
> [zfs]
> Mar 17 09:44:46 amds01b kernel: [] 
> dnode_allocate+0x18c/0x230 [zfs]
> Mar 17 09:44:46 amds01b kernel: [] 
> dmu_object_alloc_dnsize+0x34b/0x3e0 [zfs]
> Mar 17 09:44:46 amds01b kernel: [] 
> __osd_object_create+0x82/0x170 [osd_zfs]
> Mar 17 09:44:46 amds01b kernel: [] ? 
> spl_kmem_zalloc+0xd8/0x180 [spl]
> Mar 17 09:44:46 amds01b kernel: [] osd_mkreg+0x7d/0x210 
> [osd_zfs]
> Mar 17 09:44:46 amds01b kernel: [] osd_create+0x336/0xb10 
> [osd_zfs]
> Mar 17 09:44:46 amds01b kernel: [] 
> lod_sub_create+0x1f5/0x480 [lod]
> Mar 17 09:44:46 amds01b kernel: [] lod_create+0x69/0x340 
> [lod]
> Mar 17 09:44:46 amds01b kernel: [] ? 
> osd_trans_create+0x410/0x410 [osd_zfs]
> Mar 17 09:44:46 amds01b kernel: [] 
> mdd_create_object_internal+0xc3/0x300 [mdd]
> Mar 17 09:44:46 amds01b kernel: [] 
> mdd_create_object+0x7b/0x820 [mdd]
> Mar 17 09:44:46 amds01b kernel: [] mdd_create+0xdd8/0x14a0 
> [mdd]
> Mar 17 09:44:46 amds01b kernel: [] 
> mdt_reint_open+0x2588/0x3970 [mdt]
> Mar 17 09:44:46 amds01b kernel: [] ? 
> check_unlink_entry+0x19/0xd0 [obdclass]
> Mar 17 09:44:46 amds01b kernel: [] ? 
> ucred_set_audit_enabled.isra.15+0x22/0x60 [mdt]
> Mar 17 09:44:46 amds01b kernel: [] mdt_reint_rec+0x83/0x210 
> [mdt]
> Mar 17 09:44:46 amds01b kernel: [] 
> mdt_reint_internal+0x6e3/0xaf0 [mdt]
> Mar 17 09:44:46 amds01b kernel: [] ? 
> mdt_intent_fixup_resent+0x36/0x220 [mdt]
> Mar 17 09:44:46 amds01b kernel: [] 
> mdt_intent_open+0x82/0x3a0 [mdt]
> Mar 17 09:44:46 amds01b kernel: [] 
> mdt_intent_opc+0x1ba/0xb50 [mdt]
> Mar 17 09:44:46 amds01b kernel: [] ? 
> lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [] ? 
> mdt_intent_fixup_resent+0x220/0x220 [mdt]
> Mar 17 09:44:46 amds01b kernel: [] 
> mdt_intent_policy+0x1a4/0x360 [mdt]
> Mar 17 09:44:46 amds01b kernel: [] 
> ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [] ? 
> cfs_hash_bd_add_locked+0x67/0x90 [libcfs]
> Mar 17 09:44:46 amds01b kernel: [] ? 
> cfs_hash_add+0xbe/0x1a0 [libcfs]
> Mar 17 09:44:46 amds01b kernel: [] 
> ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [] ? 
> lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [] tgt_enqueue+0x62/0x210 
> [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: [] 
> tgt_request_handle+0xada/0x1570 [ptlrpc]
> Mar 17 09:44:46 amds01b kernel: 

Re: [lustre-discuss] Lustre project quotas and project IDs

2023-03-16 Thread Andreas Dilger via lustre-discuss
On Mar 16, 2023, at 04:50, Passerini Marco 
mailto:marco.passer...@cscs.ch>> wrote:

By trial and error, I found that, when using project quotas, the maximum ID 
available is 4294967294. Is this correct?

Yes, the "-1" ID is reserved for error conditions.

If I assign quota to a lot of project IDs, is the performance expected to go 
down more than having just a few or is it fixed?

Probably if you have millions or billions of different IDs there would be some 
performance loss, at a minimum just because the quota files will consume a lot 
of disk space and memory to manage.  I don't think we've done specific scaling 
testing for the number of project IDs, but it has worked well for the 
"expected" number of different IDs at production sites (in the 10,000s).

I've recommended to a few sites that want to have a "unified" quota to use e.g. 
PROJID=UID for user directories, PROJID=1M + UID for scratch, and PROJID=2M+N 
for independent projects, just to make the PROJIDs easily identified (at least 
until someone implements LU-13335 to do projid<->name mapping).

How many IDs were you thinking of using?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Node Failure in Lustre

2023-03-15 Thread Andreas Dilger via lustre-discuss
No, because the remote-attached SSDs are part of the ZFS pool and any drive 
failures a t that level are the responsibility of ZFS in that case to manage 
the failed drives (eg. with RAID) and for you to have system monitors in place 
to detect this case and alert you to the drive failures.  This is no different 
than if the drives inside a RAID enclosure fail. 

Lustre cannot magically know about drives below the filesystem layer have 
problems. It only cares about being able to access the whole filesystem, and 
that the filesystem is intact even in the case of drive failures. 

Cheers, Andreas

> On Mar 15, 2023, at 01:26, Nick dan via lustre-discuss 
>  wrote:
> 
> 
> Hi
> 
> There is a situation where disks from multiple servers are sent to a main 
> server.(Lustre storage) Zpool is created from the SSDs and mkfs.lustre is 
> done using zfs as a backend file system. Lustre client is also connected. If 
> one of the nodes from where the SSDs are sent goes down, will the node 
> failure be handled?
> 
> Thanks and regards,
> Nick Dan
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Slow Lustre traffic failover issue

2023-03-10 Thread Andreas Dilger via lustre-discuss
On Mar 4, 2023, at 02:50, 覃江龙 via lustre-discuss 
 wrote:
> 
> Dear Developer,
> I hope this message finds you well. I am currently working with a Lustre file 
> system installed on two nodes, with a mounted client and NFS connection to 
> the Lustre client directory. When I generated traffic into the Lustre 
> directory and one of the nodes failed, the MGS and OST services switched to 
> the second node and it took five to six minutes for the traffic to resume. 
> However, when I switched to using an ext3 file system, the traffic resumed in 
> only one to two minutes.
> 
> I was wondering if you could shed some light on why the Lustre switch is 
> taking longer, and how I could potentially address this issue. Thank you for 
> your time and expertise.

Not that I want to discourage Lustre usage, but if you can run your workload 
from a single-node ext3 (really should be ext4) filesystem then that will be 
much less complex than using Lustre.

Lustre is a scalable distributed filesystem and needs to deal with a far more 
difficult environment than a single-node ext3 filesystem.  It works with up to 
thousands of servers, and up to low tens of thousands of clients sharing a 
single namespace.

Using a 2-node Lustre filesystem to re-export NFS is mostly missing the value 
of Lustre, which is scalability and high performance.  If you want to use 
Lustre, you would be much better off to mount the filesystem directly with 
Lustre on the clients.  This will give you better performance, better data 
consistency vs. NFS, and less complexity compared to Lustre + NFS re-export.  
The main reason for NFS re-export of Lustre is to allow a few clients (e.g. 
data capture hardware, non-Linux clients) to access a Lustre filesystem that is 
mostly used by native Lustre clients.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Renaming or Moving directories on Lustre?

2023-02-27 Thread Andreas Dilger via lustre-discuss
On Feb 27, 2023, at 11:57, Grigory Shamov 
mailto:grigory.sha...@umanitoba.ca>> wrote:

Hi All,

What happens if a directory on Lustre FS gets moved with a regular CentOS7 mv 
command, within the same filesystem? On CentOS 7, using mv from the distro, 
like this, as root:

mv /project/TEMP/user  /project/XYZ/user

It looks like the content gets copied entirely. Which for large data takes a 
large amount of time.
Is there a way to rename the Lustre directories (changing the name of the top 
directory, only without moving every object in these directories)?  Thanks!

Renaming a file or subdirectory tree between "regular" directories in Lustre 
works as you would expect for a local filesystem, even if the directories are 
on different MDTs.  What you are seeing (full copy of contents between 
directories) is really a result of the implementation/design of project quotas, 
and not directly a Lustre problem.  The same would happen if you have two 
directories using two different project IDs and the "PROJINHERIT" flag set with 
ext4 or XFS, since they also return "-EXDEV" if trying to move (rename) a file 
between directories that do not have the same project ID, and that causes "mv" 
to copy the whole directory tree.

Running the ext4 "mv" under strace shows this:

# df -T /mnt/tmp
Filesystem Type 1K-blocks  Used Available Use% Mounted on
/dev/mapper/vg_test-lvtest ext4  1633778852  15482492   1% /mnt/tmp
# mkdir /mnt/tmp/{dir1,dir2}
# chattr -P -p 1000 /mnt/tmp/dir1
# chattr -P -p 2000 /mnt/tmp/dir2
# cp /etc/hosts /mnt/tmp/dir1
# lsattr /mnt/tmp/dir1
--eP-- /mnt/tmp/dir1/hosts
# ls -li /mnt/tmp/dir1
total 8
655365 8 -rw-r--r--. 1 root root 7424 Oct 18 22:42 hosts
# strace mv /mnt/tmp/dir1/hosts /mnt/tmp/dir2/hosts
:
renameat2(AT_FDCWD, "/mnt/tmp/dir1/hosts", AT_FDCWD, "/mnt/tmp/dir2/hosts", 
RENAME_NOREPLACE) = -1 EXDEV (Invalid cross-device link)
stat("/mnt/tmp/dir2/hosts", 0x78a6c2b0) = -1 ENOENT (No such file or 
directory)
lstat("/mnt/tmp/dir1/hosts", {st_mode=S_IFREG|0644, st_size=7424, ...}) = 0
newfstatat(AT_FDCWD, "/mnt/tmp/dir2/hosts", 0x78a6bf90, 
AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
unlink("/mnt/tmp/dir2/hosts")   = -1 ENOENT (No such file or 
directory)
openat(AT_FDCWD, "/mnt/tmp/dir1/hosts", O_RDONLY|O_NOFOLLOW) = 3
openat(AT_FDCWD, "/mnt/tmp/dir2/hosts", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4
read(3, "##\n# Host Database\n#\n# Do not re"..., 131072) = 7424
write(4, "##\n# Host Database\n#\n# Do not re"..., 7424) = 7424
:
# lsattr -p /mnt/tmp/dir2
2000 --eP-- /mnt/tmp/dir2/hosts
# ls -li /mnt/tmp/dir2
total 8
786435 8 -rw-r--r--. 1 root root 7424 Oct 18 22:42 hosts

The reason for this limitation is that there is no way to atomically update the 
quota between the two project IDs when a whole subdirectory tree is being moved 
between projects.  There might be thousands of subdirectories and millions of 
files that are being moved, and the project ID needs to be updated on all of 
those files and directories.  This is too large to do atomically in a single 
filesystem transaction.  Rather than try to solve this directly in the kernel, 
the decision of the XFS developers (copied by ext4) is that cross-project 
renames will not be done by the kernel and instead be handled in userspace by 
the "mv" utility, the same way that renames across different filesystems are 
handled.


In Lustre 2.15.0 and later, this cross-project rename constraint has been 
removed for *regular file* renames between directories with different project 
IDs.  This means the file is moved between directories and the project ID and 
associated quota accounting is updated in a single transaction without doing a 
data copy.  However, *directory* renames with PROJINHERIT still have this issue.

To work around this behavior, it is possible to use "chattr - p" (or "lfs 
project -p", they do the same thing) to change the project ID of the source 
files and directories *before* they are renamed so that the file data copy does 
not need to be done, and just the filenames can be moved.

It might be possible to patch "mv" so that instead of bailing on "rename()" 
after the first EXDEV return, it creates the target directory and then tries to 
rename the files within the source directory to the target, before it does the 
file copy.  It is likely that ext4 could also be patched to allow regular file 
renames without returning EXDEV.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Question about lustre deduplication?

2023-02-27 Thread Andreas Dilger via lustre-discuss
On Feb 27, 2023, at 05:59, yuehui gan via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

hello all

does lustre have deduplication? is this in the development plan?  thanks

Lustre on ZFS has dedpulication at the ZFS level.  There is no deduplication 
across OSTs.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Access times for file (file heat)

2023-02-25 Thread Andreas Dilger via lustre-discuss
Anna,
for monitoring server storage access, the client-side file heat is not actually 
very useful, because (a) it is spread across all of the clients, and (b) it 
shows the client-side access patterns (which may be largely from cache) and not 
the actual server storage access which is what is important for tiering 
decisions.

Consider if a client reads a file once from the storage, and then accesses it 
1000x from cache, the file may be considered "hot" by the client but it doesn't 
really matter what kind of storage the file is on since the storage only sees a 
single read.

Instead of the client-side file heat there is a different mechanism, the OST 
(lustre/utils/ofd_access_log_reader.c, ALR) that is a lightweight mechanism to 
aggregate all storage access into a producer/consumer circular log that is 
consumed by a userspace process. It is up to the userspace process to aggregate 
these log records across all of the OSTs to make decisions about which files 
are "hot" and which are "cold".  The ALR records are described in 
lustre/include/uapi/linux/lustre/lustre_access_log.h:

struct ofd_access_entry_v1 {
struct lu_fid   oae_parent_fid; /* 16 */
__u64   oae_begin; /* 24 */
__u64   oae_end; /* 32 */
__u64   oae_time; /* 40 */
__u32   oae_size; /* 44 */
__u32   oae_segment_count; /* 48 */
__u32   oae_flags; /* 52 enum ofd_access_flags */
__u32   oae_reserved1; /* 56 */
__u32   oae_reserved2; /* 60 */
__u32   oae_reserved3; /* 64 */
};

The records contain the MDT FID (to allow aggregation of IOs across multiple 
describe if it  is a read or write, and the start/end of the extent 
read/written so that the consumer can decide if the IOs are well-formed for the 
storage (e.g. large or sequential reads/writes on an HDD are OK vs. 
small/random reads/writes on an HDD are not OK).  The ALR queue is transient, 
so it is up to the consumer to make any decisions about the current IO patterns 
on files

Cheers, Andreas

On Feb 22, 2023, at 14:16, Anna Fuchs 
mailto:anna.fu...@uni-hamburg.de>> wrote:

Thank you.

Is there any documentation for the values?
Client-side means only statistics for the file remaining in the client cache? 
Not lifetime statistics?
Are there any plans to work further on this feature?

I think of several use cases when knowing these stats.
Cold data could be moved to archive like slow tape without relying on access 
time.
Hot blocks could be replicated or moved to faster caches and lot more 
optimizations.

Best regards
Anna



Am 18.02.2023 um 21:57 schrieb Andreas Dilger:

Anna, there was a client-side file heat mechanism added a few years ago, but I 
don't know if it is fully functional today.

lctl get_param llite.*.*heat*
llite.myth-979380fc1800.file_heat=1
llite.myth-979380fc1800.heat_decay_percentage=80
llite.myth-979380fc1800.heat_period_second=60

And then "lfs heat_get " to dump the file heat,  it there haven't been 
any good tools developed yet to list top heat files.

Cheers, Andreas



On Feb 7, 2023, at 08:56, Anna Fuchs via lustre-discuss 
<mailto:lustre-discuss@lists.lustre.org> wrote:

Hello,

is there a way to see how many times a file has been accessed ever (like a heat 
map)?

Thanks
Anna

--
Anna Fuchs
Universität Hamburg
https://wr.informatik.uni-hamburg.de<https://wr.informatik.uni-hamburg.de/>

anna.fu...@informatik.uni-hamburg.de<mailto:anna.fu...@informatik.uni-hamburg.de>
https://wr.informatik.uni-hamburg.de/people/anna_fuchs



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



--
Anna Fuchs
Universität Hamburg
https://wr.informatik.uni-hamburg.de<https://wr.informatik.uni-hamburg.de/>

anna.fu...@informatik.uni-hamburg.de<mailto:anna.fu...@informatik.uni-hamburg.de>
https://wr.informatik.uni-hamburg.de/people/anna_fuchs

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lfs setstripe with stripe_count=0

2023-02-24 Thread Andreas Dilger via lustre-discuss
On Feb 21, 2023, at 10:26, John Bauer 
mailto:bau...@iodoctors.com>> wrote:

Something doesn't make sense to me when using lfs setstripe when specifying 0 
for the stripe_count .  This first command works as expected.  The pool is the 
one specified, 2_hdd, and the -c 0 results in a stripe_count of 1 which I 
believe is the default for the file-system default ( per the lfs setstripe 
manpage ).

pfe24.jbauer2 1224> lfs setstripe -c 0 -p 2_hdd /nobackupp18/jbauer2/testing/
pfe24.jbauer2 1225> lfs getstripe -d /nobackupp18/jbauer2/testing/
stripe_count:  1 stripe_size:   1048576 pattern:   raid0 stripe_offset: -1 
pool:  2_hdd


If I do not specify the pool and only specify the stripe_count as 0, the 
resulting striping is what I believe is the pfl striping from the root 
directory of the file system.  Is this what is expected?  I would expect a 
stripe_count of 1, as above, with the pool from the parent directory's striping.

pfe24.jbauer2 1226> lfs setstripe -c 0 /nobackupp18/jbauer2/testing/

If you specify "-c 0" the "specified" file layout ends up being the same "all 
default values" as if nothing was specified for the layout at al (i.e. "-c 0 -i 
-1 -S 0")l, which results in the default filesystem layout being used.  If any 
non-default value is specified then the rest of the layout is filled in from 
the default values.

pfe24.jbauer2 1227> lfs getstripe -d /nobackupp18/jbauer2/testing/
  lcm_layout_gen:0
  lcm_mirror_count:  1
  lcm_entry_count:   3
lcme_id: N/A
lcme_mirror_id:  N/A
lcme_flags:  prefer
lcme_extent.e_start: 0
lcme_extent.e_end:   268435456
  stripe_count:  1   stripe_size:   16777216 pattern:   raid0   
stripe_offset: -1 pool:  ssd-pool

lcme_id: N/A
lcme_mirror_id:  N/A
lcme_flags:  prefer
lcme_extent.e_start: 268435456
lcme_extent.e_end:   5368709120
  stripe_count:  -1   stripe_size:   16777216 pattern:   raid0  
 stripe_offset: -1 pool:  ssd-pool

lcme_id: N/A
lcme_mirror_id:  N/A
lcme_flags:  0
lcme_extent.e_start: 5368709120
lcme_extent.e_end:   EOF
  stripe_count:  16   stripe_size:   16777216 pattern:   raid0  
 stripe_offset: -1 pool:  hdd-pool

pfe24.jbauer2 1228>

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Access times for file (file heat)

2023-02-18 Thread Andreas Dilger via lustre-discuss
Anna, there was a client-side file heat mechanism added a few years ago, but I 
don't know if it is fully functional today.

lctl get_param llite.*.*heat*
llite.myth-979380fc1800.file_heat=1
llite.myth-979380fc1800.heat_decay_percentage=80
llite.myth-979380fc1800.heat_period_second=60

And then "lfs heat_get " to dump the file heat,  it there haven't been 
any good tools developed yet to list top heat files. 

Cheers, Andreas

> On Feb 7, 2023, at 08:56, Anna Fuchs via lustre-discuss 
>  wrote:
> 
> Hello,
> 
> is there a way to see how many times a file has been accessed ever (like a 
> heat map)?
> 
> Thanks
> Anna
> 
> -- 
> Anna Fuchs
> Universität Hamburg
> https://wr.informatik.uni-hamburg.de
> 
> anna.fu...@informatik.uni-hamburg.de
> https://wr.informatik.uni-hamburg.de/people/anna_fuchs
> 
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Full List of Required Open Lustre Ports?

2023-02-02 Thread Andreas Dilger via lustre-discuss
Ellis, the addition of dynamic conns_per_peer for TCP connections is relatively 
new.  There would be no performance "regression" against earlier Lustre 
releases (which always had the equivalent of conns_per_peer=1), just not 
additional performance gains for high-speed Ethernet interfaces.

The port selection is somewhat dependent on the software running on the server. 
 There is no guarantee that 1023/1022/... will be available for Lustre to use, 
if other TCP listeners have already been opened on those ports, unless you use 
something like "portreserve" (IIRC), but this is tricky to get right with 
kernel-side listeners (I don't recall if we ever got that right).

Whether you open those additional ports or just set conns_per_peer=1 is up to 
you.  You would definitely want to limit the accept rule so that the source or 
target is port 988.

Cheers, Andreas

On Feb 1, 2023, at 15:18, Ellis Wilson via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi folks,

We've seen some weird stuff recently with UFW/iptables dropping packets on our 
OSS and MDS nodes.  We are running 2.15.1.  Example:

[   69.472030] [UFW BLOCK] IN=eth0 OUT= MAC= SRC= DST= LEN=52 
TOS=0x00 PREC=0x00 TTL=64 ID=58224 DF PROTO=TCP SPT=1022 DPT=988 WINDOW=510 
RES=0x00 ACK FIN URGP=0

[11777.280724] [UFW BLOCK] IN=eth0 OUT= MAC= SRC= DST= LEN=64 
TOS=0x00 PREC=0x00 TTL=64 ID=44206 DF PROTO=TCP SPT=988 DPT=1023 WINDOW=509 
RES=0x00 ACK URGP=0

Previously, we were only allowing 988 bidirectionally on BOTH clients and 
servers.  This was based on guidance from the Lustre manual.  From the above 
messages it appears we may need to expand that range.  This thread discusses it:
https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg17229.html

Based on that thread and some code reading it appears that sans explicit 
configuration of conns_per_peer the extra ports potentially required are 
autotuning (ksocklnd_speed2cpp).  E.G., if we have a node with 50Gbps 
interface, we may need up to 3 ports open to accommodate the extra ports.  
These appear to be selected beginning at 1023 and going down as far as 512.

Questions:
1. If we do not open up more than 988, are there known performance issues for 
machines at or below say, 50Gbps?  It does seem that with these closed we don't 
have correctness or visible performance problems, so there must be some 
fallback mechanism at play.
2. Can we just open 1023 to 1021 for a 50GigE machine?  Or are there situations 
where binding might fail and the algorithm could potentially attempt to create 
sockets all the way down to 512?
3. Regardless of the answer to #2, do we need to open these ports on all client 
and server nodes, or can we get away with just server nodes?
4. Do these need to be opened just for egress from the node in question, or 
bidirectionally?

Thanks in advance!

Best,

ellis
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Mistake while removing an OST

2023-02-02 Thread Andreas Dilger via lustre-discuss
You should follow the documented process, that's why it is documented.  All 
targets need to be unmounted to make it work properly.

On Feb 2, 2023, at 01:08, BALVERS Martin 
mailto:martin.balv...@danone.com>> wrote:

Hi Andreas,

Thank you for answering.

Can I just run the ‘tunefs.lustre --writeconf /dev/ost_device’ on the affected 
server (serving OST0002 in my case) on a live filesystem? So unmount OST0002, 
issue writeconf command and mount OST0002 again?
Or should I follow the procedure as described in 
https://doc.lustre.org/lustre_manual.xhtml#lustremaint.regenerateConfigLogs, 
and take the whole filesystem offline and run the writeconf command on all 
servers?

To be clear, my situation is this.
I had a server with OST0003 that needed to be removed. That worked, but in the 
process I deleted the add_uuid and attach indexes for OST0002. OST0002 is the 
one I need to keep.

Regards,
Martin

From: Andreas Dilger mailto:adil...@whamcloud.com>>
Sent: Wednesday, February 1, 2023 18:16
To: BALVERS Martin mailto:martin.balv...@danone.com>>
Cc: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Mistake while removing an OST

** Caution - this is an external email **
You should just be able to run the "writeconf" process to regenerate the config 
logs. The removed OST will not re-register with the MGS, but all of the other 
servers will, so it should be fine.
Cheers, Andreas


On Feb 1, 2023, at 03:48, BALVERS Martin via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi,

I have a defective OSS with a single OST that I was tying to remove from the 
lustre filesystem completely (2.15.1). I was 
followinghttps://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost<https://urldefense.com/v3/__https:/doc.lustre.org/lustre_manual.xhtml*lustremaint.remove_ost__;Iw!!OUGTln_Lrg!RiLzALkCuFehXiToUr1EEaIJH6J8AnCRNE6js19QvVd68PYBhvNQV9-Teh1p6lltu5zpR_tjJEwAvRuZ5wWrTA$>
I had drained the OST and was now using the commands with llog_cancel to remove 
the config. This is where it went wrong. I first deleted attach, setup, add_osc 
indexes for the ‘client’ and then needed to also delete those for MDT and 
MDT0001, but I accidentally removed two more indexes from the ‘client’.
Now I have incomplete client llogs for one OST, I am missing the add_uuid and 
attach lines for OST0002.

[root@mds ~]# lctl --device MGS llog_print lustre-client
- { index: 34, event: add_uuid, nid: 
192.168.2.3@tcp(0x2c0a80203)<mailto:192.168.2.3@tcp(0x2c0a80203)>, 
node: 192.168.2.3@tcp<mailto:192.168.2.3@tcp> }
- { index: 35, event: attach, device: lustre-OST0001-osc, type: osc, UUID: 
lustre-clilov_UUID }
- { index: 36, event: setup, device: lustre-OST0001-osc, UUID: 
lustre-OST0001_UUID, node: 192.168.2.3@tcp<mailto:192.168.2.3@tcp> }
- { index: 37, event: add_osc, device: lustre-clilov, ost: lustre-OST0001_UUID, 
index: 1, gen: 1 }

- { index: 42, event: setup, device: lustre-OST0002-osc, UUID: 
lustre-OST0002_UUID, node: 192.168.2.4@tcp<mailto:192.168.2.4@tcp> }
- { index: 43, event: add_osc, device: lustre-clilov, ost: lustre-OST0002_UUID, 
index: 2, gen: 1 }

Is there a way to recover from this ?

I hope someone can help.

Regards,
Martin Balvers
Ce message électronique et tous les fichiers attachés qu'il contient sont 
confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils 
sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à 
son émetteur. Les idées et opinions présentées dans ce message sont celles de 
son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une 
quelconque de ses filiales. La publication, l'usage, la distribution, 
l'impression ou la copie non autorisée de ce message et des attachements qu'il 
contient sont strictement interdits.

This e-mail and any files transmitted with it are confidential and intended 
solely for the use of the individual to whom it is addressed. If you have 
received this email in error please send it back to the person that sent it to 
you. Any views or opinions presented are solely those of its author and do not 
necessarily represent those of DANONE or any of its subsidiary companies. 
Unauthorized publication, use, dissemination, forwarding, printing or copying 
of this email and its associated attachments is strictly prohibited.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://urldefense.com/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!OUGTln_Lrg!RiLzALkCuFehXiToUr1EEaIJH6J8AnCRNE6js19QvVd68PYBhvNQV9-Teh1p6lltu5zpR_tjJEwAvRuLSxlvXw$>

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
l

Re: [lustre-discuss] Mistake while removing an OST

2023-02-01 Thread Andreas Dilger via lustre-discuss
You should just be able to run the "writeconf" process to regenerate the config 
logs. The removed OST will not re-register with the MGS, but all of the other 
servers will, so it should be fine.

Cheers, Andreas

On Feb 1, 2023, at 03:48, BALVERS Martin via lustre-discuss 
 wrote:


Hi,

I have a defective OSS with a single OST that I was tying to remove from the 
lustre filesystem completely (2.15.1). I was following 
https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost
I had drained the OST and was now using the commands with llog_cancel to remove 
the config. This is where it went wrong. I first deleted attach, setup, add_osc 
indexes for the ‘client’ and then needed to also delete those for MDT and 
MDT0001, but I accidentally removed two more indexes from the ‘client’.
Now I have incomplete client llogs for one OST, I am missing the add_uuid and 
attach lines for OST0002.

[root@mds ~]# lctl --device MGS llog_print lustre-client
- { index: 34, event: add_uuid, nid: 192.168.2.3@tcp(0x2c0a80203), node: 
192.168.2.3@tcp }
- { index: 35, event: attach, device: lustre-OST0001-osc, type: osc, UUID: 
lustre-clilov_UUID }
- { index: 36, event: setup, device: lustre-OST0001-osc, UUID: 
lustre-OST0001_UUID, node: 192.168.2.3@tcp }
- { index: 37, event: add_osc, device: lustre-clilov, ost: lustre-OST0001_UUID, 
index: 1, gen: 1 }

- { index: 42, event: setup, device: lustre-OST0002-osc, UUID: 
lustre-OST0002_UUID, node: 192.168.2.4@tcp }
- { index: 43, event: add_osc, device: lustre-clilov, ost: lustre-OST0002_UUID, 
index: 2, gen: 1 }

Is there a way to recover from this ?

I hope someone can help.

Regards,
Martin Balvers
Ce message électronique et tous les fichiers attachés qu'il contient sont 
confidentiels et destinés exclusivement à l'usage de la personne à laquelle ils 
sont adressés. Si vous avez reçu ce message par erreur, merci de le retourner à 
son émetteur. Les idées et opinions présentées dans ce message sont celles de 
son auteur, et ne représentent pas nécessairement celles de DANONE ou d'une 
quelconque de ses filiales. La publication, l'usage, la distribution, 
l'impression ou la copie non autorisée de ce message et des attachements qu'il 
contient sont strictement interdits.

This e-mail and any files transmitted with it are confidential and intended 
solely for the use of the individual to whom it is addressed. If you have 
received this email in error please send it back to the person that sent it to 
you. Any views or opinions presented are solely those of its author and do not 
necessarily represent those of DANONE or any of its subsidiary companies. 
Unauthorized publication, use, dissemination, forwarding, printing or copying 
of this email and its associated attachments is strictly prohibited.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Monitoring Lustre IOPS on OSTs

2023-01-24 Thread Andreas Dilger via lustre-discuss
Yes, each RPC will increment these stats counters by one. Traditional "IOPS" 
are measured with 4KB read or write, but in this case the IO sizes are variable.

Also, the client may aggregate multiple disjoint writes into a single RPC. This 
can be seen in the osd-ldiskfs.*.brw_stats as "discontiguous pages" so this 
might be considered multiple "IOs" in a single RPC.

Cheers, Andreas

On Jan 23, 2023, at 09:31, Passerini Marco  wrote:

I'd like to monitor the IOPS on the Lustre OSTs.


I have stats like this:


[root@xxx04 ~]# lctl get_param obdfilter.xxx-OST.stats
obdfilter.xxx-OST.stats=
snapshot_time 348287.096066602 secs.nsecs
start_time0.0 secs.nsecs
elapsed_time  348287.096066602 secs.nsecs
read_bytes3075 samples [bytes] 0 1134592 2891776 2568826650624
write_bytes   1312266 samples [bytes] 1 4194304 5424381521966 
4303559271585613940
read  3075 samples [usecs] 0 489 31040 3417458
write 1312266 samples [usecs] 1 1630262 6193336309 
3686004830368845
setattr   20373 samples [usecs] 1 61 133314 1090612
punch 4600 samples [usecs] 3 65 51155 732959
sync  4 samples [usecs] 443 997 2853 2250847
destroy   30777 samples [usecs] 11 42634 20978596 33571893182
create993 samples [usecs] 1 29995 534579 1518818913
statfs135519 samples [usecs] 0 24 610182 4231650
get_info  108 samples [usecs] 1 3092 6218 17971060
set_info  10047 samples [usecs] 1 22 83418 770166

From the docs https://wiki.lustre.org/Lustre_Monitoring_and_Statistics_Guide I 
see:

"""
For read_bytes and write_bytes:

First number = number of times (samples) the OST has handled a read or 
write.

"""


I guess this can be considered as OST IOPS?


Regards,

Marco Passerini

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] User find out OST configuration

2023-01-23 Thread Andreas Dilger via lustre-discuss
On Jan 23, 2023, at 10:01, Anna Fuchs 
mailto:anna.fu...@uni-hamburg.de>> wrote:

Thanks!

Is it planned to introduce some metric propagation to the user?
For advanced users who are benchmarking stuff on remote systems it remains 
unclear which performance to expect if they can not access underlaying hardware 
metrics.  Sure, they can ask the admin to share the config, but it might be 
more convenient to be able to look it up, maybe.

Yes,  there is a longstanding open ticket to export some server statistics to 
the clients - https://jira.whamcloud.com/browse/LU-7880 "add performance 
statistics to obd_statfs".  As the summary describes, this would export basic 
performance stats for each OST/MDT device to the client for current/peak x 
read/write x IOPS/bandwidth.  There are a number of potential uses for this, 
such as client/MDS selection of OSTs based on bandwidth/IOPS (beyond just 
"rotational" or "non-rotational"), userspace applications/libraries using it to 
determine if the OSTs are less busy (e.g. when to checkpoint), etc.

There aren't any plans to be able to export the storage "config" (e.g. RAID 
geometry) via Lustre since this is often opaque even on the server, and doesn't 
have any use on the client.  There was a discussion in the context of IO500 to 
write a script for collecting storage system configuration metadata for Lustre 
and other filesystems (e.g. list of OST/MDT devices, list of SCSI devices, PCI 
devices, CPU, RAM, network interfaces, etc.)

Additionally: if I try to find out the stripe location (lfs getstripe) and map 
this information to OST-specs (lctl get_param osc.*.*ost_conn_uuid), to find 
out how many different servers and networks are involved, the obdidx seems to 
be in dec-format, but the OST index in connections list is hex, which is not 
always obvious.  Is there a way to display it both in dec or both in hex?

There isn't currently an option for "lfs getstripe" to print in hex, but it 
would be possible to add a "--hex" option to print the fields in hex.  I've 
filed https://jira.whamcloud.com/browse/LU-16503 for tracking this issue.  I 
don't think it would be terribly complex to implement, but I also don't know if 
anyone is available to do this work right now.

Are there generally any tools for doing similar things?
We plan a student project for building kind of GUI for visualizing stripings 
and mappings, so I would try to avoid reinventing the wheel.

Depending on how complex your tool is, it may be better to use 
llapi_layout_from_file() or llapi_layout_from_xattr() to parse the binary 
layout directly from the file, rather than printing it out and then parsing the 
text again in userspace.

Cheers, Andreas

Am 21.01.2023 um 17:08 schrieb Andreas Dilger:
Hi Anna,
Beyond the number and size of OSTs and MDTs there isn't much information about 
the underlying storage available on the client.

The "lfs df -v" command will print a "f" at the end for flash (non-rotational) 
devices, if the storage is properly configured.  The "osc*.imports " parameter 
file will contain some information about the grant_block_size that can be used 
to distinguish ldiskfs (4096) vs. zfs backends (131072 or 1048576).

The size of the disks can often be inferred from 1/8 of the total OST size for 
standard 8+2 RAID configs, but this may vary and no actual device-level metrics 
are available on the client.

Even on the server, Lustre itself doesn't know or care much about the 
underlying storage devices beyond (non-)rotational state, so we don't track any 
of that.

Cheers, Andreas

On Jan 21, 2023, at 01:16, Anna Fuchs via lustre-discuss 
<mailto:lustre-discuss@lists.lustre.org> wrote:

 Hi,

is it possible for a user (no root, so ssh to server) to find out the 
configuration of an OST?
How many devices are there in one OST 'pool' (for both ldiskfs and ZFS) and 
even which type of devices they are (nvme, ssd, hdd)? Maybe even speeds and 
raid-levels?

Additionally, how can a user find out the mapping of all available OSTs to OSSs 
easily?

Thanks
Anna

--
Anna Fuchs
Universität Hamburg
https://wr.informatik.uni-hamburg.de<https://wr.informatik.uni-hamburg.de/>

anna.fu...@informatik.uni-hamburg.de<mailto:anna.fu...@informatik.uni-hamburg.de>
https://wr.informatik.uni-hamburg.de/people/anna_fuchs

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--
Anna Fuchs
Universität Hamburg
https://wr.informatik.uni-hamburg.de<https://wr.informatik.uni-hamburg.de/>

anna.fu...@informatik.uni-hamburg.de<mailto:anna.fu...@informatik.uni-hamburg.de>
https://wr.informatik.uni-hamburg.de/people/anna_fuchs

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___

  1   2   3   4   5   6   7   8   9   10   >