Re: [lustre-discuss] NRS TBF by UID and congestion

2021-10-18 Thread Moreno Diego (ID SIS)
Salut Stephane!

Thanks a lot for this. I guess this is the kind of helpful answer I was looking 
for when I posted. All in all it seems we will need to find the right value 
that works for us. I have the impression that changing also the settings in the 
middle of a very high load might not be the best idea since the queues are 
already filled. We see some kind of blocked filesystem during some minutes 
after we enable it but afterwards it seems to work better. Have you also tried 
to enable it on LDLM services? I was advised in the past to never enable any 
kind of throttling on LDLM so locks are cancelled as fast as possible, 
otherwise we would have high CPU and memory usage on the MDS side.

I agree that it would be very useful to know which users have long waiting 
queues, this could eventually help to create dynamic and more complex rules for 
throttling.

Regards,

Diego
 

On 15.10.21, 09:13, "Stephane Thiell"  wrote:

Salut Diego!

Yes, we have been using NRS TBF by UID on our Oak storage system for months 
now with Lustre 2.12. It’s a capacity-oriented, global filesystem, not designed 
for heavy workloads (unlike our scratch filesystem) but with many users and as 
such, a great candidate for NRS TBF UID. Since NRS, we have seen WAY fewer 
occurrences of single users abusing the system (which is always by mistake so 
we’re helping them too!). We use NRS TBF UID for all Lustre services on MDS and 
OSS.

We have an “exemption" rule for "root {0}" at 1, and a default rule 
"default {*}” at a certain value. This value is per user and per CPT (it’s also 
a value per lustre service on the MDS for example, eg. mdt_readpage is a 
separate service). If you have large servers with many CPTs and set the value 
to 500, that’s 500 req/s per CPT per user, so perhaps it is still too high to 
be useful. The ideal value also probably depends on your default striping or 
other specifics.

To set the NRS rate values right for the system, our approach is to monitor 
the active/queued values taken from the ’tbf uid’ policy on each OSS with lctl 
get_param ost.OSS.ost_io.nrs_tbf_rule (same thing on MDS for each mdt service). 
We record these instant gauge-like values every minute, which seems to be 
enough to see trends. The ‘queued' number is the most useful to me as I can 
easily see the impact of the rule by looking at the graph. Graphing these 
metrics over time allows us to adjust the rates so that queueing is not the 
norm, but the exception, while limiting heavy workloads.

So it’s working for us on this system, the only thing now is that we would 
love to have a way to get additional NRS stats from Lustre, for example, the 
UIDs that have reached the rate limit over a period.

Lastly, we tried to implement it on our scratch filesystem, but it’s more 
difficult. If a user has heavy duty jobs running on compute nodes and hit the 
rate limit, the user basically cannot transfer anything from a DTN or a login 
node (and will complain). I’ve opened LU-14567 to discuss wildcard support for 
“uid" in NRS TBF policy (’tbf’ and not ’tbf uid’) rules so that we could mix 
other, non-UID TBF rules with UID TBF rules. I don’t know how hard it is to 
implement.

Hope that helps,

Stephane


    > On Oct 14, 2021, at 12:33 PM, Moreno Diego (ID SIS) 
 wrote:
> 
> Hi Lustre friends,
> 
> I'm wondering if someone has experience setting NRS TBF (by UID) on the 
OSTs (ost_io and ost service) in order to avoid congestion of the filesystem 
IOPS or bandwidth. All my tries during the last months have miserably failed 
into something that doesn’t look like QoS when the system has a high load. Once 
the system is under high load not even the TBF UID policy is saving us from 
slow response times for any user. So far, I have only tried setting it by UID 
so every user has their fair share of bandwidth. I tried different rate values 
for the default rule (5'000, 1'000 or 500). We have Lustre 2.12 in our cluster.
> 
> Maybe there's any other setting that needs throttling (I see a parameter 
/sys/module/ptlrpc/parameters/tbf_rate that I could not find documented set to 
10'000), is there anything I'm missing about this feature?
> 
> Regards,
> 
> Diego
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] NRS TBF by UID and congestion

2021-10-14 Thread Moreno Diego (ID SIS)
 Hi Lustre friends,

I'm wondering if someone has experience setting NRS TBF (by UID) on the OSTs 
(ost_io and ost service) in order to avoid congestion of the filesystem IOPS or 
bandwidth. All my tries during the last months have miserably failed into 
something that doesn’t look like QoS when the system has a high load. Once the 
system is under high load not even the TBF UID policy is saving us from slow 
response times for any user. So far, I have only tried setting it by UID so 
every user has their fair share of bandwidth. I tried different rate values for 
the default rule (5'000, 1'000 or 500). We have Lustre 2.12 in our cluster.

Maybe there's any other setting that needs throttling (I see a parameter 
/sys/module/ptlrpc/parameters/tbf_rate that I could not find documented set to 
10'000), is there anything I'm missing about this feature?

Regards,

Diego


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Elegant way to dump quota/usage database?

2021-02-12 Thread Moreno Diego (ID SIS)
Hi Steve,

If you have access to the servers you could aggregate the information given by:

lctl get_param osd-ldiskfs.*.quota_slave_dt.acct_{user,group,project}

The command will basically give you for each Lustre device the information 
stored on the inode quota for user, group or project quotas.

Regards,

Diego
 

On 11.02.21, 20:07, "lustre-discuss on behalf of Steve Barnet" 
 
wrote:

Hey all,

   I would like to be able to dump the usage tracking and quota 
information for my lustre filesystems. I am currently running lustre 2.12

lfs quota -u $user $filesystem

works well enough for a single user. But I have been looking for a way 
to get that information for all users of the filesystem. So far, I have 
not stumbled across anything more elegant than a brute force iteration 
over my known users.

While that works (mostly), it is clearly not great. Is there a better 
way to do this? Hoping I just missed something in the docs ...

Thanks in advance for any pointers in this area.

Best,

---Steve

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Do old clients ever go away?

2020-06-05 Thread Moreno Diego (ID SIS)
I don't see a way to clear the exports on the MGS side so it seems you get 
there every single NID that ever connected to the system. You can however clear 
this on the MDSes/OSSes:

[root@mds01 ~]# ls /proc/fs/lustre/mdt/fs1-MDT0001/exports/ | wc -l
5182
[root@mds01 ~]# echo 1 > /proc/fs/lustre/mdt/fs1-MDT0001/exports/clear
[root@mds01 ~]# ls /proc/fs/lustre/mdt/fs1-MDT0001/exports/ | wc -l
349

Regards,

Diego
 

On 05.06.20, 16:39, "lustre-discuss on behalf of William D. Colburn" 
 wrote:

I was looking in /proc/fs/lustre/mgs/MGS/exports/, and I see ip
addresses in there that don't go anywhere anymore.  I'm pretty sure they
are gone so long that they predate the uptime of the mds.  Does a lost
client linger forever, or am I just wrong about when the machines went
offline in relation to the uptime of the MDS?

--Schlake
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Slow mount on clients

2020-02-03 Thread Moreno Diego (ID SIS)
Not sure if it's your case but the order of MGS' NIDs when mounting matters:

[root@my-ms-01xx-yy ~]# time mount -t lustre 
10.210.1.101@tcp:10.210.1.102@tcp:/fs2 /scratch

real0m0.215s
user0m0.007s
sys 0m0.059s

[root@my-ms-01xx-yy ~]# time mount -t lustre 
10.210.1.102@tcp:10.210.1.101@tcp:/fs2 /scratch

real0m25.196s
user0m0.009s
sys 0m0.033s

Since the MGS is running on the node having the IP "10.210.1.101", if we first 
try with the other one there seems to be a timeout of 25s.

Diego
 

On 03.02.20, 23:17, "lustre-discuss on behalf of Andrew Elwell" 
 
wrote:

Hi Folks,

One of our (recently built) 2.10.x filesystems is slow to mount on
clients (~20 seconds) whereas the others are nigh on instantaneous.

We saw this before with a 2.7 filesystem that went away after doing
 but we've no idea what.

Nothing obvious in the logs.

Does anyone have suggestions for what causes this, and how to make it
faster? It's annoying me as "something" isn't right but I can't
identify what.


Many thanks

Andrew
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LDLM locks not expiring/cancelling

2020-01-06 Thread Moreno Diego (ID SIS)
Hi Steve,

I was having a similar problem in the past months where the MDS servers would 
go OOM because of SlabUnreclaim. The root cause has not yet been found but we 
stopped seeing this the day we disabled the NRS TBF (QoS) for any LDLM service 
(just in case you have it enabled). Just in case you have it enabled. It would 
be good to check as well what’s being consumed in the slab cache. In our case 
it was mostly kernel objects and not ldlm.

Diego


From: lustre-discuss  on behalf of 
Steve Crusan 
Date: Thursday, 2 January 2020 at 20:25
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] LDLM locks not expiring/cancelling

Hi all,

We are running into a bizarre situation where we aren't having stale locks 
cancel themselves, and even worse, it seems as if ldlm.namespaces.*.lru_size is 
being ignored.

For instance, I unmount our Lustre file systems on a client machine, then 
remount. Next, I'll run "lctl set_param ldlm.namespaces.*.lru_max_age=60s, lctl 
set_param ldlm.namespaces.*.lru_size=1024". This (I believe) theoretically 
would only allow 1024 ldlm locks per osc, and then I'd see a lot of lock 
cancels (via ldlm.namespaces.${ost}.pool.stats). We also should see cancels if 
the grant time > lru_max_age.

We can trigger this simply by running 'find' on the root of our Lustre file 
system, and waiting for awhile. Eventually the clients SUnreclaim value bloats 
to 60-70GB (!!!), and each of our OSTs have 30-40k LRU locks (via lock_count). 
This is early in the process:

"""
ldlm.namespaces.h5-OST003f-osc-8802d8559000.lock_count=2090
ldlm.namespaces.h5-OST0040-osc-8802d8559000.lock_count=2127
ldlm.namespaces.h5-OST0047-osc-8802d8559000.lock_count=52
ldlm.namespaces.h5-OST0048-osc-8802d8559000.lock_count=1962
ldlm.namespaces.h5-OST0049-osc-8802d8559000.lock_count=1247
ldlm.namespaces.h5-OST004a-osc-8802d8559000.lock_count=1642
ldlm.namespaces.h5-OST004b-osc-8802d8559000.lock_count=1340
ldlm.namespaces.h5-OST004c-osc-8802d8559000.lock_count=1208
ldlm.namespaces.h5-OST004d-osc-8802d8559000.lock_count=1422
ldlm.namespaces.h5-OST004e-osc-8802d8559000.lock_count=1244
ldlm.namespaces.h5-OST004f-osc-8802d8559000.lock_count=1117
ldlm.namespaces.h5-OST0050-osc-8802d8559000.lock_count=1165
"""

But this will grow over time, and eventually this compute node gets evicted 
from the MDS (after 10 minutes of cancelling locks/hanging). The only way we 
have been able to reduce the slab usage is to drop caches and set 
LRU=clear...but the problem just comes back depending on the workload.

We are running 2.10.3 client side, 2.10.1 server side. Have there been any 
fixes added into the codebase for 2.10 that we need to apply? This seems to be 
the closest to what we are experiencing:

https://jira.whamcloud.com/browse/LU-11518


PS: I've checked other systems across our cluster, and some of them have as 
many as 50k locks per OST. I am kind of wondering if these locks are staying 
around much longer than the lru_max_age default (65 minutes), but I cannot 
prove that. Is there a good way to translate held locks to fids? I have been 
messing around with lctl set_param debug="XXX" and lctl set_param 
ldlm.namespaces.*.dump_namespace, but I don't feel like I'm getting *all* of 
the locks.

~Steve
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Degraded read performance with Large Bulk IO (16MB RPC)

2019-12-13 Thread Moreno Diego (ID SIS)
From what I can see I think you just ran the wrong command (lctl list_param -R 
* ) or it doesn’t work as you expected on 2.12.3.

But llite params are sure there on a *mounted* Lustre client.

This will give you the parameters you’re looking for and need to modify to 
have, likely, better read performance:

lctl list_param -R llite | grep max_read_ahead


From: Pinkesh Valdria 
Date: Friday, 13 December 2019 at 17:33
To: "Moreno Diego (ID SIS)" , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

This is how I installed lustre clients (only showing packages installed steps).


cat > /etc/yum.repos.d/lustre.repo << EOF
[hpddLustreserver]
name=CentOS- - Lustre
baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el7/server/
gpgcheck=0

[e2fsprogs]
name=CentOS- - Ldiskfs
baseurl=https://downloads.whamcloud.com/public/e2fsprogs/latest/el7/
gpgcheck=0

[hpddLustreclient]
name=CentOS- - Lustre
baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el7/client/
gpgcheck=0
EOF

yum  install  lustre-client  -y

reboot



From: "Moreno Diego (ID SIS)" 
Date: Friday, December 13, 2019 at 2:55 AM
To: Pinkesh Valdria , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

From what I can see they exist on my 2.12.3 client node:

[root@rufus4 ~]# lctl list_param -R llite | grep max_read_ahead
llite.reprofs-9f7c3b4a8800.max_read_ahead_mb
llite.reprofs-9f7c3b4a8800.max_read_ahead_per_file_mb
llite.reprofs-9f7c3b4a8800.max_read_ahead_whole_mb

Regards,

Diego


From: Pinkesh Valdria 
Date: Wednesday, 11 December 2019 at 17:46
To: "Moreno Diego (ID SIS)" , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

I was not able to find those parameters on my client nodes,  OSS or MGS nodes.  
 Here is how I was extracting all parameters .

mkdir -p lctl_list_param_R/
cd lctl_list_param_R/
lctl list_param -R *  > lctl_list_param_R

[opc@lustre-client-1 lctl_list_param_R]$ less lctl_list_param_R  | grep ahead
llite.lfsbv-98231c3bc000.statahead_agl
llite.lfsbv-98231c3bc000.statahead_max
llite.lfsbv-98231c3bc000.statahead_running_max
llite.lfsnvme-98232c30e000.statahead_agl
llite.lfsnvme-98232c30e000.statahead_max
llite.lfsnvme-98232c30e000.statahead_running_max
[opc@lustre-client-1 lctl_list_param_R]$

I also tried these commands:

Not working:
On client nodes
lctl get_param llite.lfsbv-*.max_read_ahead_mb
error: get_param: param_path 'llite/lfsbv-*/max_read_ahead_mb': No such file or 
directory
[opc@lustre-client-1 lctl_list_param_R]$

Works
On client nodes
lctl get_param llite.*.statahead_agl
llite.lfsbv-98231c3bc000.statahead_agl=1
llite.lfsnvme-98232c30e000.statahead_agl=1
[opc@lustre-client-1 lctl_list_param_R]$



From: "Moreno Diego (ID SIS)" 
Date: Tuesday, December 10, 2019 at 2:06 AM
To: Pinkesh Valdria , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

With that kind of degradation performance on read I would immediately think on 
llite’s max_read_ahead parameters on the client. Specifically these 2:

max_read_ahead_mb: total amount of MB allocated for read ahead, usually quite 
low for bandwidth benchmarking purposes and when there’re several files per 
client
max_read_ahead_per_file_mb: the default is quite low for 16MB RPCs (only a few 
RPCs per file)

You probably need to check the effect increasing both of them.

Regards,

Diego


From: lustre-discuss  on behalf of 
Pinkesh Valdria 
Date: Tuesday, 10 December 2019 at 09:40
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] Degraded read performance with Large Bulk IO (16MB 
RPC)

I was expecting better or same read performance with Large Bulk IO (16MB RPC),  
but I see degradation in performance.   Do I need to tune any other parameter 
to benefit from Large Bulk IO?   Appreciate if I can get any pointers to 
troubleshoot further.

Throughput before

-  Read:  2563 MB/s

-  Write:  2585 MB/s

Throughput after

-  Read:  1527 MB/s. (down by ~1025)

-  Write:  2859 MB/s


Changes I did are:
On oss

-  lctl set_param obdfilter.lfsbv-*.brw_size=16

On clients

-  unmounted and remounted

-  lctl set_param osc.lfsbv-OST*.max_pages_per_rpc=4096  (got 
auto-updated after re-mount)

-  lctl set_param osc.*.max_rpcs_in_flight=64   (Had to manually 
increase this to 64,  since after re-mount, it was auto-set to 8,  but 
read/write performance was poor)

-  lctl set_param osc.*.max_dirty_mb=2040. (setting the value to 2048 
was failing with : Numerical result out of range error.   Previously it was set 
to 2000 when I got good perfo

Re: [lustre-discuss] Degraded read performance with Large Bulk IO (16MB RPC)

2019-12-10 Thread Moreno Diego (ID SIS)
With that kind of degradation performance on read I would immediately think on 
llite’s max_read_ahead parameters on the client. Specifically these 2:

max_read_ahead_mb: total amount of MB allocated for read ahead, usually quite 
low for bandwidth benchmarking purposes and when there’re several files per 
client
max_read_ahead_per_file_mb: the default is quite low for 16MB RPCs (only a few 
RPCs per file)

You probably need to check the effect increasing both of them.

Regards,

Diego


From: lustre-discuss  on behalf of 
Pinkesh Valdria 
Date: Tuesday, 10 December 2019 at 09:40
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] Degraded read performance with Large Bulk IO (16MB 
RPC)

I was expecting better or same read performance with Large Bulk IO (16MB RPC),  
but I see degradation in performance.   Do I need to tune any other parameter 
to benefit from Large Bulk IO?   Appreciate if I can get any pointers to 
troubleshoot further.

Throughput before

-  Read:  2563 MB/s

-  Write:  2585 MB/s

Throughput after

-  Read:  1527 MB/s. (down by ~1025)

-  Write:  2859 MB/s


Changes I did are:
On oss

-  lctl set_param obdfilter.lfsbv-*.brw_size=16

On clients

-  unmounted and remounted

-  lctl set_param osc.lfsbv-OST*.max_pages_per_rpc=4096  (got 
auto-updated after re-mount)

-  lctl set_param osc.*.max_rpcs_in_flight=64   (Had to manually 
increase this to 64,  since after re-mount, it was auto-set to 8,  but 
read/write performance was poor)

-  lctl set_param osc.*.max_dirty_mb=2040. (setting the value to 2048 
was failing with : Numerical result out of range error.   Previously it was set 
to 2000 when I got good performance.


My other settings:

-  lnetctl net add --net tcp1 --if $interface  –peer-timeout 180 
–peer-credits 128 –credits 1024

-  echo "options ksocklnd nscheds=10 sock_timeout=100 credits=2560 
peer_credits=63 enable_irq_affinity=0"  >  /etc/modprobe.d/ksocklnd.conf

-  lfs setstripe -c 1 -S 1M /mnt/mdt_bv/test1

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lnet Self Test

2019-12-04 Thread Moreno Diego (ID SIS)
I recently did some work on 40Gb and 100Gb ethernet interfaces and these are a 
few of the things that helped me during lnet_selftest:


  *   On lnet: credits set to higher than the default (e.g: 1024 or more), 
peer_credits to 128 at least for network testing (it’s just 8 by default which 
is good for a big cluster maybe not for lnet_selftest with 2 clients),
  *   On ksocklnd module options: more schedulers (10, 6 by default which was 
not enough for my server), also changed some of the buffers (tx_buffer_size and 
rx_buffer_size set to 1073741824) but you need to be very careful on these
  *   Sysctl.conf: increase buffers (tcp_rmem, tcp_wmem, check window_scaling, 
net.core.max and default, check disabling timestamps if you can afford it)
  *   Other: cpupower governor (set to performance at least for testing), BIOS 
settings (e.g: on my AMD routers it was better to disable  HT, disable a few 
virtualization oriented features and set the PCI config to performance). 
Basically, be aware that Lustre ethernet’s performance will take CPU resources 
so better optimize for it

Last but not least be aware that Lustre’s ethernet driver (ksocklnd) does not 
load balance as well as Infiniband’s (ko2iblnd). I already saw sometimes 
several Lustre peers using the same socklnd thread on the destination but the 
other socklnd threads might not be active which means that your entire load is 
on just dependent on one core. For that the best is to try with more clients 
and check in your node what’s the cpu load per thread with top. 2 clients do 
not seem enough to me. With the proper configuration you should be perfectly 
able to saturate a 25Gb link in lnet_selftest.

Regards,

Diego


From: lustre-discuss  on behalf of 
Pinkesh Valdria 
Date: Thursday, 5 December 2019 at 06:14
To: Jongwoo Han 
Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Lnet Self Test

Thanks Jongwoo.

I have the MTU set for 9000 and also ring buffer setting set to max.


ip link set dev $primaryNICInterface mtu 9000

ethtool -G $primaryNICInterface rx 2047 tx 2047 rx-jumbo 8191

I read about changing  Interrupt Coalesce, but unable to find what values 
should be changed and also if it really helps or not.
# Several packets in a rapid sequence can be coalesced into one interrupt 
passed up to the CPU, providing more CPU time for application processing.

Thanks,
Pinkesh valdria
Oracle Cloud



From: Jongwoo Han 
Date: Wednesday, December 4, 2019 at 8:07 PM
To: Pinkesh Valdria 
Cc: Andreas Dilger , "lustre-discuss@lists.lustre.org" 

Subject: Re: [lustre-discuss] Lnet Self Test

Have you tried MTU >= 9000 bytes (AKA jumbo frame) on the 25G ethernet and the 
switch?
If it is set to 1500 bytes, ethernet + IP + TCP frame headers take quite amount 
of packet, reducing available bandwidth for data.

Jongwoo Han

2019년 11월 28일 (목) 오전 3:44, Pinkesh Valdria 
mailto:pinkesh.vald...@oracle.com>>님이 작성:
Thanks Andreas for your response.

I ran anotherLnet Self test with 48 concurrent processes, since the nodes have 
52 physical cores and I was able to achieve same throughput (2052.71  MiB/s = 
2152 MB/s).

Is it expected to lose almost 600 MB/s (2750-2150= ) due to overheads on 
ethernet with Lnet?


Thanks,
Pinkesh Valdria
Oracle Cloud Infrastructure




From: Andreas Dilger mailto:adil...@whamcloud.com>>
Date: Wednesday, November 27, 2019 at 1:25 AM
To: Pinkesh Valdria 
mailto:pinkesh.vald...@oracle.com>>
Cc: "lustre-discuss@lists.lustre.org" 
mailto:lustre-discuss@lists.lustre.org>>
Subject: Re: [lustre-discuss] Lnet Self Test

The first thing to note is that lst reports results in binary units
(MiB/s) while iperf reports results in decimal units (Gbps).  If you do the
conversion you get 2055.31 MiB/s = 2155 MB/s.

The other thing to check is the CPU usage. For TCP the CPU usage can
be high. You should try RoCE+o2iblnd instead.

Cheers, Andreas

On Nov 26, 2019, at 21:26, Pinkesh Valdria 
mailto:pinkesh.vald...@oracle.com>> wrote:
Hello All,

I created a new Lustre cluster on CentOS7.6 and I am running 
lnet_selftest_wrapper.sh to measure throughput on the network.  The nodes are 
connected to each other using 25Gbps ethernet, so theoretical max is 25 Gbps * 
125 = 3125 MB/s.Using iperf3,  I get 22Gbps (2750 MB/s) between the nodes.


[root@lustre-client-2 ~]# for c in 1 2 4 8 12 16 20 24 ;  do echo $c ; 
ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S)  CN=$c  SZ=1M  TM=30 BRW=write 
CKSUM=simple LFROM="10.0.3.7@tcp1" LTO="10.0.3.6@tcp1" 
/root/lnet_selftest_wrapper.sh; done ;

When I run lnet_selftest_wrapper.sh (from Lustre 
wiki)
 between 2 nodes,  I get a max of  2055.31  MiB/s,  Is that expected at the 
Lnet level?  Or can I 

Re: [lustre-discuss] one ost down

2019-11-15 Thread Moreno Diego (ID SIS)
Hi Einar,

As for the OST in bad shape, if you have not cleared the bad blocks on the 
storage system you’ll keep having IO errors when your server tries to access 
these blocks, that’s kind of a protection mechanism and lots of IO errors might 
give you many issues. The procedure to clean them up is a bit of storage and 
filesystem surgery. I would suggest, this high level view plan:


  *   Obtain the bad blocks from the storage system (offset, size, etc…)
  *   Map them to filesystem blocks: watch out, the storage system speaks 
probably and for old systems about 512bytes blocks and the filesystem blocks 
are 4KB, so you need to map storage blocks to filesystem blocks
  *   Clear the bad blocks on the storage system, each storage system has their 
own commands to clear those. You’ll probably no longer have IO errors accessing 
these sectors after clearing the bad blocks
  *   Optional, zero the bad storage blocks with dd (and just these bad blocks 
of course) to ignore the “trash” there might be on these blocks
  *   Find out with debugfs which files are affected
  *   Run e2fsck on the device

As I said, surgery, so if you really care about what you have on that device 
try to do a block level backup before… But the minimum for sure is that you 
need to clear the bad blocks, otherwise you get IO access error on the device.

Regards,

Diego


From: lustre-discuss  on behalf of 
Einar Næss Jensen 
Date: Friday, 15 November 2019 at 10:01
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] one ost down




Hello dear lustre community.



We have a lustre file system, where one ost is having problems.

The underlying diskarray, an old sfa10k from DDN (without support), have one 
raidset with ca 1300 bad blocks. The bad blocks came about when one disk in the 
raid failed while another drive in other raidset was rebuilding.



Now.

The ost is offline, and the file system seems useable for new files, while old 
files on the corresponding ost is generating lots of kernel messages on the OSS.

Quotainformation is not available though.



Questions:

May I assume that for new files, everything is fine, since they are not using 
the inactive device anyway?

I tried to run e2fschk on the ost unmounted, while jobs were still running on 
the filesystem, and for a few minutes it seemd like this was working, as the 
filesystem seemed to come back complete afterwards. After a few minutes the ost 
failed again, though.



Any pointers on how to rebuild/fix the ost and get it back is very much 
appreciated.



Also how to regenerate the quotainformation, which is currently unavailable 
would help. With or without the troublesome OST.







Best Regards

Einar Næss Jensen (on flight to Denver)




--
Einar Næss Jensen
NTNU HPC Section
Norwegian University of Science and Technoloy
Address: Høgskoleringen 7i
 N-7491 Trondheim, NORWAY
tlf: +47 90990249
email:   einar.nass.jen...@ntnu.no
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] Re: Lustre Timeouts/Filesystem Hanging

2019-10-29 Thread Moreno Diego (ID SIS)
Hi Louis,

If you don’t have any particular monitoring on the servers (Prometheus, 
ganglia, etc..) you could also use sar (sysstat) or a similar tool to confirm 
the CPU waits for IO. Also the device saturation on sar or with iostat. For 
instance:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.190.006.090.100.06   93.55

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 1.200.200.60 0.00 0.0120.00 
0.000.751.000.67   0.75   0.06
sdb   0.00   136.802.80   96.60 0.81 9.21   206.42 
0.191.91   26.291.20   0.55   5.46
sdc   0.00   144.20   58.80  128.00 2.3416.82   210.08 
0.241.312.680.68   0.66  12.40

Then if you enable lustre job stats you can check on that specific device which 
job is doing most IO. Last but not least you could also parse which specific 
NID is doing the intensive IO on that OST 
(/proc/fs/lustre/obdfilter/-OST0007/exports/*/stats).

Regards,

Diego


From: lustre-discuss  on behalf of 
Louis Allen 
Date: Tuesday, 29 October 2019 at 17:43
To: "Oral, H." , "Carlson, Timothy S" 
, "lustre-discuss@lists.lustre.org" 

Subject: Re: [lustre-discuss] [EXTERNAL] Re: Lustre Timeouts/Filesystem Hanging

Thanks, will take a look.

Any other areas i should be looking? Should i be applying any Lustre tuning?

Thanks

Get Outlook for Android

From: Oral, H. 
Sent: Monday, October 28, 2019 7:06:41 PM
To: Louis Allen ; Carlson, Timothy S 
; lustre-discuss@lists.lustre.org 

Subject: Re: [EXTERNAL] Re: [lustre-discuss] Lustre Timeouts/Filesystem Hanging

For inspecting client side I/O, you can use Darshan.

Thanks,

Sarp

--
Sarp Oral, PhD

National Center for Computational Sciences
Oak Ridge National Laboratory
ora...@ornl.gov
865-574-2173


On 10/28/19, 1:58 PM, "lustre-discuss on behalf of Louis Allen" 
 
wrote:


Thanks for the reply, Tim.


Are there any tools I can use to see if that is the cause?


Could any tuning possibly help the situation?


Thanks






From: Carlson, Timothy S 
Sent: Monday, 28 October 2019, 17:24
To: Louis Allen; lustre-discuss@lists.lustre.org
Subject: RE: Lustre Timeouts/Filesystem Hanging


In my experience, this is almost always related to some code doing really 
bad I/O. Let’s say you have a 1000 rank MPI code doing open/read 4k/close on a 
few specific files on that OST.  That will make for a  bad day.

The other place you can see this, and this isn’t your case, is when ZFS 
refuses to give up on a disk that is failing and your overall I/O suffers from 
ZFS continuing to try to read from a disk that it should just kick out

Tim


From: lustre-discuss 
On Behalf Of Louis Allen
Sent: Monday, October 28, 2019 10:16 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Lustre Timeouts/Filesystem Hanging



Hello,



Lustre (2.12) seem to be hanging quite frequently (5+ times a day) for us 
and one of the OSS servers (out of 4) is reporting an extremely high load 
average (150+) but the CPU usage of that server
 is actually very low - so it must be related to something else - possibly 
CPU_IO_WAIT.



The OSS server we are seeing the high load averages we can also see 
multiple LustreError messages in /var/log/messages:



Oct 28 11:22:23 pazlustreoss001 kernel: LNet: Service thread pid 2403 was 
inactive for 200.08s. The thread might be hung, or it might only be slow and 
will resume later. Dumping the stack trace
 for debugging purposes:
Oct 28 11:22:23 pazlustreoss001 kernel: LNet: Skipped 4 previous similar 
messages

Oct 28 11:22:23 pazlustreoss001 kernel: Pid: 2403, comm: ll_ost00_068 
3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Sun May 26 21:48:35 UTC 2019

Oct 28 11:22:23 pazlustreoss001 kernel: Call Trace:

Oct 28 11:22:23 pazlustreoss001 kernel: [] 
jbd2_log_wait_commit+0xc5/0x140 [jbd2]

Oct 28 11:22:23 pazlustreoss001 kernel: [] 
jbd2_complete_transaction+0x52/0xa0 [jbd2]

Oct 28 11:22:23 pazlustreoss001 kernel: [] 
ldiskfs_sync_file+0x2e2/0x320 [ldiskfs]

Oct 28 11:22:23 pazlustreoss001 kernel: [] 
vfs_fsync_range+0x20/0x30

Oct 28 11:22:23 pazlustreoss001 kernel: [] 
osd_object_sync+0xb1/0x160 [osd_ldiskfs]

Oct 28 11:22:23 pazlustreoss001 kernel: [] 
tgt_sync+0xb7/0x270 [ptlrpc]

Oct 28 11:22:23 pazlustreoss001 kernel: [] 
ofd_sync_hdl+0x111/0x530 [ofd]

Oct 28 11:22:23 pazlustreoss001 kernel: [] 
tgt_request_handle+0xaea/0x1580 [ptlrpc]

Oct 28 11:22:23 pazlustreoss001 kernel: [] 
ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]

Oct 28 11:22:23 pazlustreoss001 kernel: [] 
ptlrpc_main+0xafc/0x1fc0 [ptlrpc]

Oct 28 11:22:23 pazlustreoss001 kernel: [] 

Re: [lustre-discuss] RV: Lustre quota issues

2019-07-11 Thread Moreno Diego (ID SIS)
Hi Thomas,

I think one way to get reliable quota values might be to reduce a couple of 
tunables:

- osc.*.max_dirty_mb : In this case you reduce the amount of non-committed data 
on the client cache and thus the potential for quota inconsistencies. I've been 
recently having quota issues on a filesystem with many OSTs set to 128MB as 
max_dirty_mb. That gives potential for a lot of dirty data depending on the 
number of clients you have, hence quota issues or better to say, false 
positives. That’s actually just my understanding of quota vs cache.

- mds01 ~]# lctl get_param qmt.*.*.soft_least_qunit: tunable to reduce the 
qunit size between soft and hard quota thus allowing a fine tuning of quota 
allowance when quota is on this "danger zone". That should help to have more 
reliable values. That's at least on Lustre 2.10.

I had the same problems you have w.r.t. to qunit. The documentation seems 
outdated or not accurate for, at least, Lustre 2.10.

Regards,

Diego
 

On 10.07.19, 13:15, "lustre-discuss on behalf of Thomas Roth" 
 wrote:

Yes, I have seen this pattern before.

My guess:

- cetafs-OST0017 - 21 are at the limit, thus the * there, thus the overall *
The manual warns you that somebody might write more than his allotted 
amount because the quota is
distributed over the OSTs, this might work the other way around, too.

- 8k or 20k is far too low.

However, I have no idea how to influence these values.


Regards
Thomas

On 09/07/2019 08.46, Alfonso Pardo wrote:
> Hi,
> 
> If I set quota to "-b 0 -B 0" and inmedeately set guota to 20G again, I 
get
> same result, quota exceded. 
> When I run "lfs quota -v", this is the output:
> 
> 

> 

> -
> Disk quotas for group XXX (gid 694):
>  Filesystemused   quota   limit   grace   files   quota   limit
> grace
>   /mnt/data  2.307G*20G 20G 6d23h59m58s   39997  10  
10
> -
> cetafs-MDT_UUID
>  25.29M   -  0k   -   39997   -   65536
> -
> cetafs-OST0014_UUID
>  451.1M   -  452.1M   -   -   -   -
> -
> cetafs-OST0015_UUID
>  409.7M   -  410.7M   -   -   -   -
> -
> cetafs-OST0016_UUID
>  429.4M   -  430.4M   -   -   -   -
> -
> cetafs-OST0017_UUID
>  1.022G*  -  1.022G   -   -   -   -
> -
> cetafs-OST0018_UUID
>  8k*  -  8k   -   -   -   -
> -
> cetafs-OST0019_UUID
> 20k*  - 20k   -   -   -   -
> -
> cetafs-OST001a_UUID
>  8k*  -  8k   -   -   -   -
> -
> cetafs-OST001b_UUID
> 24k*  - 24k   -   -   -   -
> -
> quotactl ost28 failed.
> quotactl ost29 failed.
> quotactl ost30 failed.
> quotactl ost31 failed.
> cetafs-OST0020_UUID
>116k*  -116k   -   -   -   -
> -
> cetafs-OST0021_UUID
>136k*  -136k   -   -   -   -
> -
> Total allocated inode limit: 65536, total allocated block limit: 2.285G
> Some errors happened when getting quota info. Some devices may be not
> working or deactivated. The data in "[]" is inaccurate.
> 

> 

> -
> 
> 
> As you can see, I have some OST deactivated, because I will remove them.
> 
> I have set quotas without "quota" (soft with -b) only setting "limit" 
(hard
> with -B), and it works fine, no quota exceded, but if I set "quota" (-b) 
the
> error appears.
> 
> 
> 
> -Mensaje original-
> De: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] En
> nombre de Thomas Roth
> Enviado el: lunes, 8 de julio de 2019 15:14
> Para: lustre-discuss@lists.lustre.org
> Asunto: Re: [lustre-discuss] RV: Lustre quota issues
> 
> Perhaps the same issue that we see from time to time.
> 
> What happens if you remove the quota alltogether (-b 0 -B 0) and set them 
to
> 20G immedeately afterwards?
> That made Lustre reconsider and repent in our case.
> 
> Still, I suspect it is connected to the parts of the quota-total 
attributed
> to each OST:
> Try 'lfs quota -v' to see these for each OST.
> 
> The manual talks of the 

Re: [lustre-discuss] LFS Quota

2019-01-09 Thread Moreno Diego (ID SIS)
Hi ANS,

About the soft limits and not receiving any warning or notification when the 
soft quota is reached, this would be the expected behavior. The soft quota is 
used together with the grace period to give some “extra” time to the user to 
remove inodes/blocks, as per the Lustre Operations Manual:

Soft limit -- The grace timer is started once the soft limit is exceeded. At 
this point, the user/group/project can still allocate block/inode. When the 
grace time expires and if the user is still above the soft limit, the soft 
limit becomes a hard limit and the user/group/project can't allocate any new 
block/inode any more. The user/group/project should then delete files to be 
under the soft limit. The soft limit MUST be smaller than the hard limit. If 
the soft limit is not needed, it should be set to zero (0).

I’m not aware of any warnings triggered by Lustre when the soft quota is 
reached, though that would be interesting to have. I know some people using 
external tools to monitor Lustre quotas and trigger warnings or similar for 
users exceeding their soft quota.

Regards,

Diego


From: lustre-discuss  on behalf of ANS 

Date: Wednesday, 9 January 2019 at 10:00
To: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] LFS Quota

Dear All,

Can anyone look into it.

Thanks,
ANS

On Mon, Jan 7, 2019 at 6:38 PM ANS 
mailto:ans3...@gmail.com>> wrote:
Dear All,

I am trying to set quota on lustre but unfortunately i have issued the below 
commands:-

tunefs.lustre --param ost.quota_type=ug /dev/mapper/mds1
checking for existing Lustre data: found
Reading CONFIGS/mountdata

   Read previous values:
Target: data-MDT
Index:  0
Lustre FS:  data
Mount type: ldiskfs
Flags:  0x1001
  (MDT no_primnode )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:  mgsnode=192.168.2.9@o2ib:192.168.2.10@o2ib  
failover.node=192.168.2.9@o2ib:192.168.2.10@o2ib

   Permanent disk data:
Target: data-MDT
Index:  0
Lustre FS:  data
Mount type: ldiskfs
Flags:  0x1041
  (MDT update no_primnode )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:  mgsnode=192.168.2.9@o2ib:192.168.2.10@o2ib  
failover.node=192.168.2.9@o2ib:192.168.2.10@o2ib ost.quota_type=ug

After this i have issued lctl conf_param home.quota.mdt=ugp and now i am able 
to see quota enabled in mdt but will there be any affect to the above version 
as i have issue that command to 2 mdts where i am having 4 mdts in total. Do i 
need to issue any further commands to revert those.

After enabling quota and file space crossed soft limit but i am unable to get 
warning till it reached near hard limit.

Can anyone help me out.

Thanks,
ANS


--
Thanks,
ANS.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] space usage is not limited when using project quota

2019-01-07 Thread Moreno Diego (ID SIS)
Hi Zhang Di,

Hope it’s not too late to jump into this one  You’re only providing the quota 
settings on MDT0 but did you also enable project quotas on the OSTs?

oss1$> lctl get_param osd-*.*.quota_slave.info | grep space
space acct: ugp
space acct: ugp
space acct: ugp
space acct: ugp
space acct: ugp
space acct: ugp
space acct: ugp
space acct: ugp

Regards,

Diego


From: lustre-discuss  on behalf of 
zhang di 
Date: Tuesday, 25 December 2018 at 09:27
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] space usage is not limited when using project quota

Hi,
I’m trying to use lustre’s project quota feature, my quota configuration is:

[root@mds1 dc2-user]# lctl get_param 
osd-*.*.quota_slave.info
osd-ldiskfs.lustrefs-MDT.quota_slave.info=
target name:lustrefs-MDT
pool ID:0
type:   md
quota enabled:  ugp
conn to master: setup
space acct: ugp
user uptodate:  glb[1],slv[1],reint[0]
group uptodate: glb[1],slv[1],reint[0]
project uptodate: glb[1],slv[1],reint[0]

And have enable filesystem project feature:

[root@mds1 dc2-user]# dumpe2fs -h /dev/vdc | grep 'Filesystem features'
dumpe2fs 1.42.13.wc6 (05-Feb-2017)
Filesystem features:  has_journal ext_attr resize_inode dir_index filetype 
mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota 
project

Then I set quota on client:
[root@client2 test]# lfs quota -p 123 /mnt/test
Disk quotas for prj 123 (pid 123):
   Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
/mnt/test   4   10240   10240   -   1   1   1   -

Although the hard limit is 10M, the lustre quota don’t limit file size I write:

dd if=/dev/zero of=hello bs=30M count=1

[root@client2 test]# lsattr -p
  123 -P ./hello

[root@client2 test]# lfs quota -p 123 /mnt/test
Disk quotas for prj 123 (pid 123):
   Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
/mnt/test   28676*  10240   10240   -   2   1   1   -

My lustre version is 2.10.3, so, Does it a lustre quota's bug ?
Thank you very much.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] New accounts in Jira?

2018-06-28 Thread Moreno Diego (ID SIS)
Hello,

It doesn’t seem possible to create a new accounts on 
https://jira.whamcloud.com/ unless I’m missing something obvious…

On the login screen it says “Not a member? To request an account, please 
contact your JIRA 
administrators.”
 Unfortunately, that link leads to a dead end: 
https://jira.whamcloud.com/secure/ContactAdministrators!default.jspa

Regards,

---
Diego Moreno
HPC - Scientific IT Services
ETH Zurich

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org