[lustre-discuss] Missing OST's from 1 node only

2021-10-07 Thread Sid Young via lustre-discuss
G'Day all,

I have an odd situation where 1 compute node, mounts /home and /lustre but
only half the OST's are present, while all the other nodes are fine not
sure where to start on this one?

Good node:
[root@n02 ~]# lfs df
UUID   1K-blocksUsed   Available Use% Mounted on
home-MDT_UUID 447397068830695424  4443273216   1% /home[MDT:0]
home-OST_UUID51097721856 39839794176 11257662464  78% /home[OST:0]
home-OST0001_UUID51097897984 40967138304 10130627584  81% /home[OST:1]
home-OST0002_UUID51097705472 37731089408 13366449152  74% /home[OST:2]
home-OST0003_UUID51097773056 41447411712  9650104320  82% /home[OST:3]

filesystem_summary:  204391098368 159985433600 44404843520  79% /home

UUID   1K-blocksUsed   Available Use% Mounted on
lustre-MDT_UUID   536881612828246656  5340567424   1% /lustre[MDT:0]
lustre-OST_UUID  51098352640 10144093184 40954257408  20% /lustre[OST:0]
lustre-OST0001_UUID  51098497024  9584398336 41514096640  19% /lustre[OST:1]
lustre-OST0002_UUID  51098414080 11683002368 39415409664  23% /lustre[OST:2]
lustre-OST0003_UUID  51098514432 10475310080 40623202304  21% /lustre[OST:3]
lustre-OST0004_UUID  51098506240 11505326080 39593178112  23% /lustre[OST:4]
lustre-OST0005_UUID  51098429440  9272059904 41826367488  19% /lustre[OST:5]

filesystem_summary:  306590713856 62664189952 243926511616  21% /lustre

[root@n02 ~]#



The bad Node:

 [root@n04 ~]# lfs df
UUID   1K-blocksUsed   Available Use% Mounted on
home-MDT_UUID 447397068830726400  4443242240   1% /home[MDT:0]
home-OST0002_UUID51097703424 37732352000 13363446784  74% /home[OST:2]
home-OST0003_UUID51097778176 41449634816  9646617600  82% /home[OST:3]

filesystem_summary:  102195481600 79181986816 23010064384  78% /home

UUID   1K-blocksUsed   Available Use% Mounted on
lustre-MDT_UUID   536881612828246656  5340567424   1% /lustre[MDT:0]
lustre-OST0003_UUID  51098514432 10475310080 40623202304  21% /lustre[OST:3]
lustre-OST0004_UUID  51098511360 11505326080 39593183232  23% /lustre[OST:4]
lustre-OST0005_UUID  51098429440  9272059904 41826367488  19% /lustre[OST:5]

filesystem_summary:  153295455232 31252696064 122042753024  21% /lustre

[root@n04 ~]#



Sid Young
Translational Research Institute
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Fwd: RPCs in Flight are more than the max_rpcs_in_flight value

2021-10-07 Thread Andreas Dilger via lustre-discuss
On Oct 7, 2021, at 13:19, Md Hasanur Rashid via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello Everyone,

I am running the Filebench benchmark in my Lustre cluster. I set the 
max_rpcs_in_flight value to be 1. Before executing and after executing, I 
verified that the value of max_rpcs_in_flight is indeed 1. However, when I 
check the rpc_stats, the stat shows a much higher value for RPCs in flight.

If by "much higher than 1" you mean "2", then yes it appears there are mostly 
(95%) 2 RPCs being processed concurrently on this OST.

That might happen if you have 2 clients/mountpoints writing to the same OST, it 
might be an off-by-one logic error allowing an extra RPC in flight, it might be 
intentional for some reason (e.g. to avoid deadlock, memory pressure, etc), or 
it might be accounting error in the statistics (e.g. counting the next RPC to 
be sent before the first one is marked finished).

Following is the value shown for one OSC just for reference:

osc.hasanfs-OST-osc-882fcf777000.rpc_stats=
snapshot_time: 1632483604.967291 (secs.usecs)
read RPCs in flight:  0
write RPCs in flight: 0
pending write pages:  0
pending read pages:   0

read write
pages per rpc rpcs   % cum % |   rpcs   % cum %
1: 1 100 100   |  0   0   0
2: 0   0 100   |  0   0   0
4: 0   0 100   |  0   0   0
8: 0   0 100   |  0   0   0
16: 0   0 100   |  0   0   0
32: 0   0 100   |  0   0   0
64: 0   0 100   |  0   0   0
128: 0   0 100   |  0   0   0
256: 0   0 100   |   9508 100 100

read write
rpcs in flightrpcs   % cum % |   rpcs   % cum %
0: 0   0   0   |  0   0   0
1: 1 100 100   | 10   0   0
2: 0   0 100   |   9033  95  95
3: 0   0 100   |465   4 100

read write
offsetrpcs   % cum % |   rpcs   % cum %
0: 1 100 100   |725   7   7
1: 0   0 100   |  0   0   7
2: 0   0 100   |  0   0   7
4: 0   0 100   |  0   0   7
8: 0   0 100   |  0   0   7
16: 0   0 100   |  0   0   7
32: 0   0 100   |  0   0   7
64: 0   0 100   |  0   0   7
128: 0   0 100   |  0   0   7
256: 0   0 100   |718   7  15
512: 0   0 100   |   1386  14  29
1024: 0   0 100   |   2205  23  52
2048: 0   0 100   |   1429  15  67
4096: 0   0 100   |   1103  11  79
8192: 0   0 100   |   1942  20 100

Can anyone please explain to me why the RPCs in flight shown in the rpc_stats 
could be higher than the max_rpcs_in_flight?

I do see a similar behavior with the statistics of my home system, which has 
the default osc.*.max_rpcs_in_flight=8, but shows many cases of 9 RPCs in 
flight in the statistics for both read and write, and in a few cases 10 or 11:

read write
rpcs in flightrpcs   % cum % |   rpcs   % cum %
1:121   2   2   |  27831  93  93
2: 23   0   3   |108   0  93
3: 22   0   3   | 19   0  93
4: 24   0   4   | 15   0  93
5: 19   0   5   | 10   0  93
6: 26   0   5   | 13   0  93
7:176   4   9   | 39   0  93
8:933  22  32   | 75   0  94
9:   2802  67  99   |   1207   4  98
10: 10   0 100   |543   1  99
11:  0   0 100   |  1   0 100


The good news is that Lustre is open source, so you can look into the 
lustre/osc code to see why this is happening.  The limit is set by 
cli->cl_max_rpcs_in_flight, and the stats are accounted by 
cli->cl_write_rpc_hist.

Out of curiosity, you don't say _why_ this off-by-one error is of interest?  
Definitely it seems like a bug that could be fixed, but it doesn't seem too 
critical to correct functionality.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Fwd: RPCs in Flight are more than the max_rpcs_in_flight value

2021-10-07 Thread Md Hasanur Rashid via lustre-discuss
Hello Everyone,

I am running the Filebench benchmark in my Lustre cluster. I set the
max_rpcs_in_flight value to be 1. Before executing and after executing, I
verified that the value of max_rpcs_in_flight is indeed 1. However, when I
check the rpc_stats, the stat shows a much higher value for RPCs in flight.
Following is the value shown for one OSC just for reference:

osc.hasanfs-OST-osc-882fcf777000.rpc_stats=
snapshot_time: 1632483604.967291 (secs.usecs)
read RPCs in flight:  0
write RPCs in flight: 0
pending write pages:  0
pending read pages:   0

read write
pages per rpc rpcs   % cum % |   rpcs   % cum %
1: 1 100 100   |  0   0   0
2: 0   0 100   |  0   0   0
4: 0   0 100   |  0   0   0
8: 0   0 100   |  0   0   0
16: 0   0 100   |  0   0   0
32: 0   0 100   |  0   0   0
64: 0   0 100   |  0   0   0
128: 0   0 100   |  0   0   0
256: 0   0 100   |   9508 100 100

read write
rpcs in flightrpcs   % cum % |   rpcs   % cum %
0: 0   0   0   |  0   0   0
1: 1 100 100   | 10   0   0
2: 0   0 100   |   9033  95  95
3: 0   0 100   |465   4 100

read write
offsetrpcs   % cum % |   rpcs   % cum %
0: 1 100 100   |725   7   7
1: 0   0 100   |  0   0   7
2: 0   0 100   |  0   0   7
4: 0   0 100   |  0   0   7
8: 0   0 100   |  0   0   7
16: 0   0 100   |  0   0   7
32: 0   0 100   |  0   0   7
64: 0   0 100   |  0   0   7
128: 0   0 100   |  0   0   7
256: 0   0 100   |718   7  15
512: 0   0 100   |   1386  14  29
1024: 0   0 100   |   2205  23  52
2048: 0   0 100   |   1429  15  67
4096: 0   0 100   |   1103  11  79
8192: 0   0 100   |   1942  20 100

Can anyone please explain to me why the RPCs in flight shown in the
rpc_stats could be higher than the max_rpcs_in_flight?

Thanks,
Md. Hasanur Rashid
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] No soace left on device Error

2021-10-07 Thread Dilip Sathaye via lustre-discuss
Dear All,
We are running lustre 2.7.x  in our cluster, which was running stable for
few years. Suddenly a bug got triggered and basic copy command also fails
with message like no space left on device.
We were told to upgrade to lustre 2.10.  Could we know:

1. If there is any simpler work around available, as we want to avoid
upgrade. Upgrade can be complex process.

2. If upgrade is unavoidable, what is way to move from 2.7.x to 2.10 ?

Thanks
Dilip G Sathayr
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org