[lustre-discuss] Missing OST's from 1 node only
G'Day all, I have an odd situation where 1 compute node, mounts /home and /lustre but only half the OST's are present, while all the other nodes are fine not sure where to start on this one? Good node: [root@n02 ~]# lfs df UUID 1K-blocksUsed Available Use% Mounted on home-MDT_UUID 447397068830695424 4443273216 1% /home[MDT:0] home-OST_UUID51097721856 39839794176 11257662464 78% /home[OST:0] home-OST0001_UUID51097897984 40967138304 10130627584 81% /home[OST:1] home-OST0002_UUID51097705472 37731089408 13366449152 74% /home[OST:2] home-OST0003_UUID51097773056 41447411712 9650104320 82% /home[OST:3] filesystem_summary: 204391098368 159985433600 44404843520 79% /home UUID 1K-blocksUsed Available Use% Mounted on lustre-MDT_UUID 536881612828246656 5340567424 1% /lustre[MDT:0] lustre-OST_UUID 51098352640 10144093184 40954257408 20% /lustre[OST:0] lustre-OST0001_UUID 51098497024 9584398336 41514096640 19% /lustre[OST:1] lustre-OST0002_UUID 51098414080 11683002368 39415409664 23% /lustre[OST:2] lustre-OST0003_UUID 51098514432 10475310080 40623202304 21% /lustre[OST:3] lustre-OST0004_UUID 51098506240 11505326080 39593178112 23% /lustre[OST:4] lustre-OST0005_UUID 51098429440 9272059904 41826367488 19% /lustre[OST:5] filesystem_summary: 306590713856 62664189952 243926511616 21% /lustre [root@n02 ~]# The bad Node: [root@n04 ~]# lfs df UUID 1K-blocksUsed Available Use% Mounted on home-MDT_UUID 447397068830726400 4443242240 1% /home[MDT:0] home-OST0002_UUID51097703424 37732352000 13363446784 74% /home[OST:2] home-OST0003_UUID51097778176 41449634816 9646617600 82% /home[OST:3] filesystem_summary: 102195481600 79181986816 23010064384 78% /home UUID 1K-blocksUsed Available Use% Mounted on lustre-MDT_UUID 536881612828246656 5340567424 1% /lustre[MDT:0] lustre-OST0003_UUID 51098514432 10475310080 40623202304 21% /lustre[OST:3] lustre-OST0004_UUID 51098511360 11505326080 39593183232 23% /lustre[OST:4] lustre-OST0005_UUID 51098429440 9272059904 41826367488 19% /lustre[OST:5] filesystem_summary: 153295455232 31252696064 122042753024 21% /lustre [root@n04 ~]# Sid Young Translational Research Institute ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Fwd: RPCs in Flight are more than the max_rpcs_in_flight value
On Oct 7, 2021, at 13:19, Md Hasanur Rashid via lustre-discuss mailto:lustre-discuss@lists.lustre.org>> wrote: Hello Everyone, I am running the Filebench benchmark in my Lustre cluster. I set the max_rpcs_in_flight value to be 1. Before executing and after executing, I verified that the value of max_rpcs_in_flight is indeed 1. However, when I check the rpc_stats, the stat shows a much higher value for RPCs in flight. If by "much higher than 1" you mean "2", then yes it appears there are mostly (95%) 2 RPCs being processed concurrently on this OST. That might happen if you have 2 clients/mountpoints writing to the same OST, it might be an off-by-one logic error allowing an extra RPC in flight, it might be intentional for some reason (e.g. to avoid deadlock, memory pressure, etc), or it might be accounting error in the statistics (e.g. counting the next RPC to be sent before the first one is marked finished). Following is the value shown for one OSC just for reference: osc.hasanfs-OST-osc-882fcf777000.rpc_stats= snapshot_time: 1632483604.967291 (secs.usecs) read RPCs in flight: 0 write RPCs in flight: 0 pending write pages: 0 pending read pages: 0 read write pages per rpc rpcs % cum % | rpcs % cum % 1: 1 100 100 | 0 0 0 2: 0 0 100 | 0 0 0 4: 0 0 100 | 0 0 0 8: 0 0 100 | 0 0 0 16: 0 0 100 | 0 0 0 32: 0 0 100 | 0 0 0 64: 0 0 100 | 0 0 0 128: 0 0 100 | 0 0 0 256: 0 0 100 | 9508 100 100 read write rpcs in flightrpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 1 100 100 | 10 0 0 2: 0 0 100 | 9033 95 95 3: 0 0 100 |465 4 100 read write offsetrpcs % cum % | rpcs % cum % 0: 1 100 100 |725 7 7 1: 0 0 100 | 0 0 7 2: 0 0 100 | 0 0 7 4: 0 0 100 | 0 0 7 8: 0 0 100 | 0 0 7 16: 0 0 100 | 0 0 7 32: 0 0 100 | 0 0 7 64: 0 0 100 | 0 0 7 128: 0 0 100 | 0 0 7 256: 0 0 100 |718 7 15 512: 0 0 100 | 1386 14 29 1024: 0 0 100 | 2205 23 52 2048: 0 0 100 | 1429 15 67 4096: 0 0 100 | 1103 11 79 8192: 0 0 100 | 1942 20 100 Can anyone please explain to me why the RPCs in flight shown in the rpc_stats could be higher than the max_rpcs_in_flight? I do see a similar behavior with the statistics of my home system, which has the default osc.*.max_rpcs_in_flight=8, but shows many cases of 9 RPCs in flight in the statistics for both read and write, and in a few cases 10 or 11: read write rpcs in flightrpcs % cum % | rpcs % cum % 1:121 2 2 | 27831 93 93 2: 23 0 3 |108 0 93 3: 22 0 3 | 19 0 93 4: 24 0 4 | 15 0 93 5: 19 0 5 | 10 0 93 6: 26 0 5 | 13 0 93 7:176 4 9 | 39 0 93 8:933 22 32 | 75 0 94 9: 2802 67 99 | 1207 4 98 10: 10 0 100 |543 1 99 11: 0 0 100 | 1 0 100 The good news is that Lustre is open source, so you can look into the lustre/osc code to see why this is happening. The limit is set by cli->cl_max_rpcs_in_flight, and the stats are accounted by cli->cl_write_rpc_hist. Out of curiosity, you don't say _why_ this off-by-one error is of interest? Definitely it seems like a bug that could be fixed, but it doesn't seem too critical to correct functionality. Cheers, Andreas -- Andreas Dilger Lustre Principal Architect Whamcloud ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Fwd: RPCs in Flight are more than the max_rpcs_in_flight value
Hello Everyone, I am running the Filebench benchmark in my Lustre cluster. I set the max_rpcs_in_flight value to be 1. Before executing and after executing, I verified that the value of max_rpcs_in_flight is indeed 1. However, when I check the rpc_stats, the stat shows a much higher value for RPCs in flight. Following is the value shown for one OSC just for reference: osc.hasanfs-OST-osc-882fcf777000.rpc_stats= snapshot_time: 1632483604.967291 (secs.usecs) read RPCs in flight: 0 write RPCs in flight: 0 pending write pages: 0 pending read pages: 0 read write pages per rpc rpcs % cum % | rpcs % cum % 1: 1 100 100 | 0 0 0 2: 0 0 100 | 0 0 0 4: 0 0 100 | 0 0 0 8: 0 0 100 | 0 0 0 16: 0 0 100 | 0 0 0 32: 0 0 100 | 0 0 0 64: 0 0 100 | 0 0 0 128: 0 0 100 | 0 0 0 256: 0 0 100 | 9508 100 100 read write rpcs in flightrpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 1 100 100 | 10 0 0 2: 0 0 100 | 9033 95 95 3: 0 0 100 |465 4 100 read write offsetrpcs % cum % | rpcs % cum % 0: 1 100 100 |725 7 7 1: 0 0 100 | 0 0 7 2: 0 0 100 | 0 0 7 4: 0 0 100 | 0 0 7 8: 0 0 100 | 0 0 7 16: 0 0 100 | 0 0 7 32: 0 0 100 | 0 0 7 64: 0 0 100 | 0 0 7 128: 0 0 100 | 0 0 7 256: 0 0 100 |718 7 15 512: 0 0 100 | 1386 14 29 1024: 0 0 100 | 2205 23 52 2048: 0 0 100 | 1429 15 67 4096: 0 0 100 | 1103 11 79 8192: 0 0 100 | 1942 20 100 Can anyone please explain to me why the RPCs in flight shown in the rpc_stats could be higher than the max_rpcs_in_flight? Thanks, Md. Hasanur Rashid ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] No soace left on device Error
Dear All, We are running lustre 2.7.x in our cluster, which was running stable for few years. Suddenly a bug got triggered and basic copy command also fails with message like no space left on device. We were told to upgrade to lustre 2.10. Could we know: 1. If there is any simpler work around available, as we want to avoid upgrade. Upgrade can be complex process. 2. If upgrade is unavoidable, what is way to move from 2.7.x to 2.10 ? Thanks Dilip G Sathayr ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org