[lustre-discuss] Random drop off OST from clients
Hi, Recently, we frequently see OSTs are randomly dropped by some client nodes. We have 4 Lustre filesystems, total 126 OSTs. All clients are running 2.15.3 client on CentOS 7. Servers are CentOS 7 with Lustre 2.12.8 (3 FS') and 2.15.3 on Alma 8.8. Failures can happen from both versions of servers. LNET is using OPA interface. One example of the failure is like # lctl dl | grep ' IN ' 126 IN osc cedar_sc-OST000a-osc-980c76944800 52e66575-6443-4be9-a7ce-348b526a0836 4 In syslog, we see Oct 4 23:24:30 cedar5 kernel: LustreError: 11-0: cedar_sc-OST000a-osc-980c76944800: operation ldlm_enqueue to node 172.19.128.33@o2ib failed: rc = -107 Oct 4 23:24:30 cedar5 kernel: Lustre: cedar_sc-OST000a-osc-980c76944800: Connection to cedar_sc-OST000a (at 172.19.128.33@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 4 23:24:30 cedar5 kernel: LustreError: 5195:0:(osc_request.c:1037:osc_init_grant()) cedar_sc-OST000a-osc-980c76944800: granted 3407872 but already consumed 519700480 Oct 4 23:24:30 cedar5 kernel: LustreError: 167-0: cedar_sc-OST000a-osc-980c76944800: This client was evicted by cedar_sc-OST000a; in progress operations using this service will fail. Oct 4 23:24:31 cedar5 kernel: LustreError: 62880:0:(ldlm_resource.c:1126:ldlm_resource_complain()) cedar_sc-OST000a-osc-980c76944800: namespace resource [0x73fbbe2:0x0:0x0].0x0 (97fe127e3080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 4 23:24:31 cedar5 kernel: LustreError: 5218:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-980c76944800: dirty 131074 > system dirty_max 131072 Oct 4 23:24:36 cedar5 kernel: LustreError: 5209:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-980c76944800: dirty 131074 > system dirty_max 131072 Oct 4 23:24:47 cedar5 kernel: LustreError: 5220:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-980c76944800: dirty 131072 > system dirty_max 131072 Oct 4 23:25:36 cedar5 kernel: LustreError: 5242:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-980c76944800: dirty 131074 > system dirty_max 131072 This one in particular is 2.15.3 server. Once this happen, it appears the only way is to reboot the client and then the issue goes away. Any ideas where we should check? Thank you very much. Lixin. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lnet errors
I couldn't say exactly but.. - Your net is o2ib1. Is there an o2ib0? - Are you routing? If so, lnet routing or IB routing? Any issues with the routers or routing? - Verify the stability of lnet and the fabric path between client and server in the messages above using a tool like lnet_selftest? - Verify the fabric: Check error counters on the switch and HCA ports involved. Use non-Lustre IB tools (ib_send_bw, etc) to test the fabric. Lustre can, and will tell you when lnet issue arise but it cannot tell you anything about the actual network layer it is riding on so it is usually a good idea to certify function of the network layer first before delving into "what LBUG is running my weekend plans?" I hope that helps, --Jeff (resent to list in hopes of being beneficial to others) On Thu, Oct 5, 2023 at 9:34 AM Alastair Basden via lustre-discuss < lustre-discuss@lists.lustre.org> wrote: > Hi, > > Lustre 2.12.2. > > We are seeing lots of errors on the servers such as: > Oct 5 11:16:48 oss04 kernel: LNetError: > 6414:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending > PUT to 12345-172.19.171.15@o2ib1: -125 > Oct 5 11:16:48 oss04 kernel: LustreError: > 6414:0:(events.c:450:server_bulk_callback()) event type 5, status -125, > desc 8fe066bb9400 > > and > Oct 4 14:59:48 oss04 kernel: LustreError: > 6383:0:(events.c:305:request_in_callback()) event type 2, status -103, > service ost_io > > and > Oct 5 11:18:06 oss04 kernel: LustreError: > 6388:0:(events.c:305:request_in_callback()) event type 2, status -5, > service ost_io > Oct 5 11:18:06 oss04 kernel: LNet: > 6412:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from > 172.19.171.15@o2ib1 > > and on the clients: > m7: Oct 5 14:46:59 m7132 kernel: LustreError: > 2466:0:(events.c:200:client_bulk_callback()) event type 2, status -103, > desc 9a251fc14400 > > and > m7: Oct 5 11:18:34 m7086 kernel: LustreError: > 2495:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc > 9a39ad668000 > > Does anyone have any ideas about what could be causing this? > > Thanks, > Alastair. > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > -- -- Jeff Johnson Co-Founder Aeon Computing jeff.john...@aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x1001 f: 858-412-3845 m: 619-204-9061 4170 Morena Boulevard, Suite C - San Diego, CA 92117 High-Performance Computing / Lustre Filesystems / Scale-out Storage ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Lnet errors
Hi, Lustre 2.12.2. We are seeing lots of errors on the servers such as: Oct 5 11:16:48 oss04 kernel: LNetError: 6414:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-172.19.171.15@o2ib1: -125 Oct 5 11:16:48 oss04 kernel: LustreError: 6414:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc 8fe066bb9400 and Oct 4 14:59:48 oss04 kernel: LustreError: 6383:0:(events.c:305:request_in_callback()) event type 2, status -103, service ost_io and Oct 5 11:18:06 oss04 kernel: LustreError: 6388:0:(events.c:305:request_in_callback()) event type 2, status -5, service ost_io Oct 5 11:18:06 oss04 kernel: LNet: 6412:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 172.19.171.15@o2ib1 and on the clients: m7: Oct 5 14:46:59 m7132 kernel: LustreError: 2466:0:(events.c:200:client_bulk_callback()) event type 2, status -103, desc 9a251fc14400 and m7: Oct 5 11:18:34 m7086 kernel: LustreError: 2495:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 9a39ad668000 Does anyone have any ideas about what could be causing this? Thanks, Alastair. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org