[lustre-discuss] Random drop off OST from clients

2023-10-05 Thread Lixin Liu
 Hi,

Recently, we frequently see OSTs are randomly dropped by some client nodes.

We have 4 Lustre filesystems, total 126 OSTs. All clients are running 2.15.3 
client on CentOS 7.
Servers are CentOS 7 with Lustre 2.12.8 (3 FS') and 2.15.3 on Alma 8.8. 
Failures can happen
from both versions of servers. LNET is using OPA interface.

One example of the failure is like

# lctl dl | grep ' IN '
126 IN osc cedar_sc-OST000a-osc-980c76944800 
52e66575-6443-4be9-a7ce-348b526a0836 4

In syslog, we see

Oct  4 23:24:30 cedar5 kernel: LustreError: 11-0: 
cedar_sc-OST000a-osc-980c76944800: operation ldlm_enqueue to node 
172.19.128.33@o2ib failed: rc = -107
Oct  4 23:24:30 cedar5 kernel: Lustre: cedar_sc-OST000a-osc-980c76944800: 
Connection to cedar_sc-OST000a (at 172.19.128.33@o2ib) was lost; in progress 
operations using this service will wait for recovery to complete
Oct  4 23:24:30 cedar5 kernel: LustreError: 
5195:0:(osc_request.c:1037:osc_init_grant()) 
cedar_sc-OST000a-osc-980c76944800: granted 3407872 but already consumed 
519700480
Oct  4 23:24:30 cedar5 kernel: LustreError: 167-0: 
cedar_sc-OST000a-osc-980c76944800: This client was evicted by 
cedar_sc-OST000a; in progress operations using this service will fail.
Oct  4 23:24:31 cedar5 kernel: LustreError: 
62880:0:(ldlm_resource.c:1126:ldlm_resource_complain()) 
cedar_sc-OST000a-osc-980c76944800: namespace resource 
[0x73fbbe2:0x0:0x0].0x0 (97fe127e3080) refcount nonzero (1) after lock 
cleanup; forcing cleanup.
Oct  4 23:24:31 cedar5 kernel: LustreError: 
5218:0:(osc_request.c:711:osc_announce_cached()) 
cedar_sc-OST000a-osc-980c76944800: dirty 131074 > system dirty_max 131072
Oct  4 23:24:36 cedar5 kernel: LustreError: 
5209:0:(osc_request.c:711:osc_announce_cached()) 
cedar_sc-OST000a-osc-980c76944800: dirty 131074 > system dirty_max 131072
Oct  4 23:24:47 cedar5 kernel: LustreError: 
5220:0:(osc_request.c:711:osc_announce_cached()) 
cedar_sc-OST000a-osc-980c76944800: dirty 131072 > system dirty_max 131072
Oct  4 23:25:36 cedar5 kernel: LustreError: 
5242:0:(osc_request.c:711:osc_announce_cached()) 
cedar_sc-OST000a-osc-980c76944800: dirty 131074 > system dirty_max 131072


This one in particular is 2.15.3 server. Once this happen, it appears the only 
way is to reboot the
client and then the issue goes away.

Any ideas where we should check?

Thank you very much.

Lixin.



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lnet errors

2023-10-05 Thread Jeff Johnson
I couldn't say exactly but..

   - Your net is o2ib1. Is there an o2ib0?
   - Are you routing? If so, lnet routing or IB routing? Any issues with
   the routers or routing?
   - Verify the stability of lnet and the fabric path between client and
   server in the messages above using a tool like lnet_selftest?
   - Verify the fabric: Check error counters on the switch and HCA ports
   involved. Use non-Lustre IB tools (ib_send_bw, etc) to test the fabric.

Lustre can, and will tell you when lnet issue arise but it cannot tell you
anything about the actual network layer it is riding on so it is usually a
good idea to certify function of the network layer first before delving
into "what LBUG is running my weekend plans?"

I hope that helps,

--Jeff

(resent to list in hopes of being beneficial to others)

On Thu, Oct 5, 2023 at 9:34 AM Alastair Basden via lustre-discuss <
lustre-discuss@lists.lustre.org> wrote:

> Hi,
>
> Lustre 2.12.2.
>
> We are seeing lots of errors on the servers such as:
> Oct  5 11:16:48 oss04 kernel: LNetError:
> 6414:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending
> PUT to 12345-172.19.171.15@o2ib1: -125
> Oct  5 11:16:48 oss04 kernel: LustreError:
> 6414:0:(events.c:450:server_bulk_callback()) event type 5, status -125,
> desc 8fe066bb9400
>
> and
> Oct  4 14:59:48 oss04 kernel: LustreError:
> 6383:0:(events.c:305:request_in_callback()) event type 2, status -103,
> service ost_io
>
> and
> Oct  5 11:18:06 oss04 kernel: LustreError:
> 6388:0:(events.c:305:request_in_callback()) event type 2, status -5,
> service ost_io
> Oct  5 11:18:06 oss04 kernel: LNet:
> 6412:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from
> 172.19.171.15@o2ib1
>
> and on the clients:
> m7: Oct  5 14:46:59 m7132 kernel: LustreError:
> 2466:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
> desc 9a251fc14400
>
> and
> m7: Oct  5 11:18:34 m7086 kernel: LustreError:
> 2495:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc
> 9a39ad668000
>
> Does anyone have any ideas about what could be causing this?
>
> Thanks,
> Alastair.
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>


-- 
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lnet errors

2023-10-05 Thread Alastair Basden via lustre-discuss

Hi,

Lustre 2.12.2.

We are seeing lots of errors on the servers such as:
Oct  5 11:16:48 oss04 kernel: LNetError: 
6414:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending PUT to 
12345-172.19.171.15@o2ib1: -125
Oct  5 11:16:48 oss04 kernel: LustreError: 
6414:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc 
8fe066bb9400

and
Oct  4 14:59:48 oss04 kernel: LustreError: 
6383:0:(events.c:305:request_in_callback()) event type 2, status -103, service 
ost_io

and
Oct  5 11:18:06 oss04 kernel: LustreError: 
6388:0:(events.c:305:request_in_callback()) event type 2, status -5, service 
ost_io
Oct  5 11:18:06 oss04 kernel: LNet: 
6412:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 172.19.171.15@o2ib1

and on the clients:
m7: Oct  5 14:46:59 m7132 kernel: LustreError: 
2466:0:(events.c:200:client_bulk_callback()) event type 2, status -103, desc 
9a251fc14400

and
m7: Oct  5 11:18:34 m7086 kernel: LustreError: 
2495:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 
9a39ad668000

Does anyone have any ideas about what could be causing this?

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org