Re: [lustre-discuss] Lnet errors

2023-10-05 Thread Jeff Johnson
I couldn't say exactly but..

   - Your net is o2ib1. Is there an o2ib0?
   - Are you routing? If so, lnet routing or IB routing? Any issues with
   the routers or routing?
   - Verify the stability of lnet and the fabric path between client and
   server in the messages above using a tool like lnet_selftest?
   - Verify the fabric: Check error counters on the switch and HCA ports
   involved. Use non-Lustre IB tools (ib_send_bw, etc) to test the fabric.

Lustre can, and will tell you when lnet issue arise but it cannot tell you
anything about the actual network layer it is riding on so it is usually a
good idea to certify function of the network layer first before delving
into "what LBUG is running my weekend plans?"

I hope that helps,

--Jeff

(resent to list in hopes of being beneficial to others)

On Thu, Oct 5, 2023 at 9:34 AM Alastair Basden via lustre-discuss <
lustre-discuss@lists.lustre.org> wrote:

> Hi,
>
> Lustre 2.12.2.
>
> We are seeing lots of errors on the servers such as:
> Oct  5 11:16:48 oss04 kernel: LNetError:
> 6414:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending
> PUT to 12345-172.19.171.15@o2ib1: -125
> Oct  5 11:16:48 oss04 kernel: LustreError:
> 6414:0:(events.c:450:server_bulk_callback()) event type 5, status -125,
> desc 8fe066bb9400
>
> and
> Oct  4 14:59:48 oss04 kernel: LustreError:
> 6383:0:(events.c:305:request_in_callback()) event type 2, status -103,
> service ost_io
>
> and
> Oct  5 11:18:06 oss04 kernel: LustreError:
> 6388:0:(events.c:305:request_in_callback()) event type 2, status -5,
> service ost_io
> Oct  5 11:18:06 oss04 kernel: LNet:
> 6412:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from
> 172.19.171.15@o2ib1
>
> and on the clients:
> m7: Oct  5 14:46:59 m7132 kernel: LustreError:
> 2466:0:(events.c:200:client_bulk_callback()) event type 2, status -103,
> desc 9a251fc14400
>
> and
> m7: Oct  5 11:18:34 m7086 kernel: LustreError:
> 2495:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc
> 9a39ad668000
>
> Does anyone have any ideas about what could be causing this?
>
> Thanks,
> Alastair.
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>


-- 
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lnet errors

2023-10-05 Thread Alastair Basden via lustre-discuss

Hi,

Lustre 2.12.2.

We are seeing lots of errors on the servers such as:
Oct  5 11:16:48 oss04 kernel: LNetError: 
6414:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending PUT to 
12345-172.19.171.15@o2ib1: -125
Oct  5 11:16:48 oss04 kernel: LustreError: 
6414:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc 
8fe066bb9400

and
Oct  4 14:59:48 oss04 kernel: LustreError: 
6383:0:(events.c:305:request_in_callback()) event type 2, status -103, service 
ost_io

and
Oct  5 11:18:06 oss04 kernel: LustreError: 
6388:0:(events.c:305:request_in_callback()) event type 2, status -5, service 
ost_io
Oct  5 11:18:06 oss04 kernel: LNet: 
6412:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 172.19.171.15@o2ib1

and on the clients:
m7: Oct  5 14:46:59 m7132 kernel: LustreError: 
2466:0:(events.c:200:client_bulk_callback()) event type 2, status -103, desc 
9a251fc14400

and
m7: Oct  5 11:18:34 m7086 kernel: LustreError: 
2495:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 
9a39ad668000

Does anyone have any ideas about what could be causing this?

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org