Re: [lustre-discuss] Lustre stuck in ldlm_lockd (lock on destroyed export, lock timed out)

2021-03-10 Thread Thomas Roth via lustre-discuss

In addition, I noticed that those clients that do reconnect are logged as

Mar 10 13:12:24 lxmds19.gsi.de kernel: Lustre: hebe-MDT: Connection 
restored to  (at 10.20.0.41@o2ib5)

MDS and MDT have this client listed (/proc/fs/lustre/.../exports/) and there is 
a uuid there for the client.


Regards
Thomas

On 10.03.21 12:33, Thomas Roth via lustre-discuss wrote:

Hi all,

we are in a critical situation where our Lustre is rendered completely 
inaccessible.

We are running Lustre 2.12.5 on CentOS 7.8, Whamcloud sources, MDTs on ldiskfs, 
OSTs on ZFS, 3 MDS.

The first MDS, running MGS + MDT0, is showing
### lock callback timer expired
evicting clients, and
### lock on destroyed export
for the same client, as in


Mar 10 09:51:54 lxmds19.gsi.de kernel: LustreError: 4779:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 450s: evicting 
client at 10.20.4.68@o2ib5  ns: mdt-hebe-MDT_UUID lock: 8f1ef6681b00/0xdba5480d76a73ab6 lrc: 3/0,0 mode: PR/PR res: [0x20002db4c:0x14:0x0].0x0 
bits 0x13/0x0 rrc: 3 type: IBT flags: 0x6020040020 nid: 10.20.4.68@o2ib5 remote: 0x5360294b0558b867 expref: 31 pid: 6649 timeout: 4849 lvb_type: 0


Mar 10 09:51:54 lxmds19.gsi.de kernel: LustreError: 6570:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export 8f1eede9 
ns: mdt-hebe-MDT_UUID lock: 8f1efbded8c0/0xdba5480d76a9e456 lrc: 3/0,0 mode: PR/PR res: [0x20002c52b:0xd92b:0x0].0x0 bits 0x13/0x0 rrc: 175 
type: IBT flags: 0x5020040020 nid: 10.20.4.68@o2ib5 remote: 0x5360294b0558b875 expref: 4 pid: 6570 timeout: 0 lvb_type: 0




Eventually, there is
### lock timed out ; not entering recovery in server code, just going back to 
sleep


Restart of the server does not help.
Recovery runs through, clients show the MDS in 'lfs check mds', but any kind of 
access (aka 'ls') will hang.


Any help is much appreciated.

Regards
Thomas




--

Thomas Roth
Department: IT
Location: SB3 2.291
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre stuck in ldlm_lockd (lock on destroyed export, lock timed out)

2021-03-10 Thread Thomas Roth via lustre-discuss

Hi all,

we are in a critical situation where our Lustre is rendered completely 
inaccessible.

We are running Lustre 2.12.5 on CentOS 7.8, Whamcloud sources, MDTs on ldiskfs, 
OSTs on ZFS, 3 MDS.

The first MDS, running MGS + MDT0, is showing
### lock callback timer expired
evicting clients, and
### lock on destroyed export
for the same client, as in


Mar 10 09:51:54 lxmds19.gsi.de kernel: LustreError: 4779:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 450s: evicting 
client at 10.20.4.68@o2ib5  ns: mdt-hebe-MDT_UUID lock: 8f1ef6681b00/0xdba5480d76a73ab6 lrc: 3/0,0 mode: PR/PR res: [0x20002db4c:0x14:0x0].0x0 
bits 0x13/0x0 rrc: 3 type: IBT flags: 0x6020040020 nid: 10.20.4.68@o2ib5 remote: 0x5360294b0558b867 expref: 31 pid: 6649 timeout: 4849 lvb_type: 0


Mar 10 09:51:54 lxmds19.gsi.de kernel: LustreError: 6570:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export 8f1eede9 
ns: mdt-hebe-MDT_UUID lock: 8f1efbded8c0/0xdba5480d76a9e456 lrc: 3/0,0 mode: PR/PR res: [0x20002c52b:0xd92b:0x0].0x0 bits 0x13/0x0 rrc: 175 
type: IBT flags: 0x5020040020 nid: 10.20.4.68@o2ib5 remote: 0x5360294b0558b875 expref: 4 pid: 6570 timeout: 0 lvb_type: 0




Eventually, there is
### lock timed out ; not entering recovery in server code, just going back to 
sleep


Restart of the server does not help.
Recovery runs through, clients show the MDS in 'lfs check mds', but any kind of 
access (aka 'ls') will hang.


Any help is much appreciated.

Regards
Thomas


--

Thomas Roth
Department: IT
Location: SB3 2.291
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org