Re: [Lustre-discuss] Lustre locking up on login/interactive nodes
On Mon, 2008-07-21 at 12:04 -0400, Brock Palen wrote: Ok will keep in mind. Looks the same though, Indeed, in most cases it is the same, with the one exception that from syslog, we get time context. Its odd though, if I login to the same machine I can move to that directory list the files etc. read files on those OST's and notice this was eviction by the MDS, I see no lost network connections or network errors. Strange not good not good at all. The syslog data is the same, its below: Right. There is nothing terribly useful in it though. It's just the notification of an eviction. The real question is why did it get evicted. The evictor will know more about that than the evictee. If the evictor doesn't log anything more than a didn't hear from client ... in ... seconds, evicting then something is preventing the client from sending traffic to the evictor. b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Lustre locking up on login/interactive nodes
Every so often lustre locks up. It will recover eventually. The process show this self's in 'D' Uninterruptible IO Wait. This case it was 'ar' making an archive. Dmesg then shows: Lustre: nobackup-MDT-mdc-0101fc467800: Connection to service nobackup-MDT via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. LustreError: 167-0: This client was evicted by nobackup-MDT; in progress operations using this service will fail. LustreError: 17575:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x912452/t0 o101-[EMAIL PROTECTED]@tcp:12 lens 488/768 ref 1 fl Rpc:P/0/0 rc 0/0 LustreError: 17575:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -108 LustreError: 27076:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x912464/t0 o101-[EMAIL PROTECTED]@tcp:12 lens 440/768 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 27076:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -108 LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode 12653753 mdc close failed: rc = -108 LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode 12195682 mdc close failed: rc = -108 LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) Skipped 46 previous similar messages Lustre: nobackup-MDT-mdc-0101fc467800: Connection restored to service nobackup-MDT using nid [EMAIL PROTECTED] LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The mds_close operation failed with -116 LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The mds_close operation failed with -116 LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) inode 11441446 mdc close failed: rc = -116 LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) Skipped 113 previous similar messages Is there special options that should be done on interactive/login nodes? I remember something about how much memory should be available on login vs batch nodes. But I don't know how to change that, I just assumed lustre would use it. Login nodes have 8GB. __ www.palen.serveftp.net Center for Advanced Computing http://cac.engin.umich.edu [EMAIL PROTECTED] ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre locking up on login/interactive nodes
On Mon, 2008-07-21 at 11:43 -0400, Brock Palen wrote: Every so often lustre locks up. It will recover eventually. The process show this self's in 'D' Uninterruptible IO Wait. This case it was 'ar' making an archive. Dmesg then shows: Syslog is usually a better place to get messages from as it gives some context as to the time of the messages. Lustre: nobackup-MDT-mdc-0101fc467800: Connection to service nobackup-MDT via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. LustreError: 167-0: This client was evicted by nobackup-MDT; in progress operations using this service will fail. LustreError: 17575:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x912452/t0 o101-[EMAIL PROTECTED]@tcp:12 lens 488/768 ref 1 fl Rpc:P/0/0 rc 0/0 LustreError: 17575:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -108 LustreError: 27076:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x912464/t0 o101-[EMAIL PROTECTED]@tcp:12 lens 440/768 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 27076:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -108 LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode 12653753 mdc close failed: rc = -108 LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode 12195682 mdc close failed: rc = -108 LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) Skipped 46 previous similar messages Lustre: nobackup-MDT-mdc-0101fc467800: Connection restored to service nobackup-MDT using nid [EMAIL PROTECTED] LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The mds_close operation failed with -116 LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The mds_close operation failed with -116 LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) inode 11441446 mdc close failed: rc = -116 LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) Skipped 113 previous similar messages This looks like a pretty standard eviction. Probably the most interesting information is on the node that did the evicting. If it doesn't contain much other than a have not heard from, then you have node that is either disappearing from the network or getting wedged enough to stop sending pings (or any other traffic in lieu of). b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre locking up on login/interactive nodes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jul 21, 2008, at 11:51 AM, Brian J. Murrell wrote: On Mon, 2008-07-21 at 11:43 -0400, Brock Palen wrote: Every so often lustre locks up. It will recover eventually. The process show this self's in 'D' Uninterruptible IO Wait. This case it was 'ar' making an archive. Dmesg then shows: Syslog is usually a better place to get messages from as it gives some context as to the time of the messages. Ok will keep in mind. Looks the same though, Its odd though, if I login to the same machine I can move to that directory list the files etc. read files on those OST's and notice this was eviction by the MDS, I see no lost network connections or network errors. Strange not good not good at all. The syslog data is the same, its below: Brock Jul 21 11:38:39 nyx-login1 kernel: Lustre: nobackup-MDT- mdc-0101fc467800: Connection to service nobackup-MDT via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete.Jul 21 11:38:39 nyx-login1 kernel: LustreError: 167-0: This client was evicted by nobackup- MDT; in progress operations using this service will fail.Jul 21 11:38:39 nyx-login1 kernel: LustreError: 17575:0:(client.c: 519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x912452/t0 o101-[EMAIL PROTECTED]@tcp:12 lens 488/768 ref 1 fl Rpc:P/0/0 rc 0/0Jul 21 11:38:39 nyx-login1 kernel: LustreError: 17575:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -108Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27076:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x912464/t0 o101-nobackup- [EMAIL PROTECTED]@tcp:12 lens 440/768 ref 1 fl Rpc:/0/0 rc 0/0Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27076:0: (mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -108Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27489:0:(file.c: 97:ll_close_inode_openhandle()) inode 12653753 mdc close failed: rc = - -108Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27489:0:(file.c: 97:ll_close_inode_openhandle()) inode 12195682 mdc close failed: rc = - -108Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27489:0:(file.c: 97:ll_close_inode_openhandle()) Skipped 46 previous similar messagesJul 21 11:38:39 nyx-login1 kernel: Lustre: nobackup-MDT- mdc-0101fc467800: Connection restored to service nobackup-MDT using nid [EMAIL PROTECTED] 21 11:38:39 nyx-login1 kernel: LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The mds_close operation failed with -116Jul 21 11:38:39 nyx-login1 kernel: LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The mds_close operation failed with -116Jul 21 11:38:39 nyx-login1 kernel: LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) inode 11441446 mdc close failed: rc = -116Jul 21 11:38:39 nyx-login1 kernel: LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) Skipped 113 previous similar messages -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (Darwin) iD8DBQFIhLOqMFCQB4Bvz5QRAgWvAJ9HhQAo9JZdcS2iyMFb19HzcgkwcQCdGosB sHaligENGxnJHdMu5116D5U= =GOlg -END PGP SIGNATURE- ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss