Re: [Lustre-discuss] Lustre locking up on login/interactive nodes

2008-07-22 Thread Brian J. Murrell
On Mon, 2008-07-21 at 12:04 -0400, Brock Palen wrote:
 
 Ok will keep in mind. Looks the same though,

Indeed, in most cases it is the same, with the one exception that from
syslog, we get time context.

 Its odd though, if I  
 login to the same machine I can move to that directory list the files  
 etc.  read files on those OST's  and notice this was eviction by the  
 MDS,
 
 I see no lost network connections or network errors.  Strange not  
 good not good at all.
 The syslog data is the same, its below:

Right.  There is nothing terribly useful in it though.  It's just the
notification of an eviction.  The real question is why did it get
evicted.  The evictor will know more about that than the evictee.

If the evictor doesn't log anything more than a didn't hear from
client ... in ... seconds, evicting then something is preventing the
client from sending traffic to the evictor.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre locking up on login/interactive nodes

2008-07-21 Thread Brock Palen
Every so often lustre locks up. It will recover eventually. The  
process show this self's in 'D'  Uninterruptible IO Wait.  This case  
it was 'ar' making an archive.

Dmesg then shows:

Lustre: nobackup-MDT-mdc-0101fc467800: Connection to service  
nobackup-MDT via nid [EMAIL PROTECTED] was lost; in progress  
operations using this service will wait for recovery to complete.
LustreError: 167-0: This client was evicted by nobackup-MDT; in  
progress operations using this service will fail.
LustreError: 17575:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
IMP_INVALID  [EMAIL PROTECTED] x912452/t0  
o101-[EMAIL PROTECTED]@tcp:12 lens 488/768 ref 1  
fl Rpc:P/0/0 rc 0/0
LustreError: 17575:0:(mdc_locks.c:423:mdc_finish_enqueue())  
ldlm_cli_enqueue: -108
LustreError: 27076:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
IMP_INVALID  [EMAIL PROTECTED] x912464/t0  
o101-[EMAIL PROTECTED]@tcp:12 lens 440/768 ref 1  
fl Rpc:/0/0 rc 0/0
LustreError: 27076:0:(mdc_locks.c:423:mdc_finish_enqueue())  
ldlm_cli_enqueue: -108
LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode  
12653753 mdc close failed: rc = -108
LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode  
12195682 mdc close failed: rc = -108
LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) Skipped  
46 previous similar messages
Lustre: nobackup-MDT-mdc-0101fc467800: Connection restored to  
service nobackup-MDT using nid [EMAIL PROTECTED]
LustreError: 11-0: an error occurred while communicating with  
[EMAIL PROTECTED] The mds_close operation failed with -116
LustreError: 11-0: an error occurred while communicating with  
[EMAIL PROTECTED] The mds_close operation failed with -116
LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) inode  
11441446 mdc close failed: rc = -116
LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) Skipped  
113 previous similar messages


Is there special options that should be done on interactive/login  
nodes?  I remember something about how much memory should be available  
on login vs batch nodes. But I don't know how to change that, I just  
assumed lustre would use it.  Login nodes have 8GB.
__
www.palen.serveftp.net
Center for Advanced Computing
http://cac.engin.umich.edu
[EMAIL PROTECTED]



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre locking up on login/interactive nodes

2008-07-21 Thread Brian J. Murrell
On Mon, 2008-07-21 at 11:43 -0400, Brock Palen wrote:
 Every so often lustre locks up. It will recover eventually. The  
 process show this self's in 'D'  Uninterruptible IO Wait.  This case  
 it was 'ar' making an archive.
 
 Dmesg then shows:

Syslog is usually a better place to get messages from as it gives some
context as to the time of the messages.

 Lustre: nobackup-MDT-mdc-0101fc467800: Connection to service  
 nobackup-MDT via nid [EMAIL PROTECTED] was lost; in progress  
 operations using this service will wait for recovery to complete.
 LustreError: 167-0: This client was evicted by nobackup-MDT; in  
 progress operations using this service will fail.
 LustreError: 17575:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
 IMP_INVALID  [EMAIL PROTECTED] x912452/t0  
 o101-[EMAIL PROTECTED]@tcp:12 lens 488/768 ref 1  
 fl Rpc:P/0/0 rc 0/0
 LustreError: 17575:0:(mdc_locks.c:423:mdc_finish_enqueue())  
 ldlm_cli_enqueue: -108
 LustreError: 27076:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
 IMP_INVALID  [EMAIL PROTECTED] x912464/t0  
 o101-[EMAIL PROTECTED]@tcp:12 lens 440/768 ref 1  
 fl Rpc:/0/0 rc 0/0
 LustreError: 27076:0:(mdc_locks.c:423:mdc_finish_enqueue())  
 ldlm_cli_enqueue: -108
 LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode  
 12653753 mdc close failed: rc = -108
 LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode  
 12195682 mdc close failed: rc = -108
 LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) Skipped  
 46 previous similar messages
 Lustre: nobackup-MDT-mdc-0101fc467800: Connection restored to  
 service nobackup-MDT using nid [EMAIL PROTECTED]
 LustreError: 11-0: an error occurred while communicating with  
 [EMAIL PROTECTED] The mds_close operation failed with -116
 LustreError: 11-0: an error occurred while communicating with  
 [EMAIL PROTECTED] The mds_close operation failed with -116
 LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) inode  
 11441446 mdc close failed: rc = -116
 LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) Skipped  
 113 previous similar messages

This looks like a pretty standard eviction.  Probably the most
interesting information is on the node that did the evicting.  If it
doesn't contain much other than a have not heard from, then you have
node that is either disappearing from the network or getting wedged
enough to stop sending pings (or any other traffic in lieu of).

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre locking up on login/interactive nodes

2008-07-21 Thread Brock Palen
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jul 21, 2008, at 11:51 AM, Brian J. Murrell wrote:
 On Mon, 2008-07-21 at 11:43 -0400, Brock Palen wrote:
 Every so often lustre locks up. It will recover eventually. The
 process show this self's in 'D'  Uninterruptible IO Wait.  This case
 it was 'ar' making an archive.

 Dmesg then shows:

 Syslog is usually a better place to get messages from as it gives some
 context as to the time of the messages.

Ok will keep in mind. Looks the same though, Its odd though, if I  
login to the same machine I can move to that directory list the files  
etc.  read files on those OST's  and notice this was eviction by the  
MDS,

I see no lost network connections or network errors.  Strange not  
good not good at all.
The syslog data is the same, its below:

Brock


Jul 21 11:38:39 nyx-login1 kernel: Lustre: nobackup-MDT- 
mdc-0101fc467800: Connection to service nobackup-MDT via nid  
[EMAIL PROTECTED] was lost; in progress operations using this  
service will wait for recovery to complete.Jul 21 11:38:39 nyx-login1  
kernel: LustreError: 167-0: This client was evicted by nobackup- 
MDT; in progress operations using this service will fail.Jul 21  
11:38:39 nyx-login1 kernel: LustreError: 17575:0:(client.c: 
519:ptlrpc_import_delay_req()) @@@ IMP_INVALID  [EMAIL PROTECTED]  
x912452/t0 o101-[EMAIL PROTECTED]@tcp:12 lens  
488/768 ref 1 fl Rpc:P/0/0 rc 0/0Jul 21 11:38:39 nyx-login1 kernel:  
LustreError: 17575:0:(mdc_locks.c:423:mdc_finish_enqueue())  
ldlm_cli_enqueue: -108Jul 21 11:38:39 nyx-login1 kernel: LustreError:  
27076:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID   
[EMAIL PROTECTED] x912464/t0 o101-nobackup- 
[EMAIL PROTECTED]@tcp:12 lens 440/768 ref 1 fl Rpc:/0/0 rc  
0/0Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27076:0: 
(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -108Jul 21  
11:38:39 nyx-login1 kernel: LustreError: 27489:0:(file.c: 
97:ll_close_inode_openhandle()) inode 12653753 mdc close failed: rc =  
- -108Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27489:0:(file.c: 
97:ll_close_inode_openhandle()) inode 12195682 mdc close failed: rc =  
- -108Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27489:0:(file.c: 
97:ll_close_inode_openhandle()) Skipped 46 previous similar  
messagesJul 21 11:38:39 nyx-login1 kernel: Lustre: nobackup-MDT- 
mdc-0101fc467800: Connection restored to service nobackup-MDT  
using nid [EMAIL PROTECTED] 21 11:38:39 nyx-login1 kernel:  
LustreError: 11-0: an error occurred while communicating with  
[EMAIL PROTECTED] The mds_close operation failed with -116Jul 21  
11:38:39 nyx-login1 kernel: LustreError: 11-0: an error occurred  
while communicating with [EMAIL PROTECTED] The mds_close operation  
failed with -116Jul 21 11:38:39 nyx-login1 kernel: LustreError:  
26930:0:(file.c:97:ll_close_inode_openhandle()) inode 11441446 mdc  
close failed: rc = -116Jul 21 11:38:39 nyx-login1 kernel:  
LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) Skipped  
113 previous similar messages

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (Darwin)

iD8DBQFIhLOqMFCQB4Bvz5QRAgWvAJ9HhQAo9JZdcS2iyMFb19HzcgkwcQCdGosB
sHaligENGxnJHdMu5116D5U=
=GOlg
-END PGP SIGNATURE-
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss