[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

2009-02-10 Thread Simon Kelley
Were seeing problems which seem to be lock-related and result in data
loss. This is a fairly low probability event: it's happening on a ~1000
core compute farm which is heavily loaded and the frequency is of the
order of tens of failures an hour.

The clients are running kernel 2.6.22 and lustre 1.6.6 from Sun. The
servers are HP SFS and running lustre 1.4.11. The mixed versions are
deliberate: we're in the process of adding servers running 1.6.6 but
none are currently in use: all servers are 1.4.11. 1.4.11 clients don't
see problems and there are some old 1.4.11 clients left still using
these filesystems at teh same time as the new 1.6.6 ones.

On the server (OST), and event typically looks like this:

2009/02/09 14:05:10 kerni   sf2-sfs33: LustreError:
0:0:(ldlm_lockd.c:227:waiting_locks_callback()) ### lock callback timer
expired: evicting client
4cdd29e6-8b9e-7e81-5d72-2a055f0e2...@net_0x2ac1f6eee_uuid nid
172.31.110@tcp ns: filter-sf2-sfs-ost495_UUID lock:
010048ae0100/0x5fd2cf129ee3e5bb lrc: 1/0 , 0 mode: PW/PW res:
1542868/0 rrc: 4 type: EXT [0-18446744073709551615] (req
0-18446744073709551615) flags: 20 remote: 0xc8fcb3ab0d8d expref:
386 pid: 16626
2009/02/09 14:05:30 kerni   sf2-sfs33: LustreError:
16577:0:(ldlm_lib.c:1364:target_send_reply_msg()) @@@ processing error
(-107) r...@010073bbce00 x2138818/t0 o400-?@?:-1 lens 64/0 ref 0
fl Interpret:/0/0 rc -107/0

and on the client at the same time:

Feb  9 14:05:30 sf-2-3-10 kernel: LustreError: 11-0: an error occurred
while communicating with 172.31.96...@tcp. The obd_ping operation failed
with -107
Feb  9 14:05:30 sf-2-3-10 kernel: LustreError: Skipped 12 previous
similar messages
Feb  9 14:05:30 sf-2-3-10 kernel: Lustre:
OSC_sf2-sfs2.internal.sanger.ac.uk_sf2-sfs-ost495_MNT_client_tcp-81021f897800:
 

Connection to service sf2-sfs-ost495 via nid 172.31.96...@tcp was lost;
in progress operations using this service will wait for recovery to
complete.
Feb  9 14:05:30 sf-2-3-10 kernel: Lustre: Skipped 4 previous similar
messages
Feb  9 14:05:30 sf-2-3-10 kernel: LustreError: 167-0: This client was
evicted by sf2-sfs-ost495; in progress operations using this service
will fail.



It gets worse: the same client s few minutes later saw this on another
OST, same filesystem.

2009/02/09 14:08:30 kerni   sf2-sfs34: LustreError:
0:0:(ldlm_lockd.c:227:waiting_locks_callback()) ### lock callback timer
expired: evicting client
4cdd29e6-8b9e-7e81-5d72-2a055f0e2...@net_0x2ac1f6eee_uuid nid
172.31.110@tcp ns: filter-sf2-sfs-ost496_UUID lock:
01006b8bfa40/0x900da6275f064529 lrc: 1/0 , 0 mode: PW/PW res:
1542328/0 rrc: 2 type: EXT [0-18446744073709551615] (req
0-18446744073709551615) flags: 20 remote: 0xc8fcb3ab1478 expref:
172 pid: 16089
2009/02/09 14:08:30 kerni   sf2-sfs33: LustreError:
0:0:(ldlm_lockd.c:227:waiting_locks_callback()) ### lock callback timer
expired: evicting client
f3275d1e-1360-e139-e71c-44b283d7e...@net_0x2ac1f665d_uuid nid
172.31.102...@tcp ns: filter-sf2-sfs-ost495_UUID lock:
01005a810040/0x5fd2cf129ee3e9f1 lrc: 1/0 , 0 mode: PW/PW res:
1542868/0 rrc: 3 type: EXT [0-18446744073709551615] (req
0-18446744073709551615) flags: 20 remote: 0xa899b6e20a06a83e expref:
316 pid: 16454
2009/02/09 14:08:30 kerni   sf2-sfs33: LustreError:
16335:0:(ldlm_lockd.c:606:ldlm_server_completion_ast()) ### enqueue wait
took 399491082us from 1234188111 ns: filter-sf2-sfs-ost495_UUID lock:
010064a638c0/0x5fd2cf129ee3ea68 lrc: 2/0 , 0 mode: PW/PW res:
1542868/0 rrc: 3 type: EXT [0-18446744073709551615] (req
0-18446744073709551615) flags: 20 remote: 0xc8423e042b195e36 expref:
501 pid: 16493
2009/02/09 14:08:32 storage w   sf2-sfs1 : Warning: event log
(/var/hpls/evlog/eventlog) size is 77760824
2009/02/09 14:08:50 kerni   sf2-sfs34: LustreError:
16253:0:(ldlm_lib.c:1364:target_send_reply_msg()) @@@ processing error
(-107) r...@0100b3ea4a00 x2150055/t0 o400-?@?:-1 lens 64/0 ref 0
fl Interpret:/0/0 rc -107/0
2009/02/09 14:09:02 kerni   sf2-sfs33: LustreError:
16566:0:(ldlm_lib.c:1364:target_send_reply_msg()) @@@ processing error
(-107) r...@01006b7b3200 x5180156/t0 o400-?@?:-1 lens 64/0 ref 0
fl Interpret:/0/0 rc -107/0

whilst the client suffered an assertion failure and LBUG.

Feb  9 14:08:50 sf-2-3-10 kernel: LustreError: 11-0: an error occurred
while communicating with 172.31.96...@tcp. The obd_ping operation failed
with -107
Feb  9 14:08:50 sf-2-3-10 kernel: Lustre:
OSC_sf2-sfs2.internal.sanger.ac.uk_sf2-sfs-ost496_MNT_client_tcp-81021f897800:
 

Connection to service sf2-sfs-ost496 via nid 172.31.96...@tcp was lost;
in progress operations using this service will wait for recovery to
complete.
Feb  9 14:08:50 sf-2-3-10 kernel: LustreError: 167-0: This client was
evicted by sf2-sfs-ost496; in progress operations using this service
will fail.
Feb  9 14:08:50 sf-2-3-10 kernel: LustreError:
29564:0:(import.c:241:ptlrpc_invalidate_import()) sf2-sfs-ost495_UUID:
rc = -110 waiting 

[Lustre-discuss] subscribe

2009-02-10 Thread Brian Stone

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] (no subject)

2009-02-10 Thread Brian Stone
subscribe
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] problem mounting two different lustre instances

2009-02-10 Thread Götz Waschk
On Tue, Feb 10, 2009 at 1:44 AM, Isaac Huang he.hu...@sun.com wrote:
 On Mon, Feb 09, 2009 at 04:52:20PM +0100, G?tz Waschk wrote:
 My client has this in modprobe.conf:
 options lnet networks=o2ib,tcp
 I'm trying to mount the remote network with
 mount -t lustre 141.34.228...@tcp0:/atlas /scratch/lustre-1.6/atlas
 and the command just hangs, the error is this:
 LustreError: 2887:0:(events.c:66:request_out_callback()) @@@ type 4,
 status -113  ...@0100dfc2ac00 x7/t0

Hi Isaac,

 The outgoing message failed with -113 (EHOSTUNREACH). What does lctl
 list_nids say on the client?
on that client, the output is:
192.168.224...@o2ib
141.34.216...@tcp

 Also, please:
 echo +neterror  /proc/sys/lnet/printk
 So that more network errors would go onto console.
OK, after the next mount attempt I have this in the log now:

Lustre: OBD class driver, http://www.lustre.org/
Lustre Version: 1.6.6
Build Version:
1.6.6-1970010101-PRISTINE-.usr.src.redhat.BUILD.lustre-1.6.6.kernel-2.6.9-78.0.13.ELsmp
Lustre: Added LNI 192.168.224...@o2ib [8/64]
Lustre: Added LNI 141.34.216...@tcp [8/256]
Lustre: Accept secure, port 988
Lustre: Lustre Client File System; http://www.lustre.org/
Lustre: 2887:0:(o2iblnd_cb.c:2704:kiblnd_cm_callback())
192.168.22...@o2ib: ROUTE ERROR -22
Lustre: 2887:0:(o2iblnd_cb.c:2118:kiblnd_peer_connect_failed())
Deleting messages for 192.168.22...@o2ib: connection failed
LustreError: 2887:0:(events.c:66:request_out_callback()) @@@ type 4,
status -113  r...@010037eeac00 x7/t0
o38-atlas-mdt_u...@192.168.22.32@o2ib:12/10 lens 240/400 e 0 to 5
dl 1234255029 ref 2 fl Rpc:/0/0 rc 0/0
Lustre: 9263:0:(client.c:1199:ptlrpc_expire_one_request()) @@@ network
error (sent at 1234255024, 0s ago)  r...@010037eeac00 x7/t0
o38-atlas-mdt_u...@192.168.22.32@o2ib:12/10 lens 240/400 e 0 to 5
dl 1234255029 ref 1 fl Rpc:/0/0 rc 0/0
Lustre: Request x7 sent from atlas-MDT-mdc-0107fc2ee400 to NID
192.168.22...@o2ib 0s ago has timed out (limit 5s).
Lustre: 9264:0:(import.c:410:import_select_connection())
atlas-MDT-mdc-0107fc2ee400: tried all connections, increasing
latency to 5s
Lustre: 2887:0:(o2iblnd_cb.c:2704:kiblnd_cm_callback())
192.168.22...@o2ib: ROUTE ERROR -22
Lustre: 2887:0:(o2iblnd_cb.c:2118:kiblnd_peer_connect_failed())
Deleting messages for 192.168.22...@o2ib: connection failed
LustreError: 2887:0:(events.c:66:request_out_callback()) @@@ type 4,
status -113  r...@01080325a400 x10/t0
o38-atlas-mdt_u...@192.168.22.32@o2ib:12/10 lens 240/400 e 0 to 5
dl 1234255054 ref 2 fl Rpc:/0/0 rc 0/0
Lustre: 9263:0:(client.c:1199:ptlrpc_expire_one_request()) @@@ network
error (sent at 1234255049, 0s ago)  r...@01080325a400 x10/t0
o38-atlas-mdt_u...@192.168.22.32@o2ib:12/10 lens 240/400 e 0 to 5
dl 1234255054 ref 1 fl Rpc:/0/0 rc 0/0
Lustre: Request x10 sent from atlas-MDT-mdc-0107fc2ee400 to
NID 192.168.22...@o2ib 0s ago has timed out (limit 5s).


Regards, Götz

-- 
AL I:40: Do what thou wilt shall be the whole of the Law.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] problem mounting two different lustre instances

2009-02-10 Thread Isaac Huang
On Mon, Feb 09, 2009 at 04:52:20PM +0100, G?tz Waschk wrote:
 Hello everyone,
 .
 My client has this in modprobe.conf:
 options lnet networks=o2ib,tcp
 I'm trying to mount the remote network with
 mount -t lustre 141.34.228...@tcp0:/atlas /scratch/lustre-1.6/atlas
 and the command just hangs, the error is this:
 LustreError: 2887:0:(events.c:66:request_out_callback()) @@@ type 4,
 status -113  r...@0100dfc2ac00 x7/t0

The outgoing message failed with -113 (EHOSTUNREACH). What does lctl
list_nids say on the client?

Also, please:
echo +neterror  /proc/sys/lnet/printk

So that more network errors would go onto console.

Isaac
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] problem mounting two different lustre instances

2009-02-10 Thread Götz Waschk
On Tue, Feb 10, 2009 at 1:45 PM, Johann Lombardi joh...@sun.com wrote:
 38 = MDS_CONNECT. The client tries to reach the MDT via 192.168.22...@o2ib,
 whereas i think it should use tcp to access the lustre filesystem of the 
 remote
 cluster, is my understanding of your configuration correct?
That's right, it should use 141.34.228...@tcp0 instead.

Regards, Götz


-- 
AL I:40: Do what thou wilt shall be the whole of the Law.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

2009-02-10 Thread Simon Kelley
Oleg Drokin wrote:

 What would be useful here is if you can enable dlm tracing (echo  
 +dlm_trace /proc/sys/lnet/debug)
 on some of those 1.6 nodes (also if you are running with no debug  
 enabled at all,
 also enable rpc_trace and info levels) and also enable dump on  
 eviction feature.
 (echo 1 /proc/sys/lustre/dump_on_eviction).
 Then when next eviction happens, there would be some useful debug data  
 dumped on the client,
 that you can attach to a bugzilla bug along with server-side eviction  
 message (processed
 with lctl dl command first).

OK, will do. The main problem is reproducing the error: our users have 
unreasonably insisted that we run their jobs using known-good 1.4 
clients and even if I grab their code to run on isolated test nodes 
_most_ runs are fine.

 
 We are also seeing some userspace file operations fail with the error
 No locks available. These don't generate any logging on the client  
 so
 I don't have exact timing. It's possible that they are associated with
 further ### lock callback timer expired server logs.
 
 This error code typically means an application attempting to do some i/ 
 o and Lustre
 has no lock for the i/o area for some reason anymore (it is normally  
 obtained
 once read or write path is entered), and that could be related to  
 evictions too
 (locks are revoked at eviction time).

I should have mentioned that we are also seeing many errors of the form 
LustreError: 19842:0:(ldlm_lockd.c:1078:ldlm_handle_cancel()) received 
cancel for unknown lock cookie. Checking back, these would seem to 
pre-date the introduction of 1.6 clients and even after we upgraded 
clients I can see them associated with both 1.4 and 1.6 clients. They 
may indicate something else relevant about the filesystems or workload.

Cheers,

Simon.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

2009-02-10 Thread Oleg Drokin
Hello!

On Feb 10, 2009, at 12:11 PM, Simon Kelley wrote:
 We are also seeing some userspace file operations fail with the  
 error
 No locks available. These don't generate any logging on the  
 client  so
 I don't have exact timing. It's possible that they are associated  
 with
 further ### lock callback timer expired server logs.
 This error code typically means an application attempting to do  
 some i/ o and Lustre
 has no lock for the i/o area for some reason anymore (it is  
 normally  obtained
 once read or write path is entered), and that could be related to   
 evictions too
 (locks are revoked at eviction time).
 I should have mentioned that we are also seeing many errors of the  
 form LustreError: 19842:0:(ldlm_lockd.c:1078:ldlm_handle_cancel())  
 received cancel for unknown lock cookie. Checking back, these would  
 seem to pre-date the introduction of 1.6 clients and even after we  
 upgraded clients I can see them associated with both 1.4 and 1.6  
 clients. They may indicate something else relevant about the  
 filesystems or workload.

Hm, that means clients hold some locks that server does not believe  
thy have, which is pretty strange.
Or it just does not recognize the lock released by client and later  
releases the client.
If you have a complete kernel log of the event, that might be useful  
to see the sequence of events.
I assume you do not have flaky network and your clients do not  
reconnect all the time to the servers
with messages in logs like 'changed handle from X to Y; copying, but  
this may foreshadow disaster',
which would be a different bug no longer present in 1.6.6, too.

Bye,
 Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

2009-02-10 Thread Oleg Drokin
Hello!

On Feb 10, 2009, at 12:46 PM, Simon Kelley wrote:
 If, by the complete event you mean the received cancel for  
 unknown cookie, there's not much more to tell. Grepping through the  
 last month's server logs shows that there are bursts of typically  
 between 3 and 7 messages, at the same time and from the same client.  
 After a gap, the same thing but from a different client. The number  
 can be as low a one, and up to ten. They look to be related to  
 client workload, at a guess.

Ok, so you do not see a pattern of this unknown cookie message  
followed by eviction in some time like 100 seconds? That's what my  
question about.

Bye,
 Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss