[Lustre-discuss] OST errors caused by residual client info?

2010-12-06 Thread Jeff Johnson
Greetings..

Is it possible that the below error can be derived from a client that 
has not been rebooted or had lustre kernel mods reloaded during a time 
when a few test file systems were built and mounted?

LustreError: 12967:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing 
error (-19)  r...@81032dd2d000 x1348952525350751/t0 o8-?@?:0/0 lens 
368/0 e 0 to 0 dl 1291669076 ref 1 fl Interpret:/0/0 rc -19/0
LustreError: 12967:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 55 
previous similar messages
LustreError: 137-5: UUID 'fs-OST0058_UUID' is not available  for connect (no 
target)


Normally this would be a back end storage issue. In this case, the oss 
where this error is logged doesn't have an ost OST0058. It has an ost 
OST006d. Regardless of the ost name, the backend raid is healthy with 
no hardware errors. No other h/w errors present on the oss node (e.g.: 
mce, panic, ib/enet failures, etc).

Previous test incarnations of this filesystem were built where ost name 
was not assigned (e.g.: OST) and was assigned upon first mount and 
connection to the mds. Is it possible that some clients have residual 
pointers or config data about the previously built file systems?

Thanks!

--Jeff

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OST errors caused by residual client info?

2010-12-06 Thread Oleg Drokin
Hello!

On Dec 6, 2010, at 6:50 PM, Jeff Johnson wrote:
 Previous test incarnations of this filesystem were built where ost name 
 was not assigned (e.g.: OST) and was assigned upon first mount and 
 connection to the mds. Is it possible that some clients have residual 
 pointers or config data about the previously built file systems?

If you did not unmount clients from the previous incarnation of the filesystem,
those clients would still continue to try to contact the servers they know 
about even
after the servers themselves go away and are repurposed (since there is no way 
for the
client to know about this).

Bye,
Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OST errors caused by residual client info?

2010-12-06 Thread Jeff Johnson
On 12/6/10 3:55 PM, Oleg Drokin wrote:
 Hello!

 On Dec 6, 2010, at 6:50 PM, Jeff Johnson wrote:
 Previous test incarnations of this filesystem were built where ost name
 was not assigned (e.g.: OST) and was assigned upon first mount and
 connection to the mds. Is it possible that some clients have residual
 pointers or config data about the previously built file systems?
 If you did not unmount clients from the previous incarnation of the 
 filesystem,
 those clients would still continue to try to contact the servers they know 
 about even
 after the servers themselves go away and are repurposed (since there is no 
 way for the
 client to know about this).
All clients were unmounted but the lustre kernel mods were never 
removed/reloaded nor were the clients rebooted.

Is it odd that this error would occur naming an ost that is not present 
on that oss? Should an oss only report this error about its own ost 
devices? As I said, this particular oss where the error came from only 
has an OST006c and OST006d. It does not have an OST0058 although it may 
have back when the filesystem was made with a simple test csv that did 
not specifically give index numbers as part of the mkfs.lustre process. 
They were named later, randomly, when the osts were first mounted and 
connected to the mds.

Do you think it is possible for a client to retain this information even 
though a umount/mount of the filesystem took place?

--Jeff
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OST errors caused by residual client info?

2010-12-06 Thread Oleg Drokin
Hello!

On Dec 6, 2010, at 7:05 PM, Jeff Johnson wrote:
 Previous test incarnations of this filesystem were built where ost name
 was not assigned (e.g.: OST) and was assigned upon first mount and
 connection to the mds. Is it possible that some clients have residual
 pointers or config data about the previously built file systems?
 If you did not unmount clients from the previous incarnation of the 
 filesystem,
 those clients would still continue to try to contact the servers they know 
 about even
 after the servers themselves go away and are repurposed (since there is no 
 way for the
 client to know about this).
 All clients were unmounted but the lustre kernel mods were never 
 removed/reloaded nor were the clients rebooted.

If the clients were unmounted, then there is no information left in the kernel 
about those now vanished mountpoints.

 Is it odd that this error would occur naming an ost that is not present on 
 that oss? Should an oss only report this error about its own ost devices? As 
 I said, 

OSS would report such an error if a client contacted it trying to access an OST 
not present on this OSS.
This could be because of a client containing some stale information about 
services because it was not unmounted from previous incarnation of the 
filesystem
or it could be because there is an failover pair setup that names this OSS as a 
possible nid for a failover target.

 Do you think it is possible for a client to retain this information even 
 though a umount/mount of the filesystem took place?

If the clients unmounted cleanly, I don't think there is anywhere such info 
could be stored.

You could go back to the clients sending these requests (identify them by error 
messages in the logs, they'd complain about error -19 connecting to OSTs) and
see what's wrong with them, what do they have mounted and such.

Bye,
Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss