Re: [Lustre-discuss] Newbie w/issues

2010-04-28 Thread Oleg Drokin
Hello!

On Apr 27, 2010, at 7:29 PM, Brian Andrus wrote:
 Apr 27 16:15:19 nas-0-1 kernel: LustreError: 
 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107)  
 r...@810669d35c50 x1334203739385128/t0 o400-?@?:0/0 lens 192/0 e 0 
 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0
 
 Any direction/insigt would be most helpful.

That's way too late in the logs to see what happened aside from server decided 
to evict some clients for some reason.
Interesting parts should be around evicting or timeout were first mentioned.

Bye,
Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Newbie w/issues

2010-04-28 Thread Cliff White
Brian Andrus wrote:
 Ok, I inherited a lustre filesystem used on a cluster. 
 
 I am seeing an issue where on the frontend, I see all of /work
 On nodes, however, I only see SOME of the user's directories.

That's rather odd. The directory structure is all on the MDS, so
it's usually either all there, or not there. Are any of the user errors
permission-related? That's the only thing I can think that would change 
what directories one node sees vs another.
 
 Work consists of one MDT/MGS and 3 osts
 The osts are LVMs served from a DDN via infiniband
 
 Running the kernel modules/client one the nodes/frontend
 lustre-client-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
 lustre-client-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
 
 on the ost/mdt
 lustre-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
 kernel-2.6.18-164.11.1.el5_lustre.1.8.2
 lustre-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
 lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.2
 
 I have so many error messages in the logs, I am not sure which to sift 
 through for this issue.
 A quick tail on the MDT:
 =
 Apr 27 16:15:19 nas-0-1 kernel: LustreError: 
 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
 (-107)  r...@810669d35c50 x1334203739385128/t0 o400-?@?:0/0 lens 
 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0
 Apr 27 16:15:19 nas-0-1 kernel: LustreError: 
 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 419 previous 
 similar messages
 Apr 27 16:16:38 nas-0-1 kernel: LustreError: 
 4155:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS 
 from 12345-10.1.255...@tcp
 Apr 27 16:16:38 nas-0-1 kernel: LustreError: 
 4155:0:(handler.c:1518:mds_handle()) Skipped 177 previous similar messages
 Apr 27 16:25:21 nas-0-1 kernel: LustreError: 
 6789:0:(mgs_handler.c:573:mgs_handle()) lustre_mgs: operation 400 on 
 unconnected MGS
 Apr 27 16:25:21 nas-0-1 kernel: LustreError: 
 6789:0:(mgs_handler.c:573:mgs_handle()) Skipped 229 previous similar 
 messages
 Apr 27 16:25:21 nas-0-1 kernel: LustreError: 
 6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
 (-107)  r...@810673a78050 x1334009404220652/t0 o400-?@?:0/0 lens 
 192/0 e 0 to 0 dl 1272410737 ref 1 fl Interpret:H/0/0 rc -107/0
 Apr 27 16:25:21 nas-0-1 kernel: LustreError: 
 6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 404 previous 
 similar messages
 Apr 27 16:26:41 nas-0-1 kernel: LustreError: 
 4173:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS 
 from 12345-10.1.255...@tcp
 Apr 27 16:26:41 nas-0-1 kernel: LustreError: 
 4173:0:(handler.c:1518:mds_handle()) Skipped 181 previous similar messages
 =
 

The ENOTCONN (-107) points at server/network health. I would umount the 
clients and verify server health, then verify LNET connectivity. 
However, this would not relate to missing directories - in the absence 
of other explanations, check the MDT with fsck - that's more of a 
generic useful thing to do rather then something indicated by your data.

I would also look through older logs if available, and see if you can
find a point in time where things go bad. The first error is always the 
most useful.
 Any direction/insigt would be most helpful.

Hope this helps
cliffw

 
 Brian Andrus
 
 
 
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Newbie w/issues

2010-04-28 Thread Oleg Drokin
Hello!

On Apr 27, 2010, at 9:38 PM, Brian Andrus wrote:

 Odd, I just went through the log on the MDT and basically it has been 
 repeating those errors for over 24 hours (not spewing, but often enough). 
 only ONE other line on an ost:

Each such message means there was an attempt to send a ping to this server from 
a client that the server does not recognize.

 Apr 26 06:59:45 nas-0-4 kernel: LustreError: 137-5: UUID 'work-OST_UUID' 
 is not available  for connect (no target)

This one tells you that a client tried to contact OST0, but this service is not 
hosted on that node (or did not yet start up).
This might be a somewhat valid message if you have failover configured and this 
node is a currently passive failover target for the service.

Bye,
Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Newbie w/issues

2010-04-28 Thread Andreas Dilger
This means that your OST is not available. Maybe it is nor mounted?

Cheers, Andreas

On 2010-04-27, at 19:38, Brian Andrus toomuc...@gmail.com wrote:

 On 4/27/2010 6:10 PM, Oleg Drokin wrote:
 Hello!

 On Apr 27, 2010, at 7:29 PM, Brian Andrus wrote:

 Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c: 
 1848:target_send_reply_msg()) @@@ processing error (-107)   
 r...@810669d35c50 x1334203739385128/t0 o400-?@?:0/0 lens  
 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0

 Any direction/insigt would be most helpful.

 That's way too late in the logs to see what happened aside from  
 server decided to evict some clients for some reason.
 Interesting parts should be around evicting or timeout were  
 first mentioned.

 Bye,
 Oleg
 Odd, I just went through the log on the MDT and basically it has been
 repeating those errors for over 24 hours (not spewing, but often
 enough). only ONE other line on an ost:

 Apr 26 06:59:45 nas-0-4 kernel: LustreError: 137-5: UUID
 'work-OST_UUID' is not available  for connect (no target)


 Brian

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Newbie w/issues

2010-04-27 Thread Brian Andrus
Ok, I inherited a lustre filesystem used on a cluster.

I am seeing an issue where on the frontend, I see all of /work
On nodes, however, I only see SOME of the user's directories.

Work consists of one MDT/MGS and 3 osts
The osts are LVMs served from a DDN via infiniband

Running the kernel modules/client one the nodes/frontend
lustre-client-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
lustre-client-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2

on the ost/mdt
lustre-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
kernel-2.6.18-164.11.1.el5_lustre.1.8.2
lustre-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.2

I have so many error messages in the logs, I am not sure which to sift
through for this issue.
A quick tail on the MDT:
=
Apr 27 16:15:19 nas-0-1 kernel: LustreError:
4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107)
 r...@810669d35c50 x1334203739385128/t0 o400-?@?:0/0 lens 192/0 e 0
to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0
Apr 27 16:15:19 nas-0-1 kernel: LustreError:
4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 419 previous
similar messages
Apr 27 16:16:38 nas-0-1 kernel: LustreError:
4155:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS from
12345-10.1.255...@tcp
Apr 27 16:16:38 nas-0-1 kernel: LustreError:
4155:0:(handler.c:1518:mds_handle()) Skipped 177 previous similar messages
Apr 27 16:25:21 nas-0-1 kernel: LustreError:
6789:0:(mgs_handler.c:573:mgs_handle()) lustre_mgs: operation 400 on
unconnected MGS
Apr 27 16:25:21 nas-0-1 kernel: LustreError:
6789:0:(mgs_handler.c:573:mgs_handle()) Skipped 229 previous similar
messages
Apr 27 16:25:21 nas-0-1 kernel: LustreError:
6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107)
 r...@810673a78050 x1334009404220652/t0 o400-?@?:0/0 lens 192/0 e 0
to 0 dl 1272410737 ref 1 fl Interpret:H/0/0 rc -107/0
Apr 27 16:25:21 nas-0-1 kernel: LustreError:
6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 404 previous
similar messages
Apr 27 16:26:41 nas-0-1 kernel: LustreError:
4173:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS from
12345-10.1.255...@tcp
Apr 27 16:26:41 nas-0-1 kernel: LustreError:
4173:0:(handler.c:1518:mds_handle()) Skipped 181 previous similar messages
=

Any direction/insigt would be most helpful.

Brian Andrus
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Newbie w/issues

2010-04-27 Thread Brian Andrus
On 4/27/2010 6:10 PM, Oleg Drokin wrote:
 Hello!

 On Apr 27, 2010, at 7:29 PM, Brian Andrus wrote:

 Apr 27 16:15:19 nas-0-1 kernel: LustreError: 
 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) 
  r...@810669d35c50 x1334203739385128/t0 o400-?@?:0/0 lens 192/0 e 0 
 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0

 Any direction/insigt would be most helpful.
  
 That's way too late in the logs to see what happened aside from server 
 decided to evict some clients for some reason.
 Interesting parts should be around evicting or timeout were first 
 mentioned.

 Bye,
  Oleg
Odd, I just went through the log on the MDT and basically it has been 
repeating those errors for over 24 hours (not spewing, but often 
enough). only ONE other line on an ost:

Apr 26 06:59:45 nas-0-4 kernel: LustreError: 137-5: UUID 
'work-OST_UUID' is not available  for connect (no target)


Brian

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss