Re: [Lustre-discuss] Newbie w/issues
Hello! On Apr 27, 2010, at 7:29 PM, Brian Andrus wrote: Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) r...@810669d35c50 x1334203739385128/t0 o400-?@?:0/0 lens 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0 Any direction/insigt would be most helpful. That's way too late in the logs to see what happened aside from server decided to evict some clients for some reason. Interesting parts should be around evicting or timeout were first mentioned. Bye, Oleg ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Newbie w/issues
Brian Andrus wrote: Ok, I inherited a lustre filesystem used on a cluster. I am seeing an issue where on the frontend, I see all of /work On nodes, however, I only see SOME of the user's directories. That's rather odd. The directory structure is all on the MDS, so it's usually either all there, or not there. Are any of the user errors permission-related? That's the only thing I can think that would change what directories one node sees vs another. Work consists of one MDT/MGS and 3 osts The osts are LVMs served from a DDN via infiniband Running the kernel modules/client one the nodes/frontend lustre-client-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 lustre-client-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 on the ost/mdt lustre-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 kernel-2.6.18-164.11.1.el5_lustre.1.8.2 lustre-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.2 I have so many error messages in the logs, I am not sure which to sift through for this issue. A quick tail on the MDT: = Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) r...@810669d35c50 x1334203739385128/t0 o400-?@?:0/0 lens 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0 Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 419 previous similar messages Apr 27 16:16:38 nas-0-1 kernel: LustreError: 4155:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS from 12345-10.1.255...@tcp Apr 27 16:16:38 nas-0-1 kernel: LustreError: 4155:0:(handler.c:1518:mds_handle()) Skipped 177 previous similar messages Apr 27 16:25:21 nas-0-1 kernel: LustreError: 6789:0:(mgs_handler.c:573:mgs_handle()) lustre_mgs: operation 400 on unconnected MGS Apr 27 16:25:21 nas-0-1 kernel: LustreError: 6789:0:(mgs_handler.c:573:mgs_handle()) Skipped 229 previous similar messages Apr 27 16:25:21 nas-0-1 kernel: LustreError: 6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) r...@810673a78050 x1334009404220652/t0 o400-?@?:0/0 lens 192/0 e 0 to 0 dl 1272410737 ref 1 fl Interpret:H/0/0 rc -107/0 Apr 27 16:25:21 nas-0-1 kernel: LustreError: 6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 404 previous similar messages Apr 27 16:26:41 nas-0-1 kernel: LustreError: 4173:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS from 12345-10.1.255...@tcp Apr 27 16:26:41 nas-0-1 kernel: LustreError: 4173:0:(handler.c:1518:mds_handle()) Skipped 181 previous similar messages = The ENOTCONN (-107) points at server/network health. I would umount the clients and verify server health, then verify LNET connectivity. However, this would not relate to missing directories - in the absence of other explanations, check the MDT with fsck - that's more of a generic useful thing to do rather then something indicated by your data. I would also look through older logs if available, and see if you can find a point in time where things go bad. The first error is always the most useful. Any direction/insigt would be most helpful. Hope this helps cliffw Brian Andrus ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Newbie w/issues
Hello! On Apr 27, 2010, at 9:38 PM, Brian Andrus wrote: Odd, I just went through the log on the MDT and basically it has been repeating those errors for over 24 hours (not spewing, but often enough). only ONE other line on an ost: Each such message means there was an attempt to send a ping to this server from a client that the server does not recognize. Apr 26 06:59:45 nas-0-4 kernel: LustreError: 137-5: UUID 'work-OST_UUID' is not available for connect (no target) This one tells you that a client tried to contact OST0, but this service is not hosted on that node (or did not yet start up). This might be a somewhat valid message if you have failover configured and this node is a currently passive failover target for the service. Bye, Oleg ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Newbie w/issues
This means that your OST is not available. Maybe it is nor mounted? Cheers, Andreas On 2010-04-27, at 19:38, Brian Andrus toomuc...@gmail.com wrote: On 4/27/2010 6:10 PM, Oleg Drokin wrote: Hello! On Apr 27, 2010, at 7:29 PM, Brian Andrus wrote: Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c: 1848:target_send_reply_msg()) @@@ processing error (-107) r...@810669d35c50 x1334203739385128/t0 o400-?@?:0/0 lens 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0 Any direction/insigt would be most helpful. That's way too late in the logs to see what happened aside from server decided to evict some clients for some reason. Interesting parts should be around evicting or timeout were first mentioned. Bye, Oleg Odd, I just went through the log on the MDT and basically it has been repeating those errors for over 24 hours (not spewing, but often enough). only ONE other line on an ost: Apr 26 06:59:45 nas-0-4 kernel: LustreError: 137-5: UUID 'work-OST_UUID' is not available for connect (no target) Brian ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Newbie w/issues
Ok, I inherited a lustre filesystem used on a cluster. I am seeing an issue where on the frontend, I see all of /work On nodes, however, I only see SOME of the user's directories. Work consists of one MDT/MGS and 3 osts The osts are LVMs served from a DDN via infiniband Running the kernel modules/client one the nodes/frontend lustre-client-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 lustre-client-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 on the ost/mdt lustre-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 kernel-2.6.18-164.11.1.el5_lustre.1.8.2 lustre-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2 lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.2 I have so many error messages in the logs, I am not sure which to sift through for this issue. A quick tail on the MDT: = Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) r...@810669d35c50 x1334203739385128/t0 o400-?@?:0/0 lens 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0 Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 419 previous similar messages Apr 27 16:16:38 nas-0-1 kernel: LustreError: 4155:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS from 12345-10.1.255...@tcp Apr 27 16:16:38 nas-0-1 kernel: LustreError: 4155:0:(handler.c:1518:mds_handle()) Skipped 177 previous similar messages Apr 27 16:25:21 nas-0-1 kernel: LustreError: 6789:0:(mgs_handler.c:573:mgs_handle()) lustre_mgs: operation 400 on unconnected MGS Apr 27 16:25:21 nas-0-1 kernel: LustreError: 6789:0:(mgs_handler.c:573:mgs_handle()) Skipped 229 previous similar messages Apr 27 16:25:21 nas-0-1 kernel: LustreError: 6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) r...@810673a78050 x1334009404220652/t0 o400-?@?:0/0 lens 192/0 e 0 to 0 dl 1272410737 ref 1 fl Interpret:H/0/0 rc -107/0 Apr 27 16:25:21 nas-0-1 kernel: LustreError: 6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 404 previous similar messages Apr 27 16:26:41 nas-0-1 kernel: LustreError: 4173:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS from 12345-10.1.255...@tcp Apr 27 16:26:41 nas-0-1 kernel: LustreError: 4173:0:(handler.c:1518:mds_handle()) Skipped 181 previous similar messages = Any direction/insigt would be most helpful. Brian Andrus ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Newbie w/issues
On 4/27/2010 6:10 PM, Oleg Drokin wrote: Hello! On Apr 27, 2010, at 7:29 PM, Brian Andrus wrote: Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) r...@810669d35c50 x1334203739385128/t0 o400-?@?:0/0 lens 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0 Any direction/insigt would be most helpful. That's way too late in the logs to see what happened aside from server decided to evict some clients for some reason. Interesting parts should be around evicting or timeout were first mentioned. Bye, Oleg Odd, I just went through the log on the MDT and basically it has been repeating those errors for over 24 hours (not spewing, but often enough). only ONE other line on an ost: Apr 26 06:59:45 nas-0-4 kernel: LustreError: 137-5: UUID 'work-OST_UUID' is not available for connect (no target) Brian ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss