Re: [Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203....@tcp, match 19154486 length 728 too big
Alexey Lyashkov wrote: > Hi Michael, > >>> On Fri, 2009-05-22 at 16:38 -0400, Michael D. Seymour wrote: >>>> Hi all, >>>> >>>> One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on >>>> a >>>> different network. >>>> >>>> We get the following messages on a particular client: >>>> >>>> May 22 15:07:45 trinity kernel: LustreError: >>>> 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from >>>> 12345-10.5.203@tcp, match 19154486 length 728 too big: 704 left, 704 >>>> allowed >>> what frequently for this bug? >> Sets of entries (about 20) happen a few times per day, each entry spaced >> about >> ten minutes apart. > can you please show syslog messages around this time - should be exist > lines with errors related to 'match X' (in this example match > 19154486 -- should be something about request x19154486). I've upgraded the MDS to 1.6.7.1. So far no issues. I will probably upgrade to 1.8 very soon. Will write back if there is still problems. Mike -- Michael D. Seymour Phone: 416-978-8497 Scientific Computing Support Fax: 416-978-3921 Canadian Institute for Theoretical Astrophysics, University of Toronto ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203....@tcp, match 19154486 length 728 too big
Michael D. Seymour wrote: > Hi all, > > I hope you could help us with some connection problems we are having with our > lustre file system. The filesystem roc consists of 6 OSSs with one OST per > OSS. > Each OSS uses the 1.6.7 RHEL 5 kernel on Centos 5.1 (one unit uses Centos > 5.3). > The MDS uses CentOS 5.1 and Lustre 1.6.7. 203 RHEL-based clients mount the > filesystem and all use Lustre 1.6.7. All are connected via a Gb ethernet > switch > stack. > > One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a > different network. > Also got this earlier today before more verbose debug logging was enabled: On client trinity: May 29 10:35:47 trinity kernel: LustreError: 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 12345-10.5.203@tcp, match 20177453 length 728 too big: 704 left, 704 allowed May 29 10:40:47 trinity kernel: LustreError: 11-0: an error occurred while communicating with 10.5.203@tcp. The mds_close operation failed with -116 May 29 10:40:47 trinity kernel: LustreError: 26783:0:(file.c:113:ll_close_inode_openhandle()) inode 37609433 mdc close failed: rc = -116 May 29 10:40:47 trinity kernel: LustreError: 26783:0:(file.c:113:ll_close_inode_openhandle()) Skipped 1 previous similar message On MDS rocpile: May 29 10:35:47 rocpile kernel: LustreError: 10227:0:(mds_open.c:1561:mds_close()) @@@ no handle for file close ino 37609433: cookie 0xa00c7cf9e763396b r...@8101274e3400 x20177453/t0 o35->84adb9a1-8959-fcf5-cc72-81c6a1e17...@net_0x20a05cc02_uuid:0/0 lens 296/728 e 0 to 0 dl 1243608047 ref 1 fl Interpret:/0/0 rc 0/0 May 29 10:35:47 rocpile kernel: LustreError: 10227:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-116) r...@8101274e3400 x20177453/t0 o35->84adb9a1-8959-fcf5-cc72-81c6a1e17...@net_0x20a05cc02_uuid:0/0 lens 296/728 e 0 to 0 dl 1243608047 ref 1 fl Interpret:/0/0 rc -116/0 May 29 10:35:47 rocpile kernel: LustreError: 10227:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 1 previous similar message May 29 10:40:47 rocpile kernel: LustreError: 3611:0:(mds_open.c:1561:mds_close()) @@@ no handle for file close ino 37609433: cookie 0xa00c7cf9e763396b r...@81011f0cda00 x20177453/t0 o35->84adb9a1-8959-fcf5-cc72-81c6a1e17...@net_0x20a05cc02_uuid:0/0 lens 296/728 e 0 to 0 dl 1243608347 ref 1 fl Interpret:/2/0 rc 0/0 May 29 10:40:47 rocpile kernel: LustreError: 3611:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-116) r...@81011f0cda00 x20177453/t0 o35->84adb9a1-8959-fcf5-cc72-81c6a1e17...@net_0x20a05cc02_uuid:0/0 lens 296/728 e 0 to 0 dl 1243608347 ref 1 fl Interpret:/2/0 rc -116/0 I've already extended /proc/sys/lustre/timeout to 300s. Thanks again, Mike -- Michael D. Seymour Phone: 416-978-8497 Scientific Computing Support Fax: 416-978-3921 Canadian Institute for Theoretical Astrophysics, University of Toronto ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203....@tcp, match 19154486 length 728 too big
Hi Alexey, Alexey Lyashkov wrote: > Hi Michael, > > > On Fri, 2009-05-22 at 16:38 -0400, Michael D. Seymour wrote: >> Hi all, >> >> One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a >> different network. >> >> We get the following messages on a particular client: >> >> May 22 15:07:45 trinity kernel: LustreError: >> 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from >> 12345-10.5.203@tcp, match 19154486 length 728 too big: 704 left, 704 >> allowed > > what frequently for this bug? Sets of entries (about 20) happen a few times per day, each entry spaced about ten minutes apart. > if this quickly replicated - please set > lnet.debug=-1, lnet.debug_subsystem=-1 lnet.debug_mb=100, on mds and > client, replicate and save logs with lctl dk > $logfile. Debugging has been enabled.I haven't been able to catch it in the act yet. Will enabling the debug logging until I can catch the bug overflow anything? > after it - please fill a bug and attach log from MDS and client to bug. A bug will be filed as soon as it can be caught with logging enabled. > this message say - client want for reply less data when mds is send. Trinity cannot accept data as large as the MDS is sending? Thanks for you help, Mike -- Michael D. Seymour Phone: 416-978-8497 Scientific Computing Support Fax: 416-978-3921 Canadian Institute for Theoretical Astrophysics, University of Toronto ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203....@tcp, match 19154486 length 728 too big
Hi all, I hope you could help us with some connection problems we are having with our lustre file system. The filesystem roc consists of 6 OSSs with one OST per OSS. Each OSS uses the 1.6.7 RHEL 5 kernel on Centos 5.1 (one unit uses Centos 5.3). The MDS uses CentOS 5.1 and Lustre 1.6.7. 203 RHEL-based clients mount the filesystem and all use Lustre 1.6.7. All are connected via a Gb ethernet switch stack. One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a different network. We get the following messages on a particular client: May 22 15:07:45 trinity kernel: LustreError: 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 12345-10.5.203@tcp, match 19154486 length 728 too big: 704 left, 704 allowed May 22 15:07:45 trinity kernel: LustreError: 5111:0:(lib-move.c:110:lnet_try_match_md()) Skipped 3 previous similar messages May 22 15:12:45 trinity kernel: Lustre: Request x19154486 sent from roc-MDT-mdc-01044e1d4c00 to NID 10.5.203@tcp 300s ago has timed out (limit 300s). May 22 15:12:45 trinity kernel: Lustre: Skipped 3 previous similar messages May 22 15:12:45 trinity kernel: Lustre: roc-MDT-mdc-01044e1d4c00: Connection to service roc-MDT via nid 10.5.203@tcp was lost; in progress operations using this service will wait for recovery to complete. May 22 15:12:45 trinity kernel: Lustre: Skipped 3 previous similar messages May 22 15:12:45 trinity kernel: Lustre: roc-MDT-mdc-01044e1d4c00: Connection restored to service roc-MDT using nid 10.5.203@tcp. May 22 15:12:45 trinity kernel: Lustre: Skipped 4 previous similar messages [r...@trinity ~]# cat /proc/fs/lustre/lov/roc-clilov-01044e1d4c00/uuid 84adb9a1-8959-fcf5-cc72-81c6a1e171b8 On the MDS containing roc-MDT: May 22 15:12:45 rocpile kernel: Lustre: 19236:0:(ldlm_lib.c:538:target_handle_reconnect()) roc-MDT: 84adb9a1-8959-fcf5-cc72-81c6a1e171b8 reconnecting May 22 15:12:45 rocpile kernel: Lustre: 19236:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 4 previous similar messages Any idea what could be causing this? BUG 11332 looked similar, but it has been closed because of other related bugs being fixed. Thanks, Mike -- Michael D. Seymour Phone: 416-978-8497 Scientific Computing Support Fax: 416-978-3921 Canadian Institute for Theoretical Astrophysics, University of Toronto ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Problems adding new OSS to existing Lustre filesystem -- Refusing connection, No matching NI
Hi, We are having a problem adding a new OSS (roc06, 10.5.203.6) to an existing Lustre file system (raid-cita) on the 10.5 network. selinux and iptables are disabled. It is a multi-homed OSS on the 10.4 and 10.5 network. When mounted, clients are trying to connect to the Lustre file system via the 10.4 network, even though things are set up to use the 10.5 network. The clients do not see the new space on the file system either. It shows 23T as opposed to the > 27T it should show. lfs quota hangs as well. We did suffer some problems with the MDS filesystem, which was fcsked, the kernel downgraded to 1.6.6 and remounted. Many messages like this exist in /var/log/messages on the new OSS: Apr 24 10:01:07 roc06 kernel: LustreError: 120-3: Refusing connection from 10.4.1.52 for 10.4.20...@tcp: No matching NI On the multi-homed client 10.4.1.52: [r...@tpb52-chroot ~]# uname -a; cat /etc/redhat-release Linux tpb52 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 SMP Mon Feb 9 19:56:55 MST 2009 x86_64 x86_64 x86_64 GNU/Linux CentOS release 5 (Final) [r...@tpb52-chroot ~]# df -h /mnt/raid-cita/ FilesystemSize Used Avail Use% Mounted on 10.5.203@tcp:/roc 23T 11T 12T 47% /mnt/raid-cita [r...@tpb52-chroot ~]# lctl list_nids 10.5.2...@tcp [r...@tpb52-chroot ~]# grep lnet /etc/modprobe.conf options lnet networks=tcp0(eth1) [r...@tpb52-chroot ~]# ifconfig eth1 eth1 Link encap:Ethernet HWaddr 00:15:C5:EC:FA:8C inet addr:10.5.2.12 Bcast:10.5.255.255 Mask:255.255.0.0 On the OSS roc06: [r...@roc06 lustre]# uname -a; cat /etc/redhat-release Linux roc06 2.6.18-92.1.17.el5_lustre.1.6.7.1smp #1 SMP Mon Apr 13 16:13:00 MDT 2009 x86_64 x86_64 x86_64 GNU/Linux CentOS release 5.3 (Final) [r...@roc06 lustre]# lctl list_nids 10.5.20...@tcp [r...@roc06 ~]# grep lnet /etc/modprobe.conf options lnet networks=tcp0(eth1) [r...@roc06 ~]# ifconfig eth1 eth1 Link encap:Ethernet HWaddr 00:22:19:05:90:F2 inet addr:10.5.203.6 Bcast:10.5.255.255 Mask:255.255.0.0 The OSS was formatted with the following: mkfs.lustre --verbose --reformat --fsname=roc --ost --mgsnode=10.5.203@tcp0 --mkfsoptions="-m 0 -E stride=32" /dev/md2 I believe this was done before "options lnet networks=tcp0(eth1)" was included in modprobe.conf. [r...@roc06 ~]# tunefs.lustre --print /dev/md2 Permanent disk data: Target: roc-OST0005 Index: 5 Lustre FS: roc Mount type: ldiskfs Flags: 0x402 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.5.203@tcp ost.quota_type=u For comparison, the OSS roc05: [r...@roc05 ~]# uname -a; cat /etc/redhat-release Linux roc05 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 SMP Mon Feb 9 19:56:55 MST 2009 x86_64 x86_64 x86_64 GNU/Linux CentOS release 5 (Final) [r...@roc05 ~]# lctl list_nids 10.5.20...@tcp [r...@roc05 ~]# grep lnet /etc/modprobe.conf options lnet networks=tcp0(eth1) [r...@roc05 ~]# ifconfig eth1 eth1 Link encap:Ethernet HWaddr 00:1C:23:D5:F5:4F inet addr:10.5.203.5 Bcast:10.5.255.255 Mask:255.255.0.0 [r...@roc05 ~]# tunefs.lustre --print /dev/md2 Permanent disk data: Target: roc-OST0004 Index: 4 Lustre FS: roc Mount type: ldiskfs Flags: 0x402 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.5.203@tcp ost.quota_type=u On the MDS (rocpile): [r...@rocpile ~]# uname -a; cat /etc/redhat-release Linux rocpile 2.6.18-92.1.10.el5_lustre.1.6.6smp #1 SMP Tue Aug 26 12:16:17 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux CentOS release 5 (Final) [r...@rocpile ~]# lctl list_nids 10.5.203@tcp [r...@rocpile ~]# grep lnet /etc/modprobe.conf options lnet networks=tcp(eth1) [r...@rocpile ~]# ifconfig eth1 eth1 Link encap:Ethernet HWaddr 00:15:C5:EC:F6:88 inet addr:10.5.203.250 Bcast:10.5.255.255 Mask:255.255.0.0 Any suggestions? Thanks, Mike -- Michael D. Seymour Phone: 416-978-1776 Scientific Computing Support Fax: 416-978-3921 Canadian Institute for Theoretical Astrophysics, University of Toronto ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
Andreas Dilger wrote: > On Apr 09, 2009 17:11 -0400, Michael D. Seymour wrote: >> Is there an accepted procedure for recovering from any introduced errors >> from >> this bug? i.e. performing e2fsck with the --mdsdb option on the MDT, lfsck >> on >> the OSTs? Or simply do an e2fsck on the unmounted MDT, downgrade and remount? > > No, there is no lustre-specific mechanism for recovery for this > problem. This may result in files being put into the underlying > lost+found directory, which you might consider moving into a > newly-created ROOT/lost+found directory by mounting the MDS as > "-t ldiskfs". You shouldn't just move the filesystem lost+found > directory, as that can cause trouble at a later time. > So this would be the course of action. > umount /lustre/mdt > e2fsck /dev/md2 # mdt device > # Say yes to all repair queries # Here then one would: mkdir /root/MDT-lost+found mount -t ldiskfs /dev/md2 /mnt/tmp rsync -a /mnt/tmp/lost+found/ /root/MDT-lost+found > downgrade to 1.6.6 > mount -t lustre /dev/md2 /lustre/mdt I am unclear what use the left over files placed in lost+found from the MDT fs could be. Thanks Andreas for your help, Mike ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7
Peter Jones wrote: > A bug has been identified in 1.6.7 that can cause directory corruptions > on the MDT. A patch and full details are in bug 18695 - > https://bugzilla.lustre.org/show_bug.cgi?id=18695 > > We recommend to anyone running 1.6.7 on the MDS to unmount the MDT, run > e2fsck against the MDT device and apply the patch from bug 18695 as soon > as possible. > > Please note that the landing that caused the regression was that for > 11063, so anyone running with that patch on an earlier 1.6.x release > should also follow the above procedure. > > This fix will be included in 1.8.0 and we will also create an ad hoc > 1.6.7.1 release to provide this fix as soon as possible. 1.6.7 will be > withdrawn from the Sun Download Center > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss Hi Peter, all, Is there an accepted procedure for recovering from any introduced errors from this bug? i.e. performing e2fsck with the --mdsdb option on the MDT, lfsck on the OSTs? Or simply do an e2fsck on the unmounted MDT, downgrade and remount? I performed the following on one of our 17 TB lustre fs, containing disposable data. I performed the following: umount mdt e2fsck /dev/md2 # mdt device Say yes to all repair queries downgrade to 1.6.6 mount mdt this resulted in <100 files out of 587k that had ? ? ? directory entries, but everything else seems fine. I have not performed any checks of file consistency. We have a second lustre file system that stores permanent data but I don't want to rick any lost or corrupt files. Thanks for any help, Mike S. -- Michael D. Seymour Phone: 416-978-1776 Scientific Computing Support Fax: 416-978-3921 Canadian Institute for Theoretical Astrophysics, University of Toronto ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss