Re: [Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203....@tcp, match 19154486 length 728 too big

2009-06-08 Thread Michael D. Seymour
Alexey Lyashkov wrote:
> Hi Michael,
>  
>>> On Fri, 2009-05-22 at 16:38 -0400, Michael D. Seymour wrote:
>>>> Hi all,
>>>>
>>>> One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on 
>>>> a 
>>>> different network.
>>>>
>>>> We get the following messages on a particular client:
>>>>
>>>> May 22 15:07:45 trinity kernel: LustreError: 
>>>> 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 
>>>> 12345-10.5.203@tcp, match 19154486 length 728 too big: 704 left, 704 
>>>> allowed
>>> what frequently for this bug? 
>> Sets of entries (about 20) happen a few times per day, each entry spaced 
>> about 
>> ten minutes apart.
> can you please show syslog messages around this time - should be exist
> lines with errors related to 'match X' (in this example match
> 19154486 -- should be something about request x19154486).

I've upgraded the MDS to 1.6.7.1. So far no issues. I will probably upgrade to 
1.8 very soon. Will write back if there is still problems.

Mike


-- 
Michael D. Seymour Phone: 416-978-8497
Scientific Computing Support   Fax: 416-978-3921
Canadian Institute for Theoretical Astrophysics, University of Toronto
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203....@tcp, match 19154486 length 728 too big

2009-05-29 Thread Michael D. Seymour
Michael D. Seymour wrote:
> Hi all,
> 
> I hope you could help us with some connection problems we are having with our 
> lustre file system. The filesystem roc consists of 6 OSSs with one OST per 
> OSS. 
> Each OSS uses the 1.6.7 RHEL 5 kernel on Centos 5.1 (one unit uses Centos 
> 5.3). 
> The MDS uses CentOS 5.1 and Lustre 1.6.7. 203 RHEL-based clients mount the 
> filesystem and all use Lustre 1.6.7. All are connected via a Gb ethernet 
> switch 
> stack.
> 
> One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a 
> different network.
> 

Also got this earlier today before more verbose debug logging was enabled:

On client trinity:

May 29 10:35:47 trinity kernel: LustreError: 
5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 
12345-10.5.203@tcp, match 20177453 length 728 too big: 704 left, 704 allowed
May 29 10:40:47 trinity kernel: LustreError: 11-0: an error occurred while 
communicating with 10.5.203@tcp. The mds_close operation failed with -116
May 29 10:40:47 trinity kernel: LustreError: 
26783:0:(file.c:113:ll_close_inode_openhandle()) inode 37609433 mdc close 
failed: rc = -116
May 29 10:40:47 trinity kernel: LustreError: 
26783:0:(file.c:113:ll_close_inode_openhandle()) Skipped 1 previous similar 
message

On MDS rocpile:

May 29 10:35:47 rocpile kernel: LustreError: 
10227:0:(mds_open.c:1561:mds_close()) @@@ no handle for file close ino 
37609433: 
cookie 0xa00c7cf9e763396b  r...@8101274e3400 x20177453/t0 
o35->84adb9a1-8959-fcf5-cc72-81c6a1e17...@net_0x20a05cc02_uuid:0/0 lens 
296/728 e 0 to 0 dl 1243608047 ref 1 fl Interpret:/0/0 rc 0/0
May 29 10:35:47 rocpile kernel: LustreError: 
10227:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-116) 
r...@8101274e3400 x20177453/t0 
o35->84adb9a1-8959-fcf5-cc72-81c6a1e17...@net_0x20a05cc02_uuid:0/0 lens 
296/728 e 0 to 0 dl 1243608047 ref 1 fl Interpret:/0/0 rc -116/0
May 29 10:35:47 rocpile kernel: LustreError: 
10227:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 1 previous similar 
message
May 29 10:40:47 rocpile kernel: LustreError: 
3611:0:(mds_open.c:1561:mds_close()) @@@ no handle for file close ino 37609433: 
cookie 0xa00c7cf9e763396b  r...@81011f0cda00 x20177453/t0 
o35->84adb9a1-8959-fcf5-cc72-81c6a1e17...@net_0x20a05cc02_uuid:0/0 lens 
296/728 e 0 to 0 dl 1243608347 ref 1 fl Interpret:/2/0 rc 0/0
May 29 10:40:47 rocpile kernel: LustreError: 
3611:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-116) 
r...@81011f0cda00 x20177453/t0 
o35->84adb9a1-8959-fcf5-cc72-81c6a1e17...@net_0x20a05cc02_uuid:0/0 lens 
296/728 e 0 to 0 dl 1243608347 ref 1 fl Interpret:/2/0 rc -116/0

I've already extended /proc/sys/lustre/timeout to 300s.

Thanks again,
Mike

-- 
Michael D. Seymour Phone: 416-978-8497
Scientific Computing Support   Fax: 416-978-3921
Canadian Institute for Theoretical Astrophysics, University of Toronto
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203....@tcp, match 19154486 length 728 too big

2009-05-29 Thread Michael D. Seymour
Hi Alexey,

Alexey Lyashkov wrote:
> Hi Michael,
> 
> 
> On Fri, 2009-05-22 at 16:38 -0400, Michael D. Seymour wrote:
>> Hi all,
>>
>> One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a 
>> different network.
>>
>> We get the following messages on a particular client:
>>
>> May 22 15:07:45 trinity kernel: LustreError: 
>> 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 
>> 12345-10.5.203@tcp, match 19154486 length 728 too big: 704 left, 704 
>> allowed
> 
> what frequently for this bug? 

Sets of entries (about 20) happen a few times per day, each entry spaced about 
ten minutes apart.

> if this quickly replicated - please set
> lnet.debug=-1, lnet.debug_subsystem=-1 lnet.debug_mb=100, on mds and
> client, replicate and save logs with lctl dk > $logfile.

Debugging has been enabled.I haven't been able to catch it in the act yet. Will 
enabling the debug logging until I can catch the bug overflow anything?

> after it - please fill a bug and attach log from MDS and client to bug.

A bug will be filed as soon as it can be caught with logging enabled.

> this message say - client want for reply less data when mds is send.

Trinity cannot accept data as large as the MDS is sending?

Thanks for you help,
Mike



-- 
Michael D. Seymour Phone: 416-978-8497
Scientific Computing Support   Fax: 416-978-3921
Canadian Institute for Theoretical Astrophysics, University of Toronto
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203....@tcp, match 19154486 length 728 too big

2009-05-22 Thread Michael D. Seymour
Hi all,

I hope you could help us with some connection problems we are having with our 
lustre file system. The filesystem roc consists of 6 OSSs with one OST per OSS. 
Each OSS uses the 1.6.7 RHEL 5 kernel on Centos 5.1 (one unit uses Centos 5.3). 
The MDS uses CentOS 5.1 and Lustre 1.6.7. 203 RHEL-based clients mount the 
filesystem and all use Lustre 1.6.7. All are connected via a Gb ethernet switch 
stack.

One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a 
different network.

We get the following messages on a particular client:

May 22 15:07:45 trinity kernel: LustreError: 
5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 
12345-10.5.203@tcp, match 19154486 length 728 too big: 704 left, 704 allowed
May 22 15:07:45 trinity kernel: LustreError: 
5111:0:(lib-move.c:110:lnet_try_match_md()) Skipped 3 previous similar messages
May 22 15:12:45 trinity kernel: Lustre: Request x19154486 sent from 
roc-MDT-mdc-01044e1d4c00 to NID 10.5.203@tcp 300s ago has timed out 
(limit 300s).
May 22 15:12:45 trinity kernel: Lustre: Skipped 3 previous similar messages
May 22 15:12:45 trinity kernel: Lustre: roc-MDT-mdc-01044e1d4c00: 
Connection to service roc-MDT via nid 10.5.203@tcp was lost; in 
progress 
operations using this service will wait for recovery to complete.
May 22 15:12:45 trinity kernel: Lustre: Skipped 3 previous similar messages
May 22 15:12:45 trinity kernel: Lustre: roc-MDT-mdc-01044e1d4c00: 
Connection restored to service roc-MDT using nid 10.5.203@tcp.
May 22 15:12:45 trinity kernel: Lustre: Skipped 4 previous similar messages

[r...@trinity ~]# cat /proc/fs/lustre/lov/roc-clilov-01044e1d4c00/uuid
84adb9a1-8959-fcf5-cc72-81c6a1e171b8

On the MDS containing roc-MDT:

May 22 15:12:45 rocpile kernel: Lustre: 
19236:0:(ldlm_lib.c:538:target_handle_reconnect()) roc-MDT: 
84adb9a1-8959-fcf5-cc72-81c6a1e171b8 reconnecting
May 22 15:12:45 rocpile kernel: Lustre: 
19236:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 4 previous similar 
messages

Any idea what could be causing this? BUG 11332 looked similar, but it has been 
closed because of other related bugs being fixed.

Thanks,
Mike

-- 
Michael D. Seymour Phone: 416-978-8497
Scientific Computing Support   Fax: 416-978-3921
Canadian Institute for Theoretical Astrophysics, University of Toronto
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Problems adding new OSS to existing Lustre filesystem -- Refusing connection, No matching NI

2009-04-24 Thread Michael D. Seymour
Hi,

We are having a problem adding a new OSS (roc06, 10.5.203.6) to an existing 
Lustre file system (raid-cita) on the 10.5 network. selinux and iptables are 
disabled. It is a multi-homed OSS on the 10.4 and 10.5 network.

When mounted, clients are trying to connect to the Lustre file system via the 
10.4 network, even though things are set up to use the 10.5 network. The 
clients 
do not see the new space on the file system either. It shows 23T as opposed to 
the > 27T it should show.

lfs quota hangs as well.

We did suffer some problems with the MDS filesystem, which was fcsked, the 
kernel downgraded to 1.6.6 and remounted.

Many messages like this exist in /var/log/messages on the new OSS:

Apr 24 10:01:07 roc06 kernel: LustreError: 120-3: Refusing connection from 
10.4.1.52 for 10.4.20...@tcp: No matching NI

On the multi-homed client 10.4.1.52:

[r...@tpb52-chroot ~]# uname -a; cat /etc/redhat-release
Linux tpb52 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 SMP Mon Feb 9 19:56:55 MST 
2009 x86_64 x86_64 x86_64 GNU/Linux
CentOS release 5 (Final)

[r...@tpb52-chroot ~]# df -h /mnt/raid-cita/
FilesystemSize  Used Avail Use% Mounted on
10.5.203@tcp:/roc
23T   11T   12T  47% /mnt/raid-cita

[r...@tpb52-chroot ~]# lctl list_nids
10.5.2...@tcp

[r...@tpb52-chroot ~]# grep lnet /etc/modprobe.conf
options lnet networks=tcp0(eth1)

[r...@tpb52-chroot ~]# ifconfig eth1
eth1  Link encap:Ethernet  HWaddr 00:15:C5:EC:FA:8C
   inet addr:10.5.2.12  Bcast:10.5.255.255  Mask:255.255.0.0

On the OSS roc06:

[r...@roc06 lustre]# uname -a; cat /etc/redhat-release
Linux roc06 2.6.18-92.1.17.el5_lustre.1.6.7.1smp #1 SMP Mon Apr 13 16:13:00 MDT 
2009 x86_64 x86_64 x86_64 GNU/Linux
CentOS release 5.3 (Final)

[r...@roc06 lustre]# lctl list_nids
10.5.20...@tcp

[r...@roc06 ~]# grep lnet /etc/modprobe.conf
options lnet networks=tcp0(eth1)

[r...@roc06 ~]# ifconfig eth1
eth1  Link encap:Ethernet  HWaddr 00:22:19:05:90:F2
   inet addr:10.5.203.6  Bcast:10.5.255.255  Mask:255.255.0.0

The OSS was formatted with the following:

mkfs.lustre --verbose --reformat --fsname=roc --ost --mgsnode=10.5.203@tcp0 
--mkfsoptions="-m 0 -E  stride=32" /dev/md2

I believe this was done before "options lnet networks=tcp0(eth1)" was included 
in modprobe.conf.

[r...@roc06 ~]# tunefs.lustre --print /dev/md2

Permanent disk data:
Target: roc-OST0005
Index:  5
Lustre FS:  roc
Mount type: ldiskfs
Flags:  0x402
   (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.5.203@tcp ost.quota_type=u


For comparison, the OSS roc05:

[r...@roc05 ~]# uname -a; cat /etc/redhat-release
Linux roc05 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 SMP Mon Feb 9 19:56:55 MST 
2009 x86_64 x86_64 x86_64 GNU/Linux
CentOS release 5 (Final)

[r...@roc05 ~]# lctl list_nids
10.5.20...@tcp

[r...@roc05 ~]# grep lnet /etc/modprobe.conf
options lnet networks=tcp0(eth1)

[r...@roc05 ~]# ifconfig eth1
eth1  Link encap:Ethernet  HWaddr 00:1C:23:D5:F5:4F
   inet addr:10.5.203.5  Bcast:10.5.255.255  Mask:255.255.0.0

[r...@roc05 ~]# tunefs.lustre --print /dev/md2

Permanent disk data:
Target: roc-OST0004
Index:  4
Lustre FS:  roc
Mount type: ldiskfs
Flags:  0x402
   (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.5.203@tcp ost.quota_type=u


On the MDS (rocpile):

[r...@rocpile ~]#  uname -a; cat /etc/redhat-release
Linux rocpile 2.6.18-92.1.10.el5_lustre.1.6.6smp #1 SMP Tue Aug 26 12:16:17 EDT 
2008 x86_64 x86_64 x86_64 GNU/Linux
CentOS release 5 (Final)

[r...@rocpile ~]# lctl list_nids
10.5.203@tcp

[r...@rocpile ~]# grep lnet /etc/modprobe.conf
options lnet networks=tcp(eth1)

[r...@rocpile ~]# ifconfig eth1
eth1  Link encap:Ethernet  HWaddr 00:15:C5:EC:F6:88
   inet addr:10.5.203.250  Bcast:10.5.255.255  Mask:255.255.0.0


Any suggestions?

Thanks,
Mike

-- 
Michael D. Seymour Phone: 416-978-1776
Scientific Computing Support   Fax: 416-978-3921
Canadian Institute for Theoretical Astrophysics, University of Toronto
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7

2009-04-09 Thread Michael D. Seymour
Andreas Dilger wrote:
> On Apr 09, 2009  17:11 -0400, Michael D. Seymour wrote:
>> Is there an accepted procedure for recovering from any introduced errors 
>> from 
>> this bug? i.e. performing e2fsck with the --mdsdb option on the MDT, lfsck 
>> on 
>> the OSTs? Or simply do an e2fsck on the unmounted MDT, downgrade and remount?
> 
> No, there is no lustre-specific mechanism for recovery for this
> problem.  This may result in files being put into the underlying
> lost+found directory, which you might consider moving into a
> newly-created ROOT/lost+found directory by mounting the MDS as
> "-t ldiskfs".  You shouldn't just move the filesystem lost+found
> directory, as that can cause trouble at a later time.
> 

So this would be the course of action.

 > umount /lustre/mdt
 > e2fsck /dev/md2 # mdt device
 > # Say yes to all repair queries
# Here then one would:
mkdir /root/MDT-lost+found
mount -t ldiskfs /dev/md2 /mnt/tmp
rsync -a /mnt/tmp/lost+found/ /root/MDT-lost+found
 > downgrade to 1.6.6
 > mount -t lustre /dev/md2 /lustre/mdt

I am unclear what use the left over files placed in lost+found from the 
MDT fs could be.

Thanks Andreas for your help,
Mike
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7

2009-04-09 Thread Michael D. Seymour
Peter Jones wrote:
> A bug has been identified in 1.6.7 that can cause directory corruptions 
> on the MDT. A patch and full details are in bug 18695 - 
> https://bugzilla.lustre.org/show_bug.cgi?id=18695
> 
> We recommend to anyone running 1.6.7 on the MDS to unmount the MDT, run 
> e2fsck against the MDT device and apply the patch from bug 18695 as soon 
> as possible.
> 
> Please note that the landing that caused the regression was that for 
> 11063, so anyone running with that patch on an earlier 1.6.x release 
> should also follow the above procedure.
> 
> This fix will be included in 1.8.0 and we will also create an ad hoc 
> 1.6.7.1 release to provide this fix as soon as possible. 1.6.7 will be 
> withdrawn from the Sun Download Center
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Hi Peter, all,

Is there an accepted procedure for recovering from any introduced errors from 
this bug? i.e. performing e2fsck with the --mdsdb option on the MDT, lfsck on 
the OSTs? Or simply do an e2fsck on the unmounted MDT, downgrade and remount?

I performed the following on one of our 17 TB lustre fs, containing disposable 
data. I performed the following:

umount mdt
e2fsck /dev/md2 # mdt device
Say yes to all repair queries
downgrade to 1.6.6
mount mdt

this resulted in <100 files out of 587k that had ? ? ? directory entries, but 
everything else seems fine. I have not performed any checks of file consistency.

We have a second lustre file system that stores permanent data but I don't want 
to rick any lost or corrupt files.

Thanks for any help,
Mike S.

-- 
Michael D. Seymour Phone: 416-978-1776
Scientific Computing Support   Fax: 416-978-3921
Canadian Institute for Theoretical Astrophysics, University of Toronto
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss