Re: e1000, sshd, and the infamous "Corrupted MAC on input"

2005-02-04 Thread Ethan Weinstein
Matt Mackall wrote:
Ok, reproduceable without ssh makes narrowing this down much easier.
Are you seeing errors on the interface? No would indicate problems
post CRC checking on the receive side. Do errors happen in both
directions? If not, it may be CPU speed-related or specific to a given
NIC - swap them if they're not onboard. 

The next test is to send patterns. Try sending yourself a gigabyte of:
#include 
int main(void)
{
int i;
for (i = 0; i < 0x1000; i++) {
fwrite(, 4, 1, stdout);
}
}
If there's some sort of partial DMA transfer going on, this should
make it evident.
No errors reported on either interface.
Interesting results, in one direction though.  It seems highly likely 
the problem is only with the 82545EM as I couldn't get a botched 
transfer FROM it to the 82547EI after 20 or so attempts, (both of these 
are onboard unfortunately so no swapping).  Several transfers TO it did 
yield bad files, though (using my big 1.6G gzipped tarball).

Now, on to the patterns.  I didn't get a _single_ failure in either 
directions using what that code snippet generated in over 20 attempts. 
Perhaps we're failing on larger amounts of more complex data?

-Ethan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: e1000, sshd, and the infamous Corrupted MAC on input

2005-02-04 Thread Ethan Weinstein
Matt Mackall wrote:
Ok, reproduceable without ssh makes narrowing this down much easier.
Are you seeing errors on the interface? No would indicate problems
post CRC checking on the receive side. Do errors happen in both
directions? If not, it may be CPU speed-related or specific to a given
NIC - swap them if they're not onboard. 

The next test is to send patterns. Try sending yourself a gigabyte of:
#include stdio.h
int main(void)
{
int i;
for (i = 0; i  0x1000; i++) {
fwrite(i, 4, 1, stdout);
}
}
If there's some sort of partial DMA transfer going on, this should
make it evident.
No errors reported on either interface.
Interesting results, in one direction though.  It seems highly likely 
the problem is only with the 82545EM as I couldn't get a botched 
transfer FROM it to the 82547EI after 20 or so attempts, (both of these 
are onboard unfortunately so no swapping).  Several transfers TO it did 
yield bad files, though (using my big 1.6G gzipped tarball).

Now, on to the patterns.  I didn't get a _single_ failure in either 
directions using what that code snippet generated in over 20 attempts. 
Perhaps we're failing on larger amounts of more complex data?

-Ethan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: e1000, sshd, and the infamous "Corrupted MAC on input"

2005-02-03 Thread Willy Tarreau
Hi,

On Thu, Feb 03, 2005 at 11:16:37PM -0500, Ethan Weinstein wrote:
(...) 
> Excellent tip, thanks.  I was able to reprodce the problem several times 
> using this technique with nc, however the problem was intermittent (as 
> nasty problems like this often are).  I used a 1.3G gzipped tarball and 
>  experienced several botched transfers along with a few good ones.  To 
> be fair, I also switched back to 100Fdx and repeated; I didn't get a 
> single failure at this speed over 25 or so runs.
> 
> The results of two cmp's are here:
> 
> http://www.stinkfoot.org/e1000tests.out
> 
> What next?

I would disable rx/tx checksums on the cards to ensure that's not a bug
in this part. Because one reason to see what you encounter would be that
some frames are corrupted at gigabit speed (possibly on one of the cards
themselves), and they don't correctly compute the checksum on the receive
side, or they ignore when it's bad.

IIRC, you can do this with ethtool :

  # ethtool -K rx off tx off

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: e1000, sshd, and the infamous "Corrupted MAC on input"

2005-02-03 Thread Matt Mackall
On Thu, Feb 03, 2005 at 11:16:37PM -0500, Ethan Weinstein wrote:
> Matt Mackall wrote:
> >On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote:
> ...
> >>Finally, I used a crossover cable between the two boxes, which resulted 
> >>in the same error from sshd again.
> >
> >
> >Well ssh isn't an especially good test as it's hard to debug.
> >
> >Try transferring large compressed files via netcat and comparing the
> >results. eg:
> >
> >host1# nc -l -p 2000 > foo.bz2
> >
> >host2# nc host1 2000 < foo.bz2
> >
> >If the md5sums differ, follow up with a cmp -bl to see what changed.
> >
> >Then we can look at the failure patterns and determine if there's some
> >data or alignment dependence.
> >
> 
> Excellent tip, thanks.  I was able to reprodce the problem several times 
> using this technique with nc, however the problem was intermittent (as 
> nasty problems like this often are).  I used a 1.3G gzipped tarball and 
>  experienced several botched transfers along with a few good ones.  To 
> be fair, I also switched back to 100Fdx and repeated; I didn't get a 
> single failure at this speed over 25 or so runs.
> 
> The results of two cmp's are here:
> 
> http://www.stinkfoot.org/e1000tests.out
> 
> What next?

Ok, reproduceable without ssh makes narrowing this down much easier.
Are you seeing errors on the interface? No would indicate problems
post CRC checking on the receive side. Do errors happen in both
directions? If not, it may be CPU speed-related or specific to a given
NIC - swap them if they're not onboard. 

The next test is to send patterns. Try sending yourself a gigabyte of:

#include 

int main(void)
{
int i;

for (i = 0; i < 0x1000; i++) {
fwrite(, 4, 1, stdout);
}
}

If there's some sort of partial DMA transfer going on, this should
make it evident.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: e1000, sshd, and the infamous "Corrupted MAC on input"

2005-02-03 Thread Ethan Weinstein
Matt Mackall wrote:
On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote:
...
Finally, I used a crossover cable between the two boxes, which resulted 
in the same error from sshd again.

Well ssh isn't an especially good test as it's hard to debug.
Try transferring large compressed files via netcat and comparing the
results. eg:
host1# nc -l -p 2000 > foo.bz2
host2# nc host1 2000 < foo.bz2
If the md5sums differ, follow up with a cmp -bl to see what changed.
Then we can look at the failure patterns and determine if there's some
data or alignment dependence.
Excellent tip, thanks.  I was able to reprodce the problem several times 
using this technique with nc, however the problem was intermittent (as 
nasty problems like this often are).  I used a 1.3G gzipped tarball and 
 experienced several botched transfers along with a few good ones.  To 
be fair, I also switched back to 100Fdx and repeated; I didn't get a 
single failure at this speed over 25 or so runs.

The results of two cmp's are here:
http://www.stinkfoot.org/e1000tests.out
What next?
-Ethan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: e1000, sshd, and the infamous Corrupted MAC on input

2005-02-03 Thread Ethan Weinstein
Matt Mackall wrote:
On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote:
...
Finally, I used a crossover cable between the two boxes, which resulted 
in the same error from sshd again.

Well ssh isn't an especially good test as it's hard to debug.
Try transferring large compressed files via netcat and comparing the
results. eg:
host1# nc -l -p 2000  foo.bz2
host2# nc host1 2000  foo.bz2
If the md5sums differ, follow up with a cmp -bl to see what changed.
Then we can look at the failure patterns and determine if there's some
data or alignment dependence.
Excellent tip, thanks.  I was able to reprodce the problem several times 
using this technique with nc, however the problem was intermittent (as 
nasty problems like this often are).  I used a 1.3G gzipped tarball and 
 experienced several botched transfers along with a few good ones.  To 
be fair, I also switched back to 100Fdx and repeated; I didn't get a 
single failure at this speed over 25 or so runs.

The results of two cmp's are here:
http://www.stinkfoot.org/e1000tests.out
What next?
-Ethan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: e1000, sshd, and the infamous Corrupted MAC on input

2005-02-03 Thread Matt Mackall
On Thu, Feb 03, 2005 at 11:16:37PM -0500, Ethan Weinstein wrote:
 Matt Mackall wrote:
 On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote:
 ...
 Finally, I used a crossover cable between the two boxes, which resulted 
 in the same error from sshd again.
 
 
 Well ssh isn't an especially good test as it's hard to debug.
 
 Try transferring large compressed files via netcat and comparing the
 results. eg:
 
 host1# nc -l -p 2000  foo.bz2
 
 host2# nc host1 2000  foo.bz2
 
 If the md5sums differ, follow up with a cmp -bl to see what changed.
 
 Then we can look at the failure patterns and determine if there's some
 data or alignment dependence.
 
 
 Excellent tip, thanks.  I was able to reprodce the problem several times 
 using this technique with nc, however the problem was intermittent (as 
 nasty problems like this often are).  I used a 1.3G gzipped tarball and 
  experienced several botched transfers along with a few good ones.  To 
 be fair, I also switched back to 100Fdx and repeated; I didn't get a 
 single failure at this speed over 25 or so runs.
 
 The results of two cmp's are here:
 
 http://www.stinkfoot.org/e1000tests.out
 
 What next?

Ok, reproduceable without ssh makes narrowing this down much easier.
Are you seeing errors on the interface? No would indicate problems
post CRC checking on the receive side. Do errors happen in both
directions? If not, it may be CPU speed-related or specific to a given
NIC - swap them if they're not onboard. 

The next test is to send patterns. Try sending yourself a gigabyte of:

#include stdio.h

int main(void)
{
int i;

for (i = 0; i  0x1000; i++) {
fwrite(i, 4, 1, stdout);
}
}

If there's some sort of partial DMA transfer going on, this should
make it evident.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: e1000, sshd, and the infamous Corrupted MAC on input

2005-02-03 Thread Willy Tarreau
Hi,

On Thu, Feb 03, 2005 at 11:16:37PM -0500, Ethan Weinstein wrote:
(...) 
 Excellent tip, thanks.  I was able to reprodce the problem several times 
 using this technique with nc, however the problem was intermittent (as 
 nasty problems like this often are).  I used a 1.3G gzipped tarball and 
  experienced several botched transfers along with a few good ones.  To 
 be fair, I also switched back to 100Fdx and repeated; I didn't get a 
 single failure at this speed over 25 or so runs.
 
 The results of two cmp's are here:
 
 http://www.stinkfoot.org/e1000tests.out
 
 What next?

I would disable rx/tx checksums on the cards to ensure that's not a bug
in this part. Because one reason to see what you encounter would be that
some frames are corrupted at gigabit speed (possibly on one of the cards
themselves), and they don't correctly compute the checksum on the receive
side, or they ignore when it's bad.

IIRC, you can do this with ethtool :

  # ethtool -K rx off tx off

Willy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: e1000, sshd, and the infamous "Corrupted MAC on input"

2005-02-02 Thread Matt Mackall
On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote:
> Hey all,
> 
> I've been having quite a time with the e1000 driver running at gigabit 
> speeds.  Running it at 100Fdx has never been a problem, which I've done 
> done for a long time. Last week I picked up a gigabit switch, and that's 
> when the trouble began.  I find that transferring large amounts of data 
> using scp invariably ends up with sshd spitting out "Disconnecting: 
> Corrupted MAC on input."  After deciding I must have purchased a bum 
> switch, I grabbed another model.. only to get the same error.
> Finally, I used a crossover cable between the two boxes, which resulted 
> in the same error from sshd again.

Well ssh isn't an especially good test as it's hard to debug.

Try transferring large compressed files via netcat and comparing the
results. eg:

host1# nc -l -p 2000 > foo.bz2

host2# nc host1 2000 < foo.bz2

If the md5sums differ, follow up with a cmp -bl to see what changed.

Then we can look at the failure patterns and determine if there's some
data or alignment dependence.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: e1000, sshd, and the infamous Corrupted MAC on input

2005-02-02 Thread Matt Mackall
On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote:
 Hey all,
 
 I've been having quite a time with the e1000 driver running at gigabit 
 speeds.  Running it at 100Fdx has never been a problem, which I've done 
 done for a long time. Last week I picked up a gigabit switch, and that's 
 when the trouble began.  I find that transferring large amounts of data 
 using scp invariably ends up with sshd spitting out Disconnecting: 
 Corrupted MAC on input.  After deciding I must have purchased a bum 
 switch, I grabbed another model.. only to get the same error.
 Finally, I used a crossover cable between the two boxes, which resulted 
 in the same error from sshd again.

Well ssh isn't an especially good test as it's hard to debug.

Try transferring large compressed files via netcat and comparing the
results. eg:

host1# nc -l -p 2000  foo.bz2

host2# nc host1 2000  foo.bz2

If the md5sums differ, follow up with a cmp -bl to see what changed.

Then we can look at the failure patterns and determine if there's some
data or alignment dependence.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/