Re: e1000, sshd, and the infamous "Corrupted MAC on input"
Matt Mackall wrote: Ok, reproduceable without ssh makes narrowing this down much easier. Are you seeing errors on the interface? No would indicate problems post CRC checking on the receive side. Do errors happen in both directions? If not, it may be CPU speed-related or specific to a given NIC - swap them if they're not onboard. The next test is to send patterns. Try sending yourself a gigabyte of: #include int main(void) { int i; for (i = 0; i < 0x1000; i++) { fwrite(, 4, 1, stdout); } } If there's some sort of partial DMA transfer going on, this should make it evident. No errors reported on either interface. Interesting results, in one direction though. It seems highly likely the problem is only with the 82545EM as I couldn't get a botched transfer FROM it to the 82547EI after 20 or so attempts, (both of these are onboard unfortunately so no swapping). Several transfers TO it did yield bad files, though (using my big 1.6G gzipped tarball). Now, on to the patterns. I didn't get a _single_ failure in either directions using what that code snippet generated in over 20 attempts. Perhaps we're failing on larger amounts of more complex data? -Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: e1000, sshd, and the infamous Corrupted MAC on input
Matt Mackall wrote: Ok, reproduceable without ssh makes narrowing this down much easier. Are you seeing errors on the interface? No would indicate problems post CRC checking on the receive side. Do errors happen in both directions? If not, it may be CPU speed-related or specific to a given NIC - swap them if they're not onboard. The next test is to send patterns. Try sending yourself a gigabyte of: #include stdio.h int main(void) { int i; for (i = 0; i 0x1000; i++) { fwrite(i, 4, 1, stdout); } } If there's some sort of partial DMA transfer going on, this should make it evident. No errors reported on either interface. Interesting results, in one direction though. It seems highly likely the problem is only with the 82545EM as I couldn't get a botched transfer FROM it to the 82547EI after 20 or so attempts, (both of these are onboard unfortunately so no swapping). Several transfers TO it did yield bad files, though (using my big 1.6G gzipped tarball). Now, on to the patterns. I didn't get a _single_ failure in either directions using what that code snippet generated in over 20 attempts. Perhaps we're failing on larger amounts of more complex data? -Ethan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: e1000, sshd, and the infamous "Corrupted MAC on input"
Hi, On Thu, Feb 03, 2005 at 11:16:37PM -0500, Ethan Weinstein wrote: (...) > Excellent tip, thanks. I was able to reprodce the problem several times > using this technique with nc, however the problem was intermittent (as > nasty problems like this often are). I used a 1.3G gzipped tarball and > experienced several botched transfers along with a few good ones. To > be fair, I also switched back to 100Fdx and repeated; I didn't get a > single failure at this speed over 25 or so runs. > > The results of two cmp's are here: > > http://www.stinkfoot.org/e1000tests.out > > What next? I would disable rx/tx checksums on the cards to ensure that's not a bug in this part. Because one reason to see what you encounter would be that some frames are corrupted at gigabit speed (possibly on one of the cards themselves), and they don't correctly compute the checksum on the receive side, or they ignore when it's bad. IIRC, you can do this with ethtool : # ethtool -K rx off tx off Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: e1000, sshd, and the infamous "Corrupted MAC on input"
On Thu, Feb 03, 2005 at 11:16:37PM -0500, Ethan Weinstein wrote: > Matt Mackall wrote: > >On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote: > ... > >>Finally, I used a crossover cable between the two boxes, which resulted > >>in the same error from sshd again. > > > > > >Well ssh isn't an especially good test as it's hard to debug. > > > >Try transferring large compressed files via netcat and comparing the > >results. eg: > > > >host1# nc -l -p 2000 > foo.bz2 > > > >host2# nc host1 2000 < foo.bz2 > > > >If the md5sums differ, follow up with a cmp -bl to see what changed. > > > >Then we can look at the failure patterns and determine if there's some > >data or alignment dependence. > > > > Excellent tip, thanks. I was able to reprodce the problem several times > using this technique with nc, however the problem was intermittent (as > nasty problems like this often are). I used a 1.3G gzipped tarball and > experienced several botched transfers along with a few good ones. To > be fair, I also switched back to 100Fdx and repeated; I didn't get a > single failure at this speed over 25 or so runs. > > The results of two cmp's are here: > > http://www.stinkfoot.org/e1000tests.out > > What next? Ok, reproduceable without ssh makes narrowing this down much easier. Are you seeing errors on the interface? No would indicate problems post CRC checking on the receive side. Do errors happen in both directions? If not, it may be CPU speed-related or specific to a given NIC - swap them if they're not onboard. The next test is to send patterns. Try sending yourself a gigabyte of: #include int main(void) { int i; for (i = 0; i < 0x1000; i++) { fwrite(, 4, 1, stdout); } } If there's some sort of partial DMA transfer going on, this should make it evident. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: e1000, sshd, and the infamous "Corrupted MAC on input"
Matt Mackall wrote: On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote: ... Finally, I used a crossover cable between the two boxes, which resulted in the same error from sshd again. Well ssh isn't an especially good test as it's hard to debug. Try transferring large compressed files via netcat and comparing the results. eg: host1# nc -l -p 2000 > foo.bz2 host2# nc host1 2000 < foo.bz2 If the md5sums differ, follow up with a cmp -bl to see what changed. Then we can look at the failure patterns and determine if there's some data or alignment dependence. Excellent tip, thanks. I was able to reprodce the problem several times using this technique with nc, however the problem was intermittent (as nasty problems like this often are). I used a 1.3G gzipped tarball and experienced several botched transfers along with a few good ones. To be fair, I also switched back to 100Fdx and repeated; I didn't get a single failure at this speed over 25 or so runs. The results of two cmp's are here: http://www.stinkfoot.org/e1000tests.out What next? -Ethan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: e1000, sshd, and the infamous Corrupted MAC on input
Matt Mackall wrote: On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote: ... Finally, I used a crossover cable between the two boxes, which resulted in the same error from sshd again. Well ssh isn't an especially good test as it's hard to debug. Try transferring large compressed files via netcat and comparing the results. eg: host1# nc -l -p 2000 foo.bz2 host2# nc host1 2000 foo.bz2 If the md5sums differ, follow up with a cmp -bl to see what changed. Then we can look at the failure patterns and determine if there's some data or alignment dependence. Excellent tip, thanks. I was able to reprodce the problem several times using this technique with nc, however the problem was intermittent (as nasty problems like this often are). I used a 1.3G gzipped tarball and experienced several botched transfers along with a few good ones. To be fair, I also switched back to 100Fdx and repeated; I didn't get a single failure at this speed over 25 or so runs. The results of two cmp's are here: http://www.stinkfoot.org/e1000tests.out What next? -Ethan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: e1000, sshd, and the infamous Corrupted MAC on input
On Thu, Feb 03, 2005 at 11:16:37PM -0500, Ethan Weinstein wrote: Matt Mackall wrote: On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote: ... Finally, I used a crossover cable between the two boxes, which resulted in the same error from sshd again. Well ssh isn't an especially good test as it's hard to debug. Try transferring large compressed files via netcat and comparing the results. eg: host1# nc -l -p 2000 foo.bz2 host2# nc host1 2000 foo.bz2 If the md5sums differ, follow up with a cmp -bl to see what changed. Then we can look at the failure patterns and determine if there's some data or alignment dependence. Excellent tip, thanks. I was able to reprodce the problem several times using this technique with nc, however the problem was intermittent (as nasty problems like this often are). I used a 1.3G gzipped tarball and experienced several botched transfers along with a few good ones. To be fair, I also switched back to 100Fdx and repeated; I didn't get a single failure at this speed over 25 or so runs. The results of two cmp's are here: http://www.stinkfoot.org/e1000tests.out What next? Ok, reproduceable without ssh makes narrowing this down much easier. Are you seeing errors on the interface? No would indicate problems post CRC checking on the receive side. Do errors happen in both directions? If not, it may be CPU speed-related or specific to a given NIC - swap them if they're not onboard. The next test is to send patterns. Try sending yourself a gigabyte of: #include stdio.h int main(void) { int i; for (i = 0; i 0x1000; i++) { fwrite(i, 4, 1, stdout); } } If there's some sort of partial DMA transfer going on, this should make it evident. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: e1000, sshd, and the infamous Corrupted MAC on input
Hi, On Thu, Feb 03, 2005 at 11:16:37PM -0500, Ethan Weinstein wrote: (...) Excellent tip, thanks. I was able to reprodce the problem several times using this technique with nc, however the problem was intermittent (as nasty problems like this often are). I used a 1.3G gzipped tarball and experienced several botched transfers along with a few good ones. To be fair, I also switched back to 100Fdx and repeated; I didn't get a single failure at this speed over 25 or so runs. The results of two cmp's are here: http://www.stinkfoot.org/e1000tests.out What next? I would disable rx/tx checksums on the cards to ensure that's not a bug in this part. Because one reason to see what you encounter would be that some frames are corrupted at gigabit speed (possibly on one of the cards themselves), and they don't correctly compute the checksum on the receive side, or they ignore when it's bad. IIRC, you can do this with ethtool : # ethtool -K rx off tx off Willy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: e1000, sshd, and the infamous "Corrupted MAC on input"
On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote: > Hey all, > > I've been having quite a time with the e1000 driver running at gigabit > speeds. Running it at 100Fdx has never been a problem, which I've done > done for a long time. Last week I picked up a gigabit switch, and that's > when the trouble began. I find that transferring large amounts of data > using scp invariably ends up with sshd spitting out "Disconnecting: > Corrupted MAC on input." After deciding I must have purchased a bum > switch, I grabbed another model.. only to get the same error. > Finally, I used a crossover cable between the two boxes, which resulted > in the same error from sshd again. Well ssh isn't an especially good test as it's hard to debug. Try transferring large compressed files via netcat and comparing the results. eg: host1# nc -l -p 2000 > foo.bz2 host2# nc host1 2000 < foo.bz2 If the md5sums differ, follow up with a cmp -bl to see what changed. Then we can look at the failure patterns and determine if there's some data or alignment dependence. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: e1000, sshd, and the infamous Corrupted MAC on input
On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote: Hey all, I've been having quite a time with the e1000 driver running at gigabit speeds. Running it at 100Fdx has never been a problem, which I've done done for a long time. Last week I picked up a gigabit switch, and that's when the trouble began. I find that transferring large amounts of data using scp invariably ends up with sshd spitting out Disconnecting: Corrupted MAC on input. After deciding I must have purchased a bum switch, I grabbed another model.. only to get the same error. Finally, I used a crossover cable between the two boxes, which resulted in the same error from sshd again. Well ssh isn't an especially good test as it's hard to debug. Try transferring large compressed files via netcat and comparing the results. eg: host1# nc -l -p 2000 foo.bz2 host2# nc host1 2000 foo.bz2 If the md5sums differ, follow up with a cmp -bl to see what changed. Then we can look at the failure patterns and determine if there's some data or alignment dependence. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/