Andrew,

Thank you for the excellent write-up.  There was some really good
Sherlock Holms stuff going on here:
1.  the problem hit multiple machines at once; this eliminates certain
local issues
2.  all network links good except the one we can't test easily; smells
like that's the problem
3.  hash of 5-tuple used for link aggregation; something "every
network engineer knows" but is non-obvious elsewhere... and obviously
to the network people at times.
4.  Weird error messages that hint at what it is but not really.
5.  Fixing a minor problem that shouldn't be the cause, fixes the
larger problem; 0.4% packet loss is really bad for TCP.

Good story!  Sorry it had to happen to you!  Thanks for sharing it so
we can all learn!

Tom
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to