Hi Everyone -

This issue appears to be patched up. Please let me know immediately if
you see any more network issues.

Longer explanation - the root cause of issues we saw today was a
"fixed" router bug (our code version should not have been affected).
When in a firewall filter, packets are rejected (which sends an ICMP
rejected notice), the routing engine can receive too many of these
requests, causing the routing engine to "choke" on its backlog of
requests. This backup caused packets destined to the routing engine to
drop.  This caused several issues as VRRP, BFD, and BGP all stopped
processing. For a currently unknown reason, OSPF was unaffected.

After correcting this, for an unknown reason, one vlan was not
processing packets destined to the routing engine, while the other
vlans were properly processing these packets.  This caused both of our
main routers on that vlan to claim VRRP mastership - basically causing
two routers to claim to be the default gateway for the subnet which
contains the LVS servers.  After disabling VRRP, the router still was
not passing traffic destined to this vlan.  Turning down the vlan and
then turning it back up and adding and removing an arp policer (yes,
turning it off and on again) fixed this situation.  This vlan issue
caused a public facing outage.

The current status is that everything is working and cr2-pmtpa is the
VRRP master for all of Tampa.  We were lucky that this bug hit
cr1-sdtpa much harder than cr2-pmtpa.  Eqiad was not affected, and
while we cannot yet say definitively, I believe it is due to the more
powerful routing engines and more robust network design of the eqiad
datacenter and routers.  Software upgrades and configuration changes
should fix this issue in Tampa.  A possible fix would be hardware
upgrades of the core routers, however it may be both prohibitively
expensive and require some downtime for important machines in pmtpa.


Leslie

On Mon, Jul 2, 2012 at 3:03 PM, Ct Woo <[email protected]> wrote:
> All,
>
> The Technical Operations team noticed abnormal network package losses
> sometime after yesterday's 'leap second' switch (midnight UTC).  While it
> does not seem to impact the site availability at this moment, it is a
> concern. We are still not sure if is even related to the 'leap second'
> switch yet.
>
> Leslie has opened a ticket with our network equipment provider and together
> with Mark, they have been working with them to pinpoint the problem since
> this morning. It is possible that they might induce some latency/issue
> during the troubleshooting process.
>
> If you do experience anything abnormal, please let us know (email to
> [email protected] or find us at the #wikimedia-operations IRC channel).
>
> Thanks,
> CT
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Leslie Carr
Wikimedia Foundation
AS 14907, 43821
http://as14907.peeringdb.com/

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to