[Bug 67694] Packetloss issues on oxygen (and analytics1003)

bugzilla-daemon Tue, 29 Jul 2014 04:01:32 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=67694


--- Comment #8 from [email protected] ---
TL;DR:
  * analytics1003 alarms were harmless.
  * oxygen alarms point to real packet-loss, which affected all
    udp2log multicast consumers during two ~4 hour periods.
  * The issue is expected to re-occur for the seldom traffic spikes we
    see.
  * The issue is /not/ expected to re-occur anytime soon for our usual
    day-to-day traffic.
  * Let's not invest in adjusting udp2log setup and instead move to kafka :-)

-----------------------

* analytics1003 alarms:

I could not find a consumer of udp2log data on analytics1003 other
than sqstat. And also puppet says [1], that udp2log is only there for
sqstat.

Hence, assuming the alarms we saw around analytics1003 did not cause
issues in real data pipelines.

* oxygen alarms:

While we saw the alarms on oxygen, they were really caused by a
bottle-neck on gadolinium (see below), which caused message loss for
all udp2log multicast consumers. So for example webstatscollector
(stats.grok.se, ...), mobile-sampled-100 stream, zero stream [2].

The losses happened during two periods.

  * 2014-07-08 19:00 -- 2014-07-08 22:00
    * eqiad: ~3% loss
    * esams: ~7% loss
    * ulsfo: ~10% loss

  * 2014-07-13 19:00 -- 2014-07-13 23:00
    * eqiad: ~12% loss
    * esams: ~25% loss
    * ulsfo: ~25% loss

Those two periods map to events of the Fifa World Cup 2014 [3].
Requested pages, and referers during that time also underpin that the
increase in traffic is indeed soccer related.

-----------------------

* Bottle-neck on gadolinium

For gadolinium in and outbound traffic are typically close to each
other in volume (typically difference is less than 2MB/s). But around
both of the above periods,
  * in-bound and out-bound traffic together grew up to 70 MB/s. Then
  * in-bound traffic grew further (max ~95MB/s on 2014-07-13), while
  * out-bound traffic stayed close to 70 MB/s. After some time
  * in-bound traffic came down to 70 MB/s. And
  * in-bound and out-bound traffic together decreased again to their
    usual daily pattern.

So the bottle-neck looks a bit like either
  * a limit of out-bound network bandwith, or
  * lack of resources to produce enough out-bound packets.

Since the issue is some days back, it's hard to get more logs and rule
any of the two out.

However, gadolinium's network card should be able to bring more data
to the wire. Also SNMP for gadolinium does not show sending errors.

On the other hand, it seems the socat process that is feeding the
multicast (by far the biggest part outbound traffic on gadolinium) is
indeed a bottle-neck. This process is continuously using between
70%-95% of a core. This percentage changes over time and closely
follows the amount of inbound network traffic. Extrapolating from this
relation, we should expect issues somewhere around >65MB/s inbound
traffic.

This extrapolation matches the above periods, as during those periods,
inbound traffic jumped >70MB/s, while we're typically <60MB/s for
normal days.

If we do not take actions, I'd expect
  * any reasonable traffic spike to cause a similar udp2log outage.
  * normal traffic to not cause similar udp2log on a regular basis.

    We still have some tiny room for growth, and that room will keep
    us covered for usual day-to-day traffic for some time (at least a
    few months if there is no considerable change in the way our
    traffic changes).

Of the many paths forward, only two seem viable to me. We could

  * ignore the udp2log issues, as they only hit us for rare spikes,
    and put more effort on moving to kafka.
    Kafka infrastructure protects us against this kind of failure.

  * ask ops to split gadolinium's incoming udp2log traffic into two
    parts, and then keep one of the two parts on gadolinium, while
    feeding the other part to a separate socat process that produces
    to the same multicast address.

    This change would be transparent for multicast consumers, and
    distribute the socat load among two processes; hence removing the
    socat bottleneck.

    Once the bottleneck on gadolinium is removed, consumers of this
    data pipeline should be able to handle the spikes, as udp2log and
    it's grep-based filters, as well as udpfilter, and
    webstatscollector filter are all using <50% of a CPU. So the
    downstream consumers have enough resources to handle spikes.


Given how seldom we see spikes, I'd vote for focusing on kafka.

(I am leaving closing the bug to management, as they need to decide on
this)

[1]
https://git.wikimedia.org/blob/operations%2Fpuppet.git/fce2b1c036d503723fbea865273f2d8a27004546/manifests%2Fsite.pp#L114

[2] Note that sampled-1000 stream is /not/ affected as that is
    generated by a separate udp2log pipeline.

[3] On 2014-07-08, a semi final took place.
    On 2014-07-13, the final took place.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 67694] Packetloss issues on oxygen (and analytics1003)

Reply via email to