https://bugzilla.wikimedia.org/show_bug.cgi?id=67694
--- Comment #8 from [email protected] --- TL;DR: * analytics1003 alarms were harmless. * oxygen alarms point to real packet-loss, which affected all udp2log multicast consumers during two ~4 hour periods. * The issue is expected to re-occur for the seldom traffic spikes we see. * The issue is /not/ expected to re-occur anytime soon for our usual day-to-day traffic. * Let's not invest in adjusting udp2log setup and instead move to kafka :-) ----------------------- * analytics1003 alarms: I could not find a consumer of udp2log data on analytics1003 other than sqstat. And also puppet says [1], that udp2log is only there for sqstat. Hence, assuming the alarms we saw around analytics1003 did not cause issues in real data pipelines. * oxygen alarms: While we saw the alarms on oxygen, they were really caused by a bottle-neck on gadolinium (see below), which caused message loss for all udp2log multicast consumers. So for example webstatscollector (stats.grok.se, ...), mobile-sampled-100 stream, zero stream [2]. The losses happened during two periods. * 2014-07-08 19:00 -- 2014-07-08 22:00 * eqiad: ~3% loss * esams: ~7% loss * ulsfo: ~10% loss * 2014-07-13 19:00 -- 2014-07-13 23:00 * eqiad: ~12% loss * esams: ~25% loss * ulsfo: ~25% loss Those two periods map to events of the Fifa World Cup 2014 [3]. Requested pages, and referers during that time also underpin that the increase in traffic is indeed soccer related. ----------------------- * Bottle-neck on gadolinium For gadolinium in and outbound traffic are typically close to each other in volume (typically difference is less than 2MB/s). But around both of the above periods, * in-bound and out-bound traffic together grew up to 70 MB/s. Then * in-bound traffic grew further (max ~95MB/s on 2014-07-13), while * out-bound traffic stayed close to 70 MB/s. After some time * in-bound traffic came down to 70 MB/s. And * in-bound and out-bound traffic together decreased again to their usual daily pattern. So the bottle-neck looks a bit like either * a limit of out-bound network bandwith, or * lack of resources to produce enough out-bound packets. Since the issue is some days back, it's hard to get more logs and rule any of the two out. However, gadolinium's network card should be able to bring more data to the wire. Also SNMP for gadolinium does not show sending errors. On the other hand, it seems the socat process that is feeding the multicast (by far the biggest part outbound traffic on gadolinium) is indeed a bottle-neck. This process is continuously using between 70%-95% of a core. This percentage changes over time and closely follows the amount of inbound network traffic. Extrapolating from this relation, we should expect issues somewhere around >65MB/s inbound traffic. This extrapolation matches the above periods, as during those periods, inbound traffic jumped >70MB/s, while we're typically <60MB/s for normal days. If we do not take actions, I'd expect * any reasonable traffic spike to cause a similar udp2log outage. * normal traffic to not cause similar udp2log on a regular basis. We still have some tiny room for growth, and that room will keep us covered for usual day-to-day traffic for some time (at least a few months if there is no considerable change in the way our traffic changes). Of the many paths forward, only two seem viable to me. We could * ignore the udp2log issues, as they only hit us for rare spikes, and put more effort on moving to kafka. Kafka infrastructure protects us against this kind of failure. * ask ops to split gadolinium's incoming udp2log traffic into two parts, and then keep one of the two parts on gadolinium, while feeding the other part to a separate socat process that produces to the same multicast address. This change would be transparent for multicast consumers, and distribute the socat load among two processes; hence removing the socat bottleneck. Once the bottleneck on gadolinium is removed, consumers of this data pipeline should be able to handle the spikes, as udp2log and it's grep-based filters, as well as udpfilter, and webstatscollector filter are all using <50% of a CPU. So the downstream consumers have enough resources to handle spikes. Given how seldom we see spikes, I'd vote for focusing on kafka. (I am leaving closing the bug to management, as they need to decide on this) [1] https://git.wikimedia.org/blob/operations%2Fpuppet.git/fce2b1c036d503723fbea865273f2d8a27004546/manifests%2Fsite.pp#L114 [2] Note that sampled-1000 stream is /not/ affected as that is generated by a separate udp2log pipeline. [3] On 2014-07-08, a semi final took place. On 2014-07-13, the final took place. -- You are receiving this mail because: You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
