https://bugzilla.wikimedia.org/show_bug.cgi?id=67694

--- Comment #7 from Dan Andreescu <[email protected]> ---
Just writing some notes for myself on how to troubleshoot this.  For now, I'm
putting it on hold until I get two more things:

1. kafka broker data is available to query again (right now we have 1 week
rolling data on the brokers but they're not writing to hdfs because the cluster
is being reinstalled with CDH5).
2. data from tonight's game to see how it compares

So, Christian was nice enough to walk me through puppet and I will put this all
on a wiki when I'm done, but for now:

Oxygen and what it does was defined in site.pp (which is where you can go to
start searching for most host names - beware some are matched by regular
expression, analytics10.. for example).  Note, comments are NOT trustworthy,
puppet code is:

http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/manifests%2Fsite.pp#L2181

The roles it uses are to be found in manifests/roles/, and in this case
logging.pp:

http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/manifests%2Frole%2Flogging.pp#L272

And notice that role refers to other roles in the same file, but in general,
track down all the roles and parent classes that your instance's definition
mentions.  Since Oxygen also uses misc::udp2log:

http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/manifests%2Fmisc%2Fudp2log.pp

We notice the filter template is referenced here:

http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/manifests%2Fmisc%2Fudp2log.pp#L116

And found here:

http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/templates%2Fudp2log%2Ffilters.oxygen.erb#L46

And that line says the udp2log output is being piped to gadolinium.  So, in
conclusion, if Oxygen lost UDP packets, gadolinium would not see them, which
would affect webstatscollector.  One way to check this will be to compare
pageviews from the sampled log and the output of webstatscollector, something I
will do soon.

In general, other theories can be informed from the Server Admin Log:
https://wikitech.wikimedia.org/wiki/Server_Admin_Log which for this time period
says that a couple of varnish hosts were restarted and an apache was restarted.
 It's possible maybe that the udp packet loss script just got confused because
of those restarts and saw sequence numbers it wasn't expecting so assumed loss.
 This can be tested by looking at the sequence numbers in the sampled logs.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to