https://bugzilla.wikimedia.org/show_bug.cgi?id=67694
--- Comment #7 from Dan Andreescu <[email protected]> --- Just writing some notes for myself on how to troubleshoot this. For now, I'm putting it on hold until I get two more things: 1. kafka broker data is available to query again (right now we have 1 week rolling data on the brokers but they're not writing to hdfs because the cluster is being reinstalled with CDH5). 2. data from tonight's game to see how it compares So, Christian was nice enough to walk me through puppet and I will put this all on a wiki when I'm done, but for now: Oxygen and what it does was defined in site.pp (which is where you can go to start searching for most host names - beware some are matched by regular expression, analytics10.. for example). Note, comments are NOT trustworthy, puppet code is: http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/manifests%2Fsite.pp#L2181 The roles it uses are to be found in manifests/roles/, and in this case logging.pp: http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/manifests%2Frole%2Flogging.pp#L272 And notice that role refers to other roles in the same file, but in general, track down all the roles and parent classes that your instance's definition mentions. Since Oxygen also uses misc::udp2log: http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/manifests%2Fmisc%2Fudp2log.pp We notice the filter template is referenced here: http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/manifests%2Fmisc%2Fudp2log.pp#L116 And found here: http://git.wikimedia.org/blob/operations%2Fpuppet.git/9b9d473fa4bf02a63e35c340ec971907aab1da1d/templates%2Fudp2log%2Ffilters.oxygen.erb#L46 And that line says the udp2log output is being piped to gadolinium. So, in conclusion, if Oxygen lost UDP packets, gadolinium would not see them, which would affect webstatscollector. One way to check this will be to compare pageviews from the sampled log and the output of webstatscollector, something I will do soon. In general, other theories can be informed from the Server Admin Log: https://wikitech.wikimedia.org/wiki/Server_Admin_Log which for this time period says that a couple of varnish hosts were restarted and an apache was restarted. It's possible maybe that the udp packet loss script just got confused because of those restarts and saw sequence numbers it wasn't expecting so assumed loss. This can be tested by looking at the sequence numbers in the sampled logs. -- You are receiving this mail because: You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
