Yesterday morning I received an alert that the gadolinium udp2log process was 
experiencing packet loss.  In addition to being the webstats-collector host 
(which generates the pagecounts files), gadolinium is a socat relay.  It is 
responsible for feeding about 5 total udp2log instances all of the webrequest 
log traffic.

Upon investigating the packet loss issue on gadolinium, I noticed that the 
socat relay process itself was dropping packets if the udp2log process was also 
up.  I believe this is due to the fact that if both socat and udp2log is 
running, the NIC must process twice the amount of data than if only one is 
running.  I went into emergency mode to move as much of the udp2log filters to 
other existent udp2log boxes.  Opsen and I set up a new box (erbium) so that we 
could still have a box on which to run some of the gadolinium udp2log filters 
(including the webstatscollector one).

Fundraising gets their webrequest data from gadolinium, so I had spent much of 
the day working with them.  It turned out that this wasn't so much of an 
emergency for them, since they had a scheduled downtime during this time anyway.

Erbium was almost fully ready yesterday evening.  When I was about to finish 
setting up erbium, other opsen had started a restructuring of production 
puppetmaster setup, which caused puppet to not work for a short period.  I was 
crunched with time to finish this, but couldn't until the puppetmaster was back 
up.  I had urgent personal business to take care of (had to put an application 
in on an apartment before someone else did), so I ran out for the evening 
leaving things in this state.  I was thinking mostly of Fundraising, and they 
didn't' seem worried, and forgot that webstatscollector was an issue too.

Erbium is online as of a few minutes ago and the webstatscollector processes 
should be trucking along, so pageview data should be fine starting now.  The 
webstatscollector processes are not currently monitored.  I plan to add process 
monitoring for both of these, as well as UDP dropped packet statistics for both 
the socat relay process and the webstatscollector process.







On Jul 24, 2013, at 1:37 AM, Jeremy Baron <[email protected]> wrote:

> On Jul 24, 2013 12:43 AM, "Ikuya Yamada" <[email protected]> wrote:
> > It seems that the page view statistics data does not contain the
> > actual data for the last few hours.
> >
> > http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-07/
> >
> > Are there any failures on the server-side?
> 
> Just looking at file sizes I can see 15, 16, and 20-05(the current hour) UTC 
> all look smaller than normal. (yes, something's broken)
> 
> -Jeremy
> 
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to