Hi everyone,

It's been a while since I sent out an operations and engineering update 
letting you know what we've been doing, so here we go.

Over the last year we got the site more reliable than it was last year 
and faster... but as the site has been growing we also had to start 
making architectural changes earlier this year and bring in new 
equipment. That only got us so far and adding more equipment on the 
back-end wasn't really speeding things up. So for the last several 
months we've been focused on improving the site in layers. In order: the 
database, memcache, the image server, the apaches, and the cache machines.

We ordered a bigger and faster database machine and got that going at 
the beginning of July (in addition to the existing db)... it has roughly 
10x the capacity of our normal daily load. Adding this machine meant 
that the individual data spikes (i.e., not sustained) which happen 
randomly could be absorbed and the db wouldn't be a bottleneck. We 
didn't notice an overall speed-up in general from this box but rather 
the lack of a slowdown when we'd get these data spikes. For memcache, 
we've been analyzing the requests from the apaches and optimizing both 
the way we were handling requests and the queries being used. For the 
image server we tried httpd servers other than apache, such as lighttpd, 
which resulted in more capacity and therefore more speed (many small 
page elements that are common to all wikis are served from this machine).

Last, but certainly not least, the last month has been spent looking at 
the cache servers. Our squid hit ratio has been about 60%-65% (meaning 
that they can give you a page element without having to go back to the 
apaches&db and ask for it ... which takes longer) and as the site grew 
those machines were getting CPU bound so something obviously needed 
adjusting. We added a pretty hefty new cache server at the beginning of 
September and it was pretty much immediately loaded up... so just 
throwing machines at the problem wasn't the answer to this one. Instead, 
we made several significant improvements here. Emil (head of the Polish 
office) has been working on a patch to the squid code that allowed us to 
ignore unique id's on common elements, thereby letting the squids 
operate more efficiently... that code is already in the Wikimedia SVN... 
and we let Jack & Travis at WikiHow know since they share a lot of stuff 
with us as well. Artur (more about Artur in a minute) has been tweaking 
the configuration on the squids in some fashion every day for the last 3 
weeks. The result is that the squid CPU usage went from 60% to 6% while 
the hit ratio went from 65% to about 80% ... Artur's hoping to get that 
up to 90% shortly.

After all of this work, just in the last week we are starting to see 
some definite improvements. You should be seeing pages serve more 
quickly overall (aside from the occasional ddos attack, a machine going 
off-line for a hardware fault, or 3rd party network problems). The next 
steps for us are to add more equipment to certain layers and put in more 
redundancy. We have an additional 2 databases waiting come on-line... 
and various other equipment at other layers.

In other news... we have a new member of the technical team in the San 
Mateo office: Artur Bergman (found here: 
http://radar.oreilly.com/artur/). I hired Artur to spearhead operations 
and engineering for Wikia and the first thing he started focusing on was 
the stack as described above. You should be hearing more directly from 
Artur in future emails.

If you have any questions, please feel free to ask. :)

Thanks,
John Q.


_______________________________________________
Wikia-l mailing list
[email protected]
http://lists.wikia.com/mailman/listinfo/wikia-l

Reply via email to