Hi everyone, It's been a while since I sent out an operations and engineering update letting you know what we've been doing, so here we go.
Over the last year we got the site more reliable than it was last year and faster... but as the site has been growing we also had to start making architectural changes earlier this year and bring in new equipment. That only got us so far and adding more equipment on the back-end wasn't really speeding things up. So for the last several months we've been focused on improving the site in layers. In order: the database, memcache, the image server, the apaches, and the cache machines. We ordered a bigger and faster database machine and got that going at the beginning of July (in addition to the existing db)... it has roughly 10x the capacity of our normal daily load. Adding this machine meant that the individual data spikes (i.e., not sustained) which happen randomly could be absorbed and the db wouldn't be a bottleneck. We didn't notice an overall speed-up in general from this box but rather the lack of a slowdown when we'd get these data spikes. For memcache, we've been analyzing the requests from the apaches and optimizing both the way we were handling requests and the queries being used. For the image server we tried httpd servers other than apache, such as lighttpd, which resulted in more capacity and therefore more speed (many small page elements that are common to all wikis are served from this machine). Last, but certainly not least, the last month has been spent looking at the cache servers. Our squid hit ratio has been about 60%-65% (meaning that they can give you a page element without having to go back to the apaches&db and ask for it ... which takes longer) and as the site grew those machines were getting CPU bound so something obviously needed adjusting. We added a pretty hefty new cache server at the beginning of September and it was pretty much immediately loaded up... so just throwing machines at the problem wasn't the answer to this one. Instead, we made several significant improvements here. Emil (head of the Polish office) has been working on a patch to the squid code that allowed us to ignore unique id's on common elements, thereby letting the squids operate more efficiently... that code is already in the Wikimedia SVN... and we let Jack & Travis at WikiHow know since they share a lot of stuff with us as well. Artur (more about Artur in a minute) has been tweaking the configuration on the squids in some fashion every day for the last 3 weeks. The result is that the squid CPU usage went from 60% to 6% while the hit ratio went from 65% to about 80% ... Artur's hoping to get that up to 90% shortly. After all of this work, just in the last week we are starting to see some definite improvements. You should be seeing pages serve more quickly overall (aside from the occasional ddos attack, a machine going off-line for a hardware fault, or 3rd party network problems). The next steps for us are to add more equipment to certain layers and put in more redundancy. We have an additional 2 databases waiting come on-line... and various other equipment at other layers. In other news... we have a new member of the technical team in the San Mateo office: Artur Bergman (found here: http://radar.oreilly.com/artur/). I hired Artur to spearhead operations and engineering for Wikia and the first thing he started focusing on was the stack as described above. You should be hearing more directly from Artur in future emails. If you have any questions, please feel free to ask. :) Thanks, John Q. _______________________________________________ Wikia-l mailing list [email protected] http://lists.wikia.com/mailman/listinfo/wikia-l
