AO3 performance issues
Published: 2012-06-01 02:38:41 -0400
As pretty much all of our users have no doubt noticed, we've been experiencing some problems with Archive loads: slowdowns and the appearance of the dreaded 502 page have become a regular occurrence. We're working on addressing these issues, but it's taking longer than we'd like, so we wanted to update you on what's going on.
Why the slowdowns?
Mostly because there's so much demand! The number of people reading and posting now is overwhelming - we're glad so many people want to be here, but sorry that the rapid expansion of the site is making it less functional than it should be.
We now get over a million and a half pageviews on an average day, often clustered at peak times in the evening (particularly when folks in the Western Hemisphere are home from work and school) - we were using a self-hosted analytics system to monitor site traffic, and we had to disable it because it was too overloaded to keep up. The traffic places high demands on our servers, and you see the 502 errors when the systems are getting more requests than they can handle. Ultimately we'll need to buy more servers to cope with rising demand, but there's ongoing work that we've done and need to continue to do to make our code more efficient. We've been working on long-term plans to improve our work and bookmark searching and browsing, since those are the pages that get the most traffic; right now, they present some challenges because they were designed and built when the site was much smaller. We've learned a lot about scaling over the years, but rewriting different areas of the code takes some time!
What are you doing to fix it?
Our Systems team are making some adjustments to our server setup and databases. Their first action was to increase the amount of tmp space for our MySQL database on the server - this has alleviated some of the worst problems, but doesn't really get at the underlying issues. They're continuing to investigate to see if there are additional adjustments we can make to the servers to help with the problems.
We're also actively working on the searching and browsing code: that's been a big project, and it will hopefully make a significant impact. Because it affects a lot of crucial areas of the site, we want to make sure we get everything right and do as much testing as we can to ensure that performance is where it needs to be before we release it. We're switching from the Sphinx search engine to elasticsearch, which can index new records more rapidly, allowing us to use that for filtering. That will offer us more flexibility, get rid of some of our slower SQL queries, and take some pressure off our main database, and it also has some nice sharding/scaling capabilities built in.
We also try to cache as much data as we can, and that's something we're always looking to improve on. Systems and AD&T have discussed different options there, and we'll be continuing to work on small improvements and see what larger ones we may be able to incorporate.
When will it be fixed?
It's going to take us a few weeks to get through all the changes that we need to make. Our next code deploy will probably be within the next week - that will include email bundling of subscription and kudos notifications, so that we can scale our sending of emails better as well. After that, we'll be able to dedicate our resources to testing the search and browsing changes, and we're hoping to have that out to everyone by the end of June. We rely on volunteer time for coding and testing, so we need to schedule our work for evenings and weekends for the most part, but we're highly motivated to resolve the current problems, and we'll do our best to get the changes out to you as soon as we can.
Improving the Archive is an ongoing task, and after we’ve made the changes to search and browse we’ll be continuing to work on other areas of the site to enable better scalability. We’re currently investigating the best options for developing the site going forward, including the possibility of paying for some training and/or expert advice to cover areas our existing volunteers don’t have much experience with. (If you have experience in these areas and time to work closely with our teams, we’d also welcome more volunteers!)
Thanks for your patience!
We know it's really annoying and frustrating when the site isn't working properly. We are working hard to fix it! We really appreciate the support of all our users. ♥