Site performance issues (AO3, why the sad face?)
Published: 2011-09-29 03:34:38 -0400
As most people are sadly only too aware, the Archive of our Own has been experiencing some performance issues recently. The sad 502 page is showing up increasingly frequently, to the frustration of everyone concerned! We have also had a couple of instances of downtime. We are working on the problem, but we know that our users are wondering what is going on, so we want to give a bit more information on what is going on behind the scenes.
Why is this happening?!
The main reason for 502 errors is the sheer load on the site. We've seen a massive increase in user numbers over the past few months - on Sunday 25th September we had 105,000 visits, of which 46,000 were unique visitors. These visitors racked up an impressive 575,000 page views! This is now a pretty average day for us - we're thrilled that so many people are enjoying and using the site, but we've expanded a bit more rapidly than we expected, so it's a little bit challenging for our servers.
Our recent server outages were caused by a problem with the servers themselves; our Systems team are tracking this to its source and making some changes to fix it.
Why do I sometimes get a 502 as soon as I click on a page?
Since 502 errors are associated with site slowness, it can be a bit unexpected to get a 502 as soon as you click. The reason this happens is that our servers are set up to keep an eye on how many requests are in the queue for them to handle. If there are more than a certain number, then the chances that the page will time out before it can be delivered are high. So, the server gives the 502 page right away instead so that you don't wait a long time only to be disappointed.
Didn't you just buy new servers recently?
Yes, pretty recently! Thanks to the generosity of fans who donated to our parent Organization for Transformative Works, we were able to purchase 5 new servers at the beginning of the year. These are actually doing a great job (if we were still on our original two servers, the site would have keeled over completely by now). However, the speed of our expansion means that we will need to purchase more servers sooner rather than later. When we do this, we'll also have to make some big changes to the underlying infrastructure - one option is shard the database, which means we'd split it up into separate chunks so each server only has to deal with a bit of it (if you're interested in the problem of scaleability and how sites deal with lots of users, this post on how LiveJournal handled it is a good read). We're actively researching now to figure out the best way of doing this and the kind of hardware we'll need.
What are you doing to fix the problem?
One reason the high loads on the site are having such a drastic effect is that some of our code is really optimised for fewer users (since we weren't expecting to have so many this soon!). So we're currently making a number of changes in the code to make it more efficient and reduce the number of database reads/writes (these are the things that tend to cause slowdowns) and updating the application software to one which should be more efficient. Our Systems team are also investigating updating the server operating system which might be necessary to take care of the issue which caused the server outages (it's hard to reproduce the conditions on the real site in our test site). The current set of new code has required intensive testing, which means we haven't been able to roll out the changes as quickly as we would like, but we hope to make these updates in early October. Longer term, we'll be doing some sustained work on scalability so that we can restructure things and buy more servers.
What can I do to help?
We really appreciate how understanding users have been about the problems - it helps our team a lot! If you want to do more and you have some financial resources, then a donation to the OTW will help ensure that we are able to continue expanding our servers (money donated to the OTW supports all OTW projects, but the vast majority is spent on servers and other running costs for the AO3). Finally, if you happen to be an experienced sys-admin or database expert - especially one with experience in performance tuning and scalability - and you would be willing to donate some time, we would welcome the additional expertise and support. (If you're interested, get in touch with our Volunteers and Recruitment Committee.)
We'd like to say thanks to everyone who has supported the site in various different ways - while the 502 errors are HUGELY annoying, they do show that people enjoy using the AO3. Thanks also to all our users for your patience while we get to grips with our new success and deal with the performance problems. Finally, an HUGE thank you to our Systems team, who bear the brunt of the work on these issues - the wonderful Sidra has been woken in the night more than once to deal with server issues, and we appreciate her hard work and dedication more than we can say. ♥
We are working hard to resolve the performance issues, and we'll keep updating users as we have more news. The latest site status updates can be found on our Twitter AO3_Status.