Performance issues on the AO3

Published: 2012-01-17 17:05:30 -0500

As many users will no doubt have noticed, the AO3 has been experiencing some performance issues since the start of the year. When we posted on 5th January, we were expecting those problems to ease once the holiday rush was over. However, that hasn't turned out to be the case. We're working on ways of dealing with the performance issues, but we wanted to keep you updated with what's going on while we do that.

Why the slowdowns?

In the past month, over 2000 new users have created accounts on the Archive. At the same time, the number of people reading on the Archive - with or without accounts - has been steadily growing. This has been part of a general trend, as you can see if you look at the graph showing number of visits to the Archive since November:

Line graph showing number of visits to the AO3, November to January. The line gradually goes up (with spikes on Sundays) before peaking dramatically on Jan 2.

We're always much busier on Sundays, but the number of visits has been gradually going up each week since November (and the same holds true for the preceding months). However, before December we were hovering around the 135,000 level for visitor numbers at peak times. You can see that the visitor numbers began to climb more dramatically in December, peaking on 2nd January when we had 182,958 visitors. Crucially, after that spike it didn't drop back down to anything like the levels it had been at previously: we're now at more than 150,000 visits on a regular day, and more than 165,000 on Sundays, our busiest day. Wow!

We were expecting a big spike over the holidays, when there are lots of challenges and lots of people with a little spare time for reading and creating. However, we hadn't expected site usage to remain quite so high after the holidays were over! The increases mean that the site is now under a holiday load every day, which is one reason things have been running a little slowly.

The other reason for the slowdowns is that the increase in our number of registered users, and the holiday challenge season, has produced a big increase in the number of works. In fact, 11,516 new works have been posted since the end of December already! More data in our databases means more work for things like sorting, searching, etc - this means that sometimes the database just doesn't serve up the result you need in time, and the unicorn which is waiting to get that result gives up and goes away (yes, really - our servers are assisted by unicorns :D).

We've been expecting this general effect for a while now, and we've been working towards implementing things to deal with it; however, we weren't expecting quite such a big jump in site usage in the past month!

What are you doing about this?

The Accessibility, Design & Technology and Systems Committees had a special meeting on Saturday to discuss ways of dealing with the immediate problem, as well as longer term plans. It can be tricky to test for high load situations before they actually occur, but once they do occur there's lots of data we can gather to help us address the most crucial issues. (We're also working on implementing more tools which will help us test this stuff before it comes up.)

Short term

More caching: We already cache pages (or sections of pages) across the site - this means we store a copy which we can serve up directly, instead of creating the page every time someone wants to use it. If something changes, then the cache is expired and a new, updated copy is created. Hitherto, we've focused on caching chunks of information which are unlikely to change rapidly: for example, on any works index the 'blurbs' which show the information about each work are cached. However, some of the heaviest load is caused by rapidly changing pages like the works index. We're moving towards more caching of whole pages, so that a new copy of the works index (for example) will be created every five minutes rather than generated each time someone asks for it. This means things like works indexes will be a little slower to update - when you add a new work, it won't appear on the list until the cache expires - but that five minute delay will massively reduce the weight on our servers.

More indexes: We have a few places in our databases - for example the tables for the skins - which could use more indexes. Indexes speed things up because the server can just search through those rather than the whole table. So, we're hunting out places where more indexes are needed, and implementing them. :)

Medium term

Bad queries must die: We have a few queries which are very long and complicated, and take a long time to run. We need to rewrite these bits of the code to make them simpler and faster! In many cases this will be quite complicated (or else we would have done it already), but it's a priority to help us speed things up.

New filters for great justice: The filters that are implemented on our index pages are not really optimal considering the size of the site now - the limitations of that code are the reason we have to have a 1000 work cap on the number of works returned. We have been working on this for a long time - we need to completely throw out what we have and implement a system which works better for the site as it is now. Again, this is really complicated, which is why it's taken us a long time to achieve it even though we knew it was important - the good news is that we have now done quite a lot of work on this area and the first round of changes should be out in the next few months.

Long term

Long term, we're going to be moving to a setup which allows us to distribute our site load across more servers. This will involve database sharding - putting different bits of the database on separate servers - so it will take quite a lot of planning and expertise. If you're a user of Livejournal or Dreamwidth, you might be aware that your journal is hosted on a certain 'cluster' - we'd be moving to a similar system. We want to make sure we do this right, but based on the way the site is growing we think this is now high priority, and our Systems team are working to figure out the right ways forward.

Summary

We know it's really frustrating when the site runs slow or is timing out on you: many apologies. We really appreciate users' patience while we deal with the issues. As you'll see from the above, there are some immediate things we can do to ease the problems, and we also have a good sense of where we need to go from here. So, while these changes need to be implemented as a matter of urgency, we feel confident we will be able to tackle the problems. If you have expertise in the areas of performance, scalability and database management, we would very much welcome additional volunteers.

As we move forward on dealing with problem spots on the site, we may implement some changes which are visible to users: the caching on the index pages and the changes to browsing and searching are two of the most obvious. We'll let you know about this as we go along - we think the effect will be beneficial for everyone, but do be prepared for a few changes! You can keep up with status and deploy news on our Twitter @AO3_Status.

While the growth in the site means we're facing some problems a little sooner than we expected, we're really excited about the fact so many people want to read and post to the AO3. Thanks to everyone for your fannish energy - and apologies for the fact we sometimes slow you down a little.