Everyone at the Archive of Our Own has been working hard dealing with the recent site expansion and performance problems. Now that we've been able to deal with the immediate issues, we wanted to give everyone a bit more detail on what's happening and what we're working on.
Our recent performance problems hit when a big increase in users happened, putting pressure on all the bits of the site not optimised for lots of users. We were able to make some emergency fixes which targeted the most problematic points and thus fixed the performance problems for now. However, we know we need to do quite a bit more work to make sure the site is scalable. The good news is there are lots of things we know we can work on, and we have resources to help us do it.
Some users have been concerned that the recent performance problems mean that the site is in serious trouble. However, we've got lots of plans in place to tackle the growth of the site, and we're also currently comfortable about our financial prospects (we'll be posting about this separately). As long as we are careful and don't rush to increase the number of users too fast, the site should remain stable.
The tl;dr details
What level of growth are we experiencing?
The easiest aspect of site growth for us to measure is the number of user accounts. This has definitely grown significantly: since May 1 almost 12,000 new user accounts have been created, which means a 25% increase in user numbers in the past two months. However, the number of new accounts created is only a small proportion of the overall increase in traffic.
We know that lots more people are using the site without an account. There are currently almost 30,000 people waiting for an invitation, but even that is a very, very partial picture of how many people are actually visiting the site. In fact, we now have approximately one and a half million unique visitors per month. That's a lot of users (even if we assume that some of those visitors represent the same users accessing the site from different locations)!
A bit about scalability
The recent problems we've been experiencing were related to the increase in the number of people accessing the site. This is a problem of scalability: the requirements of a site serving a small number of users can be quite different to those of a site with a large userbase. When more users are accessing a site, any weak points in the code will also become more of a problem: something which is just a little bit slow when you have 20,000 users may grind to a halt entirely by the time you hit 60,000.
The slightly counterintuitive thing about scalability is that the difference between a happy site and an overwhelmed one can be one user. Problems tend to arise when the site hits a particular break point - for example, a database table getting one more record than it can handle - and so performance problems can appear suddenly and dramatically.
When coding and designing a site, you try to ensure it is scalable: that is, you set up the hardware so that it's easy to add more capacity, you design the code so it will work for more users than you have right now, etc. However, this is always a balancing act: you want to ensure the site can grow, but you also need to ensure there's not too much redundancy and you're not paying for more things than you need. Some solutions simply don't make any sense when you have a smaller number of users, even if you think you'll need them one day in the future. In addition, there are lots of factors which can result in code which isn't very scalable: sometimes it makes sense to implement code which works now and revise it when you see how people are using the site, sometimes things progress in unexpected ways (and testing for scalability can be tricky), sometimes you simply don't know enough to detect problem areas in the code. All of these factors have been at work for the AO3 at one time or another (as for most other sites).
Emergency fixes for scalability
When lots and lots of new users arrived at the Archive at once, all the bits of the site which were not very scalable began to creak. This happened more suddenly than we were anticipating, largely because changes at the biggest multifandom archive, Fanfiction.net, meant that lots of users from there were coming over to us en masse. So, we had to make some emergency fixes to make the site more able to cope with lots more users.
In our case, we already knew we had one bit of code that was extremely UNscalable - the tag filters used to browse lists of works. These were fine and dandy when we had a very small number of works on the Archive, but they had a big flaw - they were built on demand from the list of works returned when a user accessed a particular page. This made them up-to-the-minute and detailed, but was a big problem once the list of works returned for a given fandom were numbering in the thousands - a problem we were working around while we designed a new system by limiting the number of returned works to 1000. It was also a problem because building the filters on demand meant that our servers had to redo the work every time someone hit a page with filters on it. When thousands of people were hitting the site every minute, that put the servers under a lot of strain. Fortunately, the filters happen to be a bit of code that's relatively easy to disable without hitting anything else, so we were able to remove them as an emergency measure to deal with the performance problems. Because they were such a big part of the problem, doing this had a dramatic effect on the many 502s and slowdowns.
We also did some other work to help the site cope with more users: largely this involved implementing a lot more caching and tuning our servers so they manage their workload slightly differently. All these changes were enough to deal with the short-term issues, but we need to do some more, and more sustained work to ensure that the site can grow and meet the demands of its users.
Scalability work we're doing right now
We've got a bunch of plans for things which will help scalability and thus ensure good site performance. In the short term (approximate timescales included below) we are:
- Installing more RAM - within the next week. This will allow us to run more server processes at once so we can serve more users at the same time. This is a priority right now because our servers are running out of memory: they're regularly going over 95% of usage, which is not ideal! We have purchased new RAM and it will be installed as soon as we can book a maintenance slot with our server hosts.
- Changing our version of MySQL to Percona - within the next week. This will give us more information about what our server is doing, helping us identify problem spots in the site which we need to work on. It should also work a bit faster. We've currently installed Percona on our Test Archive and have been checking to see it doesn't cause any unexpected problems - we'll be putting it on the main site in the next week or so. Percona is an open source version of MySQL which has additional facilities which will help us look at our problems. In addition we hope to draw on the support of the company who produce it (also called Percona).
- Completing the work on our new tag filters - within the next month. These will (we hope!) be much, much more scalable than the old ones. They'll use a system called Elasticsearch, which is built on Solr/Lucene. These are solutions which don't use the MySQL database, so they cut down on a lot of database calls.
Scalability stuff we're doing going forward
We want to continue working on scalability going forward. We've reached a point where the site is only going to get bigger, so we need to be ready to accommodate that. This involves some complex work, so there are a bunch of conversations ongoing. However, this will involve some of the following:
- Analysis of our systems and code to identify problem spots. We've installed a system called New Relic which can be used to analyse what's going on in the site, how scalable it is, and where problems are occurring. Percona also provides more tools to help us analyse the site. In addition, Mark from Dreamwidth has kindly offered to work with us to take a look at our Systems setup - Mark runs the Systems side of things at Dreamwidth and has lots of experience in scalability issues, so having his fresh eyes on the performance site will help us figure out the work we need to do.
- Caching, caching and more caching. We've been working on implementing more caching for some time, and we added a lot more caching as part of our emergency fixes. However, there is still a LOT more caching we can do. Caching essentially saves a copy of a page and delivers it up to the next person who wants to see the page, instead of creating it fresh each time. Obviously, this is really helpful if you have a lot of page views: we now have over 16 million page views per week, so caching is essential. We'll be looking to implement three types:
- Whole page caching. This is the type we implemented as an emergency fix during the recent performance issues. It uses something called Squid, and it's the best performance saver because it can just grab the whole page with no extra processing. Unfortunately, this can also cause some problems, since we have a lot of personalised pages on the site - for example, when we first implemented it, some people were getting cached pages with skins on they hadn't chosen to use. There are ways around this, however, which allow you to serve a cached page and then personalise it, so we'll be working on implementing those.
- Partial page caching. This is something we already do a lot of - if there are bits that repeat a lot, you can cache them so that everything isn't generated fresh each time. For example, the 'work blurbs' (the information about individual works in a list of search results) are all cached. This uses a system called memcached. We'll be looking to do more, and better, partial caching.
- Database caching. This would mean we use a secondary server to do complex queries and then put the results on the primary server, so all the primary server is doing is grabbing them.
- Adding more servers. We’re definitely going to need more database servers to manage site growth, and we’re currently finalising some decisions on that. At the moment, it looks like the way we’re going to go is to add a new machine which would be dedicated to read requests (which is most of our traffic – people looking at works rather than posting them) while one of our older machines will be dedicated to write requests (posting, commenting, etc). Once we've confirmed the finer details (hopefully this week), we expect it to take about two months for the new server to be purchased and installed.
We'll be posting separately about the financial setup for the AO3, but the key thing to say is that we're currently in a healthy financial state. :D However, as the site gets bigger its financial needs will also get bigger, and we always welcome donations - if you want to donate and you can afford to do so, then donating to the OTW will help us stay on good financial footing. We really appreciate the immense generosity of the fannish community for the support already you've shown us. <3
A lot of supporting the site and dealing with scalability is down to the people. As we grow, we need to ensure we have the people and expertise to keep things running. We are a volunteer-run site and as such our staff have varying levels of time, expertise, and so on. One important part of expanding slowly is ensuring that we don't get into crisis situations which not only suck for our users (like when the 502s were making the site inaccessible) but also cause massive stress for the people working to fix the problems. So, we're proceeding cautiously to try to avoid those situations.
We've been working hard over the last year or so to make it easier for people to get involved with coding and working on the site. We're happy to say this is definitely paying off: we've had eight new coders come on board during the last few months who have already started contributing code. Our code is public on github, and we welcome 'drive by' code contributions: one thing we'd like to do is make that a bit easier by providing more extensive setup instructions so people who want to try running the code on their own machines can do so.
If you'd like to get more involved in our coding teams, then you can volunteer via our technical recruitment form. Please note that at the moment, we're only taking on fairly experienced people - normally we very much welcome absolute beginners as well, but we're taking a brief break while our established team get some of the performance problems under control so that we don't wind up taking on more people than we can support. We love helping people to acquire brand-new skills, but we want to be sure we can mentor and train them when they join us.
Lots of people have asked whether we'd consider having paid employees. It's unlikely that we'll have permanent employees in the foreseeable future, for a number of reasons (taxes, insurance, etc), but we are considering areas where we would benefit from paid expertise for particular tasks. Ideally, this would enable us to offer more training to our volunteers while targeting particularly sticky sections of code. Paying for help has a lot of implications (most obviously, it would add to our financial burden) and we want to think carefully about what makes sense for us. However, the OTW Board are discussing those options.
We're incredibly grateful to the hard-working volunteers who give their time and energy to all aspects of running the AO3. They are our most precious resource and we would like to take the opportunity to say thanks to all our volunteers, past, present and future. <3