AO3 News

Post Header

2015-11-04 17:53:27 -0500

At approximately 23:00 UTC on October 24, the Archive of Our Own began experiencing significant slowdowns. We suspected these slowdowns were being caused by the database, but four days of sleuthing led by volunteer SysAdmin james_ revealed a different culprit: the server hosting all our tag feeds, Archive skins, and work downloads. While a permanent solution is still to come, we were able to put a temporary fix in place and restore the Archive to full service at 21:00 UTC on October 29.

Read on for the full details of the investigation and our plans to avoid a repeat in the future.

Incident Summary

On October 24, we started to see very strange load graphs on our firewalls, and reports started coming in via Twitter that a significant number of users were getting 503 errors. There had been sporadic reports of issues earlier in the week as well, but we attributed these to the fact that one of our two front-end web servers had a hardware issue and had to be returned to its supplier for repair. (The server was returned to us and put back into business today, November 4.)

Over the next few days, we logged tickets with our MySQL database support vendor and tried adjusting our configuration of various parts of the system to handle the large spikes in load we were seeing. However, we still were unable to identify the cause of the issue.

We gradually identified a cycle in the spikes, and began, one-by-one, to turn off internal loads that were periodic in nature (e.g., hourly database updates). Unfortunately, this did not reveal the problem either.

On the 29th of October, james_ logged in to a virtual machine that runs on one of our servers and noticed it felt sluggish. We then ran a small disc performance check, which showed severely degraded performance on this server. At this point, we realised that our assumption that the application was being delayed by a database problem was wrong. Instead, our web server was being held up by slow performance from the NAS, which is a different virtual machine that runs on the same server.

The NAS holds a relatively small amount of static files, including skin assets (such as background images), tag feeds, and work downloads (PDF, .epub, and .mobi files of works). Most page requests made to the Archive load some of these files, which are normally delivered very quickly. But because the system serving those assets was having problems, the requests were getting backed up until a point where a cascade of them timed out (causing the spikes and temporarily clearing the backed-up results).

To fix the issue, we had to get the NAS out of the system. The skin assets were immediately copied to local disc instead, and we put up a banner warning users that tag feeds and potentially downloads would need to be disabled. After tag feeds were disabled, the service became more stable, but there were further spikes. These were caused by our configuration management system erroneously returning the NAS to service after we disabled it.

After a brief discussion, AD&T and Systems agreed to temporarily move the shared drive to one of the front-end servers. This shared drive represents a single point of failure, however, which is undesirable, so we also agreed to reconfigure the Archive to remove this single point of failure within a few months.

Once the feeds and downloads were moved to the front-end server, the system became stable, and full functionality returned at 21:00 (UTC) on the 29th of October.

We still do not know the cause of the slowdown on the NAS. Because it is a virtual machine, our best guess is that the problem is with a broken disc on the underlying hardware, but we do not have access to the server itself to test. We do have a ticket open with the virtual machine server vendor.

The site was significantly affected for 118 hours and our analytics platform shows a drop of about 8% in page views for the duration of the issue, and that pages took significantly longer to deliver, meaning we were providing a reduced service during the whole time.

Lessons Learnt

  • Any single point of failure that still remains in our architecture must be redesigned.
  • We need to have enough spare capacity in servers so that in the case of hardware failure we can pull a server from a different function and have it have sufficient hardware to perform in its new role. For instance, we hadn't moved a system from a role as a resque worker into being a front-end machine because of worries about the machine's lack of SSD. We have ordered the upgrades for the two worker systems so they have SSDs so that this worry should not reoccur. Database servers are more problematic because of their cost. However, when the current systems are replaced, the old systems will become app servers, but could be returned to their old role in an emergency at reduced usage.
  • We are lacking centralised logging for servers. This would have sped up the diagnostic time.
  • Systems should have access to a small budget for miscellaneous items such as extra data center smart hands, above the two hours we already have in our contract. For example, a US$75 expense for this was only approved on October 26 at 9:30 UTC, 49 hours after it was requested. This forces Systems to have to work around such restrictions and wastes time.
  • We need to be able to occasionally consult specialized support. For instance, at one point we attempted to limit the number of requests by IP address, but this limit was applied to our firewall's IP address on one server but not the other. We would recommend buying support for nginx at US$1,500 per server per year.

Technologies to Consider

  • Investigate maxscale rather than haproxy for load balancing MySQL.
  • Investigate RabbitMQ as an alternative to resque. This makes sense when multiple servers need to take an action, e.g. cache invalidation.


Post Header

To combat an influx of spam works, we are temporarily suspending the issuing of invitations from our automated queue. This will prevent spammers from getting invitations to create new accounts and give our all-volunteer teams time to clean up existing spam accounts and works. We will keep you updated about further developments on our Twitter account. Please read on for details.

The problem

We have been dealing with two issues affecting the Archive, both in terms of server health and user experience.

  • Spammers who sign up for accounts only to post thousands of fake "works" (various kinds of advertisements) with the help of automated scripts.
  • People who use bots to download works in bulk, to the point where it affects site speed and server uptime for everyone else.

Measures we've taken so far

We have been trying several things to keep both problems in check:

  • The Abuse team has been manually banning accounts that post spam.
  • We are also keeping an eye on the invitation queue for email addresses that follow discernible patterns and removing them from the queue. This is getting trickier as the spammers adjust.
  • We delete the bulk of spam works from the database directly, as individual work deletion would clearly be an overwhelming task for the Abuse team; however, this requires people with the necessary skills and access to be available.
  • Our volunteer sysadmin has been setting up various server scripts and settings aimed at catching spammers and download bots before they can do too much damage. This requires a lot of tweaking to adjust to new bots and prevent real users from being banned.

Much of this has cut into our volunteers' holiday time, and we extend heartfelt thanks to everyone who's been chipping in to keep the Archive going through our busiest days.

What we're doing now

Our Abuse team needs a chance to catch up on all reported spamming accounts and make sure that all spam works are deleted. Currently the spammers are creating new accounts faster than we can ban them. Our sysadmins and coders need some time to come up with a sustainable solution to prevent further bot attacks.

To that end, we're temporarily suspending issuing invites from our automated queue. Existing account holders can still request invite codes and share them with friends. You can use existing invites to sign up for an account; account creation itself will not be affected. (Please note: Requests for invite codes have to be manually approved by a site admin, so there might be a delay of two to three days before you receive them; challenge moderators can contact Support for invites if their project is about to open.)

We are working hard to get these problems under control, so the invite queue should be back in business soon! Thank you for your patience as we work through the issues.

What you can do

There are some things you can do to help:

  • When downloading multiple works, wait a few moments between each download. If you're downloading too many works at once, you will be taken to an error page warning you to slow down or risk being blocked from accessing the Archive for 24 hours.
  • Please don't report spam works. While we appreciate all the reports we've received so far, we now have a system in place that allows us to find spam quickly. Responding to reports of spam takes time away from dealing with it.
  • Keep an eye on our Twitter account, @AO3_Status, for updates!

Known problems with the automated download limit

We have been getting reports of users who run into a message about excessive downloads even if they were downloading only a few works, or none at all. This may happen for several reasons that are unfortunately beyond our control:

  • They pressed the download button once, but their device went on a rampage trying to download the file many times. A possible cause for this might be a download accelerator, so try disabling any relevant browser extensions or software, or try downloading works in another browser or device.
  • They share an IP address with a group of people, one of whom hit the current download limit and got everyone else with the same IP address banned as well. This can be caused by VPNs, Tor software, or an ISP who assigns the same IP address to a group of customers (more likely to happen on phones). Please try using a different device, if you can.

We apologize if you have to deal with any of these and we'll do our best to restore proper access for all users as soon as possible!


Post Header

2014-09-09 16:43:05 -0400


  • Coder: Elz
  • Code reviewers: Enigel, james_
  • Testers: Ariana, Lady Oscar, mumble, Ridicully, sarken


With today's deploy we're making some changes to our search index code, which we hope will solve some ongoing problems with suddenly "missing" works or bookmarks and inaccurate work counts.

In order to improve consistency and reduce the load on our search engine, we'll be sending updates to it on a more controlled schedule. The trade-off is that it may take a couple of minutes for new works, chapters, and bookmarks to appear on listing pages (e.g. for a fandom tag or in a collection), but those pages will ultimately be more consistent and our systems should function more reliably.

You can read on for technical details!

The Problem

We use a software package called Elasticsearch for most of our search and filtering needs. It's a powerful system for organizing and presenting all the information in our database and allows for all sorts of custom searches and tag combinations. To keep our search results up to date for everyone using the Archive, we need to ensure that freshly-posted works, new comments and kudos, edited bookmarks, new tags, etc. all make it into our search index practically in real time.

As the volume of updates has grown considerably over the last couple of years, however, that's increased the time it takes to process those updates and slowed down the general functioning of the underlying system. That slowness has interacted badly with the way we cache data in our current code: works and bookmarks seem to occasionally appear and disappear from site listings and the counts you see on different pages and sidebars may be significantly different from one another.

That's understandably alarming to anyone who encounters it, and fixing it has been our top priority.

The First Step

We are making some major changes to our various "re-indexing" processes, which take every relevant change that happens to works/bookmarks/tags and update our massive search index accordingly:

  • Instead of going directly into Elasticsearch, all indexing tasks will now be added to a queue that can be processed in a more orderly fashion. (We were queueing some updates before, but not all of them.)
  • The queued updates will then be sent to the search engine in batches to reduce the number of requests, which should help with performance.
  • Cached pages get expired (i.e., updated to reflect new data) not when the database says so, but when Elasticsearch is ready.
  • Updates concerning hit counts, kudos, comments, and bookmarks on a work (i.e. "stats" data) will be processed more efficiently but less frequently.

As a result, work updates will take a minute to affect search results and work listings, and background changes to tags (e.g. two tags being linked together) will take a few minutes longer to be reflected in listings. Stats data (hits, kudos, etc.) will be added to the search index only once an hour. The upside of this is that listings should be more consistent across the site!

(Please note that this affects only searching, sorting, and filtering! The kudos count in a work blurb, for example, is based on the database total, so you may notice slight inconsistencies between those numbers and the order you see when sorting by kudos.)

The Next Step

We're hoping that these changes will help to solve the immediate problems that we're facing, but we're also continuing to work on long-term plans and improvements. We're currently preparing to upgrade our Elasticsearch cluster from version 0.90 to 1.3 (which has better performance and backup tools), switch our code to a better client, and make some changes to the way we index data to continue to make the system more efficient.

One big improvement will be in the way we index bookmarks. When we set up our current system, we had a much smaller number of bookmarks relative to other content on the site. The old Elasticsearch client we were using also had some limitations on its functionality, so we ended up indexing the data for bookmarked works together with each of their individual bookmarks, which meant that updates to the work meant updates to dozens or hundreds of bookmark records. That's been a serious problem when changes are made to tags, in particular, where a small change can potentially kick off a large cascade of re-indexes. It's also made it more difficult to keep up with regular changes to works, which led to problems with bookmark sorting by date. We're reorganizing that, using Elasticsearch's parent-child index structure, and we hope that this will also have positive long-term effects on performance.

Overall, we're continuing to learn and look for better solutions as the Archive grows. We apologize for the bumpy ride lately, and we hope that the latest set of changes will make things run more smoothly. We should have more improvements for you in the coming months, and in the meantime, we thank you for your patience!


Post Header

2014-01-23 16:26:51 -0500

If you're a regular Archive visitor or if you follow our AO3_Status Twitter account, you may have noticed that we've experienced a number of short downtime incidents over the last few weeks. Here's a brief explanation of what's happening and what we're doing to fix things.

The issue

Every now and then, the volume of traffic we get and the amount of data we're hosting starts to hit the ceiling of what our existing infrastructure can support. We try to plan ahead and start making improvements in advance, but sometimes things simply catch up to us a little too quickly, which is what's happening now.

The good news is that we do have fixes in the works: we've ordered some new servers, and we hope to have them up and running soon. We're making plans to upgrade our database system to a cluster setup that will handle failures better and support more traffic; however, this will take a little longer. And we're working on a number of significant code fixes to improve bottlenecks and reduce server load - we hope to have the first of those out within the next two weeks.

One area that's affected are the number of hits, kudos, comments, and bookmarks on works, so you may see delays in those updating, which will also result in slightly inaccurate search and sort results. Issues with the "Date Updated" sorting on bookmark pages will persist until a larger code rewrite has been deployed.

Behind the scenes

We apologize to everyone who's been affected by these sudden outages, and we'll do our best to minimize the disruption as we work on making things better! We do have an all-volunteer staff, so while we try to respond to server problems quickly, sometimes they happen when we're all either at work or asleep, so we can't always fix things as soon as we'd like to.

While we appreciate how patient and supportive most Archive users are, please keep in mind that tweets and support requests go to real people who may find threats of violence or repeated expletives aimed at them upsetting. Definitely let us know about problems, but try to keep it to language you wouldn't mind seeing in your own inbox, and please understand if we can't predict immediately how long a sudden downtime might take.

The future

Ultimately, we need to keep growing and making things work better because more and more people are using AO3 each year, and that's something to be excited about. December and January tend to bring a lot of activity to the site - holiday gift exchanges are posted or revealed, people are on vacation, and a number of fandoms have new source material.

We're looking forward to seeing all the new fanworks that people create this year, and we'll do our best to keep up with you! And if you're able to donate or volunteer your time, that's a huge help, and we're always thrilled to hear from you.


Post Header

2013-11-20 14:45:01 -0500

Update December 14, 18:00 UTC: As of this week, all systems should be back to normal. We're still working on optimizing our server settings, so very brief downtimes for maintenance should be expected. If bookmarks still won't sort correctly for you - we're working on a more permanent fix to the underlying issue, but it might be a short while yet. As always, we're keeping an eye on Support tickets and messages to our Twitter account, and will react as quickly as possible if anything seems off. Thank you all for your patience.

Update December 3, 16:00 UTC: We have re-enabled the sort and filter sidebar on work listings only. Bookmark filtering and sorting is still turned off and will likely be off for a few more days. (The filters are the sidebar that allows you to narrow down a list of works or bookmarks by character, rating, etc.) We will continue to work on the underlying issue. In the meantime, we suggest using the Works Search to help find what you’re looking for.

All works and bookmarks should be showing up normally. Work re-indexing is complete, so we hope to be able to turn on filtering for works again in the next day or two.

Bookmark re-indexing is still ongoing, so it will be several days before we can turn bookmark filtering back on.

Please follow the @AO3_Status Twitter feed or check back here for further updates.

Update 2 Dec: Listings for works, bookmarks, tags, and pseuds are unavailable due to issues with our search index. Our coding and systems volunteers are currently looking into it, and we will keep you updated on our progress. Our Support team is working on a back log, so there might be delays in getting back to users individually. Please consider checking the @AO3_Status Twitter feed or our banner alerts instead.

Update 30 Nov: All bookmarks have been re-indexed and should show up correctly again. Any issues that might still be lingering will be sorted out when we upgrade Elasticsearch, which we're planning for mid-December. Downloads should be working without the need for any workarounds now. Thank you for your patience!

The Good

We recently deployed new code, which fixed a couple of very old bugs and introduced improvements to the kudos feature. Behind the scenes, we've been working on setting up new servers and tweaking server settings to make everything run a little more smoothly during peak times. The end of the year (holiday season in many parts of the world) usually means more people with more free time to participate in more challenges, read more fic, or post more fanart, resulting in more site usage.

One way to measure site usage is looking at page views. This number tells us how many pages (a single work, a list of search results, a set of bookmarks in a collection, a user profile, etc. etc.) were served to users during a certain time frame. Some of these pages can contain a lot of information that has to be retrieved from the database - and a lot of information being retrieved from the database at the same time can result in site slowness and server woes. During the first week of January we had 27.6 million page views. As of November 17 we registered 42.9 million page views for the preceeding week.

We've watched our traffic stats grow dramatically over the years, and we've been doing our best to keep up with our users! Buying and installing more servers is one part of the solution, and we can't thank our all-volunteer Systems team enough for all their hard work behind the scenes. On the other hand, our code needs to be constantly reviewed and updated to match new demands.

Writing code that "scales" - that works well even as the site grows - is a complicated and neverending task that requires a thorough understanding of how all parts of the Archive work together, not just right now, but in six months, or a year, or two years. As we're all volunteers who work on the Archive in our free time (or during lunch breaks), and there are only a handful of us with the experience to really dig deep into the code, this is less straightforward than a server acquisition and will take a little more time.

The Bad

As such, we've been battling some site slowness, sudden downtimes (thankfully brief due to our awesome Systems team) and an uptick in error pages. We can only ask for your patience as we investigate likely causes and discuss possible fixes.

For the time being, we have asked our intrepid tag wranglers to refrain from wrangling on Sundays, as this is our busiest day and moving a lot of tags around sadly adds to the strain on the current servers. We sincerely apologize to all wrangling volunteers who have to catch up with new tags on Monday, and to users who might notice delays (e.g. a new fandom tag that's not marked as canonical right away). From what we've seen so far, this move has helped in keeping the site stable on weekends.

The Ugly

We are aware of an issue with seemingly "vanishing" bookmarks, in which the correct number of bookmarks is displayed in the sidebar, but not all are actually shown. The most likely culprit is our search index, powered by a framework called elasticsearch. All our information (work content, tags, bookmarks, users, kudos, etc. etc.) is stored in a database, and elasticsearch provides a quicker, neater access to some of this data. This allows for fast searches, and lets us build lists of works and bookmarks (e.g. by tag) without having to ask the database to give us every single scrap of info each time.

It appears now that elasticsearch has become slightly out of sync with the database. We are looking into possible fixes and are planning an elasticsearch software upgrade; however, we must carefully test it first to assure data safety.

This problem also affects bookmark sorting, which has been broken for several weeks now. We are very sorry! If you want to know if a particular work has been updated, please consider subscribing to the work (look for the "Subscribe" button at the top of the page). This will send you a notification when a new chapter has been posted.

(Note: Since we're sending out a lot of notifications about kudos, comments and subscriptions every day, some email providers are shoving our messages into the junk folder, or outright deny them passage to your account. Please add our email address to your contacts, create a filter to never send our emails to spam, or check the new "Social" tab in Gmail if you're waiting for notifications.)

A problem with file downloads only cropped up fairly recently. We don't think this is related to the most recent deploy, and will investigate possible causes. In the meantime, if a .pdf or .mobi file gives you an error 500, try downloading the HTML version first, then give it another shot. This should help until we've fixed the underlying problem.

What You Can Do

If you have not already done so, consider subscribing to our twitter feed @AO3_Status or following us on Tumblr. You can also visit the AO3 News page for updates in the coming weeks or subscribe to the feed.

We thank everyone who has written in about their experiences, and will keep you all updated on our progress. Thank you for your patience as we work on this!


Post Header

2013-04-05 06:55:00 -0400

We're currently dealing with a few issues relating to our last deploy (Release 0.9.6), and we wanted to keep you in the loop on how we're handling them (and let you know about workarounds for a few problems):

Jumbled looking header

We launched a new design for our header, which should look like this. If it's looking jumbled, buttons are overlapping, or it otherwise looks broken, please refresh the page. If that doesn't help, you may need to clear your browser cache and then refresh again. If you're using a mobile browser on your phone or tablet, and clearing the cache alone doesn't help, try completely closing the app and opening it again.

Share box always open

Some users have reported that the 'Share' button that usually pops up a pre-formatted block of work information was replaced by the text it's supposed to hold. This severely messes up all work pages. It's due to a conflict with some userscripts people have installed in their browsers, which are interacting badly with a jQuery plugin we started using for our help boxes.

The quick and dirty way of solving this problem is to disable AO3-related userscripts.

The problem is caused by scripts which contain a @require line, which tells the script to grab a copy of jQuery from Google's servers to work with. That fresh copy then overrides our own Javascript stack, causing wonkiness. If you want to keep using the script, you (or the original creator of the script) need to edit it so it doesn't use this line. A helpful Tumblr user has written up instructions on how you can edit a userscript yourself, although this might not work in all browsers. Especially if you're using Chrome, try this little script change instead.

Subscription emails are missing information

We tested all possible Archive emails in various browsers and clients, which amounts to a lot of emails. In all this testing we missed that some information had vanished from some kinds of subscription notifications, for which we apologize.

Many, many users wrote into Support about this and commented on our Release Notes, and we are rolling out a very quick fix to bring back total chapter count, work summaries, and additional tags to all notifications.

New styling of emails

We launched new multipart emails which take advantage of HTML styling. This was to give us more control over the layout of emails, and to help provide a more consistent look. Our old mailer templates were mostly a mix of text and some HTML mark-up (for paragraph breaks, links, and some text styling) that did not actually declare itself properly to email clients. This raised our spam score, looked broken in some text-based email programs, and made it harder to add emails for new features, as there was no consistent style to base them on.

We assumed that users who preferred text-only emails would select this option in their email clients; however, it's become clear that this isn't meeting our users' needs, for various reasons, and lots of people would prefer to have a text-only option.

We're taken note of all the (strongly-felt!) responses to the new emails, and we're looking at solutions, including adding a user preference for plain text emails. We need some time for our coders to look at the issue and figure out the best way forward: when we've been able to do this we'll update users on what we are planning to do and when we will do it. Please rest assured we are taking user feedback very seriously; we appreciate your patience while we work on this.

Some users also raised specific concerns about the styling, notably the size of the font in certain email clients / devices. We'll also be looking to address these issues and tweaking the HTML emails themselves.


Post Header

2012-08-23 15:06:52 -0400

We've been receiving a small number of reports of people unable to access the Archive of Our Own - if you've been affected by this issue, this post will give you a bit more information about what's going on. We'd also like to appeal for your help as we work to fix it!

What's wrong

We recently upgraded our firewall to improve the security of our servers. Unfortunately, it seems that we haven't got the configuration absolutely right and it's causing connection problems for some users. This problem is only affecting a small number of users, and it's not completely consistent. However, if you've received an Error 404, a warning saying 'Secure Connection Failed', or you've been redirected to a url with 8080 in it, then this is what's causing it.

While we worked on the issue, we temporarily disabled https on the site, as that was causing some additional problems. This means that if you have a browser extension such as HTTPS Everywhere enabled, or you use a browser which enforces https by default, then the site will not load - apologies for this. If the site has been consistently timing out for you, it's worth checking if this applies in your case - if the url defaults to then you have been affected by this issue.

How you can workaround

If you're being affected by the https issue, you can work around by adding an exception to HTTPS Everywhere, or using a different browser.

If you're getting errors at random, then clearing your browser cache and refreshing should help. You may also find it helps to use another browser.

How you can help us

We're working to get to the bottom of this problem, and we know we've already reduced the number of errors which are occurring. However, it would be enormously helpful for our Systems team to have a little more information. If you encounter an error, please submit a Support request giving the following information:

Your IP address You can find this out by going to
The url of the page where you got the error:
The exact error you got: You may find it easiest to copy and paste the error. If you didn't get an error but the page just never loads, tell us that.
What time (UTC) you got the error: Please check the current time in UTC when you get the error - this will make it easier for us to keep track, since we're dealing with users in lots of different timezones.
Is the error intermittent or constant?
What browser are you using? It would be extra helpful if you can tell us your user agent string, which you can find out by going to".

If you know how to view the source of a page in the browser, it would also be very helpful if you could could copy and paste the source code of the page that throws up the problem.

If you're comfortable working on the command line, then it would also be helpful if you could provide us with some additional information (if you're already wondering what we're talking about, don't worry, you can ignore this bit). Open up a command line window and type nslookup Copy whatever pops up in your console and paste it into your Support message.

If you can't access the Archive at all (and thus can't submit a Support request there) you can send us this information via our backup Support form.

A note on https

We know that many people prefer to use https connection to provide additional security on the web, and we will be reenabling this option as soon as we can. Because the AO3 doesn't handle data such as credit card information or similar, browsing without https doesn't expose our users to any significant security risks. However, it is always a good policy to use a unique password (i.e. don't use the same username and password combo for the AO3 and your email account) in order to ensure that if for any reason someone else obtains your AO3 credentials, they can't use them to access other data). Apologies for the inconvenience to users while this option is disabled.

ETA for a fix

We're hoping to resolve these lingering problems asap; however, our Systems team have limited time, so we may not be able to track down the root of the problem as fast as we'd like. We'll keep you updated, and in the meantime apologise for the inconvenience.


Post Header

2012-01-17 17:05:30 -0500

As many users will no doubt have noticed, the AO3 has been experiencing some performance issues since the start of the year. When we posted on 5th January, we were expecting those problems to ease once the holiday rush was over. However, that hasn't turned out to be the case. We're working on ways of dealing with the performance issues, but we wanted to keep you updated with what's going on while we do that.

Why the slowdowns?

In the past month, over 2000 new users have created accounts on the Archive. At the same time, the number of people reading on the Archive - with or without accounts - has been steadily growing. This has been part of a general trend, as you can see if you look at the graph showing number of visits to the Archive since November:

Line graph showing number of visits to the AO3, November to January. The line gradually goes up (with spikes on Sundays) before peaking dramatically on Jan 2.

We're always much busier on Sundays, but the number of visits has been gradually going up each week since November (and the same holds true for the preceding months). However, before December we were hovering around the 135,000 level for visitor numbers at peak times. You can see that the visitor numbers began to climb more dramatically in December, peaking on 2nd January when we had 182,958 visitors. Crucially, after that spike it didn't drop back down to anything like the levels it had been at previously: we're now at more than 150,000 visits on a regular day, and more than 165,000 on Sundays, our busiest day. Wow!

We were expecting a big spike over the holidays, when there are lots of challenges and lots of people with a little spare time for reading and creating. However, we hadn't expected site usage to remain quite so high after the holidays were over! The increases mean that the site is now under a holiday load every day, which is one reason things have been running a little slowly.

The other reason for the slowdowns is that the increase in our number of registered users, and the holiday challenge season, has produced a big increase in the number of works. In fact, 11,516 new works have been posted since the end of December already! More data in our databases means more work for things like sorting, searching, etc - this means that sometimes the database just doesn't serve up the result you need in time, and the unicorn which is waiting to get that result gives up and goes away (yes, really - our servers are assisted by unicorns :D).

We've been expecting this general effect for a while now, and we've been working towards implementing things to deal with it; however, we weren't expecting quite such a big jump in site usage in the past month!

What are you doing about this?

The Accessibility, Design & Technology and Systems Committees had a special meeting on Saturday to discuss ways of dealing with the immediate problem, as well as longer term plans. It can be tricky to test for high load situations before they actually occur, but once they do occur there's lots of data we can gather to help us address the most crucial issues. (We're also working on implementing more tools which will help us test this stuff before it comes up.)

Short term

More caching: We already cache pages (or sections of pages) across the site - this means we store a copy which we can serve up directly, instead of creating the page every time someone wants to use it. If something changes, then the cache is expired and a new, updated copy is created. Hitherto, we've focused on caching chunks of information which are unlikely to change rapidly: for example, on any works index the 'blurbs' which show the information about each work are cached. However, some of the heaviest load is caused by rapidly changing pages like the works index. We're moving towards more caching of whole pages, so that a new copy of the works index (for example) will be created every five minutes rather than generated each time someone asks for it. This means things like works indexes will be a little slower to update - when you add a new work, it won't appear on the list until the cache expires - but that five minute delay will massively reduce the weight on our servers.

More indexes: We have a few places in our databases - for example the tables for the skins - which could use more indexes. Indexes speed things up because the server can just search through those rather than the whole table. So, we're hunting out places where more indexes are needed, and implementing them. :)

Medium term

Bad queries must die: We have a few queries which are very long and complicated, and take a long time to run. We need to rewrite these bits of the code to make them simpler and faster! In many cases this will be quite complicated (or else we would have done it already), but it's a priority to help us speed things up.

New filters for great justice: The filters that are implemented on our index pages are not really optimal considering the size of the site now - the limitations of that code are the reason we have to have a 1000 work cap on the number of works returned. We have been working on this for a long time - we need to completely throw out what we have and implement a system which works better for the site as it is now. Again, this is really complicated, which is why it's taken us a long time to achieve it even though we knew it was important - the good news is that we have now done quite a lot of work on this area and the first round of changes should be out in the next few months.

Long term

Long term, we're going to be moving to a setup which allows us to distribute our site load across more servers. This will involve database sharding - putting different bits of the database on separate servers - so it will take quite a lot of planning and expertise. If you're a user of Livejournal or Dreamwidth, you might be aware that your journal is hosted on a certain 'cluster' - we'd be moving to a similar system. We want to make sure we do this right, but based on the way the site is growing we think this is now high priority, and our Systems team are working to figure out the right ways forward.


We know it's really frustrating when the site runs slow or is timing out on you: many apologies. We really appreciate users' patience while we deal with the issues. As you'll see from the above, there are some immediate things we can do to ease the problems, and we also have a good sense of where we need to go from here. So, while these changes need to be implemented as a matter of urgency, we feel confident we will be able to tackle the problems. If you have expertise in the areas of performance, scalability and database management, we would very much welcome additional volunteers.

As we move forward on dealing with problem spots on the site, we may implement some changes which are visible to users: the caching on the index pages and the changes to browsing and searching are two of the most obvious. We'll let you know about this as we go along - we think the effect will be beneficial for everyone, but do be prepared for a few changes! You can keep up with status and deploy news on our Twitter @AO3_Status.

While the growth in the site means we're facing some problems a little sooner than we expected, we're really excited about the fact so many people want to read and post to the AO3. Thanks to everyone for your fannish energy - and apologies for the fact we sometimes slow you down a little.


Pages Navigation