AO3 News

Post Header

Published:
2018-05-18 16:19:38 -0400
Tags:

Following our May 10 deploy, the Archive experienced a number of issues stemming primarily from increased load during the Elasticsearch upgrade process.

As we noted in our March downtime post, the Archive hasn't been running at full strength due to this upgrade. Compounding the issue, it has taken significantly longer than planned to get the new code deployed to production, and we are now entering one of the more active times of the year. (Our daily page views for a Sunday -- our busiest day -- are over 29 million, and the normal load on our database servers is over a million queries per minute.)

You can find more details on the current state of the Archive's servers below, along with a rough timeline of the issues we experienced between Thursday, May 10, and Monday, May 14. However, the main takeaway is these issues are likely to continue until the Elasticsearch upgrade is completed and our network capacity is increased. We're very grateful for the support and patience you've shown, and we look forward to finishing our upgrades so we can provide you with a stable Archive once more.

Background: Server state

We normally have five Elasticsearch servers, but late last year we turned one of our front end machines into an Elasticsearch server, allowing us to divide these six machines into two groups: one three-machine cluster for the production site, and another for testing the upgraded code.

Having only three Elasticsearch servers meant the site experienced significant issues, so on April 11, we reprovisioned one of our old database servers, which had been producing web pages, as an Elasticsearch server in the production cluster.

In addition to the ongoing Elasticsearch upgrade, our Systems team recently completed a major overhaul intended to help with our long term stability and sustainability. Between November 2017 and March 2018, they reinstalled all the application servers, web front ends, and new Elasticsearch systems with a new version of the Debian (Stretch) operating system using FAI and Ansible. This meant rewriting the configuration from the ground up, since we had previously used FAI and CFEngine. They also upgraded various other packages during this process, and now all that's left to upgrade for the Archive are the database servers.

Timeline

May 10

16:25 UTC: We deploy the code update that will allow us to run the old and new Elasticsearch code simultaneously. (We know the new version still has a few kinks, and we expect to find more, so we're using a Redis-based system called rollout to make sure internal volunteers get the new code while everyone else gets the old version.) Because this is our first deploy since the application servers have been reinstalled, the deploy has to be done by hand.

16:56 UTC: We turn on the new Elasticsearch indexing.

21:03 UTC: We notice -- and fix -- some issues with site skins that resulted from doing a manual deploy.

May 11

05:00 UTC: We see large amounts of traffic on ao3-db06, which is both the Redis server we use for Resque and the MySQL server responsible for writes. We mistakenly believe the traffic is caused by the number of calls to rollout to check if users should see the new filters.

05:36 UTC: We increase the number of Resque workers.

10:06 UTC: The Resque queue is still high, so we increase the number of workers again.

21:00 UTC: We no longer believe the increased traffic is due to rollout, so we turn the new indexing off and schedule 45 minutes of downtime for 06:15 UTC the following morning.

May 12

06:15 UTC: In order to mitigate the extra traffic, we move Redis onto a second network interface on ao3-db01. However, routing means the replies return on the first interface, so it is still overwhelmed.

06:42 UTC: We extend the downtime by 30 minutes so we can change the new interface to a different network, but replies still return on the wrong interface.

07:26 UTC: Since we've used up our downtime window, we roll the change back.

After that, we spend large parts of the day trying to figure out what caused the increase traffic on ao3-db06. With the help of packet dumps and Redis monitoring, we learn that indexing bookmarks on new Elasticsearch is producing a large number of error messages which are stored in Redis and overwhelming the network interface.

May 13

Our coders spend most of Sunday trying to determine the cause of the Elasticsearch errors. We look at logs and try a number of solutions until we conclude that Elasticsearch doesn’t appear to support a particular code shortcut when under load, although it's not clear from the documentation why that would be.

20:45 UTC: We change the code to avoid using this shortcut and confirm that it solves the issue, but we do not resume the indexing process.

23:45 UTC: The Resque Redis instance on ao3-db06 freezes, likely due to load. As a result, some users run into errors when trying to leave comments, post works, or submit other forms.

May 14

06:30 UTC: We restart Redis, resolving the form submission errors. However, we begin to receive reports of two other issues: downloads not working and new works and bookmarks not appearing on tag pages.

16:25 UTC: To help with the download issues, we re-save our admin settings, ensuring the correct settings would be in the cache.

16:34 UTC: Now we look into why works and bookmarks aren't appearing. Investigating the state of the system, we discover a huge InnoDB history length (16 million rather than our more normal 2,000-5,000) on ao3-db06 (our write-related MySQL server). We kill old sleeping connections and the queue returns to normal. The server also returns to normal once the resultant IO has completed.

16:55 UTC: Bookmarks and works are still refusing to appear, so we clear Memcached in case caching is to blame. (It's always -- or at least frequently -- caching!)

17:32 UTC: It is not caching. We conclude Elasticsearch indexing is to blame and start reindexing bookmarks created in the last 21 hours.

17:43 UTC: New bookmarks still aren't being added to tag listings.

17:54 UTC: We notice a large number of Resque workers have died and not been restarted, indicating an issue in this area.

18:03 UTC: We apply the patch that prevents the bookmark indexing errors that previously overwhelmed ao3-db06 and then restart all the unicorns and Resque workers.

18:43 UTC: Once everything is restarted, new bookmarks and old works begin appearing on the tag pages as expected.

19:05 UTC: The site goes down. We investigate and determine the downtime is related to the number of reindexing workers we restarted. Because we believed we had hotfixed the issue with the reindexing code, we started more reindexing workers than usual to help with the indexing process. However, when we started reindexing, we went above 80% of our 1 Gbit/sec of ethernet to our two MySQL read systems (ao3-db01 and ao3-db05).

19:58 UTC: After rebalancing the traffic over the two read MySQL instances and clearing the queues on the front end, the indexers have stopped, the long queues for pages have dissipated, and the site is back.

Takeaways

  • We will either need multiple bonded ethernet or 10 Gbit/sec ethernet in the very near future. While we were already expecting to purchase 10 Gbit networking in September, this purchase may need to happen sooner.
  • Although it has not been budgeted for, we should consider moving Redis on to a separate new dedicated server.

While we are running with reduced capacity in our Elasticsearch cluster and near the capacity of our networking, the reliability of the Archive will be adversely affected.

Comment

Post Header

Published:
2018-03-31 05:52:48 -0400
Tags:

For the past several weeks, the Archive has been experiencing brief, but somewhat frequent, periods of downtime due to our search engine becoming overwhelmed. This is because we're working on upgrading from Elasticsearch 0.90 to 6.0, so only half of the servers we'd typically use to power the Archive's search engine are available -- the others are running the upgraded version so we can test it.

The good news is the downtime should stop once our upgrade is complete and all servers are running Elasticsearch 6.0. While we can't estimate when that will be, we're working hard to wrap up our testing and fix any remaining bugs as quickly as possible, and we'll have more information on the upgrade coming soon.

We've made some minor adjustments to minimize the downtime, although you may notice some slowness, and downtime may still occur during particularly busy periods. Please rest assured we have systems in place to alert our volunteers of the issue, and it will generally be resolved within 30 minutes.

For now, we offer our sincerest apologies, and we'll continue to monitor the situation and make whatever adjustments we can to improve it. As always, you can follow @AO3_Status on Twitter for updates.

Comment

Post Header

Published:
2015-11-04 17:53:27 -0500
Tags:

At approximately 23:00 UTC on October 24, the Archive of Our Own began experiencing significant slowdowns. We suspected these slowdowns were being caused by the database, but four days of sleuthing led by volunteer SysAdmin james_ revealed a different culprit: the server hosting all our tag feeds, Archive skins, and work downloads. While a permanent solution is still to come, we were able to put a temporary fix in place and restore the Archive to full service at 21:00 UTC on October 29.

Read on for the full details of the investigation and our plans to avoid a repeat in the future.

Incident Summary

On October 24, we started to see very strange load graphs on our firewalls, and reports started coming in via Twitter that a significant number of users were getting 503 errors. There had been sporadic reports of issues earlier in the week as well, but we attributed these to the fact that one of our two front-end web servers had a hardware issue and had to be returned to its supplier for repair. (The server was returned to us and put back into business today, November 4.)

Over the next few days, we logged tickets with our MySQL database support vendor and tried adjusting our configuration of various parts of the system to handle the large spikes in load we were seeing. However, we still were unable to identify the cause of the issue.

We gradually identified a cycle in the spikes, and began, one-by-one, to turn off internal loads that were periodic in nature (e.g., hourly database updates). Unfortunately, this did not reveal the problem either.

On the 29th of October, james_ logged in to a virtual machine that runs on one of our servers and noticed it felt sluggish. We then ran a small disc performance check, which showed severely degraded performance on this server. At this point, we realised that our assumption that the application was being delayed by a database problem was wrong. Instead, our web server was being held up by slow performance from the NAS, which is a different virtual machine that runs on the same server.

The NAS holds a relatively small amount of static files, including skin assets (such as background images), tag feeds, and work downloads (PDF, .epub, and .mobi files of works). Most page requests made to the Archive load some of these files, which are normally delivered very quickly. But because the system serving those assets was having problems, the requests were getting backed up until a point where a cascade of them timed out (causing the spikes and temporarily clearing the backed-up results).

To fix the issue, we had to get the NAS out of the system. The skin assets were immediately copied to local disc instead, and we put up a banner warning users that tag feeds and potentially downloads would need to be disabled. After tag feeds were disabled, the service became more stable, but there were further spikes. These were caused by our configuration management system erroneously returning the NAS to service after we disabled it.

After a brief discussion, AD&T and Systems agreed to temporarily move the shared drive to one of the front-end servers. This shared drive represents a single point of failure, however, which is undesirable, so we also agreed to reconfigure the Archive to remove this single point of failure within a few months.

Once the feeds and downloads were moved to the front-end server, the system became stable, and full functionality returned at 21:00 (UTC) on the 29th of October.

We still do not know the cause of the slowdown on the NAS. Because it is a virtual machine, our best guess is that the problem is with a broken disc on the underlying hardware, but we do not have access to the server itself to test. We do have a ticket open with the virtual machine server vendor.

The site was significantly affected for 118 hours and our analytics platform shows a drop of about 8% in page views for the duration of the issue, and that pages took significantly longer to deliver, meaning we were providing a reduced service during the whole time.

Lessons Learnt

  • Any single point of failure that still remains in our architecture must be redesigned.
  • We need to have enough spare capacity in servers so that in the case of hardware failure we can pull a server from a different function and have it have sufficient hardware to perform in its new role. For instance, we hadn't moved a system from a role as a resque worker into being a front-end machine because of worries about the machine's lack of SSD. We have ordered the upgrades for the two worker systems so they have SSDs so that this worry should not reoccur. Database servers are more problematic because of their cost. However, when the current systems are replaced, the old systems will become app servers, but could be returned to their old role in an emergency at reduced usage.
  • We are lacking centralised logging for servers. This would have sped up the diagnostic time.
  • Systems should have access to a small budget for miscellaneous items such as extra data center smart hands, above the two hours we already have in our contract. For example, a US$75 expense for this was only approved on October 26 at 9:30 UTC, 49 hours after it was requested. This forces Systems to have to work around such restrictions and wastes time.
  • We need to be able to occasionally consult specialized support. For instance, at one point we attempted to limit the number of requests by IP address, but this limit was applied to our firewall's IP address on one server but not the other. We would recommend buying support for nginx at US$1,500 per server per year.

Technologies to Consider

  • Investigate maxscale rather than haproxy for load balancing MySQL.
  • Investigate RabbitMQ as an alternative to resque. This makes sense when multiple servers need to take an action, e.g. cache invalidation.

Comment

Post Header

Published:
2014-01-23 16:26:51 -0500
Tags:

If you're a regular Archive visitor or if you follow our AO3_Status Twitter account, you may have noticed that we've experienced a number of short downtime incidents over the last few weeks. Here's a brief explanation of what's happening and what we're doing to fix things.

The issue

Every now and then, the volume of traffic we get and the amount of data we're hosting starts to hit the ceiling of what our existing infrastructure can support. We try to plan ahead and start making improvements in advance, but sometimes things simply catch up to us a little too quickly, which is what's happening now.

The good news is that we do have fixes in the works: we've ordered some new servers, and we hope to have them up and running soon. We're making plans to upgrade our database system to a cluster setup that will handle failures better and support more traffic; however, this will take a little longer. And we're working on a number of significant code fixes to improve bottlenecks and reduce server load - we hope to have the first of those out within the next two weeks.

One area that's affected are the number of hits, kudos, comments, and bookmarks on works, so you may see delays in those updating, which will also result in slightly inaccurate search and sort results. Issues with the "Date Updated" sorting on bookmark pages will persist until a larger code rewrite has been deployed.

Behind the scenes

We apologize to everyone who's been affected by these sudden outages, and we'll do our best to minimize the disruption as we work on making things better! We do have an all-volunteer staff, so while we try to respond to server problems quickly, sometimes they happen when we're all either at work or asleep, so we can't always fix things as soon as we'd like to.

While we appreciate how patient and supportive most Archive users are, please keep in mind that tweets and support requests go to real people who may find threats of violence or repeated expletives aimed at them upsetting. Definitely let us know about problems, but try to keep it to language you wouldn't mind seeing in your own inbox, and please understand if we can't predict immediately how long a sudden downtime might take.

The future

Ultimately, we need to keep growing and making things work better because more and more people are using AO3 each year, and that's something to be excited about. December and January tend to bring a lot of activity to the site - holiday gift exchanges are posted or revealed, people are on vacation, and a number of fandoms have new source material.

We're looking forward to seeing all the new fanworks that people create this year, and we'll do our best to keep up with you! And if you're able to donate or volunteer your time, that's a huge help, and we're always thrilled to hear from you.

Comment