AO3 News

Post Header

Published:
2015-11-04 17:53:27 -0500
Tags:

At approximately 23:00 UTC on October 24, the Archive of Our Own began experiencing significant slowdowns. We suspected these slowdowns were being caused by the database, but four days of sleuthing led by volunteer SysAdmin james_ revealed a different culprit: the server hosting all our tag feeds, Archive skins, and work downloads. While a permanent solution is still to come, we were able to put a temporary fix in place and restore the Archive to full service at 21:00 UTC on October 29.

Read on for the full details of the investigation and our plans to avoid a repeat in the future.

Incident Summary

On October 24, we started to see very strange load graphs on our firewalls, and reports started coming in via Twitter that a significant number of users were getting 503 errors. There had been sporadic reports of issues earlier in the week as well, but we attributed these to the fact that one of our two front-end web servers had a hardware issue and had to be returned to its supplier for repair. (The server was returned to us and put back into business today, November 4.)

Over the next few days, we logged tickets with our MySQL database support vendor and tried adjusting our configuration of various parts of the system to handle the large spikes in load we were seeing. However, we still were unable to identify the cause of the issue.

We gradually identified a cycle in the spikes, and began, one-by-one, to turn off internal loads that were periodic in nature (e.g., hourly database updates). Unfortunately, this did not reveal the problem either.

On the 29th of October, james_ logged in to a virtual machine that runs on one of our servers and noticed it felt sluggish. We then ran a small disc performance check, which showed severely degraded performance on this server. At this point, we realised that our assumption that the application was being delayed by a database problem was wrong. Instead, our web server was being held up by slow performance from the NAS, which is a different virtual machine that runs on the same server.

The NAS holds a relatively small amount of static files, including skin assets (such as background images), tag feeds, and work downloads (PDF, .epub, and .mobi files of works). Most page requests made to the Archive load some of these files, which are normally delivered very quickly. But because the system serving those assets was having problems, the requests were getting backed up until a point where a cascade of them timed out (causing the spikes and temporarily clearing the backed-up results).

To fix the issue, we had to get the NAS out of the system. The skin assets were immediately copied to local disc instead, and we put up a banner warning users that tag feeds and potentially downloads would need to be disabled. After tag feeds were disabled, the service became more stable, but there were further spikes. These were caused by our configuration management system erroneously returning the NAS to service after we disabled it.

After a brief discussion, AD&T and Systems agreed to temporarily move the shared drive to one of the front-end servers. This shared drive represents a single point of failure, however, which is undesirable, so we also agreed to reconfigure the Archive to remove this single point of failure within a few months.

Once the feeds and downloads were moved to the front-end server, the system became stable, and full functionality returned at 21:00 (UTC) on the 29th of October.

We still do not know the cause of the slowdown on the NAS. Because it is a virtual machine, our best guess is that the problem is with a broken disc on the underlying hardware, but we do not have access to the server itself to test. We do have a ticket open with the virtual machine server vendor.

The site was significantly affected for 118 hours and our analytics platform shows a drop of about 8% in page views for the duration of the issue, and that pages took significantly longer to deliver, meaning we were providing a reduced service during the whole time.

Lessons Learnt

  • Any single point of failure that still remains in our architecture must be redesigned.
  • We need to have enough spare capacity in servers so that in the case of hardware failure we can pull a server from a different function and have it have sufficient hardware to perform in its new role. For instance, we hadn't moved a system from a role as a resque worker into being a front-end machine because of worries about the machine's lack of SSD. We have ordered the upgrades for the two worker systems so they have SSDs so that this worry should not reoccur. Database servers are more problematic because of their cost. However, when the current systems are replaced, the old systems will become app servers, but could be returned to their old role in an emergency at reduced usage.
  • We are lacking centralised logging for servers. This would have sped up the diagnostic time.
  • Systems should have access to a small budget for miscellaneous items such as extra data center smart hands, above the two hours we already have in our contract. For example, a US$75 expense for this was only approved on October 26 at 9:30 UTC, 49 hours after it was requested. This forces Systems to have to work around such restrictions and wastes time.
  • We need to be able to occasionally consult specialized support. For instance, at one point we attempted to limit the number of requests by IP address, but this limit was applied to our firewall's IP address on one server but not the other. We would recommend buying support for nginx at US$1,500 per server per year.

Technologies to Consider

  • Investigate maxscale rather than haproxy for load balancing MySQL.
  • Investigate RabbitMQ as an alternative to resque. This makes sense when multiple servers need to take an action, e.g. cache invalidation.

Comment

Post Header

Published:
2013-02-07 17:25:55 -0500
Tags:

While developing the Archive of Our Own, site security is one of our top priorities. In the last couple of weeks, we've been reviewing our 'emergency plan', and wanted to give users a bit more information about how we work to protect the site. In particular, we wanted to make users aware that in the event of a security concern, we may opt to shut the site down in order to protect user data.

Background

Last week we were alerted to a critical security issue in Ruby on Rails, the framework the Archive is built on. We (and the rest of the Rails community) had to work quickly to patch this hole: we did an emergency deploy to upgrade Rails and fix the issue.

As the recent security breach at Twitter demonstrated, all web frameworks are vulnerable to security breaches. As technology develops, new security weaknesses are discovered and exploited. This was a major factor in the Rails security issue we just patched, and it means that once a problem is identified, it's important to act fast.

Our security plans

If the potential for a security breach is identified on the site, and we cannot fix it immediately we will perform an emergency shutdown until we are able to address the problem. In some cases, completely shutting down the site is the only way to guarantee that site security can be maintained and user data is protected.

We have also taken steps for 'damage limitation' in the event that the site is compromised. We perform regular offsite backups of site data. These are kept isolated from the main servers and application (where any security breach could take place).

In order to ensure the site remains as secure as possible, we also adhere to the following:

  • Developers are subscribed to the Rails mailing list and stay abreast of security announcements
  • We regularly update Rails and the software we use on our servers, so that we don't fall behind the main development cycle and potentially fall afoul of old security problems
  • All new code is reviewed before being merged into our codebase, to help prevent us introducing security holes ourselves
  • All our servers are behind firewalls
  • All password data is encrypted

What you can do

The main purpose of this post is to let you know that security is a priority, and to give you a heads up that we may take the site down in an emergency situation. Because security problems tend to be discovered in batches, we anticipate that there is an increased risk of us needing to do this over the next month. In this case, we'll keep users informed on our AO3_Status Twitter, the OTW website and our other news outlets.

Overall site security is our responsibility and there is no immediate cause for concern. However, we recommend that you always use a unique username / password combination on each site you use. Using the same login details across many sites increases the chance that a security breach in one will give hackers access to your details on other sites (which may have more sensitive data).

We'd like to thank all the users who contacted us about the latest Rails issue. If you ever have questions or concerns, do contact Support.

Comment

Post Header

Published:
2013-01-16 06:24:06 -0500
Tags:

The Archive of Our Own will have some scheduled downtime on Thursday January 17 at 18.30 22:00 UTC (see what time this in in your timezone). We expect the downtime to last about 15 minutes.

This downtime is to allow us to make some changes to our firewall which will make it better able to cope under heavy loads. This will help with the kinds of connection issues we experienced last week: our colocation host has generously offered to help us out with this (thanks, Randy!).

As usual, we'll tweet from AO3_Status before we start and when we go back up, and we'll update there if anything unexpected happens.

Comment

Post Header

Published:
2013-01-07 12:27:00 -0500
Tags:

The Archive will be down for maintenance for short periods on 8, 10 and 11 January. The maintenance is scheduled to start at approximately 05.15 UTC on each day (see what time that is in your timezone), and will last less than an hour each time. We'll put out a notice on our Twitter AO3_Status when we're about to start.

Downtime details

8 January 05.15 UTC: c. 15 minutes downtime.

10 January 05.15 UTC: c. 25 minutes downtime.

11 January 05.15 UTC: c. 50 minutes downtime.

What we're up to

The Archive has grown massively over the past year - during the first week of 2013 we had over 27.6 million pageviews! To cope with the continuing growth of the site, we're adding three more servers. We're also reorganising the way our servers are set up to ensure that they're working as efficiently as possible, and to make it easy for us to add more machines in future.

Our colocation host installed the new machines in late December. We're now moving over to using them, and reorganising our setup. We're doing the work of moving over to our new database server in several small chunks, which will keep downtimes short and make it easier for us to identify the source of any problems which may arise.

What's next?

Once this has been done we'll deploy the Archive code on the new servers and test it out. We'll be looking for some help with this - stay tuned for another post.

When we're happy that everything is working right, we'll make the switch to using the new servers. No downtime expected at present, but we'll keep you posted if that changes.

Thanks!

Thanks for your patience while we work.

We're able to continue expanding the Archive and buying new hardware thanks to the generosity of our volunteers, who give a great deal of time to coding and systems administration, and of OTW members, whose donations pay for the Archive's running costs. If you enjoy using the Archive, please consider making a donation to the OTW. We also very much welcome volunteers, but are currently holding off on recruiting new volunteers while our lovely Volunteers Committee improve our support for new volunteers (we'll let you know when we reopen). Thank you to everyone who supports us!

Comment

Post Header

Published:
2013-01-04 11:26:49 -0500
Tags:

A number of users have reported receiving malware warnings from Avast when accessing the Archive of Our Own. We haven't been hacked, and there is no cause for concern - the warning was a false positive.

Avast is erroneously flagging a file used by New Relic, which we use to monitor our servers (you can see more details in this thread). New Relic are working with Avast to resolve the issue, and we expect things to be back to normal very shortly (we have had only a small number of reports today).

Thank you to everyone who alerted us to this! If you see something unexpected on the site, we always appreciate hearing about it right away. You can keep track of the latest site status via our Twitter AO3_Status, and contact our Support team via the Support form.

Comment

Post Header

Published:
2012-12-17 06:34:52 -0500
Tags:

The Archive of Our Own will be undergoing some maintenance today at approximately 18.00 UTC (what time is this in my timezone?). During the maintenance period, which will last approximately two hours, downloads will not work. You will still be able to browse and read on the Archive, but will not be able to download any works. If the work proves complicated, we may also have a period of downtime (although we hope to avoid this).

What's going on?

In the next few weeks, we'll be adding some new servers to the OTW server family. The new servers will add some extra capacity to the Archive of Our Own, and will also create extra room for Fanlore, which is growing rapidly thanks to the amazing work of thousands of fannish editors (as Fanlore users are well aware, this expansion has been putting the existing Fanlore server under increasing strain).

In preparation for these new servers, we need to first reorganise the setup of the existing servers in order to free some more physical space at our colocation host without buying more rack space (rack space costs money, so it’s nice not to use more than we need). In order to do this, we’ll have to take some of the servers offline for a little while today. Doing this now will minimize the disruption caused when the servers arrive during the holiday period, which is typically one of the busiest times of year for the Archive.

The Archive is set up so it can function without all servers running at once, so today, we will only have to take the server which hosts downloads offline. This means that attempts to download any work will fail while we reorganize our data, though the rest of the site will work as usual (pending any unexpected problems). If you prefer to read downloaded works, you may wish to stock up now! Downloads will be restored as soon as we finish our maintenance. We’ll keep you posted about further maintenance when the new servers arrive!

Thanks for your patience while we do this work. You can keep track of current site status via our Twitter account AO3_Status.

Comment

Post Header

Published:
2012-11-07 15:07:26 -0500
Tags:

The Archive of Our Own will have approximately two hours of planned downtime on 8 November 2012, starting c. 05.30 UTC (see what time that is in your timezone).

During this time we will be installing new discs in our servers, giving us more space to accommodate the demands of serving lots of data to lots of users!

If all goes well with the hardware installation, we will also be deploying new code during this downtime. The new release will include the long-awaited return of the tag filters! We're very excited (and a bit nervous).

Please follow AO3_Status for updates on the downtime and maintenance - we'll tweet before we take the site down and again when the work has been completed. If our Twitter says we're up but you're still seeing the maintenance page, you may need to clear your browser cache and refresh.

Comment

Post Header

Published:
2012-08-17 09:21:15 -0400
Tags:

Our Systems team have been doing some behind-the-scenes maintenance over the past week or so to improve the Archive of Our Own's firewalls. This has mostly been invisible to users, but last night it briefly gave everyone a fright when a typo introduced during maintenance caused some people to be redirected to some weird pages when trying to access the AO3. We also had a few additional problems today which caused a bit of site downtime. We've fixed the problems and the site should now be back to normal, but we wanted to give you all an explanation of what we've been working on and what caused the issues.

Please note: We will be doing some more maintenance relating to these issues at c. 22:00 UTC today (see when this is in your timezone). The site should remain up, but will run slowly for a while.

Upgrading our firewall

The AO3's servers have some built-in firewalls which stop outside IP services accessing bits of the servers they shouldn't, in the same way that the firewall on your home computer protects you from malicious programmes modifying your computer. Until recently, we were using these firewalls, which meant that each server was behind its own firewall, and data passed between servers was unencrypted. However, now that we have a lot more machines (with different levels of firewall), this setup is not as secure as it could be. It also makes it difficult for us to do some of the Systems work we need to, since the firewalls get in the way. We've therefore been upgrading our firewall setup: it's better to put all the machines behind the same firewall so that data passing between different servers is always protected by the firewall.

We've been slowly moving all our servers behind the new firewall. We're almost done with this work, which will put all the main servers for the Archive (that is the ones all on the same site together) behind the firewall. In addition, our remote servers (which can't go behind the firewall) will be connected to the firewall so that they can be sure they're talking to the right machine, and all the data sent to them is properly encrypted. (The remote servers are used for data backups - they are at a different location so that if one site is hit by a meteor, we'll still have our data.) This means that everything is more secure and that we can do further Systems maintenance without our own firewalls getting in the way.

What went wrong - redirects

Last night, some users started getting redirected to a different site when trying to access the AO3. The redirect site was serving up various types of spammy content, so we know this was very alarming for everyone who experienced it. The problem was caused by an error introduced during our maintenance. It was fixed very quickly, but we're very sorry to everyone who was affected.

In order to understand what caused the bug, it's necessary to understand a little bit about DNS. Every address on the internet is actually a string of numbers (an IP address), but you usually access it via a much friendlier address like http://archiveofourown.org. DNS is a bit like a phonebook for the internet: when you go to http://archiveofourown.org, your Domain Name Service goes to look and see what number is listed for that address, then sends you to the right place. In the case of the AO3, we actually have several servers, so there are several 'phone numbers' listed and you can get sent to any one of those.

As part of our maintenance, we had to make changes to our DNS configuration. Unfortunately, during one of those changes, we accidentally introduced a typo into one of our names (actually into the delegation of the domain, for those of you who are systems savvy). This meant that some people were being sent to the wrong place when they tried to access our address - it's as if the phone book had a misprint and you suddenly found yourself calling the laundry instead of a taxi service. Initially this was just sending people to a non-existent place, but a spammer noticed the error and registered that IP address so they would get the redirected traffic. (In the phone book analogy, the laundry noticed the misprint and quickly registered to use that phone number so they could take advantage of it.) It didn't affect everyone since some people were still being sent to the other, valid IP addresses.

We fixed the typo as soon as the problem was reported. However, Domain Name Services don't update immediately, so some users were still getting sent to the wrong address for a few hours after we introduced the fix. To continue the phone book analogy, it's as if the misprinted phone book was still in circulation at the same time as the new, updated one.

If you were affected by this issue, then it should be completely resolved now. Provided you didn't click any links on the site you were redirected to, you shouldn't have anything to worry about. However, it's a a good idea to run your antivirus programme just to be absolutely sure.

Downtime today

It turned out one bit of the firewall configuration was a little overenthusiastic and was blocking some users from getting to the site at all. We rolled back part of the changes, which caused a little bit of downtime. Because this involved changing our DNS configuration again the change took a while to take effect and the downtime was different for different users (effectively we changed our phone number, and the phonebook had to update).

The site should be back up for everyone now. We'll be completing the last bits of work on the firewall upgrade today at roughly 22:00 UTC. At present we don't expect any downtime, but the site will be running more slowly than usual.

Thank you

We'd like to say a massive thank you to James_, who has done almost all of the work upgrading the firewall. He's done a sterling job and the site is much more secure because of his work. This glitch reminds us just how high pressure Systems' work is - for most of us, a tiny typo does not have such noticeable effects! We really appreciate all the work James_ has put in, and the speed at which he identified and fixed the problem when it went wrong.

We'd also like to thank our other staff who swung into action to keep people informed on Twitter, our news sites, and via Support, and who provided moral support while the issues were being dealt with.

Finally, thanks to all our users: you guys were super understanding while we were dealing with these problems and gave us lots of useful info which helped us track down the source of the bug.

Reminder: site status information

The first place to be updated when we have problems with the site is our Twitter AO3_Status. We try to answer questions addressed to us there as well as putting out general tweets, but it can be hard for us to keep up with direct conversations in busy periods, so apologies if you sent us a message and we didn't respond directly. If you see a problem, it's a good idea to check our timeline first to see if we already tweeted about it. For problems other than site status issues, the best place to go for help is AO3 Support.

Comment


Pages Navigation