Update: Site problems and our firewall upgrade
Published: 2012-08-17 09:21:15 -0400
Our Systems team have been doing some behind-the-scenes maintenance over the past week or so to improve the Archive of Our Own's firewalls. This has mostly been invisible to users, but last night it briefly gave everyone a fright when a typo introduced during maintenance caused some people to be redirected to some weird pages when trying to access the AO3. We also had a few additional problems today which caused a bit of site downtime. We've fixed the problems and the site should now be back to normal, but we wanted to give you all an explanation of what we've been working on and what caused the issues.
Please note: We will be doing some more maintenance relating to these issues at c. 22:00 UTC today (see when this is in your timezone). The site should remain up, but will run slowly for a while.
Upgrading our firewall
The AO3's servers have some built-in firewalls which stop outside IP services accessing bits of the servers they shouldn't, in the same way that the firewall on your home computer protects you from malicious programmes modifying your computer. Until recently, we were using these firewalls, which meant that each server was behind its own firewall, and data passed between servers was unencrypted. However, now that we have a lot more machines (with different levels of firewall), this setup is not as secure as it could be. It also makes it difficult for us to do some of the Systems work we need to, since the firewalls get in the way. We've therefore been upgrading our firewall setup: it's better to put all the machines behind the same firewall so that data passing between different servers is always protected by the firewall.
We've been slowly moving all our servers behind the new firewall. We're almost done with this work, which will put all the main servers for the Archive (that is the ones all on the same site together) behind the firewall. In addition, our remote servers (which can't go behind the firewall) will be connected to the firewall so that they can be sure they're talking to the right machine, and all the data sent to them is properly encrypted. (The remote servers are used for data backups - they are at a different location so that if one site is hit by a meteor, we'll still have our data.) This means that everything is more secure and that we can do further Systems maintenance without our own firewalls getting in the way.
What went wrong - redirects
Last night, some users started getting redirected to a different site when trying to access the AO3. The redirect site was serving up various types of spammy content, so we know this was very alarming for everyone who experienced it. The problem was caused by an error introduced during our maintenance. It was fixed very quickly, but we're very sorry to everyone who was affected.
In order to understand what caused the bug, it's necessary to understand a little bit about DNS. Every address on the internet is actually a string of numbers (an IP address), but you usually access it via a much friendlier address like http://archiveofourown.org. DNS is a bit like a phonebook for the internet: when you go to http://archiveofourown.org, your Domain Name Service goes to look and see what number is listed for that address, then sends you to the right place. In the case of the AO3, we actually have several servers, so there are several 'phone numbers' listed and you can get sent to any one of those.
As part of our maintenance, we had to make changes to our DNS configuration. Unfortunately, during one of those changes, we accidentally introduced a typo into one of our names (actually into the delegation of the domain, for those of you who are systems savvy). This meant that some people were being sent to the wrong place when they tried to access our address - it's as if the phone book had a misprint and you suddenly found yourself calling the laundry instead of a taxi service. Initially this was just sending people to a non-existent place, but a spammer noticed the error and registered that IP address so they would get the redirected traffic. (In the phone book analogy, the laundry noticed the misprint and quickly registered to use that phone number so they could take advantage of it.) It didn't affect everyone since some people were still being sent to the other, valid IP addresses.
We fixed the typo as soon as the problem was reported. However, Domain Name Services don't update immediately, so some users were still getting sent to the wrong address for a few hours after we introduced the fix. To continue the phone book analogy, it's as if the misprinted phone book was still in circulation at the same time as the new, updated one.
If you were affected by this issue, then it should be completely resolved now. Provided you didn't click any links on the site you were redirected to, you shouldn't have anything to worry about. However, it's a a good idea to run your antivirus programme just to be absolutely sure.
It turned out one bit of the firewall configuration was a little overenthusiastic and was blocking some users from getting to the site at all. We rolled back part of the changes, which caused a little bit of downtime. Because this involved changing our DNS configuration again the change took a while to take effect and the downtime was different for different users (effectively we changed our phone number, and the phonebook had to update).
The site should be back up for everyone now. We'll be completing the last bits of work on the firewall upgrade today at roughly 22:00 UTC. At present we don't expect any downtime, but the site will be running more slowly than usual.
We'd like to say a massive thank you to James_, who has done almost all of the work upgrading the firewall. He's done a sterling job and the site is much more secure because of his work. This glitch reminds us just how high pressure Systems' work is - for most of us, a tiny typo does not have such noticeable effects! We really appreciate all the work James_ has put in, and the speed at which he identified and fixed the problem when it went wrong.
We'd also like to thank our other staff who swung into action to keep people informed on Twitter, our news sites, and via Support, and who provided moral support while the issues were being dealt with.
Finally, thanks to all our users: you guys were super understanding while we were dealing with these problems and gave us lots of useful info which helped us track down the source of the bug.
Reminder: site status information
The first place to be updated when we have problems with the site is our Twitter AO3_Status. We try to answer questions addressed to us there as well as putting out general tweets, but it can be hard for us to keep up with direct conversations in busy periods, so apologies if you sent us a message and we didn't respond directly. If you see a problem, it's a good idea to check our timeline first to see if we already tweeted about it. For problems other than site status issues, the best place to go for help is AO3 Support.