AO3 News

Post Header

Published:
2017-03-21 15:17:26 -0400
Tags:

Update, April 4: We successfully deployed an improved version of the code referenced in this post on March 29. It now takes considerably less time to add a work to the database.

-

You may have noticed the Archive has been slow or giving 502 errors when posting or editing works, particularly on weekends and during other popular posting times. Our development and Systems teams have been working to address this issue, but our March 17 attempt failed, leading to several hours of downtime and site-wide slowness.

Overview

Whenever a user posts or edits a work, the Archive updates how many times each tag on the work has been used across the site. During this time, the record is locked and the database cannot process other changes to those tags. This can result in slowness or even 502 errors when multiple people are trying to post using the same tag. Because all works are required to use rating and warning tags, works' tags frequently overlap during busy posting times.

Unfortunately, the only workaround currently available is to avoid posting, editing, or adding chapters to works at peak times, particularly Saturdays and Sundays (UTC). We strongly recommend saving your work elsewhere so changes won’t be lost if you receive a 502.

For several weeks, we’ve had temporary measures in place to decrease the number of 502 errors. However, posting is still slow and errors are still occurring, so we’ve been looking for more ways to use hardware and software to speed up the posting process.

Our Friday, March 17, downtime was scheduled so we could deploy a code change we hoped would help. The change would have allowed us to cache tag counts for large tags (e.g. ratings, common genres, and popular fandoms), updating them only periodically rather than every time a work was posted or edited. (We chose to cache only large tags because the difference between 1,456 and 1,464 is less significant than the difference between one and nine.) However, the change led to roughly nine hours of instability and slowness and had to be rolled back.

Fixing this is our top priority, and we are continuing to look for solutions. Meanwhile, we’re updating our version of the Rails framework, which is responsible for the slow counting process. While we don’t believe this upgrade will be a solution by itself, we are optimistic it will give us a slight performance boost.

March 17 incident report

The code deployed on March 17 allowed us to set a caching period for a tag’s use count based on the size of the tag. While the caching period and tag sizes were adjusted throughout the day, the code used the following settings when it was deployed:

  • Small tags with less than 1,000 uses would not be cached.
  • Medium tags with 1,000-39,999 uses would be cached for 3-40 minutes, depending on the tag’s size.
  • Large tags with at least 40,000 uses would be cached for 40-60 minutes, but the cache would be refreshed every 30 minutes. Unlike small and medium tags, the counts for large tags would not update when a work was posted -- they would only update during browsing. Refreshing the cache every 30 minutes would prevent pages from loading slowly.

We chose to deploy at a time of light system load so we would be able to fine tune these settings before the heaviest weekend load. The deploy process itself went smoothly, beginning at 12:00 UTC and ending at 12:14 -- well within the 30 minutes we allotted for downtime.

By 12:40, we were under heavy load and had to restart one of our databases. We also updated the settings for the new code so tags with 250 or more uses would fall into the “medium” range and be cached. We increased the minimum caching period for medium tags from three minutes to 10.

At 12:50, we could see we had too many writes going to the database. To stabilize the site, we made it so only two out of seven servers were writing cache counts to the database.

However, at 13:15, the number of writes overwhelmed MySQL. It was constantly writing, making the service unavailable and eventually crashing. We put the Archive into maintenance mode and began a full MySQL cluster restart. Because the writes had exceeded the databases' capabilities, the databases had become out of sync with each other. Resynchronizing the first two servers by the built-in method took about 65 minutes, starting at 13:25 and completing at 14:30. Using a different method to bring the third recalcitrant server into line allowed us to return the system to use sooner.

By 14:57, we had a working set of two out of three MySQL servers in a cluster and were able to bring the Archive back online. Before bringing the site back, we also updated the code for the tag autocomplete, replacing a call that could write to the database with a simple read instead.

At 17:48, we were able to bring the last MySQL server back and rebalance the load across all three servers. However, the database dealing with writes was sitting at 91% load rather than the more normal 4-6%.

At 18:07, we made it so only one app server wrote tags’ cache values to the database. This dropped the load on the write database to about 50%.

At 19:40, we began implementing a hotfix that significantly reduced writes to the database server, but having all seven systems writing to the database once more put the load up to about 89%.

At 20:30, approximately half an hour after the hotfix was finished, we removed the writes from three of the seven machines. While this reduced the load, the reduction was not significant enough to resolve the issues the Archive was experiencing. Nevertheless, we let the system run for 30 minutes so we could monitor its performance.

Finally, at 21:07, we decided to take the Archive offline and revert the release. The Archive was back up and running the old code by 21:25.

We believe the issues with this caching change were caused by underestimating the number of small tags on the Archive and overestimating the accuracy of their existing counts. With the new code in place, the Archive began correcting the inaccurate counts for small tags, leading to many more writes than we anticipated. If we're able to get these writes under control, we believe this code might still be a viable solution. Unfortunately, this is made difficult by the fact we can’t simulate production-level load on our testing environment.

Going forward

We are currently considering five possible ways to improve posting speed going forward, although other options might present themselves as we continue to study the situation.

  1. Continue with the caching approach from our March 17 deploy. Although we chose to revert the code due to the downtime it had already caused, we believe we were close to resolving the issue with database writes. We discovered that the writes overwhelming our database were largely secondary writes caused by our tag sweeper. These secondary writes could likely be reduced by putting checks in the sweeper to prevent unnecessary updates to tag counts.
  2. Use the rollout gem to alternate between the current code and the code from our March 17 deploy. This would allow us to deploy and troubleshoot the new caching code with minimal interruption to normal Archive function. We would be able to study the load caused by the new code while being able to switch back to the old code before problems arose. However, it would also make the new code much more complex. This means the code would not only be more error-prone, but would also take a while to write, and users would have to put up with the 502 errors longer.
  3. Monkey patch the Rails code that updates tag counts. We could modify the default Rails code so it would still update the count for small tags, but not even try to update the count on large tags. We could then add a task that would periodically update the count on larger tags.
  4. Break work posting into smaller transactions. The current slowness comes from large transactions that are live for too long. Breaking the posting process into smaller parts would resolve that, but we would then run the risk of creating inconsistencies in the database. In other words, if something went wrong while a user was updating their work, only some of their changes might be saved.
  5. Completely redesign work posting. We currently have about 19,000 drafts and 95,000 works created in a month, and moving drafts to a separate table would allow us to only update the tag counts when a work was finally posted. We could then make posting from a draft the only option. Pressing the "Post" button on a draft would set a flag on the entry in the draft table and add a Resque job to post the work, allowing us to serialize updates to tag counts. Because the user would only be making a minor change in the database, the web page would return instantly. However, there would be a wait before the work was actually posted.
  6. The unexpected downtime that occurred around noon UTC on Tuesday, March 21, was caused by an unusually high number of requests to Elasticsearch and is unrelated to the issues discussed in this post. A temporary fix is in currently in place and we are looking for long term solutions.

Comment

Post Header

Published:
2015-11-04 17:53:27 -0500
Tags:

At approximately 23:00 UTC on October 24, the Archive of Our Own began experiencing significant slowdowns. We suspected these slowdowns were being caused by the database, but four days of sleuthing led by volunteer SysAdmin james_ revealed a different culprit: the server hosting all our tag feeds, Archive skins, and work downloads. While a permanent solution is still to come, we were able to put a temporary fix in place and restore the Archive to full service at 21:00 UTC on October 29.

Read on for the full details of the investigation and our plans to avoid a repeat in the future.

Incident Summary

On October 24, we started to see very strange load graphs on our firewalls, and reports started coming in via Twitter that a significant number of users were getting 503 errors. There had been sporadic reports of issues earlier in the week as well, but we attributed these to the fact that one of our two front-end web servers had a hardware issue and had to be returned to its supplier for repair. (The server was returned to us and put back into business today, November 4.)

Over the next few days, we logged tickets with our MySQL database support vendor and tried adjusting our configuration of various parts of the system to handle the large spikes in load we were seeing. However, we still were unable to identify the cause of the issue.

We gradually identified a cycle in the spikes, and began, one-by-one, to turn off internal loads that were periodic in nature (e.g., hourly database updates). Unfortunately, this did not reveal the problem either.

On the 29th of October, james_ logged in to a virtual machine that runs on one of our servers and noticed it felt sluggish. We then ran a small disc performance check, which showed severely degraded performance on this server. At this point, we realised that our assumption that the application was being delayed by a database problem was wrong. Instead, our web server was being held up by slow performance from the NAS, which is a different virtual machine that runs on the same server.

The NAS holds a relatively small amount of static files, including skin assets (such as background images), tag feeds, and work downloads (PDF, .epub, and .mobi files of works). Most page requests made to the Archive load some of these files, which are normally delivered very quickly. But because the system serving those assets was having problems, the requests were getting backed up until a point where a cascade of them timed out (causing the spikes and temporarily clearing the backed-up results).

To fix the issue, we had to get the NAS out of the system. The skin assets were immediately copied to local disc instead, and we put up a banner warning users that tag feeds and potentially downloads would need to be disabled. After tag feeds were disabled, the service became more stable, but there were further spikes. These were caused by our configuration management system erroneously returning the NAS to service after we disabled it.

After a brief discussion, AD&T and Systems agreed to temporarily move the shared drive to one of the front-end servers. This shared drive represents a single point of failure, however, which is undesirable, so we also agreed to reconfigure the Archive to remove this single point of failure within a few months.

Once the feeds and downloads were moved to the front-end server, the system became stable, and full functionality returned at 21:00 (UTC) on the 29th of October.

We still do not know the cause of the slowdown on the NAS. Because it is a virtual machine, our best guess is that the problem is with a broken disc on the underlying hardware, but we do not have access to the server itself to test. We do have a ticket open with the virtual machine server vendor.

The site was significantly affected for 118 hours and our analytics platform shows a drop of about 8% in page views for the duration of the issue, and that pages took significantly longer to deliver, meaning we were providing a reduced service during the whole time.

Lessons Learnt

  • Any single point of failure that still remains in our architecture must be redesigned.
  • We need to have enough spare capacity in servers so that in the case of hardware failure we can pull a server from a different function and have it have sufficient hardware to perform in its new role. For instance, we hadn't moved a system from a role as a resque worker into being a front-end machine because of worries about the machine's lack of SSD. We have ordered the upgrades for the two worker systems so they have SSDs so that this worry should not reoccur. Database servers are more problematic because of their cost. However, when the current systems are replaced, the old systems will become app servers, but could be returned to their old role in an emergency at reduced usage.
  • We are lacking centralised logging for servers. This would have sped up the diagnostic time.
  • Systems should have access to a small budget for miscellaneous items such as extra data center smart hands, above the two hours we already have in our contract. For example, a US$75 expense for this was only approved on October 26 at 9:30 UTC, 49 hours after it was requested. This forces Systems to have to work around such restrictions and wastes time.
  • We need to be able to occasionally consult specialized support. For instance, at one point we attempted to limit the number of requests by IP address, but this limit was applied to our firewall's IP address on one server but not the other. We would recommend buying support for nginx at US$1,500 per server per year.

Technologies to Consider

  • Investigate maxscale rather than haproxy for load balancing MySQL.
  • Investigate RabbitMQ as an alternative to resque. This makes sense when multiple servers need to take an action, e.g. cache invalidation.

Comment

Post Header

To combat an influx of spam works, we are temporarily suspending the issuing of invitations from our automated queue. This will prevent spammers from getting invitations to create new accounts and give our all-volunteer teams time to clean up existing spam accounts and works. We will keep you updated about further developments on our Twitter account. Please read on for details.

The problem

We have been dealing with two issues affecting the Archive, both in terms of server health and user experience.

  • Spammers who sign up for accounts only to post thousands of fake "works" (various kinds of advertisements) with the help of automated scripts.
  • People who use bots to download works in bulk, to the point where it affects site speed and server uptime for everyone else.

Measures we've taken so far

We have been trying several things to keep both problems in check:

  • The Abuse team has been manually banning accounts that post spam.
  • We are also keeping an eye on the invitation queue for email addresses that follow discernible patterns and removing them from the queue. This is getting trickier as the spammers adjust.
  • We delete the bulk of spam works from the database directly, as individual work deletion would clearly be an overwhelming task for the Abuse team; however, this requires people with the necessary skills and access to be available.
  • Our volunteer sysadmin has been setting up various server scripts and settings aimed at catching spammers and download bots before they can do too much damage. This requires a lot of tweaking to adjust to new bots and prevent real users from being banned.

Much of this has cut into our volunteers' holiday time, and we extend heartfelt thanks to everyone who's been chipping in to keep the Archive going through our busiest days.

What we're doing now

Our Abuse team needs a chance to catch up on all reported spamming accounts and make sure that all spam works are deleted. Currently the spammers are creating new accounts faster than we can ban them. Our sysadmins and coders need some time to come up with a sustainable solution to prevent further bot attacks.

To that end, we're temporarily suspending issuing invites from our automated queue. Existing account holders can still request invite codes and share them with friends. You can use existing invites to sign up for an account; account creation itself will not be affected. (Please note: Requests for invite codes have to be manually approved by a site admin, so there might be a delay of two to three days before you receive them; challenge moderators can contact Support for invites if their project is about to open.)

We are working hard to get these problems under control, so the invite queue should be back in business soon! Thank you for your patience as we work through the issues.

What you can do

There are some things you can do to help:

  • When downloading multiple works, wait a few moments between each download. If you're downloading too many works at once, you will be taken to an error page warning you to slow down or risk being blocked from accessing the Archive for 24 hours.
  • Please don't report spam works. While we appreciate all the reports we've received so far, we now have a system in place that allows us to find spam quickly. Responding to reports of spam takes time away from dealing with it.
  • Keep an eye on our Twitter account, @AO3_Status, for updates!

Known problems with the automated download limit

We have been getting reports of users who run into a message about excessive downloads even if they were downloading only a few works, or none at all. This may happen for several reasons that are unfortunately beyond our control:

  • They pressed the download button once, but their device went on a rampage trying to download the file many times. A possible cause for this might be a download accelerator, so try disabling any relevant browser extensions or software, or try downloading works in another browser or device.
  • They share an IP address with a group of people, one of whom hit the current download limit and got everyone else with the same IP address banned as well. This can be caused by VPNs, Tor software, or an ISP who assigns the same IP address to a group of customers (more likely to happen on phones). Please try using a different device, if you can.

We apologize if you have to deal with any of these and we'll do our best to restore proper access for all users as soon as possible!

Comment

Post Header

Published:
2014-09-09 16:43:05 -0400
Tags:

Credits

  • Coder: Elz
  • Code reviewers: Enigel, james_
  • Testers: Ariana, Lady Oscar, mumble, Ridicully, sarken

Overview

With today's deploy we're making some changes to our search index code, which we hope will solve some ongoing problems with suddenly "missing" works or bookmarks and inaccurate work counts.

In order to improve consistency and reduce the load on our search engine, we'll be sending updates to it on a more controlled schedule. The trade-off is that it may take a couple of minutes for new works, chapters, and bookmarks to appear on listing pages (e.g. for a fandom tag or in a collection), but those pages will ultimately be more consistent and our systems should function more reliably.

You can read on for technical details!

The Problem

We use a software package called Elasticsearch for most of our search and filtering needs. It's a powerful system for organizing and presenting all the information in our database and allows for all sorts of custom searches and tag combinations. To keep our search results up to date for everyone using the Archive, we need to ensure that freshly-posted works, new comments and kudos, edited bookmarks, new tags, etc. all make it into our search index practically in real time.

As the volume of updates has grown considerably over the last couple of years, however, that's increased the time it takes to process those updates and slowed down the general functioning of the underlying system. That slowness has interacted badly with the way we cache data in our current code: works and bookmarks seem to occasionally appear and disappear from site listings and the counts you see on different pages and sidebars may be significantly different from one another.

That's understandably alarming to anyone who encounters it, and fixing it has been our top priority.

The First Step

We are making some major changes to our various "re-indexing" processes, which take every relevant change that happens to works/bookmarks/tags and update our massive search index accordingly:

  • Instead of going directly into Elasticsearch, all indexing tasks will now be added to a queue that can be processed in a more orderly fashion. (We were queueing some updates before, but not all of them.)
  • The queued updates will then be sent to the search engine in batches to reduce the number of requests, which should help with performance.
  • Cached pages get expired (i.e., updated to reflect new data) not when the database says so, but when Elasticsearch is ready.
  • Updates concerning hit counts, kudos, comments, and bookmarks on a work (i.e. "stats" data) will be processed more efficiently but less frequently.

As a result, work updates will take a minute to affect search results and work listings, and background changes to tags (e.g. two tags being linked together) will take a few minutes longer to be reflected in listings. Stats data (hits, kudos, etc.) will be added to the search index only once an hour. The upside of this is that listings should be more consistent across the site!

(Please note that this affects only searching, sorting, and filtering! The kudos count in a work blurb, for example, is based on the database total, so you may notice slight inconsistencies between those numbers and the order you see when sorting by kudos.)

The Next Step

We're hoping that these changes will help to solve the immediate problems that we're facing, but we're also continuing to work on long-term plans and improvements. We're currently preparing to upgrade our Elasticsearch cluster from version 0.90 to 1.3 (which has better performance and backup tools), switch our code to a better client, and make some changes to the way we index data to continue to make the system more efficient.

One big improvement will be in the way we index bookmarks. When we set up our current system, we had a much smaller number of bookmarks relative to other content on the site. The old Elasticsearch client we were using also had some limitations on its functionality, so we ended up indexing the data for bookmarked works together with each of their individual bookmarks, which meant that updates to the work meant updates to dozens or hundreds of bookmark records. That's been a serious problem when changes are made to tags, in particular, where a small change can potentially kick off a large cascade of re-indexes. It's also made it more difficult to keep up with regular changes to works, which led to problems with bookmark sorting by date. We're reorganizing that, using Elasticsearch's parent-child index structure, and we hope that this will also have positive long-term effects on performance.

Overall, we're continuing to learn and look for better solutions as the Archive grows. We apologize for the bumpy ride lately, and we hope that the latest set of changes will make things run more smoothly. We should have more improvements for you in the coming months, and in the meantime, we thank you for your patience!

Comment

Post Header

Published:
2014-01-23 16:26:51 -0500
Tags:

If you're a regular Archive visitor or if you follow our AO3_Status Twitter account, you may have noticed that we've experienced a number of short downtime incidents over the last few weeks. Here's a brief explanation of what's happening and what we're doing to fix things.

The issue

Every now and then, the volume of traffic we get and the amount of data we're hosting starts to hit the ceiling of what our existing infrastructure can support. We try to plan ahead and start making improvements in advance, but sometimes things simply catch up to us a little too quickly, which is what's happening now.

The good news is that we do have fixes in the works: we've ordered some new servers, and we hope to have them up and running soon. We're making plans to upgrade our database system to a cluster setup that will handle failures better and support more traffic; however, this will take a little longer. And we're working on a number of significant code fixes to improve bottlenecks and reduce server load - we hope to have the first of those out within the next two weeks.

One area that's affected are the number of hits, kudos, comments, and bookmarks on works, so you may see delays in those updating, which will also result in slightly inaccurate search and sort results. Issues with the "Date Updated" sorting on bookmark pages will persist until a larger code rewrite has been deployed.

Behind the scenes

We apologize to everyone who's been affected by these sudden outages, and we'll do our best to minimize the disruption as we work on making things better! We do have an all-volunteer staff, so while we try to respond to server problems quickly, sometimes they happen when we're all either at work or asleep, so we can't always fix things as soon as we'd like to.

While we appreciate how patient and supportive most Archive users are, please keep in mind that tweets and support requests go to real people who may find threats of violence or repeated expletives aimed at them upsetting. Definitely let us know about problems, but try to keep it to language you wouldn't mind seeing in your own inbox, and please understand if we can't predict immediately how long a sudden downtime might take.

The future

Ultimately, we need to keep growing and making things work better because more and more people are using AO3 each year, and that's something to be excited about. December and January tend to bring a lot of activity to the site - holiday gift exchanges are posted or revealed, people are on vacation, and a number of fandoms have new source material.

We're looking forward to seeing all the new fanworks that people create this year, and we'll do our best to keep up with you! And if you're able to donate or volunteer your time, that's a huge help, and we're always thrilled to hear from you.

Comment

Post Header

Published:
2013-11-20 14:45:01 -0500
Tags:

Update December 14, 18:00 UTC: As of this week, all systems should be back to normal. We're still working on optimizing our server settings, so very brief downtimes for maintenance should be expected. If bookmarks still won't sort correctly for you - we're working on a more permanent fix to the underlying issue, but it might be a short while yet. As always, we're keeping an eye on Support tickets and messages to our Twitter account, and will react as quickly as possible if anything seems off. Thank you all for your patience.

Update December 3, 16:00 UTC: We have re-enabled the sort and filter sidebar on work listings only. Bookmark filtering and sorting is still turned off and will likely be off for a few more days. (The filters are the sidebar that allows you to narrow down a list of works or bookmarks by character, rating, etc.) We will continue to work on the underlying issue. In the meantime, we suggest using the Works Search to help find what you’re looking for.

All works and bookmarks should be showing up normally. Work re-indexing is complete, so we hope to be able to turn on filtering for works again in the next day or two.

Bookmark re-indexing is still ongoing, so it will be several days before we can turn bookmark filtering back on.

Please follow the @AO3_Status Twitter feed or check back here for further updates.

Update 2 Dec: Listings for works, bookmarks, tags, and pseuds are unavailable due to issues with our search index. Our coding and systems volunteers are currently looking into it, and we will keep you updated on our progress. Our Support team is working on a back log, so there might be delays in getting back to users individually. Please consider checking the @AO3_Status Twitter feed or our banner alerts instead.

Update 30 Nov: All bookmarks have been re-indexed and should show up correctly again. Any issues that might still be lingering will be sorted out when we upgrade Elasticsearch, which we're planning for mid-December. Downloads should be working without the need for any workarounds now. Thank you for your patience!

The Good

We recently deployed new code, which fixed a couple of very old bugs and introduced improvements to the kudos feature. Behind the scenes, we've been working on setting up new servers and tweaking server settings to make everything run a little more smoothly during peak times. The end of the year (holiday season in many parts of the world) usually means more people with more free time to participate in more challenges, read more fic, or post more fanart, resulting in more site usage.

One way to measure site usage is looking at page views. This number tells us how many pages (a single work, a list of search results, a set of bookmarks in a collection, a user profile, etc. etc.) were served to users during a certain time frame. Some of these pages can contain a lot of information that has to be retrieved from the database - and a lot of information being retrieved from the database at the same time can result in site slowness and server woes. During the first week of January we had 27.6 million page views. As of November 17 we registered 42.9 million page views for the preceeding week.

We've watched our traffic stats grow dramatically over the years, and we've been doing our best to keep up with our users! Buying and installing more servers is one part of the solution, and we can't thank our all-volunteer Systems team enough for all their hard work behind the scenes. On the other hand, our code needs to be constantly reviewed and updated to match new demands.

Writing code that "scales" - that works well even as the site grows - is a complicated and neverending task that requires a thorough understanding of how all parts of the Archive work together, not just right now, but in six months, or a year, or two years. As we're all volunteers who work on the Archive in our free time (or during lunch breaks), and there are only a handful of us with the experience to really dig deep into the code, this is less straightforward than a server acquisition and will take a little more time.

The Bad

As such, we've been battling some site slowness, sudden downtimes (thankfully brief due to our awesome Systems team) and an uptick in error pages. We can only ask for your patience as we investigate likely causes and discuss possible fixes.

For the time being, we have asked our intrepid tag wranglers to refrain from wrangling on Sundays, as this is our busiest day and moving a lot of tags around sadly adds to the strain on the current servers. We sincerely apologize to all wrangling volunteers who have to catch up with new tags on Monday, and to users who might notice delays (e.g. a new fandom tag that's not marked as canonical right away). From what we've seen so far, this move has helped in keeping the site stable on weekends.

The Ugly

We are aware of an issue with seemingly "vanishing" bookmarks, in which the correct number of bookmarks is displayed in the sidebar, but not all are actually shown. The most likely culprit is our search index, powered by a framework called elasticsearch. All our information (work content, tags, bookmarks, users, kudos, etc. etc.) is stored in a database, and elasticsearch provides a quicker, neater access to some of this data. This allows for fast searches, and lets us build lists of works and bookmarks (e.g. by tag) without having to ask the database to give us every single scrap of info each time.

It appears now that elasticsearch has become slightly out of sync with the database. We are looking into possible fixes and are planning an elasticsearch software upgrade; however, we must carefully test it first to assure data safety.

This problem also affects bookmark sorting, which has been broken for several weeks now. We are very sorry! If you want to know if a particular work has been updated, please consider subscribing to the work (look for the "Subscribe" button at the top of the page). This will send you a notification when a new chapter has been posted.

(Note: Since we're sending out a lot of notifications about kudos, comments and subscriptions every day, some email providers are shoving our messages into the junk folder, or outright deny them passage to your account. Please add our email address do-not-reply@archiveofourown.org to your contacts, create a filter to never send our emails to spam, or check the new "Social" tab in Gmail if you're waiting for notifications.)

A problem with file downloads only cropped up fairly recently. We don't think this is related to the most recent deploy, and will investigate possible causes. In the meantime, if a .pdf or .mobi file gives you an error 500, try downloading the HTML version first, then give it another shot. This should help until we've fixed the underlying problem.

What You Can Do

If you have not already done so, consider subscribing to our twitter feed @AO3_Status or following us on Tumblr. You can also visit the AO3 News page for updates in the coming weeks or subscribe to the feed.

We thank everyone who has written in about their experiences, and will keep you all updated on our progress. Thank you for your patience as we work on this!

Comment

Post Header

Published:
2013-04-05 06:55:00 -0400
Tags:

We're currently dealing with a few issues relating to our last deploy (Release 0.9.6), and we wanted to keep you in the loop on how we're handling them (and let you know about workarounds for a few problems):

Jumbled looking header

We launched a new design for our header, which should look like this. If it's looking jumbled, buttons are overlapping, or it otherwise looks broken, please refresh the page. If that doesn't help, you may need to clear your browser cache and then refresh again. If you're using a mobile browser on your phone or tablet, and clearing the cache alone doesn't help, try completely closing the app and opening it again.

Share box always open

Some users have reported that the 'Share' button that usually pops up a pre-formatted block of work information was replaced by the text it's supposed to hold. This severely messes up all work pages. It's due to a conflict with some userscripts people have installed in their browsers, which are interacting badly with a jQuery plugin we started using for our help boxes.

The quick and dirty way of solving this problem is to disable AO3-related userscripts.

The problem is caused by scripts which contain a @require line, which tells the script to grab a copy of jQuery from Google's servers to work with. That fresh copy then overrides our own Javascript stack, causing wonkiness. If you want to keep using the script, you (or the original creator of the script) need to edit it so it doesn't use this line. A helpful Tumblr user has written up instructions on how you can edit a userscript yourself, although this might not work in all browsers. Especially if you're using Chrome, try this little script change instead.

Subscription emails are missing information

We tested all possible Archive emails in various browsers and clients, which amounts to a lot of emails. In all this testing we missed that some information had vanished from some kinds of subscription notifications, for which we apologize.

Many, many users wrote into Support about this and commented on our Release Notes, and we are rolling out a very quick fix to bring back total chapter count, work summaries, and additional tags to all notifications.

New styling of emails

We launched new multipart emails which take advantage of HTML styling. This was to give us more control over the layout of emails, and to help provide a more consistent look. Our old mailer templates were mostly a mix of text and some HTML mark-up (for paragraph breaks, links, and some text styling) that did not actually declare itself properly to email clients. This raised our spam score, looked broken in some text-based email programs, and made it harder to add emails for new features, as there was no consistent style to base them on.

We assumed that users who preferred text-only emails would select this option in their email clients; however, it's become clear that this isn't meeting our users' needs, for various reasons, and lots of people would prefer to have a text-only option.

We're taken note of all the (strongly-felt!) responses to the new emails, and we're looking at solutions, including adding a user preference for plain text emails. We need some time for our coders to look at the issue and figure out the best way forward: when we've been able to do this we'll update users on what we are planning to do and when we will do it. Please rest assured we are taking user feedback very seriously; we appreciate your patience while we work on this.

Some users also raised specific concerns about the styling, notably the size of the font in certain email clients / devices. We'll also be looking to address these issues and tweaking the HTML emails themselves.

Comment

Post Header

Published:
2012-08-23 15:06:52 -0400
Tags:

We've been receiving a small number of reports of people unable to access the Archive of Our Own - if you've been affected by this issue, this post will give you a bit more information about what's going on. We'd also like to appeal for your help as we work to fix it!

What's wrong

We recently upgraded our firewall to improve the security of our servers. Unfortunately, it seems that we haven't got the configuration absolutely right and it's causing connection problems for some users. This problem is only affecting a small number of users, and it's not completely consistent. However, if you've received an Error 404, a warning saying 'Secure Connection Failed', or you've been redirected to a url with 8080 in it, then this is what's causing it.

While we worked on the issue, we temporarily disabled https on the site, as that was causing some additional problems. This means that if you have a browser extension such as HTTPS Everywhere enabled, or you use a browser which enforces https by default, then the site will not load - apologies for this. If the site has been consistently timing out for you, it's worth checking if this applies in your case - if the url defaults to https://archiveofourown.org then you have been affected by this issue.

How you can workaround

If you're being affected by the https issue, you can work around by adding an exception to HTTPS Everywhere, or using a different browser.

If you're getting errors at random, then clearing your browser cache and refreshing should help. You may also find it helps to use another browser.

How you can help us

We're working to get to the bottom of this problem, and we know we've already reduced the number of errors which are occurring. However, it would be enormously helpful for our Systems team to have a little more information. If you encounter an error, please submit a Support request giving the following information:

Your IP address You can find this out by going to http://www.whatsmyip.org/.
The url of the page where you got the error:
The exact error you got: You may find it easiest to copy and paste the error. If you didn't get an error but the page just never loads, tell us that.
What time (UTC) you got the error: Please check the current time in UTC when you get the error - this will make it easier for us to keep track, since we're dealing with users in lots of different timezones.
Is the error intermittent or constant?
What browser are you using? It would be extra helpful if you can tell us your user agent string, which you can find out by going to http://whatsmyuseragent.com/".

If you know how to view the source of a page in the browser, it would also be very helpful if you could could copy and paste the source code of the page that throws up the problem.

If you're comfortable working on the command line, then it would also be helpful if you could provide us with some additional information (if you're already wondering what we're talking about, don't worry, you can ignore this bit). Open up a command line window and type nslookup ao3.org. Copy whatever pops up in your console and paste it into your Support message.

If you can't access the Archive at all (and thus can't submit a Support request there) you can send us this information via our backup Support form.

A note on https

We know that many people prefer to use https connection to provide additional security on the web, and we will be reenabling this option as soon as we can. Because the AO3 doesn't handle data such as credit card information or similar, browsing without https doesn't expose our users to any significant security risks. However, it is always a good policy to use a unique password (i.e. don't use the same username and password combo for the AO3 and your email account) in order to ensure that if for any reason someone else obtains your AO3 credentials, they can't use them to access other data). Apologies for the inconvenience to users while this option is disabled.

ETA for a fix

We're hoping to resolve these lingering problems asap; however, our Systems team have limited time, so we may not be able to track down the root of the problem as fast as we'd like. We'll keep you updated, and in the meantime apologise for the inconvenience.

Comment


Pages Navigation