AO3 News

Post Header

Published:
2017-03-21 15:17:26 -0400
Tags:

Update, April 4: We successfully deployed an improved version of the code referenced in this post on March 29. It now takes considerably less time to add a work to the database.

-

You may have noticed the Archive has been slow or giving 502 errors when posting or editing works, particularly on weekends and during other popular posting times. Our development and Systems teams have been working to address this issue, but our March 17 attempt failed, leading to several hours of downtime and site-wide slowness.

Overview

Whenever a user posts or edits a work, the Archive updates how many times each tag on the work has been used across the site. During this time, the record is locked and the database cannot process other changes to those tags. This can result in slowness or even 502 errors when multiple people are trying to post using the same tag. Because all works are required to use rating and warning tags, works' tags frequently overlap during busy posting times.

Unfortunately, the only workaround currently available is to avoid posting, editing, or adding chapters to works at peak times, particularly Saturdays and Sundays (UTC). We strongly recommend saving your work elsewhere so changes won’t be lost if you receive a 502.

For several weeks, we’ve had temporary measures in place to decrease the number of 502 errors. However, posting is still slow and errors are still occurring, so we’ve been looking for more ways to use hardware and software to speed up the posting process.

Our Friday, March 17, downtime was scheduled so we could deploy a code change we hoped would help. The change would have allowed us to cache tag counts for large tags (e.g. ratings, common genres, and popular fandoms), updating them only periodically rather than every time a work was posted or edited. (We chose to cache only large tags because the difference between 1,456 and 1,464 is less significant than the difference between one and nine.) However, the change led to roughly nine hours of instability and slowness and had to be rolled back.

Fixing this is our top priority, and we are continuing to look for solutions. Meanwhile, we’re updating our version of the Rails framework, which is responsible for the slow counting process. While we don’t believe this upgrade will be a solution by itself, we are optimistic it will give us a slight performance boost.

March 17 incident report

The code deployed on March 17 allowed us to set a caching period for a tag’s use count based on the size of the tag. While the caching period and tag sizes were adjusted throughout the day, the code used the following settings when it was deployed:

  • Small tags with less than 1,000 uses would not be cached.
  • Medium tags with 1,000-39,999 uses would be cached for 3-40 minutes, depending on the tag’s size.
  • Large tags with at least 40,000 uses would be cached for 40-60 minutes, but the cache would be refreshed every 30 minutes. Unlike small and medium tags, the counts for large tags would not update when a work was posted -- they would only update during browsing. Refreshing the cache every 30 minutes would prevent pages from loading slowly.

We chose to deploy at a time of light system load so we would be able to fine tune these settings before the heaviest weekend load. The deploy process itself went smoothly, beginning at 12:00 UTC and ending at 12:14 -- well within the 30 minutes we allotted for downtime.

By 12:40, we were under heavy load and had to restart one of our databases. We also updated the settings for the new code so tags with 250 or more uses would fall into the “medium” range and be cached. We increased the minimum caching period for medium tags from three minutes to 10.

At 12:50, we could see we had too many writes going to the database. To stabilize the site, we made it so only two out of seven servers were writing cache counts to the database.

However, at 13:15, the number of writes overwhelmed MySQL. It was constantly writing, making the service unavailable and eventually crashing. We put the Archive into maintenance mode and began a full MySQL cluster restart. Because the writes had exceeded the databases' capabilities, the databases had become out of sync with each other. Resynchronizing the first two servers by the built-in method took about 65 minutes, starting at 13:25 and completing at 14:30. Using a different method to bring the third recalcitrant server into line allowed us to return the system to use sooner.

By 14:57, we had a working set of two out of three MySQL servers in a cluster and were able to bring the Archive back online. Before bringing the site back, we also updated the code for the tag autocomplete, replacing a call that could write to the database with a simple read instead.

At 17:48, we were able to bring the last MySQL server back and rebalance the load across all three servers. However, the database dealing with writes was sitting at 91% load rather than the more normal 4-6%.

At 18:07, we made it so only one app server wrote tags’ cache values to the database. This dropped the load on the write database to about 50%.

At 19:40, we began implementing a hotfix that significantly reduced writes to the database server, but having all seven systems writing to the database once more put the load up to about 89%.

At 20:30, approximately half an hour after the hotfix was finished, we removed the writes from three of the seven machines. While this reduced the load, the reduction was not significant enough to resolve the issues the Archive was experiencing. Nevertheless, we let the system run for 30 minutes so we could monitor its performance.

Finally, at 21:07, we decided to take the Archive offline and revert the release. The Archive was back up and running the old code by 21:25.

We believe the issues with this caching change were caused by underestimating the number of small tags on the Archive and overestimating the accuracy of their existing counts. With the new code in place, the Archive began correcting the inaccurate counts for small tags, leading to many more writes than we anticipated. If we're able to get these writes under control, we believe this code might still be a viable solution. Unfortunately, this is made difficult by the fact we can’t simulate production-level load on our testing environment.

Going forward

We are currently considering five possible ways to improve posting speed going forward, although other options might present themselves as we continue to study the situation.

  1. Continue with the caching approach from our March 17 deploy. Although we chose to revert the code due to the downtime it had already caused, we believe we were close to resolving the issue with database writes. We discovered that the writes overwhelming our database were largely secondary writes caused by our tag sweeper. These secondary writes could likely be reduced by putting checks in the sweeper to prevent unnecessary updates to tag counts.
  2. Use the rollout gem to alternate between the current code and the code from our March 17 deploy. This would allow us to deploy and troubleshoot the new caching code with minimal interruption to normal Archive function. We would be able to study the load caused by the new code while being able to switch back to the old code before problems arose. However, it would also make the new code much more complex. This means the code would not only be more error-prone, but would also take a while to write, and users would have to put up with the 502 errors longer.
  3. Monkey patch the Rails code that updates tag counts. We could modify the default Rails code so it would still update the count for small tags, but not even try to update the count on large tags. We could then add a task that would periodically update the count on larger tags.
  4. Break work posting into smaller transactions. The current slowness comes from large transactions that are live for too long. Breaking the posting process into smaller parts would resolve that, but we would then run the risk of creating inconsistencies in the database. In other words, if something went wrong while a user was updating their work, only some of their changes might be saved.
  5. Completely redesign work posting. We currently have about 19,000 drafts and 95,000 works created in a month, and moving drafts to a separate table would allow us to only update the tag counts when a work was finally posted. We could then make posting from a draft the only option. Pressing the "Post" button on a draft would set a flag on the entry in the draft table and add a Resque job to post the work, allowing us to serialize updates to tag counts. Because the user would only be making a minor change in the database, the web page would return instantly. However, there would be a wait before the work was actually posted.
  6. The unexpected downtime that occurred around noon UTC on Tuesday, March 21, was caused by an unusually high number of requests to Elasticsearch and is unrelated to the issues discussed in this post. A temporary fix is in currently in place and we are looking for long term solutions.