The Systems team is responsible for all the ‘behind the scenes’ parts of the OTW’s technical work; in particular, we maintain the servers which run all the OTW’s projects, including the Archive of Our Own and Fanlore. As the OTW has grown, so has our job, and we’ve been very busy over the past twelve months!
This update gives an overview of some of the key changes we’ve made this year. While it doesn’t cover every detail, we hope it will give our users a sense of the work we’ve done and (importantly) where we’ve spent money. We’ve included quite a few technical details for those users who are curious, but hope that non-technical users will be able to get the gist.
At the start of January 2012, we were maintaining 12 servers: 6 physical machines and 6 virtual ones. You can see more details in January 2012 - our server setup.
The Archive of Our Own was suffering performance problems as more users joined the site. We spent time working to make things more reliable and balancing unicorns. We had to disable our online web tracking system (piwik), as it caused slow responses with the Archive. Although our work helped performance, server OTW2 (running Archive-related services) started collapsing under the load.
We implemented a system which killed off runaway processes that were created when users were downloading works from the Archive of Our Own.
A bug caused Linux systems to have performance issues when its uptime reached 200 days. As our servers all run Linux, we were affected. A new kernel and a reboot on our Linux-based servers fixed the problem very quickly \0/.
June - a month of many happenings!
Our long-serving staffer Sidra stepped down as Technical lead and joint Chair of the Systems group. We have missed her and hope to see her rejoin us in the future.
In response to the rising numbers of visitors to the AO3, we upgraded our colocation bandwidth (the amount of network traffic) to an unmetered 100Megabits/second, which cost an additional $100 per month.
Demands on our servers were also increasing behind the scenes, as the number of coders and the complexity of the Archive meant that the webdevs (used by our coders to develop new code) and the Test Archive, where we test out new code before releasing it onto the live site, were unusable. We upgraded the servers these were hosted on, which increased our virtual server bill by an additional $200 per month.
We decided that we had reached a size where it would be worth buying our own servers rather than using virtual servers for the webdevs. We investigated the costs of buying new servers, but happily later in the month, two servers were donated to OTW. We then started the long task of finding a suitable hosting provider, as the servers were located a long way from our main colocation host and shipping costs were high.
Performance issues on the Archive of Our Own were at their height during June, and we spent lots of time working to address these issues. Some parts of the site were unable to cope with the number of users who were now accessing the site: in particular, we had significant problems with server OTW5 and the demands created by the tag filters, which required a lot of space for temporary files.
In order to reduce the demands on the servers, we implemented Squid caching on the Archive, which alleviated some of the problems. On the 13th of June we decided to disable the tag filters and the Archive became significantly more stable. This reduced the amount of hour by hour hand holding the servers needed, giving our teams more time to work on longer-term solutions, including the code for the new version of the filters.
The first of July brought a leap second which caused servers around the globe to slow down. We fixed the issue by patching the servers as needed and then rebooting - with just half an hour turnaround!
We consulted with Mark from Dreamwidth about the systems architecture of the Archive. We got a couple of very useful pointers (thanks, Mark!) as well as concrete advice, such as increasing the amount of memory available for our memcache caching.
A disk died in server OTW2 and a replacement disk was donated by a member of the Systems group.
We started to use a large cloudhosted server space to develop the new system that would replace the old tagging system. This machine was not turned on at all times, only when the developers were coding, or when the new system was being tested. Hiring this server space allowed us to develop the code on a full copy of the Archive’s data and do more effective testing, which more closely replicated the conditions of the real site. Since the filters are such an important part of the AO3, and have such big performance implications, this was very important.
We upgraded the RAM on servers OTW3, OTW4 and OTW5. We replaced all of the RAM in OTW5 and put its old RAM in OTW3 and OTW4. This cost approximately $2,200 and gave us some noticeable performance improvements on the Archive.
And lastly, it was SysAdmin Day. There was cake. \0/
We started using a managed firewall at our main colocation facility. This provides both a much simpler configuration of the main network connection to the servers, and allows secure remote access for systems administrators and senior coders. It costs an additional $50 per month.
A typo in our DNS while switching over to this allowed a spammer to redirect some of our traffic to their site. Happily we were able to fix this as soon as the problem was reported, although the fix took a while to show for all users. The firewall changes also caused a few lingering issues for users connecting via certain methods; these took a little while to fix.
We purchased battery backup devices for the RAID controllers on OTW1 and OTW2, meaning their disk systems are much more performant and reliable. The batteries and installation cost a total of $250.
A hardware based firewall (Mikrotik RB1100AHx2) was purchased and configured for the new colocation facility, costing around $600.
Systems supported the coders in getting the new embedded media player to work on the Archive.
The donated, dedicated hardware for Dev and Stage (our webdev and test servers) were installed in their new colocation site, after long and hard hours spent investigating options for hosting companies and insurance. After installation the initial configuration required to run the Archive code was completed. These machines support a larger number of coders than was previously possible, giving them access to a hosted development environment to run the Archive. The hosting cost is approximately $400 per month.
We were able to decommission the virtual machine that was the Dev server (for webdevs) immediately, saving $319 per month - so the new hosted servers are only costing us about $80 more than the old setup. Considerable work was done to get Elastic Search working in our dev, test and production environments (production is the live Archive).
We were running out of disk space on OTW5, which is critical to the operation of the Archive. We purchased a pair of 200GB intel 710’s and adapters which were installed in OTW5, for a total cost of $1,700. These disks are expensive, however they are fast and are enterprise grade (meant for heavy production use) rather than home grade, which is significant on a site such as ours. Solid state drives (SSDs) are dependent on the amount of use they endure and the 710’s are rated at an endurance of 1.5PB with 20 percent over provisioning (meaning they will last us far longer than a home grade SSD).
At roughly the same time, the tag filters were returned to the Archive using Elastic Search. There was much rejoicing.
We were waiting until the tag filters were back in place before deciding what servers we would need to buy to provide the Archive with enough performance to continue to grow in the following year. After discussing budgets with Finance and Board, we put a proposal through for three servers for a total price of $28,200. We arrived at this price after checking with a number of vendors; we went for the cheapest vendor we were happy with. The difference in price between the cheapest and most expensive vendor was $2,600. The servers will be described in January 2013 - server setup.
Having bought the servers, we needed to host them. We had to decide whether to rent a whole 19-inch rack to ourselves or to try and and squeeze the servers into existing space in our shared facility. In the long term we will likely require a 19-inch rack to ourselves, but as this will cost about $2,100 per month we worked hard to find a way of splitting our servers into two sections so that we could fit them into existing space.
We did this by moving all the Archive-related functions from OTW1 and OTW2, then moving the machines and the QNAP to another location in the facility. At this point we discovered that the QNAP did not reboot cleanly and we had to have a KVM installed before we could get it working. We are renting a KVM (at $25 per month) until we can reduce the reliance on the QNAP to a minimum.
January and February 2013
So far in 2013, we’ve been working to set up the new servers. You can see the details of our new servers and their setup in January 2013 - server setup, and find out more about our plans in Going Forward: our server setup and plans.
These are only the major items: there are many pieces of work which are done on a regular basis by all the members of the team. The Systems team averages between 30 and 50 hours a week on the organization’s business. The majority of the team are professional systems administrators/IT professionals and have over 90 years of experience between us.
Systems are proud to support the OTW and its projects. We are all volunteers, but as you can see from the details here, providing the service is not free. Servers and hosting costs are expensive! We will never place advertising on the Archive or any of our other sites, so please do consider donating to the Organization for Transformative Works. Donating at least $10 will gain you membership to the OTW and allow you to vote in our elections. (Plus you will get warm fuzzies in your tummy and know you are doing good things for all of fandom-kind!)