AO3 News

Post Header

Published:
2010-11-11 23:11:40 -0500
Tags:

Along with the upgrade to Rails 3, there have been significant changes and improvements to our HTML sanitizing and parsing in Release 0.8.2. These changes should make things clearer for authors and much faster for readers!

Here is a quick breakdown for those who just want the highlights, followed by a more detailed explanation of what was changed and how it all works.

Highlights

  • Blank lines and carriage returns will now be converted to paragraph (<p></p>) and line-break (<br />) tags in the text editor.

  • The text will automatically be parsed and "cleaned up" -- any tags that were left open get closed, any mis-nested tags get fixed, etc.

  • The text will be sanitized, to remove any elements that are potentially harmful to our server.

  • This change fixes the known bug where switching from HTML mode to Rich Text mode causes all your paragraphs to disappear. (Yay!)

  • This change will also allow users to embed video from: youtube, vimeo, blip.tv, dailymotion, viddler, metacafe, and 4shared. (Yay!)

What's Behind the Scenes

The new back end for content works in three steps.

  1. There is now a paragraph-adder that converts blank lines and carriage returns into paragraph tags (<p></p>) and break tags (<br />) based on a few simple rules:
  • A blank line left between two pieces of text will be made separate paragraphs:
  • Here is paragraph one.

    Here is paragraph two.

    will become:

    <p>Here is paragraph one.</p>

    <p>Here is paragraph two.</p>

  • A carriage return or newline in the middle of text will add a break tag:
  • Here is a line
    with a carriage return in the middle.

    will become:

    Here is a line <br />
    with a carriage return in the middle.

  • We also will preserve extra blank lines -- if you have TWO blank lines in a row, we will add in an empty paragraph:
  • Here is paragraph one, and I want extra space after it.

    Here is paragraph two.

    will become:

    <p>Here is paragraph one, and I want extra space after it.</p>

    <p> </p>

    <p>Here is paragraph two.</p>

  • Note: The paragraph-adder will put <br /> tags at the end of each line whenever there is a carriage return, even in things like lists. So, if you have a nice chunk of HTML in your story that you coded up by hand like this:
  • <ul>
    <li>Item one.</li>
    <li>Item two.</li>
    </ul>

    You can avoid having <br /> tags added by putting the list into a single line with no carriage returns instead:

    <ul><li>Item one.</li><li>Item two.</li></ul>

  • The next step is a Ruby on Rails gem (basically a kind of plugin) called Nokogiri, which parses the text and gives it back to us as a well-formed chunk of XHTML. What this means among other things is that:

    • any tags that were left open get closed

    • any mis-nested tags get fixed (eg, if you do <strong><em>foo!</strong></em> Nokogiri will turn that into the correct version (<strong><em>Foo!</em></strong>)

    • any attribute values that aren't properly in quotes get fixed

     

  • Finally, we use the gem Sanitize to clean up this XHTML and take out anything that is legal but not necessarily safe. Sanitize uses a whitelist, meaning that only the tags and attributes we specifically tell it are allowed are let through. It's very customizable, and we have been able to write special rules for Sanitize to safely allow embeds of videos from specific sites (currently: youtube, vimeo, blip.tv, dailymotion, viddler, metacafe and 4shared.) Once Sanitize is done, the final version is saved into the database.

  • There is lots of documentation available on Nokogiri and Sanitize on their respective sites.

    What you see when editing

    • If you are working in a field (like content in the Post New Work form) that allows you to use the Rich Text Editor, the tags <p> and <br /> will show, because otherwise if you switch to the Rich Text Editor, it will do that horrible thing where your whitespace disappears and your text all runs together into one giant blob!
    • If you manually put in some <p> tags that had extra attributes on them, like "<p align=center>", the tags will show.
    • The <p> and <br /> tags will not show when you edit fields like notes and summary, however, where there is no option to use the Rich Text Editor.

    Here's an example of how the tags will look on content in the Post New Work form: