Coercing an HTML DOM into an infoset - HTML 5

When an application uses an HTML parser in conjunction with an XML pipeline, it is possible that the constructed DOM is not compatible with the XML tool chain in certain subtle ways. For example, an XML toolchain might not be able to represent attributes with the name xmlns, since they conflict with the Namespaces in XML syntax. There is also some data that the HTML parser generates that isn't included in the DOM itself. This section specifies some rules for handling these issues.

If the XML API being used doesn't support DOCTYPEs, the tool may drop DOCTYPEs altogether.

If the XML API doesn't support attributes in no namespace that are named "xmlns", attributes whose names start with "xmlns:", or attributes in the XMLNS namespace, then the tool may drop such attributes.

The tool may annotate the output with any namespace declarations required for proper operation.

If the XML API being used restricts the allowable characters in the local names of elements and attributes, then the tool may map all element and attribute local names that the API wouldn't support to a set of names that are allowed, by replacing any character that isn't supported with the uppercase letter U and the six digits of the character's Unicode code point when expressed in hexadecimal, using digits 0-9 and capital letters A-F as the symbols, in increasing numeric order.

For example, the element name foo<bar, which can be output by the HTML parser, though it is neither a legal HTML element name nor a well-formed XML element name, would be converted into fooU00003Cbar, which is a well-formed XML element name (though it's still not legal in HTML by any means).

As another example, consider the attribute xlink:href. Used on a MathML element, it becomes, after being adjusted, an attribute with a prefix "xlink" and a local name "href". However, used on an HTML element, it becomes an attribute with no prefix and the local name "xlink:href", which is not a valid NCName, and thus might not be accepted by an XML API. It could thus get converted, becoming "xlinkU00003Ahref".

The resulting names from this conversion conveniently can't clash with any attribute generated by the HTML parser, since those are all either lowercase or those listed in the adjust foreign attributes algorithm's table.

If the XML API restricts comments from having two consecutive "--" (U+002D) characters, the tool may insert a single U+0020 SPACE character between any such offending characters.

If the XML API restricts comments from ending in a "-" (U+002D) character, the tool may insert a single U+0020 SPACE character at the end of such comments.

If the XML API restricts allowed characters in character data, attribute values, or comments, the tool may replace any "FF" (U+000C) character with a U+0020 SPACE character, and any other literal non-XML character with a U+FFFD REPLACEMENT CHARACTER.

If the tool has no way to convey out-of-band information, then the tool may drop the following information:

The mutations allowed by this section apply after the HTML parser's rules have been applied. For example, a <a::> start tag will be closed by a </a::> end tag, and never by a </aU00003AU00003A> end tag, even if the user agent is using the rules above to then generate an actual element in the DOM with the name aU00003AU00003A for that start tag.

8.2.8 An introduction to error handling and strange cases in the parser

This section is non-normative.

This section examines some erroneous markup and discusses how the HTML parser handles these cases.

8.2.8.1 Misnested tags: <b><i></b></i>

This section is non-normative.

The most-often discussed example of erroneous markup is as follows:

<p>1<b>2<i>3</b>4</i>5</p>

The parsing of this markup is straightforward up to the "3". At this point, the DOM looks like this:

html
- head
- body
  - p
    - #text: 1
    - b
      - #text: 2
      - i
        #text: 3

Here, the stack of open elements has five elements on it: html, body, p, b, and i. The list of active formatting elements just has two: b and i. The insertion mode is "in body".

Upon receiving the end tag token with the tag name "b", the "adoption agency algorithm" is invoked. This is a simple case, in that the formatting element is the b element, and there is no furthest block. Thus, the stack of open elements ends up with just three elements: html, body, and p, while the list of active formatting elements has just one: i. The DOM tree is unmodified at this point.

The next token is a character ("4"), triggers the reconstruction of the active formatting elements, in this case just the i element. A new i element is thus created for the "4" Text node. After the end tag token for the "i" is also received, and the "5" Text node is inserted, the DOM looks as follows:

html
- head
- body
  - p
    - #text: 1
    - b
      - #text: 2
      - i
        #text: 3
    - i
      - #text: 4
    - #text: 5

8.2.8.2 Misnested tags: <b><p></b></p>

This section is non-normative.

A case similar to the previous one is the following:

<b>1<p>2</b>3</p>

Up to the "2" the parsing here is straightforward:

html
- head
- body
  - b
    - #text: 1
    - p
      - #text: 2

The interesting part is when the end tag token with the tag name "b" is parsed.

Before that token is seen, the stack of open elements has four elements on it: html, body, b, and p. The list of active formatting elements just has the one: b. The insertion mode is "in body".

Upon receiving the end tag token with the tag name "b", the "adoption agency algorithm" is invoked, as in the previous example. However, in this case, there is a furthest block, namely the p element. Thus, this time the adoption agency algorithm isn't skipped over.

The common ancestor is the body element. A conceptual "bookmark" marks the position of the b in the list of active formatting elements, but since that list has only one element in it, the bookmark won't have much effect.

As the algorithm progresses, node ends up set to the formatting element (b), and last node ends up set to the furthest block (p).

The last node gets appended (moved) to the common ancestor, so that the DOM looks like:

html
- head
- body
  - b
    - #text: 1
  - p
    - #text: 2

A new b element is created, and the children of the p element are moved to it:

html
- head
- body
  - b
    - #text: 1
  - p

b
- #text: 2

Finally, the new b element is appended to the p element, so that the DOM looks like:

html
- head
- body
  - b
    - #text: 1
  - p
    - b
      - #text: 2

The b element is removed from the list of active formatting elements and the stack of open elements, so that when the "3" is parsed, it is appended to the p element:

html
- head
- body
  - b
    - #text: 1
  - p
    - b
      - #text: 2
    - #text: 3

8.2.8.3 Unexpected markup in tables

This section is non-normative.

Error handling in tables is, for historical reasons, especially strange. For example, consider the following markup:

<table><b><tr><td>aaa</td></tr>bbb</table>ccc

The highlighted b element start tag is not allowed directly inside a table like that, and the parser handles this case by placing the element before the table. (This is called foster parenting.) This can be seen by examining the DOM tree as it stands just after the table element's start tag has been seen:

html
- head
- body
  - table

...and then immediately after the b element start tag has been seen:

html
- head
- body
  - b
  - table

At this point, the stack of open elements has on it the elements html, body, table, and b (in that order, despite the resulting DOM tree); the list of active formatting elements just has the b element in it; and the insertion mode is "in table".

The tr start tag causes the b element to be popped off the stack and a tbody start tag to be implied; the tbody and tr elements are then handled in a rather straight-forward manner, taking the parser through the "in table body" and "in row" insertion modes, after which the DOM looks as follows:

html
- head
- body
  - b
  - table
    - tbody
      - tr

Here, the stack of open elements has on it the elements html, body, table, tbody, and tr; the list of active formatting elements still has the b element in it; and the insertion mode is "in row".

The td element start tag token, after putting a td element on the tree, puts a marker on the list of active formatting elements (it also switches to the "in cell" insertion mode).

html
- head
- body
  - b
  - table
    - tbody
      - tr
        td

The marker means that when the "aaa" character tokens are seen, no b element is created to hold the resulting Text node:

html
- head
- body
  - b
  - table
    - tbody
      - tr
        td
        #text: aaa

The end tags are handled in a straight-forward manner; after handling them, the stack of open elements has on it the elements html, body, table, and tbody; the list of active formatting elements still has the b element in it (the marker having been removed by the "td" end tag token); and the insertion mode is "in table body".

Thus it is that the "bbb" character tokens are found. These trigger the "in table text" insertion mode to be used (with the original insertion mode set to "in table body"). The character tokens are collected, and when the next token (the table element end tag) is seen, they are processed as a group. Since they are not all spaces, they are handled as per the "anything else" rules in the "in table" insertion mode, which defer to the "in body" insertion mode but with foster parenting.

When the active formatting elements are reconstructed, a b element is created and foster parented, and then the "bbb" Text node is appended to it:

html
- head
- body
  - b
  - b
    - #text: bbb
  - table
    - tbody
      - tr
        td
        #text: aaa

The stack of open elements has on it the elements html, body, table, tbody, and the new b (again, note that this doesn't match the resulting tree!); the list of active formatting elements has the new b element in it; and the insertion mode is still "in table body".

Had the character tokens been only space characters instead of "bbb", then those space characters would just be appended to the tbody element.

Finally, the table is closed by a "table" end tag. This pops all the nodes from the stack of open elements up to and including the table element, but it doesn't affect the list of active formatting elements, so the "ccc" character tokens after the table result in yet another b element being created, this time after the table:

html
- head
- body
  - b
  - b
    - #text: bbb
  - table
    - tbody
      - tr
        td
        #text: aaa
  - b
    - #text: ccc

8.2.8.4 Scripts that modify the page as it is being parsed

This section is non-normative.

Consider the following markup, which for this example we will assume is the document with URL http://example.com/inner, being rendered as the content of an iframe in another document with the URL http://example.com/outer:

<div id=a> <script>  var div = document.getElementById('a');  parent.document.body.appendChild(div); </script> <script>  alert(document.URL); </script></div><script> alert(document.URL);</script>

Up to the first "script" end tag, before the script is parsed, the result is relatively straightforward:

html
- head
- body
  - div id="a"
    - #text:
    - script
      - #text: var div = document.getElementById('a'); ⏎ parent.document.body.appendChild(div);

After the script is parsed, though, the div element and its child script element are gone:

html
- head
- body

They are, at this point, in the Document of the aforementioned outer browsing context. However, the stack of open elements still contains the div element.

Thus, when the second script element is parsed, it is inserted into the outer Document object.

Those parsed into different Documents than the one the parser was created for do not execute, so the first alert does not show.

Once the div element's end tag is parsed, the div element is popped off the stack, and so the next script element is in the inner Document:

html
- head
- body
  - script
    - #text: alert(document.URL);

This script does execute, resulting in an alert that says "http://example.com/inner".

8.2.8.5 The execution of scripts that are moving across multiple documents

This section is non-normative.

Elaborating on the example in the previous section, consider the case where the second script element is an external script (i.e. one with a src attribute). Since the element was not in the parser's Document when it was created, that external script is not even downloaded.

In a case where a script element with a src attribute is parsed normally into its parser's Document, but while the external script is being downloaded, the element is moved to another document, the script continues to download, but does not execute.

In general, moving script elements between Documents is considered a bad practice.

8.2.8.6 Unclosed formatting elements

This section is non-normative.

The following markup shows how nested formatting elements (such as b) get collected and continue to be applied even as the elements they are contained in are closed, but that excessive duplicates are thrown away.

<!DOCTYPE html><p><b class=x><b class=x><b><b class=x><b class=x><b>X<p>X<p><b><b class=x><b>X<p></b></b></b></b></b></b>X

The resulting DOM tree is as follows:

DOCTYPE: html
html
- head
- body
  - p
    - b class="x"
      - b class="x"
        b
        b class="x"
        b class="x"
        b
        #text: X⏎
  - p
    - b class="x"
      - b
        b class="x"
        b class="x"
        b
        #text: X⏎
  - p
    - b class="x"
      - b
        b class="x"
        b class="x"
        b
        b
        b class="x"
        b
        #text: X⏎
  - p
    - #text: X⏎

Note how the second p element in the markup has no explicit b elements, but in the resulting DOM, up to three of each kind of formatting element (in this case three b elements with the class attribute, and two unadorned b elements) get reconstructed before the element's "X".

Also note how this means that in the final paragraph only six b end tags are needed to completely clear the list of formatting elements, even though nine b start tags have been seen up to this point.

8.3 Serializing HTML fragments

The following steps form the HTML fragment serialization algorithm. The algorithm takes as input a DOM Element, Document, or DocumentFragment referred to as the node, and either returns a string or throws an exception.

This algorithm serializes the children of the node being serialized, not the node itself.

Let s be a string, and initialize it to the empty string.
For each child node of the node, in tree order, run the following steps:
1. Let current node be the child node being processed.
2. Append the appropriate string from the following list to s:
  
  If current node is an Element
  
  If current node is an element in the HTML namespace, the MathML namespace, or the SVG namespace, then let tagname be current node's local name. Otherwise, let tagname be current node's qualified name.
  
  Append a U+003C LESS-THAN SIGN character (<), followed by tagname.
  
  For HTML elements created by the HTML parser or Document.createElement(), tagname will be lowercase.
  
  For each attribute that the element has, append a U+0020 SPACE character, the attribute's serialized name as described below, a "=" (U+003D) character, a U+0022 QUOTATION MARK character ("), the attribute's value, escaped as described below in attribute mode, and a second U+0022 QUOTATION MARK character (").
  
  An attribute's serialized name for the purposes of the previous paragraph must be determined as follows:
  
  If the attribute has no namespace
  
  The attribute's serialized name is the attribute's local name.
  
  For attributes on HTML elements set by the HTML parser or by Element.setAttribute(), the local name will be lowercase.
  
  If the attribute is in the XML namespace
  
  The attribute's serialized name is the string "xml:" followed by the attribute's local name.
  
  If the attribute is in the XMLNS namespace and the attribute's local name is xmlns
  
  The attribute's serialized name is the string "xmlns".
  
  If the attribute is in the XMLNS namespace and the attribute's local name is not xmlns
  
  The attribute's serialized name is the string "xmlns:" followed by the attribute's local name.
  
  If the attribute is in the XLink namespace
  
  The attribute's serialized name is the string "xlink:" followed by the attribute's local name.
  
  If the attribute is in some other namespace
  
  The attribute's serialized name is the attribute's qualified name.
  
  While the exact order of attributes is UA-defined, and may depend on factors such as the order that the attributes were given in the original markup, the sort order must be stable, such that consecutive invocations of this algorithm serialize an element's attributes in the same order.
  
  Append a U+003E GREATER-THAN SIGN character (>).
  
  If current node is an area, base, basefont, bgsound, br, col, command, embed, frame, hr, img, input, keygen, link, meta, param, source, track or wbr element, then continue on to the next child node at this point.
  
  If current node is a pre, textarea, or listing element, and the first child node of the element, if any, is a Text node whose character data has as its first character a "LF" (U+000A) character, then append a "LF" (U+000A) character.
  
  Append the value of running the HTML fragment serialization algorithm on the current node element (thus recursing into this algorithm for that element), followed by a U+003C LESS-THAN SIGN character (<), a "/" (U+002F) character, tagname again, and finally a U+003E GREATER-THAN SIGN character (>).
  
  If current node is a Text node
  
  If the parent of current node is a style, script, xmp, iframe, noembed, noframes, or plaintext element, or if the parent of current node is noscript element and scripting is enabled for the node, then append the value of current node's data IDL attribute literally.
  
  Otherwise, append the value of current node's data IDL attribute, escaped as described below.
  
  If current node is a Comment
  
  Append the literal string  (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN).
  
  If current node is a ProcessingInstruction
  
  Append the literal string <? (U+003C LESS-THAN SIGN, U+003F QUESTION MARK), followed by the value of current node's target IDL attribute, followed by a single U+0020 SPACE character, followed by the value of current node's data IDL attribute, followed by a single ">" (U+003E) character.
  
  If current node is a DocumentType
  
  Append the literal string <!DOCTYPE (U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+0044 LATIN CAPITAL LETTER D, U+004F LATIN CAPITAL LETTER O, U+0043 LATIN CAPITAL LETTER C, U+0054 LATIN CAPITAL LETTER T, U+0059 LATIN CAPITAL LETTER Y, U+0050 LATIN CAPITAL LETTER P, U+0045 LATIN CAPITAL LETTER E), followed by a space (U+0020 SPACE), followed by the value of current node's name IDL attribute, followed by the literal string > (U+003E GREATER-THAN SIGN).
The result of the algorithm is the string s.

It is possible that the output of this algorithm, if parsed with an HTML parser, will not return the original tree structure.

For instance, if a textarea element to which a Comment node has been appended is serialized and the output is then reparsed, the comment will end up being displayed in the text field. Similarly, if, as a result of DOM manipulation, an element contains a comment that contains the literal string "-->", then when the result of serializing the element is parsed, the comment will be truncated at that point and the rest of the comment will be interpreted as markup. More examples would be making a script element contain a Text node with the text string "</script>", or having a p element that contains a ul element (as the ul element's start tag would imply the end tag for the p).

This can enable cross-site scripting attacks. An example of this would be a page that lets the user enter some font names that are then inserted into a CSS style block via the DOM and which then uses the innerHTML IDL attribute to get the HTML serialization of that style element: if the user enters "</style><script>attack</script>" as a font name, innerHTML will return markup that, if parsed in a different context, would contain a script node, even though no script node existed in the original DOM.

Escaping a string (for the purposes of the algorithm above) consists of running the following steps:

Replace any occurrence of the "&" character by the string "&".
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ".
If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string """.
If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "<", and any occurrences of the ">" character by the string ">".

8.4 Parsing HTML fragments

The following steps form the HTML fragment parsing algorithm. The algorithm optionally takes as input an Element node, referred to as the context element, which gives the context for the parser, as well as input, a string to parse, and returns a list of zero or more nodes.

Parts marked fragment case in algorithms in the parser section are parts that only occur if the parser was created for the purposes of this algorithm (and with a context element). The algorithms have been annotated with such markings for informational purposes only; such markings have no normative weight. If it is possible for a condition described as a fragment case to occur even when the parser wasn't created for the purposes of handling this algorithm, then that is an error in the specification.

Create a new Document node, and mark it as being an HTML document.
If there is a context element, and the Document of the context element is in quirks mode, then let the Document be in quirks mode. Otherwise, if there is a context element, and the Document of the context element is in limited-quirks mode, then let the Document be in limited-quirks mode. Otherwise, leave the Document in no-quirks mode.
Create a new HTML parser, and associate it with the just created Document node.
If there is a context element, run these substeps:
1. Set the state of the HTML parser's tokenization stage as follows:
  
  If it is a title or textarea element
  
  Switch the tokenizer to the RCDATA state.
  
  If it is a style, xmp, iframe, noembed, or noframes element
  
  Switch the tokenizer to the RAWTEXT state.
  
  If it is a script element
  
  Switch the tokenizer to the script data state.
  
  If it is a noscript element
  
  If the scripting flag is enabled, switch the tokenizer to the RAWTEXT state. Otherwise, leave the tokenizer in the data state.
  
  If it is a plaintext element
  
  Switch the tokenizer to the PLAINTEXT state.
  
  Otherwise
  
  Leave the tokenizer in the data state.
  
  For performance reasons, an implementation that does not report errors and that uses the actual state machine described in this specification directly could use the PLAINTEXT state instead of the RAWTEXT and script data states where those are mentioned in the list above. Except for rules regarding parse errors, they are equivalent, since there is no appropriate end tag token in the fragment case, yet they involve far fewer state transitions.
2. Let root be a new html element with no attributes.
3. Append the element root to the Document node created above.
4. Set up the parser's stack of open elements so that it contains just the single element root.
5. Reset the parser's insertion mode appropriately.
  
  The parser will reference the context element as part of that algorithm.
6. Set the parser's form element pointer to the nearest node to the context element that is a form element (going straight up the ancestor chain, and including the element itself, if it is a form element), or, if there is no such form element, to null.
Place into the input stream for the HTML parser just created the input. The encoding confidence is irrelevant.
Start the parser and let it run until it has consumed all the characters just inserted into the input stream.
If there is a context element, return the child nodes of root, in tree order.

Otherwise, return the children of the Document object, in tree order.

This algorithm is invoked without a context element in the case of Document.innerHTML.

8.2.7. Coercing an HTML DOM into an infoset

8.2.8 An introduction to error handling and strange cases in the parser

8.2.8.1 Misnested tags: <b><i></b></i>

8.2.8.2 Misnested tags: <b><p></b></p>

8.2.8.3 Unexpected markup in tables

8.2.8.4 Scripts that modify the page as it is being parsed

8.2.8.5 The execution of scripts that are moving across multiple documents

8.2.8.6 Unclosed formatting elements

8.3 Serializing HTML fragments

8.4 Parsing HTML fragments