If the XML API being used doesn't support DOCTYPEs, the tool may drop DOCTYPEs altogether.
The tool may annotate the output with any namespace declarations required for proper operation.
If the XML API being used restricts the allowable characters in the local names of elements and attributes, then the tool may map all element and attribute local names that the API wouldn't support to a set of names that are allowed, by replacing any character that isn't supported with the uppercase letter U and the six digits of the character's Unicode code point when expressed in hexadecimal, using digits 0-9 and capital letters A-F as the symbols, in increasing numeric order.
The resulting names from this conversion conveniently can't clash with any attribute generated by the HTML parser, since those are all either lowercase or those listed in the adjust foreign attributes algorithm's table.
If the XML API restricts comments from having two consecutive "--" (U+002D) characters, the tool may insert a single U+0020 SPACE character between any such offending characters.
If the XML API restricts comments from ending in a "-" (U+002D) character, the tool may insert a single U+0020 SPACE character at the end of such comments.
If the XML API restricts allowed characters in character data, attribute values, or comments, the tool may replace any "FF" (U+000C) character with a U+0020 SPACE character, and any other literal non-XML character with a U+FFFD REPLACEMENT CHARACTER.
If the tool has no way to convey out-of-band information, then the tool may drop the following information:
8.2.8 An introduction to error handling and strange cases in the parser
This section is non-normative.
This section examines some erroneous markup and discusses how the HTML parser handles these cases.
This section is non-normative.
The most-often discussed example of erroneous markup is as follows:
<p>1<b>2<i>3</b>4</i>5</p>
The parsing of this markup is straightforward up to the "3". At this point, the DOM looks like this:
Here, the stack of open elements has five elements on it: html
, body
, p
, b
, and i
. The list of active formatting elements just has two: b
and i
. The insertion mode is "in body".
Upon receiving the end tag token with the tag name "b", the "adoption agency algorithm" is invoked. This is a simple case, in that the formatting element is the b
element, and there is no furthest block. Thus, the stack of open elements ends up with just three elements: html
, body
, and p
, while the list of active formatting elements has just one: i
. The DOM tree is unmodified at this point.
The next token is a character ("4"), triggers the reconstruction of the active formatting elements, in this case just the i
element. A new i
element is thus created for the "4" Text
node. After the end tag token for the "i" is also received, and the "5" Text
node is inserted, the DOM looks as follows:
This section is non-normative.
A case similar to the previous one is the following:
<b>1<p>2</b>3</p>
Up to the "2" the parsing here is straightforward:
The interesting part is when the end tag token with the tag name "b" is parsed.
Before that token is seen, the stack of open elements has four elements on it: html
, body
, b
, and p
. The list of active formatting elements just has the one: b
. The insertion mode is "in body".
Upon receiving the end tag token with the tag name "b", the "adoption agency algorithm" is invoked, as in the previous example. However, in this case, there is a furthest block, namely the p
element. Thus, this time the adoption agency algorithm isn't skipped over.
The common ancestor is the body
element. A conceptual "bookmark" marks the position of the b
in the list of active formatting elements, but since that list has only one element in it, the bookmark won't have much effect.
As the algorithm progresses, node ends up set to the formatting element (b
), and last node ends up set to the furthest block (p
).
The last node gets appended (moved) to the common ancestor, so that the DOM looks like:
A new b
element is created, and the children of the p
element are moved to it:
Finally, the new b
element is appended to the p
element, so that the DOM looks like:
The b
element is removed from the list of active formatting elements and the stack of open elements, so that when the "3" is parsed, it is appended to the p
element:
8.2.8.3 Unexpected markup in tables
This section is non-normative.
Error handling in tables is, for historical reasons, especially strange. For example, consider the following markup:
<table><b><tr><td>aaa</td></tr>bbb</table>ccc
The highlighted b
element start tag is not allowed directly inside a table like that, and the parser handles this case by placing the element before the table. (This is called foster parenting.) This can be seen by examining the DOM tree as it stands just after the table
element's start tag has been seen:
...and then immediately after the b
element start tag has been seen:
At this point, the stack of open elements has on it the elements html
, body
, table
, and b
(in that order, despite the resulting DOM tree); the list of active formatting elements just has the b
element in it; and the insertion mode is "in table".
The tr
start tag causes the b
element to be popped off the stack and a tbody
start tag to be implied; the tbody
and tr
elements are then handled in a rather straight-forward manner, taking the parser through the "in table body" and "in row" insertion modes, after which the DOM looks as follows:
Here, the stack of open elements has on it the elements html
, body
, table
, tbody
, and tr
; the list of active formatting elements still has the b
element in it; and the insertion mode is "in row".
The td
element start tag token, after putting a td
element on the tree, puts a marker on the list of active formatting elements (it also switches to the "in cell" insertion mode).
The marker means that when the "aaa" character tokens are seen, no b
element is created to hold the resulting Text
node:
The end tags are handled in a straight-forward manner; after handling them, the stack of open elements has on it the elements html
, body
, table
, and tbody
; the list of active formatting elements still has the b
element in it (the marker having been removed by the "td" end tag token); and the insertion mode is "in table body".
Thus it is that the "bbb" character tokens are found. These trigger the "in table text" insertion mode to be used (with the original insertion mode set to "in table body"). The character tokens are collected, and when the next token (the table
element end tag) is seen, they are processed as a group. Since they are not all spaces, they are handled as per the "anything else" rules in the "in table" insertion mode, which defer to the "in body" insertion mode but with foster parenting.
When the active formatting elements are reconstructed, a b
element is created and foster parented, and then the "bbb" Text
node is appended to it:
The stack of open elements has on it the elements html
, body
, table
, tbody
, and the new b
(again, note that this doesn't match the resulting tree!); the list of active formatting elements has the new b
element in it; and the insertion mode is still "in table body".
Had the character tokens been only space characters instead of "bbb", then those space characters would just be appended to the tbody
element.
Finally, the table
is closed by a "table" end tag. This pops all the nodes from the stack of open elements up to and including the table
element, but it doesn't affect the list of active formatting elements, so the "ccc" character tokens after the table result in yet another b
element being created, this time after the table:
8.2.8.4 Scripts that modify the page as it is being parsed
This section is non-normative.
Consider the following markup, which for this example we will assume is the document with URL http://example.com/inner
, being rendered as the content of an iframe
in another document with the URL http://example.com/outer
:
<div id=a> <script> var div = document.getElementById('a'); parent.document.body.appendChild(div); </script> <script> alert(document.URL); </script></div><script> alert(document.URL);</script>
Up to the first "script" end tag, before the script is parsed, the result is relatively straightforward:
After the script is parsed, though, the div
element and its child script
element are gone:
They are, at this point, in the Document
of the aforementioned outer browsing context. However, the stack of open elements still contains the div
element.
Thus, when the second script
element is parsed, it is inserted into the outer Document
object.
Those parsed into different Document
s than the one the parser was created for do not execute, so the first alert does not show.
Once the div
element's end tag is parsed, the div
element is popped off the stack, and so the next script
element is in the inner Document
:
This script does execute, resulting in an alert that says "http://example.com/inner".
8.2.8.5 The execution of scripts that are moving across multiple documents
This section is non-normative.
Elaborating on the example in the previous section, consider the case where the second script
element is an external script (i.e. one with a src
attribute). Since the element was not in the parser's Document
when it was created, that external script is not even downloaded.
In a case where a script
element with a src
attribute is parsed normally into its parser's Document
, but while the external script is being downloaded, the element is moved to another document, the script continues to download, but does not execute.
In general, moving script
elements between Document
s is considered a bad practice.
This section is non-normative.
The following markup shows how nested formatting elements (such as b
) get collected and continue to be applied even as the elements they are contained in are closed, but that excessive duplicates are thrown away.
<!DOCTYPE html><p><b class=x><b class=x><b><b class=x><b class=x><b>X<p>X<p><b><b class=x><b>X<p></b></b></b></b></b></b>X
The resulting DOM tree is as follows:
Note how the second p
element in the markup has no explicit b
elements, but in the resulting DOM, up to three of each kind of formatting element (in this case three b
elements with the class attribute, and two unadorned b
elements) get reconstructed before the element's "X".
Also note how this means that in the final paragraph only six b
end tags are needed to completely clear the list of formatting elements, even though nine b
start tags have been seen up to this point.
8.3 Serializing HTML fragments
The following steps form the HTML fragment serialization algorithm. The algorithm takes as input a DOM Element
, Document
, or DocumentFragment
referred to as the node, and either returns a string or throws an exception.
This algorithm serializes the children of the node being serialized, not the node itself.
Let s be a string, and initialize it to the empty string.
-
For each child node of the node, in tree order, run the following steps:
Let current node be the child node being processed.
-
Append the appropriate string from the following list to s:
- If current node is an
Element
-
If current node is an element in the HTML namespace, the MathML namespace, or the SVG namespace, then let tagname be current node's local name. Otherwise, let tagname be current node's qualified name.
Append a U+003C LESS-THAN SIGN character (<), followed by tagname.
For HTML elements created by the HTML parser or Document.createElement()
, tagname will be lowercase.
For each attribute that the element has, append a U+0020 SPACE character, the attribute's serialized name as described below, a "=" (U+003D) character, a U+0022 QUOTATION MARK character ("), the attribute's value, escaped as described below in attribute mode, and a second U+0022 QUOTATION MARK character (").
An attribute's serialized name for the purposes of the previous paragraph must be determined as follows:
- If the attribute has no namespace
-
The attribute's serialized name is the attribute's local name.
For attributes on HTML elements set by the HTML parser or by Element.setAttribute()
, the local name will be lowercase.
- If the attribute is in the XML namespace
The attribute's serialized name is the string "xml:
" followed by the attribute's local name.
- If the attribute is in the XMLNS namespace and the attribute's local name is
xmlns
The attribute's serialized name is the string "xmlns
".
- If the attribute is in the XMLNS namespace and the attribute's local name is not
xmlns
The attribute's serialized name is the string "xmlns:
" followed by the attribute's local name.
- If the attribute is in the XLink namespace
The attribute's serialized name is the string "xlink:
" followed by the attribute's local name.
- If the attribute is in some other namespace
The attribute's serialized name is the attribute's qualified name.
While the exact order of attributes is UA-defined, and may depend on factors such as the order that the attributes were given in the original markup, the sort order must be stable, such that consecutive invocations of this algorithm serialize an element's attributes in the same order.
Append a U+003E GREATER-THAN SIGN character (>).
If current node is an area
, base
, basefont
, bgsound
, br
, col
, command
, embed
, frame
, hr
, img
, input
, keygen
, link
, meta
, param
, source
, track
or wbr
element, then continue on to the next child node at this point.
If current node is a pre
, textarea
, or listing
element, and the first child node of the element, if any, is a Text
node whose character data has as its first character a "LF" (U+000A) character, then append a "LF" (U+000A) character.
Append the value of running the HTML fragment serialization algorithm on the current node element (thus recursing into this algorithm for that element), followed by a U+003C LESS-THAN SIGN character (<), a "/" (U+002F) character, tagname again, and finally a U+003E GREATER-THAN SIGN character (>).
- If current node is a
Text
node -
If the parent of current node is a style
, script
, xmp
, iframe
, noembed
, noframes
, or plaintext
element, or if the parent of current node is noscript
element and scripting is enabled for the node, then append the value of current node's data
IDL attribute literally.
Otherwise, append the value of current node's data
IDL attribute, escaped as described below.
- If current node is a
Comment
-
Append the literal string <!--
(U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS), followed by the value of current node's data
IDL attribute, followed by the literal string -->
(U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN).
- If current node is a
ProcessingInstruction
-
Append the literal string <?
(U+003C LESS-THAN SIGN, U+003F QUESTION MARK), followed by the value of current node's target
IDL attribute, followed by a single U+0020 SPACE character, followed by the value of current node's data
IDL attribute, followed by a single ">" (U+003E) character.
- If current node is a
DocumentType
-
Append the literal string <!DOCTYPE
(U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+0044 LATIN CAPITAL LETTER D, U+004F LATIN CAPITAL LETTER O, U+0043 LATIN CAPITAL LETTER C, U+0054 LATIN CAPITAL LETTER T, U+0059 LATIN CAPITAL LETTER Y, U+0050 LATIN CAPITAL LETTER P, U+0045 LATIN CAPITAL LETTER E), followed by a space (U+0020 SPACE), followed by the value of current node's name
IDL attribute, followed by the literal string >
(U+003E GREATER-THAN SIGN).
The result of the algorithm is the string s.
It is possible that the output of this algorithm, if parsed with an HTML parser, will not return the original tree structure.
For instance, if a textarea
element to which a Comment
node has been appended is serialized and the output is then reparsed, the comment will end up being displayed in the text field. Similarly, if, as a result of DOM manipulation, an element contains a comment that contains the literal string "-->
", then when the result of serializing the element is parsed, the comment will be truncated at that point and the rest of the comment will be interpreted as markup. More examples would be making a script
element contain a Text
node with the text string "</script>
", or having a p
element that contains a ul
element (as the ul
element's start tag would imply the end tag for the p
).
This can enable cross-site scripting attacks. An example of this would be a page that lets the user enter some font names that are then inserted into a CSS style
block via the DOM and which then uses the innerHTML
IDL attribute to get the HTML serialization of that style
element: if the user enters "</style><script>attack</script>
" as a font name, innerHTML
will return markup that, if parsed in a different context, would contain a script
node, even though no script
node existed in the original DOM.
Escaping a string (for the purposes of the algorithm above) consists of running the following steps:
Replace any occurrence of the "&
" character by the string "&
".
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string "
".
If the algorithm was invoked in the attribute mode, replace any occurrences of the ""
" character by the string ""
".
If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<
" character by the string "<
", and any occurrences of the ">
" character by the string ">
".
8.4 Parsing HTML fragments
The following steps form the HTML fragment parsing algorithm. The algorithm optionally takes as input an Element
node, referred to as the context element, which gives the context for the parser, as well as input, a string to parse, and returns a list of zero or more nodes.
Parts marked fragment case in algorithms in the parser section are parts that only occur if the parser was created for the purposes of this algorithm (and with a context element). The algorithms have been annotated with such markings for informational purposes only; such markings have no normative weight. If it is possible for a condition described as a fragment case to occur even when the parser wasn't created for the purposes of handling this algorithm, then that is an error in the specification.
-
Create a new Document
node, and mark it as being an HTML document.
-
If there is a context element, and the Document
of the context element is in quirks mode, then let the Document
be in quirks mode. Otherwise, if there is a context element, and the Document
of the context element is in limited-quirks mode, then let the Document
be in limited-quirks mode. Otherwise, leave the Document
in no-quirks mode.
-
Create a new HTML parser, and associate it with the just created Document
node.
-
If there is a context element, run these substeps:
-
Set the state of the HTML parser's tokenization stage as follows:
- If it is a
title
or textarea
element - Switch the tokenizer to the RCDATA state.
- If it is a
style
, xmp
, iframe
, noembed
, or noframes
element - Switch the tokenizer to the RAWTEXT state.
- If it is a
script
element - Switch the tokenizer to the script data state.
- If it is a
noscript
element - If the scripting flag is enabled, switch the tokenizer to the RAWTEXT state. Otherwise, leave the tokenizer in the data state.
- If it is a
plaintext
element - Switch the tokenizer to the PLAINTEXT state.
- Otherwise
- Leave the tokenizer in the data state.
For performance reasons, an implementation that does not report errors and that uses the actual state machine described in this specification directly could use the PLAINTEXT state instead of the RAWTEXT and script data states where those are mentioned in the list above. Except for rules regarding parse errors, they are equivalent, since there is no appropriate end tag token in the fragment case, yet they involve far fewer state transitions.
-
Let root be a new html
element with no attributes.
-
Append the element root to the Document
node created above.
-
Set up the parser's stack of open elements so that it contains just the single element root.
-
Reset the parser's insertion mode appropriately.
The parser will reference the context element as part of that algorithm.
-
Set the parser's form
element pointer to the nearest node to the context element that is a form
element (going straight up the ancestor chain, and including the element itself, if it is a form
element), or, if there is no such form
element, to null.
-
Place into the input stream for the HTML parser just created the input. The encoding confidence is irrelevant.
-
Start the parser and let it run until it has consumed all the characters just inserted into the input stream.
-
If there is a context element, return the child nodes of root, in tree order.
Otherwise, return the children of the Document
object, in tree order.
This algorithm is invoked without a context element in the case of Document.innerHTML
.