8.2.4.45. Markup declaration open state If the next two characters are both "-" (U+002D) characters, consume those two characters, create a comment token whose data is the empty string, and switch to the comment start state. Otherwise, if the next seven characters are an ASCII case-insensitive match for the word "DOCTYPE", then consume those characters and switch to the DOCTYPE state. Otherwise, if there is a current node and it is not an element in the HTML namespace and the next seven characters are a case-sensitive match for the string "[CDATA[" (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET character before and after), then consume those characters and switch to the CDATA section state. Otherwise, this is a parse error. Switch to the bogus comment state. The next character that is consumed, if any, is the first character that will be in the comment. Consume the next input character: - "-" (U+002D)
- Switch to the comment start dash state.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the comment token's data. Switch to the comment state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Switch to the data state. Emit the comment token.
- EOF
- Parse error. Switch to the data state. Emit the comment token. Reconsume the EOF character.
- Anything else
- Append the current input character to the comment token's data. Switch to the comment state.
Consume the next input character: - "-" (U+002D)
- Switch to the comment end state
- U+0000 NULL
- Parse error. Append a "-" (U+002D) character and a U+FFFD REPLACEMENT CHARACTER character to the comment token's data. Switch to the comment state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Switch to the data state. Emit the comment token.
- EOF
- Parse error. Switch to the data state. Emit the comment token. Reconsume the EOF character.
- Anything else
- Append a "-" (U+002D) character and the current input character to the comment token's data. Switch to the comment state.
Consume the next input character: - "-" (U+002D)
- Switch to the comment end dash state
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the comment token's data.
- EOF
- Parse error. Switch to the data state. Emit the comment token. Reconsume the EOF character.
- Anything else
- Append the current input character to the comment token's data.
Consume the next input character: - "-" (U+002D)
- Switch to the comment end state
- U+0000 NULL
- Parse error. Append a "-" (U+002D) character and a U+FFFD REPLACEMENT CHARACTER character to the comment token's data. Switch to the comment state.
- EOF
- Parse error. Switch to the data state. Emit the comment token. Reconsume the EOF character.
- Anything else
- Append a "-" (U+002D) character and the current input character to the comment token's data. Switch to the comment state.
Consume the next input character: - U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the comment token.
- U+0000 NULL
- Parse error. Append two "-" (U+002D) characters and a U+FFFD REPLACEMENT CHARACTER character to the comment token's data. Switch to the comment state.
- "!" (U+0021)
- Parse error. Switch to the comment end bang state.
- "-" (U+002D)
- Parse error. Append a "-" (U+002D) character to the comment token's data.
- EOF
- Parse error. Switch to the data state. Emit the comment token. Reconsume the EOF character.
- Anything else
- Parse error. Append two "-" (U+002D) characters and the current input character to the comment token's data. Switch to the comment state.
Consume the next input character: - "-" (U+002D)
- Append two "-" (U+002D) characters and a "!" (U+0021) character to the comment token's data. Switch to the comment end dash state.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the comment token.
- U+0000 NULL
- Parse error. Append two "-" (U+002D) characters, a "!" (U+0021) character, and a U+FFFD REPLACEMENT CHARACTER character to the comment token's data. Switch to the comment state.
- EOF
- Parse error. Switch to the data state. Emit the comment token. Reconsume the EOF character.
- Anything else
- Append two "-" (U+002D) characters, a "!" (U+0021) character, and the current input character to the comment token's data. Switch to the comment state.
8.2.4.52 DOCTYPE state Consume the next input character: - "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Switch to the before DOCTYPE name state.
- EOF
- Parse error. Switch to the data state. Create a new DOCTYPE token. Set its force-quirks flag to on. Emit the token. Reconsume the EOF character.
- Anything else
- Parse error. Switch to the before DOCTYPE name state. Reconsume the character.
8.2.4.53 Before DOCTYPE name state Consume the next input character: - "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- Uppercase ASCII letter
- Create a new DOCTYPE token. Set the token's name to the lowercase version of the current input character (add 0x0020 to the character's code point). Switch to the DOCTYPE name state.
- U+0000 NULL
- Parse error. Create a new DOCTYPE token. Set the token's name to a U+FFFD REPLACEMENT CHARACTER character. Switch to the DOCTYPE name state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Create a new DOCTYPE token. Set its force-quirks flag to on. Switch to the data state. Emit the token.
- EOF
- Parse error. Switch to the data state. Create a new DOCTYPE token. Set its force-quirks flag to on. Emit the token. Reconsume the EOF character.
- Anything else
- Create a new DOCTYPE token. Set the token's name to the current input character. Switch to the DOCTYPE name state.
8.2.4.54 DOCTYPE name state Consume the next input character: - "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Switch to the after DOCTYPE name state.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current DOCTYPE token.
- Uppercase ASCII letter
- Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current DOCTYPE token's name.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token's name.
- EOF
- Parse error. Switch to the data state. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Append the current input character to the current DOCTYPE token's name.
8.2.4.55 After DOCTYPE name state Consume the next input character: - "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current DOCTYPE token.
- EOF
- Parse error. Switch to the data state. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
-
If the six characters starting from the current input character are an ASCII case-insensitive match for the word "PUBLIC", then consume those characters and switch to the after DOCTYPE public keyword state. Otherwise, if the six characters starting from the current input character are an ASCII case-insensitive match for the word "SYSTEM", then consume those characters and switch to the after DOCTYPE system keyword state. Otherwise, this is a parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the bogus DOCTYPE state. 8.2.4.56 After DOCTYPE public keyword state Consume the next input character: - "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Switch to the before DOCTYPE public identifier state.
- U+0022 QUOTATION MARK (")
- Parse error. Set the DOCTYPE token's public identifier to the empty string (not missing), then switch to the DOCTYPE public identifier (double-quoted) state.
- "'" (U+0027)
- Parse error. Set the DOCTYPE token's public identifier to the empty string (not missing), then switch to the DOCTYPE public identifier (single-quoted) state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data state. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the bogus DOCTYPE state.
8.2.4.57 Before DOCTYPE public identifier state Consume the next input character: - "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- U+0022 QUOTATION MARK (")
- Set the DOCTYPE token's public identifier to the empty string (not missing), then switch to the DOCTYPE public identifier (double-quoted) state.
- "'" (U+0027)
- Set the DOCTYPE token's public identifier to the empty string (not missing), then switch to the DOCTYPE public identifier (single-quoted) state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data state. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the bogus DOCTYPE state.
8.2.4.58 DOCTYPE public identifier (double-quoted) state Consume the next input character: - U+0022 QUOTATION MARK (")
- Switch to the after DOCTYPE public identifier state.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token's public identifier.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data state. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Append the current input character to the current DOCTYPE token's public identifier.
8.2.4.59 DOCTYPE public identifier (single-quoted) state Consume the next input character: - "'" (U+0027)
- Switch to the after DOCTYPE public identifier state.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token's public identifier.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data state. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Append the current input character to the current DOCTYPE token's public identifier.
8.2.4.60 After DOCTYPE public identifier state Consume the next input character: - "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Switch to the between DOCTYPE public and system identifiers state.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current DOCTYPE token.
- U+0022 QUOTATION MARK (")
- Parse error. Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (double-quoted) state.
- "'" (U+0027)
- Parse error. Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (single-quoted) state.
- EOF
- Parse error. Switch to the data state. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the bogus DOCTYPE state.
8.2.4.61 Between DOCTYPE public and system identifiers state Consume the next input character: - "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current DOCTYPE token.
- U+0022 QUOTATION MARK (")
- Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (double-quoted) state.
- "'" (U+0027)
- Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (single-quoted) state.
- EOF
- Parse error. Switch to the data state. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the bogus DOCTYPE state.
8.2.4.62 After DOCTYPE system keyword state Consume the next input character: - "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Switch to the before DOCTYPE system identifier state.
- U+0022 QUOTATION MARK (")
- Parse error. Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (double-quoted) state.
- "'" (U+0027)
- Parse error. Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (single-quoted) state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data state. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the bogus DOCTYPE state.
8.2.4.63 Before DOCTYPE system identifier state Consume the next input character: - "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- U+0022 QUOTATION MARK (")
- Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (double-quoted) state.
- "'" (U+0027)
- Set the DOCTYPE token's system identifier to the empty string (not missing), then switch to the DOCTYPE system identifier (single-quoted) state.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data state. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the bogus DOCTYPE state.
8.2.4.64 DOCTYPE system identifier (double-quoted) state Consume the next input character: - U+0022 QUOTATION MARK (")
- Switch to the after DOCTYPE system identifier state.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token's system identifier.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data state. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Append the current input character to the current DOCTYPE token's system identifier.
8.2.4.65 DOCTYPE system identifier (single-quoted) state Consume the next input character: - "'" (U+0027)
- Switch to the after DOCTYPE system identifier state.
- U+0000 NULL
- Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current DOCTYPE token's system identifier.
- U+003E GREATER-THAN SIGN (>)
- Parse error. Set the DOCTYPE token's force-quirks flag to on. Switch to the data state. Emit that DOCTYPE token.
- EOF
- Parse error. Switch to the data state. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Append the current input character to the current DOCTYPE token's system identifier.
8.2.4.66 After DOCTYPE system identifier state Consume the next input character: - "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- Ignore the character.
- U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the current DOCTYPE token.
- EOF
- Parse error. Switch to the data state. Set the DOCTYPE token's force-quirks flag to on. Emit that DOCTYPE token. Reconsume the EOF character.
- Anything else
- Parse error. Switch to the bogus DOCTYPE state. (This does not set the DOCTYPE token's force-quirks flag to on.)
8.2.4.67 Bogus DOCTYPE state Consume the next input character: - U+003E GREATER-THAN SIGN (>)
- Switch to the data state. Emit the DOCTYPE token.
- EOF
- Switch to the data state. Emit the DOCTYPE token. Reconsume the EOF character.
- Anything else
- Ignore the character.
8.2.4.68 CDATA section state Switch to the data state. Consume every character up to the next occurrence of the three character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE BRACKET U+003E GREATER-THAN SIGN (]]> ), or the end of the file (EOF), whichever comes first. Emit a series of character tokens consisting of all the characters consumed except the matching three character sequence at the end (if one was found before the end of the file). If the end of the file was reached, reconsume the EOF character. 8.2.4.69 Tokenizing character references This section defines how to consume a character reference. This definition is used when parsing character references in text and in attributes. The behavior depends on the identity of the next character (the one immediately after the U+0026 AMPERSAND character): - "tab" (U+0009)
- "LF" (U+000A)
- "FF" (U+000C)
- U+0020 SPACE
- U+003C LESS-THAN SIGN
- U+0026 AMPERSAND
- EOF
- The additional allowed character, if there is one
- Not a character reference. No characters are consumed, and nothing is returned. (This is not an error, either.)
- "#" (U+0023)
-
Consume the U+0023 NUMBER SIGN. The behavior further depends on the character after the U+0023 NUMBER SIGN: - U+0078 LATIN SMALL LETTER X
- U+0058 LATIN CAPITAL LETTER X
-
Consume the X. Follow the steps below, but using ASCII hex digits. When it comes to interpreting the number, interpret it as a hexadecimal number. - Anything else
-
Follow the steps below, but using ASCII digits. When it comes to interpreting the number, interpret it as a decimal number. Consume as many characters as match the range of characters given above (ASCII hex digits or ASCII digits). If no characters match the range, then don't consume any characters (and unconsume the U+0023 NUMBER SIGN character and, if appropriate, the X character). This is a parse error; nothing is returned. Otherwise, if the next character is a U+003B SEMICOLON, consume that too. If it isn't, there is a parse error. If one or more characters match the range, then take them all and interpret the string of characters as a number (either hexadecimal or decimal as appropriate). If that number is one of the numbers in the first column of the following table, then this is a parse error. Find the row with that number in the first column, and return a character token for the Unicode character given in the second column of that row. Number | Unicode character | 0x00 | U+FFFD | REPLACEMENT CHARACTER | 0x0D | U+000D | CARRIAGE RETURN (CR) | 0x80 | U+20AC | EURO SIGN (€) | 0x81 | U+0081 | <control> | 0x82 | U+201A | SINGLE LOW-9 QUOTATION MARK (‚) | 0x83 | U+0192 | LATIN SMALL LETTER F WITH HOOK (ƒ) | 0x84 | U+201E | DOUBLE LOW-9 QUOTATION MARK („) | 0x85 | U+2026 | HORIZONTAL ELLIPSIS (…) | 0x86 | U+2020 | DAGGER (†) | 0x87 | U+2021 | DOUBLE DAGGER (‡) | 0x88 | U+02C6 | MODIFIER LETTER CIRCUMFLEX ACCENT (�★) | 0x89 | U+2030 | PER MILLE SIGN (‰) | 0x8A | U+0160 | LATIN CAPITAL LETTER S WITH CARON (Š) | 0x8B | U+2039 | SINGLE LEFT-POINTING ANGLE QUOTATION MARK (‹) | 0x8C | U+0152 | LATIN CAPITAL LIGATURE OE (Œ) | 0x8D | U+008D | <control> | 0x8E | U+017D | LATIN CAPITAL LETTER Z WITH CARON (Ž) | 0x8F | U+008F | <control> | 0x90 | U+0090 | <control> | 0x91 | U+2018 | LEFT SINGLE QUOTATION MARK (‘) | 0x92 | U+2019 | RIGHT SINGLE QUOTATION MARK (’) | 0x93 | U+201C | LEFT DOUBLE QUOTATION MARK (“) | 0x94 | U+201D | RIGHT DOUBLE QUOTATION MARK (”) | 0x95 | U+2022 | BULLET (•) | 0x96 | U+2013 | EN DASH (–) | 0x97 | U+2014 | EM DASH (—) | 0x98 | U+02DC | SMALL TILDE (˜) | 0x99 | U+2122 | TRADE MARK SIGN (™) | 0x9A | U+0161 | LATIN SMALL LETTER S WITH CARON (š) | 0x9B | U+203A | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK (›) | 0x9C | U+0153 | LATIN SMALL LIGATURE OE (œ) | 0x9D | U+009D | <control> | 0x9E | U+017E | LATIN SMALL LETTER Z WITH CARON (ž) | 0x9F | U+0178 | LATIN CAPITAL LETTER Y WITH DIAERESIS (Ÿ) | Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 0x10FFFF, then this is a parse error. Return a U+FFFD REPLACEMENT CHARACTER. Otherwise, return a character token for the Unicode character whose code point is that number. Additionally, if the number is in the range 0x0001 to 0x0008, 0x000E to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, then this is a parse error. - Anything else
-
Consume the maximum number of characters possible, with the consumed characters matching one of the identifiers in the first column of the named character references table (in a case-sensitive manner). If no match can be made, then no characters are consumed, and nothing is returned. In this case, if the characters after the U+0026 AMPERSAND character (&) consist of a sequence of one or more alphanumeric ASCII characters followed by a U+003B SEMICOLON character (;), then this is a parse error. If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. Otherwise, a character reference is parsed. If the last character matched is not a ";" (U+003B) character, there is a parse error. Return one or two character tokens for the character(s) corresponding to the character reference name (as given by the second column of the named character references table). If the markup contains (not in an attribute) the string I'm ¬it; I tell you , the character reference is parsed as "not", as in, I'm ¬it; I tell you (and this is a parse error). But if the markup was I'm ∉ I tell you , the character reference would be parsed as "notin;", resulting in I'm ∉ I tell you (and no parse error). |