Tokenization - HTML 5 | Wiki eduNitas.com

Implementations must act as if they used the following state machine to tokenize HTML. The state machine must start in the data state. Most states consume a single character, which may have various side-effects, and either switches the state machine to a new state to reconsume the same character, or switches it to a new state to consume the next character, or stays in the same state to consume the next character. Some states have more complicated behavior and can consume several characters before switching to another state. In some cases, the tokenizer state is also changed by the tree construction stage.

The exact behavior of certain states depends on the insertion mode and the stack of open elements. Certain states also use a temporary buffer to track progress.

The output of the tokenization step is a series of zero or more of the following tokens: DOCTYPE, start tag, end tag, comment, character, end-of-file. DOCTYPE tokens have a name, a public identifier, a system identifier, and a force-quirks flag. When a DOCTYPE token is created, its name, public identifier, and system identifier must be marked as missing (which is a distinct state from the empty string), and the force-quirks flag must be set to off (its other state is on). Start and end tag tokens have a tag name, a self-closing flag, and a list of attributes, each of which has a name and a value. When a start or end tag token is created, its self-closing flag must be unset (its other state is that it be set), and its attributes list must be empty. Comment and character tokens have data.

When a token is emitted, it must immediately be handled by the tree construction stage. The tree construction stage can affect the state of the tokenization stage, and can insert additional characters into the stream. (For example, the script element can result in scripts executing and using the dynamic markup insertion APIs to insert characters into the stream being tokenized.)

When a start tag token is emitted with its self-closing flag set, if the flag is not acknowledged when it is processed by the tree construction stage, that is a parse error.

When an end tag token is emitted with attributes, that is a parse error.

When an end tag token is emitted with its self-closing flag set, that is a parse error.

An appropriate end tag token is an end tag token whose tag name matches the tag name of the last start tag to have been emitted from this tokenizer, if any. If no start tag has been emitted from this tokenizer, then no end tag token is appropriate.

Before each step of the tokenizer, the user agent must first check the parser pause flag. If it is true, then the tokenizer must abort the processing of any nested invocations of the tokenizer, yielding control back to the caller.

The tokenizer state machine consists of the states defined in the following subsections.

8.2.4. Tokenization

8.2.4.1 Data state

8.2.4.2 Character reference in data state

8.2.4.3 RCDATA state

8.2.4.4 Character reference in RCDATA state

8.2.4.5 RAWTEXT state

8.2.4.6 Script data state

8.2.4.7 PLAINTEXT state

8.2.4.8 Tag open state

8.2.4.9 End tag open state

8.2.4.10 Tag name state

8.2.4.11 RCDATA less-than sign state

8.2.4.12 RCDATA end tag open state

8.2.4.13 RCDATA end tag name state

8.2.4.14 RAWTEXT less-than sign state

8.2.4.15 RAWTEXT end tag open state

8.2.4.16 RAWTEXT end tag name state

8.2.4.17 Script data less-than sign state

8.2.4.18 Script data end tag open state

8.2.4.19 Script data end tag name state

8.2.4.20 Script data escape start state

8.2.4.21 Script data escape start dash state

8.2.4.22 Script data escaped state

8.2.4.23 Script data escaped dash state

8.2.4.24 Script data escaped dash dash state

8.2.4.25 Script data escaped less-than sign state

8.2.4.26 Script data escaped end tag open state

8.2.4.27 Script data escaped end tag name state

8.2.4.28 Script data double escape start state

8.2.4.29 Script data double escaped state

8.2.4.30 Script data double escaped dash state

8.2.4.31 Script data double escaped dash dash state

8.2.4.32 Script data double escaped less-than sign state

8.2.4.33 Script data double escape end state

8.2.4.34 Before attribute name state

8.2.4.35 Attribute name state

8.2.4.36 After attribute name state

8.2.4.37 Before attribute value state

8.2.4.38 Attribute value (double-quoted) state

8.2.4.39 Attribute value (single-quoted) state

8.2.4.40 Attribute value (unquoted) state

8.2.4.41 Character reference in attribute value state

8.2.4.42 After attribute value (quoted) state

8.2.4.43 Self-closing start tag state

8.2.4.44 Bogus comment state