Cari di HTML4 
    HTML4 User Manual
Daftar Isi
(Sebelumnya) 4. Conformance : requirements ...6. Basic HTML data types (Berikutnya)

5. HTML Document Representation

In this chapter, we discuss how HTML documents are represented on a computerand over the Internet.

The section on the document character setaddresses the issue of what abstract characters may be part of an HTMLdocument. Characters include the Latin letter "A", the Cyrillic letter "I", theChinese character meaning "water", etc.

The section on character encodings addresses theissue of how those characters may be represented in a file or whentransferred over the Internet. As some character encodings cannot directlyrepresent all characters an author may want to include in a document, HTMLoffers other mechanisms, called character references,for referring to any character.

Since there are a great number of characters throughout human languages, anda great variety of ways to represent those characters, proper care must betaken so that documents may be understood by user agents around the world.

5.1 The Document CharacterSet

To promote interoperability, SGML requires that each application (includingHTML) specify its document character set. A document character set consistsof:

  • A Repertoire: A set of abstract characters,,such as the Latin letter "A", the Cyrillic letter "I", the Chinese charactermeaning "water", etc.
  • Code positions: A set of integer references to characters inthe repertoire.

Each SGML document (including each HTML document) is a sequence ofcharacters from the repertoire. Computer systems identify each character by itscode position; for example, in the ASCII character set, code positions 65, 66,and 67 refer to the characters 'A', 'B', and 'C', respectively.

The ASCII character set is not sufficient for a global information systemsuch as the Web, so HTML uses the much more complete character set called theUniversal Character Set (UCS),defined in [ISO10646]. Thisstandard defines a repertoire of thousands of characters used by communitiesall over the world.

The character set defined in [ISO10646] ischaracter-by-character equivalent to Unicode ([UNICODE]). Both of these standards are updated from time to timewith new characters, and the amendments should be consulted at the respectiveWeb sites. In the current specification, "[ISO10646]" is used to refer to thedocument character set while "[UNICODE]" is reserved for references to theUnicode bidirectional textalgorithm.

The document character set, however, does not suffice to allow user agentsto correctly interpret HTML documents as they are typically exchanged --encoded as a sequence of bytes in a file or during a network transmission. Useragents must also know the specific character encodingthat was used to transform the document character stream into a bytestream.

5.2 Character encodings

What this specification calls a characterencoding is known by different names in other specifications(which may cause some confusion). However, the concept is largely the sameacross the Internet. Also, protocol headers, attributes, and parametersreferring to character encodings share the same name -- "charset" -- and usethe same values from the [IANA] registry (see [CHARSETS] for acomplete list).

The "charset" parameter identifies a character encoding, which is a methodof converting a sequence of bytes into a sequence of characters. Thisconversion fits naturally with the scheme of Web activity: servers send HTMLdocuments to user agents as a stream of bytes; user agents interpret them as asequence of characters. The conversion method can range from simple one-to-onecorrespondence to complex switching schemes or algorithms.

A simple one-byte-per-character encoding technique is not sufficient fortext strings over a character repertoire as large as [ISO10646]. There areseveral different encodings of parts of [ISO10646] in addition toencodings of the entire character set (such as UCS-4).

5.2.1 Choosing anencoding

Authoring tools (e.g., text editors) may encode HTML documents in thecharacter encoding of their choice, and the choice largely depends on theconventions used by the system software. These tools may employ any convenientencoding that covers most of the characters contained in the document, providedthe encoding is correctly labeled. Occasionalcharacters that fall outside this encoding may still be represented by character references. These always refer to the documentcharacter set, not the character encoding.

Servers and proxies may change a character encoding (called transcoding) on the fly to meet the requests of user agents (see section14.2 of [RFC2616], the "Accept-Charset" HTTP request header). Servers andproxies do not have to serve a document in a character encoding that covers theentire document character set.

Commonly used character encodings onthe Web include ISO-8859-1 (also referred to as "Latin-1"; usable for mostWestern European languages), ISO-8859-5 (which supports Cyrillic), SHIFT_JIS (aJapanese encoding), EUC-JP (another Japanese encoding), and UTF-8 (an encodingof ISO 10646 using a different number of bytes for different characters). Namesfor character encodings are case-insensitive, so that for example "SHIFT_JIS","Shift_JIS", and "shift_jis" are equivalent.

This specification does not mandate which character encodings a user agentmust support.

Conforming user agents must correctlymap to ISO 10646 all characters in any character encodings that they recognize(or they must behave as if they did).

Notes on specific encodings 

When HTML text is transmitted in UTF-16(charset=UTF-16), text data should be transmitted in network byte order("big-endian", high-order byte first) in accordance with [ISO10646], Section6.3 and [UNICODE], clause C3, page 3-1.

Furthermore, to maximize chances of proper interpretation, it is recommendedthat documents transmitted as UTF-16 always begin with a ZERO-WIDTHNON-BREAKING SPACE character (hexadecimal FEFF, also called Byte Order Mark(BOM)) which, when byte-reversed, becomes hexadecimal FFFE, a characterguaranteed never to be assigned. Thus, a user-agent receiving a hexadecimalFFFE as the first bytes of a text would know that bytes have to be reversed forthe remainder of the text.

The UTF-1 transformation format of [ISO10646] (registered by IANA as ISO-10646-UTF-1), should not be used. Forinformation about ISO 8859-8 and the bidirectional algorithm, please consultthe section on bidirectionality andcharacter encoding.

5.2.2 Specifying thecharacter encoding

How does a server determine which character encoding applies for a documentit serves? Some servers examine the first few bytes of the document, or checkagainst a database of known files and encodings. Many modern servers give Webmasters more control over charset configuration than old servers do. Webmasters should use these mechanisms to send out a "charset" parameter wheneverpossible, but should take care not to identify a document with the wrong"charset" parameter value.

How does a user agent know which character encoding has been used? Theserver should provide this information. The most straightforward way for aserver to inform the user agent about the character encoding of the document isto use the "charset" parameter of the "Content-Type" header field of the HTTP protocol ([RFC2616], sections 3.4 and 14.17) For example, the followingHTTP header announces that the character encoding is EUC-JP:

Content-Type: text/html; charset=EUC-JP

Please consult the section on conformance for thedefinition of text/html.

The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a defaultcharacter encoding when the "charset" parameter is absent from the"Content-Type" header field. In practice, this recommendation has proveduseless because some servers don't allow a "charset" parameter to be sent, andothers may not be configured to send the parameter. Therefore, user agents mustnot assume any default value for the "charset" parameter.

To address server or configuration limitations, HTML documents may includeexplicit information about the document's character encoding; the META element can be used to provide user agents with thisinformation.

For example, to specify that the character encoding of the current documentis "EUC-JP", a document should include the following META declaration:

<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">

The META declaration must only be used when the characterencoding is organized such that ASCII-valued bytes stand for ASCII characters(at least until the META element is parsed). META declarations should appear as early as possible in the HEAD element.

For cases where neither the HTTP protocol nor the META element provides information about the character encoding of adocument, HTML also provides the charset attribute on severalelements. By combining these mechanisms, an author can greatly improve thechances that, when the user retrieves a resource, the user agent will recognizethe character encoding.

To sum up, conforming user agents must observe the following priorities when determining a document'scharacter encoding(from highest priority to lowest):

  1. An HTTP "charset" parameter in a "Content-Type" field.
  2. A META declaration with "http-equiv" set to "Content-Type" anda value set for "charset".
  3. The charset attribute set on an element that designates anexternal resource.

In addition to this list of priorities, the user agent may use heuristicsand user settings. For example, many user agents use a heuristic to distinguishthe various encodings used for Japanese text. Also, user agents typically havea user-definable, local default character encoding which they apply in theabsence of other indicators.

User agents may provide a mechanism that allows users to override incorrect"charset" information. However, if a user agent offers such a mechanism, itshould only offer it for browsing and not for editing, to avoid the creation ofWeb pages marked with an incorrect "charset" parameter.

Note. If, for a specific application, it becomesnecessary to refer to characters outside [ISO10646], charactersshould be assigned to a private zone to avoid conflicts with present or futureversions of the standard. This is highly discouraged, however, for reasons ofportability.

5.3 Character references

A given character encoding may not be able to express all characters of thedocument character set. For such encodings, or when hardware or softwareconfigurations do not allow users to input some document characters directly,authors may use SGML character references. Characterreferences are a character encoding-independent mechanism for entering anycharacter from the document character set.

Character references in HTML may appear in two forms:

  • Numeric character references (either decimal or hexadecimal).
  • Character entity references.

Character references within commentshave no special meaning; they are comment data only.

Note. HTML provides other ways to present characterdata, in particular inline images.

Note. In SGML, it is possible to eliminate the final";" after a character reference in some cases (e.g., at a line break orimmediately before a tag). In other circumstances it may not be eliminated(e.g., in the middle of a word). We strongly suggest using the ";" in all casesto avoid problems with user agents that require this character to bepresent.

5.3.1 Numeric character references

Numeric characterreferences specify the codeposition of a character in the document character set. Numeric characterreferences may take two forms:

  • The syntax "&#D;", where D is a decimal number,refers to the ISO 10646 decimal character number D.
  • The syntax "&#xH;" or "&#XH;", where His a hexadecimal number, refers to the ISO 10646 hexadecimal character numberH. Hexadecimal numbers in numeric character references are case-insensitive.

Here are some examples of numeric character references:

  • &#229; (in decimal) represents the letter "a" with a small circle aboveit (used, for example, in Norwegian).
  • &#xE5; (in hexadecimal) represents the same character.
  • &#Xe5; (in hexadecimal) represents the same character as well.
  • &#1048; (in decimal) represents the Cyrillic capital letter "I".
  • &#x6C34; (in hexadecimal) represents the Chinese character forwater.

Note. Although the hexadecimal representation is notdefined in [ISO8879], it is expected to be in the revision, as described in[WEBSGML]. This convention is particularly useful since character standardsgenerally use hexadecimal representations.

5.3.2 Character entity references

In order to give authors a more intuitive way of referring to characters inthe document character set, HTML offers a set of character entity references. Character entity references usesymbolic names so that authors need not remember codepositions. For example, the character entity reference &aring; refersto the lowercase "a" character topped with a ring; "&aring;" is easier toremember than &#229;.

HTML 4 does not define a character entity reference for every character inthe document character set. For instance, there is no character entityreference for the Cyrillic capital letter "I". Please consult the full list of character references defined in HTML4.

Character entity references are case-sensitive.Thus, &Aring; refers to a different character (uppercase A, ring) than&aring; (lowercase a, ring).

Four character entity references deserve special mention since they arefrequently used to escape special characters:

  • "&lt;" represents the < sign.
  • "&gt;" represents the > sign.
  • "&amp;" represents the & sign.
  • "&quot; represents the " mark.

Authors wishing to put the "<" character in text should use "&lt;"(ASCII decimal 60) to avoid possible confusion with the beginning of a tag(start tag open delimiter). Similarly, authors should use "&gt;" (ASCIIdecimal 62) in text instead of ">" to avoid problems with older user agentsthat incorrectly perceive this as the end of a tag (tag close delimiter) whenit appears in quoted attribute values.

Authors should use "&amp;" (ASCII decimal 38) instead of "&" toavoid confusion with the beginning of a character reference (entity referenceopen delimiter). Authors should also use "&amp;" in attribute values sincecharacter references are allowed within CDATA attribute values.

Some authors use the character entity reference "&quot;" to encodeinstances of the double quote mark (") since that character may be used todelimit attribute values.

5.4 Undisplayablecharacters

A user agent may not be able to render all characters in a document meaningfully, for instance,because the user agent lacks a suitable font, a character has a value that maynot be expressed in the user agent's internal character encoding, etc.

Because there are many different things that may be done in such cases, thisdocument does not prescribe any specific behavior. Depending on theimplementation, undisplayable charactersmay also be handled by the underlying display system and not the applicationitself. In the absence of more sophisticated behavior, for example tailored tothe needs of a particular script or language, we recommend the followingbehavior for user agents:

  1. Adopt a clearly visible, but unobtrusive mechanism to alert the user ofmissing resources.
  2. If missing characters are presented using their numeric representation, usethe hexadecimal (not decimal) form since this is the form used in character setstandards.
Copyright © 1997-1999 W3C® (MIT, INRIA, Keio), All Rights Reserved.
(Sebelumnya) 4. Conformance : requirements ...6. Basic HTML data types (Berikutnya)