Cari di HTML4 
    HTML4 User Manual
Daftar Isi
(Sebelumnya) 7. The global structure of an ...9. Text (Berikutnya)

8. Language information and text direction

This section of the document discusses two important issues that affect theinternationalization of HTML: specifying the language (the langattribute) and direction (the dir attribute) of text in a document.

8.1 Specifying the language ofcontent: the lang attribute

Attribute definitions
lang = language-code [CI]
This attribute specifies the base language of an element's attribute valuesand text content. The default value of this attribute is unknown.

Language information specified via the langattribute may be used by a user agent to control rendering in a variety ofways. Some situations where author-supplied language information may be helpfulinclude:

  • Assisting search engines
  • Assisting speech synthesizers
  • Helping a user agent select glyph variants for high quality typography
  • Helping a user agent choose a set of quotation marks
  • Helping a user agent make decisions about hyphenation, ligatures, and spacing
  • Assisting spell checkers and grammar checkers

The lang attribute specifies the language of element content andattribute values; whether it is relevantfor a given attribute depends on the syntax and semantics of the attribute andthe operation involved.

The intent of the lang attribute is to allow user agents to rendercontent more meaningfully based on accepted cultural practice for a givenlanguage. This does not imply that user agents should render characters thatare atypical for a particular language in less meaningful ways; user agentsmust make a best attempt to render all characters,regardless of the value specified by lang.

For instance, if characters from the Greek alphabet appear in the midst ofEnglish text:

<P><Q lang="en">Her super-powers were the result of&gamma;-radiation,</Q> he explained.</P>

a user agent (1) should try to render the English content in an appropriatemanner (e.g., in its handling the quotation marks) and (2) must make a bestattempt to render γ even though it is not an English character.

Please consult the section on undisplayable characters for related information.

8.1.1 Language codes

The lang attribute's value is a language code that identifies a naturallanguage spoken, written, or otherwise used for the communication ofinformation among people. Computer languages are explicitly excluded fromlanguage codes.

[RFC1766] defines and explains the language codes that must be used in HTMLdocuments.

Briefly, language codes consist of a primary code and a possibly emptyseries of subcodes:

 language-code = primary-code ( "-" subcode )*

Here are some sample language codes:

  • "en": English
  • "en-US": the U.S. version of English.
  • "en-cockney": the Cockney version of English.
  • "i-navajo": the Navajo language spoken by some Native Americans.
  • "x-klingon": The primary tag "x" indicates an experimental languagetag

Two-letter primary codes are reserved for [ISO639] languageabbreviations. Two-letter codes include fr (French), de (German), it (Italian),nl (Dutch), el (Greek), es (Spanish), pt (Portuguese), ar (Arabic), he(Hebrew), ru (Russian), zh (Chinese), ja (Japanese), hi (Hindi), ur (Urdu), andsa (Sanskrit).

Any two-letter subcode is understood to be a [ISO3166] countrycode.

8.1.2 Inheritance of language codes

An element inherits language code information according to the followingorder of precedence (highest to lowest):

  • The lang attribute set for the element itself.
  • The closest parent element that has the lang attribute set (i.e., the lang attribute is inherited).
  • The HTTP "Content-Language" header (which may be configured in a server).For example:
    Content-Language: en-cockney
  • User agent default values and user preferences.

In this example, the primary language of the document is French ("fr"). Oneparagraph is declared to be in Spanish ("es"), after which the primary languagereturns to French. The following paragraph includes an embedded Japanese ("ja")phrase, after which the primary language returns to French.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"   "http://www.w3.org/TR/html4/strict.dtd"><HTML lang="fr"><HEAD><TITLE>Un document multilingue</TITLE></HEAD><BODY>...Interpreted as French...<P lang="es">...Interpreted as Spanish...<P>...Interpreted as French again...<P>...French text interrupted by<EM lang="ja">some Japanese</EM>French begins here again...</BODY></HTML>
Note. Table cells may inherit langvalues not from its parent but from the first cell in a span. Please consultthe section on alignmentinheritance for details.

8.1.3 Interpretation of language codes

In the context of HTML, a language code should be interpreted by user agentsas a hierarchy of tokens rather than a single token. When a user agent adjustsrendering according to language information (say, by comparing style sheetlanguage codes and lang values), it should always favor an exact match, butshould also consider matching primary codes to be sufficient. Thus, if the lang attribute value of "en-US" is set for the HTMLelement, a user agent should prefer style information that matches "en-US"first, then the more general value "en".

Note. Language code hierarchies do not guarantee thatall languages with a common prefix will be understood by those fluent in one ormore of those languages. They do allow a user to request this commonality whenit is true for that user.

8.2 Specifying the direction oftext and tables: the dir attribute

Attribute definitions

dir = LTR |RTL [CI]
This attribute specifies the base direction of directionally neutral text(i.e., text that doesn't have inherent directionality as defined in [UNICODE]) in an element's content and attribute values. It also specifiesthe directionality of tables.Possible values:
  • LTR: Left-to-right text or table.
  • RTL: Right-to-left text or table.

In addition to specifying the language of a document with the langattribute, authors may need to specify the basedirectionality (left-to-right or right-to-left) of portions of adocument's text, of table structure, etc. This is done with the dirattribute.

The [UNICODE] specification assigns directionality to characters anddefines a (complex) algorithm for determining the proper directionality oftext. If a document does not contain a displayable right-to-left character, aconforming user agent is not required to apply the [UNICODE] bidirectionalalgorithm. If a document contains right-to-left characters, and if the useragent displays these characters, the user agent must use the bidirectionalalgorithm.

Although Unicode specifies special characters that deal with text direction,HTML offers higher-level markup constructs that do the same thing: the dirattribute (do not confuse with the DIR element) and the BDOelement. Thus, to express a Hebrew quotation, it is more intuitive to write

<Q lang="he" dir="rtl">...a Hebrew quotation...</Q>

than the equivalent with Unicode references:

&#x202B;&#x05F4;...a Hebrew quotation...&#x05F4;&#x202C;

User agents must not use the langattribute to determine text directionality.

The dir attribute is inherited and may be overridden. Please consult thesection on the inheritance of text directioninformation for details.

8.2.1 Introduction to the bidirectional algorithm

The following example illustrates the expected behavior of the bidirectionalalgorithm. It involves English, a left-to-right script, and Hebrew, aright-to-left script.

Consider the following example text:

  english1 HEBREW2 english3 HEBREW4 english5 HEBREW6

The characters in this example (and in all related examples) are stored inthe computer the way they are displayed here: the first character in the fileis "e", the second is "n", and the last is "6".

Suppose the predominant language of the document containing this paragraphis English. This means that the base direction is left-to-right. The correctpresentation of this line would be:

english1 2WERBEH english3 4WERBEH english5 6WERBEH <------  <------  <------ H H H------------------------------------------------->   E

The dotted lines indicate the structure of the sentence: Englishpredominates and some Hebrew text is embedded. Achieving the correctpresentation requires no additional markup since the Hebrew fragments arereversed correctly by user agents applying the bidirectional algorithm.

If, on the other hand, the predominant language of the document is Hebrew,the base direction is right-to-left. The correct presentation is therefore:

6WERBEH english5 4WERBEH english3 2WERBEH english1 -------> -------> -------> E E E<-------------------------------------------------   H

In this case, the whole sentence has been presented as right-to-left and theembedded English sequences have been properly reversed by the bidirectionalalgorithm.

8.2.2 Inheritance of text directioninformation

The Unicode bidirectional algorithm requires a base text direction for textblocks. To specify the base direction of a block-level element, set theelement's dir attribute. The default value of the dirattribute is "ltr" (left-to-right text).

When the dir attribute is set for a block-level element, it remains in effectfor the duration of the element and any nested block-level elements. Settingthe dir attribute on a nested element overrides the inherited value.

To set the base text direction for an entire document, set the dirattribute on the HTML element.

For example:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"   "http://www.w3.org/TR/html4/strict.dtd"><HTML dir="RTL"><HEAD><TITLE>...a right-to-left title...</TITLE></HEAD>...right-to-left text...<P dir="ltr">...left-to-right text...</P><P>...right-to-left text again...</P></HTML>

Inline elements, on the other hand, do not inherit the dirattribute. This means that an inline element without a dirattribute does not open an additional level of embedding withrespect to the bidirectional algorithm. (Here, an element is considered to beblock-level or inline based on its default presentation. Note that the INS and DELelements can be block-level or inline depending on their context.)

8.2.3 Setting the direction of embedded text

The [UNICODE] bidirectional algorithm automatically reverses embeddedcharacter sequences according to their inherent directionality (as illustratedby the previous examples). However, in general only one level of embedding canbe accounted for. To achieve additional levels of embedded direction changes,you must make use of the dir attribute on an inline element.

Consider the same example text as before:

english1 HEBREW2 english3 HEBREW4 english5 HEBREW6

Suppose the predominant language of the document containing this paragraphis English. Furthermore, the above English sentence contains a Hebrew sectionextending from HEBREW2 through HEBREW4 and the Hebrew section contains anEnglish quotation (english3). The desired presentation of the text is thus:

english1 4WERBEH english3 2WERBEH english5 6WERBEH -------> E <----------------------- H-------------------------------------------------> E

To achieve two embedded direction changes, we must supply additionalinformation, which we do by delimiting the second embedding explicitly. In thisexample, we use the SPAN element and the dir attribute to mark up the text:

english1 <SPAN dir="RTL">HEBREW2 english3 HEBREW4</SPAN> english5 HEBREW6

Authors may also use special Unicode characters to achieve multiple embeddeddirection changes. To achieve left-to-right embedding, surround embedded textwith the characters LEFT-TO-RIGHT EMBEDDING ("LRE", hexadecimal 202A) and POPDIRECTIONAL FORMATTING ("PDF", hexadecimal 202C). To achieve right-to-leftembedding, surround embedded text with the characters RIGHT-TO-LEFT EMBEDDING("RTE", hexadecimal 202B) and PDF.

Using HTML directionality markup with Unicodecharacters. Authors and designers of authoring software should beaware that conflicts can arise if the dir attribute is used on inlineelements (including BDO) concurrently with the corresponding [UNICODE] formatting characters. Preferably one or the other should be usedexclusively. The markup method offers a better guarantee of document structuralintegrity and alleviates some problems when editing bidirectional HTML textwith a simple text editor, but some software may be more apt at using the [UNICODE] characters. If both methods are used, great care should beexercised to insure proper nesting of markup and directional embedding oroverride, otherwise, rendering results are undefined.

8.2.4 Overriding the bidirectional algorithm: the BDO element

<!ELEMENT BDO - - (%inline;)*  -- I18N BiDi over-ride --><!ATTLIST BDO  %coreattrs;  -- id, class, style, title --  lang %LanguageCode; #IMPLIED  -- language code --  dir (ltr|rtl)  #REQUIRED -- directionality --  >

Start tag: required, End tag: required

Attribute definitions

dir = LTR| RTL [CI]
This mandatory attribute specifies the base direction of the element's textcontent. This direction overrides the inherent directionality of characters asdefined in [UNICODE]. Possible values:
  • LTR: Left-to-right text.
  • RTL: Right-to-left text.

Attributes defined elsewhere

The bidirectional algorithm and the dir attribute generally suffice tomanage embedded direction changes. However, some situations may arise when thebidirectional algorithm results in incorrect presentation. The BDOelement allows authors to turn off the bidirectional algorithmfor selected fragments of text.

Consider a document containing the same text as before:

english1 HEBREW2 english3 HEBREW4 english5 HEBREW6

but assume that this text has already been put in visual order. One reasonfor this may be that the MIME standard ([RFC2045], [RFC1556]) favors visual order, i.e., that right-to-left charactersequences are inserted right-to-left in the byte stream. In an email, the abovemight be formatted, including line breaks, as:

english1 2WERBEH english34WERBEH english5 6WERBEH

This conflicts with the [UNICODE] bidirectionalalgorithm, because that algorithm would invert 2WERBEH, 4WERBEH, and 6WERBEH a second time, displaying the Hebrew wordsleft-to-right instead of right-to-left.

The solution in this case is to override the bidirectional algorithm byputting the Email excerpt in a PRE element (to conserve line breaks) and eachline in a BDO element, whose dir attribute is set to LTR:

<PRE><BDO dir="LTR">english1 2WERBEH english3</BDO><BDO dir="LTR">4WERBEH english5 6WERBEH</BDO></PRE>

This tells the bidirectional algorithm "Leave me left-to-right!" and wouldproduce the desired presentation:

english1 2WERBEH english34WERBEH english5 6WERBEH

The BDO element should be used in scenarios where absolute control oversequence order is required (e.g., multi-language part numbers). The dir attribute is mandatory for this element.

Authors may also use special Unicode characters to override thebidirectional algorithm -- LEFT-TO-RIGHT OVERRIDE (202D) or RIGHT-TO-LEFTOVERRIDE (hexadecimal 202E). The POP DIRECTIONAL FORMATTING (hexadecimal 202C)character ends either bidirectional override.

Note. Recall that conflicts can arise if the dirattribute is used on inline elements (including BDO) concurrently with thecorresponding [UNICODE] formatting characters.

Bidirectionality and character encoding According to [RFC1555] and [RFC1556], there are special conventions for the use of"charset" parameter values to indicate bidirectional treatment in MIME mail, inparticular to distinguish between visual, implicit, and explicitdirectionality. The parameter value "ISO-8859-8" (for Hebrew) denotes visualencoding, "ISO-8859-8-i" denotes implicit bidirectionality, and "ISO-8859-8-e"denotes explicit directionality.

Because HTML uses the Unicode bidirectionality algorithm, conformingdocuments encoded using ISO 8859-8 must be labeled as "ISO-8859-8-i". Explicitdirectional control is also possible with HTML, but cannot be expressed withISO 8859-8, so "ISO-8859-8-e" should not be used.

The value "ISO-8859-8" implies that the document is formatted visually,misusing some markup (such as TABLE with right alignment and no line wrapping)to ensure reasonable display on older user agents that do not handlebidirectionality. Such documents do not conform to the present specification.If necessary, they can be made to conform to the current specification (and atthe same time will be displayed correctly on older user agents) by adding BDOmarkup where necessary. Contrary to what is said in [RFC1555] and [RFC1556], ISO-8859-6 (Arabic) is notvisual ordering.

8.2.5 Characterreferences for directionality and joining control

Since ambiguities sometimes arise as to the directionality of certaincharacters (e.g., punctuation), the [UNICODE] specificationincludes characters to enable their proper resolution. Also, Unicode includessome characters to control joining behavior where this is necessary (e.g., somesituations with Arabic letters). HTML 4 includes character references for these characters.

The following DTD excerpt presents some of the directional entities:

   <!ENTITY zwnj CDATA "&#8204;"--=zero width non-joiner-->   <!ENTITY zwj  CDATA "&#8205;"--=zero width joiner-->   <!ENTITY lrm  CDATA "&#8206;"--=left-to-right mark-->   <!ENTITY rlm  CDATA "&#8207;"--=right-to-left mark-->

The zwnj entity is used to block joining behavior in contextswhere joining will occur but shouldn't. The zwj entity does theopposite; it forces joining when it wouldn't occur but should. For example, theArabic letter "HEH" is used to abbreviate "Hijri", the name of the Islamiccalendar system. Since the isolated form of "HEH" looks like the digit five asemployed in Arabic script (based on Indic digits), in order to preventconfusing "HEH" as a final digit five in a year, the initial form of "HEH" isused. However, there is no following context (i.e., a joining letter) to whichthe "HEH" can join. The zwj character provides that context.

Similarly, in Persian texts, there are cases where a letter that normallywould join a subsequent letter in a cursive connection should not. Thecharacter zwnj is used to block joining in such cases.

The other characters, lrm and rlm, are used toforce directionality of directionally neutral characters. For example, if adouble quotation mark comes between an Arabic (right-to-left) and a Latin(left-to-right) letter, the direction of the quotation mark is not clear (is itquoting the Arabic text or the Latin text?). The lrm and rlm characters have a directional property but no width and no word/linebreak property. Please consult [UNICODE] for moredetails.

Mirrored character glyphs. In general, thebidirectional algorithm does not mirror character glyphs but leaves themunaffected. An exception are characters such as parentheses (see [UNICODE], table 4-7). In cases where mirroring is desired, for example forEgyptian Hieroglyphs, Greek Bustrophedon, or special design effects, thisshould be controlled with styles.

8.2.6 Theeffect of style sheets on bidirectionality

In general, using style sheets to change an element's visual rendering fromblock-level to inline or vice-versa is straightforward. However, because thebidirectional algorithm relies on the inline/block-level distinction, special care must be taken during thetransformation.

When an inline element that does not have a dir attribute is transformed tothe style of a block-level element by a style sheet, it inherits the dirattribute from its closest parent block element to define the base direction ofthe block.

When a block element that does not have a dir attribute is transformed tothe style of an inline element by a style sheet, the resulting presentationshould be equivalent, in terms of bidirectional formatting, to the formattingobtained by explicitly adding a dir attribute (assigned the inherited value) tothe transformed element.

Copyright © 1997-1999 W3C® (MIT, INRIA, Keio), All Rights Reserved.
(Sebelumnya) 7. The global structure of an ...9. Text (Berikutnya)