Cari di Perl 
    Perl Tutorial
Daftar Isi
(Sebelumnya) Perl Unicode introductionIndex of Unicode Version 6.1.0 ... (Berikutnya)
Language Reference

Unicode support in Perl

Daftar Isi

NAME

perlunicode - Unicode support in Perl

DESCRIPTION

Important Caveats

Unicode support is an extensive requirement. While Perl does notimplement the Unicode standard or the accompanying technical reportsfrom cover to cover, Perl does support many Unicode features.

People who want to learn to use Unicode in Perl, should probably readthe Perl Unicode tutorial, perlunitut andperluniintro, before readingthis reference document.

Also, the use of Unicode may present security issues that aren't obvious.Read Unicode Security Considerations.

  • Safest if you "use feature 'unicode_strings'"

    In order to preserve backward compatibility, Perl does not turnon full internal Unicode support unless the pragmause feature 'unicode_strings' is specified. (This is automaticallyselected if you use use 5.012 or higher.) Failure to do this cantrigger unexpected surprises. See The Unicode Bug below.

    This pragma doesn't affect I/O, and there are still several placeswhere Unicode isn't fully supported, such as in filenames.

  • Input and Output Layers

    Perl knows when a filehandle uses Perl's internal Unicode encodings(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened withthe ":encoding(utf8)" layer. Other encodings can be converted to Perl'sencoding on input or from Perl's encoding on output by use of the":encoding(...)" layer. See open.

    To indicate that Perl source itself is in UTF-8, use use utf8;.

  • use utf8 still needed to enable UTF-8/UTF-EBCDIC in scripts

    As a compatibility measure, the use utf8 pragma must be explicitlyincluded to enable recognition of UTF-8 in the Perl scripts themselves(in string or regular expression literals, or in identifier names) onASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-basedmachines. These are the only times when an explicit use utf8is needed. See utf8.

  • BOM-marked scripts and UTF-16 scripts autodetected

    If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,or UTF-8), or if the script looks like non-BOM-marked UTF-16 of eitherendianness, Perl will correctly read in the script as Unicode.(BOMless UTF-8 cannot be effectively recognized or differentiated fromISO 8859-1 or other eight-bit encodings.)

  • use encoding needed to upgrade non-Latin-1 byte strings

    By default, there is a fundamental asymmetry in Perl's Unicode model:implicit upgrading from byte strings to Unicode strings assumes thatthey were encoded in ISO 8859-1 (Latin-1), but Unicode strings aredowngraded with UTF-8 encoding. This happens because the first 256codepoints in Unicode happens to agree with Latin-1.

    See Byte and Character Semantics for more details.

Byte and Character Semantics

Beginning with version 5.6, Perl uses logically-wide characters torepresent strings internally.

Starting in Perl 5.14, Perl-level operations work withcharacters rather than bytes within the scope of ause feature 'unicode_strings' (or equivalentlyuse 5.012 or higher). (This is not true if bytes have beenexplicitly requested by use bytes, nor necessarily truefor interactions with the platform's operating system.)

For earlier Perls, and when unicode_strings is not in effect, Perlprovides a fairly safe environment that can handle both types ofsemantics in programs. For operations where Perl can unambiguouslydecide that the input data are characters, Perl switches to charactersemantics. For operations where this determination cannot be madewithout additional information from the user, Perl decides in favor ofcompatibility and chooses to use byte semantics.

When use locale (but not use locale ':not_characters') is ineffect, Perl uses the semantics associated with the current locale.(use locale overrides use feature 'unicode_strings' in the same scope;while use locale ':not_characters' effectively also selectsuse feature 'unicode_strings' in its scope; see perllocale.)Otherwise, Perl uses the platform's nativebyte semantics for characters whose code points are less than 256, andUnicode semantics for those greater than 255. On EBCDIC platforms, thisis almost seamless, as the EBCDIC code pages that Perl handles areequivalent to Unicode's first 256 code points. (The exception is thatEBCDIC regular expression case-insensitive matching rules are not asas robust as Unicode's.) But on ASCII platforms, Perl uses US-ASCII(or Basic Latin in Unicode terminology) byte semantics, meaning that characterswhose ordinal numbers are in the range 128 - 255 are undefined except for theirordinal numbers. This means that none have case (upper and lower), nor are anya member of character classes, like [:alpha:] or \w. (But all do belongto the \W class or the Perl regular expression extension [:^alpha:].)

This behavior preserves compatibility with earlier versions of Perl,which allowed byte semantics in Perl operations only ifnone of the program's inputs were marked as being a source of Unicodecharacter data. Such data may come from filehandles, from calls toexternal programs, from information provided by the system (such as %ENV),or from literals and constants in the source text.

The utf8 pragma is primarily a compatibility device that enablesrecognition of UTF-(8|EBCDIC) in literals encountered by the parser.Note that this pragma is only required while Perl defaults to bytesemantics; when character semantics become the default, this pragmamay become a no-op. See utf8.

If strings operating under byte semantics and strings with Unicodecharacter data are concatenated, the new string will havecharacter semantics. This can cause surprises: See BUGS, below.You can choose to be warned when this happens. See encoding::warnings.

Under character semantics, many operations that formerly operated onbytes now operate on characters. A character in Perl islogically just a number ranging from 0 to 2**31 or so. Largercharacters may encode into longer sequences of bytes internally, butthis internal detail is mostly hidden for Perl code.See perluniintro for more.

Effects of Character Semantics

Character semantics have the following effects:

  • Strings--including hash keys--and regular expression patterns maycontain characters that have an ordinal value larger than 255.

    If you use a Unicode editor to edit your program, Unicode characters mayoccur directly within the literal strings in UTF-8 encoding, or UTF-16.(The former requires a BOM or use utf8, the latter requires a BOM.)

    Unicode characters can also be added to a string by using the \N{U+...}notation. The Unicode code for the desired character, in hexadecimal,should be placed in the braces, after the U. For instance, a smiley face is\N{U+263A}.

    Alternatively, you can use the \x{...} notation for characters 0x100 andabove. For characters below 0x100 you may get byte semantics instead ofcharacter semantics; see The Unicode Bug. On EBCDIC machines there isthe additional problem that the value for such characters gives the EBCDICcharacter rather than the Unicode one, thus it is more portable to use\N{U+...} instead.

    Additionally, you can use the \N{...} notation and put the officialUnicode character name within the braces, such as\N{WHITE SMILING FACE}. This automatically loads the charnamesmodule with the :full and :short options. If you prefer differentoptions for this module, you can instead, before the \N{...},explicitly load it with your desired options; for example,

    1. use charnames ':loose';
  • If an appropriate encoding is specified, identifiers within thePerl script may contain Unicode alphanumeric characters, includingideographs. Perl does not currently attempt to canonicalize variablenames.

  • Regular expressions match characters instead of bytes. "." matchesa character instead of a byte.

  • Bracketed character classes in regular expressions match characters instead ofbytes and match against the character properties specified in theUnicode properties database. \w can be used to match a Japaneseideograph, for instance.

  • Named Unicode properties, scripts, and block ranges may be used (like bracketedcharacter classes) by using the \p{} "matches property" construct andthe \P{} negation, "doesn't match property".See Unicode Character Properties for more details.

    You can define your own character properties and use themin the regular expression with the \p{} or \P{} construct.See User-Defined Character Properties for more details.

  • The special pattern \X matches a logical character, an "extended graphemecluster" in Standardese. In Unicode what appears to the user to be a singlecharacter, for example an accented G, may in fact be composed of a sequenceof characters, in this case a G followed by an accent character. \Xwill match the entire sequence.

  • The tr/// operator translates characters instead of bytes. Notethat the tr///CU functionality has been removed. For similarfunctionality see pack('U0', ...) and pack('C0', ...).

  • Case translation operators use the Unicode case translation tableswhen character input is provided. Note that uc(), or \U ininterpolated strings, translates to uppercase, while ucfirst,or \u in interpolated strings, translates to titlecase in languagesthat make the distinction (which is equivalent to uppercase in languageswithout the distinction).

  • Most operators that deal with positions or lengths in a string willautomatically switch to using character positions, includingchop(), chomp(), substr(), pos(), index(), rindex(),sprintf(), write(), and length(). An operator thatspecifically does not switch is vec(). Operators that really don'tcare include operators that treat strings as a bucket of bits such assort(), and operators dealing with filenames.

  • The pack()/unpack() letter C does not change, since it is oftenused for byte-oriented formats. Again, think char in the C language.

    There is a new U specifier that converts between Unicode charactersand code points. There is also a W specifier that is the equivalent ofchr/ord and properly handles character values even if they are above 255.

  • The chr() and ord() functions work on characters, similar topack("W") and unpack("W"), not pack("C") andunpack("C"). pack("C") and unpack("C") are methods foremulating byte-oriented chr() and ord() on Unicode strings.While these methods reveal the internal encoding of Unicode strings,that is not something one normally needs to care about at all.

  • The bit string operators, & | ^ ~, can operate on character data.However, for backward compatibility, such as when using bit stringoperations when characters are all less than 256 in ordinal value, oneshould not use ~ (the bit complement) with characters of bothvalues less than 256 and values greater than 256. Most importantly,DeMorgan's laws (~($x|$y) eq ~$x&~$y and ~($x&$y) eq ~$x|~$y)will not hold. The reason for this mathematical faux pas is thatthe complement cannot return both the 8-bit (byte-wide) bitcomplement and the full character-wide bit complement.

  • There is a CPAN module, Unicode::Casing, which allows you to defineyour own mappings to be used in lc(), lcfirst(), uc(),ucfirst(), and fc (or their double-quoted string inlinedversions such as \U).(Prior to Perl 5.16, this functionality was partially providedin the Perl core, but suffered from a number of insurmountabledrawbacks, so the CPAN module was written instead.)

  • And finally, scalar reverse() reverses by character rather than by byte.

Unicode Character Properties

(The only time that Perl considers a sequence of individual codepoints as a single logical character is in the \X construct, alreadymentioned above. Therefore "character" in this discussion means a singleUnicode code point.)

Very nearly all Unicode character properties are accessible throughregular expressions by using the \p{} "matches property" constructand the \P{} "doesn't match property" for its negation.

For instance, \p{Uppercase} matches any single character with the Unicode"Uppercase" property, while \p{L} matches any character with aGeneral_Category of "L" (letter) property. Brackets are notrequired for single letter property names, so \p{L} is equivalent to \pL.

More formally, \p{Uppercase} matches any single character whose UnicodeUppercase property value is True, and \P{Uppercase} matches any characterwhose Uppercase property value is False, and they could have been written as\p{Uppercase=True} and \p{Uppercase=False}, respectively.

This formality is needed when properties are not binary; that is, if they cantake on more values than just True and False. For example, the Bidi_Class (seeBidirectional Character Types below), can take on several differentvalues, such as Left, Right, Whitespace, and others. To match these, one needsto specify both the property name (Bidi_Class), AND the value beingmatched against(Left, Right, etc.). This is done, as in the examples above, by having thetwo components separated by an equal sign (or interchangeably, a colon), like\p{Bidi_Class: Left}.

All Unicode-defined character properties may be written in these compound formsof \p{property=value} or \p{property:value}, but Perl provides someadditional properties that are written only in the single form, as well assingle-form short-cuts for all binary properties and certain others describedbelow, in which you may omit the property name and the equals or colonseparator.

Most Unicode character properties have at least two synonyms (or aliases if youprefer): a short one that is easier to type and a longer one that is moredescriptive and hence easier to understand. Thus the "L" and "Letter" propertiesabove are equivalent and can be used interchangeably. Likewise,"Upper" is a synonym for "Uppercase", and we could have written\p{Uppercase} equivalently as \p{Upper}. Also, there are typicallyvarious synonyms for the values the property can be. For binary properties,"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F","No", and "N". But be careful. A short form of a value for one property maynot mean the same thing as the same short form for another. Thus, for theGeneral_Category property, "L" means "Letter", but for the Bidi_Class property,"L" means "Left". A complete list of properties and synonyms is inperluniprops.

Upper/lower case differences in property names and values are irrelevant;thus \p{Upper} means the same thing as \p{upper} or even \p{UpPeR}.Similarly, you can add or subtract underscores anywhere in the middle of aword, so that these are also equivalent to \p{U_p_p_e_r}. And white spaceis irrelevant adjacent to non-word characters, such as the braces and the equalsor colon separators, so \p{ Upper } and \p{ Upper_case : Y } areequivalent to these as well. In fact, white space and evenhyphens can usually be added or deleted anywhere. So even \p{ Up-per case = Yes} isequivalent. All this is called "loose-matching" by Unicode. The few placeswhere stricter matching is used is in the middle of numbers, and in the Perlextension properties that begin or end with an underscore. Stricter matchingcares about white space (except adjacent to non-word characters),hyphens, and non-interior underscores.

You can also use negation in both \p{} and \P{} by introducing a caret(^) between the first brace and the property name: \p{^Tamil} isequal to \P{Tamil}.

Almost all properties are immune to case-insensitive matching. That is,adding a /i regular expression modifier does not change what theymatch. There are two sets that are affected.The first set isUppercase_Letter,Lowercase_Letter,and Titlecase_Letter,all of which match Cased_Letter under /i matching.And the second set isUppercase,Lowercase,and Titlecase,all of which match Cased under /i matching.This set also includes its subsets PosixUpper and PosixLower bothof which under /i matching match PosixAlpha.(The difference between these sets is that some things, such as Romannumerals, come in both upper and lower case so they are Cased, but aren't consideredletters, so they aren't Cased_Letters.)

The result is undefined if you try to match a non-Unicode code point(that is, one above 0x10FFFF) against a Unicode property. Currently, awarning is raised, and the match will fail. In some cases, this iscounterintuitive, as both these fail:

  1. chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails.
  2. chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Fails!

General_Category

Every Unicode character is assigned a general category, which is the "mostusual categorization of a character" (fromhttp://www.unicode.org/reports/tr44).

The compound way of writing these is like \p{General_Category=Number}(short, \p{gc:n}). But Perl furnishes shortcuts in which everything upthrough the equal or colon separator is omitted. So you can instead just write\pN.

Here are the short and long forms of the General Category properties:

  1. Short Long
  2. L Letter
  3. LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
  4. Lu Uppercase_Letter
  5. Ll Lowercase_Letter
  6. Lt Titlecase_Letter
  7. Lm Modifier_Letter
  8. Lo Other_Letter
  9. M Mark
  10. Mn Nonspacing_Mark
  11. Mc Spacing_Mark
  12. Me Enclosing_Mark
  13. N Number
  14. Nd Decimal_Number (also Digit)
  15. Nl Letter_Number
  16. No Other_Number
  17. P Punctuation (also Punct)
  18. Pc Connector_Punctuation
  19. Pd Dash_Punctuation
  20. Ps Open_Punctuation
  21. Pe Close_Punctuation
  22. Pi Initial_Punctuation
  23. (may behave like Ps or Pe depending on usage)
  24. Pf Final_Punctuation
  25. (may behave like Ps or Pe depending on usage)
  26. Po Other_Punctuation
  27. S Symbol
  28. Sm Math_Symbol
  29. Sc Currency_Symbol
  30. Sk Modifier_Symbol
  31. So Other_Symbol
  32. Z Separator
  33. Zs Space_Separator
  34. Zl Line_Separator
  35. Zp Paragraph_Separator
  36. C Other
  37. Cc Control (also Cntrl)
  38. Cf Format
  39. Cs Surrogate
  40. Co Private_Use
  41. Cn Unassigned

Single-letter properties match all characters in any of thetwo-letter sub-properties starting with the same letter.LC and L& are special: both are aliases for the set consisting of everything matched by Ll, Lu, and Lt.

Bidirectional Character Types

Because scripts differ in their directionality (Hebrew and Arabic arewritten right to left, for example) Unicode supplies these properties inthe Bidi_Class class:

  1. Property Meaning
  2. L Left-to-Right
  3. LRE Left-to-Right Embedding
  4. LRO Left-to-Right Override
  5. R Right-to-Left
  6. AL Arabic Letter
  7. RLE Right-to-Left Embedding
  8. RLO Right-to-Left Override
  9. PDF Pop Directional Format
  10. EN European Number
  11. ES European Separator
  12. ET European Terminator
  13. AN Arabic Number
  14. CS Common Separator
  15. NSM Non-Spacing Mark
  16. BN Boundary Neutral
  17. B Paragraph Separator
  18. S Segment Separator
  19. WS Whitespace
  20. ON Other Neutrals

This property is always written in the compound form.For example, \p{Bidi_Class:R} matches characters that are normallywritten right to left.

Scripts

The world's languages are written in many different scripts. This sentence(unless you're reading it in translation) is written in Latin, while Russian iswritten in Cyrillic, and Greek is written in, well, Greek; Japanese mainly inHiragana or Katakana. There are many more.

The Unicode Script and Script_Extensions properties give what script agiven character is in. Either property can be specified with thecompound form like\p{Script=Hebrew} (short: \p{sc=hebr}), or\p{Script_Extensions=Javanese} (short: \p{scx=java}).In addition, Perl furnishes shortcuts for allScript property names. You can omit everything up through the equals(or colon), and simply write \p{Latin} or \P{Cyrillic}.(This is not true for Script_Extensions, which is required to bewritten in the compound form.)

The difference between these two properties involves characters that areused in multiple scripts. For example the digits '0' through '9' areused in many parts of the world. These are placed in a script namedCommon. Other characters are used in just a few scripts. Forexample, the "KATAKANA-HIRAGANA DOUBLE HYPHEN" is used in both Japanesescripts, Katakana and Hiragana, but nowhere else. The Scriptproperty places all characters that are used in multiple scripts in theCommon script, while the Script_Extensions property places thosethat are used in only a few scripts into each of those scripts; whilestill using Common for those used in many scripts. Thus both thesematch:

  1. "0" =~ /\p{sc=Common}/ # Matches
  2. "0" =~ /\p{scx=Common}/ # Matches

and only the first of these match:

  1. "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches
  2. "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match

And only the last two of these match:

  1. "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match
  2. "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match
  3. "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
  4. "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches

Script_Extensions is thus an improved Script, in which there arefewer characters in the Common script, and correspondingly more inother scripts. It is new in Unicode version 6.0, and its data are likelyto change significantly in later releases, as things get sorted out.

(Actually, besides Common, the Inherited script, containscharacters that are used in multiple scripts. These are modifiercharacters which modify other characters, and inherit the script valueof the controlling character. Some of these are used in many scripts,and so go into Inherited in both Script and Script_Extensions.Others are used in just a few scripts, so are in Inherited inScript, but not in Script_Extensions.)

It is worth stressing that there are several different sets of digits inUnicode that are equivalent to 0-9 and are matchable by \d in aregular expression. If they are used in a single language only, theyare in that language's Script and Script_Extension. If they areused in more than one script, they will be in sc=Common, but onlyif they are used in many scripts should they be in scx=Common.

A complete list of scripts and their shortcuts is in perluniprops.

Use of "Is" Prefix

For backward compatibility (with Perl 5.6), all properties mentionedso far may have Is or Is_ prepended to their name, so \P{Is_Lu}, forexample, is equal to \P{Lu}, and \p{IsScript:Arabic} is equal to\p{Arabic}.

Blocks

In addition to scripts, Unicode also defines blocks ofcharacters. The difference between scripts and blocks is that theconcept of scripts is closer to natural languages, while the conceptof blocks is more of an artificial grouping based on groups of Unicodecharacters with consecutive ordinal values. For example, the "Basic Latin"block is all characters whose ordinals are between 0 and 127, inclusive; inother words, the ASCII characters. The "Latin" script contains some lettersfrom this as well as several other blocks, like "Latin-1 Supplement","Latin Extended-A", etc., but it does not contain all the characters fromthose blocks. It does not, for example, contain the digits 0-9, becausethose digits are shared across many scripts, and hence are in theCommon script.

For more about scripts versus blocks, see UAX#24 "Unicode Script Property":http://www.unicode.org/reports/tr24

The Script or Script_Extensions properties are likely to be theones you want to use when processingnatural language; the Block property may occasionally be useful in workingwith the nuts and bolts of Unicode.

Block names are matched in the compound form, like \p{Block: Arrows} or\p{Blk=Hebrew}. Unlike most other properties, only a few block names have aUnicode-defined short name. But Perl does provide a (slight) shortcut: Youcan say, for example \p{In_Arrows} or \p{In_Hebrew}. For backwardscompatibility, the In prefix may be omitted if there is no naming conflictwith a script or any other property, and you can even use an Is prefixinstead in those cases. But it is not a good idea to do this, for a couplereasons:

1

It is confusing. There are many naming conflicts, and you may forget some.For example, \p{Hebrew} means the script Hebrew, and NOT the blockHebrew. But would you remember that 6 months from now?

2

It is unstable. A new version of Unicode may pre-empt the current meaning bycreating a property with the same name. There was a time in very early Unicodereleases when \p{Hebrew} would have matched the block Hebrew; now itdoesn't.

Some people prefer to always use \p{Block: foo} and \p{Script: bar}instead of the shortcuts, whether for clarity, because they can't remember thedifference between 'In' and 'Is' anyway, or they aren't confident that those whoeventually will read their code will know that difference.

A complete list of blocks and their shortcuts is in perluniprops.

Other Properties

There are many more properties than the very basic ones described here.A complete list is in perluniprops.

Unicode defines all its properties in the compound form, so all single-formproperties are Perl extensions. Most of these are just synonyms for theUnicode ones, but some are genuine extensions, including several that are inthe compound form. And quite a few of these are actually recommended by Unicode(in http://www.unicode.org/reports/tr18).

This section gives some details on all extensions that aren't justsynonyms for compound-form Unicode properties(for those properties, you'll have to refer to theUnicode Standard.

  • \p{All}

    This matches any of the 1_114_112 Unicode code points. It is a synonym for\p{Any}.

  • \p{Alnum}

    This matches any \p{Alphabetic} or \p{Decimal_Number} character.

  • \p{Any}

    This matches any of the 1_114_112 Unicode code points. It is a synonym for\p{All}.

  • \p{ASCII}

    This matches any of the 128 characters in the US-ASCII character set,which is a subset of Unicode.

  • \p{Assigned}

    This matches any assigned code point; that is, any code point whose generalcategory is not Unassigned (or equivalently, not Cn).

  • \p{Blank}

    This is the same as \h and \p{HorizSpace}: A character that changes thespacing horizontally.

  • \p{Decomposition_Type: Non_Canonical} (Short: \p{Dt=NonCanon})

    Matches a character that has a non-canonical decomposition.

    To understand the use of this rarely used property=value combination, it isnecessary to know some basics about decomposition.Consider a character, say H. It could appear with various marks around it,such as an acute accent, or a circumflex, or various hooks, circles, arrows,etc., above, below, to one side or the other, etc. There are manypossibilities among the world's languages. The number of combinations isastronomical, and if there were a character for each combination, it wouldsoon exhaust Unicode's more than a million possible characters. So Unicodetook a different approach: there is a character for the base H, and acharacter for each of the possible marks, and these can be variously combinedto get a final logical character. So a logical character--what appears to be asingle character--can be a sequence of more than one individual characters.This is called an "extended grapheme cluster"; Perl furnishes the \Xregular expression construct to match such sequences.

    But Unicode's intent is to unify the existing character set standards andpractices, and several pre-existing standards have single characters thatmean the same thing as some of these combinations. An example is ISO-8859-1,which has quite a few of these in the Latin-1 range, an example being "LATINCAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existingstandard, Unicode added it to its repertoire. But this character is consideredby Unicode to be equivalent to the sequence consisting of the character"LATIN CAPITAL LETTER E" followed by the character "COMBINING ACUTE ACCENT".

    "LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, andits equivalence with the sequence is called canonical equivalence. Allpre-composed characters are said to have a decomposition (into the equivalentsequence), and the decomposition type is also called canonical.

    However, many more characters have a different type of decomposition, a"compatible" or "non-canonical" decomposition. The sequences that form thesedecompositions are not considered canonically equivalent to the pre-composedcharacter. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE".It is somewhat like a regular digit 1, but not exactly; its decompositioninto the digit 1 is called a "compatible" decomposition, specifically a"super" decomposition. There are several such compatibilitydecompositions (see http://www.unicode.org/reports/tr44), including onecalled "compat", which means some miscellaneous type of decompositionthat doesn't fit into the decomposition categories that Unicode has chosen.

    Note that most Unicode characters don't have a decomposition, so theirdecomposition type is "None".

    For your convenience, Perl has added the Non_Canonical decompositiontype to mean any of the several compatibility decompositions.

  • \p{Graph}

    Matches any character that is graphic. Theoretically, this means a characterthat on a printer would cause ink to be used.

  • \p{HorizSpace}

    This is the same as \h and \p{Blank}: a character that changes thespacing horizontally.

  • \p{In=*}

    This is a synonym for \p{Present_In=*}

  • \p{PerlSpace}

    This is the same as \s, restricted to ASCII, namely [ \f\n\r\t].

    Mnemonic: Perl's (original) space

  • \p{PerlWord}

    This is the same as \w, restricted to ASCII, namely [A-Za-z0-9_]

    Mnemonic: Perl's (original) word.

  • \p{Posix...}

    There are several of these, which are equivalents using the \pnotation for Posix classes and are described inPOSIX Character Classes in perlrecharclass.

  • \p{Present_In: *} (Short: \p{In=*})

    This property is used when you need to know in what Unicode version(s) acharacter is.

    The "*" above stands for some two digit Unicode version number, such as1.1 or 4.0; or the "*" can also be Unassigned. This property willmatch the code points whose final disposition has been settled as of theUnicode release given by the version number; \p{Present_In: Unassigned}will match those code points whose meaning has yet to be assigned.

    For example, U+0041 "LATIN CAPITAL LETTER A" was present in the very firstUnicode release available, which is 1.1, so this property is true for allvalid "*" versions. On the other hand, U+1EFF was not assigned until version5.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" thatwould match it are 5.1, 5.2, and later.

    Unicode furnishes the Age property from which this is derived. The problemwith Age is that a strict interpretation of it (which Perl takes) has itmatching the precise release a code point's meaning is introduced in. ThusU+0041 would match only 1.1; and U+1EFF only 5.1. This is not usually whatyou want.

    Some non-Perl implementations of the Age property may change its meaning to bethe same as the Perl Present_In property; just be aware of that.

    Another confusion with both these properties is that the definition is notthat the code point has been assigned, but that the meaning of the code pointhas been determined. This is because 66 code points will always beunassigned, and so the Age for them is the Unicode version in which the decisionto make them so was made. For example, U+FDD0 is to be permanentlyunassigned to a character, and the decision to do that was made in version 3.1,so \p{Age=3.1} matches this character, as also does \p{Present_In: 3.1} and up.

  • \p{Print}

    This matches any character that is graphical or blank, except controls.

  • \p{SpacePerl}

    This is the same as \s, including beyond ASCII.

    Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tabwhich both the Posix standard and Unicode consider white space.)

  • \p{Title} and \p{Titlecase}

    Under case-sensitive matching, these both match the same code points as\p{General Category=Titlecase_Letter} (\p{gc=lt}). The differenceis that under /i caseless matching, these match the same as\p{Cased}, whereas \p{gc=lt} matches \p{Cased_Letter).

  • \p{VertSpace}

    This is the same as \v: A character that changes the spacing vertically.

  • \p{Word}

    This is the same as \w, including over 100_000 characters beyond ASCII.

  • \p{XPosix...}

    There are several of these, which are the standard Posix classesextended to the full Unicode range. They are described inPOSIX Character Classes in perlrecharclass.

User-Defined Character Properties

You can define your own binary character properties by defining subroutineswhose names begin with "In" or "Is". The subroutines can be defined in anypackage. The user-defined properties can be used in the regular expression\p and \P constructs; if you are using a user-defined property from apackage other than the one you are in, you must specify its package in the\p or \P construct.

  1. # assuming property Is_Foreign defined in Lang::
  2. package main; # property package name required
  3. if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
  4. package Lang; # property package name not required
  5. if ($txt =~ /\p{IsForeign}+/) { ... }

Note that the effect is compile-time and immutable once defined.However, the subroutines are passed a single parameter, which is 0 ifcase-sensitive matching is in effect and non-zero if caseless matchingis in effect. The subroutine may return different values depending onthe value of the flag, and one set of values will immutably be in effectfor all case-sensitive matches, and the other set for all case-insensitivematches.

Note that if the regular expression is tainted, then Perl will die ratherthan calling the subroutine, where the name of the subroutine isdetermined by the tainted data.

The subroutines must return a specially-formatted string, with oneor more newline-separated lines. Each line must be one of the following:

  • A single hexadecimal number denoting a Unicode code point to include.

  • Two hexadecimal numbers separated by horizontal whitespace (space ortabular characters) denoting a range of Unicode code points to include.

  • Something to include, prefixed by "+": a built-in characterproperty (prefixed by "utf8::") or a fully qualified (including packagename) user-defined character property,to represent all the characters in that property; two hexadecimal codepoints for a range; or a single hexadecimal code point.

  • Something to exclude, prefixed by "-": an existing characterproperty (prefixed by "utf8::") or a fully qualified (including packagename) user-defined character property,to represent all the characters in that property; two hexadecimal codepoints for a range; or a single hexadecimal code point.

  • Something to negate, prefixed "!": an existing characterproperty (prefixed by "utf8::") or a fully qualified (including packagename) user-defined character property,to represent all the characters in that property; two hexadecimal codepoints for a range; or a single hexadecimal code point.

  • Something to intersect with, prefixed by "&": an existing characterproperty (prefixed by "utf8::") or a fully qualified (including packagename) user-defined character property,for all the characters except the characters in the property; twohexadecimal code points for a range; or a single hexadecimal code point.

For example, to define a property that covers both the Japanesesyllabaries (hiragana and katakana), you can define

  1. sub InKana {
  2. return <<END;
  3. 3040\t309F
  4. 30A0\t30FF
  5. END
  6. }

Imagine that the here-doc end marker is at the beginning of the line.Now you can use \p{InKana} and \P{InKana}.

You could also have used the existing block property names:

  1. sub InKana {
  2. return <<'END';
  3. +utf8::InHiragana
  4. +utf8::InKatakana
  5. END
  6. }

Suppose you wanted to match only the allocated characters,not the raw block ranges: in other words, you want to removethe non-characters:

  1. sub InKana {
  2. return <<'END';
  3. +utf8::InHiragana
  4. +utf8::InKatakana
  5. -utf8::IsCn
  6. END
  7. }

The negation is useful for defining (surprise!) negated classes.

  1. sub InNotKana {
  2. return <<'END';
  3. !utf8::InHiragana
  4. -utf8::InKatakana
  5. +utf8::IsCn
  6. END
  7. }

This will match all non-Unicode code points, since every one of them isnot in Kana. You can use intersection to exclude these, if desired, asthis modified example shows:

  1. sub InNotKana {
  2. return <<'END';
  3. !utf8::InHiragana
  4. -utf8::InKatakana
  5. +utf8::IsCn
  6. &utf8::Any
  7. END
  8. }

&utf8::Any must be the last line in the definition.

Intersection is used generally for getting the common characters matchedby two (or more) classes. It's important to remember not to use "&" forthe first set; that would be intersecting with nothing, resulting in anempty set.

(Note that official Unicode properties differ from these in that theyautomatically exclude non-Unicode code points and a warning is raised ifa match is attempted on one of those.)

User-Defined Case Mappings (for serious hackers only)

This feature has been removed as of Perl 5.16.The CPAN module Unicode::Casing provides better functionality withoutthe drawbacks that this feature had. If you are using a Perl earlierthan 5.16, this feature was most fully documented in the 5.14 version ofthis pod:http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29

Character Encodings for Input and Output

See Encode.

Unicode Regular Expression Support Level

The following list of Unicode supported features for regular expressions describesall features currently directly supported by core Perl. The references to "Level N"and the section numbers refer to the Unicode Technical Standard #18,"Unicode Regular Expressions", version 13, from August 2008.

  • Level 1 - Basic Unicode Support

    1. RL1.1 Hex Notation - done [1]
    2. RL1.2 Properties - done [2][3]
    3. RL1.2a Compatibility Properties - done [4]
    4. RL1.3 Subtraction and Intersection - MISSING [5]
    5. RL1.4 Simple Word Boundaries - done [6]
    6. RL1.5 Simple Loose Matches - done [7]
    7. RL1.6 Line Boundaries - MISSING [8][9]
    8. RL1.7 Supplementary Code Points - done [10]
    9. [1] \x{...}
    10. [2] \p{...} \P{...}
    11. [3] supports not only minimal list, but all Unicode character
    12. properties (see Unicode Character Properties above)
    13. [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
    14. [5] can use regular expression look-ahead [a] or
    15. user-defined character properties [b] to emulate set
    16. operations
    17. [6] \b \B
    18. [7] note that Perl does Full case-folding in matching (but with
    19. bugs), not Simple: for example U+1F88 is equivalent to
    20. U+1F00 U+03B9, instead of just U+1F80. This difference
    21. matters mainly for certain Greek capital letters with certain
    22. modifiers: the Full case-folding decomposes the letter,
    23. while the Simple case-folding would map it to a single
    24. character.
    25. [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR
    26. (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
    27. (U+2029); should also affect <>, $., and script line
    28. numbers; should not split lines within CRLF [c] (i.e. there
    29. is no empty line between \r and \n)
    30. [9] Linebreaking conformant with UAX#14 "Unicode Line Breaking
    31. Algorithm" is available through the Unicode::LineBreaking
    32. module.
    33. [10] UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
    34. U+10FFFF but also beyond U+10FFFF

    [a] You can mimic class subtraction using lookahead.For example, what UTS#18 might write as

    1. [{Greek}-[{UNASSIGNED}]]

    in Perl can be written as:

    1. (?!\p{Unassigned})\p{InGreekAndCoptic}
    2. (?=\p{Assigned})\p{InGreekAndCoptic}

    But in this particular example, you probably really want

    1. \p{GreekAndCoptic}

    which will match assigned characters known to be part of the Greek script.

    Also see the Unicode::Regex::Set module; it does implement the fullUTS#18 grouping, intersection, union, and removal (subtraction) syntax.

    [b] '+' for union, '-' for removal (set-difference), '&' for intersection(see User-Defined Character Properties)

    [c] Try the :crlf layer (see PerlIO).

  • Level 2 - Extended Unicode Support

    1. RL2.1 Canonical Equivalents - MISSING [10][11]
    2. RL2.2 Default Grapheme Clusters - MISSING [12]
    3. RL2.3 Default Word Boundaries - MISSING [14]
    4. RL2.4 Default Loose Matches - MISSING [15]
    5. RL2.5 Name Properties - DONE
    6. RL2.6 Wildcard Properties - MISSING
    7. [10] see UAX#15 "Unicode Normalization Forms"
    8. [11] have Unicode::Normalize but not integrated to regexes
    9. [12] have \X but we don't have a "Grapheme Cluster Mode"
    10. [14] see UAX#29, Word Boundaries
    11. [15] This is covered in Chapter 3.13 (in Unicode 6.0)
  • Level 3 - Tailored Support

    1. RL3.1 Tailored Punctuation - MISSING
    2. RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
    3. RL3.3 Tailored Word Boundaries - MISSING
    4. RL3.4 Tailored Loose Matches - MISSING
    5. RL3.5 Tailored Ranges - MISSING
    6. RL3.6 Context Matching - MISSING [19]
    7. RL3.7 Incremental Matches - MISSING
    8. ( RL3.8 Unicode Set Sharing )
    9. RL3.9 Possible Match Sets - MISSING
    10. RL3.10 Folded Matching - MISSING [20]
    11. RL3.11 Submatchers - MISSING
    12. [17] see UAX#10 "Unicode Collation Algorithms"
    13. [18] have Unicode::Collate but not integrated to regexes
    14. [19] have (?<=x) and (?=x), but look-aheads or look-behinds
    15. should see outside of the target substring
    16. [20] need insensitive matching for linguistic features other
    17. than case; for example, hiragana to katakana, wide and
    18. narrow, simplified Han to traditional Han (see UTR#30
    19. "Character Foldings")

Unicode Encodings

Unicode characters are assigned to code points, which are abstractnumbers. To use these numbers, various encodings are needed.

  • UTF-8

    UTF-8 is a variable-length (1 to 4 bytes), byte-order independentencoding. For ASCII (and we really do mean 7-bit ASCII, not another8-bit encoding), UTF-8 is transparent.

    The following table is from Unicode 3.2.

    1. Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
    2. U+0000..U+007F 00..7F
    3. U+0080..U+07FF * C2..DF 80..BF
    4. U+0800..U+0FFF E0 * A0..BF 80..BF
    5. U+1000..U+CFFF E1..EC 80..BF 80..BF
    6. U+D000..U+D7FF ED 80..9F 80..BF
    7. U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
    8. U+E000..U+FFFF EE..EF 80..BF 80..BF
    9. U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
    10. U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
    11. U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

    Note the gaps marked by "*" before several of the byte entries above. These arecaused by legal UTF-8 avoiding non-shortest encodings: it is technicallypossible to UTF-8-encode a single code point in different ways, but that isexplicitly forbidden, and the shortest possible encoding should always be used(and that is what Perl does).

    Another way to look at it is via bits:

    1. Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
    2. 0aaaaaaa 0aaaaaaa
    3. 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
    4. ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
    5. 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa

    As you can see, the continuation bytes all begin with "10", and theleading bits of the start byte tell how many bytes there are in theencoded character.

    The original UTF-8 specification allowed up to 6 bytes, to allowencoding of numbers up to 0x7FFF_FFFF. Perl continues to allow those,and has extended that up to 13 bytes to encode code points up to whatcan fit in a 64-bit word. However, Perl will warn if you output any ofthese as being non-portable; and under strict UTF-8 input protocols,they are forbidden.

    The Unicode non-character code points are also disallowed in UTF-8 in"open interchange". See Non-character code points.

  • UTF-EBCDIC

    Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.

  • UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)

    The followings items are mostly for reference and general Unicodeknowledge, Perl doesn't use these constructs internally.

    Like UTF-8, UTF-16 is a variable-width encoding, but whereUTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.All code points occupy either 2 or 4 bytes in UTF-16: code pointsU+0000..U+FFFF are stored in a single 16-bit unit, and codepoints U+10000..U+10FFFF in two 16-bit units. The latter case isusing surrogates, the first 16-bit unit being the highsurrogate, and the second being the low surrogate.

    Surrogates are code points set aside to encode the U+10000..U+10FFFFrange of Unicode code points in pairs of 16-bit units. The highsurrogates are the range U+D800..U+DBFF and the low surrogatesare the range U+DC00..U+DFFF. The surrogate encoding is

    1. $hi = ($uni - 0x10000) / 0x400 + 0xD800;
    2. $lo = ($uni - 0x10000) % 0x400 + 0xDC00;

    and the decoding is

    1. $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);

    Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16itself can be used for in-memory computations, but if storage ortransfer is required either UTF-16BE (big-endian) or UTF-16LE(little-endian) encodings must be chosen.

    This introduces another problem: what if you just know that your datais UTF-16, but you don't know which endianness? Byte Order Marks, orBOMs, are a solution to this. A special character has been reservedin Unicode to function as a byte order marker: the character with thecode point U+FEFF is the BOM.

    The trick is that if you read a BOM, you will know the byte order,since if it was written on a big-endian platform, you will read thebytes 0xFE 0xFF, but if it was written on a little-endian platform,you will read the bytes 0xFF 0xFE. (And if the originating platformwas writing in UTF-8, you will read the bytes 0xEF 0xBB 0xBF.)

    The way this trick works is that the character with the code pointU+FFFE is not supposed to be in input streams, so thesequence of bytes 0xFF 0xFE is unambiguously "BOM, represented inlittle-endian format" and cannot be U+FFFE, represented in big-endianformat".

    Surrogates have no meaning in Unicode outside their use in pairs torepresent other code points. However, Perl allows them to berepresented individually internally, for example by sayingchr(0xD801), so that all code points, not just those valid for openinterchange, arerepresentable. Unicode does define semantics for them, such as theirGeneral Category is "Cs". But because their use is somewhat dangerous,Perl will warn (using the warning category "surrogate", which is asub-category of "utf8") if an attempt is madeto do things like take the lower case of one, or matchcase-insensitively, or to output them. (But don't try this on Perlsbefore 5.14.)

  • UTF-32, UTF-32BE, UTF-32LE

    The UTF-32 family is pretty much like the UTF-16 family, expect thatthe units are 32-bit, and therefore the surrogate scheme is notneeded. UTF-32 is a fixed-width encoding. The BOM signatures are0x00 0x00 0xFE 0xFF for BE and 0xFF 0xFE 0x00 0x00 for LE.

  • UCS-2, UCS-4

    Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bitencoding. Unlike UTF-16, UCS-2 is not extensible beyond U+FFFF,because it does not use surrogates. UCS-4 is a 32-bit encoding,functionally identical to UTF-32 (the difference being thatUCS-4 forbids neither surrogates nor code points larger than 0x10_FFFF).

  • UTF-7

    A seven-bit safe (non-eight-bit) encoding, which is useful if thetransport or storage is not eight-bit safe. Defined by RFC 2152.

Non-character code points

66 code points are set aside in Unicode as "non-character code points".These all have the Unassigned (Cn) General Category, and they never willbe assigned. These are never supposed to be in legal Unicode inputstreams, so that code can use them as sentinels that can be mixed inwith character data, and they always will be distinguishable from that data.To keep them out of Perl input streams, strict UTF-8 should bespecified, such as by using the layer :encoding('UTF-8'). Thenon-character code points are the 32 between U+FDD0 and U+FDEF, and the34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.Some people are under the mistaken impression that these are "illegal",but that is not true. An application or cooperating set of applicationscan legally use them at will internally; but these code points are"illegal for open interchange". Therefore, Perl will not accept thesefrom input streams unless lax rules are being used, and will warn(using the warning category "nonchar", which is a sub-category of "utf8") ifan attempt is made to output them.

Beyond Unicode code points

The maximum Unicode code point is U+10FFFF. But Perl accepts codepoints up to the maximum permissible unsigned number available on theplatform. However, Perl will not accept these from input streams unlesslax rules are being used, and will warn (using the warning category"non_unicode", which is a sub-category of "utf8") if an attempt is made tooperate on or output them. For example, uc(0x11_0000) will generatethis warning, returning the input parameter as its result, as the uppercase of every non-Unicode code point is the code point itself.

Security Implications of Unicode

Read Unicode Security Considerations.Also, note the following:

  • Malformed UTF-8

    Unfortunately, the original specification of UTF-8 leaves some room forinterpretation of how many bytes of encoded output one should generatefrom one input Unicode character. Strictly speaking, the shortestpossible sequence of UTF-8 bytes should be generated,because otherwise there is potential for an input buffer overflow atthe receiving end of a UTF-8 connection. Perl always generates theshortest length UTF-8, and with warnings on, Perl will warn aboutnon-shortest length UTF-8 along with other malformations, such as thesurrogates, which are not Unicode code points valid for interchange.

  • Regular expression pattern matching may surprise you if you're notaccustomed to Unicode. Starting in Perl 5.14, several patternmodifiers are available to control this, called the character setmodifiers. Details are given in Character set modifiers in perlre.

As discussed elsewhere, Perl has one foot (two hooves?) planted ineach of two worlds: the old world of bytes and the new world ofcharacters, upgrading from bytes to characters when necessary.If your legacy code does not explicitly use Unicode, no automaticswitch-over to characters should happen. Characters shouldn't getdowngraded to bytes, either. It is possible to accidentally mix bytesand characters, however (see perluniintro), in which case \w inregular expressions might start behaving differently (unless the /amodifier is in effect). Review your code. Use warnings and the strict pragma.

Unicode in Perl on EBCDIC

The way Unicode is handled on EBCDIC platforms is stillexperimental. On such platforms, references to UTF-8 encoding in thisdocument and elsewhere should be read as meaning the UTF-EBCDICspecified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issuesare specifically discussed. There is no utfebcdic pragma or":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to meanthe platform's "natural" 8-bit encoding of Unicode. See perlebcdicfor more discussion of the issues.

Locales

See Unicode and UTF-8 in perllocale

When Unicode Does Not Happen

While Perl does have extensive ways to input and output in Unicode,and a few other "entry points" like the @ARGV array (which can sometimes beinterpreted as UTF-8), there are still many places where Unicode(in some encoding or another) could be given as arguments or received asresults, or both, but it is not.

The following are such interfaces. Also, see The Unicode Bug.For all of these interfaces Perlcurrently (as of 5.8.3) simply assumes byte strings both as argumentsand results, or UTF-8 strings if the (problematic) encoding pragma has been used.

One reason that Perl does not attempt to resolve the role of Unicode inthese situations is that the answers are highly dependent on the operatingsystem and the file system(s). For example, whether filenames can bein Unicode and in exactly what kind of encoding, is not exactly aportable concept. Similarly for qx and system: how well will the"command-line interface" (and which of them?) handle Unicode?

  • chdir, chmod, chown, chroot, exec, link, lstat, mkdir,rename, rmdir, stat, symlink, truncate, unlink, utime, -X

  • %ENV

  • glob (aka the <*>)

  • open, opendir, sysopen

  • qx (aka the backtick operator), system

  • readdir, readlink

The "Unicode Bug"

The term, "Unicode bug" has been applied to an inconsistencyon ASCII platforms with theUnicode code points in the Latin-1 Supplement block, thatis, between 128 and 255. Without a locale specified, unlike all othercharacters or code points, these characters have very different semantics inbyte semantics versus character semantics, unlessuse feature 'unicode_strings' is specified, directly or indirectly.(It is indirectly specified by a use v5.12 or higher.)

In character semantics these upper-Latin1 characters are interpreted asUnicode code points, which meansthey have the same semantics as Latin-1 (ISO-8859-1).

In byte semantics (without unicode_strings), they are considered tobe unassigned characters, meaning that the only semantics they have istheir ordinal numbers, and that they arenot members of various character classes. None are considered to match \wfor example, but all match \W.

Perl 5.12.0 added unicode_strings to force character semantics onthese code points in some circumstances, which fixed portions of thebug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed theremainder (so far as we know, anyway). The lesson here is to enableunicode_strings to avoid the headaches described below.

The old, problematic behavior affects these areas:

  • Changing the case of a scalar, that is, using uc(), ucfirst(), lc(),and lcfirst(), or \L, \U, \u and \l in double-quotishcontexts, such as regular expression substitutions.Under unicode_strings starting in Perl 5.12.0, character semantics aregenerally used. See lc for details on how this worksin combination with various other pragmas.

  • Using caseless (/i) regular expression matching.Starting in Perl 5.14.0, regular expressions compiled withinthe scope of unicode_strings use character semanticseven when executed or compiled into largerregular expressions outside the scope.

  • Matching any of several properties in regular expressions, namely \b,\B, \s, \S, \w, \W, and all the Posix character classesexcept [[:ascii:]].Starting in Perl 5.14.0, regular expressions compiled withinthe scope of unicode_strings use character semanticseven when executed or compiled into largerregular expressions outside the scope.

  • In quotemeta or its inline equivalent \Q, no code points above 127are quoted in UTF-8 encoded strings, but in byte encoded strings, codepoints between 128-255 are always quoted.Starting in Perl 5.16.0, consistent quoting rules are used within thescope of unicode_strings, as described in quotemeta.

This behavior can lead to unexpected results in which a string's semanticssuddenly change if a code point above 255 is appended to or removed from it,which changes the string's semantics from byte to character or vice versa. Asan example, consider the following program and its output:

  1. $ perl -le'
  2. no feature 'unicode_strings'
  3. $s1 = "\xC2"
  4. $s2 = "\x{2660}"
  5. for ($s1, $s2, $s1.$s2) {
  6. print /\w/ || 0;
  7. }
  8. '
  9. 0
  10. 0
  11. 1

If there's no \w in s1 or in s2, why does their concatenation have one?

This anomaly stems from Perl's attempt to not disturb older programs thatdidn't use Unicode, and hence had no semantics for characters outside of theASCII range (except in a locale), along with Perl's desire to add Unicodesupport seamlessly. The result wasn't seamless: these characters wereorphaned.

For Perls earlier than those described above, or when a string is passedto a function outside the subpragma's scope, a workaround is to alwayscall utf8::upgrade($string),or to use the standard module Encode. Also, a scalar that has any characterswhose ordinal is above 0x100, or which were specified using either of the\N{...} notations, will automatically have character semantics.

Forcing Unicode in Perl (Or Unforcing Unicode in Perl)

Sometimes (see When Unicode Does Not Happen or The Unicode Bug)there are situations where you simply need to force a bytestring into UTF-8, or vice versa. The low-level callsutf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) arethe answers.

Note that utf8::downgrade() can fail if the string contains charactersthat don't fit into a byte.

Calling either function on a string that already is in the desired state is ano-op.

Using Unicode in XS

If you want to handle Perl Unicode in XS extensions, you may find thefollowing C APIs useful. See also Unicode Support in perlguts for anexplanation about Unicode at the XS level, and perlapi for the APIdetails.

  • DO_UTF8(sv) returns true if the UTF8 flag is on and the bytespragma is not in effect. SvUTF8(sv) returns true if the UTF8flag is on; the bytes pragma is ignored. The UTF8 flag being ondoes not mean that there are any characters of code points greaterthan 255 (or 127) in the scalar or that there are even any charactersin the scalar. What the UTF8 flag means is that the sequence ofoctets in the representation of the scalar is the sequence of UTF-8encoded code points of the characters of a string. The UTF8 flagbeing off means that each octet in this representation encodes asingle character with code point 0..255 within the string. Perl'sUnicode model is not to use UTF-8 until it is absolutely necessary.

  • uvchr_to_utf8(buf, chr) writes a Unicode character code point intoa buffer encoding the code point as UTF-8, and returns a pointerpointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.

  • utf8_to_uvchr_buf(buf, bufend, lenp) reads UTF-8 encoded bytes from abuffer andreturns the Unicode character code point and, optionally, the length ofthe UTF-8 byte sequence. It works appropriately on EBCDIC machines.

  • utf8_length(start, end) returns the length of the UTF-8 encoded bufferin characters. sv_len_utf8(sv) returns the length of the UTF-8 encodedscalar.

  • sv_utf8_upgrade(sv) converts the string of the scalar to its UTF-8encoded form. sv_utf8_downgrade(sv) does the opposite, ifpossible. sv_utf8_encode(sv) is like sv_utf8_upgrade except thatit does not set the UTF8 flag. sv_utf8_decode() does theopposite of sv_utf8_encode(). Note that none of these are to beused as general-purpose encoding or decoding interfaces: use Encodefor that. sv_utf8_upgrade() is affected by the encoding pragmabut sv_utf8_downgrade() is not (since the encoding pragma isdesigned to be a one-way street).

  • is_utf8_string(buf, len) returns true if len bytes of the bufferare valid UTF-8.

  • is_utf8_char(s) returns true if the pointer points to a valid UTF-8character. However, this function should not be used because ofsecurity concerns. Instead, use is_utf8_string().

  • UTF8SKIP(buf) will return the number of bytes in the UTF-8 encodedcharacter in the buffer. UNISKIP(chr) will return the number of bytesrequired to UTF-8-encode the Unicode character code point. UTF8SKIP()is useful for example for iterating over the characters of a UTF-8encoded buffer; UNISKIP() is useful, for example, in computingthe size required for a UTF-8 encoded buffer.

  • utf8_distance(a, b) will tell the distance in characters between thetwo pointers pointing to the same UTF-8 encoded buffer.

  • utf8_hop(s, off) will return a pointer to a UTF-8 encoded bufferthat is off (positive or negative) Unicode characters displacedfrom the UTF-8 buffer s. Be careful not to overstep the buffer:utf8_hop() will merrily run off the end or the beginning of thebuffer if told to do so.

  • pv_uni_display(dsv, spv, len, pvlim, flags) andsv_uni_display(dsv, ssv, pvlim, flags) are useful for debugging theoutput of Unicode strings and scalars. By default they are usefulonly for debugging--they display all characters as hexadecimal codepoints--but with the flags UNI_DISPLAY_ISPRINT,UNI_DISPLAY_BACKSLASH, and UNI_DISPLAY_QQ you can make theoutput more readable.

  • foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2) can be used tocompare two strings case-insensitively in Unicode. For case-sensitivecomparisons you can just use memEQ() and memNE() as usual, exceptif one string is in utf8 and the other isn't.

For more information, see perlapi, and utf8.c and utf8.hin the Perl source code distribution.

Hacking Perl to work on earlier Unicode versions (for very serious hackers only)

Perl by default comes with the latest supported Unicode version built in, butyou can change to use any earlier one.

Download the files in the desired version of Unicode from the Unicode website http://www.unicode.org). These should replace the existing files inlib/unicore in the Perl source tree. Follow the instructions inREADME.perl in that directory to change some of their names, and then buildperl (see INSTALL).

BUGS

Interaction with Locales

See Unicode and UTF-8 in perllocale

Problems with characters in the Latin-1 Supplement range

See The Unicode Bug

Interaction with Extensions

When Perl exchanges data with an extension, the extension should beable to understand the UTF8 flag and act accordingly. If theextension doesn't recognize that flag, it's likely that the extensionwill return incorrectly-flagged data.

So if you're working with Unicode data, consult the documentation ofevery module you're using if there are any issues with Unicode dataexchange. If the documentation does not talk about Unicode at all,suspect the worst and probably look at the source to learn how themodule is implemented. Modules written completely in Perl shouldn'tcause problems. Modules that directly or indirectly access code writtenin other programming languages are at risk.

For affected functions, the simple strategy to avoid data corruption isto always make the encoding of the exchanged data explicit. Choose anencoding that you know the extension can handle. Convert arguments passedto the extensions to that encoding and convert results back from thatencoding. Write wrapper functions that do the conversions for you, soyou can later change the functions when the extension catches up.

To provide an example, let's say the popular Foo::Bar::escape_htmlfunction doesn't deal with Unicode data yet. The wrapper functionwould convert the argument to raw UTF-8 and convert the result back toPerl's internal representation like so:

  1. sub my_escape_html ($) {
  2. my($what) = shift;
  3. return unless defined $what;
  4. Encode::decode_utf8(Foo::Bar::escape_html(
  5. Encode::encode_utf8($what)));
  6. }

Sometimes, when the extension does not convert data but just storesand retrieves them, you will be able to use the otherwisedangerous Encode::_utf8_on() function. Let's say the popularFoo::Bar extension, written in C, provides a param method thatlets you store and retrieve data according to these prototypes:

  1. $self->param($name, $value); # set a scalar
  2. $value = $self->param($name); # retrieve a scalar

If it does not yet provide support for any encoding, one could write aderived class with such a param method:

  1. sub param {
  2. my($self,$name,$value) = @_;
  3. utf8::upgrade($name); # make sure it is UTF-8 encoded
  4. if (defined $value) {
  5. utf8::upgrade($value); # make sure it is UTF-8 encoded
  6. return $self->SUPER::param($name,$value);
  7. } else {
  8. my $ret = $self->SUPER::param($name);
  9. Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
  10. return $ret;
  11. }
  12. }

Some extensions provide filters on data entry/exit points, such asDB_File::filter_store_key and family. Look out for such filters inthe documentation of your extensions, they can make the transition toUnicode data much easier.

Speed

Some functions are slower when working on UTF-8 encoded strings thanon byte encoded strings. All functions that need to hop overcharacters such as length(), substr() or index(), or matching regularexpressions can work much faster when the underlying data arebyte-encoded.

In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1a caching scheme was introduced which will hopefully make the slownesssomewhat less spectacular, at least for some operations. In general,operations with UTF-8 encoded strings are still slower. As an example,the Unicode properties (character classes) like \p{Nd} are known tobe quite a bit slower (5-20 times) than their simpler counterpartslike \d (then again, there are hundreds of Unicode characters matching Ndcompared with the 10 ASCII characters matching d).

Problems on EBCDIC platforms

There are several known problems with Perl on EBCDIC platforms. If youwant to use Perl there, send email to [email protected].

In earlier versions, when byte and character data were concatenated,the new string was sometimes created bydecoding the byte strings as ISO 8859-1 (Latin-1), even if theold Unicode string used EBCDIC.

If you find any of these, please report them as bugs.

Porting code from perl-5.6.X

Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmerwas required to use the utf8 pragma to declare that a given scopeexpected to deal with Unicode data and had to make sure that onlyUnicode data were reaching that scope. If you have code that isworking with 5.6, you will need some of the following adjustments toyour code. The examples are written such that the code will continueto work under 5.6, so you should be safe to try them out.

  • A filehandle that should read or write UTF-8

    1. if ($] > 5.007) {
    2. binmode $fh, ":encoding(utf8)";
    3. }
  • A scalar that is going to be passed to some extension

    Be it Compress::Zlib, Apache::Request or any extension that has nomention of Unicode in the manpage, you need to make sure that theUTF8 flag is stripped off. Note that at the time of this writing(October 2002) the mentioned modules are not UTF-8-aware. Pleasecheck the documentation to verify if this is still true.

    1. if ($] > 5.007) {
    2. require Encode;
    3. $val = Encode::encode_utf8($val); # make octets
    4. }
  • A scalar we got back from an extension

    If you believe the scalar comes back as UTF-8, you will most likelywant the UTF8 flag restored:

    1. if ($] > 5.007) {
    2. require Encode;
    3. $val = Encode::decode_utf8($val);
    4. }
  • Same thing, if you are really sure it is UTF-8

    1. if ($] > 5.007) {
    2. require Encode;
    3. Encode::_utf8_on($val);
    4. }
  • A wrapper for fetchrow_array and fetchrow_hashref

    When the database contains only UTF-8, a wrapper function or method isa convenient way to replace all your fetchrow_array andfetchrow_hashref calls. A wrapper function will also make it easier toadapt to future enhancements in your database driver. Note that at thetime of this writing (October 2002), the DBI has no standardized wayto deal with UTF-8 data. Please check the documentation to verify ifthat is still true.

    1. sub fetchrow {
    2. # $what is one of fetchrow_{array,hashref}
    3. my($self, $sth, $what) = @_;
    4. if ($] < 5.007) {
    5. return $sth->$what;
    6. } else {
    7. require Encode;
    8. if (wantarray) {
    9. my @arr = $sth->$what;
    10. for (@arr) {
    11. defined && /[^\000-\177]/ && Encode::_utf8_on($_);
    12. }
    13. return @arr;
    14. } else {
    15. my $ret = $sth->$what;
    16. if (ref $ret) {
    17. for my $k (keys %$ret) {
    18. defined
    19. && /[^\000-\177]/
    20. && Encode::_utf8_on($_) for $ret->{$k};
    21. }
    22. return $ret;
    23. } else {
    24. defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
    25. return $ret;
    26. }
    27. }
    28. }
    29. }
  • A large scalar that you know can only contain ASCII

    Scalars that contain only ASCII and are marked as UTF-8 are sometimesa drag to your program. If you recognize such a situation, just removethe UTF8 flag:

    1. utf8::downgrade($val) if $] > 5.007;

SEE ALSO

perlunitut, perluniintro, perluniprops, Encode, open, utf8, bytes,perlretut, ${^UNICODE} in perlvarhttp://www.unicode.org/reports/tr44).

 
Source : perldoc.perl.org - Official documentation for the Perl programming language
Site maintained by Jon Allen (JJ)     See the project page for more details
Documentation maintained by the Perl 5 Porters
(Sebelumnya) Perl Unicode introductionIndex of Unicode Version 6.1.0 ... (Berikutnya)