Cari di Perl 
    Perl User Manual
Daftar Isi
(Sebelumnya) Description of the Perl regula ...Various and sundry policies an ... (Berikutnya)
Internals and C language interface

Perl regular expression plugin interface

Daftar Isi

NAME

perlreapi - perl regular expression plugin interface

DESCRIPTION

As of Perl 5.9.5 there is a new interface for plugging and using otherregular expression engines than the default one.

Each engine is supposed to provide access to a constant structure of thefollowing format:

  1. typedef struct regexp_engine {
  2. REGEXP* (*comp) (pTHX_ const SV * const pattern, const U32 flags);
  3. I32 (*exec) (pTHX_ REGEXP * const rx, char* stringarg, char* strend,
  4. char* strbeg, I32 minend, SV* screamer,
  5. void* data, U32 flags);
  6. char* (*intuit) (pTHX_ REGEXP * const rx, SV *sv, char *strpos,
  7. char *strend, U32 flags,
  8. struct re_scream_pos_data_s *data);
  9. SV* (*checkstr) (pTHX_ REGEXP * const rx);
  10. void (*free) (pTHX_ REGEXP * const rx);
  11. void (*numbered_buff_FETCH) (pTHX_ REGEXP * const rx, const I32 paren,
  12. SV * const sv);
  13. void (*numbered_buff_STORE) (pTHX_ REGEXP * const rx, const I32 paren,
  14. SV const * const value);
  15. I32 (*numbered_buff_LENGTH) (pTHX_ REGEXP * const rx, const SV * const sv,
  16. const I32 paren);
  17. SV* (*named_buff) (pTHX_ REGEXP * const rx, SV * const key,
  18. SV * const value, U32 flags);
  19. SV* (*named_buff_iter) (pTHX_ REGEXP * const rx, const SV * const lastkey,
  20. const U32 flags);
  21. SV* (*qr_package)(pTHX_ REGEXP * const rx);
  22. #ifdef USE_ITHREADS
  23. void* (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param);
  24. #endif

When a regexp is compiled, its engine field is then set to point atthe appropriate structure, so that when it needs to be used Perl can findthe right routines to do so.

In order to install a new regexp handler, $^H{regcomp} is setto an integer which (when casted appropriately) resolves to one of thesestructures. When compiling, the comp method is executed, and theresulting regexp structure's engine field is expected to point back atthe same structure.

The pTHX_ symbol in the definition is a macro used by perl under threadingto provide an extra argument to the routine holding a pointer back tothe interpreter that is executing the regexp. So under threading allroutines get an extra argument.

Callbacks

comp

  1. REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags);

Compile the pattern stored in pattern using the given flags andreturn a pointer to a prepared REGEXP structure that can performthe match. See The REGEXP structure below for an explanation ofthe individual fields in the REGEXP struct.

The pattern parameter is the scalar that was used as thepattern. previous versions of perl would pass two char* indicatingthe start and end of the stringified pattern, the following snippet canbe used to get the old parameters:

  1. STRLEN plen;
  2. char* exp = SvPV(pattern, plen);
  3. char* xend = exp + plen;

Since any scalar can be passed as a pattern it's possible to implementan engine that does something with an array ("ook" =~ [ qw/ eekhlagh / ]) or with the non-stringified form of a compiled regularexpression ("ook" =~ qr/eek/). perl's own engine will alwaysstringify everything using the snippet above but that doesn't meanother engines have to.

The flags parameter is a bitfield which indicates which of themsixp flags the regex was compiled with. It also containsadditional info such as whether use locale is in effect.

The eogc flags are stripped out before being passed to the comproutine. The regex engine does not need to know whether any of theseare set as those flags should only affect what perl does with thepattern and its match variables, not how it gets compiled andexecuted.

By the time the comp callback is called, some of these flags havealready had effect (noted below where applicable). However most oftheir effect occurs after the comp callback has run in routines thatread the rx->extflags field which it populates.

In general the flags should be preserved in rx->extflags aftercompilation, although the regex engine might want to add or deletesome of them to invoke or disable some special behavior in perl. Theflags along with any special behavior they cause are documented below:

The pattern modifiers:

  • /m - RXf_PMf_MULTILINE

    If this is in rx->extflags it will be passed toPerl_fbm_instr by pp_split which will treat the subject stringas a multi-line string.

  • /s - RXf_PMf_SINGLELINE
  • /i - RXf_PMf_FOLD
  • /x - RXf_PMf_EXTENDED

    If present on a regex # comments will be handled differently by thetokenizer in some cases.

    TODO: Document those cases.

  • /p - RXf_PMf_KEEPCOPY

    TODO: Document this

  • Character set

    The character set semantics are determined by an enum that is containedin this field. This is still experimental and subject to change, butthe current interface returns the rules by use of the in-line functionget_regex_charset(const U32 flags). The only currently documentedvalue returned from it is REGEX_LOCALE_CHARSET, which is set ifuse locale is in effect. If present in rx->extflags,split will use the locale dependent definition of whitespacewhen RXf_SKIPWHITE or RXf_WHITE is in effect. ASCII whitespaceis defined as per isSPACE, and by the internalmacros is_utf8_space under UTF-8, and isSPACE_LC under uselocale.

Additional flags:

  • RXf_UTF8

    Set if the pattern is SvUTF8(), set by Perl_pmruntime.

    A regex engine may want to set or disable this flag duringcompilation. The perl engine for instance may upgrade non-UTF-8strings to UTF-8 if the pattern includes constructs such as \x{...}that can only match Unicode values.

  • RXf_SPLIT

    If split is invoked as split ' ' or with no arguments (whichreally means split(' ', $_), see split), perl willset this flag. The regex engine can then check for it and set theSKIPWHITE and WHITE extflags. To do this the perl engine does:

    1. if (flags & RXf_SPLIT && r->prelen == 1 && r->precomp[0] == ' ')
    2. r->extflags |= (RXf_SKIPWHITE|RXf_WHITE);

These flags can be set during compilation to enable optimizations inthe split operator.

  • RXf_SKIPWHITE

    If the flag is present in rx->extflags split will deletewhitespace from the start of the subject string before it's operatedon. What is considered whitespace depends on whether the subject is aUTF-8 string and whether the RXf_PMf_LOCALE flag is set.

    If RXf_WHITE is set in addition to this flag split will behave likesplit " " under the perl engine.

  • RXf_START_ONLY

    Tells the split operator to split the target string on newlines(\n) without invoking the regex engine.

    Perl's engine sets this if the pattern is /^/ (plen == 1 && *exp== '^'), even under /^/s, see split. Of course adifferent regex engine might want to use the same optimizationswith a different syntax.

  • RXf_WHITE

    Tells the split operator to split the target string on whitespacewithout invoking the regex engine. The definition of whitespace variesdepending on whether the target string is a UTF-8 string and onwhether RXf_PMf_LOCALE is set.

    Perl's engine sets this flag if the pattern is \s+.

  • RXf_NULL

    Tells the split operator to split the target string oncharacters. The definition of character varies depending on whetherthe target string is a UTF-8 string.

    Perl's engine sets this flag on empty patterns, this optimizationmakes split // much faster than it would otherwise be. It's evenfaster than unpack.

exec

  1. I32 exec(pTHX_ REGEXP * const rx,
  2. char *stringarg, char* strend, char* strbeg,
  3. I32 minend, SV* screamer,
  4. void* data, U32 flags);

Execute a regexp.

intuit

  1. char* intuit(pTHX_ REGEXP * const rx,
  2. SV *sv, char *strpos, char *strend,
  3. const U32 flags, struct re_scream_pos_data_s *data);

Find the start position where a regex match should be attempted,or possibly whether the regex engine should not be run because thepattern can't match. This is called as appropriate by the coredepending on the values of the extflags member of the regexpstructure.

checkstr

  1. SV*checkstr(pTHX_ REGEXP * const rx);

Return a SV containing a string that must appear in the pattern. Usedby split for optimising matches.

free

  1. void free(pTHX_ REGEXP * const rx);

Called by perl when it is freeing a regexp pattern so that the enginecan release any resources pointed to by the pprivate member of theregexp structure. This is only responsible for freeing private data;perl will handle releasing anything else contained in the regexp structure.

Numbered capture callbacks

Called to get/set the value of $`, $', $& and their namedequivalents, ${^PREMATCH}, ${^POSTMATCH} and $^{MATCH}, as well as thenumbered capture groups ($1, $2, ...).

The paren parameter will be -2 for $`, -1 for $', 0for $&, 1 for $1 and so forth.

The names have been chosen by analogy with Tie::Scalar methodsnames with an additional LENGTH callback for efficiency. Howevernamed capture variables are currently not tied internally butimplemented via magic.

numbered_buff_FETCH

  1. void numbered_buff_FETCH(pTHX_ REGEXP * const rx, const I32 paren,
  2. SV * const sv);

Fetch a specified numbered capture. sv should be set to the scalarto return, the scalar is passed as an argument rather than beingreturned from the function because when it's called perl already has ascalar to store the value, creating another one would beredundant. The scalar can be set with sv_setsv, sv_setpvn andfriends, see perlapi.

This callback is where perl untaints its own capture variables undertaint mode (see perlsec). See the Perl_reg_numbered_buff_fetchfunction in regcomp.c for how to untaint capture variables ifthat's something you'd like your engine to do as well.

numbered_buff_STORE

  1. void (*numbered_buff_STORE) (pTHX_ REGEXP * const rx, const I32 paren,
  2. SV const * const value);

Set the value of a numbered capture variable. value is the scalarthat is to be used as the new value. It's up to the engine to makesure this is used as the new value (or reject it).

Example:

  1. if ("ook" =~ /(o*)/) {
  2. # 'paren' will be '1' and 'value' will be 'ee'
  3. $1 =~ tr/o/e/;
  4. }

Perl's own engine will croak on any attempt to modify the capturevariables, to do this in another engine use the following callback(copied from Perl_reg_numbered_buff_store):

  1. void
  2. Example_reg_numbered_buff_store(pTHX_ REGEXP * const rx, const I32 paren,
  3. SV const * const value)
  4. {
  5. PERL_UNUSED_ARG(rx);
  6. PERL_UNUSED_ARG(paren);
  7. PERL_UNUSED_ARG(value);
  8. if (!PL_localizing)
  9. Perl_croak(aTHX_ PL_no_modify);
  10. }

Actually perl will not always croak in a statement that lookslike it would modify a numbered capture variable. This is because theSTORE callback will not be called if perl can determine that itdoesn't have to modify the value. This is exactly how tied variablesbehave in the same situation:

  1. package CaptureVar;
  2. use base 'Tie::Scalar';
  3. sub TIESCALAR { bless [] }
  4. sub FETCH { undef }
  5. sub STORE { die "This doesn't get called" }
  6. package main;
  7. tie my $sv => "CaptureVar";
  8. $sv =~ y/a/b/;

Because $sv is undef when the y/// operator is applied to itthe transliteration won't actually execute and the program won'tdie. This is different to how 5.8 and earlier versions behavedsince the capture variables were READONLY variables then, now they'lljust die when assigned to in the default engine.

numbered_buff_LENGTH

  1. I32 numbered_buff_LENGTH (pTHX_ REGEXP * const rx, const SV * const sv,
  2. const I32 paren);

Get the length of a capture variable. There's a special callbackfor this so that perl doesn't have to do a FETCH and run length onthe result, since the length is (in perl's case) known from an offsetstored in rx->offs this is much more efficient:

  1. I32 s1 = rx->offs[paren].start;
  2. I32 s2 = rx->offs[paren].end;
  3. I32 len = t1 - s1;

This is a little bit more complex in the case of UTF-8, see whatPerl_reg_numbered_buff_length does withis_utf8_string_loclen.

Named capture callbacks

Called to get/set the value of %+ and %- as well as by someutility functions in re.

There are two callbacks, named_buff is called in all the cases theFETCH, STORE, DELETE, CLEAR, EXISTS and SCALAR Tie::Hash callbackswould be on changes to %+ and %- and named_buff_iter in thesame cases as FIRSTKEY and NEXTKEY.

The flags parameter can be used to determine which of theseoperations the callbacks should respond to, the following flags arecurrently defined:

Which Tie::Hash operation is being performed from the Perl level on%+ or %+, if any:

  1. RXapif_FETCH
  2. RXapif_STORE
  3. RXapif_DELETE
  4. RXapif_CLEAR
  5. RXapif_EXISTS
  6. RXapif_SCALAR
  7. RXapif_FIRSTKEY
  8. RXapif_NEXTKEY

Whether %+ or %- is being operated on, if any.

  1. RXapif_ONE /* %+ */
  2. RXapif_ALL /* %- */

Whether this is being called as re::regname, re::regnames orre::regnames_count, if any. The first two will be combined withRXapif_ONE or RXapif_ALL.

  1. RXapif_REGNAME
  2. RXapif_REGNAMES
  3. RXapif_REGNAMES_COUNT

Internally %+ and %- are implemented with a real tied interfacevia Tie::Hash::NamedCapture. The methods in that package will callback into these functions. However the usage ofTie::Hash::NamedCapture for this purpose might change in futurereleases. For instance this might be implemented by magic instead(would need an extension to mgvtbl).

named_buff

  1. SV* (*named_buff) (pTHX_ REGEXP * const rx, SV * const key,
  2. SV * const value, U32 flags);

named_buff_iter

  1. SV* (*named_buff_iter) (pTHX_ REGEXP * const rx, const SV * const lastkey,
  2. const U32 flags);

qr_package

  1. SV* qr_package(pTHX_ REGEXP * const rx);

The package the qr// magic object is blessed into (as seen by refqr//). It is recommended that engines change this to their packagename for identification regardless of whether they implement methodson the object.

The package this method returns should also have the internalRegexp package in its @ISA. qr//->isa("Regexp") should alwaysbe true regardless of what engine is being used.

Example implementation might be:

  1. SV*
  2. Example_qr_package(pTHX_ REGEXP * const rx)
  3. {
  4. PERL_UNUSED_ARG(rx);
  5. return newSVpvs("re::engine::Example");
  6. }

Any method calls on an object created with qr// will be dispatched to thepackage as a normal object.

  1. use re::engine::Example;
  2. my $re = qr//;
  3. $re->meth; # dispatched to re::engine::Example::meth()

To retrieve the REGEXP object from the scalar in an XS function usethe SvRX macro, see REGEXP Functions in perlapi.

  1. void meth(SV * rv)
  2. PPCODE:
  3. REGEXP * re = SvRX(sv);

dupe

  1. void* dupe(pTHX_ REGEXP * const rx, CLONE_PARAMS *param);

On threaded builds a regexp may need to be duplicated so that the patterncan be used by multiple threads. This routine is expected to handle theduplication of any private data pointed to by the pprivate member ofthe regexp structure. It will be called with the preconstructed newregexp structure as an argument, the pprivate member will point atthe old private structure, and it is this routine's responsibility toconstruct a copy and return a pointer to it (which perl will then use tooverwrite the field as passed to this routine.)

This allows the engine to dupe its private data but also if necessarymodify the final structure if it really must.

On unthreaded builds this field doesn't exist.

The REGEXP structure

The REGEXP struct is defined in regexp.h. All regex engines must be able tocorrectly build such a structure in their comp routine.

The REGEXP structure contains all the data that perl needs to be aware ofto properly work with the regular expression. It includes data aboutoptimisations that perl can use to determine if the regex engine shouldreally be used, and various other control info that is needed to properlyexecute patterns in various contexts such as is the pattern anchored insome way, or what flags were used during the compile, or whether theprogram contains special constructs that perl needs to be aware of.

In addition it contains two fields that are intended for the privateuse of the regex engine that compiled the pattern. These are theintflags and pprivate members. pprivate is a void pointer toan arbitrary structure whose use and management is the responsibilityof the compiling engine. perl will never modify either of thesevalues.

  1. typedef struct regexp {
  2. /* what engine created this regexp? */
  3. const struct regexp_engine* engine;
  4. /* what re is this a lightweight copy of? */
  5. struct regexp* mother_re;
  6. /* Information about the match that the perl core uses to manage things */
  7. U32 extflags; /* Flags used both externally and internally */
  8. I32 minlen; /* mininum possible length of string to match */
  9. I32 minlenret; /* mininum possible length of $& */
  10. U32 gofs; /* chars left of pos that we search from */
  11. /* substring data about strings that must appear
  12. in the final match, used for optimisations */
  13. struct reg_substr_data *substrs;
  14. U32 nparens; /* number of capture groups */
  15. /* private engine specific data */
  16. U32 intflags; /* Engine Specific Internal flags */
  17. void *pprivate; /* Data private to the regex engine which
  18. created this object. */
  19. /* Data about the last/current match. These are modified during matching*/
  20. U32 lastparen; /* last open paren matched */
  21. U32 lastcloseparen; /* last close paren matched */
  22. regexp_paren_pair *swap; /* Swap copy of *offs */
  23. regexp_paren_pair *offs; /* Array of offsets for (@-) and (@+) */
  24. char *subbeg; /* saved or original string so \digit works forever. */
  25. SV_SAVED_COPY /* If non-NULL, SV which is COW from original */
  26. I32 sublen; /* Length of string pointed by subbeg */
  27. /* Information about the match that isn't often used */
  28. I32 prelen; /* length of precomp */
  29. const char *precomp; /* pre-compilation regular expression */
  30. char *wrapped; /* wrapped version of the pattern */
  31. I32 wraplen; /* length of wrapped */
  32. I32 seen_evals; /* number of eval groups in the pattern - for security checks */
  33. HV *paren_names; /* Optional hash of paren names */
  34. /* Refcount of this regexp */
  35. I32 refcnt; /* Refcount of this regexp */
  36. } regexp;

The fields are discussed in more detail below:

engine

This field points at a regexp_engine structure which contains pointersto the subroutines that are to be used for performing a match. Itis the compiling routine's responsibility to populate this field beforereturning the regexp object.

Internally this is set to NULL unless a custom engine is specified in$^H{regcomp}, perl's own set of callbacks can be accessed in the structpointed to by RE_ENGINE_PTR.

mother_re

TODO, see http://www.mail-archive.com/[email protected]/msg17328.html

extflags

This will be used by perl to see what flags the regexp was compiledwith, this will normally be set to the value of the flags parameter bythe comp callback. See the comp documentation forvalid flags.

minlen minlenret

The minimum string length required for the pattern to match. This is used toprune the search space by not bothering to match any closer to the end of astring than would allow a match. For instance there is no point in evenstarting the regex engine if the minlen is 10 but the string is only 5characters long. There is no way that the pattern can match.

minlenret is the minimum length of the string that would be foundin $& after a match.

The difference between minlen and minlenret can be seen in thefollowing pattern:

  1. /ns(?=\d)/

where the minlen would be 3 but minlenret would only be 2 as the \d isrequired to match but is not actually included in the matched content. Thisdistinction is particularly important as the substitution logic uses theminlenret to tell whether it can do in-place substitution which can result inconsiderable speedup.

gofs

Left offset from pos() to start match at.

substrs

Substring data about strings that must appear in the final match. Thisis currently only used internally by perl's engine for but might beused in the future for all engines for optimisations.

nparens, lastparen, and lastcloseparen

These fields are used to keep track of how many paren groups could be matchedin the pattern, which was the last open paren to be entered, and which wasthe last close paren to be entered.

intflags

The engine's private copy of the flags the pattern was compiled with. Usuallythis is the same as extflags unless the engine chose to modify one of them.

pprivate

A void* pointing to an engine-defined data structure. The perl engine uses theregexp_internal structure (see Base Structures in perlreguts) but a customengine should use something else.

swap

Unused. Left in for compatibility with perl 5.10.0.

offs

A regexp_paren_pair structure which defines offsets into the string beingmatched which correspond to the $& and $1, $2 etc. captures, theregexp_paren_pair struct is defined as follows:

  1. typedef struct regexp_paren_pair {
  2. I32 start;
  3. I32 end;
  4. } regexp_paren_pair;

If ->offs[num].start or ->offs[num].end is -1 then thatcapture group did not match. ->offs[0].start/end represents $& (or${^MATCH under //p) and ->offs[paren].end matches $$paren where$paren = 1>.

precomp prelen

Used for optimisations. precomp holds a copy of the pattern thatwas compiled and prelen its length. When a new pattern is to becompiled (such as inside a loop) the internal regcomp operatorchecks whether the last compiled REGEXP's precomp and prelenare equivalent to the new one, and if so uses the old pattern insteadof compiling a new one.

The relevant snippet from Perl_pp_regcomp:

  1. if (!re || !re->precomp || re->prelen != (I32)len ||
  2. memNE(re->precomp, t, len))
  3. /* Compile a new pattern */

paren_names

This is a hash used internally to track named capture groups and theiroffsets. The keys are the names of the buffers the values are dualvars,with the IV slot holding the number of buffers with the given name and thepv being an embedded array of I32. The values may also be containedindependently in the data array in cases where named backreferences areused.

substrs

Holds information on the longest string that must occur at a fixedoffset from the start of the pattern, and the longest string that mustoccur at a floating offset from the start of the pattern. Used to doFast-Boyer-Moore searches on the string to find out if its worth usingthe regex engine at all, and if so where in the string to search.

subbeg sublen saved_copy

Used during execution phase for managing search and replace patterns.

wrapped wraplen

Stores the string qr// stringifies to. The perl engine for examplestores (?^:eek) in the case of qr/eek/.

When using a custom engine that doesn't support the (?:) constructfor inline modifiers, it's probably best to have qr// stringify tothe supplied pattern, note that this will create undesired patterns incases such as:

  1. my $x = qr/a|b/; # "a|b"
  2. my $y = qr/c/i; # "c"
  3. my $z = qr/$x$y/; # "a|bc"

There's no solution for this problem other than making the customengine understand a construct like (?:).

seen_evals

This stores the number of eval groups in the pattern. This is used for securitypurposes when embedding compiled regexes into larger patterns with qr//.

refcnt

The number of times the structure is referenced. When this falls to 0 theregexp is automatically freed by a call to pregfree. This should be set to 1 ineach engine's comp routine.

HISTORY

Originally part of perlreguts.

AUTHORS

Originally written by Yves Orton, expanded by Ævar ArnfjörðBjarmason.

LICENSE

Copyright 2006 Yves Orton and 2007 Ævar Arnfjörð Bjarmason.

This program is free software; you can redistribute it and/or modify it underthe same terms as Perl itself.

 
Source : perldoc.perl.org - Official documentation for the Perl programming language
Site maintained by Jon Allen (JJ)     See the project page for more details
Documentation maintained by the Perl 5 Porters
(Sebelumnya) Description of the Perl regula ...Various and sundry policies an ... (Berikutnya)