summaryrefslogtreecommitdiff
path: root/doc/pcre2pattern.3
diff options
context:
space:
mode:
Diffstat (limited to 'doc/pcre2pattern.3')
-rw-r--r--doc/pcre2pattern.3102
1 files changed, 83 insertions, 19 deletions
diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3
index 192859d..8d0e9df 100644
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "13 June 2015" "PCRE2 10.20"
+.TH PCRE2PATTERN 3 "13 November 2015" "PCRE2 10.21"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@@ -671,8 +671,8 @@ below.
This particular group matches either the two-character sequence CR followed by
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
-line, U+0085). The two-character sequence is treated as a single unit that
-cannot be split.
+line, U+0085). Because this is an atomic group, the two-character sequence is
+treated as a single unit that cannot be split.
.P
In other modes, two additional characters whose codepoints are greater than 255
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
@@ -738,6 +738,8 @@ example:
Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:
.P
+Ahom,
+Anatolian_Hieroglyphs,
Arabic,
Armenian,
Avestan,
@@ -778,6 +780,7 @@ Gurmukhi,
Han,
Hangul,
Hanunoo,
+Hatran,
Hebrew,
Hiragana,
Imperial_Aramaic,
@@ -814,12 +817,14 @@ Miao,
Modi,
Mongolian,
Mro,
+Multani,
Myanmar,
Nabataean,
New_Tai_Lue,
Nko,
Ogham,
Ol_Chiki,
+Old_Hungarian,
Old_Italic,
Old_North_Arabian,
Old_Permic,
@@ -841,6 +846,7 @@ Saurashtra,
Sharada,
Shavian,
Siddham,
+SignWriting,
Sinhala,
Sora_Sompeng,
Sundanese,
@@ -1177,6 +1183,18 @@ patterns that are anchored in single line mode because all branches start with
when the \fIstartoffset\fP argument of \fBpcre2_match()\fP is non-zero. The
PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
.P
+When the newline convention (see
+.\" HTML <a href="#newlines">
+.\" </a>
+"Newline conventions"
+.\"
+below) recognizes the two-character sequence CRLF as a newline, this is
+preferred, even if the single characters CR and LF are also recognized as
+newlines. For example, if the newline convention is "any", a multiline mode
+circumflex matches before "xyz" in the string "abc\er\enxyz" rather than after
+CR, even though CR on its own is a valid newline. (It also matches at the very
+start of the string, of course.)
+.P
Note that the sequences \eA, \eZ, and \ez can be used to match the start and
end of the subject in both modes, and if all branches of a pattern start with
\eA it is always anchored, whether or not PCRE2_MULTILINE is set.
@@ -1227,8 +1245,11 @@ with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
with a malformed UTF character. This has undefined results, because PCRE2
assumes that it is matching character by character in a valid UTF string (by
default it checks the subject string's validity at the start of processing
-unless the PCRE2_NO_UTF_CHECK option is used). An application can lock out the
-use of \eC by setting the PCRE2_NEVER_BACKSLASH_C option.
+unless the PCRE2_NO_UTF_CHECK option is used).
+.P
+An application can lock out the use of \eC by setting the
+PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
+build PCRE2 with the use of \eC permanently disabled.
.P
PCRE2 does not allow \eC to appear in lookbehind assertions
.\" HTML <a href="#lookbehind">
@@ -1236,7 +1257,10 @@ PCRE2 does not allow \eC to appear in lookbehind assertions
(described below)
.\"
in a UTF mode, because this would make it impossible to calculate the length of
-the lookbehind.
+the lookbehind. Neither the alternative matching function
+\fBpcre2_dfa_match()\fP not the JIT optimizer support \eC in a UTF mode. The
+former gives a match-time error; the latter fails to optimize and so the match
+is always run using the interpreter.
.P
In general, the \eC escape sequence is best avoided. However, one way of using
it that avoids the problem of malformed UTF characters is to use a lookahead to
@@ -1318,9 +1342,18 @@ sequence other than one that defines a single character appears at a point
where a range ending character is expected. For example, [z-\exff] is valid,
but [A-\ed] and [A-[:digit:]] are not.
.P
-Ranges operate in the collating sequence of character values. They can also be
-used for characters specified numerically, for example [\e000-\e037]. Ranges
-can include any characters that are valid for the current mode.
+Ranges normally include all code points between the start and end characters,
+inclusive. They can also be used for code points specified numerically, for
+example [\e000-\e037]. Ranges can include any characters that are valid for the
+current mode.
+.P
+There is a special case in EBCDIC environments for ranges whose end points are
+both specified as literal letters in the same case. For compatibility with
+Perl, EBCDIC code points within the range that are not letters are omitted. For
+example, [h-k] matches only four characters, even though the codes for h and k
+are 0x88 and 0x92, a range of 11 code points. However, if the range is
+specified numerically, for example, [\ex88-\ex92] or [h-\ex92], all code points
+are included.
.P
If a range that includes letters is used when caseless matching is set, it
matches the letters in either case. For example, [W-c] is equivalent to
@@ -1650,6 +1683,9 @@ first one in the pattern with the given number. The following pattern matches
.sp
/(?|(abc)|(def))(?1)/
.sp
+A relative reference such as (?-1) is no different: it is just a convenient way
+of computing an absolute group number.
+.P
If a
.\" HTML <a href="#conditions">
.\" </a>
@@ -2513,7 +2549,8 @@ For example:
(?(VERSION>=10.4)yes|no)
.sp
This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
-"no" otherwise.
+"no" otherwise. The fractional part of the version number may not contain more
+than two digits.
.
.
.SS "Assertion conditions"
@@ -2630,6 +2667,23 @@ pattern above you can write (?-2) to refer to the second most recently opened
parentheses preceding the recursion. In other words, a negative number counts
capturing parentheses leftwards from the point at which it is encountered.
.P
+Be aware however, that if
+.\" HTML <a href="#dupsubpatternnumber">
+.\" </a>
+duplicate subpattern numbers
+.\"
+are in use, relative references refer to the earliest subpattern with the
+appropriate number. Consider, for example:
+.sp
+ (?|(a)|(b)) (c) (?-2)
+.sp
+The first two capturing groups (a) and (b) are both numbered 1, and group (c)
+is number 2. When the reference (?-2) is encountered, the second most recently
+opened parentheses has the number 1, but it is the first such group (the (a)
+group) to which the recursion refers. This would be the same if an absolute
+reference (?1) was used. In other words, relative references are just a
+shorthand for computing a group number.
+.P
It is also possible to refer to subsequently opened parentheses, by writing
references such as (?+2). However, these cannot be recursive because the
reference is not inside the parentheses that are referenced. They are always
@@ -2929,14 +2983,24 @@ in production code should be noted to avoid problems during upgrades." The same
remarks apply to the PCRE2 features described in this section.
.P
The new verbs make use of what was previously invalid syntax: an opening
-parenthesis followed by an asterisk. They are generally of the form
-(*VERB) or (*VERB:NAME). Some may take either form, possibly behaving
-differently depending on whether or not a name is present. A name is any
-sequence of characters that does not include a closing parenthesis. The maximum
-length of name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit
-libraries. If the name is empty, that is, if the closing parenthesis
-immediately follows the colon, the effect is as if the colon were not there.
-Any number of these verbs may occur in a pattern.
+parenthesis followed by an asterisk. They are generally of the form (*VERB) or
+(*VERB:NAME). Some verbs take either form, possibly behaving differently
+depending on whether or not a name is present.
+.P
+By default, for compatibility with Perl, a name is any sequence of characters
+that does not include a closing parenthesis. The name is not processed in
+any way, and it is not possible to include a closing parenthesis in the name.
+However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash processing
+is applied to verb names and only an unescaped closing parenthesis terminates
+the name. A closing parenthesis can be included in a name either as \e) or
+between \eQ and \eE. If the PCRE2_EXTENDED option is set, unescaped whitespace
+in verb names is skipped and #-comments are recognized, exactly as in the rest
+of the pattern.
+.P
+The maximum length of a name is 255 in the 8-bit library and 65535 in the
+16-bit and 32-bit libraries. If the name is empty, that is, if the closing
+parenthesis immediately follows the colon, the effect is as if the colon were
+not there. Any number of these verbs may occur in a pattern.
.P
Since these verbs are specifically related to backtracking, most of them can be
used only when the pattern is to be matched using the traditional matching
@@ -3361,6 +3425,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 13 June 2015
+Last updated: 13 November 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi