summaryrefslogtreecommitdiff
path: root/doc/pcre2test.1
diff options
context:
space:
mode:
Diffstat (limited to 'doc/pcre2test.1')
-rw-r--r--doc/pcre2test.1522
1 files changed, 401 insertions, 121 deletions
diff --git a/doc/pcre2test.1 b/doc/pcre2test.1
index 2fbf794..ee78792 100644
--- a/doc/pcre2test.1
+++ b/doc/pcre2test.1
@@ -1,4 +1,4 @@
-.TH PCRE2TEST 1 "06 July 2016" "PCRE 10.22"
+.TH PCRE2TEST 1 "21 Decbmber 2017" "PCRE 10.31"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@@ -29,7 +29,7 @@ subject is processed, and what output is produced.
.P
As the original fairly simple PCRE library evolved, it acquired many different
features, and as a result, the original \fBpcretest\fP program ended up with a
-lot of options in a messy, arcane syntax, for testing all the features. The
+lot of options in a messy, arcane syntax for testing all the features. The
move to the new PCRE2 API provided an opportunity to re-implement the test
program as \fBpcre2test\fP, with a cleaner modifier syntax. Nevertheless, there
are still many obscure modifiers, some of which are specifically designed for
@@ -47,32 +47,64 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
all three of these libraries may be simultaneously installed. The
\fBpcre2test\fP program can be used to test all the libraries. However, its own
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
-libraries, patterns and subject strings are converted to 16- or 32-bit format
-before being passed to the library functions. Results are converted back to
-8-bit code units for output.
+libraries, patterns and subject strings are converted to 16-bit or 32-bit
+format before being passed to the library functions. Results are converted back
+to 8-bit code units for output.
.P
In the rest of this document, the names of library functions and structures
are given in generic form, for example, \fBpcre_compile()\fP. The actual
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
.
.
+.\" HTML <a name="inputencoding"></a>
.SH "INPUT ENCODING"
.rs
.sp
Input to \fBpcre2test\fP is processed line by line, either by calling the C
-library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
-below). The input is processed using using C's string functions, so must not
-contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
-treats any bytes other than newline as data characters. In some Windows
-environments character 26 (hex 1A) causes an immediate end of file, and no
-further data is read.
+library's \fBfgets()\fP function, or via the \fBlibreadline\fP library. In some
+Windows environments character 26 (hex 1A) causes an immediate end of file, and
+no further data is read, so this character should be avoided unless you really
+want that action.
.P
-For maximum portability, therefore, it is safest to avoid non-printing
-characters in \fBpcre2test\fP input files. There is a facility for specifying
-some or all of a pattern's characters as hexadecimal pairs, thus making it
-possible to include binary zeroes in a pattern for testing purposes. Subject
-lines are processed for backslash escapes, which makes it possible to include
-any data value.
+The input is processed using using C's string functions, so must not
+contain binary zeros, even though in Unix-like environments, \fBfgets()\fP
+treats any bytes other than newline as data characters. An error is generated
+if a binary zero is encountered. By default subject lines are processed for
+backslash escapes, which makes it possible to include any data value in strings
+that are passed to the library for matching. For patterns, there is a facility
+for specifying some or all of the 8-bit input characters as hexadecimal pairs,
+which makes it possible to include binary zeros.
+.
+.
+.SS "Input for the 16-bit and 32-bit libraries"
+.rs
+.sp
+When testing the 16-bit or 32-bit libraries, there is a need to be able to
+generate character code points greater than 255 in the strings that are passed
+to the library. For subject lines, backslash escapes can be used. In addition,
+when the \fButf\fP modifier (see
+.\" HTML <a href="#optionmodifiers">
+.\" </a>
+"Setting compilation options"
+.\"
+below) is set, the pattern and any following subject lines are interpreted as
+UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
+.P
+For non-UTF testing of wide characters, the \fButf8_input\fP modifier can be
+used. This is mutually exclusive with \fButf\fP, and is allowed only in 16-bit
+or 32-bit mode. It causes the pattern and following subject lines to be treated
+as UTF-8 according to the original definition (RFC 2279), which allows for
+character values up to 0x7fffffff. Each character is placed in one 16-bit or
+32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
+to occur).
+.P
+UTF-8 (in its original definition) is not capable of encoding values greater
+than 0x7fffffff, but such values can be handled by the 32-bit library. When
+testing this library in non-UTF mode with \fButf8_input\fP set, if any
+character is preceded by the byte 0xff (which is an illegal byte in UTF-8)
+0x80000000 is added to the character's value. This is the only way of passing
+such code points in a pattern string. For subject strings, using an escape
+sequence is preferable.
.
.
.SH "COMMAND LINE OPTIONS"
@@ -93,14 +125,24 @@ If the 32-bit library has been built, this option causes it to be used. If only
the 32-bit library has been built, this is the default. If the 32-bit library
has not been built, this option causes an error.
.TP 10
+\fB-ac\fP
+Behave as if each pattern has the \fBauto_callout\fP modifier, that is, insert
+automatic callouts into every pattern that is compiled.
+.TP 10
+\fB-AC\fP
+As for \fB-ac\fP, but in addition behave as if each subject line has the
+\fBcallout_extra\fP modifier, that is, show additional information from
+callouts.
+.TP 10
\fB-b\fP
-Behave as if each pattern has the \fB/fullbincode\fP modifier; the full
+Behave as if each pattern has the \fBfullbincode\fP modifier; the full
internal binary form of the pattern is output after compilation.
.TP 10
\fB-C\fP
Output the version number of the PCRE2 library, and all available information
about the optional features that are included, and then exit with zero exit
-code. All other options are ignored.
+code. All other options are ignored. If both -C and -LM are present, whichever
+is first is recognized.
.TP 10
\fB-C\fP \fIoption\fP
Output information about a specific build-time option, then exit. This
@@ -114,7 +156,7 @@ following options output the value and set the exit code as indicated:
linksize the configured internal link size (2, 3, or 4)
exit code is set to the link size
newline the default newline setting:
- CR, LF, CRLF, ANYCRLF, or ANY
+ CR, LF, CRLF, ANYCRLF, ANY, or NUL
exit code is always 0
bsr the default setting for what \eR matches:
ANYCRLF or ANY
@@ -153,13 +195,23 @@ a convenience facility for PCRE2 maintainers.
Output a brief summary these options and then exit.
.TP 10
\fB-i\fP
-Behave as if each pattern has the \fB/info\fP modifier; information about the
+Behave as if each pattern has the \fBinfo\fP modifier; information about the
compiled pattern is given after compilation.
.TP 10
\fB-jit\fP
Behave as if each pattern line has the \fBjit\fP modifier; after successful
compilation, each pattern is passed to the just-in-time compiler, if available.
.TP 10
+\fB-jitverify\fP
+Behave as if each pattern line has the \fBjitverify\fP modifier; after
+successful compilation, each pattern is passed to the just-in-time compiler, if
+available, and the use of JIT is verified.
+.TP 10
+\fB-LM\fP
+List modifiers: write a list of available pattern and subject modifiers to the
+standard output, then exit with zero exit code. All other options are ignored.
+If both -C and -LM are present, whichever is first is recognized.
+.TP 10
\fB-pattern\fB \fImodifier-list\fP
Behave as if each pattern line contains the given modifiers.
.TP 10
@@ -279,8 +331,8 @@ recognized as a newline by default. Without special action the tests would fail
when PCRE2 is compiled with either CR or CRLF as the default newline.
.P
The #newline_default command specifies a list of newline types that are
-acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF, or
-ANY (in upper or lower case), for example:
+acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF,
+ANY, or NUL (in upper or lower case), for example:
.sp
#newline_default LF Any anyCRLF
.sp
@@ -293,8 +345,9 @@ of the standard test input files.
.P
When the POSIX API is being tested there is no way to override the default
newline convention, though it is possible to set the newline convention from
-within the pattern. A warning is given if the \fBposix\fP modifier is used when
-\fB#newline_default\fP would set a default for the non-POSIX API.
+within the pattern. A warning is given if the \fBposix\fP or \fBposix_nosub\fP
+modifier is used when \fB#newline_default\fP would set a default for the
+non-POSIX API.
.sp
#pattern <modifier-list>
.sp
@@ -400,8 +453,9 @@ A pattern can be followed by a modifier list (details below).
.sp
Before each subject line is passed to \fBpcre2_match()\fP or
\fBpcre2_dfa_match()\fP, leading and trailing white space is removed, and the
-line is scanned for backslash escapes. The following provide a means of
-encoding non-printing characters in a visible way:
+line is scanned for backslash escapes, unless the \fBsubject_literal\fP
+modifier was set for the pattern. The following provide a means of encoding
+non-printing characters in a visible way:
.sp
\ea alarm (BEL, \ex07)
\eb backspace (\ex08)
@@ -463,6 +517,11 @@ character. A backslash followed by anything else causes an error. However, if
the very last character in the line is a backslash (and there is no modifier
list), it is ignored. This gives a way of passing an empty line as data, since
a real empty line terminates the data input.
+.P
+If the \fBsubject_literal\fP modifier is set for a pattern, all subject lines
+that follow are treated as literals, with no special treatment of backslashes.
+No replication is possible, and any subject modifiers must be set as defaults
+by a \fB#subject\fP command.
.
.
.SH "PATTERN MODIFIERS"
@@ -478,31 +537,44 @@ by a previous \fB#pattern\fP command.
.SS "Setting compilation options"
.rs
.sp
-The following modifiers set options for \fBpcre2_compile()\fP. The most common
-ones have single-letter abbreviations. See
+The following modifiers set options for \fBpcre2_compile()\fP. Most of them set
+bits in the options argument of that function, but those whose names start with
+PCRE2_EXTRA are additional options that are set in the compile context. For the
+main options, there are some single-letter abbreviations that are the same as
+Perl options. There is special handling for /x: if a second x is present,
+PCRE2_EXTENDED is converted into PCRE2_EXTENDED_MORE as in Perl. A third
+appearance adds PCRE2_EXTENDED as well, though this makes no difference to the
+way \fBpcre2_compile()\fP behaves. See
.\" HREF
\fBpcre2api\fP
.\"
-for a description of their effects.
+for a description of the effects of these options.
.sp
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
+ allow_surrogate_escapes set PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
alt_bsux set PCRE2_ALT_BSUX
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
alt_verbnames set PCRE2_ALT_VERBNAMES
anchored set PCRE2_ANCHORED
auto_callout set PCRE2_AUTO_CALLOUT
+ bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
/i caseless set PCRE2_CASELESS
dollar_endonly set PCRE2_DOLLAR_ENDONLY
/s dotall set PCRE2_DOTALL
dupnames set PCRE2_DUPNAMES
+ endanchored set PCRE2_ENDANCHORED
/x extended set PCRE2_EXTENDED
+ /xx extended_more set PCRE2_EXTENDED_MORE
firstline set PCRE2_FIRSTLINE
+ literal set PCRE2_LITERAL
+ match_line set PCRE2_EXTRA_MATCH_LINE
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
+ match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
never_ucp set PCRE2_NEVER_UCP
never_utf set PCRE2_NEVER_UTF
- no_auto_capture set PCRE2_NO_AUTO_CAPTURE
+ /n no_auto_capture set PCRE2_NO_AUTO_CAPTURE
no_auto_possess set PCRE2_NO_AUTO_POSSESS
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
no_start_optimize set PCRE2_NO_START_OPTIMIZE
@@ -515,7 +587,9 @@ for a description of their effects.
As well as turning on the PCRE2_UTF option, the \fButf\fP modifier causes all
non-printing characters in output strings to be printed using the \ex{hh...}
notation. Otherwise, those less than 0x100 are output in hex without the curly
-brackets.
+brackets. Setting \fButf\fP in 16-bit or 32-bit mode also causes pattern and
+subject strings to be translated to UTF-16 or UTF-32, respectively, before
+being passed to library functions.
.
.
.\" HTML <a name="controlmodifiers"></a>
@@ -523,12 +597,18 @@ brackets.
.rs
.sp
The following modifiers affect the compilation process or request information
-about the pattern:
+about the pattern. There are single-letter abbreviations for some that are
+heavily used in the test files.
.sp
bsr=[anycrlf|unicode] specify \eR handling
/B bincode show binary code without lengths
callout_info show callout information
+ convert=<options> request foreign pattern conversion
+ convert_glob_escape=c set glob escape character
+ convert_glob_separator=c set glob separator character
+ convert_length set convert buffer length
debug same as info,fullbincode
+ framesize show matching frame size
fullbincode show binary code with lengths
/I info show info about compiled pattern
hex unquoted characters are hexadecimal
@@ -546,7 +626,10 @@ about the pattern:
push push compiled pattern onto the stack
pushcopy push a copy onto the stack
stackguard=<number> test the stackguard feature
+ subject_literal treat all subject lines as literal
tables=[0|1|2] select internal tables
+ use_length do not zero-terminate the pattern
+ utf8_input treat input as UTF-8
.sp
The effects of these modifiers are described in the following sections.
.
@@ -561,7 +644,7 @@ is built, with the default default being Unicode.
.P
The \fBnewline\fP modifier specifies which characters are to be interpreted as
newlines, both in the pattern and in subject lines. The type must be one of CR,
-LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
+LF, CRLF, ANYCRLF, ANY, or NUL (in upper or lower case).
.
.
.SS "Information about a pattern"
@@ -609,6 +692,10 @@ unit" is the last literal code unit that must be present in any match. This is
not necessarily the last character. These lines are omitted if no starting or
ending code units are recorded.
.P
+The \fBframesize\fP modifier shows the size, in bytes, of the storage frames
+used by \fBpcre2_match()\fP for handling backtracking. The size depends on the
+number of capturing parentheses in the pattern.
+.P
The \fBcallout_info\fP modifier requests information about all the callouts in
the pattern. A list of them is output at the end of any other information that
is requested. For each callout, either its number or string is given, followed
@@ -642,12 +729,41 @@ nine characters, only two of which are specified in hexadecimal:
/ab "literal" 32/hex
.sp
Either single or double quotes may be used. There is no way of including
-the delimiter within a substring.
+the delimiter within a substring. The \fBhex\fP and \fBexpand\fP modifiers are
+mutually exclusive.
+.
+.
+.SS "Specifying the pattern's length"
+.rs
+.sp
+By default, patterns are passed to the compiling functions as zero-terminated
+strings but can be passed by length instead of being zero-terminated. The
+\fBuse_length\fP modifier causes this to happen. Using a length happens
+automatically (whether or not \fBuse_length\fP is set) when \fBhex\fP is set,
+because patterns specified in hexadecimal may contain binary zeros.
.P
-By default, \fBpcre2test\fP passes patterns as zero-terminated strings to
-\fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED. However, for
-patterns specified with the \fBhex\fP modifier, the actual length of the
-pattern is passed.
+If \fBhex\fP or \fBuse_length\fP is used with the POSIX wrapper API (see
+.\" HTML <a href="#posixwrapper">
+.\" </a>
+"Using the POSIX wrapper API"
+.\"
+below), the REG_PEND extension is used to pass the pattern's length.
+.
+.
+.SS "Specifying wide characters in 16-bit and 32-bit modes"
+.rs
+.sp
+In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
+translated to UTF-16 or UTF-32 when the \fButf\fP modifier is set. For testing
+the 16-bit and 32-bit libraries in non-UTF mode, the \fButf8_input\fP modifier
+can be used. It is mutually exclusive with \fButf\fP. Input lines are
+interpreted as UTF-8 as a means of specifying wide characters. More details are
+given in
+.\" HTML <a href="#inputencoding">
+.\" </a>
+"Input encoding"
+.\"
+above.
.
.
.SS "Generating long repetitive patterns"
@@ -665,7 +781,8 @@ are expanded before the pattern is passed to \fBpcre2_compile()\fP. For
example, \e[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
cannot be nested. An initial "\e[" sequence is recognized only if "]{" followed
by decimal digits and "}" is found later in the pattern. If not, the characters
-remain in the pattern unaltered.
+remain in the pattern unaltered. The \fBexpand\fP and \fBhex\fP modifiers are
+mutually exclusive.
.P
If part of an expanded pattern looks like an expansion, but is really part of
the actual pattern, unwanted expansion can be avoided by giving two values in
@@ -696,7 +813,7 @@ below
.\"
for details of how these options are specified for each match attempt.
.P
-JIT compilation is requested by the \fB/jit\fP pattern modifier, which may
+JIT compilation is requested by the \fBjit\fP pattern modifier, which may
optionally be followed by an equals sign and a number in the range 0 to 7.
The three bits that make up the number specify which of the three JIT operating
modes are to be compiled:
@@ -705,7 +822,7 @@ modes are to be compiled:
2 compile JIT code for soft partial matching
4 compile JIT code for hard partial matching
.sp
-The possible values for the \fB/jit\fP modifier are therefore:
+The possible values for the \fBjit\fP modifier are therefore:
.sp
0 disable JIT
1 normal matching only
@@ -720,7 +837,7 @@ to \fBpcre2_match()\fP with either the PCRE2_PARTIAL_SOFT or the
PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
match; the options enable the possibility of a partial match, but do not
require it. Note also that if you request JIT compilation only for partial
-matching (for example, /jit=2) but do not set the \fBpartial\fP modifier on a
+matching (for example, jit=2) but do not set the \fBpartial\fP modifier on a
subject line, that match will not use JIT code because none was compiled for
non-partial matching.
.P
@@ -750,14 +867,14 @@ code was actually used in the match.
.SS "Setting a locale"
.rs
.sp
-The \fB/locale\fP modifier must specify the name of a locale, for example:
+The \fBlocale\fP modifier must specify the name of a locale, for example:
.sp
/pattern/locale=fr_FR
.sp
The given locale is set, \fBpcre2_maketables()\fP is called to build a set of
character tables for the locale, and this is then passed to
\fBpcre2_compile()\fP when compiling the regular expression. The same tables
-are used when matching the following subject lines. The \fB/locale\fP modifier
+are used when matching the following subject lines. The \fBlocale\fP modifier
applies only to the pattern on which it appears, but can be given in a
\fB#pattern\fP command if a default is needed. Setting a locale and alternate
character tables are mutually exclusive.
@@ -766,7 +883,7 @@ character tables are mutually exclusive.
.SS "Showing pattern memory"
.rs
.sp
-The \fB/memory\fP modifier causes the size in bytes of the memory used to hold
+The \fBmemory\fP modifier causes the size in bytes of the memory used to hold
the compiled pattern to be output. This does not include the size of the
\fBpcre2_code\fP block; it is just the actual compiled data. If the pattern is
subsequently passed to the JIT compiler, the size of the JIT compiled code is
@@ -797,10 +914,11 @@ causes a compilation error. The default is the largest number a PCRE2_SIZE
variable can hold (essentially unlimited).
.
.
+.\" HTML <a name="posixwrapper"></a>
.SS "Using the POSIX wrapper API"
.rs
.sp
-The \fB/posix\fP and \fBposix_nosub\fP modifiers cause \fBpcre2test\fP to call
+The \fBposix\fP and \fBposix_nosub\fP modifiers cause \fBpcre2test\fP to call
PCRE2 via the POSIX wrapper API rather than its native API. When
\fBposix_nosub\fP is used, the POSIX option REG_NOSUB is passed to
\fBregcomp()\fP. The POSIX wrapper supports only the 8-bit library. Note that
@@ -830,12 +948,16 @@ large buffer is used.
The \fBaftertext\fP and \fBallaftertext\fP subject modifiers work as described
below. All other modifiers are either ignored, with a warning message, or cause
an error.
+.P
+The pattern is passed to \fBregcomp()\fP as a zero-terminated string by
+default, but if the \fBuse_length\fP or \fBhex\fP modifiers are set, the
+REG_PEND extension is used to pass it by length.
.
.
.SS "Testing the stack guard feature"
.rs
.sp
-The \fB/stackguard\fP modifier is used to test the use of
+The \fBstackguard\fP modifier is used to test the use of
\fBpcre2_set_compile_recursion_guard()\fP, a function that is provided to
enable stack availability to be checked during compilation (see the
.\" HREF
@@ -852,7 +974,7 @@ be aborted.
.SS "Using alternative character tables"
.rs
.sp
-The value specified for the \fB/tables\fP modifier must be one of the digits 0,
+The value specified for the \fBtables\fP modifier must be one of the digits 0,
1, or 2. It causes a specific set of built-in character tables to be passed to
\fBpcre2_compile()\fP. This is used in the PCRE2 tests to check behaviour with
different character tables. The digit specifies the tables as follows:
@@ -870,17 +992,19 @@ are mutually exclusive.
.SS "Setting certain match controls"
.rs
.sp
-The following modifiers are really subject modifiers, and are described below.
-However, they may be included in a pattern's modifier list, in which case they
-are applied to every subject line that is processed with that pattern. They may
-not appear in \fB#pattern\fP commands. These modifiers do not affect the
-compilation process.
+The following modifiers are really subject modifiers, and are described under
+"Subject Modifiers" below. However, they may be included in a pattern's
+modifier list, in which case they are applied to every subject line that is
+processed with that pattern. These modifiers do not affect the compilation
+process.
.sp
aftertext show text after match
allaftertext show text after captures
allcaptures show all captures
allusedtext show all consulted text
+ altglobal alternative global matching
/g global global matching
+ jitstack=<n> set size of JIT stack
mark show mark values
replace=<string> specify a replacement string
startchar show starting character when relevant
@@ -893,6 +1017,15 @@ These modifiers may not appear in a \fB#pattern\fP command. If you want them as
defaults, set them in a \fB#subject\fP command.
.
.
+.SS "Specifying literal subject lines"
+.rs
+.sp
+If the \fBsubject_literal\fP modifier is present on a pattern, all the subject
+lines that it matches are taken as literal strings, with no interpretation of
+backslashes. It is not possible to set subject modifiers on such lines, but any
+that are set as defaults by a \fB#subject\fP command are recognized.
+.
+.
.SS "Saving a compiled pattern"
.rs
.sp
@@ -903,7 +1036,9 @@ facility is used when saving compiled patterns to a file, as described in the
section entitled "Saving and restoring compiled patterns"
.\" HTML <a href="#saverestore">
.\" </a>
-below. If \fBpushcopy\fP is used instead of \fBpush\fP, a copy of the compiled
+below.
+.\"
+If \fBpushcopy\fP is used instead of \fBpush\fP, a copy of the compiled
pattern is stacked, leaving the original as current, ready to match the
following input lines. This provides a way of testing the
\fBpcre2_code_copy()\fP function.
@@ -916,6 +1051,39 @@ allowed, does not carry through to any subsequent matching that uses a stacked
pattern.
.
.
+.SS "Testing foreign pattern conversion"
+.rs
+.sp
+The experimental foreign pattern conversion functions in PCRE2 can be tested by
+setting the \fBconvert\fP modifier. Its argument is a colon-separated list of
+options, which set the equivalent option for the \fBpcre2_pattern_convert()\fP
+function:
+.sp
+ glob PCRE2_CONVERT_GLOB
+ glob_no_starstar PCRE2_CONVERT_GLOB_NO_STARSTAR
+ glob_no_wild_separator PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR
+ posix_basic PCRE2_CONVERT_POSIX_BASIC
+ posix_extended PCRE2_CONVERT_POSIX_EXTENDED
+ unset Unset all options
+.sp
+The "unset" value is useful for turning off a default that has been set by a
+\fB#pattern\fP command. When one of these options is set, the input pattern is
+passed to \fBpcre2_pattern_convert()\fP. If the conversion is successful, the
+result is reflected in the output and then passed to \fBpcre2_compile()\fP. The
+normal \fButf\fP and \fBno_utf_check\fP options, if set, cause the
+PCRE2_CONVERT_UTF and PCRE2_CONVERT_NO_UTF_CHECK options to be passed to
+\fBpcre2_pattern_convert()\fP.
+.P
+By default, the conversion function is allowed to allocate a buffer for its
+output. However, if the \fBconvert_length\fP modifier is set to a value greater
+than zero, \fBpcre2test\fP passes a buffer of the given length. This makes it
+possible to test the length check.
+.P
+The \fBconvert_glob_escape\fP and \fBconvert_glob_separator\fP modifiers can be
+used to specify the escape and separator characters for glob processing,
+overriding the defaults, which are operating-system dependent.
+.
+.
.\" HTML <a name="subjectmodifiers"></a>
.SH "SUBJECT MODIFIERS"
.rs
@@ -935,6 +1103,7 @@ The following modifiers set options for \fBpcre2_match()\fP or
for a description of their effects.
.sp
anchored set PCRE2_ANCHORED
+ endanchored set PCRE2_ENDANCHORED
dfa_restart set PCRE2_DFA_RESTART
dfa_shortest set PCRE2_DFA_SHORTEST
no_jit set PCRE2_NO_JIT
@@ -949,11 +1118,27 @@ for a description of their effects.
The partial matching modifiers are provided with abbreviations because they
appear frequently in tests.
.P
-If the \fB/posix\fP modifier was present on the pattern, causing the POSIX
-wrapper API to be used, the only option-setting modifiers that have any effect
-are \fBnotbol\fP, \fBnotempty\fP, and \fBnoteol\fP, causing REG_NOTBOL,
-REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to \fBregexec()\fP.
-The other modifiers are ignored, with a warning message.
+If the \fBposix\fP or \fBposix_nosub\fP modifier was present on the pattern,
+causing the POSIX wrapper API to be used, the only option-setting modifiers
+that have any effect are \fBnotbol\fP, \fBnotempty\fP, and \fBnoteol\fP,
+causing REG_NOTBOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
+\fBregexec()\fP. The other modifiers are ignored, with a warning message.
+.P
+There is one additional modifier that can be used with the POSIX wrapper. It is
+ignored (with a warning) if used for non-POSIX matching.
+.sp
+ posix_startend=<n>[:<m>]
+.sp
+This causes the subject string to be passed to \fBregexec()\fP using the
+REG_STARTEND option, which uses offsets to specify which part of the string is
+searched. If only one number is given, the end offset is passed as the end of
+the subject string. For more detail of REG_STARTEND, see the
+.\" HREF
+\fBpcre2posix\fP
+.\"
+documentation. If the subject string contains binary zeros (coded as escapes
+such as \ex{00} because \fBpcre2test\fP does not support actual binary zeros in
+its input), you must use \fBposix_startend\fP to specify its length.
.
.
.SS "Setting match controls"
@@ -971,23 +1156,28 @@ pattern.
altglobal alternative global matching
callout_capture show captures at callout time
callout_data=<n> set a value to pass via callouts
+ callout_error=<n>[:<m>] control callout error
+ callout_extra show extra callout information
callout_fail=<n>[:<m>] control callout failure
+ callout_no_where do not show position of a callout
callout_none do not supply a callout function
copy=<number or name> copy captured substring
+ depth_limit=<n> set a depth limit
dfa use \fBpcre2_dfa_match()\fP
- find_limits find match and recursion limits
+ find_limits find match and depth limits
get=<number or name> extract captured substring
getall extract all captured substrings
/g global global matching
+ heap_limit=<n> set a limit on heap memory
jitstack=<n> set size of JIT stack
mark show mark values
match_limit=<n> set a match limit
- memory show memory usage
+ memory show heap memory usage
null_context match with a NULL context
offset=<n> set starting offset
offset_limit=<n> set offset limit
ovector=<n> set size of output vector
- recursion_limit=<n> set a recursion limit
+ recursion_limit=<n> obsolete synonym for depth_limit
replace=<string> specify a replacement string
startchar show startchar when relevant
startoffset=<n> same as offset=<n>
@@ -1063,27 +1253,20 @@ does no capturing); it is ignored, with a warning message, if present.
.rs
.sp
A callout function is supplied when \fBpcre2test\fP calls the library matching
-functions, unless \fBcallout_none\fP is specified. If \fBcallout_capture\fP is
-set, the current captured groups are output when a callout occurs.
-.P
-The \fBcallout_fail\fP modifier can be given one or two numbers. If there is
-only one number, 1 is returned instead of 0 when a callout of that number is
-reached. If two numbers are given, 1 is returned when callout <n> is reached
-for the <m>th time. Note that callouts with string arguments are always given
-the number zero. See "Callouts" below for a description of the output when a
-callout it taken.
-.P
-The \fBcallout_data\fP modifier can be given an unsigned or a negative number.
-This is set as the "user data" that is passed to the matching function, and
-passed back when the callout function is invoked. Any value other than zero is
-used as a return from \fBpcre2test\fP's callout function.
+functions, unless \fBcallout_none\fP is specified. Its behaviour can be
+controlled by various modifiers listed above whose names begin with
+\fBcallout_\fP. Details are given in the section entitled "Callouts"
+.\" HTML <a href="#callouts">
+.\" </a>
+below.
+.\"
.
.
.SS "Finding all matches in a string"
.rs
.sp
Searching for all possible matches within a subject can be requested by the
-\fBglobal\fP or \fB/altglobal\fP modifier. After finding a match, the matching
+\fBglobal\fP or \fBaltglobal\fP modifier. After finding a match, the matching
function is called again to search the remainder of the subject. The difference
between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
@@ -1198,39 +1381,44 @@ matching provokes an error return ("bad option value") from
.sp
The \fBjitstack\fP modifier provides a way of setting the maximum stack size
that is used by the just-in-time optimization code. It is ignored if JIT
-optimization is not being used. The value is a number of kilobytes. Providing a
-stack that is larger than the default 32K is necessary only for very
-complicated patterns.
+optimization is not being used. The value is a number of kilobytes. Setting
+zero reverts to the default of 32K. Providing a stack that is larger than the
+default is necessary only for very complicated patterns. If \fBjitstack\fP is
+set non-zero on a subject line it overrides any value that was set on the
+pattern.
.
.
-.SS "Setting match and recursion limits"
+.SS "Setting heap, match, and depth limits"
.rs
.sp
-The \fBmatch_limit\fP and \fBrecursion_limit\fP modifiers set the appropriate
-limits in the match context. These values are ignored when the
+The \fBheap_limit\fP, \fBmatch_limit\fP, and \fBdepth_limit\fP modifiers set
+the appropriate limits in the match context. These values are ignored when the
\fBfind_limits\fP modifier is specified.
.
.
.SS "Finding minimum limits"
.rs
.sp
-If the \fBfind_limits\fP modifier is present, \fBpcre2test\fP calls
-\fBpcre2_match()\fP several times, setting different values in the match
-context via \fBpcre2_set_match_limit()\fP and \fBpcre2_set_recursion_limit()\fP
-until it finds the minimum values for each parameter that allow
-\fBpcre2_match()\fP to complete without error.
+If the \fBfind_limits\fP modifier is present on a subject line, \fBpcre2test\fP
+calls the relevant matching function several times, setting different values in
+the match context via \fBpcre2_set_heap_limit(), \fBpcre2_set_match_limit()\fP,
+or \fBpcre2_set_depth_limit()\fP until it finds the minimum values for each
+parameter that allows the match to complete without error.
.P
If JIT is being used, only the match limit is relevant. If DFA matching is
-being used, neither limit is relevant, and this modifier is ignored (with a
-warning message).
+being used, only the depth limit is relevant.
.P
The \fImatch_limit\fP number is a measure of the amount of backtracking
that takes place, and learning the minimum value can be instructive. For most
simple matches, the number is quite small, but for patterns with very large
numbers of matching possibilities, it can become large very quickly with
-increasing length of subject string. The \fImatch_limit_recursion\fP number is
-a measure of how much stack (or, if PCRE2 is compiled with NO_RECURSE, how much
-heap) memory is needed to complete the match attempt.
+increasing length of subject string.
+.P
+For non-DFA matching, the minimum \fIdepth_limit\fP number is a measure of how
+much nested backtracking happens (that is, how deeply the pattern's tree is
+searched). In the case of DFA matching, \fIdepth_limit\fP controls the depth of
+recursive calls of the internal function that is used for handling pattern
+recursion, lookaround assertions, and atomic groups.
.
.
.SS "Showing MARK names"
@@ -1247,8 +1435,15 @@ is added to the non-match message.
.SS "Showing memory usage"
.rs
.sp
-The \fBmemory\fP modifier causes \fBpcre2test\fP to log all memory allocation
-and freeing calls that occur during a match operation.
+The \fBmemory\fP modifier causes \fBpcre2test\fP to log the sizes of all heap
+memory allocation and freeing calls that occur during a call to
+\fBpcre2_match()\fP. These occur only when a match requires a bigger vector
+than the default for remembering backtracking points. In many cases there will
+be no heap memory used and therefore no additional output. No heap memory is
+allocated during matching with \fBpcre2_dfa_match\fP or with JIT, so in those
+cases the \fBmemory\fP modifier never has any effect. For this modifier to
+work, the \fBnull_context\fP modifier must not be set on both the pattern and
+the subject, though it can be set on one or the other.
.
.
.SS "Setting a starting offset"
@@ -1291,8 +1486,8 @@ pair of offsets.)
By default, the subject string is passed to a native API matching function with
its correct length. In order to test the facility for passing a zero-terminated
string, the \fBzero_terminate\fP modifier is provided. It causes the length to
-be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
-this modifier has no effect, as there is no facility for passing a length.)
+be passed as PCRE2_ZERO_TERMINATED. When matching via the POSIX interface,
+this modifier is ignored, with a warning.
.P
When testing \fBpcre2_substitute()\fP, this modifier also has the effect of
passing the replacement string as zero-terminated.
@@ -1349,7 +1544,7 @@ code unit offset of the start of the failing character is also output. Here is
an example of an interactive \fBpcre2test\fP run.
.sp
$ pcre2test
- PCRE2 version 9.00 2014-05-10
+ PCRE2 version 10.22 2016-07-29
.sp
re> /^abc(\ed+)/
data> abc123
@@ -1376,7 +1571,7 @@ unset substring is shown as "<unset>", as for the second data line.
If the strings contain any non-printing characters, they are output as \exhh
escapes if the value is less than 256 and UTF mode is not set. Otherwise they
are output as \ex{hh...} escapes. See below for the definition of non-printing
-characters. If the \fB/aftertext\fP modifier is set, the output for substring
+characters. If the \fBaftertext\fP modifier is set, the output for substring
0 is followed by the the rest of the subject string, identified by "0+" like
this:
.sp
@@ -1470,27 +1665,15 @@ For further information about partial matching, see the
documentation.
.
.
+.\" HTML <a name="callouts"></a>
.SH CALLOUTS
.rs
.sp
If the pattern contains any callout requests, \fBpcre2test\fP's callout
-function is called during matching unless \fBcallout_none\fP is specified.
-This works with both matching functions.
-.P
-The callout function in \fBpcre2test\fP returns zero (carry on matching) by
-default, but you can use a \fBcallout_fail\fP modifier in a subject line (as
-described above) to change this and other parameters of the callout.
-.P
-Inserting callouts can be helpful when using \fBpcre2test\fP to check
-complicated regular expressions. For further information about callouts, see
-the
-.\" HREF
-\fBpcre2callout\fP
-.\"
-documentation.
-.P
-The output for callouts with numerical arguments and those with string
-arguments is slightly different.
+function is called during matching unless \fBcallout_none\fP is specified. This
+works with both matching functions, and with JIT, though there are some
+differences in behaviour. The output for callouts with numerical arguments and
+those with string arguments is slightly different.
.
.
.SS "Callouts with numerical arguments"
@@ -1511,7 +1694,7 @@ the current position precedes the start position, which can happen if the
callout is in a lookbehind assertion.
.P
Callouts numbered 255 are assumed to be automatic callouts, inserted as a
-result of the \fB/auto_callout\fP pattern modifier. In this case, instead of
+result of the \fBauto_callout\fP pattern modifier. In this case, instead of
showing the callout number, the offset in the pattern, preceded by a plus, is
output. For example:
.sp
@@ -1564,6 +1747,103 @@ example:
.sp
.
.
+.SS "Callout modifiers"
+.rs
+.sp
+The callout function in \fBpcre2test\fP returns zero (carry on matching) by
+default, but you can use a \fBcallout_fail\fP modifier in a subject line to
+change this and other parameters of the callout (see below).
+.P
+If the \fBcallout_capture\fP modifier is set, the current captured groups are
+output when a callout occurs. This is useful only for non-DFA matching, as
+\fBpcre2_dfa_match()\fP does not support capturing, so no captures are ever
+shown.
+.P
+The normal callout output, showing the callout number or pattern offset (as
+described above) is suppressed if the \fBcallout_no_where\fP modifier is set.
+.P
+When using the interpretive matching function \fBpcre2_match()\fP without JIT,
+setting the \fBcallout_extra\fP modifier causes additional output from
+\fBpcre2test\fP's callout function to be generated. For the first callout in a
+match attempt at a new starting position in the subject, "New match attempt" is
+output. If there has been a backtrack since the last callout (or start of
+matching if this is the first callout), "Backtrack" is output, followed by "No
+other matching paths" if the backtrack ended the previous match attempt. For
+example:
+.sp
+ re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess
+ data> aac\e=callout_extra
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ +3 ^ ^ )
+ +4 ^ ^ b
+ Backtrack
+ --->aac
+ +3 ^^ )
+ +4 ^^ b
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ +3 ^^ )
+ +4 ^^ b
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ No match
+.sp
+Notice that various optimizations must be turned off if you want all possible
+matching paths to be scanned. If \fBno_start_optimize\fP is not used, there is
+an immediate "no match", without any callouts, because the starting
+optimization fails to find "b" in the subject, which it knows must be present
+for any match. If \fBno_auto_possess\fP is not used, the "a+" item is turned
+into "a++", which reduces the number of backtracks.
+.P
+The \fBcallout_extra\fP modifier has no effect if used with the DFA matching
+function, or with JIT.
+.
+.
+.SS "Return values from callouts"
+.rs
+.sp
+The default return from the callout function is zero, which allows matching to
+continue. The \fBcallout_fail\fP modifier can be given one or two numbers. If
+there is only one number, 1 is returned instead of 0 (causing matching to
+backtrack) when a callout of that number is reached. If two numbers (<n>:<m>)
+are given, 1 is returned when callout <n> is reached and there have been at
+least <m> callouts. The \fBcallout_error\fP modifier is similar, except that
+PCRE2_ERROR_CALLOUT is returned, causing the entire matching process to be
+aborted. If both these modifiers are set for the same callout number,
+\fBcallout_error\fP takes precedence. Note that callouts with string arguments
+are always given the number zero.
+.P
+The \fBcallout_data\fP modifier can be given an unsigned or a negative number.
+This is set as the "user data" that is passed to the matching function, and
+passed back when the callout function is invoked. Any value other than zero is
+used as a return from \fBpcre2test\fP's callout function.
+.P
+Inserting callouts can be helpful when using \fBpcre2test\fP to check
+complicated regular expressions. For further information about callouts, see
+the
+.\" HREF
+\fBpcre2callout\fP
+.\"
+documentation.
+.
+.
.
.SH "NON-PRINTING CHARACTERS"
.rs
@@ -1574,7 +1854,7 @@ therefore shown as hex escapes.
.P
When \fBpcre2test\fP is outputting text that is a matched part of a subject
string, it behaves in the same way, unless a different locale has been set for
-the pattern (using the \fB/locale\fP modifier). In this case, the
+the pattern (using the \fBlocale\fP modifier). In this case, the
\fBisprint()\fP function is used to distinguish printing and non-printing
characters.
.
@@ -1682,6 +1962,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 06 July 2016
-Copyright (c) 1997-2016 University of Cambridge.
+Last updated: 21 December 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi