diff options
Diffstat (limited to 'doc/pcre2test.1')
-rw-r--r-- | doc/pcre2test.1 | 365 |
1 files changed, 277 insertions, 88 deletions
diff --git a/doc/pcre2test.1 b/doc/pcre2test.1 index 857adc3..b8eef93 100644 --- a/doc/pcre2test.1 +++ b/doc/pcre2test.1 @@ -1,4 +1,4 @@ -.TH PCRE2TEST 1 "20 May 2015" "PCRE 10.20" +.TH PCRE2TEST 1 "12 December 2015" "PCRE 10.21" .SH NAME pcre2test - a program for testing Perl-compatible regular expressions. .SH SYNOPSIS @@ -122,12 +122,13 @@ following options output the value and set the exit code as indicated: The following options output 1 for true or 0 for false, and set the exit code to the same value: .sp - ebcdic compiled for an EBCDIC environment - jit just-in-time support is available - pcre2-16 the 16-bit library was built - pcre2-32 the 32-bit library was built - pcre2-8 the 8-bit library was built - unicode Unicode support is available + backslash-C \eC is supported (not locked out) + ebcdic compiled for an EBCDIC environment + jit just-in-time support is available + pcre2-16 the 16-bit library was built + pcre2-32 the 32-bit library was built + pcre2-8 the 8-bit library was built + unicode Unicode support is available .sp If an unknown option is given, an error message is output; the exit code is 0. .TP 10 @@ -217,9 +218,9 @@ Each subject line is matched separately and independently. If you want to do multi-line matches, you have to use the \en escape sequence (or \er or \er\en, etc., depending on the newline setting) in a single line of input to encode the newline sequences. There is no limit on the length of subject lines; the input -buffer is automatically extended if it is too small. There is a replication -feature that makes it possible to generate long subject lines without having to -supply them explicitly. +buffer is automatically extended if it is too small. There are replication +features that makes it possible to generate long repetitive pattern or subject +lines without having to supply them explicitly. .P An empty line or the end of the file signals the end of the subject lines for a test, at which point a new pattern or command line is expected if there is @@ -260,6 +261,34 @@ described in the section entitled "Saving and restoring compiled patterns" below. .\" .sp + #newline_default [<newline-list>] +.sp +When PCRE2 is built, a default newline convention can be specified. This +determines which characters and/or character pairs are recognized as indicating +a newline in a pattern or subject string. The default can be overridden when a +pattern is compiled. The standard test files contain tests of various newline +conventions, but the majority of the tests expect a single linefeed to be +recognized as a newline by default. Without special action the tests would fail +when PCRE2 is compiled with either CR or CRLF as the default newline. +.P +The #newline_default command specifies a list of newline types that are +acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF, or +ANY (in upper or lower case), for example: +.sp + #newline_default LF Any anyCRLF +.sp +If the default newline is in the list, this command has no effect. Otherwise, +except when testing the POSIX API, a \fBnewline\fP modifier that specifies the +first newline convention in the list (LF in the above example) is added to any +pattern that does not already have a \fBnewline\fP modifier. If the newline +list is empty, the feature is turned off. This command is present in a number +of the standard test input files. +.P +When the POSIX API is being tested there is no way to override the default +newline convention, though it is possible to set the newline convention from +within the pattern. A warning is given if the \fBposix\fP modifier is used when +\fB#newline_default\fP would set a default for the non-POSIX API. +.sp #pattern <modifier-list> .sp This command sets a default modifier list that applies to all subsequent @@ -303,12 +332,13 @@ subject lines. Modifiers on a subject line can change these settings. .rs .sp Modifier lists are used with both pattern and subject lines. Items in a list -are separated by commas and optional white space. Some modifiers may be given -for both patterns and subject lines, whereas others are valid for one or the -other only. Each modifier has a long name, for example "anchored", and some of -them must be followed by an equals sign and a value, for example, "offset=12". -Modifiers that do not take values may be preceded by a minus sign to turn off a -previous setting. +are separated by commas followed by optional white space. Trailing whitespace +in a modifier list is ignored. Some modifiers may be given for both patterns +and subject lines, whereas others are valid only for one or the other. Each +modifier has a long name, for example "anchored", and some of them must be +followed by an equals sign and a value, for example, "offset=12". Values cannot +contain comma characters, but may contain spaces. Modifiers that do not take +values may be preceded by a minus sign to turn off a previous setting. .P A few of the more common modifiers can also be specified as single letters, for example "i" for "caseless". In documentation, following the Perl convention, @@ -414,6 +444,12 @@ the start of a modifier list. For example: .sp abc\e=notbol,notempty .sp +If the subject string is empty and \e= is followed by whitespace, the line is +treated as a comment line, and is not used for matching. For example: +.sp + \e= This is a comment. + abc\e= This is an invalid modifier list. +.sp A backslash followed by any other non-alphanumeric character just escapes that character. A backslash followed by anything else causes an error. However, if the very last character in the line is a backslash (and there is no modifier @@ -424,10 +460,10 @@ a real empty line terminates the data input. .SH "PATTERN MODIFIERS" .rs .sp -There are three types of modifier that can appear in pattern lines, two of -which may also be used in a \fB#pattern\fP command. A pattern's modifier list -can add to or override default modifiers that were set by a previous -\fB#pattern\fP command. +There are several types of modifier that can appear in pattern lines. Except +where noted below, they may also be used in \fB#pattern\fP commands. A +pattern's modifier list can add to or override default modifiers that were set +by a previous \fB#pattern\fP command. . . .\" HTML <a name="optionmodifiers"></a> @@ -437,13 +473,14 @@ can add to or override default modifiers that were set by a previous The following modifiers set options for \fBpcre2_compile()\fP. The most common ones have single-letter abbreviations. See .\" HREF -\fBpcreapi\fP +\fBpcre2api\fP .\" for a description of their effects. .sp allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS alt_bsux set PCRE2_ALT_BSUX alt_circumflex set PCRE2_ALT_CIRCUMFLEX + alt_verbnames set PCRE2_ALT_VERBNAMES anchored set PCRE2_ANCHORED auto_callout set PCRE2_AUTO_CALLOUT /i caseless set PCRE2_CASELESS @@ -464,6 +501,7 @@ for a description of their effects. no_utf_check set PCRE2_NO_UTF_CHECK ucp set PCRE2_UCP ungreedy set PCRE2_UNGREEDY + use_offset_limit set PCRE2_USE_OFFSET_LIMIT utf set PCRE2_UTF .sp As well as turning on the PCRE2_UTF option, the \fButf\fP modifier causes all @@ -490,8 +528,10 @@ about the pattern: jitfast use JIT fast path jitverify verify JIT use locale=<name> use this locale + max_pattern_length=<n> set the maximum pattern length memory show memory used newline=<type> set newline type + null_context compile with a NULL context parens_nest_limit=<n> set maximum parentheses depth posix use the POSIX API push push compiled pattern onto the stack @@ -565,6 +605,15 @@ is requested. For each callout, either its number or string is given, followed by the item that follows it in the pattern. . . +.SS "Passing a NULL context" +.rs +.sp +Normally, \fBpcre2test\fP passes a context block to \fBpcre2_compile()\fP. If +the \fBnull_context\fP modifier is set, however, NULL is passed. This is for +testing that \fBpcre2_compile()\fP behaves correctly in this case (it uses +default values). +. +. .SS "Specifying a pattern in hex" .rs .sp @@ -581,24 +630,83 @@ PCRE2_ZERO_TERMINATED. However, for patterns specified in hexadecimal, the actual length of the pattern is passed. . . +.SS "Generating long repetitive patterns" +.rs +.sp +Some tests use long patterns that are very repetitive. Instead of creating a +very long input line for such a pattern, you can use a special repetition +feature, similar to the one described for subject lines above. If the +\fBexpand\fP modifier is present on a pattern, parts of the pattern that have +the form +.sp + \e[<characters>]{<count>} +.sp +are expanded before the pattern is passed to \fBpcre2_compile()\fP. For +example, \e[AB]{6000} is expanded to "ABAB..." 6000 times. This construction +cannot be nested. An initial "\e[" sequence is recognized only if "]{" followed +by decimal digits and "}" is found later in the pattern. If not, the characters +remain in the pattern unaltered. +.P +If part of an expanded pattern looks like an expansion, but is really part of +the actual pattern, unwanted expansion can be avoided by giving two values in +the quantifier. For example, \e[AB]{6000,6000} is not recognized as an +expansion item. +.P +If the \fBinfo\fP modifier is set on an expanded pattern, the result of the +expansion is included in the information that is output. +. +. .SS "JIT compilation" .rs .sp -The \fB/jit\fP modifier may optionally be followed by an equals sign and a -number in the range 0 to 7: +Just-in-time (JIT) compiling is a heavyweight optimization that can greatly +speed up pattern matching. See the +.\" HREF +\fBpcre2jit\fP +.\" +documentation for details. JIT compiling happens, optionally, after a pattern +has been successfully compiled into an internal form. The JIT compiler converts +this to optimized machine code. It needs to know whether the match-time options +PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, because +different code is generated for the different cases. See the \fBpartial\fP +modifier in "Subject Modifiers" +.\" HTML <a href="#subjectmodifiers"> +.\" </a> +below +.\" +for details of how these options are specified for each match attempt. +.P +JIT compilation is requested by the \fB/jit\fP pattern modifier, which may +optionally be followed by an equals sign and a number in the range 0 to 7. +The three bits that make up the number specify which of the three JIT operating +modes are to be compiled: +.sp + 1 compile JIT code for non-partial matching + 2 compile JIT code for soft partial matching + 4 compile JIT code for hard partial matching +.sp +The possible values for the \fB/jit\fP modifier are therefore: .sp 0 disable JIT - 1 use JIT for normal match only - 2 use JIT for soft partial match only - 3 use JIT for normal match and soft partial match - 4 use JIT for hard partial match only - 6 use JIT for soft and hard partial match + 1 normal matching only + 2 soft partial matching only + 3 normal and soft partial matching + 4 hard partial matching only + 6 soft and hard partial matching only 7 all three modes .sp -If no number is given, 7 is assumed. If JIT compilation is successful, the -compiled JIT code will automatically be used when \fBpcre2_match()\fP is run -for the appropriate type of match, except when incompatible run-time options -are specified. For more details, see the +If no number is given, 7 is assumed. The phrase "partial matching" means a call +to \fBpcre2_match()\fP with either the PCRE2_PARTIAL_SOFT or the +PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete +match; the options enable the possibility of a partial match, but do not +require it. Note also that if you request JIT compilation only for partial +matching (for example, /jit=2) but do not set the \fBpartial\fP modifier on a +subject line, that match will not use JIT code because none was compiled for +non-partial matching. +.P +If JIT compilation is successful, the compiled JIT code will automatically be +used when an appropriate type of match is run, except when incompatible +run-time options are specified. For more details, see the .\" HREF \fBpcre2jit\fP .\" @@ -660,13 +768,26 @@ sets its own default of 220, which is required for running the standard test suite. . . +.SS "Limiting the pattern length" +.rs +.sp +The \fBmax_pattern_length\fP modifier sets a limit, in code units, to the +length of pattern that \fBpcre2_compile()\fP will accept. Breaching the limit +causes a compilation error. The default is the largest number a PCRE2_SIZE +variable can hold (essentially unlimited). +. +. .SS "Using the POSIX wrapper API" .rs .sp The \fB/posix\fP modifier causes \fBpcre2test\fP to call PCRE2 via the POSIX wrapper API rather than its native API. This supports only the 8-bit library. -When the POSIX API is being used, the following pattern modifiers set options -for the \fBregcomp()\fP function: +Note that it does not imply POSIX matching semantics; for more detail see the +.\" HREF +\fBpcre2posix\fP +.\" +documentation. When the POSIX API is being used, the following pattern +modifiers set options for the \fBregcomp()\fP function: .sp caseless REG_ICASE multiline REG_NEWLINE @@ -676,6 +797,15 @@ for the \fBregcomp()\fP function: ucp REG_UCP ) the POSIX standard utf REG_UTF8 ) .sp +The \fBregerror_buffsize\fP modifier specifies a size for the error buffer that +is passed to \fBregerror()\fP in the event of a compilation error. For example: +.sp + /abc/posix,regerror_buffsize=20 +.sp +This provides a means of testing the behaviour of \fBregerror()\fP when the +buffer is too small for the error message. If this modifier has not been set, a +large buffer is used. +.P The \fBaftertext\fP and \fBallaftertext\fP subject modifiers work as described below. All other modifiers cause an error. . @@ -720,17 +850,22 @@ are mutually exclusive. .sp The following modifiers are really subject modifiers, and are described below. However, they may be included in a pattern's modifier list, in which case they -are applied to every subject line that is processed with that pattern. They do -not affect the compilation process. -.sp - aftertext show text after match - allaftertext show text after captures - allcaptures show all captures - allusedtext show all consulted text - /g global global matching - mark show mark values - replace=<string> specify a replacement string - startchar show starting character when relevant +are applied to every subject line that is processed with that pattern. They may +not appear in \fB#pattern\fP commands. These modifiers do not affect the +compilation process. +.sp + aftertext show text after match + allaftertext show text after captures + allcaptures show all captures + allusedtext show all consulted text + /g global global matching + mark show mark values + replace=<string> specify a replacement string + startchar show starting character when relevant + substitute_extended use PCRE2_SUBSTITUTE_EXTENDED + substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH + substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET + substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY .sp These modifiers may not appear in a \fB#pattern\fP command. If you want them as defaults, set them in a \fB#subject\fP command. @@ -755,6 +890,7 @@ warning message, except for \fBreplace\fP, which causes an error. Note that, matching that uses this pattern. . . +.\" HTML <a name="subjectmodifiers"></a> .SH "SUBJECT MODIFIERS" .rs .sp @@ -801,31 +937,38 @@ information. Some of them may also be specified on a pattern line (see above), in which case they apply to every subject line that is matched against that pattern. .sp - aftertext show text after match - allaftertext show text after captures - allcaptures show all captures - allusedtext show all consulted text (non-JIT only) - altglobal alternative global matching - callout_capture show captures at callout time - callout_data=<n> set a value to pass via callouts - callout_fail=<n>[:<m>] control callout failure - callout_none do not supply a callout function - copy=<number or name> copy captured substring - dfa use \fBpcre2_dfa_match()\fP - find_limits find match and recursion limits - get=<number or name> extract captured substring - getall extract all captured substrings - /g global global matching - jitstack=<n> set size of JIT stack - mark show mark values - match_limit=>n> set a match limit - memory show memory usage - offset=<n> set starting offset - ovector=<n> set size of output vector - recursion_limit=<n> set a recursion limit - replace=<string> specify a replacement string - startchar show startchar when relevant - zero_terminate pass the subject as zero-terminated + aftertext show text after match + allaftertext show text after captures + allcaptures show all captures + allusedtext show all consulted text (non-JIT only) + altglobal alternative global matching + callout_capture show captures at callout time + callout_data=<n> set a value to pass via callouts + callout_fail=<n>[:<m>] control callout failure + callout_none do not supply a callout function + copy=<number or name> copy captured substring + dfa use \fBpcre2_dfa_match()\fP + find_limits find match and recursion limits + get=<number or name> extract captured substring + getall extract all captured substrings + /g global global matching + jitstack=<n> set size of JIT stack + mark show mark values + match_limit=<n> set a match limit + memory show memory usage + null_context match with a NULL context + offset=<n> set starting offset + offset_limit=<n> set offset limit + ovector=<n> set size of output vector + recursion_limit=<n> set a recursion limit + replace=<string> specify a replacement string + startchar show startchar when relevant + startoffset=<n> same as offset=<n> + substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED + substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH + substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET + substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY + zero_terminate pass the subject as zero-terminated .sp The effects of these modifiers are described in the following sections. . @@ -957,18 +1100,30 @@ by name. .rs .sp If the \fBreplace\fP modifier is set, the \fBpcre2_substitute()\fP function is -called instead of one of the matching functions. Unlike subject strings, -\fBpcre2test\fP does not process replacement strings for escape sequences. In -UTF mode, a replacement string is checked to see if it is a valid UTF-8 string. -If so, it is correctly converted to a UTF string of the appropriate code unit -width. If it is not a valid UTF-8 string, the individual code units are copied -directly. This provides a means of passing an invalid UTF-8 string for testing -purposes. +called instead of one of the matching functions. Note that replacement strings +cannot contain commas, because a comma signifies the end of a modifier. This is +not thought to be an issue in a test program. +.P +Unlike subject strings, \fBpcre2test\fP does not process replacement strings +for escape sequences. In UTF mode, a replacement string is checked to see if it +is a valid UTF-8 string. If so, it is correctly converted to a UTF string of +the appropriate code unit width. If it is not a valid UTF-8 string, the +individual code units are copied directly. This provides a means of passing an +invalid UTF-8 string for testing purposes. .P -If the \fBglobal\fP modifier is set, PCRE2_SUBSTITUTE_GLOBAL is passed to -\fBpcre2_substitute()\fP. After a successful substitution, the modified string -is output, preceded by the number of replacements. This may be zero if there -were no matches. Here is a simple example of a substitution test: +The following modifiers set options (in additional to the normal match options) +for \fBpcre2_substitute()\fP: +.sp + global PCRE2_SUBSTITUTE_GLOBAL + substitute_extended PCRE2_SUBSTITUTE_EXTENDED + substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH + substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET + substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY +.sp +.P +After a successful substitution, the modified string is output, preceded by the +number of replacements. This may be zero if there were no matches. Here is a +simple example of a substitution test: .sp /abc/replace=xxx =abc=abc= @@ -976,12 +1131,12 @@ were no matches. Here is a simple example of a substitution test: =abc=abc=\e=global 2: =xxx=xxx= .sp -Subject and replacement strings should be kept relatively short for -substitution tests, as fixed-size buffers are used. To make it easy to test for -buffer overflow, if the replacement string starts with a number in square -brackets, that number is passed to \fBpcre2_substitute()\fP as the size of the -output buffer, with the replacement string starting at the next character. Here -is an example that tests the edge case: +Subject and replacement strings should be kept relatively short (fewer than 256 +characters) for substitution tests, as fixed-size buffers are used. To make it +easy to test for buffer overflow, if the replacement string starts with a +number in square brackets, that number is passed to \fBpcre2_substitute()\fP as +the size of the output buffer, with the replacement string starting at the next +character. Here is an example that tests the edge case: .sp /abc/ 123abc123\e=replace=[10]XYZ @@ -989,6 +1144,19 @@ is an example that tests the edge case: 123abc123\e=replace=[9]XYZ Failed: error -47: no more memory .sp +The default action of \fBpcre2_substitute()\fP is to return +PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if the +PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the +\fBsubstitute_overflow_length\fP modifier), \fBpcre2_substitute()\fP continues +to go through the motions of matching and substituting, in order to compute the +size of buffer that is required. When this happens, \fBpcre2test\fP shows the +required buffer length (which includes space for the trailing zero) as part of +the error message. For example: +.sp + /abc/substitute_overflow_length + 123abc123\e=replace=[9]XYZ + Failed: error -47: no more memory: 10 code units are needed +.sp A replacement string is ignored with POSIX and DFA matching. Specifying partial matching provokes an error return ("bad option value") from \fBpcre2_substitute()\fP. @@ -1059,6 +1227,16 @@ The \fBoffset\fP modifier sets an offset in the subject string at which matching starts. Its value is a number of code units, not characters. . . +.SS "Setting an offset limit" +.rs +.sp +The \fBoffset_limit\fP modifier sets a limit for unanchored matches. If a match +cannot be found starting at or before this offset in the subject, a "no match" +return is given. The data value is a number of code units, not characters. When +this modifier is used, the \fBuse_offset_limit\fP modifier must have been set +for the pattern; if not, an error is generated. +. +. .SS "Setting the size of the output vector" .rs .sp @@ -1089,6 +1267,17 @@ When testing \fBpcre2_substitute()\fP, this modifier also has the effect of passing the replacement string as zero-terminated. . . +.SS "Passing a NULL context" +.rs +.sp +Normally, \fBpcre2test\fP passes a context block to \fBpcre2_match()\fP, +\fBpcre2_dfa_match()\fP or \fBpcre2_jit_match()\fP. If the \fBnull_context\fP +modifier is set, however, NULL is passed. This is for testing that the matching +functions behave correctly in this case (they use default values). This +modifier cannot be used with the \fBfind_limits\fP modifier or when testing the +substitution function. +. +. .SH "THE ALTERNATIVE MATCHING FUNCTION" .rs .sp @@ -1451,6 +1640,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 20 May 2015 +Last updated: 12 December 2015 Copyright (c) 1997-2015 University of Cambridge. .fi |