summaryrefslogtreecommitdiff
path: root/doc/pcre2perform.3
diff options
context:
space:
mode:
Diffstat (limited to 'doc/pcre2perform.3')
-rw-r--r--doc/pcre2perform.3131
1 files changed, 93 insertions, 38 deletions
diff --git a/doc/pcre2perform.3 b/doc/pcre2perform.3
index ec86fe7..8b49a2a 100644
--- a/doc/pcre2perform.3
+++ b/doc/pcre2perform.3
@@ -1,4 +1,4 @@
-.TH PCRE2PERFORM 3 "02 January 2015" "PCRE2 10.00"
+.TH PCRE2PERFORM 3 "08 April 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 PERFORMANCE"
@@ -12,11 +12,11 @@ of them.
.rs
.sp
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
-so that most simple patterns do not use much memory. However, there is one case
-where the memory usage of a compiled pattern can be unexpectedly large. If a
-parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
-a limited maximum, the whole subpattern is repeated in the compiled code. For
-example, the pattern
+so that most simple patterns do not use much memory for storing the compiled
+version. However, there is one case where the memory usage of a compiled
+pattern can be unexpectedly large. If a parenthesized subpattern has a
+quantifier with a minimum greater than 1 and/or a limited maximum, the whole
+subpattern is repeated in the compiled code. For example, the pattern
.sp
(abc|def){2,4}
.sp
@@ -34,13 +34,13 @@ example, the very simple pattern
.sp
((ab){1,1000}c){1,3}
.sp
-uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
-with its default internal pointer size of two bytes, the size limit on a
-compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
-is reached with the above pattern if the outer repetition is increased from 3
-to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
-larger compiled patterns, but it is better to try to rewrite your pattern to
-use less memory if you can.
+uses over 50K bytes when compiled using the 8-bit library. When PCRE2 is
+compiled with its default internal pointer size of two bytes, the size limit on
+a compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and
+this is reached with the above pattern if the outer repetition is increased
+from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
+handle larger compiled patterns, but it is better to try to rewrite your
+pattern to use less memory if you can.
.P
One way of reducing the memory usage for such patterns is to make use of
PCRE2's
@@ -52,32 +52,35 @@ facility. Re-writing the above pattern as
.sp
((ab)(?2){0,999}c)(?1){0,2}
.sp
-reduces the memory requirements to 18K, and indeed it remains under 20K even
-with the outer repetition increased to 100. However, this pattern is not
-exactly equivalent, because the "subroutine" calls are treated as
-.\" HTML <a href="pcre2pattern.html#atomicgroup">
-.\" </a>
-atomic groups
-.\"
-into which there can be no backtracking if there is a subsequent matching
-failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
-Furthermore, there is a noticeable loss of speed when executing the modified
-pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
-speed is acceptable, this kind of rewriting will allow you to process patterns
-that PCRE2 cannot otherwise handle.
+reduces the memory requirements to around 16K, and indeed it remains under 20K
+even with the outer repetition increased to 100. However, this kind of pattern
+is not always exactly equivalent, because any captures within subroutine calls
+are lost when the subroutine completes. If this is not a problem, this kind of
+rewriting will allow you to process patterns that PCRE2 cannot otherwise
+handle. The matching performance of the two different versions of the pattern
+are roughly the same. (This applies from release 10.30 - things were different
+in earlier releases.)
.
.
-.SH "STACK USAGE AT RUN TIME"
+.SH "STACK AND HEAP USAGE AT RUN TIME"
.rs
.sp
-When \fBpcre2_match()\fP is used for matching, certain kinds of pattern can
-cause it to use large amounts of the process stack. In some environments the
-default process stack is quite small, and if it runs out the result is often
-SIGSEGV. Rewriting your pattern can often help. The
-.\" HREF
-\fBpcre2stack\fP
-.\"
-documentation discusses this issue in detail.
+From release 10.30, the interpretive (non-JIT) version of \fBpcre2_match()\fP
+uses very little system stack at run time. In earlier releases recursive
+function calls could use a great deal of stack, and this could cause problems,
+but this usage has been eliminated. Backtracking positions are now explicitly
+remembered in memory frames controlled by the code. An initial 20K vector of
+frames is allocated on the system stack (enough for about 100 frames for small
+patterns), but if this is insufficient, heap memory is used. The amount of heap
+memory can be limited; if the limit is set to zero, only the initial stack
+vector is used. Rewriting patterns to be time-efficient, as described below,
+may also reduce the memory requirements.
+.P
+In contrast to \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP does use recursive
+function calls, but only for processing atomic groups, lookaround assertions,
+and recursion within the pattern. Too much nested recursion may cause stack
+issues. The "match depth" parameter can be used to limit the depth of function
+recursion in \fBpcre2_dfa_match()\fP.
.
.
.SH "PROCESSING TIME"
@@ -160,7 +163,59 @@ applied to a whole line of "a" characters, whereas the latter takes an
appreciable time with strings longer than about 20 characters.
.P
In many cases, the solution to this kind of performance issue is to use an
-atomic group or a possessive quantifier.
+atomic group or a possessive quantifier. This can often reduce memory
+requirements as well. As another example, consider this pattern:
+.sp
+ ([^<]|<(?!inet))+
+.sp
+It matches from wherever it starts until it encounters "<inet" or the end of
+the data, and is the kind of pattern that might be used when processing an XML
+file. Each iteration of the outer parentheses matches either one character that
+is not "<" or a "<" that is not followed by "inet". However, each time a
+parenthesis is processed, a backtracking position is passed, so this
+formulation uses a memory frame for each matched character. For a long string,
+a lot of memory is required. Consider now this rewritten pattern, which matches
+exactly the same strings:
+.sp
+ ([^<]++|<(?!inet))+
+.sp
+This runs much faster, because sequences of characters that do not contain "<"
+are "swallowed" in one item inside the parentheses, and a possessive quantifier
+is used to stop any backtracking into the runs of non-"<" characters. This
+version also uses a lot less memory because entry to a new set of parentheses
+happens only when a "<" character that is not followed by "inet" is encountered
+(and we assume this is relatively rare).
+.P
+This example shows that one way of optimizing performance when matching long
+subject strings is to write repeated parenthesized subpatterns to match more
+than one character whenever possible.
+.
+.
+.SS "SETTING RESOURCE LIMITS"
+.rs
+.sp
+You can set limits on the amount of processing that takes place when matching,
+and on the amount of heap memory that is used. The default values of the limits
+are very large, and unlikely ever to operate. They can be changed when PCRE2 is
+built, and they can also be set when \fBpcre2_match()\fP or
+\fBpcre2_dfa_match()\fP is called. For details of these interfaces, see the
+.\" HREF
+\fBpcre2build\fP
+.\"
+documentation and the section entitled
+.\" HTML <a href="pcre2api.html#matchcontext">
+.\" </a>
+"The match context"
+.\"
+in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+documentation.
+.P
+The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
+applied to a subject line, causes it to find the smallest limits that allow a
+pattern to match. This is done by repeatedly matching with different limits.
.
.
.SH AUTHOR
@@ -177,6 +232,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 02 January 2015
-Copyright (c) 1997-2015 University of Cambridge.
+Last updated: 08 April 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi