summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorMatthew Vernon <matthew@debian.org>2018-02-24 12:07:04 +0000
committerMatthew Vernon <matthew@debian.org>2018-02-24 12:07:04 +0000
commite98c3314cf9e05aa99f5e192862ec37f29b7dbb5 (patch)
treeb69bb3feb63a4fd79ad8a6e55865228f6fde04eb /doc
parent92b17f0eb8fddd7117c5344a1e1177daec21995a (diff)
New upstream version 10.31
Diffstat (limited to 'doc')
-rw-r--r--doc/html/NON-AUTOTOOLS-BUILD.txt67
-rw-r--r--doc/html/README.txt212
-rw-r--r--doc/html/index.html43
-rw-r--r--doc/html/pcre2.html10
-rw-r--r--doc/html/pcre2_callout_enumerate.html25
-rw-r--r--doc/html/pcre2_code_copy.html5
-rw-r--r--doc/html/pcre2_code_copy_with_tables.html44
-rw-r--r--doc/html/pcre2_code_free.html4
-rw-r--r--doc/html/pcre2_compile.html32
-rw-r--r--doc/html/pcre2_config.html19
-rw-r--r--doc/html/pcre2_convert_context_copy.html40
-rw-r--r--doc/html/pcre2_convert_context_create.html41
-rw-r--r--doc/html/pcre2_convert_context_free.html39
-rw-r--r--doc/html/pcre2_converted_pattern_free.html39
-rw-r--r--doc/html/pcre2_dfa_match.html23
-rw-r--r--doc/html/pcre2_get_error_message.html10
-rw-r--r--doc/html/pcre2_get_mark.html14
-rw-r--r--doc/html/pcre2_jit_stack_create.html7
-rw-r--r--doc/html/pcre2_maketables.html10
-rw-r--r--doc/html/pcre2_match.html33
-rw-r--r--doc/html/pcre2_match_data_free.html4
-rw-r--r--doc/html/pcre2_pattern_convert.html70
-rw-r--r--doc/html/pcre2_pattern_info.html23
-rw-r--r--doc/html/pcre2_set_callout.html2
-rw-r--r--doc/html/pcre2_set_compile_extra_options.html45
-rw-r--r--doc/html/pcre2_set_depth_limit.html40
-rw-r--r--doc/html/pcre2_set_glob_escape.html43
-rw-r--r--doc/html/pcre2_set_glob_separator.html42
-rw-r--r--doc/html/pcre2_set_heap_limit.html40
-rw-r--r--doc/html/pcre2_set_max_pattern_length.html43
-rw-r--r--doc/html/pcre2_set_newline.html1
-rw-r--r--doc/html/pcre2_set_recursion_limit.html4
-rw-r--r--doc/html/pcre2_set_recursion_memory_management.html9
-rw-r--r--doc/html/pcre2_substitute.html22
-rw-r--r--doc/html/pcre2api.html1155
-rw-r--r--doc/html/pcre2build.html241
-rw-r--r--doc/html/pcre2callout.html182
-rw-r--r--doc/html/pcre2compat.html121
-rw-r--r--doc/html/pcre2convert.html190
-rw-r--r--doc/html/pcre2demo.html53
-rw-r--r--doc/html/pcre2grep.html418
-rw-r--r--doc/html/pcre2jit.html14
-rw-r--r--doc/html/pcre2limits.html31
-rw-r--r--doc/html/pcre2pattern.html576
-rw-r--r--doc/html/pcre2perform.html126
-rw-r--r--doc/html/pcre2posix.html69
-rw-r--r--doc/html/pcre2serialize.html9
-rw-r--r--doc/html/pcre2stack.html207
-rw-r--r--doc/html/pcre2syntax.html41
-rw-r--r--doc/html/pcre2test.html528
-rw-r--r--doc/html/pcre2unicode.html26
-rw-r--r--doc/index.html.src43
-rw-r--r--doc/pcre2.312
-rw-r--r--doc/pcre2.txt5664
-rw-r--r--doc/pcre2_callout_enumerate.327
-rw-r--r--doc/pcre2_code_copy.37
-rw-r--r--doc/pcre2_code_copy_with_tables.332
-rw-r--r--doc/pcre2_code_free.36
-rw-r--r--doc/pcre2_compile.334
-rw-r--r--doc/pcre2_config.319
-rw-r--r--doc/pcre2_convert_context_copy.326
-rw-r--r--doc/pcre2_convert_context_create.327
-rw-r--r--doc/pcre2_convert_context_free.325
-rw-r--r--doc/pcre2_converted_pattern_free.325
-rw-r--r--doc/pcre2_dfa_match.323
-rw-r--r--doc/pcre2_get_error_message.312
-rw-r--r--doc/pcre2_get_mark.315
-rw-r--r--doc/pcre2_jit_stack_create.39
-rw-r--r--doc/pcre2_maketables.312
-rw-r--r--doc/pcre2_match.331
-rw-r--r--doc/pcre2_match_data_free.36
-rw-r--r--doc/pcre2_pattern_convert.355
-rw-r--r--doc/pcre2_pattern_info.321
-rw-r--r--doc/pcre2_set_callout.34
-rw-r--r--doc/pcre2_set_compile_extra_options.338
-rw-r--r--doc/pcre2_set_depth_limit.328
-rw-r--r--doc/pcre2_set_glob_escape.329
-rw-r--r--doc/pcre2_set_glob_separator.328
-rw-r--r--doc/pcre2_set_heap_limit.328
-rw-r--r--doc/pcre2_set_max_pattern_length.331
-rw-r--r--doc/pcre2_set_newline.33
-rw-r--r--doc/pcre2_set_recursion_limit.36
-rw-r--r--doc/pcre2_set_recursion_memory_management.311
-rw-r--r--doc/pcre2_substitute.320
-rw-r--r--doc/pcre2api.31017
-rw-r--r--doc/pcre2build.3196
-rw-r--r--doc/pcre2callout.3167
-rw-r--r--doc/pcre2compat.3127
-rw-r--r--doc/pcre2convert.3163
-rw-r--r--doc/pcre2demo.353
-rw-r--r--doc/pcre2grep.1405
-rw-r--r--doc/pcre2grep.txt889
-rw-r--r--doc/pcre2jit.315
-rw-r--r--doc/pcre2limits.332
-rw-r--r--doc/pcre2pattern.3572
-rw-r--r--doc/pcre2perform.3131
-rw-r--r--doc/pcre2posix.369
-rw-r--r--doc/pcre2serialize.311
-rw-r--r--doc/pcre2stack.3202
-rw-r--r--doc/pcre2syntax.342
-rw-r--r--doc/pcre2test.1522
-rw-r--r--doc/pcre2test.txt973
-rw-r--r--doc/pcre2unicode.330
103 files changed, 10542 insertions, 6523 deletions
diff --git a/doc/html/NON-AUTOTOOLS-BUILD.txt b/doc/html/NON-AUTOTOOLS-BUILD.txt
index ceb9245..0775794 100644
--- a/doc/html/NON-AUTOTOOLS-BUILD.txt
+++ b/doc/html/NON-AUTOTOOLS-BUILD.txt
@@ -1,10 +1,6 @@
Building PCRE2 without using autotools
--------------------------------------
-This document has been converted from the PCRE1 document. I have removed a
-number of sections about building in various environments, as they applied only
-to PCRE1 and are probably out of date.
-
This document contains the following sections:
General
@@ -49,7 +45,7 @@ can skip ahead to the CMake section.
macro settings that it contains to whatever is appropriate for your
environment. In particular, you can alter the definition of the NEWLINE
macro to specify what character(s) you want to be interpreted as line
- terminators.
+ terminators by default.
When you compile any of the PCRE2 modules, you must specify
-DHAVE_CONFIG_H to your compiler so that src/config.h is included in the
@@ -95,8 +91,10 @@ can skip ahead to the CMake section.
pcre2_compile.c
pcre2_config.c
pcre2_context.c
+ pcre2_convert.c
pcre2_dfa_match.c
pcre2_error.c
+ pcre2_extuni.c
pcre2_find_bracket.c
pcre2_jit_compile.c
pcre2_maketables.c
@@ -123,10 +121,14 @@ can skip ahead to the CMake section.
Note that you must compile pcre2_jit_compile.c, even if you have not
defined SUPPORT_JIT in src/config.h, because when JIT support is not
configured, dummy functions are compiled. When JIT support IS configured,
- pcre2_compile.c #includes other files from the sljit subdirectory, where
- there should be 16 files, all of whose names begin with "sljit". It also
- #includes src/pcre2_jit_match.c and src/pcre2_jit_misc.c, so you should
- not compile these yourself.
+ pcre2_jit_compile.c #includes other files from the sljit subdirectory,
+ all of whose names begin with "sljit". It also #includes
+ src/pcre2_jit_match.c and src/pcre2_jit_misc.c, so you should not compile
+ these yourself.
+
+ Not also that the pcre2_fuzzsupport.c file contains special code that is
+ useful to those who want to run fuzzing tests on the PCRE2 library. Unless
+ you are doing that, you can ignore it.
(5) Now link all the compiled code into an object library in whichever form
your system keeps such libraries. This is the basic PCRE2 C 8-bit library.
@@ -174,26 +176,18 @@ can skip ahead to the CMake section.
(11) If you want to use the pcre2grep command, compile and link
src/pcre2grep.c; it uses only the basic 8-bit PCRE2 library (it does not
- need the pcre2posix library).
+ need the pcre2posix library). If you have built the PCRE2 library with JIT
+ support by defining SUPPORT_JIT in src/config.h, you can also define
+ SUPPORT_PCRE2GREP_JIT, which causes pcre2grep to make use of JIT (unless
+ it is run with --no-jit). If you define SUPPORT_PCRE2GREP_JIT without
+ defining SUPPORT_JIT, pcre2grep does not try to make use of JIT.
STACK SIZE IN WINDOWS ENVIRONMENTS
-The default processor stack size of 1Mb in some Windows environments is too
-small for matching patterns that need much recursion. In particular, test 2 may
-fail because of this. Normally, running out of stack causes a crash, but there
-have been cases where the test program has just died silently. See your linker
-documentation for how to increase stack size if you experience problems. If you
-are using CMake (see "BUILDING PCRE2 ON WINDOWS WITH CMAKE" below) and the gcc
-compiler, you can increase the stack size for pcre2test and pcre2grep by
-setting the CMAKE_EXE_LINKER_FLAGS variable to "-Wl,--stack,8388608" (for
-example). The Linux default of 8Mb is a reasonable choice for the stack, though
-even that can be too small for some pattern/subject combinations.
-
-PCRE2 has a compile configuration option to disable the use of stack for
-recursion so that heap is used instead. However, pattern matching is
-significantly slower when this is done. There is more about stack usage in the
-"pcre2stack" documentation.
+Prior to release 10.30 the default system stack size of 1Mb in some Windows
+environments caused issues with some tests. This should no longer be the case
+for 10.30 and later releases.
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
@@ -375,18 +369,19 @@ BUILDING PCRE2 ON NATIVE Z/OS AND Z/VM
z/OS and z/VM are operating systems for mainframe computers, produced by IBM.
The character code used is EBCDIC, not ASCII or Unicode. In z/OS, UNIX APIs and
applications can be supported through UNIX System Services, and in such an
-environment PCRE2 can be built in the same way as in other systems. However, in
-native z/OS (without UNIX System Services) and in z/VM, special ports are
-required. For details, please see this web site:
+environment it should be possible to build PCRE2 in the same way as in other
+systems, with the EBCDIC related configuration settings, but it is not known if
+anybody has tried this.
- http://www.zaconsultants.net
+In native z/OS (without UNIX System Services) and in z/VM, special ports are
+required. For details, please see file 939 on this web site:
-The site currently has ports for PCRE1 releases, but PCRE2 should follow in due
-course.
+ http://www.cbttape.org
-You may also download PCRE1 from WWW.CBTTAPE.ORG, file 882. Everything, source
-and executable, is in EBCDIC and native z/OS file formats and this is the
-recommended download site.
+Everything in that location, source and executable, is in EBCDIC and native
+z/OS file formats. The port provides an API for LE languages such as COBOL and
+for the z/OS and z/VM versions of the Rexx languages.
-=============================
-Last Updated: 16 July 2015
+===============================
+Last Updated: 13 September 2017
+===============================
diff --git a/doc/html/README.txt b/doc/html/README.txt
index 03d67f6..52859a9 100644
--- a/doc/html/README.txt
+++ b/doc/html/README.txt
@@ -15,8 +15,8 @@ subscribe or manage your subscription here:
https://lists.exim.org/mailman/listinfo/pcre-dev
-Please read the NEWS file if you are upgrading from a previous release.
-The contents of this README file are:
+Please read the NEWS file if you are upgrading from a previous release. The
+contents of this README file are:
The PCRE2 APIs
Documentation for PCRE2
@@ -44,8 +44,8 @@ wrappers.
The distribution does contain a set of C wrapper functions for the 8-bit
library that are based on the POSIX regular expression API (see the pcre2posix
-man page). These can be found in a library called libpcre2posix. Note that this
-just provides a POSIX calling interface to PCRE2; the regular expressions
+man page). These can be found in a library called libpcre2-posix. Note that
+this just provides a POSIX calling interface to PCRE2; the regular expressions
themselves still follow Perl syntax and semantics. The POSIX API is restricted,
and does not give full access to all of PCRE2's facilities.
@@ -58,8 +58,8 @@ renamed or pointed at by a link.
If you are using the POSIX interface to PCRE2 and there is already a POSIX
regex library installed on your system, as well as worrying about the regex.h
header file (as mentioned above), you must also take care when linking programs
-to ensure that they link with PCRE2's libpcre2posix library. Otherwise they may
-pick up the POSIX functions of the same name from the other library.
+to ensure that they link with PCRE2's libpcre2-posix library. Otherwise they
+may pick up the POSIX functions of the same name from the other library.
One way of avoiding this confusion is to compile PCRE2 with the addition of
-Dregcomp=PCRE2regcomp (and similarly for the other POSIX functions) to the
@@ -95,10 +95,9 @@ PCRE2 documentation is supplied in two other forms:
Building PCRE2 on non-Unix-like systems
---------------------------------------
-For a non-Unix-like system, please read the comments in the file
-NON-AUTOTOOLS-BUILD, though if your system supports the use of "configure" and
-"make" you may be able to build PCRE2 using autotools in the same way as for
-many Unix-like systems.
+For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
+your system supports the use of "configure" and "make" you may be able to build
+PCRE2 using autotools in the same way as for many Unix-like systems.
PCRE2 can also be configured using CMake, which can be run in various ways
(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
@@ -172,21 +171,24 @@ library. They are also documented in the pcre2build man page.
give large performance improvements on certain platforms, add --enable-jit to
the "configure" command. This support is available only for certain hardware
architectures. If you try to enable it on an unsupported architecture, there
- will be a compile time error.
-
-. If you do not want to make use of the support for UTF-8 Unicode character
- strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
- library, or UTF-32 Unicode character strings in the 32-bit library, you can
- add --disable-unicode to the "configure" command. This reduces the size of
- the libraries. It is not possible to configure one library with Unicode
- support, and another without, in the same configuration.
+ will be a compile time error. If you are running under SELinux you may also
+ want to add --enable-jit-sealloc, which enables the use of an execmem
+ allocator in JIT that is compatible with SELinux. This has no effect if JIT
+ is not enabled.
+
+. If you do not want to make use of the default support for UTF-8 Unicode
+ character strings in the 8-bit library, UTF-16 Unicode character strings in
+ the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
+ library, you can add --disable-unicode to the "configure" command. This
+ reduces the size of the libraries. It is not possible to configure one
+ library with Unicode support, and another without, in the same configuration.
+ It is also not possible to use --enable-ebcdic (see below) with Unicode
+ support, so if this option is set, you must also use --disable-unicode.
When Unicode support is available, the use of a UTF encoding still has to be
enabled by setting the PCRE2_UTF option at run time or starting a pattern
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
- either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
- not possible to use both --enable-unicode and --enable-ebcdic at the same
- time.
+ either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.
As well as supporting UTF strings, Unicode support includes support for the
\P, \p, and \X sequences that recognize Unicode character properties.
@@ -196,20 +198,14 @@ library. They are also documented in the pcre2build man page.
or starting a pattern with (*UCP).
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
- of the preceding, or any of the Unicode newline sequences, as indicating the
- end of a line. Whatever you specify at build time is the default; the caller
- of PCRE2 can change the selection at run time. The default newline indicator
- is a single LF character (the Unix standard). You can specify the default
- newline indicator by adding --enable-newline-is-cr, --enable-newline-is-lf,
- --enable-newline-is-crlf, --enable-newline-is-anycrlf, or
- --enable-newline-is-any to the "configure" command, respectively.
-
- If you specify --enable-newline-is-cr or --enable-newline-is-crlf, some of
- the standard tests will fail, because the lines in the test files end with
- LF. Even if the files are edited to change the line endings, there are likely
- to be some failures. With --enable-newline-is-anycrlf or
- --enable-newline-is-any, many tests should succeed, but there may be some
- failures.
+ of the preceding, or any of the Unicode newline sequences, or the NUL (zero)
+ character as indicating the end of a line. Whatever you specify at build time
+ is the default; the caller of PCRE2 can change the selection at run time. The
+ default newline indicator is a single LF character (the Unix standard). You
+ can specify the default newline indicator by adding --enable-newline-is-cr,
+ --enable-newline-is-lf, --enable-newline-is-crlf,
+ --enable-newline-is-anycrlf, --enable-newline-is-any, or
+ --enable-newline-is-nul to the "configure" command, respectively.
. By default, the sequence \R in a pattern matches any Unicode line ending
sequence. This is independent of the option specifying what PCRE2 considers
@@ -231,49 +227,44 @@ library. They are also documented in the pcre2build man page.
--with-parens-nest-limit=500
-. PCRE2 has a counter that can be set to limit the amount of resources it uses
- when matching a pattern. If the limit is exceeded during a match, the match
- fails. The default is ten million. You can change the default by setting, for
- example,
+. PCRE2 has a counter that can be set to limit the amount of computing resource
+ it uses when matching a pattern. If the limit is exceeded during a match, the
+ match fails. The default is ten million. You can change the default by
+ setting, for example,
--with-match-limit=500000
on the "configure" command. This is just the default; individual calls to
- pcre2_match() can supply their own value. There is more discussion on the
- pcre2api man page.
+ pcre2_match() or pcre2_dfa_match() can supply their own value. There is more
+ discussion in the pcre2api man page (search for pcre2_set_match_limit).
+
+. There is a separate counter that limits the depth of nested backtracking
+ during a matching process, which indirectly limits the amount of heap memory
+ that is used. This also has a default of ten million, which is essentially
+ "unlimited". You can change the default by setting, for example,
+
+ --with-match-limit-depth=5000
-. There is a separate counter that limits the depth of recursive function calls
- during a matching process. This also has a default of ten million, which is
- essentially "unlimited". You can change the default by setting, for example,
+ There is more discussion in the pcre2api man page (search for
+ pcre2_set_depth_limit).
- --with-match-limit-recursion=500000
+. You can also set an explicit limit on the amount of heap memory used by
+ the pcre2_match() interpreter:
- Recursive function calls use up the runtime stack; running out of stack can
- cause programs to crash in strange ways. There is a discussion about stack
- sizes in the pcre2stack man page.
+ --with-heap-limit=500
+
+ The units are kilobytes. This limit does not apply when the JIT optimization
+ (which has its own memory control features) is used. There is more discussion
+ on the pcre2api man page (search for pcre2_set_heap_limit).
. In the 8-bit library, the default maximum compiled pattern size is around
- 64K. You can increase this by adding --with-link-size=3 to the "configure"
- command. PCRE2 then uses three bytes instead of two for offsets to different
- parts of the compiled pattern. In the 16-bit library, --with-link-size=3 is
- the same as --with-link-size=4, which (in both libraries) uses four-byte
- offsets. Increasing the internal link size reduces performance in the 8-bit
- and 16-bit libraries. In the 32-bit library, the link size setting is
- ignored, as 4-byte offsets are always used.
-
-. You can build PCRE2 so that its internal match() function that is called from
- pcre2_match() does not call itself recursively. Instead, it uses memory
- blocks obtained from the heap to save data that would otherwise be saved on
- the stack. To build PCRE2 like this, use
-
- --disable-stack-for-recursion
-
- on the "configure" command. PCRE2 runs more slowly in this mode, but it may
- be necessary in environments with limited stack sizes. This applies only to
- the normal execution of the pcre2_match() function; if JIT support is being
- successfully used, it is not relevant. Equally, it does not apply to
- pcre2_dfa_match(), which does not use deeply nested recursion. There is a
- discussion about stack sizes in the pcre2stack man page.
+ 64K bytes. You can increase this by adding --with-link-size=3 to the
+ "configure" command. PCRE2 then uses three bytes instead of two for offsets
+ to different parts of the compiled pattern. In the 16-bit library,
+ --with-link-size=3 is the same as --with-link-size=4, which (in both
+ libraries) uses four-byte offsets. Increasing the internal link size reduces
+ performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
+ link size setting is ignored, as 4-byte offsets are always used.
. For speed, PCRE2 uses four tables for manipulating and identifying characters
whose code point values are less than 256. By default, it uses a set of
@@ -339,12 +330,23 @@ library. They are also documented in the pcre2build man page.
Of course, the relevant libraries must be installed on your system.
-. The default size (in bytes) of the internal buffer used by pcre2grep can be
- set by, for example:
+. The default starting size (in bytes) of the internal buffer used by pcre2grep
+ can be set by, for example:
--with-pcre2grep-bufsize=51200
- The value must be a plain integer. The default is 20480.
+ The value must be a plain integer. The default is 20480. The amount of memory
+ used by pcre2grep is actually three times this number, to allow for "before"
+ and "after" lines. If very long lines are encountered, the buffer is
+ automatically enlarged, up to a fixed maximum size.
+
+. The default maximum size of pcre2grep's internal buffer can be set by, for
+ example:
+
+ --with-pcre2grep-max-bufsize=2097152
+
+ The default is either 1048576 or the value of --with-pcre2grep-bufsize,
+ whichever is the larger.
. It is possible to compile pcre2test so that it links with the libreadline
or libedit libraries, by specifying, respectively,
@@ -369,6 +371,29 @@ library. They are also documented in the pcre2build man page.
tgetflag, or tgoto, this is the problem, and linking with the ncurses library
should fix it.
+. There is a special option called --enable-fuzz-support for use by people who
+ want to run fuzzing tests on PCRE2. At present this applies only to the 8-bit
+ library. If set, it causes an extra library called libpcre2-fuzzsupport.a to
+ be built, but not installed. This contains a single function called
+ LLVMFuzzerTestOneInput() whose arguments are a pointer to a string and the
+ length of the string. When called, this function tries to compile the string
+ as a pattern, and if that succeeds, to match it. This is done both with no
+ options and with some random options bits that are generated from the string.
+ Setting --enable-fuzz-support also causes a binary called pcre2fuzzcheck to
+ be created. This is normally run under valgrind or used when PCRE2 is
+ compiled with address sanitizing enabled. It calls the fuzzing function and
+ outputs information about it is doing. The input strings are specified by
+ arguments: if an argument starts with "=" the rest of it is a literal input
+ string. Otherwise, it is assumed to be a file name, and the contents of the
+ file are the test string.
+
+. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
+ which caused pcre2_match() to use individual blocks on the heap for
+ backtracking instead of recursive function calls (which use the stack). This
+ is now obsolete since pcre2_match() was refactored always to use the heap (in
+ a much more efficient way than before). This option is retained for backwards
+ compatibility, but has no effect other than to output a warning.
+
The "configure" script builds the following files for the basic C library:
. Makefile the makefile that builds the library
@@ -543,7 +568,7 @@ script creates the .txt and HTML forms of the documentation from the man pages.
Testing PCRE2
-------------
+-------------
To test the basic PCRE2 library on a Unix-like system, run the RunTest script.
There is another script called RunGrepTest that tests the pcre2grep command.
@@ -635,32 +660,43 @@ with the perltest.sh script, and test 5 checking PCRE2-specific things.
Tests 6 and 7 check the pcre2_dfa_match() alternative matching function, in
non-UTF mode and UTF-mode with Unicode property support, respectively.
-Test 8 checks some internal offsets and code size features; it is run only when
-the default "link size" of 2 is set (in other cases the sizes change) and when
-Unicode support is enabled.
+Test 8 checks some internal offsets and code size features, but it is run only
+when Unicode support is enabled. The output is different in 8-bit, 16-bit, and
+32-bit modes and for different link sizes, so there are different output files
+for each mode and link size.
Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
16-bit and 32-bit modes. These are tests that generate different output in
8-bit mode. Each pair are for general cases and Unicode support, respectively.
+
Test 13 checks the handling of non-UTF characters greater than 255 by
pcre2_dfa_match() in 16-bit and 32-bit modes.
-Test 14 contains a number of tests that must not be run with JIT. They check,
+Test 14 contains some special UTF and UCP tests that give different output for
+different code unit widths.
+
+Test 15 contains a number of tests that must not be run with JIT. They check,
among other non-JIT things, the match-limiting features of the intepretive
matcher.
-Test 15 is run only when JIT support is not available. It checks that an
+Test 16 is run only when JIT support is not available. It checks that an
attempt to use JIT has the expected behaviour.
-Test 16 is run only when JIT support is available. It checks JIT complete and
+Test 17 is run only when JIT support is available. It checks JIT complete and
partial modes, match-limiting under JIT, and other JIT-specific features.
-Tests 17 and 18 are run only in 8-bit mode. They check the POSIX interface to
+Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
the 8-bit library, without and with Unicode support, respectively.
-Test 19 checks the serialization functions by writing a set of compiled
+Test 20 checks the serialization functions by writing a set of compiled
patterns to a file, and then reloading and checking them.
+Tests 21 and 22 test \C support when the use of \C is not locked out, without
+and with UTF support, respectively. Test 23 tests \C when it is locked out.
+
+Tests 24 and 25 test the experimental pattern conversion functions, without and
+with UTF support, respectively.
+
Character tables
----------------
@@ -679,7 +715,7 @@ specified for ./configure, a different version of pcre2_chartables.c is built
by the program dftables (compiled from dftables.c), which uses the ANSI C
character handling functions such as isalnum(), isalpha(), isupper(),
islower(), etc. to build the table sources. This means that the default C
-locale which is set for your system will control the contents of these default
+locale that is set for your system will control the contents of these default
tables. You can change the default tables by editing pcre2_chartables.c and
then re-building PCRE2. If you do this, you should take care to ensure that the
file does not get automatically re-generated. The best way to do this is to
@@ -734,8 +770,10 @@ The distribution should contain the files listed below.
src/pcre2_compile.c )
src/pcre2_config.c )
src/pcre2_context.c )
+ src/pcre2_convert.c )
src/pcre2_dfa_match.c )
src/pcre2_error.c )
+ src/pcre2_extuni.c )
src/pcre2_find_bracket.c )
src/pcre2_jit_compile.c )
src/pcre2_jit_match.c ) sources for the functions in the library,
@@ -757,6 +795,7 @@ The distribution should contain the files listed below.
src/pcre2_xclass.c )
src/pcre2_printint.c debugging function that is used by pcre2test,
+ src/pcre2_fuzzsupport.c function for (optional) fuzzing support
src/config.h.in template for config.h, when built by "configure"
src/pcre2.h.in template for pcre2.h when built by "configure"
@@ -772,7 +811,6 @@ The distribution should contain the files listed below.
src/pcre2demo.c simple demonstration of coding calls to PCRE2
src/pcre2grep.c source of a grep utility that uses PCRE2
src/pcre2test.c comprehensive test program
- src/pcre2_printint.c part of pcre2test
src/pcre2_jit_test.c JIT test program
(C) Auxiliary files:
@@ -814,7 +852,7 @@ The distribution should contain the files listed below.
libpcre2-8.pc.in template for libpcre2-8.pc for pkg-config
libpcre2-16.pc.in template for libpcre2-16.pc for pkg-config
libpcre2-32.pc.in template for libpcre2-32.pc for pkg-config
- libpcre2posix.pc.in template for libpcre2posix.pc for pkg-config
+ libpcre2-posix.pc.in template for libpcre2-posix.pc for pkg-config
ltmain.sh file used to build a libtool script
missing ) common stub for a few missing GNU programs while
) installing, generated by automake
@@ -837,12 +875,12 @@ The distribution should contain the files listed below.
(E) Auxiliary files for building PCRE2 "by hand"
- pcre2.h.generic ) a version of the public PCRE2 header file
+ src/pcre2.h.generic ) a version of the public PCRE2 header file
) for use in non-"configure" environments
- config.h.generic ) a version of config.h for use in non-"configure"
+ src/config.h.generic ) a version of config.h for use in non-"configure"
) environments
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 01 April 2016
+Last updated: 12 September 2017
diff --git a/doc/html/index.html b/doc/html/index.html
index 703c298..b9393d9 100644
--- a/doc/html/index.html
+++ b/doc/html/index.html
@@ -35,6 +35,9 @@ first.
<tr><td><a href="pcre2compat.html">pcre2compat</a></td>
<td>&nbsp;&nbsp;Compability with Perl</td></tr>
+<tr><td><a href="pcre2convert.html">pcre2convert</a></td>
+ <td>&nbsp;&nbsp;Experimental foreign pattern conversion functions</td></tr>
+
<tr><td><a href="pcre2demo.html">pcre2demo</a></td>
<td>&nbsp;&nbsp;A demonstration C program that uses the PCRE2 library</td></tr>
@@ -68,9 +71,6 @@ first.
<tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
<td>&nbsp;&nbsp;Serializing functions for saving precompiled patterns</td></tr>
-<tr><td><a href="pcre2stack.html">pcre2stack</a></td>
- <td>&nbsp;&nbsp;Discussion of PCRE2's stack usage</td></tr>
-
<tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
<td>&nbsp;&nbsp;Syntax quick-reference summary</td></tr>
@@ -94,6 +94,9 @@ in the library.
<tr><td><a href="pcre2_code_copy.html">pcre2_code_copy</a></td>
<td>&nbsp;&nbsp;Copy a compiled pattern</td></tr>
+<tr><td><a href="pcre2_code_copy_with_tables.html">pcre2_code_copy_with_tables</a></td>
+ <td>&nbsp;&nbsp;Copy a compiled pattern and its character tables</td></tr>
+
<tr><td><a href="pcre2_code_free.html">pcre2_code_free</a></td>
<td>&nbsp;&nbsp;Free a compiled pattern</td></tr>
@@ -112,6 +115,18 @@ in the library.
<tr><td><a href="pcre2_config.html">pcre2_config</a></td>
<td>&nbsp;&nbsp;Show build-time configuration options</td></tr>
+<tr><td><a href="pcre2_convert_context_copy.html">pcre2_convert_context_copy</a></td>
+ <td>&nbsp;&nbsp;Copy a convert context</td></tr>
+
+<tr><td><a href="pcre2_convert_context_create.html">pcre2_convert_context_create</a></td>
+ <td>&nbsp;&nbsp;Create a convert context</td></tr>
+
+<tr><td><a href="pcre2_convert_context_free.html">pcre2_convert_context_free</a></td>
+ <td>&nbsp;&nbsp;Free a convert context</td></tr>
+
+<tr><td><a href="pcre2_converted_pattern_free.html">pcre2_converted_pattern_free</a></td>
+ <td>&nbsp;&nbsp;Free converted foreign pattern</td></tr>
+
<tr><td><a href="pcre2_dfa_match.html">pcre2_dfa_match</a></td>
<td>&nbsp;&nbsp;Match a compiled pattern to a subject string
(DFA algorithm; <i>not</i> Perl compatible)</td></tr>
@@ -183,6 +198,9 @@ in the library.
<tr><td><a href="pcre2_match_data_free.html">pcre2_match_data_free</a></td>
<td>&nbsp;&nbsp;Free a match data block</td></tr>
+<tr><td><a href="pcre2_pattern_convert.html">pcre2_pattern_convert</a></td>
+ <td>&nbsp;&nbsp;Experimental foreign pattern converter</td></tr>
+
<tr><td><a href="pcre2_pattern_info.html">pcre2_pattern_info</a></td>
<td>&nbsp;&nbsp;Extract information about a pattern</td></tr>
@@ -207,9 +225,24 @@ in the library.
<tr><td><a href="pcre2_set_character_tables.html">pcre2_set_character_tables</a></td>
<td>&nbsp;&nbsp;Set character tables</td></tr>
+<tr><td><a href="pcre2_set_compile_extra_options.html">pcre2_set_compile_extra_options</a></td>
+ <td>&nbsp;&nbsp;Set compile time extra options</td></tr>
+
<tr><td><a href="pcre2_set_compile_recursion_guard.html">pcre2_set_compile_recursion_guard</a></td>
<td>&nbsp;&nbsp;Set up a compile recursion guard function</td></tr>
+<tr><td><a href="pcre2_set_depth_limit.html">pcre2_set_depth_limit</a></td>
+ <td>&nbsp;&nbsp;Set the match backtracking depth limit</td></tr>
+
+<tr><td><a href="pcre2_set_glob_escape.html">pcre2_set_glob_escape</a></td>
+ <td>&nbsp;&nbsp;Set glob escape character</td></tr>
+
+<tr><td><a href="pcre2_set_glob_separator.html">pcre2_set_glob_separator</a></td>
+ <td>&nbsp;&nbsp;Set glob separator character</td></tr>
+
+<tr><td><a href="pcre2_set_heap_limit.html">pcre2_set_heap_limit</a></td>
+ <td>&nbsp;&nbsp;Set the match backtracking heap limit</td></tr>
+
<tr><td><a href="pcre2_set_match_limit.html">pcre2_set_match_limit</a></td>
<td>&nbsp;&nbsp;Set the match limit</td></tr>
@@ -226,10 +259,10 @@ in the library.
<td>&nbsp;&nbsp;Set the parentheses nesting limit</td></tr>
<tr><td><a href="pcre2_set_recursion_limit.html">pcre2_set_recursion_limit</a></td>
- <td>&nbsp;&nbsp;Set the match recursion limit</td></tr>
+ <td>&nbsp;&nbsp;Obsolete: use pcre2_set_depth_limit</td></tr>
<tr><td><a href="pcre2_set_recursion_memory_management.html">pcre2_set_recursion_memory_management</a></td>
- <td>&nbsp;&nbsp;Set match recursion memory management</td></tr>
+ <td>&nbsp;&nbsp;Obsolete function that (from 10.30 onwards) does nothing</td></tr>
<tr><td><a href="pcre2_substitute.html">pcre2_substitute</a></td>
<td>&nbsp;&nbsp;Match a compiled pattern to a subject string and do
diff --git a/doc/html/pcre2.html b/doc/html/pcre2.html
index 07ab8e9..b61c579 100644
--- a/doc/html/pcre2.html
+++ b/doc/html/pcre2.html
@@ -109,7 +109,7 @@ lose performance.
One way of guarding against this possibility is to use the
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
-<b>pcre2_compile()</b>. This causes an compile time error if a pattern contains
+<b>pcre2_compile()</b>. This causes a compile time error if the pattern contains
a UTF-setting sequence.
</P>
<P>
@@ -137,7 +137,8 @@ large search tree against a string that will never match. Nested unlimited
repeats in a pattern are a common example. PCRE2 provides some protection
against this: see the <b>pcre2_set_match_limit()</b> function in the
<a href="pcre2api.html"><b>pcre2api</b></a>
-page.
+page. There is a similar function called <b>pcre2_set_depth_limit()</b> that can
+be used to restrict the amount of memory that is used.
</P>
<br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br>
<P>
@@ -166,7 +167,6 @@ listing), and the short pages for individual functions, are concatenated in
pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program
- pcre2stack discussion of stack usage
pcre2syntax quick syntax reference
pcre2test description of the <b>pcre2test</b> command
pcre2unicode discussion of Unicode and UTF support
@@ -189,9 +189,9 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
</P>
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 16 October 2015
+Last updated: 01 April 2017
<br>
-Copyright &copy; 1997-2015 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2_callout_enumerate.html b/doc/html/pcre2_callout_enumerate.html
index 6c2cdb8..505ea7b 100644
--- a/doc/html/pcre2_callout_enumerate.html
+++ b/doc/html/pcre2_callout_enumerate.html
@@ -36,20 +36,21 @@ for success and non-zero otherwise. The arguments are:
<i>callout_data</i> User data that is passed to the callback
</pre>
The <i>callback()</i> function is passed a pointer to a data block containing
-the following fields:
+the following fields (not necessarily in this order):
<pre>
- <i>version</i> Block version number
- <i>pattern_position</i> Offset to next item in pattern
- <i>next_item_length</i> Length of next item in pattern
- <i>callout_number</i> Number for numbered callouts
- <i>callout_string_offset</i> Offset to string within pattern
- <i>callout_string_length</i> Length of callout string
- <i>callout_string</i> Points to callout string or is NULL
+ uint32_t <i>version</i> Block version number
+ uint32_t <i>callout_number</i> Number for numbered callouts
+ PCRE2_SIZE <i>pattern_position</i> Offset to next item in pattern
+ PCRE2_SIZE <i>next_item_length</i> Length of next item in pattern
+ PCRE2_SIZE <i>callout_string_offset</i> Offset to string within pattern
+ PCRE2_SIZE <i>callout_string_length</i> Length of callout string
+ PCRE2_SPTR <i>callout_string</i> Points to callout string or is NULL
</pre>
-The second argument is the callout data that was passed to
-<b>pcre2_callout_enumerate()</b>. The <b>callback()</b> function must return zero
-for success. Any other value causes the pattern scan to stop, with the value
-being passed back as the result of <b>pcre2_callout_enumerate()</b>.
+The second argument passed to the <b>callback()</b> function is the callout data
+that was passed to <b>pcre2_callout_enumerate()</b>. The <b>callback()</b>
+function must return zero for success. Any other value causes the pattern scan
+to stop, with the value being passed back as the result of
+<b>pcre2_callout_enumerate()</b>.
</P>
<P>
There is a complete description of the PCRE2 native API in the
diff --git a/doc/html/pcre2_code_copy.html b/doc/html/pcre2_code_copy.html
index 5b68282..667d7b7 100644
--- a/doc/html/pcre2_code_copy.html
+++ b/doc/html/pcre2_code_copy.html
@@ -28,8 +28,9 @@ DESCRIPTION
This function makes a copy of the memory used for a compiled pattern, excluding
any memory used by the JIT compiler. Without a subsequent call to
<b>pcre2_jit_compile()</b>, the copy can be used only for non-JIT matching. The
-yield of the function is NULL if <i>code</i> is NULL or if sufficient memory
-cannot be obtained.
+pointer to the character tables is copied, not the tables themselves (see
+<b>pcre2_code_copy_with_tables()</b>). The yield of the function is NULL if
+<i>code</i> is NULL or if sufficient memory cannot be obtained.
</P>
<P>
There is a complete description of the PCRE2 native API in the
diff --git a/doc/html/pcre2_code_copy_with_tables.html b/doc/html/pcre2_code_copy_with_tables.html
new file mode 100644
index 0000000..67b2e1f
--- /dev/null
+++ b/doc/html/pcre2_code_copy_with_tables.html
@@ -0,0 +1,44 @@
+<html>
+<head>
+<title>pcre2_code_copy_with_tables specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcre2_code_copy_with_tables man page</h1>
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
+<p>
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+<br>
+<br><b>
+SYNOPSIS
+</b><br>
+<P>
+<b>#include &#60;pcre2.h&#62;</b>
+</P>
+<P>
+<b>pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *<i>code</i>);</b>
+</P>
+<br><b>
+DESCRIPTION
+</b><br>
+<P>
+This function makes a copy of the memory used for a compiled pattern, excluding
+any memory used by the JIT compiler. Without a subsequent call to
+<b>pcre2_jit_compile()</b>, the copy can be used only for non-JIT matching.
+Unlike <b>pcre2_code_copy()</b>, a separate copy of the character tables is also
+made, with the new code pointing to it. This memory will be automatically freed
+when <b>pcre2_code_free()</b> is called. The yield of the function is NULL if
+<i>code</i> is NULL or if sufficient memory cannot be obtained.
+</P>
+<P>
+There is a complete description of the PCRE2 native API in the
+<a href="pcre2api.html"><b>pcre2api</b></a>
+page and a description of the POSIX API in the
+<a href="pcre2posix.html"><b>pcre2posix</b></a>
+page.
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
diff --git a/doc/html/pcre2_code_free.html b/doc/html/pcre2_code_free.html
index 0477abe..5fce3c5 100644
--- a/doc/html/pcre2_code_free.html
+++ b/doc/html/pcre2_code_free.html
@@ -26,7 +26,9 @@ DESCRIPTION
</b><br>
<P>
This function frees the memory used for a compiled pattern, including any
-memory used by the JIT compiler.
+memory used by the JIT compiler. If the compiled pattern was created by a call
+to <b>pcre2_code_copy_with_tables()</b>, the memory for the character tables is
+also freed.
</P>
<P>
There is a complete description of the PCRE2 native API in the
diff --git a/doc/html/pcre2_compile.html b/doc/html/pcre2_compile.html
index 544f4fe..0a9eafa 100644
--- a/doc/html/pcre2_compile.html
+++ b/doc/html/pcre2_compile.html
@@ -37,26 +37,34 @@ arguments are:
<i>erroffset</i> Where to put an error offset
<i>ccontext</i> Pointer to a compile context or NULL
</pre>
-The length of the string and any error offset that is returned are in code
-units, not characters. A compile context is needed only if you want to change
+The length of the pattern and any error offset that is returned are in code
+units, not characters. A compile context is needed only if you want to provide
+custom memory allocation functions, or to provide an external function for
+system stack size checking, or to change one or more of these parameters:
<pre>
- What \R matches (Unicode newlines or CR, LF, CRLF only)
- PCRE2's character tables
- The newline character sequence
- The compile time nested parentheses limit
+ What \R matches (Unicode newlines, or CR, LF, CRLF only);
+ PCRE2's character tables;
+ The newline character sequence;
+ The compile time nested parentheses limit;
+ The maximum pattern length (in code units) that is allowed.
+ The additional options bits (see pcre2_set_compile_extra_options())
</pre>
-or provide an external function for stack size checking. The option bits are:
+The option bits are:
<pre>
PCRE2_ANCHORED Force pattern anchoring
+ PCRE2_ALLOW_EMPTY_CLASS Allow empty classes
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
+ PCRE2_ALT_VERBNAMES Process backslashes in verb names
PCRE2_AUTO_CALLOUT Compile automatic callouts
PCRE2_CASELESS Do caseless matching
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
PCRE2_DOTALL . matches anything including NL
PCRE2_DUPNAMES Allow duplicate names for subpatterns
+ PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_EXTENDED Ignore white space and # comments
PCRE2_FIRSTLINE Force matching to be before newline
+ PCRE2_LITERAL Pattern characters are all literal
PCRE2_MATCH_UNSET_BACKREF Match unset back references
PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
@@ -71,19 +79,21 @@ or provide an external function for stack size checking. The option bits are:
(only relevant if PCRE2_UTF is set)
PCRE2_UCP Use Unicode properties for \d, \w, etc.
PCRE2_UNGREEDY Invert greediness of quantifiers
+ PCRE2_USE_OFFSET_LIMIT Enable offset limit for unanchored matching
PCRE2_UTF Treat pattern and subjects as UTF strings
</pre>
-PCRE2 must be built with Unicode support in order to use PCRE2_UTF, PCRE2_UCP
-and related options.
+PCRE2 must be built with Unicode support (the default) in order to use
+PCRE2_UTF, PCRE2_UCP and related options.
</P>
<P>
The yield of the function is a pointer to a private data structure that
contains the compiled pattern, or NULL if an error was detected.
</P>
<P>
-There is a complete description of the PCRE2 native API in the
+There is a complete description of the PCRE2 native API, with more detail on
+each option, in the
<a href="pcre2api.html"><b>pcre2api</b></a>
-page and a description of the POSIX API in the
+page, and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
diff --git a/doc/html/pcre2_config.html b/doc/html/pcre2_config.html
index a51b0c7..f05bd06 100644
--- a/doc/html/pcre2_config.html
+++ b/doc/html/pcre2_config.html
@@ -45,24 +45,25 @@ point to a uint32_t integer variable. The available codes are:
PCRE2_CONFIG_BSR Indicates what \R matches by default:
PCRE2_BSR_UNICODE
PCRE2_BSR_ANYCRLF
- PCRE2_CONFIG_JIT Availability of just-in-time compiler
- support (1=yes 0=no)
- PCRE2_CONFIG_JITTARGET Information about the target archi-
- tecture for the JIT compiler
+ PCRE2_CONFIG_COMPILED_WIDTHS Which of 8/16/32 support was compiled
+ PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
+ PCRE2_CONFIG_HEAPLIMIT Default heap memory limit
+ PCRE2_CONFIG_JIT Availability of just-in-time compiler support (1=yes 0=no)
+ PCRE2_CONFIG_JITTARGET Information (a string) about the target architecture for the JIT compiler
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
+ PCRE2_CONFIG_NEVER_BACKSLASH_C Whether or not \C is disabled
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
PCRE2_NEWLINE_CR
PCRE2_NEWLINE_LF
PCRE2_NEWLINE_CRLF
PCRE2_NEWLINE_ANY
PCRE2_NEWLINE_ANYCRLF
+ PCRE2_NEWLINE_NUL
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
- PCRE2_CONFIG_RECURSIONLIMIT Internal recursion depth limit
- PCRE2_CONFIG_STACKRECURSE Recursion implementation (1=stack
- 0=heap)
- PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
- 0=no)
+ PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
+ PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
+ PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes 0=no)
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
PCRE2_CONFIG_VERSION The PCRE2 version (a string)
</pre>
diff --git a/doc/html/pcre2_convert_context_copy.html b/doc/html/pcre2_convert_context_copy.html
new file mode 100644
index 0000000..3c44ac6
--- /dev/null
+++ b/doc/html/pcre2_convert_context_copy.html
@@ -0,0 +1,40 @@
+<html>
+<head>
+<title>pcre2_convert_context_copy specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcre2_convert_context_copy man page</h1>
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
+<p>
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+<br>
+<br><b>
+SYNOPSIS
+</b><br>
+<P>
+<b>#include &#60;pcre2.h&#62;</b>
+</P>
+<P>
+<b>pcre2_convert_context *pcre2_convert_context_copy(</b>
+<b> pcre2_convert_context *<i>cvcontext</i>);</b>
+</P>
+<br><b>
+DESCRIPTION
+</b><br>
+<P>
+This function is part of an experimental set of pattern conversion functions.
+It makes a new copy of a convert context, using the memory allocation function
+that was used for the original context. The result is NULL if the memory cannot
+be obtained.
+</P>
+<P>
+The pattern conversion functions are described in the
+<a href="pcre2convert.html"><b>pcre2convert</b></a>
+documentation.
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
diff --git a/doc/html/pcre2_convert_context_create.html b/doc/html/pcre2_convert_context_create.html
new file mode 100644
index 0000000..2564780
--- /dev/null
+++ b/doc/html/pcre2_convert_context_create.html
@@ -0,0 +1,41 @@
+<html>
+<head>
+<title>pcre2_convert_context_create specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcre2_convert_context_create man page</h1>
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
+<p>
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+<br>
+<br><b>
+SYNOPSIS
+</b><br>
+<P>
+<b>#include &#60;pcre2.h&#62;</b>
+</P>
+<P>
+<b>pcre2_convert_context *pcre2_convert_context_create(</b>
+<b> pcre2_general_context *<i>gcontext</i>);</b>
+</P>
+<br><b>
+DESCRIPTION
+</b><br>
+<P>
+This function is part of an experimental set of pattern conversion functions.
+It creates and initializes a new convert context. If its argument is
+NULL, <b>malloc()</b> is used to get the necessary memory; otherwise the memory
+allocation function within the general context is used. The result is NULL if
+the memory could not be obtained.
+</P>
+<P>
+The pattern conversion functions are described in the
+<a href="pcre2convert.html"><b>pcre2convert</b></a>
+documentation.
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
diff --git a/doc/html/pcre2_convert_context_free.html b/doc/html/pcre2_convert_context_free.html
new file mode 100644
index 0000000..ab6db6c
--- /dev/null
+++ b/doc/html/pcre2_convert_context_free.html
@@ -0,0 +1,39 @@
+<html>
+<head>
+<title>pcre2_convert_context_free specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcre2_convert_context_free man page</h1>
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
+<p>
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+<br>
+<br><b>
+SYNOPSIS
+</b><br>
+<P>
+<b>#include &#60;pcre2.h&#62;</b>
+</P>
+<P>
+<b>void pcre2_convert_context_free(pcre2_convert_context *<i>cvcontext</i>);</b>
+</P>
+<br><b>
+DESCRIPTION
+</b><br>
+<P>
+This function is part of an experimental set of pattern conversion functions.
+It frees the memory occupied by a convert context, using the memory
+freeing function from the general context with which it was created, or
+<b>free()</b> if that was not set.
+</P>
+<P>
+The pattern conversion functions are described in the
+<a href="pcre2convert.html"><b>pcre2convert</b></a>
+documentation.
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
diff --git a/doc/html/pcre2_converted_pattern_free.html b/doc/html/pcre2_converted_pattern_free.html
new file mode 100644
index 0000000..11adefd
--- /dev/null
+++ b/doc/html/pcre2_converted_pattern_free.html
@@ -0,0 +1,39 @@
+<html>
+<head>
+<title>pcre2_converted_pattern_free specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcre2_converted_pattern_free man page</h1>
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
+<p>
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+<br>
+<br><b>
+SYNOPSIS
+</b><br>
+<P>
+<b>#include &#60;pcre2.h&#62;</b>
+</P>
+<P>
+<b>void pcre2_converted_pattern_free(PCRE2_UCHAR *<i>converted_pattern</i>);</b>
+</P>
+<br><b>
+DESCRIPTION
+</b><br>
+<P>
+This function is part of an experimental set of pattern conversion functions.
+It frees the memory occupied by a converted pattern that was obtained by
+calling <b>pcre2_pattern_convert()</b> with arguments that caused it to place
+the converted pattern into newly obtained heap memory.
+</P>
+<P>
+The pattern conversion functions are described in the
+<a href="pcre2convert.html"><b>pcre2convert</b></a>
+documentation.
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
diff --git a/doc/html/pcre2_dfa_match.html b/doc/html/pcre2_dfa_match.html
index e137a14..36d7976 100644
--- a/doc/html/pcre2_dfa_match.html
+++ b/doc/html/pcre2_dfa_match.html
@@ -31,8 +31,9 @@ DESCRIPTION
<P>
This function matches a compiled regular expression against a given subject
string, using an alternative matching algorithm that scans the subject string
-just once (<i>not</i> Perl-compatible). (The Perl-compatible matching function
-is <b>pcre2_match()</b>.) The arguments for this function are:
+just once (except when processing lookaround assertions). This function is
+<i>not</i> Perl-compatible (the Perl-compatible matching function is
+<b>pcre2_match()</b>). The arguments for this function are:
<pre>
<i>code</i> Points to the compiled pattern
<i>subject</i> Points to the subject string
@@ -45,22 +46,20 @@ is <b>pcre2_match()</b>.) The arguments for this function are:
<i>wscount</i> Number of elements in the vector
</pre>
For <b>pcre2_dfa_match()</b>, a match context is needed only if you want to set
-up a callout function. The <i>length</i> and <i>startoffset</i> values are code
-units, not characters. The options are:
+up a callout function or specify the match and/or the recursion depth limits.
+The <i>length</i> and <i>startoffset</i> values are code units, not characters.
+The options are:
<pre>
PCRE2_ANCHORED Match only at the first position
+ PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
- PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
- is not a valid match
- PCRE2_NO_UTF_CHECK Do not check the subject for UTF
- validity (only relevant if PCRE2_UTF
+ PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
+ PCRE2_NO_UTF_CHECK Do not check the subject for UTF validity (only relevant if PCRE2_UTF
was set at compile time)
- PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
- match if no full matches are found
- PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
- even if there is a full match as well
+ PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match even if there is a full match
+ PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial match if no full matches are found
PCRE2_DFA_RESTART Restart after a partial match
PCRE2_DFA_SHORTEST Return only the shortest match
</pre>
diff --git a/doc/html/pcre2_get_error_message.html b/doc/html/pcre2_get_error_message.html
index 26c80fe..7005760 100644
--- a/doc/html/pcre2_get_error_message.html
+++ b/doc/html/pcre2_get_error_message.html
@@ -34,11 +34,11 @@ errors are negative numbers. The arguments are:
<i>buffer</i> where to put the message
<i>bufflen</i> the length of the buffer (code units)
</pre>
-The function returns the length of the message, excluding the trailing zero, or
-the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
-this case, the returned message is truncated (but still with a trailing zero).
-If <i>errorcode</i> does not contain a recognized error code number, the
-negative value PCRE2_ERROR_BADDATA is returned.
+The function returns the length of the message in code units, excluding the
+trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
+too small. In this case, the returned message is truncated (but still with a
+trailing zero). If <i>errorcode</i> does not contain a recognized error code
+number, the negative value PCRE2_ERROR_BADDATA is returned.
</P>
<P>
There is a complete description of the PCRE2 native API in the
diff --git a/doc/html/pcre2_get_mark.html b/doc/html/pcre2_get_mark.html
index f8e50e3..88e6326 100644
--- a/doc/html/pcre2_get_mark.html
+++ b/doc/html/pcre2_get_mark.html
@@ -26,11 +26,15 @@ DESCRIPTION
</b><br>
<P>
After a call of <b>pcre2_match()</b> that was passed the match block that is
-this function's argument, this function returns a pointer to the last (*MARK)
-name that was encountered. The name is zero-terminated, and is within the
-compiled pattern. If no (*MARK) name is available, NULL is returned. A (*MARK)
-name may be available after a failed match or a partial match, as well as after
-a successful one.
+this function's argument, this function returns a pointer to the last (*MARK),
+(*PRUNE), or (*THEN) name that was encountered during the matching process. The
+name is zero-terminated, and is within the compiled pattern. The length of the
+name is in the preceding code unit. If no name is available, NULL is returned.
+</P>
+<P>
+After a successful match, the name that is returned is the last one on the
+matching path. After a failed match or a partial match, the last encountered
+name is returned.
</P>
<P>
There is a complete description of the PCRE2 native API in the
diff --git a/doc/html/pcre2_jit_stack_create.html b/doc/html/pcre2_jit_stack_create.html
index a668e34..7c89c31 100644
--- a/doc/html/pcre2_jit_stack_create.html
+++ b/doc/html/pcre2_jit_stack_create.html
@@ -32,10 +32,9 @@ maximum size to which it is allowed to grow. The final argument is a general
context, for memory allocation functions, or NULL for standard memory
allocation. The result can be passed to the JIT run-time code by calling
<b>pcre2_jit_stack_assign()</b> to associate the stack with a compiled pattern,
-which can then be processed by <b>pcre2_match()</b>. If the "fast path" JIT
-matcher, <b>pcre2_jit_match()</b> is used, the stack can be passed directly as
-an argument. A maximum stack size of 512K to 1M should be more than enough for
-any pattern. For more details, see the
+which can then be processed by <b>pcre2_match()</b> or <b>pcre2_jit_match()</b>.
+A maximum stack size of 512K to 1M should be more than enough for any pattern.
+For more details, see the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
page.
</P>
diff --git a/doc/html/pcre2_maketables.html b/doc/html/pcre2_maketables.html
index 068e6d4..6d240e3 100644
--- a/doc/html/pcre2_maketables.html
+++ b/doc/html/pcre2_maketables.html
@@ -19,16 +19,16 @@ SYNOPSIS
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
-<b>const unsigned char *pcre2_maketables(pcre22_general_context *<i>gcontext</i>);</b>
+<b>const unsigned char *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
-This function builds a set of character tables for character values less than
-256. These can be passed to <b>pcre2_compile()</b> in a compile context in order
-to override the internal, built-in tables (which were either defaulted or made
-by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
+This function builds a set of character tables for character code points that
+are less than 256. These can be passed to <b>pcre2_compile()</b> in a compile
+context in order to override the internal, built-in tables (which were either
+defaulted or made by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
<a href="pcre2_set_character_tables.html"><b>pcre2_set_character_tables()</b></a>
page. You might want to do this if you are using a non-standard locale.
</P>
diff --git a/doc/html/pcre2_match.html b/doc/html/pcre2_match.html
index 0e389eb..ced70bb 100644
--- a/doc/html/pcre2_match.html
+++ b/doc/html/pcre2_match.html
@@ -30,7 +30,13 @@ DESCRIPTION
<P>
This function matches a compiled regular expression against a given subject
string, using a matching algorithm that is similar to Perl's. It returns
-offsets to captured substrings. Its arguments are:
+offsets to what it has matched and to captured substrings via the
+<b>match_data</b> block, which can be processed by functions with names that
+start with <b>pcre2_get_ovector_...()</b> or <b>pcre2_substring_...()</b>. The
+return from <b>pcre2_match()</b> is one more than the highest numbered capturing
+pair that has been set (for example, 1 if there are no captures), zero if the
+vector of offsets is too small, or a negative error code for no match and other
+errors. The function arguments are:
<pre>
<i>code</i> Points to the compiled pattern
<i>subject</i> Points to the subject string
@@ -43,26 +49,27 @@ offsets to captured substrings. Its arguments are:
A match context is needed only if you want to:
<pre>
Set up a callout function
- Change the limit for calling the internal function <i>match()</i>
- Change the limit for calling <i>match()</i> recursively
- Set custom memory management when the heap is used for recursion
+ Set a matching offset limit
+ Change the heap memory limit
+ Change the backtracking match limit
+ Change the backtracking depth limit
+ Set custom memory management specifically for the match
</pre>
The <i>length</i> and <i>startoffset</i> values are code
-units, not characters. The options are:
+units, not characters. The length may be given as PCRE2_ZERO_TERMINATE for a
+subject that is terminated by a binary zero code unit. The options are:
<pre>
PCRE2_ANCHORED Match only at the first position
+ PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_NOTBOL Subject string is not the beginning of a line
PCRE2_NOTEOL Subject string is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
- PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
- is not a valid match
- PCRE2_NO_UTF_CHECK Do not check the subject for UTF
- validity (only relevant if PCRE2_UTF
+ PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
+ PCRE2_NO_JIT Do not use JIT matching
+ PCRE2_NO_UTF_CHECK Do not check the subject for UTF validity (only relevant if PCRE2_UTF
was set at compile time)
- PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
- match if no full matches are found
- PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
- if that is found before a full match
+ PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match even if there is a full match
+ PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial match if no full matches are found
</pre>
For details of partial matching, see the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
diff --git a/doc/html/pcre2_match_data_free.html b/doc/html/pcre2_match_data_free.html
index 70e107e..840067f 100644
--- a/doc/html/pcre2_match_data_free.html
+++ b/doc/html/pcre2_match_data_free.html
@@ -26,8 +26,8 @@ DESCRIPTION
</b><br>
<P>
This function frees the memory occupied by a match data block, using the memory
-freeing function from the general context with which it was created, or
-<b>free()</b> if that was not set.
+freeing function from the general context or compiled pattern with which it was
+created, or <b>free()</b> if that was not set.
</P>
<P>
There is a complete description of the PCRE2 native API in the
diff --git a/doc/html/pcre2_pattern_convert.html b/doc/html/pcre2_pattern_convert.html
new file mode 100644
index 0000000..2fcd7cc
--- /dev/null
+++ b/doc/html/pcre2_pattern_convert.html
@@ -0,0 +1,70 @@
+<html>
+<head>
+<title>pcre2_pattern_convert specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcre2_pattern_convert man page</h1>
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
+<p>
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+<br>
+<br><b>
+SYNOPSIS
+</b><br>
+<P>
+<b>#include &#60;pcre2.h&#62;</b>
+</P>
+<P>
+<b>int pcre2_pattern_convert(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
+<b> uint32_t <i>options</i>, PCRE2_UCHAR **<i>buffer</i>,</b>
+<b> PCRE2_SIZE *<i>blength</i>, pcre2_convert_context *<i>cvcontext</i>);</b>
+</P>
+<br><b>
+DESCRIPTION
+</b><br>
+<P>
+This function is part of an experimental set of pattern conversion functions.
+It converts a foreign pattern (for example, a glob) into a PCRE2 regular
+expression pattern. Its arguments are:
+<pre>
+ <i>pattern</i> The foreign pattern
+ <i>length</i> The length of the input pattern or PCRE2_ZERO_TERMINATED
+ <i>options</i> Option bits
+ <i>buffer</i> Pointer to pointer to output buffer, or NULL
+ <i>blength</i> Pointer to output length field
+ <i>cvcontext</i> Pointer to a convert context or NULL
+</pre>
+The length of the converted pattern (excluding the terminating zero) is
+returned via <i>blength</i>. If <i>buffer</i> is NULL, the function just returns
+the output length. If <i>buffer</i> points to a NULL pointer, heap memory is
+obtained for the converted pattern, using the allocator in the context if
+present (or else <b>malloc()</b>), and the field pointed to by <i>buffer</i> is
+updated. If <i>buffer</i> points to a non-NULL field, that must point to a
+buffer whose size is in the variable pointed to by <i>blength</i>. This value is
+updated.
+</P>
+<P>
+The option bits are:
+<pre>
+ PCRE2_CONVERT_UTF Input is UTF
+ PCRE2_CONVERT_NO_UTF_CHECK Do not check UTF validity
+ PCRE2_CONVERT_POSIX_BASIC Convert POSIX basic pattern
+ PCRE2_CONVERT_POSIX_EXTENDED Convert POSIX extended pattern
+ PCRE2_CONVERT_GLOB ) Convert
+ PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR ) various types
+ PCRE2_CONVERT_GLOB_NO_STARSTAR ) of glob
+</pre>
+The return value from <b>pcre2_pattern_convert()</b> is zero on success or a
+non-zero PCRE2 error code.
+</P>
+<P>
+The pattern conversion functions are described in the
+<a href="pcre2convert.html"><b>pcre2convert</b></a>
+documentation.
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
diff --git a/doc/html/pcre2_pattern_info.html b/doc/html/pcre2_pattern_info.html
index b4cd6f5..1ebf90b 100644
--- a/doc/html/pcre2_pattern_info.html
+++ b/doc/html/pcre2_pattern_info.html
@@ -27,7 +27,7 @@ DESCRIPTION
<P>
This function returns information about a compiled pattern. Its arguments are:
<pre>
- <i>code</i> Pointer to a compiled regular expression
+ <i>code</i> Pointer to a compiled regular expression pattern
<i>what</i> What information is required
<i>where</i> Where to put the information
</pre>
@@ -41,27 +41,28 @@ request are as follows:
PCRE2_BSR_UNICODE: Unicode line endings
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
PCRE2_INFO_CAPTURECOUNT Number of capturing subpatterns
+ PCRE2_INFO_DEPTHLIMIT Backtracking depth limit if set, otherwise PCRE2_ERROR_UNSET
+ PCRE2_INFO_EXTRAOPTIONS Extra options that were passed in the
+ compile context
PCRE2_INFO_FIRSTBITMAP Bitmap of first code units, or NULL
PCRE2_INFO_FIRSTCODETYPE Type of start-of-match information
0 nothing set
1 first code unit is set
2 start of string or after newline
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
+ PCRE2_INFO_FRAMESIZE Size of backtracking frame
PCRE2_INFO_HASBACKSLASHC Return 1 if pattern contains \C
- PCRE2_INFO_HASCRORLF Return 1 if explicit CR or LF matches
- exist in the pattern
+ PCRE2_INFO_HASCRORLF Return 1 if explicit CR or LF matches exist in the pattern
+ PCRE2_INFO_HEAPLIMIT Heap memory limit if set, otherwise PCRE2_ERROR_UNSET
PCRE2_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
PCRE2_INFO_JITSIZE Size of JIT compiled code, or 0
PCRE2_INFO_LASTCODETYPE Type of must-be-present information
0 nothing set
1 code unit is set
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
- PCRE2_INFO_MATCHEMPTY 1 if the pattern can match an
- empty string, 0 otherwise
- PCRE2_INFO_MATCHLIMIT Match limit if set,
- otherwise PCRE2_ERROR_UNSET
- PCRE2_INFO_MAXLOOKBEHIND Length (in characters) of the longest
- lookbehind assertion
+ PCRE2_INFO_MATCHEMPTY 1 if the pattern can match an empty string, 0 otherwise
+ PCRE2_INFO_MATCHLIMIT Match limit if set, otherwise PCRE2_ERROR_UNSET
+ PCRE2_INFO_MAXLOOKBEHIND Length (in characters) of the longest lookbehind assertion
PCRE2_INFO_MINLENGTH Lower bound length of matching strings
PCRE2_INFO_NAMECOUNT Number of named subpatterns
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
@@ -72,8 +73,8 @@ request are as follows:
PCRE2_NEWLINE_CRLF
PCRE2_NEWLINE_ANY
PCRE2_NEWLINE_ANYCRLF
- PCRE2_INFO_RECURSIONLIMIT Recursion limit if set,
- otherwise PCRE2_ERROR_UNSET
+ PCRE2_NEWLINE_NUL
+ PCRE2_INFO_RECURSIONLIMIT Obsolete synonym for PCRE2_INFO_DEPTHLIMIT
PCRE2_INFO_SIZE Size of compiled pattern
</pre>
If <i>where</i> is NULL, the function returns the amount of memory needed for
diff --git a/doc/html/pcre2_set_callout.html b/doc/html/pcre2_set_callout.html
index 635e0c2..4e7aca6 100644
--- a/doc/html/pcre2_set_callout.html
+++ b/doc/html/pcre2_set_callout.html
@@ -29,7 +29,7 @@ DESCRIPTION
<P>
This function sets the callout fields in a match context (the first argument).
The second argument specifies a callout function, and the third argument is an
-opaque data time that is passed to it. The result of this function is always
+opaque data item that is passed to it. The result of this function is always
zero.
</P>
<P>
diff --git a/doc/html/pcre2_set_compile_extra_options.html b/doc/html/pcre2_set_compile_extra_options.html
new file mode 100644
index 0000000..7374931
--- /dev/null
+++ b/doc/html/pcre2_set_compile_extra_options.html
@@ -0,0 +1,45 @@
+<html>
+<head>
+<title>pcre2_set_compile_extra_options specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcre2_set_compile_extra_options man page</h1>
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
+<p>
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+<br>
+<br><b>
+SYNOPSIS
+</b><br>
+<P>
+<b>#include &#60;pcre2.h&#62;</b>
+</P>
+<P>
+<b>int pcre2_set_compile_extra_options(pcre2_compile_context *<i>ccontext</i>,</b>
+<b> PCRE2_SIZE <i>extra_options</i>);</b>
+</P>
+<br><b>
+DESCRIPTION
+</b><br>
+<P>
+This function sets additional option bits for <b>pcre2_compile()</b> that are
+housed in a compile context. It completely replaces all the bits. The extra
+options are:
+<pre>
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{df800} to \x{dfff} in UTF-8 and UTF-32 modes
+ PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character
+ PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
+ PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
+</pre>
+There is a complete description of the PCRE2 native API in the
+<a href="pcre2api.html"><b>pcre2api</b></a>
+page and a description of the POSIX API in the
+<a href="pcre2posix.html"><b>pcre2posix</b></a>
+page.
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
diff --git a/doc/html/pcre2_set_depth_limit.html b/doc/html/pcre2_set_depth_limit.html
new file mode 100644
index 0000000..a1cf706
--- /dev/null
+++ b/doc/html/pcre2_set_depth_limit.html
@@ -0,0 +1,40 @@
+<html>
+<head>
+<title>pcre2_set_depth_limit specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcre2_set_depth_limit man page</h1>
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
+<p>
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+<br>
+<br><b>
+SYNOPSIS
+</b><br>
+<P>
+<b>#include &#60;pcre2.h&#62;</b>
+</P>
+<P>
+<b>int pcre2_set_depth_limit(pcre2_match_context *<i>mcontext</i>,</b>
+<b> uint32_t <i>value</i>);</b>
+</P>
+<br><b>
+DESCRIPTION
+</b><br>
+<P>
+This function sets the backtracking depth limit field in a match context. The
+result is always zero.
+</P>
+<P>
+There is a complete description of the PCRE2 native API in the
+<a href="pcre2api.html"><b>pcre2api</b></a>
+page and a description of the POSIX API in the
+<a href="pcre2posix.html"><b>pcre2posix</b></a>
+page.
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
diff --git a/doc/html/pcre2_set_glob_escape.html b/doc/html/pcre2_set_glob_escape.html
new file mode 100644
index 0000000..2b55627
--- /dev/null
+++ b/doc/html/pcre2_set_glob_escape.html
@@ -0,0 +1,43 @@
+<html>
+<head>
+<title>pcre2_set_glob_escape specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcre2_set_glob_escape man page</h1>
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
+<p>
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+<br>
+<br><b>
+SYNOPSIS
+</b><br>
+<P>
+<b>#include &#60;pcre2.h&#62;</b>
+</P>
+<P>
+<b>int pcre2_set_glob_escape(pcre2_convert_context *<i>cvcontext</i>,</b>
+<b> uint32_t <i>escape_char</i>);</b>
+</P>
+<br><b>
+DESCRIPTION
+</b><br>
+<P>
+This function is part of an experimental set of pattern conversion functions.
+It sets the escape character that is used when converting globs. The second
+argument must either be zero (meaning there is no escape character) or a
+punctuation character whose code point is less than 256. The default is grave
+accent if running under Windows, otherwise backslash. The result of the
+function is zero for success or PCRE2_ERROR_BADDATA if the second argument is
+invalid.
+</P>
+<P>
+The pattern conversion functions are described in the
+<a href="pcre2convert.html"><b>pcre2convert</b></a>
+documentation.
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
diff --git a/doc/html/pcre2_set_glob_separator.html b/doc/html/pcre2_set_glob_separator.html
new file mode 100644
index 0000000..538748d
--- /dev/null
+++ b/doc/html/pcre2_set_glob_separator.html
@@ -0,0 +1,42 @@
+<html>
+<head>
+<title>pcre2_set_glob_separator specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcre2_set_glob_separator man page</h1>
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
+<p>
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+<br>
+<br><b>
+SYNOPSIS
+</b><br>
+<P>
+<b>#include &#60;pcre2.h&#62;</b>
+</P>
+<P>
+<b>int pcre2_set_glob_separator(pcre2_convert_context *<i>cvcontext</i>,</b>
+<b> uint32_t <i>separator_char</i>);</b>
+</P>
+<br><b>
+DESCRIPTION
+</b><br>
+<P>
+This function is part of an experimental set of pattern conversion functions.
+It sets the component separator character that is used when converting globs.
+The second argument must one of the characters forward slash, backslash, or
+dot. The default is backslash when running under Windows, otherwise forward
+slash. The result of the function is zero for success or PCRE2_ERROR_BADDATA if
+the second argument is invalid.
+</P>
+<P>
+The pattern conversion functions are described in the
+<a href="pcre2convert.html"><b>pcre2convert</b></a>
+documentation.
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
diff --git a/doc/html/pcre2_set_heap_limit.html b/doc/html/pcre2_set_heap_limit.html
new file mode 100644
index 0000000..3631ef6
--- /dev/null
+++ b/doc/html/pcre2_set_heap_limit.html
@@ -0,0 +1,40 @@
+<html>
+<head>
+<title>pcre2_set_heap_limit specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcre2_set_heap_limit man page</h1>
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
+<p>
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+<br>
+<br><b>
+SYNOPSIS
+</b><br>
+<P>
+<b>#include &#60;pcre2.h&#62;</b>
+</P>
+<P>
+<b>int pcre2_set_heap_limit(pcre2_match_context *<i>mcontext</i>,</b>
+<b> uint32_t <i>value</i>);</b>
+</P>
+<br><b>
+DESCRIPTION
+</b><br>
+<P>
+This function sets the backtracking heap limit field in a match context. The
+result is always zero.
+</P>
+<P>
+There is a complete description of the PCRE2 native API in the
+<a href="pcre2api.html"><b>pcre2api</b></a>
+page and a description of the POSIX API in the
+<a href="pcre2posix.html"><b>pcre2posix</b></a>
+page.
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
diff --git a/doc/html/pcre2_set_max_pattern_length.html b/doc/html/pcre2_set_max_pattern_length.html
new file mode 100644
index 0000000..f6e422a
--- /dev/null
+++ b/doc/html/pcre2_set_max_pattern_length.html
@@ -0,0 +1,43 @@
+<html>
+<head>
+<title>pcre2_set_max_pattern_length specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcre2_set_max_pattern_length man page</h1>
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
+<p>
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+<br>
+<br><b>
+SYNOPSIS
+</b><br>
+<P>
+<b>#include &#60;pcre2.h&#62;</b>
+</P>
+<P>
+<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b>
+<b> PCRE2_SIZE <i>value</i>);</b>
+</P>
+<br><b>
+DESCRIPTION
+</b><br>
+<P>
+This function sets, in a compile context, the maximum text length (in code
+units) of the pattern that can be compiled. The result is always zero. If a
+longer pattern is passed to <b>pcre2_compile()</b> there is an immediate error
+return. The default is effectively unlimited, being the largest value a
+PCRE2_SIZE variable can hold.
+</P>
+<P>
+There is a complete description of the PCRE2 native API in the
+<a href="pcre2api.html"><b>pcre2api</b></a>
+page and a description of the POSIX API in the
+<a href="pcre2posix.html"><b>pcre2posix</b></a>
+page.
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
diff --git a/doc/html/pcre2_set_newline.html b/doc/html/pcre2_set_newline.html
index ae6332a..ba81300 100644
--- a/doc/html/pcre2_set_newline.html
+++ b/doc/html/pcre2_set_newline.html
@@ -35,6 +35,7 @@ matching patterns. The second argument must be one of:
PCRE2_NEWLINE_CRLF CR followed by LF only
PCRE2_NEWLINE_ANYCRLF Any of the above
PCRE2_NEWLINE_ANY Any Unicode newline sequence
+ PCRE2_NEWLINE_NUL The NUL character (binary zero)
</pre>
The result is zero for success or PCRE2_ERROR_BADDATA if the second argument is
invalid.
diff --git a/doc/html/pcre2_set_recursion_limit.html b/doc/html/pcre2_set_recursion_limit.html
index 5adcc99..9ff68c2 100644
--- a/doc/html/pcre2_set_recursion_limit.html
+++ b/doc/html/pcre2_set_recursion_limit.html
@@ -26,8 +26,8 @@ SYNOPSIS
DESCRIPTION
</b><br>
<P>
-This function sets the recursion limit field in a match context. The result is
-always zero.
+This function is obsolete and should not be used in new code. Use
+<b>pcre2_set_depth_limit()</b> instead.
</P>
<P>
There is a complete description of the PCRE2 native API in the
diff --git a/doc/html/pcre2_set_recursion_memory_management.html b/doc/html/pcre2_set_recursion_memory_management.html
index ec18947..1e057b9 100644
--- a/doc/html/pcre2_set_recursion_memory_management.html
+++ b/doc/html/pcre2_set_recursion_memory_management.html
@@ -28,13 +28,8 @@ SYNOPSIS
DESCRIPTION
</b><br>
<P>
-This function sets the match context fields for custom memory management when
-PCRE2 is compiled to use the heap instead of the system stack for recursive
-function calls while matching. When PCRE2 is compiled to use the stack (the
-default) this function does nothing. The first argument is a match context, the
-second and third specify the memory allocation and freeing functions, and the
-final argument is an opaque value that is passed to them whenever they are
-called. The result of this function is always zero.
+From release 10.30 onwards, this function is obsolete and does nothing. The
+result is always zero.
</P>
<P>
There is a complete description of the PCRE2 native API in the
diff --git a/doc/html/pcre2_substitute.html b/doc/html/pcre2_substitute.html
index 2dfd094..2215ce9 100644
--- a/doc/html/pcre2_substitute.html
+++ b/doc/html/pcre2_substitute.html
@@ -47,26 +47,30 @@ Its arguments are:
<i>outputbuffer</i> Points to the output buffer
<i>outlengthptr</i> Points to the length of the output buffer
</pre>
-A match context is needed only if you want to:
+A match data block is needed only if you want to inspect the data from the
+match that is returned in that block. A match context is needed only if you
+want to:
<pre>
Set up a callout function
- Change the limit for calling the internal function <i>match()</i>
- Change the limit for calling <i>match()</i> recursively
- Set custom memory management when the heap is used for recursion
+ Set a matching offset limit
+ Change the backtracking match limit
+ Change the backtracking depth limit
+ Set custom memory management in the match context
</pre>
The <i>length</i>, <i>startoffset</i> and <i>rlength</i> values are code
units, not characters, as is the contents of the variable pointed at by
<i>outlengthptr</i>, which is updated to the actual length of the new string.
-The options are:
+The subject and replacement lengths can be given as PCRE2_ZERO_TERMINATED for
+zero-terminated strings. The options are:
<pre>
PCRE2_ANCHORED Match only at the first position
+ PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
- PCRE2_NOTEMPTY_ATSTART An empty string at the start of the
- subject is not a valid match
- PCRE2_NO_UTF_CHECK Do not check the subject or replacement
- for UTF validity (only relevant if
+ PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
+ PCRE2_NO_JIT Do not use JIT matching
+ PCRE2_NO_UTF_CHECK Do not check the subject or replacement for UTF validity (only relevant if
PCRE2_UTF was set at compile time)
PCRE2_SUBSTITUTE_EXTENDED Do extended replacement processing
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html
index fa9f342..ba3b2ca 100644
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@@ -23,44 +23,45 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC8" href="#SEC8">PCRE2 NATIVE API JIT FUNCTIONS</a>
<li><a name="TOC9" href="#SEC9">PCRE2 NATIVE API SERIALIZATION FUNCTIONS</a>
<li><a name="TOC10" href="#SEC10">PCRE2 NATIVE API AUXILIARY FUNCTIONS</a>
-<li><a name="TOC11" href="#SEC11">PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a>
-<li><a name="TOC12" href="#SEC12">PCRE2 API OVERVIEW</a>
-<li><a name="TOC13" href="#SEC13">STRING LENGTHS AND OFFSETS</a>
-<li><a name="TOC14" href="#SEC14">NEWLINES</a>
-<li><a name="TOC15" href="#SEC15">MULTITHREADING</a>
-<li><a name="TOC16" href="#SEC16">PCRE2 CONTEXTS</a>
-<li><a name="TOC17" href="#SEC17">CHECKING BUILD-TIME OPTIONS</a>
-<li><a name="TOC18" href="#SEC18">COMPILING A PATTERN</a>
-<li><a name="TOC19" href="#SEC19">COMPILATION ERROR CODES</a>
-<li><a name="TOC20" href="#SEC20">JUST-IN-TIME (JIT) COMPILATION</a>
-<li><a name="TOC21" href="#SEC21">LOCALE SUPPORT</a>
-<li><a name="TOC22" href="#SEC22">INFORMATION ABOUT A COMPILED PATTERN</a>
-<li><a name="TOC23" href="#SEC23">INFORMATION ABOUT A PATTERN'S CALLOUTS</a>
-<li><a name="TOC24" href="#SEC24">SERIALIZATION AND PRECOMPILING</a>
-<li><a name="TOC25" href="#SEC25">THE MATCH DATA BLOCK</a>
-<li><a name="TOC26" href="#SEC26">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a>
-<li><a name="TOC27" href="#SEC27">NEWLINE HANDLING WHEN MATCHING</a>
-<li><a name="TOC28" href="#SEC28">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a>
-<li><a name="TOC29" href="#SEC29">OTHER INFORMATION ABOUT A MATCH</a>
-<li><a name="TOC30" href="#SEC30">ERROR RETURNS FROM <b>pcre2_match()</b></a>
-<li><a name="TOC31" href="#SEC31">OBTAINING A TEXTUAL ERROR MESSAGE</a>
-<li><a name="TOC32" href="#SEC32">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a>
-<li><a name="TOC33" href="#SEC33">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a>
-<li><a name="TOC34" href="#SEC34">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a>
-<li><a name="TOC35" href="#SEC35">CREATING A NEW STRING WITH SUBSTITUTIONS</a>
-<li><a name="TOC36" href="#SEC36">DUPLICATE SUBPATTERN NAMES</a>
-<li><a name="TOC37" href="#SEC37">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a>
-<li><a name="TOC38" href="#SEC38">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a>
-<li><a name="TOC39" href="#SEC39">SEE ALSO</a>
-<li><a name="TOC40" href="#SEC40">AUTHOR</a>
-<li><a name="TOC41" href="#SEC41">REVISION</a>
+<li><a name="TOC11" href="#SEC11">PCRE2 NATIVE API OBSOLETE FUNCTIONS</a>
+<li><a name="TOC12" href="#SEC12">PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS</a>
+<li><a name="TOC13" href="#SEC13">PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a>
+<li><a name="TOC14" href="#SEC14">PCRE2 API OVERVIEW</a>
+<li><a name="TOC15" href="#SEC15">STRING LENGTHS AND OFFSETS</a>
+<li><a name="TOC16" href="#SEC16">NEWLINES</a>
+<li><a name="TOC17" href="#SEC17">MULTITHREADING</a>
+<li><a name="TOC18" href="#SEC18">PCRE2 CONTEXTS</a>
+<li><a name="TOC19" href="#SEC19">CHECKING BUILD-TIME OPTIONS</a>
+<li><a name="TOC20" href="#SEC20">COMPILING A PATTERN</a>
+<li><a name="TOC21" href="#SEC21">JUST-IN-TIME (JIT) COMPILATION</a>
+<li><a name="TOC22" href="#SEC22">LOCALE SUPPORT</a>
+<li><a name="TOC23" href="#SEC23">INFORMATION ABOUT A COMPILED PATTERN</a>
+<li><a name="TOC24" href="#SEC24">INFORMATION ABOUT A PATTERN'S CALLOUTS</a>
+<li><a name="TOC25" href="#SEC25">SERIALIZATION AND PRECOMPILING</a>
+<li><a name="TOC26" href="#SEC26">THE MATCH DATA BLOCK</a>
+<li><a name="TOC27" href="#SEC27">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a>
+<li><a name="TOC28" href="#SEC28">NEWLINE HANDLING WHEN MATCHING</a>
+<li><a name="TOC29" href="#SEC29">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a>
+<li><a name="TOC30" href="#SEC30">OTHER INFORMATION ABOUT A MATCH</a>
+<li><a name="TOC31" href="#SEC31">ERROR RETURNS FROM <b>pcre2_match()</b></a>
+<li><a name="TOC32" href="#SEC32">OBTAINING A TEXTUAL ERROR MESSAGE</a>
+<li><a name="TOC33" href="#SEC33">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a>
+<li><a name="TOC34" href="#SEC34">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a>
+<li><a name="TOC35" href="#SEC35">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a>
+<li><a name="TOC36" href="#SEC36">CREATING A NEW STRING WITH SUBSTITUTIONS</a>
+<li><a name="TOC37" href="#SEC37">DUPLICATE SUBPATTERN NAMES</a>
+<li><a name="TOC38" href="#SEC38">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a>
+<li><a name="TOC39" href="#SEC39">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a>
+<li><a name="TOC40" href="#SEC40">SEE ALSO</a>
+<li><a name="TOC41" href="#SEC41">AUTHOR</a>
+<li><a name="TOC42" href="#SEC42">REVISION</a>
</ul>
<P>
<b>#include &#60;pcre2.h&#62;</b>
<br>
<br>
-PCRE2 is a new API for PCRE. This document contains a description of all its
-functions. See the
+PCRE2 is a new API for PCRE, starting at release 10.0. This document contains a
+description of all its native functions. See the
<a href="pcre2.html"><b>pcre2</b></a>
document for an overview of all the PCRE2 documentation.
</P>
@@ -144,6 +145,10 @@ document for an overview of all the PCRE2 documentation.
<b> const unsigned char *<i>tables</i>);</b>
<br>
<br>
+<b>int pcre2_set_compile_extra_options(pcre2_compile_context *<i>ccontext</i>,</b>
+<b> uint32_t <i>extra_options</i>);</b>
+<br>
+<br>
<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b>
<b> PCRE2_SIZE <i>value</i>);</b>
<br>
@@ -177,22 +182,20 @@ document for an overview of all the PCRE2 documentation.
<b> void *<i>callout_data</i>);</b>
<br>
<br>
-<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
-<b> uint32_t <i>value</i>);</b>
-<br>
-<br>
<b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> PCRE2_SIZE <i>value</i>);</b>
<br>
<br>
-<b>int pcre2_set_recursion_limit(pcre2_match_context *<i>mcontext</i>,</b>
+<b>int pcre2_set_heap_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
-<b>int pcre2_set_recursion_memory_management(</b>
-<b> pcre2_match_context *<i>mcontext</i>,</b>
-<b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b>
-<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
+<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
+<b> uint32_t <i>value</i>);</b>
+<br>
+<br>
+<b>int pcre2_set_depth_limit(pcre2_match_context *<i>mcontext</i>,</b>
+<b> uint32_t <i>value</i>);</b>
</P>
<br><a name="SEC6" href="#TOC1">PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS</a><br>
<P>
@@ -294,6 +297,9 @@ document for an overview of all the PCRE2 documentation.
<b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b>
<br>
<br>
+<b>pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *<i>code</i>);</b>
+<br>
+<br>
<b>int pcre2_get_error_message(int <i>errorcode</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
<b> PCRE2_SIZE <i>bufflen</i>);</b>
<br>
@@ -311,7 +317,60 @@ document for an overview of all the PCRE2 documentation.
<br>
<b>int pcre2_config(uint32_t <i>what</i>, void *<i>where</i>);</b>
</P>
-<br><a name="SEC11" href="#TOC1">PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a><br>
+<br><a name="SEC11" href="#TOC1">PCRE2 NATIVE API OBSOLETE FUNCTIONS</a><br>
+<P>
+<b>int pcre2_set_recursion_limit(pcre2_match_context *<i>mcontext</i>,</b>
+<b> uint32_t <i>value</i>);</b>
+<br>
+<br>
+<b>int pcre2_set_recursion_memory_management(</b>
+<b> pcre2_match_context *<i>mcontext</i>,</b>
+<b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b>
+<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
+<br>
+<br>
+These functions became obsolete at release 10.30 and are retained only for
+backward compatibility. They should not be used in new code. The first is
+replaced by <b>pcre2_set_depth_limit()</b>; the second is no longer needed and
+has no effect (it always returns zero).
+</P>
+<br><a name="SEC12" href="#TOC1">PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS</a><br>
+<P>
+<b>pcre2_convert_context *pcre2_convert_context_create(</b>
+<b> pcre2_general_context *<i>gcontext</i>);</b>
+<br>
+<br>
+<b>pcre2_convert_context *pcre2_convert_context_copy(</b>
+<b> pcre2_convert_context *<i>cvcontext</i>);</b>
+<br>
+<br>
+<b>void pcre2_convert_context_free(pcre2_convert_context *<i>cvcontext</i>);</b>
+<br>
+<br>
+<b>int pcre2_set_glob_escape(pcre2_convert_context *<i>cvcontext</i>,</b>
+<b> uint32_t <i>escape_char</i>);</b>
+<br>
+<br>
+<b>int pcre2_set_glob_separator(pcre2_convert_context *<i>cvcontext</i>,</b>
+<b> uint32_t <i>separator_char</i>);</b>
+<br>
+<br>
+<b>int pcre2_pattern_convert(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
+<b> uint32_t <i>options</i>, PCRE2_UCHAR **<i>buffer</i>,</b>
+<b> PCRE2_SIZE *<i>blength</i>, pcre2_convert_context *<i>cvcontext</i>);</b>
+<br>
+<br>
+<b>void pcre2_converted_pattern_free(PCRE2_UCHAR *<i>converted_pattern</i>);</b>
+<br>
+<br>
+These functions provide a way of converting non-PCRE2 patterns into
+patterns that can be processed by <b>pcre2_compile()</b>. This facility is
+experimental and may be changed in future releases. At present, "globs" and
+POSIX basic and extended patterns can be converted. Details are given in the
+<a href="pcre2convert.html"><b>pcre2convert</b></a>
+documentation.
+</P>
+<br><a name="SEC13" href="#TOC1">PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a><br>
<P>
There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit code
units, respectively. However, there is just one header file, <b>pcre2.h</b>.
@@ -365,28 +424,28 @@ When using multiple libraries in an application, you must take care when
processing any particular pattern to use only functions from a single library.
For example, if you want to run a match using a pattern that was compiled with
<b>pcre2_compile_16()</b>, you must do so with <b>pcre2_match_16()</b>, not
-<b>pcre2_match_8()</b>.
+<b>pcre2_match_8()</b> or <b>pcre2_match_32()</b>.
</P>
<P>
In the function summaries above, and in the rest of this document and other
PCRE2 documents, functions and data types are described using their generic
-names, without the 8, 16, or 32 suffix.
+names, without the _8, _16, or _32 suffix.
</P>
-<br><a name="SEC12" href="#TOC1">PCRE2 API OVERVIEW</a><br>
+<br><a name="SEC14" href="#TOC1">PCRE2 API OVERVIEW</a><br>
<P>
PCRE2 has its own native API, which is described in this document. There are
also some wrapper functions for the 8-bit library that correspond to the
POSIX regular expression API, but they do not give access to all the
-functionality. They are described in the
+functionality of PCRE2. They are described in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
documentation. Both these APIs define a set of C function calls.
</P>
<P>
The native API C data types, function prototypes, option values, and error
-codes are defined in the header file <b>pcre2.h</b>, which contains definitions
-of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release numbers for the
-library. Applications can use these to include support for different releases
-of PCRE2.
+codes are defined in the header file <b>pcre2.h</b>, which also contains
+definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release numbers
+for the library. Applications can use these to include support for different
+releases of PCRE2.
</P>
<P>
In a Windows environment, if you want to statically link an application program
@@ -394,7 +453,7 @@ against a non-dll PCRE2 library, you must define PCRE2_STATIC before including
<b>pcre2.h</b>.
</P>
<P>
-The functions <b>pcre2_compile()</b>, and <b>pcre2_match()</b> are used for
+The functions <b>pcre2_compile()</b> and <b>pcre2_match()</b> are used for
compiling and matching regular expressions in a Perl-compatible manner. A
sample program that demonstrates the simplest way of using them is provided in
the file called <i>pcre2demo.c</i> in the PCRE2 source distribution. A listing
@@ -405,10 +464,17 @@ documentation, and the
documentation describes how to compile and run it.
</P>
<P>
-Just-in-time compiler support is an optional feature of PCRE2 that can be built
-in appropriate hardware environments. It greatly speeds up the matching
+The compiling and matching functions recognize various options that are passed
+as bits in an options argument. There are also some more complicated parameters
+such as custom memory management functions and resource limits that are passed
+in "contexts" (which are just memory blocks, described below). Simple
+applications do not need to make use of contexts.
+</P>
+<P>
+Just-in-time (JIT) compiler support is an optional feature of PCRE2 that can be
+built in appropriate hardware environments. It greatly speeds up the matching
performance of many patterns. Programs can request that it be used if
-available, by calling <b>pcre2_jit_compile()</b> after a pattern has been
+available by calling <b>pcre2_jit_compile()</b> after a pattern has been
successfully compiled by <b>pcre2_compile()</b>. This does nothing if JIT
support is not available.
</P>
@@ -420,8 +486,8 @@ More complicated programs might need to make use of the specialist functions
<P>
JIT matching is automatically used by <b>pcre2_match()</b> if it is available,
unless the PCRE2_NO_JIT option is set. There is also a direct interface for JIT
-matching, which gives improved performance. The JIT-specific functions are
-discussed in the
+matching, which gives improved performance at the expense of less sanity
+checking. The JIT-specific functions are discussed in the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation.
</P>
@@ -430,7 +496,7 @@ A second matching function, <b>pcre2_dfa_match()</b>, which is not
Perl-compatible, is also provided. This uses a different algorithm for the
matching. The alternative algorithm finds all possible matches (at a given
point in the subject), and scans the subject just once (unless there are
-lookbehind assertions). However, this algorithm does not return captured
+lookaround assertions). However, this algorithm does not return captured
substrings. A description of the two matching algorithms and their advantages
and disadvantages is given in the
<a href="pcre2matching.html"><b>pcre2matching</b></a>
@@ -452,7 +518,7 @@ been matched by <b>pcre2_match()</b>. They are:
<b>pcre2_substring_number_from_name()</b>
</pre>
<b>pcre2_substring_free()</b> and <b>pcre2_substring_list_free()</b> are also
-provided, to free the memory used for extracted strings.
+provided, to free memory used for extracted strings.
</P>
<P>
The function <b>pcre2_substitute()</b> can be called to match a pattern and
@@ -473,7 +539,7 @@ Functions with names ending with <b>_free()</b> are used for freeing memory
blocks of various sorts. In all cases, if one of these functions is called with
a NULL argument, it does nothing.
</P>
-<br><a name="SEC13" href="#TOC1">STRING LENGTHS AND OFFSETS</a><br>
+<br><a name="SEC15" href="#TOC1">STRING LENGTHS AND OFFSETS</a><br>
<P>
The PCRE2 API uses string lengths and offsets into strings of code units in
several places. These values are always of type PCRE2_SIZE, which is an
@@ -483,7 +549,7 @@ as a special indicator for zero-terminated strings and unset offsets.
Therefore, the longest string that can be handled is one less than this
maximum.
<a name="newlines"></a></P>
-<br><a name="SEC14" href="#TOC1">NEWLINES</a><br>
+<br><a name="SEC16" href="#TOC1">NEWLINES</a><br>
<P>
PCRE2 supports five different conventions for indicating line breaks in
strings: a single CR (carriage return) character, a single LF (linefeed)
@@ -518,7 +584,7 @@ The choice of newline convention does not affect the interpretation of
the \n or \r escape sequences, nor does it affect what \R matches; this has
its own separate convention.
</P>
-<br><a name="SEC15" href="#TOC1">MULTITHREADING</a><br>
+<br><a name="SEC17" href="#TOC1">MULTITHREADING</a><br>
<P>
In a multithreaded application it is important to keep thread-specific data
separate from data that can be shared between threads. The PCRE2 library code
@@ -540,8 +606,8 @@ and does not change when the pattern is matched. Therefore, it is thread-safe,
that is, the same compiled pattern can be used by more than one thread
simultaneously. For example, an application can compile all its patterns at the
start, before forking off multiple threads that use them. However, if the
-just-in-time optimization feature is being used, it needs separate memory stack
-areas for each thread. See the
+just-in-time (JIT) optimization feature is being used, it needs separate memory
+stack areas for each thread. See the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for more details.
</P>
@@ -567,8 +633,9 @@ If JIT is being used, but the JIT compilation is not being done immediately,
(perhaps waiting to see if the pattern is used often enough) similar logic is
required. JIT compilation updates a pointer within the compiled code block, so
a thread must gain unique write access to the pointer before calling
-<b>pcre2_jit_compile()</b>. Alternatively, <b>pcre2_code_copy()</b> can be used
-to obtain a private copy of the compiled code.
+<b>pcre2_jit_compile()</b>. Alternatively, <b>pcre2_code_copy()</b> or
+<b>pcre2_code_copy_with_tables()</b> can be used to obtain a private copy of the
+compiled code before calling the JIT compiler.
</P>
<br><b>
Context blocks
@@ -592,12 +659,12 @@ thread-specific copy.
Match blocks
</b><br>
<P>
-The matching functions need a block of memory for working space and for storing
-the results of a match. This includes details of what was matched, as well as
-additional information such as the name of a (*MARK) setting. Each thread must
-provide its own copy of this memory.
+The matching functions need a block of memory for storing the results of a
+match. This includes details of what was matched, as well as additional
+information such as the name of a (*MARK) setting. Each thread must provide its
+own copy of this memory.
</P>
-<br><a name="SEC16" href="#TOC1">PCRE2 CONTEXTS</a><br>
+<br><a name="SEC18" href="#TOC1">PCRE2 CONTEXTS</a><br>
<P>
Some PCRE2 functions have a lot of parameters, many of which are used only by
specialist applications, for example, those that use custom memory management
@@ -622,6 +689,8 @@ library. The context is named `general' rather than specifically `memory'
because in future other fields may be added. If you do not want to supply your
own custom memory management functions, you do not need to bother with a
general context. A general context is created by:
+<br>
+<br>
<b>pcre2_general_context *pcre2_general_context_create(</b>
<b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b>
<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
@@ -648,26 +717,31 @@ used. When the time comes to free the block, this function is called.
</P>
<P>
A general context can be copied by calling:
+<br>
+<br>
<b>pcre2_general_context *pcre2_general_context_copy(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
The memory used for a general context should be freed by calling:
+<br>
+<br>
<b>void pcre2_general_context_free(pcre2_general_context *<i>gcontext</i>);</b>
<a name="compilecontext"></a></P>
<br><b>
The compile context
</b><br>
<P>
-A compile context is required if you want to change the default values of any
-of the following compile-time parameters:
+A compile context is required if you want to provide an external function for
+stack checking during compilation or to change the default values of any of the
+following compile-time parameters:
<pre>
What \R matches (Unicode newlines or CR, LF, CRLF only)
PCRE2's character tables
The newline character sequence
The compile time nested parentheses limit
The maximum length of the pattern string
- An external function for stack checking
+ The extra options bits (none set by default)
</pre>
A compile context is also required if you are using custom memory management.
If none of these apply, just pass NULL as the context argument of
@@ -675,6 +749,8 @@ If none of these apply, just pass NULL as the context argument of
</P>
<P>
A compile context is created, copied, and freed by the following functions:
+<br>
+<br>
<b>pcre2_compile_context *pcre2_compile_context_create(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
@@ -689,6 +765,8 @@ A compile context is created, copied, and freed by the following functions:
A compile context is created with default values for its parameters. These can
be changed by calling the following functions, which return 0 on success, or
PCRE2_ERROR_BADDATA if invalid data is detected.
+<br>
+<br>
<b>int pcre2_set_bsr(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
@@ -698,6 +776,8 @@ or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any Unicode line
ending sequence. The value is used by the JIT compiler and by the two
interpreted matching functions, <i>pcre2_match()</i> and
<i>pcre2_dfa_match()</i>.
+<br>
+<br>
<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b>
<b> const unsigned char *<i>tables</i>);</b>
<br>
@@ -705,15 +785,33 @@ interpreted matching functions, <i>pcre2_match()</i> and
The value must be the result of a call to <i>pcre2_maketables()</i>, whose only
argument is a general context. This function builds a set of character tables
in the current locale.
+<br>
+<br>
+<b>int pcre2_set_compile_extra_options(pcre2_compile_context *<i>ccontext</i>,</b>
+<b> uint32_t <i>extra_options</i>);</b>
+<br>
+<br>
+As PCRE2 has developed, almost all the 32 option bits that are available in
+the <i>options</i> argument of <b>pcre2_compile()</b> have been used up. To avoid
+running out, the compile context contains a set of extra option bits which are
+used for some newer, assumed rarer, options. This function sets those bits. It
+always sets all the bits (either on or off). It does not modify any existing
+setting. The available options are defined in the section entitled "Extra
+compile options"
+<a href="#extracompileoptions">below.</a>
+<br>
+<br>
<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b>
<b> PCRE2_SIZE <i>value</i>);</b>
<br>
<br>
-This sets a maximum length, in code units, for the pattern string that is to be
-compiled. If the pattern is longer, an error is generated. This facility is
-provided so that applications that accept patterns from external sources can
-limit their size. The default is the largest number that a PCRE2_SIZE variable
-can hold, which is effectively unlimited.
+This sets a maximum length, in code units, for any pattern string that is
+compiled with this context. If the pattern is longer, an error is generated.
+This facility is provided so that applications that accept patterns from
+external sources can limit their size. The default is the largest number that a
+PCRE2_SIZE variable can hold, which is effectively unlimited.
+<br>
+<br>
<b>int pcre2_set_newline(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
@@ -721,22 +819,34 @@ can hold, which is effectively unlimited.
This specifies which characters or character sequences are to be recognized as
newlines. The value must be one of PCRE2_NEWLINE_CR (carriage return only),
PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the two-character
-sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any of the above), or
-PCRE2_NEWLINE_ANY (any Unicode newline sequence).
+sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any of the above),
+PCRE2_NEWLINE_ANY (any Unicode newline sequence), or PCRE2_NEWLINE_NUL (the
+NUL character, that is a binary zero).
+</P>
+<P>
+A pattern can override the value set in the compile context by starting with a
+sequence such as (*CRLF). See the
+<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
+page for details.
</P>
<P>
-When a pattern is compiled with the PCRE2_EXTENDED option, the value of this
-parameter affects the recognition of white space and the end of internal
-comments starting with #. The value is saved with the compiled pattern for
-subsequent use by the JIT compiler and by the two interpreted matching
-functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
+When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
+option, the newline convention affects the recognition of white space and the
+end of internal comments starting with #. The value is saved with the compiled
+pattern for subsequent use by the JIT compiler and by the two interpreted
+matching functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
+<br>
+<br>
<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
This parameter ajusts the limit, set when PCRE2 is built (default 250), on the
depth of parenthesis nesting in a pattern. This limit stops rogue patterns
-using up too much system stack when being compiled.
+using up too much system stack when being compiled. The limit applies to
+parentheses of all kinds, not just capturing parentheses.
+<br>
+<br>
<b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b>
<b> int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b>
<br>
@@ -744,10 +854,10 @@ using up too much system stack when being compiled.
There is at least one application that runs PCRE2 in threads with very limited
system stack, where running out of stack is to be avoided at all costs. The
parenthesis limit above cannot take account of how much stack is actually
-available. For a finer control, you can supply a function that is called
-whenever <b>pcre2_compile()</b> starts to compile a parenthesized part of a
-pattern. This function can check the actual stack size (or anything else that
-it wants to, of course).
+available during compilation. For a finer control, you can supply a function
+that is called whenever <b>pcre2_compile()</b> starts to compile a parenthesized
+part of a pattern. This function can check the actual stack size (or anything
+else that it wants to, of course).
</P>
<P>
The first argument to the callout function gives the current depth of
@@ -759,20 +869,22 @@ zero if all is well, or non-zero to force an error.
The match context
</b><br>
<P>
-A match context is required if you want to change the default values of any
-of the following match-time parameters:
+A match context is required if you want to:
<pre>
- A callout function
- The offset limit for matching an unanchored pattern
- The limit for calling <b>match()</b> (see below)
- The limit for calling <b>match()</b> recursively
+ Set up a callout function
+ Set an offset limit for matching an unanchored pattern
+ Change the limit on the amount of heap used when matching
+ Change the backtracking match limit
+ Change the backtracking depth limit
+ Set custom memory management specifically for the match
</pre>
-A match context is also required if you are using custom memory management.
If none of these apply, just pass NULL as the context argument of
<b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or <b>pcre2_jit_match()</b>.
</P>
<P>
A match context is created, copied, and freed by the following functions:
+<br>
+<br>
<b>pcre2_match_context *pcre2_match_context_create(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
@@ -787,15 +899,19 @@ A match context is created, copied, and freed by the following functions:
A match context is created with default values for its parameters. These can
be changed by calling the following functions, which return 0 on success, or
PCRE2_ERROR_BADDATA if invalid data is detected.
+<br>
+<br>
<b>int pcre2_set_callout(pcre2_match_context *<i>mcontext</i>,</b>
<b> int (*<i>callout_function</i>)(pcre2_callout_block *, void *),</b>
<b> void *<i>callout_data</i>);</b>
<br>
<br>
-This sets up a "callout" function, which PCRE2 will call at specified points
+This sets up a "callout" function for PCRE2 to call at specified points
during a matching operation. Details are given in the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation.
+<br>
+<br>
<b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> PCRE2_SIZE <i>value</i>);</b>
<br>
@@ -804,41 +920,83 @@ The <i>offset_limit</i> parameter limits how far an unanchored search can
advance in the subject string. The default value is PCRE2_UNSET. The
<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b> functions return
PCRE2_ERROR_NOMATCH if a match with a starting point before or at the given
-offset is not found. For example, if the pattern /abc/ is matched against
-"123abc" with an offset limit less than 3, the result is PCRE2_ERROR_NO_MATCH.
-A match can never be found if the <i>startoffset</i> argument of
-<b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> is greater than the offset
-limit.
+offset is not found. The <b>pcre2_substitute()</b> function makes no more
+substitutions.
+</P>
+<P>
+For example, if the pattern /abc/ is matched against "123abc" with an offset
+limit less than 3, the result is PCRE2_ERROR_NO_MATCH. A match can never be
+found if the <i>startoffset</i> argument of <b>pcre2_match()</b>,
+<b>pcre2_dfa_match()</b>, or <b>pcre2_substitute()</b> is greater than the offset
+limit set in the match context.
</P>
<P>
-When using this facility, you must set PCRE2_USE_OFFSET_LIMIT when calling
-<b>pcre2_compile()</b> so that when JIT is in use, different code can be
+When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT option when
+calling <b>pcre2_compile()</b> so that when JIT is in use, different code can be
compiled. If a match is started with a non-default match limit when
PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
</P>
<P>
The offset limit facility can be used to track progress when searching large
-subject strings. See also the PCRE2_FIRSTLINE option, which requires a match to
-start within the first line of the subject. If this is set with an offset
-limit, a match must occur in the first line and also within the offset limit.
-In other words, whichever limit comes first is used.
+subject strings or to limit the extent of global substitutions. See also the
+PCRE2_FIRSTLINE option, which requires a match to start before or at the first
+newline that follows the start of matching in the subject. If this is set with
+an offset limit, a match must occur in the first line and also within the
+offset limit. In other words, whichever limit comes first is used.
+<br>
+<br>
+<b>int pcre2_set_heap_limit(pcre2_match_context *<i>mcontext</i>,</b>
+<b> uint32_t <i>value</i>);</b>
+<br>
+<br>
+The <i>heap_limit</i> parameter specifies, in units of kilobytes, the maximum
+amount of heap memory that <b>pcre2_match()</b> may use to hold backtracking
+information when running an interpretive match. This limit does not apply to
+matching with the JIT optimization, which has its own memory control
+arrangements (see the
+<a href="pcre2jit.html"><b>pcre2jit</b></a>
+documentation for more details), nor does it apply to <b>pcre2_dfa_match()</b>.
+If the limit is reached, the negative error code PCRE2_ERROR_HEAPLIMIT is
+returned. The default limit is set when PCRE2 is built; the default default is
+very large and is essentially "unlimited".
+</P>
+<P>
+A value for the heap limit may also be supplied by an item at the start of a
+pattern of the form
+<pre>
+ (*LIMIT_HEAP=ddd)
+</pre>
+where ddd is a decimal number. However, such a setting is ignored unless ddd is
+less than the limit set by the caller of <b>pcre2_match()</b> or, if no such
+limit is set, less than the default.
+</P>
+<P>
+The <b>pcre2_match()</b> function starts out using a 20K vector on the system
+stack for recording backtracking points. The more nested backtracking points
+there are (that is, the deeper the search tree), the more memory is needed.
+Heap memory is used only if the initial vector is too small. If the heap limit
+is set to a value less than 21 (in particular, zero) no heap memory will be
+used. In this case, only patterns that do not have a lot of nested backtracking
+can be successfully processed.
+<br>
+<br>
<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
The <i>match_limit</i> parameter provides a means of preventing PCRE2 from using
-up too many resources when processing patterns that are not going to match, but
-which have a very large number of possibilities in their search trees. The
-classic example is a pattern that uses nested unlimited repeats.
+up too many computing resources when processing patterns that are not going to
+match, but which have a very large number of possibilities in their search
+trees. The classic example is a pattern that uses nested unlimited repeats.
</P>
<P>
-Internally, <b>pcre2_match()</b> uses a function called <b>match()</b>, which it
-calls repeatedly (sometimes recursively). The limit set by <i>match_limit</i> is
-imposed on the number of times this function is called during a match, which
-has the effect of limiting the amount of backtracking that can take place. For
+There is an internal counter in <b>pcre2_match()</b> that is incremented each
+time round its main matching loop. If this value reaches the match limit,
+<b>pcre2_match()</b> returns the negative value PCRE2_ERROR_MATCHLIMIT. This has
+the effect of limiting the amount of backtracking that can take place. For
patterns that are not anchored, the count restarts from zero for each position
-in the subject string. This limit is not relevant to <b>pcre2_dfa_match()</b>,
-which ignores it.
+in the subject string. This limit also applies to <b>pcre2_dfa_match()</b>,
+though the counting is done in a different way.
</P>
<P>
When <b>pcre2_match()</b> is called with a pattern that was successfully
@@ -850,72 +1008,53 @@ matching can continue.
</P>
<P>
The default value for the limit can be set when PCRE2 is built; the default
-default is 10 million, which handles all but the most extreme cases. If the
-limit is exceeded, <b>pcre2_match()</b> returns PCRE2_ERROR_MATCHLIMIT. A value
+default is 10 million, which handles all but the most extreme cases. A value
for the match limit may also be supplied by an item at the start of a pattern
of the form
<pre>
(*LIMIT_MATCH=ddd)
</pre>
where ddd is a decimal number. However, such a setting is ignored unless ddd is
-less than the limit set by the caller of <b>pcre2_match()</b> or, if no such
-limit is set, less than the default.
-<b>int pcre2_set_recursion_limit(pcre2_match_context *<i>mcontext</i>,</b>
+less than the limit set by the caller of <b>pcre2_match()</b> or
+<b>pcre2_dfa_match()</b> or, if no such limit is set, less than the default.
+<br>
+<br>
+<b>int pcre2_set_depth_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
-The <i>recursion_limit</i> parameter is similar to <i>match_limit</i>, but
-instead of limiting the total number of times that <b>match()</b> is called, it
-limits the depth of recursion. The recursion depth is a smaller number than the
-total number of calls, because not all calls to <b>match()</b> are recursive.
-This limit is of use only if it is set smaller than <i>match_limit</i>.
+This parameter limits the depth of nested backtracking in <b>pcre2_match()</b>.
+Each time a nested backtracking point is passed, a new memory "frame" is used
+to remember the state of matching at that point. Thus, this parameter
+indirectly limits the amount of memory that is used in a match. However,
+because the size of each memory "frame" depends on the number of capturing
+parentheses, the actual memory limit varies from pattern to pattern. This limit
+was more useful in versions before 10.30, where function recursion was used for
+backtracking.
</P>
<P>
-Limiting the recursion depth limits the amount of system stack that can be
-used, or, when PCRE2 has been compiled to use memory on the heap instead of the
-stack, the amount of heap memory that can be used. This limit is not relevant,
-and is ignored, when matching is done using JIT compiled code or by the
-<b>pcre2_dfa_match()</b> function.
+The depth limit is not relevant, and is ignored, when matching is done using
+JIT compiled code. However, it is supported by <b>pcre2_dfa_match()</b>, which
+uses it to limit the depth of internal recursive function calls that implement
+atomic groups, lookaround assertions, and pattern recursions. This is,
+therefore, an indirect limit on the amount of system stack that is used. A
+recursive pattern such as /(.)(?1)/, when matched to a very long string using
+<b>pcre2_dfa_match()</b>, can use a great deal of stack.
</P>
<P>
-The default value for <i>recursion_limit</i> can be set when PCRE2 is built; the
-default default is the same value as the default for <i>match_limit</i>. If the
-limit is exceeded, <b>pcre2_match()</b> returns PCRE2_ERROR_RECURSIONLIMIT. A
-value for the recursion limit may also be supplied by an item at the start of a
-pattern of the form
+The default value for the depth limit can be set when PCRE2 is built; the
+default default is the same value as the default for the match limit. If the
+limit is exceeded, <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> returns
+PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be supplied by an
+item at the start of a pattern of the form
<pre>
- (*LIMIT_RECURSION=ddd)
+ (*LIMIT_DEPTH=ddd)
</pre>
where ddd is a decimal number. However, such a setting is ignored unless ddd is
-less than the limit set by the caller of <b>pcre2_match()</b> or, if no such
-limit is set, less than the default.
-<b>int pcre2_set_recursion_memory_management(</b>
-<b> pcre2_match_context *<i>mcontext</i>,</b>
-<b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b>
-<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
-<br>
-<br>
-This function sets up two additional custom memory management functions for use
-by <b>pcre2_match()</b> when PCRE2 is compiled to use the heap for remembering
-backtracking data, instead of recursive function calls that use the system
-stack. There is a discussion about PCRE2's stack usage in the
-<a href="pcre2stack.html"><b>pcre2stack</b></a>
-documentation. See the
-<a href="pcre2build.html"><b>pcre2build</b></a>
-documentation for details of how to build PCRE2.
-</P>
-<P>
-Using the heap for recursion is a non-standard way of building PCRE2, for use
-in environments that have limited stacks. Because of the greater use of memory
-management, <b>pcre2_match()</b> runs more slowly. Functions that are different
-to the general custom memory functions are provided so that special-purpose
-external code can be used for this case, because the memory blocks are all the
-same size. The blocks are retained by <b>pcre2_match()</b> until it is about to
-exit so that they can be re-used when possible during the match. In the absence
-of these functions, the normal custom memory management functions are used, if
-supplied, otherwise the system functions.
+less than the limit set by the caller of <b>pcre2_match()</b> or
+<b>pcre2_dfa_match()</b> or, if no such limit is set, less than the default.
</P>
-<br><a name="SEC17" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br>
+<br><a name="SEC19" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br>
<P>
<b>int pcre2_config(uint32_t <i>what</i>, void *<i>where</i>);</b>
</P>
@@ -948,6 +1087,25 @@ PCRE2_BSR_UNICODE means that \R matches any Unicode line ending sequence; a
value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF. The
default can be overridden when a pattern is compiled.
<pre>
+ PCRE2_CONFIG_COMPILED_WIDTHS
+</pre>
+The output is a uint32_t integer whose lower bits indicate which code unit
+widths were selected when PCRE2 was built. The 1-bit indicates 8-bit support,
+and the 2-bit and 4-bit indicate 16-bit and 32-bit support, respectively.
+<pre>
+ PCRE2_CONFIG_DEPTHLIMIT
+</pre>
+The output is a uint32_t integer that gives the default limit for the depth of
+nested backtracking in <b>pcre2_match()</b> or the depth of nested recursions
+and lookarounds in <b>pcre2_dfa_match()</b>. Further details are given with
+<b>pcre2_set_depth_limit()</b> above.
+<pre>
+ PCRE2_CONFIG_HEAPLIMIT
+</pre>
+The output is a uint32_t integer that gives, in kilobytes, the default limit
+for the amount of heap memory used by <b>pcre2_match()</b>. Further details are
+given with <b>pcre2_set_heap_limit()</b> above.
+<pre>
PCRE2_CONFIG_JIT
</pre>
The output is a uint32_t integer that is set to one if support for just-in-time
@@ -982,9 +1140,9 @@ be compiled by those two libraries, but at the expense of slower matching.
<pre>
PCRE2_CONFIG_MATCHLIMIT
</pre>
-The output is a uint32_t integer that gives the default limit for the number of
-internal matching function calls in a <b>pcre2_match()</b> execution. Further
-details are given with <b>pcre2_match()</b> below.
+The output is a uint32_t integer that gives the default match limit for
+<b>pcre2_match()</b>. Further details are given with
+<b>pcre2_set_match_limit()</b> above.
<pre>
PCRE2_CONFIG_NEWLINE
</pre>
@@ -996,10 +1154,16 @@ sequence that is recognized as meaning "newline". The values are:
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
PCRE2_NEWLINE_ANY Any Unicode line ending
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
+ PCRE2_NEWLINE_NUL The NUL character (binary zero)
</pre>
The default should normally correspond to the standard sequence for your
operating system.
<pre>
+ PCRE2_CONFIG_NEVER_BACKSLASH_C
+</pre>
+The output is a uint32_t integer that is set to one if the use of \C was
+permanently disabled when PCRE2 was built; otherwise it is set to zero.
+<pre>
PCRE2_CONFIG_PARENSLIMIT
</pre>
The output is a uint32_t integer that gives the maximum depth of nesting
@@ -1009,19 +1173,10 @@ PCRE2 is built; the default is 250. This limit does not take into account the
stack that may already be used by the calling application. For finer control
over compilation stack usage, see <b>pcre2_set_compile_recursion_guard()</b>.
<pre>
- PCRE2_CONFIG_RECURSIONLIMIT
-</pre>
-The output is a uint32_t integer that gives the default limit for the depth of
-recursion when calling the internal matching function in a <b>pcre2_match()</b>
-execution. Further details are given with <b>pcre2_match()</b> below.
-<pre>
PCRE2_CONFIG_STACKRECURSE
</pre>
-The output is a uint32_t integer that is set to one if internal recursion when
-running <b>pcre2_match()</b> is implemented by recursive function calls that use
-the system stack to remember their state. This is the usual way that PCRE2 is
-compiled. The output is zero if PCRE2 was compiled to use blocks of data on the
-heap instead of recursive function calls.
+This parameter is obsolete and should not be used in new code. The output is a
+uint32_t integer that is always set to zero.
<pre>
PCRE2_CONFIG_UNICODE_VERSION
</pre>
@@ -1040,14 +1195,14 @@ available; otherwise it is set to zero. Unicode support implies UTF support.
<pre>
PCRE2_CONFIG_VERSION
</pre>
-The <i>where</i> argument should point to a buffer that is at least 12 code
+The <i>where</i> argument should point to a buffer that is at least 24 code
units long. (The exact length required can be found by calling
<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with
the PCRE2 version string, zero-terminated. The number of code units used is
returned. This is the length of the string plus one unit for the terminating
zero.
<a name="compiling"></a></P>
-<br><a name="SEC18" href="#TOC1">COMPILING A PATTERN</a><br>
+<br><a name="SEC20" href="#TOC1">COMPILING A PATTERN</a><br>
<P>
<b>pcre2_code *pcre2_compile(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
<b> uint32_t <i>options</i>, int *<i>errorcode</i>, PCRE2_SIZE *<i>erroroffset,</i></b>
@@ -1058,11 +1213,14 @@ zero.
<br>
<br>
<b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b>
+<br>
+<br>
+<b>pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *<i>code</i>);</b>
</P>
<P>
The <b>pcre2_compile()</b> function compiles a pattern into an internal form.
-The pattern is defined by a pointer to a string of code units and a length. If
-the pattern is zero-terminated, the length can be specified as
+The pattern is defined by a pointer to a string of code units and a length (in
+code units). If the pattern is zero-terminated, the length can be specified as
PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that
contains the compiled pattern and related data, or NULL if an error occurred.
</P>
@@ -1079,9 +1237,22 @@ if the code has been processed by the JIT compiler (see
<a href="#jitcompiling">below),</a>
the JIT information cannot be copied (because it is position-dependent).
The new copy can initially be used only for non-JIT matching, though it can be
-passed to <b>pcre2_jit_compile()</b> if required. The <b>pcre2_code_copy()</b>
-function provides a way for individual threads in a multithreaded application
-to acquire a private copy of shared compiled code.
+passed to <b>pcre2_jit_compile()</b> if required.
+</P>
+<P>
+The <b>pcre2_code_copy()</b> function provides a way for individual threads in a
+multithreaded application to acquire a private copy of shared compiled code.
+However, it does not make a copy of the character tables used by the compiled
+pattern; the new pattern code points to the same tables as the original code.
+(See
+<a href="#jitcompiling">"Locale Support"</a>
+below for details of these character tables.) In many applications the same
+tables are used throughout, so this behaviour is appropriate. Nevertheless,
+there are occasions when a copy of a compiled pattern and the relevant tables
+are needed. The <b>pcre2_code_copy_with_tables()</b> provides this facility.
+Copies of both the code and the tables are made, with the new code pointing to
+the new tables. The memory for the new tables is automatically freed when
+<b>pcre2_code_free()</b> is called for the new copy of the compiled code.
</P>
<P>
NOTE: When one of the matching functions is called, pointers to the compiled
@@ -1105,8 +1276,8 @@ documentation).
<P>
For those options that can be different in different parts of the pattern, the
contents of the <i>options</i> argument specifies their settings at the start of
-compilation. The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK options can be set at
-the time of matching as well as at compile time.
+compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and PCRE2_NO_UTF_CHECK
+options can be set at the time of matching as well as at compile time.
</P>
<P>
Other, less frequently required compile-time parameters (for example, the
@@ -1122,13 +1293,26 @@ error has occurred. The values are not defined when compilation is successful
and <b>pcre2_compile()</b> returns a non-NULL value.
</P>
<P>
-The <b>pcre2_get_error_message()</b> function (see "Obtaining a textual error
+There are nearly 100 positive error codes that <b>pcre2_compile()</b> may return
+if it finds an error in the pattern. There are also some negative error codes
+that are used for invalid UTF strings. These are the same as given by
+<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and are described in the
+<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
+page. There is no separate documentation for the positive error codes, because
+the textual error messages that are obtained by calling the
+<b>pcre2_get_error_message()</b> function (see "Obtaining a textual error
message"
<a href="#geterrormessage">below)</a>
-provides a textual message for each error code. Compilation errors have
-positive error codes; UTF formatting error codes are negative. For an invalid
-UTF-8 or UTF-16 string, the offset is that of the first code unit of the
-failing character.
+should be self-explanatory. Macro names starting with PCRE2_ERROR_ are defined
+for both positive and negative error codes in <b>pcre2.h</b>.
+</P>
+<P>
+The value returned in <i>erroroffset</i> is an indication of where in the
+pattern the error occurred. It is not necessarily the furthest point in the
+pattern that was read. For example, after the error "lookbehind assertion is
+not fixed length", the error offset points to the start of the failing
+assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of the
+first code unit of the failing character.
</P>
<P>
Some errors are not detected until the whole pattern has been scanned; in these
@@ -1209,13 +1393,15 @@ include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
option is set, normal backslash processing is applied to verb names and only an
unescaped closing parenthesis terminates the name. A closing parenthesis can be
included in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED
-option is set, unescaped whitespace in verb names is skipped and #-comments are
-recognized, exactly as in the rest of the pattern.
+or PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names is
+skipped and #-comments are recognized in this mode, exactly as in the rest of
+the pattern.
<pre>
PCRE2_AUTO_CALLOUT
</pre>
If this bit is set, <b>pcre2_compile()</b> automatically inserts callout items,
-all with number 255, before each pattern item. For discussion of the callout
+all with number 255, before each pattern item, except immediately before or
+after an explicit callout in the pattern. For discussion of the callout
facility, see the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation.
@@ -1224,7 +1410,13 @@ documentation.
</pre>
If this bit is set, letters in the pattern match both upper and lower case
letters in the subject. It is equivalent to Perl's /i option, and it can be
-changed within a pattern by a (?i) option setting.
+changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode
+properties are used for all characters with more than one other case, and for
+all characters whose code points are greater than U+007f. For lower valued
+characters with only one other case, a lookup table is used for speed. When
+PCRE2_UTF is not set, a lookup table is used for all code points less than 256,
+and higher code points (available only in 16-bit or 32-bit mode) are treated as
+not having another case.
<pre>
PCRE2_DOLLAR_ENDONLY
</pre>
@@ -1254,6 +1446,30 @@ details of named subpatterns below; see also the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation.
<pre>
+ PCRE2_ENDANCHORED
+</pre>
+If this bit is set, the end of any pattern match must be right at the end of
+the string being searched (the "subject string"). If the pattern match
+succeeds by reaching (*ACCEPT), but does not reach the end of the subject, the
+match fails at the current starting point. For unanchored patterns, a new match
+is then tried at the next starting point. However, if the match succeeds by
+reaching the end of the pattern, but not the end of the subject, backtracking
+occurs and an alternative match may be found. Consider these two patterns:
+<pre>
+ .(*ACCEPT)|..
+ .|..
+</pre>
+If matched against "abc" with PCRE2_ENDANCHORED set, the first matches "c"
+whereas the second matches "bc". The effect of PCRE2_ENDANCHORED can also be
+achieved by appropriate constructs in the pattern itself, which is the only way
+to do it in Perl.
+</P>
+<P>
+For DFA matching with <b>pcre2_dfa_match()</b>, PCRE2_ENDANCHORED applies only
+to the first (that is, the longest) matched string. Other parallel matches,
+which are necessarily substrings of the first one, must obviously end before
+the end of the subject.
+<pre>
PCRE2_EXTENDED
</pre>
If this bit is set, most white space characters in the pattern are totally
@@ -1280,14 +1496,39 @@ sequence at the start of the pattern, as described in the section entitled
in the <b>pcre2pattern</b> documentation. A default is defined when PCRE2 is
built.
<pre>
+ PCRE2_EXTENDED_MORE
+</pre>
+This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
+and horizontal tab characters are ignored inside a character class.
+PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx option, and it can be
+changed within a pattern by a (?xx) option setting.
+<pre>
PCRE2_FIRSTLINE
</pre>
-If this option is set, an unanchored pattern is required to match before or at
-the first newline in the subject string, though the matched text may continue
-over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
-general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit, a
-match must occur in the first line and also within the offset limit. In other
-words, whichever limit comes first is used.
+If this option is set, the start of an unanchored pattern match must be before
+or at the first newline in the subject string following the start of matching,
+though the matched text may continue over the newline. If <i>startoffset</i> is
+non-zero, the limiting newline is not necessarily the first newline in the
+subject. For example, if the subject string is "abc\nxyz" (where \n
+represents a single-character newline) a pattern match for "yz" succeeds with
+PCRE2_FIRSTLINE if <i>startoffset</i> is greater than 3. See also
+PCRE2_USE_OFFSET_LIMIT, which provides a more general limiting facility. If
+PCRE2_FIRSTLINE is set with an offset limit, a match must occur in the first
+line and also within the offset limit. In other words, whichever limit comes
+first is used.
+<pre>
+ PCRE2_LITERAL
+</pre>
+If this option is set, all meta-characters in the pattern are disabled, and it
+is treated as a literal string. Matching literal strings with a regular
+expression engine is not the most efficient way of doing it. If you are doing a
+lot of literal matching and are worried about efficiency, you should consider
+using other approaches. The only other main options that are allowed with
+PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
+PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
+PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
+and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
+error.
<pre>
PCRE2_MATCH_UNSET_BACKREF
</pre>
@@ -1352,8 +1593,8 @@ PCRE2_NEVER_UTF causes an error.
If this option is set, it disables the use of numbered capturing parentheses in
the pattern. Any opening parenthesis that is not followed by ? behaves as if it
were followed by ?: but named parentheses can still be used for capturing (and
-they acquire numbers in the usual way). There is no equivalent of this option
-in Perl. Note that, if this option is set, references to capturing groups (back
+they acquire numbers in the usual way). This is the same as Perl's /n option.
+Note that, when this option is set, references to capturing groups (back
references or recursion/subroutine calls) may only refer to named groups,
though the reference can be by name or by number.
<pre>
@@ -1389,8 +1630,8 @@ compiler.
<P>
There are a number of optimizations that may occur at the start of a match, in
order to speed up the process. For example, if it is known that an unanchored
-match must start with a specific character, the matching code searches the
-subject for that character, and fails immediately if it cannot find it, without
+match must start with a specific code unit value, the matching code searches
+the subject for that value, and fails immediately if it cannot find it, without
actually running the main matching function. This means that a special item
such as (*COMMIT) at the start of a pattern is not considered until after a
suitable starting point for the match has been found. Also, when callouts or
@@ -1419,9 +1660,11 @@ current starting position, which in this case, it does. However, if the same
match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the
subject string does not happen. The first match attempt is run starting from
"D" and when this fails, (*COMMIT) prevents any further matches being tried, so
-the overall result is "no match". There are also other start-up optimizations.
-For example, a minimum length for the subject may be recorded. Consider the
-pattern
+the overall result is "no match".
+</P>
+<P>
+There are also other start-up optimizations. For example, a minimum length for
+the subject may be recorded. Consider the pattern
<pre>
(*MARK:A)(X|Y)
</pre>
@@ -1442,17 +1685,30 @@ and
<a href="pcre2unicode.html#utf32strings">UTF-32 strings</a>
in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
-document.
-If an invalid UTF sequence is found, <b>pcre2_compile()</b> returns a negative
-error code.
+document. If an invalid UTF sequence is found, <b>pcre2_compile()</b> returns a
+negative error code.
</P>
<P>
-If you know that your pattern is valid, and you want to skip this check for
-performance reasons, you can set the PCRE2_NO_UTF_CHECK option. When it is set,
-the effect of passing an invalid UTF string as a pattern is undefined. It may
-cause your program to crash or loop. Note that this option can also be passed
-to <b>pcre2_match()</b> and <b>pcre_dfa_match()</b>, to suppress validity
-checking of the subject string.
+If you know that your pattern is a valid UTF string, and you want to skip this
+check for performance reasons, you can set the PCRE2_NO_UTF_CHECK option. When
+it is set, the effect of passing an invalid UTF string as a pattern is
+undefined. It may cause your program to crash or loop.
+</P>
+<P>
+Note that this option can also be passed to <b>pcre2_match()</b> and
+<b>pcre_dfa_match()</b>, to suppress UTF validity checking of the subject
+string.
+</P>
+<P>
+Note also that setting PCRE2_NO_UTF_CHECK at compile time does not disable the
+error that is given if an escape sequence for an invalid Unicode code point is
+encountered in the pattern. In particular, the so-called "surrogate" code
+points (0xd800 to 0xdfff) are invalid. If you want to allow escape sequences
+such as \x{d800} you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
+option, as described in the section entitled "Extra compile options"
+<a href="#extracompileoptions">below.</a>
+However, this is possible only in UTF-8 and UTF-32 modes, because these values
+are not representable in UTF-16.
<pre>
PCRE2_UCP
</pre>
@@ -1465,7 +1721,7 @@ in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page. If you set PCRE2_UCP, matching one of the items it affects takes much
longer. The option is available only if PCRE2 has been compiled with Unicode
-support.
+support (which is the default).
<pre>
PCRE2_UNGREEDY
</pre>
@@ -1490,25 +1746,80 @@ This option causes PCRE2 to regard both the pattern and the subject strings
that are subsequently processed as strings of UTF characters instead of
single-code-unit strings. It is available when PCRE2 is built to include
Unicode support (which is the default). If Unicode support is not available,
-the use of this option provokes an error. Details of how this option changes
-the behaviour of PCRE2 are given in the
+the use of this option provokes an error. Details of how PCRE2_UTF changes the
+behaviour of PCRE2 are given in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page.
-</P>
-<br><a name="SEC19" href="#TOC1">COMPILATION ERROR CODES</a><br>
+<a name="extracompileoptions"></a></P>
+<br><b>
+Extra compile options
+</b><br>
<P>
-There are over 80 positive error codes that <b>pcre2_compile()</b> may return
-(via <i>errorcode</i>) if it finds an error in the pattern. There are also some
-negative error codes that are used for invalid UTF strings. These are the same
-as given by <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and are described
-in the
-<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
-page. The <b>pcre2_get_error_message()</b> function (see "Obtaining a textual
-error message"
-<a href="#geterrormessage">below)</a>
-can be called to obtain a textual error message from any error code.
+Unlike the main compile-time options, the extra options are not saved with the
+compiled pattern. The option bits that can be set in a compile context by
+calling the <b>pcre2_set_compile_extra_options()</b> function are as follows:
+<pre>
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
+</pre>
+This option applies when compiling a pattern in UTF-8 or UTF-32 mode. It is
+forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode "surrogate"
+code points in the range 0xd800 to 0xdfff are used in pairs in UTF-16 to encode
+code points with values in the range 0x10000 to 0x10ffff. The surrogates cannot
+therefore be represented in UTF-16. They can be represented in UTF-8 and
+UTF-32, but are defined as invalid code points, and cause errors if encountered
+in a UTF-8 or UTF-32 string that is being checked for validity by PCRE2.
+</P>
+<P>
+These values also cause errors if encountered in escape sequences such as
+\x{d912} within a pattern. However, it seems that some applications, when
+using PCRE2 to check for unwanted characters in UTF-8 strings, explicitly test
+for the surrogates using escape sequences. The PCRE2_NO_UTF_CHECK option does
+not disable the error that occurs, because it applies only to the testing of
+input strings for UTF validity.
+</P>
+<P>
+If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surrogate code
+point values in UTF-8 and UTF-32 patterns no longer provoke errors and are
+incorporated in the compiled pattern. However, they can only match subject
+characters if the matching function is called with PCRE2_NO_UTF_CHECK set.
+<pre>
+ PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
+</pre>
+This is a dangerous option. Use with care. By default, an unrecognized escape
+such as \j or a malformed one such as \x{2z} causes a compile-time error when
+detected by <b>pcre2_compile()</b>. Perl is somewhat inconsistent in handling
+such items: for example, \j is treated as a literal "j", and non-hexadecimal
+digits in \x{} are just ignored, though warnings are given in both cases if
+Perl's warning switch is enabled. However, a malformed octal number after \o{
+always causes an error in Perl.
+</P>
+<P>
+If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
+<b>pcre2_compile()</b>, all unrecognized or erroneous escape sequences are
+treated as single-character escapes. For example, \j is a literal "j" and
+\x{2z} is treated as the literal string "x{2z}". Setting this option means
+that typos in patterns may go undetected and have unexpected results. This is a
+dangerous option. Use with care.
+<pre>
+ PCRE2_EXTRA_MATCH_LINE
+</pre>
+This option is provided for use by the <b>-x</b> option of <b>pcre2grep</b>. It
+causes the pattern only to match complete lines. This is achieved by
+automatically inserting the code for "^(?:" at the start of the compiled
+pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, the matched
+line may be in the middle of the subject string. This option can be used with
+PCRE2_LITERAL.
+<pre>
+ PCRE2_EXTRA_MATCH_WORD
+</pre>
+This option is provided for use by the <b>-w</b> option of <b>pcre2grep</b>. It
+causes the pattern only to match strings that have a word boundary at the start
+and the end. This is achieved by automatically inserting the code for "\b(?:"
+at the start of the compiled pattern and ")\b" at the end. The option may be
+used with PCRE2_LITERAL. However, it is ignored if PCRE2_EXTRA_MATCH_LINE is
+also set.
<a name="jitcompiling"></a></P>
-<br><a name="SEC20" href="#TOC1">JUST-IN-TIME (JIT) COMPILATION</a><br>
+<br><a name="SEC21" href="#TOC1">JUST-IN-TIME (JIT) COMPILATION</a><br>
<P>
<b>int pcre2_jit_compile(pcre2_code *<i>code</i>, uint32_t <i>options</i>);</b>
<br>
@@ -1544,18 +1855,18 @@ documentation.
JIT compilation is a heavyweight optimization. It can take some time for
patterns to be analyzed, and for one-off matches and simple patterns the
benefit of faster execution might be offset by a much slower compilation time.
-Most, but not all patterns can be optimized by the JIT compiler.
+Most (but not all) patterns can be optimized by the JIT compiler.
<a name="localesupport"></a></P>
-<br><a name="SEC21" href="#TOC1">LOCALE SUPPORT</a><br>
+<br><a name="SEC22" href="#TOC1">LOCALE SUPPORT</a><br>
<P>
PCRE2 handles caseless matching, and determines whether characters are letters,
digits, or whatever, by reference to a set of tables, indexed by character code
point. This applies only to characters whose code points are less than 256. By
default, higher-valued code points never match escapes such as \w or \d.
-However, if PCRE2 is built with UTF support, all characters can be tested with
-\p and \P, or, alternatively, the PCRE2_UCP option can be set when a pattern
-is compiled; this causes \w and friends to use Unicode property support
-instead of the built-in tables.
+However, if PCRE2 is built with Unicode support, all characters can be tested
+with \p and \P, or, alternatively, the PCRE2_UCP option can be set when a
+pattern is compiled; this causes \w and friends to use Unicode property
+support instead of the built-in tables.
</P>
<P>
The use of locales with Unicode is discouraged. If you are handling characters
@@ -1599,10 +1910,10 @@ available for as long as it is needed.
The pointer that is passed (via the compile context) to <b>pcre2_compile()</b>
is saved with the compiled pattern, and the same tables are used by
<b>pcre2_match()</b> and <b>pcre_dfa_match()</b>. Thus, for any single pattern,
-compilation, and matching all happen in the same locale, but different patterns
+compilation and matching both happen in the same locale, but different patterns
can be processed in different locales.
<a name="infoaboutpattern"></a></P>
-<br><a name="SEC22" href="#TOC1">INFORMATION ABOUT A COMPILED PATTERN</a><br>
+<br><a name="SEC23" href="#TOC1">INFORMATION ABOUT A COMPILED PATTERN</a><br>
<P>
<b>int pcre2_pattern_info(const pcre2 *<i>code</i>, uint32_t <i>what</i>, void *<i>where</i>);</b>
</P>
@@ -1615,7 +1926,7 @@ pattern. The second argument specifies which piece of information is required,
and the third argument is a pointer to a variable to receive the data. If the
third argument is NULL, the first argument is ignored, and the function returns
the size in bytes of the variable that is required for the information
-requested. Otherwise, The yield of the function is zero for success, or one of
+requested. Otherwise, the yield of the function is zero for success, or one of
the following negative numbers:
<pre>
PCRE2_ERROR_NULL the argument <i>code</i> was NULL
@@ -1639,12 +1950,15 @@ are as follows:
<pre>
PCRE2_INFO_ALLOPTIONS
PCRE2_INFO_ARGOPTIONS
+ PCRE2_INFO_EXTRAOPTIONS
</pre>
-Return a copy of the pattern's options. The third argument should point to a
+Return copies of the pattern's options. The third argument should point to a
<b>uint32_t</b> variable. PCRE2_INFO_ARGOPTIONS returns exactly the options that
were passed to <b>pcre2_compile()</b>, whereas PCRE2_INFO_ALLOPTIONS returns
the compile options as modified by any top-level (*XXX) option settings such as
-(*UTF) at the start of the pattern itself.
+(*UTF) at the start of the pattern itself. PCRE2_INFO_EXTRAOPTIONS returns the
+extra options that were set in the compile context by calling the
+pcre2_set_compile_extra_options() function.
</P>
<P>
For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED
@@ -1668,8 +1982,8 @@ following are true:
.* is not in an atomic group
.* is not in a capturing group that is the subject of a back reference
PCRE2_DOTALL is in force for .*
- Neither (*PRUNE) nor (*SKIP) appears in the pattern.
- PCRE2_NO_DOTSTAR_ANCHOR is not set.
+ Neither (*PRUNE) nor (*SKIP) appears in the pattern
+ PCRE2_NO_DOTSTAR_ANCHOR is not set
</pre>
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
options returned for PCRE2_INFO_ALLOPTIONS.
@@ -1697,6 +2011,15 @@ Return the highest capturing subpattern number in the pattern. In patterns
where (?| is not used, this is also the total number of capturing subpatterns.
The third argument should point to an <b>uint32_t</b> variable.
<pre>
+ PCRE2_INFO_DEPTHLIMIT
+</pre>
+If the pattern set a backtracking depth limit by including an item of the form
+(*LIMIT_DEPTH=nnnn) at the start, the value is returned. The third argument
+should point to an unsigned 32-bit integer. If no such value has been set, the
+call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note
+that this limit will only be used during matching if it is less than the limit
+set or defaulted by the caller of the match function.
+<pre>
PCRE2_INFO_FIRSTBITMAP
</pre>
In the absence of a single first code unit for a non-anchored pattern,
@@ -1713,21 +2036,29 @@ returned. Otherwise NULL is returned. The third argument should point to an
Return information about the first code unit of any matched string, for a
non-anchored pattern. The third argument should point to an <b>uint32_t</b>
variable. If there is a fixed first value, for example, the letter "c" from a
-pattern such as (cat|cow|coyote), 1 is returned, and the character value can be
-retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
-it is known that a match can occur only at the start of the subject or
-following a newline in the subject, 2 is returned. Otherwise, and for anchored
-patterns, 0 is returned.
+pattern such as (cat|cow|coyote), 1 is returned, and the value can be retrieved
+using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but it is
+known that a match can occur only at the start of the subject or following a
+newline in the subject, 2 is returned. Otherwise, and for anchored patterns, 0
+is returned.
<pre>
PCRE2_INFO_FIRSTCODEUNIT
</pre>
-Return the value of the first code unit of any matched string in the situation
+Return the value of the first code unit of any matched string for a pattern
where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. The third
argument should point to an <b>uint32_t</b> variable. In the 8-bit library, the
value is always less than 256. In the 16-bit library the value can be up to
0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff,
and up to 0xffffffff when not using UTF-32 mode.
<pre>
+ PCRE2_INFO_FRAMESIZE
+</pre>
+Return the size (in bytes) of the data frames that are used to remember
+backtracking positions when the pattern is processed by <b>pcre2_match()</b>
+without the use of JIT. The third argument should point to an <b>size_t</b>
+variable. The frame size depends on the number of capturing parentheses in the
+pattern. Each additional capturing group adds two PCRE2_SIZE variables.
+<pre>
PCRE2_INFO_HASBACKSLASHC
</pre>
Return 1 if the pattern contains any instances of \C, otherwise 0. The third
@@ -1737,7 +2068,17 @@ argument should point to an <b>uint32_t</b> variable.
</pre>
Return 1 if the pattern contains any explicit matches for CR or LF characters,
otherwise 0. The third argument should point to an <b>uint32_t</b> variable. An
-explicit match is either a literal CR or LF character, or \r or \n.
+explicit match is either a literal CR or LF character, or \r or \n or one of
+the equivalent hexadecimal or octal escape sequences.
+<pre>
+ PCRE2_INFO_HEAPLIMIT
+</pre>
+If the pattern set a heap memory limit by including an item of the form
+(*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argument
+should point to an unsigned 32-bit integer. If no such value has been set, the
+call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note
+that this limit will only be used during matching if it is less than the limit
+set or defaulted by the caller of the match function.
<pre>
PCRE2_INFO_JCHANGED
</pre>
@@ -1764,10 +2105,10 @@ PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.
<pre>
PCRE2_INFO_LASTCODEUNIT
</pre>
-Return the value of the rightmost literal data unit that must exist in any
-matched string, other than at its start, if such a value has been recorded. The
-third argument should point to an <b>uint32_t</b> variable. If there is no such
-value, 0 is returned.
+Return the value of the rightmost literal code unit that must exist in any
+matched string, other than at its start, for a pattern where
+PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argument
+should point to an <b>uint32_t</b> variable.
<pre>
PCRE2_INFO_MATCHEMPTY
</pre>
@@ -1782,7 +2123,9 @@ in such cases.
If the pattern set a match limit by including an item of the form
(*LIMIT_MATCH=nnnn) at the start, the value is returned. The third argument
should point to an unsigned 32-bit integer. If no such value has been set, the
-call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET.
+call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note
+that this limit will only be used during matching if it is less than the limit
+set or defaulted by the caller of the match function.
<pre>
PCRE2_INFO_MAXLOOKBEHIND
</pre>
@@ -1794,7 +2137,8 @@ require a one-character lookbehind. \A also registers a one-character
lookbehind, though it does not actually inspect the previous character. This is
to ensure that at least one character from the old segment is retained when a
new segment is processed. Otherwise, if there are no lookbehinds in the
-pattern, \A might match incorrectly at the start of a new segment.
+pattern, \A might match incorrectly at the start of a second or subsequent
+segment.
<pre>
PCRE2_INFO_MINLENGTH
</pre>
@@ -1874,23 +2218,17 @@ different for each compiled pattern.
<pre>
PCRE2_INFO_NEWLINE
</pre>
-The output is a <b>uint32_t</b> with one of the following values:
+The output is one of the following <b>uint32_t</b> values:
<pre>
PCRE2_NEWLINE_CR Carriage return (CR)
PCRE2_NEWLINE_LF Linefeed (LF)
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
PCRE2_NEWLINE_ANY Any Unicode line ending
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
+ PCRE2_NEWLINE_NUL The NUL character (binary zero)
</pre>
-This specifies the default character sequence that will be recognized as
-meaning "newline" while matching.
-<pre>
- PCRE2_INFO_RECURSIONLIMIT
-</pre>
-If the pattern set a recursion limit by including an item of the form
-(*LIMIT_RECURSION=nnnn) at the start, the value is returned. The third
-argument should point to an unsigned 32-bit integer. If no such value has been
-set, the call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET.
+This identifies the character sequence that will be recognized as meaning
+"newline" while matching.
<pre>
PCRE2_INFO_SIZE
</pre>
@@ -1903,7 +2241,7 @@ value returned by this option, because there are cases where the code that
calculates the size has to over-estimate. Processing a pattern with the JIT
compiler does not alter the value returned by this option.
<a name="infoaboutcallouts"></a></P>
-<br><a name="SEC23" href="#TOC1">INFORMATION ABOUT A PATTERN'S CALLOUTS</a><br>
+<br><a name="SEC24" href="#TOC1">INFORMATION ABOUT A PATTERN'S CALLOUTS</a><br>
<P>
<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b>
<b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b>
@@ -1922,7 +2260,7 @@ contents of the callout enumeration block are described in the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation, which also gives further details about callouts.
</P>
-<br><a name="SEC24" href="#TOC1">SERIALIZATION AND PRECOMPILING</a><br>
+<br><a name="SEC25" href="#TOC1">SERIALIZATION AND PRECOMPILING</a><br>
<P>
It is possible to save compiled patterns on disc or elsewhere, and reload them
later, subject to a number of restrictions. The functions whose names begin
@@ -1931,7 +2269,7 @@ the
<a href="pcre2serialize.html"><b>pcre2serialize</b></a>
documentation.
<a name="matchdatablock"></a></P>
-<br><a name="SEC25" href="#TOC1">THE MATCH DATA BLOCK</a><br>
+<br><a name="SEC26" href="#TOC1">THE MATCH DATA BLOCK</a><br>
<P>
<b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
@@ -1948,7 +2286,7 @@ Information about a successful or unsuccessful match is placed in a match
data block, which is an opaque structure that is accessed by function calls. In
particular, the match data block contains a vector of offsets into the subject
string that define the matched part of the subject and any substrings that were
-captured. This is know as the <i>ovector</i>.
+captured. This is known as the <i>ovector</i>.
</P>
<P>
Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or
@@ -1956,9 +2294,9 @@ Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or
the creation functions above. For <b>pcre2_match_data_create()</b>, the first
argument is the number of pairs of offsets in the <i>ovector</i>. One pair of
offsets is required to identify the string that matched the whole pattern, with
-another pair for each captured substring. For example, a value of 4 creates
-enough space to record the matched portion of the subject plus three captured
-substrings. A minimum of at least 1 pair is imposed by
+an additional pair for each captured substring. For example, a value of 4
+creates enough space to record the matched portion of the subject plus three
+captured substrings. A minimum of at least 1 pair is imposed by
<b>pcre2_match_data_create()</b>, so it is always possible to return the overall
matched string.
</P>
@@ -2002,7 +2340,7 @@ match data block (for that match) have taken place.
When a match data block itself is no longer needed, it should be freed by
calling <b>pcre2_match_data_free()</b>.
</P>
-<br><a name="SEC26" href="#TOC1">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a><br>
+<br><a name="SEC27" href="#TOC1">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a><br>
<P>
<b>int pcre2_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
@@ -2033,7 +2371,7 @@ Here is an example of a simple call to <b>pcre2_match()</b>:
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
- match_data, /* the match data block */
+ md, /* the match data block */
NULL); /* a match context; NULL means use defaults */
</pre>
If the subject string is zero-terminated, the length can be given as
@@ -2096,25 +2434,27 @@ character is CR followed by LF, advance the starting offset by two characters
instead of one.
</P>
<P>
-If a non-zero starting offset is passed when the pattern is anchored, one
+If a non-zero starting offset is passed when the pattern is anchored, a single
attempt to match at the given offset is made. This can only succeed if the
-pattern does not require the match to be at the start of the subject.
+pattern does not require the match to be at the start of the subject. In other
+words, the anchoring must be the result of setting the PCRE2_ANCHORED option or
+the use of .* with PCRE2_DOTALL, not by starting the pattern with ^ or \A.
<a name="matchoptions"></a></P>
<br><b>
Option bits for <b>pcre2_match()</b>
</b><br>
<P>
The unused bits of the <i>options</i> argument for <b>pcre2_match()</b> must be
-zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
-PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT,
-PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is
-described below.
+zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED,
+PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
+PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT.
+Their action is described below.
</P>
<P>
-Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT)
-compiler. If it is set, JIT matching is disabled and the normal interpretive
-code in <b>pcre2_match()</b> is run. Apart from PCRE2_NO_JIT (obviously), the
-remaining options are supported for JIT matching.
+Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not supported by
+the just-in-time (JIT) compiler. If it is set, JIT matching is disabled and the
+interpretive code in <b>pcre2_match()</b> is run. Apart from PCRE2_NO_JIT
+(obviously), the remaining options are supported for JIT matching.
<pre>
PCRE2_ANCHORED
</pre>
@@ -2124,6 +2464,12 @@ to be anchored by virtue of its contents, it cannot be made unachored at
matching time. Note that setting the option at match time disables JIT
matching.
<pre>
+ PCRE2_ENDANCHORED
+</pre>
+If the PCRE2_ENDANCHORED option is set, any string that <b>pcre2_match()</b>
+matches must be right at the end of the subject string. Note that setting the
+option at match time disables JIT matching.
+<pre>
PCRE2_NOTBOL
</pre>
This option specifies that first character of the subject string is not the
@@ -2199,13 +2545,13 @@ page.
If you know that your subject is valid, and you want to skip these checks for
performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
<b>pcre2_match()</b>. You might want to do this for the second and subsequent
-calls to <b>pcre2_match()</b> if you are making repeated calls to find all the
-matches in a single subject string.
+calls to <b>pcre2_match()</b> if you are making repeated calls to find other
+matches in the same subject string.
</P>
<P>
-NOTE: When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid string
-as a subject, or an invalid value of <i>startoffset</i>, is undefined. Your
-program may crash or loop indefinitely.
+WARNING: When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
+string as a subject, or an invalid value of <i>startoffset</i>, is undefined.
+Your program may crash or loop indefinitely.
<pre>
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
@@ -2232,7 +2578,7 @@ examples, in the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation.
</P>
-<br><a name="SEC27" href="#TOC1">NEWLINE HANDLING WHEN MATCHING</a><br>
+<br><a name="SEC28" href="#TOC1">NEWLINE HANDLING WHEN MATCHING</a><br>
<P>
When PCRE2 is built, a default newline convention is set; this is usually the
standard convention for the operating system. The default can be overridden in
@@ -2264,15 +2610,15 @@ reference, and so advances only by one character after the first failure.
</P>
<P>
An explicit match for CR of LF is either a literal appearance of one of those
-characters in the pattern, or one of the \r or \n escape sequences. Implicit
-matches such as [^X] do not count, nor does \s, even though it includes CR and
-LF in the characters that it matches.
+characters in the pattern, or one of the \r or \n or equivalent octal or
+hexadecimal escape sequences. Implicit matches such as [^X] do not count, nor
+does \s, even though it includes CR and LF in the characters that it matches.
</P>
<P>
Notwithstanding the above, anomalous effects may still occur when CRLF is a
valid newline sequence and explicit \r or \n escapes appear in the pattern.
<a name="matchedstrings"></a></P>
-<br><a name="SEC28" href="#TOC1">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a><br>
+<br><a name="SEC29" href="#TOC1">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a><br>
<P>
<b>uint32_t pcre2_get_ovector_count(pcre2_match_data *<i>match_data</i>);</b>
<br>
@@ -2322,12 +2668,12 @@ identify the part of the subject that was partially matched. See the
documentation for details of partial matching.
</P>
<P>
-After a successful match, the first pair of offsets identifies the portion of
-the subject string that was matched by the entire pattern. The next pair is
-used for the first capturing subpattern, and so on. The value returned by
+After a fully successful match, the first pair of offsets identifies the
+portion of the subject string that was matched by the entire pattern. The next
+pair is used for the first captured substring, and so on. The value returned by
<b>pcre2_match()</b> is one more than the highest numbered pair that has been
set. For example, if two substrings have been captured, the returned value is
-3. If there are no capturing subpatterns, the return value from a successful
+3. If there are no captured substrings, the return value from a successful
match is 1, indicating that just the first pair of offsets has been set.
</P>
<P>
@@ -2345,11 +2691,7 @@ returned.
If the ovector is too small to hold all the captured substring offsets, as much
as possible is filled in, and the function returns a value of zero. If captured
substrings are not of interest, <b>pcre2_match()</b> may be called with a match
-data block whose ovector is of minimum length (that is, one pair). However, if
-the pattern contains back references and the <i>ovector</i> is not big enough to
-remember the related substrings, PCRE2 has to get additional memory for use
-during matching. Thus it is usually advisable to set up a match data block
-containing an ovector of reasonable size.
+data block whose ovector is of minimum length (that is, one pair).
</P>
<P>
It is possible for capturing subpattern number <i>n+1</i> to match some part of
@@ -2375,7 +2717,7 @@ parentheses, no more than <i>ovector[0]</i> to <i>ovector[2n+1]</i> are set by
<b>pcre2_match()</b>. The other elements retain whatever values they previously
had.
<a name="matchotherdata"></a></P>
-<br><a name="SEC29" href="#TOC1">OTHER INFORMATION ABOUT A MATCH</a><br>
+<br><a name="SEC30" href="#TOC1">OTHER INFORMATION ABOUT A MATCH</a><br>
<P>
<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b>
<br>
@@ -2390,25 +2732,28 @@ undefined.
</P>
<P>
After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure
-to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and
-<b>pcre2_get_mark()</b> can be called. It returns a pointer to the
-zero-terminated name, which is within the compiled pattern. Otherwise NULL is
-returned. The length of the (*MARK) name (excluding the terminating zero) is
-stored in the code unit that preceeds the name. You should use this instead of
-relying on the terminating zero if the (*MARK) name might contain a binary
-zero.
-</P>
-<P>
-After a successful match, the (*MARK) name that is returned is the
-last one encountered on the matching path through the pattern. After a "no
-match" or a partial match, the last encountered (*MARK) name is returned. For
-example, consider this pattern:
+to match (PCRE2_ERROR_NOMATCH), a (*MARK), (*PRUNE), or (*THEN) name may be
+available. The function <b>pcre2_get_mark()</b> can be called to access this
+name. The same function applies to all three verbs. It returns a pointer to the
+zero-terminated name, which is within the compiled pattern. If no name is
+available, NULL is returned. The length of the name (excluding the terminating
+zero) is stored in the code unit that precedes the name. You should use this
+length instead of relying on the terminating zero if the name might contain a
+binary zero.
+</P>
+<P>
+After a successful match, the name that is returned is the last (*MARK),
+(*PRUNE), or (*THEN) name encountered on the matching path through the pattern.
+Instances of (*PRUNE) and (*THEN) without names are ignored. Thus, for example,
+if the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned.
+After a "no match" or a partial match, the last encountered name is returned.
+For example, consider this pattern:
<pre>
^(*MARK:A)((*MARK:B)a|b)c
</pre>
-When it matches "bc", the returned mark is A. The B mark is "seen" in the first
+When it matches "bc", the returned name is A. The B mark is "seen" in the first
branch of the group, but it is not on the matching path. On the other hand,
-when this pattern fails to match "bx", the returned mark is B.
+when this pattern fails to match "bx", the returned name is B.
</P>
<P>
After a successful match, a partial match, or one of the invalid UTF errors
@@ -2425,7 +2770,7 @@ the code unit offset of the invalid UTF character. Details are given in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page.
<a name="errorlist"></a></P>
-<br><a name="SEC30" href="#TOC1">ERROR RETURNS FROM <b>pcre2_match()</b></a><br>
+<br><a name="SEC31" href="#TOC1">ERROR RETURNS FROM <b>pcre2_match()</b></a><br>
<P>
If <b>pcre2_match()</b> fails, it returns a negative number. This can be
converted to a text string by calling the <b>pcre2_get_error_message()</b>
@@ -2457,8 +2802,9 @@ returned when the magic number is not present.
<pre>
PCRE2_ERROR_BADMODE
</pre>
-This error is given when a pattern that was compiled by the 8-bit library is
-passed to a 16-bit or 32-bit library function, or vice versa.
+This error is given when a compiled pattern is passed to a function in a
+library of a different code unit width, for example, a pattern compiled by
+the 8-bit library is passed to a 16-bit or 32-bit library function.
<pre>
PCRE2_ERROR_BADOFFSET
</pre>
@@ -2483,20 +2829,19 @@ use by callout functions that want to cause <b>pcre2_match()</b> or
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation for details.
<pre>
+ PCRE2_ERROR_DEPTHLIMIT
+</pre>
+The nested backtracking depth limit was reached.
+<pre>
+ PCRE2_ERROR_HEAPLIMIT
+</pre>
+The heap limit was reached.
+<pre>
PCRE2_ERROR_INTERNAL
</pre>
An unexpected internal error has occurred. This error could be caused by a bug
in PCRE2 or by overwriting of the compiled pattern.
<pre>
- PCRE2_ERROR_JIT_BADOPTION
-</pre>
-This error is returned when a pattern that was successfully studied using JIT
-is being matched, but the matching mode (partial or complete match) does not
-correspond to any JIT compilation mode. When the JIT fast path function is
-used, this error may be also given for invalid options. See the
-<a href="pcre2jit.html"><b>pcre2jit</b></a>
-documentation for more details.
-<pre>
PCRE2_ERROR_JIT_STACKLIMIT
</pre>
This error is returned when a pattern that was successfully studied using JIT
@@ -2507,15 +2852,14 @@ documentation for more details.
<pre>
PCRE2_ERROR_MATCHLIMIT
</pre>
-The backtracking limit was reached.
+The backtracking match limit was reached.
<pre>
PCRE2_ERROR_NOMEMORY
</pre>
-If a pattern contains back references, but the ovector is not big enough to
-remember the referenced substrings, PCRE2 gets a block of memory at the start
-of matching to use for this purpose. There are some other special cases where
-extra memory is needed during matching. This error is given when memory cannot
-be obtained.
+If a pattern contains many nested backtracking points, heap memory is used to
+remember them. This error is given when the memory allocation function (default
+or custom) fails. Note that a different error, PCRE2_ERROR_HEAPLIMIT, is given
+if the amount of memory needed exceeds the heap limit.
<pre>
PCRE2_ERROR_NULL
</pre>
@@ -2531,12 +2875,8 @@ in the subject string. Some simple patterns that might do this are detected and
faulted at compile time, but more complicated cases, in particular mutual
recursions between two different subpatterns, cannot be detected until matching
is attempted.
-<pre>
- PCRE2_ERROR_RECURSIONLIMIT
-</pre>
-The internal recursion limit was reached.
<a name="geterrormessage"></a></P>
-<br><a name="SEC31" href="#TOC1">OBTAINING A TEXTUAL ERROR MESSAGE</a><br>
+<br><a name="SEC32" href="#TOC1">OBTAINING A TEXTUAL ERROR MESSAGE</a><br>
<P>
<b>int pcre2_get_error_message(int <i>errorcode</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
<b> PCRE2_SIZE <i>bufflen</i>);</b>
@@ -2545,8 +2885,8 @@ The internal recursion limit was reached.
A text message for an error code from any PCRE2 function (compile, match, or
auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code
is passed as the first argument, with the remaining two arguments specifying a
-code unit buffer and its length, into which the text message is placed. Note
-that the message is returned in code units of the appropriate width for the
+code unit buffer and its length in code units, into which the text message is
+placed. The message is returned in code units of the appropriate width for the
library that is being used.
</P>
<P>
@@ -2557,7 +2897,7 @@ returned. If the buffer is too small, the message is truncated (but still with
a trailing zero), and the negative error code PCRE2_ERROR_NOMEMORY is returned.
None of the messages are very long; a buffer size of 120 code units is ample.
<a name="extractbynumber"></a></P>
-<br><a name="SEC32" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br>
+<br><a name="SEC33" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br>
<P>
<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b>
<b> uint32_t <i>number</i>, PCRE2_SIZE *<i>length</i>);</b>
@@ -2654,7 +2994,7 @@ The substring did not participate in the match. For example, if the pattern is
(abc)|(def) and the subject is "def", and the ovector contains at least two
capturing slots, substring number 1 is unset.
</P>
-<br><a name="SEC33" href="#TOC1">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a><br>
+<br><a name="SEC34" href="#TOC1">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a><br>
<P>
<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b>
<b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b>
@@ -2693,7 +3033,7 @@ can be distinguished from a genuine zero-length substring by inspecting the
appropriate offset in the ovector, which contain PCRE2_UNSET for unset
substrings, or by calling <b>pcre2_substring_length_bynumber()</b>.
<a name="extractbyname"></a></P>
-<br><a name="SEC34" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br>
+<br><a name="SEC35" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br>
<P>
<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
<b> PCRE2_SPTR <i>name</i>);</b>
@@ -2725,8 +3065,8 @@ calling <b>pcre2_substring_number_from_name()</b>. The first argument is the
compiled pattern, and the second is the name. The yield of the function is the
subpattern number, PCRE2_ERROR_NOSUBSTRING if there is no subpattern of that
name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is more than one subpattern of
-that name. Given the number, you can extract the substring directly, or use one
-of the functions described above.
+that name. Given the number, you can extract the substring directly from the
+ovector, or use one of the "bynumber" functions described above.
</P>
<P>
For convenience, there are also "byname" functions that correspond to the
@@ -2753,7 +3093,7 @@ names are not included in the compiled code. The matching process uses only
numbers. For this reason, the use of different names for subpatterns of the
same number causes an error at compile time.
</P>
-<br><a name="SEC35" href="#TOC1">CREATING A NEW STRING WITH SUBSTITUTIONS</a><br>
+<br><a name="SEC36" href="#TOC1">CREATING A NEW STRING WITH SUBSTITUTIONS</a><br>
<P>
<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
@@ -2800,12 +3140,12 @@ length is in code units, not bytes.
In the replacement string, which is interpreted as a UTF string in UTF mode,
and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a
dollar character is an escape character that can specify the insertion of
-characters from capturing groups or (*MARK) items in the pattern. The following
-forms are always recognized:
+characters from capturing groups or (*MARK), (*PRUNE), or (*THEN) items in the
+pattern. The following forms are always recognized:
<pre>
$$ insert a dollar character
$&#60;n&#62; or ${&#60;n&#62;} insert the contents of group &#60;n&#62;
- $*MARK or ${*MARK} insert the name of the last (*MARK) encountered
+ $*MARK or ${*MARK} insert a (*MARK), (*PRUNE), or (*THEN) name
</pre>
Either a group number or a group name can be given for &#60;n&#62;. Curly brackets are
required only if the following character would be interpreted as part of the
@@ -2814,25 +3154,43 @@ For example, if the pattern a(b)c is matched with "=abc=" and the replacement
string "+$1$0$1+", the result is "=+babcb+=".
</P>
<P>
-The facility for inserting a (*MARK) name can be used to perform simple
-simultaneous substitutions, as this <b>pcre2test</b> example shows:
+$*MARK inserts the name from the last encountered (*MARK), (*PRUNE), or (*THEN)
+on the matching path that has a name. (*MARK) must always include a name, but
+(*PRUNE) and (*THEN) need not. For example, in the case of (*MARK:A)(*PRUNE)
+the name inserted is "A", but for (*MARK:A)(*PRUNE:B) the relevant name is "B".
+This facility can be used to perform simple simultaneous substitutions, as this
+<b>pcre2test</b> example shows:
<pre>
- /(*:pear)apple|(*:orange)lemon/g,replace=${*MARK}
+ /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
apple lemon
2: pear orange
</pre>
As well as the usual options for <b>pcre2_match()</b>, a number of additional
-options can be set in the <i>options</i> argument.
+options can be set in the <i>options</i> argument of <b>pcre2_substitute()</b>.
</P>
<P>
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject string,
-replacing every matching substring. If this is not set, only the first matching
-substring is replaced. If any matched substring has zero length, after the
-substitution has happened, an attempt to find a non-empty match at the same
-position is performed. If this is not successful, the current position is
-advanced by one character except when CRLF is a valid newline sequence and the
-next two characters are CR, LF. In this case, the current position is advanced
-by two characters.
+replacing every matching substring. If this option is not set, only the first
+matching substring is replaced. The search for matches takes place in the
+original subject string (that is, previous replacements do not affect it).
+Iteration is implemented by advancing the <i>startoffset</i> value for each
+search, which is always passed the entire subject string. If an offset limit is
+set in the match context, searching stops when that limit is reached.
+</P>
+<P>
+You can restrict the effect of a global substitution to a portion of the
+subject string by setting either or both of <i>startoffset</i> and an offset
+limit. Here is a \fPpcre2test\fP example:
+<pre>
+ /B/g,replace=!,use_offset_limit
+ ABC ABC ABC ABC\=offset=3,offset_limit=12
+ 2: ABC A!C A!C ABC
+</pre>
+When continuing with global substitutions after matching a substring with zero
+length, an attempt to find a non-empty match at the same offset is performed.
+If this is not successful, the offset is advanced by one character except when
+CRLF is a valid newline sequence and the next two characters are CR, LF. In
+this case, the offset is advanced by two characters.
</P>
<P>
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
@@ -2949,10 +3307,10 @@ default.
<P>
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the
replacement string, with more particular errors being PCRE2_ERROR_BADREPESCAPE
-(invalid escape sequence), PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket
-not found), PCRE2_BADSUBSTITUTION (syntax error in extended group
-substitution), and PCRE2_BADSUBPATTERN (the pattern match ended before it
-started, which can happen if \K is used in an assertion).
+(invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE (closing curly bracket
+not found), PCRE2_ERROR_BADSUBSTITUTION (syntax error in extended group
+substitution), and PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before
+it started, which can happen if \K is used in an assertion).
</P>
<P>
As for all PCRE2 errors, a text message that describes the error can be
@@ -2960,7 +3318,7 @@ obtained by calling the <b>pcre2_get_error_message()</b> function (see
"Obtaining a textual error message"
<a href="#geterrormessage">above).</a>
</P>
-<br><a name="SEC36" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br>
+<br><a name="SEC37" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br>
<P>
<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b>
@@ -3005,7 +3363,7 @@ in the section entitled <i>Information about a pattern</i>. Given all the
relevant entries for the name, you can extract each of their numbers, and hence
the captured data.
</P>
-<br><a name="SEC37" href="#TOC1">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a><br>
+<br><a name="SEC38" href="#TOC1">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a><br>
<P>
The traditional matching function uses a similar algorithm to Perl, which stops
when it finds the first match at a given point in the subject. If you want to
@@ -3023,7 +3381,7 @@ substring. Then return 1, which forces <b>pcre2_match()</b> to backtrack and try
other alternatives. Ultimately, when it runs out of matches,
<b>pcre2_match()</b> will yield PCRE2_ERROR_NOMATCH.
<a name="dfamatch"></a></P>
-<br><a name="SEC38" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br>
+<br><a name="SEC39" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br>
<P>
<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
@@ -3034,11 +3392,12 @@ other alternatives. Ultimately, when it runs out of matches,
<P>
The function <b>pcre2_dfa_match()</b> is called to match a subject string
against a compiled pattern, using a matching algorithm that scans the subject
-string just once, and does not backtrack. This has different characteristics to
-the normal algorithm, and is not compatible with Perl. Some of the features of
-PCRE2 patterns are not supported. Nevertheless, there are times when this kind
-of matching can be useful. For a discussion of the two matching algorithms, and
-a list of features that <b>pcre2_dfa_match()</b> does not support, see the
+string just once (not counting lookaround assertions), and does not backtrack.
+This has different characteristics to the normal algorithm, and is not
+compatible with Perl. Some of the features of PCRE2 patterns are not supported.
+Nevertheless, there are times when this kind of matching can be useful. For a
+discussion of the two matching algorithms, and a list of features that
+<b>pcre2_dfa_match()</b> does not support, see the
<a href="pcre2matching.html"><b>pcre2matching</b></a>
documentation.
</P>
@@ -3066,7 +3425,7 @@ Here is an example of a simple call to <b>pcre2_dfa_match()</b>:
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
- match_data, /* the match data block */
+ md, /* the match data block */
NULL, /* a match context; NULL means use defaults */
wspace, /* working space vector */
20); /* number of elements (NOT size in bytes) */
@@ -3077,11 +3436,11 @@ Option bits for <b>pcre_dfa_match()</b>
</b><br>
<P>
The unused bits of the <i>options</i> argument for <b>pcre2_dfa_match()</b> must
-be zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
-PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
-PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and
-PCRE2_DFA_RESTART. All but the last four of these are exactly the same as for
-<b>pcre2_match()</b>, so their description is not repeated here.
+be zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED,
+PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
+PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST,
+and PCRE2_DFA_RESTART. All but the last four of these are exactly the same as
+for <b>pcre2_match()</b>, so their description is not repeated here.
<pre>
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
@@ -3174,7 +3533,7 @@ NOTE: PCRE2's "auto-possessification" optimization usually applies to character
repeats at the end of a pattern (as well as internally). For example, the
pattern "a\d+" is compiled as if it were "a\d++". For DFA matching, this
means that only one possible match is found. If you really do want multiple
-matches in such cases, either use an ungreedy repeat auch as "a\d+?" or set
+matches in such cases, either use an ungreedy repeat such as "a\d+?" or set
the PCRE2_NO_AUTO_POSSESS option when compiling.
</P>
<br><b>
@@ -3218,13 +3577,13 @@ some plausibility checks are made on the contents of the workspace, which
should contain data about the previous partial match. If any of these checks
fail, this error is given.
</P>
-<br><a name="SEC39" href="#TOC1">SEE ALSO</a><br>
+<br><a name="SEC40" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2build</b>(3), <b>pcre2callout</b>(3), <b>pcre2demo(3)</b>,
<b>pcre2matching</b>(3), <b>pcre2partial</b>(3), <b>pcre2posix</b>(3),
-<b>pcre2sample</b>(3), <b>pcre2stack</b>(3), <b>pcre2unicode</b>(3).
+<b>pcre2sample</b>(3), <b>pcre2unicode</b>(3).
</P>
-<br><a name="SEC40" href="#TOC1">AUTHOR</a><br>
+<br><a name="SEC41" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@@ -3233,11 +3592,11 @@ University Computing Service
Cambridge, England.
<br>
</P>
-<br><a name="SEC41" href="#TOC1">REVISION</a><br>
+<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 17 June 2016
+Last updated: 31 December 2017
<br>
-Copyright &copy; 1997-2016 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2build.html b/doc/html/pcre2build.html
index 6c8e1de..823e605 100644
--- a/doc/html/pcre2build.html
+++ b/doc/html/pcre2build.html
@@ -23,20 +23,21 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC8" href="#SEC8">NEWLINE RECOGNITION</a>
<li><a name="TOC9" href="#SEC9">WHAT \R MATCHES</a>
<li><a name="TOC10" href="#SEC10">HANDLING VERY LARGE PATTERNS</a>
-<li><a name="TOC11" href="#SEC11">AVOIDING EXCESSIVE STACK USAGE</a>
-<li><a name="TOC12" href="#SEC12">LIMITING PCRE2 RESOURCE USAGE</a>
-<li><a name="TOC13" href="#SEC13">CREATING CHARACTER TABLES AT BUILD TIME</a>
-<li><a name="TOC14" href="#SEC14">USING EBCDIC CODE</a>
-<li><a name="TOC15" href="#SEC15">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a>
-<li><a name="TOC16" href="#SEC16">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
-<li><a name="TOC17" href="#SEC17">PCRE2GREP BUFFER SIZE</a>
-<li><a name="TOC18" href="#SEC18">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
-<li><a name="TOC19" href="#SEC19">INCLUDING DEBUGGING CODE</a>
-<li><a name="TOC20" href="#SEC20">DEBUGGING WITH VALGRIND SUPPORT</a>
-<li><a name="TOC21" href="#SEC21">CODE COVERAGE REPORTING</a>
-<li><a name="TOC22" href="#SEC22">SEE ALSO</a>
-<li><a name="TOC23" href="#SEC23">AUTHOR</a>
-<li><a name="TOC24" href="#SEC24">REVISION</a>
+<li><a name="TOC11" href="#SEC11">LIMITING PCRE2 RESOURCE USAGE</a>
+<li><a name="TOC12" href="#SEC12">CREATING CHARACTER TABLES AT BUILD TIME</a>
+<li><a name="TOC13" href="#SEC13">USING EBCDIC CODE</a>
+<li><a name="TOC14" href="#SEC14">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a>
+<li><a name="TOC15" href="#SEC15">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
+<li><a name="TOC16" href="#SEC16">PCRE2GREP BUFFER SIZE</a>
+<li><a name="TOC17" href="#SEC17">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
+<li><a name="TOC18" href="#SEC18">INCLUDING DEBUGGING CODE</a>
+<li><a name="TOC19" href="#SEC19">DEBUGGING WITH VALGRIND SUPPORT</a>
+<li><a name="TOC20" href="#SEC20">CODE COVERAGE REPORTING</a>
+<li><a name="TOC21" href="#SEC21">SUPPORT FOR FUZZERS</a>
+<li><a name="TOC22" href="#SEC22">OBSOLETE OPTION</a>
+<li><a name="TOC23" href="#SEC23">SEE ALSO</a>
+<li><a name="TOC24" href="#SEC24">AUTHOR</a>
+<li><a name="TOC25" href="#SEC25">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">BUILDING PCRE2</a><br>
<P>
@@ -77,19 +78,19 @@ running
<pre>
./configure --help
</pre>
-The following sections include descriptions of options whose names begin with
---enable or --disable. These settings specify changes to the defaults for the
-<b>configure</b> command. Because of the way that <b>configure</b> works,
---enable and --disable always come in pairs, so the complementary option always
-exists as well, but as it specifies the default, it is not described.
+The following sections include descriptions of "on/off" options whose names
+begin with --enable or --disable. Because of the way that <b>configure</b>
+works, --enable and --disable always come in pairs, so the complementary option
+always exists as well, but as it specifies the default, it is not described.
+Options that specify values have names that start with --with.
</P>
<br><a name="SEC3" href="#TOC1">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a><br>
<P>
By default, a library called <b>libpcre2-8</b> is built, containing functions
-that take string arguments contained in vectors of bytes, interpreted either as
+that take string arguments contained in arrays of bytes, interpreted either as
single-byte characters, or UTF-8 strings. You can also build two other
libraries, called <b>libpcre2-16</b> and <b>libpcre2-32</b>, which process
-strings that are contained in vectors of 16-bit and 32-bit code units,
+strings that are contained in arrays of 16-bit and 32-bit code units,
respectively. These can be interpreted either as single-unit characters or
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
the following to the <b>configure</b> command:
@@ -137,10 +138,10 @@ locked this out by setting PCRE2_NEVER_UTF.
</P>
<P>
UTF support allows the libraries to process character code points up to
-0x10ffff in the strings that they handle. It also provides support for
-accessing the Unicode properties of such characters, using pattern escapes such
-as \P, \p, and \X. Only the general category properties such as <i>Lu</i> and
-<i>Nd</i> are supported. Details are given in the
+0x10ffff in the strings that they handle. Unicode support also gives access to
+the Unicode properties of characters, using pattern escapes such as \P, \p,
+and \X. Only the general category properties such as <i>Lu</i> and <i>Nd</i> are
+supported. Details are given in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation.
</P>
@@ -164,13 +165,18 @@ out by setting the PCRE2_NEVER_BACKSLASH_C option when calling
</P>
<br><a name="SEC7" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
-Just-in-time compiler support is included in the build by specifying
+Just-in-time (JIT) compiler support is included in the build by specifying
<pre>
--enable-jit
</pre>
This support is available only for certain hardware architectures. If this
-option is set for an unsupported architecture, a building error occurs.
-See the
+option is set for an unsupported architecture, a building error occurs. If you
+are running under SELinux you may also want to add
+<pre>
+ --enable-jit-sealloc
+</pre>
+which enables the use of an execmem allocator in JIT that is compatible with
+SELinux. This has no effect if JIT is not enabled. See the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for a discussion of JIT usage. When JIT support is enabled,
pcre2grep automatically makes use of it, unless you add
@@ -202,19 +208,23 @@ to the <b>configure</b> command. There is a fourth option, specified by
--enable-newline-is-anycrlf
</pre>
which causes PCRE2 to recognize any of the three sequences CR, LF, or CRLF as
-indicating a line ending. Finally, a fifth option, specified by
+indicating a line ending. A fifth option, specified by
<pre>
--enable-newline-is-any
</pre>
causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
sequences are the three just mentioned, plus the single characters VT (vertical
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
-separator, U+2028), and PS (paragraph separator, U+2029).
+separator, U+2028), and PS (paragraph separator, U+2029). The final option is
+<pre>
+ --enable-newline-is-nul
+</pre>
+which causes NUL (binary zero) is set as the default line-ending character.
</P>
<P>
Whatever default line ending convention is selected when PCRE2 is built can be
overridden by applications that use the library. At build time it is
-conventional to use the standard for your operating system.
+recommended to use the standard for your operating system.
</P>
<br><a name="SEC9" href="#TOC1">WHAT \R MATCHES</a><br>
<P>
@@ -226,7 +236,7 @@ specify
</pre>
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
selected when PCRE2 is built can be overridden by applications that use the
-called.
+library.
</P>
<br><a name="SEC10" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
<P>
@@ -247,58 +257,62 @@ longer offsets slows down the operation of PCRE2 because it has to load
additional data when handling them. For the 32-bit library the value is always
4 and cannot be overridden; the value of --with-link-size is ignored.
</P>
-<br><a name="SEC11" href="#TOC1">AVOIDING EXCESSIVE STACK USAGE</a><br>
+<br><a name="SEC11" href="#TOC1">LIMITING PCRE2 RESOURCE USAGE</a><br>
<P>
-When matching with the <b>pcre2_match()</b> function, PCRE2 implements
-backtracking by making recursive calls to an internal function called
-<b>match()</b>. In environments where the size of the stack is limited, this can
-severely limit PCRE2's operation. (The Unix environment does not usually suffer
-from this problem, but it may sometimes be necessary to increase the maximum
-stack size. There is a discussion in the
-<a href="pcre2stack.html"><b>pcre2stack</b></a>
-documentation.) An alternative approach to recursion that uses memory from the
-heap to remember data, instead of using recursive function calls, has been
-implemented to work round the problem of limited stack size. If you want to
-build a version of PCRE2 that works this way, add
+The <b>pcre2_match()</b> function increments a counter each time it goes round
+its main loop. Putting a limit on this counter controls the amount of computing
+resource used by a single call to <b>pcre2_match()</b>. The limit can be changed
+at run time, as described in the
+<a href="pcre2api.html"><b>pcre2api</b></a>
+documentation. The default is 10 million, but this can be changed by adding a
+setting such as
<pre>
- --disable-stack-for-recursion
+ --with-match-limit=500000
</pre>
-to the <b>configure</b> command. By default, the system functions <b>malloc()</b>
-and <b>free()</b> are called to manage the heap memory that is required, but
-custom memory management functions can be called instead. PCRE2 runs noticeably
-more slowly when built in this way. This option affects only the
-<b>pcre2_match()</b> function; it is not relevant for <b>pcre2_dfa_match()</b>.
+to the <b>configure</b> command. This setting also applies to the
+<b>pcre2_dfa_match()</b> matching function, and to JIT matching (though the
+counting is done differently).
</P>
-<br><a name="SEC12" href="#TOC1">LIMITING PCRE2 RESOURCE USAGE</a><br>
<P>
-Internally, PCRE2 has a function called <b>match()</b>, which it calls
-repeatedly (sometimes recursively) when matching a pattern with the
-<b>pcre2_match()</b> function. By controlling the maximum number of times this
-function may be called during a single matching operation, a limit can be
-placed on the resources used by a single call to <b>pcre2_match()</b>. The limit
-can be changed at run time, as described in the
+The <b>pcre2_match()</b> function starts out using a 20K vector on the system
+stack to record backtracking points. The more nested backtracking points there
+are (that is, the deeper the search tree), the more memory is needed. If the
+initial vector is not large enough, heap memory is used, up to a certain limit,
+which is specified in kilobytes. The limit can be changed at run time, as
+described in the
<a href="pcre2api.html"><b>pcre2api</b></a>
-documentation. The default is 10 million, but this can be changed by adding a
-setting such as
+documentation. The default limit (in effect unlimited) is 20 million. You can
+change this by a setting such as
<pre>
- --with-match-limit=500000
+ --with-heap-limit=500
</pre>
-to the <b>configure</b> command. This setting has no effect on the
-<b>pcre2_dfa_match()</b> matching function.
+which limits the amount of heap to 500 kilobytes. This limit applies only to
+interpretive matching in pcre2_match(). It does not apply when JIT (which has
+its own memory arrangements) is used, nor does it apply to
+<b>pcre2_dfa_match()</b>.
</P>
<P>
-In some environments it is desirable to limit the depth of recursive calls of
-<b>match()</b> more strictly than the total number of calls, in order to
-restrict the maximum amount of stack (or heap, if --disable-stack-for-recursion
-is specified) that is used. A second limit controls this; it defaults to the
-value that is set for --with-match-limit, which imposes no additional
-constraints. However, you can set a lower limit by adding, for example,
+You can also explicitly limit the depth of nested backtracking in the
+<b>pcre2_match()</b> interpreter. This limit defaults to the value that is set
+for --with-match-limit. You can set a lower default limit by adding, for
+example,
<pre>
- --with-match-limit-recursion=10000
+ --with-match-limit_depth=10000
</pre>
-to the <b>configure</b> command. This value can also be overridden at run time.
+to the <b>configure</b> command. This value can be overridden at run time. This
+depth limit indirectly limits the amount of heap memory that is used, but
+because the size of each backtracking "frame" depends on the number of
+capturing parentheses in a pattern, the amount of heap that is used before the
+limit is reached varies from pattern to pattern. This limit was more useful in
+versions before 10.30, where function recursion was used for backtracking.
</P>
-<br><a name="SEC13" href="#TOC1">CREATING CHARACTER TABLES AT BUILD TIME</a><br>
+<P>
+As well as applying to <b>pcre2_match()</b>, the depth limit also controls
+the depth of recursive function calls in <b>pcre2_dfa_match()</b>. These are
+used for lookaround assertions, atomic groups, and recursion within patterns.
+The limit does not apply to JIT matching.
+</P>
+<br><a name="SEC12" href="#TOC1">CREATING CHARACTER TABLES AT BUILD TIME</a><br>
<P>
PCRE2 uses fixed tables for processing characters whose code points are less
than 256. By default, PCRE2 is built with a set of tables that are distributed
@@ -310,12 +324,12 @@ only. If you add
to the <b>configure</b> command, the distributed tables are no longer used.
Instead, a program called <b>dftables</b> is compiled and run. This outputs the
source for new set of tables, created in the default locale of your C run-time
-system. (This method of replacing the tables does not work if you are cross
+system. This method of replacing the tables does not work if you are cross
compiling, because <b>dftables</b> is run on the local host. If you need to
create alternative tables when cross compiling, you will have to do so "by
-hand".)
+hand".
</P>
-<br><a name="SEC14" href="#TOC1">USING EBCDIC CODE</a><br>
+<br><a name="SEC13" href="#TOC1">USING EBCDIC CODE</a><br>
<P>
PCRE2 assumes by default that it will run in an environment where the character
code is ASCII or Unicode, which is a superset of ASCII. This is the case for
@@ -350,7 +364,7 @@ The options that select newline behaviour, such as --enable-newline-is-cr,
and equivalent run-time options, refer to these character values in an EBCDIC
environment.
</P>
-<br><a name="SEC15" href="#TOC1">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a><br>
+<br><a name="SEC14" href="#TOC1">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a><br>
<P>
By default, on non-Windows systems, <b>pcre2grep</b> supports the use of
callouts with string arguments within the patterns it is matching, in order to
@@ -359,7 +373,7 @@ run external scripts. For details, see the
documentation. This support can be disabled by adding
--disable-pcre2grep-callout to the <b>configure</b> command.
</P>
-<br><a name="SEC16" href="#TOC1">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a><br>
+<br><a name="SEC15" href="#TOC1">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a><br>
<P>
By default, <b>pcre2grep</b> reads all files as plain text. You can build it so
that it recognizes files whose names end in <b>.gz</b> or <b>.bz2</b>, and reads
@@ -372,22 +386,25 @@ to the <b>configure</b> command. These options naturally require that the
relevant libraries are installed on your system. Configuration will fail if
they are not.
</P>
-<br><a name="SEC17" href="#TOC1">PCRE2GREP BUFFER SIZE</a><br>
+<br><a name="SEC16" href="#TOC1">PCRE2GREP BUFFER SIZE</a><br>
<P>
<b>pcre2grep</b> uses an internal buffer to hold a "window" on the file it is
scanning, in order to be able to output "before" and "after" lines when it
-finds a match. The size of the buffer is controlled by a parameter whose
-default value is 20K. The buffer itself is three times this size, but because
-of the way it is used for holding "before" lines, the longest line that is
-guaranteed to be processable is the parameter size. You can change the default
-parameter value by adding, for example,
+finds a match. The starting size of the buffer is controlled by a parameter
+whose default value is 20K. The buffer itself is three times this size, but
+because of the way it is used for holding "before" lines, the longest line that
+is guaranteed to be processable is the parameter size. If a longer line is
+encountered, <b>pcre2grep</b> automatically expands the buffer, up to a
+specified maximum size, whose default is 1M or the starting size, whichever is
+the larger. You can change the default parameter values by adding, for example,
<pre>
- --with-pcre2grep-bufsize=50K
+ --with-pcre2grep-bufsize=51200
+ --with-pcre2grep-max-bufsize=2097152
</pre>
-to the <b>configure</b> command. The caller of \fPpcre2grep\fP can override this
-value by using --buffer-size on the command line.
+to the <b>configure</b> command. The caller of \fPpcre2grep\fP can override
+these values by using --buffer-size and --max-buffer-size on the command line.
</P>
-<br><a name="SEC18" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
+<br><a name="SEC17" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
<P>
If you add one of
<pre>
@@ -421,7 +438,7 @@ automatically included, you may need to add something like
</pre>
immediately before the <b>configure</b> command.
</P>
-<br><a name="SEC19" href="#TOC1">INCLUDING DEBUGGING CODE</a><br>
+<br><a name="SEC18" href="#TOC1">INCLUDING DEBUGGING CODE</a><br>
<P>
If you add
<pre>
@@ -430,7 +447,7 @@ If you add
to the <b>configure</b> command, additional debugging code is included in the
build. This feature is intended for use by the PCRE2 maintainers.
</P>
-<br><a name="SEC20" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
+<br><a name="SEC19" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
<P>
If you add
<pre>
@@ -440,7 +457,7 @@ to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
certain memory regions as unaddressable. This allows it to detect invalid
memory accesses, and is mostly useful for debugging PCRE2 itself.
</P>
-<br><a name="SEC21" href="#TOC1">CODE COVERAGE REPORTING</a><br>
+<br><a name="SEC20" href="#TOC1">CODE COVERAGE REPORTING</a><br>
<P>
If your C compiler is gcc, you can build a version of PCRE2 that can generate a
code coverage report for its test suite. To enable this, you must install
@@ -497,11 +514,47 @@ This cleans all coverage data including the generated coverage report. For more
information about code coverage, see the <b>gcov</b> and <b>lcov</b>
documentation.
</P>
-<br><a name="SEC22" href="#TOC1">SEE ALSO</a><br>
+<br><a name="SEC21" href="#TOC1">SUPPORT FOR FUZZERS</a><br>
+<P>
+There is a special option for use by people who want to run fuzzing tests on
+PCRE2:
+<pre>
+ --enable-fuzz-support
+</pre>
+At present this applies only to the 8-bit library. If set, it causes an extra
+library called libpcre2-fuzzsupport.a to be built, but not installed. This
+contains a single function called LLVMFuzzerTestOneInput() whose arguments are
+a pointer to a string and the length of the string. When called, this function
+tries to compile the string as a pattern, and if that succeeds, to match it.
+This is done both with no options and with some random options bits that are
+generated from the string.
+</P>
+<P>
+Setting --enable-fuzz-support also causes a binary called <b>pcre2fuzzcheck</b>
+to be created. This is normally run under valgrind or used when PCRE2 is
+compiled with address sanitizing enabled. It calls the fuzzing function and
+outputs information about it is doing. The input strings are specified by
+arguments: if an argument starts with "=" the rest of it is a literal input
+string. Otherwise, it is assumed to be a file name, and the contents of the
+file are the test string.
+</P>
+<br><a name="SEC22" href="#TOC1">OBSOLETE OPTION</a><br>
+<P>
+In versions of PCRE2 prior to 10.30, there were two ways of handling
+backtracking in the <b>pcre2_match()</b> function. The default was to use the
+system stack, but if
+<pre>
+ --disable-stack-for-recursion
+</pre>
+was set, memory on the heap was used. From release 10.30 onwards this has
+changed (the stack is no longer used) and this option now does nothing except
+give a warning.
+</P>
+<br><a name="SEC23" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2api</b>(3), <b>pcre2-config</b>(3).
</P>
-<br><a name="SEC23" href="#TOC1">AUTHOR</a><br>
+<br><a name="SEC24" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@@ -510,11 +563,11 @@ University Computing Service
Cambridge, England.
<br>
</P>
-<br><a name="SEC24" href="#TOC1">REVISION</a><br>
+<br><a name="SEC25" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 01 April 2016
+Last updated: 18 July 2017
<br>
-Copyright &copy; 1997-2016 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2callout.html b/doc/html/pcre2callout.html
index 7e85c9a..2adf21a 100644
--- a/doc/html/pcre2callout.html
+++ b/doc/html/pcre2callout.html
@@ -57,16 +57,23 @@ two callout points:
</pre>
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
automatically inserts callouts, all with number 255, before each item in the
-pattern. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
+pattern except for immediately before or after an explicit callout. For
+example, if PCRE2_AUTO_CALLOUT is used with the pattern
<pre>
- A(\d{2}|--)
+ A(?C3)B
</pre>
it is processed as if it were
-<br>
-<br>
-(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
-<br>
-<br>
+<pre>
+ (?C255)A(?C3)B(?C255)
+</pre>
+Here is a more complicated example:
+<pre>
+ A(\d{2}|--)
+</pre>
+With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
+<pre>
+ (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
+</pre>
Notice that there is a callout before and after each parenthesis and
alternation bar. If the pattern contains a conditional group whose condition is
an assertion, an automatic callout is inserted immediately before the
@@ -107,10 +114,10 @@ with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
No match
</pre>
This indicates that when matching [bc] fails, there is no backtracking into a+
-and therefore the callouts that would be taken for the backtracks do not occur.
-You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
-<b>pcre2_compile()</b>, or starting the pattern with (*NO_AUTO_POSSESS). In this
-case, the output changes to this:
+(because it is being treated as a++) and therefore the callouts that would be
+taken for the backtracks do not occur. You can disable the auto-possessify
+feature by passing PCRE2_NO_AUTO_POSSESS to <b>pcre2_compile()</b>, or starting
+the pattern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
<pre>
---&#62;aaaa
+0 ^ a+
@@ -131,10 +138,14 @@ By default, an optimization is applied when .* is the first significant item in
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
start only after an internal newline or at the beginning of the subject, and
-<b>pcre2_compile()</b> remembers this. This optimization is disabled, however,
-if .* is in an atomic group or if there is a back reference to the capturing
-group in which it appears. It is also disabled if the pattern contains (*PRUNE)
-or (*SKIP). However, the presence of callouts does not affect it.
+<b>pcre2_compile()</b> remembers this. If a pattern has more than one top-level
+branch, automatic anchoring occurs if all branches are anchorable.
+</P>
+<P>
+This optimization is disabled, however, if .* is in an atomic group or if there
+is a back reference to the capturing group in which it appears. It is also
+disabled if the pattern contains (*PRUNE) or (*SKIP). However, the presence of
+callouts does not affect it.
</P>
<P>
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT and
@@ -166,10 +177,6 @@ This shows more match attempts, starting at the second subject character.
Another optimization, described in the next section, means that there is no
subsequent attempt to match with an empty subject.
</P>
-<P>
-If a pattern has more than one top-level branch, automatic anchoring occurs if
-all branches are anchorable.
-</P>
<br><b>
Other optimizations
</b><br>
@@ -185,9 +192,10 @@ start, and the callout is never reached. However, with "abyd", though the
result is still no match, the callout is obeyed.
</P>
<P>
-PCRE2 also knows the minimum length of a matching string, and will immediately
-give a "no match" return without actually running a match if the subject is not
-long enough, or, for unanchored patterns, if it has been scanned far enough.
+For most patterns PCRE2 also knows the minimum length of a matching string, and
+will immediately give a "no match" return without actually running a match if
+the subject is not long enough, or, for unanchored patterns, if it has been
+scanned far enough.
</P>
<P>
You can disable these optimizations by passing the PCRE2_NO_START_OPTIMIZE
@@ -198,18 +206,20 @@ callouts such as the example above are obeyed.
<br><a name="SEC4" href="#TOC1">THE CALLOUT INTERFACE</a><br>
<P>
During matching, when PCRE2 reaches a callout point, if an external function is
-set in the match context, it is called. This applies to both normal and DFA
-matching. The first argument to the callout function is a pointer to a
-<b>pcre2_callout</b> block. The second argument is the void * callout data that
-was supplied when the callout was set up by calling <b>pcre2_set_callout()</b>
-(see the
+provided in the match context, it is called. This applies to both normal,
+DFA, and JIT matching. The first argument to the callout function is a pointer
+to a <b>pcre2_callout</b> block. The second argument is the void * callout data
+that was supplied when the callout was set up by calling
+<b>pcre2_set_callout()</b> (see the
<a href="pcre2api.html"><b>pcre2api</b></a>
-documentation). The callout block structure contains the following fields:
+documentation). The callout block structure contains the following fields, not
+necessarily in this order:
<pre>
uint32_t <i>version</i>;
uint32_t <i>callout_number</i>;
uint32_t <i>capture_top</i>;
uint32_t <i>capture_last</i>;
+ uint32_t <i>callout_flags</i>;
PCRE2_SIZE *<i>offset_vector</i>;
PCRE2_SPTR <i>mark</i>;
PCRE2_SPTR <i>subject</i>;
@@ -223,11 +233,12 @@ documentation). The callout block structure contains the following fields:
PCRE2_SPTR <i>callout_string</i>;
</pre>
The <i>version</i> field contains the version number of the block format. The
-current version is 1; the three callout string fields were added for this
-version. If you are writing an application that might use an earlier release of
-PCRE2, you should check the version number before accessing any of these
-fields. The version number will increase in future if more fields are added,
-but the intention is never to remove any of the existing fields.
+current version is 2; the three callout string fields were added for version 1,
+and the <i>callout_flags</i> field for version 2. If you are writing an
+application that might use an earlier release of PCRE2, you should check the
+version number before accessing any of these fields. The version number will
+increase in future if more fields are added, but the intention is never to
+remove any of the existing fields.
</P>
<br><b>
Fields for numerical callouts
@@ -235,8 +246,8 @@ Fields for numerical callouts
<P>
For a numerical callout, <i>callout_string</i> is NULL, and <i>callout_number</i>
contains the number of the callout, in the range 0-255. This is the number
-that follows (?C for manual callouts; it is 255 for automatically generated
-callouts.
+that follows (?C for callouts that part of the pattern; it is 255 for
+automatically generated callouts.
</P>
<br><b>
Fields for string callouts
@@ -267,12 +278,42 @@ The remaining fields in the callout block are the same for both kinds of
callout.
</P>
<P>
-The <i>offset_vector</i> field is a pointer to the vector of capturing offsets
-(the "ovector") that was passed to the matching function in the match data
-block. When <b>pcre2_match()</b> is used, the contents can be inspected in
+The <i>offset_vector</i> field is a pointer to a vector of capturing offsets
+(the "ovector"). You may read the elements in this vector, but you must not
+change any of them.
+</P>
+<P>
+For calls to <b>pcre2_match()</b>, the <i>offset_vector</i> field is not (since
+release 10.30) a pointer to the actual ovector that was passed to the matching
+function in the match data block. Instead it points to an internal ovector of a
+size large enough to hold all possible captured substrings in the pattern. Note
+that whenever a recursion or subroutine call within a pattern completes, the
+capturing state is reset to what it was before.
+</P>
+<P>
+The <i>capture_last</i> field contains the number of the most recently captured
+substring, and the <i>capture_top</i> field contains one more than the number of
+the highest numbered captured substring so far. If no substrings have yet been
+captured, the value of <i>capture_last</i> is 0 and the value of
+<i>capture_top</i> is 1. The values of these fields do not always differ by one;
+for example, when the callout in the pattern ((a)(b))(?C2) is taken,
+<i>capture_last</i> is 1 but <i>capture_top</i> is 4.
+</P>
+<P>
+The contents of ovector[2] to ovector[&#60;capture_top&#62;*2-1] can be inspected in
order to extract substrings that have been matched so far, in the same way as
-for extracting substrings after a match has completed. For the DFA matching
-function, this field is not useful.
+extracting substrings after a match has completed. The values in ovector[0] and
+ovector[1] are always PCRE2_UNSET because the match is by definition not
+complete. Substrings that have not been captured but whose numbers are less
+than <i>capture_top</i> also have both of their ovector slots set to
+PCRE2_UNSET.
+</P>
+<P>
+For DFA matching, the <i>offset_vector</i> field points to the ovector that was
+passed to the matching function in the match data block, but it holds no useful
+information at callout time because <b>pcre2_dfa_match()</b> does not support
+substring capturing. The value of <i>capture_top</i> is always 1 and the value
+of <i>capture_last</i> is always 0 for DFA matching.
</P>
<P>
The <i>subject</i> and <i>subject_length</i> fields contain copies of the values
@@ -291,29 +332,20 @@ The <i>current_position</i> field contains the offset within the subject of the
current match pointer.
</P>
<P>
-When the <b>pcre2_match()</b> is used, the <i>capture_top</i> field contains one
-more than the number of the highest numbered captured substring so far. If no
-substrings have been captured, the value of <i>capture_top</i> is one. This is
-always the case when the DFA functions are used, because they do not support
-captured substrings.
-</P>
-<P>
-The <i>capture_last</i> field contains the number of the most recently captured
-substring. However, when a recursion exits, the value reverts to what it was
-outside the recursion, as do the values of all captured substrings. If no
-substrings have been captured, the value of <i>capture_last</i> is 0. This is
-always the case for the DFA matching functions.
-</P>
-<P>
The <i>pattern_position</i> field contains the offset in the pattern string to
the next item to be matched.
</P>
<P>
The <i>next_item_length</i> field contains the length of the next item to be
-matched in the pattern string. When the callout immediately precedes an
-alternation bar, a closing parenthesis, or the end of the pattern, the length
-is zero. When the callout precedes an opening parenthesis, the length is that
-of the entire subpattern.
+processed in the pattern string. When the callout is at the end of the pattern,
+the length is zero. When the callout precedes an opening parenthesis, the
+length includes meta characters that follow the parenthesis. For example, in a
+callout before an assertion such as (?=ab) the length is 3. For an an
+alternation bar or a closing parenthesis, the length is one, unless a closing
+parenthesis is followed by a quantifier, in which case its length is included.
+(This changed in release 10.23. In earlier releases, before an opening
+parenthesis the length was that of the entire subpattern, and before an
+alternation bar or a closing parenthesis the length was zero.)
</P>
<P>
The <i>pattern_position</i> and <i>next_item_length</i> fields are intended to
@@ -329,6 +361,36 @@ the zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
of (*PRUNE) or (*THEN) without a name do not obliterate a previous (*MARK). In
callouts from the DFA matching function this field always contains NULL.
</P>
+<P>
+The <i>callout_flags</i> field is always zero in callouts from
+<b>pcre2_dfa_match()</b> or when JIT is being used. When <b>pcre2_match()</b>
+without JIT is used, the following bits may be set:
+<pre>
+ PCRE2_CALLOUT_STARTMATCH
+</pre>
+This is set for the first callout after the start of matching for each new
+starting position in the subject.
+<pre>
+ PCRE2_CALLOUT_BACKTRACK
+</pre>
+This is set if there has been a matching backtrack since the previous callout,
+or since the start of matching if this is the first callout from a
+<b>pcre2_match()</b> run.
+</P>
+<P>
+Both bits are set when a backtrack has caused a "bumpalong" to a new starting
+position in the subject. Output from <b>pcre2test</b> does not indicate the
+presence of these bits unless the <b>callout_extra</b> modifier is set.
+</P>
+<P>
+The information in the <b>callout_flags</b> field is provided so that
+applications can track and tell their users how matching with backtracking is
+done. This can be useful when trying to optimize patterns, or just to
+understand how PCRE2 works. There is no support in <b>pcre2_dfa_match()</b>
+because there is no backtracking in DFA matching, and there is no support in
+JIT because JIT is all about maximimizing matching performance. In both these
+cases the <b>callout_flags</b> field is always zero.
+</P>
<br><a name="SEC5" href="#TOC1">RETURN VALUES FROM CALLOUTS</a><br>
<P>
The external callout function returns an integer to PCRE2. If the value is
@@ -399,9 +461,9 @@ Cambridge, England.
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 23 March 2015
+Last updated: 22 December 2017
<br>
-Copyright &copy; 1997-2015 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2compat.html b/doc/html/pcre2compat.html
index 3b29e6f..e6d2e7e 100644
--- a/doc/html/pcre2compat.html
+++ b/doc/html/pcre2compat.html
@@ -18,7 +18,8 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
<P>
This document describes the differences in the ways that PCRE2 and Perl handle
regular expressions. The differences described here are with respect to Perl
-versions 5.10 and above.
+versions 5.26, but as both Perl and PCRE2 are continually changing, the
+information may sometimes be out of date.
</P>
<P>
1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
@@ -27,17 +28,18 @@ have are given in the
page.
</P>
<P>
-2. PCRE2 allows repeat quantifiers only on parenthesized assertions, but they
-do not mean what you might think. For example, (?!a){3} does not assert that
-the next three characters are not "a". It just asserts that the next character
-is not "a" three times (in principle: PCRE2 optimizes this to run the assertion
-just once). Perl allows repeat quantifiers on other assertions such as \b, but
-these do not seem to have any use.
+2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
+they do not mean what you might think. For example, (?!a){3} does not assert
+that the next three characters are not "a". It just asserts that the next
+character is not "a" three times (in principle: PCRE2 optimizes this to run the
+assertion just once). Perl allows some repeat quantifiers on other assertions,
+for example, \b* (but not \b{3}), but these do not seem to have any use.
</P>
<P>
-3. Capturing subpatterns that occur inside negative lookahead assertions are
-counted, but their entries in the offsets vector are never set. Perl sometimes
-(but not always) sets its numerical variables from inside negative assertions.
+3. Capturing subpatterns that occur inside negative lookaround assertions are
+counted, but their entries in the offsets vector are set only when a negative
+assertion is a condition that has a matching branch (that is, the condition is
+false).
</P>
<P>
4. The following Perl escape sequences are not supported: \l, \u, \L,
@@ -50,13 +52,13 @@ generated by default. However, if the PCRE2_ALT_BSUX option is set,
</P>
<P>
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
-built with Unicode support. The properties that can be tested with \p and \P
-are limited to the general category properties such as Lu and Nd, script names
-such as Greek or Han, and the derived properties Any and L&. PCRE2 does support
-the Cs (surrogate) property, which Perl does not; the Perl documentation says
-"Because Perl hides the need for the user to understand the internal
-representation of Unicode characters, there is no need to implement the
-somewhat messy concept of surrogates."
+built with Unicode support (the default). The properties that can be tested
+with \p and \P are limited to the general category properties such as Lu and
+Nd, script names such as Greek or Han, and the derived properties Any and L&.
+PCRE2 does support the Cs (surrogate) property, which Perl does not; the Perl
+documentation says "Because Perl hides the need for the user to understand the
+internal representation of Unicode characters, there is no need to implement
+the somewhat messy concept of surrogates."
</P>
<P>
6. PCRE2 does support the \Q...\E escape for quoting substrings. Characters
@@ -75,23 +77,15 @@ The \Q...\E sequence is recognized both inside and outside character classes.
</P>
<P>
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
-constructions. However, there is support for recursive patterns. This is not
-available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE2 "callout"
-feature allows an external function to be called during pattern matching. See
-the
+constructions. However, there is support PCRE2's "callout" feature, which
+allows an external function to be called during pattern matching. See the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation for details.
</P>
<P>
-8. Subroutine calls (whether recursive or not) are treated as atomic groups.
-Atomic recursion is like Python, but unlike Perl. Captured values that are set
-outside a subroutine call can be referenced from inside in PCRE2, but not in
-Perl. There is a discussion that explains these differences in more detail in
-the
-<a href="pcre2pattern.html#recursiondifference">section on recursion differences from Perl</a>
-in the
-<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
-page.
+8. Subroutine calls (whether recursive or not) were treated as atomic groups up
+to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking
+into subroutine calls is now supported, as in Perl.
</P>
<P>
9. If any of the backtracking control verbs are used in a subpattern that is
@@ -107,7 +101,7 @@ processed as anchored at the point where they are tested.
one that is backtracked onto acts. For example, in the pattern
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
-same as PCRE2, but there are examples where it differs.
+same as PCRE2, but there are cases where it differs.
</P>
<P>
11. Most backtracking verbs in assertions have their normal actions. They are
@@ -123,7 +117,7 @@ the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
13. PCRE2's handling of duplicate subpattern numbers and duplicate subpattern
names is not as general as Perl's. This is a consequence of the fact the PCRE2
works internally just with numbers, using an external table to translate
-between numbers and names. In particular, a pattern such as (?|(?&#60;a&#62;A)|(?&#60;b)B),
+between numbers and names. In particular, a pattern such as (?|(?&#60;a&#62;A)|(?&#60;b&#62;B),
where the two capturing parentheses have the same number but different names,
is not supported, and causes an error at compile time. If it were allowed, it
would not be possible to distinguish which parentheses matched, because both
@@ -131,10 +125,11 @@ names map to capturing subpattern number 1. To avoid this confusing situation,
an error is given at compile time.
</P>
<P>
-14. Perl recognizes comments in some places that PCRE2 does not, for example,
-between the ( and ? at the start of a subpattern. If the /x modifier is set,
-Perl allows white space between ( and ? (though current Perls warn that this is
-deprecated) but PCRE2 never does, even if the PCRE2_EXTENDED option is set.
+14. Perl used to recognize comments in some places that PCRE2 does not, for
+example, between the ( and ? at the start of a subpattern. If the /x modifier
+is set, Perl allowed white space between ( and ? though the latest Perls give
+an error (for a while it was just deprecated). There may still be some cases
+where Perl behaves differently.
</P>
<P>
15. Perl, when in warning mode, gives warnings for character classes such as
@@ -146,14 +141,14 @@ certainly user mistakes.
16. In PCRE2, the upper/lower case character properties Lu and Ll are not
affected when case-independent matching is specified. For example, \p{Lu}
always matches an upper case letter. I think Perl has changed in this respect;
-in the release at the time of writing (5.16), \p{Lu} and \p{Ll} match all
+in the release at the time of writing (5.24), \p{Lu} and \p{Ll} match all
letters, regardless of case, when case independence is specified.
</P>
<P>
17. PCRE2 provides some extensions to the Perl regular expression facilities.
Perl 5.10 includes new features that are not in earlier versions of Perl, some
-of which (such as named parentheses) have been in PCRE2 for some time. This
-list is with respect to Perl 5.10:
+of which (such as named parentheses) were in PCRE2 for some time before. This
+list is with respect to Perl 5.26:
<br>
<br>
(a) Although lookbehind assertions in PCRE2 must match fixed length strings,
@@ -161,43 +156,63 @@ each alternative branch of a lookbehind assertion can match a different length
of string. Perl requires them all to have the same length.
<br>
<br>
-(b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
+(b) From PCRE2 10.23, back references to groups of fixed length are supported
+in lookbehinds, provided that there is no possibility of referencing a
+non-unique number or name. Perl does not support backreferences in lookbehinds.
+<br>
+<br>
+(c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
meta-character matches only at the very end of the string.
<br>
<br>
-(c) A backslash followed by a letter with no special meaning is faulted. (Perl
+(d) A backslash followed by a letter with no special meaning is faulted. (Perl
can be made to issue a warning.)
<br>
<br>
-(d) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
+(e) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
inverted, that is, by default they are not greedy, but if followed by a
question mark they are.
<br>
<br>
-(e) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
+(f) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
only at the first matching position in the subject string.
<br>
<br>
-(f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
-PCRE2_NO_AUTO_CAPTURE options have no Perl equivalents.
+(g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY and PCRE2_NOTEMPTY_ATSTART
+options have no Perl equivalents.
<br>
<br>
-(g) The \R escape sequence can be restricted to match only CR, LF, or CRLF
+(h) The \R escape sequence can be restricted to match only CR, LF, or CRLF
by the PCRE2_BSR_ANYCRLF option.
<br>
<br>
-(h) The callout facility is PCRE2-specific.
+(i) The callout facility is PCRE2-specific. Perl supports codeblocks and
+variable interpolation, but not general hooks on every match.
<br>
<br>
-(i) The partial matching facility is PCRE2-specific.
+(j) The partial matching facility is PCRE2-specific.
<br>
<br>
-(j) The alternative matching function (<b>pcre2_dfa_match()</b> matches in a
+(k) The alternative matching function (<b>pcre2_dfa_match()</b> matches in a
different way and is not Perl-compatible.
<br>
<br>
-(k) PCRE2 recognizes some special sequences such as (*CR) at the start of
-a pattern that set overall options that cannot be changed within the pattern.
+(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
+the start of a pattern that set overall options that cannot be changed within
+the pattern.
+</P>
+<P>
+18. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa
+modifier restricts /i case-insensitive matching to pure ascii, ignoring Unicode
+rules. This separation cannot be represented with PCRE2_UCP.
+</P>
+<P>
+19. Perl has different limits than PCRE2. See the
+<a href="pcre2limit.html"><b>pcre2limit</b></a>
+documentation for details. Perl went with 5.10 from recursion to iteration
+keeping the intermediate matches on the heap, which is ~10% slower but does not
+fall into any stack-overflow limit. PCRE2 made a similar change at release
+10.30, and also has many build-time and run-time customizable limits.
</P>
<br><b>
AUTHOR
@@ -214,9 +229,9 @@ Cambridge, England.
REVISION
</b><br>
<P>
-Last updated: 15 March 2015
+Last updated: 18 April 2017
<br>
-Copyright &copy; 1997-2015 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2convert.html b/doc/html/pcre2convert.html
new file mode 100644
index 0000000..8b4d87f
--- /dev/null
+++ b/doc/html/pcre2convert.html
@@ -0,0 +1,190 @@
+<html>
+<head>
+<title>pcre2convert specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcre2convert man page</h1>
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
+<p>
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+<br>
+<ul>
+<li><a name="TOC1" href="#SEC1">EXPERIMENTAL PATTERN CONVERSION FUNCTIONS</a>
+<li><a name="TOC2" href="#SEC2">THE CONVERT CONTEXT</a>
+<li><a name="TOC3" href="#SEC3">THE CONVERSION FUNCTION</a>
+<li><a name="TOC4" href="#SEC4">CONVERTING GLOBS</a>
+<li><a name="TOC5" href="#SEC5">CONVERTING POSIX PATTERNS</a>
+<li><a name="TOC6" href="#SEC6">AUTHOR</a>
+<li><a name="TOC7" href="#SEC7">REVISION</a>
+</ul>
+<br><a name="SEC1" href="#TOC1">EXPERIMENTAL PATTERN CONVERSION FUNCTIONS</a><br>
+<P>
+This document describes a set of functions that can be used to convert
+"foreign" patterns into PCRE2 regular expressions. This facility is currently
+experimental, and may be changed in future releases. Two kinds of pattern,
+globs and POSIX patterns, are supported.
+</P>
+<br><a name="SEC2" href="#TOC1">THE CONVERT CONTEXT</a><br>
+<P>
+<b>pcre2_convert_context *pcre2_convert_context_create(</b>
+<b> pcre2_general_context *<i>gcontext</i>);</b>
+<br>
+<br>
+<b>pcre2_convert_context *pcre2_convert_context_copy(</b>
+<b> pcre2_convert_context *<i>cvcontext</i>);</b>
+<br>
+<br>
+<b>void pcre2_convert_context_free(pcre2_convert_context *<i>cvcontext</i>);</b>
+<br>
+<br>
+<b>int pcre2_set_glob_escape(pcre2_convert_context *<i>cvcontext</i>,</b>
+<b> uint32_t <i>escape_char</i>);</b>
+<br>
+<br>
+<b>int pcre2_set_glob_separator(pcre2_convert_context *<i>cvcontext</i>,</b>
+<b> uint32_t <i>separator_char</i>);</b>
+<br>
+<br>
+A convert context is used to hold parameters that affect the way that pattern
+conversion works. Like all PCRE2 contexts, you need to use a context only if
+you want to override the defaults. There are the usual create, copy, and free
+functions. If custom memory management functions are set in a general context
+that is passed to <b>pcre2_convert_context_create()</b>, they are used for all
+memory management within the conversion functions.
+</P>
+<P>
+There are only two parameters in the convert context at present. Both apply
+only to glob conversions. The escape character defaults to grave accent under
+Windows, otherwise backslash. It can be set to zero, meaning no escape
+character, or to any punctuation character with a code point less than 256.
+The separator character defaults to backslash under Windows, otherwise forward
+slash. It can be set to forward slash, backslash, or dot.
+</P>
+<P>
+The two setting functions return zero on success, or PCRE2_ERROR_BADDATA if
+their second argument is invalid.
+</P>
+<br><a name="SEC3" href="#TOC1">THE CONVERSION FUNCTION</a><br>
+<P>
+<b>int pcre2_pattern_convert(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
+<b> uint32_t <i>options</i>, PCRE2_UCHAR **<i>buffer</i>,</b>
+<b> PCRE2_SIZE *<i>blength</i>, pcre2_convert_context *<i>cvcontext</i>);</b>
+<br>
+<br>
+<b>void pcre2_converted_pattern_free(PCRE2_UCHAR *<i>converted_pattern</i>);</b>
+<br>
+<br>
+The first two arguments of <b>pcre2_pattern_convert()</b> define the foreign
+pattern that is to be converted. The length may be given as
+PCRE2_ZERO_TERMINATED. The <b>options</b> argument defines how the pattern is to
+be processed. If the input is UTF, the PCRE2_CONVERT_UTF option should be set.
+PCRE2_CONVERT_NO_UTF_CHECK may also be set if you are sure the input is valid.
+One or more of the glob options, or one of the following POSIX options must be
+set to define the type of conversion that is required:
+<pre>
+ PCRE2_CONVERT_GLOB
+ PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR
+ PCRE2_CONVERT_GLOB_NO_STARSTAR
+ PCRE2_CONVERT_POSIX_BASIC
+ PCRE2_CONVERT_POSIX_EXTENDED
+</pre>
+Details of the conversions are given below. The <b>buffer</b> and <b>blength</b>
+arguments define how the output is handled:
+</P>
+<P>
+If <b>buffer</b> is NULL, the function just returns the length of the converted
+pattern via <b>blength</b>. This is one less than the length of buffer needed,
+because a terminating zero is always added to the output.
+</P>
+<P>
+If <b>buffer</b> points to a NULL pointer, an output buffer is obtained using
+the allocator in the context or <b>malloc()</b> if no context is supplied. A
+pointer to this buffer is placed in the variable to which <b>buffer</b> points.
+When no longer needed the output buffer must be freed by calling
+<b>pcre2_converted_pattern_free()</b>.
+</P>
+<P>
+If <b>buffer</b> points to a non-NULL pointer, <b>blength</b> must be set to the
+actual length of the buffer provided (in code units).
+</P>
+<P>
+In all cases, after successful conversion, the variable pointed to by
+<b>blength</b> is updated to the length actually used (in code units), excluding
+the terminating zero that is always added.
+</P>
+<P>
+If an error occurs, the length (via <b>blength</b>) is set to the offset
+within the input pattern where the error was detected. Only gross syntax errors
+are caught; there are plenty of errors that will get passed on for
+<b>pcre2_compile()</b> to discover.
+</P>
+<P>
+The return from <b>pcre2_pattern_convert()</b> is zero on success or a non-zero
+PCRE2 error code. Note that PCRE2 error codes may be positive or negative:
+<b>pcre2_compile()</b> uses mostly positive codes and <b>pcre2_match()</b>
+negative ones; <b>pcre2_convert()</b> uses existing codes of both kinds. A
+textual error message can be obtained by calling
+<b>pcre2_get_error_message()</b>.
+</P>
+<br><a name="SEC4" href="#TOC1">CONVERTING GLOBS</a><br>
+<P>
+Globs are used to match file names, and consequently have the concept of a
+"path separator", which defaults to backslash under Windows and forward slash
+otherwise. If PCRE2_CONVERT_GLOB is set, the wildcards * and ? are not
+permitted to match separator characters, but the double-star (**) feature
+(which does match separators) is supported.
+</P>
+<P>
+PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR matches globs with wildcards allowed to
+match separator characters. PCRE2_GLOB_NO_STARSTAR matches globs with the
+double-star feature disabled. These options may be given together.
+</P>
+<br><a name="SEC5" href="#TOC1">CONVERTING POSIX PATTERNS</a><br>
+<P>
+POSIX defines two kinds of regular expression pattern: basic and extended.
+These can be processed by setting PCRE2_CONVERT_POSIX_BASIC or
+PCRE2_CONVERT_POSIX_EXTENDED, respectively.
+</P>
+<P>
+In POSIX patterns, backslash is not special in a character class. Unmatched
+closing parentheses are treated as literals.
+</P>
+<P>
+In basic patterns, ? + | {} and () must be escaped to be recognized
+as metacharacters outside a character class. If the first character in the
+pattern is * it is treated as a literal. ^ is a metacharacter only at the start
+of a branch.
+</P>
+<P>
+In extended patterns, a backslash not in a character class always
+makes the next character literal, whatever it is. There are no backreferences.
+</P>
+<P>
+Note: POSIX mandates that the longest possible match at the first matching
+position must be found. This is not what <b>pcre2_match()</b> does; it yields
+the first match that is found. An application can use <b>pcre2_dfa_match()</b>
+to find the longest match, but that does not support backreferences (but then
+neither do POSIX extended patterns).
+</P>
+<br><a name="SEC6" href="#TOC1">AUTHOR</a><br>
+<P>
+Philip Hazel
+<br>
+University Computing Service
+<br>
+Cambridge, England.
+<br>
+</P>
+<br><a name="SEC7" href="#TOC1">REVISION</a><br>
+<P>
+Last updated: 12 July 2017
+<br>
+Copyright &copy; 1997-2017 University of Cambridge.
+<br>
+<p>
+Return to the <a href="index.html">PCRE2 index page</a>.
+</p>
diff --git a/doc/html/pcre2demo.html b/doc/html/pcre2demo.html
index d64e16b..72754d3 100644
--- a/doc/html/pcre2demo.html
+++ b/doc/html/pcre2demo.html
@@ -228,6 +228,21 @@ pcre2_match_data_create_from_pattern() above. */
if (rc == 0)
printf("ovector was not big enough for all the captured substrings\n");
+/* We must guard against patterns such as /(?=.\K)/ that use \K in an assertion
+to set the start of a match later than its end. In this demonstration program,
+we just detect this case and give up. */
+
+if (ovector[0] &gt; ovector[1])
+ {
+ printf("\\K was used in an assertion to set the match start after its end.\n"
+ "From end to start the match was: %.*s\n", (int)(ovector[0] - ovector[1]),
+ (char *)(subject + ovector[1]));
+ printf("Run abandoned\n");
+ pcre2_match_data_free(match_data);
+ pcre2_code_free(re);
+ return 1;
+ }
+
/* Show substrings stored in the output vector by number. Obviously, in a real
application you might want to do things other than print them. */
@@ -355,6 +370,29 @@ for (;;)
options = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED;
}
+ /* If the previous match was not an empty string, there is one tricky case to
+ consider. If a pattern contains \K within a lookbehind assertion at the
+ start, the end of the matched string can be at the offset where the match
+ started. Without special action, this leads to a loop that keeps on matching
+ the same substring. We must detect this case and arrange to move the start on
+ by one character. The pcre2_get_startchar() function returns the starting
+ offset that was passed to pcre2_match(). */
+
+ else
+ {
+ PCRE2_SIZE startchar = pcre2_get_startchar(match_data);
+ if (start_offset &lt;= startchar)
+ {
+ if (startchar &gt;= subject_length) break; /* Reached end of subject. */
+ start_offset = startchar + 1; /* Advance by one character. */
+ if (utf8) /* If UTF-8, it may be more */
+ { /* than one code unit. */
+ for (; start_offset &lt; subject_length; start_offset++)
+ if ((subject[start_offset] &amp; 0xc0) != 0x80) break;
+ }
+ }
+ }
+
/* Run the next matching operation */
rc = pcre2_match(
@@ -419,6 +457,21 @@ for (;;)
if (rc == 0)
printf("ovector was not big enough for all the captured substrings\n");
+ /* We must guard against patterns such as /(?=.\K)/ that use \K in an
+ assertion to set the start of a match later than its end. In this
+ demonstration program, we just detect this case and give up. */
+
+ if (ovector[0] &gt; ovector[1])
+ {
+ printf("\\K was used in an assertion to set the match start after its end.\n"
+ "From end to start the match was: %.*s\n", (int)(ovector[0] - ovector[1]),
+ (char *)(subject + ovector[1]));
+ printf("Run abandoned\n");
+ pcre2_match_data_free(match_data);
+ pcre2_code_free(re);
+ return 1;
+ }
+
/* As before, show substrings stored in the output vector by number, and then
also any named substrings. */
diff --git a/doc/html/pcre2grep.html b/doc/html/pcre2grep.html
index d02d365..625a467 100644
--- a/doc/html/pcre2grep.html
+++ b/doc/html/pcre2grep.html
@@ -22,7 +22,7 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC7" href="#SEC7">NEWLINES</a>
<li><a name="TOC8" href="#SEC8">OPTIONS COMPATIBILITY</a>
<li><a name="TOC9" href="#SEC9">OPTIONS WITH DATA</a>
-<li><a name="TOC10" href="#SEC10">CALLING EXTERNAL SCRIPTS</a>
+<li><a name="TOC10" href="#SEC10">USING PCRE2'S CALLOUT FACILITY</a>
<li><a name="TOC11" href="#SEC11">MATCHING ERRORS</a>
<li><a name="TOC12" href="#SEC12">DIAGNOSTICS</a>
<li><a name="TOC13" href="#SEC13">SEE ALSO</a>
@@ -80,11 +80,19 @@ span line boundaries. What defines a line boundary is controlled by the
</P>
<P>
The amount of memory used for buffering files that are being scanned is
-controlled by a parameter that can be set by the <b>--buffer-size</b> option.
-The default value for this parameter is specified when <b>pcre2grep</b> is
-built, with the default default being 20K. A block of memory three times this
-size is used (to allow for buffering "before" and "after" lines). An error
-occurs if a line overflows the buffer.
+controlled by parameters that can be set by the <b>--buffer-size</b> and
+<b>--max-buffer-size</b> options. The first of these sets the size of buffer
+that is obtained at the start of processing. If an input file contains very
+long lines, a larger buffer may be needed; this is handled by automatically
+extending the buffer, up to the limit specified by <b>--max-buffer-size</b>. The
+default values for these parameters are specified when <b>pcre2grep</b> is
+built, with the default defaults being 20K and 1M respectively. An error occurs
+if a line is too long and the buffer can no longer be expanded.
+</P>
+<P>
+The block of memory that is actually used is three times the "buffer size", to
+allow for buffering "before" and "after" lines. If the buffer size is too
+small, fewer than requested "before" and "after" lines may be output.
</P>
<P>
Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the greater.
@@ -125,23 +133,27 @@ The <b>--locale</b> option can be used to override this.
<br><a name="SEC3" href="#TOC1">SUPPORT FOR COMPRESSED FILES</a><br>
<P>
It is possible to compile <b>pcre2grep</b> so that it uses <b>libz</b> or
-<b>libbz2</b> to read files whose names end in <b>.gz</b> or <b>.bz2</b>,
-respectively. You can find out whether your binary has support for one or both
-of these file types by running it with the <b>--help</b> option. If the
-appropriate support is not present, files are treated as plain text. The
-standard input is always so treated.
+<b>libbz2</b> to read compressed files whose names end in <b>.gz</b> or
+<b>.bz2</b>, respectively. You can find out whether your <b>pcre2grep</b> binary
+has support for one or both of these file types by running it with the
+<b>--help</b> option. If the appropriate support is not present, all files are
+treated as plain text. The standard input is always so treated. When input is
+from a compressed .gz or .bz2 file, the <b>--line-buffered</b> option is
+ignored.
</P>
<br><a name="SEC4" href="#TOC1">BINARY FILES</a><br>
<P>
By default, a file that contains a binary zero byte within the first 1024 bytes
-is identified as a binary file, and is processed specially. (GNU grep also
-identifies binary files in this manner.) See the <b>--binary-files</b> option
-for a means of changing the way binary files are handled.
+is identified as a binary file, and is processed specially. (GNU grep
+identifies binary files in this manner.) However, if the newline type is
+specified as "nul", that is, the line terminator is a binary zero, the test for
+a binary file is not applied. See the <b>--binary-files</b> option for a means
+of changing the way binary files are handled.
</P>
<br><a name="SEC5" href="#TOC1">OPTIONS</a><br>
<P>
The order in which some of the options appear can affect the output. For
-example, both the <b>-h</b> and <b>-l</b> options affect the printing of file
+example, both the <b>-H</b> and <b>-l</b> options affect the printing of file
names. Whichever comes later in the command line will be the one that takes
effect. Similarly, except where noted below, if an option is given twice, the
later setting is used. Numerical values for options may be followed by K or M,
@@ -155,12 +167,13 @@ processing of patterns and file names that start with hyphens.
</P>
<P>
<b>-A</b> <i>number</i>, <b>--after-context=</b><i>number</i>
-Output <i>number</i> lines of context after each matching line. If file names
-and/or line numbers are being output, a hyphen separator is used instead of a
-colon for the context lines. A line containing "--" is output between each
-group of lines, unless they are in fact contiguous in the input file. The value
-of <i>number</i> is expected to be relatively small. However, <b>pcre2grep</b>
-guarantees to have up to 8K of following text available for context output.
+Output up to <i>number</i> lines of context after each matching line. Fewer
+lines are output if the next match or the end of the file is reached, or if the
+processing buffer size has been set too small. If file names and/or line
+numbers are being output, a hyphen separator is used instead of a colon for the
+context lines. A line containing "--" is output between each group of lines,
+unless they are in fact contiguous in the input file. The value of <i>number</i>
+is expected to be relatively small. When <b>-c</b> is used, <b>-A</b> is ignored.
</P>
<P>
<b>-a</b>, <b>--text</b>
@@ -169,12 +182,14 @@ Treat binary files as text. This is equivalent to
</P>
<P>
<b>-B</b> <i>number</i>, <b>--before-context=</b><i>number</i>
-Output <i>number</i> lines of context before each matching line. If file names
-and/or line numbers are being output, a hyphen separator is used instead of a
-colon for the context lines. A line containing "--" is output between each
-group of lines, unless they are in fact contiguous in the input file. The value
-of <i>number</i> is expected to be relatively small. However, <b>pcre2grep</b>
-guarantees to have up to 8K of preceding text available for context output.
+Output up to <i>number</i> lines of context before each matching line. Fewer
+lines are output if the previous match or the start of the file is within
+<i>number</i> lines, or if the processing buffer size has been set too small. If
+file names and/or line numbers are being output, a hyphen separator is used
+instead of a colon for the context lines. A line containing "--" is output
+between each group of lines, unless they are in fact contiguous in the input
+file. The value of <i>number</i> is expected to be relatively small. When
+<b>-c</b> is used, <b>-B</b> is ignored.
</P>
<P>
<b>--binary-files=</b><i>word</i>
@@ -191,8 +206,9 @@ return code.
</P>
<P>
<b>--buffer-size=</b><i>number</i>
-Set the parameter that controls how much memory is used for buffering files
-that are being scanned.
+Set the parameter that controls how much memory is obtained at the start of
+processing for buffering files that are being scanned. See also
+<b>--max-buffer-size</b> below.
</P>
<P>
<b>-C</b> <i>number</i>, <b>--context=</b><i>number</i>
@@ -202,14 +218,16 @@ This is equivalent to setting both <b>-A</b> and <b>-B</b> to the same value.
<P>
<b>-c</b>, <b>--count</b>
Do not output lines from the files that are being scanned; instead output the
-number of matches (or non-matches if <b>-v</b> is used) that would otherwise
-have caused lines to be shown. By default, this count is the same as the number
-of suppressed lines, but if the <b>-M</b> (multiline) option is used (without
-<b>-v</b>), there may be more suppressed lines than the number of matches.
+number of lines that would have been shown, either because they matched, or, if
+<b>-v</b> is set, because they failed to match. By default, this count is
+exactly the same as the number of lines that would have been output, but if the
+<b>-M</b> (multiline) option is used (without <b>-v</b>), there may be more
+suppressed lines than the count (that is, the number of matches).
<br>
<br>
If no lines are selected, the number zero is output. If several files are are
-being scanned, a count is output for each of them. However, if the
+being scanned, a count is output for each of them and the <b>-t</b> option can
+be used to cause a total to be output at the end. However, if the
<b>--files-with-matches</b> option is also used, only those files whose counts
are greater than zero are listed. When <b>-c</b> is used, the <b>-A</b>,
<b>-B</b>, and <b>-C</b> options are ignored.
@@ -231,12 +249,23 @@ because <b>pcre2grep</b> has to search for all possible matches in a line, not
just one, in order to colour them all.
<br>
<br>
-The colour that is used can be specified by setting the environment variable
-PCRE2GREP_COLOUR or PCRE2GREP_COLOR. The value of this variable should be a
-string of two numbers, separated by a semicolon. They are copied directly into
-the control string for setting colour on a terminal, so it is your
-responsibility to ensure that they make sense. If neither of the environment
-variables is set, the default is "1;31", which gives red.
+The colour that is used can be specified by setting one of the environment
+variables PCRE2GREP_COLOUR, PCRE2GREP_COLOR, PCREGREP_COLOUR, or
+PCREGREP_COLOR, which are checked in that order. If none of these are set,
+<b>pcre2grep</b> looks for GREP_COLORS or GREP_COLOR (in that order). The value
+of the variable should be a string of two numbers, separated by a semicolon,
+except in the case of GREP_COLORS, which must start with "ms=" or "mt="
+followed by two semicolon-separated colours, terminated by the end of the
+string or by a colon. If GREP_COLORS does not start with "ms=" or "mt=" it is
+ignored, and GREP_COLOR is checked.
+<br>
+<br>
+If the string obtained from one of the above variables contains any characters
+other than semicolon or digits, the setting is ignored and the default colour
+is used. The string is copied directly into the control string for setting
+colour on a terminal, so it is your responsibility to ensure that the values
+make sense. If no relevant environment variable is set, the default is "1;31",
+which gives red.
</P>
<P>
<b>-D</b> <i>action</i>, <b>--devices=</b><i>action</i>
@@ -255,6 +284,10 @@ operating systems the effect of reading a directory like this is an immediate
end-of-file; in others it may provoke an error.
</P>
<P>
+<b>--depth-limit</b>=<i>number</i>
+See <b>--match-limit</b> below.
+</P>
+<P>
<b>-e</b> <i>pattern</i>, <b>--regex=</b><i>pattern</i>, <b>--regexp=</b><i>pattern</i>
Specify a pattern to be matched. This option can be used multiple times in
order to specify several patterns. It can also be used as a way of specifying a
@@ -321,18 +354,18 @@ files; it does not apply to patterns specified by any of the <b>--include</b> or
</P>
<P>
<b>-f</b> <i>filename</i>, <b>--file=</b><i>filename</i>
-Read patterns from the file, one per line, and match them against
-each line of input. What constitutes a newline when reading the file is the
-operating system's default. The <b>--newline</b> option has no effect on this
-option. Trailing white space is removed from each line, and blank lines are
-ignored. An empty file contains no patterns and therefore matches nothing. See
-also the comments about multiple patterns versus a single pattern with
-alternatives in the description of <b>-e</b> above.
-<br>
-<br>
-If this option is given more than once, all the specified files are
-read. A data line is output if any of the patterns match it. A file name can
-be given as "-" to refer to the standard input. When <b>-f</b> is used, patterns
+Read patterns from the file, one per line, and match them against each line of
+input. What constitutes a newline when reading the file is the operating
+system's default. The <b>--newline</b> option has no effect on this option.
+Trailing white space is removed from each line, and blank lines are ignored. An
+empty file contains no patterns and therefore matches nothing. See also the
+comments about multiple patterns versus a single pattern with alternatives in
+the description of <b>-e</b> above.
+<br>
+<br>
+If this option is given more than once, all the specified files are read. A
+data line is output if any of the patterns match it. A file name can be given
+as "-" to refer to the standard input. When <b>-f</b> is used, patterns
specified on the command line using <b>-e</b> may also be present; they are
tested before the file's patterns. However, no other pattern is taken from the
command line; all arguments are treated as the names of paths to be searched.
@@ -355,8 +388,8 @@ Instead of showing lines or parts of lines that match, show each match as an
offset from the start of the file and a length, separated by a comma. In this
mode, no context is shown. That is, the <b>-A</b>, <b>-B</b>, and <b>-C</b>
options are ignored. If there is more than one match in a line, each of them is
-shown separately. This option is mutually exclusive with <b>--line-offsets</b>
-and <b>--only-matching</b>.
+shown separately. This option is mutually exclusive with <b>--output</b>,
+<b>--line-offsets</b>, and <b>--only-matching</b>.
</P>
<P>
<b>-H</b>, <b>--with-filename</b>
@@ -365,14 +398,20 @@ searching a single file. By default, the file name is not shown in this case.
For matching lines, the file name is followed by a colon; for context lines, a
hyphen separator is used. If a line number is also being output, it follows the
file name. When the <b>-M</b> option causes a pattern to match more than one
-line, only the first is preceded by the file name.
+line, only the first is preceded by the file name. This option overrides any
+previous <b>-h</b>, <b>-l</b>, or <b>-L</b> options.
</P>
<P>
<b>-h</b>, <b>--no-filename</b>
Suppress the output file names when searching multiple files. By default,
file names are shown when multiple files are searched. For matching lines, the
file name is followed by a colon; for context lines, a hyphen separator is used.
-If a line number is also being output, it follows the file name.
+If a line number is also being output, it follows the file name. This option
+overrides any previous <b>-H</b>, <b>-L</b>, or <b>-l</b> options.
+</P>
+<P>
+<b>--heap-limit</b>=<i>number</i>
+See <b>--match-limit</b> below.
</P>
<P>
<b>--help</b>
@@ -425,17 +464,19 @@ given any number of times. If a directory matches both <b>--include-dir</b> and
<b>-L</b>, <b>--files-without-match</b>
Instead of outputting lines from the files, just output the names of the files
that do not contain any lines that would have been output. Each file name is
-output once, on a separate line.
+output once, on a separate line. This option overrides any previous <b>-H</b>,
+<b>-h</b>, or <b>-l</b> options.
</P>
<P>
<b>-l</b>, <b>--files-with-matches</b>
Instead of outputting lines from the files, just output the names of the files
-containing lines that would have been output. Each file name is output
-once, on a separate line. Searching normally stops as soon as a matching line
-is found in a file. However, if the <b>-c</b> (count) option is also used,
-matching continues in order to obtain the correct count, and those files that
-have at least one match are listed along with their counts. Using this option
-with <b>-c</b> is a way of suppressing the listing of files with no matches.
+containing lines that would have been output. Each file name is output once, on
+a separate line. Searching normally stops as soon as a matching line is found
+in a file. However, if the <b>-c</b> (count) option is also used, matching
+continues in order to obtain the correct count, and those files that have at
+least one match are listed along with their counts. Using this option with
+<b>-c</b> is a way of suppressing the listing of files with no matches. This
+opeion overrides any previous <b>-H</b>, <b>-h</b>, or <b>-L</b> options.
</P>
<P>
<b>--label</b>=<i>name</i>
@@ -445,14 +486,16 @@ short form for this option.
</P>
<P>
<b>--line-buffered</b>
-When this option is given, input is read and processed line by line, and the
-output is flushed after each write. By default, input is read in large chunks,
-unless <b>pcre2grep</b> can determine that it is reading from a terminal (which
-is currently possible only in Unix-like environments). Output to terminal is
-normally automatically flushed by the operating system. This option can be
-useful when the input or output is attached to a pipe and you do not want
-<b>pcre2grep</b> to buffer up large amounts of data. However, its use will
-affect performance, and the <b>-M</b> (multiline) option ceases to work.
+When this option is given, non-compressed input is read and processed line by
+line, and the output is flushed after each write. By default, input is read in
+large chunks, unless <b>pcre2grep</b> can determine that it is reading from a
+terminal (which is currently possible only in Unix-like environments). Output
+to terminal is normally automatically flushed by the operating system. This
+option can be useful when the input or output is attached to a pipe and you do
+not want <b>pcre2grep</b> to buffer up large amounts of data. However, its use
+will affect performance, and the <b>-M</b> (multiline) option ceases to work.
+When input is from a compressed .gz or .bz2 file, <b>--line-buffered</b> is
+ignored.
</P>
<P>
<b>--line-offsets</b>
@@ -462,7 +505,8 @@ number is terminated by a colon (as usual; see the <b>-n</b> option), and the
offset and length are separated by a comma. In this mode, no context is shown.
That is, the <b>-A</b>, <b>-B</b>, and <b>-C</b> options are ignored. If there is
more than one match in a line, each of them is shown separately. This option is
-mutually exclusive with <b>--file-offsets</b> and <b>--only-matching</b>.
+mutually exclusive with <b>--output</b>, <b>--file-offsets</b>, and
+<b>--only-matching</b>.
</P>
<P>
<b>--locale</b>=<i>locale-name</i>
@@ -473,51 +517,57 @@ used. There is no short form for this option.
</P>
<P>
<b>--match-limit</b>=<i>number</i>
-Processing some regular expression patterns can require a very large amount of
-memory, leading in some cases to a program crash if not enough is available.
-Other patterns may take a very long time to search for all possible matching
-strings. The <b>pcre2_match()</b> function that is called by <b>pcre2grep</b> to
-do the matching has two parameters that can limit the resources that it uses.
+Processing some regular expression patterns may take a very long time to search
+for all possible matching strings. Others may require a very large amount of
+memory. There are three options that set resource limits for matching.
+<br>
+<br>
+The <b>--match-limit</b> option provides a means of limiting computing resource
+usage when processing patterns that are not going to match, but which have a
+very large number of possibilities in their search trees. The classic example
+is a pattern that uses nested unlimited repeats. Internally, PCRE2 has a
+counter that is incremented each time around its main processing loop. If the
+value set by <b>--match-limit</b> is reached, an error occurs.
<br>
<br>
-The <b>--match-limit</b> option provides a means of limiting resource usage
-when processing patterns that are not going to match, but which have a very
-large number of possibilities in their search trees. The classic example is a
-pattern that uses nested unlimited repeats. Internally, PCRE2 uses a function
-called <b>match()</b> which it calls repeatedly (sometimes recursively). The
-limit set by <b>--match-limit</b> is imposed on the number of times this
-function is called during a match, which has the effect of limiting the amount
-of backtracking that can take place.
+The <b>--heap-limit</b> option specifies, as a number of kilobytes, the amount
+of heap memory that may be used for matching. Heap memory is needed only if
+matching the pattern requires a significant number of nested backtracking
+points to be remembered. This parameter can be set to zero to forbid the use of
+heap memory altogether.
<br>
<br>
-The <b>--recursion-limit</b> option is similar to <b>--match-limit</b>, but
-instead of limiting the total number of times that <b>match()</b> is called, it
-limits the depth of recursive calls, which in turn limits the amount of memory
-that can be used. The recursion depth is a smaller number than the total number
-of calls, because not all calls to <b>match()</b> are recursive. This limit is
-of use only if it is set smaller than <b>--match-limit</b>.
+The <b>--depth-limit</b> option limits the depth of nested backtracking points,
+which indirectly limits the amount of memory that is used. The amount of memory
+needed for each backtracking point depends on the number of capturing
+parentheses in the pattern, so the amount of memory that is used before this
+limit acts varies from pattern to pattern. This limit is of use only if it is
+set smaller than <b>--match-limit</b>.
<br>
<br>
There are no short forms for these options. The default settings are specified
-when the PCRE2 library is compiled, with the default default being 10 million.
+when the PCRE2 library is compiled, with the default defaults being very large
+and so effectively unlimited.
+</P>
+<P>
+\fB--max-buffer-size=<i>number</i>
+This limits the expansion of the processing buffer, whose initial size can be
+set by <b>--buffer-size</b>. The maximum buffer size is silently forced to be no
+smaller than the starting buffer size.
</P>
<P>
<b>-M</b>, <b>--multiline</b>
-Allow patterns to match more than one line. When this option is given, patterns
-may usefully contain literal newline characters and internal occurrences of ^
-and $ characters. The output for a successful match may consist of more than
-one line. The first is the line in which the match started, and the last is the
-line in which the match ended. If the matched string ends with a newline
-sequence the output ends at the end of that line.
-<br>
-<br>
-When this option is set, the PCRE2 library is called in "multiline" mode. This
-allows a matched string to extend past the end of a line and continue on one or
-more subsequent lines. However, <b>pcre2grep</b> still processes the input line
-by line. Once a match has been handled, scanning restarts at the beginning of
-the next line, just as it does when <b>-M</b> is not present. This means that it
-is possible for the second or subsequent lines in a multiline match to be
-output again as part of another match.
+Allow patterns to match more than one line. When this option is set, the PCRE2
+library is called in "multiline" mode. This allows a matched string to extend
+past the end of a line and continue on one or more subsequent lines. Patterns
+used with <b>-M</b> may usefully contain literal newline characters and internal
+occurrences of ^ and $ characters. The output for a successful match may
+consist of more than one line. The first line is the line in which the match
+started, and the last line is the line in which the match ended. If the matched
+string ends with a newline sequence, the output ends at the end of that line.
+If <b>-v</b> is set, none of the lines in a multi-line match are output. Once a
+match has been handled, scanning restarts at the beginning of the line after
+the one in which the match ended.
<br>
<br>
The newline sequence that separates multiple lines must be matched as part of
@@ -533,11 +583,8 @@ well as possibly handling a two-character newline sequence.
<br>
<br>
There is a limit to the number of lines that can be matched, imposed by the way
-that <b>pcre2grep</b> buffers the input file as it scans it. However,
-<b>pcre2grep</b> ensures that at least 8K characters or the rest of the file
-(whichever is the shorter) are available for forward matching, and similarly
-the previous 8K characters (or all the previous characters, if fewer than 8K)
-are guaranteed to be available for lookbehind assertions. The <b>-M</b> option
+that <b>pcre2grep</b> buffers the input file as it scans it. With a sufficiently
+large processing buffer, this should not be a problem, but the <b>-M</b> option
does not work when input is read line by line (see \fP--line-buffered\fP.)
</P>
<P>
@@ -581,16 +628,47 @@ use of JIT at run time. It is provided for testing and working round problems.
It should never be needed in normal use.
</P>
<P>
+<b>-O</b> <i>text</i>, <b>--output</b>=<i>text</i>
+When there is a match, instead of outputting the whole line that matched,
+output just the given text. This option is mutually exclusive with
+<b>--only-matching</b>, <b>--file-offsets</b>, and <b>--line-offsets</b>. Escape
+sequences starting with a dollar character may be used to insert the contents
+of the matched part of the line and/or captured substrings into the text.
+<br>
+<br>
+$&#60;digits&#62; or ${&#60;digits&#62;} is replaced by the captured
+substring of the given decimal number; zero substitutes the whole match. If
+the number is greater than the number of capturing substrings, or if the
+capture is unset, the replacement is empty.
+<br>
+<br>
+$a is replaced by bell; $b by backspace; $e by escape; $f by form feed; $n by
+newline; $r by carriage return; $t by tab; $v by vertical tab.
+<br>
+<br>
+$o&#60;digits&#62; is replaced by the character represented by the given octal
+number; up to three digits are processed.
+<br>
+<br>
+$x&#60;digits&#62; is replaced by the character represented by the given hexadecimal
+number; up to two digits are processed.
+<br>
+<br>
+Any other character is substituted by itself. In particular, $$ is replaced by
+a single dollar.
+</P>
+<P>
<b>-o</b>, <b>--only-matching</b>
Show only the part of the line that matched a pattern instead of the whole
line. In this mode, no context is shown. That is, the <b>-A</b>, <b>-B</b>, and
<b>-C</b> options are ignored. If there is more than one match in a line, each
-of them is shown separately. If <b>-o</b> is combined with <b>-v</b> (invert the
-sense of the match to find non-matching lines), no output is generated, but the
-return code is set appropriately. If the matched portion of the line is empty,
-nothing is output unless the file name or line number are being printed, in
-which case they are shown on an otherwise empty line. This option is mutually
-exclusive with <b>--file-offsets</b> and <b>--line-offsets</b>.
+of them is shown separately, on a separate line of output. If <b>-o</b> is
+combined with <b>-v</b> (invert the sense of the match to find non-matching
+lines), no output is generated, but the return code is set appropriately. If
+the matched portion of the line is empty, nothing is output unless the file
+name or line number are being printed, in which case they are shown on an
+otherwise empty line. This option is mutually exclusive with <b>--output</b>,
+<b>--file-offsets</b> and <b>--line-offsets</b>.
</P>
<P>
<b>-o</b><i>number</i>, <b>--only-matching</b>=<i>number</i>
@@ -599,15 +677,16 @@ given number. Up to 32 capturing parentheses are supported, and -o0 is
equivalent to <b>-o</b> without a number. Because these options can be given
without an argument (see above), if an argument is present, it must be given in
the same shell item, for example, -o3 or --only-matching=2. The comments given
-for the non-argument case above also apply to this case. If the specified
+for the non-argument case above also apply to this option. If the specified
capturing parentheses do not exist in the pattern, or were not set in the
match, nothing is output unless the file name or line number are being output.
<br>
<br>
-If this option is given multiple times, multiple substrings are output, in the
-order the options are given. For example, -o3 -o1 -o3 causes the substrings
-matched by capturing parentheses 3 and 1 and then 3 again to be output. By
-default, there is no separator (but see the next option).
+If this option is given multiple times, multiple substrings are output for each
+match, in the order the options are given, and all on one line. For example,
+-o3 -o1 -o3 causes the substrings matched by capturing parentheses 3 and 1 and
+then 3 again to be output. By default, there is no separator (but see the next
+option).
</P>
<P>
<b>--om-separator</b>=<i>text</i>
@@ -638,6 +717,18 @@ quietly skipped. However, the return code is still 2, even if matches were
found in other files.
</P>
<P>
+<b>-t</b>, <b>--total-count</b>
+This option is useful when scanning more than one file. If used on its own,
+<b>-t</b> suppresses all output except for a grand total number of matching
+lines (or non-matching lines if <b>-v</b> is used) in all the files. If <b>-t</b>
+is used with <b>-c</b>, a grand total is output except when the previous output
+is just one line. In other words, it is not output when just one file's count
+is listed. If file names are being output, the grand total is preceded by
+"TOTAL:". Otherwise, it appears as just another number. The <b>-t</b> option is
+ignored when used with <b>-L</b> (list files without matches), because the grand
+total would always be zero.
+</P>
+<P>
<b>-u</b>, <b>--utf-8</b>
Operate in UTF-8 mode. This option is available only if PCRE2 has been compiled
with UTF-8 support. All patterns (including those for any <b>--exclude</b> and
@@ -657,17 +748,19 @@ the patterns are the ones that are found.
</P>
<P>
<b>-w</b>, <b>--word-regex</b>, <b>--word-regexp</b>
-Force the patterns to match only whole words. This is equivalent to having \b
-at the start and end of the pattern. This option applies only to the patterns
-that are matched against the contents of files; it does not apply to patterns
-specified by any of the <b>--include</b> or <b>--exclude</b> options.
+Force the patterns only to match "words". That is, there must be a word
+boundary at the start and end of each matched string. This is equivalent to
+having "\b(?:" at the start of each pattern, and ")\b" at the end. This
+option applies only to the patterns that are matched against the contents of
+files; it does not apply to patterns specified by any of the <b>--include</b> or
+<b>--exclude</b> options.
</P>
<P>
<b>-x</b>, <b>--line-regex</b>, <b>--line-regexp</b>
-Force the patterns to be anchored (each must start matching at the beginning of
-a line) and in addition, require them to match entire lines. This is equivalent
-to having ^ and $ characters at the start and end of each alternative top-level
-branch in every pattern. This option applies only to the patterns that are
+Force the patterns to start matching only at the beginnings of lines, and in
+addition, require them to match entire lines. In multiline mode the match may
+be more than one line. This is equivalent to having "^(?:" at the start of each
+pattern and ")$" at the end. This option applies only to the patterns that are
matched against the contents of files; it does not apply to patterns specified
by any of the <b>--include</b> or <b>--exclude</b> options.
</P>
@@ -696,10 +789,11 @@ relying on the C I/O library to convert this to an appropriate sequence.
Many of the short and long forms of <b>pcre2grep</b>'s options are the same
as in the GNU <b>grep</b> program. Any long option of the form
<b>--xxx-regexp</b> (GNU terminology) is also available as <b>--xxx-regex</b>
-(PCRE2 terminology). However, the <b>--file-list</b>, <b>--file-offsets</b>,
-<b>--include-dir</b>, <b>--line-offsets</b>, <b>--locale</b>, <b>--match-limit</b>,
-<b>-M</b>, <b>--multiline</b>, <b>-N</b>, <b>--newline</b>, <b>--om-separator</b>,
-<b>--recursion-limit</b>, <b>-u</b>, and <b>--utf-8</b> options are specific to
+(PCRE2 terminology). However, the <b>--depth-limit</b>, <b>--file-list</b>,
+<b>--file-offsets</b>, <b>--heap-limit</b>, <b>--include-dir</b>,
+<b>--line-offsets</b>, <b>--locale</b>, <b>--match-limit</b>, <b>-M</b>,
+<b>--multiline</b>, <b>-N</b>, <b>--newline</b>, <b>--om-separator</b>,
+<b>--output</b>, <b>-u</b>, and <b>--utf-8</b> options are specific to
<b>pcre2grep</b>, as is the use of the <b>--only-matching</b> option with a
capturing parentheses number.
</P>
@@ -742,23 +836,30 @@ The exceptions to the above are the <b>--colour</b> (or <b>--color</b>) and
options does have data, it must be given in the first form, using an equals
character. Otherwise <b>pcre2grep</b> will assume that it has no data.
</P>
-<br><a name="SEC10" href="#TOC1">CALLING EXTERNAL SCRIPTS</a><br>
+<br><a name="SEC10" href="#TOC1">USING PCRE2'S CALLOUT FACILITY</a><br>
<P>
-On non-Windows systems, <b>pcre2grep</b> has, by default, support for calling
-external programs or scripts during matching by making use of PCRE2's callout
-facility. However, this support can be disabled when <b>pcre2grep</b> is built.
-You can find out whether your binary has support for callouts by running it
-with the <b>--help</b> option. If the support is not enabled, all callouts in
+<b>pcre2grep</b> has, by default, support for calling external programs or
+scripts or echoing specific strings during matching by making use of PCRE2's
+callout facility. However, this support can be disabled when <b>pcre2grep</b> is
+built. You can find out whether your binary has support for callouts by running
+it with the <b>--help</b> option. If the support is not enabled, all callouts in
patterns are ignored by <b>pcre2grep</b>.
</P>
<P>
A callout in a PCRE2 pattern is of the form (?C&#60;arg&#62;) where the argument is
either a number or a quoted string (see the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
-documentation for details). Numbered callouts are ignored by <b>pcre2grep</b>.
-String arguments are parsed as a list of substrings separated by pipe (vertical
-bar) characters. The first substring must be an executable name, with the
-following substrings specifying arguments:
+documentation for details). Numbered callouts are ignored by <b>pcre2grep</b>;
+only callouts with string arguments are useful.
+</P>
+<br><b>
+Calling external programs or scripts
+</b><br>
+<P>
+If the callout string does not start with a pipe (vertical bar) character, it
+is parsed into a list of substrings separated by pipe characters. The first
+substring must be an executable name, with the following substrings specifying
+arguments:
<pre>
executable_name|arg1|arg2|...
</pre>
@@ -792,6 +893,19 @@ callout to be ignored. If running the program fails for any reason (including
the non-existence of the executable), a local matching failure occurs and the
matcher backtracks in the normal way.
</P>
+<br><b>
+Echoing a specific string
+</b><br>
+<P>
+If the callout string starts with a pipe (vertical bar) character, the rest of
+the string is written to the output, having been passed through the same escape
+processing as text from the --output option. This provides a simple echoing
+facility that avoids calling an external program or script. No terminator is
+added to the string, so if you want a newline, you must include it explicitly.
+Matching continues normally after the string is output. If you want to see only
+the callout output but not any output from an actual match, you should end the
+relevant pattern with (*FAIL).
+</P>
<br><a name="SEC11" href="#TOC1">MATCHING ERRORS</a><br>
<P>
It is possible to supply a regular expression that takes a very long time to
@@ -804,9 +918,9 @@ there are more than 20 such errors, <b>pcre2grep</b> gives up.
</P>
<P>
The <b>--match-limit</b> option of <b>pcre2grep</b> can be used to set the
-overall resource limit; there is a second option called <b>--recursion-limit</b>
-that sets a limit on the amount of memory (usually stack) that is used (see the
-discussion of these options above).
+overall resource limit. There are also other limits that affect the amount of
+memory used during matching; see the discussion of <b>--heap-limit</b> and
+<b>--depth-limit</b> above.
</P>
<br><a name="SEC12" href="#TOC1">DIAGNOSTICS</a><br>
<P>
@@ -816,6 +930,10 @@ matches were found in other files) or too many matching errors. Using the
<b>-s</b> option to suppress error messages about inaccessible files does not
affect the return code.
</P>
+<P>
+When run under VMS, the return code is placed in the symbol PCRE2GREP_RC
+because VMS does not distinguish between exit(0) and exit(1).
+</P>
<br><a name="SEC13" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2pattern</b>(3), <b>pcre2syntax</b>(3), <b>pcre2callout</b>(3).
@@ -831,9 +949,9 @@ Cambridge, England.
</P>
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 19 June 2016
+Last updated: 13 November 2017
<br>
-Copyright &copy; 1997-2016 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2jit.html b/doc/html/pcre2jit.html
index 4a6d4ff..c53d3d9 100644
--- a/doc/html/pcre2jit.html
+++ b/doc/html/pcre2jit.html
@@ -173,7 +173,7 @@ below for a discussion of JIT stack usage.
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if searching
a very large pattern tree goes on for too long, as it is in the same
circumstance when JIT is not used, but the details of exactly what is counted
-are not the same. The PCRE2_ERROR_RECURSIONLIMIT error code is never returned
+are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
when JIT matching is used.
<a name="stackcontrol"></a></P>
<br><a name="SEC6" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
@@ -194,12 +194,8 @@ allocation functions, or NULL for standard memory allocation). It returns a
pointer to an opaque structure of type <b>pcre2_jit_stack</b>, or NULL if there
is an error. The <b>pcre2_jit_stack_free()</b> function is used to free a stack
that is no longer needed. (For the technically minded: the address space is
-allocated by mmap or VirtualAlloc.)
-</P>
-<P>
-JIT uses far less memory for recursion than the interpretive code,
-and a maximum stack size of 512K to 1M should be more than enough for any
-pattern.
+allocated by mmap or VirtualAlloc.) A maximum stack size of 512K to 1M should
+be more than enough for any pattern.
</P>
<P>
The <b>pcre2_jit_stack_assign()</b> function specifies which stack JIT code
@@ -436,9 +432,9 @@ Cambridge, England.
</P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 05 June 2016
+Last updated: 31 March 2017
<br>
-Copyright &copy; 1997-2016 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2limits.html b/doc/html/pcre2limits.html
index e227a30..640fe3d 100644
--- a/doc/html/pcre2limits.html
+++ b/doc/html/pcre2limits.html
@@ -44,14 +44,6 @@ integer type, usually defined as size_t. Its maximum value (that is
and unset offsets.
</P>
<P>
-Note that when using the traditional matching function, PCRE2 uses recursion to
-handle subpatterns and indefinite repetition. This means that the available
-stack space may limit the size of a subject string that can be processed by
-certain patterns. For a discussion of stack issues, see the
-<a href="pcre2stack.html"><b>pcre2stack</b></a>
-documentation.
-</P>
-<P>
All values in repeating quantifiers must be less than 65536.
</P>
<P>
@@ -61,14 +53,10 @@ The maximum length of a lookbehind assertion is 65535 characters.
There is no limit to the number of parenthesized subpatterns, but there can be
no more than 65535 capturing subpatterns. There is, however, a limit to the
depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
-order to limit the amount of system stack used at compile time. The limit can
-be specified when PCRE2 is built; the default is 250.
-</P>
-<P>
-There is a limit to the number of forward references to subsequent subpatterns
-of around 200,000. Repeated forward references with fixed upper limits, for
-example, (?2){0,100} when subpattern number 2 is to the right, are included in
-the count. There is no limit to the number of backward references.
+order to limit the amount of system stack used at compile time. The default
+limit can be specified when PCRE2 is built; the default default is 250. An
+application can change this limit by calling pcre2_set_parens_nest_limit() to
+set the limit in a compile context.
</P>
<P>
The maximum length of name for a named subpattern is 32 code units, and the
@@ -76,7 +64,12 @@ maximum number of named subpatterns is 10000.
</P>
<P>
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb
-is 255 for the 8-bit library and 65535 for the 16-bit and 32-bit libraries.
+is 255 code units for the 8-bit library and 65535 code units for the 16-bit and
+32-bit libraries.
+</P>
+<P>
+The maximum length of a string argument to a callout is the largest number a
+32-bit unsigned integer can hold.
</P>
<br><b>
AUTHOR
@@ -93,9 +86,9 @@ Cambridge, England.
REVISION
</b><br>
<P>
-Last updated: 05 November 2015
+Last updated: 30 March 2017
<br>
-Copyright &copy; 1997-2015 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html
index 797690a..c495cba 100644
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@@ -170,35 +170,54 @@ the application to apply the JIT optimization by calling
<b>pcre2_jit_compile()</b> is ignored.
</P>
<br><b>
-Setting match and recursion limits
+Setting match resource limits
</b><br>
<P>
-The caller of <b>pcre2_match()</b> can set a limit on the number of times the
-internal <b>match()</b> function is called and on the maximum depth of
-recursive calls. These facilities are provided to catch runaway matches that
-are provoked by patterns with huge matching trees (a typical example is a
-pattern with nested unlimited repeats) and to avoid running out of system stack
-by too much recursion. When one of these limits is reached, <b>pcre2_match()</b>
-gives an error return. The limits can also be set by items at the start of the
-pattern of the form
+The pcre2_match() function contains a counter that is incremented every time it
+goes round its main loop. The caller of <b>pcre2_match()</b> can set a limit on
+this counter, which therefore limits the amount of computing resource used for
+a match. The maximum depth of nested backtracking can also be limited; this
+indirectly restricts the amount of heap memory that is used, but there is also
+an explicit memory limit that can be set.
+</P>
+<P>
+These facilities are provided to catch runaway matches that are provoked by
+patterns with huge matching trees (a typical example is a pattern with nested
+unlimited repeats applied to a long string that does not match). When one of
+these limits is reached, <b>pcre2_match()</b> gives an error return. The limits
+can also be set by items at the start of the pattern of the form
<pre>
+ (*LIMIT_HEAP=d)
(*LIMIT_MATCH=d)
- (*LIMIT_RECURSION=d)
+ (*LIMIT_DEPTH=d)
</pre>
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
+</P>
+<P>
+Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
+still recognized for backwards compatibility.
+</P>
+<P>
+The heap limit applies only when the <b>pcre2_match()</b> interpreter is used
+for matching. It does not apply to JIT or DFA matching. The match limit is used
+(but in a different way) when JIT is being used, or when
+<b>pcre2_dfa_match()</b> is called, to limit computing resource usage by those
+matching functions. The depth limit is ignored by JIT but is relevant for DFA
+matching, which uses function recursion for recursions within the pattern. In
+this case, the depth limit controls the amount of system stack that is used.
<a name="newlines"></a></P>
<br><b>
Newline conventions
</b><br>
<P>
-PCRE2 supports five different conventions for indicating line breaks in
+PCRE2 supports six different conventions for indicating line breaks in
strings: a single CR (carriage return) character, a single LF (linefeed)
-character, the two-character sequence CRLF, any of the three preceding, or any
-Unicode newline sequence. The
+character, the two-character sequence CRLF, any of the three preceding, any
+Unicode newline sequence, or the NUL character (binary zero). The
<a href="pcre2api.html"><b>pcre2api</b></a>
page has
<a href="pcre2api.html#newlines">further discussion</a>
@@ -207,13 +226,14 @@ about newlines, and shows how to set the newline convention when calling
</P>
<P>
It is also possible to specify a newline convention by starting a pattern
-string with one of the following five sequences:
+string with one of the following sequences:
<pre>
(*CR) carriage return
(*LF) linefeed
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
+ (*NUL) the NUL character (binary zero)
</pre>
These override the default and the options given to the compiling function. For
example, on a Unix system where LF is the default newline sequence, the pattern
@@ -229,8 +249,8 @@ The newline convention affects where the circumflex and dollar assertions are
true. It also affects the interpretation of the dot metacharacter when
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
what the \R escape sequence matches. By default, this is any Unicode newline
-sequence, for Perl compatibility. However, this can be changed; see the
-description of \R in the section entitled
+sequence, for Perl compatibility. However, this can be changed; see the next
+section and the description of \R in the section entitled
<a href="#newlineseq">"Newline sequences"</a>
below. A change of \R setting can be combined with a change of newline
convention.
@@ -248,7 +268,7 @@ corresponding to PCRE2_BSR_UNICODE.
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
<P>
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
-character code rather than ASCII or Unicode (typically a mainframe system). In
+character code instead of ASCII or Unicode (typically a mainframe system). In
the sections below, character code values are ASCII or Unicode; in an EBCDIC
environment these characters may have different code values, and there are no
code points greater than 255.
@@ -312,11 +332,11 @@ that character may have. This use of backslash as an escape character applies
both inside and outside character classes.
</P>
<P>
-For example, if you want to match a * character, you write \* in the pattern.
-This escaping action applies whether or not the following character would
-otherwise be interpreted as a metacharacter, so it is always safe to precede a
-non-alphanumeric with backslash to specify that it stands for itself. In
-particular, if you want to match a backslash, you write \\.
+For example, if you want to match a * character, you must write \* in the
+pattern. This escaping action applies whether or not the following character
+would otherwise be interpreted as a metacharacter, so it is always safe to
+precede a non-alphanumeric with backslash to specify that it stands for itself.
+In particular, if you want to match a backslash, you write \\.
</P>
<P>
In a UTF mode, only ASCII numbers and letters have any special meaning after a
@@ -347,7 +367,7 @@ An isolated \E that is not preceded by \Q is ignored. If \Q is not followed
by \E later in the pattern, the literal interpretation continues to the end of
the pattern (that is, \E is assumed at the end). If the isolated \Q is inside
a character class, this causes an error, because the character class is not
-terminated.
+terminated by a closing square bracket.
<a name="digitsafterbackslash"></a></P>
<br><b>
Non-printing characters
@@ -379,32 +399,31 @@ case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If the
code unit following \c has a value less than 32 or greater than 126, a
-compile-time error occurs. This locks out non-printable ASCII characters in all
-modes.
+compile-time error occurs.
</P>
<P>
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t
generate the appropriate EBCDIC code values. The \c escape is processed
as specified for Perl in the <b>perlebcdic</b> document. The only characters
that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any
-other character provokes a compile-time error. The sequence \@ encodes
-character code 0; the letters (in either case) encode characters 1-26 (hex 01
-to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
-\? becomes either 255 (hex FF) or 95 (hex 5F).
+other character provokes a compile-time error. The sequence \c@ encodes
+character code 0; after \c the letters (in either case) encode characters 1-26
+(hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex
+1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
</P>
<P>
-Thus, apart from \?, these escapes generate the same character code values as
+Thus, apart from \c?, these escapes generate the same character code values as
they do in an ASCII environment, though the meanings of the values mostly
-differ. For example, \G always generates code value 7, which is BEL in ASCII
+differ. For example, \cG always generates code value 7, which is BEL in ASCII
but DEL in EBCDIC.
</P>
<P>
-The sequence \? generates DEL (127, hex 7F) in an ASCII environment, but
+The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, but
because 127 is not a control character in EBCDIC, Perl makes it generate the
APC character. Unfortunately, there are several variants of EBCDIC. In most of
them the APC character has the value 255 (hex FF), but in the one Perl calls
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
-values, PCRE2 makes \? generate 95; otherwise it generates 255.
+values, PCRE2 makes \c? generate 95; otherwise it generates 255.
</P>
<P>
After \0 up to two further octal digits are read. If there are fewer than two
@@ -471,9 +490,9 @@ a hexadecimal digit appears between \x{ and }, or if there is no terminating
<P>
If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as just
described only when it is followed by two hexadecimal digits. Otherwise, it
-matches a literal "x" character. In this mode mode, support for code points
-greater than 256 is provided by \u, which must be followed by four hexadecimal
-digits; otherwise it matches a literal "u" character.
+matches a literal "x" character. In this mode, support for code points greater
+than 256 is provided by \u, which must be followed by four hexadecimal digits;
+otherwise it matches a literal "u" character.
</P>
<P>
Characters whose value is less than 256 can be defined by either of the two
@@ -488,15 +507,15 @@ Constraints on character values
Characters that are specified using octal or hexadecimal numbers are
limited to certain values, as follows:
<pre>
- 8-bit non-UTF mode less than 0x100
- 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
- 16-bit non-UTF mode less than 0x10000
- 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
- 32-bit non-UTF mode less than 0x100000000
- 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
+ 8-bit non-UTF mode no greater than 0xff
+ 16-bit non-UTF mode no greater than 0xffff
+ 32-bit non-UTF mode no greater than 0xffffffff
+ All UTF modes no greater than 0x10ffff and a valid codepoint
</pre>
-Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
-"surrogate" codepoints), and 0xffef.
+Invalid Unicode codepoints are all those in the range 0xd800 to 0xdfff (the
+so-called "surrogate" codepoints). The check for these can be disabled by the
+caller of <b>pcre2_compile()</b> by setting the option
+PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES.
</P>
<br><b>
Escape sequences in character classes
@@ -520,15 +539,15 @@ In Perl, the sequences \l, \L, \u, and \U are recognized by its string
handler and used to modify the case of following characters. By default, PCRE2
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
is set, \U matches a "U" character, and \u can be used to define a character
-by code point, as described in the previous section.
+by code point, as described above.
</P>
<br><b>
Absolute and relative back references
</b><br>
<P>
-The sequence \g followed by an unsigned or a negative number, optionally
-enclosed in braces, is an absolute or relative back reference. A named back
-reference can be coded as \g{name}. Back references are discussed
+The sequence \g followed by a signed or unsigned number, optionally enclosed
+in braces, is an absolute or relative back reference. A named back reference
+can be coded as \g{name}. Back references are discussed
<a href="#backreferences">later,</a>
following the discussion of
<a href="#subpattern">parenthesized subpatterns.</a>
@@ -709,7 +728,9 @@ When PCRE2 is built with Unicode support (the default), three additional escape
sequences that match characters with specific properties are available. In
8-bit non-UTF-8 mode, these sequences are of course limited to testing
characters whose codepoints are less than 256, but they do work in this mode.
-The extra escape sequences are:
+In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
+may be encountered. These are all treated as being in the Common script and
+with an unassigned type. The extra escape sequences are:
<pre>
\p{<i>xx</i>} a character with the <i>xx</i> property
\P{<i>xx</i>} a character without the <i>xx</i> property
@@ -736,6 +757,7 @@ Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:
</P>
<P>
+Adlam,
Ahom,
Anatolian_Hieroglyphs,
Arabic,
@@ -746,6 +768,7 @@ Bamum,
Bassa_Vah,
Batak,
Bengali,
+Bhaiksuki,
Bopomofo,
Brahmi,
Braille,
@@ -807,6 +830,8 @@ Mahajani,
Malayalam,
Mandaic,
Manichaean,
+Marchen,
+Masaram_Gondi,
Meetei_Mayek,
Mende_Kikakui,
Meroitic_Cursive,
@@ -819,7 +844,9 @@ Multani,
Myanmar,
Nabataean,
New_Tai_Lue,
+Newa,
Nko,
+Nushu,
Ogham,
Ol_Chiki,
Old_Hungarian,
@@ -830,6 +857,7 @@ Old_Persian,
Old_South_Arabian,
Old_Turkic,
Oriya,
+Osage,
Osmanya,
Pahawh_Hmong,
Palmyrene,
@@ -847,6 +875,7 @@ Siddham,
SignWriting,
Sinhala,
Sora_Sompeng,
+Soyombo,
Sundanese,
Syloti_Nagri,
Syriac,
@@ -857,6 +886,7 @@ Tai_Tham,
Tai_Viet,
Takri,
Tamil,
+Tangut,
Telugu,
Thaana,
Thai,
@@ -866,7 +896,8 @@ Tirhuta,
Ugaritic,
Vai,
Warang_Citi,
-Yi.
+Yi,
+Zanabazar_Square.
</P>
<P>
Each character has exactly one Unicode general category property, specified by
@@ -972,9 +1003,12 @@ grapheme cluster", and treats the sequence as an atomic group
<a href="#atomicgroup">(see below).</a>
Unicode supports various kinds of composite character by giving each character
a grapheme breaking property, and having rules that use these properties to
-define the boundaries of extended grapheme clusters. \X always matches at
-least one character. Then it decides whether to add additional characters
-according to the following rules for ending a cluster:
+define the boundaries of extended grapheme clusters. The rules are defined in
+Unicode Standard Annex 29, "Unicode Text Segmentation".
+</P>
+<P>
+\X always matches at least one character. Then it decides whether to add
+additional characters according to the following rules for ending a cluster:
</P>
<P>
1. End at the end of the subject string.
@@ -989,13 +1023,27 @@ L, V, LV, or LVT character; an LV or V character may be followed by a V or T
character; an LVT or T character may be follwed only by a T character.
</P>
<P>
-4. Do not end before extending characters or spacing marks. Characters with
-the "mark" property always have the "extend" grapheme breaking property.
+4. Do not end before extending characters or spacing marks or the "zero-width
+joiner" characters. Characters with the "mark" property always have the
+"extend" grapheme breaking property.
</P>
<P>
5. Do not end after prepend characters.
</P>
<P>
+6. Do not break within emoji modifier sequences (a base character followed by a
+modifier). Extending characters are allowed before the modifier.
+</P>
+<P>
+7. Do not break within emoji zwj sequences (zero-width jointer followed by
+"glue after ZWJ" or "base glue after ZWJ").
+</P>
+<P>
+8. Do not break within emoji flag sequences. That is, do not break between
+regional indicator (RI) characters if there are an odd number of RI characters
+before the break point.
+</P>
+<P>
6. Otherwise, end the cluster.
<a name="extraprops"></a></P>
<br><b>
@@ -1326,13 +1374,33 @@ whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
class such as [^a] always matches one of these characters.
</P>
<P>
+The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,
+\V, \w, and \W may appear in a character class, and add the characters that
+they match to the class. For example, [\dABCDEF] matches any hexadecimal
+digit. In UTF modes, the PCRE2_UCP option affects the meanings of \d, \s, \w
+and their upper case partners, just as it does when they appear outside a
+character class, as described in the section entitled
+<a href="#genericchartypes">"Generic character types"</a>
+above. The escape sequence \b has a different meaning inside a character
+class; it matches the backspace character. The sequences \B, \N, \R, and \X
+are not special inside a character class. Like any other unrecognized escape
+sequences, they cause an error.
+</P>
+<P>
The minus (hyphen) character can be used to specify a range of characters in a
character class. For example, [d-m] matches any letter between d and m,
inclusive. If a minus character is required in a class, it must be escaped with
a backslash or appear in a position where it cannot be interpreted as
-indicating a range, typically as the first or last character in the class, or
-immediately after a range. For example, [b-d-z] matches letters in the range b
-to d, a hyphen character, or z.
+indicating a range, typically as the first or last character in the class,
+or immediately after a range. For example, [b-d-z] matches letters in the range
+b to d, a hyphen character, or z.
+</P>
+<P>
+Perl treats a hyphen as a literal if it appears before or after a POSIX class
+(see below) or before or after a character type escape such as as \d or \H.
+However, unless the hyphen is the last character in the class, Perl outputs a
+warning in its warning mode, as this is most likely a user error. As PCRE2 has
+no facility for warning, an error is given in these cases.
</P>
<P>
It is not possible to have the literal character "]" as the end character of a
@@ -1344,16 +1412,14 @@ followed by two other characters. The octal or hexadecimal representation of
"]" can also be used to end a range.
</P>
<P>
-An error is generated if a POSIX character class (see below) or an escape
-sequence other than one that defines a single character appears at a point
-where a range ending character is expected. For example, [z-\xff] is valid,
-but [A-\d] and [A-[:digit:]] are not.
-</P>
-<P>
Ranges normally include all code points between the start and end characters,
inclusive. They can also be used for code points specified numerically, for
example [\000-\037]. Ranges can include any characters that are valid for the
-current mode.
+current mode. In any UTF mode, the so-called "surrogate" characters (those
+whose code points lie between 0xd800 and 0xdfff inclusive) may not be specified
+explicitly by default (the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables
+this check). However, ranges such as [\x{d7ff}-\x{e000}], which include the
+surrogates, are always permitted.
</P>
<P>
There is a special case in EBCDIC environments for ranges whose end points are
@@ -1372,19 +1438,6 @@ tables for a French locale are in use, [\xc8-\xcb] matches accented E
characters in both cases.
</P>
<P>
-The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,
-\V, \w, and \W may appear in a character class, and add the characters that
-they match to the class. For example, [\dABCDEF] matches any hexadecimal
-digit. In UTF modes, the PCRE2_UCP option affects the meanings of \d, \s, \w
-and their upper case partners, just as it does when they appear outside a
-character class, as described in the section entitled
-<a href="#genericchartypes">"Generic character types"</a>
-above. The escape sequence \b has a different meaning inside a character
-class; it matches the backspace character. The sequences \B, \N, \R, and \X
-are not special inside a character class. Like any other unrecognized escape
-sequences, they cause an error.
-</P>
-<P>
A circumflex can conveniently be used with the upper case character types to
specify a more restricted set of characters than the matching lower case type.
For example, the class [^\W_] matches any letter or digit, but not underscore,
@@ -1526,20 +1579,26 @@ alternative in the subpattern.
</P>
<br><a name="SEC13" href="#TOC1">INTERNAL OPTION SETTING</a><br>
<P>
-The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
-PCRE2_EXTENDED options (which are Perl-compatible) can be changed from within
-the pattern by a sequence of Perl option letters enclosed between "(?" and ")".
-The option letters are
+The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
+PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options (which
+are Perl-compatible) can be changed from within the pattern by a sequence of
+Perl option letters enclosed between "(?" and ")". The option letters are
<pre>
i for PCRE2_CASELESS
m for PCRE2_MULTILINE
+ n for PCRE2_NO_AUTO_CAPTURE
s for PCRE2_DOTALL
x for PCRE2_EXTENDED
+ xx for PCRE2_EXTENDED_MORE
</pre>
For example, (?im) sets caseless, multiline matching. It is also possible to
-unset these options by preceding the letter with a hyphen, and a combined
-setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS and
-PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
+unset these options by preceding the letter with a hyphen. The two "extended"
+options are not independent; unsetting either one cancels the effects of both
+of them.
+</P>
+<P>
+A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
+and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
permitted. If a letter appears both before and after the hyphen, the option is
unset. An empty options setting "(?)" is allowed. Needless to say, it has no
effect.
@@ -1552,13 +1611,8 @@ respectively.
<P>
When one of these option changes occurs at top level (that is, not inside
subpattern parentheses), the change applies to the remainder of the pattern
-that follows. If the change is placed right at the start of a pattern, PCRE2
-extracts it into the global options (and it will therefore show up in data
-extracted by the <b>pcre2_pattern_info()</b> function).
-</P>
-<P>
-An option change within a subpattern (see below for a description of
-subpatterns) affects only that part of the subpattern that follows it, so
+that follows. An option change within a subpattern (see below for a description
+of subpatterns) affects only that part of the subpattern that follows it, so
<pre>
(a(?i)b)c
</pre>
@@ -2093,9 +2147,9 @@ subpattern is possible using named parentheses (see below).
</P>
<P>
Another way of avoiding the ambiguity inherent in the use of digits following a
-backslash is to use the \g escape sequence. This escape must be followed by an
-unsigned number or a negative number, optionally enclosed in braces. These
-examples are all identical:
+backslash is to use the \g escape sequence. This escape must be followed by a
+signed or unsigned number, optionally enclosed in braces. These examples are
+all identical:
<pre>
(ring), \1
(ring), \g1
@@ -2103,8 +2157,7 @@ examples are all identical:
</pre>
An unsigned number specifies an absolute reference without the ambiguity that
is present in the older syntax. It is also useful when literal digits follow
-the reference. A negative number is a relative reference. Consider this
-example:
+the reference. A signed number is a relative reference. Consider this example:
<pre>
(abc(def)ghi)\g{-1}
</pre>
@@ -2115,6 +2168,11 @@ can be helpful in long patterns, and also in patterns that are created by
joining together fragments that contain references within themselves.
</P>
<P>
+The sequence \g{+1} is a reference to the next capturing subpattern. This kind
+of forward reference can be useful it patterns that repeat. Perl does not
+support the use of + in this way.
+</P>
+<P>
A back reference matches whatever actually matched the capturing subpattern in
the current subject string, rather than anything matching the subpattern
itself (see
@@ -2203,15 +2261,27 @@ coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
<P>
More complicated assertions are coded as subpatterns. There are two kinds:
those that look ahead of the current position in the subject string, and those
-that look behind it. An assertion subpattern is matched in the normal way,
-except that it does not cause the current matching position to be changed.
+that look behind it, and in each case an assertion may be positive (must
+succeed for matching to continue) or negative (must not succeed for matching to
+continue). An assertion subpattern is matched in the normal way, except that,
+when matching continues afterwards, the matching position in the subject string
+is as it was at the start of the assertion.
</P>
<P>
-Assertion subpatterns are not capturing subpatterns. If such an assertion
-contains capturing subpatterns within it, these are counted for the purposes of
+Assertion subpatterns are not capturing subpatterns. If an assertion contains
+capturing subpatterns within it, these are counted for the purposes of
numbering the capturing subpatterns in the whole pattern. However, substring
-capturing is carried out only for positive assertions. (Perl sometimes, but not
-always, does do capturing in negative assertions.)
+capturing is carried out only for positive assertions that succeed, that is,
+one of their branches matches, so matching continues after the assertion. If
+all branches of a positive assertion fail to match, nothing is captured, and
+control is passed to the previous backtracking point.
+</P>
+<P>
+No capturing is done for a negative assertion unless it is being used as a
+condition in a
+<a href="#subpatternsassubroutines">conditional subpattern</a>
+(see the discussion below). Matching continues after a non-conditional negative
+assertion only if all its branches fail to match.
</P>
<P>
For compatibility with Perl, most assertion subpatterns may be repeated; though
@@ -2310,18 +2380,31 @@ match. If there are insufficient characters before the current position, the
assertion fails.
</P>
<P>
-In a UTF mode, PCRE2 does not allow the \C escape (which matches a single code
-unit even in a UTF mode) to appear in lookbehind assertions, because it makes
-it impossible to calculate the length of the lookbehind. The \X and \R
-escapes, which can match different numbers of code units, are also not
-permitted.
+In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which matches a
+single code unit even in a UTF mode) to appear in lookbehind assertions,
+because it makes it impossible to calculate the length of the lookbehind. The
+\X and \R escapes, which can match different numbers of code units, are never
+permitted in lookbehinds.
</P>
<P>
<a href="#subpatternsassubroutines">"Subroutine"</a>
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
-as the subpattern matches a fixed-length string.
-<a href="#recursion">Recursion,</a>
-however, is not supported.
+as the subpattern matches a fixed-length string. However,
+<a href="#recursion">recursion,</a>
+that is, a "subroutine" call into a group that is already active,
+is not supported.
+</P>
+<P>
+Perl does not support back references in lookbehinds. PCRE2 does support them,
+but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
+must not be set, there must be no use of (?| in the pattern (it creates
+duplicate subpattern numbers), and if the back reference is by name, the name
+must be unique. Of course, the referenced subpattern must itself be of fixed
+length. The following pattern matches words containing at least two characters
+that begin and end with the same character:
+<pre>
+ \b(\w)\w++(?&#60;=\1)
+</PRE>
</P>
<P>
Possessive quantifiers can be used in conjunction with lookbehind assertions to
@@ -2459,7 +2542,9 @@ Checking for a used subpattern by name
<P>
Perl uses the syntax (?(&#60;name&#62;)...) or (?('name')...) to test for a used
subpattern by name. For compatibility with earlier versions of PCRE1, which had
-this facility before Perl, the syntax (?(name)...) is also recognized.
+this facility before Perl, the syntax (?(name)...) is also recognized. Note,
+however, that undelimited names consisting of the letter R followed by digits
+are ambiguous (see the following section).
</P>
<P>
Rewriting the above example to use a named subpattern gives this:
@@ -2474,30 +2559,52 @@ matched.
Checking for pattern recursion
</b><br>
<P>
-If the condition is the string (R), and there is no subpattern with the name R,
-the condition is true if a recursive call to the whole pattern or any
-subpattern has been made. If digits or a name preceded by ampersand follow the
-letter R, for example:
+"Recursion" in this sense refers to any subroutine-like call from one part of
+the pattern to another, whether or not it is actually recursive. See the
+sections entitled
+<a href="#recursion">"Recursive patterns"</a>
+and
+<a href="#subpatternsassubroutines">"Subpatterns as subroutines"</a>
+below for details of recursion and subpattern calls.
+</P>
+<P>
+If a condition is the string (R), and there is no subpattern with the name R,
+the condition is true if matching is currently in a recursion or subroutine
+call to the whole pattern or any subpattern. If digits follow the letter R, and
+there is no subpattern with that name, the condition is true if the most recent
+call is into a subpattern with the given number, which must exist somewhere in
+the overall pattern. This is a contrived example that is equivalent to a+b:
<pre>
- (?(R3)...) or (?(R&name)...)
+ ((?(R1)a+|(?1)b))
</pre>
-the condition is true if the most recent recursion is into a subpattern whose
-number or name is given. This condition does not check the entire recursion
-stack. If the name used in a condition of this kind is a duplicate, the test is
-applied to all subpatterns of the same name, and is true if any one of them is
-the most recent recursion.
+However, in both cases, if there is a subpattern with a matching name, the
+condition tests for its being set, as described in the section above, instead
+of testing for recursion. For example, creating a group with the name R1 by
+adding (?&#60;R1&#62;) to the above pattern completely changes its meaning.
+</P>
+<P>
+If a name preceded by ampersand follows the letter R, for example:
+<pre>
+ (?(R&name)...)
+</pre>
+the condition is true if the most recent recursion is into a subpattern of that
+name (which must exist within the pattern).
+</P>
+<P>
+This condition does not check the entire recursion stack. It tests only the
+current level. If the name used in a condition of this kind is a duplicate, the
+test is applied to all subpatterns of the same name, and is true if any one of
+them is the most recent recursion.
</P>
<P>
At "top level", all these recursion test conditions are false.
-<a href="#recursion">The syntax for recursive patterns</a>
-is described below.
<a name="subdefine"></a></P>
<br><b>
Defining subpatterns for use by reference only
</b><br>
<P>
-If the condition is the string (DEFINE), and there is no subpattern with the
-name DEFINE, the condition is always false. In this case, there may be only one
+If the condition is the string (DEFINE), the condition is always false, even if
+there is a group with the name DEFINE. In this case, there may be only one
alternative in the subpattern. It is always skipped if control reaches this
point in the pattern; the idea of DEFINE is that it can be used to define
subroutines that can be referenced from elsewhere. (The use of
@@ -2552,6 +2659,13 @@ presence of at least one letter in the subject. If a letter is found, the
subject is matched against the first alternative; otherwise it is matched
against the second. This pattern matches strings in one of the two forms
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
+</P>
+<P>
+When an assertion that is a condition contains capturing subpatterns, any
+capturing that occurs in a matching branch is retained afterwards, for both
+positive and negative assertions, because matching always continues after the
+assertion, whether it succeeds or fails. (Compare non-conditional assertions,
+when captures are retained only for positive assertions that succeed.)
<a name="comments"></a></P>
<br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
<P>
@@ -2724,93 +2838,57 @@ is the actual recursive call.
Differences in recursion processing between PCRE2 and Perl
</b><br>
<P>
-Recursion processing in PCRE2 differs from Perl in two important ways. In PCRE2
-(like Python, but unlike Perl), a recursive subpattern call is always treated
-as an atomic group. That is, once it has matched some of the subject string, it
-is never re-entered, even if it contains untried alternatives and there is a
-subsequent matching failure. This can be illustrated by the following pattern,
-which purports to match a palindromic string that contains an odd number of
-characters (for example, "a", "aba", "abcba", "abcdcba"):
-<pre>
- ^(.|(.)(?1)\2)$
-</pre>
-The idea is that it either matches a single character, or two identical
-characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE2
-it does not if the pattern is longer than three characters. Consider the
-subject string "abcba":
+Some former differences between PCRE2 and Perl no longer exist.
</P>
<P>
-At the top level, the first character is matched, but as it is not at the end
-of the string, the first alternative fails; the second alternative is taken
-and the recursion kicks in. The recursive call to subpattern 1 successfully
-matches the next character ("b"). (Note that the beginning and end of line
-tests are not part of the recursion).
+Before release 10.30, recursion processing in PCRE2 differed from Perl in that
+a recursive subpattern call was always treated as an atomic group. That is,
+once it had matched some of the subject string, it was never re-entered, even
+if it contained untried alternatives and there was a subsequent matching
+failure. (Historical note: PCRE implemented recursion before Perl did.)
</P>
<P>
-Back at the top level, the next character ("c") is compared with what
-subpattern 2 matched, which was "a". This fails. Because the recursion is
-treated as an atomic group, there are now no backtracking points, and so the
-entire match fails. (Perl is able, at this point, to re-enter the recursion and
-try the second alternative.) However, if the pattern is written with the
-alternatives in the other order, things are different:
-<pre>
- ^((.)(?1)\2|.)$
-</pre>
-This time, the recursing alternative is tried first, and continues to recurse
-until it runs out of characters, at which point the recursion fails. But this
-time we do have another alternative to try at the higher level. That is the big
-difference: in the previous case the remaining alternative is at a deeper
-recursion level, which PCRE2 cannot use.
+Starting with release 10.30, recursive subroutine calls are no longer treated
+as atomic. That is, they can be re-entered to try unused alternatives if there
+is a matching failure later in the pattern. This is now compatible with the way
+Perl works. If you want a subroutine call to be atomic, you must explicitly
+enclose it in an atomic group.
</P>
<P>
-To change the pattern so that it matches all palindromic strings, not just
-those with an odd number of characters, it is tempting to change the pattern to
-this:
+Supporting backtracking into recursions simplifies certain types of recursive
+pattern. For example, this pattern matches palindromic strings:
<pre>
^((.)(?1)\2|.?)$
</pre>
-Again, this works in Perl, but not in PCRE2, and for the same reason. When a
-deeper recursion has matched a single character, it cannot be entered again in
-order to match an empty string. The solution is to separate the two cases, and
-write out the odd and even cases as alternatives at the higher level:
+The second branch in the group matches a single central character in the
+palindrome when there are an odd number of characters, or nothing when there
+are an even number of characters, but in order to work it has to be able to try
+the second case when the rest of the pattern match fails. If you want to match
+typical palindromic phrases, the pattern has to ignore all non-word characters,
+which can be done like this:
<pre>
- ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
-</pre>
-If you want to match typical palindromic phrases, the pattern has to ignore all
-non-word characters, which can be done like this:
-<pre>
- ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
+ ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
</pre>
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
-man, a plan, a canal: Panama!" and it works in both PCRE2 and Perl. Note the
-use of the possessive quantifier *+ to avoid backtracking into sequences of
-non-word characters. Without this, PCRE2 takes a great deal longer (ten times
-or more) to match typical phrases, and Perl takes so long that you think it has
-gone into a loop.
+man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
+avoid backtracking into sequences of non-word characters. Without this, PCRE2
+takes a great deal longer (ten times or more) to match typical phrases, and
+Perl takes so long that you think it has gone into a loop.
</P>
<P>
-<b>WARNING</b>: The palindrome-matching patterns above work only if the subject
-string does not start with a palindrome that is shorter than the entire string.
-For example, although "abcba" is correctly matched, if the subject is "ababa",
-PCRE2 finds the palindrome "aba" at the start, then fails at top level because
-the end of the string does not follow. Once again, it cannot jump back into the
-recursion to try other alternatives, so the entire match fails.
-</P>
-<P>
-The second way in which PCRE2 and Perl differ in their recursion processing is
-in the handling of captured values. In Perl, when a subpattern is called
-recursively or as a subpattern (see the next section), it has no access to any
-values that were captured outside the recursion, whereas in PCRE2 these values
-can be referenced. Consider this pattern:
+Another way in which PCRE2 and Perl used to differ in their recursion
+processing is in the handling of captured values. Formerly in Perl, when a
+subpattern was called recursively or as a subpattern (see the next section), it
+had no access to any values that were captured outside the recursion, whereas
+in PCRE2 these values can be referenced. Consider this pattern:
<pre>
^(.)(\1|a(?2))
</pre>
-In PCRE2, this pattern matches "bab". The first capturing parentheses match "b",
-then in the second group, when the back reference \1 fails to match "b", the
-second alternative matches "a" and then recurses. In the recursion, \1 does
-now match "b" and so the whole match succeeds. In Perl, the pattern fails to
-match because inside the recursive call \1 cannot access the externally set
-value.
+This pattern matches "bab". The first capturing parentheses match "b", then in
+the second group, when the back reference \1 fails to match "b", the second
+alternative matches "a" and then recurses. In the recursion, \1 does now match
+"b" and so the whole match succeeds. This match used to fail in Perl, but in
+later versions (I tried 5.024) it now works.
<a name="subpatternsassubroutines"></a></P>
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
<P>
@@ -2837,11 +2915,10 @@ is used, it does match "sense and responsibility" as well as the other two
strings. Another example is given in the discussion of DEFINE above.
</P>
<P>
-All subroutine calls, whether recursive or not, are always treated as atomic
-groups. That is, once a subroutine has matched some of the subject string, it
-is never re-entered, even if it contains untried alternatives and there is a
-subsequent matching failure. Any capturing parentheses that are set during the
-subroutine call revert to their previous values afterwards.
+Like recursions, subroutine calls used to be treated as atomic, but this
+changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
+occur. However, any capturing parentheses that are set during the subroutine
+call revert to their previous values afterwards.
</P>
<P>
Processing options such as case-independence are fixed when a subpattern is
@@ -2949,28 +3026,31 @@ The doubling is removed before the string is passed to the callout function.
<a name="backtrackcontrol"></a></P>
<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
-Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
-are still described in the Perl documentation as "experimental and subject to
-change or removal in a future version of Perl". It goes on to say: "Their usage
-in production code should be noted to avoid problems during upgrades." The same
-remarks apply to the PCRE2 features described in this section.
-</P>
-<P>
-The new verbs make use of what was previously invalid syntax: an opening
-parenthesis followed by an asterisk. They are generally of the form (*VERB) or
-(*VERB:NAME). Some verbs take either form, possibly behaving differently
-depending on whether or not a name is present.
+There are a number of special "Backtracking Control Verbs" (to use Perl's
+terminology) that modify the behaviour of backtracking during matching. They
+are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
+possibly behaving differently depending on whether or not a name is present.
</P>
<P>
By default, for compatibility with Perl, a name is any sequence of characters
that does not include a closing parenthesis. The name is not processed in
any way, and it is not possible to include a closing parenthesis in the name.
-However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash processing
-is applied to verb names and only an unescaped closing parenthesis terminates
-the name. A closing parenthesis can be included in a name either as \) or
-between \Q and \E. If the PCRE2_EXTENDED option is set, unescaped whitespace
-in verb names is skipped and #-comments are recognized, exactly as in the rest
-of the pattern.
+This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
+is no longer Perl-compatible.
+</P>
+<P>
+When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
+and only an unescaped closing parenthesis terminates the name. However, the
+only backslash items that are permitted are \Q, \E, and sequences such as
+\x{100} that define character code points. Character type escapes such as \d
+are faulted.
+</P>
+<P>
+A closing parenthesis can be included in a name either as \) or between \Q
+and \E. In addition to backslash processing, if the PCRE2_EXTENDED option is
+also set, unescaped whitespace in verb names is skipped, and #-comments are
+recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
+affect verb names unless PCRE2_ALT_VERBNAMES is also set.
</P>
<P>
The maximum length of a name is 255 in the 8-bit library and 65535 in the
@@ -2981,7 +3061,7 @@ not there. Any number of these verbs may occur in a pattern.
<P>
Since these verbs are specifically related to backtracking, most of them can be
used only when the pattern is to be matched using the traditional matching
-function, because these use a backtracking algorithm. With the exception of
+function, because that uses a backtracking algorithm. With the exception of
(*FAIL), which behaves like a failing negative assertion, the backtracking
control verbs cause an error if encountered by the DFA matching function.
</P>
@@ -3119,11 +3199,11 @@ Verbs that act after backtracking
The following verbs do nothing when they are encountered. Matching continues
with what follows, but if there is no subsequent match, causing a backtrack to
the verb, a failure is forced. That is, backtracking cannot pass to the left of
-the verb. However, when one of these verbs appears inside an atomic group
-(which includes any group that is called as a subroutine) or in an assertion
-that is true, its effect is confined to that group, because once the group has
-been matched, there is never any backtracking into it. In this situation,
-backtracking has to jump to the left of the entire atomic group or assertion.
+the verb. However, when one of these verbs appears inside an atomic group or in
+an assertion that is true, its effect is confined to that group, because once
+the group has been matched, there is never any backtracking into it. In this
+situation, backtracking has to jump to the left of the entire atomic group or
+assertion.
</P>
<P>
These verbs differ in exactly what kind of failure occurs when backtracking
@@ -3187,8 +3267,8 @@ expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
as (*COMMIT).
</P>
<P>
-The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE).
-It is like (*MARK:NAME) in that the name is remembered for passing back to the
+The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
+like (*MARK:NAME) in that the name is remembered for passing back to the
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
ignoring those set by (*PRUNE) or (*THEN).
<pre>
@@ -3329,28 +3409,34 @@ in the second repeat of the group acts.
Backtracking verbs in assertions
</b><br>
<P>
-(*FAIL) in an assertion has its normal effect: it forces an immediate
-backtrack.
+(*FAIL) in any assertion has its normal effect: it forces an immediate
+backtrack. The behaviour of the other backtracking verbs depends on whether or
+not the assertion is standalone or acting as the condition in a conditional
+subpattern.
+</P>
+<P>
+(*ACCEPT) in a standalone positive assertion causes the assertion to succeed
+without any further processing; captured strings are retained. In a standalone
+negative assertion, (*ACCEPT) causes the assertion to fail without any further
+processing; captured substrings are discarded.
+</P>
+<P>
+If the assertion is a condition, (*ACCEPT) causes the condition to be true for
+a positive assertion and false for a negative one; captured substrings are
+retained in both cases.
</P>
<P>
-(*ACCEPT) in a positive assertion causes the assertion to succeed without any
-further processing. In a negative assertion, (*ACCEPT) causes the assertion to
-fail without any further processing.
+The effect of (*THEN) is not allowed to escape beyond an assertion. If there
+are no more branches to try, (*THEN) causes a positive assertion to be false,
+and a negative assertion to be true.
</P>
<P>
The other backtracking verbs are not treated specially if they appear in a
-positive assertion. In particular, (*THEN) skips to the next alternative in the
-innermost enclosing group that has alternations, whether or not this is within
-the assertion.
-</P>
-<P>
-Negative assertions are, however, different, in order to ensure that changing a
-positive assertion into a negative assertion changes its result. Backtracking
-into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative assertion to be true,
-without considering any further alternative branches in the assertion.
-Backtracking into (*THEN) causes it to skip to the next enclosing alternative
-within the assertion (the normal behaviour), but if the assertion does not have
-such an alternative, (*THEN) behaves like (*PRUNE).
+standalone positive assertion. In a conditional positive assertion,
+backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be
+false. However, for both standalone and conditional negative assertions,
+backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be
+true, without considering any further alternative branches.
<a name="btsub"></a></P>
<br><b>
Backtracking verbs in subroutines
@@ -3393,9 +3479,9 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 20 June 2016
+Last updated: 12 September 2017
<br>
-Copyright &copy; 1997-2016 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2perform.html b/doc/html/pcre2perform.html
index ac9d23c..28f4f73 100644
--- a/doc/html/pcre2perform.html
+++ b/doc/html/pcre2perform.html
@@ -15,7 +15,7 @@ please consult the man page, in case the conversion went wrong.
<ul>
<li><a name="TOC1" href="#SEC1">PCRE2 PERFORMANCE</a>
<li><a name="TOC2" href="#SEC2">COMPILED PATTERN MEMORY USAGE</a>
-<li><a name="TOC3" href="#SEC3">STACK USAGE AT RUN TIME</a>
+<li><a name="TOC3" href="#SEC3">STACK AND HEAP USAGE AT RUN TIME</a>
<li><a name="TOC4" href="#SEC4">PROCESSING TIME</a>
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
<li><a name="TOC6" href="#SEC6">REVISION</a>
@@ -29,11 +29,11 @@ of them.
<br><a name="SEC2" href="#TOC1">COMPILED PATTERN MEMORY USAGE</a><br>
<P>
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
-so that most simple patterns do not use much memory. However, there is one case
-where the memory usage of a compiled pattern can be unexpectedly large. If a
-parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
-a limited maximum, the whole subpattern is repeated in the compiled code. For
-example, the pattern
+so that most simple patterns do not use much memory for storing the compiled
+version. However, there is one case where the memory usage of a compiled
+pattern can be unexpectedly large. If a parenthesized subpattern has a
+quantifier with a minimum greater than 1 and/or a limited maximum, the whole
+subpattern is repeated in the compiled code. For example, the pattern
<pre>
(abc|def){2,4}
</pre>
@@ -52,13 +52,13 @@ example, the very simple pattern
<pre>
((ab){1,1000}c){1,3}
</pre>
-uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
-with its default internal pointer size of two bytes, the size limit on a
-compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
-is reached with the above pattern if the outer repetition is increased from 3
-to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
-larger compiled patterns, but it is better to try to rewrite your pattern to
-use less memory if you can.
+uses over 50K bytes when compiled using the 8-bit library. When PCRE2 is
+compiled with its default internal pointer size of two bytes, the size limit on
+a compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and
+this is reached with the above pattern if the outer repetition is increased
+from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
+handle larger compiled patterns, but it is better to try to rewrite your
+pattern to use less memory if you can.
</P>
<P>
One way of reducing the memory usage for such patterns is to make use of
@@ -68,25 +68,34 @@ facility. Re-writing the above pattern as
<pre>
((ab)(?2){0,999}c)(?1){0,2}
</pre>
-reduces the memory requirements to 18K, and indeed it remains under 20K even
-with the outer repetition increased to 100. However, this pattern is not
-exactly equivalent, because the "subroutine" calls are treated as
-<a href="pcre2pattern.html#atomicgroup">atomic groups</a>
-into which there can be no backtracking if there is a subsequent matching
-failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
-Furthermore, there is a noticeable loss of speed when executing the modified
-pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
-speed is acceptable, this kind of rewriting will allow you to process patterns
-that PCRE2 cannot otherwise handle.
-</P>
-<br><a name="SEC3" href="#TOC1">STACK USAGE AT RUN TIME</a><br>
-<P>
-When <b>pcre2_match()</b> is used for matching, certain kinds of pattern can
-cause it to use large amounts of the process stack. In some environments the
-default process stack is quite small, and if it runs out the result is often
-SIGSEGV. Rewriting your pattern can often help. The
-<a href="pcre2stack.html"><b>pcre2stack</b></a>
-documentation discusses this issue in detail.
+reduces the memory requirements to around 16K, and indeed it remains under 20K
+even with the outer repetition increased to 100. However, this kind of pattern
+is not always exactly equivalent, because any captures within subroutine calls
+are lost when the subroutine completes. If this is not a problem, this kind of
+rewriting will allow you to process patterns that PCRE2 cannot otherwise
+handle. The matching performance of the two different versions of the pattern
+are roughly the same. (This applies from release 10.30 - things were different
+in earlier releases.)
+</P>
+<br><a name="SEC3" href="#TOC1">STACK AND HEAP USAGE AT RUN TIME</a><br>
+<P>
+From release 10.30, the interpretive (non-JIT) version of <b>pcre2_match()</b>
+uses very little system stack at run time. In earlier releases recursive
+function calls could use a great deal of stack, and this could cause problems,
+but this usage has been eliminated. Backtracking positions are now explicitly
+remembered in memory frames controlled by the code. An initial 20K vector of
+frames is allocated on the system stack (enough for about 100 frames for small
+patterns), but if this is insufficient, heap memory is used. The amount of heap
+memory can be limited; if the limit is set to zero, only the initial stack
+vector is used. Rewriting patterns to be time-efficient, as described below,
+may also reduce the memory requirements.
+</P>
+<P>
+In contrast to <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b> does use recursive
+function calls, but only for processing atomic groups, lookaround assertions,
+and recursion within the pattern. Too much nested recursion may cause stack
+issues. The "match depth" parameter can be used to limit the depth of function
+recursion in <b>pcre2_dfa_match()</b>.
</P>
<br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br>
<P>
@@ -175,7 +184,54 @@ appreciable time with strings longer than about 20 characters.
</P>
<P>
In many cases, the solution to this kind of performance issue is to use an
-atomic group or a possessive quantifier.
+atomic group or a possessive quantifier. This can often reduce memory
+requirements as well. As another example, consider this pattern:
+<pre>
+ ([^&#60;]|&#60;(?!inet))+
+</pre>
+It matches from wherever it starts until it encounters "&#60;inet" or the end of
+the data, and is the kind of pattern that might be used when processing an XML
+file. Each iteration of the outer parentheses matches either one character that
+is not "&#60;" or a "&#60;" that is not followed by "inet". However, each time a
+parenthesis is processed, a backtracking position is passed, so this
+formulation uses a memory frame for each matched character. For a long string,
+a lot of memory is required. Consider now this rewritten pattern, which matches
+exactly the same strings:
+<pre>
+ ([^&#60;]++|&#60;(?!inet))+
+</pre>
+This runs much faster, because sequences of characters that do not contain "&#60;"
+are "swallowed" in one item inside the parentheses, and a possessive quantifier
+is used to stop any backtracking into the runs of non-"&#60;" characters. This
+version also uses a lot less memory because entry to a new set of parentheses
+happens only when a "&#60;" character that is not followed by "inet" is encountered
+(and we assume this is relatively rare).
+</P>
+<P>
+This example shows that one way of optimizing performance when matching long
+subject strings is to write repeated parenthesized subpatterns to match more
+than one character whenever possible.
+</P>
+<br><b>
+SETTING RESOURCE LIMITS
+</b><br>
+<P>
+You can set limits on the amount of processing that takes place when matching,
+and on the amount of heap memory that is used. The default values of the limits
+are very large, and unlikely ever to operate. They can be changed when PCRE2 is
+built, and they can also be set when <b>pcre2_match()</b> or
+<b>pcre2_dfa_match()</b> is called. For details of these interfaces, see the
+<a href="pcre2build.html"><b>pcre2build</b></a>
+documentation and the section entitled
+<a href="pcre2api.html#matchcontext">"The match context"</a>
+in the
+<a href="pcre2api.html"><b>pcre2api</b></a>
+documentation.
+</P>
+<P>
+The <b>pcre2test</b> test program has a modifier called "find_limits" which, if
+applied to a subject line, causes it to find the smallest limits that allow a
+pattern to match. This is done by repeatedly matching with different limits.
</P>
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
<P>
@@ -188,9 +244,9 @@ Cambridge, England.
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 02 January 2015
+Last updated: 08 April 2017
<br>
-Copyright &copy; 1997-2015 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2posix.html b/doc/html/pcre2posix.html
index 1d5fe63..8a4431c 100644
--- a/doc/html/pcre2posix.html
+++ b/doc/html/pcre2posix.html
@@ -69,7 +69,7 @@ replacement library. Other POSIX options are not even defined.
<P>
There are also some options that are not defined by POSIX. These have been
added at the request of users who want to make use of certain PCRE2-specific
-features via the POSIX calling interface.
+features via the POSIX calling interface or to add BSD or GNU functionality.
</P>
<P>
When PCRE2 is called via these functions, it is only the API that is POSIX-like
@@ -91,10 +91,11 @@ identifying error codes.
<br><a name="SEC3" href="#TOC1">COMPILING A PATTERN</a><br>
<P>
The function <b>regcomp()</b> is called to compile a pattern into an
-internal form. The pattern is a C string terminated by a binary zero, and
-is passed in the argument <i>pattern</i>. The <i>preg</i> argument is a pointer
-to a <b>regex_t</b> structure that is used as a base for storing information
-about the compiled regular expression.
+internal form. By default, the pattern is a C string terminated by a binary
+zero (but see REG_PEND below). The <i>preg</i> argument is a pointer to a
+<b>regex_t</b> structure that is used as a base for storing information about
+the compiled regular expression. (It is also used for input when REG_PEND is
+set.)
</P>
<P>
The argument <i>cflags</i> is either zero, or contains one or more of the bits
@@ -117,6 +118,14 @@ The PCRE2_MULTILINE option is set when the regular expression is passed for
compilation to the native function. Note that this does <i>not</i> mimic the
defined POSIX behaviour for REG_NEWLINE (see the following section).
<pre>
+ REG_NOSPEC
+</pre>
+The PCRE2_LITERAL option is set when the regular expression is passed for
+compilation to the native function. This disables all meta characters in the
+pattern, causing it to be treated as a literal string. The only other options
+that are allowed with REG_NOSPEC are REG_ICASE, REG_NOSUB, REG_PEND, and
+REG_UTF. Note that REG_NOSPEC is not part of the POSIX standard.
+<pre>
REG_NOSUB
</pre>
When a pattern that is compiled with this flag is passed to <b>regexec()</b> for
@@ -125,6 +134,16 @@ captured strings are returned. Versions of the PCRE library prior to 10.22 used
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
because it disables the use of back references.
<pre>
+ REG_PEND
+</pre>
+If this option is set, the <b>reg_endp</b> field in the <i>preg</i> structure
+(which has the type const char *) must be set to point to the character beyond
+the end of the pattern before calling <b>regcomp()</b>. The pattern itself may
+now contain binary zeroes, which are treated as data characters. Without
+REG_PEND, a binary zero terminates the pattern and the <b>re_endp</b> field is
+ignored. This is a GNU extension to the POSIX standard and should be used with
+caution in software intended to be portable to other systems.
+<pre>
REG_UCP
</pre>
The PCRE2_UCP option is set when the regular expression is passed for
@@ -156,9 +175,10 @@ class such as [^a] (they are).
</P>
<P>
The yield of <b>regcomp()</b> is zero on success, and non-zero otherwise. The
-<i>preg</i> structure is filled in on success, and one member of the structure
-is public: <i>re_nsub</i> contains the number of capturing subpatterns in
-the regular expression. Various error codes are defined in the header file.
+<i>preg</i> structure is filled in on success, and one other member of the
+structure (as well as <i>re_endp</i>) is public: <i>re_nsub</i> contains the
+number of capturing subpatterns in the regular expression. Various error codes
+are defined in the header file.
</P>
<P>
NOTE: If the yield of <b>regcomp()</b> is non-zero, you must not attempt to
@@ -228,15 +248,26 @@ function.
<pre>
REG_STARTEND
</pre>
-The string is considered to start at <i>string</i> + <i>pmatch[0].rm_so</i> and
-to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i>
-(there need not actually be a NUL at that location), regardless of the value of
-<i>nmatch</i>. This is a BSD extension, compatible with but not specified by
-IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
-intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does
-not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
-how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL are
-mutually exclusive; the error REG_INVARG is returned.
+When this option is set, the subject string is starts at <i>string</i> +
+<i>pmatch[0].rm_so</i> and ends at <i>string</i> + <i>pmatch[0].rm_eo</i>, which
+should point to the first character beyond the string. There may be binary
+zeroes within the subject string, and indeed, using REG_STARTEND is the only
+way to pass a subject string that contains a binary zero.
+</P>
+<P>
+Whatever the value of <i>pmatch[0].rm_so</i>, the offsets of the matched string
+and any captured substrings are still given relative to the start of
+<i>string</i> itself. (Before PCRE2 release 10.30 these were given relative to
+<i>string</i> + <i>pmatch[0].rm_so</i>, but this differs from other
+implementations.)
+</P>
+<P>
+This is a BSD extension, compatible with but not specified by IEEE Standard
+1003.2 (POSIX.2), and should be used with caution in software intended to be
+portable to other systems. Note that a non-zero <i>rm_so</i> does not imply
+REG_NOTBOL; REG_STARTEND affects only the location and length of the string,
+not how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL
+are mutually exclusive; the error REG_INVARG is returned.
</P>
<P>
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
@@ -291,9 +322,9 @@ Cambridge, England.
</P>
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 31 January 2016
+Last updated: 15 June 2017
<br>
-Copyright &copy; 1997-2016 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2serialize.html b/doc/html/pcre2serialize.html
index edf415a..813b25a 100644
--- a/doc/html/pcre2serialize.html
+++ b/doc/html/pcre2serialize.html
@@ -55,7 +55,10 @@ The facility for saving and restoring compiled patterns is intended for use
within individual applications. As such, the data supplied to
<b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from
arbitrary external sources. There is only some simple consistency checking, not
-complete validation of what is being re-loaded.
+complete validation of what is being re-loaded. Corrupted data may cause
+undefined results. For example, if the length field of a pattern in the
+serialized data is corrupted, the deserializing code may read beyond the end of
+the byte stream that is passed to it.
</P>
<br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
<P>
@@ -190,9 +193,9 @@ Cambridge, England.
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 24 May 2016
+Last updated: 21 March 2017
<br>
-Copyright &copy; 1997-2016 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2stack.html b/doc/html/pcre2stack.html
deleted file mode 100644
index 2942c7a..0000000
--- a/doc/html/pcre2stack.html
+++ /dev/null
@@ -1,207 +0,0 @@
-<html>
-<head>
-<title>pcre2stack specification</title>
-</head>
-<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
-<h1>pcre2stack man page</h1>
-<p>
-Return to the <a href="index.html">PCRE2 index page</a>.
-</p>
-<p>
-This page is part of the PCRE2 HTML documentation. It was generated
-automatically from the original man page. If there is any nonsense in it,
-please consult the man page, in case the conversion went wrong.
-<br>
-<br><b>
-PCRE2 DISCUSSION OF STACK USAGE
-</b><br>
-<P>
-When you call <b>pcre2_match()</b>, it makes use of an internal function called
-<b>match()</b>. This calls itself recursively at branch points in the pattern,
-in order to remember the state of the match so that it can back up and try a
-different alternative after a failure. As matching proceeds deeper and deeper
-into the tree of possibilities, the recursion depth increases. The
-<b>match()</b> function is also called in other circumstances, for example,
-whenever a parenthesized sub-pattern is entered, and in certain cases of
-repetition.
-</P>
-<P>
-Not all calls of <b>match()</b> increase the recursion depth; for an item such
-as a* it may be called several times at the same level, after matching
-different numbers of a's. Furthermore, in a number of cases where the result of
-the recursive call would immediately be passed back as the result of the
-current call (a "tail recursion"), the function is just restarted instead.
-</P>
-<P>
-Each time the internal <b>match()</b> function is called recursively, it uses
-memory from the process stack. For certain kinds of pattern and data, very
-large amounts of stack may be needed, despite the recognition of "tail
-recursion". Note that if PCRE2 is compiled with the -fsanitize=address option
-of the GCC compiler, the stack requirements are greatly increased.
-</P>
-<P>
-The above comments apply when <b>pcre2_match()</b> is run in its normal
-interpretive manner. If the compiled pattern was processed by
-<b>pcre2_jit_compile()</b>, and just-in-time compiling was successful, and the
-options passed to <b>pcre2_match()</b> were not incompatible, the matching
-process uses the JIT-compiled code instead of the <b>match()</b> function. In
-this case, the memory requirements are handled entirely differently. See the
-<a href="pcre2jit.html"><b>pcre2jit</b></a>
-documentation for details.
-</P>
-<P>
-The <b>pcre2_dfa_match()</b> function operates in a different way to
-<b>pcre2_match()</b>, and uses recursion only when there is a regular expression
-recursion or subroutine call in the pattern. This includes the processing of
-assertion and "once-only" subpatterns, which are handled like subroutine calls.
-Normally, these are never very deep, and the limit on the complexity of
-<b>pcre2_dfa_match()</b> is controlled by the amount of workspace it is given.
-However, it is possible to write patterns with runaway infinite recursions;
-such patterns will cause <b>pcre2_dfa_match()</b> to run out of stack. At
-present, there is no protection against this.
-</P>
-<P>
-The comments that follow do NOT apply to <b>pcre2_dfa_match()</b>; they are
-relevant only for <b>pcre2_match()</b> without the JIT optimization.
-</P>
-<br><b>
-Reducing <b>pcre2_match()</b>'s stack usage
-</b><br>
-<P>
-You can often reduce the amount of recursion, and therefore the
-amount of stack used, by modifying the pattern that is being matched. Consider,
-for example, this pattern:
-<pre>
- ([^&#60;]|&#60;(?!inet))+
-</pre>
-It matches from wherever it starts until it encounters "&#60;inet" or the end of
-the data, and is the kind of pattern that might be used when processing an XML
-file. Each iteration of the outer parentheses matches either one character that
-is not "&#60;" or a "&#60;" that is not followed by "inet". However, each time a
-parenthesis is processed, a recursion occurs, so this formulation uses a stack
-frame for each matched character. For a long string, a lot of stack is
-required. Consider now this rewritten pattern, which matches exactly the same
-strings:
-<pre>
- ([^&#60;]++|&#60;(?!inet))+
-</pre>
-This uses very much less stack, because runs of characters that do not contain
-"&#60;" are "swallowed" in one item inside the parentheses. Recursion happens only
-when a "&#60;" character that is not followed by "inet" is encountered (and we
-assume this is relatively rare). A possessive quantifier is used to stop any
-backtracking into the runs of non-"&#60;" characters, but that is not related to
-stack usage.
-</P>
-<P>
-This example shows that one way of avoiding stack problems when matching long
-subject strings is to write repeated parenthesized subpatterns to match more
-than one character whenever possible.
-</P>
-<br><b>
-Compiling PCRE2 to use heap instead of stack for <b>pcre2_match()</b>
-</b><br>
-<P>
-In environments where stack memory is constrained, you might want to compile
-PCRE2 to use heap memory instead of stack for remembering back-up points when
-<b>pcre2_match()</b> is running. This makes it run more slowly, however. Details
-of how to do this are given in the
-<a href="pcre2build.html"><b>pcre2build</b></a>
-documentation. When built in this way, instead of using the stack, PCRE2
-gets memory for remembering backup points from the heap. By default, the memory
-is obtained by calling the system <b>malloc()</b> function, but you can arrange
-to supply your own memory management function. For details, see the section
-entitled
-<a href="pcre2api.html#matchcontext">"The match context"</a>
-in the
-<a href="pcre2api.html"><b>pcre2api</b></a>
-documentation. Since the block sizes are always the same, it may be possible to
-implement customized a memory handler that is more efficient than the standard
-function. The memory blocks obtained for this purpose are retained and re-used
-if possible while <b>pcre2_match()</b> is running. They are all freed just
-before it exits.
-</P>
-<br><b>
-Limiting <b>pcre2_match()</b>'s stack usage
-</b><br>
-<P>
-You can set limits on the number of times the internal <b>match()</b> function
-is called, both in total and recursively. If a limit is exceeded,
-<b>pcre2_match()</b> returns an error code. Setting suitable limits should
-prevent it from running out of stack. The default values of the limits are very
-large, and unlikely ever to operate. They can be changed when PCRE2 is built,
-and they can also be set when <b>pcre2_match()</b> is called. For details of
-these interfaces, see the
-<a href="pcre2build.html"><b>pcre2build</b></a>
-documentation and the section entitled
-<a href="pcre2api.html#matchcontext">"The match context"</a>
-in the
-<a href="pcre2api.html"><b>pcre2api</b></a>
-documentation.
-</P>
-<P>
-As a very rough rule of thumb, you should reckon on about 500 bytes per
-recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
-the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
-around 128000 recursions.
-</P>
-<P>
-The <b>pcre2test</b> test program has a modifier called "find_limits" which, if
-applied to a subject line, causes it to find the smallest limits that allow a a
-pattern to match. This is done by calling <b>pcre2_match()</b> repeatedly with
-different limits.
-</P>
-<br><b>
-Changing stack size in Unix-like systems
-</b><br>
-<P>
-In Unix-like environments, there is not often a problem with the stack unless
-very long strings are involved, though the default limit on stack size varies
-from system to system. Values from 8Mb to 64Mb are common. You can find your
-default limit by running the command:
-<pre>
- ulimit -s
-</pre>
-Unfortunately, the effect of running out of stack is often SIGSEGV, though
-sometimes a more explicit error message is given. You can normally increase the
-limit on stack size by code such as this:
-<pre>
- struct rlimit rlim;
- getrlimit(RLIMIT_STACK, &rlim);
- rlim.rlim_cur = 100*1024*1024;
- setrlimit(RLIMIT_STACK, &rlim);
-</pre>
-This reads the current limits (soft and hard) using <b>getrlimit()</b>, then
-attempts to increase the soft limit to 100Mb using <b>setrlimit()</b>. You must
-do this before calling <b>pcre2_match()</b>.
-</P>
-<br><b>
-Changing stack size in Mac OS X
-</b><br>
-<P>
-Using <b>setrlimit()</b>, as described above, should also work on Mac OS X. It
-is also possible to set a stack size when linking a program. There is a
-discussion about stack sizes in Mac OS X at this web site:
-<a href="http://developer.apple.com/qa/qa2005/qa1419.html">http://developer.apple.com/qa/qa2005/qa1419.html.</a>
-</P>
-<br><b>
-AUTHOR
-</b><br>
-<P>
-Philip Hazel
-<br>
-University Computing Service
-<br>
-Cambridge, England.
-<br>
-</P>
-<br><b>
-REVISION
-</b><br>
-<P>
-Last updated: 21 November 2014
-<br>
-Copyright &copy; 1997-2014 University of Cambridge.
-<br>
-<p>
-Return to the <a href="index.html">PCRE2 index page</a>.
-</p>
diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html
index 7fdc0dc..9098f47 100644
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@@ -430,18 +430,21 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
(?i) caseless
(?J) allow duplicate names
(?m) multiline
+ (?n) no auto capture
(?s) single line (dotall)
(?U) default ungreedy (lazy)
- (?x) extended (ignore white space)
+ (?x) extended: ignore white space except in classes
+ (?xx) as (?x) but also ignore space and tab in classes
(?-...) unset option(s)
</pre>
The following are recognized only at the very start of a pattern or after one
of the newline or \R options with similar syntax. More than one of them may
-appear.
+appear. For the first three, d is a decimal number.
<pre>
- (*LIMIT_MATCH=d) set the match limit to d (decimal number)
- (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
- (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
+ (*LIMIT_DEPTH=d) set the backtracking limit to d
+ (*LIMIT_HEAP=d) set the heap size limit to d kilobytes
+ (*LIMIT_MATCH=d) set the match limit to d
+ (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
@@ -450,10 +453,11 @@ appear.
(*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
</pre>
-Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
-limits set by the caller of pcre2_match(), not increase them. The application
-can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
-PCRE2_NEVER_UCP options, respectively, at compile time.
+Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of
+the limits set by the caller of <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>,
+not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
+application can lock out the use of (*UTF) and (*UCP) by setting the
+PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
</P>
<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
<P>
@@ -465,6 +469,7 @@ settings with a similar syntax.
(*CRLF) carriage return followed by linefeed
(*ANYCRLF) all three of the above
(*ANY) any Unicode newline sequence
+ (*NUL) the NUL character (binary zero)
</PRE>
</P>
<br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
@@ -492,6 +497,9 @@ Each top-level branch of a look behind must be of a fixed length.
\n reference by number (can be ambiguous)
\gn reference by number
\g{n} reference by number
+ \g+n relative reference by number (PCRE2 extension)
+ \g-n relative reference by number
+ \g{+n} relative reference by number (PCRE2 extension)
\g{-n} relative reference by number
\k&#60;name&#62; reference by name (Perl)
\k'name' reference by name (Perl)
@@ -530,14 +538,17 @@ Each top-level branch of a look behind must be of a fixed length.
(?(-n) relative reference condition
(?(&#60;name&#62;) named reference condition (Perl)
(?('name') named reference condition (Perl)
- (?(name) named reference condition (PCRE2)
+ (?(name) named reference condition (PCRE2, deprecated)
(?(R) overall recursion condition
- (?(Rn) specific group recursion condition
- (?(R&name) specific recursion condition
+ (?(Rn) specific numbered group recursion condition
+ (?(R&name) specific named group recursion condition
(?(DEFINE) define subpattern for reference
(?(VERSION[&#62;]=n.m) test PCRE2 version
(?(assert) assertion condition
-</PRE>
+</pre>
+Note the ambiguity of (?(R) and (?(Rn) which might be named reference
+conditions or recursion tests. Such a condition is interpreted as a reference
+condition if the relevant named group exists.
</P>
<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
@@ -589,9 +600,9 @@ Cambridge, England.
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 16 October 2015
+Last updated: 17 June 2017
<br>
-Copyright &copy; 1997-2015 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2test.html b/doc/html/pcre2test.html
index 17b308e..7d98d90 100644
--- a/doc/html/pcre2test.html
+++ b/doc/html/pcre2test.html
@@ -61,7 +61,7 @@ subject is processed, and what output is produced.
<P>
As the original fairly simple PCRE library evolved, it acquired many different
features, and as a result, the original <b>pcretest</b> program ended up with a
-lot of options in a messy, arcane syntax, for testing all the features. The
+lot of options in a messy, arcane syntax for testing all the features. The
move to the new PCRE2 API provided an opportunity to re-implement the test
program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
are still many obscure modifiers, some of which are specifically designed for
@@ -77,32 +77,62 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
all three of these libraries may be simultaneously installed. The
<b>pcre2test</b> program can be used to test all the libraries. However, its own
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
-libraries, patterns and subject strings are converted to 16- or 32-bit format
-before being passed to the library functions. Results are converted back to
-8-bit code units for output.
+libraries, patterns and subject strings are converted to 16-bit or 32-bit
+format before being passed to the library functions. Results are converted back
+to 8-bit code units for output.
</P>
<P>
In the rest of this document, the names of library functions and structures
are given in generic form, for example, <b>pcre_compile()</b>. The actual
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
-</P>
+<a name="inputencoding"></a></P>
<br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
<P>
Input to <b>pcre2test</b> is processed line by line, either by calling the C
-library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see
-below). The input is processed using using C's string functions, so must not
-contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
-treats any bytes other than newline as data characters. In some Windows
-environments character 26 (hex 1A) causes an immediate end of file, and no
-further data is read.
+library's <b>fgets()</b> function, or via the <b>libreadline</b> library. In some
+Windows environments character 26 (hex 1A) causes an immediate end of file, and
+no further data is read, so this character should be avoided unless you really
+want that action.
+</P>
+<P>
+The input is processed using using C's string functions, so must not
+contain binary zeros, even though in Unix-like environments, <b>fgets()</b>
+treats any bytes other than newline as data characters. An error is generated
+if a binary zero is encountered. By default subject lines are processed for
+backslash escapes, which makes it possible to include any data value in strings
+that are passed to the library for matching. For patterns, there is a facility
+for specifying some or all of the 8-bit input characters as hexadecimal pairs,
+which makes it possible to include binary zeros.
+</P>
+<br><b>
+Input for the 16-bit and 32-bit libraries
+</b><br>
+<P>
+When testing the 16-bit or 32-bit libraries, there is a need to be able to
+generate character code points greater than 255 in the strings that are passed
+to the library. For subject lines, backslash escapes can be used. In addition,
+when the <b>utf</b> modifier (see
+<a href="#optionmodifiers">"Setting compilation options"</a>
+below) is set, the pattern and any following subject lines are interpreted as
+UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
</P>
<P>
-For maximum portability, therefore, it is safest to avoid non-printing
-characters in <b>pcre2test</b> input files. There is a facility for specifying
-some or all of a pattern's characters as hexadecimal pairs, thus making it
-possible to include binary zeroes in a pattern for testing purposes. Subject
-lines are processed for backslash escapes, which makes it possible to include
-any data value.
+For non-UTF testing of wide characters, the <b>utf8_input</b> modifier can be
+used. This is mutually exclusive with <b>utf</b>, and is allowed only in 16-bit
+or 32-bit mode. It causes the pattern and following subject lines to be treated
+as UTF-8 according to the original definition (RFC 2279), which allows for
+character values up to 0x7fffffff. Each character is placed in one 16-bit or
+32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
+to occur).
+</P>
+<P>
+UTF-8 (in its original definition) is not capable of encoding values greater
+than 0x7fffffff, but such values can be handled by the 32-bit library. When
+testing this library in non-UTF mode with <b>utf8_input</b> set, if any
+character is preceded by the byte 0xff (which is an illegal byte in UTF-8)
+0x80000000 is added to the character's value. This is the only way of passing
+such code points in a pattern string. For subject strings, using an escape
+sequence is preferable.
</P>
<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
<P>
@@ -124,15 +154,27 @@ the 32-bit library has been built, this is the default. If the 32-bit library
has not been built, this option causes an error.
</P>
<P>
+<b>-ac</b>
+Behave as if each pattern has the <b>auto_callout</b> modifier, that is, insert
+automatic callouts into every pattern that is compiled.
+</P>
+<P>
+<b>-AC</b>
+As for <b>-ac</b>, but in addition behave as if each subject line has the
+<b>callout_extra</b> modifier, that is, show additional information from
+callouts.
+</P>
+<P>
<b>-b</b>
-Behave as if each pattern has the <b>/fullbincode</b> modifier; the full
+Behave as if each pattern has the <b>fullbincode</b> modifier; the full
internal binary form of the pattern is output after compilation.
</P>
<P>
<b>-C</b>
Output the version number of the PCRE2 library, and all available information
about the optional features that are included, and then exit with zero exit
-code. All other options are ignored.
+code. All other options are ignored. If both -C and -LM are present, whichever
+is first is recognized.
</P>
<P>
<b>-C</b> <i>option</i>
@@ -147,7 +189,7 @@ following options output the value and set the exit code as indicated:
linksize the configured internal link size (2, 3, or 4)
exit code is set to the link size
newline the default newline setting:
- CR, LF, CRLF, ANYCRLF, or ANY
+ CR, LF, CRLF, ANYCRLF, ANY, or NUL
exit code is always 0
bsr the default setting for what \R matches:
ANYCRLF or ANY
@@ -191,7 +233,7 @@ Output a brief summary these options and then exit.
</P>
<P>
<b>-i</b>
-Behave as if each pattern has the <b>/info</b> modifier; information about the
+Behave as if each pattern has the <b>info</b> modifier; information about the
compiled pattern is given after compilation.
</P>
<P>
@@ -200,6 +242,18 @@ Behave as if each pattern line has the <b>jit</b> modifier; after successful
compilation, each pattern is passed to the just-in-time compiler, if available.
</P>
<P>
+<b>-jitverify</b>
+Behave as if each pattern line has the <b>jitverify</b> modifier; after
+successful compilation, each pattern is passed to the just-in-time compiler, if
+available, and the use of JIT is verified.
+</P>
+<P>
+<b>-LM</b>
+List modifiers: write a list of available pattern and subject modifiers to the
+standard output, then exit with zero exit code. All other options are ignored.
+If both -C and -LM are present, whichever is first is recognized.
+</P>
+<P>
\fB-pattern\fB <i>modifier-list</i>
Behave as if each pattern line contains the given modifiers.
</P>
@@ -326,8 +380,8 @@ when PCRE2 is compiled with either CR or CRLF as the default newline.
</P>
<P>
The #newline_default command specifies a list of newline types that are
-acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF, or
-ANY (in upper or lower case), for example:
+acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF,
+ANY, or NUL (in upper or lower case), for example:
<pre>
#newline_default LF Any anyCRLF
</pre>
@@ -341,8 +395,9 @@ of the standard test input files.
<P>
When the POSIX API is being tested there is no way to override the default
newline convention, though it is possible to set the newline convention from
-within the pattern. A warning is given if the <b>posix</b> modifier is used when
-<b>#newline_default</b> would set a default for the non-POSIX API.
+within the pattern. A warning is given if the <b>posix</b> or <b>posix_nosub</b>
+modifier is used when <b>#newline_default</b> would set a default for the
+non-POSIX API.
<pre>
#pattern &#60;modifier-list&#62;
</pre>
@@ -438,8 +493,9 @@ A pattern can be followed by a modifier list (details below).
<P>
Before each subject line is passed to <b>pcre2_match()</b> or
<b>pcre2_dfa_match()</b>, leading and trailing white space is removed, and the
-line is scanned for backslash escapes. The following provide a means of
-encoding non-printing characters in a visible way:
+line is scanned for backslash escapes, unless the <b>subject_literal</b>
+modifier was set for the pattern. The following provide a means of encoding
+non-printing characters in a visible way:
<pre>
\a alarm (BEL, \x07)
\b backspace (\x08)
@@ -507,6 +563,12 @@ the very last character in the line is a backslash (and there is no modifier
list), it is ignored. This gives a way of passing an empty line as data, since
a real empty line terminates the data input.
</P>
+<P>
+If the <b>subject_literal</b> modifier is set for a pattern, all subject lines
+that follow are treated as literals, with no special treatment of backslashes.
+No replication is possible, and any subject modifiers must be set as defaults
+by a <b>#subject</b> command.
+</P>
<br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br>
<P>
There are several types of modifier that can appear in pattern lines. Except
@@ -518,29 +580,42 @@ by a previous <b>#pattern</b> command.
Setting compilation options
</b><br>
<P>
-The following modifiers set options for <b>pcre2_compile()</b>. The most common
-ones have single-letter abbreviations. See
+The following modifiers set options for <b>pcre2_compile()</b>. Most of them set
+bits in the options argument of that function, but those whose names start with
+PCRE2_EXTRA are additional options that are set in the compile context. For the
+main options, there are some single-letter abbreviations that are the same as
+Perl options. There is special handling for /x: if a second x is present,
+PCRE2_EXTENDED is converted into PCRE2_EXTENDED_MORE as in Perl. A third
+appearance adds PCRE2_EXTENDED as well, though this makes no difference to the
+way <b>pcre2_compile()</b> behaves. See
<a href="pcre2api.html"><b>pcre2api</b></a>
-for a description of their effects.
+for a description of the effects of these options.
<pre>
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
+ allow_surrogate_escapes set PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
alt_bsux set PCRE2_ALT_BSUX
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
alt_verbnames set PCRE2_ALT_VERBNAMES
anchored set PCRE2_ANCHORED
auto_callout set PCRE2_AUTO_CALLOUT
+ bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
/i caseless set PCRE2_CASELESS
dollar_endonly set PCRE2_DOLLAR_ENDONLY
/s dotall set PCRE2_DOTALL
dupnames set PCRE2_DUPNAMES
+ endanchored set PCRE2_ENDANCHORED
/x extended set PCRE2_EXTENDED
+ /xx extended_more set PCRE2_EXTENDED_MORE
firstline set PCRE2_FIRSTLINE
+ literal set PCRE2_LITERAL
+ match_line set PCRE2_EXTRA_MATCH_LINE
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
+ match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
never_ucp set PCRE2_NEVER_UCP
never_utf set PCRE2_NEVER_UTF
- no_auto_capture set PCRE2_NO_AUTO_CAPTURE
+ /n no_auto_capture set PCRE2_NO_AUTO_CAPTURE
no_auto_possess set PCRE2_NO_AUTO_POSSESS
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
no_start_optimize set PCRE2_NO_START_OPTIMIZE
@@ -553,19 +628,27 @@ for a description of their effects.
As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
non-printing characters in output strings to be printed using the \x{hh...}
notation. Otherwise, those less than 0x100 are output in hex without the curly
-brackets.
+brackets. Setting <b>utf</b> in 16-bit or 32-bit mode also causes pattern and
+subject strings to be translated to UTF-16 or UTF-32, respectively, before
+being passed to library functions.
<a name="controlmodifiers"></a></P>
<br><b>
Setting compilation controls
</b><br>
<P>
The following modifiers affect the compilation process or request information
-about the pattern:
+about the pattern. There are single-letter abbreviations for some that are
+heavily used in the test files.
<pre>
bsr=[anycrlf|unicode] specify \R handling
/B bincode show binary code without lengths
callout_info show callout information
+ convert=&#60;options&#62; request foreign pattern conversion
+ convert_glob_escape=c set glob escape character
+ convert_glob_separator=c set glob separator character
+ convert_length set convert buffer length
debug same as info,fullbincode
+ framesize show matching frame size
fullbincode show binary code with lengths
/I info show info about compiled pattern
hex unquoted characters are hexadecimal
@@ -583,7 +666,10 @@ about the pattern:
push push compiled pattern onto the stack
pushcopy push a copy onto the stack
stackguard=&#60;number&#62; test the stackguard feature
+ subject_literal treat all subject lines as literal
tables=[0|1|2] select internal tables
+ use_length do not zero-terminate the pattern
+ utf8_input treat input as UTF-8
</pre>
The effects of these modifiers are described in the following sections.
</P>
@@ -599,7 +685,7 @@ is built, with the default default being Unicode.
<P>
The <b>newline</b> modifier specifies which characters are to be interpreted as
newlines, both in the pattern and in subject lines. The type must be one of CR,
-LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
+LF, CRLF, ANYCRLF, ANY, or NUL (in upper or lower case).
</P>
<br><b>
Information about a pattern
@@ -651,6 +737,11 @@ not necessarily the last character. These lines are omitted if no starting or
ending code units are recorded.
</P>
<P>
+The <b>framesize</b> modifier shows the size, in bytes, of the storage frames
+used by <b>pcre2_match()</b> for handling backtracking. The size depends on the
+number of capturing parentheses in the pattern.
+</P>
+<P>
The <b>callout_info</b> modifier requests information about all the callouts in
the pattern. A list of them is output at the end of any other information that
is requested. For each callout, either its number or string is given, followed
@@ -684,13 +775,36 @@ nine characters, only two of which are specified in hexadecimal:
/ab "literal" 32/hex
</pre>
Either single or double quotes may be used. There is no way of including
-the delimiter within a substring.
+the delimiter within a substring. The <b>hex</b> and <b>expand</b> modifiers are
+mutually exclusive.
+</P>
+<br><b>
+Specifying the pattern's length
+</b><br>
+<P>
+By default, patterns are passed to the compiling functions as zero-terminated
+strings but can be passed by length instead of being zero-terminated. The
+<b>use_length</b> modifier causes this to happen. Using a length happens
+automatically (whether or not <b>use_length</b> is set) when <b>hex</b> is set,
+because patterns specified in hexadecimal may contain binary zeros.
</P>
<P>
-By default, <b>pcre2test</b> passes patterns as zero-terminated strings to
-<b>pcre2_compile()</b>, giving the length as PCRE2_ZERO_TERMINATED. However, for
-patterns specified with the <b>hex</b> modifier, the actual length of the
-pattern is passed.
+If <b>hex</b> or <b>use_length</b> is used with the POSIX wrapper API (see
+<a href="#posixwrapper">"Using the POSIX wrapper API"</a>
+below), the REG_PEND extension is used to pass the pattern's length.
+</P>
+<br><b>
+Specifying wide characters in 16-bit and 32-bit modes
+</b><br>
+<P>
+In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
+translated to UTF-16 or UTF-32 when the <b>utf</b> modifier is set. For testing
+the 16-bit and 32-bit libraries in non-UTF mode, the <b>utf8_input</b> modifier
+can be used. It is mutually exclusive with <b>utf</b>. Input lines are
+interpreted as UTF-8 as a means of specifying wide characters. More details are
+given in
+<a href="#inputencoding">"Input encoding"</a>
+above.
</P>
<br><b>
Generating long repetitive patterns
@@ -708,7 +822,8 @@ are expanded before the pattern is passed to <b>pcre2_compile()</b>. For
example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
cannot be nested. An initial "\[" sequence is recognized only if "]{" followed
by decimal digits and "}" is found later in the pattern. If not, the characters
-remain in the pattern unaltered.
+remain in the pattern unaltered. The <b>expand</b> and <b>hex</b> modifiers are
+mutually exclusive.
</P>
<P>
If part of an expanded pattern looks like an expansion, but is really part of
@@ -737,7 +852,7 @@ modifier in "Subject Modifiers"
for details of how these options are specified for each match attempt.
</P>
<P>
-JIT compilation is requested by the <b>/jit</b> pattern modifier, which may
+JIT compilation is requested by the <b>jit</b> pattern modifier, which may
optionally be followed by an equals sign and a number in the range 0 to 7.
The three bits that make up the number specify which of the three JIT operating
modes are to be compiled:
@@ -746,7 +861,7 @@ modes are to be compiled:
2 compile JIT code for soft partial matching
4 compile JIT code for hard partial matching
</pre>
-The possible values for the <b>/jit</b> modifier are therefore:
+The possible values for the <b>jit</b> modifier are therefore:
<pre>
0 disable JIT
1 normal matching only
@@ -761,7 +876,7 @@ to <b>pcre2_match()</b> with either the PCRE2_PARTIAL_SOFT or the
PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
match; the options enable the possibility of a partial match, but do not
require it. Note also that if you request JIT compilation only for partial
-matching (for example, /jit=2) but do not set the <b>partial</b> modifier on a
+matching (for example, jit=2) but do not set the <b>partial</b> modifier on a
subject line, that match will not use JIT code because none was compiled for
non-partial matching.
</P>
@@ -792,14 +907,14 @@ code was actually used in the match.
Setting a locale
</b><br>
<P>
-The <b>/locale</b> modifier must specify the name of a locale, for example:
+The <b>locale</b> modifier must specify the name of a locale, for example:
<pre>
/pattern/locale=fr_FR
</pre>
The given locale is set, <b>pcre2_maketables()</b> is called to build a set of
character tables for the locale, and this is then passed to
<b>pcre2_compile()</b> when compiling the regular expression. The same tables
-are used when matching the following subject lines. The <b>/locale</b> modifier
+are used when matching the following subject lines. The <b>locale</b> modifier
applies only to the pattern on which it appears, but can be given in a
<b>#pattern</b> command if a default is needed. Setting a locale and alternate
character tables are mutually exclusive.
@@ -808,7 +923,7 @@ character tables are mutually exclusive.
Showing pattern memory
</b><br>
<P>
-The <b>/memory</b> modifier causes the size in bytes of the memory used to hold
+The <b>memory</b> modifier causes the size in bytes of the memory used to hold
the compiled pattern to be output. This does not include the size of the
<b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is
subsequently passed to the JIT compiler, the size of the JIT compiled code is
@@ -838,12 +953,12 @@ The <b>max_pattern_length</b> modifier sets a limit, in code units, to the
length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit
causes a compilation error. The default is the largest number a PCRE2_SIZE
variable can hold (essentially unlimited).
-</P>
+<a name="posixwrapper"></a></P>
<br><b>
Using the POSIX wrapper API
</b><br>
<P>
-The <b>/posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
+The <b>posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
PCRE2 via the POSIX wrapper API rather than its native API. When
<b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to
<b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that
@@ -873,11 +988,16 @@ The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described
below. All other modifiers are either ignored, with a warning message, or cause
an error.
</P>
+<P>
+The pattern is passed to <b>regcomp()</b> as a zero-terminated string by
+default, but if the <b>use_length</b> or <b>hex</b> modifiers are set, the
+REG_PEND extension is used to pass it by length.
+</P>
<br><b>
Testing the stack guard feature
</b><br>
<P>
-The <b>/stackguard</b> modifier is used to test the use of
+The <b>stackguard</b> modifier is used to test the use of
<b>pcre2_set_compile_recursion_guard()</b>, a function that is provided to
enable stack availability to be checked during compilation (see the
<a href="pcre2api.html"><b>pcre2api</b></a>
@@ -892,7 +1012,7 @@ be aborted.
Using alternative character tables
</b><br>
<P>
-The value specified for the <b>/tables</b> modifier must be one of the digits 0,
+The value specified for the <b>tables</b> modifier must be one of the digits 0,
1, or 2. It causes a specific set of built-in character tables to be passed to
<b>pcre2_compile()</b>. This is used in the PCRE2 tests to check behaviour with
different character tables. The digit specifies the tables as follows:
@@ -910,17 +1030,19 @@ are mutually exclusive.
Setting certain match controls
</b><br>
<P>
-The following modifiers are really subject modifiers, and are described below.
-However, they may be included in a pattern's modifier list, in which case they
-are applied to every subject line that is processed with that pattern. They may
-not appear in <b>#pattern</b> commands. These modifiers do not affect the
-compilation process.
+The following modifiers are really subject modifiers, and are described under
+"Subject Modifiers" below. However, they may be included in a pattern's
+modifier list, in which case they are applied to every subject line that is
+processed with that pattern. These modifiers do not affect the compilation
+process.
<pre>
aftertext show text after match
allaftertext show text after captures
allcaptures show all captures
allusedtext show all consulted text
+ altglobal alternative global matching
/g global global matching
+ jitstack=&#60;n&#62; set size of JIT stack
mark show mark values
replace=&#60;string&#62; specify a replacement string
startchar show starting character when relevant
@@ -933,6 +1055,15 @@ These modifiers may not appear in a <b>#pattern</b> command. If you want them as
defaults, set them in a <b>#subject</b> command.
</P>
<br><b>
+Specifying literal subject lines
+</b><br>
+<P>
+If the <b>subject_literal</b> modifier is present on a pattern, all the subject
+lines that it matches are taken as literal strings, with no interpretation of
+backslashes. It is not possible to set subject modifiers on such lines, but any
+that are set as defaults by a <b>#subject</b> command are recognized.
+</P>
+<br><b>
Saving a compiled pattern
</b><br>
<P>
@@ -941,7 +1072,8 @@ pushed onto a stack of compiled patterns, and <b>pcre2test</b> expects the next
line to contain a new pattern (or a command) instead of a subject line. This
facility is used when saving compiled patterns to a file, as described in the
section entitled "Saving and restoring compiled patterns"
-<a href="#saverestore">below. If <b>pushcopy</b> is used instead of <b>push</b>, a copy of the compiled</a>
+<a href="#saverestore">below.</a>
+If <b>pushcopy</b> is used instead of <b>push</b>, a copy of the compiled
pattern is stacked, leaving the original as current, ready to match the
following input lines. This provides a way of testing the
<b>pcre2_code_copy()</b> function.
@@ -951,6 +1083,41 @@ are ignored (for the stacked copy), with a warning message, except for
<b>replace</b>, which causes an error. Note that <b>jitverify</b>, which is
allowed, does not carry through to any subsequent matching that uses a stacked
pattern.
+</P>
+<br><b>
+Testing foreign pattern conversion
+</b><br>
+<P>
+The experimental foreign pattern conversion functions in PCRE2 can be tested by
+setting the <b>convert</b> modifier. Its argument is a colon-separated list of
+options, which set the equivalent option for the <b>pcre2_pattern_convert()</b>
+function:
+<pre>
+ glob PCRE2_CONVERT_GLOB
+ glob_no_starstar PCRE2_CONVERT_GLOB_NO_STARSTAR
+ glob_no_wild_separator PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR
+ posix_basic PCRE2_CONVERT_POSIX_BASIC
+ posix_extended PCRE2_CONVERT_POSIX_EXTENDED
+ unset Unset all options
+</pre>
+The "unset" value is useful for turning off a default that has been set by a
+<b>#pattern</b> command. When one of these options is set, the input pattern is
+passed to <b>pcre2_pattern_convert()</b>. If the conversion is successful, the
+result is reflected in the output and then passed to <b>pcre2_compile()</b>. The
+normal <b>utf</b> and <b>no_utf_check</b> options, if set, cause the
+PCRE2_CONVERT_UTF and PCRE2_CONVERT_NO_UTF_CHECK options to be passed to
+<b>pcre2_pattern_convert()</b>.
+</P>
+<P>
+By default, the conversion function is allowed to allocate a buffer for its
+output. However, if the <b>convert_length</b> modifier is set to a value greater
+than zero, <b>pcre2test</b> passes a buffer of the given length. This makes it
+possible to test the length check.
+</P>
+<P>
+The <b>convert_glob_escape</b> and <b>convert_glob_separator</b> modifiers can be
+used to specify the escape and separator characters for glob processing,
+overriding the defaults, which are operating-system dependent.
<a name="subjectmodifiers"></a></P>
<br><a name="SEC11" href="#TOC1">SUBJECT MODIFIERS</a><br>
<P>
@@ -967,6 +1134,7 @@ The following modifiers set options for <b>pcre2_match()</b> or
for a description of their effects.
<pre>
anchored set PCRE2_ANCHORED
+ endanchored set PCRE2_ENDANCHORED
dfa_restart set PCRE2_DFA_RESTART
dfa_shortest set PCRE2_DFA_SHORTEST
no_jit set PCRE2_NO_JIT
@@ -982,11 +1150,26 @@ The partial matching modifiers are provided with abbreviations because they
appear frequently in tests.
</P>
<P>
-If the <b>/posix</b> modifier was present on the pattern, causing the POSIX
-wrapper API to be used, the only option-setting modifiers that have any effect
-are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL,
-REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>.
-The other modifiers are ignored, with a warning message.
+If the <b>posix</b> or <b>posix_nosub</b> modifier was present on the pattern,
+causing the POSIX wrapper API to be used, the only option-setting modifiers
+that have any effect are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>,
+causing REG_NOTBOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
+<b>regexec()</b>. The other modifiers are ignored, with a warning message.
+</P>
+<P>
+There is one additional modifier that can be used with the POSIX wrapper. It is
+ignored (with a warning) if used for non-POSIX matching.
+<pre>
+ posix_startend=&#60;n&#62;[:&#60;m&#62;]
+</pre>
+This causes the subject string to be passed to <b>regexec()</b> using the
+REG_STARTEND option, which uses offsets to specify which part of the string is
+searched. If only one number is given, the end offset is passed as the end of
+the subject string. For more detail of REG_STARTEND, see the
+<a href="pcre2posix.html"><b>pcre2posix</b></a>
+documentation. If the subject string contains binary zeros (coded as escapes
+such as \x{00} because <b>pcre2test</b> does not support actual binary zeros in
+its input), you must use <b>posix_startend</b> to specify its length.
</P>
<br><b>
Setting match controls
@@ -1004,23 +1187,28 @@ pattern.
altglobal alternative global matching
callout_capture show captures at callout time
callout_data=&#60;n&#62; set a value to pass via callouts
+ callout_error=&#60;n&#62;[:&#60;m&#62;] control callout error
+ callout_extra show extra callout information
callout_fail=&#60;n&#62;[:&#60;m&#62;] control callout failure
+ callout_no_where do not show position of a callout
callout_none do not supply a callout function
copy=&#60;number or name&#62; copy captured substring
+ depth_limit=&#60;n&#62; set a depth limit
dfa use <b>pcre2_dfa_match()</b>
- find_limits find match and recursion limits
+ find_limits find match and depth limits
get=&#60;number or name&#62; extract captured substring
getall extract all captured substrings
/g global global matching
+ heap_limit=&#60;n&#62; set a limit on heap memory
jitstack=&#60;n&#62; set size of JIT stack
mark show mark values
match_limit=&#60;n&#62; set a match limit
- memory show memory usage
+ memory show heap memory usage
null_context match with a NULL context
offset=&#60;n&#62; set starting offset
offset_limit=&#60;n&#62; set offset limit
ovector=&#60;n&#62; set size of output vector
- recursion_limit=&#60;n&#62; set a recursion limit
+ recursion_limit=&#60;n&#62; obsolete synonym for depth_limit
replace=&#60;string&#62; specify a replacement string
startchar show startchar when relevant
startoffset=&#60;n&#62; same as offset=&#60;n&#62;
@@ -1098,29 +1286,17 @@ Testing callouts
</b><br>
<P>
A callout function is supplied when <b>pcre2test</b> calls the library matching
-functions, unless <b>callout_none</b> is specified. If <b>callout_capture</b> is
-set, the current captured groups are output when a callout occurs.
-</P>
-<P>
-The <b>callout_fail</b> modifier can be given one or two numbers. If there is
-only one number, 1 is returned instead of 0 when a callout of that number is
-reached. If two numbers are given, 1 is returned when callout &#60;n&#62; is reached
-for the &#60;m&#62;th time. Note that callouts with string arguments are always given
-the number zero. See "Callouts" below for a description of the output when a
-callout it taken.
-</P>
-<P>
-The <b>callout_data</b> modifier can be given an unsigned or a negative number.
-This is set as the "user data" that is passed to the matching function, and
-passed back when the callout function is invoked. Any value other than zero is
-used as a return from <b>pcre2test</b>'s callout function.
+functions, unless <b>callout_none</b> is specified. Its behaviour can be
+controlled by various modifiers listed above whose names begin with
+<b>callout_</b>. Details are given in the section entitled "Callouts"
+<a href="#callouts">below.</a>
</P>
<br><b>
Finding all matches in a string
</b><br>
<P>
Searching for all possible matches within a subject can be requested by the
-<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching
+<b>global</b> or <b>altglobal</b> modifier. After finding a match, the matching
function is called again to search the remainder of the subject. The difference
between <b>global</b> and <b>altglobal</b> is that the former uses the
<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
@@ -1242,41 +1418,47 @@ Setting the JIT stack size
<P>
The <b>jitstack</b> modifier provides a way of setting the maximum stack size
that is used by the just-in-time optimization code. It is ignored if JIT
-optimization is not being used. The value is a number of kilobytes. Providing a
-stack that is larger than the default 32K is necessary only for very
-complicated patterns.
+optimization is not being used. The value is a number of kilobytes. Setting
+zero reverts to the default of 32K. Providing a stack that is larger than the
+default is necessary only for very complicated patterns. If <b>jitstack</b> is
+set non-zero on a subject line it overrides any value that was set on the
+pattern.
</P>
<br><b>
-Setting match and recursion limits
+Setting heap, match, and depth limits
</b><br>
<P>
-The <b>match_limit</b> and <b>recursion_limit</b> modifiers set the appropriate
-limits in the match context. These values are ignored when the
+The <b>heap_limit</b>, <b>match_limit</b>, and <b>depth_limit</b> modifiers set
+the appropriate limits in the match context. These values are ignored when the
<b>find_limits</b> modifier is specified.
</P>
<br><b>
Finding minimum limits
</b><br>
<P>
-If the <b>find_limits</b> modifier is present, <b>pcre2test</b> calls
-<b>pcre2_match()</b> several times, setting different values in the match
-context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_recursion_limit()</b>
-until it finds the minimum values for each parameter that allow
-<b>pcre2_match()</b> to complete without error.
+If the <b>find_limits</b> modifier is present on a subject line, <b>pcre2test</b>
+calls the relevant matching function several times, setting different values in
+the match context via <b>pcre2_set_heap_limit(), \fBpcre2_set_match_limit()</b>,
+or <b>pcre2_set_depth_limit()</b> until it finds the minimum values for each
+parameter that allows the match to complete without error.
</P>
<P>
If JIT is being used, only the match limit is relevant. If DFA matching is
-being used, neither limit is relevant, and this modifier is ignored (with a
-warning message).
+being used, only the depth limit is relevant.
</P>
<P>
The <i>match_limit</i> number is a measure of the amount of backtracking
that takes place, and learning the minimum value can be instructive. For most
simple matches, the number is quite small, but for patterns with very large
numbers of matching possibilities, it can become large very quickly with
-increasing length of subject string. The <i>match_limit_recursion</i> number is
-a measure of how much stack (or, if PCRE2 is compiled with NO_RECURSE, how much
-heap) memory is needed to complete the match attempt.
+increasing length of subject string.
+</P>
+<P>
+For non-DFA matching, the minimum <i>depth_limit</i> number is a measure of how
+much nested backtracking happens (that is, how deeply the pattern's tree is
+searched). In the case of DFA matching, <i>depth_limit</i> controls the depth of
+recursive calls of the internal function that is used for handling pattern
+recursion, lookaround assertions, and atomic groups.
</P>
<br><b>
Showing MARK names
@@ -1292,8 +1474,15 @@ is added to the non-match message.
Showing memory usage
</b><br>
<P>
-The <b>memory</b> modifier causes <b>pcre2test</b> to log all memory allocation
-and freeing calls that occur during a match operation.
+The <b>memory</b> modifier causes <b>pcre2test</b> to log the sizes of all heap
+memory allocation and freeing calls that occur during a call to
+<b>pcre2_match()</b>. These occur only when a match requires a bigger vector
+than the default for remembering backtracking points. In many cases there will
+be no heap memory used and therefore no additional output. No heap memory is
+allocated during matching with <b>pcre2_dfa_match</b> or with JIT, so in those
+cases the <b>memory</b> modifier never has any effect. For this modifier to
+work, the <b>null_context</b> modifier must not be set on both the pattern and
+the subject, though it can be set on one or the other.
</P>
<br><b>
Setting a starting offset
@@ -1337,8 +1526,8 @@ Passing the subject as zero-terminated
By default, the subject string is passed to a native API matching function with
its correct length. In order to test the facility for passing a zero-terminated
string, the <b>zero_terminate</b> modifier is provided. It causes the length to
-be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
-this modifier has no effect, as there is no facility for passing a length.)
+be passed as PCRE2_ZERO_TERMINATED. When matching via the POSIX interface,
+this modifier is ignored, with a warning.
</P>
<P>
When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
@@ -1393,7 +1582,7 @@ code unit offset of the start of the failing character is also output. Here is
an example of an interactive <b>pcre2test</b> run.
<pre>
$ pcre2test
- PCRE2 version 9.00 2014-05-10
+ PCRE2 version 10.22 2016-07-29
re&#62; /^abc(\d+)/
data&#62; abc123
@@ -1420,7 +1609,7 @@ unset substring is shown as "&#60;unset&#62;", as for the second data line.
If the strings contain any non-printing characters, they are output as \xhh
escapes if the value is less than 256 and UTF mode is not set. Otherwise they
are output as \x{hh...} escapes. See below for the definition of non-printing
-characters. If the <b>/aftertext</b> modifier is set, the output for substring
+characters. If the <b>aftertext</b> modifier is set, the output for substring
0 is followed by the the rest of the subject string, identified by "0+" like
this:
<pre>
@@ -1508,28 +1697,14 @@ restart the match with additional subject data by means of the
For further information about partial matching, see the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation.
-</P>
+<a name="callouts"></a></P>
<br><a name="SEC16" href="#TOC1">CALLOUTS</a><br>
<P>
If the pattern contains any callout requests, <b>pcre2test</b>'s callout
-function is called during matching unless <b>callout_none</b> is specified.
-This works with both matching functions.
-</P>
-<P>
-The callout function in <b>pcre2test</b> returns zero (carry on matching) by
-default, but you can use a <b>callout_fail</b> modifier in a subject line (as
-described above) to change this and other parameters of the callout.
-</P>
-<P>
-Inserting callouts can be helpful when using <b>pcre2test</b> to check
-complicated regular expressions. For further information about callouts, see
-the
-<a href="pcre2callout.html"><b>pcre2callout</b></a>
-documentation.
-</P>
-<P>
-The output for callouts with numerical arguments and those with string
-arguments is slightly different.
+function is called during matching unless <b>callout_none</b> is specified. This
+works with both matching functions, and with JIT, though there are some
+differences in behaviour. The output for callouts with numerical arguments and
+those with string arguments is slightly different.
</P>
<br><b>
Callouts with numerical arguments
@@ -1551,7 +1726,7 @@ callout is in a lookbehind assertion.
</P>
<P>
Callouts numbered 255 are assumed to be automatic callouts, inserted as a
-result of the <b>/auto_callout</b> pattern modifier. In this case, instead of
+result of the <b>auto_callout</b> pattern modifier. In this case, instead of
showing the callout number, the offset in the pattern, preceded by a plus, is
output. For example:
<pre>
@@ -1604,6 +1779,107 @@ example:
</PRE>
</P>
+<br><b>
+Callout modifiers
+</b><br>
+<P>
+The callout function in <b>pcre2test</b> returns zero (carry on matching) by
+default, but you can use a <b>callout_fail</b> modifier in a subject line to
+change this and other parameters of the callout (see below).
+</P>
+<P>
+If the <b>callout_capture</b> modifier is set, the current captured groups are
+output when a callout occurs. This is useful only for non-DFA matching, as
+<b>pcre2_dfa_match()</b> does not support capturing, so no captures are ever
+shown.
+</P>
+<P>
+The normal callout output, showing the callout number or pattern offset (as
+described above) is suppressed if the <b>callout_no_where</b> modifier is set.
+</P>
+<P>
+When using the interpretive matching function <b>pcre2_match()</b> without JIT,
+setting the <b>callout_extra</b> modifier causes additional output from
+<b>pcre2test</b>'s callout function to be generated. For the first callout in a
+match attempt at a new starting position in the subject, "New match attempt" is
+output. If there has been a backtrack since the last callout (or start of
+matching if this is the first callout), "Backtrack" is output, followed by "No
+other matching paths" if the backtrack ended the previous match attempt. For
+example:
+<pre>
+ re&#62; /(a+)b/auto_callout,no_start_optimize,no_auto_possess
+ data&#62; aac\=callout_extra
+ New match attempt
+ ---&#62;aac
+ +0 ^ (
+ +1 ^ a+
+ +3 ^ ^ )
+ +4 ^ ^ b
+ Backtrack
+ ---&#62;aac
+ +3 ^^ )
+ +4 ^^ b
+ Backtrack
+ No other matching paths
+ New match attempt
+ ---&#62;aac
+ +0 ^ (
+ +1 ^ a+
+ +3 ^^ )
+ +4 ^^ b
+ Backtrack
+ No other matching paths
+ New match attempt
+ ---&#62;aac
+ +0 ^ (
+ +1 ^ a+
+ Backtrack
+ No other matching paths
+ New match attempt
+ ---&#62;aac
+ +0 ^ (
+ +1 ^ a+
+ No match
+</pre>
+Notice that various optimizations must be turned off if you want all possible
+matching paths to be scanned. If <b>no_start_optimize</b> is not used, there is
+an immediate "no match", without any callouts, because the starting
+optimization fails to find "b" in the subject, which it knows must be present
+for any match. If <b>no_auto_possess</b> is not used, the "a+" item is turned
+into "a++", which reduces the number of backtracks.
+</P>
+<P>
+The <b>callout_extra</b> modifier has no effect if used with the DFA matching
+function, or with JIT.
+</P>
+<br><b>
+Return values from callouts
+</b><br>
+<P>
+The default return from the callout function is zero, which allows matching to
+continue. The <b>callout_fail</b> modifier can be given one or two numbers. If
+there is only one number, 1 is returned instead of 0 (causing matching to
+backtrack) when a callout of that number is reached. If two numbers (&#60;n&#62;:&#60;m&#62;)
+are given, 1 is returned when callout &#60;n&#62; is reached and there have been at
+least &#60;m&#62; callouts. The <b>callout_error</b> modifier is similar, except that
+PCRE2_ERROR_CALLOUT is returned, causing the entire matching process to be
+aborted. If both these modifiers are set for the same callout number,
+<b>callout_error</b> takes precedence. Note that callouts with string arguments
+are always given the number zero.
+</P>
+<P>
+The <b>callout_data</b> modifier can be given an unsigned or a negative number.
+This is set as the "user data" that is passed to the matching function, and
+passed back when the callout function is invoked. Any value other than zero is
+used as a return from <b>pcre2test</b>'s callout function.
+</P>
+<P>
+Inserting callouts can be helpful when using <b>pcre2test</b> to check
+complicated regular expressions. For further information about callouts, see
+the
+<a href="pcre2callout.html"><b>pcre2callout</b></a>
+documentation.
+</P>
<br><a name="SEC17" href="#TOC1">NON-PRINTING CHARACTERS</a><br>
<P>
When <b>pcre2test</b> is outputting text in the compiled version of a pattern,
@@ -1613,7 +1889,7 @@ therefore shown as hex escapes.
<P>
When <b>pcre2test</b> is outputting text that is a matched part of a subject
string, it behaves in the same way, unless a different locale has been set for
-the pattern (using the <b>/locale</b> modifier). In this case, the
+the pattern (using the <b>locale</b> modifier). In this case, the
<b>isprint()</b> function is used to distinguish printing and non-printing
characters.
<a name="saverestore"></a></P>
@@ -1706,9 +1982,9 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 06 July 2016
+Last updated: 21 December 2017
<br>
-Copyright &copy; 1997-2016 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2unicode.html b/doc/html/pcre2unicode.html
index 6ca367f..448a221 100644
--- a/doc/html/pcre2unicode.html
+++ b/doc/html/pcre2unicode.html
@@ -47,7 +47,7 @@ and
documentation. Only the short names for properties are supported. For example,
\p{L} matches a letter. Its Perl synonym, \p{Letter}, is not supported.
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
-compatibility with Perl 5.6. PCRE does not support this.
+compatibility with Perl 5.6. PCRE2 does not support this.
</P>
<br><b>
WIDE CHARACTERS AND UTF MODES
@@ -109,10 +109,15 @@ However, the special horizontal and vertical white space matching escapes (\h,
\H, \v, and \V) do match all the appropriate Unicode characters, whether or
not PCRE2_UCP is set.
</P>
+<br><b>
+CASE-EQUIVALENCE IN UTF MODES
+</b><br>
<P>
-Case-insensitive matching in UTF mode makes use of Unicode properties. A few
-Unicode characters such as Greek sigma have more than two codepoints that are
-case-equivalent, and these are treated as such.
+Case-insensitive matching in a UTF mode makes use of Unicode properties except
+for characters whose code points are less than 128 and that have at most two
+case-equivalent values. For these, a direct table lookup is used for speed. A
+few Unicode characters such as Greek sigma have more than two codepoints that
+are case-equivalent, and these are treated as such.
</P>
<br><b>
VALIDITY OF UTF STRINGS
@@ -173,6 +178,15 @@ or <b>pcre2_dfa_match()</b>.
<P>
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
is undefined and your program may crash or loop indefinitely.
+</P>
+<P>
+Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
+that is given if an escape sequence for an invalid Unicode code point is
+encountered in the pattern. If you want to allow escape sequences such as
+\x{d800} (a surrogate code point) you can set the
+PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible
+only in UTF-8 and UTF-32 modes, because these values are not representable in
+UTF-16.
<a name="utf8strings"></a></P>
<br><b>
Errors in UTF-8 strings
@@ -280,9 +294,9 @@ Cambridge, England.
REVISION
</b><br>
<P>
-Last updated: 03 July 2016
+Last updated: 17 May 2017
<br>
-Copyright &copy; 1997-2016 University of Cambridge.
+Copyright &copy; 1997-2017 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/index.html.src b/doc/index.html.src
index 703c298..b9393d9 100644
--- a/doc/index.html.src
+++ b/doc/index.html.src
@@ -35,6 +35,9 @@ first.
<tr><td><a href="pcre2compat.html">pcre2compat</a></td>
<td>&nbsp;&nbsp;Compability with Perl</td></tr>
+<tr><td><a href="pcre2convert.html">pcre2convert</a></td>
+ <td>&nbsp;&nbsp;Experimental foreign pattern conversion functions</td></tr>
+
<tr><td><a href="pcre2demo.html">pcre2demo</a></td>
<td>&nbsp;&nbsp;A demonstration C program that uses the PCRE2 library</td></tr>
@@ -68,9 +71,6 @@ first.
<tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
<td>&nbsp;&nbsp;Serializing functions for saving precompiled patterns</td></tr>
-<tr><td><a href="pcre2stack.html">pcre2stack</a></td>
- <td>&nbsp;&nbsp;Discussion of PCRE2's stack usage</td></tr>
-
<tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
<td>&nbsp;&nbsp;Syntax quick-reference summary</td></tr>
@@ -94,6 +94,9 @@ in the library.
<tr><td><a href="pcre2_code_copy.html">pcre2_code_copy</a></td>
<td>&nbsp;&nbsp;Copy a compiled pattern</td></tr>
+<tr><td><a href="pcre2_code_copy_with_tables.html">pcre2_code_copy_with_tables</a></td>
+ <td>&nbsp;&nbsp;Copy a compiled pattern and its character tables</td></tr>
+
<tr><td><a href="pcre2_code_free.html">pcre2_code_free</a></td>
<td>&nbsp;&nbsp;Free a compiled pattern</td></tr>
@@ -112,6 +115,18 @@ in the library.
<tr><td><a href="pcre2_config.html">pcre2_config</a></td>
<td>&nbsp;&nbsp;Show build-time configuration options</td></tr>
+<tr><td><a href="pcre2_convert_context_copy.html">pcre2_convert_context_copy</a></td>
+ <td>&nbsp;&nbsp;Copy a convert context</td></tr>
+
+<tr><td><a href="pcre2_convert_context_create.html">pcre2_convert_context_create</a></td>
+ <td>&nbsp;&nbsp;Create a convert context</td></tr>
+
+<tr><td><a href="pcre2_convert_context_free.html">pcre2_convert_context_free</a></td>
+ <td>&nbsp;&nbsp;Free a convert context</td></tr>
+
+<tr><td><a href="pcre2_converted_pattern_free.html">pcre2_converted_pattern_free</a></td>
+ <td>&nbsp;&nbsp;Free converted foreign pattern</td></tr>
+
<tr><td><a href="pcre2_dfa_match.html">pcre2_dfa_match</a></td>
<td>&nbsp;&nbsp;Match a compiled pattern to a subject string
(DFA algorithm; <i>not</i> Perl compatible)</td></tr>
@@ -183,6 +198,9 @@ in the library.
<tr><td><a href="pcre2_match_data_free.html">pcre2_match_data_free</a></td>
<td>&nbsp;&nbsp;Free a match data block</td></tr>
+<tr><td><a href="pcre2_pattern_convert.html">pcre2_pattern_convert</a></td>
+ <td>&nbsp;&nbsp;Experimental foreign pattern converter</td></tr>
+
<tr><td><a href="pcre2_pattern_info.html">pcre2_pattern_info</a></td>
<td>&nbsp;&nbsp;Extract information about a pattern</td></tr>
@@ -207,9 +225,24 @@ in the library.
<tr><td><a href="pcre2_set_character_tables.html">pcre2_set_character_tables</a></td>
<td>&nbsp;&nbsp;Set character tables</td></tr>
+<tr><td><a href="pcre2_set_compile_extra_options.html">pcre2_set_compile_extra_options</a></td>
+ <td>&nbsp;&nbsp;Set compile time extra options</td></tr>
+
<tr><td><a href="pcre2_set_compile_recursion_guard.html">pcre2_set_compile_recursion_guard</a></td>
<td>&nbsp;&nbsp;Set up a compile recursion guard function</td></tr>
+<tr><td><a href="pcre2_set_depth_limit.html">pcre2_set_depth_limit</a></td>
+ <td>&nbsp;&nbsp;Set the match backtracking depth limit</td></tr>
+
+<tr><td><a href="pcre2_set_glob_escape.html">pcre2_set_glob_escape</a></td>
+ <td>&nbsp;&nbsp;Set glob escape character</td></tr>
+
+<tr><td><a href="pcre2_set_glob_separator.html">pcre2_set_glob_separator</a></td>
+ <td>&nbsp;&nbsp;Set glob separator character</td></tr>
+
+<tr><td><a href="pcre2_set_heap_limit.html">pcre2_set_heap_limit</a></td>
+ <td>&nbsp;&nbsp;Set the match backtracking heap limit</td></tr>
+
<tr><td><a href="pcre2_set_match_limit.html">pcre2_set_match_limit</a></td>
<td>&nbsp;&nbsp;Set the match limit</td></tr>
@@ -226,10 +259,10 @@ in the library.
<td>&nbsp;&nbsp;Set the parentheses nesting limit</td></tr>
<tr><td><a href="pcre2_set_recursion_limit.html">pcre2_set_recursion_limit</a></td>
- <td>&nbsp;&nbsp;Set the match recursion limit</td></tr>
+ <td>&nbsp;&nbsp;Obsolete: use pcre2_set_depth_limit</td></tr>
<tr><td><a href="pcre2_set_recursion_memory_management.html">pcre2_set_recursion_memory_management</a></td>
- <td>&nbsp;&nbsp;Set match recursion memory management</td></tr>
+ <td>&nbsp;&nbsp;Obsolete function that (from 10.30 onwards) does nothing</td></tr>
<tr><td><a href="pcre2_substitute.html">pcre2_substitute</a></td>
<td>&nbsp;&nbsp;Match a compiled pattern to a subject string and do
diff --git a/doc/pcre2.3 b/doc/pcre2.3
index 9a84ce3..83a7655 100644
--- a/doc/pcre2.3
+++ b/doc/pcre2.3
@@ -1,4 +1,4 @@
-.TH PCRE2 3 "16 October 2015" "PCRE2 10.21"
+.TH PCRE2 3 "01 April 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH INTRODUCTION
@@ -104,7 +104,7 @@ lose performance.
One way of guarding against this possibility is to use the
\fBpcre2_pattern_info()\fP function to check the compiled pattern's options for
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
-\fBpcre2_compile()\fP. This causes an compile time error if a pattern contains
+\fBpcre2_compile()\fP. This causes a compile time error if the pattern contains
a UTF-setting sequence.
.P
The use of Unicode properties for character types such as \ed can also be
@@ -130,7 +130,8 @@ against this: see the \fBpcre2_set_match_limit()\fP function in the
.\" HREF
\fBpcre2api\fP
.\"
-page.
+page. There is a similar function called \fBpcre2_set_depth_limit()\fP that can
+be used to restrict the amount of memory that is used.
.
.
.SH "USER DOCUMENTATION"
@@ -163,7 +164,6 @@ listing), and the short pages for individual functions, are concatenated in
pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program
- pcre2stack discussion of stack usage
pcre2syntax quick syntax reference
pcre2test description of the \fBpcre2test\fP command
pcre2unicode discussion of Unicode and UTF support
@@ -189,6 +189,6 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
.rs
.sp
.nf
-Last updated: 16 October 2015
-Copyright (c) 1997-2015 University of Cambridge.
+Last updated: 01 April 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2.txt b/doc/pcre2.txt
index 8f4e8a1..79d94e3 100644
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@@ -89,8 +89,8 @@ SECURITY CONSIDERATIONS
One way of guarding against this possibility is to use the pcre2_pat-
tern_info() function to check the compiled pattern's options for
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
- calling pcre2_compile(). This causes an compile time error if a pattern
- contains a UTF-setting sequence.
+ calling pcre2_compile(). This causes a compile time error if the pat-
+ tern contains a UTF-setting sequence.
The use of Unicode properties for character types such as \d can also
be enabled from within the pattern, by specifying "(*UCP)". This fea-
@@ -112,7 +112,9 @@ SECURITY CONSIDERATIONS
has a very large search tree against a string that will never match.
Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
vides some protection against this: see the pcre2_set_match_limit()
- function in the pcre2api page.
+ function in the pcre2api page. There is a similar function called
+ pcre2_set_depth_limit() that can be used to restrict the amount of mem-
+ ory that is used.
USER DOCUMENTATION
@@ -144,7 +146,6 @@ USER DOCUMENTATION
pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program
- pcre2stack discussion of stack usage
pcre2syntax quick syntax reference
pcre2test description of the pcre2test command
pcre2unicode discussion of Unicode and UTF support
@@ -166,8 +167,8 @@ AUTHOR
REVISION
- Last updated: 16 October 2015
- Copyright (c) 1997-2015 University of Cambridge.
+ Last updated: 01 April 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -180,9 +181,9 @@ NAME
#include <pcre2.h>
- PCRE2 is a new API for PCRE. This document contains a description of
- all its functions. See the pcre2 document for an overview of all the
- PCRE2 documentation.
+ PCRE2 is a new API for PCRE, starting at release 10.0. This document
+ contains a description of all its native functions. See the pcre2 docu-
+ ment for an overview of all the PCRE2 documentation.
PCRE2 NATIVE API BASIC FUNCTIONS
@@ -252,6 +253,9 @@ PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
int pcre2_set_character_tables(pcre2_compile_context *ccontext,
const unsigned char *tables);
+ int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
+ uint32_t extra_options);
+
int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
PCRE2_SIZE value);
@@ -279,19 +283,17 @@ PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
int (*callout_function)(pcre2_callout_block *, void *),
void *callout_data);
- int pcre2_set_match_limit(pcre2_match_context *mcontext,
- uint32_t value);
-
int pcre2_set_offset_limit(pcre2_match_context *mcontext,
PCRE2_SIZE value);
- int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
+ int pcre2_set_heap_limit(pcre2_match_context *mcontext,
uint32_t value);
- int pcre2_set_recursion_memory_management(
- pcre2_match_context *mcontext,
- void *(*private_malloc)(PCRE2_SIZE, void *),
- void (*private_free)(void *, void *), void *memory_data);
+ int pcre2_set_match_limit(pcre2_match_context *mcontext,
+ uint32_t value);
+
+ int pcre2_set_depth_limit(pcre2_match_context *mcontext,
+ uint32_t value);
PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS
@@ -379,6 +381,8 @@ PCRE2 NATIVE API AUXILIARY FUNCTIONS
pcre2_code *pcre2_code_copy(const pcre2_code *code);
+ pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
+
int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
PCRE2_SIZE bufflen);
@@ -393,19 +397,64 @@ PCRE2 NATIVE API AUXILIARY FUNCTIONS
int pcre2_config(uint32_t what, void *where);
+PCRE2 NATIVE API OBSOLETE FUNCTIONS
+
+ int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
+ uint32_t value);
+
+ int pcre2_set_recursion_memory_management(
+ pcre2_match_context *mcontext,
+ void *(*private_malloc)(PCRE2_SIZE, void *),
+ void (*private_free)(void *, void *), void *memory_data);
+
+ These functions became obsolete at release 10.30 and are retained only
+ for backward compatibility. They should not be used in new code. The
+ first is replaced by pcre2_set_depth_limit(); the second is no longer
+ needed and has no effect (it always returns zero).
+
+
+PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS
+
+ pcre2_convert_context *pcre2_convert_context_create(
+ pcre2_general_context *gcontext);
+
+ pcre2_convert_context *pcre2_convert_context_copy(
+ pcre2_convert_context *cvcontext);
+
+ void pcre2_convert_context_free(pcre2_convert_context *cvcontext);
+
+ int pcre2_set_glob_escape(pcre2_convert_context *cvcontext,
+ uint32_t escape_char);
+
+ int pcre2_set_glob_separator(pcre2_convert_context *cvcontext,
+ uint32_t separator_char);
+
+ int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length,
+ uint32_t options, PCRE2_UCHAR **buffer,
+ PCRE2_SIZE *blength, pcre2_convert_context *cvcontext);
+
+ void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern);
+
+ These functions provide a way of converting non-PCRE2 patterns into
+ patterns that can be processed by pcre2_compile(). This facility is
+ experimental and may be changed in future releases. At present, "globs"
+ and POSIX basic and extended patterns can be converted. Details are
+ given in the pcre2convert documentation.
+
+
PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
- There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
- code units, respectively. However, there is just one header file,
- pcre2.h. This contains the function prototypes and other definitions
+ There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
+ code units, respectively. However, there is just one header file,
+ pcre2.h. This contains the function prototypes and other definitions
for all three libraries. One, two, or all three can be installed simul-
- taneously. On Unix-like systems the libraries are called libpcre2-8,
+ taneously. On Unix-like systems the libraries are called libpcre2-8,
libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
inal PCRE libraries.
- Character strings are passed to and from a PCRE2 library as a sequence
- of unsigned integers in code units of the appropriate width. Every
- PCRE2 function comes in three different forms, one for each library,
+ Character strings are passed to and from a PCRE2 library as a sequence
+ of unsigned integers in code units of the appropriate width. Every
+ PCRE2 function comes in three different forms, one for each library,
for example:
pcre2_compile_8()
@@ -417,72 +466,79 @@ PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32
- The UCHAR types define unsigned code units of the appropriate widths.
- For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR
- types are constant pointers to the equivalent UCHAR types, that is,
+ The UCHAR types define unsigned code units of the appropriate widths.
+ For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR
+ types are constant pointers to the equivalent UCHAR types, that is,
they are pointers to vectors of unsigned code units.
- Many applications use only one code unit width. For their convenience,
+ Many applications use only one code unit width. For their convenience,
macros are defined whose names are the generic forms such as pcre2_com-
- pile() and PCRE2_SPTR. These macros use the value of the macro
- PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func-
+ pile() and PCRE2_SPTR. These macros use the value of the macro
+ PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func-
tion and macro names. PCRE2_CODE_UNIT_WIDTH is not defined by default.
- An application must define it to be 8, 16, or 32 before including
+ An application must define it to be 8, 16, or 32 before including
pcre2.h in order to make use of the generic names.
- Applications that use more than one code unit width can be linked with
- more than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to
- be 0 before including pcre2.h, and then use the real function names.
- Any code that is to be included in an environment where the value of
- PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function
+ Applications that use more than one code unit width can be linked with
+ more than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to
+ be 0 before including pcre2.h, and then use the real function names.
+ Any code that is to be included in an environment where the value of
+ PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function
names. (Unfortunately, it is not possible in C code to save and restore
the value of a macro.)
- If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a
+ If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a
compiler error occurs.
- When using multiple libraries in an application, you must take care
- when processing any particular pattern to use only functions from a
- single library. For example, if you want to run a match using a pat-
- tern that was compiled with pcre2_compile_16(), you must do so with
- pcre2_match_16(), not pcre2_match_8().
+ When using multiple libraries in an application, you must take care
+ when processing any particular pattern to use only functions from a
+ single library. For example, if you want to run a match using a pat-
+ tern that was compiled with pcre2_compile_16(), you must do so with
+ pcre2_match_16(), not pcre2_match_8() or pcre2_match_32().
- In the function summaries above, and in the rest of this document and
- other PCRE2 documents, functions and data types are described using
- their generic names, without the 8, 16, or 32 suffix.
+ In the function summaries above, and in the rest of this document and
+ other PCRE2 documents, functions and data types are described using
+ their generic names, without the _8, _16, or _32 suffix.
PCRE2 API OVERVIEW
- PCRE2 has its own native API, which is described in this document.
+ PCRE2 has its own native API, which is described in this document.
There are also some wrapper functions for the 8-bit library that corre-
- spond to the POSIX regular expression API, but they do not give access
- to all the functionality. They are described in the pcre2posix documen-
- tation. Both these APIs define a set of C function calls.
-
- The native API C data types, function prototypes, option values, and
- error codes are defined in the header file pcre2.h, which contains def-
- initions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release
- numbers for the library. Applications can use these to include support
+ spond to the POSIX regular expression API, but they do not give access
+ to all the functionality of PCRE2. They are described in the pcre2posix
+ documentation. Both these APIs define a set of C function calls.
+
+ The native API C data types, function prototypes, option values, and
+ error codes are defined in the header file pcre2.h, which also contains
+ definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release
+ numbers for the library. Applications can use these to include support
for different releases of PCRE2.
In a Windows environment, if you want to statically link an application
- program against a non-dll PCRE2 library, you must define PCRE2_STATIC
+ program against a non-dll PCRE2 library, you must define PCRE2_STATIC
before including pcre2.h.
- The functions pcre2_compile(), and pcre2_match() are used for compiling
- and matching regular expressions in a Perl-compatible manner. A sample
+ The functions pcre2_compile() and pcre2_match() are used for compiling
+ and matching regular expressions in a Perl-compatible manner. A sample
program that demonstrates the simplest way of using them is provided in
the file called pcre2demo.c in the PCRE2 source distribution. A listing
- of this program is given in the pcre2demo documentation, and the
+ of this program is given in the pcre2demo documentation, and the
pcre2sample documentation describes how to compile and run it.
- Just-in-time compiler support is an optional feature of PCRE2 that can
- be built in appropriate hardware environments. It greatly speeds up the
- matching performance of many patterns. Programs can request that it be
- used if available, by calling pcre2_jit_compile() after a pattern has
- been successfully compiled by pcre2_compile(). This does nothing if JIT
- support is not available.
+ The compiling and matching functions recognize various options that are
+ passed as bits in an options argument. There are also some more compli-
+ cated parameters such as custom memory management functions and
+ resource limits that are passed in "contexts" (which are just memory
+ blocks, described below). Simple applications do not need to make use
+ of contexts.
+
+ Just-in-time (JIT) compiler support is an optional feature of PCRE2
+ that can be built in appropriate hardware environments. It greatly
+ speeds up the matching performance of many patterns. Programs can
+ request that it be used if available by calling pcre2_jit_compile()
+ after a pattern has been successfully compiled by pcre2_compile(). This
+ does nothing if JIT support is not available.
More complicated programs might need to make use of the specialist
functions pcre2_jit_stack_create(), pcre2_jit_stack_free(), and
@@ -491,20 +547,21 @@ PCRE2 API OVERVIEW
JIT matching is automatically used by pcre2_match() if it is available,
unless the PCRE2_NO_JIT option is set. There is also a direct interface
- for JIT matching, which gives improved performance. The JIT-specific
- functions are discussed in the pcre2jit documentation.
-
- A second matching function, pcre2_dfa_match(), which is not Perl-com-
- patible, is also provided. This uses a different algorithm for the
- matching. The alternative algorithm finds all possible matches (at a
- given point in the subject), and scans the subject just once (unless
- there are lookbehind assertions). However, this algorithm does not
- return captured substrings. A description of the two matching algo-
- rithms and their advantages and disadvantages is given in the
- pcre2matching documentation. There is no JIT support for
+ for JIT matching, which gives improved performance at the expense of
+ less sanity checking. The JIT-specific functions are discussed in the
+ pcre2jit documentation.
+
+ A second matching function, pcre2_dfa_match(), which is not Perl-com-
+ patible, is also provided. This uses a different algorithm for the
+ matching. The alternative algorithm finds all possible matches (at a
+ given point in the subject), and scans the subject just once (unless
+ there are lookaround assertions). However, this algorithm does not
+ return captured substrings. A description of the two matching algo-
+ rithms and their advantages and disadvantages is given in the
+ pcre2matching documentation. There is no JIT support for
pcre2_dfa_match().
- In addition to the main compiling and matching functions, there are
+ In addition to the main compiling and matching functions, there are
convenience functions for extracting captured substrings from a subject
string that has been matched by pcre2_match(). They are:
@@ -518,74 +575,74 @@ PCRE2 API OVERVIEW
pcre2_substring_nametable_scan()
pcre2_substring_number_from_name()
- pcre2_substring_free() and pcre2_substring_list_free() are also pro-
- vided, to free the memory used for extracted strings.
+ pcre2_substring_free() and pcre2_substring_list_free() are also pro-
+ vided, to free memory used for extracted strings.
- The function pcre2_substitute() can be called to match a pattern and
- return a copy of the subject string with substitutions for parts that
+ The function pcre2_substitute() can be called to match a pattern and
+ return a copy of the subject string with substitutions for parts that
were matched.
- Functions whose names begin with pcre2_serialize_ are used for saving
+ Functions whose names begin with pcre2_serialize_ are used for saving
compiled patterns on disc or elsewhere, and reloading them later.
- Finally, there are functions for finding out information about a com-
- piled pattern (pcre2_pattern_info()) and about the configuration with
+ Finally, there are functions for finding out information about a com-
+ piled pattern (pcre2_pattern_info()) and about the configuration with
which PCRE2 was built (pcre2_config()).
- Functions with names ending with _free() are used for freeing memory
- blocks of various sorts. In all cases, if one of these functions is
+ Functions with names ending with _free() are used for freeing memory
+ blocks of various sorts. In all cases, if one of these functions is
called with a NULL argument, it does nothing.
STRING LENGTHS AND OFFSETS
- The PCRE2 API uses string lengths and offsets into strings of code
- units in several places. These values are always of type PCRE2_SIZE,
- which is an unsigned integer type, currently always defined as size_t.
- The largest value that can be stored in such a type (that is
- ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated
- strings and unset offsets. Therefore, the longest string that can be
+ The PCRE2 API uses string lengths and offsets into strings of code
+ units in several places. These values are always of type PCRE2_SIZE,
+ which is an unsigned integer type, currently always defined as size_t.
+ The largest value that can be stored in such a type (that is
+ ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated
+ strings and unset offsets. Therefore, the longest string that can be
handled is one less than this maximum.
NEWLINES
PCRE2 supports five different conventions for indicating line breaks in
- strings: a single CR (carriage return) character, a single LF (line-
+ strings: a single CR (carriage return) character, a single LF (line-
feed) character, the two-character sequence CRLF, any of the three pre-
- ceding, or any Unicode newline sequence. The Unicode newline sequences
- are the three just mentioned, plus the single characters VT (vertical
+ ceding, or any Unicode newline sequence. The Unicode newline sequences
+ are the three just mentioned, plus the single characters VT (vertical
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
separator, U+2028), and PS (paragraph separator, U+2029).
- Each of the first three conventions is used by at least one operating
+ Each of the first three conventions is used by at least one operating
system as its standard newline sequence. When PCRE2 is built, a default
- can be specified. The default default is LF, which is the Unix stan-
- dard. However, the newline convention can be changed by an application
+ can be specified. The default default is LF, which is the Unix stan-
+ dard. However, the newline convention can be changed by an application
when calling pcre2_compile(), or it can be specified by special text at
the start of the pattern itself; this overrides any other settings. See
the pcre2pattern page for details of the special character sequences.
- In the PCRE2 documentation the word "newline" is used to mean "the
+ In the PCRE2 documentation the word "newline" is used to mean "the
character or pair of characters that indicate a line break". The choice
- of newline convention affects the handling of the dot, circumflex, and
+ of newline convention affects the handling of the dot, circumflex, and
dollar metacharacters, the handling of #-comments in /x mode, and, when
- CRLF is a recognized line ending sequence, the match position advance-
+ CRLF is a recognized line ending sequence, the match position advance-
ment for a non-anchored pattern. There is more detail about this in the
section on pcre2_match() options below.
- The choice of newline convention does not affect the interpretation of
+ The choice of newline convention does not affect the interpretation of
the \n or \r escape sequences, nor does it affect what \R matches; this
has its own separate convention.
MULTITHREADING
- In a multithreaded application it is important to keep thread-specific
- data separate from data that can be shared between threads. The PCRE2
- library code itself is thread-safe: it contains no static or global
- variables. The API is designed to be fairly simple for non-threaded
- applications while at the same time ensuring that multithreaded appli-
+ In a multithreaded application it is important to keep thread-specific
+ data separate from data that can be shared between threads. The PCRE2
+ library code itself is thread-safe: it contains no static or global
+ variables. The API is designed to be fairly simple for non-threaded
+ applications while at the same time ensuring that multithreaded appli-
cations can use it.
There are several different blocks of data that are used to pass infor-
@@ -593,19 +650,19 @@ MULTITHREADING
The compiled pattern
- A pointer to the compiled form of a pattern is returned to the user
+ A pointer to the compiled form of a pattern is returned to the user
when pcre2_compile() is successful. The data in the compiled pattern is
- fixed, and does not change when the pattern is matched. Therefore, it
- is thread-safe, that is, the same compiled pattern can be used by more
+ fixed, and does not change when the pattern is matched. Therefore, it
+ is thread-safe, that is, the same compiled pattern can be used by more
than one thread simultaneously. For example, an application can compile
all its patterns at the start, before forking off multiple threads that
- use them. However, if the just-in-time optimization feature is being
- used, it needs separate memory stack areas for each thread. See the
- pcre2jit documentation for more details.
+ use them. However, if the just-in-time (JIT) optimization feature is
+ being used, it needs separate memory stack areas for each thread. See
+ the pcre2jit documentation for more details.
- In a more complicated situation, where patterns are compiled only when
- they are first needed, but are still shared between threads, pointers
- to compiled patterns must be protected from simultaneous writing by
+ In a more complicated situation, where patterns are compiled only when
+ they are first needed, but are still shared between threads, pointers
+ to compiled patterns must be protected from simultaneous writing by
multiple threads, at least until a pattern has been compiled. The logic
can be something like this:
@@ -618,16 +675,17 @@ MULTITHREADING
Release the lock
Use pointer in pcre2_match()
- Of course, testing for compilation errors should also be included in
+ Of course, testing for compilation errors should also be included in
the code.
If JIT is being used, but the JIT compilation is not being done immedi-
- ately, (perhaps waiting to see if the pattern is used often enough)
+ ately, (perhaps waiting to see if the pattern is used often enough)
similar logic is required. JIT compilation updates a pointer within the
- compiled code block, so a thread must gain unique write access to the
- pointer before calling pcre2_jit_compile(). Alternatively,
- pcre2_code_copy() can be used to obtain a private copy of the compiled
- code.
+ compiled code block, so a thread must gain unique write access to the
+ pointer before calling pcre2_jit_compile(). Alternatively,
+ pcre2_code_copy() or pcre2_code_copy_with_tables() can be used to
+ obtain a private copy of the compiled code before calling the JIT com-
+ piler.
Context blocks
@@ -646,10 +704,10 @@ MULTITHREADING
Match blocks
- The matching functions need a block of memory for working space and for
- storing the results of a match. This includes details of what was
- matched, as well as additional information such as the name of a
- (*MARK) setting. Each thread must provide its own copy of this memory.
+ The matching functions need a block of memory for storing the results
+ of a match. This includes details of what was matched, as well as addi-
+ tional information such as the name of a (*MARK) setting. Each thread
+ must provide its own copy of this memory.
PCRE2 CONTEXTS
@@ -714,21 +772,22 @@ PCRE2 CONTEXTS
The compile context
- A compile context is required if you want to change the default values
- of any of the following compile-time parameters:
+ A compile context is required if you want to provide an external func-
+ tion for stack checking during compilation or to change the default
+ values of any of the following compile-time parameters:
What \R matches (Unicode newlines or CR, LF, CRLF only)
PCRE2's character tables
The newline character sequence
The compile time nested parentheses limit
The maximum length of the pattern string
- An external function for stack checking
+ The extra options bits (none set by default)
- A compile context is also required if you are using custom memory man-
- agement. If none of these apply, just pass NULL as the context argu-
+ A compile context is also required if you are using custom memory man-
+ agement. If none of these apply, just pass NULL as the context argu-
ment of pcre2_compile().
- A compile context is created, copied, and freed by the following func-
+ A compile context is created, copied, and freed by the following func-
tions:
pcre2_compile_context *pcre2_compile_context_create(
@@ -739,57 +798,75 @@ PCRE2 CONTEXTS
void pcre2_compile_context_free(pcre2_compile_context *ccontext);
- A compile context is created with default values for its parameters.
+ A compile context is created with default values for its parameters.
These can be changed by calling the following functions, which return 0
on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
int pcre2_set_bsr(pcre2_compile_context *ccontext,
uint32_t value);
- The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only
- CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any
+ The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only
+ CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any
Unicode line ending sequence. The value is used by the JIT compiler and
- by the two interpreted matching functions, pcre2_match() and
+ by the two interpreted matching functions, pcre2_match() and
pcre2_dfa_match().
int pcre2_set_character_tables(pcre2_compile_context *ccontext,
const unsigned char *tables);
- The value must be the result of a call to pcre2_maketables(), whose
+ The value must be the result of a call to pcre2_maketables(), whose
only argument is a general context. This function builds a set of char-
acter tables in the current locale.
+ int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
+ uint32_t extra_options);
+
+ As PCRE2 has developed, almost all the 32 option bits that are avail-
+ able in the options argument of pcre2_compile() have been used up. To
+ avoid running out, the compile context contains a set of extra option
+ bits which are used for some newer, assumed rarer, options. This func-
+ tion sets those bits. It always sets all the bits (either on or off).
+ It does not modify any existing setting. The available options are
+ defined in the section entitled "Extra compile options" below.
+
int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
PCRE2_SIZE value);
- This sets a maximum length, in code units, for the pattern string that
- is to be compiled. If the pattern is longer, an error is generated.
- This facility is provided so that applications that accept patterns
- from external sources can limit their size. The default is the largest
- number that a PCRE2_SIZE variable can hold, which is effectively unlim-
- ited.
+ This sets a maximum length, in code units, for any pattern string that
+ is compiled with this context. If the pattern is longer, an error is
+ generated. This facility is provided so that applications that accept
+ patterns from external sources can limit their size. The default is the
+ largest number that a PCRE2_SIZE variable can hold, which is effec-
+ tively unlimited.
int pcre2_set_newline(pcre2_compile_context *ccontext,
uint32_t value);
This specifies which characters or character sequences are to be recog-
- nized as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage
+ nized as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage
return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
- two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
- of the above), or PCRE2_NEWLINE_ANY (any Unicode newline sequence).
+ two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
+ of the above), PCRE2_NEWLINE_ANY (any Unicode newline sequence), or
+ PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero).
- When a pattern is compiled with the PCRE2_EXTENDED option, the value of
- this parameter affects the recognition of white space and the end of
- internal comments starting with #. The value is saved with the compiled
- pattern for subsequent use by the JIT compiler and by the two inter-
- preted matching functions, pcre2_match() and pcre2_dfa_match().
+ A pattern can override the value set in the compile context by starting
+ with a sequence such as (*CRLF). See the pcre2pattern page for details.
+
+ When a pattern is compiled with the PCRE2_EXTENDED or
+ PCRE2_EXTENDED_MORE option, the newline convention affects the recogni-
+ tion of white space and the end of internal comments starting with #.
+ The value is saved with the compiled pattern for subsequent use by the
+ JIT compiler and by the two interpreted matching functions,
+ pcre2_match() and pcre2_dfa_match().
int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
uint32_t value);
This parameter ajusts the limit, set when PCRE2 is built (default 250),
on the depth of parenthesis nesting in a pattern. This limit stops
- rogue patterns using up too much system stack when being compiled.
+ rogue patterns using up too much system stack when being compiled. The
+ limit applies to parentheses of all kinds, not just capturing parenthe-
+ ses.
int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
int (*guard_function)(uint32_t, void *), void *user_data);
@@ -797,31 +874,32 @@ PCRE2 CONTEXTS
There is at least one application that runs PCRE2 in threads with very
limited system stack, where running out of stack is to be avoided at
all costs. The parenthesis limit above cannot take account of how much
- stack is actually available. For a finer control, you can supply a
- function that is called whenever pcre2_compile() starts to compile a
- parenthesized part of a pattern. This function can check the actual
- stack size (or anything else that it wants to, of course).
-
- The first argument to the callout function gives the current depth of
- nesting, and the second is user data that is set up by the last argu-
- ment of pcre2_set_compile_recursion_guard(). The callout function
+ stack is actually available during compilation. For a finer control,
+ you can supply a function that is called whenever pcre2_compile()
+ starts to compile a parenthesized part of a pattern. This function can
+ check the actual stack size (or anything else that it wants to, of
+ course).
+
+ The first argument to the callout function gives the current depth of
+ nesting, and the second is user data that is set up by the last argu-
+ ment of pcre2_set_compile_recursion_guard(). The callout function
should return zero if all is well, or non-zero to force an error.
The match context
- A match context is required if you want to change the default values of
- any of the following match-time parameters:
+ A match context is required if you want to:
- A callout function
- The offset limit for matching an unanchored pattern
- The limit for calling match() (see below)
- The limit for calling match() recursively
+ Set up a callout function
+ Set an offset limit for matching an unanchored pattern
+ Change the limit on the amount of heap used when matching
+ Change the backtracking match limit
+ Change the backtracking depth limit
+ Set custom memory management specifically for the match
- A match context is also required if you are using custom memory manage-
- ment. If none of these apply, just pass NULL as the context argument
- of pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().
+ If none of these apply, just pass NULL as the context argument of
+ pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().
- A match context is created, copied, and freed by the following func-
+ A match context is created, copied, and freed by the following func-
tions:
pcre2_match_context *pcre2_match_context_create(
@@ -832,7 +910,7 @@ PCRE2 CONTEXTS
void pcre2_match_context_free(pcre2_match_context *mcontext);
- A match context is created with default values for its parameters.
+ A match context is created with default values for its parameters.
These can be changed by calling the following functions, which return 0
on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
@@ -840,120 +918,137 @@ PCRE2 CONTEXTS
int (*callout_function)(pcre2_callout_block *, void *),
void *callout_data);
- This sets up a "callout" function, which PCRE2 will call at specified
- points during a matching operation. Details are given in the pcre2call-
- out documentation.
+ This sets up a "callout" function for PCRE2 to call at specified points
+ during a matching operation. Details are given in the pcre2callout doc-
+ umentation.
int pcre2_set_offset_limit(pcre2_match_context *mcontext,
PCRE2_SIZE value);
- The offset_limit parameter limits how far an unanchored search can
- advance in the subject string. The default value is PCRE2_UNSET. The
- pcre2_match() and pcre2_dfa_match() functions return
- PCRE2_ERROR_NOMATCH if a match with a starting point before or at the
- given offset is not found. For example, if the pattern /abc/ is matched
- against "123abc" with an offset limit less than 3, the result is
- PCRE2_ERROR_NO_MATCH. A match can never be found if the startoffset
- argument of pcre2_match() or pcre2_dfa_match() is greater than the off-
- set limit.
-
- When using this facility, you must set PCRE2_USE_OFFSET_LIMIT when
- calling pcre2_compile() so that when JIT is in use, different code can
- be compiled. If a match is started with a non-default match limit when
- PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
-
- The offset limit facility can be used to track progress when searching
- large subject strings. See also the PCRE2_FIRSTLINE option, which
- requires a match to start within the first line of the subject. If this
- is set with an offset limit, a match must occur in the first line and
- also within the offset limit. In other words, whichever limit comes
- first is used.
-
- int pcre2_set_match_limit(pcre2_match_context *mcontext,
+ The offset_limit parameter limits how far an unanchored search can
+ advance in the subject string. The default value is PCRE2_UNSET. The
+ pcre2_match() and pcre2_dfa_match() functions return
+ PCRE2_ERROR_NOMATCH if a match with a starting point before or at the
+ given offset is not found. The pcre2_substitute() function makes no
+ more substitutions.
+
+ For example, if the pattern /abc/ is matched against "123abc" with an
+ offset limit less than 3, the result is PCRE2_ERROR_NO_MATCH. A match
+ can never be found if the startoffset argument of pcre2_match(),
+ pcre2_dfa_match(), or pcre2_substitute() is greater than the offset
+ limit set in the match context.
+
+ When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT
+ option when calling pcre2_compile() so that when JIT is in use, differ-
+ ent code can be compiled. If a match is started with a non-default
+ match limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is gener-
+ ated.
+
+ The offset limit facility can be used to track progress when searching
+ large subject strings or to limit the extent of global substitutions.
+ See also the PCRE2_FIRSTLINE option, which requires a match to start
+ before or at the first newline that follows the start of matching in
+ the subject. If this is set with an offset limit, a match must occur in
+ the first line and also within the offset limit. In other words, which-
+ ever limit comes first is used.
+
+ int pcre2_set_heap_limit(pcre2_match_context *mcontext,
uint32_t value);
- The match_limit parameter provides a means of preventing PCRE2 from
- using up too many resources when processing patterns that are not going
- to match, but which have a very large number of possibilities in their
- search trees. The classic example is a pattern that uses nested unlim-
- ited repeats.
-
- Internally, pcre2_match() uses a function called match(), which it
- calls repeatedly (sometimes recursively). The limit set by match_limit
- is imposed on the number of times this function is called during a
- match, which has the effect of limiting the amount of backtracking that
- can take place. For patterns that are not anchored, the count restarts
- from zero for each position in the subject string. This limit is not
- relevant to pcre2_dfa_match(), which ignores it.
-
- When pcre2_match() is called with a pattern that was successfully pro-
- cessed by pcre2_jit_compile(), the way in which matching is executed is
- entirely different. However, there is still the possibility of runaway
- matching that goes on for a very long time, and so the match_limit
- value is also used in this case (but in a different way) to limit how
- long the matching can continue.
+ The heap_limit parameter specifies, in units of kilobytes, the maximum
+ amount of heap memory that pcre2_match() may use to hold backtracking
+ information when running an interpretive match. This limit does not
+ apply to matching with the JIT optimization, which has its own memory
+ control arrangements (see the pcre2jit documentation for more details),
+ nor does it apply to pcre2_dfa_match(). If the limit is reached, the
+ negative error code PCRE2_ERROR_HEAPLIMIT is returned. The default
+ limit is set when PCRE2 is built; the default default is very large and
+ is essentially "unlimited".
- The default value for the limit can be set when PCRE2 is built; the
- default default is 10 million, which handles all but the most extreme
- cases. If the limit is exceeded, pcre2_match() returns
- PCRE2_ERROR_MATCHLIMIT. A value for the match limit may also be sup-
- plied by an item at the start of a pattern of the form
+ A value for the heap limit may also be supplied by an item at the start
+ of a pattern of the form
- (*LIMIT_MATCH=ddd)
+ (*LIMIT_HEAP=ddd)
where ddd is a decimal number. However, such a setting is ignored
unless ddd is less than the limit set by the caller of pcre2_match()
or, if no such limit is set, less than the default.
- int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
- uint32_t value);
+ The pcre2_match() function starts out using a 20K vector on the system
+ stack for recording backtracking points. The more nested backtracking
+ points there are (that is, the deeper the search tree), the more memory
+ is needed. Heap memory is used only if the initial vector is too
+ small. If the heap limit is set to a value less than 21 (in particular,
+ zero) no heap memory will be used. In this case, only patterns that do
+ not have a lot of nested backtracking can be successfully processed.
- The recursion_limit parameter is similar to match_limit, but instead of
- limiting the total number of times that match() is called, it limits
- the depth of recursion. The recursion depth is a smaller number than
- the total number of calls, because not all calls to match() are recur-
- sive. This limit is of use only if it is set smaller than match_limit.
+ int pcre2_set_match_limit(pcre2_match_context *mcontext,
+ uint32_t value);
- Limiting the recursion depth limits the amount of system stack that can
- be used, or, when PCRE2 has been compiled to use memory on the heap
- instead of the stack, the amount of heap memory that can be used. This
- limit is not relevant, and is ignored, when matching is done using JIT
- compiled code or by the pcre2_dfa_match() function.
+ The match_limit parameter provides a means of preventing PCRE2 from
+ using up too many computing resources when processing patterns that are
+ not going to match, but which have a very large number of possibilities
+ in their search trees. The classic example is a pattern that uses
+ nested unlimited repeats.
+
+ There is an internal counter in pcre2_match() that is incremented each
+ time round its main matching loop. If this value reaches the match
+ limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT.
+ This has the effect of limiting the amount of backtracking that can
+ take place. For patterns that are not anchored, the count restarts from
+ zero for each position in the subject string. This limit also applies
+ to pcre2_dfa_match(), though the counting is done in a different way.
+
+ When pcre2_match() is called with a pattern that was successfully pro-
+ cessed by pcre2_jit_compile(), the way in which matching is executed is
+ entirely different. However, there is still the possibility of runaway
+ matching that goes on for a very long time, and so the match_limit
+ value is also used in this case (but in a different way) to limit how
+ long the matching can continue.
- The default value for recursion_limit can be set when PCRE2 is built;
- the default default is the same value as the default for match_limit.
- If the limit is exceeded, pcre2_match() returns PCRE2_ERROR_RECURSION-
- LIMIT. A value for the recursion limit may also be supplied by an item
- at the start of a pattern of the form
+ The default value for the limit can be set when PCRE2 is built; the
+ default default is 10 million, which handles all but the most extreme
+ cases. A value for the match limit may also be supplied by an item at
+ the start of a pattern of the form
- (*LIMIT_RECURSION=ddd)
+ (*LIMIT_MATCH=ddd)
where ddd is a decimal number. However, such a setting is ignored
- unless ddd is less than the limit set by the caller of pcre2_match()
- or, if no such limit is set, less than the default.
+ unless ddd is less than the limit set by the caller of pcre2_match() or
+ pcre2_dfa_match() or, if no such limit is set, less than the default.
- int pcre2_set_recursion_memory_management(
- pcre2_match_context *mcontext,
- void *(*private_malloc)(PCRE2_SIZE, void *),
- void (*private_free)(void *, void *), void *memory_data);
+ int pcre2_set_depth_limit(pcre2_match_context *mcontext,
+ uint32_t value);
- This function sets up two additional custom memory management functions
- for use by pcre2_match() when PCRE2 is compiled to use the heap for
- remembering backtracking data, instead of recursive function calls that
- use the system stack. There is a discussion about PCRE2's stack usage
- in the pcre2stack documentation. See the pcre2build documentation for
- details of how to build PCRE2.
-
- Using the heap for recursion is a non-standard way of building PCRE2,
- for use in environments that have limited stacks. Because of the
- greater use of memory management, pcre2_match() runs more slowly. Func-
- tions that are different to the general custom memory functions are
- provided so that special-purpose external code can be used for this
- case, because the memory blocks are all the same size. The blocks are
- retained by pcre2_match() until it is about to exit so that they can be
- re-used when possible during the match. In the absence of these func-
- tions, the normal custom memory management functions are used, if sup-
- plied, otherwise the system functions.
+ This parameter limits the depth of nested backtracking in
+ pcre2_match(). Each time a nested backtracking point is passed, a new
+ memory "frame" is used to remember the state of matching at that point.
+ Thus, this parameter indirectly limits the amount of memory that is
+ used in a match. However, because the size of each memory "frame"
+ depends on the number of capturing parentheses, the actual memory limit
+ varies from pattern to pattern. This limit was more useful in versions
+ before 10.30, where function recursion was used for backtracking.
+
+ The depth limit is not relevant, and is ignored, when matching is done
+ using JIT compiled code. However, it is supported by pcre2_dfa_match(),
+ which uses it to limit the depth of internal recursive function calls
+ that implement atomic groups, lookaround assertions, and pattern recur-
+ sions. This is, therefore, an indirect limit on the amount of system
+ stack that is used. A recursive pattern such as /(.)(?1)/, when matched
+ to a very long string using pcre2_dfa_match(), can use a great deal of
+ stack.
+
+ The default value for the depth limit can be set when PCRE2 is built;
+ the default default is the same value as the default for the match
+ limit. If the limit is exceeded, pcre2_match() or pcre2_dfa_match()
+ returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be
+ supplied by an item at the start of a pattern of the form
+
+ (*LIMIT_DEPTH=ddd)
+
+ where ddd is a decimal number. However, such a setting is ignored
+ unless ddd is less than the limit set by the caller of pcre2_match() or
+ pcre2_dfa_match() or, if no such limit is set, less than the default.
CHECKING BUILD-TIME OPTIONS
@@ -987,6 +1082,26 @@ CHECKING BUILD-TIME OPTIONS
sequence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR,
LF, or CRLF. The default can be overridden when a pattern is compiled.
+ PCRE2_CONFIG_COMPILED_WIDTHS
+
+ The output is a uint32_t integer whose lower bits indicate which code
+ unit widths were selected when PCRE2 was built. The 1-bit indicates
+ 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
+ port, respectively.
+
+ PCRE2_CONFIG_DEPTHLIMIT
+
+ The output is a uint32_t integer that gives the default limit for the
+ depth of nested backtracking in pcre2_match() or the depth of nested
+ recursions and lookarounds in pcre2_dfa_match(). Further details are
+ given with pcre2_set_depth_limit() above.
+
+ PCRE2_CONFIG_HEAPLIMIT
+
+ The output is a uint32_t integer that gives, in kilobytes, the default
+ limit for the amount of heap memory used by pcre2_match(). Further
+ details are given with pcre2_set_heap_limit() above.
+
PCRE2_CONFIG_JIT
The output is a uint32_t integer that is set to one if support for
@@ -1021,9 +1136,9 @@ CHECKING BUILD-TIME OPTIONS
PCRE2_CONFIG_MATCHLIMIT
- The output is a uint32_t integer that gives the default limit for the
- number of internal matching function calls in a pcre2_match() execu-
- tion. Further details are given with pcre2_match() below.
+ The output is a uint32_t integer that gives the default match limit for
+ pcre2_match(). Further details are given with pcre2_set_match_limit()
+ above.
PCRE2_CONFIG_NEWLINE
@@ -1036,10 +1151,17 @@ CHECKING BUILD-TIME OPTIONS
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
PCRE2_NEWLINE_ANY Any Unicode line ending
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
+ PCRE2_NEWLINE_NUL The NUL character (binary zero)
The default should normally correspond to the standard sequence for
your operating system.
+ PCRE2_CONFIG_NEVER_BACKSLASH_C
+
+ The output is a uint32_t integer that is set to one if the use of \C
+ was permanently disabled when PCRE2 was built; otherwise it is set to
+ zero.
+
PCRE2_CONFIG_PARENSLIMIT
The output is a uint32_t integer that gives the maximum depth of nest-
@@ -1050,43 +1172,32 @@ CHECKING BUILD-TIME OPTIONS
application. For finer control over compilation stack usage, see
pcre2_set_compile_recursion_guard().
- PCRE2_CONFIG_RECURSIONLIMIT
-
- The output is a uint32_t integer that gives the default limit for the
- depth of recursion when calling the internal matching function in a
- pcre2_match() execution. Further details are given with pcre2_match()
- below.
-
PCRE2_CONFIG_STACKRECURSE
- The output is a uint32_t integer that is set to one if internal recur-
- sion when running pcre2_match() is implemented by recursive function
- calls that use the system stack to remember their state. This is the
- usual way that PCRE2 is compiled. The output is zero if PCRE2 was com-
- piled to use blocks of data on the heap instead of recursive function
- calls.
+ This parameter is obsolete and should not be used in new code. The out-
+ put is a uint32_t integer that is always set to zero.
PCRE2_CONFIG_UNICODE_VERSION
- The where argument should point to a buffer that is at least 24 code
- units long. (The exact length required can be found by calling
- pcre2_config() with where set to NULL.) If PCRE2 has been compiled
- without Unicode support, the buffer is filled with the text "Unicode
- not supported". Otherwise, the Unicode version string (for example,
- "8.0.0") is inserted. The number of code units used is returned. This
+ The where argument should point to a buffer that is at least 24 code
+ units long. (The exact length required can be found by calling
+ pcre2_config() with where set to NULL.) If PCRE2 has been compiled
+ without Unicode support, the buffer is filled with the text "Unicode
+ not supported". Otherwise, the Unicode version string (for example,
+ "8.0.0") is inserted. The number of code units used is returned. This
is the length of the string plus one unit for the terminating zero.
PCRE2_CONFIG_UNICODE
- The output is a uint32_t integer that is set to one if Unicode support
- is available; otherwise it is set to zero. Unicode support implies UTF
+ The output is a uint32_t integer that is set to one if Unicode support
+ is available; otherwise it is set to zero. Unicode support implies UTF
support.
PCRE2_CONFIG_VERSION
- The where argument should point to a buffer that is at least 12 code
- units long. (The exact length required can be found by calling
- pcre2_config() with where set to NULL.) The buffer is filled with the
+ The where argument should point to a buffer that is at least 24 code
+ units long. (The exact length required can be found by calling
+ pcre2_config() with where set to NULL.) The buffer is filled with the
PCRE2 version string, zero-terminated. The number of code units used is
returned. This is the length of the string plus one unit for the termi-
nating zero.
@@ -1102,28 +1213,41 @@ COMPILING A PATTERN
pcre2_code *pcre2_code_copy(const pcre2_code *code);
- The pcre2_compile() function compiles a pattern into an internal form.
- The pattern is defined by a pointer to a string of code units and a
- length. If the pattern is zero-terminated, the length can be specified
- as PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of
- memory that contains the compiled pattern and related data, or NULL if
- an error occurred.
-
- If the compile context argument ccontext is NULL, memory for the com-
- piled pattern is obtained by calling malloc(). Otherwise, it is
- obtained from the same memory function that was used for the compile
- context. The caller must free the memory by calling pcre2_code_free()
+ pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
+
+ The pcre2_compile() function compiles a pattern into an internal form.
+ The pattern is defined by a pointer to a string of code units and a
+ length (in code units). If the pattern is zero-terminated, the length
+ can be specified as PCRE2_ZERO_TERMINATED. The function returns a
+ pointer to a block of memory that contains the compiled pattern and
+ related data, or NULL if an error occurred.
+
+ If the compile context argument ccontext is NULL, memory for the com-
+ piled pattern is obtained by calling malloc(). Otherwise, it is
+ obtained from the same memory function that was used for the compile
+ context. The caller must free the memory by calling pcre2_code_free()
when it is no longer needed.
The function pcre2_code_copy() makes a copy of the compiled code in new
- memory, using the same memory allocator as was used for the original.
- However, if the code has been processed by the JIT compiler (see
- below), the JIT information cannot be copied (because it is position-
+ memory, using the same memory allocator as was used for the original.
+ However, if the code has been processed by the JIT compiler (see
+ below), the JIT information cannot be copied (because it is position-
dependent). The new copy can initially be used only for non-JIT match-
- ing, though it can be passed to pcre2_jit_compile() if required. The
- pcre2_code_copy() function provides a way for individual threads in a
- multithreaded application to acquire a private copy of shared compiled
- code.
+ ing, though it can be passed to pcre2_jit_compile() if required.
+
+ The pcre2_code_copy() function provides a way for individual threads in
+ a multithreaded application to acquire a private copy of shared com-
+ piled code. However, it does not make a copy of the character tables
+ used by the compiled pattern; the new pattern code points to the same
+ tables as the original code. (See "Locale Support" below for details
+ of these character tables.) In many applications the same tables are
+ used throughout, so this behaviour is appropriate. Nevertheless, there
+ are occasions when a copy of a compiled pattern and the relevant tables
+ are needed. The pcre2_code_copy_with_tables() provides this facility.
+ Copies of both the code and the tables are made, with the new code
+ pointing to the new tables. The memory for the new tables is automati-
+ cally freed when pcre2_code_free() is called for the new copy of the
+ compiled code.
NOTE: When one of the matching functions is called, pointers to the
compiled pattern and the subject string are set in the match data block
@@ -1141,33 +1265,46 @@ COMPILING A PATTERN
For those options that can be different in different parts of the pat-
tern, the contents of the options argument specifies their settings at
- the start of compilation. The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK
- options can be set at the time of matching as well as at compile time.
+ the start of compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and
+ PCRE2_NO_UTF_CHECK options can be set at the time of matching as well
+ as at compile time.
- Other, less frequently required compile-time parameters (for example,
+ Other, less frequently required compile-time parameters (for example,
the newline setting) can be provided in a compile context (as described
above).
If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
- diately. Otherwise, the variables to which these point are set to an
- error code and an offset (number of code units) within the pattern,
- respectively, when pcre2_compile() returns NULL because a compilation
+ diately. Otherwise, the variables to which these point are set to an
+ error code and an offset (number of code units) within the pattern,
+ respectively, when pcre2_compile() returns NULL because a compilation
error has occurred. The values are not defined when compilation is suc-
cessful and pcre2_compile() returns a non-NULL value.
- The pcre2_get_error_message() function (see "Obtaining a textual error
- message" below) provides a textual message for each error code. Compi-
- lation errors have positive error codes; UTF formatting error codes are
- negative. For an invalid UTF-8 or UTF-16 string, the offset is that of
- the first code unit of the failing character.
-
- Some errors are not detected until the whole pattern has been scanned;
- in these cases, the offset passed back is the length of the pattern.
- Note that the offset is in code units, not characters, even in a UTF
+ There are nearly 100 positive error codes that pcre2_compile() may
+ return if it finds an error in the pattern. There are also some nega-
+ tive error codes that are used for invalid UTF strings. These are the
+ same as given by pcre2_match() and pcre2_dfa_match(), and are described
+ in the pcre2unicode page. There is no separate documentation for the
+ positive error codes, because the textual error messages that are
+ obtained by calling the pcre2_get_error_message() function (see
+ "Obtaining a textual error message" below) should be self-explanatory.
+ Macro names starting with PCRE2_ERROR_ are defined for both positive
+ and negative error codes in pcre2.h.
+
+ The value returned in erroroffset is an indication of where in the pat-
+ tern the error occurred. It is not necessarily the furthest point in
+ the pattern that was read. For example, after the error "lookbehind
+ assertion is not fixed length", the error offset points to the start of
+ the failing assertion. For an invalid UTF-8 or UTF-16 string, the off-
+ set is that of the first code unit of the failing character.
+
+ Some errors are not detected until the whole pattern has been scanned;
+ in these cases, the offset passed back is the length of the pattern.
+ Note that the offset is in code units, not characters, even in a UTF
mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
acter.
- This code fragment shows a typical straightforward call to pcre2_com-
+ This code fragment shows a typical straightforward call to pcre2_com-
pile():
pcre2_code *re;
@@ -1181,77 +1318,86 @@ COMPILING A PATTERN
&erroffset, /* for error offset */
NULL); /* no compile context */
- The following names for option bits are defined in the pcre2.h header
+ The following names for option bits are defined in the pcre2.h header
file:
PCRE2_ANCHORED
If this bit is set, the pattern is forced to be "anchored", that is, it
- is constrained to match only at the first matching point in the string
- that is being searched (the "subject string"). This effect can also be
- achieved by appropriate constructs in the pattern itself, which is the
+ is constrained to match only at the first matching point in the string
+ that is being searched (the "subject string"). This effect can also be
+ achieved by appropriate constructs in the pattern itself, which is the
only way to do it in Perl.
PCRE2_ALLOW_EMPTY_CLASS
- By default, for compatibility with Perl, a closing square bracket that
- immediately follows an opening one is treated as a data character for
- the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the
+ By default, for compatibility with Perl, a closing square bracket that
+ immediately follows an opening one is treated as a data character for
+ the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the
class, which therefore contains no characters and so can never match.
PCRE2_ALT_BSUX
- This option request alternative handling of three escape sequences,
- which makes PCRE2's behaviour more like ECMAscript (aka JavaScript).
+ This option request alternative handling of three escape sequences,
+ which makes PCRE2's behaviour more like ECMAscript (aka JavaScript).
When it is set:
(1) \U matches an upper case "U" character; by default \U causes a com-
pile time error (Perl uses \U to upper case subsequent characters).
(2) \u matches a lower case "u" character unless it is followed by four
- hexadecimal digits, in which case the hexadecimal number defines the
- code point to match. By default, \u causes a compile time error (Perl
+ hexadecimal digits, in which case the hexadecimal number defines the
+ code point to match. By default, \u causes a compile time error (Perl
uses it to upper case the following character).
- (3) \x matches a lower case "x" character unless it is followed by two
- hexadecimal digits, in which case the hexadecimal number defines the
- code point to match. By default, as in Perl, a hexadecimal number is
+ (3) \x matches a lower case "x" character unless it is followed by two
+ hexadecimal digits, in which case the hexadecimal number defines the
+ code point to match. By default, as in Perl, a hexadecimal number is
always expected after \x, but it may have zero, one, or two digits (so,
for example, \xz matches a binary zero character followed by z).
PCRE2_ALT_CIRCUMFLEX
In multiline mode (when PCRE2_MULTILINE is set), the circumflex
- metacharacter matches at the start of the subject (unless PCRE2_NOTBOL
- is set), and also after any internal newline. However, it does not
+ metacharacter matches at the start of the subject (unless PCRE2_NOTBOL
+ is set), and also after any internal newline. However, it does not
match after a newline at the end of the subject, for compatibility with
- Perl. If you want a multiline circumflex also to match after a termi-
+ Perl. If you want a multiline circumflex also to match after a termi-
nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
PCRE2_ALT_VERBNAMES
- By default, for compatibility with Perl, the name in any verb sequence
- such as (*MARK:NAME) is any sequence of characters that does not
- include a closing parenthesis. The name is not processed in any way,
- and it is not possible to include a closing parenthesis in the name.
- However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash
- processing is applied to verb names and only an unescaped closing
- parenthesis terminates the name. A closing parenthesis can be included
- in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED
- option is set, unescaped whitespace in verb names is skipped and #-com-
- ments are recognized, exactly as in the rest of the pattern.
+ By default, for compatibility with Perl, the name in any verb sequence
+ such as (*MARK:NAME) is any sequence of characters that does not
+ include a closing parenthesis. The name is not processed in any way,
+ and it is not possible to include a closing parenthesis in the name.
+ However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash
+ processing is applied to verb names and only an unescaped closing
+ parenthesis terminates the name. A closing parenthesis can be included
+ in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED or
+ PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names
+ is skipped and #-comments are recognized in this mode, exactly as in
+ the rest of the pattern.
PCRE2_AUTO_CALLOUT
If this bit is set, pcre2_compile() automatically inserts callout
- items, all with number 255, before each pattern item. For discussion of
- the callout facility, see the pcre2callout documentation.
+ items, all with number 255, before each pattern item, except immedi-
+ ately before or after an explicit callout in the pattern. For discus-
+ sion of the callout facility, see the pcre2callout documentation.
PCRE2_CASELESS
- If this bit is set, letters in the pattern match both upper and lower
- case letters in the subject. It is equivalent to Perl's /i option, and
- it can be changed within a pattern by a (?i) option setting.
+ If this bit is set, letters in the pattern match both upper and lower
+ case letters in the subject. It is equivalent to Perl's /i option, and
+ it can be changed within a pattern by a (?i) option setting. If
+ PCRE2_UTF is set, Unicode properties are used for all characters with
+ more than one other case, and for all characters whose code points are
+ greater than U+007f. For lower valued characters with only one other
+ case, a lookup table is used for speed. When PCRE2_UTF is not set, a
+ lookup table is used for all code points less than 256, and higher code
+ points (available only in 16-bit or 32-bit mode) are treated as not
+ having another case.
PCRE2_DOLLAR_ENDONLY
@@ -1281,178 +1427,229 @@ COMPILING A PATTERN
matched. There are more details of named subpatterns below; see also
the pcre2pattern documentation.
+ PCRE2_ENDANCHORED
+
+ If this bit is set, the end of any pattern match must be right at the
+ end of the string being searched (the "subject string"). If the pattern
+ match succeeds by reaching (*ACCEPT), but does not reach the end of the
+ subject, the match fails at the current starting point. For unanchored
+ patterns, a new match is then tried at the next starting point. How-
+ ever, if the match succeeds by reaching the end of the pattern, but not
+ the end of the subject, backtracking occurs and an alternative match
+ may be found. Consider these two patterns:
+
+ .(*ACCEPT)|..
+ .|..
+
+ If matched against "abc" with PCRE2_ENDANCHORED set, the first matches
+ "c" whereas the second matches "bc". The effect of PCRE2_ENDANCHORED
+ can also be achieved by appropriate constructs in the pattern itself,
+ which is the only way to do it in Perl.
+
+ For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only
+ to the first (that is, the longest) matched string. Other parallel
+ matches, which are necessarily substrings of the first one, must obvi-
+ ously end before the end of the subject.
+
PCRE2_EXTENDED
- If this bit is set, most white space characters in the pattern are
- totally ignored except when escaped or inside a character class. How-
- ever, white space is not allowed within sequences such as (?> that
+ If this bit is set, most white space characters in the pattern are
+ totally ignored except when escaped or inside a character class. How-
+ ever, white space is not allowed within sequences such as (?> that
introduce various parenthesized subpatterns, nor within numerical quan-
- tifiers such as {1,3}. Ignorable white space is permitted between an
- item and a following quantifier and between a quantifier and a follow-
+ tifiers such as {1,3}. Ignorable white space is permitted between an
+ item and a following quantifier and between a quantifier and a follow-
ing + that indicates possessiveness.
- PCRE2_EXTENDED also causes characters between an unescaped # outside a
- character class and the next newline, inclusive, to be ignored, which
+ PCRE2_EXTENDED also causes characters between an unescaped # outside a
+ character class and the next newline, inclusive, to be ignored, which
makes it possible to include comments inside complicated patterns. Note
- that the end of this type of comment is a literal newline sequence in
+ that the end of this type of comment is a literal newline sequence in
the pattern; escape sequences that happen to represent a newline do not
- count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
+ count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
changed within a pattern by a (?x) option setting.
Which characters are interpreted as newlines can be specified by a set-
- ting in the compile context that is passed to pcre2_compile() or by a
- special sequence at the start of the pattern, as described in the sec-
- tion entitled "Newline conventions" in the pcre2pattern documentation.
+ ting in the compile context that is passed to pcre2_compile() or by a
+ special sequence at the start of the pattern, as described in the sec-
+ tion entitled "Newline conventions" in the pcre2pattern documentation.
A default is defined when PCRE2 is built.
+ PCRE2_EXTENDED_MORE
+
+ This option has the effect of PCRE2_EXTENDED, but, in addition,
+ unescaped space and horizontal tab characters are ignored inside a
+ character class. PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx
+ option, and it can be changed within a pattern by a (?xx) option set-
+ ting.
+
PCRE2_FIRSTLINE
- If this option is set, an unanchored pattern is required to match
- before or at the first newline in the subject string, though the
- matched text may continue over the newline. See also PCRE2_USE_OFF-
- SET_LIMIT, which provides a more general limiting facility. If
- PCRE2_FIRSTLINE is set with an offset limit, a match must occur in the
- first line and also within the offset limit. In other words, whichever
- limit comes first is used.
+ If this option is set, the start of an unanchored pattern match must be
+ before or at the first newline in the subject string following the
+ start of matching, though the matched text may continue over the new-
+ line. If startoffset is non-zero, the limiting newline is not necessar-
+ ily the first newline in the subject. For example, if the subject
+ string is "abc\nxyz" (where \n represents a single-character newline) a
+ pattern match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is
+ greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
+ general limiting facility. If PCRE2_FIRSTLINE is set with an offset
+ limit, a match must occur in the first line and also within the offset
+ limit. In other words, whichever limit comes first is used.
+
+ PCRE2_LITERAL
+
+ If this option is set, all meta-characters in the pattern are disabled,
+ and it is treated as a literal string. Matching literal strings with a
+ regular expression engine is not the most efficient way of doing it. If
+ you are doing a lot of literal matching and are worried about effi-
+ ciency, you should consider using other approaches. The only other main
+ options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED,
+ PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
+ PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
+ PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
+ PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
+ error.
PCRE2_MATCH_UNSET_BACKREF
- If this option is set, a back reference to an unset subpattern group
- matches an empty string (by default this causes the current matching
- alternative to fail). A pattern such as (\1)(a) succeeds when this
- option is set (assuming it can find an "a" in the subject), whereas it
- fails by default, for Perl compatibility. Setting this option makes
+ If this option is set, a back reference to an unset subpattern group
+ matches an empty string (by default this causes the current matching
+ alternative to fail). A pattern such as (\1)(a) succeeds when this
+ option is set (assuming it can find an "a" in the subject), whereas it
+ fails by default, for Perl compatibility. Setting this option makes
PCRE2 behave more like ECMAscript (aka JavaScript).
PCRE2_MULTILINE
- By default, for the purposes of matching "start of line" and "end of
- line", PCRE2 treats the subject string as consisting of a single line
- of characters, even if it actually contains newlines. The "start of
- line" metacharacter (^) matches only at the start of the string, and
- the "end of line" metacharacter ($) matches only at the end of the
+ By default, for the purposes of matching "start of line" and "end of
+ line", PCRE2 treats the subject string as consisting of a single line
+ of characters, even if it actually contains newlines. The "start of
+ line" metacharacter (^) matches only at the start of the string, and
+ the "end of line" metacharacter ($) matches only at the end of the
string, or before a terminating newline (except when PCRE2_DOL-
- LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set,
+ LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set,
the "any character" metacharacter (.) does not match at a newline. This
behaviour (for ^, $, and dot) is the same as Perl.
- When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
- constructs match immediately following or immediately before internal
- newlines in the subject string, respectively, as well as at the very
- start and end. This is equivalent to Perl's /m option, and it can be
+ When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
+ constructs match immediately following or immediately before internal
+ newlines in the subject string, respectively, as well as at the very
+ start and end. This is equivalent to Perl's /m option, and it can be
changed within a pattern by a (?m) option setting. Note that the "start
of line" metacharacter does not match after a newline at the end of the
- subject, for compatibility with Perl. However, you can change this by
- setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
- subject string, or no occurrences of ^ or $ in a pattern, setting
+ subject, for compatibility with Perl. However, you can change this by
+ setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
+ subject string, or no occurrences of ^ or $ in a pattern, setting
PCRE2_MULTILINE has no effect.
PCRE2_NEVER_BACKSLASH_C
- This option locks out the use of \C in the pattern that is being com-
- piled. This escape can cause unpredictable behaviour in UTF-8 or
- UTF-16 modes, because it may leave the current matching point in the
- middle of a multi-code-unit character. This option may be useful in
- applications that process patterns from external sources. Note that
+ This option locks out the use of \C in the pattern that is being com-
+ piled. This escape can cause unpredictable behaviour in UTF-8 or
+ UTF-16 modes, because it may leave the current matching point in the
+ middle of a multi-code-unit character. This option may be useful in
+ applications that process patterns from external sources. Note that
there is also a build-time option that permanently locks out the use of
\C.
PCRE2_NEVER_UCP
- This option locks out the use of Unicode properties for handling \B,
+ This option locks out the use of Unicode properties for handling \B,
\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
- described for the PCRE2_UCP option below. In particular, it prevents
- the creator of the pattern from enabling this facility by starting the
- pattern with (*UCP). This option may be useful in applications that
+ described for the PCRE2_UCP option below. In particular, it prevents
+ the creator of the pattern from enabling this facility by starting the
+ pattern with (*UCP). This option may be useful in applications that
process patterns from external sources. The option combination PCRE_UCP
and PCRE_NEVER_UCP causes an error.
PCRE2_NEVER_UTF
- This option locks out interpretation of the pattern as UTF-8, UTF-16,
+ This option locks out interpretation of the pattern as UTF-8, UTF-16,
or UTF-32, depending on which library is in use. In particular, it pre-
- vents the creator of the pattern from switching to UTF interpretation
- by starting the pattern with (*UTF). This option may be useful in
- applications that process patterns from external sources. The combina-
+ vents the creator of the pattern from switching to UTF interpretation
+ by starting the pattern with (*UTF). This option may be useful in
+ applications that process patterns from external sources. The combina-
tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
PCRE2_NO_AUTO_CAPTURE
If this option is set, it disables the use of numbered capturing paren-
- theses in the pattern. Any opening parenthesis that is not followed by
- ? behaves as if it were followed by ?: but named parentheses can still
- be used for capturing (and they acquire numbers in the usual way).
- There is no equivalent of this option in Perl. Note that, if this
- option is set, references to capturing groups (back references or
- recursion/subroutine calls) may only refer to named groups, though the
- reference can be by name or by number.
+ theses in the pattern. Any opening parenthesis that is not followed by
+ ? behaves as if it were followed by ?: but named parentheses can still
+ be used for capturing (and they acquire numbers in the usual way). This
+ is the same as Perl's /n option. Note that, when this option is set,
+ references to capturing groups (back references or recursion/subroutine
+ calls) may only refer to named groups, though the reference can be by
+ name or by number.
PCRE2_NO_AUTO_POSSESS
If this option is set, it disables "auto-possessification", which is an
- optimization that, for example, turns a+b into a++b in order to avoid
- backtracks into a+ that can never be successful. However, if callouts
- are in use, auto-possessification means that some callouts are never
+ optimization that, for example, turns a+b into a++b in order to avoid
+ backtracks into a+ that can never be successful. However, if callouts
+ are in use, auto-possessification means that some callouts are never
taken. You can set this option if you want the matching functions to do
- a full unoptimized search and run all the callouts, but it is mainly
+ a full unoptimized search and run all the callouts, but it is mainly
provided for testing purposes.
PCRE2_NO_DOTSTAR_ANCHOR
If this option is set, it disables an optimization that is applied when
- .* is the first significant item in a top-level branch of a pattern,
- and all the other branches also start with .* or with \A or \G or ^.
- The optimization is automatically disabled for .* if it is inside an
- atomic group or a capturing group that is the subject of a back refer-
- ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
- mization is not disabled, such a pattern is automatically anchored if
+ .* is the first significant item in a top-level branch of a pattern,
+ and all the other branches also start with .* or with \A or \G or ^.
+ The optimization is automatically disabled for .* if it is inside an
+ atomic group or a capturing group that is the subject of a back refer-
+ ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
+ mization is not disabled, such a pattern is automatically anchored if
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
- for any ^ items. Otherwise, the fact that any match must start either
- at the start of the subject or following a newline is remembered. Like
+ for any ^ items. Otherwise, the fact that any match must start either
+ at the start of the subject or following a newline is remembered. Like
other optimizations, this can cause callouts to be skipped.
PCRE2_NO_START_OPTIMIZE
- This is an option whose main effect is at matching time. It does not
+ This is an option whose main effect is at matching time. It does not
change what pcre2_compile() generates, but it does affect the output of
the JIT compiler.
- There are a number of optimizations that may occur at the start of a
- match, in order to speed up the process. For example, if it is known
- that an unanchored match must start with a specific character, the
- matching code searches the subject for that character, and fails imme-
- diately if it cannot find it, without actually running the main match-
- ing function. This means that a special item such as (*COMMIT) at the
- start of a pattern is not considered until after a suitable starting
- point for the match has been found. Also, when callouts or (*MARK)
- items are in use, these "start-up" optimizations can cause them to be
- skipped if the pattern is never actually used. The start-up optimiza-
- tions are in effect a pre-scan of the subject that takes place before
+ There are a number of optimizations that may occur at the start of a
+ match, in order to speed up the process. For example, if it is known
+ that an unanchored match must start with a specific code unit value,
+ the matching code searches the subject for that value, and fails imme-
+ diately if it cannot find it, without actually running the main match-
+ ing function. This means that a special item such as (*COMMIT) at the
+ start of a pattern is not considered until after a suitable starting
+ point for the match has been found. Also, when callouts or (*MARK)
+ items are in use, these "start-up" optimizations can cause them to be
+ skipped if the pattern is never actually used. The start-up optimiza-
+ tions are in effect a pre-scan of the subject that takes place before
the pattern is run.
The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
- possibly causing performance to suffer, but ensuring that in cases
- where the result is "no match", the callouts do occur, and that items
+ possibly causing performance to suffer, but ensuring that in cases
+ where the result is "no match", the callouts do occur, and that items
such as (*COMMIT) and (*MARK) are considered at every possible starting
position in the subject string.
- Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching
+ Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching
operation. Consider the pattern
(*COMMIT)ABC
- When this is compiled, PCRE2 records the fact that a match must start
- with the character "A". Suppose the subject string is "DEFABC". The
- start-up optimization scans along the subject, finds "A" and runs the
- first match attempt from there. The (*COMMIT) item means that the pat-
- tern must match the current starting position, which in this case, it
- does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE
- set, the initial scan along the subject string does not happen. The
- first match attempt is run starting from "D" and when this fails,
- (*COMMIT) prevents any further matches being tried, so the overall
- result is "no match". There are also other start-up optimizations. For
- example, a minimum length for the subject may be recorded. Consider the
- pattern
+ When this is compiled, PCRE2 records the fact that a match must start
+ with the character "A". Suppose the subject string is "DEFABC". The
+ start-up optimization scans along the subject, finds "A" and runs the
+ first match attempt from there. The (*COMMIT) item means that the pat-
+ tern must match the current starting position, which in this case, it
+ does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE
+ set, the initial scan along the subject string does not happen. The
+ first match attempt is run starting from "D" and when this fails,
+ (*COMMIT) prevents any further matches being tried, so the overall
+ result is "no match".
+
+ There are also other start-up optimizations. For example, a minimum
+ length for the subject may be recorded. Consider the pattern
(*MARK:A)(X|Y)
@@ -1469,63 +1666,133 @@ COMPILING A PATTERN
When PCRE2_UTF is set, the validity of the pattern as a UTF string is
automatically checked. There are discussions about the validity of
UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode
- document. If an invalid UTF sequence is found, pcre2_compile() returns
+ document. If an invalid UTF sequence is found, pcre2_compile() returns
a negative error code.
- If you know that your pattern is valid, and you want to skip this check
- for performance reasons, you can set the PCRE2_NO_UTF_CHECK option.
- When it is set, the effect of passing an invalid UTF string as a pat-
- tern is undefined. It may cause your program to crash or loop. Note
- that this option can also be passed to pcre2_match() and
- pcre_dfa_match(), to suppress validity checking of the subject string.
+ If you know that your pattern is a valid UTF string, and you want to
+ skip this check for performance reasons, you can set the
+ PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an
+ invalid UTF string as a pattern is undefined. It may cause your program
+ to crash or loop.
+
+ Note that this option can also be passed to pcre2_match() and
+ pcre_dfa_match(), to suppress UTF validity checking of the subject
+ string.
+
+ Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
+ able the error that is given if an escape sequence for an invalid Uni-
+ code code point is encountered in the pattern. In particular, the so-
+ called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you
+ want to allow escape sequences such as \x{d800} you can set the
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option, as described in the
+ section entitled "Extra compile options" below. However, this is pos-
+ sible only in UTF-8 and UTF-32 modes, because these values are not rep-
+ resentable in UTF-16.
PCRE2_UCP
This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
- \w, and some of the POSIX character classes. By default, only ASCII
- characters are recognized, but if PCRE2_UCP is set, Unicode properties
- are used instead to classify characters. More details are given in the
+ \w, and some of the POSIX character classes. By default, only ASCII
+ characters are recognized, but if PCRE2_UCP is set, Unicode properties
+ are used instead to classify characters. More details are given in the
section on generic character types in the pcre2pattern page. If you set
- PCRE2_UCP, matching one of the items it affects takes much longer. The
- option is available only if PCRE2 has been compiled with Unicode sup-
- port.
+ PCRE2_UCP, matching one of the items it affects takes much longer. The
+ option is available only if PCRE2 has been compiled with Unicode sup-
+ port (which is the default).
PCRE2_UNGREEDY
- This option inverts the "greediness" of the quantifiers so that they
- are not greedy by default, but become greedy if followed by "?". It is
- not compatible with Perl. It can also be set by a (?U) option setting
+ This option inverts the "greediness" of the quantifiers so that they
+ are not greedy by default, but become greedy if followed by "?". It is
+ not compatible with Perl. It can also be set by a (?U) option setting
within the pattern.
PCRE2_USE_OFFSET_LIMIT
This option must be set for pcre2_compile() if pcre2_set_offset_limit()
- is going to be used to set a non-default offset limit in a match con-
- text for matches that use this pattern. An error is generated if an
- offset limit is set without this option. For more details, see the
- description of pcre2_set_offset_limit() in the section that describes
+ is going to be used to set a non-default offset limit in a match con-
+ text for matches that use this pattern. An error is generated if an
+ offset limit is set without this option. For more details, see the
+ description of pcre2_set_offset_limit() in the section that describes
match contexts. See also the PCRE2_FIRSTLINE option above.
PCRE2_UTF
- This option causes PCRE2 to regard both the pattern and the subject
- strings that are subsequently processed as strings of UTF characters
- instead of single-code-unit strings. It is available when PCRE2 is
- built to include Unicode support (which is the default). If Unicode
- support is not available, the use of this option provokes an error.
- Details of how this option changes the behaviour of PCRE2 are given in
+ This option causes PCRE2 to regard both the pattern and the subject
+ strings that are subsequently processed as strings of UTF characters
+ instead of single-code-unit strings. It is available when PCRE2 is
+ built to include Unicode support (which is the default). If Unicode
+ support is not available, the use of this option provokes an error.
+ Details of how PCRE2_UTF changes the behaviour of PCRE2 are given in
the pcre2unicode page.
-
-COMPILATION ERROR CODES
-
- There are over 80 positive error codes that pcre2_compile() may return
- (via errorcode) if it finds an error in the pattern. There are also
- some negative error codes that are used for invalid UTF strings. These
- are the same as given by pcre2_match() and pcre2_dfa_match(), and are
- described in the pcre2unicode page. The pcre2_get_error_message() func-
- tion (see "Obtaining a textual error message" below) can be called to
- obtain a textual error message from any error code.
+ Extra compile options
+
+ Unlike the main compile-time options, the extra options are not saved
+ with the compiled pattern. The option bits that can be set in a compile
+ context by calling the pcre2_set_compile_extra_options() function are
+ as follows:
+
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
+
+ This option applies when compiling a pattern in UTF-8 or UTF-32 mode.
+ It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
+ "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs
+ in UTF-16 to encode code points with values in the range 0x10000 to
+ 0x10ffff. The surrogates cannot therefore be represented in UTF-16.
+ They can be represented in UTF-8 and UTF-32, but are defined as invalid
+ code points, and cause errors if encountered in a UTF-8 or UTF-32
+ string that is being checked for validity by PCRE2.
+
+ These values also cause errors if encountered in escape sequences such
+ as \x{d912} within a pattern. However, it seems that some applications,
+ when using PCRE2 to check for unwanted characters in UTF-8 strings,
+ explicitly test for the surrogates using escape sequences. The
+ PCRE2_NO_UTF_CHECK option does not disable the error that occurs,
+ because it applies only to the testing of input strings for UTF valid-
+ ity.
+
+ If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
+ gate code point values in UTF-8 and UTF-32 patterns no longer provoke
+ errors and are incorporated in the compiled pattern. However, they can
+ only match subject characters if the matching function is called with
+ PCRE2_NO_UTF_CHECK set.
+
+ PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
+
+ This is a dangerous option. Use with care. By default, an unrecognized
+ escape such as \j or a malformed one such as \x{2z} causes a compile-
+ time error when detected by pcre2_compile(). Perl is somewhat inconsis-
+ tent in handling such items: for example, \j is treated as a literal
+ "j", and non-hexadecimal digits in \x{} are just ignored, though warn-
+ ings are given in both cases if Perl's warning switch is enabled. How-
+ ever, a malformed octal number after \o{ always causes an error in
+ Perl.
+
+ If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
+ pcre2_compile(), all unrecognized or erroneous escape sequences are
+ treated as single-character escapes. For example, \j is a literal "j"
+ and \x{2z} is treated as the literal string "x{2z}". Setting this
+ option means that typos in patterns may go undetected and have unex-
+ pected results. This is a dangerous option. Use with care.
+
+ PCRE2_EXTRA_MATCH_LINE
+
+ This option is provided for use by the -x option of pcre2grep. It
+ causes the pattern only to match complete lines. This is achieved by
+ automatically inserting the code for "^(?:" at the start of the com-
+ piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
+ the matched line may be in the middle of the subject string. This
+ option can be used with PCRE2_LITERAL.
+
+ PCRE2_EXTRA_MATCH_WORD
+
+ This option is provided for use by the -w option of pcre2grep. It
+ causes the pattern only to match strings that have a word boundary at
+ the start and the end. This is achieved by automatically inserting the
+ code for "\b(?:" at the start of the compiled pattern and ")\b" at the
+ end. The option may be used with PCRE2_LITERAL. However, it is ignored
+ if PCRE2_EXTRA_MATCH_LINE is also set.
JUST-IN-TIME (JIT) COMPILATION
@@ -1547,53 +1814,53 @@ JUST-IN-TIME (JIT) COMPILATION
void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
- These functions provide support for JIT compilation, which, if the
- just-in-time compiler is available, further processes a compiled pat-
+ These functions provide support for JIT compilation, which, if the
+ just-in-time compiler is available, further processes a compiled pat-
tern into machine code that executes much faster than the pcre2_match()
- interpretive matching function. Full details are given in the pcre2jit
+ interpretive matching function. Full details are given in the pcre2jit
documentation.
- JIT compilation is a heavyweight optimization. It can take some time
- for patterns to be analyzed, and for one-off matches and simple pat-
- terns the benefit of faster execution might be offset by a much slower
- compilation time. Most, but not all patterns can be optimized by the
+ JIT compilation is a heavyweight optimization. It can take some time
+ for patterns to be analyzed, and for one-off matches and simple pat-
+ terns the benefit of faster execution might be offset by a much slower
+ compilation time. Most (but not all) patterns can be optimized by the
JIT compiler.
LOCALE SUPPORT
- PCRE2 handles caseless matching, and determines whether characters are
- letters, digits, or whatever, by reference to a set of tables, indexed
- by character code point. This applies only to characters whose code
- points are less than 256. By default, higher-valued code points never
- match escapes such as \w or \d. However, if PCRE2 is built with UTF
- support, all characters can be tested with \p and \P, or, alterna-
- tively, the PCRE2_UCP option can be set when a pattern is compiled;
- this causes \w and friends to use Unicode property support instead of
+ PCRE2 handles caseless matching, and determines whether characters are
+ letters, digits, or whatever, by reference to a set of tables, indexed
+ by character code point. This applies only to characters whose code
+ points are less than 256. By default, higher-valued code points never
+ match escapes such as \w or \d. However, if PCRE2 is built with Uni-
+ code support, all characters can be tested with \p and \P, or, alterna-
+ tively, the PCRE2_UCP option can be set when a pattern is compiled;
+ this causes \w and friends to use Unicode property support instead of
the built-in tables.
- The use of locales with Unicode is discouraged. If you are handling
- characters with code points greater than 128, you should either use
+ The use of locales with Unicode is discouraged. If you are handling
+ characters with code points greater than 128, you should either use
Unicode support, or use locales, but not try to mix the two.
- PCRE2 contains an internal set of character tables that are used by
- default. These are sufficient for many applications. Normally, the
+ PCRE2 contains an internal set of character tables that are used by
+ default. These are sufficient for many applications. Normally, the
internal tables recognize only ASCII characters. However, when PCRE2 is
built, it is possible to cause the internal tables to be rebuilt in the
default "C" locale of the local system, which may cause them to be dif-
ferent.
- The internal tables can be overridden by tables supplied by the appli-
- cation that calls PCRE2. These may be created in a different locale
- from the default. As more and more applications change to using Uni-
+ The internal tables can be overridden by tables supplied by the appli-
+ cation that calls PCRE2. These may be created in a different locale
+ from the default. As more and more applications change to using Uni-
code, the need for this locale support is expected to die away.
- External tables are built by calling the pcre2_maketables() function,
- in the relevant locale. The result can be passed to pcre2_compile() as
- often as necessary, by creating a compile context and calling
- pcre2_set_character_tables() to set the tables pointer therein. For
- example, to build and use tables that are appropriate for the French
- locale (where accented characters with values greater than 128 are
+ External tables are built by calling the pcre2_maketables() function,
+ in the relevant locale. The result can be passed to pcre2_compile() as
+ often as necessary, by creating a compile context and calling
+ pcre2_set_character_tables() to set the tables pointer therein. For
+ example, to build and use tables that are appropriate for the French
+ locale (where accented characters with values greater than 128 are
treated as letters), the following code could be used:
setlocale(LC_CTYPE, "fr_FR");
@@ -1602,15 +1869,15 @@ LOCALE SUPPORT
pcre2_set_character_tables(ccontext, tables);
re = pcre2_compile(..., ccontext);
- The locale name "fr_FR" is used on Linux and other Unix-like systems;
- if you are using Windows, the name for the French locale is "french".
- It is the caller's responsibility to ensure that the memory containing
+ The locale name "fr_FR" is used on Linux and other Unix-like systems;
+ if you are using Windows, the name for the French locale is "french".
+ It is the caller's responsibility to ensure that the memory containing
the tables remains available for as long as it is needed.
The pointer that is passed (via the compile context) to pcre2_compile()
- is saved with the compiled pattern, and the same tables are used by
- pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com-
- pilation, and matching all happen in the same locale, but different
+ is saved with the compiled pattern, and the same tables are used by
+ pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com-
+ pilation and matching both happen in the same locale, but different
patterns can be processed in different locales.
@@ -1618,14 +1885,14 @@ INFORMATION ABOUT A COMPILED PATTERN
int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
- The pcre2_pattern_info() function returns general information about a
+ The pcre2_pattern_info() function returns general information about a
compiled pattern. For information about callouts, see the next section.
- The first argument for pcre2_pattern_info() is a pointer to the com-
+ The first argument for pcre2_pattern_info() is a pointer to the com-
piled pattern. The second argument specifies which piece of information
- is required, and the third argument is a pointer to a variable to
- receive the data. If the third argument is NULL, the first argument is
- ignored, and the function returns the size in bytes of the variable
- that is required for the information requested. Otherwise, The yield of
+ is required, and the third argument is a pointer to a variable to
+ receive the data. If the third argument is NULL, the first argument is
+ ignored, and the function returns the size in bytes of the variable
+ that is required for the information requested. Otherwise, the yield of
the function is zero for success, or one of the following negative num-
bers:
@@ -1634,9 +1901,9 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_ERROR_BADOPTION the value of what was invalid
PCRE2_ERROR_UNSET the requested field is not set
- The "magic number" is placed at the start of each compiled pattern as
- an simple check against passing an arbitrary memory pointer. Here is a
- typical call of pcre2_pattern_info(), to obtain the length of the com-
+ The "magic number" is placed at the start of each compiled pattern as
+ an simple check against passing an arbitrary memory pointer. Here is a
+ typical call of pcre2_pattern_info(), to obtain the length of the com-
piled pattern:
int rc;
@@ -1651,12 +1918,16 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_INFO_ALLOPTIONS
PCRE2_INFO_ARGOPTIONS
-
- Return a copy of the pattern's options. The third argument should point
- to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the
- options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
- TIONS returns the compile options as modified by any top-level (*XXX)
- option settings such as (*UTF) at the start of the pattern itself.
+ PCRE2_INFO_EXTRAOPTIONS
+
+ Return copies of the pattern's options. The third argument should point
+ to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the
+ options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
+ TIONS returns the compile options as modified by any top-level (*XXX)
+ option settings such as (*UTF) at the start of the pattern itself.
+ PCRE2_INFO_EXTRAOPTIONS returns the extra options that were set in the
+ compile context by calling the pcre2_set_compile_extra_options() func-
+ tion.
For example, if the pattern /(*UTF)abc/ is compiled with the
PCRE2_EXTENDED option, the result for PCRE2_INFO_ALLOPTIONS is
@@ -1681,8 +1952,8 @@ INFORMATION ABOUT A COMPILED PATTERN
.* is not in a capturing group that is the subject
of a back reference
PCRE2_DOTALL is in force for .*
- Neither (*PRUNE) nor (*SKIP) appears in the pattern.
- PCRE2_NO_DOTSTAR_ANCHOR is not set.
+ Neither (*PRUNE) nor (*SKIP) appears in the pattern
+ PCRE2_NO_DOTSTAR_ANCHOR is not set
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
the options returned for PCRE2_INFO_ALLOPTIONS.
@@ -1711,6 +1982,16 @@ INFORMATION ABOUT A COMPILED PATTERN
terns where (?| is not used, this is also the total number of capturing
subpatterns. The third argument should point to an uint32_t variable.
+ PCRE2_INFO_DEPTHLIMIT
+
+ If the pattern set a backtracking depth limit by including an item of
+ the form (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The
+ third argument should point to an unsigned 32-bit integer. If no such
+ value has been set, the call to pcre2_pattern_info() returns the error
+ PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
+ ing if it is less than the limit set or defaulted by the caller of the
+ match function.
+
PCRE2_INFO_FIRSTBITMAP
In the absence of a single first code unit for a non-anchored pattern,
@@ -1727,33 +2008,53 @@ INFORMATION ABOUT A COMPILED PATTERN
Return information about the first code unit of any matched string, for
a non-anchored pattern. The third argument should point to an uint32_t
variable. If there is a fixed first value, for example, the letter "c"
- from a pattern such as (cat|cow|coyote), 1 is returned, and the charac-
- ter value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is
- no fixed first value, but it is known that a match can occur only at
- the start of the subject or following a newline in the subject, 2 is
- returned. Otherwise, and for anchored patterns, 0 is returned.
+ from a pattern such as (cat|cow|coyote), 1 is returned, and the value
+ can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed
+ first value, but it is known that a match can occur only at the start
+ of the subject or following a newline in the subject, 2 is returned.
+ Otherwise, and for anchored patterns, 0 is returned.
PCRE2_INFO_FIRSTCODEUNIT
- Return the value of the first code unit of any matched string in the
- situation where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
+ Return the value of the first code unit of any matched string for a
+ pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
The third argument should point to an uint32_t variable. In the 8-bit
library, the value is always less than 256. In the 16-bit library the
value can be up to 0xffff. In the 32-bit library in UTF-32 mode the
value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
mode.
+ PCRE2_INFO_FRAMESIZE
+
+ Return the size (in bytes) of the data frames that are used to remember
+ backtracking positions when the pattern is processed by pcre2_match()
+ without the use of JIT. The third argument should point to an size_t
+ variable. The frame size depends on the number of capturing parentheses
+ in the pattern. Each additional capturing group adds two PCRE2_SIZE
+ variables.
+
PCRE2_INFO_HASBACKSLASHC
- Return 1 if the pattern contains any instances of \C, otherwise 0. The
+ Return 1 if the pattern contains any instances of \C, otherwise 0. The
third argument should point to an uint32_t variable.
PCRE2_INFO_HASCRORLF
- Return 1 if the pattern contains any explicit matches for CR or LF
+ Return 1 if the pattern contains any explicit matches for CR or LF
characters, otherwise 0. The third argument should point to an uint32_t
- variable. An explicit match is either a literal CR or LF character, or
- \r or \n.
+ variable. An explicit match is either a literal CR or LF character, or
+ \r or \n or one of the equivalent hexadecimal or octal escape
+ sequences.
+
+ PCRE2_INFO_HEAPLIMIT
+
+ If the pattern set a heap memory limit by including an item of the form
+ (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
+ ment should point to an unsigned 32-bit integer. If no such value has
+ been set, the call to pcre2_pattern_info() returns the error
+ PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
+ ing if it is less than the limit set or defaulted by the caller of the
+ match function.
PCRE2_INFO_JCHANGED
@@ -1782,10 +2083,10 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_INFO_LASTCODEUNIT
- Return the value of the rightmost literal data unit that must exist in
- any matched string, other than at its start, if such a value has been
- recorded. The third argument should point to an uint32_t variable. If
- there is no such value, 0 is returned.
+ Return the value of the rightmost literal code unit that must exist in
+ any matched string, other than at its start, for a pattern where
+ PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
+ ment should point to an uint32_t variable.
PCRE2_INFO_MATCHEMPTY
@@ -1801,7 +2102,9 @@ INFORMATION ABOUT A COMPILED PATTERN
(*LIMIT_MATCH=nnnn) at the start, the value is returned. The third
argument should point to an unsigned 32-bit integer. If no such value
has been set, the call to pcre2_pattern_info() returns the error
- PCRE2_ERROR_UNSET.
+ PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
+ ing if it is less than the limit set or defaulted by the caller of the
+ match function.
PCRE2_INFO_MAXLOOKBEHIND
@@ -1814,7 +2117,7 @@ INFORMATION ABOUT A COMPILED PATTERN
inspect the previous character. This is to ensure that at least one
character from the old segment is retained when a new segment is pro-
cessed. Otherwise, if there are no lookbehinds in the pattern, \A might
- match incorrectly at the start of a new segment.
+ match incorrectly at the start of a second or subsequent segment.
PCRE2_INFO_MINLENGTH
@@ -1889,24 +2192,17 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_INFO_NEWLINE
- The output is a uint32_t with one of the following values:
+ The output is one of the following uint32_t values:
PCRE2_NEWLINE_CR Carriage return (CR)
PCRE2_NEWLINE_LF Linefeed (LF)
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
PCRE2_NEWLINE_ANY Any Unicode line ending
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
+ PCRE2_NEWLINE_NUL The NUL character (binary zero)
- This specifies the default character sequence that will be recognized
- as meaning "newline" while matching.
-
- PCRE2_INFO_RECURSIONLIMIT
-
- If the pattern set a recursion limit by including an item of the form
- (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The third
- argument should point to an unsigned 32-bit integer. If no such value
- has been set, the call to pcre2_pattern_info() returns the error
- PCRE2_ERROR_UNSET.
+ This identifies the character sequence that will be recognized as mean-
+ ing "newline" while matching.
PCRE2_INFO_SIZE
@@ -1962,15 +2258,15 @@ THE MATCH DATA BLOCK
match data block, which is an opaque structure that is accessed by
function calls. In particular, the match data block contains a vector
of offsets into the subject string that define the matched part of the
- subject and any substrings that were captured. This is know as the
+ subject and any substrings that were captured. This is known as the
ovector.
Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
you must create a match data block by calling one of the creation func-
tions above. For pcre2_match_data_create(), the first argument is the
number of pairs of offsets in the ovector. One pair of offsets is
- required to identify the string that matched the whole pattern, with
- another pair for each captured substring. For example, a value of 4
+ required to identify the string that matched the whole pattern, with an
+ additional pair for each captured substring. For example, a value of 4
creates enough space to record the matched portion of the subject plus
three captured substrings. A minimum of at least 1 pair is imposed by
pcre2_match_data_create(), so it is always possible to return the over-
@@ -2038,7 +2334,7 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
- match_data, /* the match data block */
+ md, /* the match data block */
NULL); /* a match context; NULL means use defaults */
If the subject string is zero-terminated, the length can be given as
@@ -2094,24 +2390,26 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
so, and the current character is CR followed by LF, advance the start-
ing offset by two characters instead of one.
- If a non-zero starting offset is passed when the pattern is anchored,
- one attempt to match at the given offset is made. This can only succeed
- if the pattern does not require the match to be at the start of the
- subject.
+ If a non-zero starting offset is passed when the pattern is anchored, a
+ single attempt to match at the given offset is made. This can only suc-
+ ceed if the pattern does not require the match to be at the start of
+ the subject. In other words, the anchoring must be the result of set-
+ ting the PCRE2_ANCHORED option or the use of .* with PCRE2_DOTALL, not
+ by starting the pattern with ^ or \A.
Option bits for pcre2_match()
The unused bits of the options argument for pcre2_match() must be zero.
- The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
- PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT,
- PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their
- action is described below.
-
- Setting PCRE2_ANCHORED at match time is not supported by the just-in-
- time (JIT) compiler. If it is set, JIT matching is disabled and the
- normal interpretive code in pcre2_match() is run. Apart from
- PCRE2_NO_JIT (obviously), the remaining options are supported for JIT
- matching.
+ The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED,
+ PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
+ PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PAR-
+ TIAL_SOFT. Their action is described below.
+
+ Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup-
+ ported by the just-in-time (JIT) compiler. If it is set, JIT matching
+ is disabled and the interpretive code in pcre2_match() is run. Apart
+ from PCRE2_NO_JIT (obviously), the remaining options are supported for
+ JIT matching.
PCRE2_ANCHORED
@@ -2121,6 +2419,12 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
unachored at matching time. Note that setting the option at match time
disables JIT matching.
+ PCRE2_ENDANCHORED
+
+ If the PCRE2_ENDANCHORED option is set, any string that pcre2_match()
+ matches must be right at the end of the subject string. Note that set-
+ ting the option at match time disables JIT matching.
+
PCRE2_NOTBOL
This option specifies that first character of the subject string is not
@@ -2192,11 +2496,11 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
checks for performance reasons, you can set the PCRE2_NO_UTF_CHECK
option when calling pcre2_match(). You might want to do this for the
second and subsequent calls to pcre2_match() if you are making repeated
- calls to find all the matches in a single subject string.
+ calls to find other matches in the same subject string.
- NOTE: When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
- string as a subject, or an invalid value of startoffset, is undefined.
- Your program may crash or loop indefinitely.
+ WARNING: When PCRE2_NO_UTF_CHECK is set, the effect of passing an
+ invalid string as a subject, or an invalid value of startoffset, is
+ undefined. Your program may crash or loop indefinitely.
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
@@ -2249,11 +2553,12 @@ NEWLINE HANDLING WHEN MATCHING
acter after the first failure.
An explicit match for CR of LF is either a literal appearance of one of
- those characters in the pattern, or one of the \r or \n escape
- sequences. Implicit matches such as [^X] do not count, nor does \s,
- even though it includes CR and LF in the characters that it matches.
+ those characters in the pattern, or one of the \r or \n or equivalent
+ octal or hexadecimal escape sequences. Implicit matches such as [^X] do
+ not count, nor does \s, even though it includes CR and LF in the char-
+ acters that it matches.
- Notwithstanding the above, anomalous effects may still occur when CRLF
+ Notwithstanding the above, anomalous effects may still occur when CRLF
is a valid newline sequence and explicit \r or \n escapes appear in the
pattern.
@@ -2264,85 +2569,81 @@ HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
- In general, a pattern matches a certain portion of the subject, and in
- addition, further substrings from the subject may be picked out by
- parenthesized parts of the pattern. Following the usage in Jeffrey
- Friedl's book, this is called "capturing" in what follows, and the
- phrase "capturing subpattern" or "capturing group" is used for a frag-
- ment of a pattern that picks out a substring. PCRE2 supports several
+ In general, a pattern matches a certain portion of the subject, and in
+ addition, further substrings from the subject may be picked out by
+ parenthesized parts of the pattern. Following the usage in Jeffrey
+ Friedl's book, this is called "capturing" in what follows, and the
+ phrase "capturing subpattern" or "capturing group" is used for a frag-
+ ment of a pattern that picks out a substring. PCRE2 supports several
other kinds of parenthesized subpattern that do not cause substrings to
- be captured. The pcre2_pattern_info() function can be used to find out
+ be captured. The pcre2_pattern_info() function can be used to find out
how many capturing subpatterns there are in a compiled pattern.
- You can use auxiliary functions for accessing captured substrings by
+ You can use auxiliary functions for accessing captured substrings by
number or by name, as described in sections below.
Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
- ues, called the ovector, which contains the offsets of captured
- strings. It is part of the match data block. The function
- pcre2_get_ovector_pointer() returns the address of the ovector, and
+ ues, called the ovector, which contains the offsets of captured
+ strings. It is part of the match data block. The function
+ pcre2_get_ovector_pointer() returns the address of the ovector, and
pcre2_get_ovector_count() returns the number of pairs of values it con-
tains.
Within the ovector, the first in each pair of values is set to the off-
set of the first code unit of a substring, and the second is set to the
- offset of the first code unit after the end of a substring. These val-
- ues are always code unit offsets, not character offsets. That is, they
- are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit
+ offset of the first code unit after the end of a substring. These val-
+ ues are always code unit offsets, not character offsets. That is, they
+ are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit
library, and 32-bit offsets in the 32-bit library.
- After a partial match (error return PCRE2_ERROR_PARTIAL), only the
- first pair of offsets (that is, ovector[0] and ovector[1]) are set.
- They identify the part of the subject that was partially matched. See
+ After a partial match (error return PCRE2_ERROR_PARTIAL), only the
+ first pair of offsets (that is, ovector[0] and ovector[1]) are set.
+ They identify the part of the subject that was partially matched. See
the pcre2partial documentation for details of partial matching.
- After a successful match, the first pair of offsets identifies the por-
- tion of the subject string that was matched by the entire pattern. The
- next pair is used for the first capturing subpattern, and so on. The
- value returned by pcre2_match() is one more than the highest numbered
- pair that has been set. For example, if two substrings have been cap-
- tured, the returned value is 3. If there are no capturing subpatterns,
- the return value from a successful match is 1, indicating that just the
- first pair of offsets has been set.
+ After a fully successful match, the first pair of offsets identifies
+ the portion of the subject string that was matched by the entire pat-
+ tern. The next pair is used for the first captured substring, and so
+ on. The value returned by pcre2_match() is one more than the highest
+ numbered pair that has been set. For example, if two substrings have
+ been captured, the returned value is 3. If there are no captured sub-
+ strings, the return value from a successful match is 1, indicating that
+ just the first pair of offsets has been set.
- If a pattern uses the \K escape sequence within a positive assertion,
+ If a pattern uses the \K escape sequence within a positive assertion,
the reported start of a successful match can be greater than the end of
- the match. For example, if the pattern (?=ab\K) is matched against
+ the match. For example, if the pattern (?=ab\K) is matched against
"ab", the start and end offset values for the match are 2 and 0.
- If a capturing subpattern group is matched repeatedly within a single
- match operation, it is the last portion of the subject that it matched
+ If a capturing subpattern group is matched repeatedly within a single
+ match operation, it is the last portion of the subject that it matched
that is returned.
If the ovector is too small to hold all the captured substring offsets,
- as much as possible is filled in, and the function returns a value of
- zero. If captured substrings are not of interest, pcre2_match() may be
+ as much as possible is filled in, and the function returns a value of
+ zero. If captured substrings are not of interest, pcre2_match() may be
called with a match data block whose ovector is of minimum length (that
- is, one pair). However, if the pattern contains back references and the
- ovector is not big enough to remember the related substrings, PCRE2 has
- to get additional memory for use during matching. Thus it is usually
- advisable to set up a match data block containing an ovector of reason-
- able size.
+ is, one pair).
- It is possible for capturing subpattern number n+1 to match some part
+ It is possible for capturing subpattern number n+1 to match some part
of the subject when subpattern n has not been used at all. For example,
- if the string "abc" is matched against the pattern (a|(z))(bc) the
+ if the string "abc" is matched against the pattern (a|(z))(bc) the
return from the function is 4, and subpatterns 1 and 3 are matched, but
- 2 is not. When this happens, both values in the offset pairs corre-
+ 2 is not. When this happens, both values in the offset pairs corre-
sponding to unused subpatterns are set to PCRE2_UNSET.
- Offset values that correspond to unused subpatterns at the end of the
- expression are also set to PCRE2_UNSET. For example, if the string
+ Offset values that correspond to unused subpatterns at the end of the
+ expression are also set to PCRE2_UNSET. For example, if the string
"abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3
- are not matched. The return from the function is 2, because the high-
+ are not matched. The return from the function is 2, because the high-
est used capturing subpattern number is 1. The offsets for for the sec-
- ond and third capturing subpatterns (assuming the vector is large
+ ond and third capturing subpatterns (assuming the vector is large
enough, of course) are set to PCRE2_UNSET.
Elements in the ovector that do not correspond to capturing parentheses
in the pattern are never changed. That is, if a pattern contains n cap-
turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
- pcre2_match(). The other elements retain whatever values they previ-
+ pcre2_match(). The other elements retain whatever values they previ-
ously had.
@@ -2352,56 +2653,60 @@ OTHER INFORMATION ABOUT A MATCH
PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
- As well as the offsets in the ovector, other information about a match
- is retained in the match data block and can be retrieved by the above
- functions in appropriate circumstances. If they are called at other
+ As well as the offsets in the ovector, other information about a match
+ is retained in the match data block and can be retrieved by the above
+ functions in appropriate circumstances. If they are called at other
times, the result is undefined.
- After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a
- failure to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be avail-
- able, and pcre2_get_mark() can be called. It returns a pointer to the
- zero-terminated name, which is within the compiled pattern. Otherwise
- NULL is returned. The length of the (*MARK) name (excluding the termi-
- nating zero) is stored in the code unit that preceeds the name. You
- should use this instead of relying on the terminating zero if the
- (*MARK) name might contain a binary zero.
-
- After a successful match, the (*MARK) name that is returned is the last
- one encountered on the matching path through the pattern. After a "no
- match" or a partial match, the last encountered (*MARK) name is
- returned. For example, consider this pattern:
+ After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a
+ failure to match (PCRE2_ERROR_NOMATCH), a (*MARK), (*PRUNE), or (*THEN)
+ name may be available. The function pcre2_get_mark() can be called to
+ access this name. The same function applies to all three verbs. It
+ returns a pointer to the zero-terminated name, which is within the com-
+ piled pattern. If no name is available, NULL is returned. The length of
+ the name (excluding the terminating zero) is stored in the code unit
+ that precedes the name. You should use this length instead of relying
+ on the terminating zero if the name might contain a binary zero.
+
+ After a successful match, the name that is returned is the last
+ (*MARK), (*PRUNE), or (*THEN) name encountered on the matching path
+ through the pattern. Instances of (*PRUNE) and (*THEN) without names
+ are ignored. Thus, for example, if the matching path contains
+ (*MARK:A)(*PRUNE), the name "A" is returned. After a "no match" or a
+ partial match, the last encountered name is returned. For example,
+ consider this pattern:
^(*MARK:A)((*MARK:B)a|b)c
- When it matches "bc", the returned mark is A. The B mark is "seen" in
- the first branch of the group, but it is not on the matching path. On
- the other hand, when this pattern fails to match "bx", the returned
- mark is B.
+ When it matches "bc", the returned name is A. The B mark is "seen" in
+ the first branch of the group, but it is not on the matching path. On
+ the other hand, when this pattern fails to match "bx", the returned
+ name is B.
- After a successful match, a partial match, or one of the invalid UTF
- errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can
+ After a successful match, a partial match, or one of the invalid UTF
+ errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can
be called. After a successful or partial match it returns the code unit
- offset of the character at which the match started. For a non-partial
- match, this can be different to the value of ovector[0] if the pattern
- contains the \K escape sequence. After a partial match, however, this
- value is always the same as ovector[0] because \K does not affect the
+ offset of the character at which the match started. For a non-partial
+ match, this can be different to the value of ovector[0] if the pattern
+ contains the \K escape sequence. After a partial match, however, this
+ value is always the same as ovector[0] because \K does not affect the
result of a partial match.
- After a UTF check failure, pcre2_get_startchar() can be used to obtain
+ After a UTF check failure, pcre2_get_startchar() can be used to obtain
the code unit offset of the invalid UTF character. Details are given in
the pcre2unicode page.
ERROR RETURNS FROM pcre2_match()
- If pcre2_match() fails, it returns a negative number. This can be con-
- verted to a text string by calling the pcre2_get_error_message() func-
- tion (see "Obtaining a textual error message" below). Negative error
- codes are also returned by other functions, and are documented with
- them. The codes are given names in the header file. If UTF checking is
+ If pcre2_match() fails, it returns a negative number. This can be con-
+ verted to a text string by calling the pcre2_get_error_message() func-
+ tion (see "Obtaining a textual error message" below). Negative error
+ codes are also returned by other functions, and are documented with
+ them. The codes are given names in the header file. If UTF checking is
in force and an invalid UTF subject string is detected, one of a number
- of UTF-specific negative error codes is returned. Details are given in
- the pcre2unicode page. The following are the other errors that may be
+ of UTF-specific negative error codes is returned. Details are given in
+ the pcre2unicode page. The following are the other errors that may be
returned by pcre2_match():
PCRE2_ERROR_NOMATCH
@@ -2410,20 +2715,21 @@ ERROR RETURNS FROM pcre2_match()
PCRE2_ERROR_PARTIAL
- The subject string did not match, but it did match partially. See the
+ The subject string did not match, but it did match partially. See the
pcre2partial documentation for details of partial matching.
PCRE2_ERROR_BADMAGIC
PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
- to catch the case when it is passed a junk pointer. This is the error
+ to catch the case when it is passed a junk pointer. This is the error
that is returned when the magic number is not present.
PCRE2_ERROR_BADMODE
- This error is given when a pattern that was compiled by the 8-bit
- library is passed to a 16-bit or 32-bit library function, or vice
- versa.
+ This error is given when a compiled pattern is passed to a function in
+ a library of a different code unit width, for example, a pattern com-
+ piled by the 8-bit library is passed to a 16-bit or 32-bit library
+ function.
PCRE2_ERROR_BADOFFSET
@@ -2447,19 +2753,19 @@ ERROR RETURNS FROM pcre2_match()
pcre2_callout_enumerate() to return a distinctive error code. See the
pcre2callout documentation for details.
+ PCRE2_ERROR_DEPTHLIMIT
+
+ The nested backtracking depth limit was reached.
+
+ PCRE2_ERROR_HEAPLIMIT
+
+ The heap limit was reached.
+
PCRE2_ERROR_INTERNAL
An unexpected internal error has occurred. This error could be caused
by a bug in PCRE2 or by overwriting of the compiled pattern.
- PCRE2_ERROR_JIT_BADOPTION
-
- This error is returned when a pattern that was successfully studied
- using JIT is being matched, but the matching mode (partial or complete
- match) does not correspond to any JIT compilation mode. When the JIT
- fast path function is used, this error may be also given for invalid
- options. See the pcre2jit documentation for more details.
-
PCRE2_ERROR_JIT_STACKLIMIT
This error is returned when a pattern that was successfully studied
@@ -2469,15 +2775,15 @@ ERROR RETURNS FROM pcre2_match()
PCRE2_ERROR_MATCHLIMIT
- The backtracking limit was reached.
+ The backtracking match limit was reached.
PCRE2_ERROR_NOMEMORY
- If a pattern contains back references, but the ovector is not big
- enough to remember the referenced substrings, PCRE2 gets a block of
- memory at the start of matching to use for this purpose. There are some
- other special cases where extra memory is needed during matching. This
- error is given when memory cannot be obtained.
+ If a pattern contains many nested backtracking points, heap memory is
+ used to remember them. This error is given when the memory allocation
+ function (default or custom) fails. Note that a different error,
+ PCRE2_ERROR_HEAPLIMIT, is given if the amount of memory needed exceeds
+ the heap limit.
PCRE2_ERROR_NULL
@@ -2493,10 +2799,6 @@ ERROR RETURNS FROM pcre2_match()
plicated cases, in particular mutual recursions between two different
subpatterns, cannot be detected until matching is attempted.
- PCRE2_ERROR_RECURSIONLIMIT
-
- The internal recursion limit was reached.
-
OBTAINING A TEXTUAL ERROR MESSAGE
@@ -2506,16 +2808,17 @@ OBTAINING A TEXTUAL ERROR MESSAGE
A text message for an error code from any PCRE2 function (compile,
match, or auxiliary) can be obtained by calling pcre2_get_error_mes-
sage(). The code is passed as the first argument, with the remaining
- two arguments specifying a code unit buffer and its length, into which
- the text message is placed. Note that the message is returned in code
- units of the appropriate width for the library that is being used.
+ two arguments specifying a code unit buffer and its length in code
+ units, into which the text message is placed. The message is returned
+ in code units of the appropriate width for the library that is being
+ used.
- The returned message is terminated with a trailing zero, and the func-
- tion returns the number of code units used, excluding the trailing
+ The returned message is terminated with a trailing zero, and the func-
+ tion returns the number of code units used, excluding the trailing
zero. If the error number is unknown, the negative error code
- PCRE2_ERROR_BADDATA is returned. If the buffer is too small, the mes-
- sage is truncated (but still with a trailing zero), and the negative
- error code PCRE2_ERROR_NOMEMORY is returned. None of the messages are
+ PCRE2_ERROR_BADDATA is returned. If the buffer is too small, the mes-
+ sage is truncated (but still with a trailing zero), and the negative
+ error code PCRE2_ERROR_NOMEMORY is returned. None of the messages are
very long; a buffer size of 120 code units is ample.
@@ -2534,39 +2837,39 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
void pcre2_substring_free(PCRE2_UCHAR *buffer);
- Captured substrings can be accessed directly by using the ovector as
+ Captured substrings can be accessed directly by using the ovector as
described above. For convenience, auxiliary functions are provided for
- extracting captured substrings as new, separate, zero-terminated
+ extracting captured substrings as new, separate, zero-terminated
strings. A substring that contains a binary zero is correctly extracted
- and has a further zero added on the end, but the result is not, of
+ and has a further zero added on the end, but the result is not, of
course, a C string.
The functions in this section identify substrings by number. The number
zero refers to the entire matched substring, with higher numbers refer-
- ring to substrings captured by parenthesized groups. After a partial
- match, only substring zero is available. An attempt to extract any
- other substring gives the error PCRE2_ERROR_PARTIAL. The next section
+ ring to substrings captured by parenthesized groups. After a partial
+ match, only substring zero is available. An attempt to extract any
+ other substring gives the error PCRE2_ERROR_PARTIAL. The next section
describes similar functions for extracting captured substrings by name.
- If a pattern uses the \K escape sequence within a positive assertion,
+ If a pattern uses the \K escape sequence within a positive assertion,
the reported start of a successful match can be greater than the end of
- the match. For example, if the pattern (?=ab\K) is matched against
- "ab", the start and end offset values for the match are 2 and 0. In
- this situation, calling these functions with a zero substring number
+ the match. For example, if the pattern (?=ab\K) is matched against
+ "ab", the start and end offset values for the match are 2 and 0. In
+ this situation, calling these functions with a zero substring number
extracts a zero-length empty string.
- You can find the length in code units of a captured substring without
- extracting it by calling pcre2_substring_length_bynumber(). The first
- argument is a pointer to the match data block, the second is the group
- number, and the third is a pointer to a variable into which the length
- is placed. If you just want to know whether or not the substring has
+ You can find the length in code units of a captured substring without
+ extracting it by calling pcre2_substring_length_bynumber(). The first
+ argument is a pointer to the match data block, the second is the group
+ number, and the third is a pointer to a variable into which the length
+ is placed. If you just want to know whether or not the substring has
been captured, you can pass the third argument as NULL.
- The pcre2_substring_copy_bynumber() function copies a captured sub-
- string into a supplied buffer, whereas pcre2_substring_get_bynumber()
- copies it into new memory, obtained using the same memory allocation
- function that was used for the match data block. The first two argu-
- ments of these functions are a pointer to the match data block and a
+ The pcre2_substring_copy_bynumber() function copies a captured sub-
+ string into a supplied buffer, whereas pcre2_substring_get_bynumber()
+ copies it into new memory, obtained using the same memory allocation
+ function that was used for the match data block. The first two argu-
+ ments of these functions are a pointer to the match data block and a
capturing group number.
The final arguments of pcre2_substring_copy_bynumber() are a pointer to
@@ -2575,25 +2878,25 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
for the extracted substring, excluding the terminating zero.
For pcre2_substring_get_bynumber() the third and fourth arguments point
- to variables that are updated with a pointer to the new memory and the
- number of code units that comprise the substring, again excluding the
- terminating zero. When the substring is no longer needed, the memory
+ to variables that are updated with a pointer to the new memory and the
+ number of code units that comprise the substring, again excluding the
+ terminating zero. When the substring is no longer needed, the memory
should be freed by calling pcre2_substring_free().
- The return value from all these functions is zero for success, or a
- negative error code. If the pattern match failed, the match failure
- code is returned. If a substring number greater than zero is used
- after a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible
+ The return value from all these functions is zero for success, or a
+ negative error code. If the pattern match failed, the match failure
+ code is returned. If a substring number greater than zero is used
+ after a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible
error codes are:
PCRE2_ERROR_NOMEMORY
- The buffer was too small for pcre2_substring_copy_bynumber(), or the
+ The buffer was too small for pcre2_substring_copy_bynumber(), or the
attempt to get memory failed for pcre2_substring_get_bynumber().
PCRE2_ERROR_NOSUBSTRING
- There is no substring with that number in the pattern, that is, the
+ There is no substring with that number in the pattern, that is, the
number is greater than the number of capturing parentheses.
PCRE2_ERROR_UNAVAILABLE
@@ -2604,8 +2907,8 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
PCRE2_ERROR_UNSET
- The substring did not participate in the match. For example, if the
- pattern is (abc)|(def) and the subject is "def", and the ovector con-
+ The substring did not participate in the match. For example, if the
+ pattern is (abc)|(def) and the subject is "def", and the ovector con-
tains at least two capturing slots, substring number 1 is unset.
@@ -2616,32 +2919,32 @@ EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
void pcre2_substring_list_free(PCRE2_SPTR *list);
- The pcre2_substring_list_get() function extracts all available sub-
- strings and builds a list of pointers to them. It also (optionally)
- builds a second list that contains their lengths (in code units),
+ The pcre2_substring_list_get() function extracts all available sub-
+ strings and builds a list of pointers to them. It also (optionally)
+ builds a second list that contains their lengths (in code units),
excluding a terminating zero that is added to each of them. All this is
done in a single block of memory that is obtained using the same memory
allocation function that was used to get the match data block.
- This function must be called only after a successful match. If called
+ This function must be called only after a successful match. If called
after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
- The address of the memory block is returned via listptr, which is also
+ The address of the memory block is returned via listptr, which is also
the start of the list of string pointers. The end of the list is marked
- by a NULL pointer. The address of the list of lengths is returned via
- lengthsptr. If your strings do not contain binary zeros and you do not
+ by a NULL pointer. The address of the list of lengths is returned via
+ lengthsptr. If your strings do not contain binary zeros and you do not
therefore need the lengths, you may supply NULL as the lengthsptr argu-
- ment to disable the creation of a list of lengths. The yield of the
- function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
- ory block could not be obtained. When the list is no longer needed, it
+ ment to disable the creation of a list of lengths. The yield of the
+ function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
+ ory block could not be obtained. When the list is no longer needed, it
should be freed by calling pcre2_substring_list_free().
If this function encounters a substring that is unset, which can happen
- when capturing subpattern number n+1 matches some part of the subject,
- but subpattern n has not been used at all, it returns an empty string.
- This can be distinguished from a genuine zero-length substring by
+ when capturing subpattern number n+1 matches some part of the subject,
+ but subpattern n has not been used at all, it returns an empty string.
+ This can be distinguished from a genuine zero-length substring by
inspecting the appropriate offset in the ovector, which contain
- PCRE2_UNSET for unset substrings, or by calling pcre2_sub-
+ PCRE2_UNSET for unset substrings, or by calling pcre2_sub-
string_length_bynumber().
@@ -2661,39 +2964,39 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME
void pcre2_substring_free(PCRE2_UCHAR *buffer);
- To extract a substring by name, you first have to find associated num-
+ To extract a substring by name, you first have to find associated num-
ber. For example, for this pattern:
(a+)b(?<xxx>\d+)...
the number of the subpattern called "xxx" is 2. If the name is known to
- be unique (PCRE2_DUPNAMES was not set), you can find the number from
+ be unique (PCRE2_DUPNAMES was not set), you can find the number from
the name by calling pcre2_substring_number_from_name(). The first argu-
- ment is the compiled pattern, and the second is the name. The yield of
+ ment is the compiled pattern, and the second is the name. The yield of
the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there
- is no subpattern of that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if
- there is more than one subpattern of that name. Given the number, you
- can extract the substring directly, or use one of the functions
- described above.
-
- For convenience, there are also "byname" functions that correspond to
- the "bynumber" functions, the only difference being that the second
- argument is a name instead of a number. If PCRE2_DUPNAMES is set and
+ is no subpattern of that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if
+ there is more than one subpattern of that name. Given the number, you
+ can extract the substring directly from the ovector, or use one of the
+ "bynumber" functions described above.
+
+ For convenience, there are also "byname" functions that correspond to
+ the "bynumber" functions, the only difference being that the second
+ argument is a name instead of a number. If PCRE2_DUPNAMES is set and
there are duplicate names, these functions scan all the groups with the
given name, and return the first named string that is set.
- If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
- returned. If all groups with the name have numbers that are greater
- than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is
- returned. If there is at least one group with a slot in the ovector,
+ If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
+ returned. If all groups with the name have numbers that are greater
+ than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is
+ returned. If there is at least one group with a slot in the ovector,
but no group is found to be set, PCRE2_ERROR_UNSET is returned.
Warning: If the pattern uses the (?| feature to set up multiple subpat-
- terns with the same number, as described in the section on duplicate
- subpattern numbers in the pcre2pattern page, you cannot use names to
- distinguish the different subpatterns, because names are not included
- in the compiled code. The matching process uses only numbers. For this
- reason, the use of different names for subpatterns of the same number
+ terns with the same number, as described in the section on duplicate
+ subpattern numbers in the pcre2pattern page, you cannot use names to
+ distinguish the different subpatterns, because names are not included
+ in the compiled code. The matching process uses only numbers. For this
+ reason, the use of different names for subpatterns of the same number
causes an error at compile time.
@@ -2706,46 +3009,47 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
PCRE2_SIZE rlength, PCRE2_UCHAR *outputbufferP,
PCRE2_SIZE *outlengthptr);
- This function calls pcre2_match() and then makes a copy of the subject
- string in outputbuffer, replacing the part that was matched with the
- replacement string, whose length is supplied in rlength. This can be
+ This function calls pcre2_match() and then makes a copy of the subject
+ string in outputbuffer, replacing the part that was matched with the
+ replacement string, whose length is supplied in rlength. This can be
given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
- which a \K item in a lookahead in the pattern causes the match to end
+ which a \K item in a lookahead in the pattern causes the match to end
before it starts are not supported, and give rise to an error return.
- The first seven arguments of pcre2_substitute() are the same as for
+ The first seven arguments of pcre2_substitute() are the same as for
pcre2_match(), except that the partial matching options are not permit-
- ted, and match_data may be passed as NULL, in which case a match data
- block is obtained and freed within this function, using memory manage-
- ment functions from the match context, if provided, or else those that
+ ted, and match_data may be passed as NULL, in which case a match data
+ block is obtained and freed within this function, using memory manage-
+ ment functions from the match context, if provided, or else those that
were used to allocate memory for the compiled code.
- The outlengthptr argument must point to a variable that contains the
- length, in code units, of the output buffer. If the function is suc-
- cessful, the value is updated to contain the length of the new string,
+ The outlengthptr argument must point to a variable that contains the
+ length, in code units, of the output buffer. If the function is suc-
+ cessful, the value is updated to contain the length of the new string,
excluding the trailing zero that is automatically added.
- If the function is not successful, the value set via outlengthptr
- depends on the type of error. For syntax errors in the replacement
- string, the value is the offset in the replacement string where the
- error was detected. For other errors, the value is PCRE2_UNSET by
- default. This includes the case of the output buffer being too small,
- unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which
- case the value is the minimum length needed, including space for the
- trailing zero. Note that in order to compute the required length,
- pcre2_substitute() has to simulate all the matching and copying,
+ If the function is not successful, the value set via outlengthptr
+ depends on the type of error. For syntax errors in the replacement
+ string, the value is the offset in the replacement string where the
+ error was detected. For other errors, the value is PCRE2_UNSET by
+ default. This includes the case of the output buffer being too small,
+ unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which
+ case the value is the minimum length needed, including space for the
+ trailing zero. Note that in order to compute the required length,
+ pcre2_substitute() has to simulate all the matching and copying,
instead of giving an error return as soon as the buffer overflows. Note
also that the length is in code units, not bytes.
- In the replacement string, which is interpreted as a UTF string in UTF
- mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK
+ In the replacement string, which is interpreted as a UTF string in UTF
+ mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK
option is set, a dollar character is an escape character that can spec-
- ify the insertion of characters from capturing groups or (*MARK) items
- in the pattern. The following forms are always recognized:
+ ify the insertion of characters from capturing groups or (*MARK),
+ (*PRUNE), or (*THEN) items in the pattern. The following forms are
+ always recognized:
$$ insert a dollar character
$<n> or ${<n>} insert the contents of group <n>
- $*MARK or ${*MARK} insert the name of the last (*MARK) encountered
+ $*MARK or ${*MARK} insert a (*MARK), (*PRUNE), or (*THEN) name
Either a group number or a group name can be given for <n>. Curly
brackets are required only if the following character would be inter-
@@ -2754,24 +3058,44 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
matched with "=abc=" and the replacement string "+$1$0$1+", the result
is "=+babcb+=".
- The facility for inserting a (*MARK) name can be used to perform simple
- simultaneous substitutions, as this pcre2test example shows:
+ $*MARK inserts the name from the last encountered (*MARK), (*PRUNE), or
+ (*THEN) on the matching path that has a name. (*MARK) must always
+ include a name, but (*PRUNE) and (*THEN) need not. For example, in the
+ case of (*MARK:A)(*PRUNE) the name inserted is "A", but for
+ (*MARK:A)(*PRUNE:B) the relevant name is "B". This facility can be
+ used to perform simple simultaneous substitutions, as this pcre2test
+ example shows:
- /(*:pear)apple|(*:orange)lemon/g,replace=${*MARK}
+ /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
apple lemon
2: pear orange
- As well as the usual options for pcre2_match(), a number of additional
- options can be set in the options argument.
+ As well as the usual options for pcre2_match(), a number of additional
+ options can be set in the options argument of pcre2_substitute().
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
- string, replacing every matching substring. If this is not set, only
- the first matching substring is replaced. If any matched substring has
- zero length, after the substitution has happened, an attempt to find a
- non-empty match at the same position is performed. If this is not suc-
- cessful, the current position is advanced by one character except when
- CRLF is a valid newline sequence and the next two characters are CR,
- LF. In this case, the current position is advanced by two characters.
+ string, replacing every matching substring. If this option is not set,
+ only the first matching substring is replaced. The search for matches
+ takes place in the original subject string (that is, previous replace-
+ ments do not affect it). Iteration is implemented by advancing the
+ startoffset value for each search, which is always passed the entire
+ subject string. If an offset limit is set in the match context, search-
+ ing stops when that limit is reached.
+
+ You can restrict the effect of a global substitution to a portion of
+ the subject string by setting either or both of startoffset and an off-
+ set limit. Here is a pcre2test example:
+
+ /B/g,replace=!,use_offset_limit
+ ABC ABC ABC ABC\=offset=3,offset_limit=12
+ 2: ABC A!C A!C ABC
+
+ When continuing with global substitutions after matching a substring
+ with zero length, an attempt to find a non-empty match at the same off-
+ set is performed. If this is not successful, the offset is advanced by
+ one character except when CRLF is a valid newline sequence and the next
+ two characters are CR, LF. In this case, the offset is advanced by two
+ characters.
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output
buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
@@ -2883,10 +3207,10 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in
the replacement string, with more particular errors being
PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP-
- MISSING_BRACE (closing curly bracket not found), PCRE2_BADSUBSTITUTION
- (syntax error in extended group substitution), and PCRE2_BADSUBPATTERN
- (the pattern match ended before it started, which can happen if \K is
- used in an assertion).
+ MISSINGBRACE (closing curly bracket not found), PCRE2_ERROR_BADSUBSTI-
+ TUTION (syntax error in extended group substitution), and
+ PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before it started,
+ which can happen if \K is used in an assertion).
As for all PCRE2 errors, a text message that describes the error can be
obtained by calling the pcre2_get_error_message() function (see
@@ -2961,13 +3285,13 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
The function pcre2_dfa_match() is called to match a subject string
against a compiled pattern, using a matching algorithm that scans the
- subject string just once, and does not backtrack. This has different
- characteristics to the normal algorithm, and is not compatible with
- Perl. Some of the features of PCRE2 patterns are not supported. Never-
- theless, there are times when this kind of matching can be useful. For
- a discussion of the two matching algorithms, and a list of features
- that pcre2_dfa_match() does not support, see the pcre2matching documen-
- tation.
+ subject string just once (not counting lookaround assertions), and does
+ not backtrack. This has different characteristics to the normal algo-
+ rithm, and is not compatible with Perl. Some of the features of PCRE2
+ patterns are not supported. Nevertheless, there are times when this
+ kind of matching can be useful. For a discussion of the two matching
+ algorithms, and a list of features that pcre2_dfa_match() does not sup-
+ port, see the pcre2matching documentation.
The arguments for the pcre2_dfa_match() function are the same as for
pcre2_match(), plus two extras. The ovector within the match data block
@@ -2991,7 +3315,7 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
- match_data, /* the match data block */
+ md, /* the match data block */
NULL, /* a match context; NULL means use defaults */
wspace, /* working space vector */
20); /* number of elements (NOT size in bytes) */
@@ -2999,12 +3323,12 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
Option bits for pcre_dfa_match()
The unused bits of the options argument for pcre2_dfa_match() must be
- zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
- PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
- PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT,
- PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last four of
- these are exactly the same as for pcre2_match(), so their description
- is not repeated here.
+ zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDAN-
+ CHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
+ PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD,
+ PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but
+ the last four of these are exactly the same as for pcre2_match(), so
+ their description is not repeated here.
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
@@ -3093,7 +3417,7 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA
matching, this means that only one possible match is found. If you
really do want multiple matches in such cases, either use an ungreedy
- repeat auch as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when
+ repeat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when
compiling.
Error returns from pcre2_dfa_match()
@@ -3138,8 +3462,7 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
SEE ALSO
pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3),
- pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2stack(3),
- pcre2unicode(3).
+ pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3).
AUTHOR
@@ -3151,8 +3474,8 @@ AUTHOR
REVISION
- Last updated: 17 June 2016
- Copyright (c) 1997-2016 University of Cambridge.
+ Last updated: 31 December 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -3198,21 +3521,21 @@ PCRE2 BUILD-TIME OPTIONS
./configure --help
- The following sections include descriptions of options whose names
- begin with --enable or --disable. These settings specify changes to the
- defaults for the configure command. Because of the way that configure
- works, --enable and --disable always come in pairs, so the complemen-
- tary option always exists as well, but as it specifies the default, it
- is not described.
+ The following sections include descriptions of "on/off" options whose
+ names begin with --enable or --disable. Because of the way that config-
+ ure works, --enable and --disable always come in pairs, so the comple-
+ mentary option always exists as well, but as it specifies the default,
+ it is not described. Options that specify values have names that start
+ with --with.
BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
By default, a library called libpcre2-8 is built, containing functions
- that take string arguments contained in vectors of bytes, interpreted
+ that take string arguments contained in arrays of bytes, interpreted
either as single-byte characters, or UTF-8 strings. You can also build
two other libraries, called libpcre2-16 and libpcre2-32, which process
- strings that are contained in vectors of 16-bit and 32-bit code units,
+ strings that are contained in arrays of 16-bit and 32-bit code units,
respectively. These can be interpreted either as single-unit characters
or UTF-16/UTF-32 strings. To build these additional libraries, add one
or both of the following to the configure command:
@@ -3260,11 +3583,11 @@ UNICODE AND UTF SUPPORT
application has locked this out by setting PCRE2_NEVER_UTF.
UTF support allows the libraries to process character code points up to
- 0x10ffff in the strings that they handle. It also provides support for
- accessing the Unicode properties of such characters, using pattern
- escapes such as \P, \p, and \X. Only the general category properties
- such as Lu and Nd are supported. Details are given in the pcre2pattern
- documentation.
+ 0x10ffff in the strings that they handle. Unicode support also gives
+ access to the Unicode properties of characters, using pattern escapes
+ such as \P, \p, and \X. Only the general category properties such as Lu
+ and Nd are supported. Details are given in the pcre2pattern documenta-
+ tion.
Pattern escapes such as \d and \w do not by default make use of Unicode
properties. The application can request that they do by setting the
@@ -3287,15 +3610,21 @@ DISABLING THE USE OF \C
JUST-IN-TIME COMPILER SUPPORT
- Just-in-time compiler support is included in the build by specifying
+ Just-in-time (JIT) compiler support is included in the build by speci-
+ fying
--enable-jit
- This support is available only for certain hardware architectures. If
- this option is set for an unsupported architecture, a building error
- occurs. See the pcre2jit documentation for a discussion of JIT usage.
- When JIT support is enabled, pcre2grep automatically makes use of it,
- unless you add
+ This support is available only for certain hardware architectures. If
+ this option is set for an unsupported architecture, a building error
+ occurs. If you are running under SELinux you may also want to add
+
+ --enable-jit-sealloc
+
+ which enables the use of an execmem allocator in JIT that is compatible
+ with SELinux. This has no effect if JIT is not enabled. See the
+ pcre2jit documentation for a discussion of JIT usage. When JIT support
+ is enabled, pcre2grep automatically makes use of it, unless you add
--disable-pcre2grep-jit
@@ -3325,7 +3654,7 @@ NEWLINE RECOGNITION
--enable-newline-is-anycrlf
which causes PCRE2 to recognize any of the three sequences CR, LF, or
- CRLF as indicating a line ending. Finally, a fifth option, specified by
+ CRLF as indicating a line ending. A fifth option, specified by
--enable-newline-is-any
@@ -3333,144 +3662,148 @@ NEWLINE RECOGNITION
newline sequences are the three just mentioned, plus the single charac-
ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
U+0085), LS (line separator, U+2028), and PS (paragraph separator,
- U+2029).
+ U+2029). The final option is
+
+ --enable-newline-is-nul
+
+ which causes NUL (binary zero) is set as the default line-ending char-
+ acter.
Whatever default line ending convention is selected when PCRE2 is built
- can be overridden by applications that use the library. At build time
- it is conventional to use the standard for your operating system.
+ can be overridden by applications that use the library. At build time
+ it is recommended to use the standard for your operating system.
WHAT \R MATCHES
- By default, the sequence \R in a pattern matches any Unicode newline
- sequence, independently of what has been selected as the line ending
+ By default, the sequence \R in a pattern matches any Unicode newline
+ sequence, independently of what has been selected as the line ending
sequence. If you specify
--enable-bsr-anycrlf
- the default is changed so that \R matches only CR, LF, or CRLF. What-
- ever is selected when PCRE2 is built can be overridden by applications
- that use the called.
+ the default is changed so that \R matches only CR, LF, or CRLF. What-
+ ever is selected when PCRE2 is built can be overridden by applications
+ that use the library.
HANDLING VERY LARGE PATTERNS
- Within a compiled pattern, offset values are used to point from one
- part to another (for example, from an opening parenthesis to an alter-
- nation metacharacter). By default, in the 8-bit and 16-bit libraries,
- two-byte values are used for these offsets, leading to a maximum size
- for a compiled pattern of around 64K code units. This is sufficient to
+ Within a compiled pattern, offset values are used to point from one
+ part to another (for example, from an opening parenthesis to an alter-
+ nation metacharacter). By default, in the 8-bit and 16-bit libraries,
+ two-byte values are used for these offsets, leading to a maximum size
+ for a compiled pattern of around 64K code units. This is sufficient to
handle all but the most gigantic patterns. Nevertheless, some people do
- want to process truly enormous patterns, so it is possible to compile
- PCRE2 to use three-byte or four-byte offsets by adding a setting such
+ want to process truly enormous patterns, so it is possible to compile
+ PCRE2 to use three-byte or four-byte offsets by adding a setting such
as
--with-link-size=3
- to the configure command. The value given must be 2, 3, or 4. For the
- 16-bit library, a value of 3 is rounded up to 4. In these libraries,
- using longer offsets slows down the operation of PCRE2 because it has
- to load additional data when handling them. For the 32-bit library the
- value is always 4 and cannot be overridden; the value of --with-link-
+ to the configure command. The value given must be 2, 3, or 4. For the
+ 16-bit library, a value of 3 is rounded up to 4. In these libraries,
+ using longer offsets slows down the operation of PCRE2 because it has
+ to load additional data when handling them. For the 32-bit library the
+ value is always 4 and cannot be overridden; the value of --with-link-
size is ignored.
-AVOIDING EXCESSIVE STACK USAGE
-
- When matching with the pcre2_match() function, PCRE2 implements back-
- tracking by making recursive calls to an internal function called
- match(). In environments where the size of the stack is limited, this
- can severely limit PCRE2's operation. (The Unix environment does not
- usually suffer from this problem, but it may sometimes be necessary to
- increase the maximum stack size. There is a discussion in the
- pcre2stack documentation.) An alternative approach to recursion that
- uses memory from the heap to remember data, instead of using recursive
- function calls, has been implemented to work round the problem of lim-
- ited stack size. If you want to build a version of PCRE2 that works
- this way, add
+LIMITING PCRE2 RESOURCE USAGE
- --disable-stack-for-recursion
+ The pcre2_match() function increments a counter each time it goes round
+ its main loop. Putting a limit on this counter controls the amount of
+ computing resource used by a single call to pcre2_match(). The limit
+ can be changed at run time, as described in the pcre2api documentation.
+ The default is 10 million, but this can be changed by adding a setting
+ such as
- to the configure command. By default, the system functions malloc() and
- free() are called to manage the heap memory that is required, but cus-
- tom memory management functions can be called instead. PCRE2 runs
- noticeably more slowly when built in this way. This option affects only
- the pcre2_match() function; it is not relevant for pcre2_dfa_match().
+ --with-match-limit=500000
+ to the configure command. This setting also applies to the
+ pcre2_dfa_match() matching function, and to JIT matching (though the
+ counting is done differently).
-LIMITING PCRE2 RESOURCE USAGE
+ The pcre2_match() function starts out using a 20K vector on the system
+ stack to record backtracking points. The more nested backtracking
+ points there are (that is, the deeper the search tree), the more memory
+ is needed. If the initial vector is not large enough, heap memory is
+ used, up to a certain limit, which is specified in kilobytes. The limit
+ can be changed at run time, as described in the pcre2api documentation.
+ The default limit (in effect unlimited) is 20 million. You can change
+ this by a setting such as
- Internally, PCRE2 has a function called match(), which it calls repeat-
- edly (sometimes recursively) when matching a pattern with the
- pcre2_match() function. By controlling the maximum number of times this
- function may be called during a single matching operation, a limit can
- be placed on the resources used by a single call to pcre2_match(). The
- limit can be changed at run time, as described in the pcre2api documen-
- tation. The default is 10 million, but this can be changed by adding a
- setting such as
+ --with-heap-limit=500
- --with-match-limit=500000
+ which limits the amount of heap to 500 kilobytes. This limit applies
+ only to interpretive matching in pcre2_match(). It does not apply when
+ JIT (which has its own memory arrangements) is used, nor does it apply
+ to pcre2_dfa_match().
- to the configure command. This setting has no effect on the
- pcre2_dfa_match() matching function.
+ You can also explicitly limit the depth of nested backtracking in the
+ pcre2_match() interpreter. This limit defaults to the value that is set
+ for --with-match-limit. You can set a lower default limit by adding,
+ for example,
- In some environments it is desirable to limit the depth of recursive
- calls of match() more strictly than the total number of calls, in order
- to restrict the maximum amount of stack (or heap, if --disable-stack-
- for-recursion is specified) that is used. A second limit controls this;
- it defaults to the value that is set for --with-match-limit, which
- imposes no additional constraints. However, you can set a lower limit
- by adding, for example,
+ --with-match-limit_depth=10000
- --with-match-limit-recursion=10000
+ to the configure command. This value can be overridden at run time.
+ This depth limit indirectly limits the amount of heap memory that is
+ used, but because the size of each backtracking "frame" depends on the
+ number of capturing parentheses in a pattern, the amount of heap that
+ is used before the limit is reached varies from pattern to pattern.
+ This limit was more useful in versions before 10.30, where function
+ recursion was used for backtracking.
- to the configure command. This value can also be overridden at run
- time.
+ As well as applying to pcre2_match(), the depth limit also controls the
+ depth of recursive function calls in pcre2_dfa_match(). These are used
+ for lookaround assertions, atomic groups, and recursion within pat-
+ terns. The limit does not apply to JIT matching.
CREATING CHARACTER TABLES AT BUILD TIME
PCRE2 uses fixed tables for processing characters whose code points are
less than 256. By default, PCRE2 is built with a set of tables that are
- distributed in the file src/pcre2_chartables.c.dist. These tables are
+ distributed in the file src/pcre2_chartables.c.dist. These tables are
for ASCII codes only. If you add
--enable-rebuild-chartables
- to the configure command, the distributed tables are no longer used.
- Instead, a program called dftables is compiled and run. This outputs
+ to the configure command, the distributed tables are no longer used.
+ Instead, a program called dftables is compiled and run. This outputs
the source for new set of tables, created in the default locale of your
- C run-time system. (This method of replacing the tables does not work
- if you are cross compiling, because dftables is run on the local host.
- If you need to create alternative tables when cross compiling, you will
- have to do so "by hand".)
+ C run-time system. This method of replacing the tables does not work if
+ you are cross compiling, because dftables is run on the local host. If
+ you need to create alternative tables when cross compiling, you will
+ have to do so "by hand".
USING EBCDIC CODE
- PCRE2 assumes by default that it will run in an environment where the
- character code is ASCII or Unicode, which is a superset of ASCII. This
+ PCRE2 assumes by default that it will run in an environment where the
+ character code is ASCII or Unicode, which is a superset of ASCII. This
is the case for most computer operating systems. PCRE2 can, however, be
compiled to run in an 8-bit EBCDIC environment by adding
--enable-ebcdic --disable-unicode
to the configure command. This setting implies --enable-rebuild-charta-
- bles. You should only use it if you know that you are in an EBCDIC
+ bles. You should only use it if you know that you are in an EBCDIC
environment (for example, an IBM mainframe operating system).
- It is not possible to support both EBCDIC and UTF-8 codes in the same
- version of the library. Consequently, --enable-unicode and --enable-
+ It is not possible to support both EBCDIC and UTF-8 codes in the same
+ version of the library. Consequently, --enable-unicode and --enable-
ebcdic are mutually exclusive.
The EBCDIC character that corresponds to an ASCII LF is assumed to have
- the value 0x15 by default. However, in some EBCDIC environments, 0x25
+ the value 0x15 by default. However, in some EBCDIC environments, 0x25
is used. In such an environment you should use
--enable-ebcdic-nl25
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
- has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
+ has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
acter (which, in Unicode, is 0x85).
@@ -3483,39 +3816,44 @@ PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS
By default, on non-Windows systems, pcre2grep supports the use of call-
outs with string arguments within the patterns it is matching, in order
- to run external scripts. For details, see the pcre2grep documentation.
- This support can be disabled by adding --disable-pcre2grep-callout to
+ to run external scripts. For details, see the pcre2grep documentation.
+ This support can be disabled by adding --disable-pcre2grep-callout to
the configure command.
PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
- By default, pcre2grep reads all files as plain text. You can build it
- so that it recognizes files whose names end in .gz or .bz2, and reads
+ By default, pcre2grep reads all files as plain text. You can build it
+ so that it recognizes files whose names end in .gz or .bz2, and reads
them with libz or libbz2, respectively, by adding one or both of
--enable-pcre2grep-libz
--enable-pcre2grep-libbz2
to the configure command. These options naturally require that the rel-
- evant libraries are installed on your system. Configuration will fail
+ evant libraries are installed on your system. Configuration will fail
if they are not.
PCRE2GREP BUFFER SIZE
- pcre2grep uses an internal buffer to hold a "window" on the file it is
+ pcre2grep uses an internal buffer to hold a "window" on the file it is
scanning, in order to be able to output "before" and "after" lines when
- it finds a match. The size of the buffer is controlled by a parameter
- whose default value is 20K. The buffer itself is three times this size,
- but because of the way it is used for holding "before" lines, the long-
- est line that is guaranteed to be processable is the parameter size.
- You can change the default parameter value by adding, for example,
+ it finds a match. The starting size of the buffer is controlled by a
+ parameter whose default value is 20K. The buffer itself is three times
+ this size, but because of the way it is used for holding "before"
+ lines, the longest line that is guaranteed to be processable is the
+ parameter size. If a longer line is encountered, pcre2grep automati-
+ cally expands the buffer, up to a specified maximum size, whose default
+ is 1M or the starting size, whichever is the larger. You can change the
+ default parameter values by adding, for example,
- --with-pcre2grep-bufsize=50K
+ --with-pcre2grep-bufsize=51200
+ --with-pcre2grep-max-bufsize=2097152
- to the configure command. The caller of pcre2grep can override this
- value by using --buffer-size on the command line.
+ to the configure command. The caller of pcre2grep can override these
+ values by using --buffer-size and --max-buffer-size on the command
+ line.
PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
@@ -3525,26 +3863,26 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
--enable-pcre2test-libreadline
--enable-pcre2test-libedit
- to the configure command, pcre2test is linked with the libreadline
+ to the configure command, pcre2test is linked with the libreadline
orlibedit library, respectively, and when its input is from a terminal,
- it reads it using the readline() function. This provides line-editing
- and history facilities. Note that libreadline is GPL-licensed, so if
- you distribute a binary of pcre2test linked in this way, there may be
+ it reads it using the readline() function. This provides line-editing
+ and history facilities. Note that libreadline is GPL-licensed, so if
+ you distribute a binary of pcre2test linked in this way, there may be
licensing issues. These can be avoided by linking instead with libedit,
which has a BSD licence.
- Setting --enable-pcre2test-libreadline causes the -lreadline option to
- be added to the pcre2test build. In many operating environments with a
- sytem-installed readline library this is sufficient. However, in some
+ Setting --enable-pcre2test-libreadline causes the -lreadline option to
+ be added to the pcre2test build. In many operating environments with a
+ sytem-installed readline library this is sufficient. However, in some
environments (e.g. if an unmodified distribution version of readline is
- in use), some extra configuration may be necessary. The INSTALL file
+ in use), some extra configuration may be necessary. The INSTALL file
for libreadline says this:
"Readline uses the termcap functions, but does not link with
the termcap or curses library itself, allowing applications
which link with readline the to choose an appropriate library."
- If your environment has not been set up so that an appropriate library
+ If your environment has not been set up so that an appropriate library
is automatically included, you may need to add something like
LIBS="-ncurses"
@@ -3558,7 +3896,7 @@ INCLUDING DEBUGGING CODE
--enable-debug
- to the configure command, additional debugging code is included in the
+ to the configure command, additional debugging code is included in the
build. This feature is intended for use by the PCRE2 maintainers.
@@ -3568,15 +3906,15 @@ DEBUGGING WITH VALGRIND SUPPORT
--enable-valgrind
- to the configure command, PCRE2 will use valgrind annotations to mark
- certain memory regions as unaddressable. This allows it to detect
- invalid memory accesses, and is mostly useful for debugging PCRE2
+ to the configure command, PCRE2 will use valgrind annotations to mark
+ certain memory regions as unaddressable. This allows it to detect
+ invalid memory accesses, and is mostly useful for debugging PCRE2
itself.
CODE COVERAGE REPORTING
- If your C compiler is gcc, you can build a version of PCRE2 that can
+ If your C compiler is gcc, you can build a version of PCRE2 that can
generate a code coverage report for its test suite. To enable this, you
must install lcov version 1.6 or above. Then specify
@@ -3585,20 +3923,20 @@ CODE COVERAGE REPORTING
to the configure command and build PCRE2 in the usual way.
Note that using ccache (a caching C compiler) is incompatible with code
- coverage reporting. If you have configured ccache to run automatically
+ coverage reporting. If you have configured ccache to run automatically
on your system, you must set the environment variable
CCACHE_DISABLE=1
before running make to build PCRE2, so that ccache is not used.
- When --enable-coverage is used, the following addition targets are
+ When --enable-coverage is used, the following addition targets are
added to the Makefile:
make coverage
- This creates a fresh coverage report for the PCRE2 test suite. It is
- equivalent to running "make coverage-reset", "make coverage-baseline",
+ This creates a fresh coverage report for the PCRE2 test suite. It is
+ equivalent to running "make coverage-reset", "make coverage-baseline",
"make check", and then "make coverage-report".
make coverage-reset
@@ -3615,21 +3953,59 @@ CODE COVERAGE REPORTING
make coverage-clean-report
- This removes the generated coverage report without cleaning the cover-
+ This removes the generated coverage report without cleaning the cover-
age data itself.
make coverage-clean-data
- This removes the captured coverage data without removing the coverage
+ This removes the captured coverage data without removing the coverage
files created at compile time (*.gcno).
make coverage-clean
- This cleans all coverage data including the generated coverage report.
- For more information about code coverage, see the gcov and lcov docu-
+ This cleans all coverage data including the generated coverage report.
+ For more information about code coverage, see the gcov and lcov docu-
mentation.
+SUPPORT FOR FUZZERS
+
+ There is a special option for use by people who want to run fuzzing
+ tests on PCRE2:
+
+ --enable-fuzz-support
+
+ At present this applies only to the 8-bit library. If set, it causes an
+ extra library called libpcre2-fuzzsupport.a to be built, but not
+ installed. This contains a single function called LLVMFuzzerTestOneIn-
+ put() whose arguments are a pointer to a string and the length of the
+ string. When called, this function tries to compile the string as a
+ pattern, and if that succeeds, to match it. This is done both with no
+ options and with some random options bits that are generated from the
+ string.
+
+ Setting --enable-fuzz-support also causes a binary called pcre2fuz-
+ zcheck to be created. This is normally run under valgrind or used when
+ PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing
+ function and outputs information about it is doing. The input strings
+ are specified by arguments: if an argument starts with "=" the rest of
+ it is a literal input string. Otherwise, it is assumed to be a file
+ name, and the contents of the file are the test string.
+
+
+OBSOLETE OPTION
+
+ In versions of PCRE2 prior to 10.30, there were two ways of handling
+ backtracking in the pcre2_match() function. The default was to use the
+ system stack, but if
+
+ --disable-stack-for-recursion
+
+ was set, memory on the heap was used. From release 10.30 onwards this
+ has changed (the stack is no longer used) and this option now does
+ nothing except give a warning.
+
+
SEE ALSO
pcre2api(3), pcre2-config(3).
@@ -3644,8 +4020,8 @@ AUTHOR
REVISION
- Last updated: 01 April 2016
- Copyright (c) 1997-2016 University of Cambridge.
+ Last updated: 18 July 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -3689,14 +4065,22 @@ DESCRIPTION
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
PCRE2 automatically inserts callouts, all with number 255, before each
- item in the pattern. For example, if PCRE2_AUTO_CALLOUT is used with
- the pattern
+ item in the pattern except for immediately before or after an explicit
+ callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
- A(\d{2}|--)
+ A(?C3)B
it is processed as if it were
- (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
+ (?C255)A(?C3)B(?C255)
+
+ Here is a more complicated example:
+
+ A(\d{2}|--)
+
+ With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
+
+ (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
Notice that there is a callout before and after each parenthesis and
alternation bar. If the pattern contains a conditional group whose con-
@@ -3737,10 +4121,11 @@ MISSING CALLOUTS
No match
This indicates that when matching [bc] fails, there is no backtracking
- into a+ and therefore the callouts that would be taken for the back-
- tracks do not occur. You can disable the auto-possessify feature by
- passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the pat-
- tern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
+ into a+ (because it is being treated as a++) and therefore the callouts
+ that would be taken for the backtracks do not occur. You can disable
+ the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
+ pcre2_compile(), or starting the pattern with (*NO_AUTO_POSSESS). In
+ this case, the output changes to this:
--->aaaa
+0 ^ a+
@@ -3756,14 +4141,17 @@ MISSING CALLOUTS
Automatic .* anchoring
By default, an optimization is applied when .* is the first significant
- item in a pattern. If PCRE2_DOTALL is set, so that the dot can match
- any character, the pattern is automatically anchored. If PCRE2_DOTALL
- is not set, a match can start only after an internal newline or at the
- beginning of the subject, and pcre2_compile() remembers this. This
- optimization is disabled, however, if .* is in an atomic group or if
- there is a back reference to the capturing group in which it appears.
- It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
- ever, the presence of callouts does not affect it.
+ item in a pattern. If PCRE2_DOTALL is set, so that the dot can match
+ any character, the pattern is automatically anchored. If PCRE2_DOTALL
+ is not set, a match can start only after an internal newline or at the
+ beginning of the subject, and pcre2_compile() remembers this. If a pat-
+ tern has more than one top-level branch, automatic anchoring occurs if
+ all branches are anchorable.
+
+ This optimization is disabled, however, if .* is in an atomic group or
+ if there is a back reference to the capturing group in which it
+ appears. It is also disabled if the pattern contains (*PRUNE) or
+ (*SKIP). However, the presence of callouts does not affect it.
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
and applied to the string "aa", the pcre2test output is:
@@ -3795,46 +4183,45 @@ MISSING CALLOUTS
ter. Another optimization, described in the next section, means that
there is no subsequent attempt to match with an empty subject.
- If a pattern has more than one top-level branch, automatic anchoring
- occurs if all branches are anchorable.
-
Other optimizations
- Other optimizations that provide fast "no match" results also affect
+ Other optimizations that provide fast "no match" results also affect
callouts. For example, if the pattern is
ab(?C4)cd
- PCRE2 knows that any matching string must contain the letter "d". If
- the subject string is "abyz", the lack of "d" means that matching
- doesn't ever start, and the callout is never reached. However, with
+ PCRE2 knows that any matching string must contain the letter "d". If
+ the subject string is "abyz", the lack of "d" means that matching
+ doesn't ever start, and the callout is never reached. However, with
"abyd", though the result is still no match, the callout is obeyed.
- PCRE2 also knows the minimum length of a matching string, and will
- immediately give a "no match" return without actually running a match
- if the subject is not long enough, or, for unanchored patterns, if it
- has been scanned far enough.
+ For most patterns PCRE2 also knows the minimum length of a matching
+ string, and will immediately give a "no match" return without actually
+ running a match if the subject is not long enough, or, for unanchored
+ patterns, if it has been scanned far enough.
You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
- MIZE option to pcre2_compile(), or by starting the pattern with
- (*NO_START_OPT). This slows down the matching process, but does ensure
+ MIZE option to pcre2_compile(), or by starting the pattern with
+ (*NO_START_OPT). This slows down the matching process, but does ensure
that callouts such as the example above are obeyed.
THE CALLOUT INTERFACE
- During matching, when PCRE2 reaches a callout point, if an external
- function is set in the match context, it is called. This applies to
- both normal and DFA matching. The first argument to the callout func-
- tion is a pointer to a pcre2_callout block. The second argument is the
- void * callout data that was supplied when the callout was set up by
- calling pcre2_set_callout() (see the pcre2api documentation). The call-
- out block structure contains the following fields:
+ During matching, when PCRE2 reaches a callout point, if an external
+ function is provided in the match context, it is called. This applies
+ to both normal, DFA, and JIT matching. The first argument to the call-
+ out function is a pointer to a pcre2_callout block. The second argument
+ is the void * callout data that was supplied when the callout was set
+ up by calling pcre2_set_callout() (see the pcre2api documentation). The
+ callout block structure contains the following fields, not necessarily
+ in this order:
uint32_t version;
uint32_t callout_number;
uint32_t capture_top;
uint32_t capture_last;
+ uint32_t callout_flags;
PCRE2_SIZE *offset_vector;
PCRE2_SPTR mark;
PCRE2_SPTR subject;
@@ -3848,19 +4235,19 @@ THE CALLOUT INTERFACE
PCRE2_SPTR callout_string;
The version field contains the version number of the block format. The
- current version is 1; the three callout string fields were added for
- this version. If you are writing an application that might use an ear-
- lier release of PCRE2, you should check the version number before
- accessing any of these fields. The version number will increase in
- future if more fields are added, but the intention is never to remove
- any of the existing fields.
+ current version is 2; the three callout string fields were added for
+ version 1, and the callout_flags field for version 2. If you are writ-
+ ing an application that might use an earlier release of PCRE2, you
+ should check the version number before accessing any of these fields.
+ The version number will increase in future if more fields are added,
+ but the intention is never to remove any of the existing fields.
Fields for numerical callouts
For a numerical callout, callout_string is NULL, and callout_number
contains the number of the callout, in the range 0-255. This is the
- number that follows (?C for manual callouts; it is 255 for automati-
- cally generated callouts.
+ number that follows (?C for callouts that part of the pattern; it is
+ 255 for automatically generated callouts.
Fields for string callouts
@@ -3885,74 +4272,123 @@ THE CALLOUT INTERFACE
The remaining fields in the callout block are the same for both kinds
of callout.
- The offset_vector field is a pointer to the vector of capturing offsets
- (the "ovector") that was passed to the matching function in the match
- data block. When pcre2_match() is used, the contents can be inspected
- in order to extract substrings that have been matched so far, in the
- same way as for extracting substrings after a match has completed. For
- the DFA matching function, this field is not useful.
+ The offset_vector field is a pointer to a vector of capturing offsets
+ (the "ovector"). You may read the elements in this vector, but you must
+ not change any of them.
+
+ For calls to pcre2_match(), the offset_vector field is not (since
+ release 10.30) a pointer to the actual ovector that was passed to the
+ matching function in the match data block. Instead it points to an
+ internal ovector of a size large enough to hold all possible captured
+ substrings in the pattern. Note that whenever a recursion or subroutine
+ call within a pattern completes, the capturing state is reset to what
+ it was before.
+
+ The capture_last field contains the number of the most recently cap-
+ tured substring, and the capture_top field contains one more than the
+ number of the highest numbered captured substring so far. If no sub-
+ strings have yet been captured, the value of capture_last is 0 and the
+ value of capture_top is 1. The values of these fields do not always
+ differ by one; for example, when the callout in the pattern
+ ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4.
+
+ The contents of ovector[2] to ovector[<capture_top>*2-1] can be
+ inspected in order to extract substrings that have been matched so far,
+ in the same way as extracting substrings after a match has completed.
+ The values in ovector[0] and ovector[1] are always PCRE2_UNSET because
+ the match is by definition not complete. Substrings that have not been
+ captured but whose numbers are less than capture_top also have both of
+ their ovector slots set to PCRE2_UNSET.
+
+ For DFA matching, the offset_vector field points to the ovector that
+ was passed to the matching function in the match data block, but it
+ holds no useful information at callout time because pcre2_dfa_match()
+ does not support substring capturing. The value of capture_top is
+ always 1 and the value of capture_last is always 0 for DFA matching.
The subject and subject_length fields contain copies of the values that
were passed to the matching function.
- The start_match field normally contains the offset within the subject
- at which the current match attempt started. However, if the escape
- sequence \K has been encountered, this value is changed to reflect the
- modified starting point. If the pattern is not anchored, the callout
+ The start_match field normally contains the offset within the subject
+ at which the current match attempt started. However, if the escape
+ sequence \K has been encountered, this value is changed to reflect the
+ modified starting point. If the pattern is not anchored, the callout
function may be called several times from the same point in the pattern
for different starting points in the subject.
- The current_position field contains the offset within the subject of
+ The current_position field contains the offset within the subject of
the current match pointer.
- When the pcre2_match() is used, the capture_top field contains one more
- than the number of the highest numbered captured substring so far. If
- no substrings have been captured, the value of capture_top is one. This
- is always the case when the DFA functions are used, because they do not
- support captured substrings.
-
- The capture_last field contains the number of the most recently cap-
- tured substring. However, when a recursion exits, the value reverts to
- what it was outside the recursion, as do the values of all captured
- substrings. If no substrings have been captured, the value of cap-
- ture_last is 0. This is always the case for the DFA matching functions.
-
The pattern_position field contains the offset in the pattern string to
the next item to be matched.
- The next_item_length field contains the length of the next item to be
- matched in the pattern string. When the callout immediately precedes an
- alternation bar, a closing parenthesis, or the end of the pattern, the
- length is zero. When the callout precedes an opening parenthesis, the
- length is that of the entire subpattern.
-
- The pattern_position and next_item_length fields are intended to help
- in distinguishing between different automatic callouts, which all have
- the same callout number. However, they are set for all callouts, and
+ The next_item_length field contains the length of the next item to be
+ processed in the pattern string. When the callout is at the end of the
+ pattern, the length is zero. When the callout precedes an opening
+ parenthesis, the length includes meta characters that follow the paren-
+ thesis. For example, in a callout before an assertion such as (?=ab)
+ the length is 3. For an an alternation bar or a closing parenthesis,
+ the length is one, unless a closing parenthesis is followed by a quan-
+ tifier, in which case its length is included. (This changed in release
+ 10.23. In earlier releases, before an opening parenthesis the length
+ was that of the entire subpattern, and before an alternation bar or a
+ closing parenthesis the length was zero.)
+
+ The pattern_position and next_item_length fields are intended to help
+ in distinguishing between different automatic callouts, which all have
+ the same callout number. However, they are set for all callouts, and
are used by pcre2test to show the next item to be matched when display-
ing callout information.
In callouts from pcre2_match() the mark field contains a pointer to the
- zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
- (*THEN) item in the match, or NULL if no such items have been passed.
- Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
+ zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
+ (*THEN) item in the match, or NULL if no such items have been passed.
+ Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
previous (*MARK). In callouts from the DFA matching function this field
always contains NULL.
+ The callout_flags field is always zero in callouts from
+ pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
+ JIT is used, the following bits may be set:
+
+ PCRE2_CALLOUT_STARTMATCH
+
+ This is set for the first callout after the start of matching for each
+ new starting position in the subject.
+
+ PCRE2_CALLOUT_BACKTRACK
+
+ This is set if there has been a matching backtrack since the previous
+ callout, or since the start of matching if this is the first callout
+ from a pcre2_match() run.
+
+ Both bits are set when a backtrack has caused a "bumpalong" to a new
+ starting position in the subject. Output from pcre2test does not indi-
+ cate the presence of these bits unless the callout_extra modifier is
+ set.
+
+ The information in the callout_flags field is provided so that applica-
+ tions can track and tell their users how matching with backtracking is
+ done. This can be useful when trying to optimize patterns, or just to
+ understand how PCRE2 works. There is no support in pcre2_dfa_match()
+ because there is no backtracking in DFA matching, and there is no sup-
+ port in JIT because JIT is all about maximimizing matching performance.
+ In both these cases the callout_flags field is always zero.
+
RETURN VALUES FROM CALLOUTS
The external callout function returns an integer to PCRE2. If the value
- is zero, matching proceeds as normal. If the value is greater than
- zero, matching fails at the current point, but the testing of other
+ is zero, matching proceeds as normal. If the value is greater than
+ zero, matching fails at the current point, but the testing of other
matching possibilities goes ahead, just as if a lookahead assertion had
failed. If the value is less than zero, the match is abandoned, and the
matching function returns the negative value.
- Negative values should normally be chosen from the set of
- PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
- standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
- reserved for use by callout functions; it will never be used by PCRE2
+ Negative values should normally be chosen from the set of
+ PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
+ standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
+ reserved for use by callout functions; it will never be used by PCRE2
itself.
@@ -3963,14 +4399,14 @@ CALLOUT ENUMERATION
void *user_data);
A script language that supports the use of string arguments in callouts
- might like to scan all the callouts in a pattern before running the
+ might like to scan all the callouts in a pattern before running the
match. This can be done by calling pcre2_callout_enumerate(). The first
- argument is a pointer to a compiled pattern, the second points to a
- callback function, and the third is arbitrary user data. The callback
- function is called for every callout in the pattern in the order in
+ argument is a pointer to a compiled pattern, the second points to a
+ callback function, and the third is arbitrary user data. The callback
+ function is called for every callout in the pattern in the order in
which they appear. Its first argument is a pointer to a callout enumer-
- ation block, and its second argument is the user_data value that was
- passed to pcre2_callout_enumerate(). The data block contains the fol-
+ ation block, and its second argument is the user_data value that was
+ passed to pcre2_callout_enumerate(). The data block contains the fol-
lowing fields:
version Block version number
@@ -3981,17 +4417,17 @@ CALLOUT ENUMERATION
callout_string_length Length of callout string
callout_string Points to callout string or is NULL
- The version number is currently 0. It will increase if new fields are
- ever added to the block. The remaining fields are the same as their
- namesakes in the pcre2_callout block that is used for callouts during
+ The version number is currently 0. It will increase if new fields are
+ ever added to the block. The remaining fields are the same as their
+ namesakes in the pcre2_callout block that is used for callouts during
matching, as described above.
- Note that the value of pattern_position is unique for each callout.
- However, if a callout occurs inside a group that is quantified with a
+ Note that the value of pattern_position is unique for each callout.
+ However, if a callout occurs inside a group that is quantified with a
non-zero minimum or a fixed maximum, the group is replicated inside the
- compiled pattern. For example, a pattern such as /(a){2}/ is compiled
- as if it were /(a)(a)/. This means that the callout will be enumerated
- more than once, but with the same value for pattern_position in each
+ compiled pattern. For example, a pattern such as /(a){2}/ is compiled
+ as if it were /(a)(a)/. This means that the callout will be enumerated
+ more than once, but with the same value for pattern_position in each
case.
The callback function should normally return zero. If it returns a non-
@@ -4008,8 +4444,8 @@ AUTHOR
REVISION
- Last updated: 23 March 2015
- Copyright (c) 1997-2015 University of Cambridge.
+ Last updated: 22 December 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -4024,45 +4460,46 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
This document describes the differences in the ways that PCRE2 and Perl
handle regular expressions. The differences described here are with
- respect to Perl versions 5.10 and above.
+ respect to Perl versions 5.26, but as both Perl and PCRE2 are continu-
+ ally changing, the information may sometimes be out of date.
- 1. PCRE2 has only a subset of Perl's Unicode support. Details of what
+ 1. PCRE2 has only a subset of Perl's Unicode support. Details of what
it does have are given in the pcre2unicode page.
- 2. PCRE2 allows repeat quantifiers only on parenthesized assertions,
- but they do not mean what you might think. For example, (?!a){3} does
- not assert that the next three characters are not "a". It just asserts
- that the next character is not "a" three times (in principle: PCRE2
- optimizes this to run the assertion just once). Perl allows repeat
- quantifiers on other assertions such as \b, but these do not seem to
- have any use.
-
- 3. Capturing subpatterns that occur inside negative lookahead asser-
- tions are counted, but their entries in the offsets vector are never
- set. Perl sometimes (but not always) sets its numerical variables from
- inside negative assertions.
-
- 4. The following Perl escape sequences are not supported: \l, \u, \L,
- \U, and \N when followed by a character name or Unicode value. (\N on
+ 2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
+ tions, but they do not mean what you might think. For example, (?!a){3}
+ does not assert that the next three characters are not "a". It just
+ asserts that the next character is not "a" three times (in principle:
+ PCRE2 optimizes this to run the assertion just once). Perl allows some
+ repeat quantifiers on other assertions, for example, \b* (but not
+ \b{3}), but these do not seem to have any use.
+
+ 3. Capturing subpatterns that occur inside negative lookaround asser-
+ tions are counted, but their entries in the offsets vector are set only
+ when a negative assertion is a condition that has a matching branch
+ (that is, the condition is false).
+
+ 4. The following Perl escape sequences are not supported: \l, \u, \L,
+ \U, and \N when followed by a character name or Unicode value. (\N on
its own, matching a non-newline character, is supported.) In fact these
- are implemented by Perl's general string-handling and are not part of
- its pattern matching engine. If any of these are encountered by PCRE2,
+ are implemented by Perl's general string-handling and are not part of
+ its pattern matching engine. If any of these are encountered by PCRE2,
an error is generated by default. However, if the PCRE2_ALT_BSUX option
is set, \U and \u are interpreted as ECMAScript interprets them.
5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
- is built with Unicode support. The properties that can be tested with
- \p and \P are limited to the general category properties such as Lu and
- Nd, script names such as Greek or Han, and the derived properties Any
- and L&. PCRE2 does support the Cs (surrogate) property, which Perl does
- not; the Perl documentation says "Because Perl hides the need for the
- user to understand the internal representation of Unicode characters,
- there is no need to implement the somewhat messy concept of surro-
- gates."
-
- 6. PCRE2 does support the \Q...\E escape for quoting substrings. Char-
- acters in between are treated as literals. This is slightly different
- from Perl in that $ and @ are also handled as literals inside the
+ is built with Unicode support (the default). The properties that can be
+ tested with \p and \P are limited to the general category properties
+ such as Lu and Nd, script names such as Greek or Han, and the derived
+ properties Any and L&. PCRE2 does support the Cs (surrogate) property,
+ which Perl does not; the Perl documentation says "Because Perl hides
+ the need for the user to understand the internal representation of Uni-
+ code characters, there is no need to implement the somewhat messy con-
+ cept of surrogates."
+
+ 6. PCRE2 does support the \Q...\E escape for quoting substrings. Char-
+ acters in between are treated as literals. This is slightly different
+ from Perl in that $ and @ are also handled as literals inside the
quotes. In Perl, they cause variable interpolation (but of course PCRE2
does not have variables). Note the following examples:
@@ -4073,22 +4510,17 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
- The \Q...\E sequence is recognized both inside and outside character
+ The \Q...\E sequence is recognized both inside and outside character
classes.
- 7. Fairly obviously, PCRE2 does not support the (?{code}) and
- (??{code}) constructions. However, there is support for recursive pat-
- terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also,
- the PCRE2 "callout" feature allows an external function to be called
- during pattern matching. See the pcre2callout documentation for
- details.
+ 7. Fairly obviously, PCRE2 does not support the (?{code}) and
+ (??{code}) constructions. However, there is support PCRE2's "callout"
+ feature, which allows an external function to be called during pattern
+ matching. See the pcre2callout documentation for details.
- 8. Subroutine calls (whether recursive or not) are treated as atomic
- groups. Atomic recursion is like Python, but unlike Perl. Captured
- values that are set outside a subroutine call can be referenced from
- inside in PCRE2, but not in Perl. There is a discussion that explains
- these differences in more detail in the section on recursion differ-
- ences from Perl in the pcre2pattern page.
+ 8. Subroutine calls (whether recursive or not) were treated as atomic
+ groups up to PCRE2 release 10.23, but from release 10.30 this changed,
+ and backtracking into subroutine calls is now supported, as in Perl.
9. If any of the backtracking control verbs are used in a subpattern
that is called as a subroutine (whether or not recursively), their
@@ -4103,7 +4535,7 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
first one that is backtracked onto acts. For example, in the pattern
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure
in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
- it is the same as PCRE2, but there are examples where it differs.
+ it is the same as PCRE2, but there are cases where it differs.
11. Most backtracking verbs in assertions have their normal actions.
They are not confined to the assertion.
@@ -4117,18 +4549,18 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
pattern names is not as general as Perl's. This is a consequence of the
fact the PCRE2 works internally just with numbers, using an external
table to translate between numbers and names. In particular, a pattern
- such as (?|(?<a>A)|(?<b)B), where the two capturing parentheses have
+ such as (?|(?<a>A)|(?<b>B), where the two capturing parentheses have
the same number but different names, is not supported, and causes an
error at compile time. If it were allowed, it would not be possible to
distinguish which parentheses matched, because both names map to cap-
turing subpattern number 1. To avoid this confusing situation, an error
is given at compile time.
- 14. Perl recognizes comments in some places that PCRE2 does not, for
- example, between the ( and ? at the start of a subpattern. If the /x
- modifier is set, Perl allows white space between ( and ? (though cur-
- rent Perls warn that this is deprecated) but PCRE2 never does, even if
- the PCRE2_EXTENDED option is set.
+ 14. Perl used to recognize comments in some places that PCRE2 does not,
+ for example, between the ( and ? at the start of a subpattern. If the
+ /x modifier is set, Perl allowed white space between ( and ? though the
+ latest Perls give an error (for a while it was just deprecated). There
+ may still be some cases where Perl behaves differently.
15. Perl, when in warning mode, gives warnings for character classes
such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
@@ -4138,50 +4570,67 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
16. In PCRE2, the upper/lower case character properties Lu and Ll are
not affected when case-independent matching is specified. For example,
\p{Lu} always matches an upper case letter. I think Perl has changed in
- this respect; in the release at the time of writing (5.16), \p{Lu} and
+ this respect; in the release at the time of writing (5.24), \p{Lu} and
\p{Ll} match all letters, regardless of case, when case independence is
specified.
17. PCRE2 provides some extensions to the Perl regular expression
facilities. Perl 5.10 includes new features that are not in earlier
- versions of Perl, some of which (such as named parentheses) have been
- in PCRE2 for some time. This list is with respect to Perl 5.10:
+ versions of Perl, some of which (such as named parentheses) were in
+ PCRE2 for some time before. This list is with respect to Perl 5.26:
(a) Although lookbehind assertions in PCRE2 must match fixed length
strings, each alternative branch of a lookbehind assertion can match a
different length of string. Perl requires them all to have the same
length.
- (b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the
+ (b) From PCRE2 10.23, back references to groups of fixed length are
+ supported in lookbehinds, provided that there is no possibility of ref-
+ erencing a non-unique number or name. Perl does not support backrefer-
+ ences in lookbehinds.
+
+ (c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the
$ meta-character matches only at the very end of the string.
- (c) A backslash followed by a letter with no special meaning is
+ (d) A backslash followed by a letter with no special meaning is
faulted. (Perl can be made to issue a warning.)
- (d) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
+ (e) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
fiers is inverted, that is, by default they are not greedy, but if fol-
lowed by a question mark they are.
- (e) PCRE2_ANCHORED can be used at matching time to force a pattern to
+ (f) PCRE2_ANCHORED can be used at matching time to force a pattern to
be tried only at the first matching position in the subject string.
- (f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
- PCRE2_NOTEMPTY_ATSTART, and PCRE2_NO_AUTO_CAPTURE options have no Perl
- equivalents.
+ (g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY and
+ PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents.
- (g) The \R escape sequence can be restricted to match only CR, LF, or
+ (h) The \R escape sequence can be restricted to match only CR, LF, or
CRLF by the PCRE2_BSR_ANYCRLF option.
- (h) The callout facility is PCRE2-specific.
+ (i) The callout facility is PCRE2-specific. Perl supports codeblocks
+ and variable interpolation, but not general hooks on every match.
- (i) The partial matching facility is PCRE2-specific.
+ (j) The partial matching facility is PCRE2-specific.
- (j) The alternative matching function (pcre2_dfa_match() matches in a
+ (k) The alternative matching function (pcre2_dfa_match() matches in a
different way and is not Perl-compatible.
- (k) PCRE2 recognizes some special sequences such as (*CR) at the start
- of a pattern that set overall options that cannot be changed within the
- pattern.
+ (l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT)
+ at the start of a pattern that set overall options that cannot be
+ changed within the pattern.
+
+ 18. The Perl /a modifier restricts /d numbers to pure ascii, and the
+ /aa modifier restricts /i case-insensitive matching to pure ascii,
+ ignoring Unicode rules. This separation cannot be represented with
+ PCRE2_UCP.
+
+ 19. Perl has different limits than PCRE2. See the pcre2limit documenta-
+ tion for details. Perl went with 5.10 from recursion to iteration keep-
+ ing the intermediate matches on the heap, which is ~10% slower but does
+ not fall into any stack-overflow limit. PCRE2 made a similar change at
+ release 10.30, and also has many build-time and run-time customizable
+ limits.
AUTHOR
@@ -4193,8 +4642,8 @@ AUTHOR
REVISION
- Last updated: 15 March 2015
- Copyright (c) 1997-2015 University of Cambridge.
+ Last updated: 18 April 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -4342,8 +4791,8 @@ RETURN VALUES FROM JIT MATCHING
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
searching a very large pattern tree goes on for too long, as it is in
the same circumstance when JIT is not used, but the details of exactly
- what is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT error
- code is never returned when JIT matching is used.
+ what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
+ is never returned when JIT matching is used.
CONTROLLING THE JIT STACK
@@ -4362,13 +4811,10 @@ CONTROLLING THE JIT STACK
It returns a pointer to an opaque structure of type pcre2_jit_stack, or
NULL if there is an error. The pcre2_jit_stack_free() function is used
to free a stack that is no longer needed. (For the technically minded:
- the address space is allocated by mmap or VirtualAlloc.)
+ the address space is allocated by mmap or VirtualAlloc.) A maximum
+ stack size of 512K to 1M should be more than enough for any pattern.
- JIT uses far less memory for recursion than the interpretive code, and
- a maximum stack size of 512K to 1M should be more than enough for any
- pattern.
-
- The pcre2_jit_stack_assign() function specifies which stack JIT code
+ The pcre2_jit_stack_assign() function specifies which stack JIT code
should use. Its arguments are as follows:
pcre2_match_context *mcontext
@@ -4377,7 +4823,7 @@ CONTROLLING THE JIT STACK
The first argument is a pointer to a match context. When this is subse-
quently passed to a matching function, its information determines which
- JIT stack is used. There are three cases for the values of the other
+ JIT stack is used. There are three cases for the values of the other
two options:
(1) If callback is NULL and data is NULL, an internal 32K block
@@ -4395,34 +4841,34 @@ CONTROLLING THE JIT STACK
return value must be a valid JIT stack, the result of calling
pcre2_jit_stack_create().
- A callback function is obeyed whenever JIT code is about to be run; it
+ A callback function is obeyed whenever JIT code is about to be run; it
is not obeyed when pcre2_match() is called with options that are incom-
- patible for JIT matching. A callback function can therefore be used to
- determine whether a match operation was executed by JIT or by the
+ patible for JIT matching. A callback function can therefore be used to
+ determine whether a match operation was executed by JIT or by the
interpreter.
You may safely use the same JIT stack for more than one pattern (either
- by assigning directly or by callback), as long as the patterns are
+ by assigning directly or by callback), as long as the patterns are
matched sequentially in the same thread. Currently, the only way to set
- up non-sequential matches in one thread is to use callouts: if a call-
- out function starts another match, that match must use a different JIT
+ up non-sequential matches in one thread is to use callouts: if a call-
+ out function starts another match, that match must use a different JIT
stack to the one used for currently suspended match(es).
- In a multithread application, if you do not specify a JIT stack, or if
- you assign or pass back NULL from a callback, that is thread-safe,
- because each thread has its own machine stack. However, if you assign
- or pass back a non-NULL JIT stack, this must be a different stack for
+ In a multithread application, if you do not specify a JIT stack, or if
+ you assign or pass back NULL from a callback, that is thread-safe,
+ because each thread has its own machine stack. However, if you assign
+ or pass back a non-NULL JIT stack, this must be a different stack for
each thread so that the application is thread-safe.
- Strictly speaking, even more is allowed. You can assign the same non-
- NULL stack to a match context that is used by any number of patterns,
- as long as they are not used for matching by multiple threads at the
- same time. For example, you could use the same stack in all compiled
- patterns, with a global mutex in the callback to wait until the stack
+ Strictly speaking, even more is allowed. You can assign the same non-
+ NULL stack to a match context that is used by any number of patterns,
+ as long as they are not used for matching by multiple threads at the
+ same time. For example, you could use the same stack in all compiled
+ patterns, with a global mutex in the callback to wait until the stack
is available for use. However, this is an inefficient solution, and not
recommended.
- This is a suggestion for how a multithreaded program that needs to set
+ This is a suggestion for how a multithreaded program that needs to set
up non-default JIT stacks might operate:
During thread initalization
@@ -4434,7 +4880,7 @@ CONTROLLING THE JIT STACK
Use a one-line callback function
return thread_local_var
- All the functions described in this section do nothing if JIT is not
+ All the functions described in this section do nothing if JIT is not
available.
@@ -4443,20 +4889,20 @@ JIT STACK FAQ
(1) Why do we need JIT stacks?
PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
- where the local data of the current node is pushed before checking its
+ where the local data of the current node is pushed before checking its
child nodes. Allocating real machine stack on some platforms is diffi-
cult. For example, the stack chain needs to be updated every time if we
- extend the stack on PowerPC. Although it is possible, its updating
+ extend the stack on PowerPC. Although it is possible, its updating
time overhead decreases performance. So we do the recursion in memory.
(2) Why don't we simply allocate blocks of memory with malloc()?
- Modern operating systems have a nice feature: they can reserve an
+ Modern operating systems have a nice feature: they can reserve an
address space instead of allocating memory. We can safely allocate mem-
- ory pages inside this address space, so the stack could grow without
+ ory pages inside this address space, so the stack could grow without
moving memory data (this is important because of pointers). Thus we can
- allocate 1M address space, and use only a single memory page (usually
- 4K) if that is enough. However, we can still grow up to 1M anytime if
+ allocate 1M address space, and use only a single memory page (usually
+ 4K) if that is enough. However, we can still grow up to 1M anytime if
needed.
(3) Who "owns" a JIT stack?
@@ -4464,8 +4910,8 @@ JIT STACK FAQ
The owner of the stack is the user program, not the JIT studied pattern
or anything else. The user program must ensure that if a stack is being
used by pcre2_match(), (that is, it is assigned to a match context that
- is passed to the pattern currently running), that stack must not be
- used by any other threads (to avoid overwriting the same memory area).
+ is passed to the pattern currently running), that stack must not be
+ used by any other threads (to avoid overwriting the same memory area).
The best practice for multithreaded programs is to allocate a stack for
each thread, and return this stack through the JIT callback function.
@@ -4473,36 +4919,36 @@ JIT STACK FAQ
You can free a JIT stack at any time, as long as it will not be used by
pcre2_match() again. When you assign the stack to a match context, only
- a pointer is set. There is no reference counting or any other magic.
+ a pointer is set. There is no reference counting or any other magic.
You can free compiled patterns, contexts, and stacks in any order, any-
- time. Just do not call pcre2_match() with a match context pointing to
+ time. Just do not call pcre2_match() with a match context pointing to
an already freed stack, as that will cause SEGFAULT. (Also, do not free
- a stack currently used by pcre2_match() in another thread). You can
- also replace the stack in a context at any time when it is not in use.
+ a stack currently used by pcre2_match() in another thread). You can
+ also replace the stack in a context at any time when it is not in use.
You should free the previous stack before assigning a replacement.
- (5) Should I allocate/free a stack every time before/after calling
+ (5) Should I allocate/free a stack every time before/after calling
pcre2_match()?
- No, because this is too costly in terms of resources. However, you
- could implement some clever idea which release the stack if it is not
- used in let's say two minutes. The JIT callback can help to achieve
+ No, because this is too costly in terms of resources. However, you
+ could implement some clever idea which release the stack if it is not
+ used in let's say two minutes. The JIT callback can help to achieve
this without keeping a list of patterns.
- (6) OK, the stack is for long term memory allocation. But what happens
- if a pattern causes stack overflow with a stack of 1M? Is that 1M kept
+ (6) OK, the stack is for long term memory allocation. But what happens
+ if a pattern causes stack overflow with a stack of 1M? Is that 1M kept
until the stack is freed?
- Especially on embedded sytems, it might be a good idea to release mem-
- ory sometimes without freeing the stack. There is no API for this at
- the moment. Probably a function call which returns with the currently
- allocated memory for any stack and another which allows releasing mem-
+ Especially on embedded sytems, it might be a good idea to release mem-
+ ory sometimes without freeing the stack. There is no API for this at
+ the moment. Probably a function call which returns with the currently
+ allocated memory for any stack and another which allows releasing mem-
ory (shrinking the stack) would be a good idea if someone needs this.
(7) This is too much of a headache. Isn't there any better solution for
JIT stack handling?
- No, thanks to Windows. If POSIX threads were used everywhere, we could
+ No, thanks to Windows. If POSIX threads were used everywhere, we could
throw out this complicated API.
@@ -4511,18 +4957,18 @@ FREEING JIT SPECULATIVE MEMORY
void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
The JIT executable allocator does not free all memory when it is possi-
- ble. It expects new allocations, and keeps some free memory around to
- improve allocation speed. However, in low memory conditions, it might
- be better to free all possible memory. You can cause this to happen by
- calling pcre2_jit_free_unused_memory(). Its argument is a general con-
+ ble. It expects new allocations, and keeps some free memory around to
+ improve allocation speed. However, in low memory conditions, it might
+ be better to free all possible memory. You can cause this to happen by
+ calling pcre2_jit_free_unused_memory(). Its argument is a general con-
text, for custom memory management, or NULL for standard memory manage-
ment.
EXAMPLE CODE
- This is a single-threaded example that specifies a JIT stack without
- using a callback. A real program should include error checking after
+ This is a single-threaded example that specifies a JIT stack without
+ using a callback. A real program should include error checking after
all the function calls.
int rc;
@@ -4550,29 +4996,29 @@ EXAMPLE CODE
JIT FAST PATH API
Because the API described above falls back to interpreted matching when
- JIT is not available, it is convenient for programs that are written
+ JIT is not available, it is convenient for programs that are written
for general use in many environments. However, calling JIT via
pcre2_match() does have a performance impact. Programs that are written
- for use where JIT is known to be available, and which need the best
- possible performance, can instead use a "fast path" API to call JIT
- matching directly instead of calling pcre2_match() (obviously only for
+ for use where JIT is known to be available, and which need the best
+ possible performance, can instead use a "fast path" API to call JIT
+ matching directly instead of calling pcre2_match() (obviously only for
patterns that have been successfully processed by pcre2_jit_compile()).
- The fast path function is called pcre2_jit_match(), and it takes
+ The fast path function is called pcre2_jit_match(), and it takes
exactly the same arguments as pcre2_match(). The return values are also
the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or
- complete) is requested that was not compiled. Unsupported option bits
- (for example, PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT
+ complete) is requested that was not compiled. Unsupported option bits
+ (for example, PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT
option.
- When you call pcre2_match(), as well as testing for invalid options, a
+ When you call pcre2_match(), as well as testing for invalid options, a
number of other sanity checks are performed on the arguments. For exam-
ple, if the subject pointer is NULL, an immediate error is given. Also,
- unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for
- validity. In the interests of speed, these checks do not happen on the
+ unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for
+ validity. In the interests of speed, these checks do not happen on the
JIT fast path, and if invalid data is passed, the result is undefined.
- Bypassing the sanity checks and the pcre2_match() wrapping can give
+ Bypassing the sanity checks and the pcre2_match() wrapping can give
speedups of more than 10%.
@@ -4590,8 +5036,8 @@ AUTHOR
REVISION
- Last updated: 05 June 2016
- Copyright (c) 1997-2016 University of Cambridge.
+ Last updated: 31 March 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -4628,12 +5074,6 @@ SIZE AND OTHER LIMITATIONS
(that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-
terminated strings and unset offsets.
- Note that when using the traditional matching function, PCRE2 uses
- recursion to handle subpatterns and indefinite repetition. This means
- that the available stack space may limit the size of a subject string
- that can be processed by certain patterns. For a discussion of stack
- issues, see the pcre2stack documentation.
-
All values in repeating quantifiers must be less than 65536.
The maximum length of a lookbehind assertion is 65535 characters.
@@ -4642,21 +5082,20 @@ SIZE AND OTHER LIMITATIONS
can be no more than 65535 capturing subpatterns. There is, however, a
limit to the depth of nesting of parenthesized subpatterns of all
kinds. This is imposed in order to limit the amount of system stack
- used at compile time. The limit can be specified when PCRE2 is built;
- the default is 250.
-
- There is a limit to the number of forward references to subsequent sub-
- patterns of around 200,000. Repeated forward references with fixed
- upper limits, for example, (?2){0,100} when subpattern number 2 is to
- the right, are included in the count. There is no limit to the number
- of backward references.
+ used at compile time. The default limit can be specified when PCRE2 is
+ built; the default default is 250. An application can change this limit
+ by calling pcre2_set_parens_nest_limit() to set the limit in a compile
+ context.
The maximum length of name for a named subpattern is 32 code units, and
the maximum number of named subpatterns is 10000.
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or
- (*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit and
- 32-bit libraries.
+ (*THEN) verb is 255 code units for the 8-bit library and 65535 code
+ units for the 16-bit and 32-bit libraries.
+
+ The maximum length of a string argument to a callout is the largest
+ number a 32-bit unsigned integer can hold.
AUTHOR
@@ -4668,8 +5107,8 @@ AUTHOR
REVISION
- Last updated: 05 November 2015
- Copyright (c) 1997-2015 University of Cambridge.
+ Last updated: 30 March 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -5444,19 +5883,26 @@ SPECIAL START-OF-PATTERN ITEMS
attempt by the application to apply the JIT optimization by calling
pcre2_jit_compile() is ignored.
- Setting match and recursion limits
+ Setting match resource limits
- The caller of pcre2_match() can set a limit on the number of times the
- internal match() function is called and on the maximum depth of recur-
- sive calls. These facilities are provided to catch runaway matches that
- are provoked by patterns with huge matching trees (a typical example is
- a pattern with nested unlimited repeats) and to avoid running out of
- system stack by too much recursion. When one of these limits is
- reached, pcre2_match() gives an error return. The limits can also be
- set by items at the start of the pattern of the form
+ The pcre2_match() function contains a counter that is incremented every
+ time it goes round its main loop. The caller of pcre2_match() can set a
+ limit on this counter, which therefore limits the amount of computing
+ resource used for a match. The maximum depth of nested backtracking can
+ also be limited; this indirectly restricts the amount of heap memory
+ that is used, but there is also an explicit memory limit that can be
+ set.
+
+ These facilities are provided to catch runaway matches that are pro-
+ voked by patterns with huge matching trees (a typical example is a pat-
+ tern with nested unlimited repeats applied to a long string that does
+ not match). When one of these limits is reached, pcre2_match() gives an
+ error return. The limits can also be set by items at the start of the
+ pattern of the form
+ (*LIMIT_HEAP=d)
(*LIMIT_MATCH=d)
- (*LIMIT_RECURSION=d)
+ (*LIMIT_DEPTH=d)
where d is any number of decimal digits. However, the value of the set-
ting must be less than the value set (or defaulted) by the caller of
@@ -5465,23 +5911,36 @@ SPECIAL START-OF-PATTERN ITEMS
If there is more than one setting of one of these limits, the lower
value is used.
+ Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
+ name is still recognized for backwards compatibility.
+
+ The heap limit applies only when the pcre2_match() interpreter is used
+ for matching. It does not apply to JIT or DFA matching. The match limit
+ is used (but in a different way) when JIT is being used, or when
+ pcre2_dfa_match() is called, to limit computing resource usage by those
+ matching functions. The depth limit is ignored by JIT but is relevant
+ for DFA matching, which uses function recursion for recursions within
+ the pattern. In this case, the depth limit controls the amount of sys-
+ tem stack that is used.
+
Newline conventions
- PCRE2 supports five different conventions for indicating line breaks in
+ PCRE2 supports six different conventions for indicating line breaks in
strings: a single CR (carriage return) character, a single LF (line-
feed) character, the two-character sequence CRLF, any of the three pre-
- ceding, or any Unicode newline sequence. The pcre2api page has further
- discussion about newlines, and shows how to set the newline convention
- when calling pcre2_compile().
+ ceding, any Unicode newline sequence, or the NUL character (binary
+ zero). The pcre2api page has further discussion about newlines, and
+ shows how to set the newline convention when calling pcre2_compile().
It is also possible to specify a newline convention by starting a pat-
- tern string with one of the following five sequences:
+ tern string with one of the following sequences:
(*CR) carriage return
(*LF) linefeed
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
+ (*NUL) the NUL character (binary zero)
These override the default and the options given to the compiling func-
tion. For example, on a Unix system where LF is the default newline
@@ -5498,9 +5957,9 @@ SPECIAL START-OF-PATTERN ITEMS
acter when PCRE2_DOTALL is not set, and the behaviour of \N. However,
it does not affect what the \R escape sequence matches. By default,
this is any Unicode newline sequence, for Perl compatibility. However,
- this can be changed; see the description of \R in the section entitled
- "Newline sequences" below. A change of \R setting can be combined with
- a change of newline convention.
+ this can be changed; see the next section and the description of \R in
+ the section entitled "Newline sequences" below. A change of \R setting
+ can be combined with a change of newline convention.
Specifying what \R matches
@@ -5514,7 +5973,7 @@ SPECIAL START-OF-PATTERN ITEMS
EBCDIC CHARACTER CODES
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
- character code rather than ASCII or Unicode (typically a mainframe sys-
+ character code instead of ASCII or Unicode (typically a mainframe sys-
tem). In the sections below, character code values are ASCII or Uni-
code; in an EBCDIC environment these characters may have different code
values, and there are no code points greater than 255.
@@ -5579,11 +6038,11 @@ BACKSLASH
meaning that character may have. This use of backslash as an escape
character applies both inside and outside character classes.
- For example, if you want to match a * character, you write \* in the
- pattern. This escaping action applies whether or not the following
+ For example, if you want to match a * character, you must write \* in
+ the pattern. This escaping action applies whether or not the following
character would otherwise be interpreted as a metacharacter, so it is
always safe to precede a non-alphanumeric with backslash to specify
- that it stands for itself. In particular, if you want to match a back-
+ that it stands for itself. In particular, if you want to match a back-
slash, you write \\.
In a UTF mode, only ASCII numbers and letters have any special meaning
@@ -5614,15 +6073,16 @@ BACKSLASH
is not followed by \E later in the pattern, the literal interpretation
continues to the end of the pattern (that is, \E is assumed at the
end). If the isolated \Q is inside a character class, this causes an
- error, because the character class is not terminated.
+ error, because the character class is not terminated by a closing
+ square bracket.
Non-printing characters
A second use of backslash provides a way of encoding non-printing char-
- acters in patterns in a visible manner. There is no restriction on the
- appearance of non-printing characters in a pattern, but when a pattern
+ acters in patterns in a visible manner. There is no restriction on the
+ appearance of non-printing characters in a pattern, but when a pattern
is being prepared by text editing, it is often easier to use one of the
- following escape sequences than the binary character it represents. In
+ following escape sequences than the binary character it represents. In
an ASCII or Unicode environment, these escapes are as follows:
\a alarm, that is, the BEL character (hex 07)
@@ -5639,51 +6099,51 @@ BACKSLASH
\x{hhh..} character with hex code hhh.. (default mode)
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
- The precise effect of \cx on ASCII characters is as follows: if x is a
- lower case letter, it is converted to upper case. Then bit 6 of the
+ The precise effect of \cx on ASCII characters is as follows: if x is a
+ lower case letter, it is converted to upper case. Then bit 6 of the
character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
- (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
- hex 7B (; is 3B). If the code unit following \c has a value less than
- 32 or greater than 126, a compile-time error occurs. This locks out
- non-printable ASCII characters in all modes.
+ (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
+ hex 7B (; is 3B). If the code unit following \c has a value less than
+ 32 or greater than 126, a compile-time error occurs.
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gen-
erate the appropriate EBCDIC code values. The \c escape is processed as
specified for Perl in the perlebcdic document. The only characters that
are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?.
- Any other character provokes a compile-time error. The sequence \@
- encodes character code 0; the letters (in either case) encode charac-
- ters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31
- (hex 1B to hex 1F), and \? becomes either 255 (hex FF) or 95 (hex 5F).
-
- Thus, apart from \?, these escapes generate the same character code
- values as they do in an ASCII environment, though the meanings of the
- values mostly differ. For example, \G always generates code value 7,
+ Any other character provokes a compile-time error. The sequence \c@
+ encodes character code 0; after \c the letters (in either case) encode
+ characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters
+ 27-31 (hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95
+ (hex 5F).
+
+ Thus, apart from \c?, these escapes generate the same character code
+ values as they do in an ASCII environment, though the meanings of the
+ values mostly differ. For example, \cG always generates code value 7,
which is BEL in ASCII but DEL in EBCDIC.
- The sequence \? generates DEL (127, hex 7F) in an ASCII environment,
- but because 127 is not a control character in EBCDIC, Perl makes it
- generate the APC character. Unfortunately, there are several variants
- of EBCDIC. In most of them the APC character has the value 255 (hex
- FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
- certain other characters have POSIX-BC values, PCRE2 makes \? generate
+ The sequence \c? generates DEL (127, hex 7F) in an ASCII environment,
+ but because 127 is not a control character in EBCDIC, Perl makes it
+ generate the APC character. Unfortunately, there are several variants
+ of EBCDIC. In most of them the APC character has the value 255 (hex
+ FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
+ certain other characters have POSIX-BC values, PCRE2 makes \c? generate
95; otherwise it generates 255.
- After \0 up to two further octal digits are read. If there are fewer
- than two digits, just those that are present are used. Thus the
+ After \0 up to two further octal digits are read. If there are fewer
+ than two digits, just those that are present are used. Thus the
sequence \0\x\015 specifies two binary zeros followed by a CR character
(code value 13). Make sure you supply two digits after the initial zero
if the pattern character that follows is itself an octal digit.
- The escape \o must be followed by a sequence of octal digits, enclosed
- in braces. An error occurs if this is not the case. This escape is a
- recent addition to Perl; it provides way of specifying character code
- points as octal numbers greater than 0777, and it also allows octal
+ The escape \o must be followed by a sequence of octal digits, enclosed
+ in braces. An error occurs if this is not the case. This escape is a
+ recent addition to Perl; it provides way of specifying character code
+ points as octal numbers greater than 0777, and it also allows octal
numbers and back references to be unambiguously specified.
For greater clarity and unambiguity, it is best to avoid following \ by
a digit greater than zero. Instead, use \o{} or \x{} to specify charac-
- ter numbers, and \g{} to specify back references. The following para-
+ ter numbers, and \g{} to specify back references. The following para-
graphs describe the old, ambiguous syntax.
The handling of a backslash followed by a digit other than 0 is compli-
@@ -5691,16 +6151,16 @@ BACKSLASH
Outside a character class, PCRE2 reads the digit and any following dig-
its as a decimal number. If the number is less than 10, begins with the
- digit 8 or 9, or if there are at least that many previous capturing
- left parentheses in the expression, the entire sequence is taken as a
+ digit 8 or 9, or if there are at least that many previous capturing
+ left parentheses in the expression, the entire sequence is taken as a
back reference. A description of how this works is given later, follow-
- ing the discussion of parenthesized subpatterns. Otherwise, up to
+ ing the discussion of parenthesized subpatterns. Otherwise, up to
three octal digits are read to form a character code.
- Inside a character class, PCRE2 handles \8 and \9 as the literal char-
- acters "8" and "9", and otherwise reads up to three octal digits fol-
+ Inside a character class, PCRE2 handles \8 and \9 as the literal char-
+ acters "8" and "9", and otherwise reads up to three octal digits fol-
lowing the backslash, using them to generate a data character. Any sub-
- sequent digits stand for themselves. For example, outside a character
+ sequent digits stand for themselves. For example, outside a character
class:
\040 is another way of writing an ASCII space
@@ -5717,69 +6177,68 @@ BACKSLASH
the value 255 (decimal)
\81 is always a back reference
- Note that octal values of 100 or greater that are specified using this
- syntax must not be introduced by a leading zero, because no more than
+ Note that octal values of 100 or greater that are specified using this
+ syntax must not be introduced by a leading zero, because no more than
three octal digits are ever read.
- By default, after \x that is not followed by {, from zero to two hexa-
- decimal digits are read (letters can be in upper or lower case). Any
+ By default, after \x that is not followed by {, from zero to two hexa-
+ decimal digits are read (letters can be in upper or lower case). Any
number of hexadecimal digits may appear between \x{ and }. If a charac-
- ter other than a hexadecimal digit appears between \x{ and }, or if
+ ter other than a hexadecimal digit appears between \x{ and }, or if
there is no terminating }, an error occurs.
- If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as
+ If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as
just described only when it is followed by two hexadecimal digits. Oth-
- erwise, it matches a literal "x" character. In this mode mode, support
- for code points greater than 256 is provided by \u, which must be fol-
- lowed by four hexadecimal digits; otherwise it matches a literal "u"
- character.
+ erwise, it matches a literal "x" character. In this mode, support for
+ code points greater than 256 is provided by \u, which must be followed
+ by four hexadecimal digits; otherwise it matches a literal "u" charac-
+ ter.
Characters whose value is less than 256 can be defined by either of the
two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif-
- ference in the way they are handled. For example, \xdc is exactly the
+ ference in the way they are handled. For example, \xdc is exactly the
same as \x{dc} (or \u00dc in PCRE2_ALT_BSUX mode).
Constraints on character values
- Characters that are specified using octal or hexadecimal numbers are
+ Characters that are specified using octal or hexadecimal numbers are
limited to certain values, as follows:
- 8-bit non-UTF mode less than 0x100
- 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
- 16-bit non-UTF mode less than 0x10000
- 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
- 32-bit non-UTF mode less than 0x100000000
- 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
+ 8-bit non-UTF mode no greater than 0xff
+ 16-bit non-UTF mode no greater than 0xffff
+ 32-bit non-UTF mode no greater than 0xffffffff
+ All UTF modes no greater than 0x10ffff and a valid codepoint
- Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
- called "surrogate" codepoints), and 0xffef.
+ Invalid Unicode codepoints are all those in the range 0xd800 to 0xdfff
+ (the so-called "surrogate" codepoints). The check for these can be dis-
+ abled by the caller of pcre2_compile() by setting the option
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES.
Escape sequences in character classes
All the sequences that define a single character value can be used both
- inside and outside character classes. In addition, inside a character
+ inside and outside character classes. In addition, inside a character
class, \b is interpreted as the backspace character (hex 08).
- \N is not allowed in a character class. \B, \R, and \X are not special
- inside a character class. Like other unrecognized alphabetic escape
- sequences, they cause an error. Outside a character class, these
+ \N is not allowed in a character class. \B, \R, and \X are not special
+ inside a character class. Like other unrecognized alphabetic escape
+ sequences, they cause an error. Outside a character class, these
sequences have different meanings.
Unsupported escape sequences
- In Perl, the sequences \l, \L, \u, and \U are recognized by its string
- handler and used to modify the case of following characters. By
+ In Perl, the sequences \l, \L, \u, and \U are recognized by its string
+ handler and used to modify the case of following characters. By
default, PCRE2 does not support these escape sequences. However, if the
PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be
- used to define a character by code point, as described in the previous
- section.
+ used to define a character by code point, as described above.
Absolute and relative back references
- The sequence \g followed by an unsigned or a negative number, option-
- ally enclosed in braces, is an absolute or relative back reference. A
- named back reference can be coded as \g{name}. Back references are dis-
- cussed later, following the discussion of parenthesized subpatterns.
+ The sequence \g followed by a signed or unsigned number, optionally
+ enclosed in braces, is an absolute or relative back reference. A named
+ back reference can be coded as \g{name}. Back references are discussed
+ later, following the discussion of parenthesized subpatterns.
Absolute and relative subroutine calls
@@ -5941,59 +6400,64 @@ BACKSLASH
tional escape sequences that match characters with specific properties
are available. In 8-bit non-UTF-8 mode, these sequences are of course
limited to testing characters whose codepoints are less than 256, but
- they do work in this mode. The extra escape sequences are:
+ they do work in this mode. In 32-bit non-UTF mode, codepoints greater
+ than 0x10ffff (the Unicode limit) may be encountered. These are all
+ treated as being in the Common script and with an unassigned type. The
+ extra escape sequences are:
\p{xx} a character with the xx property
\P{xx} a character without the xx property
\X a Unicode extended grapheme cluster
- The property names represented by xx above are limited to the Unicode
+ The property names represented by xx above are limited to the Unicode
script names, the general category properties, "Any", which matches any
character (including newline), and some special PCRE2 properties
- (described in the next section). Other Perl properties such as "InMu-
- sicalSymbols" are not supported by PCRE2. Note that \P{Any} does not
+ (described in the next section). Other Perl properties such as "InMu-
+ sicalSymbols" are not supported by PCRE2. Note that \P{Any} does not
match any characters, so always causes a match failure.
Sets of Unicode characters are defined as belonging to certain scripts.
- A character from one of these sets can be matched using a script name.
+ A character from one of these sets can be matched using a script name.
For example:
\p{Greek}
\P{Han}
- Those that are not part of an identified script are lumped together as
+ Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:
- Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Balinese,
- Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese,
- Buhid, Canadian_Aboriginal, Carian, Caucasian_Albanian, Chakma, Cham,
- Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
- Devanagari, Duployan, Egyptian_Hieroglyphs, Elbasan, Ethiopic, Geor-
- gian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gurmukhi, Han,
- Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
- Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
- nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
- Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
- jani, Malayalam, Mandaic, Manichaean, Meetei_Mayek, Mende_Kikakui,
- Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro,
- Multani, Myanmar, Nabataean, New_Tai_Lue, Nko, Ogham, Ol_Chiki,
- Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian,
- Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene,
- Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang, Runic,
- Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala,
- Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa,
- Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai,
- Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi.
+ Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
+ nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
+ Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
+ nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
+ Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hieroglyphs, Elbasan,
+ Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gur-
+ mukhi, Han, Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Ara-
+ maic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
+ Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Kho-
+ jki, Khudawadi, Lao, Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu,
+ Lycian, Lydian, Mahajani, Malayalam, Mandaic, Manichaean, Marchen,
+ Masaram_Gondi, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
+ Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
+ Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
+ ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian,
+ Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya, Pahawh_Hmong,
+ Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang,
+ Runic, Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting,
+ Sinhala, Sora_Sompeng, Soyombo, Sundanese, Syloti_Nagri, Syriac, Taga-
+ log, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Tangut, Tel-
+ ugu, Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai,
+ Warang_Citi, Yi, Zanabazar_Square.
Each character has exactly one Unicode general category property, spec-
- ified by a two-letter abbreviation. For compatibility with Perl, nega-
- tion can be specified by including a circumflex between the opening
- brace and the property name. For example, \p{^Lu} is the same as
+ ified by a two-letter abbreviation. For compatibility with Perl, nega-
+ tion can be specified by including a circumflex between the opening
+ brace and the property name. For example, \p{^Lu} is the same as
\P{Lu}.
If only one letter is specified with \p or \P, it includes all the gen-
- eral category properties that start with that letter. In this case, in
- the absence of negation, the curly brackets in the escape sequence are
+ eral category properties that start with that letter. In this case, in
+ the absence of negation, the curly brackets in the escape sequence are
optional; these two examples have the same effect:
\p{L}
@@ -6045,45 +6509,48 @@ BACKSLASH
Zp Paragraph separator
Zs Space separator
- The special property L& is also supported: it matches a character that
- has the Lu, Ll, or Lt property, in other words, a letter that is not
+ The special property L& is also supported: it matches a character that
+ has the Lu, Ll, or Lt property, in other words, a letter that is not
classified as a modifier or "other".
- The Cs (Surrogate) property applies only to characters in the range
- U+D800 to U+DFFF. Such characters are not valid in Unicode strings and
- so cannot be tested by PCRE2, unless UTF validity checking has been
- turned off (see the discussion of PCRE2_NO_UTF_CHECK in the pcre2api
+ The Cs (Surrogate) property applies only to characters in the range
+ U+D800 to U+DFFF. Such characters are not valid in Unicode strings and
+ so cannot be tested by PCRE2, unless UTF validity checking has been
+ turned off (see the discussion of PCRE2_NO_UTF_CHECK in the pcre2api
page). Perl does not support the Cs property.
- The long synonyms for property names that Perl supports (such as
- \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix
+ The long synonyms for property names that Perl supports (such as
+ \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix
any of these properties with "Is".
No character that is in the Unicode table has the Cn (unassigned) prop-
erty. Instead, this property is assumed for any code point that is not
in the Unicode table.
- Specifying caseless matching does not affect these escape sequences.
- For example, \p{Lu} always matches only upper case letters. This is
+ Specifying caseless matching does not affect these escape sequences.
+ For example, \p{Lu} always matches only upper case letters. This is
different from the behaviour of current versions of Perl.
- Matching characters by Unicode property is not fast, because PCRE2 has
- to do a multistage table lookup in order to find a character's prop-
+ Matching characters by Unicode property is not fast, because PCRE2 has
+ to do a multistage table lookup in order to find a character's prop-
erty. That is why the traditional escape sequences such as \d and \w do
- not use Unicode properties in PCRE2 by default, though you can make
- them do so by setting the PCRE2_UCP option or by starting the pattern
+ not use Unicode properties in PCRE2 by default, though you can make
+ them do so by setting the PCRE2_UCP option or by starting the pattern
with (*UCP).
Extended grapheme clusters
- The \X escape matches any number of Unicode characters that form an
+ The \X escape matches any number of Unicode characters that form an
"extended grapheme cluster", and treats the sequence as an atomic group
- (see below). Unicode supports various kinds of composite character by
- giving each character a grapheme breaking property, and having rules
+ (see below). Unicode supports various kinds of composite character by
+ giving each character a grapheme breaking property, and having rules
that use these properties to define the boundaries of extended grapheme
- clusters. \X always matches at least one character. Then it decides
- whether to add additional characters according to the following rules
- for ending a cluster:
+ clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
+ Text Segmentation".
+
+ \X always matches at least one character. Then it decides whether to
+ add additional characters according to the following rules for ending a
+ cluster:
1. End at the end of the subject string.
@@ -6096,20 +6563,31 @@ BACKSLASH
be followed by a V or T character; an LVT or T character may be follwed
only by a T character.
- 4. Do not end before extending characters or spacing marks. Characters
- with the "mark" property always have the "extend" grapheme breaking
- property.
+ 4. Do not end before extending characters or spacing marks or the
+ "zero-width joiner" characters. Characters with the "mark" property
+ always have the "extend" grapheme breaking property.
5. Do not end after prepend characters.
+ 6. Do not break within emoji modifier sequences (a base character fol-
+ lowed by a modifier). Extending characters are allowed before the modi-
+ fier.
+
+ 7. Do not break within emoji zwj sequences (zero-width jointer followed
+ by "glue after ZWJ" or "base glue after ZWJ").
+
+ 8. Do not break within emoji flag sequences. That is, do not break
+ between regional indicator (RI) characters if there are an odd number
+ of RI characters before the break point.
+
6. Otherwise, end the cluster.
PCRE2's additional properties
- As well as the standard Unicode properties described above, PCRE2 sup-
- ports four more that make it possible to convert traditional escape
+ As well as the standard Unicode properties described above, PCRE2 sup-
+ ports four more that make it possible to convert traditional escape
sequences such as \w and \s to use Unicode properties. PCRE2 uses these
- non-standard, non-Perl properties internally when PCRE2_UCP is set.
+ non-standard, non-Perl properties internally when PCRE2_UCP is set.
However, they may also be used explicitly. These properties are:
Xan Any alphanumeric character
@@ -6117,53 +6595,53 @@ BACKSLASH
Xsp Any Perl space character
Xwd Any Perl "word" character
- Xan matches characters that have either the L (letter) or the N (num-
- ber) property. Xps matches the characters tab, linefeed, vertical tab,
- form feed, or carriage return, and any other character that has the Z
- (separator) property. Xsp is the same as Xps; in PCRE1 it used to
- exclude vertical tab, for Perl compatibility, but Perl changed. Xwd
+ Xan matches characters that have either the L (letter) or the N (num-
+ ber) property. Xps matches the characters tab, linefeed, vertical tab,
+ form feed, or carriage return, and any other character that has the Z
+ (separator) property. Xsp is the same as Xps; in PCRE1 it used to
+ exclude vertical tab, for Perl compatibility, but Perl changed. Xwd
matches the same characters as Xan, plus underscore.
- There is another non-standard property, Xuc, which matches any charac-
- ter that can be represented by a Universal Character Name in C++ and
- other programming languages. These are the characters $, @, ` (grave
- accent), and all characters with Unicode code points greater than or
- equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
- most base (ASCII) characters are excluded. (Universal Character Names
- are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
+ There is another non-standard property, Xuc, which matches any charac-
+ ter that can be represented by a Universal Character Name in C++ and
+ other programming languages. These are the characters $, @, ` (grave
+ accent), and all characters with Unicode code points greater than or
+ equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
+ most base (ASCII) characters are excluded. (Universal Character Names
+ are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
Note that the Xuc property does not match these sequences but the char-
acters that they represent.)
Resetting the match start
- The escape sequence \K causes any previously matched characters not to
+ The escape sequence \K causes any previously matched characters not to
be included in the final matched sequence. For example, the pattern:
foo\Kbar
- matches "foobar", but reports that it has matched "bar". This feature
- is similar to a lookbehind assertion (described below). However, in
- this case, the part of the subject before the real match does not have
- to be of fixed length, as lookbehind assertions do. The use of \K does
- not interfere with the setting of captured substrings. For example,
+ matches "foobar", but reports that it has matched "bar". This feature
+ is similar to a lookbehind assertion (described below). However, in
+ this case, the part of the subject before the real match does not have
+ to be of fixed length, as lookbehind assertions do. The use of \K does
+ not interfere with the setting of captured substrings. For example,
when the pattern
(foo)\Kbar
matches "foobar", the first substring is still set to "foo".
- Perl documents that the use of \K within assertions is "not well
- defined". In PCRE2, \K is acted upon when it occurs inside positive
- assertions, but is ignored in negative assertions. Note that when a
- pattern such as (?=ab\K) matches, the reported start of the match can
+ Perl documents that the use of \K within assertions is "not well
+ defined". In PCRE2, \K is acted upon when it occurs inside positive
+ assertions, but is ignored in negative assertions. Note that when a
+ pattern such as (?=ab\K) matches, the reported start of the match can
be greater than the end of the match.
Simple assertions
- The final use of backslash is for certain simple assertions. An asser-
- tion specifies a condition that has to be met at a particular point in
- a match, without consuming any characters from the subject string. The
- use of subpatterns for more complicated assertions is described below.
+ The final use of backslash is for certain simple assertions. An asser-
+ tion specifies a condition that has to be met at a particular point in
+ a match, without consuming any characters from the subject string. The
+ use of subpatterns for more complicated assertions is described below.
The backslashed assertions are:
\b matches at a word boundary
@@ -6174,184 +6652,184 @@ BACKSLASH
\z matches only at the end of the subject
\G matches at the first matching position in the subject
- Inside a character class, \b has a different meaning; it matches the
- backspace character. If any other of these assertions appears in a
+ Inside a character class, \b has a different meaning; it matches the
+ backspace character. If any other of these assertions appears in a
character class, an "invalid escape sequence" error is generated.
- A word boundary is a position in the subject string where the current
- character and the previous character do not both match \w or \W (i.e.
- one matches \w and the other matches \W), or the start or end of the
- string if the first or last character matches \w, respectively. In a
- UTF mode, the meanings of \w and \W can be changed by setting the
+ A word boundary is a position in the subject string where the current
+ character and the previous character do not both match \w or \W (i.e.
+ one matches \w and the other matches \W), or the start or end of the
+ string if the first or last character matches \w, respectively. In a
+ UTF mode, the meanings of \w and \W can be changed by setting the
PCRE2_UCP option. When this is done, it also affects \b and \B. Neither
- PCRE2 nor Perl has a separate "start of word" or "end of word" metase-
- quence. However, whatever follows \b normally determines which it is.
+ PCRE2 nor Perl has a separate "start of word" or "end of word" metase-
+ quence. However, whatever follows \b normally determines which it is.
For example, the fragment \ba matches "a" at the start of a word.
- The \A, \Z, and \z assertions differ from the traditional circumflex
+ The \A, \Z, and \z assertions differ from the traditional circumflex
and dollar (described in the next section) in that they only ever match
- at the very start and end of the subject string, whatever options are
- set. Thus, they are independent of multiline mode. These three asser-
- tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
- which affect only the behaviour of the circumflex and dollar metachar-
- acters. However, if the startoffset argument of pcre2_match() is non-
- zero, indicating that matching is to start at a point other than the
- beginning of the subject, \A can never match. The difference between
- \Z and \z is that \Z matches before a newline at the end of the string
+ at the very start and end of the subject string, whatever options are
+ set. Thus, they are independent of multiline mode. These three asser-
+ tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
+ which affect only the behaviour of the circumflex and dollar metachar-
+ acters. However, if the startoffset argument of pcre2_match() is non-
+ zero, indicating that matching is to start at a point other than the
+ beginning of the subject, \A can never match. The difference between
+ \Z and \z is that \Z matches before a newline at the end of the string
as well as at the very end, whereas \z matches only at the end.
- The \G assertion is true only when the current matching position is at
- the start point of the match, as specified by the startoffset argument
- of pcre2_match(). It differs from \A when the value of startoffset is
- non-zero. By calling pcre2_match() multiple times with appropriate
- arguments, you can mimic Perl's /g option, and it is in this kind of
+ The \G assertion is true only when the current matching position is at
+ the start point of the match, as specified by the startoffset argument
+ of pcre2_match(). It differs from \A when the value of startoffset is
+ non-zero. By calling pcre2_match() multiple times with appropriate
+ arguments, you can mimic Perl's /g option, and it is in this kind of
implementation where \G can be useful.
- Note, however, that PCRE2's interpretation of \G, as the start of the
+ Note, however, that PCRE2's interpretation of \G, as the start of the
current match, is subtly different from Perl's, which defines it as the
- end of the previous match. In Perl, these can be different when the
- previously matched string was empty. Because PCRE2 does just one match
+ end of the previous match. In Perl, these can be different when the
+ previously matched string was empty. Because PCRE2 does just one match
at a time, it cannot reproduce this behaviour.
- If all the alternatives of a pattern begin with \G, the expression is
+ If all the alternatives of a pattern begin with \G, the expression is
anchored to the starting match position, and the "anchored" flag is set
in the compiled regular expression.
CIRCUMFLEX AND DOLLAR
- The circumflex and dollar metacharacters are zero-width assertions.
- That is, they test for a particular condition being true without con-
+ The circumflex and dollar metacharacters are zero-width assertions.
+ That is, they test for a particular condition being true without con-
suming any characters from the subject string. These two metacharacters
- are concerned with matching the starts and ends of lines. If the new-
- line convention is set so that only the two-character sequence CRLF is
- recognized as a newline, isolated CR and LF characters are treated as
+ are concerned with matching the starts and ends of lines. If the new-
+ line convention is set so that only the two-character sequence CRLF is
+ recognized as a newline, isolated CR and LF characters are treated as
ordinary data characters, and are not recognized as newlines.
Outside a character class, in the default matching mode, the circumflex
- character is an assertion that is true only if the current matching
- point is at the start of the subject string. If the startoffset argu-
- ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
- flex can never match if the PCRE2_MULTILINE option is unset. Inside a
- character class, circumflex has an entirely different meaning (see
+ character is an assertion that is true only if the current matching
+ point is at the start of the subject string. If the startoffset argu-
+ ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
+ flex can never match if the PCRE2_MULTILINE option is unset. Inside a
+ character class, circumflex has an entirely different meaning (see
below).
- Circumflex need not be the first character of the pattern if a number
- of alternatives are involved, but it should be the first thing in each
- alternative in which it appears if the pattern is ever to match that
- branch. If all possible alternatives start with a circumflex, that is,
- if the pattern is constrained to match only at the start of the sub-
- ject, it is said to be an "anchored" pattern. (There are also other
+ Circumflex need not be the first character of the pattern if a number
+ of alternatives are involved, but it should be the first thing in each
+ alternative in which it appears if the pattern is ever to match that
+ branch. If all possible alternatives start with a circumflex, that is,
+ if the pattern is constrained to match only at the start of the sub-
+ ject, it is said to be an "anchored" pattern. (There are also other
constructs that can cause a pattern to be anchored.)
- The dollar character is an assertion that is true only if the current
- matching point is at the end of the subject string, or immediately
- before a newline at the end of the string (by default), unless
+ The dollar character is an assertion that is true only if the current
+ matching point is at the end of the subject string, or immediately
+ before a newline at the end of the string (by default), unless
PCRE2_NOTEOL is set. Note, however, that it does not actually match the
newline. Dollar need not be the last character of the pattern if a num-
ber of alternatives are involved, but it should be the last item in any
- branch in which it appears. Dollar has no special meaning in a charac-
+ branch in which it appears. Dollar has no special meaning in a charac-
ter class.
- The meaning of dollar can be changed so that it matches only at the
- very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at
+ The meaning of dollar can be changed so that it matches only at the
+ very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at
compile time. This does not affect the \Z assertion.
The meanings of the circumflex and dollar metacharacters are changed if
- the PCRE2_MULTILINE option is set. When this is the case, a dollar
- character matches before any newlines in the string, as well as at the
- very end, and a circumflex matches immediately after internal newlines
- as well as at the start of the subject string. It does not match after
- a newline that ends the string, for compatibility with Perl. However,
+ the PCRE2_MULTILINE option is set. When this is the case, a dollar
+ character matches before any newlines in the string, as well as at the
+ very end, and a circumflex matches immediately after internal newlines
+ as well as at the start of the subject string. It does not match after
+ a newline that ends the string, for compatibility with Perl. However,
this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
- For example, the pattern /^abc$/ matches the subject string "def\nabc"
- (where \n represents a newline) in multiline mode, but not otherwise.
- Consequently, patterns that are anchored in single line mode because
- all branches start with ^ are not anchored in multiline mode, and a
- match for circumflex is possible when the startoffset argument of
- pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
+ For example, the pattern /^abc$/ matches the subject string "def\nabc"
+ (where \n represents a newline) in multiline mode, but not otherwise.
+ Consequently, patterns that are anchored in single line mode because
+ all branches start with ^ are not anchored in multiline mode, and a
+ match for circumflex is possible when the startoffset argument of
+ pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
if PCRE2_MULTILINE is set.
- When the newline convention (see "Newline conventions" below) recog-
- nizes the two-character sequence CRLF as a newline, this is preferred,
- even if the single characters CR and LF are also recognized as new-
- lines. For example, if the newline convention is "any", a multiline
- mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather
- than after CR, even though CR on its own is a valid newline. (It also
+ When the newline convention (see "Newline conventions" below) recog-
+ nizes the two-character sequence CRLF as a newline, this is preferred,
+ even if the single characters CR and LF are also recognized as new-
+ lines. For example, if the newline convention is "any", a multiline
+ mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather
+ than after CR, even though CR on its own is a valid newline. (It also
matches at the very start of the string, of course.)
- Note that the sequences \A, \Z, and \z can be used to match the start
- and end of the subject in both modes, and if all branches of a pattern
- start with \A it is always anchored, whether or not PCRE2_MULTILINE is
+ Note that the sequences \A, \Z, and \z can be used to match the start
+ and end of the subject in both modes, and if all branches of a pattern
+ start with \A it is always anchored, whether or not PCRE2_MULTILINE is
set.
FULL STOP (PERIOD, DOT) AND \N
Outside a character class, a dot in the pattern matches any one charac-
- ter in the subject string except (by default) a character that signi-
+ ter in the subject string except (by default) a character that signi-
fies the end of a line.
- When a line ending is defined as a single character, dot never matches
- that character; when the two-character sequence CRLF is used, dot does
- not match CR if it is immediately followed by LF, but otherwise it
- matches all characters (including isolated CRs and LFs). When any Uni-
- code line endings are being recognized, dot does not match CR or LF or
+ When a line ending is defined as a single character, dot never matches
+ that character; when the two-character sequence CRLF is used, dot does
+ not match CR if it is immediately followed by LF, but otherwise it
+ matches all characters (including isolated CRs and LFs). When any Uni-
+ code line endings are being recognized, dot does not match CR or LF or
any of the other line ending characters.
- The behaviour of dot with regard to newlines can be changed. If the
- PCRE2_DOTALL option is set, a dot matches any one character, without
- exception. If the two-character sequence CRLF is present in the sub-
+ The behaviour of dot with regard to newlines can be changed. If the
+ PCRE2_DOTALL option is set, a dot matches any one character, without
+ exception. If the two-character sequence CRLF is present in the sub-
ject string, it takes two dots to match it.
- The handling of dot is entirely independent of the handling of circum-
- flex and dollar, the only relationship being that they both involve
+ The handling of dot is entirely independent of the handling of circum-
+ flex and dollar, the only relationship being that they both involve
newlines. Dot has no special meaning in a character class.
- The escape sequence \N behaves like a dot, except that it is not
- affected by the PCRE2_DOTALL option. In other words, it matches any
- character except one that signifies the end of a line. Perl also uses
+ The escape sequence \N behaves like a dot, except that it is not
+ affected by the PCRE2_DOTALL option. In other words, it matches any
+ character except one that signifies the end of a line. Perl also uses
\N to match characters by name; PCRE2 does not support this.
MATCHING A SINGLE CODE UNIT
- Outside a character class, the escape sequence \C matches any one code
- unit, whether or not a UTF mode is set. In the 8-bit library, one code
- unit is one byte; in the 16-bit library it is a 16-bit unit; in the
- 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
- line-ending characters. The feature is provided in Perl in order to
+ Outside a character class, the escape sequence \C matches any one code
+ unit, whether or not a UTF mode is set. In the 8-bit library, one code
+ unit is one byte; in the 16-bit library it is a 16-bit unit; in the
+ 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
+ line-ending characters. The feature is provided in Perl in order to
match individual bytes in UTF-8 mode, but it is unclear how it can use-
fully be used.
- Because \C breaks up characters into individual code units, matching
- one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
- string may start with a malformed UTF character. This has undefined
+ Because \C breaks up characters into individual code units, matching
+ one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
+ string may start with a malformed UTF character. This has undefined
results, because PCRE2 assumes that it is matching character by charac-
- ter in a valid UTF string (by default it checks the subject string's
- validity at the start of processing unless the PCRE2_NO_UTF_CHECK
+ ter in a valid UTF string (by default it checks the subject string's
+ validity at the start of processing unless the PCRE2_NO_UTF_CHECK
option is used).
- An application can lock out the use of \C by setting the
- PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
+ An application can lock out the use of \C by setting the
+ PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
possible to build PCRE2 with the use of \C permanently disabled.
- PCRE2 does not allow \C to appear in lookbehind assertions (described
- below) in UTF-8 or UTF-16 modes, because this would make it impossible
- to calculate the length of the lookbehind. Neither the alternative
+ PCRE2 does not allow \C to appear in lookbehind assertions (described
+ below) in UTF-8 or UTF-16 modes, because this would make it impossible
+ to calculate the length of the lookbehind. Neither the alternative
matching function pcre2_dfa_match() nor the JIT optimizer support \C in
these UTF modes. The former gives a match-time error; the latter fails
to optimize and so the match is always run using the interpreter.
- In the 32-bit library, however, \C is always supported (when not
- explicitly locked out) because it always matches a single code unit,
+ In the 32-bit library, however, \C is always supported (when not
+ explicitly locked out) because it always matches a single code unit,
whether or not UTF-32 is specified.
In general, the \C escape sequence is best avoided. However, one way of
- using it that avoids the problem of malformed UTF-8 or UTF-16 charac-
- ters is to use a lookahead to check the length of the next character,
- as in this pattern, which could be used with a UTF-8 string (ignore
+ using it that avoids the problem of malformed UTF-8 or UTF-16 charac-
+ ters is to use a lookahead to check the length of the next character,
+ as in this pattern, which could be used with a UTF-8 string (ignore
white space and line breaks):
(?| (?=[\x00-\x7f])(\C) |
@@ -6359,10 +6837,10 @@ MATCHING A SINGLE CODE UNIT
(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
- In this example, a group that starts with (?| resets the capturing
+ In this example, a group that starts with (?| resets the capturing
parentheses numbers in each alternative (see "Duplicate Subpattern Num-
bers" below). The assertions at the start of each branch check the next
- UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes,
+ UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes,
respectively. The character's individual bytes are then captured by the
appropriate number of \C groups.
@@ -6371,48 +6849,67 @@ SQUARE BRACKETS AND CHARACTER CLASSES
An opening square bracket introduces a character class, terminated by a
closing square bracket. A closing square bracket on its own is not spe-
- cial by default. If a closing square bracket is required as a member
+ cial by default. If a closing square bracket is required as a member
of the class, it should be the first data character in the class (after
- an initial circumflex, if present) or escaped with a backslash. This
- means that, by default, an empty class cannot be defined. However, if
- the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
+ an initial circumflex, if present) or escaped with a backslash. This
+ means that, by default, an empty class cannot be defined. However, if
+ the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
the start does end the (empty) class.
- A character class matches a single character in the subject. A matched
+ A character class matches a single character in the subject. A matched
character must be in the set of characters defined by the class, unless
- the first character in the class definition is a circumflex, in which
+ the first character in the class definition is a circumflex, in which
case the subject character must not be in the set defined by the class.
- If a circumflex is actually required as a member of the class, ensure
+ If a circumflex is actually required as a member of the class, ensure
it is not the first character, or escape it with a backslash.
- For example, the character class [aeiou] matches any lower case vowel,
- while [^aeiou] matches any character that is not a lower case vowel.
+ For example, the character class [aeiou] matches any lower case vowel,
+ while [^aeiou] matches any character that is not a lower case vowel.
Note that a circumflex is just a convenient notation for specifying the
- characters that are in the class by enumerating those that are not. A
- class that starts with a circumflex is not an assertion; it still con-
- sumes a character from the subject string, and therefore it fails if
+ characters that are in the class by enumerating those that are not. A
+ class that starts with a circumflex is not an assertion; it still con-
+ sumes a character from the subject string, and therefore it fails if
the current pointer is at the end of the string.
- When caseless matching is set, any letters in a class represent both
- their upper case and lower case versions, so for example, a caseless
- [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
+ When caseless matching is set, any letters in a class represent both
+ their upper case and lower case versions, so for example, a caseless
+ [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
match "A", whereas a caseful version would.
- Characters that might indicate line breaks are never treated in any
- special way when matching character classes, whatever line-ending
- sequence is in use, and whatever setting of the PCRE2_DOTALL and
- PCRE2_MULTILINE options is used. A class such as [^a] always matches
+ Characters that might indicate line breaks are never treated in any
+ special way when matching character classes, whatever line-ending
+ sequence is in use, and whatever setting of the PCRE2_DOTALL and
+ PCRE2_MULTILINE options is used. A class such as [^a] always matches
one of these characters.
- The minus (hyphen) character can be used to specify a range of charac-
- ters in a character class. For example, [d-m] matches any letter
- between d and m, inclusive. If a minus character is required in a
- class, it must be escaped with a backslash or appear in a position
- where it cannot be interpreted as indicating a range, typically as the
+ The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
+ \w, and \W may appear in a character class, and add the characters that
+ they match to the class. For example, [\dABCDEF] matches any hexadeci-
+ mal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
+ \d, \s, \w and their upper case partners, just as it does when they
+ appear outside a character class, as described in the section entitled
+ "Generic character types" above. The escape sequence \b has a different
+ meaning inside a character class; it matches the backspace character.
+ The sequences \B, \N, \R, and \X are not special inside a character
+ class. Like any other unrecognized escape sequences, they cause an
+ error.
+
+ The minus (hyphen) character can be used to specify a range of charac-
+ ters in a character class. For example, [d-m] matches any letter
+ between d and m, inclusive. If a minus character is required in a
+ class, it must be escaped with a backslash or appear in a position
+ where it cannot be interpreted as indicating a range, typically as the
first or last character in the class, or immediately after a range. For
- example, [b-d-z] matches letters in the range b to d, a hyphen charac-
+ example, [b-d-z] matches letters in the range b to d, a hyphen charac-
ter, or z.
+ Perl treats a hyphen as a literal if it appears before or after a POSIX
+ class (see below) or before or after a character type escape such as as
+ \d or \H. However, unless the hyphen is the last character in the
+ class, Perl outputs a warning in its warning mode, as this is most
+ likely a user error. As PCRE2 has no facility for warning, an error is
+ given in these cases.
+
It is not possible to have the literal character "]" as the end charac-
ter of a range. A pattern such as [W-]46] is interpreted as a class of
two characters ("W" and "-") followed by a literal string "46]", so it
@@ -6422,15 +6919,15 @@ SQUARE BRACKETS AND CHARACTER CLASSES
The octal or hexadecimal representation of "]" can also be used to end
a range.
- An error is generated if a POSIX character class (see below) or an
- escape sequence other than one that defines a single character appears
- at a point where a range ending character is expected. For example,
- [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
-
Ranges normally include all code points between the start and end char-
- acters, inclusive. They can also be used for code points specified
+ acters, inclusive. They can also be used for code points specified
numerically, for example [\000-\037]. Ranges can include any characters
- that are valid for the current mode.
+ that are valid for the current mode. In any UTF mode, the so-called
+ "surrogate" characters (those whose code points lie between 0xd800 and
+ 0xdfff inclusive) may not be specified explicitly by default (the
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How-
+ ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
+ are always permitted.
There is a special case in EBCDIC environments for ranges whose end
points are both specified as literal letters in the same case. For com-
@@ -6446,18 +6943,6 @@ SQUARE BRACKETS AND CHARACTER CLASSES
character tables for a French locale are in use, [\xc8-\xcb] matches
accented E characters in both cases.
- The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
- \w, and \W may appear in a character class, and add the characters that
- they match to the class. For example, [\dABCDEF] matches any hexadeci-
- mal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
- \d, \s, \w and their upper case partners, just as it does when they
- appear outside a character class, as described in the section entitled
- "Generic character types" above. The escape sequence \b has a different
- meaning inside a character class; it matches the backspace character.
- The sequences \B, \N, \R, and \X are not special inside a character
- class. Like any other unrecognized escape sequences, they cause an
- error.
-
A circumflex can conveniently be used with the upper case character
types to specify a more restricted set of characters than the matching
lower case type. For example, the class [^\W_] matches any letter or
@@ -6594,20 +7079,26 @@ VERTICAL BAR
INTERNAL OPTION SETTING
- The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
- PCRE2_EXTENDED options (which are Perl-compatible) can be changed from
- within the pattern by a sequence of Perl option letters enclosed
- between "(?" and ")". The option letters are
+ The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
+ PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options
+ (which are Perl-compatible) can be changed from within the pattern by a
+ sequence of Perl option letters enclosed between "(?" and ")". The
+ option letters are
i for PCRE2_CASELESS
m for PCRE2_MULTILINE
+ n for PCRE2_NO_AUTO_CAPTURE
s for PCRE2_DOTALL
x for PCRE2_EXTENDED
+ xx for PCRE2_EXTENDED_MORE
For example, (?im) sets caseless, multiline matching. It is also possi-
- ble to unset these options by preceding the letter with a hyphen, and a
- combined setting and unsetting such as (?im-sx), which sets PCRE2_CASE-
- LESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and
+ ble to unset these options by preceding the letter with a hyphen. The
+ two "extended" options are not independent; unsetting either one can-
+ cels the effects of both of them.
+
+ A combined setting and unsetting such as (?im-sx), which sets
+ PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and
PCRE2_EXTENDED, is also permitted. If a letter appears both before and
after the hyphen, the option is unset. An empty options setting "(?)"
is allowed. Needless to say, it has no effect.
@@ -6618,32 +7109,27 @@ INTERNAL OPTION SETTING
When one of these option changes occurs at top level (that is, not
inside subpattern parentheses), the change applies to the remainder of
- the pattern that follows. If the change is placed right at the start of
- a pattern, PCRE2 extracts it into the global options (and it will
- therefore show up in data extracted by the pcre2_pattern_info() func-
- tion).
-
- An option change within a subpattern (see below for a description of
- subpatterns) affects only that part of the subpattern that follows it,
- so
+ the pattern that follows. An option change within a subpattern (see
+ below for a description of subpatterns) affects only that part of the
+ subpattern that follows it, so
(a(?i)b)c
- matches abc and aBc and no other strings (assuming PCRE2_CASELESS is
- not used). By this means, options can be made to have different set-
+ matches abc and aBc and no other strings (assuming PCRE2_CASELESS is
+ not used). By this means, options can be made to have different set-
tings in different parts of the pattern. Any changes made in one alter-
native do carry on into subsequent branches within the same subpattern.
For example,
(a(?i)b|c)
- matches "ab", "aB", "c", and "C", even though when matching "C" the
- first branch is abandoned before the option setting. This is because
- the effects of option settings happen at compile time. There would be
+ matches "ab", "aB", "c", and "C", even though when matching "C" the
+ first branch is abandoned before the option setting. This is because
+ the effects of option settings happen at compile time. There would be
some very weird behaviour otherwise.
- As a convenient shorthand, if any option settings are required at the
- start of a non-capturing subpattern (see the next section), the option
+ As a convenient shorthand, if any option settings are required at the
+ start of a non-capturing subpattern (see the next section), the option
letters may appear between the "?" and the ":". Thus the two patterns
(?i:saturday|sunday)
@@ -6651,14 +7137,14 @@ INTERNAL OPTION SETTING
match exactly the same set of strings.
- Note: There are other PCRE2-specific options that can be set by the
+ Note: There are other PCRE2-specific options that can be set by the
application when the compiling function is called. The pattern can con-
- tain special leading sequences such as (*CRLF) to override what the
- application has set or what has been defaulted. Details are given in
- the section entitled "Newline sequences" above. There are also the
- (*UTF) and (*UCP) leading sequences that can be used to set UTF and
- Unicode property modes; they are equivalent to setting the PCRE2_UTF
- and PCRE2_UCP options, respectively. However, the application can set
+ tain special leading sequences such as (*CRLF) to override what the
+ application has set or what has been defaulted. Details are given in
+ the section entitled "Newline sequences" above. There are also the
+ (*UTF) and (*UCP) leading sequences that can be used to set UTF and
+ Unicode property modes; they are equivalent to setting the PCRE2_UTF
+ and PCRE2_UCP options, respectively. However, the application can set
the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use
of the (*UTF) and (*UCP) sequences.
@@ -6672,18 +7158,18 @@ SUBPATTERNS
cat(aract|erpillar|)
- matches "cataract", "caterpillar", or "cat". Without the parentheses,
+ matches "cataract", "caterpillar", or "cat". Without the parentheses,
it would match "cataract", "erpillar" or an empty string.
- 2. It sets up the subpattern as a capturing subpattern. This means
+ 2. It sets up the subpattern as a capturing subpattern. This means
that, when the whole pattern matches, the portion of the subject string
- that matched the subpattern is passed back to the caller, separately
- from the portion that matched the whole pattern. (This applies only to
- the traditional matching function; the DFA matching function does not
+ that matched the subpattern is passed back to the caller, separately
+ from the portion that matched the whole pattern. (This applies only to
+ the traditional matching function; the DFA matching function does not
support capturing.)
Opening parentheses are counted from left to right (starting from 1) to
- obtain numbers for the capturing subpatterns. For example, if the
+ obtain numbers for the capturing subpatterns. For example, if the
string "the red king" is matched against the pattern
the ((red|white) (king|queen))
@@ -6691,12 +7177,12 @@ SUBPATTERNS
the captured substrings are "red king", "red", and "king", and are num-
bered 1, 2, and 3, respectively.
- The fact that plain parentheses fulfil two functions is not always
- helpful. There are often times when a grouping subpattern is required
- without a capturing requirement. If an opening parenthesis is followed
- by a question mark and a colon, the subpattern does not do any captur-
- ing, and is not counted when computing the number of any subsequent
- capturing subpatterns. For example, if the string "the white queen" is
+ The fact that plain parentheses fulfil two functions is not always
+ helpful. There are often times when a grouping subpattern is required
+ without a capturing requirement. If an opening parenthesis is followed
+ by a question mark and a colon, the subpattern does not do any captur-
+ ing, and is not counted when computing the number of any subsequent
+ capturing subpatterns. For example, if the string "the white queen" is
matched against the pattern
the ((?:red|white) (king|queen))
@@ -6704,37 +7190,37 @@ SUBPATTERNS
the captured substrings are "white queen" and "queen", and are numbered
1 and 2. The maximum number of capturing subpatterns is 65535.
- As a convenient shorthand, if any option settings are required at the
- start of a non-capturing subpattern, the option letters may appear
+ As a convenient shorthand, if any option settings are required at the
+ start of a non-capturing subpattern, the option letters may appear
between the "?" and the ":". Thus the two patterns
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
match exactly the same set of strings. Because alternative branches are
- tried from left to right, and options are not reset until the end of
- the subpattern is reached, an option setting in one branch does affect
- subsequent branches, so the above patterns match "SUNDAY" as well as
+ tried from left to right, and options are not reset until the end of
+ the subpattern is reached, an option setting in one branch does affect
+ subsequent branches, so the above patterns match "SUNDAY" as well as
"Saturday".
DUPLICATE SUBPATTERN NUMBERS
Perl 5.10 introduced a feature whereby each alternative in a subpattern
- uses the same numbers for its capturing parentheses. Such a subpattern
- starts with (?| and is itself a non-capturing subpattern. For example,
+ uses the same numbers for its capturing parentheses. Such a subpattern
+ starts with (?| and is itself a non-capturing subpattern. For example,
consider this pattern:
(?|(Sat)ur|(Sun))day
- Because the two alternatives are inside a (?| group, both sets of cap-
- turing parentheses are numbered one. Thus, when the pattern matches,
- you can look at captured substring number one, whichever alternative
- matched. This construct is useful when you want to capture part, but
+ Because the two alternatives are inside a (?| group, both sets of cap-
+ turing parentheses are numbered one. Thus, when the pattern matches,
+ you can look at captured substring number one, whichever alternative
+ matched. This construct is useful when you want to capture part, but
not all, of one of a number of alternatives. Inside a (?| group, paren-
- theses are numbered as usual, but the number is reset at the start of
- each branch. The numbers of any capturing parentheses that follow the
- subpattern start after the highest number used in any branch. The fol-
+ theses are numbered as usual, but the number is reset at the start of
+ each branch. The numbers of any capturing parentheses that follow the
+ subpattern start after the highest number used in any branch. The fol-
lowing example is taken from the Perl documentation. The numbers under-
neath show in which buffer the captured content will be stored.
@@ -6742,14 +7228,14 @@ DUPLICATE SUBPATTERN NUMBERS
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4
- A back reference to a numbered subpattern uses the most recent value
- that is set for that number by any subpattern. The following pattern
+ A back reference to a numbered subpattern uses the most recent value
+ that is set for that number by any subpattern. The following pattern
matches "abcabc" or "defdef":
/(?|(abc)|(def))\1/
- In contrast, a subroutine call to a numbered subpattern always refers
- to the first one in the pattern with the given number. The following
+ In contrast, a subroutine call to a numbered subpattern always refers
+ to the first one in the pattern with the given number. The following
pattern matches "abcabc" or "defabc":
/(?|(abc)|(def))(?1)/
@@ -6757,47 +7243,47 @@ DUPLICATE SUBPATTERN NUMBERS
A relative reference such as (?-1) is no different: it is just a conve-
nient way of computing an absolute group number.
- If a condition test for a subpattern's having matched refers to a non-
- unique number, the test is true if any of the subpatterns of that num-
+ If a condition test for a subpattern's having matched refers to a non-
+ unique number, the test is true if any of the subpatterns of that num-
ber have matched.
- An alternative approach to using this "branch reset" feature is to use
+ An alternative approach to using this "branch reset" feature is to use
duplicate named subpatterns, as described in the next section.
NAMED SUBPATTERNS
- Identifying capturing parentheses by number is simple, but it can be
- very hard to keep track of the numbers in complicated regular expres-
- sions. Furthermore, if an expression is modified, the numbers may
+ Identifying capturing parentheses by number is simple, but it can be
+ very hard to keep track of the numbers in complicated regular expres-
+ sions. Furthermore, if an expression is modified, the numbers may
change. To help with this difficulty, PCRE2 supports the naming of sub-
patterns. This feature was not added to Perl until release 5.10. Python
- had the feature earlier, and PCRE1 introduced it at release 4.0, using
- the Python syntax. PCRE2 supports both the Perl and the Python syntax.
- Perl allows identically numbered subpatterns to have different names,
+ had the feature earlier, and PCRE1 introduced it at release 4.0, using
+ the Python syntax. PCRE2 supports both the Perl and the Python syntax.
+ Perl allows identically numbered subpatterns to have different names,
but PCRE2 does not.
- In PCRE2, a subpattern can be named in one of three ways: (?<name>...)
- or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
- to capturing parentheses from other parts of the pattern, such as back
- references, recursion, and conditions, can be made by name as well as
+ In PCRE2, a subpattern can be named in one of three ways: (?<name>...)
+ or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
+ to capturing parentheses from other parts of the pattern, such as back
+ references, recursion, and conditions, can be made by name as well as
by number.
- Names consist of up to 32 alphanumeric characters and underscores, but
- must start with a non-digit. Named capturing parentheses are still
- allocated numbers as well as names, exactly as if the names were not
+ Names consist of up to 32 alphanumeric characters and underscores, but
+ must start with a non-digit. Named capturing parentheses are still
+ allocated numbers as well as names, exactly as if the names were not
present. The PCRE2 API provides function calls for extracting the name-
- to-number translation table from a compiled pattern. There are also
+ to-number translation table from a compiled pattern. There are also
convenience functions for extracting a captured substring by name.
- By default, a name must be unique within a pattern, but it is possible
- to relax this constraint by setting the PCRE2_DUPNAMES option at com-
- pile time. (Duplicate names are also always permitted for subpatterns
- with the same number, set up as described in the previous section.)
- Duplicate names can be useful for patterns where only one instance of
+ By default, a name must be unique within a pattern, but it is possible
+ to relax this constraint by setting the PCRE2_DUPNAMES option at com-
+ pile time. (Duplicate names are also always permitted for subpatterns
+ with the same number, set up as described in the previous section.)
+ Duplicate names can be useful for patterns where only one instance of
the named parentheses can match. Suppose you want to match the name of
- a weekday, either as a 3-letter abbreviation or as the full name, and
- in both cases you want to extract the abbreviation. This pattern
+ a weekday, either as a 3-letter abbreviation or as the full name, and
+ in both cases you want to extract the abbreviation. This pattern
(ignoring the line breaks) does the job:
(?<DN>Mon|Fri|Sun)(?:day)?|
@@ -6806,18 +7292,18 @@ NAMED SUBPATTERNS
(?<DN>Thu)(?:rsday)?|
(?<DN>Sat)(?:urday)?
- There are five capturing substrings, but only one is ever set after a
+ There are five capturing substrings, but only one is ever set after a
match. (An alternative way of solving this problem is to use a "branch
reset" subpattern, as described in the previous section.)
- The convenience functions for extracting the data by name returns the
- substring for the first (and in this example, the only) subpattern of
- that name that matched. This saves searching to find which numbered
+ The convenience functions for extracting the data by name returns the
+ substring for the first (and in this example, the only) subpattern of
+ that name that matched. This saves searching to find which numbered
subpattern it was.
- If you make a back reference to a non-unique named subpattern from
- elsewhere in the pattern, the subpatterns to which the name refers are
- checked in the order in which they appear in the overall pattern. The
+ If you make a back reference to a non-unique named subpattern from
+ elsewhere in the pattern, the subpatterns to which the name refers are
+ checked in the order in which they appear in the overall pattern. The
first one that is set is used for the reference. For example, this pat-
tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo":
@@ -6825,29 +7311,29 @@ NAMED SUBPATTERNS
If you make a subroutine call to a non-unique named subpattern, the one
- that corresponds to the first occurrence of the name is used. In the
+ that corresponds to the first occurrence of the name is used. In the
absence of duplicate numbers (see the previous section) this is the one
with the lowest number.
If you use a named reference in a condition test (see the section about
conditions below), either to check whether a subpattern has matched, or
- to check for recursion, all subpatterns with the same name are tested.
- If the condition is true for any one of them, the overall condition is
- true. This is the same behaviour as testing by number. For further
- details of the interfaces for handling named subpatterns, see the
+ to check for recursion, all subpatterns with the same name are tested.
+ If the condition is true for any one of them, the overall condition is
+ true. This is the same behaviour as testing by number. For further
+ details of the interfaces for handling named subpatterns, see the
pcre2api documentation.
Warning: You cannot use different names to distinguish between two sub-
- patterns with the same number because PCRE2 uses only the numbers when
+ patterns with the same number because PCRE2 uses only the numbers when
matching. For this reason, an error is given at compile time if differ-
- ent names are given to subpatterns with the same number. However, you
+ ent names are given to subpatterns with the same number. However, you
can always give the same name to subpatterns with the same number, even
when PCRE2_DUPNAMES is not set.
REPETITION
- Repetition is specified by quantifiers, which can follow any of the
+ Repetition is specified by quantifiers, which can follow any of the
following items:
a literal data character
@@ -6861,17 +7347,17 @@ REPETITION
a parenthesized subpattern (including most assertions)
a subroutine call to a subpattern (recursive or otherwise)
- The general repetition quantifier specifies a minimum and maximum num-
- ber of permitted matches, by giving the two numbers in curly brackets
- (braces), separated by a comma. The numbers must be less than 65536,
+ The general repetition quantifier specifies a minimum and maximum num-
+ ber of permitted matches, by giving the two numbers in curly brackets
+ (braces), separated by a comma. The numbers must be less than 65536,
and the first must be less than or equal to the second. For example:
z{2,4}
- matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
- special character. If the second number is omitted, but the comma is
- present, there is no upper limit; if the second number and the comma
- are both omitted, the quantifier specifies an exact number of required
+ matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
+ special character. If the second number is omitted, but the comma is
+ present, there is no upper limit; if the second number and the comma
+ are both omitted, the quantifier specifies an exact number of required
matches. Thus
[aeiou]{3,}
@@ -6880,50 +7366,50 @@ REPETITION
\d{8}
- matches exactly 8 digits. An opening curly bracket that appears in a
- position where a quantifier is not allowed, or one that does not match
- the syntax of a quantifier, is taken as a literal character. For exam-
+ matches exactly 8 digits. An opening curly bracket that appears in a
+ position where a quantifier is not allowed, or one that does not match
+ the syntax of a quantifier, is taken as a literal character. For exam-
ple, {,6} is not a quantifier, but a literal string of four characters.
In UTF modes, quantifiers apply to characters rather than to individual
- code units. Thus, for example, \x{100}{2} matches two characters, each
+ code units. Thus, for example, \x{100}{2} matches two characters, each
of which is represented by a two-byte sequence in a UTF-8 string. Simi-
- larly, \X{3} matches three Unicode extended grapheme clusters, each of
- which may be several code units long (and they may be of different
+ larly, \X{3} matches three Unicode extended grapheme clusters, each of
+ which may be several code units long (and they may be of different
lengths).
The quantifier {0} is permitted, causing the expression to behave as if
the previous item and the quantifier were not present. This may be use-
- ful for subpatterns that are referenced as subroutines from elsewhere
+ ful for subpatterns that are referenced as subroutines from elsewhere
in the pattern (but see also the section entitled "Defining subpatterns
- for use by reference only" below). Items other than subpatterns that
+ for use by reference only" below). Items other than subpatterns that
have a {0} quantifier are omitted from the compiled pattern.
- For convenience, the three most common quantifiers have single-charac-
+ For convenience, the three most common quantifiers have single-charac-
ter abbreviations:
* is equivalent to {0,}
+ is equivalent to {1,}
? is equivalent to {0,1}
- It is possible to construct infinite loops by following a subpattern
+ It is possible to construct infinite loops by following a subpattern
that can match no characters with a quantifier that has no upper limit,
for example:
(a?)*
- Earlier versions of Perl and PCRE1 used to give an error at compile
+ Earlier versions of Perl and PCRE1 used to give an error at compile
time for such patterns. However, because there are cases where this can
be useful, such patterns are now accepted, but if any repetition of the
- subpattern does in fact match no characters, the loop is forcibly bro-
+ subpattern does in fact match no characters, the loop is forcibly bro-
ken.
- By default, the quantifiers are "greedy", that is, they match as much
- as possible (up to the maximum number of permitted times), without
- causing the rest of the pattern to fail. The classic example of where
+ By default, the quantifiers are "greedy", that is, they match as much
+ as possible (up to the maximum number of permitted times), without
+ causing the rest of the pattern to fail. The classic example of where
this gives problems is in trying to match comments in C programs. These
- appear between /* and */ and within the comment, individual * and /
- characters may appear. An attempt to match C comments by applying the
+ appear between /* and */ and within the comment, individual * and /
+ characters may appear. An attempt to match C comments by applying the
pattern
/\*.*\*/
@@ -6932,19 +7418,19 @@ REPETITION
/* first comment */ not comment /* second comment */
- fails, because it matches the entire string owing to the greediness of
+ fails, because it matches the entire string owing to the greediness of
the .* item.
If a quantifier is followed by a question mark, it ceases to be greedy,
- and instead matches the minimum number of times possible, so the pat-
+ and instead matches the minimum number of times possible, so the pat-
tern
/\*.*?\*/
- does the right thing with the C comments. The meaning of the various
- quantifiers is not otherwise changed, just the preferred number of
- matches. Do not confuse this use of question mark with its use as a
- quantifier in its own right. Because it has two uses, it can sometimes
+ does the right thing with the C comments. The meaning of the various
+ quantifiers is not otherwise changed, just the preferred number of
+ matches. Do not confuse this use of question mark with its use as a
+ quantifier in its own right. Because it has two uses, it can sometimes
appear doubled, as in
\d??\d
@@ -6953,45 +7439,45 @@ REPETITION
only way the rest of the pattern matches.
If the PCRE2_UNGREEDY option is set (an option that is not available in
- Perl), the quantifiers are not greedy by default, but individual ones
- can be made greedy by following them with a question mark. In other
+ Perl), the quantifiers are not greedy by default, but individual ones
+ can be made greedy by following them with a question mark. In other
words, it inverts the default behaviour.
- When a parenthesized subpattern is quantified with a minimum repeat
- count that is greater than 1 or with a limited maximum, more memory is
- required for the compiled pattern, in proportion to the size of the
+ When a parenthesized subpattern is quantified with a minimum repeat
+ count that is greater than 1 or with a limited maximum, more memory is
+ required for the compiled pattern, in proportion to the size of the
minimum or maximum.
- If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option
- (equivalent to Perl's /s) is set, thus allowing the dot to match new-
- lines, the pattern is implicitly anchored, because whatever follows
- will be tried against every character position in the subject string,
- so there is no point in retrying the overall match at any position
+ If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option
+ (equivalent to Perl's /s) is set, thus allowing the dot to match new-
+ lines, the pattern is implicitly anchored, because whatever follows
+ will be tried against every character position in the subject string,
+ so there is no point in retrying the overall match at any position
after the first. PCRE2 normally treats such a pattern as though it were
preceded by \A.
- In cases where it is known that the subject string contains no new-
- lines, it is worth setting PCRE2_DOTALL in order to obtain this opti-
+ In cases where it is known that the subject string contains no new-
+ lines, it is worth setting PCRE2_DOTALL in order to obtain this opti-
mization, or alternatively, using ^ to indicate anchoring explicitly.
- However, there are some cases where the optimization cannot be used.
+ However, there are some cases where the optimization cannot be used.
When .* is inside capturing parentheses that are the subject of a back
reference elsewhere in the pattern, a match at the start may fail where
a later one succeeds. Consider, for example:
(.*)abc\1
- If the subject is "xyz123abc123" the match point is the fourth charac-
+ If the subject is "xyz123abc123" the match point is the fourth charac-
ter. For this reason, such a pattern is not implicitly anchored.
- Another case where implicit anchoring is not applied is when the lead-
- ing .* is inside an atomic group. Once again, a match at the start may
+ Another case where implicit anchoring is not applied is when the lead-
+ ing .* is inside an atomic group. Once again, a match at the start may
fail where a later one succeeds. Consider this pattern:
(?>.*?a)b
- It matches "ab" in the subject "aab". The use of the backtracking con-
- trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and
+ It matches "ab" in the subject "aab". The use of the backtracking con-
+ trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and
there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
When a capturing subpattern is repeated, the value captured is the sub-
@@ -7000,8 +7486,8 @@ REPETITION
(tweedle[dume]{3}\s*)+
has matched "tweedledum tweedledee" the value of the captured substring
- is "tweedledee". However, if there are nested capturing subpatterns,
- the corresponding captured values may have been set in previous itera-
+ is "tweedledee". However, if there are nested capturing subpatterns,
+ the corresponding captured values may have been set in previous itera-
tions. For example, after
(a|(b))+
@@ -7011,53 +7497,53 @@ REPETITION
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
- With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
- repetition, failure of what follows normally causes the repeated item
- to be re-evaluated to see if a different number of repeats allows the
- rest of the pattern to match. Sometimes it is useful to prevent this,
- either to change the nature of the match, or to cause it fail earlier
- than it otherwise might, when the author of the pattern knows there is
+ With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
+ repetition, failure of what follows normally causes the repeated item
+ to be re-evaluated to see if a different number of repeats allows the
+ rest of the pattern to match. Sometimes it is useful to prevent this,
+ either to change the nature of the match, or to cause it fail earlier
+ than it otherwise might, when the author of the pattern knows there is
no point in carrying on.
- Consider, for example, the pattern \d+foo when applied to the subject
+ Consider, for example, the pattern \d+foo when applied to the subject
line
123456bar
After matching all 6 digits and then failing to match "foo", the normal
- action of the matcher is to try again with only 5 digits matching the
- \d+ item, and then with 4, and so on, before ultimately failing.
- "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
- the means for specifying that once a subpattern has matched, it is not
+ action of the matcher is to try again with only 5 digits matching the
+ \d+ item, and then with 4, and so on, before ultimately failing.
+ "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
+ the means for specifying that once a subpattern has matched, it is not
to be re-evaluated in this way.
- If we use atomic grouping for the previous example, the matcher gives
- up immediately on failing to match "foo" the first time. The notation
+ If we use atomic grouping for the previous example, the matcher gives
+ up immediately on failing to match "foo" the first time. The notation
is a kind of special parenthesis, starting with (?> as in this example:
(?>\d+)foo
- This kind of parenthesis "locks up" the part of the pattern it con-
- tains once it has matched, and a failure further into the pattern is
- prevented from backtracking into it. Backtracking past it to previous
+ This kind of parenthesis "locks up" the part of the pattern it con-
+ tains once it has matched, and a failure further into the pattern is
+ prevented from backtracking into it. Backtracking past it to previous
items, however, works as normal.
- An alternative description is that a subpattern of this type matches
- exactly the string of characters that an identical standalone pattern
+ An alternative description is that a subpattern of this type matches
+ exactly the string of characters that an identical standalone pattern
would match, if anchored at the current point in the subject string.
Atomic grouping subpatterns are not capturing subpatterns. Simple cases
such as the above example can be thought of as a maximizing repeat that
- must swallow everything it can. So, while both \d+ and \d+? are pre-
- pared to adjust the number of digits they match in order to make the
+ must swallow everything it can. So, while both \d+ and \d+? are pre-
+ pared to adjust the number of digits they match in order to make the
rest of the pattern match, (?>\d+) can only match an entire sequence of
digits.
- Atomic groups in general can of course contain arbitrarily complicated
- subpatterns, and can be nested. However, when the subpattern for an
+ Atomic groups in general can of course contain arbitrarily complicated
+ subpatterns, and can be nested. However, when the subpattern for an
atomic group is just a single repeated item, as in the example above, a
- simpler notation, called a "possessive quantifier" can be used. This
- consists of an additional + character following a quantifier. Using
+ simpler notation, called a "possessive quantifier" can be used. This
+ consists of an additional + character following a quantifier. Using
this notation, the previous example can be rewritten as
\d++foo
@@ -7067,46 +7553,46 @@ ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
(abc|xyz){2,3}+
- Possessive quantifiers are always greedy; the setting of the
- PCRE2_UNGREEDY option is ignored. They are a convenient notation for
- the simpler forms of atomic group. However, there is no difference in
+ Possessive quantifiers are always greedy; the setting of the
+ PCRE2_UNGREEDY option is ignored. They are a convenient notation for
+ the simpler forms of atomic group. However, there is no difference in
the meaning of a possessive quantifier and the equivalent atomic group,
- though there may be a performance difference; possessive quantifiers
+ though there may be a performance difference; possessive quantifiers
should be slightly faster.
- The possessive quantifier syntax is an extension to the Perl 5.8 syn-
- tax. Jeffrey Friedl originated the idea (and the name) in the first
+ The possessive quantifier syntax is an extension to the Perl 5.8 syn-
+ tax. Jeffrey Friedl originated the idea (and the name) in the first
edition of his book. Mike McCloskey liked it, so implemented it when he
built Sun's Java package, and PCRE1 copied it from there. It ultimately
found its way into Perl at release 5.10.
- PCRE2 has an optimization that automatically "possessifies" certain
- simple pattern constructs. For example, the sequence A+B is treated as
- A++B because there is no point in backtracking into a sequence of A's
+ PCRE2 has an optimization that automatically "possessifies" certain
+ simple pattern constructs. For example, the sequence A+B is treated as
+ A++B because there is no point in backtracking into a sequence of A's
when B must follow. This feature can be disabled by the PCRE2_NO_AUTO-
POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
- When a pattern contains an unlimited repeat inside a subpattern that
- can itself be repeated an unlimited number of times, the use of an
- atomic group is the only way to avoid some failing matches taking a
+ When a pattern contains an unlimited repeat inside a subpattern that
+ can itself be repeated an unlimited number of times, the use of an
+ atomic group is the only way to avoid some failing matches taking a
very long time indeed. The pattern
(\D+|<\d+>)*[!?]
- matches an unlimited number of substrings that either consist of non-
- digits, or digits enclosed in <>, followed by either ! or ?. When it
+ matches an unlimited number of substrings that either consist of non-
+ digits, or digits enclosed in <>, followed by either ! or ?. When it
matches, it runs quickly. However, if it is applied to
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
- it takes a long time before reporting failure. This is because the
- string can be divided between the internal \D+ repeat and the external
- * repeat in a large number of ways, and all have to be tried. (The
- example uses [!?] rather than a single character at the end, because
- both PCRE2 and Perl have an optimization that allows for fast failure
- when a single character is used. They remember the last single charac-
- ter that is required for a match, and fail early if it is not present
- in the string.) If the pattern is changed so that it uses an atomic
+ it takes a long time before reporting failure. This is because the
+ string can be divided between the internal \D+ repeat and the external
+ * repeat in a large number of ways, and all have to be tried. (The
+ example uses [!?] rather than a single character at the end, because
+ both PCRE2 and Perl have an optimization that allows for fast failure
+ when a single character is used. They remember the last single charac-
+ ter that is required for a match, and fail early if it is not present
+ in the string.) If the pattern is changed so that it uses an atomic
group, like this:
((?>\D+)|<\d+>)*[!?]
@@ -7118,71 +7604,75 @@ BACK REFERENCES
Outside a character class, a backslash followed by a digit greater than
0 (and possibly further digits) is a back reference to a capturing sub-
- pattern earlier (that is, to its left) in the pattern, provided there
+ pattern earlier (that is, to its left) in the pattern, provided there
have been that many previous capturing left parentheses.
- However, if the decimal number following the backslash is less than 8,
- it is always taken as a back reference, and causes an error only if
- there are not that many capturing left parentheses in the entire pat-
- tern. In other words, the parentheses that are referenced need not be
- to the left of the reference for numbers less than 8. A "forward back
- reference" of this type can make sense when a repetition is involved
- and the subpattern to the right has participated in an earlier itera-
+ However, if the decimal number following the backslash is less than 8,
+ it is always taken as a back reference, and causes an error only if
+ there are not that many capturing left parentheses in the entire pat-
+ tern. In other words, the parentheses that are referenced need not be
+ to the left of the reference for numbers less than 8. A "forward back
+ reference" of this type can make sense when a repetition is involved
+ and the subpattern to the right has participated in an earlier itera-
tion.
- It is not possible to have a numerical "forward back reference" to a
- subpattern whose number is 8 or more using this syntax because a
- sequence such as \50 is interpreted as a character defined in octal.
+ It is not possible to have a numerical "forward back reference" to a
+ subpattern whose number is 8 or more using this syntax because a
+ sequence such as \50 is interpreted as a character defined in octal.
See the subsection entitled "Non-printing characters" above for further
- details of the handling of digits following a backslash. There is no
- such problem when named parentheses are used. A back reference to any
+ details of the handling of digits following a backslash. There is no
+ such problem when named parentheses are used. A back reference to any
subpattern is possible using named parentheses (see below).
- Another way of avoiding the ambiguity inherent in the use of digits
- following a backslash is to use the \g escape sequence. This escape
- must be followed by an unsigned number or a negative number, optionally
- enclosed in braces. These examples are all identical:
+ Another way of avoiding the ambiguity inherent in the use of digits
+ following a backslash is to use the \g escape sequence. This escape
+ must be followed by a signed or unsigned number, optionally enclosed in
+ braces. These examples are all identical:
(ring), \1
(ring), \g1
(ring), \g{1}
- An unsigned number specifies an absolute reference without the ambigu-
+ An unsigned number specifies an absolute reference without the ambigu-
ity that is present in the older syntax. It is also useful when literal
- digits follow the reference. A negative number is a relative reference.
+ digits follow the reference. A signed number is a relative reference.
Consider this example:
(abc(def)ghi)\g{-1}
The sequence \g{-1} is a reference to the most recently started captur-
ing subpattern before \g, that is, is it equivalent to \2 in this exam-
- ple. Similarly, \g{-2} would be equivalent to \1. The use of relative
- references can be helpful in long patterns, and also in patterns that
- are created by joining together fragments that contain references
+ ple. Similarly, \g{-2} would be equivalent to \1. The use of relative
+ references can be helpful in long patterns, and also in patterns that
+ are created by joining together fragments that contain references
within themselves.
- A back reference matches whatever actually matched the capturing sub-
- pattern in the current subject string, rather than anything matching
+ The sequence \g{+1} is a reference to the next capturing subpattern.
+ This kind of forward reference can be useful it patterns that repeat.
+ Perl does not support the use of + in this way.
+
+ A back reference matches whatever actually matched the capturing sub-
+ pattern in the current subject string, rather than anything matching
the subpattern itself (see "Subpatterns as subroutines" below for a way
of doing that). So the pattern
(sens|respons)e and \1ibility
- matches "sense and sensibility" and "response and responsibility", but
- not "sense and responsibility". If caseful matching is in force at the
- time of the back reference, the case of letters is relevant. For exam-
+ matches "sense and sensibility" and "response and responsibility", but
+ not "sense and responsibility". If caseful matching is in force at the
+ time of the back reference, the case of letters is relevant. For exam-
ple,
((?i)rah)\s+\1
- matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
+ matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
original capturing subpattern is matched caselessly.
- There are several different ways of writing back references to named
- subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
- \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
+ There are several different ways of writing back references to named
+ subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
+ \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
unified back reference syntax, in which \g can be used for both numeric
- and named references, is also supported. We could rewrite the above
+ and named references, is also supported. We could rewrite the above
example in any of the following ways:
(?<p1>(?i)rah)\s+\k<p1>
@@ -7190,86 +7680,96 @@ BACK REFERENCES
(?P<p1>(?i)rah)\s+(?P=p1)
(?<p1>(?i)rah)\s+\g{p1}
- A subpattern that is referenced by name may appear in the pattern
+ A subpattern that is referenced by name may appear in the pattern
before or after the reference.
- There may be more than one back reference to the same subpattern. If a
- subpattern has not actually been used in a particular match, any back
+ There may be more than one back reference to the same subpattern. If a
+ subpattern has not actually been used in a particular match, any back
references to it always fail by default. For example, the pattern
(a|(bc))\2
- always fails if it starts to match "a" rather than "bc". However, if
- the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a back
+ always fails if it starts to match "a" rather than "bc". However, if
+ the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a back
reference to an unset value matches an empty string.
- Because there may be many capturing parentheses in a pattern, all dig-
- its following a backslash are taken as part of a potential back refer-
- ence number. If the pattern continues with a digit character, some
- delimiter must be used to terminate the back reference. If the
- PCRE2_EXTENDED option is set, this can be white space. Otherwise, the
+ Because there may be many capturing parentheses in a pattern, all dig-
+ its following a backslash are taken as part of a potential back refer-
+ ence number. If the pattern continues with a digit character, some
+ delimiter must be used to terminate the back reference. If the
+ PCRE2_EXTENDED option is set, this can be white space. Otherwise, the
\g{ syntax or an empty comment (see "Comments" below) can be used.
Recursive back references
- A back reference that occurs inside the parentheses to which it refers
- fails when the subpattern is first used, so, for example, (a\1) never
- matches. However, such references can be useful inside repeated sub-
+ A back reference that occurs inside the parentheses to which it refers
+ fails when the subpattern is first used, so, for example, (a\1) never
+ matches. However, such references can be useful inside repeated sub-
patterns. For example, the pattern
(a|b\1)+
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
- ation of the subpattern, the back reference matches the character
- string corresponding to the previous iteration. In order for this to
- work, the pattern must be such that the first iteration does not need
- to match the back reference. This can be done using alternation, as in
+ ation of the subpattern, the back reference matches the character
+ string corresponding to the previous iteration. In order for this to
+ work, the pattern must be such that the first iteration does not need
+ to match the back reference. This can be done using alternation, as in
the example above, or by a quantifier with a minimum of zero.
- Back references of this type cause the group that they reference to be
- treated as an atomic group. Once the whole group has been matched, a
- subsequent matching failure cannot cause backtracking into the middle
+ Back references of this type cause the group that they reference to be
+ treated as an atomic group. Once the whole group has been matched, a
+ subsequent matching failure cannot cause backtracking into the middle
of the group.
ASSERTIONS
- An assertion is a test on the characters following or preceding the
+ An assertion is a test on the characters following or preceding the
current matching point that does not consume any characters. The simple
- assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
+ assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
above.
- More complicated assertions are coded as subpatterns. There are two
- kinds: those that look ahead of the current position in the subject
- string, and those that look behind it. An assertion subpattern is
- matched in the normal way, except that it does not cause the current
- matching position to be changed.
-
- Assertion subpatterns are not capturing subpatterns. If such an asser-
- tion contains capturing subpatterns within it, these are counted for
- the purposes of numbering the capturing subpatterns in the whole pat-
- tern. However, substring capturing is carried out only for positive
- assertions. (Perl sometimes, but not always, does do capturing in nega-
- tive assertions.)
-
- For compatibility with Perl, most assertion subpatterns may be
- repeated; though it makes no sense to assert the same thing several
- times, the side effect of capturing parentheses may occasionally be
- useful. However, an assertion that forms the condition for a condi-
- tional subpattern may not be quantified. In practice, for other asser-
+ More complicated assertions are coded as subpatterns. There are two
+ kinds: those that look ahead of the current position in the subject
+ string, and those that look behind it, and in each case an assertion
+ may be positive (must succeed for matching to continue) or negative
+ (must not succeed for matching to continue). An assertion subpattern is
+ matched in the normal way, except that, when matching continues after-
+ wards, the matching position in the subject string is as it was at the
+ start of the assertion.
+
+ Assertion subpatterns are not capturing subpatterns. If an assertion
+ contains capturing subpatterns within it, these are counted for the
+ purposes of numbering the capturing subpatterns in the whole pattern.
+ However, substring capturing is carried out only for positive asser-
+ tions that succeed, that is, one of their branches matches, so matching
+ continues after the assertion. If all branches of a positive assertion
+ fail to match, nothing is captured, and control is passed to the previ-
+ ous backtracking point.
+
+ No capturing is done for a negative assertion unless it is being used
+ as a condition in a conditional subpattern (see the discussion below).
+ Matching continues after a non-conditional negative assertion only if
+ all its branches fail to match.
+
+ For compatibility with Perl, most assertion subpatterns may be
+ repeated; though it makes no sense to assert the same thing several
+ times, the side effect of capturing parentheses may occasionally be
+ useful. However, an assertion that forms the condition for a condi-
+ tional subpattern may not be quantified. In practice, for other asser-
tions, there only three cases:
- (1) If the quantifier is {0}, the assertion is never obeyed during
- matching. However, it may contain internal capturing parenthesized
+ (1) If the quantifier is {0}, the assertion is never obeyed during
+ matching. However, it may contain internal capturing parenthesized
groups that are called from elsewhere via the subroutine mechanism.
- (2) If quantifier is {0,n} where n is greater than zero, it is treated
- as if it were {0,1}. At run time, the rest of the pattern match is
+ (2) If quantifier is {0,n} where n is greater than zero, it is treated
+ as if it were {0,1}. At run time, the rest of the pattern match is
tried with and without the assertion, the order depending on the greed-
iness of the quantifier.
- (3) If the minimum repetition is greater than zero, the quantifier is
- ignored. The assertion is obeyed just once when encountered during
+ (3) If the minimum repetition is greater than zero, the quantifier is
+ ignored. The assertion is obeyed just once when encountered during
matching.
Lookahead assertions
@@ -7279,38 +7779,38 @@ ASSERTIONS
\w+(?=;)
- matches a word followed by a semicolon, but does not include the semi-
+ matches a word followed by a semicolon, but does not include the semi-
colon in the match, and
foo(?!bar)
- matches any occurrence of "foo" that is not followed by "bar". Note
+ matches any occurrence of "foo" that is not followed by "bar". Note
that the apparently similar pattern
(?!foo)bar
- does not find an occurrence of "bar" that is preceded by something
- other than "foo"; it finds any occurrence of "bar" whatsoever, because
+ does not find an occurrence of "bar" that is preceded by something
+ other than "foo"; it finds any occurrence of "bar" whatsoever, because
the assertion (?!foo) is always true when the next three characters are
"bar". A lookbehind assertion is needed to achieve the other effect.
If you want to force a matching failure at some point in a pattern, the
- most convenient way to do it is with (?!) because an empty string
- always matches, so an assertion that requires there not to be an empty
+ most convenient way to do it is with (?!) because an empty string
+ always matches, so an assertion that requires there not to be an empty
string must always fail. The backtracking control verb (*FAIL) or (*F)
is a synonym for (?!).
Lookbehind assertions
- Lookbehind assertions start with (?<= for positive assertions and (?<!
+ Lookbehind assertions start with (?<= for positive assertions and (?<!
for negative assertions. For example,
(?<!foo)bar
- does find an occurrence of "bar" that is not preceded by "foo". The
- contents of a lookbehind assertion are restricted such that all the
+ does find an occurrence of "bar" that is not preceded by "foo". The
+ contents of a lookbehind assertion are restricted such that all the
strings it matches must have a fixed length. However, if there are sev-
- eral top-level alternatives, they do not all have to have the same
+ eral top-level alternatives, they do not all have to have the same
fixed length. Thus
(?<=bullock|donkey)
@@ -7319,62 +7819,74 @@ ASSERTIONS
(?<!dogs?|cats?)
- causes an error at compile time. Branches that match different length
- strings are permitted only at the top level of a lookbehind assertion.
+ causes an error at compile time. Branches that match different length
+ strings are permitted only at the top level of a lookbehind assertion.
This is an extension compared with Perl, which requires all branches to
match the same length of string. An assertion such as
(?<=ab(c|de))
- is not permitted, because its single top-level branch can match two
- different lengths, but it is acceptable to PCRE2 if rewritten to use
+ is not permitted, because its single top-level branch can match two
+ different lengths, but it is acceptable to PCRE2 if rewritten to use
two top-level branches:
(?<=abc|abde)
- In some cases, the escape sequence \K (see above) can be used instead
+ In some cases, the escape sequence \K (see above) can be used instead
of a lookbehind assertion to get round the fixed-length restriction.
- The implementation of lookbehind assertions is, for each alternative,
- to temporarily move the current position back by the fixed length and
+ The implementation of lookbehind assertions is, for each alternative,
+ to temporarily move the current position back by the fixed length and
then try to match. If there are insufficient characters before the cur-
rent position, the assertion fails.
- In a UTF mode, PCRE2 does not allow the \C escape (which matches a sin-
- gle code unit even in a UTF mode) to appear in lookbehind assertions,
- because it makes it impossible to calculate the length of the lookbe-
- hind. The \X and \R escapes, which can match different numbers of code
- units, are also not permitted.
-
- "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
- lookbehinds, as long as the subpattern matches a fixed-length string.
- Recursion, however, is not supported.
-
- Possessive quantifiers can be used in conjunction with lookbehind
+ In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which
+ matches a single code unit even in a UTF mode) to appear in lookbehind
+ assertions, because it makes it impossible to calculate the length of
+ the lookbehind. The \X and \R escapes, which can match different num-
+ bers of code units, are never permitted in lookbehinds.
+
+ "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
+ lookbehinds, as long as the subpattern matches a fixed-length string.
+ However, recursion, that is, a "subroutine" call into a group that is
+ already active, is not supported.
+
+ Perl does not support back references in lookbehinds. PCRE2 does sup-
+ port them, but only if certain conditions are met. The
+ PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no use
+ of (?| in the pattern (it creates duplicate subpattern numbers), and if
+ the back reference is by name, the name must be unique. Of course, the
+ referenced subpattern must itself be of fixed length. The following
+ pattern matches words containing at least two characters that begin and
+ end with the same character:
+
+ \b(\w)\w++(?<=\1)
+
+ Possessive quantifiers can be used in conjunction with lookbehind
assertions to specify efficient matching of fixed-length strings at the
end of subject strings. Consider a simple pattern such as
abcd$
- when applied to a long string that does not match. Because matching
- proceeds from left to right, PCRE2 will look for each "a" in the sub-
- ject and then see if what follows matches the rest of the pattern. If
+ when applied to a long string that does not match. Because matching
+ proceeds from left to right, PCRE2 will look for each "a" in the sub-
+ ject and then see if what follows matches the rest of the pattern. If
the pattern is specified as
^.*abcd$
- the initial .* matches the entire string at first, but when this fails
+ the initial .* matches the entire string at first, but when this fails
(because there is no following "a"), it backtracks to match all but the
- last character, then all but the last two characters, and so on. Once
- again the search for "a" covers the entire string, from right to left,
+ last character, then all but the last two characters, and so on. Once
+ again the search for "a" covers the entire string, from right to left,
so we are no better off. However, if the pattern is written as
^.*+(?<=abcd)
there can be no backtracking for the .*+ item because of the possessive
quantifier; it can match only the entire string. The subsequent lookbe-
- hind assertion does a single test on the last four characters. If it
- fails, the match fails immediately. For long strings, this approach
+ hind assertion does a single test on the last four characters. If it
+ fails, the match fails immediately. For long strings, this approach
makes a significant difference to the processing time.
Using multiple assertions
@@ -7383,18 +7895,18 @@ ASSERTIONS
(?<=\d{3})(?<!999)foo
- matches "foo" preceded by three digits that are not "999". Notice that
- each of the assertions is applied independently at the same point in
- the subject string. First there is a check that the previous three
- characters are all digits, and then there is a check that the same
+ matches "foo" preceded by three digits that are not "999". Notice that
+ each of the assertions is applied independently at the same point in
+ the subject string. First there is a check that the previous three
+ characters are all digits, and then there is a check that the same
three characters are not "999". This pattern does not match "foo" pre-
- ceded by six characters, the first of which are digits and the last
- three of which are not "999". For example, it doesn't match "123abc-
+ ceded by six characters, the first of which are digits and the last
+ three of which are not "999". For example, it doesn't match "123abc-
foo". A pattern to do that is
(?<=\d{3}...)(?<!999)foo
- This time the first assertion looks at the preceding six characters,
+ This time the first assertion looks at the preceding six characters,
checking that the first three are digits, and then the second assertion
checks that the preceding three characters are not "999".
@@ -7402,29 +7914,29 @@ ASSERTIONS
(?<=(?<!foo)bar)baz
- matches an occurrence of "baz" that is preceded by "bar" which in turn
+ matches an occurrence of "baz" that is preceded by "bar" which in turn
is not preceded by "foo", while
(?<=\d{3}(?!999)...)foo
- is another pattern that matches "foo" preceded by three digits and any
+ is another pattern that matches "foo" preceded by three digits and any
three characters that are not "999".
CONDITIONAL SUBPATTERNS
- It is possible to cause the matching process to obey a subpattern con-
- ditionally or to choose between two alternative subpatterns, depending
- on the result of an assertion, or whether a specific capturing subpat-
- tern has already been matched. The two possible forms of conditional
+ It is possible to cause the matching process to obey a subpattern con-
+ ditionally or to choose between two alternative subpatterns, depending
+ on the result of an assertion, or whether a specific capturing subpat-
+ tern has already been matched. The two possible forms of conditional
subpattern are:
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
- If the condition is satisfied, the yes-pattern is used; otherwise the
- no-pattern (if present) is used. If there are more than two alterna-
- tives in the subpattern, a compile-time error occurs. Each of the two
+ If the condition is satisfied, the yes-pattern is used; otherwise the
+ no-pattern (if present) is used. If there are more than two alterna-
+ tives in the subpattern, a compile-time error occurs. Each of the two
alternatives may itself contain nested subpatterns of any form, includ-
ing conditional subpatterns; the restriction to two alternatives
applies only at the level of the condition. This pattern fragment is an
@@ -7433,93 +7945,114 @@ CONDITIONAL SUBPATTERNS
(?(1) (A|B|C) | (D | (?(2)E|F) | E) )
- There are five kinds of condition: references to subpatterns, refer-
- ences to recursion, two pseudo-conditions called DEFINE and VERSION,
+ There are five kinds of condition: references to subpatterns, refer-
+ ences to recursion, two pseudo-conditions called DEFINE and VERSION,
and assertions.
Checking for a used subpattern by number
- If the text between the parentheses consists of a sequence of digits,
+ If the text between the parentheses consists of a sequence of digits,
the condition is true if a capturing subpattern of that number has pre-
- viously matched. If there is more than one capturing subpattern with
- the same number (see the earlier section about duplicate subpattern
- numbers), the condition is true if any of them have matched. An alter-
- native notation is to precede the digits with a plus or minus sign. In
- this case, the subpattern number is relative rather than absolute. The
- most recently opened parentheses can be referenced by (?(-1), the next
- most recent by (?(-2), and so on. Inside loops it can also make sense
+ viously matched. If there is more than one capturing subpattern with
+ the same number (see the earlier section about duplicate subpattern
+ numbers), the condition is true if any of them have matched. An alter-
+ native notation is to precede the digits with a plus or minus sign. In
+ this case, the subpattern number is relative rather than absolute. The
+ most recently opened parentheses can be referenced by (?(-1), the next
+ most recent by (?(-2), and so on. Inside loops it can also make sense
to refer to subsequent groups. The next parentheses to be opened can be
- referenced as (?(+1), and so on. (The value zero in any of these forms
+ referenced as (?(+1), and so on. (The value zero in any of these forms
is not used; it provokes a compile-time error.)
- Consider the following pattern, which contains non-significant white
- space to make it more readable (assume the PCRE2_EXTENDED option) and
+ Consider the following pattern, which contains non-significant white
+ space to make it more readable (assume the PCRE2_EXTENDED option) and
to divide it into three parts for ease of discussion:
( \( )? [^()]+ (?(1) \) )
- The first part matches an optional opening parenthesis, and if that
+ The first part matches an optional opening parenthesis, and if that
character is present, sets it as the first captured substring. The sec-
- ond part matches one or more characters that are not parentheses. The
- third part is a conditional subpattern that tests whether or not the
- first set of parentheses matched. If they did, that is, if subject
- started with an opening parenthesis, the condition is true, and so the
- yes-pattern is executed and a closing parenthesis is required. Other-
- wise, since no-pattern is not present, the subpattern matches nothing.
- In other words, this pattern matches a sequence of non-parentheses,
+ ond part matches one or more characters that are not parentheses. The
+ third part is a conditional subpattern that tests whether or not the
+ first set of parentheses matched. If they did, that is, if subject
+ started with an opening parenthesis, the condition is true, and so the
+ yes-pattern is executed and a closing parenthesis is required. Other-
+ wise, since no-pattern is not present, the subpattern matches nothing.
+ In other words, this pattern matches a sequence of non-parentheses,
optionally enclosed in parentheses.
- If you were embedding this pattern in a larger one, you could use a
+ If you were embedding this pattern in a larger one, you could use a
relative reference:
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
- This makes the fragment independent of the parentheses in the larger
+ This makes the fragment independent of the parentheses in the larger
pattern.
Checking for a used subpattern by name
- Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
- used subpattern by name. For compatibility with earlier versions of
- PCRE1, which had this facility before Perl, the syntax (?(name)...) is
- also recognized.
+ Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
+ used subpattern by name. For compatibility with earlier versions of
+ PCRE1, which had this facility before Perl, the syntax (?(name)...) is
+ also recognized. Note, however, that undelimited names consisting of
+ the letter R followed by digits are ambiguous (see the following sec-
+ tion).
Rewriting the above example to use a named subpattern gives this:
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
- If the name used in a condition of this kind is a duplicate, the test
- is applied to all subpatterns of the same name, and is true if any one
+ If the name used in a condition of this kind is a duplicate, the test
+ is applied to all subpatterns of the same name, and is true if any one
of them has matched.
Checking for pattern recursion
- If the condition is the string (R), and there is no subpattern with the
- name R, the condition is true if a recursive call to the whole pattern
- or any subpattern has been made. If digits or a name preceded by amper-
- sand follow the letter R, for example:
+ "Recursion" in this sense refers to any subroutine-like call from one
+ part of the pattern to another, whether or not it is actually recur-
+ sive. See the sections entitled "Recursive patterns" and "Subpatterns
+ as subroutines" below for details of recursion and subpattern calls.
- (?(R3)...) or (?(R&name)...)
+ If a condition is the string (R), and there is no subpattern with the
+ name R, the condition is true if matching is currently in a recursion
+ or subroutine call to the whole pattern or any subpattern. If digits
+ follow the letter R, and there is no subpattern with that name, the
+ condition is true if the most recent call is into a subpattern with the
+ given number, which must exist somewhere in the overall pattern. This
+ is a contrived example that is equivalent to a+b:
+
+ ((?(R1)a+|(?1)b))
+
+ However, in both cases, if there is a subpattern with a matching name,
+ the condition tests for its being set, as described in the section
+ above, instead of testing for recursion. For example, creating a group
+ with the name R1 by adding (?<R1>) to the above pattern completely
+ changes its meaning.
+
+ If a name preceded by ampersand follows the letter R, for example:
+
+ (?(R&name)...)
the condition is true if the most recent recursion is into a subpattern
- whose number or name is given. This condition does not check the entire
- recursion stack. If the name used in a condition of this kind is a
+ of that name (which must exist within the pattern).
+
+ This condition does not check the entire recursion stack. It tests only
+ the current level. If the name used in a condition of this kind is a
duplicate, the test is applied to all subpatterns of the same name, and
is true if any one of them is the most recent recursion.
- At "top level", all these recursion test conditions are false. The
- syntax for recursive patterns is described below.
+ At "top level", all these recursion test conditions are false.
Defining subpatterns for use by reference only
- If the condition is the string (DEFINE), and there is no subpattern
- with the name DEFINE, the condition is always false. In this case,
- there may be only one alternative in the subpattern. It is always
- skipped if control reaches this point in the pattern; the idea of
- DEFINE is that it can be used to define subroutines that can be refer-
- enced from elsewhere. (The use of subroutines is described below.) For
- example, a pattern to match an IPv4 address such as "192.168.23.245"
- could be written like this (ignore white space and line breaks):
+ If the condition is the string (DEFINE), the condition is always false,
+ even if there is a group with the name DEFINE. In this case, there may
+ be only one alternative in the subpattern. It is always skipped if con-
+ trol reaches this point in the pattern; the idea of DEFINE is that it
+ can be used to define subroutines that can be referenced from else-
+ where. (The use of subroutines is described below.) For example, a pat-
+ tern to match an IPv4 address such as "192.168.23.245" could be written
+ like this (ignore white space and line breaks):
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
\b (?&byte) (\.(?&byte)){3} \b
@@ -7566,48 +8099,55 @@ CONDITIONAL SUBPATTERNS
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
letters and dd are digits.
+ When an assertion that is a condition contains capturing subpatterns,
+ any capturing that occurs in a matching branch is retained afterwards,
+ for both positive and negative assertions, because matching always con-
+ tinues after the assertion, whether it succeeds or fails. (Compare non-
+ conditional assertions, when captures are retained only for positive
+ assertions that succeed.)
+
COMMENTS
There are two ways of including comments in patterns that are processed
- by PCRE2. In both cases, the start of the comment must not be in a
- character class, nor in the middle of any other sequence of related
- characters such as (?: or a subpattern name or number. The characters
+ by PCRE2. In both cases, the start of the comment must not be in a
+ character class, nor in the middle of any other sequence of related
+ characters such as (?: or a subpattern name or number. The characters
that make up a comment play no part in the pattern matching.
- The sequence (?# marks the start of a comment that continues up to the
- next closing parenthesis. Nested parentheses are not permitted. If the
- PCRE2_EXTENDED option is set, an unescaped # character also introduces
- a comment, which in this case continues to immediately after the next
- newline character or character sequence in the pattern. Which charac-
- ters are interpreted as newlines is controlled by an option passed to
- the compiling function or by a special sequence at the start of the
- pattern, as described in the section entitled "Newline conventions"
- above. Note that the end of this type of comment is a literal newline
- sequence in the pattern; escape sequences that happen to represent a
- newline do not count. For example, consider this pattern when
- PCRE2_EXTENDED is set, and the default newline convention (a single
+ The sequence (?# marks the start of a comment that continues up to the
+ next closing parenthesis. Nested parentheses are not permitted. If the
+ PCRE2_EXTENDED option is set, an unescaped # character also introduces
+ a comment, which in this case continues to immediately after the next
+ newline character or character sequence in the pattern. Which charac-
+ ters are interpreted as newlines is controlled by an option passed to
+ the compiling function or by a special sequence at the start of the
+ pattern, as described in the section entitled "Newline conventions"
+ above. Note that the end of this type of comment is a literal newline
+ sequence in the pattern; escape sequences that happen to represent a
+ newline do not count. For example, consider this pattern when
+ PCRE2_EXTENDED is set, and the default newline convention (a single
linefeed character) is in force:
abc #comment \n still comment
- On encountering the # character, pcre2_compile() skips along, looking
- for a newline in the pattern. The sequence \n is still literal at this
- stage, so it does not terminate the comment. Only an actual character
+ On encountering the # character, pcre2_compile() skips along, looking
+ for a newline in the pattern. The sequence \n is still literal at this
+ stage, so it does not terminate the comment. Only an actual character
with the code value 0x0a (the default newline) does so.
RECURSIVE PATTERNS
- Consider the problem of matching a string in parentheses, allowing for
- unlimited nested parentheses. Without the use of recursion, the best
- that can be done is to use a pattern that matches up to some fixed
- depth of nesting. It is not possible to handle an arbitrary nesting
+ Consider the problem of matching a string in parentheses, allowing for
+ unlimited nested parentheses. Without the use of recursion, the best
+ that can be done is to use a pattern that matches up to some fixed
+ depth of nesting. It is not possible to handle an arbitrary nesting
depth.
For some time, Perl has provided a facility that allows regular expres-
- sions to recurse (amongst other things). It does this by interpolating
- Perl code in the expression at run time, and the code can refer to the
+ sions to recurse (amongst other things). It does this by interpolating
+ Perl code in the expression at run time, and the code can refer to the
expression itself. A Perl pattern using code interpolation to solve the
parentheses problem can be created like this:
@@ -7617,206 +8157,171 @@ RECURSIVE PATTERNS
refers recursively to the pattern in which it appears.
Obviously, PCRE2 cannot support the interpolation of Perl code.
- Instead, it supports special syntax for recursion of the entire pat-
+ Instead, it supports special syntax for recursion of the entire pat-
tern, and also for individual subpattern recursion. After its introduc-
- tion in PCRE1 and Python, this kind of recursion was subsequently
+ tion in PCRE1 and Python, this kind of recursion was subsequently
introduced into Perl at release 5.10.
- A special item that consists of (? followed by a number greater than
- zero and a closing parenthesis is a recursive subroutine call of the
- subpattern of the given number, provided that it occurs inside that
- subpattern. (If not, it is a non-recursive subroutine call, which is
- described in the next section.) The special item (?R) or (?0) is a
+ A special item that consists of (? followed by a number greater than
+ zero and a closing parenthesis is a recursive subroutine call of the
+ subpattern of the given number, provided that it occurs inside that
+ subpattern. (If not, it is a non-recursive subroutine call, which is
+ described in the next section.) The special item (?R) or (?0) is a
recursive call of the entire regular expression.
- This PCRE2 pattern solves the nested parentheses problem (assume the
+ This PCRE2 pattern solves the nested parentheses problem (assume the
PCRE2_EXTENDED option is set so that white space is ignored):
\( ( [^()]++ | (?R) )* \)
- First it matches an opening parenthesis. Then it matches any number of
- substrings which can either be a sequence of non-parentheses, or a
- recursive match of the pattern itself (that is, a correctly parenthe-
+ First it matches an opening parenthesis. Then it matches any number of
+ substrings which can either be a sequence of non-parentheses, or a
+ recursive match of the pattern itself (that is, a correctly parenthe-
sized substring). Finally there is a closing parenthesis. Note the use
of a possessive quantifier to avoid backtracking into sequences of non-
parentheses.
- If this were part of a larger pattern, you would not want to recurse
+ If this were part of a larger pattern, you would not want to recurse
the entire pattern, so instead you could use this:
( \( ( [^()]++ | (?1) )* \) )
- We have put the pattern into parentheses, and caused the recursion to
+ We have put the pattern into parentheses, and caused the recursion to
refer to them instead of the whole pattern.
- In a larger pattern, keeping track of parenthesis numbers can be
- tricky. This is made easier by the use of relative references. Instead
+ In a larger pattern, keeping track of parenthesis numbers can be
+ tricky. This is made easier by the use of relative references. Instead
of (?1) in the pattern above you can write (?-2) to refer to the second
- most recently opened parentheses preceding the recursion. In other
- words, a negative number counts capturing parentheses leftwards from
+ most recently opened parentheses preceding the recursion. In other
+ words, a negative number counts capturing parentheses leftwards from
the point at which it is encountered.
Be aware however, that if duplicate subpattern numbers are in use, rel-
- ative references refer to the earliest subpattern with the appropriate
+ ative references refer to the earliest subpattern with the appropriate
number. Consider, for example:
(?|(a)|(b)) (c) (?-2)
- The first two capturing groups (a) and (b) are both numbered 1, and
- group (c) is number 2. When the reference (?-2) is encountered, the
+ The first two capturing groups (a) and (b) are both numbered 1, and
+ group (c) is number 2. When the reference (?-2) is encountered, the
second most recently opened parentheses has the number 1, but it is the
- first such group (the (a) group) to which the recursion refers. This
- would be the same if an absolute reference (?1) was used. In other
- words, relative references are just a shorthand for computing a group
+ first such group (the (a) group) to which the recursion refers. This
+ would be the same if an absolute reference (?1) was used. In other
+ words, relative references are just a shorthand for computing a group
number.
- It is also possible to refer to subsequently opened parentheses, by
- writing references such as (?+2). However, these cannot be recursive
- because the reference is not inside the parentheses that are refer-
- enced. They are always non-recursive subroutine calls, as described in
+ It is also possible to refer to subsequently opened parentheses, by
+ writing references such as (?+2). However, these cannot be recursive
+ because the reference is not inside the parentheses that are refer-
+ enced. They are always non-recursive subroutine calls, as described in
the next section.
- An alternative approach is to use named parentheses. The Perl syntax
- for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup-
+ An alternative approach is to use named parentheses. The Perl syntax
+ for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup-
ported. We could rewrite the above example as follows:
(?<pn> \( ( [^()]++ | (?&pn) )* \) )
- If there is more than one subpattern with the same name, the earliest
+ If there is more than one subpattern with the same name, the earliest
one is used.
The example pattern that we have been looking at contains nested unlim-
- ited repeats, and so the use of a possessive quantifier for matching
- strings of non-parentheses is important when applying the pattern to
+ ited repeats, and so the use of a possessive quantifier for matching
+ strings of non-parentheses is important when applying the pattern to
strings that do not match. For example, when this pattern is applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
- it yields "no match" quickly. However, if a possessive quantifier is
- not used, the match runs for a very long time indeed because there are
- so many different ways the + and * repeats can carve up the subject,
+ it yields "no match" quickly. However, if a possessive quantifier is
+ not used, the match runs for a very long time indeed because there are
+ so many different ways the + and * repeats can carve up the subject,
and all have to be tested before failure can be reported.
- At the end of a match, the values of capturing parentheses are those
- from the outermost level. If you want to obtain intermediate values, a
+ At the end of a match, the values of capturing parentheses are those
+ from the outermost level. If you want to obtain intermediate values, a
callout function can be used (see below and the pcre2callout documenta-
tion). If the pattern above is matched against
(ab(cd)ef)
- the value for the inner capturing parentheses (numbered 2) is "ef",
- which is the last value taken on at the top level. If a capturing sub-
- pattern is not matched at the top level, its final captured value is
- unset, even if it was (temporarily) set at a deeper level during the
+ the value for the inner capturing parentheses (numbered 2) is "ef",
+ which is the last value taken on at the top level. If a capturing sub-
+ pattern is not matched at the top level, its final captured value is
+ unset, even if it was (temporarily) set at a deeper level during the
matching process.
If there are more than 15 capturing parentheses in a pattern, PCRE2 has
- to obtain extra memory from the heap to store data during a recursion.
- If no memory can be obtained, the match fails with the
+ to obtain extra memory from the heap to store data during a recursion.
+ If no memory can be obtained, the match fails with the
PCRE2_ERROR_NOMEMORY error.
- Do not confuse the (?R) item with the condition (R), which tests for
- recursion. Consider this pattern, which matches text in angle brack-
- ets, allowing for arbitrary nesting. Only digits are allowed in nested
- brackets (that is, when recursing), whereas any characters are permit-
+ Do not confuse the (?R) item with the condition (R), which tests for
+ recursion. Consider this pattern, which matches text in angle brack-
+ ets, allowing for arbitrary nesting. Only digits are allowed in nested
+ brackets (that is, when recursing), whereas any characters are permit-
ted at the outer level.
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
- In this pattern, (?(R) is the start of a conditional subpattern, with
- two different alternatives for the recursive and non-recursive cases.
+ In this pattern, (?(R) is the start of a conditional subpattern, with
+ two different alternatives for the recursive and non-recursive cases.
The (?R) item is the actual recursive call.
Differences in recursion processing between PCRE2 and Perl
- Recursion processing in PCRE2 differs from Perl in two important ways.
- In PCRE2 (like Python, but unlike Perl), a recursive subpattern call is
- always treated as an atomic group. That is, once it has matched some of
- the subject string, it is never re-entered, even if it contains untried
- alternatives and there is a subsequent matching failure. This can be
- illustrated by the following pattern, which purports to match a palin-
- dromic string that contains an odd number of characters (for example,
- "a", "aba", "abcba", "abcdcba"):
-
- ^(.|(.)(?1)\2)$
-
- The idea is that it either matches a single character, or two identical
- characters surrounding a sub-palindrome. In Perl, this pattern works;
- in PCRE2 it does not if the pattern is longer than three characters.
- Consider the subject string "abcba":
-
- At the top level, the first character is matched, but as it is not at
- the end of the string, the first alternative fails; the second alterna-
- tive is taken and the recursion kicks in. The recursive call to subpat-
- tern 1 successfully matches the next character ("b"). (Note that the
- beginning and end of line tests are not part of the recursion).
-
- Back at the top level, the next character ("c") is compared with what
- subpattern 2 matched, which was "a". This fails. Because the recursion
- is treated as an atomic group, there are now no backtracking points,
- and so the entire match fails. (Perl is able, at this point, to re-
- enter the recursion and try the second alternative.) However, if the
- pattern is written with the alternatives in the other order, things are
- different:
+ Some former differences between PCRE2 and Perl no longer exist.
- ^((.)(?1)\2|.)$
+ Before release 10.30, recursion processing in PCRE2 differed from Perl
+ in that a recursive subpattern call was always treated as an atomic
+ group. That is, once it had matched some of the subject string, it was
+ never re-entered, even if it contained untried alternatives and there
+ was a subsequent matching failure. (Historical note: PCRE implemented
+ recursion before Perl did.)
- This time, the recursing alternative is tried first, and continues to
- recurse until it runs out of characters, at which point the recursion
- fails. But this time we do have another alternative to try at the
- higher level. That is the big difference: in the previous case the
- remaining alternative is at a deeper recursion level, which PCRE2 can-
- not use.
+ Starting with release 10.30, recursive subroutine calls are no longer
+ treated as atomic. That is, they can be re-entered to try unused alter-
+ natives if there is a matching failure later in the pattern. This is
+ now compatible with the way Perl works. If you want a subroutine call
+ to be atomic, you must explicitly enclose it in an atomic group.
- To change the pattern so that it matches all palindromic strings, not
- just those with an odd number of characters, it is tempting to change
- the pattern to this:
+ Supporting backtracking into recursions simplifies certain types of
+ recursive pattern. For example, this pattern matches palindromic
+ strings:
^((.)(?1)\2|.?)$
- Again, this works in Perl, but not in PCRE2, and for the same reason.
- When a deeper recursion has matched a single character, it cannot be
- entered again in order to match an empty string. The solution is to
- separate the two cases, and write out the odd and even cases as alter-
- natives at the higher level:
-
- ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
-
- If you want to match typical palindromic phrases, the pattern has to
- ignore all non-word characters, which can be done like this:
+ The second branch in the group matches a single central character in
+ the palindrome when there are an odd number of characters, or nothing
+ when there are an even number of characters, but in order to work it
+ has to be able to try the second case when the rest of the pattern
+ match fails. If you want to match typical palindromic phrases, the pat-
+ tern has to ignore all non-word characters, which can be done like
+ this:
- ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
+ ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
If run with the PCRE2_CASELESS option, this pattern matches phrases
- such as "A man, a plan, a canal: Panama!" and it works in both PCRE2
- and Perl. Note the use of the possessive quantifier *+ to avoid back-
- tracking into sequences of non-word characters. Without this, PCRE2
- takes a great deal longer (ten times or more) to match typical phrases,
- and Perl takes so long that you think it has gone into a loop.
-
- WARNING: The palindrome-matching patterns above work only if the sub-
- ject string does not start with a palindrome that is shorter than the
- entire string. For example, although "abcba" is correctly matched, if
- the subject is "ababa", PCRE2 finds the palindrome "aba" at the start,
- then fails at top level because the end of the string does not follow.
- Once again, it cannot jump back into the recursion to try other alter-
- natives, so the entire match fails.
-
- The second way in which PCRE2 and Perl differ in their recursion pro-
- cessing is in the handling of captured values. In Perl, when a subpat-
- tern is called recursively or as a subpattern (see the next section),
- it has no access to any values that were captured outside the recur-
- sion, whereas in PCRE2 these values can be referenced. Consider this
- pattern:
+ such as "A man, a plan, a canal: Panama!". Note the use of the posses-
+ sive quantifier *+ to avoid backtracking into sequences of non-word
+ characters. Without this, PCRE2 takes a great deal longer (ten times or
+ more) to match typical phrases, and Perl takes so long that you think
+ it has gone into a loop.
+
+ Another way in which PCRE2 and Perl used to differ in their recursion
+ processing is in the handling of captured values. Formerly in Perl,
+ when a subpattern was called recursively or as a subpattern (see the
+ next section), it had no access to any values that were captured out-
+ side the recursion, whereas in PCRE2 these values can be referenced.
+ Consider this pattern:
^(.)(\1|a(?2))
- In PCRE2, this pattern matches "bab". The first capturing parentheses
- match "b", then in the second group, when the back reference \1 fails
- to match "b", the second alternative matches "a" and then recurses. In
- the recursion, \1 does now match "b" and so the whole match succeeds.
- In Perl, the pattern fails to match because inside the recursive call
- \1 cannot access the externally set value.
+ This pattern matches "bab". The first capturing parentheses match "b",
+ then in the second group, when the back reference \1 fails to match
+ "b", the second alternative matches "a" and then recurses. In the
+ recursion, \1 does now match "b" and so the whole match succeeds. This
+ match used to fail in Perl, but in later versions (I tried 5.024) it
+ now works.
SUBPATTERNS AS SUBROUTINES
@@ -7844,12 +8349,10 @@ SUBPATTERNS AS SUBROUTINES
two strings. Another example is given in the discussion of DEFINE
above.
- All subroutine calls, whether recursive or not, are always treated as
- atomic groups. That is, once a subroutine has matched some of the sub-
- ject string, it is never re-entered, even if it contains untried alter-
- natives and there is a subsequent matching failure. Any capturing
- parentheses that are set during the subroutine call revert to their
- previous values afterwards.
+ Like recursions, subroutine calls used to be treated as atomic, but
+ this changed at PCRE2 release 10.30, so backtracking into subroutine
+ calls can now occur. However, any capturing parentheses that are set
+ during the subroutine call revert to their previous values afterwards.
Processing options such as case-independence are fixed when a subpat-
tern is defined, so if it is used as a subroutine, such options cannot
@@ -7956,43 +8459,46 @@ CALLOUTS
BACKTRACKING CONTROL
- Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
- which are still described in the Perl documentation as "experimental
- and subject to change or removal in a future version of Perl". It goes
- on to say: "Their usage in production code should be noted to avoid
- problems during upgrades." The same remarks apply to the PCRE2 features
- described in this section.
-
- The new verbs make use of what was previously invalid syntax: an open-
- ing parenthesis followed by an asterisk. They are generally of the form
- (*VERB) or (*VERB:NAME). Some verbs take either form, possibly behaving
- differently depending on whether or not a name is present.
+ There are a number of special "Backtracking Control Verbs" (to use
+ Perl's terminology) that modify the behaviour of backtracking during
+ matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
+ verbs take either form, possibly behaving differently depending on
+ whether or not a name is present.
By default, for compatibility with Perl, a name is any sequence of
characters that does not include a closing parenthesis. The name is not
processed in any way, and it is not possible to include a closing
- parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES option is
- set, normal backslash processing is applied to verb names and only an
- unescaped closing parenthesis terminates the name. A closing parenthe-
- sis can be included in a name either as \) or between \Q and \E. If the
- PCRE2_EXTENDED option is set, unescaped whitespace in verb names is
- skipped and #-comments are recognized, exactly as in the rest of the
- pattern.
-
- The maximum length of a name is 255 in the 8-bit library and 65535 in
- the 16-bit and 32-bit libraries. If the name is empty, that is, if the
- closing parenthesis immediately follows the colon, the effect is as if
+ parenthesis in the name. This can be changed by setting the
+ PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati-
+ ble.
+
+ When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to
+ verb names and only an unescaped closing parenthesis terminates the
+ name. However, the only backslash items that are permitted are \Q, \E,
+ and sequences such as \x{100} that define character code points. Char-
+ acter type escapes such as \d are faulted.
+
+ A closing parenthesis can be included in a name either as \) or between
+ \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED
+ option is also set, unescaped whitespace in verb names is skipped, and
+ #-comments are recognized, exactly as in the rest of the pattern.
+ PCRE2_EXTENDED does not affect verb names unless PCRE2_ALT_VERBNAMES is
+ also set.
+
+ The maximum length of a name is 255 in the 8-bit library and 65535 in
+ the 16-bit and 32-bit libraries. If the name is empty, that is, if the
+ closing parenthesis immediately follows the colon, the effect is as if
the colon were not there. Any number of these verbs may occur in a pat-
tern.
- Since these verbs are specifically related to backtracking, most of
- them can be used only when the pattern is to be matched using the tra-
- ditional matching function, because these use a backtracking algorithm.
- With the exception of (*FAIL), which behaves like a failing negative
+ Since these verbs are specifically related to backtracking, most of
+ them can be used only when the pattern is to be matched using the tra-
+ ditional matching function, because that uses a backtracking algorithm.
+ With the exception of (*FAIL), which behaves like a failing negative
assertion, the backtracking control verbs cause an error if encountered
by the DFA matching function.
- The behaviour of these verbs in repeated groups, assertions, and in
+ The behaviour of these verbs in repeated groups, assertions, and in
subpatterns called as subroutines (whether or not recursively) is docu-
mented below.
@@ -8000,71 +8506,71 @@ BACKTRACKING CONTROL
PCRE2 contains some optimizations that are used to speed up matching by
running some checks at the start of each match attempt. For example, it
- may know the minimum length of matching subject, or that a particular
+ may know the minimum length of matching subject, or that a particular
character must be present. When one of these optimizations bypasses the
- running of a match, any included backtracking verbs will not, of
+ running of a match, any included backtracking verbs will not, of
course, be processed. You can suppress the start-of-match optimizations
- by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
- pile(), or by starting the pattern with (*NO_START_OPT). There is more
+ by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
+ pile(), or by starting the pattern with (*NO_START_OPT). There is more
discussion of this option in the section entitled "Compiling a pattern"
in the pcre2api documentation.
- Experiments with Perl suggest that it too has similar optimizations,
+ Experiments with Perl suggest that it too has similar optimizations,
sometimes leading to anomalous results.
Verbs that act immediately
- The following verbs act as soon as they are encountered. They may not
+ The following verbs act as soon as they are encountered. They may not
be followed by a name.
(*ACCEPT)
- This verb causes the match to end successfully, skipping the remainder
- of the pattern. However, when it is inside a subpattern that is called
- as a subroutine, only that subpattern is ended successfully. Matching
+ This verb causes the match to end successfully, skipping the remainder
+ of the pattern. However, when it is inside a subpattern that is called
+ as a subroutine, only that subpattern is ended successfully. Matching
then continues at the outer level. If (*ACCEPT) in triggered in a posi-
- tive assertion, the assertion succeeds; in a negative assertion, the
+ tive assertion, the assertion succeeds; in a negative assertion, the
assertion fails.
- If (*ACCEPT) is inside capturing parentheses, the data so far is cap-
+ If (*ACCEPT) is inside capturing parentheses, the data so far is cap-
tured. For example:
A((?:A|B(*ACCEPT)|C)D)
- This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
+ This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
tured by the outer parentheses.
(*FAIL) or (*F)
- This verb causes a matching failure, forcing backtracking to occur. It
- is equivalent to (?!) but easier to read. The Perl documentation notes
- that it is probably useful only when combined with (?{}) or (??{}).
- Those are, of course, Perl features that are not present in PCRE2. The
- nearest equivalent is the callout feature, as for example in this pat-
+ This verb causes a matching failure, forcing backtracking to occur. It
+ is equivalent to (?!) but easier to read. The Perl documentation notes
+ that it is probably useful only when combined with (?{}) or (??{}).
+ Those are, of course, Perl features that are not present in PCRE2. The
+ nearest equivalent is the callout feature, as for example in this pat-
tern:
a+(?C)(*FAIL)
- A match with the string "aaaa" always fails, but the callout is taken
+ A match with the string "aaaa" always fails, but the callout is taken
before each backtrack happens (in this example, 10 times).
Recording which path was taken
- There is one verb whose main purpose is to track how a match was
- arrived at, though it also has a secondary use in conjunction with
+ There is one verb whose main purpose is to track how a match was
+ arrived at, though it also has a secondary use in conjunction with
advancing the match starting point (see (*SKIP) below).
(*MARK:NAME) or (*:NAME)
- A name is always required with this verb. There may be as many
- instances of (*MARK) as you like in a pattern, and their names do not
+ A name is always required with this verb. There may be as many
+ instances of (*MARK) as you like in a pattern, and their names do not
have to be unique.
- When a match succeeds, the name of the last-encountered (*MARK:NAME),
- (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to
- the caller as described in the section entitled "Other information
- about the match" in the pcre2api documentation. Here is an example of
- pcre2test output, where the "mark" modifier requests the retrieval and
+ When a match succeeds, the name of the last-encountered (*MARK:NAME),
+ (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to
+ the caller as described in the section entitled "Other information
+ about the match" in the pcre2api documentation. Here is an example of
+ pcre2test output, where the "mark" modifier requests the retrieval and
outputting of (*MARK) data:
re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
@@ -8076,72 +8582,72 @@ BACKTRACKING CONTROL
MK: B
The (*MARK) name is tagged with "MK:" in this output, and in this exam-
- ple it indicates which of the two alternatives matched. This is a more
- efficient way of obtaining this information than putting each alterna-
+ ple it indicates which of the two alternatives matched. This is a more
+ efficient way of obtaining this information than putting each alterna-
tive in its own capturing parentheses.
- If a verb with a name is encountered in a positive assertion that is
- true, the name is recorded and passed back if it is the last-encoun-
+ If a verb with a name is encountered in a positive assertion that is
+ true, the name is recorded and passed back if it is the last-encoun-
tered. This does not happen for negative assertions or failing positive
assertions.
- After a partial match or a failed match, the last encountered name in
+ After a partial match or a failed match, the last encountered name in
the entire match process is returned. For example:
re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
data> XP
No match, mark = B
- Note that in this unanchored example the mark is retained from the
+ Note that in this unanchored example the mark is retained from the
match attempt that started at the letter "X" in the subject. Subsequent
match attempts starting at "P" and then with an empty string do not get
as far as the (*MARK) item, but nevertheless do not reset it.
- If you are interested in (*MARK) values after failed matches, you
- should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to
+ If you are interested in (*MARK) values after failed matches, you
+ should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to
ensure that the match is always attempted.
Verbs that act after backtracking
The following verbs do nothing when they are encountered. Matching con-
- tinues with what follows, but if there is no subsequent match, causing
- a backtrack to the verb, a failure is forced. That is, backtracking
- cannot pass to the left of the verb. However, when one of these verbs
- appears inside an atomic group (which includes any group that is called
- as a subroutine) or in an assertion that is true, its effect is con-
- fined to that group, because once the group has been matched, there is
- never any backtracking into it. In this situation, backtracking has to
- jump to the left of the entire atomic group or assertion.
-
- These verbs differ in exactly what kind of failure occurs when back-
- tracking reaches them. The behaviour described below is what happens
- when the verb is not in a subroutine or an assertion. Subsequent sec-
+ tinues with what follows, but if there is no subsequent match, causing
+ a backtrack to the verb, a failure is forced. That is, backtracking
+ cannot pass to the left of the verb. However, when one of these verbs
+ appears inside an atomic group or in an assertion that is true, its
+ effect is confined to that group, because once the group has been
+ matched, there is never any backtracking into it. In this situation,
+ backtracking has to jump to the left of the entire atomic group or
+ assertion.
+
+ These verbs differ in exactly what kind of failure occurs when back-
+ tracking reaches them. The behaviour described below is what happens
+ when the verb is not in a subroutine or an assertion. Subsequent sec-
tions cover these special cases.
(*COMMIT)
- This verb, which may not be followed by a name, causes the whole match
+ This verb, which may not be followed by a name, causes the whole match
to fail outright if there is a later matching failure that causes back-
- tracking to reach it. Even if the pattern is unanchored, no further
+ tracking to reach it. Even if the pattern is unanchored, no further
attempts to find a match by advancing the starting point take place. If
- (*COMMIT) is the only backtracking verb that is encountered, once it
- has been passed pcre2_match() is committed to finding a match at the
+ (*COMMIT) is the only backtracking verb that is encountered, once it
+ has been passed pcre2_match() is committed to finding a match at the
current starting point, or not at all. For example:
a+(*COMMIT)b
- This matches "xxaab" but not "aacaab". It can be thought of as a kind
+ This matches "xxaab" but not "aacaab". It can be thought of as a kind
of dynamic anchor, or "I've started, so I must finish." The name of the
- most recently passed (*MARK) in the path is passed back when (*COMMIT)
+ most recently passed (*MARK) in the path is passed back when (*COMMIT)
forces a match failure.
- If there is more than one backtracking verb in a pattern, a different
- one that follows (*COMMIT) may be triggered first, so merely passing
+ If there is more than one backtracking verb in a pattern, a different
+ one that follows (*COMMIT) may be triggered first, so merely passing
(*COMMIT) during a match does not always guarantee that a match must be
at this starting point.
- Note that (*COMMIT) at the start of a pattern is not the same as an
- anchor, unless PCRE2's start-of-match optimizations are turned off, as
+ Note that (*COMMIT) at the start of a pattern is not the same as an
+ anchor, unless PCRE2's start-of-match optimizations are turned off, as
shown in this output from pcre2test:
re> /(*COMMIT)abc/
@@ -8152,33 +8658,32 @@ BACKTRACKING CONTROL
data> xyzabc
No match
- For the first pattern, PCRE2 knows that any match must start with "a",
- so the optimization skips along the subject to "a" before applying the
- pattern to the first set of data. The match attempt then succeeds. The
- second pattern disables the optimization that skips along to the first
- character. The pattern is now applied starting at "x", and so the
- (*COMMIT) causes the match to fail without trying any other starting
+ For the first pattern, PCRE2 knows that any match must start with "a",
+ so the optimization skips along the subject to "a" before applying the
+ pattern to the first set of data. The match attempt then succeeds. The
+ second pattern disables the optimization that skips along to the first
+ character. The pattern is now applied starting at "x", and so the
+ (*COMMIT) causes the match to fail without trying any other starting
points.
(*PRUNE) or (*PRUNE:NAME)
- This verb causes the match to fail at the current starting position in
+ This verb causes the match to fail at the current starting position in
the subject if there is a later matching failure that causes backtrack-
- ing to reach it. If the pattern is unanchored, the normal "bumpalong"
- advance to the next starting character then happens. Backtracking can
- occur as usual to the left of (*PRUNE), before it is reached, or when
- matching to the right of (*PRUNE), but if there is no match to the
- right, backtracking cannot cross (*PRUNE). In simple cases, the use of
- (*PRUNE) is just an alternative to an atomic group or possessive quan-
+ ing to reach it. If the pattern is unanchored, the normal "bumpalong"
+ advance to the next starting character then happens. Backtracking can
+ occur as usual to the left of (*PRUNE), before it is reached, or when
+ matching to the right of (*PRUNE), but if there is no match to the
+ right, backtracking cannot cross (*PRUNE). In simple cases, the use of
+ (*PRUNE) is just an alternative to an atomic group or possessive quan-
tifier, but there are some uses of (*PRUNE) that cannot be expressed in
- any other way. In an anchored pattern (*PRUNE) has the same effect as
+ any other way. In an anchored pattern (*PRUNE) has the same effect as
(*COMMIT).
- The behaviour of (*PRUNE:NAME) is the not the same as
- (*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is
- remembered for passing back to the caller. However, (*SKIP:NAME)
- searches only for names set with (*MARK), ignoring those set by
- (*PRUNE) or (*THEN).
+ The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
+ It is like (*MARK:NAME) in that the name is remembered for passing back
+ to the caller. However, (*SKIP:NAME) searches only for names set with
+ (*MARK), ignoring those set by (*PRUNE) or (*THEN).
(*SKIP)
@@ -8311,50 +8816,55 @@ BACKTRACKING CONTROL
Backtracking verbs in assertions
- (*FAIL) in an assertion has its normal effect: it forces an immediate
- backtrack.
+ (*FAIL) in any assertion has its normal effect: it forces an immediate
+ backtrack. The behaviour of the other backtracking verbs depends on
+ whether or not the assertion is standalone or acting as the condition
+ in a conditional subpattern.
- (*ACCEPT) in a positive assertion causes the assertion to succeed with-
- out any further processing. In a negative assertion, (*ACCEPT) causes
- the assertion to fail without any further processing.
+ (*ACCEPT) in a standalone positive assertion causes the assertion to
+ succeed without any further processing; captured strings are retained.
+ In a standalone negative assertion, (*ACCEPT) causes the assertion to
+ fail without any further processing; captured substrings are discarded.
- The other backtracking verbs are not treated specially if they appear
- in a positive assertion. In particular, (*THEN) skips to the next
- alternative in the innermost enclosing group that has alternations,
- whether or not this is within the assertion.
+ If the assertion is a condition, (*ACCEPT) causes the condition to be
+ true for a positive assertion and false for a negative one; captured
+ substrings are retained in both cases.
- Negative assertions are, however, different, in order to ensure that
- changing a positive assertion into a negative assertion changes its
- result. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a neg-
- ative assertion to be true, without considering any further alternative
- branches in the assertion. Backtracking into (*THEN) causes it to skip
- to the next enclosing alternative within the assertion (the normal be-
- haviour), but if the assertion does not have such an alternative,
- (*THEN) behaves like (*PRUNE).
+ The effect of (*THEN) is not allowed to escape beyond an assertion. If
+ there are no more branches to try, (*THEN) causes a positive assertion
+ to be false, and a negative assertion to be true.
+
+ The other backtracking verbs are not treated specially if they appear
+ in a standalone positive assertion. In a conditional positive asser-
+ tion, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the con-
+ dition to be false. However, for both standalone and conditional nega-
+ tive assertions, backtracking into (*COMMIT), (*SKIP), or (*PRUNE)
+ causes the assertion to be true, without considering any further alter-
+ native branches.
Backtracking verbs in subroutines
- These behaviours occur whether or not the subpattern is called recur-
+ These behaviours occur whether or not the subpattern is called recur-
sively. Perl's treatment of subroutines is different in some cases.
- (*FAIL) in a subpattern called as a subroutine has its normal effect:
+ (*FAIL) in a subpattern called as a subroutine has its normal effect:
it forces an immediate backtrack.
- (*ACCEPT) in a subpattern called as a subroutine causes the subroutine
- match to succeed without any further processing. Matching then contin-
+ (*ACCEPT) in a subpattern called as a subroutine causes the subroutine
+ match to succeed without any further processing. Matching then contin-
ues after the subroutine call.
(*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine
cause the subroutine match to fail.
- (*THEN) skips to the next alternative in the innermost enclosing group
- within the subpattern that has alternatives. If there is no such group
+ (*THEN) skips to the next alternative in the innermost enclosing group
+ within the subpattern that has alternatives. If there is no such group
within the subpattern, (*THEN) causes the subroutine match to fail.
SEE ALSO
- pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3),
+ pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3),
pcre2(3).
@@ -8367,8 +8877,8 @@ AUTHOR
REVISION
- Last updated: 20 June 2016
- Copyright (c) 1997-2016 University of Cambridge.
+ Last updated: 12 September 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -8389,11 +8899,12 @@ PCRE2 PERFORMANCE
COMPILED PATTERN MEMORY USAGE
Patterns are compiled by PCRE2 into a reasonably efficient interpretive
- code, so that most simple patterns do not use much memory. However,
- there is one case where the memory usage of a compiled pattern can be
- unexpectedly large. If a parenthesized subpattern has a quantifier with
- a minimum greater than 1 and/or a limited maximum, the whole subpattern
- is repeated in the compiled code. For example, the pattern
+ code, so that most simple patterns do not use much memory for storing
+ the compiled version. However, there is one case where the memory usage
+ of a compiled pattern can be unexpectedly large. If a parenthesized
+ subpattern has a quantifier with a minimum greater than 1 and/or a lim-
+ ited maximum, the whole subpattern is repeated in the compiled code.
+ For example, the pattern
(abc|def){2,4}
@@ -8401,134 +8912,188 @@ COMPILED PATTERN MEMORY USAGE
(abc|def)(abc|def)((abc|def)(abc|def)?)?
- (Technical aside: It is done this way so that backtrack points within
+ (Technical aside: It is done this way so that backtrack points within
each of the repetitions can be independently maintained.)
- For regular expressions whose quantifiers use only small numbers, this
- is not usually a problem. However, if the numbers are large, and par-
- ticularly if such repetitions are nested, the memory usage can become
+ For regular expressions whose quantifiers use only small numbers, this
+ is not usually a problem. However, if the numbers are large, and par-
+ ticularly if such repetitions are nested, the memory usage can become
an embarrassment. For example, the very simple pattern
((ab){1,1000}c){1,3}
- uses 51K bytes when compiled using the 8-bit library. When PCRE2 is
- compiled with its default internal pointer size of two bytes, the size
- limit on a compiled pattern is 64K code units in the 8-bit and 16-bit
- libraries, and this is reached with the above pattern if the outer rep-
- etition is increased from 3 to 4. PCRE2 can be compiled to use larger
- internal pointers and thus handle larger compiled patterns, but it is
- better to try to rewrite your pattern to use less memory if you can.
+ uses over 50K bytes when compiled using the 8-bit library. When PCRE2
+ is compiled with its default internal pointer size of two bytes, the
+ size limit on a compiled pattern is 64K code units in the 8-bit and
+ 16-bit libraries, and this is reached with the above pattern if the
+ outer repetition is increased from 3 to 4. PCRE2 can be compiled to use
+ larger internal pointers and thus handle larger compiled patterns, but
+ it is better to try to rewrite your pattern to use less memory if you
+ can.
One way of reducing the memory usage for such patterns is to make use
of PCRE2's "subroutine" facility. Re-writing the above pattern as
((ab)(?2){0,999}c)(?1){0,2}
- reduces the memory requirements to 18K, and indeed it remains under 20K
- even with the outer repetition increased to 100. However, this pattern
- is not exactly equivalent, because the "subroutine" calls are treated
- as atomic groups into which there can be no backtracking if there is a
- subsequent matching failure. Therefore, PCRE2 cannot do this kind of
- rewriting automatically. Furthermore, there is a noticeable loss of
- speed when executing the modified pattern. Nevertheless, if the atomic
- grouping is not a problem and the loss of speed is acceptable, this
- kind of rewriting will allow you to process patterns that PCRE2 cannot
- otherwise handle.
-
-
-STACK USAGE AT RUN TIME
-
- When pcre2_match() is used for matching, certain kinds of pattern can
- cause it to use large amounts of the process stack. In some environ-
- ments the default process stack is quite small, and if it runs out the
- result is often SIGSEGV. Rewriting your pattern can often help. The
- pcre2stack documentation discusses this issue in detail.
+ reduces the memory requirements to around 16K, and indeed it remains
+ under 20K even with the outer repetition increased to 100. However,
+ this kind of pattern is not always exactly equivalent, because any cap-
+ tures within subroutine calls are lost when the subroutine completes.
+ If this is not a problem, this kind of rewriting will allow you to
+ process patterns that PCRE2 cannot otherwise handle. The matching per-
+ formance of the two different versions of the pattern are roughly the
+ same. (This applies from release 10.30 - things were different in ear-
+ lier releases.)
+
+
+STACK AND HEAP USAGE AT RUN TIME
+
+ From release 10.30, the interpretive (non-JIT) version of pcre2_match()
+ uses very little system stack at run time. In earlier releases recur-
+ sive function calls could use a great deal of stack, and this could
+ cause problems, but this usage has been eliminated. Backtracking posi-
+ tions are now explicitly remembered in memory frames controlled by the
+ code. An initial 20K vector of frames is allocated on the system stack
+ (enough for about 100 frames for small patterns), but if this is insuf-
+ ficient, heap memory is used. The amount of heap memory can be limited;
+ if the limit is set to zero, only the initial stack vector is used.
+ Rewriting patterns to be time-efficient, as described below, may also
+ reduce the memory requirements.
+
+ In contrast to pcre2_match(), pcre2_dfa_match() does use recursive
+ function calls, but only for processing atomic groups, lookaround
+ assertions, and recursion within the pattern. Too much nested recursion
+ may cause stack issues. The "match depth" parameter can be used to
+ limit the depth of function recursion in pcre2_dfa_match().
PROCESSING TIME
- Certain items in regular expression patterns are processed more effi-
+ Certain items in regular expression patterns are processed more effi-
ciently than others. It is more efficient to use a character class like
- [aeiou] than a set of single-character alternatives such as
- (a|e|i|o|u). In general, the simplest construction that provides the
+ [aeiou] than a set of single-character alternatives such as
+ (a|e|i|o|u). In general, the simplest construction that provides the
required behaviour is usually the most efficient. Jeffrey Friedl's book
- contains a lot of useful general discussion about optimizing regular
- expressions for efficient performance. This document contains a few
+ contains a lot of useful general discussion about optimizing regular
+ expressions for efficient performance. This document contains a few
observations about PCRE2.
- Using Unicode character properties (the \p, \P, and \X escapes) is
- slow, because PCRE2 has to use a multi-stage table lookup whenever it
- needs a character's property. If you can find an alternative pattern
+ Using Unicode character properties (the \p, \P, and \X escapes) is
+ slow, because PCRE2 has to use a multi-stage table lookup whenever it
+ needs a character's property. If you can find an alternative pattern
that does not use character properties, it will probably be faster.
- By default, the escape sequences \b, \d, \s, and \w, and the POSIX
- character classes such as [:alpha:] do not use Unicode properties,
+ By default, the escape sequences \b, \d, \s, and \w, and the POSIX
+ character classes such as [:alpha:] do not use Unicode properties,
partly for backwards compatibility, and partly for performance reasons.
- However, you can set the PCRE2_UCP option or start the pattern with
- (*UCP) if you want Unicode character properties to be used. This can
- double the matching time for items such as \d, when matched with
- pcre2_match(); the performance loss is less with a DFA matching func-
+ However, you can set the PCRE2_UCP option or start the pattern with
+ (*UCP) if you want Unicode character properties to be used. This can
+ double the matching time for items such as \d, when matched with
+ pcre2_match(); the performance loss is less with a DFA matching func-
tion, and in both cases there is not much difference for \b.
- When a pattern begins with .* not in atomic parentheses, nor in paren-
- theses that are the subject of a backreference, and the PCRE2_DOTALL
- option is set, the pattern is implicitly anchored by PCRE2, since it
- can match only at the start of a subject string. If the pattern has
+ When a pattern begins with .* not in atomic parentheses, nor in paren-
+ theses that are the subject of a backreference, and the PCRE2_DOTALL
+ option is set, the pattern is implicitly anchored by PCRE2, since it
+ can match only at the start of a subject string. If the pattern has
multiple top-level branches, they must all be anchorable. The optimiza-
- tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is
+ tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is
automatically disabled if the pattern contains (*PRUNE) or (*SKIP).
- If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization,
+ If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization,
because the dot metacharacter does not then match a newline, and if the
- subject string contains newlines, the pattern may match from the char-
+ subject string contains newlines, the pattern may match from the char-
acter immediately following one of them instead of from the very start.
For example, the pattern
.*second
- matches the subject "first\nand second" (where \n stands for a newline
- character), with the match starting at the seventh character. In order
- to do this, PCRE2 has to retry the match starting after every newline
+ matches the subject "first\nand second" (where \n stands for a newline
+ character), with the match starting at the seventh character. In order
+ to do this, PCRE2 has to retry the match starting after every newline
in the subject.
- If you are using such a pattern with subject strings that do not con-
- tain newlines, the best performance is obtained by setting
- PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate
+ If you are using such a pattern with subject strings that do not con-
+ tain newlines, the best performance is obtained by setting
+ PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate
explicit anchoring. That saves PCRE2 from having to scan along the sub-
ject looking for a newline to restart at.
- Beware of patterns that contain nested indefinite repeats. These can
- take a long time to run when applied to a string that does not match.
+ Beware of patterns that contain nested indefinite repeats. These can
+ take a long time to run when applied to a string that does not match.
Consider the pattern fragment
^(a+)*
- This can match "aaaa" in 16 different ways, and this number increases
- very rapidly as the string gets longer. (The * repeat can match 0, 1,
- 2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
- repeats can match different numbers of times.) When the remainder of
- the pattern is such that the entire match is going to fail, PCRE2 has
- in principle to try every possible variation, and this can take an
+ This can match "aaaa" in 16 different ways, and this number increases
+ very rapidly as the string gets longer. (The * repeat can match 0, 1,
+ 2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
+ repeats can match different numbers of times.) When the remainder of
+ the pattern is such that the entire match is going to fail, PCRE2 has
+ in principle to try every possible variation, and this can take an
extremely long time, even for relatively short strings.
An optimization catches some of the more simple cases such as
(a+)*b
- where a literal character follows. Before embarking on the standard
- matching procedure, PCRE2 checks that there is a "b" later in the sub-
- ject string, and if there is not, it fails the match immediately. How-
- ever, when there is no following literal this optimization cannot be
+ where a literal character follows. Before embarking on the standard
+ matching procedure, PCRE2 checks that there is a "b" later in the sub-
+ ject string, and if there is not, it fails the match immediately. How-
+ ever, when there is no following literal this optimization cannot be
used. You can see the difference by comparing the behaviour of
(a+)*\d
- with the pattern above. The former gives a failure almost instantly
- when applied to a whole line of "a" characters, whereas the latter
+ with the pattern above. The former gives a failure almost instantly
+ when applied to a whole line of "a" characters, whereas the latter
takes an appreciable time with strings longer than about 20 characters.
In many cases, the solution to this kind of performance issue is to use
- an atomic group or a possessive quantifier.
+ an atomic group or a possessive quantifier. This can often reduce mem-
+ ory requirements as well. As another example, consider this pattern:
+
+ ([^<]|<(?!inet))+
+
+ It matches from wherever it starts until it encounters "<inet" or the
+ end of the data, and is the kind of pattern that might be used when
+ processing an XML file. Each iteration of the outer parentheses matches
+ either one character that is not "<" or a "<" that is not followed by
+ "inet". However, each time a parenthesis is processed, a backtracking
+ position is passed, so this formulation uses a memory frame for each
+ matched character. For a long string, a lot of memory is required. Con-
+ sider now this rewritten pattern, which matches exactly the same
+ strings:
+
+ ([^<]++|<(?!inet))+
+
+ This runs much faster, because sequences of characters that do not con-
+ tain "<" are "swallowed" in one item inside the parentheses, and a pos-
+ sessive quantifier is used to stop any backtracking into the runs of
+ non-"<" characters. This version also uses a lot less memory because
+ entry to a new set of parentheses happens only when a "<" character
+ that is not followed by "inet" is encountered (and we assume this is
+ relatively rare).
+
+ This example shows that one way of optimizing performance when matching
+ long subject strings is to write repeated parenthesized subpatterns to
+ match more than one character whenever possible.
+
+ SETTING RESOURCE LIMITS
+
+ You can set limits on the amount of processing that takes place when
+ matching, and on the amount of heap memory that is used. The default
+ values of the limits are very large, and unlikely ever to operate. They
+ can be changed when PCRE2 is built, and they can also be set when
+ pcre2_match() or pcre2_dfa_match() is called. For details of these
+ interfaces, see the pcre2build documentation and the section entitled
+ "The match context" in the pcre2api documentation.
+
+ The pcre2test test program has a modifier called "find_limits" which,
+ if applied to a subject line, causes it to find the smallest limits
+ that allow a pattern to match. This is done by repeatedly matching with
+ different limits.
AUTHOR
@@ -8540,8 +9105,8 @@ AUTHOR
REVISION
- Last updated: 02 January 2015
- Copyright (c) 1997-2015 University of Cambridge.
+ Last updated: 08 April 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -8593,32 +9158,34 @@ DESCRIPTION
There are also some options that are not defined by POSIX. These have
been added at the request of users who want to make use of certain
- PCRE2-specific features via the POSIX calling interface.
-
- When PCRE2 is called via these functions, it is only the API that is
- POSIX-like in style. The syntax and semantics of the regular expres-
- sions themselves are still those of Perl, subject to the setting of
- various PCRE2 options, as described below. "POSIX-like in style" means
- that the API approximates to the POSIX definition; it is not fully
- POSIX-compatible, and in multi-unit encoding domains it is probably
+ PCRE2-specific features via the POSIX calling interface or to add BSD
+ or GNU functionality.
+
+ When PCRE2 is called via these functions, it is only the API that is
+ POSIX-like in style. The syntax and semantics of the regular expres-
+ sions themselves are still those of Perl, subject to the setting of
+ various PCRE2 options, as described below. "POSIX-like in style" means
+ that the API approximates to the POSIX definition; it is not fully
+ POSIX-compatible, and in multi-unit encoding domains it is probably
even less compatible.
The header for these functions is supplied as pcre2posix.h to avoid any
- potential clash with other POSIX libraries. It can, of course, be
+ potential clash with other POSIX libraries. It can, of course, be
renamed or aliased as regex.h, which is the "correct" name. It provides
- two structure types, regex_t for compiled internal forms, and reg-
- match_t for returning captured substrings. It also defines some con-
- stants whose names start with "REG_"; these are used for setting
+ two structure types, regex_t for compiled internal forms, and reg-
+ match_t for returning captured substrings. It also defines some con-
+ stants whose names start with "REG_"; these are used for setting
options and identifying error codes.
COMPILING A PATTERN
- The function regcomp() is called to compile a pattern into an internal
- form. The pattern is a C string terminated by a binary zero, and is
- passed in the argument pattern. The preg argument is a pointer to a
- regex_t structure that is used as a base for storing information about
- the compiled regular expression.
+ The function regcomp() is called to compile a pattern into an internal
+ form. By default, the pattern is a C string terminated by a binary zero
+ (but see REG_PEND below). The preg argument is a pointer to a regex_t
+ structure that is used as a base for storing information about the com-
+ piled regular expression. (It is also used for input when REG_PEND is
+ set.)
The argument cflags is either zero, or contains one or more of the bits
defined by the following macros:
@@ -8641,14 +9208,34 @@ COMPILING A PATTERN
the defined POSIX behaviour for REG_NEWLINE (see the following sec-
tion).
+ REG_NOSPEC
+
+ The PCRE2_LITERAL option is set when the regular expression is passed
+ for compilation to the native function. This disables all meta charac-
+ ters in the pattern, causing it to be treated as a literal string. The
+ only other options that are allowed with REG_NOSPEC are REG_ICASE,
+ REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of
+ the POSIX standard.
+
REG_NOSUB
- When a pattern that is compiled with this flag is passed to regexec()
- for matching, the nmatch and pmatch arguments are ignored, and no cap-
+ When a pattern that is compiled with this flag is passed to regexec()
+ for matching, the nmatch and pmatch arguments are ignored, and no cap-
tured strings are returned. Versions of the PCRE library prior to 10.22
- used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no
+ used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no
longer happens because it disables the use of back references.
+ REG_PEND
+
+ If this option is set, the reg_endp field in the preg structure (which
+ has the type const char *) must be set to point to the character beyond
+ the end of the pattern before calling regcomp(). The pattern itself may
+ now contain binary zeroes, which are treated as data characters. With-
+ out REG_PEND, a binary zero terminates the pattern and the re_endp
+ field is ignored. This is a GNU extension to the POSIX standard and
+ should be used with caution in software intended to be portable to
+ other systems.
+
REG_UCP
The PCRE2_UCP option is set when the regular expression is passed for
@@ -8678,11 +9265,12 @@ COMPILING A PATTERN
ter (they are not) or by a negative class such as [^a] (they are).
The yield of regcomp() is zero on success, and non-zero otherwise. The
- preg structure is filled in on success, and one member of the structure
- is public: re_nsub contains the number of capturing subpatterns in the
- regular expression. Various error codes are defined in the header file.
+ preg structure is filled in on success, and one other member of the
+ structure (as well as re_endp) is public: re_nsub contains the number
+ of capturing subpatterns in the regular expression. Various error codes
+ are defined in the header file.
- NOTE: If the yield of regcomp() is non-zero, you must not attempt to
+ NOTE: If the yield of regcomp() is non-zero, you must not attempt to
use the contents of the preg structure. If, for example, you pass it to
regexec(), the result is undefined and your program is likely to crash.
@@ -8690,9 +9278,9 @@ COMPILING A PATTERN
MATCHING NEWLINE CHARACTERS
This area is not simple, because POSIX and Perl take different views of
- things. It is not possible to get PCRE2 to obey POSIX semantics, but
+ things. It is not possible to get PCRE2 to obey POSIX semantics, but
then PCRE2 was never intended to be a POSIX engine. The following table
- lists the different possibilities for matching newline characters in
+ lists the different possibilities for matching newline characters in
Perl and PCRE2:
Default Change with
@@ -8713,25 +9301,25 @@ MATCHING NEWLINE CHARACTERS
$ matches \n in middle no REG_NEWLINE
^ matches \n in middle no REG_NEWLINE
- This behaviour is not what happens when PCRE2 is called via its POSIX
- API. By default, PCRE2's behaviour is the same as Perl's, except that
- there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
+ This behaviour is not what happens when PCRE2 is called via its POSIX
+ API. By default, PCRE2's behaviour is the same as Perl's, except that
+ there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
and Perl, there is no way to stop newline from matching [^a].
- Default POSIX newline handling can be obtained by setting PCRE2_DOTALL
- and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but
- there is no way to make PCRE2 behave exactly as for the REG_NEWLINE
- action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg-
+ Default POSIX newline handling can be obtained by setting PCRE2_DOTALL
+ and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but
+ there is no way to make PCRE2 behave exactly as for the REG_NEWLINE
+ action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg-
comp() function causes PCRE2_MULTILINE to be passed to pcre2_compile(),
- and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL-
+ and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL-
LAR_ENDONLY.
MATCHING A PATTERN
- The function regexec() is called to match a compiled pattern preg
- against a given string, which is by default terminated by a zero byte
- (but see REG_STARTEND below), subject to the options in eflags. These
+ The function regexec() is called to match a compiled pattern preg
+ against a given string, which is by default terminated by a zero byte
+ (but see REG_STARTEND below), subject to the options in eflags. These
can be:
REG_NOTBOL
@@ -8741,9 +9329,9 @@ MATCHING A PATTERN
REG_NOTEMPTY
- The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2
- matching function. Note that REG_NOTEMPTY is not part of the POSIX
- standard. However, setting this option can give more POSIX-like behav-
+ The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2
+ matching function. Note that REG_NOTEMPTY is not part of the POSIX
+ standard. However, setting this option can give more POSIX-like behav-
iour in some situations.
REG_NOTEOL
@@ -8753,15 +9341,24 @@ MATCHING A PATTERN
REG_STARTEND
- The string is considered to start at string + pmatch[0].rm_so and to
- have a terminating NUL located at string + pmatch[0].rm_eo (there need
- not actually be a NUL at that location), regardless of the value of
- nmatch. This is a BSD extension, compatible with but not specified by
- IEEE Standard 1003.2 (POSIX.2), and should be used with caution in
- software intended to be portable to other systems. Note that a non-zero
- rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
- of the string, not how it is matched. Setting REG_STARTEND and passing
- pmatch as NULL are mutually exclusive; the error REG_INVARG is
+ When this option is set, the subject string is starts at string +
+ pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should
+ point to the first character beyond the string. There may be binary
+ zeroes within the subject string, and indeed, using REG_STARTEND is the
+ only way to pass a subject string that contains a binary zero.
+
+ Whatever the value of pmatch[0].rm_so, the offsets of the matched
+ string and any captured substrings are still given relative to the
+ start of string itself. (Before PCRE2 release 10.30 these were given
+ relative to string + pmatch[0].rm_so, but this differs from other
+ implementations.)
+
+ This is a BSD extension, compatible with but not specified by IEEE
+ Standard 1003.2 (POSIX.2), and should be used with caution in software
+ intended to be portable to other systems. Note that a non-zero rm_so
+ does not imply REG_NOTBOL; REG_STARTEND affects only the location and
+ length of the string, not how it is matched. Setting REG_STARTEND and
+ passing pmatch as NULL are mutually exclusive; the error REG_INVARG is
returned.
If the pattern was compiled with the REG_NOSUB flag, no data about any
@@ -8816,8 +9413,8 @@ AUTHOR
REVISION
- Last updated: 31 January 2016
- Copyright (c) 1997-2016 University of Cambridge.
+ Last updated: 15 June 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -8949,26 +9546,29 @@ SECURITY CONCERNS
use within individual applications. As such, the data supplied to
pcre2_serialize_decode() is expected to be trusted data, not data from
arbitrary external sources. There is only some simple consistency
- checking, not complete validation of what is being re-loaded.
+ checking, not complete validation of what is being re-loaded. Corrupted
+ data may cause undefined results. For example, if the length field of a
+ pattern in the serialized data is corrupted, the deserializing code may
+ read beyond the end of the byte stream that is passed to it.
SAVING COMPILED PATTERNS
Before compiled patterns can be saved they must be serialized, that is,
- converted to a stream of bytes. A single byte stream may contain any
- number of compiled patterns, but they must all use the same character
+ converted to a stream of bytes. A single byte stream may contain any
+ number of compiled patterns, but they must all use the same character
tables. A single copy of the tables is included in the byte stream (its
size is 1088 bytes). For more details of character tables, see the sec-
tion on locale support in the pcre2api documentation.
- The function pcre2_serialize_encode() creates a serialized byte stream
- from a list of compiled patterns. Its first two arguments specify the
+ The function pcre2_serialize_encode() creates a serialized byte stream
+ from a list of compiled patterns. Its first two arguments specify the
list, being a pointer to a vector of pointers to compiled patterns, and
the length of the vector. The third and fourth arguments point to vari-
ables which are set to point to the created byte stream and its length,
- respectively. The final argument is a pointer to a general context,
- which can be used to specify custom memory mangagement functions. If
- this argument is NULL, malloc() is used to obtain memory for the byte
+ respectively. The final argument is a pointer to a general context,
+ which can be used to specify custom memory mangagement functions. If
+ this argument is NULL, malloc() is used to obtain memory for the byte
stream. The yield of the function is the number of serialized patterns,
or one of the following negative error codes:
@@ -8978,12 +9578,12 @@ SAVING COMPILED PATTERNS
PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables
PCRE2_ERROR_NULL the 1st, 3rd, or 4th argument is NULL
- PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor-
- rupted, or that a slot in the vector does not point to a compiled pat-
+ PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor-
+ rupted, or that a slot in the vector does not point to a compiled pat-
tern.
Once a set of patterns has been serialized you can save the data in any
- appropriate manner. Here is sample code that compiles two patterns and
+ appropriate manner. Here is sample code that compiles two patterns and
writes them to a file. It assumes that the variable fd refers to a file
that is open for output. The error checking that should be present in a
real application has been omitted for simplicity.
@@ -9001,13 +9601,13 @@ SAVING COMPILED PATTERNS
&bytescount, NULL);
errorcode = fwrite(bytes, 1, bytescount, fd);
- Note that the serialized data is binary data that may contain any of
- the 256 possible byte values. On systems that make a distinction
+ Note that the serialized data is binary data that may contain any of
+ the 256 possible byte values. On systems that make a distinction
between binary and non-binary data, be sure that the file is opened for
binary output.
- Serializing a set of patterns leaves the original data untouched, so
- they can still be used for matching. Their memory must eventually be
+ Serializing a set of patterns leaves the original data untouched, so
+ they can still be used for matching. Their memory must eventually be
freed in the usual way by calling pcre2_code_free(). When you have fin-
ished with the byte stream, it too must be freed by calling pcre2_seri-
alize_free().
@@ -9015,11 +9615,11 @@ SAVING COMPILED PATTERNS
RE-USING PRECOMPILED PATTERNS
- In order to re-use a set of saved patterns you must first make the
- serialized byte stream available in main memory (for example, by read-
- ing from a file). The management of this memory block is up to the
+ In order to re-use a set of saved patterns you must first make the
+ serialized byte stream available in main memory (for example, by read-
+ ing from a file). The management of this memory block is up to the
application. You can use the pcre2_serialize_get_number_of_codes()
- function to find out how many compiled patterns are in the serialized
+ function to find out how many compiled patterns are in the serialized
data without actually decoding the patterns:
uint8_t *bytes = <serialized data>;
@@ -9027,10 +9627,10 @@ RE-USING PRECOMPILED PATTERNS
The pcre2_serialize_decode() function reads a byte stream and recreates
the compiled patterns in new memory blocks, setting pointers to them in
- a vector. The first two arguments are a pointer to a suitable vector
- and its length, and the third argument points to a byte stream. The
- final argument is a pointer to a general context, which can be used to
- specify custom memory mangagement functions for the decoded patterns.
+ a vector. The first two arguments are a pointer to a suitable vector
+ and its length, and the third argument points to a byte stream. The
+ final argument is a pointer to a general context, which can be used to
+ specify custom memory mangagement functions for the decoded patterns.
If this argument is NULL, malloc() and free() are used. After deserial-
ization, the byte stream is no longer needed and can be discarded.
@@ -9040,9 +9640,9 @@ RE-USING PRECOMPILED PATTERNS
int32_t number_of_codes =
pcre2_serialize_decode(list_of_codes, 2, bytes, NULL);
- If the vector is not large enough for all the patterns in the byte
- stream, it is filled with those that fit, and the remainder are
- ignored. The yield of the function is the number of decoded patterns,
+ If the vector is not large enough for all the patterns in the byte
+ stream, it is filled with those that fit, and the remainder are
+ ignored. The yield of the function is the number of decoded patterns,
or one of the following negative error codes:
PCRE2_ERROR_BADDATA second argument is zero or less
@@ -9052,24 +9652,24 @@ RE-USING PRECOMPILED PATTERNS
PCRE2_ERROR_MEMORY memory allocation failed
PCRE2_ERROR_NULL first or third argument is NULL
- PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was
+ PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was
compiled on a system with different endianness.
Decoded patterns can be used for matching in the usual way, and must be
- freed by calling pcre2_code_free(). However, be aware that there is a
- potential race issue if you are using multiple patterns that were
- decoded from a single byte stream in a multithreaded application. A
+ freed by calling pcre2_code_free(). However, be aware that there is a
+ potential race issue if you are using multiple patterns that were
+ decoded from a single byte stream in a multithreaded application. A
single copy of the character tables is used by all the decoded patterns
and a reference count is used to arrange for its memory to be automati-
- cally freed when the last pattern is freed, but there is no locking on
- this reference count. Therefore, if you want to call pcre2_code_free()
- for these patterns in different threads, you must arrange your own
- locking, and ensure that pcre2_code_free() cannot be called by two
+ cally freed when the last pattern is freed, but there is no locking on
+ this reference count. Therefore, if you want to call pcre2_code_free()
+ for these patterns in different threads, you must arrange your own
+ locking, and ensure that pcre2_code_free() cannot be called by two
threads at the same time.
- If a pattern was processed by pcre2_jit_compile() before being serial-
- ized, the JIT data is discarded and so is no longer available after a
- save/restore cycle. You can, however, process a restored pattern with
+ If a pattern was processed by pcre2_jit_compile() before being serial-
+ ized, the JIT data is discarded and so is no longer available after a
+ save/restore cycle. You can, however, process a restored pattern with
pcre2_jit_compile() if you wish.
@@ -9082,174 +9682,8 @@ AUTHOR
REVISION
- Last updated: 24 May 2016
- Copyright (c) 1997-2016 University of Cambridge.
-------------------------------------------------------------------------------
-
-
-PCRE2STACK(3) Library Functions Manual PCRE2STACK(3)
-
-
-
-NAME
- PCRE2 - Perl-compatible regular expressions (revised API)
-
-PCRE2 DISCUSSION OF STACK USAGE
-
- When you call pcre2_match(), it makes use of an internal function
- called match(). This calls itself recursively at branch points in the
- pattern, in order to remember the state of the match so that it can
- back up and try a different alternative after a failure. As matching
- proceeds deeper and deeper into the tree of possibilities, the recur-
- sion depth increases. The match() function is also called in other cir-
- cumstances, for example, whenever a parenthesized sub-pattern is
- entered, and in certain cases of repetition.
-
- Not all calls of match() increase the recursion depth; for an item such
- as a* it may be called several times at the same level, after matching
- different numbers of a's. Furthermore, in a number of cases where the
- result of the recursive call would immediately be passed back as the
- result of the current call (a "tail recursion"), the function is just
- restarted instead.
-
- Each time the internal match() function is called recursively, it uses
- memory from the process stack. For certain kinds of pattern and data,
- very large amounts of stack may be needed, despite the recognition of
- "tail recursion". Note that if PCRE2 is compiled with the -fsani-
- tize=address option of the GCC compiler, the stack requirements are
- greatly increased.
-
- The above comments apply when pcre2_match() is run in its normal inter-
- pretive manner. If the compiled pattern was processed by pcre2_jit_com-
- pile(), and just-in-time compiling was successful, and the options
- passed to pcre2_match() were not incompatible, the matching process
- uses the JIT-compiled code instead of the match() function. In this
- case, the memory requirements are handled entirely differently. See the
- pcre2jit documentation for details.
-
- The pcre2_dfa_match() function operates in a different way to
- pcre2_match(), and uses recursion only when there is a regular expres-
- sion recursion or subroutine call in the pattern. This includes the
- processing of assertion and "once-only" subpatterns, which are handled
- like subroutine calls. Normally, these are never very deep, and the
- limit on the complexity of pcre2_dfa_match() is controlled by the
- amount of workspace it is given. However, it is possible to write pat-
- terns with runaway infinite recursions; such patterns will cause
- pcre2_dfa_match() to run out of stack. At present, there is no protec-
- tion against this.
-
- The comments that follow do NOT apply to pcre2_dfa_match(); they are
- relevant only for pcre2_match() without the JIT optimization.
-
- Reducing pcre2_match()'s stack usage
-
- You can often reduce the amount of recursion, and therefore the amount
- of stack used, by modifying the pattern that is being matched. Con-
- sider, for example, this pattern:
-
- ([^<]|<(?!inet))+
-
- It matches from wherever it starts until it encounters "<inet" or the
- end of the data, and is the kind of pattern that might be used when
- processing an XML file. Each iteration of the outer parentheses matches
- either one character that is not "<" or a "<" that is not followed by
- "inet". However, each time a parenthesis is processed, a recursion
- occurs, so this formulation uses a stack frame for each matched charac-
- ter. For a long string, a lot of stack is required. Consider now this
- rewritten pattern, which matches exactly the same strings:
-
- ([^<]++|<(?!inet))+
-
- This uses very much less stack, because runs of characters that do not
- contain "<" are "swallowed" in one item inside the parentheses. Recur-
- sion happens only when a "<" character that is not followed by "inet"
- is encountered (and we assume this is relatively rare). A possessive
- quantifier is used to stop any backtracking into the runs of non-"<"
- characters, but that is not related to stack usage.
-
- This example shows that one way of avoiding stack problems when match-
- ing long subject strings is to write repeated parenthesized subpatterns
- to match more than one character whenever possible.
-
- Compiling PCRE2 to use heap instead of stack for pcre2_match()
-
- In environments where stack memory is constrained, you might want to
- compile PCRE2 to use heap memory instead of stack for remembering back-
- up points when pcre2_match() is running. This makes it run more slowly,
- however. Details of how to do this are given in the pcre2build documen-
- tation. When built in this way, instead of using the stack, PCRE2 gets
- memory for remembering backup points from the heap. By default, the
- memory is obtained by calling the system malloc() function, but you can
- arrange to supply your own memory management function. For details, see
- the section entitled "The match context" in the pcre2api documentation.
- Since the block sizes are always the same, it may be possible to imple-
- ment customized a memory handler that is more efficient than the stan-
- dard function. The memory blocks obtained for this purpose are retained
- and re-used if possible while pcre2_match() is running. They are all
- freed just before it exits.
-
- Limiting pcre2_match()'s stack usage
-
- You can set limits on the number of times the internal match() function
- is called, both in total and recursively. If a limit is exceeded,
- pcre2_match() returns an error code. Setting suitable limits should
- prevent it from running out of stack. The default values of the limits
- are very large, and unlikely ever to operate. They can be changed when
- PCRE2 is built, and they can also be set when pcre2_match() is called.
- For details of these interfaces, see the pcre2build documentation and
- the section entitled "The match context" in the pcre2api documentation.
-
- As a very rough rule of thumb, you should reckon on about 500 bytes per
- recursion. Thus, if you want to limit your stack usage to 8Mb, you
- should set the limit at 16000 recursions. A 64Mb stack, on the other
- hand, can support around 128000 recursions.
-
- The pcre2test test program has a modifier called "find_limits" which,
- if applied to a subject line, causes it to find the smallest limits
- that allow a a pattern to match. This is done by calling pcre2_match()
- repeatedly with different limits.
-
- Changing stack size in Unix-like systems
-
- In Unix-like environments, there is not often a problem with the stack
- unless very long strings are involved, though the default limit on
- stack size varies from system to system. Values from 8Mb to 64Mb are
- common. You can find your default limit by running the command:
-
- ulimit -s
-
- Unfortunately, the effect of running out of stack is often SIGSEGV,
- though sometimes a more explicit error message is given. You can nor-
- mally increase the limit on stack size by code such as this:
-
- struct rlimit rlim;
- getrlimit(RLIMIT_STACK, &rlim);
- rlim.rlim_cur = 100*1024*1024;
- setrlimit(RLIMIT_STACK, &rlim);
-
- This reads the current limits (soft and hard) using getrlimit(), then
- attempts to increase the soft limit to 100Mb using setrlimit(). You
- must do this before calling pcre2_match().
-
- Changing stack size in Mac OS X
-
- Using setrlimit(), as described above, should also work on Mac OS X. It
- is also possible to set a stack size when linking a program. There is a
- discussion about stack sizes in Mac OS X at this web site:
- http://developer.apple.com/qa/qa2005/qa1419.html.
-
-
-AUTHOR
-
- Philip Hazel
- University Computing Service
- Cambridge, England.
-
-
-REVISION
-
- Last updated: 21 November 2014
- Copyright (c) 1997-2014 University of Cambridge.
+ Last updated: 21 March 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -9526,18 +9960,21 @@ OPTION SETTING
(?i) caseless
(?J) allow duplicate names
(?m) multiline
+ (?n) no auto capture
(?s) single line (dotall)
(?U) default ungreedy (lazy)
- (?x) extended (ignore white space)
+ (?x) extended: ignore white space except in classes
+ (?xx) as (?x) but also ignore space and tab in classes
(?-...) unset option(s)
The following are recognized only at the very start of a pattern or
after one of the newline or \R options with similar syntax. More than
- one of them may appear.
+ one of them may appear. For the first three, d is a decimal number.
- (*LIMIT_MATCH=d) set the match limit to d (decimal number)
- (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
- (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
+ (*LIMIT_DEPTH=d) set the backtracking limit to d
+ (*LIMIT_HEAP=d) set the heap size limit to d kilobytes
+ (*LIMIT_MATCH=d) set the match limit to d
+ (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
@@ -9546,16 +9983,17 @@ OPTION SETTING
(*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
- Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
- the limits set by the caller of pcre2_match(), not increase them. The
- application can lock out the use of (*UTF) and (*UCP) by setting the
- PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile
- time.
+ Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
+ value of the limits set by the caller of pcre2_match() or
+ pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
+ synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
+ and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
+ respectively, at compile time.
NEWLINE CONVENTION
- These are recognized only at the very start of the pattern or after
+ These are recognized only at the very start of the pattern or after
option settings with a similar syntax.
(*CR) carriage return only
@@ -9563,11 +10001,12 @@ NEWLINE CONVENTION
(*CRLF) carriage return followed by linefeed
(*ANYCRLF) all three of the above
(*ANY) any Unicode newline sequence
+ (*NUL) the NUL character (binary zero)
WHAT \R MATCHES
- These are recognized only at the very start of the pattern or after
+ These are recognized only at the very start of the pattern or after
option setting with a similar syntax.
(*BSR_ANYCRLF) CR, LF, or CRLF
@@ -9589,6 +10028,9 @@ BACKREFERENCES
\n reference by number (can be ambiguous)
\gn reference by number
\g{n} reference by number
+ \g+n relative reference by number (PCRE2 extension)
+ \g-n relative reference by number
+ \g{+n} relative reference by number (PCRE2 extension)
\g{-n} relative reference by number
\k<name> reference by name (Perl)
\k'name' reference by name (Perl)
@@ -9625,14 +10067,18 @@ CONDITIONAL PATTERNS
(?(-n) relative reference condition
(?(<name>) named reference condition (Perl)
(?('name') named reference condition (Perl)
- (?(name) named reference condition (PCRE2)
+ (?(name) named reference condition (PCRE2, deprecated)
(?(R) overall recursion condition
- (?(Rn) specific group recursion condition
- (?(R&name) specific recursion condition
+ (?(Rn) specific numbered group recursion condition
+ (?(R&name) specific named group recursion condition
(?(DEFINE) define subpattern for reference
(?(VERSION[>]=n.m) test PCRE2 version
(?(assert) assertion condition
+ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
+ conditions or recursion tests. Such a condition is interpreted as a
+ reference condition if the relevant named group exists.
+
BACKTRACKING CONTROL
@@ -9642,7 +10088,7 @@ BACKTRACKING CONTROL
(*FAIL) force backtrack; synonym (*F)
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
- The following act only when a subsequent match failure causes a back-
+ The following act only when a subsequent match failure causes a back-
track to reach them. They all force a match failure, but they differ in
what happens afterwards. Those that advance the start-of-match point do
so only if the pattern is not anchored.
@@ -9664,14 +10110,14 @@ CALLOUTS
(?C"text") callout with string data
The allowed string delimiters are ` ' " ^ % # $ (which are the same for
- the start and the end), and the starting delimiter { matched with the
- ending delimiter }. To encode the ending delimiter within the string,
+ the start and the end), and the starting delimiter { matched with the
+ ending delimiter }. To encode the ending delimiter within the string,
double it.
SEE ALSO
- pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
+ pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
pcre2(3).
@@ -9684,8 +10130,8 @@ AUTHOR
REVISION
- Last updated: 16 October 2015
- Copyright (c) 1997-2015 University of Cambridge.
+ Last updated: 17 June 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
@@ -9724,7 +10170,7 @@ UNICODE PROPERTY SUPPORT
names for properties are supported. For example, \p{L} matches a let-
ter. Its Perl synonym, \p{Letter}, is not supported. Furthermore, in
Perl, many properties may optionally be prefixed by "Is", for compati-
- bility with Perl 5.6. PCRE does not support this.
+ bility with Perl 5.6. PCRE2 does not support this.
WIDE CHARACTERS AND UTF MODES
@@ -9775,64 +10221,78 @@ WIDE CHARACTERS AND UTF MODES
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
acters, whether or not PCRE2_UCP is set.
- Case-insensitive matching in UTF mode makes use of Unicode properties.
- A few Unicode characters such as Greek sigma have more than two code-
- points that are case-equivalent, and these are treated as such.
+
+CASE-EQUIVALENCE IN UTF MODES
+
+ Case-insensitive matching in a UTF mode makes use of Unicode properties
+ except for characters whose code points are less than 128 and that have
+ at most two case-equivalent values. For these, a direct table lookup is
+ used for speed. A few Unicode characters such as Greek sigma have more
+ than two codepoints that are case-equivalent, and these are treated as
+ such.
VALIDITY OF UTF STRINGS
- When the PCRE2_UTF option is set, the strings passed as patterns and
+ When the PCRE2_UTF option is set, the strings passed as patterns and
subjects are (by default) checked for validity on entry to the relevant
- functions. If an invalid UTF string is passed, an negative error code
- is returned. The code unit offset to the offending character can be
- extracted from the match data block by calling pcre2_get_startchar(),
+ functions. If an invalid UTF string is passed, an negative error code
+ is returned. The code unit offset to the offending character can be
+ extracted from the match data block by calling pcre2_get_startchar(),
which is used for this purpose after a UTF error.
UTF-16 and UTF-32 strings can indicate their endianness by special code
- knows as a byte-order mark (BOM). The PCRE2 functions do not handle
+ knows as a byte-order mark (BOM). The PCRE2 functions do not handle
this, expecting strings to be in host byte order.
A UTF string is checked before any other processing takes place. In the
- case of pcre2_match() and pcre2_dfa_match() calls with a non-zero
- starting offset, the check is applied only to that part of the subject
- that could be inspected during matching, and there is a check that the
- starting offset points to the first code unit of a character or to the
- end of the subject. If there are no lookbehind assertions in the pat-
- tern, the check starts at the starting offset. Otherwise, it starts at
- the length of the longest lookbehind before the starting offset, or at
- the start of the subject if there are not that many characters before
- the starting offset. Note that the sequences \b and \B are one-charac-
+ case of pcre2_match() and pcre2_dfa_match() calls with a non-zero
+ starting offset, the check is applied only to that part of the subject
+ that could be inspected during matching, and there is a check that the
+ starting offset points to the first code unit of a character or to the
+ end of the subject. If there are no lookbehind assertions in the pat-
+ tern, the check starts at the starting offset. Otherwise, it starts at
+ the length of the longest lookbehind before the starting offset, or at
+ the start of the subject if there are not that many characters before
+ the starting offset. Note that the sequences \b and \B are one-charac-
ter lookbehinds.
- In addition to checking the format of the string, there is a check to
+ In addition to checking the format of the string, there is a check to
ensure that all code points lie in the range U+0 to U+10FFFF, excluding
- the surrogate area. The so-called "non-character" code points are not
+ the surrogate area. The so-called "non-character" code points are not
excluded because Unicode corrigendum #9 makes it clear that they should
not be.
- Characters in the "Surrogate Area" of Unicode are reserved for use by
- UTF-16, where they are used in pairs to encode code points with values
- greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
- are available independently in the UTF-8 and UTF-32 encodings. (In
- other words, the whole surrogate thing is a fudge for UTF-16 which
+ Characters in the "Surrogate Area" of Unicode are reserved for use by
+ UTF-16, where they are used in pairs to encode code points with values
+ greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
+ are available independently in the UTF-8 and UTF-32 encodings. (In
+ other words, the whole surrogate thing is a fudge for UTF-16 which
unfortunately messes up UTF-8 and UTF-32.)
- In some situations, you may already know that your strings are valid,
- and therefore want to skip these checks in order to improve perfor-
- mance, for example in the case of a long subject string that is being
- scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
- pile time or at match time, PCRE2 assumes that the pattern or subject
+ In some situations, you may already know that your strings are valid,
+ and therefore want to skip these checks in order to improve perfor-
+ mance, for example in the case of a long subject string that is being
+ scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
+ pile time or at match time, PCRE2 assumes that the pattern or subject
it is given (respectively) contains only valid UTF code unit sequences.
- Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
+ Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
for the pattern; it does not also apply to subject strings. If you want
- to disable the check for a subject string you must pass this option to
+ to disable the check for a subject string you must pass this option to
pcre2_match() or pcre2_dfa_match().
- If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
+ If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
result is undefined and your program may crash or loop indefinitely.
+ Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable
+ the error that is given if an escape sequence for an invalid Unicode
+ code point is encountered in the pattern. If you want to allow escape
+ sequences such as \x{d800} (a surrogate code point) you can set the
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is pos-
+ sible only in UTF-8 and UTF-32 modes, because these values are not rep-
+ resentable in UTF-16.
+
Errors in UTF-8 strings
The following negative error codes are given for invalid UTF-8 strings:
@@ -9843,10 +10303,10 @@ VALIDITY OF UTF STRINGS
PCRE2_ERROR_UTF8_ERR4
PCRE2_ERROR_UTF8_ERR5
- The string ends with a truncated UTF-8 character; the code specifies
- how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
- characters to be no longer than 4 bytes, the encoding scheme (origi-
- nally defined by RFC 2279) allows for up to 6 bytes, and this is
+ The string ends with a truncated UTF-8 character; the code specifies
+ how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
+ characters to be no longer than 4 bytes, the encoding scheme (origi-
+ nally defined by RFC 2279) allows for up to 6 bytes, and this is
checked first; hence the possibility of 4 or 5 missing bytes.
PCRE2_ERROR_UTF8_ERR6
@@ -9856,24 +10316,24 @@ VALIDITY OF UTF STRINGS
PCRE2_ERROR_UTF8_ERR10
The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
- the character do not have the binary value 0b10 (that is, either the
+ the character do not have the binary value 0b10 (that is, either the
most significant bit is 0, or the next bit is 1).
PCRE2_ERROR_UTF8_ERR11
PCRE2_ERROR_UTF8_ERR12
- A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
+ A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
long; these code points are excluded by RFC 3629.
PCRE2_ERROR_UTF8_ERR13
- A 4-byte character has a value greater than 0x10fff; these code points
+ A 4-byte character has a value greater than 0x10fff; these code points
are excluded by RFC 3629.
PCRE2_ERROR_UTF8_ERR14
- A 3-byte character has a value in the range 0xd800 to 0xdfff; this
- range of code points are reserved by RFC 3629 for use with UTF-16, and
+ A 3-byte character has a value in the range 0xd800 to 0xdfff; this
+ range of code points are reserved by RFC 3629 for use with UTF-16, and
so are excluded from UTF-8.
PCRE2_ERROR_UTF8_ERR15
@@ -9882,26 +10342,26 @@ VALIDITY OF UTF STRINGS
PCRE2_ERROR_UTF8_ERR18
PCRE2_ERROR_UTF8_ERR19
- A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
- for a value that can be represented by fewer bytes, which is invalid.
- For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
+ A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
+ for a value that can be represented by fewer bytes, which is invalid.
+ For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
rect coding uses just one byte.
PCRE2_ERROR_UTF8_ERR20
The two most significant bits of the first byte of a character have the
- binary value 0b10 (that is, the most significant bit is 1 and the sec-
- ond is 0). Such a byte can only validly occur as the second or subse-
+ binary value 0b10 (that is, the most significant bit is 1 and the sec-
+ ond is 0). Such a byte can only validly occur as the second or subse-
quent byte of a multi-byte character.
PCRE2_ERROR_UTF8_ERR21
- The first byte of a character has the value 0xfe or 0xff. These values
+ The first byte of a character has the value 0xfe or 0xff. These values
can never occur in a valid UTF-8 string.
Errors in UTF-16 strings
- The following negative error codes are given for invalid UTF-16
+ The following negative error codes are given for invalid UTF-16
strings:
PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string
@@ -9911,7 +10371,7 @@ VALIDITY OF UTF STRINGS
Errors in UTF-32 strings
- The following negative error codes are given for invalid UTF-32
+ The following negative error codes are given for invalid UTF-32
strings:
PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff)
@@ -9927,8 +10387,8 @@ AUTHOR
REVISION
- Last updated: 03 July 2016
- Copyright (c) 1997-2016 University of Cambridge.
+ Last updated: 17 May 2017
+ Copyright (c) 1997-2017 University of Cambridge.
------------------------------------------------------------------------------
diff --git a/doc/pcre2_callout_enumerate.3 b/doc/pcre2_callout_enumerate.3
index 4573bb4..109c9be 100644
--- a/doc/pcre2_callout_enumerate.3
+++ b/doc/pcre2_callout_enumerate.3
@@ -1,4 +1,4 @@
-.TH PCRE2_COMPILE 3 "23 March 2015" "PCRE2 10.20"
+.TH PCRE2_COMPILE 3 "23 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -24,20 +24,21 @@ for success and non-zero otherwise. The arguments are:
\fIcallout_data\fP User data that is passed to the callback
.sp
The \fIcallback()\fP function is passed a pointer to a data block containing
-the following fields:
+the following fields (not necessarily in this order):
.sp
- \fIversion\fP Block version number
- \fIpattern_position\fP Offset to next item in pattern
- \fInext_item_length\fP Length of next item in pattern
- \fIcallout_number\fP Number for numbered callouts
- \fIcallout_string_offset\fP Offset to string within pattern
- \fIcallout_string_length\fP Length of callout string
- \fIcallout_string\fP Points to callout string or is NULL
+ uint32_t \fIversion\fP Block version number
+ uint32_t \fIcallout_number\fP Number for numbered callouts
+ PCRE2_SIZE \fIpattern_position\fP Offset to next item in pattern
+ PCRE2_SIZE \fInext_item_length\fP Length of next item in pattern
+ PCRE2_SIZE \fIcallout_string_offset\fP Offset to string within pattern
+ PCRE2_SIZE \fIcallout_string_length\fP Length of callout string
+ PCRE2_SPTR \fIcallout_string\fP Points to callout string or is NULL
.sp
-The second argument is the callout data that was passed to
-\fBpcre2_callout_enumerate()\fP. The \fBcallback()\fP function must return zero
-for success. Any other value causes the pattern scan to stop, with the value
-being passed back as the result of \fBpcre2_callout_enumerate()\fP.
+The second argument passed to the \fBcallback()\fP function is the callout data
+that was passed to \fBpcre2_callout_enumerate()\fP. The \fBcallback()\fP
+function must return zero for success. Any other value causes the pattern scan
+to stop, with the value being passed back as the result of
+\fBpcre2_callout_enumerate()\fP.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF
diff --git a/doc/pcre2_code_copy.3 b/doc/pcre2_code_copy.3
index 270b3a6..09b4705 100644
--- a/doc/pcre2_code_copy.3
+++ b/doc/pcre2_code_copy.3
@@ -1,4 +1,4 @@
-.TH PCRE2_CODE_COPY 3 "26 February 2016" "PCRE2 10.22"
+.TH PCRE2_CODE_COPY 3 "22 November 2016" "PCRE2 10.23"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -16,8 +16,9 @@ PCRE2 - Perl-compatible regular expressions (revised API)
This function makes a copy of the memory used for a compiled pattern, excluding
any memory used by the JIT compiler. Without a subsequent call to
\fBpcre2_jit_compile()\fP, the copy can be used only for non-JIT matching. The
-yield of the function is NULL if \fIcode\fP is NULL or if sufficient memory
-cannot be obtained.
+pointer to the character tables is copied, not the tables themselves (see
+\fBpcre2_code_copy_with_tables()\fP). The yield of the function is NULL if
+\fIcode\fP is NULL or if sufficient memory cannot be obtained.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF
diff --git a/doc/pcre2_code_copy_with_tables.3 b/doc/pcre2_code_copy_with_tables.3
new file mode 100644
index 0000000..cfbddb3
--- /dev/null
+++ b/doc/pcre2_code_copy_with_tables.3
@@ -0,0 +1,32 @@
+.TH PCRE2_CODE_COPY 3 "22 November 2016" "PCRE2 10.23"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH SYNOPSIS
+.rs
+.sp
+.B #include <pcre2.h>
+.PP
+.nf
+.B pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *\fIcode\fP);
+.fi
+.
+.SH DESCRIPTION
+.rs
+.sp
+This function makes a copy of the memory used for a compiled pattern, excluding
+any memory used by the JIT compiler. Without a subsequent call to
+\fBpcre2_jit_compile()\fP, the copy can be used only for non-JIT matching.
+Unlike \fBpcre2_code_copy()\fP, a separate copy of the character tables is also
+made, with the new code pointing to it. This memory will be automatically freed
+when \fBpcre2_code_free()\fP is called. The yield of the function is NULL if
+\fIcode\fP is NULL or if sufficient memory cannot be obtained.
+.P
+There is a complete description of the PCRE2 native API in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+page and a description of the POSIX API in the
+.\" HREF
+\fBpcre2posix\fP
+.\"
+page.
diff --git a/doc/pcre2_code_free.3 b/doc/pcre2_code_free.3
index 5127081..7376869 100644
--- a/doc/pcre2_code_free.3
+++ b/doc/pcre2_code_free.3
@@ -1,4 +1,4 @@
-.TH PCRE2_CODE_FREE 3 "29 July 2015" "PCRE2 10.21"
+.TH PCRE2_CODE_FREE 3 "23 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -14,7 +14,9 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.rs
.sp
This function frees the memory used for a compiled pattern, including any
-memory used by the JIT compiler.
+memory used by the JIT compiler. If the compiled pattern was created by a call
+to \fBpcre2_code_copy_with_tables()\fP, the memory for the character tables is
+also freed.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF
diff --git a/doc/pcre2_compile.3 b/doc/pcre2_compile.3
index 1e0dca5..19f35c3 100644
--- a/doc/pcre2_compile.3
+++ b/doc/pcre2_compile.3
@@ -1,4 +1,4 @@
-.TH PCRE2_COMPILE 3 "22 April 2015" "PCRE2 10.20"
+.TH PCRE2_COMPILE 3 "16 June 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -25,26 +25,34 @@ arguments are:
\fIerroffset\fP Where to put an error offset
\fIccontext\fP Pointer to a compile context or NULL
.sp
-The length of the string and any error offset that is returned are in code
-units, not characters. A compile context is needed only if you want to change
+The length of the pattern and any error offset that is returned are in code
+units, not characters. A compile context is needed only if you want to provide
+custom memory allocation functions, or to provide an external function for
+system stack size checking, or to change one or more of these parameters:
.sp
- What \eR matches (Unicode newlines or CR, LF, CRLF only)
- PCRE2's character tables
- The newline character sequence
- The compile time nested parentheses limit
+ What \eR matches (Unicode newlines, or CR, LF, CRLF only);
+ PCRE2's character tables;
+ The newline character sequence;
+ The compile time nested parentheses limit;
+ The maximum pattern length (in code units) that is allowed.
+ The additional options bits (see pcre2_set_compile_extra_options())
.sp
-or provide an external function for stack size checking. The option bits are:
+The option bits are:
.sp
PCRE2_ANCHORED Force pattern anchoring
+ PCRE2_ALLOW_EMPTY_CLASS Allow empty classes
PCRE2_ALT_BSUX Alternative handling of \eu, \eU, and \ex
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
+ PCRE2_ALT_VERBNAMES Process backslashes in verb names
PCRE2_AUTO_CALLOUT Compile automatic callouts
PCRE2_CASELESS Do caseless matching
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
PCRE2_DOTALL . matches anything including NL
PCRE2_DUPNAMES Allow duplicate names for subpatterns
+ PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_EXTENDED Ignore white space and # comments
PCRE2_FIRSTLINE Force matching to be before newline
+ PCRE2_LITERAL Pattern characters are all literal
PCRE2_MATCH_UNSET_BACKREF Match unset back references
PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \eC in patterns
@@ -59,19 +67,21 @@ or provide an external function for stack size checking. The option bits are:
(only relevant if PCRE2_UTF is set)
PCRE2_UCP Use Unicode properties for \ed, \ew, etc.
PCRE2_UNGREEDY Invert greediness of quantifiers
+ PCRE2_USE_OFFSET_LIMIT Enable offset limit for unanchored matching
PCRE2_UTF Treat pattern and subjects as UTF strings
.sp
-PCRE2 must be built with Unicode support in order to use PCRE2_UTF, PCRE2_UCP
-and related options.
+PCRE2 must be built with Unicode support (the default) in order to use
+PCRE2_UTF, PCRE2_UCP and related options.
.P
The yield of the function is a pointer to a private data structure that
contains the compiled pattern, or NULL if an error was detected.
.P
-There is a complete description of the PCRE2 native API in the
+There is a complete description of the PCRE2 native API, with more detail on
+each option, in the
.\" HREF
\fBpcre2api\fP
.\"
-page and a description of the POSIX API in the
+page, and a description of the POSIX API in the
.\" HREF
\fBpcre2posix\fP
.\"
diff --git a/doc/pcre2_config.3 b/doc/pcre2_config.3
index 0c29ce6..ab9623d 100644
--- a/doc/pcre2_config.3
+++ b/doc/pcre2_config.3
@@ -1,4 +1,4 @@
-.TH PCRE2_CONFIG 3 "20 April 2014" "PCRE2 10.0"
+.TH PCRE2_CONFIG 3 "16 September 2017" "PCRE2 10.31"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -31,22 +31,29 @@ point to a uint32_t integer variable. The available codes are:
PCRE2_CONFIG_BSR Indicates what \eR matches by default:
PCRE2_BSR_UNICODE
PCRE2_BSR_ANYCRLF
+ PCRE2_CONFIG_COMPILED_WIDTHS Which of 8/16/32 support was compiled
+ PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
+ PCRE2_CONFIG_HEAPLIMIT Default heap memory limit
+.\" JOIN
PCRE2_CONFIG_JIT Availability of just-in-time compiler
support (1=yes 0=no)
- PCRE2_CONFIG_JITTARGET Information about the target archi-
- tecture for the JIT compiler
+.\" JOIN
+ PCRE2_CONFIG_JITTARGET Information (a string) about the target
+ architecture for the JIT compiler
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
+ PCRE2_CONFIG_NEVER_BACKSLASH_C Whether or not \eC is disabled
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
PCRE2_NEWLINE_CR
PCRE2_NEWLINE_LF
PCRE2_NEWLINE_CRLF
PCRE2_NEWLINE_ANY
PCRE2_NEWLINE_ANYCRLF
+ PCRE2_NEWLINE_NUL
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
- PCRE2_CONFIG_RECURSIONLIMIT Internal recursion depth limit
- PCRE2_CONFIG_STACKRECURSE Recursion implementation (1=stack
- 0=heap)
+ PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
+ PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
+.\" JOIN
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes
0=no)
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
diff --git a/doc/pcre2_convert_context_copy.3 b/doc/pcre2_convert_context_copy.3
new file mode 100644
index 0000000..827c3e9
--- /dev/null
+++ b/doc/pcre2_convert_context_copy.3
@@ -0,0 +1,26 @@
+.TH PCRE2_CONVERT_CONTEXT_COPY 3 "10 July 2017" "PCRE2 10.30"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH SYNOPSIS
+.rs
+.sp
+.B #include <pcre2.h>
+.PP
+.nf
+.B pcre2_convert_context *pcre2_convert_context_copy(
+.B " pcre2_convert_context *\fIcvcontext\fP);"
+.fi
+.
+.SH DESCRIPTION
+.rs
+.sp
+This function is part of an experimental set of pattern conversion functions.
+It makes a new copy of a convert context, using the memory allocation function
+that was used for the original context. The result is NULL if the memory cannot
+be obtained.
+.P
+The pattern conversion functions are described in the
+.\" HREF
+\fBpcre2convert\fP
+.\"
+documentation.
diff --git a/doc/pcre2_convert_context_create.3 b/doc/pcre2_convert_context_create.3
new file mode 100644
index 0000000..91c17fb
--- /dev/null
+++ b/doc/pcre2_convert_context_create.3
@@ -0,0 +1,27 @@
+.TH PCRE2_CONVERT_CONTEXT_CREATE 3 "10 July 2017" "PCRE2 10.30"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH SYNOPSIS
+.rs
+.sp
+.B #include <pcre2.h>
+.PP
+.nf
+.B pcre2_convert_context *pcre2_convert_context_create(
+.B " pcre2_general_context *\fIgcontext\fP);"
+.fi
+.
+.SH DESCRIPTION
+.rs
+.sp
+This function is part of an experimental set of pattern conversion functions.
+It creates and initializes a new convert context. If its argument is
+NULL, \fBmalloc()\fP is used to get the necessary memory; otherwise the memory
+allocation function within the general context is used. The result is NULL if
+the memory could not be obtained.
+.P
+The pattern conversion functions are described in the
+.\" HREF
+\fBpcre2convert\fP
+.\"
+documentation.
diff --git a/doc/pcre2_convert_context_free.3 b/doc/pcre2_convert_context_free.3
new file mode 100644
index 0000000..fd5b13c
--- /dev/null
+++ b/doc/pcre2_convert_context_free.3
@@ -0,0 +1,25 @@
+.TH PCRE2_CONVERT_CONTEXT_FREE 3 "10 July 2017" "PCRE2 10.30"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH SYNOPSIS
+.rs
+.sp
+.B #include <pcre2.h>
+.PP
+.nf
+.B void pcre2_convert_context_free(pcre2_convert_context *\fIcvcontext\fP);
+.fi
+.
+.SH DESCRIPTION
+.rs
+.sp
+This function is part of an experimental set of pattern conversion functions.
+It frees the memory occupied by a convert context, using the memory
+freeing function from the general context with which it was created, or
+\fBfree()\fP if that was not set.
+.P
+The pattern conversion functions are described in the
+.\" HREF
+\fBpcre2convert\fP
+.\"
+documentation.
diff --git a/doc/pcre2_converted_pattern_free.3 b/doc/pcre2_converted_pattern_free.3
new file mode 100644
index 0000000..687e078
--- /dev/null
+++ b/doc/pcre2_converted_pattern_free.3
@@ -0,0 +1,25 @@
+.TH PCRE2_CONVERTED_PATTERN_FREE 3 "11 July 2017" "PCRE2 10.30"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH SYNOPSIS
+.rs
+.sp
+.B #include <pcre2.h>
+.PP
+.nf
+.B void pcre2_converted_pattern_free(PCRE2_UCHAR *\fIconverted_pattern\fP);
+.fi
+.
+.SH DESCRIPTION
+.rs
+.sp
+This function is part of an experimental set of pattern conversion functions.
+It frees the memory occupied by a converted pattern that was obtained by
+calling \fBpcre2_pattern_convert()\fP with arguments that caused it to place
+the converted pattern into newly obtained heap memory.
+.P
+The pattern conversion functions are described in the
+.\" HREF
+\fBpcre2convert\fP
+.\"
+documentation.
diff --git a/doc/pcre2_dfa_match.3 b/doc/pcre2_dfa_match.3
index f45da0d..7839145 100644
--- a/doc/pcre2_dfa_match.3
+++ b/doc/pcre2_dfa_match.3
@@ -1,4 +1,4 @@
-.TH PCRE2_DFA_MATCH 3 "12 May 2013" "PCRE2 10.00"
+.TH PCRE2_DFA_MATCH 3 "30 May 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -19,8 +19,9 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.sp
This function matches a compiled regular expression against a given subject
string, using an alternative matching algorithm that scans the subject string
-just once (\fInot\fP Perl-compatible). (The Perl-compatible matching function
-is \fBpcre2_match()\fP.) The arguments for this function are:
+just once (except when processing lookaround assertions). This function is
+\fInot\fP Perl-compatible (the Perl-compatible matching function is
+\fBpcre2_match()\fP). The arguments for this function are:
.sp
\fIcode\fP Points to the compiled pattern
\fIsubject\fP Points to the subject string
@@ -33,22 +34,28 @@ is \fBpcre2_match()\fP.) The arguments for this function are:
\fIwscount\fP Number of elements in the vector
.sp
For \fBpcre2_dfa_match()\fP, a match context is needed only if you want to set
-up a callout function. The \fIlength\fP and \fIstartoffset\fP values are code
-units, not characters. The options are:
+up a callout function or specify the match and/or the recursion depth limits.
+The \fIlength\fP and \fIstartoffset\fP values are code units, not characters.
+The options are:
.sp
PCRE2_ANCHORED Match only at the first position
+ PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
+.\" JOIN
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
is not a valid match
+.\" JOIN
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
validity (only relevant if PCRE2_UTF
was set at compile time)
+.\" JOIN
+ PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial
+ match even if there is a full match
+.\" JOIN
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
- match if no full matches are found
- PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
- even if there is a full match as well
+ match if no full matches are found
PCRE2_DFA_RESTART Restart after a partial match
PCRE2_DFA_SHORTEST Return only the shortest match
.sp
diff --git a/doc/pcre2_get_error_message.3 b/doc/pcre2_get_error_message.3
index 9378b18..3d3e0de 100644
--- a/doc/pcre2_get_error_message.3
+++ b/doc/pcre2_get_error_message.3
@@ -1,4 +1,4 @@
-.TH PCRE2_GET_ERROR_MESSAGE 3 "17 June 2016" "PCRE2 10.22"
+.TH PCRE2_GET_ERROR_MESSAGE 3 "24 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -22,11 +22,11 @@ errors are negative numbers. The arguments are:
\fIbuffer\fP where to put the message
\fIbufflen\fP the length of the buffer (code units)
.sp
-The function returns the length of the message, excluding the trailing zero, or
-the negative error code PCRE2_ERROR_NOMEMORY if the buffer is too small. In
-this case, the returned message is truncated (but still with a trailing zero).
-If \fIerrorcode\fP does not contain a recognized error code number, the
-negative value PCRE2_ERROR_BADDATA is returned.
+The function returns the length of the message in code units, excluding the
+trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
+too small. In this case, the returned message is truncated (but still with a
+trailing zero). If \fIerrorcode\fP does not contain a recognized error code
+number, the negative value PCRE2_ERROR_BADDATA is returned.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF
diff --git a/doc/pcre2_get_mark.3 b/doc/pcre2_get_mark.3
index e741dfe..dce377d 100644
--- a/doc/pcre2_get_mark.3
+++ b/doc/pcre2_get_mark.3
@@ -1,4 +1,4 @@
-.TH PCRE2_GET_MARK 3 "24 October 2014" "PCRE2 10.00"
+.TH PCRE2_GET_MARK 3 "13 October 2017" "PCRE2 10.31"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -14,11 +14,14 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.rs
.sp
After a call of \fBpcre2_match()\fP that was passed the match block that is
-this function's argument, this function returns a pointer to the last (*MARK)
-name that was encountered. The name is zero-terminated, and is within the
-compiled pattern. If no (*MARK) name is available, NULL is returned. A (*MARK)
-name may be available after a failed match or a partial match, as well as after
-a successful one.
+this function's argument, this function returns a pointer to the last (*MARK),
+(*PRUNE), or (*THEN) name that was encountered during the matching process. The
+name is zero-terminated, and is within the compiled pattern. The length of the
+name is in the preceding code unit. If no name is available, NULL is returned.
+.P
+After a successful match, the name that is returned is the last one on the
+matching path. After a failed match or a partial match, the last encountered
+name is returned.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF
diff --git a/doc/pcre2_jit_stack_create.3 b/doc/pcre2_jit_stack_create.3
index d530d50..61ccf79 100644
--- a/doc/pcre2_jit_stack_create.3
+++ b/doc/pcre2_jit_stack_create.3
@@ -1,4 +1,4 @@
-.TH PCRE2_JIT_STACK_CREATE 3 "03 November 2014" "PCRE2 10.00"
+.TH PCRE2_JIT_STACK_CREATE 3 "24 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -20,10 +20,9 @@ maximum size to which it is allowed to grow. The final argument is a general
context, for memory allocation functions, or NULL for standard memory
allocation. The result can be passed to the JIT run-time code by calling
\fBpcre2_jit_stack_assign()\fP to associate the stack with a compiled pattern,
-which can then be processed by \fBpcre2_match()\fP. If the "fast path" JIT
-matcher, \fBpcre2_jit_match()\fP is used, the stack can be passed directly as
-an argument. A maximum stack size of 512K to 1M should be more than enough for
-any pattern. For more details, see the
+which can then be processed by \fBpcre2_match()\fP or \fBpcre2_jit_match()\fP.
+A maximum stack size of 512K to 1M should be more than enough for any pattern.
+For more details, see the
.\" HREF
\fBpcre2jit\fP
.\"
diff --git a/doc/pcre2_maketables.3 b/doc/pcre2_maketables.3
index 322dba7..740954b 100644
--- a/doc/pcre2_maketables.3
+++ b/doc/pcre2_maketables.3
@@ -1,4 +1,4 @@
-.TH PCRE2_MAKETABLES 3 "21 October 2014" "PCRE2 10.00"
+.TH PCRE2_MAKETABLES 3 "17 April 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -7,15 +7,15 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.B #include <pcre2.h>
.PP
.SM
-.B const unsigned char *pcre2_maketables(pcre22_general_context *\fIgcontext\fP);
+.B const unsigned char *pcre2_maketables(pcre2_general_context *\fIgcontext\fP);
.
.SH DESCRIPTION
.rs
.sp
-This function builds a set of character tables for character values less than
-256. These can be passed to \fBpcre2_compile()\fP in a compile context in order
-to override the internal, built-in tables (which were either defaulted or made
-by \fBpcre2_maketables()\fP when PCRE2 was compiled). See the
+This function builds a set of character tables for character code points that
+are less than 256. These can be passed to \fBpcre2_compile()\fP in a compile
+context in order to override the internal, built-in tables (which were either
+defaulted or made by \fBpcre2_maketables()\fP when PCRE2 was compiled). See the
.\" HREF
\fBpcre2_set_character_tables()\fP
.\"
diff --git a/doc/pcre2_match.3 b/doc/pcre2_match.3
index f25cace..6f7aefb 100644
--- a/doc/pcre2_match.3
+++ b/doc/pcre2_match.3
@@ -1,4 +1,4 @@
-.TH PCRE2_MATCH 3 "21 October 2014" "PCRE2 10.00"
+.TH PCRE2_MATCH 3 "14 November 2017" "PCRE2 10.31"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -18,7 +18,13 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.sp
This function matches a compiled regular expression against a given subject
string, using a matching algorithm that is similar to Perl's. It returns
-offsets to captured substrings. Its arguments are:
+offsets to what it has matched and to captured substrings via the
+\fBmatch_data\fP block, which can be processed by functions with names that
+start with \fBpcre2_get_ovector_...()\fP or \fBpcre2_substring_...()\fP. The
+return from \fBpcre2_match()\fP is one more than the highest numbered capturing
+pair that has been set (for example, 1 if there are no captures), zero if the
+vector of offsets is too small, or a negative error code for no match and other
+errors. The function arguments are:
.sp
\fIcode\fP Points to the compiled pattern
\fIsubject\fP Points to the subject string
@@ -31,26 +37,35 @@ offsets to captured substrings. Its arguments are:
A match context is needed only if you want to:
.sp
Set up a callout function
- Change the limit for calling the internal function \fImatch()\fP
- Change the limit for calling \fImatch()\fP recursively
- Set custom memory management when the heap is used for recursion
+ Set a matching offset limit
+ Change the heap memory limit
+ Change the backtracking match limit
+ Change the backtracking depth limit
+ Set custom memory management specifically for the match
.sp
The \fIlength\fP and \fIstartoffset\fP values are code
-units, not characters. The options are:
+units, not characters. The length may be given as PCRE2_ZERO_TERMINATE for a
+subject that is terminated by a binary zero code unit. The options are:
.sp
PCRE2_ANCHORED Match only at the first position
+ PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_NOTBOL Subject string is not the beginning of a line
PCRE2_NOTEOL Subject string is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
+.\" JOIN
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject
is not a valid match
+ PCRE2_NO_JIT Do not use JIT matching
+.\" JOIN
PCRE2_NO_UTF_CHECK Do not check the subject for UTF
validity (only relevant if PCRE2_UTF
was set at compile time)
+.\" JOIN
+ PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial
+ match even if there is a full match
+.\" JOIN
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial
match if no full matches are found
- PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match
- if that is found before a full match
.sp
For details of partial matching, see the
.\" HREF
diff --git a/doc/pcre2_match_data_free.3 b/doc/pcre2_match_data_free.3
index 5e4bc62..e22074b 100644
--- a/doc/pcre2_match_data_free.3
+++ b/doc/pcre2_match_data_free.3
@@ -1,4 +1,4 @@
-.TH PCRE2_MATCH_DATA_FREE 3 "24 October 2014" "PCRE2 10.00"
+.TH PCRE2_MATCH_DATA_FREE 3 "25 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -14,8 +14,8 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.rs
.sp
This function frees the memory occupied by a match data block, using the memory
-freeing function from the general context with which it was created, or
-\fBfree()\fP if that was not set.
+freeing function from the general context or compiled pattern with which it was
+created, or \fBfree()\fP if that was not set.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF
diff --git a/doc/pcre2_pattern_convert.3 b/doc/pcre2_pattern_convert.3
new file mode 100644
index 0000000..b72acb7
--- /dev/null
+++ b/doc/pcre2_pattern_convert.3
@@ -0,0 +1,55 @@
+.TH PCRE2_PATTERN_CONVERT 3 "11 July 2017" "PCRE2 10.30"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH SYNOPSIS
+.rs
+.sp
+.B #include <pcre2.h>
+.PP
+.nf
+.B int pcre2_pattern_convert(PCRE2_SPTR \fIpattern\fP, PCRE2_SIZE \fIlength\fP,
+.B " uint32_t \fIoptions\fP, PCRE2_UCHAR **\fIbuffer\fP,"
+.B " PCRE2_SIZE *\fIblength\fP, pcre2_convert_context *\fIcvcontext\fP);"
+.fi
+.
+.SH DESCRIPTION
+.rs
+.sp
+This function is part of an experimental set of pattern conversion functions.
+It converts a foreign pattern (for example, a glob) into a PCRE2 regular
+expression pattern. Its arguments are:
+.sp
+ \fIpattern\fP The foreign pattern
+ \fIlength\fP The length of the input pattern or PCRE2_ZERO_TERMINATED
+ \fIoptions\fP Option bits
+ \fIbuffer\fP Pointer to pointer to output buffer, or NULL
+ \fIblength\fP Pointer to output length field
+ \fIcvcontext\fP Pointer to a convert context or NULL
+.sp
+The length of the converted pattern (excluding the terminating zero) is
+returned via \fIblength\fP. If \fIbuffer\fP is NULL, the function just returns
+the output length. If \fIbuffer\fP points to a NULL pointer, heap memory is
+obtained for the converted pattern, using the allocator in the context if
+present (or else \fBmalloc()\fP), and the field pointed to by \fIbuffer\fP is
+updated. If \fIbuffer\fP points to a non-NULL field, that must point to a
+buffer whose size is in the variable pointed to by \fIblength\fP. This value is
+updated.
+.P
+The option bits are:
+.sp
+ PCRE2_CONVERT_UTF Input is UTF
+ PCRE2_CONVERT_NO_UTF_CHECK Do not check UTF validity
+ PCRE2_CONVERT_POSIX_BASIC Convert POSIX basic pattern
+ PCRE2_CONVERT_POSIX_EXTENDED Convert POSIX extended pattern
+ PCRE2_CONVERT_GLOB ) Convert
+ PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR ) various types
+ PCRE2_CONVERT_GLOB_NO_STARSTAR ) of glob
+.sp
+The return value from \fBpcre2_pattern_convert()\fP is zero on success or a
+non-zero PCRE2 error code.
+.P
+The pattern conversion functions are described in the
+.\" HREF
+\fBpcre2convert\fP
+.\"
+documentation.
diff --git a/doc/pcre2_pattern_info.3 b/doc/pcre2_pattern_info.3
index 575840b..64bfc45 100644
--- a/doc/pcre2_pattern_info.3
+++ b/doc/pcre2_pattern_info.3
@@ -1,4 +1,4 @@
-.TH PCRE2_PATTERN_INFO 3 "21 November 2015" "PCRE2 10.21"
+.TH PCRE2_PATTERN_INFO 3 "16 December 2017" "PCRE2 10.31"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -15,7 +15,7 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.sp
This function returns information about a compiled pattern. Its arguments are:
.sp
- \fIcode\fP Pointer to a compiled regular expression
+ \fIcode\fP Pointer to a compiled regular expression pattern
\fIwhat\fP What information is required
\fIwhere\fP Where to put the information
.sp
@@ -29,25 +29,38 @@ request are as follows:
PCRE2_BSR_UNICODE: Unicode line endings
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
PCRE2_INFO_CAPTURECOUNT Number of capturing subpatterns
+.\" JOIN
+ PCRE2_INFO_DEPTHLIMIT Backtracking depth limit if set,
+ otherwise PCRE2_ERROR_UNSET
+ PCRE2_INFO_EXTRAOPTIONS Extra options that were passed in the
+ compile context
PCRE2_INFO_FIRSTBITMAP Bitmap of first code units, or NULL
PCRE2_INFO_FIRSTCODETYPE Type of start-of-match information
0 nothing set
1 first code unit is set
2 start of string or after newline
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
+ PCRE2_INFO_FRAMESIZE Size of backtracking frame
PCRE2_INFO_HASBACKSLASHC Return 1 if pattern contains \eC
+.\" JOIN
PCRE2_INFO_HASCRORLF Return 1 if explicit CR or LF matches
exist in the pattern
+.\" JOIN
+ PCRE2_INFO_HEAPLIMIT Heap memory limit if set,
+ otherwise PCRE2_ERROR_UNSET
PCRE2_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
PCRE2_INFO_JITSIZE Size of JIT compiled code, or 0
PCRE2_INFO_LASTCODETYPE Type of must-be-present information
0 nothing set
1 code unit is set
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
+.\" JOIN
PCRE2_INFO_MATCHEMPTY 1 if the pattern can match an
empty string, 0 otherwise
+.\" JOIN
PCRE2_INFO_MATCHLIMIT Match limit if set,
otherwise PCRE2_ERROR_UNSET
+.\" JOIN
PCRE2_INFO_MAXLOOKBEHIND Length (in characters) of the longest
lookbehind assertion
PCRE2_INFO_MINLENGTH Lower bound length of matching strings
@@ -60,8 +73,8 @@ request are as follows:
PCRE2_NEWLINE_CRLF
PCRE2_NEWLINE_ANY
PCRE2_NEWLINE_ANYCRLF
- PCRE2_INFO_RECURSIONLIMIT Recursion limit if set,
- otherwise PCRE2_ERROR_UNSET
+ PCRE2_NEWLINE_NUL
+ PCRE2_INFO_RECURSIONLIMIT Obsolete synonym for PCRE2_INFO_DEPTHLIMIT
PCRE2_INFO_SIZE Size of compiled pattern
.sp
If \fIwhere\fP is NULL, the function returns the amount of memory needed for
diff --git a/doc/pcre2_set_callout.3 b/doc/pcre2_set_callout.3
index 2f86f69..cb48e14 100644
--- a/doc/pcre2_set_callout.3
+++ b/doc/pcre2_set_callout.3
@@ -1,4 +1,4 @@
-.TH PCRE2_SET_CALLOUT 3 "24 October 2014" "PCRE2 10.00"
+.TH PCRE2_SET_CALLOUT 3 "21 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -17,7 +17,7 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.sp
This function sets the callout fields in a match context (the first argument).
The second argument specifies a callout function, and the third argument is an
-opaque data time that is passed to it. The result of this function is always
+opaque data item that is passed to it. The result of this function is always
zero.
.P
There is a complete description of the PCRE2 native API in the
diff --git a/doc/pcre2_set_compile_extra_options.3 b/doc/pcre2_set_compile_extra_options.3
new file mode 100644
index 0000000..1d73a8f
--- /dev/null
+++ b/doc/pcre2_set_compile_extra_options.3
@@ -0,0 +1,38 @@
+.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "16 June 2017" "PCRE2 10.30"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH SYNOPSIS
+.rs
+.sp
+.B #include <pcre2.h>
+.PP
+.nf
+.B int pcre2_set_compile_extra_options(pcre2_compile_context *\fIccontext\fP,
+.B " PCRE2_SIZE \fIextra_options\fP);"
+.fi
+.
+.SH DESCRIPTION
+.rs
+.sp
+This function sets additional option bits for \fBpcre2_compile()\fP that are
+housed in a compile context. It completely replaces all the bits. The extra
+options are:
+.sp
+.\" JOIN
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \ex{df800} to \ex{dfff}
+ in UTF-8 and UTF-32 modes
+.\" JOIN
+ PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as
+ a literal following character
+ PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
+ PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
+.sp
+There is a complete description of the PCRE2 native API in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+page and a description of the POSIX API in the
+.\" HREF
+\fBpcre2posix\fP
+.\"
+page.
diff --git a/doc/pcre2_set_depth_limit.3 b/doc/pcre2_set_depth_limit.3
new file mode 100644
index 0000000..62bc7fe
--- /dev/null
+++ b/doc/pcre2_set_depth_limit.3
@@ -0,0 +1,28 @@
+.TH PCRE2_SET_DEPTH_LIMIT 3 "25 March 2017" "PCRE2 10.30"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH SYNOPSIS
+.rs
+.sp
+.B #include <pcre2.h>
+.PP
+.nf
+.B int pcre2_set_depth_limit(pcre2_match_context *\fImcontext\fP,
+.B " uint32_t \fIvalue\fP);"
+.fi
+.
+.SH DESCRIPTION
+.rs
+.sp
+This function sets the backtracking depth limit field in a match context. The
+result is always zero.
+.P
+There is a complete description of the PCRE2 native API in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+page and a description of the POSIX API in the
+.\" HREF
+\fBpcre2posix\fP
+.\"
+page.
diff --git a/doc/pcre2_set_glob_escape.3 b/doc/pcre2_set_glob_escape.3
new file mode 100644
index 0000000..d5637af
--- /dev/null
+++ b/doc/pcre2_set_glob_escape.3
@@ -0,0 +1,29 @@
+.TH PCRE2_SET_GLOB_ESCAPE 3 "11 July 2017" "PCRE2 10.30"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH SYNOPSIS
+.rs
+.sp
+.B #include <pcre2.h>
+.PP
+.nf
+.B int pcre2_set_glob_escape(pcre2_convert_context *\fIcvcontext\fP,
+.B " uint32_t \fIescape_char\fP);"
+.fi
+.
+.SH DESCRIPTION
+.rs
+.sp
+This function is part of an experimental set of pattern conversion functions.
+It sets the escape character that is used when converting globs. The second
+argument must either be zero (meaning there is no escape character) or a
+punctuation character whose code point is less than 256. The default is grave
+accent if running under Windows, otherwise backslash. The result of the
+function is zero for success or PCRE2_ERROR_BADDATA if the second argument is
+invalid.
+.P
+The pattern conversion functions are described in the
+.\" HREF
+\fBpcre2convert\fP
+.\"
+documentation.
diff --git a/doc/pcre2_set_glob_separator.3 b/doc/pcre2_set_glob_separator.3
new file mode 100644
index 0000000..273b515
--- /dev/null
+++ b/doc/pcre2_set_glob_separator.3
@@ -0,0 +1,28 @@
+.TH PCRE2_SET_GLOB_SEPARATOR 3 "11 July 2017" "PCRE2 10.30"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH SYNOPSIS
+.rs
+.sp
+.B #include <pcre2.h>
+.PP
+.nf
+.B int pcre2_set_glob_separator(pcre2_convert_context *\fIcvcontext\fP,
+.B " uint32_t \fIseparator_char\fP);"
+.fi
+.
+.SH DESCRIPTION
+.rs
+.sp
+This function is part of an experimental set of pattern conversion functions.
+It sets the component separator character that is used when converting globs.
+The second argument must one of the characters forward slash, backslash, or
+dot. The default is backslash when running under Windows, otherwise forward
+slash. The result of the function is zero for success or PCRE2_ERROR_BADDATA if
+the second argument is invalid.
+.P
+The pattern conversion functions are described in the
+.\" HREF
+\fBpcre2convert\fP
+.\"
+documentation.
diff --git a/doc/pcre2_set_heap_limit.3 b/doc/pcre2_set_heap_limit.3
new file mode 100644
index 0000000..a99b4ab
--- /dev/null
+++ b/doc/pcre2_set_heap_limit.3
@@ -0,0 +1,28 @@
+.TH PCRE2_SET_DEPTH_LIMIT 3 "11 April 2017" "PCRE2 10.30"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH SYNOPSIS
+.rs
+.sp
+.B #include <pcre2.h>
+.PP
+.nf
+.B int pcre2_set_heap_limit(pcre2_match_context *\fImcontext\fP,
+.B " uint32_t \fIvalue\fP);"
+.fi
+.
+.SH DESCRIPTION
+.rs
+.sp
+This function sets the backtracking heap limit field in a match context. The
+result is always zero.
+.P
+There is a complete description of the PCRE2 native API in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+page and a description of the POSIX API in the
+.\" HREF
+\fBpcre2posix\fP
+.\"
+page.
diff --git a/doc/pcre2_set_max_pattern_length.3 b/doc/pcre2_set_max_pattern_length.3
new file mode 100644
index 0000000..7aa01c7
--- /dev/null
+++ b/doc/pcre2_set_max_pattern_length.3
@@ -0,0 +1,31 @@
+.TH PCRE2_SET_MAX_PATTERN_LENGTH 3 "05 October 2016" "PCRE2 10.23"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH SYNOPSIS
+.rs
+.sp
+.B #include <pcre2.h>
+.PP
+.nf
+.B int pcre2_set_max_pattern_length(pcre2_compile_context *\fIccontext\fP,
+.B " PCRE2_SIZE \fIvalue\fP);"
+.fi
+.
+.SH DESCRIPTION
+.rs
+.sp
+This function sets, in a compile context, the maximum text length (in code
+units) of the pattern that can be compiled. The result is always zero. If a
+longer pattern is passed to \fBpcre2_compile()\fP there is an immediate error
+return. The default is effectively unlimited, being the largest value a
+PCRE2_SIZE variable can hold.
+.P
+There is a complete description of the PCRE2 native API in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+page and a description of the POSIX API in the
+.\" HREF
+\fBpcre2posix\fP
+.\"
+page.
diff --git a/doc/pcre2_set_newline.3 b/doc/pcre2_set_newline.3
index 8237500..0bccfc7 100644
--- a/doc/pcre2_set_newline.3
+++ b/doc/pcre2_set_newline.3
@@ -1,4 +1,4 @@
-.TH PCRE2_SET_NEWLINE 3 "22 October 2014" "PCRE2 10.00"
+.TH PCRE2_SET_NEWLINE 3 "26 May 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -23,6 +23,7 @@ matching patterns. The second argument must be one of:
PCRE2_NEWLINE_CRLF CR followed by LF only
PCRE2_NEWLINE_ANYCRLF Any of the above
PCRE2_NEWLINE_ANY Any Unicode newline sequence
+ PCRE2_NEWLINE_NUL The NUL character (binary zero)
.sp
The result is zero for success or PCRE2_ERROR_BADDATA if the second argument is
invalid.
diff --git a/doc/pcre2_set_recursion_limit.3 b/doc/pcre2_set_recursion_limit.3
index ab1f3cd..26f4257 100644
--- a/doc/pcre2_set_recursion_limit.3
+++ b/doc/pcre2_set_recursion_limit.3
@@ -1,4 +1,4 @@
-.TH PCRE2_SET_RECURSION_LIMIT 3 "24 October 2014" "PCRE2 10.00"
+.TH PCRE2_SET_RECURSION_LIMIT 3 "25 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -14,8 +14,8 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.SH DESCRIPTION
.rs
.sp
-This function sets the recursion limit field in a match context. The result is
-always zero.
+This function is obsolete and should not be used in new code. Use
+\fBpcre2_set_depth_limit()\fP instead.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF
diff --git a/doc/pcre2_set_recursion_memory_management.3 b/doc/pcre2_set_recursion_memory_management.3
index 9b5887a..12f175d 100644
--- a/doc/pcre2_set_recursion_memory_management.3
+++ b/doc/pcre2_set_recursion_memory_management.3
@@ -1,4 +1,4 @@
-.TH PCRE2_SET_RECURSION_MEMORY_MANAGEMENT 3 "24 October 2014" "PCRE2 10.00"
+.TH PCRE2_SET_RECURSION_MEMORY_MANAGEMENT 3 "25 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -16,13 +16,8 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.SH DESCRIPTION
.rs
.sp
-This function sets the match context fields for custom memory management when
-PCRE2 is compiled to use the heap instead of the system stack for recursive
-function calls while matching. When PCRE2 is compiled to use the stack (the
-default) this function does nothing. The first argument is a match context, the
-second and third specify the memory allocation and freeing functions, and the
-final argument is an opaque value that is passed to them whenever they are
-called. The result of this function is always zero.
+From release 10.30 onwards, this function is obsolete and does nothing. The
+result is always zero.
.P
There is a complete description of the PCRE2 native API in the
.\" HREF
diff --git a/doc/pcre2_substitute.3 b/doc/pcre2_substitute.3
index e69e0cc..7da668c 100644
--- a/doc/pcre2_substitute.3
+++ b/doc/pcre2_substitute.3
@@ -1,4 +1,4 @@
-.TH PCRE2_SUBSTITUTE 3 "12 December 2015" "PCRE2 10.21"
+.TH PCRE2_SUBSTITUTE 3 "04 April 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -35,24 +35,32 @@ Its arguments are:
\fIoutputbuffer\fP Points to the output buffer
\fIoutlengthptr\fP Points to the length of the output buffer
.sp
-A match context is needed only if you want to:
+A match data block is needed only if you want to inspect the data from the
+match that is returned in that block. A match context is needed only if you
+want to:
.sp
Set up a callout function
- Change the limit for calling the internal function \fImatch()\fP
- Change the limit for calling \fImatch()\fP recursively
- Set custom memory management when the heap is used for recursion
+ Set a matching offset limit
+ Change the backtracking match limit
+ Change the backtracking depth limit
+ Set custom memory management in the match context
.sp
The \fIlength\fP, \fIstartoffset\fP and \fIrlength\fP values are code
units, not characters, as is the contents of the variable pointed at by
\fIoutlengthptr\fP, which is updated to the actual length of the new string.
-The options are:
+The subject and replacement lengths can be given as PCRE2_ZERO_TERMINATED for
+zero-terminated strings. The options are:
.sp
PCRE2_ANCHORED Match only at the first position
+ PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
+.\" JOIN
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the
subject is not a valid match
+ PCRE2_NO_JIT Do not use JIT matching
+.\" JOIN
PCRE2_NO_UTF_CHECK Do not check the subject or replacement
for UTF validity (only relevant if
PCRE2_UTF was set at compile time)
diff --git a/doc/pcre2api.3 b/doc/pcre2api.3
index db61ea0..786b314 100644
--- a/doc/pcre2api.3
+++ b/doc/pcre2api.3
@@ -1,11 +1,11 @@
-.TH PCRE2API 3 "17 June 2016" "PCRE2 10.22"
+.TH PCRE2API 3 "31 December 2017" "PCRE2 10.31"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.sp
.B #include <pcre2.h>
.sp
-PCRE2 is a new API for PCRE. This document contains a description of all its
-functions. See the
+PCRE2 is a new API for PCRE, starting at release 10.0. This document contains a
+description of all its native functions. See the
.\" HREF
\fBpcre2\fP
.\"
@@ -90,6 +90,9 @@ document for an overview of all the PCRE2 documentation.
.B int pcre2_set_character_tables(pcre2_compile_context *\fIccontext\fP,
.B " const unsigned char *\fItables\fP);"
.sp
+.B int pcre2_set_compile_extra_options(pcre2_compile_context *\fIccontext\fP,
+.B " uint32_t \fIextra_options\fP);"
+.sp
.B int pcre2_set_max_pattern_length(pcre2_compile_context *\fIccontext\fP,
.B " PCRE2_SIZE \fIvalue\fP);"
.sp
@@ -120,19 +123,17 @@ document for an overview of all the PCRE2 documentation.
.B " int (*\fIcallout_function\fP)(pcre2_callout_block *, void *),"
.B " void *\fIcallout_data\fP);"
.sp
-.B int pcre2_set_match_limit(pcre2_match_context *\fImcontext\fP,
-.B " uint32_t \fIvalue\fP);"
-.sp
.B int pcre2_set_offset_limit(pcre2_match_context *\fImcontext\fP,
.B " PCRE2_SIZE \fIvalue\fP);"
.sp
-.B int pcre2_set_recursion_limit(pcre2_match_context *\fImcontext\fP,
+.B int pcre2_set_heap_limit(pcre2_match_context *\fImcontext\fP,
.B " uint32_t \fIvalue\fP);"
.sp
-.B int pcre2_set_recursion_memory_management(
-.B " pcre2_match_context *\fImcontext\fP,"
-.B " void *(*\fIprivate_malloc\fP)(PCRE2_SIZE, void *),"
-.B " void (*\fIprivate_free\fP)(void *, void *), void *\fImemory_data\fP);"
+.B int pcre2_set_match_limit(pcre2_match_context *\fImcontext\fP,
+.B " uint32_t \fIvalue\fP);"
+.sp
+.B int pcre2_set_depth_limit(pcre2_match_context *\fImcontext\fP,
+.B " uint32_t \fIvalue\fP);"
.fi
.
.
@@ -235,6 +236,8 @@ document for an overview of all the PCRE2 documentation.
.nf
.B pcre2_code *pcre2_code_copy(const pcre2_code *\fIcode\fP);
.sp
+.B pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *\fIcode\fP);
+.sp
.B int pcre2_get_error_message(int \fIerrorcode\fP, PCRE2_UCHAR *\fIbuffer\fP,
.B " PCRE2_SIZE \fIbufflen\fP);"
.sp
@@ -250,6 +253,60 @@ document for an overview of all the PCRE2 documentation.
.fi
.
.
+.SH "PCRE2 NATIVE API OBSOLETE FUNCTIONS"
+.rs
+.sp
+.nf
+.B int pcre2_set_recursion_limit(pcre2_match_context *\fImcontext\fP,
+.B " uint32_t \fIvalue\fP);"
+.sp
+.B int pcre2_set_recursion_memory_management(
+.B " pcre2_match_context *\fImcontext\fP,"
+.B " void *(*\fIprivate_malloc\fP)(PCRE2_SIZE, void *),"
+.B " void (*\fIprivate_free\fP)(void *, void *), void *\fImemory_data\fP);"
+.fi
+.sp
+These functions became obsolete at release 10.30 and are retained only for
+backward compatibility. They should not be used in new code. The first is
+replaced by \fBpcre2_set_depth_limit()\fP; the second is no longer needed and
+has no effect (it always returns zero).
+.
+.
+.SH "PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS"
+.rs
+.sp
+.nf
+.B pcre2_convert_context *pcre2_convert_context_create(
+.B " pcre2_general_context *\fIgcontext\fP);"
+.sp
+.B pcre2_convert_context *pcre2_convert_context_copy(
+.B " pcre2_convert_context *\fIcvcontext\fP);"
+.sp
+.B void pcre2_convert_context_free(pcre2_convert_context *\fIcvcontext\fP);
+.sp
+.B int pcre2_set_glob_escape(pcre2_convert_context *\fIcvcontext\fP,
+.B " uint32_t \fIescape_char\fP);"
+.sp
+.B int pcre2_set_glob_separator(pcre2_convert_context *\fIcvcontext\fP,
+.B " uint32_t \fIseparator_char\fP);"
+.sp
+.B int pcre2_pattern_convert(PCRE2_SPTR \fIpattern\fP, PCRE2_SIZE \fIlength\fP,
+.B " uint32_t \fIoptions\fP, PCRE2_UCHAR **\fIbuffer\fP,"
+.B " PCRE2_SIZE *\fIblength\fP, pcre2_convert_context *\fIcvcontext\fP);"
+.sp
+.B void pcre2_converted_pattern_free(PCRE2_UCHAR *\fIconverted_pattern\fP);
+.fi
+.sp
+These functions provide a way of converting non-PCRE2 patterns into
+patterns that can be processed by \fBpcre2_compile()\fP. This facility is
+experimental and may be changed in future releases. At present, "globs" and
+POSIX basic and extended patterns can be converted. Details are given in the
+.\" HREF
+\fBpcre2convert\fP
+.\"
+documentation.
+.
+.
.SH "PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES"
.rs
.sp
@@ -300,11 +357,11 @@ When using multiple libraries in an application, you must take care when
processing any particular pattern to use only functions from a single library.
For example, if you want to run a match using a pattern that was compiled with
\fBpcre2_compile_16()\fP, you must do so with \fBpcre2_match_16()\fP, not
-\fBpcre2_match_8()\fP.
+\fBpcre2_match_8()\fP or \fBpcre2_match_32()\fP.
.P
In the function summaries above, and in the rest of this document and other
PCRE2 documents, functions and data types are described using their generic
-names, without the 8, 16, or 32 suffix.
+names, without the _8, _16, or _32 suffix.
.
.
.SH "PCRE2 API OVERVIEW"
@@ -313,23 +370,23 @@ names, without the 8, 16, or 32 suffix.
PCRE2 has its own native API, which is described in this document. There are
also some wrapper functions for the 8-bit library that correspond to the
POSIX regular expression API, but they do not give access to all the
-functionality. They are described in the
+functionality of PCRE2. They are described in the
.\" HREF
\fBpcre2posix\fP
.\"
documentation. Both these APIs define a set of C function calls.
.P
The native API C data types, function prototypes, option values, and error
-codes are defined in the header file \fBpcre2.h\fP, which contains definitions
-of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release numbers for the
-library. Applications can use these to include support for different releases
-of PCRE2.
+codes are defined in the header file \fBpcre2.h\fP, which also contains
+definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release numbers
+for the library. Applications can use these to include support for different
+releases of PCRE2.
.P
In a Windows environment, if you want to statically link an application program
against a non-dll PCRE2 library, you must define PCRE2_STATIC before including
\fBpcre2.h\fP.
.P
-The functions \fBpcre2_compile()\fP, and \fBpcre2_match()\fP are used for
+The functions \fBpcre2_compile()\fP and \fBpcre2_match()\fP are used for
compiling and matching regular expressions in a Perl-compatible manner. A
sample program that demonstrates the simplest way of using them is provided in
the file called \fIpcre2demo.c\fP in the PCRE2 source distribution. A listing
@@ -343,10 +400,16 @@ documentation, and the
.\"
documentation describes how to compile and run it.
.P
-Just-in-time compiler support is an optional feature of PCRE2 that can be built
-in appropriate hardware environments. It greatly speeds up the matching
+The compiling and matching functions recognize various options that are passed
+as bits in an options argument. There are also some more complicated parameters
+such as custom memory management functions and resource limits that are passed
+in "contexts" (which are just memory blocks, described below). Simple
+applications do not need to make use of contexts.
+.P
+Just-in-time (JIT) compiler support is an optional feature of PCRE2 that can be
+built in appropriate hardware environments. It greatly speeds up the matching
performance of many patterns. Programs can request that it be used if
-available, by calling \fBpcre2_jit_compile()\fP after a pattern has been
+available by calling \fBpcre2_jit_compile()\fP after a pattern has been
successfully compiled by \fBpcre2_compile()\fP. This does nothing if JIT
support is not available.
.P
@@ -356,8 +419,8 @@ More complicated programs might need to make use of the specialist functions
.P
JIT matching is automatically used by \fBpcre2_match()\fP if it is available,
unless the PCRE2_NO_JIT option is set. There is also a direct interface for JIT
-matching, which gives improved performance. The JIT-specific functions are
-discussed in the
+matching, which gives improved performance at the expense of less sanity
+checking. The JIT-specific functions are discussed in the
.\" HREF
\fBpcre2jit\fP
.\"
@@ -367,7 +430,7 @@ A second matching function, \fBpcre2_dfa_match()\fP, which is not
Perl-compatible, is also provided. This uses a different algorithm for the
matching. The alternative algorithm finds all possible matches (at a given
point in the subject), and scans the subject just once (unless there are
-lookbehind assertions). However, this algorithm does not return captured
+lookaround assertions). However, this algorithm does not return captured
substrings. A description of the two matching algorithms and their advantages
and disadvantages is given in the
.\" HREF
@@ -390,7 +453,7 @@ been matched by \fBpcre2_match()\fP. They are:
\fBpcre2_substring_number_from_name()\fP
.sp
\fBpcre2_substring_free()\fP and \fBpcre2_substring_list_free()\fP are also
-provided, to free the memory used for extracted strings.
+provided, to free memory used for extracted strings.
.P
The function \fBpcre2_substitute()\fP can be called to match a pattern and
return a copy of the subject string with substitutions for parts that were
@@ -482,8 +545,8 @@ and does not change when the pattern is matched. Therefore, it is thread-safe,
that is, the same compiled pattern can be used by more than one thread
simultaneously. For example, an application can compile all its patterns at the
start, before forking off multiple threads that use them. However, if the
-just-in-time optimization feature is being used, it needs separate memory stack
-areas for each thread. See the
+just-in-time (JIT) optimization feature is being used, it needs separate memory
+stack areas for each thread. See the
.\" HREF
\fBpcre2jit\fP
.\"
@@ -509,8 +572,9 @@ If JIT is being used, but the JIT compilation is not being done immediately,
(perhaps waiting to see if the pattern is used often enough) similar logic is
required. JIT compilation updates a pointer within the compiled code block, so
a thread must gain unique write access to the pointer before calling
-\fBpcre2_jit_compile()\fP. Alternatively, \fBpcre2_code_copy()\fP can be used
-to obtain a private copy of the compiled code.
+\fBpcre2_jit_compile()\fP. Alternatively, \fBpcre2_code_copy()\fP or
+\fBpcre2_code_copy_with_tables()\fP can be used to obtain a private copy of the
+compiled code before calling the JIT compiler.
.
.
.SS "Context blocks"
@@ -533,10 +597,10 @@ thread-specific copy.
.SS "Match blocks"
.rs
.sp
-The matching functions need a block of memory for working space and for storing
-the results of a match. This includes details of what was matched, as well as
-additional information such as the name of a (*MARK) setting. Each thread must
-provide its own copy of this memory.
+The matching functions need a block of memory for storing the results of a
+match. This includes details of what was matched, as well as additional
+information such as the name of a (*MARK) setting. Each thread must provide its
+own copy of this memory.
.
.
.SH "PCRE2 CONTEXTS"
@@ -608,15 +672,16 @@ The memory used for a general context should be freed by calling:
.SS "The compile context"
.rs
.sp
-A compile context is required if you want to change the default values of any
-of the following compile-time parameters:
+A compile context is required if you want to provide an external function for
+stack checking during compilation or to change the default values of any of the
+following compile-time parameters:
.sp
What \eR matches (Unicode newlines or CR, LF, CRLF only)
PCRE2's character tables
The newline character sequence
The compile time nested parentheses limit
The maximum length of the pattern string
- An external function for stack checking
+ The extra options bits (none set by default)
.sp
A compile context is also required if you are using custom memory management.
If none of these apply, just pass NULL as the context argument of
@@ -659,15 +724,32 @@ argument is a general context. This function builds a set of character tables
in the current locale.
.sp
.nf
+.B int pcre2_set_compile_extra_options(pcre2_compile_context *\fIccontext\fP,
+.B " uint32_t \fIextra_options\fP);"
+.fi
+.sp
+As PCRE2 has developed, almost all the 32 option bits that are available in
+the \fIoptions\fP argument of \fBpcre2_compile()\fP have been used up. To avoid
+running out, the compile context contains a set of extra option bits which are
+used for some newer, assumed rarer, options. This function sets those bits. It
+always sets all the bits (either on or off). It does not modify any existing
+setting. The available options are defined in the section entitled "Extra
+compile options"
+.\" HTML <a href="#extracompileoptions">
+.\" </a>
+below.
+.\"
+.sp
+.nf
.B int pcre2_set_max_pattern_length(pcre2_compile_context *\fIccontext\fP,
.B " PCRE2_SIZE \fIvalue\fP);"
.fi
.sp
-This sets a maximum length, in code units, for the pattern string that is to be
-compiled. If the pattern is longer, an error is generated. This facility is
-provided so that applications that accept patterns from external sources can
-limit their size. The default is the largest number that a PCRE2_SIZE variable
-can hold, which is effectively unlimited.
+This sets a maximum length, in code units, for any pattern string that is
+compiled with this context. If the pattern is longer, an error is generated.
+This facility is provided so that applications that accept patterns from
+external sources can limit their size. The default is the largest number that a
+PCRE2_SIZE variable can hold, which is effectively unlimited.
.sp
.nf
.B int pcre2_set_newline(pcre2_compile_context *\fIccontext\fP,
@@ -677,14 +759,22 @@ can hold, which is effectively unlimited.
This specifies which characters or character sequences are to be recognized as
newlines. The value must be one of PCRE2_NEWLINE_CR (carriage return only),
PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the two-character
-sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any of the above), or
-PCRE2_NEWLINE_ANY (any Unicode newline sequence).
+sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any of the above),
+PCRE2_NEWLINE_ANY (any Unicode newline sequence), or PCRE2_NEWLINE_NUL (the
+NUL character, that is a binary zero).
.P
-When a pattern is compiled with the PCRE2_EXTENDED option, the value of this
-parameter affects the recognition of white space and the end of internal
-comments starting with #. The value is saved with the compiled pattern for
-subsequent use by the JIT compiler and by the two interpreted matching
-functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
+A pattern can override the value set in the compile context by starting with a
+sequence such as (*CRLF). See the
+.\" HREF
+\fBpcre2pattern\fP
+.\"
+page for details.
+.P
+When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
+option, the newline convention affects the recognition of white space and the
+end of internal comments starting with #. The value is saved with the compiled
+pattern for subsequent use by the JIT compiler and by the two interpreted
+matching functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
.sp
.nf
.B int pcre2_set_parens_nest_limit(pcre2_compile_context *\fIccontext\fP,
@@ -693,7 +783,8 @@ functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
.sp
This parameter ajusts the limit, set when PCRE2 is built (default 250), on the
depth of parenthesis nesting in a pattern. This limit stops rogue patterns
-using up too much system stack when being compiled.
+using up too much system stack when being compiled. The limit applies to
+parentheses of all kinds, not just capturing parentheses.
.sp
.nf
.B int pcre2_set_compile_recursion_guard(pcre2_compile_context *\fIccontext\fP,
@@ -703,10 +794,10 @@ using up too much system stack when being compiled.
There is at least one application that runs PCRE2 in threads with very limited
system stack, where running out of stack is to be avoided at all costs. The
parenthesis limit above cannot take account of how much stack is actually
-available. For a finer control, you can supply a function that is called
-whenever \fBpcre2_compile()\fP starts to compile a parenthesized part of a
-pattern. This function can check the actual stack size (or anything else that
-it wants to, of course).
+available during compilation. For a finer control, you can supply a function
+that is called whenever \fBpcre2_compile()\fP starts to compile a parenthesized
+part of a pattern. This function can check the actual stack size (or anything
+else that it wants to, of course).
.P
The first argument to the callout function gives the current depth of
nesting, and the second is user data that is set up by the last argument of
@@ -718,15 +809,15 @@ zero if all is well, or non-zero to force an error.
.SS "The match context"
.rs
.sp
-A match context is required if you want to change the default values of any
-of the following match-time parameters:
+A match context is required if you want to:
.sp
- A callout function
- The offset limit for matching an unanchored pattern
- The limit for calling \fBmatch()\fP (see below)
- The limit for calling \fBmatch()\fP recursively
+ Set up a callout function
+ Set an offset limit for matching an unanchored pattern
+ Change the limit on the amount of heap used when matching
+ Change the backtracking match limit
+ Change the backtracking depth limit
+ Set custom memory management specifically for the match
.sp
-A match context is also required if you are using custom memory management.
If none of these apply, just pass NULL as the context argument of
\fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP, or \fBpcre2_jit_match()\fP.
.P
@@ -752,7 +843,7 @@ PCRE2_ERROR_BADDATA if invalid data is detected.
.B " void *\fIcallout_data\fP);"
.fi
.sp
-This sets up a "callout" function, which PCRE2 will call at specified points
+This sets up a "callout" function for PCRE2 to call at specified points
during a matching operation. Details are given in the
.\" HREF
\fBpcre2callout\fP
@@ -768,22 +859,61 @@ The \fIoffset_limit\fP parameter limits how far an unanchored search can
advance in the subject string. The default value is PCRE2_UNSET. The
\fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP functions return
PCRE2_ERROR_NOMATCH if a match with a starting point before or at the given
-offset is not found. For example, if the pattern /abc/ is matched against
-"123abc" with an offset limit less than 3, the result is PCRE2_ERROR_NO_MATCH.
-A match can never be found if the \fIstartoffset\fP argument of
-\fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP is greater than the offset
-limit.
-.P
-When using this facility, you must set PCRE2_USE_OFFSET_LIMIT when calling
-\fBpcre2_compile()\fP so that when JIT is in use, different code can be
+offset is not found. The \fBpcre2_substitute()\fP function makes no more
+substitutions.
+.P
+For example, if the pattern /abc/ is matched against "123abc" with an offset
+limit less than 3, the result is PCRE2_ERROR_NO_MATCH. A match can never be
+found if the \fIstartoffset\fP argument of \fBpcre2_match()\fP,
+\fBpcre2_dfa_match()\fP, or \fBpcre2_substitute()\fP is greater than the offset
+limit set in the match context.
+.P
+When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT option when
+calling \fBpcre2_compile()\fP so that when JIT is in use, different code can be
compiled. If a match is started with a non-default match limit when
PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
.P
The offset limit facility can be used to track progress when searching large
-subject strings. See also the PCRE2_FIRSTLINE option, which requires a match to
-start within the first line of the subject. If this is set with an offset
-limit, a match must occur in the first line and also within the offset limit.
-In other words, whichever limit comes first is used.
+subject strings or to limit the extent of global substitutions. See also the
+PCRE2_FIRSTLINE option, which requires a match to start before or at the first
+newline that follows the start of matching in the subject. If this is set with
+an offset limit, a match must occur in the first line and also within the
+offset limit. In other words, whichever limit comes first is used.
+.sp
+.nf
+.B int pcre2_set_heap_limit(pcre2_match_context *\fImcontext\fP,
+.B " uint32_t \fIvalue\fP);"
+.fi
+.sp
+The \fIheap_limit\fP parameter specifies, in units of kilobytes, the maximum
+amount of heap memory that \fBpcre2_match()\fP may use to hold backtracking
+information when running an interpretive match. This limit does not apply to
+matching with the JIT optimization, which has its own memory control
+arrangements (see the
+.\" HREF
+\fBpcre2jit\fP
+.\"
+documentation for more details), nor does it apply to \fBpcre2_dfa_match()\fP.
+If the limit is reached, the negative error code PCRE2_ERROR_HEAPLIMIT is
+returned. The default limit is set when PCRE2 is built; the default default is
+very large and is essentially "unlimited".
+.P
+A value for the heap limit may also be supplied by an item at the start of a
+pattern of the form
+.sp
+ (*LIMIT_HEAP=ddd)
+.sp
+where ddd is a decimal number. However, such a setting is ignored unless ddd is
+less than the limit set by the caller of \fBpcre2_match()\fP or, if no such
+limit is set, less than the default.
+.P
+The \fBpcre2_match()\fP function starts out using a 20K vector on the system
+stack for recording backtracking points. The more nested backtracking points
+there are (that is, the deeper the search tree), the more memory is needed.
+Heap memory is used only if the initial vector is too small. If the heap limit
+is set to a value less than 21 (in particular, zero) no heap memory will be
+used. In this case, only patterns that do not have a lot of nested backtracking
+can be successfully processed.
.sp
.nf
.B int pcre2_set_match_limit(pcre2_match_context *\fImcontext\fP,
@@ -791,17 +921,17 @@ In other words, whichever limit comes first is used.
.fi
.sp
The \fImatch_limit\fP parameter provides a means of preventing PCRE2 from using
-up too many resources when processing patterns that are not going to match, but
-which have a very large number of possibilities in their search trees. The
-classic example is a pattern that uses nested unlimited repeats.
-.P
-Internally, \fBpcre2_match()\fP uses a function called \fBmatch()\fP, which it
-calls repeatedly (sometimes recursively). The limit set by \fImatch_limit\fP is
-imposed on the number of times this function is called during a match, which
-has the effect of limiting the amount of backtracking that can take place. For
+up too many computing resources when processing patterns that are not going to
+match, but which have a very large number of possibilities in their search
+trees. The classic example is a pattern that uses nested unlimited repeats.
+.P
+There is an internal counter in \fBpcre2_match()\fP that is incremented each
+time round its main matching loop. If this value reaches the match limit,
+\fBpcre2_match()\fP returns the negative value PCRE2_ERROR_MATCHLIMIT. This has
+the effect of limiting the amount of backtracking that can take place. For
patterns that are not anchored, the count restarts from zero for each position
-in the subject string. This limit is not relevant to \fBpcre2_dfa_match()\fP,
-which ignores it.
+in the subject string. This limit also applies to \fBpcre2_dfa_match()\fP,
+though the counting is done in a different way.
.P
When \fBpcre2_match()\fP is called with a pattern that was successfully
processed by \fBpcre2_jit_compile()\fP, the way in which matching is executed
@@ -811,75 +941,49 @@ is also used in this case (but in a different way) to limit how long the
matching can continue.
.P
The default value for the limit can be set when PCRE2 is built; the default
-default is 10 million, which handles all but the most extreme cases. If the
-limit is exceeded, \fBpcre2_match()\fP returns PCRE2_ERROR_MATCHLIMIT. A value
+default is 10 million, which handles all but the most extreme cases. A value
for the match limit may also be supplied by an item at the start of a pattern
of the form
.sp
(*LIMIT_MATCH=ddd)
.sp
where ddd is a decimal number. However, such a setting is ignored unless ddd is
-less than the limit set by the caller of \fBpcre2_match()\fP or, if no such
-limit is set, less than the default.
+less than the limit set by the caller of \fBpcre2_match()\fP or
+\fBpcre2_dfa_match()\fP or, if no such limit is set, less than the default.
.sp
.nf
-.B int pcre2_set_recursion_limit(pcre2_match_context *\fImcontext\fP,
+.B int pcre2_set_depth_limit(pcre2_match_context *\fImcontext\fP,
.B " uint32_t \fIvalue\fP);"
.fi
.sp
-The \fIrecursion_limit\fP parameter is similar to \fImatch_limit\fP, but
-instead of limiting the total number of times that \fBmatch()\fP is called, it
-limits the depth of recursion. The recursion depth is a smaller number than the
-total number of calls, because not all calls to \fBmatch()\fP are recursive.
-This limit is of use only if it is set smaller than \fImatch_limit\fP.
-.P
-Limiting the recursion depth limits the amount of system stack that can be
-used, or, when PCRE2 has been compiled to use memory on the heap instead of the
-stack, the amount of heap memory that can be used. This limit is not relevant,
-and is ignored, when matching is done using JIT compiled code or by the
-\fBpcre2_dfa_match()\fP function.
-.P
-The default value for \fIrecursion_limit\fP can be set when PCRE2 is built; the
-default default is the same value as the default for \fImatch_limit\fP. If the
-limit is exceeded, \fBpcre2_match()\fP returns PCRE2_ERROR_RECURSIONLIMIT. A
-value for the recursion limit may also be supplied by an item at the start of a
-pattern of the form
-.sp
- (*LIMIT_RECURSION=ddd)
+This parameter limits the depth of nested backtracking in \fBpcre2_match()\fP.
+Each time a nested backtracking point is passed, a new memory "frame" is used
+to remember the state of matching at that point. Thus, this parameter
+indirectly limits the amount of memory that is used in a match. However,
+because the size of each memory "frame" depends on the number of capturing
+parentheses, the actual memory limit varies from pattern to pattern. This limit
+was more useful in versions before 10.30, where function recursion was used for
+backtracking.
+.P
+The depth limit is not relevant, and is ignored, when matching is done using
+JIT compiled code. However, it is supported by \fBpcre2_dfa_match()\fP, which
+uses it to limit the depth of internal recursive function calls that implement
+atomic groups, lookaround assertions, and pattern recursions. This is,
+therefore, an indirect limit on the amount of system stack that is used. A
+recursive pattern such as /(.)(?1)/, when matched to a very long string using
+\fBpcre2_dfa_match()\fP, can use a great deal of stack.
+.P
+The default value for the depth limit can be set when PCRE2 is built; the
+default default is the same value as the default for the match limit. If the
+limit is exceeded, \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP returns
+PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be supplied by an
+item at the start of a pattern of the form
+.sp
+ (*LIMIT_DEPTH=ddd)
.sp
where ddd is a decimal number. However, such a setting is ignored unless ddd is
-less than the limit set by the caller of \fBpcre2_match()\fP or, if no such
-limit is set, less than the default.
-.sp
-.nf
-.B int pcre2_set_recursion_memory_management(
-.B " pcre2_match_context *\fImcontext\fP,"
-.B " void *(*\fIprivate_malloc\fP)(PCRE2_SIZE, void *),"
-.B " void (*\fIprivate_free\fP)(void *, void *), void *\fImemory_data\fP);"
-.fi
-.sp
-This function sets up two additional custom memory management functions for use
-by \fBpcre2_match()\fP when PCRE2 is compiled to use the heap for remembering
-backtracking data, instead of recursive function calls that use the system
-stack. There is a discussion about PCRE2's stack usage in the
-.\" HREF
-\fBpcre2stack\fP
-.\"
-documentation. See the
-.\" HREF
-\fBpcre2build\fP
-.\"
-documentation for details of how to build PCRE2.
-.P
-Using the heap for recursion is a non-standard way of building PCRE2, for use
-in environments that have limited stacks. Because of the greater use of memory
-management, \fBpcre2_match()\fP runs more slowly. Functions that are different
-to the general custom memory functions are provided so that special-purpose
-external code can be used for this case, because the memory blocks are all the
-same size. The blocks are retained by \fBpcre2_match()\fP until it is about to
-exit so that they can be re-used when possible during the match. In the absence
-of these functions, the normal custom memory management functions are used, if
-supplied, otherwise the system functions.
+less than the limit set by the caller of \fBpcre2_match()\fP or
+\fBpcre2_dfa_match()\fP or, if no such limit is set, less than the default.
.
.
.SH "CHECKING BUILD-TIME OPTIONS"
@@ -915,6 +1019,25 @@ PCRE2_BSR_UNICODE means that \eR matches any Unicode line ending sequence; a
value of PCRE2_BSR_ANYCRLF means that \eR matches only CR, LF, or CRLF. The
default can be overridden when a pattern is compiled.
.sp
+ PCRE2_CONFIG_COMPILED_WIDTHS
+.sp
+The output is a uint32_t integer whose lower bits indicate which code unit
+widths were selected when PCRE2 was built. The 1-bit indicates 8-bit support,
+and the 2-bit and 4-bit indicate 16-bit and 32-bit support, respectively.
+.sp
+ PCRE2_CONFIG_DEPTHLIMIT
+.sp
+The output is a uint32_t integer that gives the default limit for the depth of
+nested backtracking in \fBpcre2_match()\fP or the depth of nested recursions
+and lookarounds in \fBpcre2_dfa_match()\fP. Further details are given with
+\fBpcre2_set_depth_limit()\fP above.
+.sp
+ PCRE2_CONFIG_HEAPLIMIT
+.sp
+The output is a uint32_t integer that gives, in kilobytes, the default limit
+for the amount of heap memory used by \fBpcre2_match()\fP. Further details are
+given with \fBpcre2_set_heap_limit()\fP above.
+.sp
PCRE2_CONFIG_JIT
.sp
The output is a uint32_t integer that is set to one if support for just-in-time
@@ -948,9 +1071,9 @@ be compiled by those two libraries, but at the expense of slower matching.
.sp
PCRE2_CONFIG_MATCHLIMIT
.sp
-The output is a uint32_t integer that gives the default limit for the number of
-internal matching function calls in a \fBpcre2_match()\fP execution. Further
-details are given with \fBpcre2_match()\fP below.
+The output is a uint32_t integer that gives the default match limit for
+\fBpcre2_match()\fP. Further details are given with
+\fBpcre2_set_match_limit()\fP above.
.sp
PCRE2_CONFIG_NEWLINE
.sp
@@ -962,10 +1085,16 @@ sequence that is recognized as meaning "newline". The values are:
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
PCRE2_NEWLINE_ANY Any Unicode line ending
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
+ PCRE2_NEWLINE_NUL The NUL character (binary zero)
.sp
The default should normally correspond to the standard sequence for your
operating system.
.sp
+ PCRE2_CONFIG_NEVER_BACKSLASH_C
+.sp
+The output is a uint32_t integer that is set to one if the use of \eC was
+permanently disabled when PCRE2 was built; otherwise it is set to zero.
+.sp
PCRE2_CONFIG_PARENSLIMIT
.sp
The output is a uint32_t integer that gives the maximum depth of nesting
@@ -975,19 +1104,10 @@ PCRE2 is built; the default is 250. This limit does not take into account the
stack that may already be used by the calling application. For finer control
over compilation stack usage, see \fBpcre2_set_compile_recursion_guard()\fP.
.sp
- PCRE2_CONFIG_RECURSIONLIMIT
-.sp
-The output is a uint32_t integer that gives the default limit for the depth of
-recursion when calling the internal matching function in a \fBpcre2_match()\fP
-execution. Further details are given with \fBpcre2_match()\fP below.
-.sp
PCRE2_CONFIG_STACKRECURSE
.sp
-The output is a uint32_t integer that is set to one if internal recursion when
-running \fBpcre2_match()\fP is implemented by recursive function calls that use
-the system stack to remember their state. This is the usual way that PCRE2 is
-compiled. The output is zero if PCRE2 was compiled to use blocks of data on the
-heap instead of recursive function calls.
+This parameter is obsolete and should not be used in new code. The output is a
+uint32_t integer that is always set to zero.
.sp
PCRE2_CONFIG_UNICODE_VERSION
.sp
@@ -1006,7 +1126,7 @@ available; otherwise it is set to zero. Unicode support implies UTF support.
.sp
PCRE2_CONFIG_VERSION
.sp
-The \fIwhere\fP argument should point to a buffer that is at least 12 code
+The \fIwhere\fP argument should point to a buffer that is at least 24 code
units long. (The exact length required can be found by calling
\fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with
the PCRE2 version string, zero-terminated. The number of code units used is
@@ -1026,11 +1146,13 @@ zero.
.B void pcre2_code_free(pcre2_code *\fIcode\fP);
.sp
.B pcre2_code *pcre2_code_copy(const pcre2_code *\fIcode\fP);
+.sp
+.B pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *\fIcode\fP);
.fi
.P
The \fBpcre2_compile()\fP function compiles a pattern into an internal form.
-The pattern is defined by a pointer to a string of code units and a length. If
-the pattern is zero-terminated, the length can be specified as
+The pattern is defined by a pointer to a string of code units and a length (in
+code units). If the pattern is zero-terminated, the length can be specified as
PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that
contains the compiled pattern and related data, or NULL if an error occurred.
.P
@@ -1048,9 +1170,24 @@ below),
.\"
the JIT information cannot be copied (because it is position-dependent).
The new copy can initially be used only for non-JIT matching, though it can be
-passed to \fBpcre2_jit_compile()\fP if required. The \fBpcre2_code_copy()\fP
-function provides a way for individual threads in a multithreaded application
-to acquire a private copy of shared compiled code.
+passed to \fBpcre2_jit_compile()\fP if required.
+.P
+The \fBpcre2_code_copy()\fP function provides a way for individual threads in a
+multithreaded application to acquire a private copy of shared compiled code.
+However, it does not make a copy of the character tables used by the compiled
+pattern; the new pattern code points to the same tables as the original code.
+(See
+.\" HTML <a href="#jitcompiling">
+.\" </a>
+"Locale Support"
+.\"
+below for details of these character tables.) In many applications the same
+tables are used throughout, so this behaviour is appropriate. Nevertheless,
+there are occasions when a copy of a compiled pattern and the relevant tables
+are needed. The \fBpcre2_code_copy_with_tables()\fP provides this facility.
+Copies of both the code and the tables are made, with the new code pointing to
+the new tables. The memory for the new tables is automatically freed when
+\fBpcre2_code_free()\fP is called for the new copy of the compiled code.
.P
NOTE: When one of the matching functions is called, pointers to the compiled
pattern and the subject string are set in the match data block so that they can
@@ -1076,8 +1213,8 @@ documentation).
.P
For those options that can be different in different parts of the pattern, the
contents of the \fIoptions\fP argument specifies their settings at the start of
-compilation. The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK options can be set at
-the time of matching as well as at compile time.
+compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and PCRE2_NO_UTF_CHECK
+options can be set at the time of matching as well as at compile time.
.P
Other, less frequently required compile-time parameters (for example, the
newline setting) can be provided in a compile context (as described
@@ -1093,16 +1230,30 @@ respectively, when \fBpcre2_compile()\fP returns NULL because a compilation
error has occurred. The values are not defined when compilation is successful
and \fBpcre2_compile()\fP returns a non-NULL value.
.P
-The \fBpcre2_get_error_message()\fP function (see "Obtaining a textual error
+There are nearly 100 positive error codes that \fBpcre2_compile()\fP may return
+if it finds an error in the pattern. There are also some negative error codes
+that are used for invalid UTF strings. These are the same as given by
+\fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, and are described in the
+.\" HREF
+\fBpcre2unicode\fP
+.\"
+page. There is no separate documentation for the positive error codes, because
+the textual error messages that are obtained by calling the
+\fBpcre2_get_error_message()\fP function (see "Obtaining a textual error
message"
.\" HTML <a href="#geterrormessage">
.\" </a>
below)
.\"
-provides a textual message for each error code. Compilation errors have
-positive error codes; UTF formatting error codes are negative. For an invalid
-UTF-8 or UTF-16 string, the offset is that of the first code unit of the
-failing character.
+should be self-explanatory. Macro names starting with PCRE2_ERROR_ are defined
+for both positive and negative error codes in \fBpcre2.h\fP.
+.P
+The value returned in \fIerroroffset\fP is an indication of where in the
+pattern the error occurred. It is not necessarily the furthest point in the
+pattern that was read. For example, after the error "lookbehind assertion is
+not fixed length", the error offset points to the start of the failing
+assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of the
+first code unit of the failing character.
.P
Some errors are not detected until the whole pattern has been scanned; in these
cases, the offset passed back is the length of the pattern. Note that the
@@ -1178,13 +1329,15 @@ include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
option is set, normal backslash processing is applied to verb names and only an
unescaped closing parenthesis terminates the name. A closing parenthesis can be
included in a name either as \e) or between \eQ and \eE. If the PCRE2_EXTENDED
-option is set, unescaped whitespace in verb names is skipped and #-comments are
-recognized, exactly as in the rest of the pattern.
+or PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names is
+skipped and #-comments are recognized in this mode, exactly as in the rest of
+the pattern.
.sp
PCRE2_AUTO_CALLOUT
.sp
If this bit is set, \fBpcre2_compile()\fP automatically inserts callout items,
-all with number 255, before each pattern item. For discussion of the callout
+all with number 255, before each pattern item, except immediately before or
+after an explicit callout in the pattern. For discussion of the callout
facility, see the
.\" HREF
\fBpcre2callout\fP
@@ -1195,7 +1348,13 @@ documentation.
.sp
If this bit is set, letters in the pattern match both upper and lower case
letters in the subject. It is equivalent to Perl's /i option, and it can be
-changed within a pattern by a (?i) option setting.
+changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode
+properties are used for all characters with more than one other case, and for
+all characters whose code points are greater than U+007f. For lower valued
+characters with only one other case, a lookup table is used for speed. When
+PCRE2_UTF is not set, a lookup table is used for all code points less than 256,
+and higher code points (available only in 16-bit or 32-bit mode) are treated as
+not having another case.
.sp
PCRE2_DOLLAR_ENDONLY
.sp
@@ -1227,6 +1386,29 @@ details of named subpatterns below; see also the
.\"
documentation.
.sp
+ PCRE2_ENDANCHORED
+.sp
+If this bit is set, the end of any pattern match must be right at the end of
+the string being searched (the "subject string"). If the pattern match
+succeeds by reaching (*ACCEPT), but does not reach the end of the subject, the
+match fails at the current starting point. For unanchored patterns, a new match
+is then tried at the next starting point. However, if the match succeeds by
+reaching the end of the pattern, but not the end of the subject, backtracking
+occurs and an alternative match may be found. Consider these two patterns:
+.sp
+ .(*ACCEPT)|..
+ .|..
+.sp
+If matched against "abc" with PCRE2_ENDANCHORED set, the first matches "c"
+whereas the second matches "bc". The effect of PCRE2_ENDANCHORED can also be
+achieved by appropriate constructs in the pattern itself, which is the only way
+to do it in Perl.
+.P
+For DFA matching with \fBpcre2_dfa_match()\fP, PCRE2_ENDANCHORED applies only
+to the first (that is, the longest) matched string. Other parallel matches,
+which are necessarily substrings of the first one, must obviously end before
+the end of the subject.
+.sp
PCRE2_EXTENDED
.sp
If this bit is set, most white space characters in the pattern are totally
@@ -1254,14 +1436,39 @@ sequence at the start of the pattern, as described in the section entitled
in the \fBpcre2pattern\fP documentation. A default is defined when PCRE2 is
built.
.sp
+ PCRE2_EXTENDED_MORE
+.sp
+This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
+and horizontal tab characters are ignored inside a character class.
+PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx option, and it can be
+changed within a pattern by a (?xx) option setting.
+.sp
PCRE2_FIRSTLINE
.sp
-If this option is set, an unanchored pattern is required to match before or at
-the first newline in the subject string, though the matched text may continue
-over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
-general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit, a
-match must occur in the first line and also within the offset limit. In other
-words, whichever limit comes first is used.
+If this option is set, the start of an unanchored pattern match must be before
+or at the first newline in the subject string following the start of matching,
+though the matched text may continue over the newline. If \fIstartoffset\fP is
+non-zero, the limiting newline is not necessarily the first newline in the
+subject. For example, if the subject string is "abc\enxyz" (where \en
+represents a single-character newline) a pattern match for "yz" succeeds with
+PCRE2_FIRSTLINE if \fIstartoffset\fP is greater than 3. See also
+PCRE2_USE_OFFSET_LIMIT, which provides a more general limiting facility. If
+PCRE2_FIRSTLINE is set with an offset limit, a match must occur in the first
+line and also within the offset limit. In other words, whichever limit comes
+first is used.
+.sp
+ PCRE2_LITERAL
+.sp
+If this option is set, all meta-characters in the pattern are disabled, and it
+is treated as a literal string. Matching literal strings with a regular
+expression engine is not the most efficient way of doing it. If you are doing a
+lot of literal matching and are worried about efficiency, you should consider
+using other approaches. The only other main options that are allowed with
+PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
+PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK,
+PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE
+and PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
+error.
.sp
PCRE2_MATCH_UNSET_BACKREF
.sp
@@ -1325,8 +1532,8 @@ PCRE2_NEVER_UTF causes an error.
If this option is set, it disables the use of numbered capturing parentheses in
the pattern. Any opening parenthesis that is not followed by ? behaves as if it
were followed by ?: but named parentheses can still be used for capturing (and
-they acquire numbers in the usual way). There is no equivalent of this option
-in Perl. Note that, if this option is set, references to capturing groups (back
+they acquire numbers in the usual way). This is the same as Perl's /n option.
+Note that, when this option is set, references to capturing groups (back
references or recursion/subroutine calls) may only refer to named groups,
though the reference can be by name or by number.
.sp
@@ -1361,8 +1568,8 @@ compiler.
.P
There are a number of optimizations that may occur at the start of a match, in
order to speed up the process. For example, if it is known that an unanchored
-match must start with a specific character, the matching code searches the
-subject for that character, and fails immediately if it cannot find it, without
+match must start with a specific code unit value, the matching code searches
+the subject for that value, and fails immediately if it cannot find it, without
actually running the main matching function. This means that a special item
such as (*COMMIT) at the start of a pattern is not considered until after a
suitable starting point for the match has been found. Also, when callouts or
@@ -1389,9 +1596,10 @@ current starting position, which in this case, it does. However, if the same
match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the
subject string does not happen. The first match attempt is run starting from
"D" and when this fails, (*COMMIT) prevents any further matches being tried, so
-the overall result is "no match". There are also other start-up optimizations.
-For example, a minimum length for the subject may be recorded. Consider the
-pattern
+the overall result is "no match".
+.P
+There are also other start-up optimizations. For example, a minimum length for
+the subject may be recorded. Consider the pattern
.sp
(*MARK:A)(X|Y)
.sp
@@ -1423,16 +1631,30 @@ in the
.\" HREF
\fBpcre2unicode\fP
.\"
-document.
-If an invalid UTF sequence is found, \fBpcre2_compile()\fP returns a negative
-error code.
+document. If an invalid UTF sequence is found, \fBpcre2_compile()\fP returns a
+negative error code.
.P
-If you know that your pattern is valid, and you want to skip this check for
-performance reasons, you can set the PCRE2_NO_UTF_CHECK option. When it is set,
-the effect of passing an invalid UTF string as a pattern is undefined. It may
-cause your program to crash or loop. Note that this option can also be passed
-to \fBpcre2_match()\fP and \fBpcre_dfa_match()\fP, to suppress validity
-checking of the subject string.
+If you know that your pattern is a valid UTF string, and you want to skip this
+check for performance reasons, you can set the PCRE2_NO_UTF_CHECK option. When
+it is set, the effect of passing an invalid UTF string as a pattern is
+undefined. It may cause your program to crash or loop.
+.P
+Note that this option can also be passed to \fBpcre2_match()\fP and
+\fBpcre_dfa_match()\fP, to suppress UTF validity checking of the subject
+string.
+.P
+Note also that setting PCRE2_NO_UTF_CHECK at compile time does not disable the
+error that is given if an escape sequence for an invalid Unicode code point is
+encountered in the pattern. In particular, the so-called "surrogate" code
+points (0xd800 to 0xdfff) are invalid. If you want to allow escape sequences
+such as \ex{d800} you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
+option, as described in the section entitled "Extra compile options"
+.\" HTML <a href="#extracompileoptions">
+.\" </a>
+below.
+.\"
+However, this is possible only in UTF-8 and UTF-32 modes, because these values
+are not representable in UTF-16.
.sp
PCRE2_UCP
.sp
@@ -1450,7 +1672,7 @@ in the
.\"
page. If you set PCRE2_UCP, matching one of the items it affects takes much
longer. The option is available only if PCRE2 has been compiled with Unicode
-support.
+support (which is the default).
.sp
PCRE2_UNGREEDY
.sp
@@ -1478,32 +1700,78 @@ This option causes PCRE2 to regard both the pattern and the subject strings
that are subsequently processed as strings of UTF characters instead of
single-code-unit strings. It is available when PCRE2 is built to include
Unicode support (which is the default). If Unicode support is not available,
-the use of this option provokes an error. Details of how this option changes
-the behaviour of PCRE2 are given in the
+the use of this option provokes an error. Details of how PCRE2_UTF changes the
+behaviour of PCRE2 are given in the
.\" HREF
\fBpcre2unicode\fP
.\"
page.
.
.
-.SH "COMPILATION ERROR CODES"
+.\" HTML <a name="extracompileoptions"></a>
+.SS "Extra compile options"
.rs
.sp
-There are over 80 positive error codes that \fBpcre2_compile()\fP may return
-(via \fIerrorcode\fP) if it finds an error in the pattern. There are also some
-negative error codes that are used for invalid UTF strings. These are the same
-as given by \fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, and are described
-in the
-.\" HREF
-\fBpcre2unicode\fP
-.\"
-page. The \fBpcre2_get_error_message()\fP function (see "Obtaining a textual
-error message"
-.\" HTML <a href="#geterrormessage">
-.\" </a>
-below)
-.\"
-can be called to obtain a textual error message from any error code.
+Unlike the main compile-time options, the extra options are not saved with the
+compiled pattern. The option bits that can be set in a compile context by
+calling the \fBpcre2_set_compile_extra_options()\fP function are as follows:
+.sp
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
+.sp
+This option applies when compiling a pattern in UTF-8 or UTF-32 mode. It is
+forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode "surrogate"
+code points in the range 0xd800 to 0xdfff are used in pairs in UTF-16 to encode
+code points with values in the range 0x10000 to 0x10ffff. The surrogates cannot
+therefore be represented in UTF-16. They can be represented in UTF-8 and
+UTF-32, but are defined as invalid code points, and cause errors if encountered
+in a UTF-8 or UTF-32 string that is being checked for validity by PCRE2.
+.P
+These values also cause errors if encountered in escape sequences such as
+\ex{d912} within a pattern. However, it seems that some applications, when
+using PCRE2 to check for unwanted characters in UTF-8 strings, explicitly test
+for the surrogates using escape sequences. The PCRE2_NO_UTF_CHECK option does
+not disable the error that occurs, because it applies only to the testing of
+input strings for UTF validity.
+.P
+If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surrogate code
+point values in UTF-8 and UTF-32 patterns no longer provoke errors and are
+incorporated in the compiled pattern. However, they can only match subject
+characters if the matching function is called with PCRE2_NO_UTF_CHECK set.
+.sp
+ PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
+.sp
+This is a dangerous option. Use with care. By default, an unrecognized escape
+such as \ej or a malformed one such as \ex{2z} causes a compile-time error when
+detected by \fBpcre2_compile()\fP. Perl is somewhat inconsistent in handling
+such items: for example, \ej is treated as a literal "j", and non-hexadecimal
+digits in \ex{} are just ignored, though warnings are given in both cases if
+Perl's warning switch is enabled. However, a malformed octal number after \eo{
+always causes an error in Perl.
+.P
+If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
+\fBpcre2_compile()\fP, all unrecognized or erroneous escape sequences are
+treated as single-character escapes. For example, \ej is a literal "j" and
+\ex{2z} is treated as the literal string "x{2z}". Setting this option means
+that typos in patterns may go undetected and have unexpected results. This is a
+dangerous option. Use with care.
+.sp
+ PCRE2_EXTRA_MATCH_LINE
+.sp
+This option is provided for use by the \fB-x\fP option of \fBpcre2grep\fP. It
+causes the pattern only to match complete lines. This is achieved by
+automatically inserting the code for "^(?:" at the start of the compiled
+pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, the matched
+line may be in the middle of the subject string. This option can be used with
+PCRE2_LITERAL.
+.sp
+ PCRE2_EXTRA_MATCH_WORD
+.sp
+This option is provided for use by the \fB-w\fP option of \fBpcre2grep\fP. It
+causes the pattern only to match strings that have a word boundary at the start
+and the end. This is achieved by automatically inserting the code for "\eb(?:"
+at the start of the compiled pattern and ")\eb" at the end. The option may be
+used with PCRE2_LITERAL. However, it is ignored if PCRE2_EXTRA_MATCH_LINE is
+also set.
.
.
.\" HTML <a name="jitcompiling"></a>
@@ -1541,7 +1809,7 @@ documentation.
JIT compilation is a heavyweight optimization. It can take some time for
patterns to be analyzed, and for one-off matches and simple patterns the
benefit of faster execution might be offset by a much slower compilation time.
-Most, but not all patterns can be optimized by the JIT compiler.
+Most (but not all) patterns can be optimized by the JIT compiler.
.
.
.\" HTML <a name="localesupport"></a>
@@ -1552,10 +1820,10 @@ PCRE2 handles caseless matching, and determines whether characters are letters,
digits, or whatever, by reference to a set of tables, indexed by character code
point. This applies only to characters whose code points are less than 256. By
default, higher-valued code points never match escapes such as \ew or \ed.
-However, if PCRE2 is built with UTF support, all characters can be tested with
-\ep and \eP, or, alternatively, the PCRE2_UCP option can be set when a pattern
-is compiled; this causes \ew and friends to use Unicode property support
-instead of the built-in tables.
+However, if PCRE2 is built with Unicode support, all characters can be tested
+with \ep and \eP, or, alternatively, the PCRE2_UCP option can be set when a
+pattern is compiled; this causes \ew and friends to use Unicode property
+support instead of the built-in tables.
.P
The use of locales with Unicode is discouraged. If you are handling characters
with code points greater than 128, you should either use Unicode support, or
@@ -1594,7 +1862,7 @@ available for as long as it is needed.
The pointer that is passed (via the compile context) to \fBpcre2_compile()\fP
is saved with the compiled pattern, and the same tables are used by
\fBpcre2_match()\fP and \fBpcre_dfa_match()\fP. Thus, for any single pattern,
-compilation, and matching all happen in the same locale, but different patterns
+compilation and matching both happen in the same locale, but different patterns
can be processed in different locales.
.
.
@@ -1617,7 +1885,7 @@ pattern. The second argument specifies which piece of information is required,
and the third argument is a pointer to a variable to receive the data. If the
third argument is NULL, the first argument is ignored, and the function returns
the size in bytes of the variable that is required for the information
-requested. Otherwise, The yield of the function is zero for success, or one of
+requested. Otherwise, the yield of the function is zero for success, or one of
the following negative numbers:
.sp
PCRE2_ERROR_NULL the argument \fIcode\fP was NULL
@@ -1641,12 +1909,15 @@ are as follows:
.sp
PCRE2_INFO_ALLOPTIONS
PCRE2_INFO_ARGOPTIONS
+ PCRE2_INFO_EXTRAOPTIONS
.sp
-Return a copy of the pattern's options. The third argument should point to a
+Return copies of the pattern's options. The third argument should point to a
\fBuint32_t\fP variable. PCRE2_INFO_ARGOPTIONS returns exactly the options that
were passed to \fBpcre2_compile()\fP, whereas PCRE2_INFO_ALLOPTIONS returns
the compile options as modified by any top-level (*XXX) option settings such as
-(*UTF) at the start of the pattern itself.
+(*UTF) at the start of the pattern itself. PCRE2_INFO_EXTRAOPTIONS returns the
+extra options that were set in the compile context by calling the
+pcre2_set_compile_extra_options() function.
.P
For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED
option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED and PCRE2_UTF.
@@ -1670,8 +1941,8 @@ following are true:
.* is not in a capturing group that is the subject
of a back reference
PCRE2_DOTALL is in force for .*
- Neither (*PRUNE) nor (*SKIP) appears in the pattern.
- PCRE2_NO_DOTSTAR_ANCHOR is not set.
+ Neither (*PRUNE) nor (*SKIP) appears in the pattern
+ PCRE2_NO_DOTSTAR_ANCHOR is not set
.sp
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
options returned for PCRE2_INFO_ALLOPTIONS.
@@ -1699,6 +1970,15 @@ Return the highest capturing subpattern number in the pattern. In patterns
where (?| is not used, this is also the total number of capturing subpatterns.
The third argument should point to an \fBuint32_t\fP variable.
.sp
+ PCRE2_INFO_DEPTHLIMIT
+.sp
+If the pattern set a backtracking depth limit by including an item of the form
+(*LIMIT_DEPTH=nnnn) at the start, the value is returned. The third argument
+should point to an unsigned 32-bit integer. If no such value has been set, the
+call to \fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note
+that this limit will only be used during matching if it is less than the limit
+set or defaulted by the caller of the match function.
+.sp
PCRE2_INFO_FIRSTBITMAP
.sp
In the absence of a single first code unit for a non-anchored pattern,
@@ -1715,21 +1995,29 @@ returned. Otherwise NULL is returned. The third argument should point to an
Return information about the first code unit of any matched string, for a
non-anchored pattern. The third argument should point to an \fBuint32_t\fP
variable. If there is a fixed first value, for example, the letter "c" from a
-pattern such as (cat|cow|coyote), 1 is returned, and the character value can be
-retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
-it is known that a match can occur only at the start of the subject or
-following a newline in the subject, 2 is returned. Otherwise, and for anchored
-patterns, 0 is returned.
+pattern such as (cat|cow|coyote), 1 is returned, and the value can be retrieved
+using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but it is
+known that a match can occur only at the start of the subject or following a
+newline in the subject, 2 is returned. Otherwise, and for anchored patterns, 0
+is returned.
.sp
PCRE2_INFO_FIRSTCODEUNIT
.sp
-Return the value of the first code unit of any matched string in the situation
+Return the value of the first code unit of any matched string for a pattern
where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. The third
argument should point to an \fBuint32_t\fP variable. In the 8-bit library, the
value is always less than 256. In the 16-bit library the value can be up to
0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff,
and up to 0xffffffff when not using UTF-32 mode.
.sp
+ PCRE2_INFO_FRAMESIZE
+.sp
+Return the size (in bytes) of the data frames that are used to remember
+backtracking positions when the pattern is processed by \fBpcre2_match()\fP
+without the use of JIT. The third argument should point to an \fBsize_t\fP
+variable. The frame size depends on the number of capturing parentheses in the
+pattern. Each additional capturing group adds two PCRE2_SIZE variables.
+.sp
PCRE2_INFO_HASBACKSLASHC
.sp
Return 1 if the pattern contains any instances of \eC, otherwise 0. The third
@@ -1739,7 +2027,17 @@ argument should point to an \fBuint32_t\fP variable.
.sp
Return 1 if the pattern contains any explicit matches for CR or LF characters,
otherwise 0. The third argument should point to an \fBuint32_t\fP variable. An
-explicit match is either a literal CR or LF character, or \er or \en.
+explicit match is either a literal CR or LF character, or \er or \en or one of
+the equivalent hexadecimal or octal escape sequences.
+.sp
+ PCRE2_INFO_HEAPLIMIT
+.sp
+If the pattern set a heap memory limit by including an item of the form
+(*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argument
+should point to an unsigned 32-bit integer. If no such value has been set, the
+call to \fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note
+that this limit will only be used during matching if it is less than the limit
+set or defaulted by the caller of the match function.
.sp
PCRE2_INFO_JCHANGED
.sp
@@ -1766,10 +2064,10 @@ PCRE2_INFO_LASTCODEUNIT), but for /^a\edz\ed/ the returned value is 0.
.sp
PCRE2_INFO_LASTCODEUNIT
.sp
-Return the value of the rightmost literal data unit that must exist in any
-matched string, other than at its start, if such a value has been recorded. The
-third argument should point to an \fBuint32_t\fP variable. If there is no such
-value, 0 is returned.
+Return the value of the rightmost literal code unit that must exist in any
+matched string, other than at its start, for a pattern where
+PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argument
+should point to an \fBuint32_t\fP variable.
.sp
PCRE2_INFO_MATCHEMPTY
.sp
@@ -1784,7 +2082,9 @@ in such cases.
If the pattern set a match limit by including an item of the form
(*LIMIT_MATCH=nnnn) at the start, the value is returned. The third argument
should point to an unsigned 32-bit integer. If no such value has been set, the
-call to \fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET.
+call to \fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET. Note
+that this limit will only be used during matching if it is less than the limit
+set or defaulted by the caller of the match function.
.sp
PCRE2_INFO_MAXLOOKBEHIND
.sp
@@ -1796,7 +2096,8 @@ require a one-character lookbehind. \eA also registers a one-character
lookbehind, though it does not actually inspect the previous character. This is
to ensure that at least one character from the old segment is retained when a
new segment is processed. Otherwise, if there are no lookbehinds in the
-pattern, \eA might match incorrectly at the start of a new segment.
+pattern, \eA might match incorrectly at the start of a second or subsequent
+segment.
.sp
PCRE2_INFO_MINLENGTH
.sp
@@ -1878,23 +2179,17 @@ different for each compiled pattern.
.sp
PCRE2_INFO_NEWLINE
.sp
-The output is a \fBuint32_t\fP with one of the following values:
+The output is one of the following \fBuint32_t\fP values:
.sp
PCRE2_NEWLINE_CR Carriage return (CR)
PCRE2_NEWLINE_LF Linefeed (LF)
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
PCRE2_NEWLINE_ANY Any Unicode line ending
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
+ PCRE2_NEWLINE_NUL The NUL character (binary zero)
.sp
-This specifies the default character sequence that will be recognized as
-meaning "newline" while matching.
-.sp
- PCRE2_INFO_RECURSIONLIMIT
-.sp
-If the pattern set a recursion limit by including an item of the form
-(*LIMIT_RECURSION=nnnn) at the start, the value is returned. The third
-argument should point to an unsigned 32-bit integer. If no such value has been
-set, the call to \fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET.
+This identifies the character sequence that will be recognized as meaning
+"newline" while matching.
.sp
PCRE2_INFO_SIZE
.sp
@@ -1964,16 +2259,16 @@ Information about a successful or unsuccessful match is placed in a match
data block, which is an opaque structure that is accessed by function calls. In
particular, the match data block contains a vector of offsets into the subject
string that define the matched part of the subject and any substrings that were
-captured. This is know as the \fIovector\fP.
+captured. This is known as the \fIovector\fP.
.P
Before calling \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP, or
\fBpcre2_jit_match()\fP you must create a match data block by calling one of
the creation functions above. For \fBpcre2_match_data_create()\fP, the first
argument is the number of pairs of offsets in the \fIovector\fP. One pair of
offsets is required to identify the string that matched the whole pattern, with
-another pair for each captured substring. For example, a value of 4 creates
-enough space to record the matched portion of the subject plus three captured
-substrings. A minimum of at least 1 pair is imposed by
+an additional pair for each captured substring. For example, a value of 4
+creates enough space to record the matched portion of the subject plus three
+captured substrings. A minimum of at least 1 pair is imposed by
\fBpcre2_match_data_create()\fP, so it is always possible to return the overall
matched string.
.P
@@ -2052,7 +2347,7 @@ Here is an example of a simple call to \fBpcre2_match()\fP:
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
- match_data, /* the match data block */
+ md, /* the match data block */
NULL); /* a match context; NULL means use defaults */
.sp
If the subject string is zero-terminated, the length can be given as
@@ -2116,9 +2411,11 @@ newline convention recognizes CRLF as a newline, and if so, and the current
character is CR followed by LF, advance the starting offset by two characters
instead of one.
.P
-If a non-zero starting offset is passed when the pattern is anchored, one
+If a non-zero starting offset is passed when the pattern is anchored, a single
attempt to match at the given offset is made. This can only succeed if the
-pattern does not require the match to be at the start of the subject.
+pattern does not require the match to be at the start of the subject. In other
+words, the anchoring must be the result of setting the PCRE2_ANCHORED option or
+the use of .* with PCRE2_DOTALL, not by starting the pattern with ^ or \eA.
.
.
.\" HTML <a name="matchoptions"></a>
@@ -2126,15 +2423,15 @@ pattern does not require the match to be at the start of the subject.
.rs
.sp
The unused bits of the \fIoptions\fP argument for \fBpcre2_match()\fP must be
-zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
-PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT,
-PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is
-described below.
+zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED,
+PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
+PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT.
+Their action is described below.
.P
-Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT)
-compiler. If it is set, JIT matching is disabled and the normal interpretive
-code in \fBpcre2_match()\fP is run. Apart from PCRE2_NO_JIT (obviously), the
-remaining options are supported for JIT matching.
+Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not supported by
+the just-in-time (JIT) compiler. If it is set, JIT matching is disabled and the
+interpretive code in \fBpcre2_match()\fP is run. Apart from PCRE2_NO_JIT
+(obviously), the remaining options are supported for JIT matching.
.sp
PCRE2_ANCHORED
.sp
@@ -2144,6 +2441,12 @@ to be anchored by virtue of its contents, it cannot be made unachored at
matching time. Note that setting the option at match time disables JIT
matching.
.sp
+ PCRE2_ENDANCHORED
+.sp
+If the PCRE2_ENDANCHORED option is set, any string that \fBpcre2_match()\fP
+matches must be right at the end of the subject string. Note that setting the
+option at match time disables JIT matching.
+.sp
PCRE2_NOTBOL
.sp
This option specifies that first character of the subject string is not the
@@ -2228,12 +2531,12 @@ page.
If you know that your subject is valid, and you want to skip these checks for
performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
\fBpcre2_match()\fP. You might want to do this for the second and subsequent
-calls to \fBpcre2_match()\fP if you are making repeated calls to find all the
-matches in a single subject string.
+calls to \fBpcre2_match()\fP if you are making repeated calls to find other
+matches in the same subject string.
.P
-NOTE: When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid string
-as a subject, or an invalid value of \fIstartoffset\fP, is undefined. Your
-program may crash or loop indefinitely.
+WARNING: When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
+string as a subject, or an invalid value of \fIstartoffset\fP, is undefined.
+Your program may crash or loop indefinitely.
.sp
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
@@ -2300,9 +2603,9 @@ start, it skips both the CR and the LF before retrying. However, the pattern
reference, and so advances only by one character after the first failure.
.P
An explicit match for CR of LF is either a literal appearance of one of those
-characters in the pattern, or one of the \er or \en escape sequences. Implicit
-matches such as [^X] do not count, nor does \es, even though it includes CR and
-LF in the characters that it matches.
+characters in the pattern, or one of the \er or \en or equivalent octal or
+hexadecimal escape sequences. Implicit matches such as [^X] do not count, nor
+does \es, even though it includes CR and LF in the characters that it matches.
.P
Notwithstanding the above, anomalous effects may still occur when CRLF is a
valid newline sequence and explicit \er or \en escapes appear in the pattern.
@@ -2366,12 +2669,12 @@ identify the part of the subject that was partially matched. See the
.\"
documentation for details of partial matching.
.P
-After a successful match, the first pair of offsets identifies the portion of
-the subject string that was matched by the entire pattern. The next pair is
-used for the first capturing subpattern, and so on. The value returned by
+After a fully successful match, the first pair of offsets identifies the
+portion of the subject string that was matched by the entire pattern. The next
+pair is used for the first captured substring, and so on. The value returned by
\fBpcre2_match()\fP is one more than the highest numbered pair that has been
set. For example, if two substrings have been captured, the returned value is
-3. If there are no capturing subpatterns, the return value from a successful
+3. If there are no captured substrings, the return value from a successful
match is 1, indicating that just the first pair of offsets has been set.
.P
If a pattern uses the \eK escape sequence within a positive assertion, the
@@ -2386,11 +2689,7 @@ returned.
If the ovector is too small to hold all the captured substring offsets, as much
as possible is filled in, and the function returns a value of zero. If captured
substrings are not of interest, \fBpcre2_match()\fP may be called with a match
-data block whose ovector is of minimum length (that is, one pair). However, if
-the pattern contains back references and the \fIovector\fP is not big enough to
-remember the related substrings, PCRE2 has to get additional memory for use
-during matching. Thus it is usually advisable to set up a match data block
-containing an ovector of reasonable size.
+data block whose ovector is of minimum length (that is, one pair).
.P
It is possible for capturing subpattern number \fIn+1\fP to match some part of
the subject when subpattern \fIn\fP has not been used at all. For example, if
@@ -2430,24 +2729,27 @@ appropriate circumstances. If they are called at other times, the result is
undefined.
.P
After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure
-to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and
-\fBpcre2_get_mark()\fP can be called. It returns a pointer to the
-zero-terminated name, which is within the compiled pattern. Otherwise NULL is
-returned. The length of the (*MARK) name (excluding the terminating zero) is
-stored in the code unit that preceeds the name. You should use this instead of
-relying on the terminating zero if the (*MARK) name might contain a binary
-zero.
-.P
-After a successful match, the (*MARK) name that is returned is the
-last one encountered on the matching path through the pattern. After a "no
-match" or a partial match, the last encountered (*MARK) name is returned. For
-example, consider this pattern:
+to match (PCRE2_ERROR_NOMATCH), a (*MARK), (*PRUNE), or (*THEN) name may be
+available. The function \fBpcre2_get_mark()\fP can be called to access this
+name. The same function applies to all three verbs. It returns a pointer to the
+zero-terminated name, which is within the compiled pattern. If no name is
+available, NULL is returned. The length of the name (excluding the terminating
+zero) is stored in the code unit that precedes the name. You should use this
+length instead of relying on the terminating zero if the name might contain a
+binary zero.
+.P
+After a successful match, the name that is returned is the last (*MARK),
+(*PRUNE), or (*THEN) name encountered on the matching path through the pattern.
+Instances of (*PRUNE) and (*THEN) without names are ignored. Thus, for example,
+if the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned.
+After a "no match" or a partial match, the last encountered name is returned.
+For example, consider this pattern:
.sp
^(*MARK:A)((*MARK:B)a|b)c
.sp
-When it matches "bc", the returned mark is A. The B mark is "seen" in the first
+When it matches "bc", the returned name is A. The B mark is "seen" in the first
branch of the group, but it is not on the matching path. On the other hand,
-when this pattern fails to match "bx", the returned mark is B.
+when this pattern fails to match "bx", the returned name is B.
.P
After a successful match, a partial match, or one of the invalid UTF errors
(for example, PCRE2_ERROR_UTF8_ERR5), \fBpcre2_get_startchar()\fP can be
@@ -2506,8 +2808,9 @@ returned when the magic number is not present.
.sp
PCRE2_ERROR_BADMODE
.sp
-This error is given when a pattern that was compiled by the 8-bit library is
-passed to a 16-bit or 32-bit library function, or vice versa.
+This error is given when a compiled pattern is passed to a function in a
+library of a different code unit width, for example, a pattern compiled by
+the 8-bit library is passed to a 16-bit or 32-bit library function.
.sp
PCRE2_ERROR_BADOFFSET
.sp
@@ -2534,22 +2837,19 @@ use by callout functions that want to cause \fBpcre2_match()\fP or
.\"
documentation for details.
.sp
+ PCRE2_ERROR_DEPTHLIMIT
+.sp
+The nested backtracking depth limit was reached.
+.sp
+ PCRE2_ERROR_HEAPLIMIT
+.sp
+The heap limit was reached.
+.sp
PCRE2_ERROR_INTERNAL
.sp
An unexpected internal error has occurred. This error could be caused by a bug
in PCRE2 or by overwriting of the compiled pattern.
.sp
- PCRE2_ERROR_JIT_BADOPTION
-.sp
-This error is returned when a pattern that was successfully studied using JIT
-is being matched, but the matching mode (partial or complete match) does not
-correspond to any JIT compilation mode. When the JIT fast path function is
-used, this error may be also given for invalid options. See the
-.\" HREF
-\fBpcre2jit\fP
-.\"
-documentation for more details.
-.sp
PCRE2_ERROR_JIT_STACKLIMIT
.sp
This error is returned when a pattern that was successfully studied using JIT
@@ -2562,15 +2862,14 @@ documentation for more details.
.sp
PCRE2_ERROR_MATCHLIMIT
.sp
-The backtracking limit was reached.
+The backtracking match limit was reached.
.sp
PCRE2_ERROR_NOMEMORY
.sp
-If a pattern contains back references, but the ovector is not big enough to
-remember the referenced substrings, PCRE2 gets a block of memory at the start
-of matching to use for this purpose. There are some other special cases where
-extra memory is needed during matching. This error is given when memory cannot
-be obtained.
+If a pattern contains many nested backtracking points, heap memory is used to
+remember them. This error is given when the memory allocation function (default
+or custom) fails. Note that a different error, PCRE2_ERROR_HEAPLIMIT, is given
+if the amount of memory needed exceeds the heap limit.
.sp
PCRE2_ERROR_NULL
.sp
@@ -2586,10 +2885,6 @@ in the subject string. Some simple patterns that might do this are detected and
faulted at compile time, but more complicated cases, in particular mutual
recursions between two different subpatterns, cannot be detected until matching
is attempted.
-.sp
- PCRE2_ERROR_RECURSIONLIMIT
-.sp
-The internal recursion limit was reached.
.
.
.\" HTML <a name="geterrormessage"></a>
@@ -2604,8 +2899,8 @@ The internal recursion limit was reached.
A text message for an error code from any PCRE2 function (compile, match, or
auxiliary) can be obtained by calling \fBpcre2_get_error_message()\fP. The code
is passed as the first argument, with the remaining two arguments specifying a
-code unit buffer and its length, into which the text message is placed. Note
-that the message is returned in code units of the appropriate width for the
+code unit buffer and its length in code units, into which the text message is
+placed. The message is returned in code units of the appropriate width for the
library that is being used.
.P
The returned message is terminated with a trailing zero, and the function
@@ -2779,8 +3074,8 @@ calling \fBpcre2_substring_number_from_name()\fP. The first argument is the
compiled pattern, and the second is the name. The yield of the function is the
subpattern number, PCRE2_ERROR_NOSUBSTRING if there is no subpattern of that
name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is more than one subpattern of
-that name. Given the number, you can extract the substring directly, or use one
-of the functions described above.
+that name. Given the number, you can extract the substring directly from the
+ovector, or use one of the "bynumber" functions described above.
.P
For convenience, there are also "byname" functions that correspond to the
"bynumber" functions, the only difference being that the second argument is a
@@ -2855,12 +3150,12 @@ length is in code units, not bytes.
In the replacement string, which is interpreted as a UTF string in UTF mode,
and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a
dollar character is an escape character that can specify the insertion of
-characters from capturing groups or (*MARK) items in the pattern. The following
-forms are always recognized:
+characters from capturing groups or (*MARK), (*PRUNE), or (*THEN) items in the
+pattern. The following forms are always recognized:
.sp
$$ insert a dollar character
$<n> or ${<n>} insert the contents of group <n>
- $*MARK or ${*MARK} insert the name of the last (*MARK) encountered
+ $*MARK or ${*MARK} insert a (*MARK), (*PRUNE), or (*THEN) name
.sp
Either a group number or a group name can be given for <n>. Curly brackets are
required only if the following character would be interpreted as part of the
@@ -2868,24 +3163,41 @@ number or name. The number may be zero to include the entire matched string.
For example, if the pattern a(b)c is matched with "=abc=" and the replacement
string "+$1$0$1+", the result is "=+babcb+=".
.P
-The facility for inserting a (*MARK) name can be used to perform simple
-simultaneous substitutions, as this \fBpcre2test\fP example shows:
+$*MARK inserts the name from the last encountered (*MARK), (*PRUNE), or (*THEN)
+on the matching path that has a name. (*MARK) must always include a name, but
+(*PRUNE) and (*THEN) need not. For example, in the case of (*MARK:A)(*PRUNE)
+the name inserted is "A", but for (*MARK:A)(*PRUNE:B) the relevant name is "B".
+This facility can be used to perform simple simultaneous substitutions, as this
+\fBpcre2test\fP example shows:
.sp
- /(*:pear)apple|(*:orange)lemon/g,replace=${*MARK}
+ /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
apple lemon
2: pear orange
.sp
As well as the usual options for \fBpcre2_match()\fP, a number of additional
-options can be set in the \fIoptions\fP argument.
+options can be set in the \fIoptions\fP argument of \fBpcre2_substitute()\fP.
.P
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject string,
-replacing every matching substring. If this is not set, only the first matching
-substring is replaced. If any matched substring has zero length, after the
-substitution has happened, an attempt to find a non-empty match at the same
-position is performed. If this is not successful, the current position is
-advanced by one character except when CRLF is a valid newline sequence and the
-next two characters are CR, LF. In this case, the current position is advanced
-by two characters.
+replacing every matching substring. If this option is not set, only the first
+matching substring is replaced. The search for matches takes place in the
+original subject string (that is, previous replacements do not affect it).
+Iteration is implemented by advancing the \fIstartoffset\fP value for each
+search, which is always passed the entire subject string. If an offset limit is
+set in the match context, searching stops when that limit is reached.
+.P
+You can restrict the effect of a global substitution to a portion of the
+subject string by setting either or both of \fIstartoffset\fP and an offset
+limit. Here is a \fPpcre2test\fP example:
+.sp
+ /B/g,replace=!,use_offset_limit
+ ABC ABC ABC ABC\e=offset=3,offset_limit=12
+ 2: ABC A!C A!C ABC
+.sp
+When continuing with global substitutions after matching a substring with zero
+length, an attempt to find a non-empty match at the same offset is performed.
+If this is not successful, the offset is advanced by one character except when
+CRLF is a valid newline sequence and the next two characters are CR, LF. In
+this case, the offset is advanced by two characters.
.P
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If
@@ -2987,10 +3299,10 @@ default.
.P
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the
replacement string, with more particular errors being PCRE2_ERROR_BADREPESCAPE
-(invalid escape sequence), PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket
-not found), PCRE2_BADSUBSTITUTION (syntax error in extended group
-substitution), and PCRE2_BADSUBPATTERN (the pattern match ended before it
-started, which can happen if \eK is used in an assertion).
+(invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE (closing curly bracket
+not found), PCRE2_ERROR_BADSUBSTITUTION (syntax error in extended group
+substitution), and PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before
+it started, which can happen if \eK is used in an assertion).
.P
As for all PCRE2 errors, a text message that describes the error can be
obtained by calling the \fBpcre2_get_error_message()\fP function (see
@@ -3084,11 +3396,12 @@ other alternatives. Ultimately, when it runs out of matches,
.P
The function \fBpcre2_dfa_match()\fP is called to match a subject string
against a compiled pattern, using a matching algorithm that scans the subject
-string just once, and does not backtrack. This has different characteristics to
-the normal algorithm, and is not compatible with Perl. Some of the features of
-PCRE2 patterns are not supported. Nevertheless, there are times when this kind
-of matching can be useful. For a discussion of the two matching algorithms, and
-a list of features that \fBpcre2_dfa_match()\fP does not support, see the
+string just once (not counting lookaround assertions), and does not backtrack.
+This has different characteristics to the normal algorithm, and is not
+compatible with Perl. Some of the features of PCRE2 patterns are not supported.
+Nevertheless, there are times when this kind of matching can be useful. For a
+discussion of the two matching algorithms, and a list of features that
+\fBpcre2_dfa_match()\fP does not support, see the
.\" HREF
\fBpcre2matching\fP
.\"
@@ -3115,7 +3428,7 @@ Here is an example of a simple call to \fBpcre2_dfa_match()\fP:
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
- match_data, /* the match data block */
+ md, /* the match data block */
NULL, /* a match context; NULL means use defaults */
wspace, /* working space vector */
20); /* number of elements (NOT size in bytes) */
@@ -3124,11 +3437,11 @@ Here is an example of a simple call to \fBpcre2_dfa_match()\fP:
.rs
.sp
The unused bits of the \fIoptions\fP argument for \fBpcre2_dfa_match()\fP must
-be zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
-PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
-PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and
-PCRE2_DFA_RESTART. All but the last four of these are exactly the same as for
-\fBpcre2_match()\fP, so their description is not repeated here.
+be zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED,
+PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
+PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST,
+and PCRE2_DFA_RESTART. All but the last four of these are exactly the same as
+for \fBpcre2_match()\fP, so their description is not repeated here.
.sp
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
@@ -3222,7 +3535,7 @@ NOTE: PCRE2's "auto-possessification" optimization usually applies to character
repeats at the end of a pattern (as well as internally). For example, the
pattern "a\ed+" is compiled as if it were "a\ed++". For DFA matching, this
means that only one possible match is found. If you really do want multiple
-matches in such cases, either use an ungreedy repeat auch as "a\ed+?" or set
+matches in such cases, either use an ungreedy repeat such as "a\ed+?" or set
the PCRE2_NO_AUTO_POSSESS option when compiling.
.
.
@@ -3275,7 +3588,7 @@ fail, this error is given.
.sp
\fBpcre2build\fP(3), \fBpcre2callout\fP(3), \fBpcre2demo(3)\fP,
\fBpcre2matching\fP(3), \fBpcre2partial\fP(3), \fBpcre2posix\fP(3),
-\fBpcre2sample\fP(3), \fBpcre2stack\fP(3), \fBpcre2unicode\fP(3).
+\fBpcre2sample\fP(3), \fBpcre2unicode\fP(3).
.
.
.SH AUTHOR
@@ -3292,6 +3605,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 17 June 2016
-Copyright (c) 1997-2016 University of Cambridge.
+Last updated: 31 December 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2build.3 b/doc/pcre2build.3
index 11b1c57..7586d22 100644
--- a/doc/pcre2build.3
+++ b/doc/pcre2build.3
@@ -1,4 +1,4 @@
-.TH PCRE2BUILD 3 "01 April 2016" "PCRE2 10.22"
+.TH PCRE2BUILD 3 "18 July 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.
@@ -55,21 +55,21 @@ running
.sp
./configure --help
.sp
-The following sections include descriptions of options whose names begin with
---enable or --disable. These settings specify changes to the defaults for the
-\fBconfigure\fP command. Because of the way that \fBconfigure\fP works,
---enable and --disable always come in pairs, so the complementary option always
-exists as well, but as it specifies the default, it is not described.
+The following sections include descriptions of "on/off" options whose names
+begin with --enable or --disable. Because of the way that \fBconfigure\fP
+works, --enable and --disable always come in pairs, so the complementary option
+always exists as well, but as it specifies the default, it is not described.
+Options that specify values have names that start with --with.
.
.
.SH "BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES"
.rs
.sp
By default, a library called \fBlibpcre2-8\fP is built, containing functions
-that take string arguments contained in vectors of bytes, interpreted either as
+that take string arguments contained in arrays of bytes, interpreted either as
single-byte characters, or UTF-8 strings. You can also build two other
libraries, called \fBlibpcre2-16\fP and \fBlibpcre2-32\fP, which process
-strings that are contained in vectors of 16-bit and 32-bit code units,
+strings that are contained in arrays of 16-bit and 32-bit code units,
respectively. These can be interpreted either as single-unit characters or
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
the following to the \fBconfigure\fP command:
@@ -119,10 +119,10 @@ Alternatively, patterns may be started with (*UTF) unless the application has
locked this out by setting PCRE2_NEVER_UTF.
.P
UTF support allows the libraries to process character code points up to
-0x10ffff in the strings that they handle. It also provides support for
-accessing the Unicode properties of such characters, using pattern escapes such
-as \eP, \ep, and \eX. Only the general category properties such as \fILu\fP and
-\fINd\fP are supported. Details are given in the
+0x10ffff in the strings that they handle. Unicode support also gives access to
+the Unicode properties of characters, using pattern escapes such as \eP, \ep,
+and \eX. Only the general category properties such as \fILu\fP and \fINd\fP are
+supported. Details are given in the
.\" HREF
\fBpcre2pattern\fP
.\"
@@ -151,13 +151,18 @@ out by setting the PCRE2_NEVER_BACKSLASH_C option when calling
.SH "JUST-IN-TIME COMPILER SUPPORT"
.rs
.sp
-Just-in-time compiler support is included in the build by specifying
+Just-in-time (JIT) compiler support is included in the build by specifying
.sp
--enable-jit
.sp
This support is available only for certain hardware architectures. If this
-option is set for an unsupported architecture, a building error occurs.
-See the
+option is set for an unsupported architecture, a building error occurs. If you
+are running under SELinux you may also want to add
+.sp
+ --enable-jit-sealloc
+.sp
+which enables the use of an execmem allocator in JIT that is compatible with
+SELinux. This has no effect if JIT is not enabled. See the
.\" HREF
\fBpcre2jit\fP
.\"
@@ -192,18 +197,22 @@ to the \fBconfigure\fP command. There is a fourth option, specified by
--enable-newline-is-anycrlf
.sp
which causes PCRE2 to recognize any of the three sequences CR, LF, or CRLF as
-indicating a line ending. Finally, a fifth option, specified by
+indicating a line ending. A fifth option, specified by
.sp
--enable-newline-is-any
.sp
causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
sequences are the three just mentioned, plus the single characters VT (vertical
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
-separator, U+2028), and PS (paragraph separator, U+2029).
+separator, U+2028), and PS (paragraph separator, U+2029). The final option is
+.sp
+ --enable-newline-is-nul
+.sp
+which causes NUL (binary zero) is set as the default line-ending character.
.P
Whatever default line ending convention is selected when PCRE2 is built can be
overridden by applications that use the library. At build time it is
-conventional to use the standard for your operating system.
+recommended to use the standard for your operating system.
.
.
.SH "WHAT \eR MATCHES"
@@ -217,7 +226,7 @@ specify
.sp
the default is changed so that \eR matches only CR, LF, or CRLF. Whatever is
selected when PCRE2 is built can be overridden by applications that use the
-called.
+library.
.
.
.SH "HANDLING VERY LARGE PATTERNS"
@@ -241,41 +250,13 @@ additional data when handling them. For the 32-bit library the value is always
4 and cannot be overridden; the value of --with-link-size is ignored.
.
.
-.SH "AVOIDING EXCESSIVE STACK USAGE"
-.rs
-.sp
-When matching with the \fBpcre2_match()\fP function, PCRE2 implements
-backtracking by making recursive calls to an internal function called
-\fBmatch()\fP. In environments where the size of the stack is limited, this can
-severely limit PCRE2's operation. (The Unix environment does not usually suffer
-from this problem, but it may sometimes be necessary to increase the maximum
-stack size. There is a discussion in the
-.\" HREF
-\fBpcre2stack\fP
-.\"
-documentation.) An alternative approach to recursion that uses memory from the
-heap to remember data, instead of using recursive function calls, has been
-implemented to work round the problem of limited stack size. If you want to
-build a version of PCRE2 that works this way, add
-.sp
- --disable-stack-for-recursion
-.sp
-to the \fBconfigure\fP command. By default, the system functions \fBmalloc()\fP
-and \fBfree()\fP are called to manage the heap memory that is required, but
-custom memory management functions can be called instead. PCRE2 runs noticeably
-more slowly when built in this way. This option affects only the
-\fBpcre2_match()\fP function; it is not relevant for \fBpcre2_dfa_match()\fP.
-.
-.
.SH "LIMITING PCRE2 RESOURCE USAGE"
.rs
.sp
-Internally, PCRE2 has a function called \fBmatch()\fP, which it calls
-repeatedly (sometimes recursively) when matching a pattern with the
-\fBpcre2_match()\fP function. By controlling the maximum number of times this
-function may be called during a single matching operation, a limit can be
-placed on the resources used by a single call to \fBpcre2_match()\fP. The limit
-can be changed at run time, as described in the
+The \fBpcre2_match()\fP function increments a counter each time it goes round
+its main loop. Putting a limit on this counter controls the amount of computing
+resource used by a single call to \fBpcre2_match()\fP. The limit can be changed
+at run time, as described in the
.\" HREF
\fBpcre2api\fP
.\"
@@ -284,19 +265,47 @@ setting such as
.sp
--with-match-limit=500000
.sp
-to the \fBconfigure\fP command. This setting has no effect on the
-\fBpcre2_dfa_match()\fP matching function.
+to the \fBconfigure\fP command. This setting also applies to the
+\fBpcre2_dfa_match()\fP matching function, and to JIT matching (though the
+counting is done differently).
.P
-In some environments it is desirable to limit the depth of recursive calls of
-\fBmatch()\fP more strictly than the total number of calls, in order to
-restrict the maximum amount of stack (or heap, if --disable-stack-for-recursion
-is specified) that is used. A second limit controls this; it defaults to the
-value that is set for --with-match-limit, which imposes no additional
-constraints. However, you can set a lower limit by adding, for example,
+The \fBpcre2_match()\fP function starts out using a 20K vector on the system
+stack to record backtracking points. The more nested backtracking points there
+are (that is, the deeper the search tree), the more memory is needed. If the
+initial vector is not large enough, heap memory is used, up to a certain limit,
+which is specified in kilobytes. The limit can be changed at run time, as
+described in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+documentation. The default limit (in effect unlimited) is 20 million. You can
+change this by a setting such as
.sp
- --with-match-limit-recursion=10000
+ --with-heap-limit=500
.sp
-to the \fBconfigure\fP command. This value can also be overridden at run time.
+which limits the amount of heap to 500 kilobytes. This limit applies only to
+interpretive matching in pcre2_match(). It does not apply when JIT (which has
+its own memory arrangements) is used, nor does it apply to
+\fBpcre2_dfa_match()\fP.
+.P
+You can also explicitly limit the depth of nested backtracking in the
+\fBpcre2_match()\fP interpreter. This limit defaults to the value that is set
+for --with-match-limit. You can set a lower default limit by adding, for
+example,
+.sp
+ --with-match-limit_depth=10000
+.sp
+to the \fBconfigure\fP command. This value can be overridden at run time. This
+depth limit indirectly limits the amount of heap memory that is used, but
+because the size of each backtracking "frame" depends on the number of
+capturing parentheses in a pattern, the amount of heap that is used before the
+limit is reached varies from pattern to pattern. This limit was more useful in
+versions before 10.30, where function recursion was used for backtracking.
+.P
+As well as applying to \fBpcre2_match()\fP, the depth limit also controls
+the depth of recursive function calls in \fBpcre2_dfa_match()\fP. These are
+used for lookaround assertions, atomic groups, and recursion within patterns.
+The limit does not apply to JIT matching.
.
.
.SH "CREATING CHARACTER TABLES AT BUILD TIME"
@@ -312,10 +321,10 @@ only. If you add
to the \fBconfigure\fP command, the distributed tables are no longer used.
Instead, a program called \fBdftables\fP is compiled and run. This outputs the
source for new set of tables, created in the default locale of your C run-time
-system. (This method of replacing the tables does not work if you are cross
+system. This method of replacing the tables does not work if you are cross
compiling, because \fBdftables\fP is run on the local host. If you need to
create alternative tables when cross compiling, you will have to do so "by
-hand".)
+hand".
.
.
.SH "USING EBCDIC CODE"
@@ -385,16 +394,19 @@ they are not.
.sp
\fBpcre2grep\fP uses an internal buffer to hold a "window" on the file it is
scanning, in order to be able to output "before" and "after" lines when it
-finds a match. The size of the buffer is controlled by a parameter whose
-default value is 20K. The buffer itself is three times this size, but because
-of the way it is used for holding "before" lines, the longest line that is
-guaranteed to be processable is the parameter size. You can change the default
-parameter value by adding, for example,
+finds a match. The starting size of the buffer is controlled by a parameter
+whose default value is 20K. The buffer itself is three times this size, but
+because of the way it is used for holding "before" lines, the longest line that
+is guaranteed to be processable is the parameter size. If a longer line is
+encountered, \fBpcre2grep\fP automatically expands the buffer, up to a
+specified maximum size, whose default is 1M or the starting size, whichever is
+the larger. You can change the default parameter values by adding, for example,
.sp
- --with-pcre2grep-bufsize=50K
+ --with-pcre2grep-bufsize=51200
+ --with-pcre2grep-max-bufsize=2097152
.sp
-to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can override this
-value by using --buffer-size on the command line.
+to the \fBconfigure\fP command. The caller of \fPpcre2grep\fP can override
+these values by using --buffer-size and --max-buffer-size on the command line.
.
.
.SH "PCRE2TEST OPTION FOR LIBREADLINE SUPPORT"
@@ -512,6 +524,44 @@ information about code coverage, see the \fBgcov\fP and \fBlcov\fP
documentation.
.
.
+.SH "SUPPORT FOR FUZZERS"
+.rs
+.sp
+There is a special option for use by people who want to run fuzzing tests on
+PCRE2:
+.sp
+ --enable-fuzz-support
+.sp
+At present this applies only to the 8-bit library. If set, it causes an extra
+library called libpcre2-fuzzsupport.a to be built, but not installed. This
+contains a single function called LLVMFuzzerTestOneInput() whose arguments are
+a pointer to a string and the length of the string. When called, this function
+tries to compile the string as a pattern, and if that succeeds, to match it.
+This is done both with no options and with some random options bits that are
+generated from the string.
+.P
+Setting --enable-fuzz-support also causes a binary called \fBpcre2fuzzcheck\fP
+to be created. This is normally run under valgrind or used when PCRE2 is
+compiled with address sanitizing enabled. It calls the fuzzing function and
+outputs information about it is doing. The input strings are specified by
+arguments: if an argument starts with "=" the rest of it is a literal input
+string. Otherwise, it is assumed to be a file name, and the contents of the
+file are the test string.
+.
+.
+.SH "OBSOLETE OPTION"
+.rs
+.sp
+In versions of PCRE2 prior to 10.30, there were two ways of handling
+backtracking in the \fBpcre2_match()\fP function. The default was to use the
+system stack, but if
+.sp
+ --disable-stack-for-recursion
+.sp
+was set, memory on the heap was used. From release 10.30 onwards this has
+changed (the stack is no longer used) and this option now does nothing except
+give a warning.
+.
.SH "SEE ALSO"
.rs
.sp
@@ -532,6 +582,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 01 April 2016
-Copyright (c) 1997-2016 University of Cambridge.
+Last updated: 18 July 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2callout.3 b/doc/pcre2callout.3
index 6919f5a..e3fd600 100644
--- a/doc/pcre2callout.3
+++ b/doc/pcre2callout.3
@@ -1,4 +1,4 @@
-.TH PCRE2CALLOUT 3 "23 March 2015" "PCRE2 10.20"
+.TH PCRE2CALLOUT 3 "22 December 2017" "PCRE2 10.31"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH SYNOPSIS
@@ -40,13 +40,22 @@ two callout points:
.sp
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
automatically inserts callouts, all with number 255, before each item in the
-pattern. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
+pattern except for immediately before or after an explicit callout. For
+example, if PCRE2_AUTO_CALLOUT is used with the pattern
.sp
- A(\ed{2}|--)
+ A(?C3)B
.sp
it is processed as if it were
.sp
-(?C255)A(?C255)((?C255)\ed{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
+ (?C255)A(?C3)B(?C255)
+.sp
+Here is a more complicated example:
+.sp
+ A(\ed{2}|--)
+.sp
+With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
+.sp
+ (?C255)A(?C255)((?C255)\ed{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
.sp
Notice that there is a callout before and after each parenthesis and
alternation bar. If the pattern contains a conditional group whose condition is
@@ -91,10 +100,10 @@ with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
No match
.sp
This indicates that when matching [bc] fails, there is no backtracking into a+
-and therefore the callouts that would be taken for the backtracks do not occur.
-You can disable the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
-\fBpcre2_compile()\fP, or starting the pattern with (*NO_AUTO_POSSESS). In this
-case, the output changes to this:
+(because it is being treated as a++) and therefore the callouts that would be
+taken for the backtracks do not occur. You can disable the auto-possessify
+feature by passing PCRE2_NO_AUTO_POSSESS to \fBpcre2_compile()\fP, or starting
+the pattern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
.sp
--->aaaa
+0 ^ a+
@@ -115,10 +124,13 @@ By default, an optimization is applied when .* is the first significant item in
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
start only after an internal newline or at the beginning of the subject, and
-\fBpcre2_compile()\fP remembers this. This optimization is disabled, however,
-if .* is in an atomic group or if there is a back reference to the capturing
-group in which it appears. It is also disabled if the pattern contains (*PRUNE)
-or (*SKIP). However, the presence of callouts does not affect it.
+\fBpcre2_compile()\fP remembers this. If a pattern has more than one top-level
+branch, automatic anchoring occurs if all branches are anchorable.
+.P
+This optimization is disabled, however, if .* is in an atomic group or if there
+is a back reference to the capturing group in which it appears. It is also
+disabled if the pattern contains (*PRUNE) or (*SKIP). However, the presence of
+callouts does not affect it.
.P
For example, if the pattern .*\ed is compiled with PCRE2_AUTO_CALLOUT and
applied to the string "aa", the \fBpcre2test\fP output is:
@@ -148,9 +160,6 @@ pattern with (*NO_DOTSTAR_ANCHOR). In this case, the output changes to:
This shows more match attempts, starting at the second subject character.
Another optimization, described in the next section, means that there is no
subsequent attempt to match with an empty subject.
-.P
-If a pattern has more than one top-level branch, automatic anchoring occurs if
-all branches are anchorable.
.
.
.SS "Other optimizations"
@@ -166,9 +175,10 @@ subject string is "abyz", the lack of "d" means that matching doesn't ever
start, and the callout is never reached. However, with "abyd", though the
result is still no match, the callout is obeyed.
.P
-PCRE2 also knows the minimum length of a matching string, and will immediately
-give a "no match" return without actually running a match if the subject is not
-long enough, or, for unanchored patterns, if it has been scanned far enough.
+For most patterns PCRE2 also knows the minimum length of a matching string, and
+will immediately give a "no match" return without actually running a match if
+the subject is not long enough, or, for unanchored patterns, if it has been
+scanned far enough.
.P
You can disable these optimizations by passing the PCRE2_NO_START_OPTIMIZE
option to \fBpcre2_compile()\fP, or by starting the pattern with
@@ -181,20 +191,22 @@ callouts such as the example above are obeyed.
.rs
.sp
During matching, when PCRE2 reaches a callout point, if an external function is
-set in the match context, it is called. This applies to both normal and DFA
-matching. The first argument to the callout function is a pointer to a
-\fBpcre2_callout\fP block. The second argument is the void * callout data that
-was supplied when the callout was set up by calling \fBpcre2_set_callout()\fP
-(see the
+provided in the match context, it is called. This applies to both normal,
+DFA, and JIT matching. The first argument to the callout function is a pointer
+to a \fBpcre2_callout\fP block. The second argument is the void * callout data
+that was supplied when the callout was set up by calling
+\fBpcre2_set_callout()\fP (see the
.\" HREF
\fBpcre2api\fP
.\"
-documentation). The callout block structure contains the following fields:
+documentation). The callout block structure contains the following fields, not
+necessarily in this order:
.sp
uint32_t \fIversion\fP;
uint32_t \fIcallout_number\fP;
uint32_t \fIcapture_top\fP;
uint32_t \fIcapture_last\fP;
+ uint32_t \fIcallout_flags\fP;
PCRE2_SIZE *\fIoffset_vector\fP;
PCRE2_SPTR \fImark\fP;
PCRE2_SPTR \fIsubject\fP;
@@ -208,11 +220,12 @@ documentation). The callout block structure contains the following fields:
PCRE2_SPTR \fIcallout_string\fP;
.sp
The \fIversion\fP field contains the version number of the block format. The
-current version is 1; the three callout string fields were added for this
-version. If you are writing an application that might use an earlier release of
-PCRE2, you should check the version number before accessing any of these
-fields. The version number will increase in future if more fields are added,
-but the intention is never to remove any of the existing fields.
+current version is 2; the three callout string fields were added for version 1,
+and the \fIcallout_flags\fP field for version 2. If you are writing an
+application that might use an earlier release of PCRE2, you should check the
+version number before accessing any of these fields. The version number will
+increase in future if more fields are added, but the intention is never to
+remove any of the existing fields.
.
.
.SS "Fields for numerical callouts"
@@ -220,8 +233,8 @@ but the intention is never to remove any of the existing fields.
.sp
For a numerical callout, \fIcallout_string\fP is NULL, and \fIcallout_number\fP
contains the number of the callout, in the range 0-255. This is the number
-that follows (?C for manual callouts; it is 255 for automatically generated
-callouts.
+that follows (?C for callouts that part of the pattern; it is 255 for
+automatically generated callouts.
.
.
.SS "Fields for string callouts"
@@ -250,12 +263,38 @@ need to report errors in the callout string within the pattern.
The remaining fields in the callout block are the same for both kinds of
callout.
.P
-The \fIoffset_vector\fP field is a pointer to the vector of capturing offsets
-(the "ovector") that was passed to the matching function in the match data
-block. When \fBpcre2_match()\fP is used, the contents can be inspected in
+The \fIoffset_vector\fP field is a pointer to a vector of capturing offsets
+(the "ovector"). You may read the elements in this vector, but you must not
+change any of them.
+.P
+For calls to \fBpcre2_match()\fP, the \fIoffset_vector\fP field is not (since
+release 10.30) a pointer to the actual ovector that was passed to the matching
+function in the match data block. Instead it points to an internal ovector of a
+size large enough to hold all possible captured substrings in the pattern. Note
+that whenever a recursion or subroutine call within a pattern completes, the
+capturing state is reset to what it was before.
+.P
+The \fIcapture_last\fP field contains the number of the most recently captured
+substring, and the \fIcapture_top\fP field contains one more than the number of
+the highest numbered captured substring so far. If no substrings have yet been
+captured, the value of \fIcapture_last\fP is 0 and the value of
+\fIcapture_top\fP is 1. The values of these fields do not always differ by one;
+for example, when the callout in the pattern ((a)(b))(?C2) is taken,
+\fIcapture_last\fP is 1 but \fIcapture_top\fP is 4.
+.P
+The contents of ovector[2] to ovector[<capture_top>*2-1] can be inspected in
order to extract substrings that have been matched so far, in the same way as
-for extracting substrings after a match has completed. For the DFA matching
-function, this field is not useful.
+extracting substrings after a match has completed. The values in ovector[0] and
+ovector[1] are always PCRE2_UNSET because the match is by definition not
+complete. Substrings that have not been captured but whose numbers are less
+than \fIcapture_top\fP also have both of their ovector slots set to
+PCRE2_UNSET.
+.P
+For DFA matching, the \fIoffset_vector\fP field points to the ovector that was
+passed to the matching function in the match data block, but it holds no useful
+information at callout time because \fBpcre2_dfa_match()\fP does not support
+substring capturing. The value of \fIcapture_top\fP is always 1 and the value
+of \fIcapture_last\fP is always 0 for DFA matching.
.P
The \fIsubject\fP and \fIsubject_length\fP fields contain copies of the values
that were passed to the matching function.
@@ -270,26 +309,19 @@ in the subject.
The \fIcurrent_position\fP field contains the offset within the subject of the
current match pointer.
.P
-When the \fBpcre2_match()\fP is used, the \fIcapture_top\fP field contains one
-more than the number of the highest numbered captured substring so far. If no
-substrings have been captured, the value of \fIcapture_top\fP is one. This is
-always the case when the DFA functions are used, because they do not support
-captured substrings.
-.P
-The \fIcapture_last\fP field contains the number of the most recently captured
-substring. However, when a recursion exits, the value reverts to what it was
-outside the recursion, as do the values of all captured substrings. If no
-substrings have been captured, the value of \fIcapture_last\fP is 0. This is
-always the case for the DFA matching functions.
-.P
The \fIpattern_position\fP field contains the offset in the pattern string to
the next item to be matched.
.P
The \fInext_item_length\fP field contains the length of the next item to be
-matched in the pattern string. When the callout immediately precedes an
-alternation bar, a closing parenthesis, or the end of the pattern, the length
-is zero. When the callout precedes an opening parenthesis, the length is that
-of the entire subpattern.
+processed in the pattern string. When the callout is at the end of the pattern,
+the length is zero. When the callout precedes an opening parenthesis, the
+length includes meta characters that follow the parenthesis. For example, in a
+callout before an assertion such as (?=ab) the length is 3. For an an
+alternation bar or a closing parenthesis, the length is one, unless a closing
+parenthesis is followed by a quantifier, in which case its length is included.
+(This changed in release 10.23. In earlier releases, before an opening
+parenthesis the length was that of the entire subpattern, and before an
+alternation bar or a closing parenthesis the length was zero.)
.P
The \fIpattern_position\fP and \fInext_item_length\fP fields are intended to
help in distinguishing between different automatic callouts, which all have the
@@ -302,6 +334,33 @@ the zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
(*THEN) item in the match, or NULL if no such items have been passed. Instances
of (*PRUNE) or (*THEN) without a name do not obliterate a previous (*MARK). In
callouts from the DFA matching function this field always contains NULL.
+.P
+The \fIcallout_flags\fP field is always zero in callouts from
+\fBpcre2_dfa_match()\fP or when JIT is being used. When \fBpcre2_match()\fP
+without JIT is used, the following bits may be set:
+.sp
+ PCRE2_CALLOUT_STARTMATCH
+.sp
+This is set for the first callout after the start of matching for each new
+starting position in the subject.
+.sp
+ PCRE2_CALLOUT_BACKTRACK
+.sp
+This is set if there has been a matching backtrack since the previous callout,
+or since the start of matching if this is the first callout from a
+\fBpcre2_match()\fP run.
+.P
+Both bits are set when a backtrack has caused a "bumpalong" to a new starting
+position in the subject. Output from \fBpcre2test\fP does not indicate the
+presence of these bits unless the \fBcallout_extra\fP modifier is set.
+.P
+The information in the \fBcallout_flags\fP field is provided so that
+applications can track and tell their users how matching with backtracking is
+done. This can be useful when trying to optimize patterns, or just to
+understand how PCRE2 works. There is no support in \fBpcre2_dfa_match()\fP
+because there is no backtracking in DFA matching, and there is no support in
+JIT because JIT is all about maximimizing matching performance. In both these
+cases the \fBcallout_flags\fP field is always zero.
.
.
.SH "RETURN VALUES FROM CALLOUTS"
@@ -382,6 +441,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 23 March 2015
-Copyright (c) 1997-2015 University of Cambridge.
+Last updated: 22 December 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2compat.3 b/doc/pcre2compat.3
index a3306d7..8094ebd 100644
--- a/doc/pcre2compat.3
+++ b/doc/pcre2compat.3
@@ -1,4 +1,4 @@
-.TH PCRE2COMPAT 3 "15 March 2015" "PCRE2 10.20"
+.TH PCRE2COMPAT 3 "18 April 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
@@ -6,7 +6,8 @@ PCRE2 - Perl-compatible regular expressions (revised API)
.sp
This document describes the differences in the ways that PCRE2 and Perl handle
regular expressions. The differences described here are with respect to Perl
-versions 5.10 and above.
+versions 5.26, but as both Perl and PCRE2 are continually changing, the
+information may sometimes be out of date.
.P
1. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
have are given in the
@@ -15,16 +16,17 @@ have are given in the
.\"
page.
.P
-2. PCRE2 allows repeat quantifiers only on parenthesized assertions, but they
-do not mean what you might think. For example, (?!a){3} does not assert that
-the next three characters are not "a". It just asserts that the next character
-is not "a" three times (in principle: PCRE2 optimizes this to run the assertion
-just once). Perl allows repeat quantifiers on other assertions such as \eb, but
-these do not seem to have any use.
+2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
+they do not mean what you might think. For example, (?!a){3} does not assert
+that the next three characters are not "a". It just asserts that the next
+character is not "a" three times (in principle: PCRE2 optimizes this to run the
+assertion just once). Perl allows some repeat quantifiers on other assertions,
+for example, \eb* (but not \eb{3}), but these do not seem to have any use.
.P
-3. Capturing subpatterns that occur inside negative lookahead assertions are
-counted, but their entries in the offsets vector are never set. Perl sometimes
-(but not always) sets its numerical variables from inside negative assertions.
+3. Capturing subpatterns that occur inside negative lookaround assertions are
+counted, but their entries in the offsets vector are set only when a negative
+assertion is a condition that has a matching branch (that is, the condition is
+false).
.P
4. The following Perl escape sequences are not supported: \el, \eu, \eL,
\eU, and \eN when followed by a character name or Unicode value. (\eN on its
@@ -35,13 +37,13 @@ generated by default. However, if the PCRE2_ALT_BSUX option is set,
\eU and \eu are interpreted as ECMAScript interprets them.
.P
5. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE2 is
-built with Unicode support. The properties that can be tested with \ep and \eP
-are limited to the general category properties such as Lu and Nd, script names
-such as Greek or Han, and the derived properties Any and L&. PCRE2 does support
-the Cs (surrogate) property, which Perl does not; the Perl documentation says
-"Because Perl hides the need for the user to understand the internal
-representation of Unicode characters, there is no need to implement the
-somewhat messy concept of surrogates."
+built with Unicode support (the default). The properties that can be tested
+with \ep and \eP are limited to the general category properties such as Lu and
+Nd, script names such as Greek or Han, and the derived properties Any and L&.
+PCRE2 does support the Cs (surrogate) property, which Perl does not; the Perl
+documentation says "Because Perl hides the need for the user to understand the
+internal representation of Unicode characters, there is no need to implement
+the somewhat messy concept of surrogates."
.P
6. PCRE2 does support the \eQ...\eE escape for quoting substrings. Characters
in between are treated as literals. This is slightly different from Perl in
@@ -60,29 +62,16 @@ Note the following examples:
The \eQ...\eE sequence is recognized both inside and outside character classes.
.P
7. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
-constructions. However, there is support for recursive patterns. This is not
-available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE2 "callout"
-feature allows an external function to be called during pattern matching. See
-the
+constructions. However, there is support PCRE2's "callout" feature, which
+allows an external function to be called during pattern matching. See the
.\" HREF
\fBpcre2callout\fP
.\"
documentation for details.
.P
-8. Subroutine calls (whether recursive or not) are treated as atomic groups.
-Atomic recursion is like Python, but unlike Perl. Captured values that are set
-outside a subroutine call can be referenced from inside in PCRE2, but not in
-Perl. There is a discussion that explains these differences in more detail in
-the
-.\" HTML <a href="pcre2pattern.html#recursiondifference">
-.\" </a>
-section on recursion differences from Perl
-.\"
-in the
-.\" HREF
-\fBpcre2pattern\fP
-.\"
-page.
+8. Subroutine calls (whether recursive or not) were treated as atomic groups up
+to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking
+into subroutine calls is now supported, as in Perl.
.P
9. If any of the backtracking control verbs are used in a subpattern that is
called as a subroutine (whether or not recursively), their effect is confined
@@ -96,7 +85,7 @@ processed as anchored at the point where they are tested.
one that is backtracked onto acts. For example, in the pattern
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
-same as PCRE2, but there are examples where it differs.
+same as PCRE2, but there are cases where it differs.
.P
11. Most backtracking verbs in assertions have their normal actions. They are
not confined to the assertion.
@@ -109,17 +98,18 @@ the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
13. PCRE2's handling of duplicate subpattern numbers and duplicate subpattern
names is not as general as Perl's. This is a consequence of the fact the PCRE2
works internally just with numbers, using an external table to translate
-between numbers and names. In particular, a pattern such as (?|(?<a>A)|(?<b)B),
+between numbers and names. In particular, a pattern such as (?|(?<a>A)|(?<b>B),
where the two capturing parentheses have the same number but different names,
is not supported, and causes an error at compile time. If it were allowed, it
would not be possible to distinguish which parentheses matched, because both
names map to capturing subpattern number 1. To avoid this confusing situation,
an error is given at compile time.
.P
-14. Perl recognizes comments in some places that PCRE2 does not, for example,
-between the ( and ? at the start of a subpattern. If the /x modifier is set,
-Perl allows white space between ( and ? (though current Perls warn that this is
-deprecated) but PCRE2 never does, even if the PCRE2_EXTENDED option is set.
+14. Perl used to recognize comments in some places that PCRE2 does not, for
+example, between the ( and ? at the start of a subpattern. If the /x modifier
+is set, Perl allowed white space between ( and ? though the latest Perls give
+an error (for a while it was just deprecated). There may still be some cases
+where Perl behaves differently.
.P
15. Perl, when in warning mode, gives warnings for character classes such as
[A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no
@@ -129,46 +119,65 @@ certainly user mistakes.
16. In PCRE2, the upper/lower case character properties Lu and Ll are not
affected when case-independent matching is specified. For example, \ep{Lu}
always matches an upper case letter. I think Perl has changed in this respect;
-in the release at the time of writing (5.16), \ep{Lu} and \ep{Ll} match all
+in the release at the time of writing (5.24), \ep{Lu} and \ep{Ll} match all
letters, regardless of case, when case independence is specified.
.P
17. PCRE2 provides some extensions to the Perl regular expression facilities.
Perl 5.10 includes new features that are not in earlier versions of Perl, some
-of which (such as named parentheses) have been in PCRE2 for some time. This
-list is with respect to Perl 5.10:
+of which (such as named parentheses) were in PCRE2 for some time before. This
+list is with respect to Perl 5.26:
.sp
(a) Although lookbehind assertions in PCRE2 must match fixed length strings,
each alternative branch of a lookbehind assertion can match a different length
of string. Perl requires them all to have the same length.
.sp
-(b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
+(b) From PCRE2 10.23, back references to groups of fixed length are supported
+in lookbehinds, provided that there is no possibility of referencing a
+non-unique number or name. Perl does not support backreferences in lookbehinds.
+.sp
+(c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
meta-character matches only at the very end of the string.
.sp
-(c) A backslash followed by a letter with no special meaning is faulted. (Perl
+(d) A backslash followed by a letter with no special meaning is faulted. (Perl
can be made to issue a warning.)
.sp
-(d) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
+(e) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
inverted, that is, by default they are not greedy, but if followed by a
question mark they are.
.sp
-(e) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
+(f) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
only at the first matching position in the subject string.
.sp
-(f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, and
-PCRE2_NO_AUTO_CAPTURE options have no Perl equivalents.
+(g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY and PCRE2_NOTEMPTY_ATSTART
+options have no Perl equivalents.
.sp
-(g) The \eR escape sequence can be restricted to match only CR, LF, or CRLF
+(h) The \eR escape sequence can be restricted to match only CR, LF, or CRLF
by the PCRE2_BSR_ANYCRLF option.
.sp
-(h) The callout facility is PCRE2-specific.
+(i) The callout facility is PCRE2-specific. Perl supports codeblocks and
+variable interpolation, but not general hooks on every match.
.sp
-(i) The partial matching facility is PCRE2-specific.
+(j) The partial matching facility is PCRE2-specific.
.sp
-(j) The alternative matching function (\fBpcre2_dfa_match()\fP matches in a
+(k) The alternative matching function (\fBpcre2_dfa_match()\fP matches in a
different way and is not Perl-compatible.
.sp
-(k) PCRE2 recognizes some special sequences such as (*CR) at the start of
-a pattern that set overall options that cannot be changed within the pattern.
+(l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
+the start of a pattern that set overall options that cannot be changed within
+the pattern.
+.P
+18. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa
+modifier restricts /i case-insensitive matching to pure ascii, ignoring Unicode
+rules. This separation cannot be represented with PCRE2_UCP.
+.P
+19. Perl has different limits than PCRE2. See the
+.\" HREF
+\fBpcre2limit\fP
+.\"
+documentation for details. Perl went with 5.10 from recursion to iteration
+keeping the intermediate matches on the heap, which is ~10% slower but does not
+fall into any stack-overflow limit. PCRE2 made a similar change at release
+10.30, and also has many build-time and run-time customizable limits.
.
.
.SH AUTHOR
@@ -185,6 +194,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 15 March 2015
-Copyright (c) 1997-2015 University of Cambridge.
+Last updated: 18 April 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2convert.3 b/doc/pcre2convert.3
new file mode 100644
index 0000000..3dadf6e
--- /dev/null
+++ b/doc/pcre2convert.3
@@ -0,0 +1,163 @@
+.TH PCRE2CONVERT 3 "12 July 2017" "PCRE2 10.30"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH "EXPERIMENTAL PATTERN CONVERSION FUNCTIONS"
+.rs
+.sp
+This document describes a set of functions that can be used to convert
+"foreign" patterns into PCRE2 regular expressions. This facility is currently
+experimental, and may be changed in future releases. Two kinds of pattern,
+globs and POSIX patterns, are supported.
+.
+.
+.SH "THE CONVERT CONTEXT"
+.rs
+.sp
+.nf
+.B pcre2_convert_context *pcre2_convert_context_create(
+.B " pcre2_general_context *\fIgcontext\fP);"
+.sp
+.B pcre2_convert_context *pcre2_convert_context_copy(
+.B " pcre2_convert_context *\fIcvcontext\fP);"
+.sp
+.B void pcre2_convert_context_free(pcre2_convert_context *\fIcvcontext\fP);
+.sp
+.B int pcre2_set_glob_escape(pcre2_convert_context *\fIcvcontext\fP,
+.B " uint32_t \fIescape_char\fP);"
+.sp
+.B int pcre2_set_glob_separator(pcre2_convert_context *\fIcvcontext\fP,
+.B " uint32_t \fIseparator_char\fP);"
+.fi
+.sp
+A convert context is used to hold parameters that affect the way that pattern
+conversion works. Like all PCRE2 contexts, you need to use a context only if
+you want to override the defaults. There are the usual create, copy, and free
+functions. If custom memory management functions are set in a general context
+that is passed to \fBpcre2_convert_context_create()\fP, they are used for all
+memory management within the conversion functions.
+.P
+There are only two parameters in the convert context at present. Both apply
+only to glob conversions. The escape character defaults to grave accent under
+Windows, otherwise backslash. It can be set to zero, meaning no escape
+character, or to any punctuation character with a code point less than 256.
+The separator character defaults to backslash under Windows, otherwise forward
+slash. It can be set to forward slash, backslash, or dot.
+.P
+The two setting functions return zero on success, or PCRE2_ERROR_BADDATA if
+their second argument is invalid.
+.
+.
+.SH "THE CONVERSION FUNCTION"
+.rs
+.sp
+.nf
+.B int pcre2_pattern_convert(PCRE2_SPTR \fIpattern\fP, PCRE2_SIZE \fIlength\fP,
+.B " uint32_t \fIoptions\fP, PCRE2_UCHAR **\fIbuffer\fP,"
+.B " PCRE2_SIZE *\fIblength\fP, pcre2_convert_context *\fIcvcontext\fP);"
+.sp
+.B void pcre2_converted_pattern_free(PCRE2_UCHAR *\fIconverted_pattern\fP);
+.fi
+.sp
+The first two arguments of \fBpcre2_pattern_convert()\fP define the foreign
+pattern that is to be converted. The length may be given as
+PCRE2_ZERO_TERMINATED. The \fBoptions\fP argument defines how the pattern is to
+be processed. If the input is UTF, the PCRE2_CONVERT_UTF option should be set.
+PCRE2_CONVERT_NO_UTF_CHECK may also be set if you are sure the input is valid.
+One or more of the glob options, or one of the following POSIX options must be
+set to define the type of conversion that is required:
+.sp
+ PCRE2_CONVERT_GLOB
+ PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR
+ PCRE2_CONVERT_GLOB_NO_STARSTAR
+ PCRE2_CONVERT_POSIX_BASIC
+ PCRE2_CONVERT_POSIX_EXTENDED
+.sp
+Details of the conversions are given below. The \fBbuffer\fP and \fBblength\fP
+arguments define how the output is handled:
+.P
+If \fBbuffer\fP is NULL, the function just returns the length of the converted
+pattern via \fBblength\fP. This is one less than the length of buffer needed,
+because a terminating zero is always added to the output.
+.P
+If \fBbuffer\fP points to a NULL pointer, an output buffer is obtained using
+the allocator in the context or \fBmalloc()\fP if no context is supplied. A
+pointer to this buffer is placed in the variable to which \fBbuffer\fP points.
+When no longer needed the output buffer must be freed by calling
+\fBpcre2_converted_pattern_free()\fP.
+.P
+If \fBbuffer\fP points to a non-NULL pointer, \fBblength\fP must be set to the
+actual length of the buffer provided (in code units).
+.P
+In all cases, after successful conversion, the variable pointed to by
+\fBblength\fP is updated to the length actually used (in code units), excluding
+the terminating zero that is always added.
+.P
+If an error occurs, the length (via \fBblength\fP) is set to the offset
+within the input pattern where the error was detected. Only gross syntax errors
+are caught; there are plenty of errors that will get passed on for
+\fBpcre2_compile()\fP to discover.
+.P
+The return from \fBpcre2_pattern_convert()\fP is zero on success or a non-zero
+PCRE2 error code. Note that PCRE2 error codes may be positive or negative:
+\fBpcre2_compile()\fP uses mostly positive codes and \fBpcre2_match()\fP
+negative ones; \fBpcre2_convert()\fP uses existing codes of both kinds. A
+textual error message can be obtained by calling
+\fBpcre2_get_error_message()\fP.
+.
+.
+.SH "CONVERTING GLOBS"
+.rs
+.sp
+Globs are used to match file names, and consequently have the concept of a
+"path separator", which defaults to backslash under Windows and forward slash
+otherwise. If PCRE2_CONVERT_GLOB is set, the wildcards * and ? are not
+permitted to match separator characters, but the double-star (**) feature
+(which does match separators) is supported.
+.P
+PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR matches globs with wildcards allowed to
+match separator characters. PCRE2_GLOB_NO_STARSTAR matches globs with the
+double-star feature disabled. These options may be given together.
+.
+.
+.SH "CONVERTING POSIX PATTERNS"
+.rs
+.sp
+POSIX defines two kinds of regular expression pattern: basic and extended.
+These can be processed by setting PCRE2_CONVERT_POSIX_BASIC or
+PCRE2_CONVERT_POSIX_EXTENDED, respectively.
+.P
+In POSIX patterns, backslash is not special in a character class. Unmatched
+closing parentheses are treated as literals.
+.P
+In basic patterns, ? + | {} and () must be escaped to be recognized
+as metacharacters outside a character class. If the first character in the
+pattern is * it is treated as a literal. ^ is a metacharacter only at the start
+of a branch.
+.P
+In extended patterns, a backslash not in a character class always
+makes the next character literal, whatever it is. There are no backreferences.
+.P
+Note: POSIX mandates that the longest possible match at the first matching
+position must be found. This is not what \fBpcre2_match()\fP does; it yields
+the first match that is found. An application can use \fBpcre2_dfa_match()\fP
+to find the longest match, but that does not support backreferences (but then
+neither do POSIX extended patterns).
+.
+.
+.SH AUTHOR
+.rs
+.sp
+.nf
+Philip Hazel
+University Computing Service
+Cambridge, England.
+.fi
+.
+.
+.SH REVISION
+.rs
+.sp
+.nf
+Last updated: 12 July 2017
+Copyright (c) 1997-2017 University of Cambridge.
+.fi
diff --git a/doc/pcre2demo.3 b/doc/pcre2demo.3
index c02dcd9..a9e58e2 100644
--- a/doc/pcre2demo.3
+++ b/doc/pcre2demo.3
@@ -228,6 +228,21 @@ pcre2_match_data_create_from_pattern() above. */
if (rc == 0)
printf("ovector was not big enough for all the captured substrings\en");
+/* We must guard against patterns such as /(?=.\eK)/ that use \eK in an assertion
+to set the start of a match later than its end. In this demonstration program,
+we just detect this case and give up. */
+
+if (ovector[0] > ovector[1])
+ {
+ printf("\e\eK was used in an assertion to set the match start after its end.\en"
+ "From end to start the match was: %.*s\en", (int)(ovector[0] - ovector[1]),
+ (char *)(subject + ovector[1]));
+ printf("Run abandoned\en");
+ pcre2_match_data_free(match_data);
+ pcre2_code_free(re);
+ return 1;
+ }
+
/* Show substrings stored in the output vector by number. Obviously, in a real
application you might want to do things other than print them. */
@@ -355,6 +370,29 @@ for (;;)
options = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED;
}
+ /* If the previous match was not an empty string, there is one tricky case to
+ consider. If a pattern contains \eK within a lookbehind assertion at the
+ start, the end of the matched string can be at the offset where the match
+ started. Without special action, this leads to a loop that keeps on matching
+ the same substring. We must detect this case and arrange to move the start on
+ by one character. The pcre2_get_startchar() function returns the starting
+ offset that was passed to pcre2_match(). */
+
+ else
+ {
+ PCRE2_SIZE startchar = pcre2_get_startchar(match_data);
+ if (start_offset <= startchar)
+ {
+ if (startchar >= subject_length) break; /* Reached end of subject. */
+ start_offset = startchar + 1; /* Advance by one character. */
+ if (utf8) /* If UTF-8, it may be more */
+ { /* than one code unit. */
+ for (; start_offset < subject_length; start_offset++)
+ if ((subject[start_offset] & 0xc0) != 0x80) break;
+ }
+ }
+ }
+
/* Run the next matching operation */
rc = pcre2_match(
@@ -419,6 +457,21 @@ for (;;)
if (rc == 0)
printf("ovector was not big enough for all the captured substrings\en");
+ /* We must guard against patterns such as /(?=.\eK)/ that use \eK in an
+ assertion to set the start of a match later than its end. In this
+ demonstration program, we just detect this case and give up. */
+
+ if (ovector[0] > ovector[1])
+ {
+ printf("\e\eK was used in an assertion to set the match start after its end.\en"
+ "From end to start the match was: %.*s\en", (int)(ovector[0] - ovector[1]),
+ (char *)(subject + ovector[1]));
+ printf("Run abandoned\en");
+ pcre2_match_data_free(match_data);
+ pcre2_code_free(re);
+ return 1;
+ }
+
/* As before, show substrings stored in the output vector by number, and then
also any named substrings. */
diff --git a/doc/pcre2grep.1 b/doc/pcre2grep.1
index 6d27780..5e5cbea 100644
--- a/doc/pcre2grep.1
+++ b/doc/pcre2grep.1
@@ -1,4 +1,4 @@
-.TH PCRE2GREP 1 "19 June 2016" "PCRE2 10.22"
+.TH PCRE2GREP 1 "13 November 2017" "PCRE2 10.31"
.SH NAME
pcre2grep - a grep with Perl-compatible regular expressions.
.SH SYNOPSIS
@@ -52,11 +52,18 @@ span line boundaries. What defines a line boundary is controlled by the
\fB-N\fP (\fB--newline\fP) option.
.P
The amount of memory used for buffering files that are being scanned is
-controlled by a parameter that can be set by the \fB--buffer-size\fP option.
-The default value for this parameter is specified when \fBpcre2grep\fP is
-built, with the default default being 20K. A block of memory three times this
-size is used (to allow for buffering "before" and "after" lines). An error
-occurs if a line overflows the buffer.
+controlled by parameters that can be set by the \fB--buffer-size\fP and
+\fB--max-buffer-size\fP options. The first of these sets the size of buffer
+that is obtained at the start of processing. If an input file contains very
+long lines, a larger buffer may be needed; this is handled by automatically
+extending the buffer, up to the limit specified by \fB--max-buffer-size\fP. The
+default values for these parameters are specified when \fBpcre2grep\fP is
+built, with the default defaults being 20K and 1M respectively. An error occurs
+if a line is too long and the buffer can no longer be expanded.
+.P
+The block of memory that is actually used is three times the "buffer size", to
+allow for buffering "before" and "after" lines. If the buffer size is too
+small, fewer than requested "before" and "after" lines may be output.
.P
Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the greater.
BUFSIZ is defined in \fB<stdio.h>\fP. When there is more than one pattern
@@ -94,27 +101,31 @@ The \fB--locale\fP option can be used to override this.
.rs
.sp
It is possible to compile \fBpcre2grep\fP so that it uses \fBlibz\fP or
-\fBlibbz2\fP to read files whose names end in \fB.gz\fP or \fB.bz2\fP,
-respectively. You can find out whether your binary has support for one or both
-of these file types by running it with the \fB--help\fP option. If the
-appropriate support is not present, files are treated as plain text. The
-standard input is always so treated.
+\fBlibbz2\fP to read compressed files whose names end in \fB.gz\fP or
+\fB.bz2\fP, respectively. You can find out whether your \fBpcre2grep\fP binary
+has support for one or both of these file types by running it with the
+\fB--help\fP option. If the appropriate support is not present, all files are
+treated as plain text. The standard input is always so treated. When input is
+from a compressed .gz or .bz2 file, the \fB--line-buffered\fP option is
+ignored.
.
.
.SH "BINARY FILES"
.rs
.sp
By default, a file that contains a binary zero byte within the first 1024 bytes
-is identified as a binary file, and is processed specially. (GNU grep also
-identifies binary files in this manner.) See the \fB--binary-files\fP option
-for a means of changing the way binary files are handled.
+is identified as a binary file, and is processed specially. (GNU grep
+identifies binary files in this manner.) However, if the newline type is
+specified as "nul", that is, the line terminator is a binary zero, the test for
+a binary file is not applied. See the \fB--binary-files\fP option for a means
+of changing the way binary files are handled.
.
.
.SH OPTIONS
.rs
.sp
The order in which some of the options appear can affect the output. For
-example, both the \fB-h\fP and \fB-l\fP options affect the printing of file
+example, both the \fB-H\fP and \fB-l\fP options affect the printing of file
names. Whichever comes later in the command line will be the one that takes
effect. Similarly, except where noted below, if an option is given twice, the
later setting is used. Numerical values for options may be followed by K or M,
@@ -126,24 +137,27 @@ command line starts with a hyphen but is not an option. This allows for the
processing of patterns and file names that start with hyphens.
.TP
\fB-A\fP \fInumber\fP, \fB--after-context=\fP\fInumber\fP
-Output \fInumber\fP lines of context after each matching line. If file names
-and/or line numbers are being output, a hyphen separator is used instead of a
-colon for the context lines. A line containing "--" is output between each
-group of lines, unless they are in fact contiguous in the input file. The value
-of \fInumber\fP is expected to be relatively small. However, \fBpcre2grep\fP
-guarantees to have up to 8K of following text available for context output.
+Output up to \fInumber\fP lines of context after each matching line. Fewer
+lines are output if the next match or the end of the file is reached, or if the
+processing buffer size has been set too small. If file names and/or line
+numbers are being output, a hyphen separator is used instead of a colon for the
+context lines. A line containing "--" is output between each group of lines,
+unless they are in fact contiguous in the input file. The value of \fInumber\fP
+is expected to be relatively small. When \fB-c\fP is used, \fB-A\fP is ignored.
.TP
\fB-a\fP, \fB--text\fP
Treat binary files as text. This is equivalent to
\fB--binary-files\fP=\fItext\fP.
.TP
\fB-B\fP \fInumber\fP, \fB--before-context=\fP\fInumber\fP
-Output \fInumber\fP lines of context before each matching line. If file names
-and/or line numbers are being output, a hyphen separator is used instead of a
-colon for the context lines. A line containing "--" is output between each
-group of lines, unless they are in fact contiguous in the input file. The value
-of \fInumber\fP is expected to be relatively small. However, \fBpcre2grep\fP
-guarantees to have up to 8K of preceding text available for context output.
+Output up to \fInumber\fP lines of context before each matching line. Fewer
+lines are output if the previous match or the start of the file is within
+\fInumber\fP lines, or if the processing buffer size has been set too small. If
+file names and/or line numbers are being output, a hyphen separator is used
+instead of a colon for the context lines. A line containing "--" is output
+between each group of lines, unless they are in fact contiguous in the input
+file. The value of \fInumber\fP is expected to be relatively small. When
+\fB-c\fP is used, \fB-B\fP is ignored.
.TP
\fB--binary-files=\fP\fIword\fP
Specify how binary files are to be processed. If the word is "binary" (the
@@ -158,8 +172,9 @@ be of interest and are skipped without causing any output or affecting the
return code.
.TP
\fB--buffer-size=\fP\fInumber\fP
-Set the parameter that controls how much memory is used for buffering files
-that are being scanned.
+Set the parameter that controls how much memory is obtained at the start of
+processing for buffering files that are being scanned. See also
+\fB--max-buffer-size\fP below.
.TP
\fB-C\fP \fInumber\fP, \fB--context=\fP\fInumber\fP
Output \fInumber\fP lines of context both before and after each matching line.
@@ -167,13 +182,15 @@ This is equivalent to setting both \fB-A\fP and \fB-B\fP to the same value.
.TP
\fB-c\fP, \fB--count\fP
Do not output lines from the files that are being scanned; instead output the
-number of matches (or non-matches if \fB-v\fP is used) that would otherwise
-have caused lines to be shown. By default, this count is the same as the number
-of suppressed lines, but if the \fB-M\fP (multiline) option is used (without
-\fB-v\fP), there may be more suppressed lines than the number of matches.
+number of lines that would have been shown, either because they matched, or, if
+\fB-v\fP is set, because they failed to match. By default, this count is
+exactly the same as the number of lines that would have been output, but if the
+\fB-M\fP (multiline) option is used (without \fB-v\fP), there may be more
+suppressed lines than the count (that is, the number of matches).
.sp
If no lines are selected, the number zero is output. If several files are are
-being scanned, a count is output for each of them. However, if the
+being scanned, a count is output for each of them and the \fB-t\fP option can
+be used to cause a total to be output at the end. However, if the
\fB--files-with-matches\fP option is also used, only those files whose counts
are greater than zero are listed. When \fB-c\fP is used, the \fB-A\fP,
\fB-B\fP, and \fB-C\fP options are ignored.
@@ -192,12 +209,22 @@ connected to a terminal. More resources are used when colouring is enabled,
because \fBpcre2grep\fP has to search for all possible matches in a line, not
just one, in order to colour them all.
.sp
-The colour that is used can be specified by setting the environment variable
-PCRE2GREP_COLOUR or PCRE2GREP_COLOR. The value of this variable should be a
-string of two numbers, separated by a semicolon. They are copied directly into
-the control string for setting colour on a terminal, so it is your
-responsibility to ensure that they make sense. If neither of the environment
-variables is set, the default is "1;31", which gives red.
+The colour that is used can be specified by setting one of the environment
+variables PCRE2GREP_COLOUR, PCRE2GREP_COLOR, PCREGREP_COLOUR, or
+PCREGREP_COLOR, which are checked in that order. If none of these are set,
+\fBpcre2grep\fP looks for GREP_COLORS or GREP_COLOR (in that order). The value
+of the variable should be a string of two numbers, separated by a semicolon,
+except in the case of GREP_COLORS, which must start with "ms=" or "mt="
+followed by two semicolon-separated colours, terminated by the end of the
+string or by a colon. If GREP_COLORS does not start with "ms=" or "mt=" it is
+ignored, and GREP_COLOR is checked.
+.sp
+If the string obtained from one of the above variables contains any characters
+other than semicolon or digits, the setting is ignored and the default colour
+is used. The string is copied directly into the control string for setting
+colour on a terminal, so it is your responsibility to ensure that the values
+make sense. If no relevant environment variable is set, the default is "1;31",
+which gives red.
.TP
\fB-D\fP \fIaction\fP, \fB--devices=\fP\fIaction\fP
If an input path is not a regular file or a directory, "action" specifies how
@@ -213,6 +240,9 @@ compatibility with GNU grep), "recurse" (equivalent to the \fB-r\fP option), or
operating systems the effect of reading a directory like this is an immediate
end-of-file; in others it may provoke an error.
.TP
+\fB--depth-limit\fP=\fInumber\fP
+See \fB--match-limit\fP below.
+.TP
\fB-e\fP \fIpattern\fP, \fB--regex=\fP\fIpattern\fP, \fB--regexp=\fP\fIpattern\fP
Specify a pattern to be matched. This option can be used multiple times in
order to specify several patterns. It can also be used as a way of specifying a
@@ -273,17 +303,17 @@ files; it does not apply to patterns specified by any of the \fB--include\fP or
\fB--exclude\fP options.
.TP
\fB-f\fP \fIfilename\fP, \fB--file=\fP\fIfilename\fP
-Read patterns from the file, one per line, and match them against
-each line of input. What constitutes a newline when reading the file is the
-operating system's default. The \fB--newline\fP option has no effect on this
-option. Trailing white space is removed from each line, and blank lines are
-ignored. An empty file contains no patterns and therefore matches nothing. See
-also the comments about multiple patterns versus a single pattern with
-alternatives in the description of \fB-e\fP above.
-.sp
-If this option is given more than once, all the specified files are
-read. A data line is output if any of the patterns match it. A file name can
-be given as "-" to refer to the standard input. When \fB-f\fP is used, patterns
+Read patterns from the file, one per line, and match them against each line of
+input. What constitutes a newline when reading the file is the operating
+system's default. The \fB--newline\fP option has no effect on this option.
+Trailing white space is removed from each line, and blank lines are ignored. An
+empty file contains no patterns and therefore matches nothing. See also the
+comments about multiple patterns versus a single pattern with alternatives in
+the description of \fB-e\fP above.
+.sp
+If this option is given more than once, all the specified files are read. A
+data line is output if any of the patterns match it. A file name can be given
+as "-" to refer to the standard input. When \fB-f\fP is used, patterns
specified on the command line using \fB-e\fP may also be present; they are
tested before the file's patterns. However, no other pattern is taken from the
command line; all arguments are treated as the names of paths to be searched.
@@ -304,8 +334,8 @@ Instead of showing lines or parts of lines that match, show each match as an
offset from the start of the file and a length, separated by a comma. In this
mode, no context is shown. That is, the \fB-A\fP, \fB-B\fP, and \fB-C\fP
options are ignored. If there is more than one match in a line, each of them is
-shown separately. This option is mutually exclusive with \fB--line-offsets\fP
-and \fB--only-matching\fP.
+shown separately. This option is mutually exclusive with \fB--output\fP,
+\fB--line-offsets\fP, and \fB--only-matching\fP.
.TP
\fB-H\fP, \fB--with-filename\fP
Force the inclusion of the file name at the start of output lines when
@@ -313,13 +343,18 @@ searching a single file. By default, the file name is not shown in this case.
For matching lines, the file name is followed by a colon; for context lines, a
hyphen separator is used. If a line number is also being output, it follows the
file name. When the \fB-M\fP option causes a pattern to match more than one
-line, only the first is preceded by the file name.
+line, only the first is preceded by the file name. This option overrides any
+previous \fB-h\fP, \fB-l\fP, or \fB-L\fP options.
.TP
\fB-h\fP, \fB--no-filename\fP
Suppress the output file names when searching multiple files. By default,
file names are shown when multiple files are searched. For matching lines, the
file name is followed by a colon; for context lines, a hyphen separator is used.
-If a line number is also being output, it follows the file name.
+If a line number is also being output, it follows the file name. This option
+overrides any previous \fB-H\fP, \fB-L\fP, or \fB-l\fP options.
+.TP
+\fB--heap-limit\fP=\fInumber\fP
+See \fB--match-limit\fP below.
.TP
\fB--help\fP
Output a help message, giving brief details of the command options and file
@@ -365,16 +400,18 @@ given any number of times. If a directory matches both \fB--include-dir\fP and
\fB-L\fP, \fB--files-without-match\fP
Instead of outputting lines from the files, just output the names of the files
that do not contain any lines that would have been output. Each file name is
-output once, on a separate line.
+output once, on a separate line. This option overrides any previous \fB-H\fP,
+\fB-h\fP, or \fB-l\fP options.
.TP
\fB-l\fP, \fB--files-with-matches\fP
Instead of outputting lines from the files, just output the names of the files
-containing lines that would have been output. Each file name is output
-once, on a separate line. Searching normally stops as soon as a matching line
-is found in a file. However, if the \fB-c\fP (count) option is also used,
-matching continues in order to obtain the correct count, and those files that
-have at least one match are listed along with their counts. Using this option
-with \fB-c\fP is a way of suppressing the listing of files with no matches.
+containing lines that would have been output. Each file name is output once, on
+a separate line. Searching normally stops as soon as a matching line is found
+in a file. However, if the \fB-c\fP (count) option is also used, matching
+continues in order to obtain the correct count, and those files that have at
+least one match are listed along with their counts. Using this option with
+\fB-c\fP is a way of suppressing the listing of files with no matches. This
+opeion overrides any previous \fB-H\fP, \fB-h\fP, or \fB-L\fP options.
.TP
\fB--label\fP=\fIname\fP
This option supplies a name to be used for the standard input when file names
@@ -382,14 +419,16 @@ are being output. If not supplied, "(standard input)" is used. There is no
short form for this option.
.TP
\fB--line-buffered\fP
-When this option is given, input is read and processed line by line, and the
-output is flushed after each write. By default, input is read in large chunks,
-unless \fBpcre2grep\fP can determine that it is reading from a terminal (which
-is currently possible only in Unix-like environments). Output to terminal is
-normally automatically flushed by the operating system. This option can be
-useful when the input or output is attached to a pipe and you do not want
-\fBpcre2grep\fP to buffer up large amounts of data. However, its use will
-affect performance, and the \fB-M\fP (multiline) option ceases to work.
+When this option is given, non-compressed input is read and processed line by
+line, and the output is flushed after each write. By default, input is read in
+large chunks, unless \fBpcre2grep\fP can determine that it is reading from a
+terminal (which is currently possible only in Unix-like environments). Output
+to terminal is normally automatically flushed by the operating system. This
+option can be useful when the input or output is attached to a pipe and you do
+not want \fBpcre2grep\fP to buffer up large amounts of data. However, its use
+will affect performance, and the \fB-M\fP (multiline) option ceases to work.
+When input is from a compressed .gz or .bz2 file, \fB--line-buffered\fP is
+ignored.
.TP
\fB--line-offsets\fP
Instead of showing lines or parts of lines that match, show each match as a
@@ -398,7 +437,8 @@ number is terminated by a colon (as usual; see the \fB-n\fP option), and the
offset and length are separated by a comma. In this mode, no context is shown.
That is, the \fB-A\fP, \fB-B\fP, and \fB-C\fP options are ignored. If there is
more than one match in a line, each of them is shown separately. This option is
-mutually exclusive with \fB--file-offsets\fP and \fB--only-matching\fP.
+mutually exclusive with \fB--output\fP, \fB--file-offsets\fP, and
+\fB--only-matching\fP.
.TP
\fB--locale\fP=\fIlocale-name\fP
This option specifies a locale to be used for pattern matching. It overrides
@@ -407,46 +447,51 @@ locale is specified, the PCRE2 library's default (usually the "C" locale) is
used. There is no short form for this option.
.TP
\fB--match-limit\fP=\fInumber\fP
-Processing some regular expression patterns can require a very large amount of
-memory, leading in some cases to a program crash if not enough is available.
-Other patterns may take a very long time to search for all possible matching
-strings. The \fBpcre2_match()\fP function that is called by \fBpcre2grep\fP to
-do the matching has two parameters that can limit the resources that it uses.
-.sp
-The \fB--match-limit\fP option provides a means of limiting resource usage
-when processing patterns that are not going to match, but which have a very
-large number of possibilities in their search trees. The classic example is a
-pattern that uses nested unlimited repeats. Internally, PCRE2 uses a function
-called \fBmatch()\fP which it calls repeatedly (sometimes recursively). The
-limit set by \fB--match-limit\fP is imposed on the number of times this
-function is called during a match, which has the effect of limiting the amount
-of backtracking that can take place.
-.sp
-The \fB--recursion-limit\fP option is similar to \fB--match-limit\fP, but
-instead of limiting the total number of times that \fBmatch()\fP is called, it
-limits the depth of recursive calls, which in turn limits the amount of memory
-that can be used. The recursion depth is a smaller number than the total number
-of calls, because not all calls to \fBmatch()\fP are recursive. This limit is
-of use only if it is set smaller than \fB--match-limit\fP.
+Processing some regular expression patterns may take a very long time to search
+for all possible matching strings. Others may require a very large amount of
+memory. There are three options that set resource limits for matching.
+.sp
+The \fB--match-limit\fP option provides a means of limiting computing resource
+usage when processing patterns that are not going to match, but which have a
+very large number of possibilities in their search trees. The classic example
+is a pattern that uses nested unlimited repeats. Internally, PCRE2 has a
+counter that is incremented each time around its main processing loop. If the
+value set by \fB--match-limit\fP is reached, an error occurs.
+.sp
+The \fB--heap-limit\fP option specifies, as a number of kilobytes, the amount
+of heap memory that may be used for matching. Heap memory is needed only if
+matching the pattern requires a significant number of nested backtracking
+points to be remembered. This parameter can be set to zero to forbid the use of
+heap memory altogether.
+.sp
+The \fB--depth-limit\fP option limits the depth of nested backtracking points,
+which indirectly limits the amount of memory that is used. The amount of memory
+needed for each backtracking point depends on the number of capturing
+parentheses in the pattern, so the amount of memory that is used before this
+limit acts varies from pattern to pattern. This limit is of use only if it is
+set smaller than \fB--match-limit\fP.
.sp
There are no short forms for these options. The default settings are specified
-when the PCRE2 library is compiled, with the default default being 10 million.
+when the PCRE2 library is compiled, with the default defaults being very large
+and so effectively unlimited.
+.TP
+\fB--max-buffer-size=\fInumber\fP
+This limits the expansion of the processing buffer, whose initial size can be
+set by \fB--buffer-size\fP. The maximum buffer size is silently forced to be no
+smaller than the starting buffer size.
.TP
\fB-M\fP, \fB--multiline\fP
-Allow patterns to match more than one line. When this option is given, patterns
-may usefully contain literal newline characters and internal occurrences of ^
-and $ characters. The output for a successful match may consist of more than
-one line. The first is the line in which the match started, and the last is the
-line in which the match ended. If the matched string ends with a newline
-sequence the output ends at the end of that line.
-.sp
-When this option is set, the PCRE2 library is called in "multiline" mode. This
-allows a matched string to extend past the end of a line and continue on one or
-more subsequent lines. However, \fBpcre2grep\fP still processes the input line
-by line. Once a match has been handled, scanning restarts at the beginning of
-the next line, just as it does when \fB-M\fP is not present. This means that it
-is possible for the second or subsequent lines in a multiline match to be
-output again as part of another match.
+Allow patterns to match more than one line. When this option is set, the PCRE2
+library is called in "multiline" mode. This allows a matched string to extend
+past the end of a line and continue on one or more subsequent lines. Patterns
+used with \fB-M\fP may usefully contain literal newline characters and internal
+occurrences of ^ and $ characters. The output for a successful match may
+consist of more than one line. The first line is the line in which the match
+started, and the last line is the line in which the match ended. If the matched
+string ends with a newline sequence, the output ends at the end of that line.
+If \fB-v\fP is set, none of the lines in a multi-line match are output. Once a
+match has been handled, scanning restarts at the beginning of the line after
+the one in which the match ended.
.sp
The newline sequence that separates multiple lines must be matched as part of
the pattern. For example, to find the phrase "regular expression" in a file
@@ -460,11 +505,8 @@ and is followed by + so as to match trailing white space on the first line as
well as possibly handling a two-character newline sequence.
.sp
There is a limit to the number of lines that can be matched, imposed by the way
-that \fBpcre2grep\fP buffers the input file as it scans it. However,
-\fBpcre2grep\fP ensures that at least 8K characters or the rest of the file
-(whichever is the shorter) are available for forward matching, and similarly
-the previous 8K characters (or all the previous characters, if fewer than 8K)
-are guaranteed to be available for lookbehind assertions. The \fB-M\fP option
+that \fBpcre2grep\fP buffers the input file as it scans it. With a sufficiently
+large processing buffer, this should not be a problem, but the \fB-M\fP option
does not work when input is read line by line (see \fP--line-buffered\fP.)
.TP
\fB-N\fP \fInewline-type\fP, \fB--newline\fP=\fInewline-type\fP
@@ -503,16 +545,41 @@ was explicitly disabled at build time. This option can be used to disable the
use of JIT at run time. It is provided for testing and working round problems.
It should never be needed in normal use.
.TP
+\fB-O\fP \fItext\fP, \fB--output\fP=\fItext\fP
+When there is a match, instead of outputting the whole line that matched,
+output just the given text. This option is mutually exclusive with
+\fB--only-matching\fP, \fB--file-offsets\fP, and \fB--line-offsets\fP. Escape
+sequences starting with a dollar character may be used to insert the contents
+of the matched part of the line and/or captured substrings into the text.
+.sp
+$<digits> or ${<digits>} is replaced by the captured
+substring of the given decimal number; zero substitutes the whole match. If
+the number is greater than the number of capturing substrings, or if the
+capture is unset, the replacement is empty.
+.sp
+$a is replaced by bell; $b by backspace; $e by escape; $f by form feed; $n by
+newline; $r by carriage return; $t by tab; $v by vertical tab.
+.sp
+$o<digits> is replaced by the character represented by the given octal
+number; up to three digits are processed.
+.sp
+$x<digits> is replaced by the character represented by the given hexadecimal
+number; up to two digits are processed.
+.sp
+Any other character is substituted by itself. In particular, $$ is replaced by
+a single dollar.
+.TP
\fB-o\fP, \fB--only-matching\fP
Show only the part of the line that matched a pattern instead of the whole
line. In this mode, no context is shown. That is, the \fB-A\fP, \fB-B\fP, and
\fB-C\fP options are ignored. If there is more than one match in a line, each
-of them is shown separately. If \fB-o\fP is combined with \fB-v\fP (invert the
-sense of the match to find non-matching lines), no output is generated, but the
-return code is set appropriately. If the matched portion of the line is empty,
-nothing is output unless the file name or line number are being printed, in
-which case they are shown on an otherwise empty line. This option is mutually
-exclusive with \fB--file-offsets\fP and \fB--line-offsets\fP.
+of them is shown separately, on a separate line of output. If \fB-o\fP is
+combined with \fB-v\fP (invert the sense of the match to find non-matching
+lines), no output is generated, but the return code is set appropriately. If
+the matched portion of the line is empty, nothing is output unless the file
+name or line number are being printed, in which case they are shown on an
+otherwise empty line. This option is mutually exclusive with \fB--output\fP,
+\fB--file-offsets\fP and \fB--line-offsets\fP.
.TP
\fB-o\fP\fInumber\fP, \fB--only-matching\fP=\fInumber\fP
Show only the part of the line that matched the capturing parentheses of the
@@ -520,14 +587,15 @@ given number. Up to 32 capturing parentheses are supported, and -o0 is
equivalent to \fB-o\fP without a number. Because these options can be given
without an argument (see above), if an argument is present, it must be given in
the same shell item, for example, -o3 or --only-matching=2. The comments given
-for the non-argument case above also apply to this case. If the specified
+for the non-argument case above also apply to this option. If the specified
capturing parentheses do not exist in the pattern, or were not set in the
match, nothing is output unless the file name or line number are being output.
.sp
-If this option is given multiple times, multiple substrings are output, in the
-order the options are given. For example, -o3 -o1 -o3 causes the substrings
-matched by capturing parentheses 3 and 1 and then 3 again to be output. By
-default, there is no separator (but see the next option).
+If this option is given multiple times, multiple substrings are output for each
+match, in the order the options are given, and all on one line. For example,
+-o3 -o1 -o3 causes the substrings matched by capturing parentheses 3 and 1 and
+then 3 again to be output. By default, there is no separator (but see the next
+option).
.TP
\fB--om-separator\fP=\fItext\fP
Specify a separating string for multiple occurrences of \fB-o\fP. The default
@@ -552,6 +620,17 @@ Suppress error messages about non-existent or unreadable files. Such files are
quietly skipped. However, the return code is still 2, even if matches were
found in other files.
.TP
+\fB-t\fP, \fB--total-count\fP
+This option is useful when scanning more than one file. If used on its own,
+\fB-t\fP suppresses all output except for a grand total number of matching
+lines (or non-matching lines if \fB-v\fP is used) in all the files. If \fB-t\fP
+is used with \fB-c\fP, a grand total is output except when the previous output
+is just one line. In other words, it is not output when just one file's count
+is listed. If file names are being output, the grand total is preceded by
+"TOTAL:". Otherwise, it appears as just another number. The \fB-t\fP option is
+ignored when used with \fB-L\fP (list files without matches), because the grand
+total would always be zero.
+.TP
\fB-u\fP, \fB--utf-8\fP
Operate in UTF-8 mode. This option is available only if PCRE2 has been compiled
with UTF-8 support. All patterns (including those for any \fB--exclude\fP and
@@ -568,16 +647,18 @@ Invert the sense of the match, so that lines which do \fInot\fP match any of
the patterns are the ones that are found.
.TP
\fB-w\fP, \fB--word-regex\fP, \fB--word-regexp\fP
-Force the patterns to match only whole words. This is equivalent to having \eb
-at the start and end of the pattern. This option applies only to the patterns
-that are matched against the contents of files; it does not apply to patterns
-specified by any of the \fB--include\fP or \fB--exclude\fP options.
+Force the patterns only to match "words". That is, there must be a word
+boundary at the start and end of each matched string. This is equivalent to
+having "\eb(?:" at the start of each pattern, and ")\eb" at the end. This
+option applies only to the patterns that are matched against the contents of
+files; it does not apply to patterns specified by any of the \fB--include\fP or
+\fB--exclude\fP options.
.TP
\fB-x\fP, \fB--line-regex\fP, \fB--line-regexp\fP
-Force the patterns to be anchored (each must start matching at the beginning of
-a line) and in addition, require them to match entire lines. This is equivalent
-to having ^ and $ characters at the start and end of each alternative top-level
-branch in every pattern. This option applies only to the patterns that are
+Force the patterns to start matching only at the beginnings of lines, and in
+addition, require them to match entire lines. In multiline mode the match may
+be more than one line. This is equivalent to having "^(?:" at the start of each
+pattern and ")$" at the end. This option applies only to the patterns that are
matched against the contents of files; it does not apply to patterns specified
by any of the \fB--include\fP or \fB--exclude\fP options.
.
@@ -612,10 +693,11 @@ relying on the C I/O library to convert this to an appropriate sequence.
Many of the short and long forms of \fBpcre2grep\fP's options are the same
as in the GNU \fBgrep\fP program. Any long option of the form
\fB--xxx-regexp\fP (GNU terminology) is also available as \fB--xxx-regex\fP
-(PCRE2 terminology). However, the \fB--file-list\fP, \fB--file-offsets\fP,
-\fB--include-dir\fP, \fB--line-offsets\fP, \fB--locale\fP, \fB--match-limit\fP,
-\fB-M\fP, \fB--multiline\fP, \fB-N\fP, \fB--newline\fP, \fB--om-separator\fP,
-\fB--recursion-limit\fP, \fB-u\fP, and \fB--utf-8\fP options are specific to
+(PCRE2 terminology). However, the \fB--depth-limit\fP, \fB--file-list\fP,
+\fB--file-offsets\fP, \fB--heap-limit\fP, \fB--include-dir\fP,
+\fB--line-offsets\fP, \fB--locale\fP, \fB--match-limit\fP, \fB-M\fP,
+\fB--multiline\fP, \fB-N\fP, \fB--newline\fP, \fB--om-separator\fP,
+\fB--output\fP, \fB-u\fP, and \fB--utf-8\fP options are specific to
\fBpcre2grep\fP, as is the use of the \fB--only-matching\fP option with a
capturing parentheses number.
.P
@@ -658,14 +740,14 @@ options does have data, it must be given in the first form, using an equals
character. Otherwise \fBpcre2grep\fP will assume that it has no data.
.
.
-.SH "CALLING EXTERNAL SCRIPTS"
+.SH "USING PCRE2'S CALLOUT FACILITY"
.rs
.sp
-On non-Windows systems, \fBpcre2grep\fP has, by default, support for calling
-external programs or scripts during matching by making use of PCRE2's callout
-facility. However, this support can be disabled when \fBpcre2grep\fP is built.
-You can find out whether your binary has support for callouts by running it
-with the \fB--help\fP option. If the support is not enabled, all callouts in
+\fBpcre2grep\fP has, by default, support for calling external programs or
+scripts or echoing specific strings during matching by making use of PCRE2's
+callout facility. However, this support can be disabled when \fBpcre2grep\fP is
+built. You can find out whether your binary has support for callouts by running
+it with the \fB--help\fP option. If the support is not enabled, all callouts in
patterns are ignored by \fBpcre2grep\fP.
.P
A callout in a PCRE2 pattern is of the form (?C<arg>) where the argument is
@@ -673,10 +755,17 @@ either a number or a quoted string (see the
.\" HREF
\fBpcre2callout\fP
.\"
-documentation for details). Numbered callouts are ignored by \fBpcre2grep\fP.
-String arguments are parsed as a list of substrings separated by pipe (vertical
-bar) characters. The first substring must be an executable name, with the
-following substrings specifying arguments:
+documentation for details). Numbered callouts are ignored by \fBpcre2grep\fP;
+only callouts with string arguments are useful.
+.
+.
+.SS "Calling external programs or scripts"
+.rs
+.sp
+If the callout string does not start with a pipe (vertical bar) character, it
+is parsed into a list of substrings separated by pipe characters. The first
+substring must be an executable name, with the following substrings specifying
+arguments:
.sp
executable_name|arg1|arg2|...
.sp
@@ -710,6 +799,19 @@ the non-existence of the executable), a local matching failure occurs and the
matcher backtracks in the normal way.
.
.
+.SS "Echoing a specific string"
+.rs
+.sp
+If the callout string starts with a pipe (vertical bar) character, the rest of
+the string is written to the output, having been passed through the same escape
+processing as text from the --output option. This provides a simple echoing
+facility that avoids calling an external program or script. No terminator is
+added to the string, so if you want a newline, you must include it explicitly.
+Matching continues normally after the string is output. If you want to see only
+the callout output but not any output from an actual match, you should end the
+relevant pattern with (*FAIL).
+.
+.
.SH "MATCHING ERRORS"
.rs
.sp
@@ -722,9 +824,9 @@ message and the line that caused the problem to the standard error stream. If
there are more than 20 such errors, \fBpcre2grep\fP gives up.
.P
The \fB--match-limit\fP option of \fBpcre2grep\fP can be used to set the
-overall resource limit; there is a second option called \fB--recursion-limit\fP
-that sets a limit on the amount of memory (usually stack) that is used (see the
-discussion of these options above).
+overall resource limit. There are also other limits that affect the amount of
+memory used during matching; see the discussion of \fB--heap-limit\fP and
+\fB--depth-limit\fP above.
.
.
.SH DIAGNOSTICS
@@ -735,6 +837,9 @@ for syntax errors, overlong lines, non-existent or inaccessible files (even if
matches were found in other files) or too many matching errors. Using the
\fB-s\fP option to suppress error messages about inaccessible files does not
affect the return code.
+.P
+When run under VMS, the return code is placed in the symbol PCRE2GREP_RC
+because VMS does not distinguish between exit(0) and exit(1).
.
.
.SH "SEE ALSO"
@@ -757,6 +862,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 19 June 2016
-Copyright (c) 1997-2016 University of Cambridge.
+Last updated: 13 November 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2grep.txt b/doc/pcre2grep.txt
index 31aa610..30517b4 100644
--- a/doc/pcre2grep.txt
+++ b/doc/pcre2grep.txt
@@ -51,61 +51,73 @@ DESCRIPTION
boundary is controlled by the -N (--newline) option.
The amount of memory used for buffering files that are being scanned is
- controlled by a parameter that can be set by the --buffer-size option.
- The default value for this parameter is specified when pcre2grep is
- built, with the default default being 20K. A block of memory three
- times this size is used (to allow for buffering "before" and "after"
- lines). An error occurs if a line overflows the buffer.
-
- Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the
- greater. BUFSIZ is defined in <stdio.h>. When there is more than one
+ controlled by parameters that can be set by the --buffer-size and
+ --max-buffer-size options. The first of these sets the size of buffer
+ that is obtained at the start of processing. If an input file contains
+ very long lines, a larger buffer may be needed; this is handled by
+ automatically extending the buffer, up to the limit specified by --max-
+ buffer-size. The default values for these parameters are specified when
+ pcre2grep is built, with the default defaults being 20K and 1M respec-
+ tively. An error occurs if a line is too long and the buffer can no
+ longer be expanded.
+
+ The block of memory that is actually used is three times the "buffer
+ size", to allow for buffering "before" and "after" lines. If the buffer
+ size is too small, fewer than requested "before" and "after" lines may
+ be output.
+
+ Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the
+ greater. BUFSIZ is defined in <stdio.h>. When there is more than one
pattern (specified by the use of -e and/or -f), each pattern is applied
- to each line in the order in which they are defined, except that all
+ to each line in the order in which they are defined, except that all
the -e patterns are tried before the -f patterns.
- By default, as soon as one pattern matches a line, no further patterns
+ By default, as soon as one pattern matches a line, no further patterns
are considered. However, if --colour (or --color) is used to colour the
- matching substrings, or if --only-matching, --file-offsets, or --line-
- offsets is used to output only the part of the line that matched
+ matching substrings, or if --only-matching, --file-offsets, or --line-
+ offsets is used to output only the part of the line that matched
(either shown literally, or as an offset), scanning resumes immediately
- following the match, so that further matches on the same line can be
- found. If there are multiple patterns, they are all tried on the
- remainder of the line, but patterns that follow the one that matched
+ following the match, so that further matches on the same line can be
+ found. If there are multiple patterns, they are all tried on the
+ remainder of the line, but patterns that follow the one that matched
are not tried on the earlier part of the line.
- This behaviour means that the order in which multiple patterns are
- specified can affect the output when one of the above options is used.
- This is no longer the same behaviour as GNU grep, which now manages to
- display earlier matches for later patterns (as long as there is no
+ This behaviour means that the order in which multiple patterns are
+ specified can affect the output when one of the above options is used.
+ This is no longer the same behaviour as GNU grep, which now manages to
+ display earlier matches for later patterns (as long as there is no
overlap).
- Patterns that can match an empty string are accepted, but empty string
+ Patterns that can match an empty string are accepted, but empty string
matches are never recognized. An example is the pattern
- "(super)?(man)?", in which all components are optional. This pattern
- finds all occurrences of both "super" and "man"; the output differs
- from matching with "super|man" when only the matching substrings are
+ "(super)?(man)?", in which all components are optional. This pattern
+ finds all occurrences of both "super" and "man"; the output differs
+ from matching with "super|man" when only the matching substrings are
being shown.
- If the LC_ALL or LC_CTYPE environment variable is set, pcre2grep uses
+ If the LC_ALL or LC_CTYPE environment variable is set, pcre2grep uses
the value to set a locale when calling the PCRE2 library. The --locale
option can be used to override this.
SUPPORT FOR COMPRESSED FILES
- It is possible to compile pcre2grep so that it uses libz or libbz2 to
- read files whose names end in .gz or .bz2, respectively. You can find
- out whether your binary has support for one or both of these file types
- by running it with the --help option. If the appropriate support is not
- present, files are treated as plain text. The standard input is always
- so treated.
+ It is possible to compile pcre2grep so that it uses libz or libbz2 to
+ read compressed files whose names end in .gz or .bz2, respectively. You
+ can find out whether your pcre2grep binary has support for one or both
+ of these file types by running it with the --help option. If the appro-
+ priate support is not present, all files are treated as plain text. The
+ standard input is always so treated. When input is from a compressed
+ .gz or .bz2 file, the --line-buffered option is ignored.
BINARY FILES
By default, a file that contains a binary zero byte within the first
1024 bytes is identified as a binary file, and is processed specially.
- (GNU grep also identifies binary files in this manner.) See the
+ (GNU grep identifies binary files in this manner.) However, if the new-
+ line type is specified as "nul", that is, the line terminator is a
+ binary zero, the test for a binary file is not applied. See the
--binary-files option for a means of changing the way binary files are
handled.
@@ -113,7 +125,7 @@ BINARY FILES
OPTIONS
The order in which some of the options appear can affect the output.
- For example, both the -h and -l options affect the printing of file
+ For example, both the -H and -l options affect the printing of file
names. Whichever comes later in the command line will be the one that
takes effect. Similarly, except where noted below, if an option is
given twice, the later setting is used. Numerical values for options
@@ -126,46 +138,50 @@ OPTIONS
names that start with hyphens.
-A number, --after-context=number
- Output number lines of context after each matching line. If
- file names and/or line numbers are being output, a hyphen
- separator is used instead of a colon for the context lines. A
- line containing "--" is output between each group of lines,
- unless they are in fact contiguous in the input file. The
- value of number is expected to be relatively small. However,
- pcre2grep guarantees to have up to 8K of following text
- available for context output.
+ Output up to number lines of context after each matching
+ line. Fewer lines are output if the next match or the end of
+ the file is reached, or if the processing buffer size has
+ been set too small. If file names and/or line numbers are
+ being output, a hyphen separator is used instead of a colon
+ for the context lines. A line containing "--" is output
+ between each group of lines, unless they are in fact contigu-
+ ous in the input file. The value of number is expected to be
+ relatively small. When -c is used, -A is ignored.
-a, --text
- Treat binary files as text. This is equivalent to --binary-
+ Treat binary files as text. This is equivalent to --binary-
files=text.
-B number, --before-context=number
- Output number lines of context before each matching line. If
- file names and/or line numbers are being output, a hyphen
- separator is used instead of a colon for the context lines. A
- line containing "--" is output between each group of lines,
- unless they are in fact contiguous in the input file. The
- value of number is expected to be relatively small. However,
- pcre2grep guarantees to have up to 8K of preceding text
- available for context output.
+ Output up to number lines of context before each matching
+ line. Fewer lines are output if the previous match or the
+ start of the file is within number lines, or if the process-
+ ing buffer size has been set too small. If file names and/or
+ line numbers are being output, a hyphen separator is used
+ instead of a colon for the context lines. A line containing
+ "--" is output between each group of lines, unless they are
+ in fact contiguous in the input file. The value of number is
+ expected to be relatively small. When -c is used, -B is
+ ignored.
--binary-files=word
- Specify how binary files are to be processed. If the word is
- "binary" (the default), pattern matching is performed on
- binary files, but the only output is "Binary file <name>
- matches" when a match succeeds. If the word is "text", which
- is equivalent to the -a or --text option, binary files are
- processed in the same way as any other file. In this case,
- when a match succeeds, the output may be binary garbage,
- which can have nasty effects if sent to a terminal. If the
- word is "without-match", which is equivalent to the -I
- option, binary files are not processed at all; they are
+ Specify how binary files are to be processed. If the word is
+ "binary" (the default), pattern matching is performed on
+ binary files, but the only output is "Binary file <name>
+ matches" when a match succeeds. If the word is "text", which
+ is equivalent to the -a or --text option, binary files are
+ processed in the same way as any other file. In this case,
+ when a match succeeds, the output may be binary garbage,
+ which can have nasty effects if sent to a terminal. If the
+ word is "without-match", which is equivalent to the -I
+ option, binary files are not processed at all; they are
assumed not to be of interest and are skipped without causing
any output or affecting the return code.
--buffer-size=number
- Set the parameter that controls how much memory is used for
- buffering files that are being scanned.
+ Set the parameter that controls how much memory is obtained
+ at the start of processing for buffering files that are being
+ scanned. See also --max-buffer-size below.
-C number, --context=number
Output number lines of context both before and after each
@@ -174,19 +190,21 @@ OPTIONS
-c, --count
Do not output lines from the files that are being scanned;
- instead output the number of matches (or non-matches if -v is
- used) that would otherwise have caused lines to be shown. By
- default, this count is the same as the number of suppressed
- lines, but if the -M (multiline) option is used (without -v),
- there may be more suppressed lines than the number of
- matches.
-
- If no lines are selected, the number zero is output. If sev-
- eral files are are being scanned, a count is output for each
- of them. However, if the --files-with-matches option is also
- used, only those files whose counts are greater than zero are
- listed. When -c is used, the -A, -B, and -C options are
- ignored.
+ instead output the number of lines that would have been
+ shown, either because they matched, or, if -v is set, because
+ they failed to match. By default, this count is exactly the
+ same as the number of lines that would have been output, but
+ if the -M (multiline) option is used (without -v), there may
+ be more suppressed lines than the count (that is, the number
+ of matches).
+
+ If no lines are selected, the number zero is output. If sev-
+ eral files are are being scanned, a count is output for each
+ of them and the -t option can be used to cause a total to be
+ output at the end. However, if the --files-with-matches
+ option is also used, only those files whose counts are
+ greater than zero are listed. When -c is used, the -A, -B,
+ and -C options are ignored.
--colour, --color
If this option is given without any data, it is equivalent to
@@ -204,205 +222,225 @@ OPTIONS
possible matches in a line, not just one, in order to colour
them all.
- The colour that is used can be specified by setting the envi-
- ronment variable PCRE2GREP_COLOUR or PCRE2GREP_COLOR. The
- value of this variable should be a string of two numbers,
- separated by a semicolon. They are copied directly into the
- control string for setting colour on a terminal, so it is
- your responsibility to ensure that they make sense. If nei-
- ther of the environment variables is set, the default is
- "1;31", which gives red.
+ The colour that is used can be specified by setting one of
+ the environment variables PCRE2GREP_COLOUR, PCRE2GREP_COLOR,
+ PCREGREP_COLOUR, or PCREGREP_COLOR, which are checked in that
+ order. If none of these are set, pcre2grep looks for
+ GREP_COLORS or GREP_COLOR (in that order). The value of the
+ variable should be a string of two numbers, separated by a
+ semicolon, except in the case of GREP_COLORS, which must
+ start with "ms=" or "mt=" followed by two semicolon-separated
+ colours, terminated by the end of the string or by a colon.
+ If GREP_COLORS does not start with "ms=" or "mt=" it is
+ ignored, and GREP_COLOR is checked.
+
+ If the string obtained from one of the above variables con-
+ tains any characters other than semicolon or digits, the set-
+ ting is ignored and the default colour is used. The string is
+ copied directly into the control string for setting colour on
+ a terminal, so it is your responsibility to ensure that the
+ values make sense. If no relevant environment variable is
+ set, the default is "1;31", which gives red.
-D action, --devices=action
- If an input path is not a regular file or a directory,
- "action" specifies how it is to be processed. Valid values
+ If an input path is not a regular file or a directory,
+ "action" specifies how it is to be processed. Valid values
are "read" (the default) or "skip" (silently skip the path).
-d action, --directories=action
If an input path is a directory, "action" specifies how it is
- to be processed. Valid values are "read" (the default in
- non-Windows environments, for compatibility with GNU grep),
- "recurse" (equivalent to the -r option), or "skip" (silently
- skip the path, the default in Windows environments). In the
- "read" case, directories are read as if they were ordinary
- files. In some operating systems the effect of reading a
+ to be processed. Valid values are "read" (the default in
+ non-Windows environments, for compatibility with GNU grep),
+ "recurse" (equivalent to the -r option), or "skip" (silently
+ skip the path, the default in Windows environments). In the
+ "read" case, directories are read as if they were ordinary
+ files. In some operating systems the effect of reading a
directory like this is an immediate end-of-file; in others it
may provoke an error.
+ --depth-limit=number
+ See --match-limit below.
+
-e pattern, --regex=pattern, --regexp=pattern
Specify a pattern to be matched. This option can be used mul-
tiple times in order to specify several patterns. It can also
- be used as a way of specifying a single pattern that starts
- with a hyphen. When -e is used, no argument pattern is taken
- from the command line; all arguments are treated as file
- names. There is no limit to the number of patterns. They are
- applied to each line in the order in which they are defined
+ be used as a way of specifying a single pattern that starts
+ with a hyphen. When -e is used, no argument pattern is taken
+ from the command line; all arguments are treated as file
+ names. There is no limit to the number of patterns. They are
+ applied to each line in the order in which they are defined
until one matches.
- If -f is used with -e, the command line patterns are matched
+ If -f is used with -e, the command line patterns are matched
first, followed by the patterns from the file(s), independent
- of the order in which these options are specified. Note that
- multiple use of -e is not the same as a single pattern with
+ of the order in which these options are specified. Note that
+ multiple use of -e is not the same as a single pattern with
alternatives. For example, X|Y finds the first character in a
- line that is X or Y, whereas if the two patterns are given
+ line that is X or Y, whereas if the two patterns are given
separately, with X first, pcre2grep finds X if it is present,
even if it follows Y in the line. It finds Y only if there is
- no X in the line. This matters only if you are using -o or
+ no X in the line. This matters only if you are using -o or
--colo(u)r to show the part(s) of the line that matched.
--exclude=pattern
Files (but not directories) whose names match the pattern are
- skipped without being processed. This applies to all files,
- whether listed on the command line, obtained from --file-
+ skipped without being processed. This applies to all files,
+ whether listed on the command line, obtained from --file-
list, or by scanning a directory. The pattern is a PCRE2 reg-
- ular expression, and is matched against the final component
- of the file name, not the entire path. The -F, -w, and -x
+ ular expression, and is matched against the final component
+ of the file name, not the entire path. The -F, -w, and -x
options do not apply to this pattern. The option may be given
any number of times in order to specify multiple patterns. If
- a file name matches both an --include and an --exclude pat-
+ a file name matches both an --include and an --exclude pat-
tern, it is excluded. There is no short form for this option.
--exclude-from=filename
- Treat each non-empty line of the file as the data for an
+ Treat each non-empty line of the file as the data for an
--exclude option. What constitutes a newline when reading the
- file is the operating system's default. The --newline option
- has no effect on this option. This option may be given more
+ file is the operating system's default. The --newline option
+ has no effect on this option. This option may be given more
than once in order to specify a number of files to read.
--exclude-dir=pattern
Directories whose names match the pattern are skipped without
- being processed, whatever the setting of the --recursive
- option. This applies to all directories, whether listed on
+ being processed, whatever the setting of the --recursive
+ option. This applies to all directories, whether listed on
the command line, obtained from --file-list, or by scanning a
- parent directory. The pattern is a PCRE2 regular expression,
- and is matched against the final component of the directory
- name, not the entire path. The -F, -w, and -x options do not
- apply to this pattern. The option may be given any number of
- times in order to specify more than one pattern. If a direc-
- tory matches both --include-dir and --exclude-dir, it is
+ parent directory. The pattern is a PCRE2 regular expression,
+ and is matched against the final component of the directory
+ name, not the entire path. The -F, -w, and -x options do not
+ apply to this pattern. The option may be given any number of
+ times in order to specify more than one pattern. If a direc-
+ tory matches both --include-dir and --exclude-dir, it is
excluded. There is no short form for this option.
-F, --fixed-strings
- Interpret each data-matching pattern as a list of fixed
- strings, separated by newlines, instead of as a regular
- expression. What constitutes a newline for this purpose is
- controlled by the --newline option. The -w (match as a word)
- and -x (match whole line) options can be used with -F. They
+ Interpret each data-matching pattern as a list of fixed
+ strings, separated by newlines, instead of as a regular
+ expression. What constitutes a newline for this purpose is
+ controlled by the --newline option. The -w (match as a word)
+ and -x (match whole line) options can be used with -F. They
apply to each of the fixed strings. A line is selected if any
of the fixed strings are found in it (subject to -w or -x, if
- present). This option applies only to the patterns that are
- matched against the contents of files; it does not apply to
- patterns specified by any of the --include or --exclude
+ present). This option applies only to the patterns that are
+ matched against the contents of files; it does not apply to
+ patterns specified by any of the --include or --exclude
options.
-f filename, --file=filename
- Read patterns from the file, one per line, and match them
- against each line of input. What constitutes a newline when
- reading the file is the operating system's default. The
- --newline option has no effect on this option. Trailing white
- space is removed from each line, and blank lines are ignored.
- An empty file contains no patterns and therefore matches
- nothing. See also the comments about multiple patterns versus
- a single pattern with alternatives in the description of -e
- above.
-
- If this option is given more than once, all the specified
- files are read. A data line is output if any of the patterns
- match it. A file name can be given as "-" to refer to the
- standard input. When -f is used, patterns specified on the
- command line using -e may also be present; they are tested
- before the file's patterns. However, no other pattern is
+ Read patterns from the file, one per line, and match them
+ against each line of input. What constitutes a newline when
+ reading the file is the operating system's default. The
+ --newline option has no effect on this option. Trailing
+ white space is removed from each line, and blank lines are
+ ignored. An empty file contains no patterns and therefore
+ matches nothing. See also the comments about multiple pat-
+ terns versus a single pattern with alternatives in the
+ description of -e above.
+
+ If this option is given more than once, all the specified
+ files are read. A data line is output if any of the patterns
+ match it. A file name can be given as "-" to refer to the
+ standard input. When -f is used, patterns specified on the
+ command line using -e may also be present; they are tested
+ before the file's patterns. However, no other pattern is
taken from the command line; all arguments are treated as the
names of paths to be searched.
--file-list=filename
- Read a list of files and/or directories that are to be
- scanned from the given file, one per line. Trailing white
+ Read a list of files and/or directories that are to be
+ scanned from the given file, one per line. Trailing white
space is removed from each line, and blank lines are ignored.
- These paths are processed before any that are listed on the
- command line. The file name can be given as "-" to refer to
+ These paths are processed before any that are listed on the
+ command line. The file name can be given as "-" to refer to
the standard input. If --file and --file-list are both spec-
- ified as "-", patterns are read first. This is useful only
- when the standard input is a terminal, from which further
- lines (the list of files) can be read after an end-of-file
- indication. If this option is given more than once, all the
+ ified as "-", patterns are read first. This is useful only
+ when the standard input is a terminal, from which further
+ lines (the list of files) can be read after an end-of-file
+ indication. If this option is given more than once, all the
specified files are read.
--file-offsets
- Instead of showing lines or parts of lines that match, show
- each match as an offset from the start of the file and a
- length, separated by a comma. In this mode, no context is
- shown. That is, the -A, -B, and -C options are ignored. If
+ Instead of showing lines or parts of lines that match, show
+ each match as an offset from the start of the file and a
+ length, separated by a comma. In this mode, no context is
+ shown. That is, the -A, -B, and -C options are ignored. If
there is more than one match in a line, each of them is shown
- separately. This option is mutually exclusive with --line-
- offsets and --only-matching.
+ separately. This option is mutually exclusive with --output,
+ --line-offsets, and --only-matching.
-H, --with-filename
- Force the inclusion of the file name at the start of output
+ Force the inclusion of the file name at the start of output
lines when searching a single file. By default, the file name
is not shown in this case. For matching lines, the file name
is followed by a colon; for context lines, a hyphen separator
- is used. If a line number is also being output, it follows
- the file name. When the -M option causes a pattern to match
- more than one line, only the first is preceded by the file
- name.
+ is used. If a line number is also being output, it follows
+ the file name. When the -M option causes a pattern to match
+ more than one line, only the first is preceded by the file
+ name. This option overrides any previous -h, -l, or -L
+ options.
-h, --no-filename
Suppress the output file names when searching multiple files.
By default, file names are shown when multiple files are
searched. For matching lines, the file name is followed by a
colon; for context lines, a hyphen separator is used. If a
- line number is also being output, it follows the file name.
+ line number is also being output, it follows the file name.
+ This option overrides any previous -H, -L, or -l options.
+
+ --heap-limit=number
+ See --match-limit below.
- --help Output a help message, giving brief details of the command
- options and file type support, and then exit. Anything else
+ --help Output a help message, giving brief details of the command
+ options and file type support, and then exit. Anything else
on the command line is ignored.
- -I Ignore binary files. This is equivalent to --binary-
+ -I Ignore binary files. This is equivalent to --binary-
files=without-match.
-i, --ignore-case
Ignore upper/lower case distinctions during comparisons.
--include=pattern
- If any --include patterns are specified, the only files that
- are processed are those that match one of the patterns (and
- do not match an --exclude pattern). This option does not
- affect directories, but it applies to all files, whether
- listed on the command line, obtained from --file-list, or by
- scanning a directory. The pattern is a PCRE2 regular expres-
- sion, and is matched against the final component of the file
- name, not the entire path. The -F, -w, and -x options do not
- apply to this pattern. The option may be given any number of
- times. If a file name matches both an --include and an
- --exclude pattern, it is excluded. There is no short form
+ If any --include patterns are specified, the only files that
+ are processed are those that match one of the patterns (and
+ do not match an --exclude pattern). This option does not
+ affect directories, but it applies to all files, whether
+ listed on the command line, obtained from --file-list, or by
+ scanning a directory. The pattern is a PCRE2 regular expres-
+ sion, and is matched against the final component of the file
+ name, not the entire path. The -F, -w, and -x options do not
+ apply to this pattern. The option may be given any number of
+ times. If a file name matches both an --include and an
+ --exclude pattern, it is excluded. There is no short form
for this option.
--include-from=filename
- Treat each non-empty line of the file as the data for an
+ Treat each non-empty line of the file as the data for an
--include option. What constitutes a newline for this purpose
- is the operating system's default. The --newline option has
+ is the operating system's default. The --newline option has
no effect on this option. This option may be given any number
of times; all the files are read.
--include-dir=pattern
- If any --include-dir patterns are specified, the only direc-
- tories that are processed are those that match one of the
- patterns (and do not match an --exclude-dir pattern). This
- applies to all directories, whether listed on the command
- line, obtained from --file-list, or by scanning a parent
- directory. The pattern is a PCRE2 regular expression, and is
- matched against the final component of the directory name,
- not the entire path. The -F, -w, and -x options do not apply
+ If any --include-dir patterns are specified, the only direc-
+ tories that are processed are those that match one of the
+ patterns (and do not match an --exclude-dir pattern). This
+ applies to all directories, whether listed on the command
+ line, obtained from --file-list, or by scanning a parent
+ directory. The pattern is a PCRE2 regular expression, and is
+ matched against the final component of the directory name,
+ not the entire path. The -F, -w, and -x options do not apply
to this pattern. The option may be given any number of times.
- If a directory matches both --include-dir and --exclude-dir,
+ If a directory matches both --include-dir and --exclude-dir,
it is excluded. There is no short form for this option.
-L, --files-without-match
- Instead of outputting lines from the files, just output the
- names of the files that do not contain any lines that would
- have been output. Each file name is output once, on a sepa-
- rate line.
+ Instead of outputting lines from the files, just output the
+ names of the files that do not contain any lines that would
+ have been output. Each file name is output once, on a sepa-
+ rate line. This option overrides any previous -H, -h, or -l
+ options.
-l, --files-with-matches
Instead of outputting lines from the files, just output the
@@ -413,7 +451,8 @@ OPTIONS
matching continues in order to obtain the correct count, and
those files that have at least one match are listed along
with their counts. Using this option with -c is a way of sup-
- pressing the listing of files with no matches.
+ pressing the listing of files with no matches. This opeion
+ overrides any previous -H, -h, or -L options.
--label=name
This option supplies a name to be used for the standard input
@@ -421,163 +460,194 @@ OPTIONS
input)" is used. There is no short form for this option.
--line-buffered
- When this option is given, input is read and processed line
- by line, and the output is flushed after each write. By
- default, input is read in large chunks, unless pcre2grep can
- determine that it is reading from a terminal (which is cur-
- rently possible only in Unix-like environments). Output to
- terminal is normally automatically flushed by the operating
- system. This option can be useful when the input or output is
- attached to a pipe and you do not want pcre2grep to buffer up
- large amounts of data. However, its use will affect perfor-
- mance, and the -M (multiline) option ceases to work.
+ When this option is given, non-compressed input is read and
+ processed line by line, and the output is flushed after each
+ write. By default, input is read in large chunks, unless
+ pcre2grep can determine that it is reading from a terminal
+ (which is currently possible only in Unix-like environments).
+ Output to terminal is normally automatically flushed by the
+ operating system. This option can be useful when the input or
+ output is attached to a pipe and you do not want pcre2grep to
+ buffer up large amounts of data. However, its use will affect
+ performance, and the -M (multiline) option ceases to work.
+ When input is from a compressed .gz or .bz2 file, --line-
+ buffered is ignored.
--line-offsets
- Instead of showing lines or parts of lines that match, show
+ Instead of showing lines or parts of lines that match, show
each match as a line number, the offset from the start of the
- line, and a length. The line number is terminated by a colon
- (as usual; see the -n option), and the offset and length are
- separated by a comma. In this mode, no context is shown.
- That is, the -A, -B, and -C options are ignored. If there is
- more than one match in a line, each of them is shown sepa-
- rately. This option is mutually exclusive with --file-offsets
- and --only-matching.
+ line, and a length. The line number is terminated by a colon
+ (as usual; see the -n option), and the offset and length are
+ separated by a comma. In this mode, no context is shown.
+ That is, the -A, -B, and -C options are ignored. If there is
+ more than one match in a line, each of them is shown sepa-
+ rately. This option is mutually exclusive with --output,
+ --file-offsets, and --only-matching.
--locale=locale-name
- This option specifies a locale to be used for pattern match-
- ing. It overrides the value in the LC_ALL or LC_CTYPE envi-
- ronment variables. If no locale is specified, the PCRE2
- library's default (usually the "C" locale) is used. There is
+ This option specifies a locale to be used for pattern match-
+ ing. It overrides the value in the LC_ALL or LC_CTYPE envi-
+ ronment variables. If no locale is specified, the PCRE2
+ library's default (usually the "C" locale) is used. There is
no short form for this option.
--match-limit=number
- Processing some regular expression patterns can require a
- very large amount of memory, leading in some cases to a pro-
- gram crash if not enough is available. Other patterns may
- take a very long time to search for all possible matching
- strings. The pcre2_match() function that is called by
- pcre2grep to do the matching has two parameters that can
- limit the resources that it uses.
-
- The --match-limit option provides a means of limiting
- resource usage when processing patterns that are not going to
- match, but which have a very large number of possibilities in
- their search trees. The classic example is a pattern that
- uses nested unlimited repeats. Internally, PCRE2 uses a func-
- tion called match() which it calls repeatedly (sometimes
- recursively). The limit set by --match-limit is imposed on
- the number of times this function is called during a match,
- which has the effect of limiting the amount of backtracking
- that can take place.
-
- The --recursion-limit option is similar to --match-limit, but
- instead of limiting the total number of times that match() is
- called, it limits the depth of recursive calls, which in turn
- limits the amount of memory that can be used. The recursion
- depth is a smaller number than the total number of calls,
- because not all calls to match() are recursive. This limit is
- of use only if it is set smaller than --match-limit.
-
- There are no short forms for these options. The default set-
- tings are specified when the PCRE2 library is compiled, with
- the default default being 10 million.
+ Processing some regular expression patterns may take a very
+ long time to search for all possible matching strings. Others
+ may require a very large amount of memory. There are three
+ options that set resource limits for matching.
+
+ The --match-limit option provides a means of limiting comput-
+ ing resource usage when processing patterns that are not
+ going to match, but which have a very large number of possi-
+ bilities in their search trees. The classic example is a pat-
+ tern that uses nested unlimited repeats. Internally, PCRE2
+ has a counter that is incremented each time around its main
+ processing loop. If the value set by --match-limit is
+ reached, an error occurs.
+
+ The --heap-limit option specifies, as a number of kilobytes,
+ the amount of heap memory that may be used for matching. Heap
+ memory is needed only if matching the pattern requires a sig-
+ nificant number of nested backtracking points to be remem-
+ bered. This parameter can be set to zero to forbid the use of
+ heap memory altogether.
+
+ The --depth-limit option limits the depth of nested back-
+ tracking points, which indirectly limits the amount of memory
+ that is used. The amount of memory needed for each backtrack-
+ ing point depends on the number of capturing parentheses in
+ the pattern, so the amount of memory that is used before this
+ limit acts varies from pattern to pattern. This limit is of
+ use only if it is set smaller than --match-limit.
+
+ There are no short forms for these options. The default set-
+ tings are specified when the PCRE2 library is compiled, with
+ the default defaults being very large and so effectively
+ unlimited.
+
+ --max-buffer-size=number
+ This limits the expansion of the processing buffer, whose
+ initial size can be set by --buffer-size. The maximum buffer
+ size is silently forced to be no smaller than the starting
+ buffer size.
-M, --multiline
- Allow patterns to match more than one line. When this option
- is given, patterns may usefully contain literal newline char-
- acters and internal occurrences of ^ and $ characters. The
- output for a successful match may consist of more than one
- line. The first is the line in which the match started, and
- the last is the line in which the match ended. If the matched
- string ends with a newline sequence the output ends at the
- end of that line.
-
- When this option is set, the PCRE2 library is called in "mul-
- tiline" mode. This allows a matched string to extend past the
- end of a line and continue on one or more subsequent lines.
- However, pcre2grep still processes the input line by line.
- Once a match has been handled, scanning restarts at the
- beginning of the next line, just as it does when -M is not
- present. This means that it is possible for the second or
- subsequent lines in a multiline match to be output again as
- part of another match.
-
- The newline sequence that separates multiple lines must be
- matched as part of the pattern. For example, to find the
- phrase "regular expression" in a file where "regular" might
- be at the end of a line and "expression" at the start of the
+ Allow patterns to match more than one line. When this option
+ is set, the PCRE2 library is called in "multiline" mode. This
+ allows a matched string to extend past the end of a line and
+ continue on one or more subsequent lines. Patterns used with
+ -M may usefully contain literal newline characters and inter-
+ nal occurrences of ^ and $ characters. The output for a suc-
+ cessful match may consist of more than one line. The first
+ line is the line in which the match started, and the last
+ line is the line in which the match ended. If the matched
+ string ends with a newline sequence, the output ends at the
+ end of that line. If -v is set, none of the lines in a
+ multi-line match are output. Once a match has been handled,
+ scanning restarts at the beginning of the line after the one
+ in which the match ended.
+
+ The newline sequence that separates multiple lines must be
+ matched as part of the pattern. For example, to find the
+ phrase "regular expression" in a file where "regular" might
+ be at the end of a line and "expression" at the start of the
next line, you could use this command:
pcre2grep -M 'regular\s+expression' <file>
- The \s escape sequence matches any white space character,
- including newlines, and is followed by + so as to match
- trailing white space on the first line as well as possibly
+ The \s escape sequence matches any white space character,
+ including newlines, and is followed by + so as to match
+ trailing white space on the first line as well as possibly
handling a two-character newline sequence.
- There is a limit to the number of lines that can be matched,
- imposed by the way that pcre2grep buffers the input file as
- it scans it. However, pcre2grep ensures that at least 8K
- characters or the rest of the file (whichever is the shorter)
- are available for forward matching, and similarly the previ-
- ous 8K characters (or all the previous characters, if fewer
- than 8K) are guaranteed to be available for lookbehind asser-
- tions. The -M option does not work when input is read line by
- line (see --line-buffered.)
+ There is a limit to the number of lines that can be matched,
+ imposed by the way that pcre2grep buffers the input file as
+ it scans it. With a sufficiently large processing buffer,
+ this should not be a problem, but the -M option does not work
+ when input is read line by line (see --line-buffered.)
-N newline-type, --newline=newline-type
- The PCRE2 library supports five different conventions for
- indicating the ends of lines. They are the single-character
- sequences CR (carriage return) and LF (linefeed), the two-
- character sequence CRLF, an "anycrlf" convention, which rec-
- ognizes any of the preceding three types, and an "any" con-
+ The PCRE2 library supports five different conventions for
+ indicating the ends of lines. They are the single-character
+ sequences CR (carriage return) and LF (linefeed), the two-
+ character sequence CRLF, an "anycrlf" convention, which rec-
+ ognizes any of the preceding three types, and an "any" con-
vention, in which any Unicode line ending sequence is assumed
- to end a line. The Unicode sequences are the three just men-
- tioned, plus VT (vertical tab, U+000B), FF (form feed,
- U+000C), NEL (next line, U+0085), LS (line separator,
+ to end a line. The Unicode sequences are the three just men-
+ tioned, plus VT (vertical tab, U+000B), FF (form feed,
+ U+000C), NEL (next line, U+0085), LS (line separator,
U+2028), and PS (paragraph separator, U+2029).
- When the PCRE2 library is built, a default line-ending
- sequence is specified. This is normally the standard
+ When the PCRE2 library is built, a default line-ending
+ sequence is specified. This is normally the standard
sequence for the operating system. Unless otherwise specified
- by this option, pcre2grep uses the library's default. The
+ by this option, pcre2grep uses the library's default. The
possible values for this option are CR, LF, CRLF, ANYCRLF, or
- ANY. This makes it possible to use pcre2grep to scan files
+ ANY. This makes it possible to use pcre2grep to scan files
that have come from other environments without having to mod-
- ify their line endings. If the data that is being scanned
- does not agree with the convention set by this option,
- pcre2grep may behave in strange ways. Note that this option
- does not apply to files specified by the -f, --exclude-from,
- or --include-from options, which are expected to use the
+ ify their line endings. If the data that is being scanned
+ does not agree with the convention set by this option,
+ pcre2grep may behave in strange ways. Note that this option
+ does not apply to files specified by the -f, --exclude-from,
+ or --include-from options, which are expected to use the
operating system's standard newline sequence.
-n, --line-number
Precede each output line by its line number in the file, fol-
- lowed by a colon for matching lines or a hyphen for context
+ lowed by a colon for matching lines or a hyphen for context
lines. If the file name is also being output, it precedes the
- line number. When the -M option causes a pattern to match
- more than one line, only the first is preceded by its line
+ line number. When the -M option causes a pattern to match
+ more than one line, only the first is preceded by its line
number. This option is forced if --line-offsets is used.
- --no-jit If the PCRE2 library is built with support for just-in-time
+ --no-jit If the PCRE2 library is built with support for just-in-time
compiling (which speeds up matching), pcre2grep automatically
makes use of this, unless it was explicitly disabled at build
- time. This option can be used to disable the use of JIT at
- run time. It is provided for testing and working round prob-
+ time. This option can be used to disable the use of JIT at
+ run time. It is provided for testing and working round prob-
lems. It should never be needed in normal use.
+ -O text, --output=text
+ When there is a match, instead of outputting the whole line
+ that matched, output just the given text. This option is
+ mutually exclusive with --only-matching, --file-offsets, and
+ --line-offsets. Escape sequences starting with a dollar char-
+ acter may be used to insert the contents of the matched part
+ of the line and/or captured substrings into the text.
+
+ $<digits> or ${<digits>} is replaced by the captured sub-
+ string of the given decimal number; zero substitutes the
+ whole match. If the number is greater than the number of cap-
+ turing substrings, or if the capture is unset, the replace-
+ ment is empty.
+
+ $a is replaced by bell; $b by backspace; $e by escape; $f by
+ form feed; $n by newline; $r by carriage return; $t by tab;
+ $v by vertical tab.
+
+ $o<digits> is replaced by the character represented by the
+ given octal number; up to three digits are processed.
+
+ $x<digits> is replaced by the character represented by the
+ given hexadecimal number; up to two digits are processed.
+
+ Any other character is substituted by itself. In particular,
+ $$ is replaced by a single dollar.
+
-o, --only-matching
Show only the part of the line that matched a pattern instead
- of the whole line. In this mode, no context is shown. That
- is, the -A, -B, and -C options are ignored. If there is more
- than one match in a line, each of them is shown separately.
- If -o is combined with -v (invert the sense of the match to
- find non-matching lines), no output is generated, but the
- return code is set appropriately. If the matched portion of
- the line is empty, nothing is output unless the file name or
- line number are being printed, in which case they are shown
- on an otherwise empty line. This option is mutually exclusive
- with --file-offsets and --line-offsets.
+ of the whole line. In this mode, no context is shown. That
+ is, the -A, -B, and -C options are ignored. If there is more
+ than one match in a line, each of them is shown separately,
+ on a separate line of output. If -o is combined with -v
+ (invert the sense of the match to find non-matching lines),
+ no output is generated, but the return code is set appropri-
+ ately. If the matched portion of the line is empty, nothing
+ is output unless the file name or line number are being
+ printed, in which case they are shown on an otherwise empty
+ line. This option is mutually exclusive with --output,
+ --file-offsets and --line-offsets.
-onumber, --only-matching=number
Show only the part of the line that matched the capturing
@@ -587,82 +657,98 @@ OPTIONS
(see above), if an argument is present, it must be given in
the same shell item, for example, -o3 or --only-matching=2.
The comments given for the non-argument case above also apply
- to this case. If the specified capturing parentheses do not
+ to this option. If the specified capturing parentheses do not
exist in the pattern, or were not set in the match, nothing
is output unless the file name or line number are being out-
put.
If this option is given multiple times, multiple substrings
- are output, in the order the options are given. For example,
- -o3 -o1 -o3 causes the substrings matched by capturing paren-
- theses 3 and 1 and then 3 again to be output. By default,
- there is no separator (but see the next option).
+ are output for each match, in the order the options are
+ given, and all on one line. For example, -o3 -o1 -o3 causes
+ the substrings matched by capturing parentheses 3 and 1 and
+ then 3 again to be output. By default, there is no separator
+ (but see the next option).
--om-separator=text
- Specify a separating string for multiple occurrences of -o.
- The default is an empty string. Separating strings are never
+ Specify a separating string for multiple occurrences of -o.
+ The default is an empty string. Separating strings are never
coloured.
-q, --quiet
Work quietly, that is, display nothing except error messages.
- The exit status indicates whether or not any matches were
+ The exit status indicates whether or not any matches were
found.
-r, --recursive
- If any given path is a directory, recursively scan the files
- it contains, taking note of any --include and --exclude set-
- tings. By default, a directory is read as a normal file; in
- some operating systems this gives an immediate end-of-file.
- This option is a shorthand for setting the -d option to
+ If any given path is a directory, recursively scan the files
+ it contains, taking note of any --include and --exclude set-
+ tings. By default, a directory is read as a normal file; in
+ some operating systems this gives an immediate end-of-file.
+ This option is a shorthand for setting the -d option to
"recurse".
--recursion-limit=number
See --match-limit above.
-s, --no-messages
- Suppress error messages about non-existent or unreadable
- files. Such files are quietly skipped. However, the return
+ Suppress error messages about non-existent or unreadable
+ files. Such files are quietly skipped. However, the return
code is still 2, even if matches were found in other files.
+ -t, --total-count
+ This option is useful when scanning more than one file. If
+ used on its own, -t suppresses all output except for a grand
+ total number of matching lines (or non-matching lines if -v
+ is used) in all the files. If -t is used with -c, a grand
+ total is output except when the previous output is just one
+ line. In other words, it is not output when just one file's
+ count is listed. If file names are being output, the grand
+ total is preceded by "TOTAL:". Otherwise, it appears as just
+ another number. The -t option is ignored when used with -L
+ (list files without matches), because the grand total would
+ always be zero.
+
-u, --utf-8
Operate in UTF-8 mode. This option is available only if PCRE2
has been compiled with UTF-8 support. All patterns (including
- those for any --exclude and --include options) and all sub-
- ject lines that are scanned must be valid strings of UTF-8
+ those for any --exclude and --include options) and all sub-
+ ject lines that are scanned must be valid strings of UTF-8
characters.
-V, --version
- Write the version numbers of pcre2grep and the PCRE2 library
- to the standard output and then exit. Anything else on the
+ Write the version numbers of pcre2grep and the PCRE2 library
+ to the standard output and then exit. Anything else on the
command line is ignored.
-v, --invert-match
- Invert the sense of the match, so that lines which do not
+ Invert the sense of the match, so that lines which do not
match any of the patterns are the ones that are found.
-w, --word-regex, --word-regexp
- Force the patterns to match only whole words. This is equiva-
- lent to having \b at the start and end of the pattern. This
- option applies only to the patterns that are matched against
- the contents of files; it does not apply to patterns speci-
- fied by any of the --include or --exclude options.
+ Force the patterns only to match "words". That is, there must
+ be a word boundary at the start and end of each matched
+ string. This is equivalent to having "\b(?:" at the start of
+ each pattern, and ")\b" at the end. This option applies only
+ to the patterns that are matched against the contents of
+ files; it does not apply to patterns specified by any of the
+ --include or --exclude options.
-x, --line-regex, --line-regexp
- Force the patterns to be anchored (each must start matching
- at the beginning of a line) and in addition, require them to
- match entire lines. This is equivalent to having ^ and $
- characters at the start and end of each alternative top-level
- branch in every pattern. This option applies only to the pat-
- terns that are matched against the contents of files; it does
- not apply to patterns specified by any of the --include or
- --exclude options.
+ Force the patterns to start matching only at the beginnings
+ of lines, and in addition, require them to match entire
+ lines. In multiline mode the match may be more than one line.
+ This is equivalent to having "^(?:" at the start of each pat-
+ tern and ")$" at the end. This option applies only to the
+ patterns that are matched against the contents of files; it
+ does not apply to patterns specified by any of the --include
+ or --exclude options.
ENVIRONMENT VARIABLES
- The environment variables LC_ALL and LC_CTYPE are examined, in that
- order, for a locale. The first one that is set is used. This can be
- overridden by the --locale option. If no locale is set, the PCRE2
+ The environment variables LC_ALL and LC_CTYPE are examined, in that
+ order, for a locale. The first one that is set is used. This can be
+ overridden by the --locale option. If no locale is set, the PCRE2
library's default (usually the "C" locale) is used.
@@ -670,82 +756,87 @@ NEWLINES
The -N (--newline) option allows pcre2grep to scan files with different
newline conventions from the default. Any parts of the input files that
- are written to the standard output are copied identically, with what-
- ever newline sequences they have in the input. However, the setting of
- this option does not affect the interpretation of files specified by
+ are written to the standard output are copied identically, with what-
+ ever newline sequences they have in the input. However, the setting of
+ this option does not affect the interpretation of files specified by
the -f, --exclude-from, or --include-from options, which are assumed to
- use the operating system's standard newline sequence, nor does it
- affect the way in which pcre2grep writes informational messages to the
+ use the operating system's standard newline sequence, nor does it
+ affect the way in which pcre2grep writes informational messages to the
standard error and output streams. For these it uses the string "\n" to
- indicate newlines, relying on the C I/O library to convert this to an
+ indicate newlines, relying on the C I/O library to convert this to an
appropriate sequence.
OPTIONS COMPATIBILITY
Many of the short and long forms of pcre2grep's options are the same as
- in the GNU grep program. Any long option of the form --xxx-regexp (GNU
+ in the GNU grep program. Any long option of the form --xxx-regexp (GNU
terminology) is also available as --xxx-regex (PCRE2 terminology). How-
- ever, the --file-list, --file-offsets, --include-dir, --line-offsets,
- --locale, --match-limit, -M, --multiline, -N, --newline, --om-separa-
- tor, --recursion-limit, -u, and --utf-8 options are specific to
- pcre2grep, as is the use of the --only-matching option with a capturing
- parentheses number.
-
- Although most of the common options work the same way, a few are dif-
- ferent in pcre2grep. For example, the --include option's argument is a
- glob for GNU grep, but a regular expression for pcre2grep. If both the
- -c and -l options are given, GNU grep lists only file names, without
+ ever, the --depth-limit, --file-list, --file-offsets, --heap-limit,
+ --include-dir, --line-offsets, --locale, --match-limit, -M, --multi-
+ line, -N, --newline, --om-separator, --output, -u, and --utf-8 options
+ are specific to pcre2grep, as is the use of the --only-matching option
+ with a capturing parentheses number.
+
+ Although most of the common options work the same way, a few are dif-
+ ferent in pcre2grep. For example, the --include option's argument is a
+ glob for GNU grep, but a regular expression for pcre2grep. If both the
+ -c and -l options are given, GNU grep lists only file names, without
counts, but pcre2grep gives the counts as well.
OPTIONS WITH DATA
There are four different ways in which an option with data can be spec-
- ified. If a short form option is used, the data may follow immedi-
+ ified. If a short form option is used, the data may follow immedi-
ately, or (with one exception) in the next command line item. For exam-
ple:
-f/some/file
-f /some/file
- The exception is the -o option, which may appear with or without data.
- Because of this, if data is present, it must follow immediately in the
+ The exception is the -o option, which may appear with or without data.
+ Because of this, if data is present, it must follow immediately in the
same item, for example -o3.
- If a long form option is used, the data may appear in the same command
- line item, separated by an equals character, or (with two exceptions)
+ If a long form option is used, the data may appear in the same command
+ line item, separated by an equals character, or (with two exceptions)
it may appear in the next command line item. For example:
--file=/some/file
--file /some/file
- Note, however, that if you want to supply a file name beginning with ~
- as data in a shell command, and have the shell expand ~ to a home
+ Note, however, that if you want to supply a file name beginning with ~
+ as data in a shell command, and have the shell expand ~ to a home
directory, you must separate the file name from the option, because the
shell does not treat ~ specially unless it is at the start of an item.
- The exceptions to the above are the --colour (or --color) and --only-
- matching options, for which the data is optional. If one of these
- options does have data, it must be given in the first form, using an
+ The exceptions to the above are the --colour (or --color) and --only-
+ matching options, for which the data is optional. If one of these
+ options does have data, it must be given in the first form, using an
equals character. Otherwise pcre2grep will assume that it has no data.
-CALLING EXTERNAL SCRIPTS
+USING PCRE2'S CALLOUT FACILITY
+
+ pcre2grep has, by default, support for calling external programs or
+ scripts or echoing specific strings during matching by making use of
+ PCRE2's callout facility. However, this support can be disabled when
+ pcre2grep is built. You can find out whether your binary has support
+ for callouts by running it with the --help option. If the support is
+ not enabled, all callouts in patterns are ignored by pcre2grep.
+
+ A callout in a PCRE2 pattern is of the form (?C<arg>) where the argu-
+ ment is either a number or a quoted string (see the pcre2callout docu-
+ mentation for details). Numbered callouts are ignored by pcre2grep;
+ only callouts with string arguments are useful.
- On non-Windows systems, pcre2grep has, by default, support for calling
- external programs or scripts during matching by making use of PCRE2's
- callout facility. However, this support can be disabled when pcre2grep
- is built. You can find out whether your binary has support for call-
- outs by running it with the --help option. If the support is not
- enabled, all callouts in patterns are ignored by pcre2grep.
+ Calling external programs or scripts
- A callout in a PCRE2 pattern is of the form (?C<arg>) where the argu-
- ment is either a number or a quoted string (see the pcre2callout docu-
- mentation for details). Numbered callouts are ignored by pcre2grep.
- String arguments are parsed as a list of substrings separated by pipe
- (vertical bar) characters. The first substring must be an executable
- name, with the following substrings specifying arguments:
+ If the callout string does not start with a pipe (vertical bar) charac-
+ ter, it is parsed into a list of substrings separated by pipe charac-
+ ters. The first substring must be an executable name, with the follow-
+ ing substrings specifying arguments:
executable_name|arg1|arg2|...
@@ -781,6 +872,18 @@ CALLING EXTERNAL SCRIPTS
local matching failure occurs and the matcher backtracks in the normal
way.
+ Echoing a specific string
+
+ If the callout string starts with a pipe (vertical bar) character, the
+ rest of the string is written to the output, having been passed through
+ the same escape processing as text from the --output option. This pro-
+ vides a simple echoing facility that avoids calling an external program
+ or script. No terminator is added to the string, so if you want a new-
+ line, you must include it explicitly. Matching continues normally
+ after the string is output. If you want to see only the callout output
+ but not any output from an actual match, you should end the relevant
+ pattern with (*FAIL).
+
MATCHING ERRORS
@@ -794,9 +897,9 @@ MATCHING ERRORS
such errors, pcre2grep gives up.
The --match-limit option of pcre2grep can be used to set the overall
- resource limit; there is a second option called --recursion-limit that
- sets a limit on the amount of memory (usually stack) that is used (see
- the discussion of these options above).
+ resource limit. There are also other limits that affect the amount of
+ memory used during matching; see the discussion of --heap-limit and
+ --depth-limit above.
DIAGNOSTICS
@@ -807,6 +910,10 @@ DIAGNOSTICS
errors. Using the -s option to suppress error messages about inaccessi-
ble files does not affect the return code.
+ When run under VMS, the return code is placed in the symbol
+ PCRE2GREP_RC because VMS does not distinguish between exit(0) and
+ exit(1).
+
SEE ALSO
@@ -822,5 +929,5 @@ AUTHOR
REVISION
- Last updated: 19 June 2016
- Copyright (c) 1997-2016 University of Cambridge.
+ Last updated: 13 November 2017
+ Copyright (c) 1997-2017 University of Cambridge.
diff --git a/doc/pcre2jit.3 b/doc/pcre2jit.3
index 0b95b4d..f6d17ca 100644
--- a/doc/pcre2jit.3
+++ b/doc/pcre2jit.3
@@ -1,4 +1,4 @@
-.TH PCRE2JIT 3 "05 June 2016" "PCRE2 10.22"
+.TH PCRE2JIT 3 "31 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 JUST-IN-TIME COMPILER SUPPORT"
@@ -152,7 +152,7 @@ below for a discussion of JIT stack usage.
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if searching
a very large pattern tree goes on for too long, as it is in the same
circumstance when JIT is not used, but the details of exactly what is counted
-are not the same. The PCRE2_ERROR_RECURSIONLIMIT error code is never returned
+are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
when JIT matching is used.
.
.
@@ -178,11 +178,8 @@ allocation functions, or NULL for standard memory allocation). It returns a
pointer to an opaque structure of type \fBpcre2_jit_stack\fP, or NULL if there
is an error. The \fBpcre2_jit_stack_free()\fP function is used to free a stack
that is no longer needed. (For the technically minded: the address space is
-allocated by mmap or VirtualAlloc.)
-.P
-JIT uses far less memory for recursion than the interpretive code,
-and a maximum stack size of 512K to 1M should be more than enough for any
-pattern.
+allocated by mmap or VirtualAlloc.) A maximum stack size of 512K to 1M should
+be more than enough for any pattern.
.P
The \fBpcre2_jit_stack_assign()\fP function specifies which stack JIT code
should use. Its arguments are as follows:
@@ -413,6 +410,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 05 June 2016
-Copyright (c) 1997-2016 University of Cambridge.
+Last updated: 31 March 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2limits.3 b/doc/pcre2limits.3
index a5bab81..88944db 100644
--- a/doc/pcre2limits.3
+++ b/doc/pcre2limits.3
@@ -1,4 +1,4 @@
-.TH PCRE2LIMITS 3 "05 November 2015" "PCRE2 10.21"
+.TH PCRE2LIMITS 3 "30 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "SIZE AND OTHER LIMITATIONS"
@@ -30,15 +30,6 @@ integer type, usually defined as size_t. Its maximum value (that is
~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated strings
and unset offsets.
.P
-Note that when using the traditional matching function, PCRE2 uses recursion to
-handle subpatterns and indefinite repetition. This means that the available
-stack space may limit the size of a subject string that can be processed by
-certain patterns. For a discussion of stack issues, see the
-.\" HREF
-\fBpcre2stack\fP
-.\"
-documentation.
-.P
All values in repeating quantifiers must be less than 65536.
.P
The maximum length of a lookbehind assertion is 65535 characters.
@@ -46,19 +37,20 @@ The maximum length of a lookbehind assertion is 65535 characters.
There is no limit to the number of parenthesized subpatterns, but there can be
no more than 65535 capturing subpatterns. There is, however, a limit to the
depth of nesting of parenthesized subpatterns of all kinds. This is imposed in
-order to limit the amount of system stack used at compile time. The limit can
-be specified when PCRE2 is built; the default is 250.
-.P
-There is a limit to the number of forward references to subsequent subpatterns
-of around 200,000. Repeated forward references with fixed upper limits, for
-example, (?2){0,100} when subpattern number 2 is to the right, are included in
-the count. There is no limit to the number of backward references.
+order to limit the amount of system stack used at compile time. The default
+limit can be specified when PCRE2 is built; the default default is 250. An
+application can change this limit by calling pcre2_set_parens_nest_limit() to
+set the limit in a compile context.
.P
The maximum length of name for a named subpattern is 32 code units, and the
maximum number of named subpatterns is 10000.
.P
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb
-is 255 for the 8-bit library and 65535 for the 16-bit and 32-bit libraries.
+is 255 code units for the 8-bit library and 65535 code units for the 16-bit and
+32-bit libraries.
+.P
+The maximum length of a string argument to a callout is the largest number a
+32-bit unsigned integer can hold.
.
.
.SH AUTHOR
@@ -75,6 +67,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 05 November 2015
-Copyright (c) 1997-2015 University of Cambridge.
+Last updated: 30 March 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3
index 70ac14a..5c0daa8 100644
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "20 June 2016" "PCRE2 10.22"
+.TH PCRE2PATTERN 3 "12 September 2017" "PCRE2 10.31"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@@ -138,36 +138,52 @@ the application to apply the JIT optimization by calling
\fBpcre2_jit_compile()\fP is ignored.
.
.
-.SS "Setting match and recursion limits"
+.SS "Setting match resource limits"
.rs
.sp
-The caller of \fBpcre2_match()\fP can set a limit on the number of times the
-internal \fBmatch()\fP function is called and on the maximum depth of
-recursive calls. These facilities are provided to catch runaway matches that
-are provoked by patterns with huge matching trees (a typical example is a
-pattern with nested unlimited repeats) and to avoid running out of system stack
-by too much recursion. When one of these limits is reached, \fBpcre2_match()\fP
-gives an error return. The limits can also be set by items at the start of the
-pattern of the form
+The pcre2_match() function contains a counter that is incremented every time it
+goes round its main loop. The caller of \fBpcre2_match()\fP can set a limit on
+this counter, which therefore limits the amount of computing resource used for
+a match. The maximum depth of nested backtracking can also be limited; this
+indirectly restricts the amount of heap memory that is used, but there is also
+an explicit memory limit that can be set.
+.P
+These facilities are provided to catch runaway matches that are provoked by
+patterns with huge matching trees (a typical example is a pattern with nested
+unlimited repeats applied to a long string that does not match). When one of
+these limits is reached, \fBpcre2_match()\fP gives an error return. The limits
+can also be set by items at the start of the pattern of the form
.sp
+ (*LIMIT_HEAP=d)
(*LIMIT_MATCH=d)
- (*LIMIT_RECURSION=d)
+ (*LIMIT_DEPTH=d)
.sp
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used.
+.P
+Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
+still recognized for backwards compatibility.
+.P
+The heap limit applies only when the \fBpcre2_match()\fP interpreter is used
+for matching. It does not apply to JIT or DFA matching. The match limit is used
+(but in a different way) when JIT is being used, or when
+\fBpcre2_dfa_match()\fP is called, to limit computing resource usage by those
+matching functions. The depth limit is ignored by JIT but is relevant for DFA
+matching, which uses function recursion for recursions within the pattern. In
+this case, the depth limit controls the amount of system stack that is used.
.
.
.\" HTML <a name="newlines"></a>
.SS "Newline conventions"
.rs
.sp
-PCRE2 supports five different conventions for indicating line breaks in
+PCRE2 supports six different conventions for indicating line breaks in
strings: a single CR (carriage return) character, a single LF (linefeed)
-character, the two-character sequence CRLF, any of the three preceding, or any
-Unicode newline sequence. The
+character, the two-character sequence CRLF, any of the three preceding, any
+Unicode newline sequence, or the NUL character (binary zero). The
.\" HREF
\fBpcre2api\fP
.\"
@@ -180,13 +196,14 @@ about newlines, and shows how to set the newline convention when calling
\fBpcre2_compile()\fP.
.P
It is also possible to specify a newline convention by starting a pattern
-string with one of the following five sequences:
+string with one of the following sequences:
.sp
(*CR) carriage return
(*LF) linefeed
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
+ (*NUL) the NUL character (binary zero)
.sp
These override the default and the options given to the compiling function. For
example, on a Unix system where LF is the default newline sequence, the pattern
@@ -201,8 +218,8 @@ The newline convention affects where the circumflex and dollar assertions are
true. It also affects the interpretation of the dot metacharacter when
PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
what the \eR escape sequence matches. By default, this is any Unicode newline
-sequence, for Perl compatibility. However, this can be changed; see the
-description of \eR in the section entitled
+sequence, for Perl compatibility. However, this can be changed; see the next
+section and the description of \eR in the section entitled
.\" HTML <a href="#newlineseq">
.\" </a>
"Newline sequences"
@@ -225,7 +242,7 @@ corresponding to PCRE2_BSR_UNICODE.
.rs
.sp
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
-character code rather than ASCII or Unicode (typically a mainframe system). In
+character code instead of ASCII or Unicode (typically a mainframe system). In
the sections below, character code values are ASCII or Unicode; in an EBCDIC
environment these characters may have different code values, and there are no
code points greater than 255.
@@ -292,11 +309,11 @@ character that is not a number or a letter, it takes away any special meaning
that character may have. This use of backslash as an escape character applies
both inside and outside character classes.
.P
-For example, if you want to match a * character, you write \e* in the pattern.
-This escaping action applies whether or not the following character would
-otherwise be interpreted as a metacharacter, so it is always safe to precede a
-non-alphanumeric with backslash to specify that it stands for itself. In
-particular, if you want to match a backslash, you write \e\e.
+For example, if you want to match a * character, you must write \e* in the
+pattern. This escaping action applies whether or not the following character
+would otherwise be interpreted as a metacharacter, so it is always safe to
+precede a non-alphanumeric with backslash to specify that it stands for itself.
+In particular, if you want to match a backslash, you write \e\e.
.P
In a UTF mode, only ASCII numbers and letters have any special meaning after a
backslash. All other characters (in particular, those whose codepoints are
@@ -326,7 +343,7 @@ An isolated \eE that is not preceded by \eQ is ignored. If \eQ is not followed
by \eE later in the pattern, the literal interpretation continues to the end of
the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside
a character class, this causes an error, because the character class is not
-terminated.
+terminated by a closing square bracket.
.
.
.\" HTML <a name="digitsafterbackslash"></a>
@@ -359,29 +376,28 @@ case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
code unit following \ec has a value less than 32 or greater than 126, a
-compile-time error occurs. This locks out non-printable ASCII characters in all
-modes.
+compile-time error occurs.
.P
When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
generate the appropriate EBCDIC code values. The \ec escape is processed
as specified for Perl in the \fBperlebcdic\fP document. The only characters
that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any
-other character provokes a compile-time error. The sequence \e@ encodes
-character code 0; the letters (in either case) encode characters 1-26 (hex 01
-to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
-\e? becomes either 255 (hex FF) or 95 (hex 5F).
+other character provokes a compile-time error. The sequence \ec@ encodes
+character code 0; after \ec the letters (in either case) encode characters 1-26
+(hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex
+1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
.P
-Thus, apart from \e?, these escapes generate the same character code values as
+Thus, apart from \ec?, these escapes generate the same character code values as
they do in an ASCII environment, though the meanings of the values mostly
-differ. For example, \eG always generates code value 7, which is BEL in ASCII
+differ. For example, \ecG always generates code value 7, which is BEL in ASCII
but DEL in EBCDIC.
.P
-The sequence \e? generates DEL (127, hex 7F) in an ASCII environment, but
+The sequence \ec? generates DEL (127, hex 7F) in an ASCII environment, but
because 127 is not a control character in EBCDIC, Perl makes it generate the
APC character. Unfortunately, there are several variants of EBCDIC. In most of
them the APC character has the value 255 (hex FF), but in the one Perl calls
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
-values, PCRE2 makes \e? generate 95; otherwise it generates 255.
+values, PCRE2 makes \ec? generate 95; otherwise it generates 255.
.P
After \e0 up to two further octal digits are read. If there are fewer than two
digits, just those that are present are used. Thus the sequence \e0\ex\e015
@@ -455,9 +471,9 @@ a hexadecimal digit appears between \ex{ and }, or if there is no terminating
.P
If the PCRE2_ALT_BSUX option is set, the interpretation of \ex is as just
described only when it is followed by two hexadecimal digits. Otherwise, it
-matches a literal "x" character. In this mode mode, support for code points
-greater than 256 is provided by \eu, which must be followed by four hexadecimal
-digits; otherwise it matches a literal "u" character.
+matches a literal "x" character. In this mode, support for code points greater
+than 256 is provided by \eu, which must be followed by four hexadecimal digits;
+otherwise it matches a literal "u" character.
.P
Characters whose value is less than 256 can be defined by either of the two
syntaxes for \ex (or by \eu in PCRE2_ALT_BSUX mode). There is no difference in
@@ -471,15 +487,15 @@ the way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
Characters that are specified using octal or hexadecimal numbers are
limited to certain values, as follows:
.sp
- 8-bit non-UTF mode less than 0x100
- 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
- 16-bit non-UTF mode less than 0x10000
- 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
- 32-bit non-UTF mode less than 0x100000000
- 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
+ 8-bit non-UTF mode no greater than 0xff
+ 16-bit non-UTF mode no greater than 0xffff
+ 32-bit non-UTF mode no greater than 0xffffffff
+ All UTF modes no greater than 0x10ffff and a valid codepoint
.sp
-Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
-"surrogate" codepoints), and 0xffef.
+Invalid Unicode codepoints are all those in the range 0xd800 to 0xdfff (the
+so-called "surrogate" codepoints). The check for these can be disabled by the
+caller of \fBpcre2_compile()\fP by setting the option
+PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES.
.
.
.SS "Escape sequences in character classes"
@@ -502,15 +518,15 @@ In Perl, the sequences \el, \eL, \eu, and \eU are recognized by its string
handler and used to modify the case of following characters. By default, PCRE2
does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
is set, \eU matches a "U" character, and \eu can be used to define a character
-by code point, as described in the previous section.
+by code point, as described above.
.
.
.SS "Absolute and relative back references"
.rs
.sp
-The sequence \eg followed by an unsigned or a negative number, optionally
-enclosed in braces, is an absolute or relative back reference. A named back
-reference can be coded as \eg{name}. Back references are discussed
+The sequence \eg followed by a signed or unsigned number, optionally enclosed
+in braces, is an absolute or relative back reference. A named back reference
+can be coded as \eg{name}. Back references are discussed
.\" HTML <a href="#backreferences">
.\" </a>
later,
@@ -710,7 +726,9 @@ When PCRE2 is built with Unicode support (the default), three additional escape
sequences that match characters with specific properties are available. In
8-bit non-UTF-8 mode, these sequences are of course limited to testing
characters whose codepoints are less than 256, but they do work in this mode.
-The extra escape sequences are:
+In 32-bit non-UTF mode, codepoints greater than 0x10ffff (the Unicode limit)
+may be encountered. These are all treated as being in the Common script and
+with an unassigned type. The extra escape sequences are:
.sp
\ep{\fIxx\fP} a character with the \fIxx\fP property
\eP{\fIxx\fP} a character without the \fIxx\fP property
@@ -738,6 +756,7 @@ example:
Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:
.P
+Adlam,
Ahom,
Anatolian_Hieroglyphs,
Arabic,
@@ -748,6 +767,7 @@ Bamum,
Bassa_Vah,
Batak,
Bengali,
+Bhaiksuki,
Bopomofo,
Brahmi,
Braille,
@@ -809,6 +829,8 @@ Mahajani,
Malayalam,
Mandaic,
Manichaean,
+Marchen,
+Masaram_Gondi,
Meetei_Mayek,
Mende_Kikakui,
Meroitic_Cursive,
@@ -821,7 +843,9 @@ Multani,
Myanmar,
Nabataean,
New_Tai_Lue,
+Newa,
Nko,
+Nushu,
Ogham,
Ol_Chiki,
Old_Hungarian,
@@ -832,6 +856,7 @@ Old_Persian,
Old_South_Arabian,
Old_Turkic,
Oriya,
+Osage,
Osmanya,
Pahawh_Hmong,
Palmyrene,
@@ -849,6 +874,7 @@ Siddham,
SignWriting,
Sinhala,
Sora_Sompeng,
+Soyombo,
Sundanese,
Syloti_Nagri,
Syriac,
@@ -859,6 +885,7 @@ Tai_Tham,
Tai_Viet,
Takri,
Tamil,
+Tangut,
Telugu,
Thaana,
Thai,
@@ -868,7 +895,8 @@ Tirhuta,
Ugaritic,
Vai,
Warang_Citi,
-Yi.
+Yi,
+Zanabazar_Square.
.P
Each character has exactly one Unicode general category property, specified by
a two-letter abbreviation. For compatibility with Perl, negation can be
@@ -972,9 +1000,11 @@ grapheme cluster", and treats the sequence as an atomic group
.\"
Unicode supports various kinds of composite character by giving each character
a grapheme breaking property, and having rules that use these properties to
-define the boundaries of extended grapheme clusters. \eX always matches at
-least one character. Then it decides whether to add additional characters
-according to the following rules for ending a cluster:
+define the boundaries of extended grapheme clusters. The rules are defined in
+Unicode Standard Annex 29, "Unicode Text Segmentation".
+.P
+\eX always matches at least one character. Then it decides whether to add
+additional characters according to the following rules for ending a cluster:
.P
1. End at the end of the subject string.
.P
@@ -985,11 +1015,22 @@ are of five types: L, V, T, LV, and LVT. An L character may be followed by an
L, V, LV, or LVT character; an LV or V character may be followed by a V or T
character; an LVT or T character may be follwed only by a T character.
.P
-4. Do not end before extending characters or spacing marks. Characters with
-the "mark" property always have the "extend" grapheme breaking property.
+4. Do not end before extending characters or spacing marks or the "zero-width
+joiner" characters. Characters with the "mark" property always have the
+"extend" grapheme breaking property.
.P
5. Do not end after prepend characters.
.P
+6. Do not break within emoji modifier sequences (a base character followed by a
+modifier). Extending characters are allowed before the modifier.
+.P
+7. Do not break within emoji zwj sequences (zero-width jointer followed by
+"glue after ZWJ" or "base glue after ZWJ").
+.P
+8. Do not break within emoji flag sequences. That is, do not break between
+regional indicator (RI) characters if there are an odd number of RI characters
+before the break point.
+.P
6. Otherwise, end the cluster.
.
.
@@ -1325,13 +1366,34 @@ when matching character classes, whatever line-ending sequence is in use, and
whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
class such as [^a] always matches one of these characters.
.P
+The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev,
+\eV, \ew, and \eW may appear in a character class, and add the characters that
+they match to the class. For example, [\edABCDEF] matches any hexadecimal
+digit. In UTF modes, the PCRE2_UCP option affects the meanings of \ed, \es, \ew
+and their upper case partners, just as it does when they appear outside a
+character class, as described in the section entitled
+.\" HTML <a href="#genericchartypes">
+.\" </a>
+"Generic character types"
+.\"
+above. The escape sequence \eb has a different meaning inside a character
+class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX
+are not special inside a character class. Like any other unrecognized escape
+sequences, they cause an error.
+.P
The minus (hyphen) character can be used to specify a range of characters in a
character class. For example, [d-m] matches any letter between d and m,
inclusive. If a minus character is required in a class, it must be escaped with
a backslash or appear in a position where it cannot be interpreted as
-indicating a range, typically as the first or last character in the class, or
-immediately after a range. For example, [b-d-z] matches letters in the range b
-to d, a hyphen character, or z.
+indicating a range, typically as the first or last character in the class,
+or immediately after a range. For example, [b-d-z] matches letters in the range
+b to d, a hyphen character, or z.
+.P
+Perl treats a hyphen as a literal if it appears before or after a POSIX class
+(see below) or before or after a character type escape such as as \ed or \eH.
+However, unless the hyphen is the last character in the class, Perl outputs a
+warning in its warning mode, as this is most likely a user error. As PCRE2 has
+no facility for warning, an error is given in these cases.
.P
It is not possible to have the literal character "]" as the end character of a
range. A pattern such as [W-]46] is interpreted as a class of two characters
@@ -1341,15 +1403,14 @@ the end of range, so [W-\e]46] is interpreted as a class containing a range
followed by two other characters. The octal or hexadecimal representation of
"]" can also be used to end a range.
.P
-An error is generated if a POSIX character class (see below) or an escape
-sequence other than one that defines a single character appears at a point
-where a range ending character is expected. For example, [z-\exff] is valid,
-but [A-\ed] and [A-[:digit:]] are not.
-.P
Ranges normally include all code points between the start and end characters,
inclusive. They can also be used for code points specified numerically, for
example [\e000-\e037]. Ranges can include any characters that are valid for the
-current mode.
+current mode. In any UTF mode, the so-called "surrogate" characters (those
+whose code points lie between 0xd800 and 0xdfff inclusive) may not be specified
+explicitly by default (the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables
+this check). However, ranges such as [\ex{d7ff}-\ex{e000}], which include the
+surrogates, are always permitted.
.P
There is a special case in EBCDIC environments for ranges whose end points are
both specified as literal letters in the same case. For compatibility with
@@ -1365,21 +1426,6 @@ matches the letters in either case. For example, [W-c] is equivalent to
tables for a French locale are in use, [\exc8-\excb] matches accented E
characters in both cases.
.P
-The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev,
-\eV, \ew, and \eW may appear in a character class, and add the characters that
-they match to the class. For example, [\edABCDEF] matches any hexadecimal
-digit. In UTF modes, the PCRE2_UCP option affects the meanings of \ed, \es, \ew
-and their upper case partners, just as it does when they appear outside a
-character class, as described in the section entitled
-.\" HTML <a href="#genericchartypes">
-.\" </a>
-"Generic character types"
-.\"
-above. The escape sequence \eb has a different meaning inside a character
-class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX
-are not special inside a character class. Like any other unrecognized escape
-sequences, they cause an error.
-.P
A circumflex can conveniently be used with the upper case character types to
specify a more restricted set of characters than the matching lower case type.
For example, the class [^\eW_] matches any letter or digit, but not underscore,
@@ -1527,20 +1573,25 @@ alternative in the subpattern.
.SH "INTERNAL OPTION SETTING"
.rs
.sp
-The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
-PCRE2_EXTENDED options (which are Perl-compatible) can be changed from within
-the pattern by a sequence of Perl option letters enclosed between "(?" and ")".
-The option letters are
+The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
+PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options (which
+are Perl-compatible) can be changed from within the pattern by a sequence of
+Perl option letters enclosed between "(?" and ")". The option letters are
.sp
i for PCRE2_CASELESS
m for PCRE2_MULTILINE
+ n for PCRE2_NO_AUTO_CAPTURE
s for PCRE2_DOTALL
x for PCRE2_EXTENDED
+ xx for PCRE2_EXTENDED_MORE
.sp
For example, (?im) sets caseless, multiline matching. It is also possible to
-unset these options by preceding the letter with a hyphen, and a combined
-setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS and
-PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
+unset these options by preceding the letter with a hyphen. The two "extended"
+options are not independent; unsetting either one cancels the effects of both
+of them.
+.P
+A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
+and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
permitted. If a letter appears both before and after the hyphen, the option is
unset. An empty options setting "(?)" is allowed. Needless to say, it has no
effect.
@@ -1551,12 +1602,8 @@ respectively.
.P
When one of these option changes occurs at top level (that is, not inside
subpattern parentheses), the change applies to the remainder of the pattern
-that follows. If the change is placed right at the start of a pattern, PCRE2
-extracts it into the global options (and it will therefore show up in data
-extracted by the \fBpcre2_pattern_info()\fP function).
-.P
-An option change within a subpattern (see below for a description of
-subpatterns) affects only that part of the subpattern that follows it, so
+that follows. An option change within a subpattern (see below for a description
+of subpatterns) affects only that part of the subpattern that follows it, so
.sp
(a(?i)b)c
.sp
@@ -2096,9 +2143,9 @@ no such problem when named parentheses are used. A back reference to any
subpattern is possible using named parentheses (see below).
.P
Another way of avoiding the ambiguity inherent in the use of digits following a
-backslash is to use the \eg escape sequence. This escape must be followed by an
-unsigned number or a negative number, optionally enclosed in braces. These
-examples are all identical:
+backslash is to use the \eg escape sequence. This escape must be followed by a
+signed or unsigned number, optionally enclosed in braces. These examples are
+all identical:
.sp
(ring), \e1
(ring), \eg1
@@ -2106,8 +2153,7 @@ examples are all identical:
.sp
An unsigned number specifies an absolute reference without the ambiguity that
is present in the older syntax. It is also useful when literal digits follow
-the reference. A negative number is a relative reference. Consider this
-example:
+the reference. A signed number is a relative reference. Consider this example:
.sp
(abc(def)ghi)\eg{-1}
.sp
@@ -2117,6 +2163,10 @@ Similarly, \eg{-2} would be equivalent to \e1. The use of relative references
can be helpful in long patterns, and also in patterns that are created by
joining together fragments that contain references within themselves.
.P
+The sequence \eg{+1} is a reference to the next capturing subpattern. This kind
+of forward reference can be useful it patterns that repeat. Perl does not
+support the use of + in this way.
+.P
A back reference matches whatever actually matched the capturing subpattern in
the current subject string, rather than anything matching the subpattern
itself (see
@@ -2215,14 +2265,28 @@ above.
.P
More complicated assertions are coded as subpatterns. There are two kinds:
those that look ahead of the current position in the subject string, and those
-that look behind it. An assertion subpattern is matched in the normal way,
-except that it does not cause the current matching position to be changed.
-.P
-Assertion subpatterns are not capturing subpatterns. If such an assertion
-contains capturing subpatterns within it, these are counted for the purposes of
+that look behind it, and in each case an assertion may be positive (must
+succeed for matching to continue) or negative (must not succeed for matching to
+continue). An assertion subpattern is matched in the normal way, except that,
+when matching continues afterwards, the matching position in the subject string
+is as it was at the start of the assertion.
+.P
+Assertion subpatterns are not capturing subpatterns. If an assertion contains
+capturing subpatterns within it, these are counted for the purposes of
numbering the capturing subpatterns in the whole pattern. However, substring
-capturing is carried out only for positive assertions. (Perl sometimes, but not
-always, does do capturing in negative assertions.)
+capturing is carried out only for positive assertions that succeed, that is,
+one of their branches matches, so matching continues after the assertion. If
+all branches of a positive assertion fail to match, nothing is captured, and
+control is passed to the previous backtracking point.
+.P
+No capturing is done for a negative assertion unless it is being used as a
+condition in a
+.\" HTML <a href="#subpatternsassubroutines">
+.\" </a>
+conditional subpattern
+.\"
+(see the discussion below). Matching continues after a non-conditional negative
+assertion only if all its branches fail to match.
.P
For compatibility with Perl, most assertion subpatterns may be repeated; though
it makes no sense to assert the same thing several times, the side effect of
@@ -2321,23 +2385,34 @@ temporarily move the current position back by the fixed length and then try to
match. If there are insufficient characters before the current position, the
assertion fails.
.P
-In a UTF mode, PCRE2 does not allow the \eC escape (which matches a single code
-unit even in a UTF mode) to appear in lookbehind assertions, because it makes
-it impossible to calculate the length of the lookbehind. The \eX and \eR
-escapes, which can match different numbers of code units, are also not
-permitted.
+In UTF-8 and UTF-16 modes, PCRE2 does not allow the \eC escape (which matches a
+single code unit even in a UTF mode) to appear in lookbehind assertions,
+because it makes it impossible to calculate the length of the lookbehind. The
+\eX and \eR escapes, which can match different numbers of code units, are never
+permitted in lookbehinds.
.P
.\" HTML <a href="#subpatternsassubroutines">
.\" </a>
"Subroutine"
.\"
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
-as the subpattern matches a fixed-length string.
+as the subpattern matches a fixed-length string. However,
.\" HTML <a href="#recursion">
.\" </a>
-Recursion,
+recursion,
.\"
-however, is not supported.
+that is, a "subroutine" call into a group that is already active,
+is not supported.
+.P
+Perl does not support back references in lookbehinds. PCRE2 does support them,
+but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
+must not be set, there must be no use of (?| in the pattern (it creates
+duplicate subpattern numbers), and if the back reference is by name, the name
+must be unique. Of course, the referenced subpattern must itself be of fixed
+length. The following pattern matches words containing at least two characters
+that begin and end with the same character:
+.sp
+ \eb(\ew)\ew++(?<=\e1)
.P
Possessive quantifiers can be used in conjunction with lookbehind assertions to
specify efficient matching of fixed-length strings at the end of subject
@@ -2476,7 +2551,9 @@ This makes the fragment independent of the parentheses in the larger pattern.
.sp
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
subpattern by name. For compatibility with earlier versions of PCRE1, which had
-this facility before Perl, the syntax (?(name)...) is also recognized.
+this facility before Perl, the syntax (?(name)...) is also recognized. Note,
+however, that undelimited names consisting of the letter R followed by digits
+are ambiguous (see the following section).
.P
Rewriting the above example to use a named subpattern gives this:
.sp
@@ -2490,33 +2567,55 @@ matched.
.SS "Checking for pattern recursion"
.rs
.sp
-If the condition is the string (R), and there is no subpattern with the name R,
-the condition is true if a recursive call to the whole pattern or any
-subpattern has been made. If digits or a name preceded by ampersand follow the
-letter R, for example:
+"Recursion" in this sense refers to any subroutine-like call from one part of
+the pattern to another, whether or not it is actually recursive. See the
+sections entitled
+.\" HTML <a href="#recursion">
+.\" </a>
+"Recursive patterns"
+.\"
+and
+.\" HTML <a href="#subpatternsassubroutines">
+.\" </a>
+"Subpatterns as subroutines"
+.\"
+below for details of recursion and subpattern calls.
+.P
+If a condition is the string (R), and there is no subpattern with the name R,
+the condition is true if matching is currently in a recursion or subroutine
+call to the whole pattern or any subpattern. If digits follow the letter R, and
+there is no subpattern with that name, the condition is true if the most recent
+call is into a subpattern with the given number, which must exist somewhere in
+the overall pattern. This is a contrived example that is equivalent to a+b:
+.sp
+ ((?(R1)a+|(?1)b))
+.sp
+However, in both cases, if there is a subpattern with a matching name, the
+condition tests for its being set, as described in the section above, instead
+of testing for recursion. For example, creating a group with the name R1 by
+adding (?<R1>) to the above pattern completely changes its meaning.
+.P
+If a name preceded by ampersand follows the letter R, for example:
.sp
- (?(R3)...) or (?(R&name)...)
+ (?(R&name)...)
.sp
-the condition is true if the most recent recursion is into a subpattern whose
-number or name is given. This condition does not check the entire recursion
-stack. If the name used in a condition of this kind is a duplicate, the test is
-applied to all subpatterns of the same name, and is true if any one of them is
-the most recent recursion.
+the condition is true if the most recent recursion is into a subpattern of that
+name (which must exist within the pattern).
+.P
+This condition does not check the entire recursion stack. It tests only the
+current level. If the name used in a condition of this kind is a duplicate, the
+test is applied to all subpatterns of the same name, and is true if any one of
+them is the most recent recursion.
.P
At "top level", all these recursion test conditions are false.
-.\" HTML <a href="#recursion">
-.\" </a>
-The syntax for recursive patterns
-.\"
-is described below.
.
.
.\" HTML <a name="subdefine"></a>
.SS "Defining subpatterns for use by reference only"
.rs
.sp
-If the condition is the string (DEFINE), and there is no subpattern with the
-name DEFINE, the condition is always false. In this case, there may be only one
+If the condition is the string (DEFINE), the condition is always false, even if
+there is a group with the name DEFINE. In this case, there may be only one
alternative in the subpattern. It is always skipped if control reaches this
point in the pattern; the idea of DEFINE is that it can be used to define
subroutines that can be referenced from elsewhere. (The use of
@@ -2574,6 +2673,12 @@ presence of at least one letter in the subject. If a letter is found, the
subject is matched against the first alternative; otherwise it is matched
against the second. This pattern matches strings in one of the two forms
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
+.P
+When an assertion that is a condition contains capturing subpatterns, any
+capturing that occurs in a matching branch is retained afterwards, for both
+positive and negative assertions, because matching always continues after the
+assertion, whether it succeeds or fails. (Compare non-conditional assertions,
+when captures are retained only for positive assertions that succeed.)
.
.
.\" HTML <a name="comments"></a>
@@ -2753,88 +2858,53 @@ is the actual recursive call.
.SS "Differences in recursion processing between PCRE2 and Perl"
.rs
.sp
-Recursion processing in PCRE2 differs from Perl in two important ways. In PCRE2
-(like Python, but unlike Perl), a recursive subpattern call is always treated
-as an atomic group. That is, once it has matched some of the subject string, it
-is never re-entered, even if it contains untried alternatives and there is a
-subsequent matching failure. This can be illustrated by the following pattern,
-which purports to match a palindromic string that contains an odd number of
-characters (for example, "a", "aba", "abcba", "abcdcba"):
-.sp
- ^(.|(.)(?1)\e2)$
-.sp
-The idea is that it either matches a single character, or two identical
-characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE2
-it does not if the pattern is longer than three characters. Consider the
-subject string "abcba":
+Some former differences between PCRE2 and Perl no longer exist.
.P
-At the top level, the first character is matched, but as it is not at the end
-of the string, the first alternative fails; the second alternative is taken
-and the recursion kicks in. The recursive call to subpattern 1 successfully
-matches the next character ("b"). (Note that the beginning and end of line
-tests are not part of the recursion).
+Before release 10.30, recursion processing in PCRE2 differed from Perl in that
+a recursive subpattern call was always treated as an atomic group. That is,
+once it had matched some of the subject string, it was never re-entered, even
+if it contained untried alternatives and there was a subsequent matching
+failure. (Historical note: PCRE implemented recursion before Perl did.)
.P
-Back at the top level, the next character ("c") is compared with what
-subpattern 2 matched, which was "a". This fails. Because the recursion is
-treated as an atomic group, there are now no backtracking points, and so the
-entire match fails. (Perl is able, at this point, to re-enter the recursion and
-try the second alternative.) However, if the pattern is written with the
-alternatives in the other order, things are different:
-.sp
- ^((.)(?1)\e2|.)$
-.sp
-This time, the recursing alternative is tried first, and continues to recurse
-until it runs out of characters, at which point the recursion fails. But this
-time we do have another alternative to try at the higher level. That is the big
-difference: in the previous case the remaining alternative is at a deeper
-recursion level, which PCRE2 cannot use.
+Starting with release 10.30, recursive subroutine calls are no longer treated
+as atomic. That is, they can be re-entered to try unused alternatives if there
+is a matching failure later in the pattern. This is now compatible with the way
+Perl works. If you want a subroutine call to be atomic, you must explicitly
+enclose it in an atomic group.
.P
-To change the pattern so that it matches all palindromic strings, not just
-those with an odd number of characters, it is tempting to change the pattern to
-this:
+Supporting backtracking into recursions simplifies certain types of recursive
+pattern. For example, this pattern matches palindromic strings:
.sp
^((.)(?1)\e2|.?)$
.sp
-Again, this works in Perl, but not in PCRE2, and for the same reason. When a
-deeper recursion has matched a single character, it cannot be entered again in
-order to match an empty string. The solution is to separate the two cases, and
-write out the odd and even cases as alternatives at the higher level:
-.sp
- ^(?:((.)(?1)\e2|)|((.)(?3)\e4|.))
+The second branch in the group matches a single central character in the
+palindrome when there are an odd number of characters, or nothing when there
+are an even number of characters, but in order to work it has to be able to try
+the second case when the rest of the pattern match fails. If you want to match
+typical palindromic phrases, the pattern has to ignore all non-word characters,
+which can be done like this:
.sp
-If you want to match typical palindromic phrases, the pattern has to ignore all
-non-word characters, which can be done like this:
-.sp
- ^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\e4|\eW*+.\eW*+))\eW*+$
+ ^\eW*+((.)\eW*+(?1)\eW*+\e2|\eW*+.?)\eW*+$
.sp
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
-man, a plan, a canal: Panama!" and it works in both PCRE2 and Perl. Note the
-use of the possessive quantifier *+ to avoid backtracking into sequences of
-non-word characters. Without this, PCRE2 takes a great deal longer (ten times
-or more) to match typical phrases, and Perl takes so long that you think it has
-gone into a loop.
-.P
-\fBWARNING\fP: The palindrome-matching patterns above work only if the subject
-string does not start with a palindrome that is shorter than the entire string.
-For example, although "abcba" is correctly matched, if the subject is "ababa",
-PCRE2 finds the palindrome "aba" at the start, then fails at top level because
-the end of the string does not follow. Once again, it cannot jump back into the
-recursion to try other alternatives, so the entire match fails.
-.P
-The second way in which PCRE2 and Perl differ in their recursion processing is
-in the handling of captured values. In Perl, when a subpattern is called
-recursively or as a subpattern (see the next section), it has no access to any
-values that were captured outside the recursion, whereas in PCRE2 these values
-can be referenced. Consider this pattern:
+man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
+avoid backtracking into sequences of non-word characters. Without this, PCRE2
+takes a great deal longer (ten times or more) to match typical phrases, and
+Perl takes so long that you think it has gone into a loop.
+.P
+Another way in which PCRE2 and Perl used to differ in their recursion
+processing is in the handling of captured values. Formerly in Perl, when a
+subpattern was called recursively or as a subpattern (see the next section), it
+had no access to any values that were captured outside the recursion, whereas
+in PCRE2 these values can be referenced. Consider this pattern:
.sp
^(.)(\e1|a(?2))
.sp
-In PCRE2, this pattern matches "bab". The first capturing parentheses match "b",
-then in the second group, when the back reference \e1 fails to match "b", the
-second alternative matches "a" and then recurses. In the recursion, \e1 does
-now match "b" and so the whole match succeeds. In Perl, the pattern fails to
-match because inside the recursive call \e1 cannot access the externally set
-value.
+This pattern matches "bab". The first capturing parentheses match "b", then in
+the second group, when the back reference \e1 fails to match "b", the second
+alternative matches "a" and then recurses. In the recursion, \e1 does now match
+"b" and so the whole match succeeds. This match used to fail in Perl, but in
+later versions (I tried 5.024) it now works.
.
.
.\" HTML <a name="subpatternsassubroutines"></a>
@@ -2863,11 +2933,10 @@ matches "sense and sensibility" and "response and responsibility", but not
is used, it does match "sense and responsibility" as well as the other two
strings. Another example is given in the discussion of DEFINE above.
.P
-All subroutine calls, whether recursive or not, are always treated as atomic
-groups. That is, once a subroutine has matched some of the subject string, it
-is never re-entered, even if it contains untried alternatives and there is a
-subsequent matching failure. Any capturing parentheses that are set during the
-subroutine call revert to their previous values afterwards.
+Like recursions, subroutine calls used to be treated as atomic, but this
+changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
+occur. However, any capturing parentheses that are set during the subroutine
+call revert to their previous values afterwards.
.P
Processing options such as case-independence are fixed when a subpattern is
defined, so if it is used as a subroutine, such options cannot be changed for
@@ -2980,26 +3049,28 @@ The doubling is removed before the string is passed to the callout function.
.SH "BACKTRACKING CONTROL"
.rs
.sp
-Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
-are still described in the Perl documentation as "experimental and subject to
-change or removal in a future version of Perl". It goes on to say: "Their usage
-in production code should be noted to avoid problems during upgrades." The same
-remarks apply to the PCRE2 features described in this section.
-.P
-The new verbs make use of what was previously invalid syntax: an opening
-parenthesis followed by an asterisk. They are generally of the form (*VERB) or
-(*VERB:NAME). Some verbs take either form, possibly behaving differently
-depending on whether or not a name is present.
+There are a number of special "Backtracking Control Verbs" (to use Perl's
+terminology) that modify the behaviour of backtracking during matching. They
+are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
+possibly behaving differently depending on whether or not a name is present.
.P
By default, for compatibility with Perl, a name is any sequence of characters
that does not include a closing parenthesis. The name is not processed in
any way, and it is not possible to include a closing parenthesis in the name.
-However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash processing
-is applied to verb names and only an unescaped closing parenthesis terminates
-the name. A closing parenthesis can be included in a name either as \e) or
-between \eQ and \eE. If the PCRE2_EXTENDED option is set, unescaped whitespace
-in verb names is skipped and #-comments are recognized, exactly as in the rest
-of the pattern.
+This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
+is no longer Perl-compatible.
+.P
+When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
+and only an unescaped closing parenthesis terminates the name. However, the
+only backslash items that are permitted are \eQ, \eE, and sequences such as
+\ex{100} that define character code points. Character type escapes such as \ed
+are faulted.
+.P
+A closing parenthesis can be included in a name either as \e) or between \eQ
+and \eE. In addition to backslash processing, if the PCRE2_EXTENDED option is
+also set, unescaped whitespace in verb names is skipped, and #-comments are
+recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
+affect verb names unless PCRE2_ALT_VERBNAMES is also set.
.P
The maximum length of a name is 255 in the 8-bit library and 65535 in the
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
@@ -3008,7 +3079,7 @@ not there. Any number of these verbs may occur in a pattern.
.P
Since these verbs are specifically related to backtracking, most of them can be
used only when the pattern is to be matched using the traditional matching
-function, because these use a backtracking algorithm. With the exception of
+function, because that uses a backtracking algorithm. With the exception of
(*FAIL), which behaves like a failing negative assertion, the backtracking
control verbs cause an error if encountered by the DFA matching function.
.P
@@ -3162,11 +3233,11 @@ to ensure that the match is always attempted.
The following verbs do nothing when they are encountered. Matching continues
with what follows, but if there is no subsequent match, causing a backtrack to
the verb, a failure is forced. That is, backtracking cannot pass to the left of
-the verb. However, when one of these verbs appears inside an atomic group
-(which includes any group that is called as a subroutine) or in an assertion
-that is true, its effect is confined to that group, because once the group has
-been matched, there is never any backtracking into it. In this situation,
-backtracking has to jump to the left of the entire atomic group or assertion.
+the verb. However, when one of these verbs appears inside an atomic group or in
+an assertion that is true, its effect is confined to that group, because once
+the group has been matched, there is never any backtracking into it. In this
+situation, backtracking has to jump to the left of the entire atomic group or
+assertion.
.P
These verbs differ in exactly what kind of failure occurs when backtracking
reaches them. The behaviour described below is what happens when the verb is
@@ -3226,8 +3297,8 @@ possessive quantifier, but there are some uses of (*PRUNE) that cannot be
expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
as (*COMMIT).
.P
-The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE).
-It is like (*MARK:NAME) in that the name is remembered for passing back to the
+The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
+like (*MARK:NAME) in that the name is remembered for passing back to the
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
ignoring those set by (*PRUNE) or (*THEN).
.sp
@@ -3365,25 +3436,30 @@ in the second repeat of the group acts.
.SS "Backtracking verbs in assertions"
.rs
.sp
-(*FAIL) in an assertion has its normal effect: it forces an immediate
-backtrack.
+(*FAIL) in any assertion has its normal effect: it forces an immediate
+backtrack. The behaviour of the other backtracking verbs depends on whether or
+not the assertion is standalone or acting as the condition in a conditional
+subpattern.
.P
-(*ACCEPT) in a positive assertion causes the assertion to succeed without any
-further processing. In a negative assertion, (*ACCEPT) causes the assertion to
-fail without any further processing.
+(*ACCEPT) in a standalone positive assertion causes the assertion to succeed
+without any further processing; captured strings are retained. In a standalone
+negative assertion, (*ACCEPT) causes the assertion to fail without any further
+processing; captured substrings are discarded.
.P
-The other backtracking verbs are not treated specially if they appear in a
-positive assertion. In particular, (*THEN) skips to the next alternative in the
-innermost enclosing group that has alternations, whether or not this is within
-the assertion.
+If the assertion is a condition, (*ACCEPT) causes the condition to be true for
+a positive assertion and false for a negative one; captured substrings are
+retained in both cases.
.P
-Negative assertions are, however, different, in order to ensure that changing a
-positive assertion into a negative assertion changes its result. Backtracking
-into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative assertion to be true,
-without considering any further alternative branches in the assertion.
-Backtracking into (*THEN) causes it to skip to the next enclosing alternative
-within the assertion (the normal behaviour), but if the assertion does not have
-such an alternative, (*THEN) behaves like (*PRUNE).
+The effect of (*THEN) is not allowed to escape beyond an assertion. If there
+are no more branches to try, (*THEN) causes a positive assertion to be false,
+and a negative assertion to be true.
+.P
+The other backtracking verbs are not treated specially if they appear in a
+standalone positive assertion. In a conditional positive assertion,
+backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the condition to be
+false. However, for both standalone and conditional negative assertions,
+backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be
+true, without considering any further alternative branches.
.
.
.\" HTML <a name="btsub"></a>
@@ -3429,6 +3505,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 20 June 2016
-Copyright (c) 1997-2016 University of Cambridge.
+Last updated: 12 September 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2perform.3 b/doc/pcre2perform.3
index ec86fe7..8b49a2a 100644
--- a/doc/pcre2perform.3
+++ b/doc/pcre2perform.3
@@ -1,4 +1,4 @@
-.TH PCRE2PERFORM 3 "02 January 2015" "PCRE2 10.00"
+.TH PCRE2PERFORM 3 "08 April 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 PERFORMANCE"
@@ -12,11 +12,11 @@ of them.
.rs
.sp
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
-so that most simple patterns do not use much memory. However, there is one case
-where the memory usage of a compiled pattern can be unexpectedly large. If a
-parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
-a limited maximum, the whole subpattern is repeated in the compiled code. For
-example, the pattern
+so that most simple patterns do not use much memory for storing the compiled
+version. However, there is one case where the memory usage of a compiled
+pattern can be unexpectedly large. If a parenthesized subpattern has a
+quantifier with a minimum greater than 1 and/or a limited maximum, the whole
+subpattern is repeated in the compiled code. For example, the pattern
.sp
(abc|def){2,4}
.sp
@@ -34,13 +34,13 @@ example, the very simple pattern
.sp
((ab){1,1000}c){1,3}
.sp
-uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
-with its default internal pointer size of two bytes, the size limit on a
-compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
-is reached with the above pattern if the outer repetition is increased from 3
-to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
-larger compiled patterns, but it is better to try to rewrite your pattern to
-use less memory if you can.
+uses over 50K bytes when compiled using the 8-bit library. When PCRE2 is
+compiled with its default internal pointer size of two bytes, the size limit on
+a compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and
+this is reached with the above pattern if the outer repetition is increased
+from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
+handle larger compiled patterns, but it is better to try to rewrite your
+pattern to use less memory if you can.
.P
One way of reducing the memory usage for such patterns is to make use of
PCRE2's
@@ -52,32 +52,35 @@ facility. Re-writing the above pattern as
.sp
((ab)(?2){0,999}c)(?1){0,2}
.sp
-reduces the memory requirements to 18K, and indeed it remains under 20K even
-with the outer repetition increased to 100. However, this pattern is not
-exactly equivalent, because the "subroutine" calls are treated as
-.\" HTML <a href="pcre2pattern.html#atomicgroup">
-.\" </a>
-atomic groups
-.\"
-into which there can be no backtracking if there is a subsequent matching
-failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
-Furthermore, there is a noticeable loss of speed when executing the modified
-pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
-speed is acceptable, this kind of rewriting will allow you to process patterns
-that PCRE2 cannot otherwise handle.
+reduces the memory requirements to around 16K, and indeed it remains under 20K
+even with the outer repetition increased to 100. However, this kind of pattern
+is not always exactly equivalent, because any captures within subroutine calls
+are lost when the subroutine completes. If this is not a problem, this kind of
+rewriting will allow you to process patterns that PCRE2 cannot otherwise
+handle. The matching performance of the two different versions of the pattern
+are roughly the same. (This applies from release 10.30 - things were different
+in earlier releases.)
.
.
-.SH "STACK USAGE AT RUN TIME"
+.SH "STACK AND HEAP USAGE AT RUN TIME"
.rs
.sp
-When \fBpcre2_match()\fP is used for matching, certain kinds of pattern can
-cause it to use large amounts of the process stack. In some environments the
-default process stack is quite small, and if it runs out the result is often
-SIGSEGV. Rewriting your pattern can often help. The
-.\" HREF
-\fBpcre2stack\fP
-.\"
-documentation discusses this issue in detail.
+From release 10.30, the interpretive (non-JIT) version of \fBpcre2_match()\fP
+uses very little system stack at run time. In earlier releases recursive
+function calls could use a great deal of stack, and this could cause problems,
+but this usage has been eliminated. Backtracking positions are now explicitly
+remembered in memory frames controlled by the code. An initial 20K vector of
+frames is allocated on the system stack (enough for about 100 frames for small
+patterns), but if this is insufficient, heap memory is used. The amount of heap
+memory can be limited; if the limit is set to zero, only the initial stack
+vector is used. Rewriting patterns to be time-efficient, as described below,
+may also reduce the memory requirements.
+.P
+In contrast to \fBpcre2_match()\fP, \fBpcre2_dfa_match()\fP does use recursive
+function calls, but only for processing atomic groups, lookaround assertions,
+and recursion within the pattern. Too much nested recursion may cause stack
+issues. The "match depth" parameter can be used to limit the depth of function
+recursion in \fBpcre2_dfa_match()\fP.
.
.
.SH "PROCESSING TIME"
@@ -160,7 +163,59 @@ applied to a whole line of "a" characters, whereas the latter takes an
appreciable time with strings longer than about 20 characters.
.P
In many cases, the solution to this kind of performance issue is to use an
-atomic group or a possessive quantifier.
+atomic group or a possessive quantifier. This can often reduce memory
+requirements as well. As another example, consider this pattern:
+.sp
+ ([^<]|<(?!inet))+
+.sp
+It matches from wherever it starts until it encounters "<inet" or the end of
+the data, and is the kind of pattern that might be used when processing an XML
+file. Each iteration of the outer parentheses matches either one character that
+is not "<" or a "<" that is not followed by "inet". However, each time a
+parenthesis is processed, a backtracking position is passed, so this
+formulation uses a memory frame for each matched character. For a long string,
+a lot of memory is required. Consider now this rewritten pattern, which matches
+exactly the same strings:
+.sp
+ ([^<]++|<(?!inet))+
+.sp
+This runs much faster, because sequences of characters that do not contain "<"
+are "swallowed" in one item inside the parentheses, and a possessive quantifier
+is used to stop any backtracking into the runs of non-"<" characters. This
+version also uses a lot less memory because entry to a new set of parentheses
+happens only when a "<" character that is not followed by "inet" is encountered
+(and we assume this is relatively rare).
+.P
+This example shows that one way of optimizing performance when matching long
+subject strings is to write repeated parenthesized subpatterns to match more
+than one character whenever possible.
+.
+.
+.SS "SETTING RESOURCE LIMITS"
+.rs
+.sp
+You can set limits on the amount of processing that takes place when matching,
+and on the amount of heap memory that is used. The default values of the limits
+are very large, and unlikely ever to operate. They can be changed when PCRE2 is
+built, and they can also be set when \fBpcre2_match()\fP or
+\fBpcre2_dfa_match()\fP is called. For details of these interfaces, see the
+.\" HREF
+\fBpcre2build\fP
+.\"
+documentation and the section entitled
+.\" HTML <a href="pcre2api.html#matchcontext">
+.\" </a>
+"The match context"
+.\"
+in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+documentation.
+.P
+The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
+applied to a subject line, causes it to find the smallest limits that allow a
+pattern to match. This is done by repeatedly matching with different limits.
.
.
.SH AUTHOR
@@ -177,6 +232,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 02 January 2015
-Copyright (c) 1997-2015 University of Cambridge.
+Last updated: 08 April 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2posix.3 b/doc/pcre2posix.3
index 70a86d8..399e2a8 100644
--- a/doc/pcre2posix.3
+++ b/doc/pcre2posix.3
@@ -1,4 +1,4 @@
-.TH PCRE2POSIX 3 "31 January 2016" "PCRE2 10.22"
+.TH PCRE2POSIX 3 "15 June 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "SYNOPSIS"
@@ -46,7 +46,7 @@ replacement library. Other POSIX options are not even defined.
.P
There are also some options that are not defined by POSIX. These have been
added at the request of users who want to make use of certain PCRE2-specific
-features via the POSIX calling interface.
+features via the POSIX calling interface or to add BSD or GNU functionality.
.P
When PCRE2 is called via these functions, it is only the API that is POSIX-like
in style. The syntax and semantics of the regular expressions themselves are
@@ -68,10 +68,11 @@ identifying error codes.
.rs
.sp
The function \fBregcomp()\fP is called to compile a pattern into an
-internal form. The pattern is a C string terminated by a binary zero, and
-is passed in the argument \fIpattern\fP. The \fIpreg\fP argument is a pointer
-to a \fBregex_t\fP structure that is used as a base for storing information
-about the compiled regular expression.
+internal form. By default, the pattern is a C string terminated by a binary
+zero (but see REG_PEND below). The \fIpreg\fP argument is a pointer to a
+\fBregex_t\fP structure that is used as a base for storing information about
+the compiled regular expression. (It is also used for input when REG_PEND is
+set.)
.P
The argument \fIcflags\fP is either zero, or contains one or more of the bits
defined by the following macros:
@@ -93,6 +94,14 @@ The PCRE2_MULTILINE option is set when the regular expression is passed for
compilation to the native function. Note that this does \fInot\fP mimic the
defined POSIX behaviour for REG_NEWLINE (see the following section).
.sp
+ REG_NOSPEC
+.sp
+The PCRE2_LITERAL option is set when the regular expression is passed for
+compilation to the native function. This disables all meta characters in the
+pattern, causing it to be treated as a literal string. The only other options
+that are allowed with REG_NOSPEC are REG_ICASE, REG_NOSUB, REG_PEND, and
+REG_UTF. Note that REG_NOSPEC is not part of the POSIX standard.
+.sp
REG_NOSUB
.sp
When a pattern that is compiled with this flag is passed to \fBregexec()\fP for
@@ -101,6 +110,16 @@ captured strings are returned. Versions of the PCRE library prior to 10.22 used
to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no longer happens
because it disables the use of back references.
.sp
+ REG_PEND
+.sp
+If this option is set, the \fBreg_endp\fP field in the \fIpreg\fP structure
+(which has the type const char *) must be set to point to the character beyond
+the end of the pattern before calling \fBregcomp()\fP. The pattern itself may
+now contain binary zeroes, which are treated as data characters. Without
+REG_PEND, a binary zero terminates the pattern and the \fBre_endp\fP field is
+ignored. This is a GNU extension to the POSIX standard and should be used with
+caution in software intended to be portable to other systems.
+.sp
REG_UCP
.sp
The PCRE2_UCP option is set when the regular expression is passed for
@@ -130,9 +149,10 @@ newlines are matched by the dot metacharacter (they are not) or by a negative
class such as [^a] (they are).
.P
The yield of \fBregcomp()\fP is zero on success, and non-zero otherwise. The
-\fIpreg\fP structure is filled in on success, and one member of the structure
-is public: \fIre_nsub\fP contains the number of capturing subpatterns in
-the regular expression. Various error codes are defined in the header file.
+\fIpreg\fP structure is filled in on success, and one other member of the
+structure (as well as \fIre_endp\fP) is public: \fIre_nsub\fP contains the
+number of capturing subpatterns in the regular expression. Various error codes
+are defined in the header file.
.P
NOTE: If the yield of \fBregcomp()\fP is non-zero, you must not attempt to
use the contents of the \fIpreg\fP structure. If, for example, you pass it to
@@ -204,15 +224,24 @@ function.
.sp
REG_STARTEND
.sp
-The string is considered to start at \fIstring\fP + \fIpmatch[0].rm_so\fP and
-to have a terminating NUL located at \fIstring\fP + \fIpmatch[0].rm_eo\fP
-(there need not actually be a NUL at that location), regardless of the value of
-\fInmatch\fP. This is a BSD extension, compatible with but not specified by
-IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
-intended to be portable to other systems. Note that a non-zero \fIrm_so\fP does
-not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
-how it is matched. Setting REG_STARTEND and passing \fIpmatch\fP as NULL are
-mutually exclusive; the error REG_INVARG is returned.
+When this option is set, the subject string is starts at \fIstring\fP +
+\fIpmatch[0].rm_so\fP and ends at \fIstring\fP + \fIpmatch[0].rm_eo\fP, which
+should point to the first character beyond the string. There may be binary
+zeroes within the subject string, and indeed, using REG_STARTEND is the only
+way to pass a subject string that contains a binary zero.
+.P
+Whatever the value of \fIpmatch[0].rm_so\fP, the offsets of the matched string
+and any captured substrings are still given relative to the start of
+\fIstring\fP itself. (Before PCRE2 release 10.30 these were given relative to
+\fIstring\fP + \fIpmatch[0].rm_so\fP, but this differs from other
+implementations.)
+.P
+This is a BSD extension, compatible with but not specified by IEEE Standard
+1003.2 (POSIX.2), and should be used with caution in software intended to be
+portable to other systems. Note that a non-zero \fIrm_so\fP does not imply
+REG_NOTBOL; REG_STARTEND affects only the location and length of the string,
+not how it is matched. Setting REG_STARTEND and passing \fIpmatch\fP as NULL
+are mutually exclusive; the error REG_INVARG is returned.
.P
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
strings is returned. The \fInmatch\fP and \fIpmatch\fP arguments of
@@ -271,6 +300,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 31 January 2016
-Copyright (c) 1997-2016 University of Cambridge.
+Last updated: 15 June 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2serialize.3 b/doc/pcre2serialize.3
index 664c1db..5a87cec 100644
--- a/doc/pcre2serialize.3
+++ b/doc/pcre2serialize.3
@@ -1,4 +1,4 @@
-.TH PCRE2SERIALIZE 3 "24 May 2016" "PCRE2 10.22"
+.TH PCRE2SERIALIZE 3 "21 March 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS"
@@ -37,7 +37,10 @@ The facility for saving and restoring compiled patterns is intended for use
within individual applications. As such, the data supplied to
\fBpcre2_serialize_decode()\fP is expected to be trusted data, not data from
arbitrary external sources. There is only some simple consistency checking, not
-complete validation of what is being re-loaded.
+complete validation of what is being re-loaded. Corrupted data may cause
+undefined results. For example, if the length field of a pattern in the
+serialized data is corrupted, the deserializing code may read beyond the end of
+the byte stream that is passed to it.
.
.
.SH "SAVING COMPILED PATTERNS"
@@ -181,6 +184,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 24 May 2016
-Copyright (c) 1997-2016 University of Cambridge.
+Last updated: 21 March 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2stack.3 b/doc/pcre2stack.3
deleted file mode 100644
index 8711263..0000000
--- a/doc/pcre2stack.3
+++ /dev/null
@@ -1,202 +0,0 @@
-.TH PCRE2STACK 3 "21 November 2014" "PCRE2 10.00"
-.SH NAME
-PCRE2 - Perl-compatible regular expressions (revised API)
-.SH "PCRE2 DISCUSSION OF STACK USAGE"
-.rs
-.sp
-When you call \fBpcre2_match()\fP, it makes use of an internal function called
-\fBmatch()\fP. This calls itself recursively at branch points in the pattern,
-in order to remember the state of the match so that it can back up and try a
-different alternative after a failure. As matching proceeds deeper and deeper
-into the tree of possibilities, the recursion depth increases. The
-\fBmatch()\fP function is also called in other circumstances, for example,
-whenever a parenthesized sub-pattern is entered, and in certain cases of
-repetition.
-.P
-Not all calls of \fBmatch()\fP increase the recursion depth; for an item such
-as a* it may be called several times at the same level, after matching
-different numbers of a's. Furthermore, in a number of cases where the result of
-the recursive call would immediately be passed back as the result of the
-current call (a "tail recursion"), the function is just restarted instead.
-.P
-Each time the internal \fBmatch()\fP function is called recursively, it uses
-memory from the process stack. For certain kinds of pattern and data, very
-large amounts of stack may be needed, despite the recognition of "tail
-recursion". Note that if PCRE2 is compiled with the -fsanitize=address option
-of the GCC compiler, the stack requirements are greatly increased.
-.P
-The above comments apply when \fBpcre2_match()\fP is run in its normal
-interpretive manner. If the compiled pattern was processed by
-\fBpcre2_jit_compile()\fP, and just-in-time compiling was successful, and the
-options passed to \fBpcre2_match()\fP were not incompatible, the matching
-process uses the JIT-compiled code instead of the \fBmatch()\fP function. In
-this case, the memory requirements are handled entirely differently. See the
-.\" HREF
-\fBpcre2jit\fP
-.\"
-documentation for details.
-.P
-The \fBpcre2_dfa_match()\fP function operates in a different way to
-\fBpcre2_match()\fP, and uses recursion only when there is a regular expression
-recursion or subroutine call in the pattern. This includes the processing of
-assertion and "once-only" subpatterns, which are handled like subroutine calls.
-Normally, these are never very deep, and the limit on the complexity of
-\fBpcre2_dfa_match()\fP is controlled by the amount of workspace it is given.
-However, it is possible to write patterns with runaway infinite recursions;
-such patterns will cause \fBpcre2_dfa_match()\fP to run out of stack. At
-present, there is no protection against this.
-.P
-The comments that follow do NOT apply to \fBpcre2_dfa_match()\fP; they are
-relevant only for \fBpcre2_match()\fP without the JIT optimization.
-.
-.
-.SS "Reducing \fBpcre2_match()\fP's stack usage"
-.rs
-.sp
-You can often reduce the amount of recursion, and therefore the
-amount of stack used, by modifying the pattern that is being matched. Consider,
-for example, this pattern:
-.sp
- ([^<]|<(?!inet))+
-.sp
-It matches from wherever it starts until it encounters "<inet" or the end of
-the data, and is the kind of pattern that might be used when processing an XML
-file. Each iteration of the outer parentheses matches either one character that
-is not "<" or a "<" that is not followed by "inet". However, each time a
-parenthesis is processed, a recursion occurs, so this formulation uses a stack
-frame for each matched character. For a long string, a lot of stack is
-required. Consider now this rewritten pattern, which matches exactly the same
-strings:
-.sp
- ([^<]++|<(?!inet))+
-.sp
-This uses very much less stack, because runs of characters that do not contain
-"<" are "swallowed" in one item inside the parentheses. Recursion happens only
-when a "<" character that is not followed by "inet" is encountered (and we
-assume this is relatively rare). A possessive quantifier is used to stop any
-backtracking into the runs of non-"<" characters, but that is not related to
-stack usage.
-.P
-This example shows that one way of avoiding stack problems when matching long
-subject strings is to write repeated parenthesized subpatterns to match more
-than one character whenever possible.
-.
-.
-.SS "Compiling PCRE2 to use heap instead of stack for \fBpcre2_match()\fP"
-.rs
-.sp
-In environments where stack memory is constrained, you might want to compile
-PCRE2 to use heap memory instead of stack for remembering back-up points when
-\fBpcre2_match()\fP is running. This makes it run more slowly, however. Details
-of how to do this are given in the
-.\" HREF
-\fBpcre2build\fP
-.\"
-documentation. When built in this way, instead of using the stack, PCRE2
-gets memory for remembering backup points from the heap. By default, the memory
-is obtained by calling the system \fBmalloc()\fP function, but you can arrange
-to supply your own memory management function. For details, see the section
-entitled
-.\" HTML <a href="pcre2api.html#matchcontext">
-.\" </a>
-"The match context"
-.\"
-in the
-.\" HREF
-\fBpcre2api\fP
-.\"
-documentation. Since the block sizes are always the same, it may be possible to
-implement customized a memory handler that is more efficient than the standard
-function. The memory blocks obtained for this purpose are retained and re-used
-if possible while \fBpcre2_match()\fP is running. They are all freed just
-before it exits.
-.
-.
-.SS "Limiting \fBpcre2_match()\fP's stack usage"
-.rs
-.sp
-You can set limits on the number of times the internal \fBmatch()\fP function
-is called, both in total and recursively. If a limit is exceeded,
-\fBpcre2_match()\fP returns an error code. Setting suitable limits should
-prevent it from running out of stack. The default values of the limits are very
-large, and unlikely ever to operate. They can be changed when PCRE2 is built,
-and they can also be set when \fBpcre2_match()\fP is called. For details of
-these interfaces, see the
-.\" HREF
-\fBpcre2build\fP
-.\"
-documentation and the section entitled
-.\" HTML <a href="pcre2api.html#matchcontext">
-.\" </a>
-"The match context"
-.\"
-in the
-.\" HREF
-\fBpcre2api\fP
-.\"
-documentation.
-.P
-As a very rough rule of thumb, you should reckon on about 500 bytes per
-recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
-the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
-around 128000 recursions.
-.P
-The \fBpcre2test\fP test program has a modifier called "find_limits" which, if
-applied to a subject line, causes it to find the smallest limits that allow a a
-pattern to match. This is done by calling \fBpcre2_match()\fP repeatedly with
-different limits.
-.
-.
-.SS "Changing stack size in Unix-like systems"
-.rs
-.sp
-In Unix-like environments, there is not often a problem with the stack unless
-very long strings are involved, though the default limit on stack size varies
-from system to system. Values from 8Mb to 64Mb are common. You can find your
-default limit by running the command:
-.sp
- ulimit -s
-.sp
-Unfortunately, the effect of running out of stack is often SIGSEGV, though
-sometimes a more explicit error message is given. You can normally increase the
-limit on stack size by code such as this:
-.sp
- struct rlimit rlim;
- getrlimit(RLIMIT_STACK, &rlim);
- rlim.rlim_cur = 100*1024*1024;
- setrlimit(RLIMIT_STACK, &rlim);
-.sp
-This reads the current limits (soft and hard) using \fBgetrlimit()\fP, then
-attempts to increase the soft limit to 100Mb using \fBsetrlimit()\fP. You must
-do this before calling \fBpcre2_match()\fP.
-.
-.
-.SS "Changing stack size in Mac OS X"
-.rs
-.sp
-Using \fBsetrlimit()\fP, as described above, should also work on Mac OS X. It
-is also possible to set a stack size when linking a program. There is a
-discussion about stack sizes in Mac OS X at this web site:
-.\" HTML <a href="http://developer.apple.com/qa/qa2005/qa1419.html">
-.\" </a>
-http://developer.apple.com/qa/qa2005/qa1419.html.
-.\"
-.
-.
-.SH AUTHOR
-.rs
-.sp
-.nf
-Philip Hazel
-University Computing Service
-Cambridge, England.
-.fi
-.
-.
-.SH REVISION
-.rs
-.sp
-.nf
-Last updated: 21 November 2014
-Copyright (c) 1997-2014 University of Cambridge.
-.fi
diff --git a/doc/pcre2syntax.3 b/doc/pcre2syntax.3
index 8be8b92..6eb0235 100644
--- a/doc/pcre2syntax.3
+++ b/doc/pcre2syntax.3
@@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "16 October 2015" "PCRE2 10.21"
+.TH PCRE2SYNTAX 3 "17 June 2017" "PCRE2 10.30"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -407,18 +407,21 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
(?i) caseless
(?J) allow duplicate names
(?m) multiline
+ (?n) no auto capture
(?s) single line (dotall)
(?U) default ungreedy (lazy)
- (?x) extended (ignore white space)
+ (?x) extended: ignore white space except in classes
+ (?xx) as (?x) but also ignore space and tab in classes
(?-...) unset option(s)
.sp
The following are recognized only at the very start of a pattern or after one
of the newline or \eR options with similar syntax. More than one of them may
-appear.
+appear. For the first three, d is a decimal number.
.sp
- (*LIMIT_MATCH=d) set the match limit to d (decimal number)
- (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
- (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
+ (*LIMIT_DEPTH=d) set the backtracking limit to d
+ (*LIMIT_HEAP=d) set the heap size limit to d kilobytes
+ (*LIMIT_MATCH=d) set the match limit to d
+ (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
@@ -427,10 +430,11 @@ appear.
(*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
.sp
-Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
-limits set by the caller of pcre2_match(), not increase them. The application
-can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
-PCRE2_NEVER_UCP options, respectively, at compile time.
+Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of
+the limits set by the caller of \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP,
+not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
+application can lock out the use of (*UTF) and (*UCP) by setting the
+PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
.
.
.SH "NEWLINE CONVENTION"
@@ -444,6 +448,7 @@ settings with a similar syntax.
(*CRLF) carriage return followed by linefeed
(*ANYCRLF) all three of the above
(*ANY) any Unicode newline sequence
+ (*NUL) the NUL character (binary zero)
.
.
.SH "WHAT \eR MATCHES"
@@ -473,6 +478,9 @@ Each top-level branch of a look behind must be of a fixed length.
\en reference by number (can be ambiguous)
\egn reference by number
\eg{n} reference by number
+ \eg+n relative reference by number (PCRE2 extension)
+ \eg-n relative reference by number
+ \eg{+n} relative reference by number (PCRE2 extension)
\eg{-n} relative reference by number
\ek<name> reference by name (Perl)
\ek'name' reference by name (Perl)
@@ -511,13 +519,17 @@ Each top-level branch of a look behind must be of a fixed length.
(?(-n) relative reference condition
(?(<name>) named reference condition (Perl)
(?('name') named reference condition (Perl)
- (?(name) named reference condition (PCRE2)
+ (?(name) named reference condition (PCRE2, deprecated)
(?(R) overall recursion condition
- (?(Rn) specific group recursion condition
- (?(R&name) specific recursion condition
+ (?(Rn) specific numbered group recursion condition
+ (?(R&name) specific named group recursion condition
(?(DEFINE) define subpattern for reference
(?(VERSION[>]=n.m) test PCRE2 version
(?(assert) assertion condition
+.sp
+Note the ambiguity of (?(R) and (?(Rn) which might be named reference
+conditions or recursion tests. Such a condition is interpreted as a reference
+condition if the relevant named group exists.
.
.
.SH "BACKTRACKING CONTROL"
@@ -577,6 +589,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 16 October 2015
-Copyright (c) 1997-2015 University of Cambridge.
+Last updated: 17 June 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2test.1 b/doc/pcre2test.1
index 2fbf794..ee78792 100644
--- a/doc/pcre2test.1
+++ b/doc/pcre2test.1
@@ -1,4 +1,4 @@
-.TH PCRE2TEST 1 "06 July 2016" "PCRE 10.22"
+.TH PCRE2TEST 1 "21 Decbmber 2017" "PCRE 10.31"
.SH NAME
pcre2test - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@@ -29,7 +29,7 @@ subject is processed, and what output is produced.
.P
As the original fairly simple PCRE library evolved, it acquired many different
features, and as a result, the original \fBpcretest\fP program ended up with a
-lot of options in a messy, arcane syntax, for testing all the features. The
+lot of options in a messy, arcane syntax for testing all the features. The
move to the new PCRE2 API provided an opportunity to re-implement the test
program as \fBpcre2test\fP, with a cleaner modifier syntax. Nevertheless, there
are still many obscure modifiers, some of which are specifically designed for
@@ -47,32 +47,64 @@ strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
all three of these libraries may be simultaneously installed. The
\fBpcre2test\fP program can be used to test all the libraries. However, its own
input and output are always in 8-bit format. When testing the 16-bit or 32-bit
-libraries, patterns and subject strings are converted to 16- or 32-bit format
-before being passed to the library functions. Results are converted back to
-8-bit code units for output.
+libraries, patterns and subject strings are converted to 16-bit or 32-bit
+format before being passed to the library functions. Results are converted back
+to 8-bit code units for output.
.P
In the rest of this document, the names of library functions and structures
are given in generic form, for example, \fBpcre_compile()\fP. The actual
names used in the libraries have a suffix _8, _16, or _32, as appropriate.
.
.
+.\" HTML <a name="inputencoding"></a>
.SH "INPUT ENCODING"
.rs
.sp
Input to \fBpcre2test\fP is processed line by line, either by calling the C
-library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
-below). The input is processed using using C's string functions, so must not
-contain binary zeroes, even though in Unix-like environments, \fBfgets()\fP
-treats any bytes other than newline as data characters. In some Windows
-environments character 26 (hex 1A) causes an immediate end of file, and no
-further data is read.
+library's \fBfgets()\fP function, or via the \fBlibreadline\fP library. In some
+Windows environments character 26 (hex 1A) causes an immediate end of file, and
+no further data is read, so this character should be avoided unless you really
+want that action.
.P
-For maximum portability, therefore, it is safest to avoid non-printing
-characters in \fBpcre2test\fP input files. There is a facility for specifying
-some or all of a pattern's characters as hexadecimal pairs, thus making it
-possible to include binary zeroes in a pattern for testing purposes. Subject
-lines are processed for backslash escapes, which makes it possible to include
-any data value.
+The input is processed using using C's string functions, so must not
+contain binary zeros, even though in Unix-like environments, \fBfgets()\fP
+treats any bytes other than newline as data characters. An error is generated
+if a binary zero is encountered. By default subject lines are processed for
+backslash escapes, which makes it possible to include any data value in strings
+that are passed to the library for matching. For patterns, there is a facility
+for specifying some or all of the 8-bit input characters as hexadecimal pairs,
+which makes it possible to include binary zeros.
+.
+.
+.SS "Input for the 16-bit and 32-bit libraries"
+.rs
+.sp
+When testing the 16-bit or 32-bit libraries, there is a need to be able to
+generate character code points greater than 255 in the strings that are passed
+to the library. For subject lines, backslash escapes can be used. In addition,
+when the \fButf\fP modifier (see
+.\" HTML <a href="#optionmodifiers">
+.\" </a>
+"Setting compilation options"
+.\"
+below) is set, the pattern and any following subject lines are interpreted as
+UTF-8 strings and translated to UTF-16 or UTF-32 as appropriate.
+.P
+For non-UTF testing of wide characters, the \fButf8_input\fP modifier can be
+used. This is mutually exclusive with \fButf\fP, and is allowed only in 16-bit
+or 32-bit mode. It causes the pattern and following subject lines to be treated
+as UTF-8 according to the original definition (RFC 2279), which allows for
+character values up to 0x7fffffff. Each character is placed in one 16-bit or
+32-bit code unit (in the 16-bit case, values greater than 0xffff cause an error
+to occur).
+.P
+UTF-8 (in its original definition) is not capable of encoding values greater
+than 0x7fffffff, but such values can be handled by the 32-bit library. When
+testing this library in non-UTF mode with \fButf8_input\fP set, if any
+character is preceded by the byte 0xff (which is an illegal byte in UTF-8)
+0x80000000 is added to the character's value. This is the only way of passing
+such code points in a pattern string. For subject strings, using an escape
+sequence is preferable.
.
.
.SH "COMMAND LINE OPTIONS"
@@ -93,14 +125,24 @@ If the 32-bit library has been built, this option causes it to be used. If only
the 32-bit library has been built, this is the default. If the 32-bit library
has not been built, this option causes an error.
.TP 10
+\fB-ac\fP
+Behave as if each pattern has the \fBauto_callout\fP modifier, that is, insert
+automatic callouts into every pattern that is compiled.
+.TP 10
+\fB-AC\fP
+As for \fB-ac\fP, but in addition behave as if each subject line has the
+\fBcallout_extra\fP modifier, that is, show additional information from
+callouts.
+.TP 10
\fB-b\fP
-Behave as if each pattern has the \fB/fullbincode\fP modifier; the full
+Behave as if each pattern has the \fBfullbincode\fP modifier; the full
internal binary form of the pattern is output after compilation.
.TP 10
\fB-C\fP
Output the version number of the PCRE2 library, and all available information
about the optional features that are included, and then exit with zero exit
-code. All other options are ignored.
+code. All other options are ignored. If both -C and -LM are present, whichever
+is first is recognized.
.TP 10
\fB-C\fP \fIoption\fP
Output information about a specific build-time option, then exit. This
@@ -114,7 +156,7 @@ following options output the value and set the exit code as indicated:
linksize the configured internal link size (2, 3, or 4)
exit code is set to the link size
newline the default newline setting:
- CR, LF, CRLF, ANYCRLF, or ANY
+ CR, LF, CRLF, ANYCRLF, ANY, or NUL
exit code is always 0
bsr the default setting for what \eR matches:
ANYCRLF or ANY
@@ -153,13 +195,23 @@ a convenience facility for PCRE2 maintainers.
Output a brief summary these options and then exit.
.TP 10
\fB-i\fP
-Behave as if each pattern has the \fB/info\fP modifier; information about the
+Behave as if each pattern has the \fBinfo\fP modifier; information about the
compiled pattern is given after compilation.
.TP 10
\fB-jit\fP
Behave as if each pattern line has the \fBjit\fP modifier; after successful
compilation, each pattern is passed to the just-in-time compiler, if available.
.TP 10
+\fB-jitverify\fP
+Behave as if each pattern line has the \fBjitverify\fP modifier; after
+successful compilation, each pattern is passed to the just-in-time compiler, if
+available, and the use of JIT is verified.
+.TP 10
+\fB-LM\fP
+List modifiers: write a list of available pattern and subject modifiers to the
+standard output, then exit with zero exit code. All other options are ignored.
+If both -C and -LM are present, whichever is first is recognized.
+.TP 10
\fB-pattern\fB \fImodifier-list\fP
Behave as if each pattern line contains the given modifiers.
.TP 10
@@ -279,8 +331,8 @@ recognized as a newline by default. Without special action the tests would fail
when PCRE2 is compiled with either CR or CRLF as the default newline.
.P
The #newline_default command specifies a list of newline types that are
-acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF, or
-ANY (in upper or lower case), for example:
+acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF,
+ANY, or NUL (in upper or lower case), for example:
.sp
#newline_default LF Any anyCRLF
.sp
@@ -293,8 +345,9 @@ of the standard test input files.
.P
When the POSIX API is being tested there is no way to override the default
newline convention, though it is possible to set the newline convention from
-within the pattern. A warning is given if the \fBposix\fP modifier is used when
-\fB#newline_default\fP would set a default for the non-POSIX API.
+within the pattern. A warning is given if the \fBposix\fP or \fBposix_nosub\fP
+modifier is used when \fB#newline_default\fP would set a default for the
+non-POSIX API.
.sp
#pattern <modifier-list>
.sp
@@ -400,8 +453,9 @@ A pattern can be followed by a modifier list (details below).
.sp
Before each subject line is passed to \fBpcre2_match()\fP or
\fBpcre2_dfa_match()\fP, leading and trailing white space is removed, and the
-line is scanned for backslash escapes. The following provide a means of
-encoding non-printing characters in a visible way:
+line is scanned for backslash escapes, unless the \fBsubject_literal\fP
+modifier was set for the pattern. The following provide a means of encoding
+non-printing characters in a visible way:
.sp
\ea alarm (BEL, \ex07)
\eb backspace (\ex08)
@@ -463,6 +517,11 @@ character. A backslash followed by anything else causes an error. However, if
the very last character in the line is a backslash (and there is no modifier
list), it is ignored. This gives a way of passing an empty line as data, since
a real empty line terminates the data input.
+.P
+If the \fBsubject_literal\fP modifier is set for a pattern, all subject lines
+that follow are treated as literals, with no special treatment of backslashes.
+No replication is possible, and any subject modifiers must be set as defaults
+by a \fB#subject\fP command.
.
.
.SH "PATTERN MODIFIERS"
@@ -478,31 +537,44 @@ by a previous \fB#pattern\fP command.
.SS "Setting compilation options"
.rs
.sp
-The following modifiers set options for \fBpcre2_compile()\fP. The most common
-ones have single-letter abbreviations. See
+The following modifiers set options for \fBpcre2_compile()\fP. Most of them set
+bits in the options argument of that function, but those whose names start with
+PCRE2_EXTRA are additional options that are set in the compile context. For the
+main options, there are some single-letter abbreviations that are the same as
+Perl options. There is special handling for /x: if a second x is present,
+PCRE2_EXTENDED is converted into PCRE2_EXTENDED_MORE as in Perl. A third
+appearance adds PCRE2_EXTENDED as well, though this makes no difference to the
+way \fBpcre2_compile()\fP behaves. See
.\" HREF
\fBpcre2api\fP
.\"
-for a description of their effects.
+for a description of the effects of these options.
.sp
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
+ allow_surrogate_escapes set PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
alt_bsux set PCRE2_ALT_BSUX
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
alt_verbnames set PCRE2_ALT_VERBNAMES
anchored set PCRE2_ANCHORED
auto_callout set PCRE2_AUTO_CALLOUT
+ bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
/i caseless set PCRE2_CASELESS
dollar_endonly set PCRE2_DOLLAR_ENDONLY
/s dotall set PCRE2_DOTALL
dupnames set PCRE2_DUPNAMES
+ endanchored set PCRE2_ENDANCHORED
/x extended set PCRE2_EXTENDED
+ /xx extended_more set PCRE2_EXTENDED_MORE
firstline set PCRE2_FIRSTLINE
+ literal set PCRE2_LITERAL
+ match_line set PCRE2_EXTRA_MATCH_LINE
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
+ match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
never_ucp set PCRE2_NEVER_UCP
never_utf set PCRE2_NEVER_UTF
- no_auto_capture set PCRE2_NO_AUTO_CAPTURE
+ /n no_auto_capture set PCRE2_NO_AUTO_CAPTURE
no_auto_possess set PCRE2_NO_AUTO_POSSESS
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
no_start_optimize set PCRE2_NO_START_OPTIMIZE
@@ -515,7 +587,9 @@ for a description of their effects.
As well as turning on the PCRE2_UTF option, the \fButf\fP modifier causes all
non-printing characters in output strings to be printed using the \ex{hh...}
notation. Otherwise, those less than 0x100 are output in hex without the curly
-brackets.
+brackets. Setting \fButf\fP in 16-bit or 32-bit mode also causes pattern and
+subject strings to be translated to UTF-16 or UTF-32, respectively, before
+being passed to library functions.
.
.
.\" HTML <a name="controlmodifiers"></a>
@@ -523,12 +597,18 @@ brackets.
.rs
.sp
The following modifiers affect the compilation process or request information
-about the pattern:
+about the pattern. There are single-letter abbreviations for some that are
+heavily used in the test files.
.sp
bsr=[anycrlf|unicode] specify \eR handling
/B bincode show binary code without lengths
callout_info show callout information
+ convert=<options> request foreign pattern conversion
+ convert_glob_escape=c set glob escape character
+ convert_glob_separator=c set glob separator character
+ convert_length set convert buffer length
debug same as info,fullbincode
+ framesize show matching frame size
fullbincode show binary code with lengths
/I info show info about compiled pattern
hex unquoted characters are hexadecimal
@@ -546,7 +626,10 @@ about the pattern:
push push compiled pattern onto the stack
pushcopy push a copy onto the stack
stackguard=<number> test the stackguard feature
+ subject_literal treat all subject lines as literal
tables=[0|1|2] select internal tables
+ use_length do not zero-terminate the pattern
+ utf8_input treat input as UTF-8
.sp
The effects of these modifiers are described in the following sections.
.
@@ -561,7 +644,7 @@ is built, with the default default being Unicode.
.P
The \fBnewline\fP modifier specifies which characters are to be interpreted as
newlines, both in the pattern and in subject lines. The type must be one of CR,
-LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
+LF, CRLF, ANYCRLF, ANY, or NUL (in upper or lower case).
.
.
.SS "Information about a pattern"
@@ -609,6 +692,10 @@ unit" is the last literal code unit that must be present in any match. This is
not necessarily the last character. These lines are omitted if no starting or
ending code units are recorded.
.P
+The \fBframesize\fP modifier shows the size, in bytes, of the storage frames
+used by \fBpcre2_match()\fP for handling backtracking. The size depends on the
+number of capturing parentheses in the pattern.
+.P
The \fBcallout_info\fP modifier requests information about all the callouts in
the pattern. A list of them is output at the end of any other information that
is requested. For each callout, either its number or string is given, followed
@@ -642,12 +729,41 @@ nine characters, only two of which are specified in hexadecimal:
/ab "literal" 32/hex
.sp
Either single or double quotes may be used. There is no way of including
-the delimiter within a substring.
+the delimiter within a substring. The \fBhex\fP and \fBexpand\fP modifiers are
+mutually exclusive.
+.
+.
+.SS "Specifying the pattern's length"
+.rs
+.sp
+By default, patterns are passed to the compiling functions as zero-terminated
+strings but can be passed by length instead of being zero-terminated. The
+\fBuse_length\fP modifier causes this to happen. Using a length happens
+automatically (whether or not \fBuse_length\fP is set) when \fBhex\fP is set,
+because patterns specified in hexadecimal may contain binary zeros.
.P
-By default, \fBpcre2test\fP passes patterns as zero-terminated strings to
-\fBpcre2_compile()\fP, giving the length as PCRE2_ZERO_TERMINATED. However, for
-patterns specified with the \fBhex\fP modifier, the actual length of the
-pattern is passed.
+If \fBhex\fP or \fBuse_length\fP is used with the POSIX wrapper API (see
+.\" HTML <a href="#posixwrapper">
+.\" </a>
+"Using the POSIX wrapper API"
+.\"
+below), the REG_PEND extension is used to pass the pattern's length.
+.
+.
+.SS "Specifying wide characters in 16-bit and 32-bit modes"
+.rs
+.sp
+In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 and
+translated to UTF-16 or UTF-32 when the \fButf\fP modifier is set. For testing
+the 16-bit and 32-bit libraries in non-UTF mode, the \fButf8_input\fP modifier
+can be used. It is mutually exclusive with \fButf\fP. Input lines are
+interpreted as UTF-8 as a means of specifying wide characters. More details are
+given in
+.\" HTML <a href="#inputencoding">
+.\" </a>
+"Input encoding"
+.\"
+above.
.
.
.SS "Generating long repetitive patterns"
@@ -665,7 +781,8 @@ are expanded before the pattern is passed to \fBpcre2_compile()\fP. For
example, \e[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
cannot be nested. An initial "\e[" sequence is recognized only if "]{" followed
by decimal digits and "}" is found later in the pattern. If not, the characters
-remain in the pattern unaltered.
+remain in the pattern unaltered. The \fBexpand\fP and \fBhex\fP modifiers are
+mutually exclusive.
.P
If part of an expanded pattern looks like an expansion, but is really part of
the actual pattern, unwanted expansion can be avoided by giving two values in
@@ -696,7 +813,7 @@ below
.\"
for details of how these options are specified for each match attempt.
.P
-JIT compilation is requested by the \fB/jit\fP pattern modifier, which may
+JIT compilation is requested by the \fBjit\fP pattern modifier, which may
optionally be followed by an equals sign and a number in the range 0 to 7.
The three bits that make up the number specify which of the three JIT operating
modes are to be compiled:
@@ -705,7 +822,7 @@ modes are to be compiled:
2 compile JIT code for soft partial matching
4 compile JIT code for hard partial matching
.sp
-The possible values for the \fB/jit\fP modifier are therefore:
+The possible values for the \fBjit\fP modifier are therefore:
.sp
0 disable JIT
1 normal matching only
@@ -720,7 +837,7 @@ to \fBpcre2_match()\fP with either the PCRE2_PARTIAL_SOFT or the
PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
match; the options enable the possibility of a partial match, but do not
require it. Note also that if you request JIT compilation only for partial
-matching (for example, /jit=2) but do not set the \fBpartial\fP modifier on a
+matching (for example, jit=2) but do not set the \fBpartial\fP modifier on a
subject line, that match will not use JIT code because none was compiled for
non-partial matching.
.P
@@ -750,14 +867,14 @@ code was actually used in the match.
.SS "Setting a locale"
.rs
.sp
-The \fB/locale\fP modifier must specify the name of a locale, for example:
+The \fBlocale\fP modifier must specify the name of a locale, for example:
.sp
/pattern/locale=fr_FR
.sp
The given locale is set, \fBpcre2_maketables()\fP is called to build a set of
character tables for the locale, and this is then passed to
\fBpcre2_compile()\fP when compiling the regular expression. The same tables
-are used when matching the following subject lines. The \fB/locale\fP modifier
+are used when matching the following subject lines. The \fBlocale\fP modifier
applies only to the pattern on which it appears, but can be given in a
\fB#pattern\fP command if a default is needed. Setting a locale and alternate
character tables are mutually exclusive.
@@ -766,7 +883,7 @@ character tables are mutually exclusive.
.SS "Showing pattern memory"
.rs
.sp
-The \fB/memory\fP modifier causes the size in bytes of the memory used to hold
+The \fBmemory\fP modifier causes the size in bytes of the memory used to hold
the compiled pattern to be output. This does not include the size of the
\fBpcre2_code\fP block; it is just the actual compiled data. If the pattern is
subsequently passed to the JIT compiler, the size of the JIT compiled code is
@@ -797,10 +914,11 @@ causes a compilation error. The default is the largest number a PCRE2_SIZE
variable can hold (essentially unlimited).
.
.
+.\" HTML <a name="posixwrapper"></a>
.SS "Using the POSIX wrapper API"
.rs
.sp
-The \fB/posix\fP and \fBposix_nosub\fP modifiers cause \fBpcre2test\fP to call
+The \fBposix\fP and \fBposix_nosub\fP modifiers cause \fBpcre2test\fP to call
PCRE2 via the POSIX wrapper API rather than its native API. When
\fBposix_nosub\fP is used, the POSIX option REG_NOSUB is passed to
\fBregcomp()\fP. The POSIX wrapper supports only the 8-bit library. Note that
@@ -830,12 +948,16 @@ large buffer is used.
The \fBaftertext\fP and \fBallaftertext\fP subject modifiers work as described
below. All other modifiers are either ignored, with a warning message, or cause
an error.
+.P
+The pattern is passed to \fBregcomp()\fP as a zero-terminated string by
+default, but if the \fBuse_length\fP or \fBhex\fP modifiers are set, the
+REG_PEND extension is used to pass it by length.
.
.
.SS "Testing the stack guard feature"
.rs
.sp
-The \fB/stackguard\fP modifier is used to test the use of
+The \fBstackguard\fP modifier is used to test the use of
\fBpcre2_set_compile_recursion_guard()\fP, a function that is provided to
enable stack availability to be checked during compilation (see the
.\" HREF
@@ -852,7 +974,7 @@ be aborted.
.SS "Using alternative character tables"
.rs
.sp
-The value specified for the \fB/tables\fP modifier must be one of the digits 0,
+The value specified for the \fBtables\fP modifier must be one of the digits 0,
1, or 2. It causes a specific set of built-in character tables to be passed to
\fBpcre2_compile()\fP. This is used in the PCRE2 tests to check behaviour with
different character tables. The digit specifies the tables as follows:
@@ -870,17 +992,19 @@ are mutually exclusive.
.SS "Setting certain match controls"
.rs
.sp
-The following modifiers are really subject modifiers, and are described below.
-However, they may be included in a pattern's modifier list, in which case they
-are applied to every subject line that is processed with that pattern. They may
-not appear in \fB#pattern\fP commands. These modifiers do not affect the
-compilation process.
+The following modifiers are really subject modifiers, and are described under
+"Subject Modifiers" below. However, they may be included in a pattern's
+modifier list, in which case they are applied to every subject line that is
+processed with that pattern. These modifiers do not affect the compilation
+process.
.sp
aftertext show text after match
allaftertext show text after captures
allcaptures show all captures
allusedtext show all consulted text
+ altglobal alternative global matching
/g global global matching
+ jitstack=<n> set size of JIT stack
mark show mark values
replace=<string> specify a replacement string
startchar show starting character when relevant
@@ -893,6 +1017,15 @@ These modifiers may not appear in a \fB#pattern\fP command. If you want them as
defaults, set them in a \fB#subject\fP command.
.
.
+.SS "Specifying literal subject lines"
+.rs
+.sp
+If the \fBsubject_literal\fP modifier is present on a pattern, all the subject
+lines that it matches are taken as literal strings, with no interpretation of
+backslashes. It is not possible to set subject modifiers on such lines, but any
+that are set as defaults by a \fB#subject\fP command are recognized.
+.
+.
.SS "Saving a compiled pattern"
.rs
.sp
@@ -903,7 +1036,9 @@ facility is used when saving compiled patterns to a file, as described in the
section entitled "Saving and restoring compiled patterns"
.\" HTML <a href="#saverestore">
.\" </a>
-below. If \fBpushcopy\fP is used instead of \fBpush\fP, a copy of the compiled
+below.
+.\"
+If \fBpushcopy\fP is used instead of \fBpush\fP, a copy of the compiled
pattern is stacked, leaving the original as current, ready to match the
following input lines. This provides a way of testing the
\fBpcre2_code_copy()\fP function.
@@ -916,6 +1051,39 @@ allowed, does not carry through to any subsequent matching that uses a stacked
pattern.
.
.
+.SS "Testing foreign pattern conversion"
+.rs
+.sp
+The experimental foreign pattern conversion functions in PCRE2 can be tested by
+setting the \fBconvert\fP modifier. Its argument is a colon-separated list of
+options, which set the equivalent option for the \fBpcre2_pattern_convert()\fP
+function:
+.sp
+ glob PCRE2_CONVERT_GLOB
+ glob_no_starstar PCRE2_CONVERT_GLOB_NO_STARSTAR
+ glob_no_wild_separator PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR
+ posix_basic PCRE2_CONVERT_POSIX_BASIC
+ posix_extended PCRE2_CONVERT_POSIX_EXTENDED
+ unset Unset all options
+.sp
+The "unset" value is useful for turning off a default that has been set by a
+\fB#pattern\fP command. When one of these options is set, the input pattern is
+passed to \fBpcre2_pattern_convert()\fP. If the conversion is successful, the
+result is reflected in the output and then passed to \fBpcre2_compile()\fP. The
+normal \fButf\fP and \fBno_utf_check\fP options, if set, cause the
+PCRE2_CONVERT_UTF and PCRE2_CONVERT_NO_UTF_CHECK options to be passed to
+\fBpcre2_pattern_convert()\fP.
+.P
+By default, the conversion function is allowed to allocate a buffer for its
+output. However, if the \fBconvert_length\fP modifier is set to a value greater
+than zero, \fBpcre2test\fP passes a buffer of the given length. This makes it
+possible to test the length check.
+.P
+The \fBconvert_glob_escape\fP and \fBconvert_glob_separator\fP modifiers can be
+used to specify the escape and separator characters for glob processing,
+overriding the defaults, which are operating-system dependent.
+.
+.
.\" HTML <a name="subjectmodifiers"></a>
.SH "SUBJECT MODIFIERS"
.rs
@@ -935,6 +1103,7 @@ The following modifiers set options for \fBpcre2_match()\fP or
for a description of their effects.
.sp
anchored set PCRE2_ANCHORED
+ endanchored set PCRE2_ENDANCHORED
dfa_restart set PCRE2_DFA_RESTART
dfa_shortest set PCRE2_DFA_SHORTEST
no_jit set PCRE2_NO_JIT
@@ -949,11 +1118,27 @@ for a description of their effects.
The partial matching modifiers are provided with abbreviations because they
appear frequently in tests.
.P
-If the \fB/posix\fP modifier was present on the pattern, causing the POSIX
-wrapper API to be used, the only option-setting modifiers that have any effect
-are \fBnotbol\fP, \fBnotempty\fP, and \fBnoteol\fP, causing REG_NOTBOL,
-REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to \fBregexec()\fP.
-The other modifiers are ignored, with a warning message.
+If the \fBposix\fP or \fBposix_nosub\fP modifier was present on the pattern,
+causing the POSIX wrapper API to be used, the only option-setting modifiers
+that have any effect are \fBnotbol\fP, \fBnotempty\fP, and \fBnoteol\fP,
+causing REG_NOTBOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
+\fBregexec()\fP. The other modifiers are ignored, with a warning message.
+.P
+There is one additional modifier that can be used with the POSIX wrapper. It is
+ignored (with a warning) if used for non-POSIX matching.
+.sp
+ posix_startend=<n>[:<m>]
+.sp
+This causes the subject string to be passed to \fBregexec()\fP using the
+REG_STARTEND option, which uses offsets to specify which part of the string is
+searched. If only one number is given, the end offset is passed as the end of
+the subject string. For more detail of REG_STARTEND, see the
+.\" HREF
+\fBpcre2posix\fP
+.\"
+documentation. If the subject string contains binary zeros (coded as escapes
+such as \ex{00} because \fBpcre2test\fP does not support actual binary zeros in
+its input), you must use \fBposix_startend\fP to specify its length.
.
.
.SS "Setting match controls"
@@ -971,23 +1156,28 @@ pattern.
altglobal alternative global matching
callout_capture show captures at callout time
callout_data=<n> set a value to pass via callouts
+ callout_error=<n>[:<m>] control callout error
+ callout_extra show extra callout information
callout_fail=<n>[:<m>] control callout failure
+ callout_no_where do not show position of a callout
callout_none do not supply a callout function
copy=<number or name> copy captured substring
+ depth_limit=<n> set a depth limit
dfa use \fBpcre2_dfa_match()\fP
- find_limits find match and recursion limits
+ find_limits find match and depth limits
get=<number or name> extract captured substring
getall extract all captured substrings
/g global global matching
+ heap_limit=<n> set a limit on heap memory
jitstack=<n> set size of JIT stack
mark show mark values
match_limit=<n> set a match limit
- memory show memory usage
+ memory show heap memory usage
null_context match with a NULL context
offset=<n> set starting offset
offset_limit=<n> set offset limit
ovector=<n> set size of output vector
- recursion_limit=<n> set a recursion limit
+ recursion_limit=<n> obsolete synonym for depth_limit
replace=<string> specify a replacement string
startchar show startchar when relevant
startoffset=<n> same as offset=<n>
@@ -1063,27 +1253,20 @@ does no capturing); it is ignored, with a warning message, if present.
.rs
.sp
A callout function is supplied when \fBpcre2test\fP calls the library matching
-functions, unless \fBcallout_none\fP is specified. If \fBcallout_capture\fP is
-set, the current captured groups are output when a callout occurs.
-.P
-The \fBcallout_fail\fP modifier can be given one or two numbers. If there is
-only one number, 1 is returned instead of 0 when a callout of that number is
-reached. If two numbers are given, 1 is returned when callout <n> is reached
-for the <m>th time. Note that callouts with string arguments are always given
-the number zero. See "Callouts" below for a description of the output when a
-callout it taken.
-.P
-The \fBcallout_data\fP modifier can be given an unsigned or a negative number.
-This is set as the "user data" that is passed to the matching function, and
-passed back when the callout function is invoked. Any value other than zero is
-used as a return from \fBpcre2test\fP's callout function.
+functions, unless \fBcallout_none\fP is specified. Its behaviour can be
+controlled by various modifiers listed above whose names begin with
+\fBcallout_\fP. Details are given in the section entitled "Callouts"
+.\" HTML <a href="#callouts">
+.\" </a>
+below.
+.\"
.
.
.SS "Finding all matches in a string"
.rs
.sp
Searching for all possible matches within a subject can be requested by the
-\fBglobal\fP or \fB/altglobal\fP modifier. After finding a match, the matching
+\fBglobal\fP or \fBaltglobal\fP modifier. After finding a match, the matching
function is called again to search the remainder of the subject. The difference
between \fBglobal\fP and \fBaltglobal\fP is that the former uses the
\fIstart_offset\fP argument to \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
@@ -1198,39 +1381,44 @@ matching provokes an error return ("bad option value") from
.sp
The \fBjitstack\fP modifier provides a way of setting the maximum stack size
that is used by the just-in-time optimization code. It is ignored if JIT
-optimization is not being used. The value is a number of kilobytes. Providing a
-stack that is larger than the default 32K is necessary only for very
-complicated patterns.
+optimization is not being used. The value is a number of kilobytes. Setting
+zero reverts to the default of 32K. Providing a stack that is larger than the
+default is necessary only for very complicated patterns. If \fBjitstack\fP is
+set non-zero on a subject line it overrides any value that was set on the
+pattern.
.
.
-.SS "Setting match and recursion limits"
+.SS "Setting heap, match, and depth limits"
.rs
.sp
-The \fBmatch_limit\fP and \fBrecursion_limit\fP modifiers set the appropriate
-limits in the match context. These values are ignored when the
+The \fBheap_limit\fP, \fBmatch_limit\fP, and \fBdepth_limit\fP modifiers set
+the appropriate limits in the match context. These values are ignored when the
\fBfind_limits\fP modifier is specified.
.
.
.SS "Finding minimum limits"
.rs
.sp
-If the \fBfind_limits\fP modifier is present, \fBpcre2test\fP calls
-\fBpcre2_match()\fP several times, setting different values in the match
-context via \fBpcre2_set_match_limit()\fP and \fBpcre2_set_recursion_limit()\fP
-until it finds the minimum values for each parameter that allow
-\fBpcre2_match()\fP to complete without error.
+If the \fBfind_limits\fP modifier is present on a subject line, \fBpcre2test\fP
+calls the relevant matching function several times, setting different values in
+the match context via \fBpcre2_set_heap_limit(), \fBpcre2_set_match_limit()\fP,
+or \fBpcre2_set_depth_limit()\fP until it finds the minimum values for each
+parameter that allows the match to complete without error.
.P
If JIT is being used, only the match limit is relevant. If DFA matching is
-being used, neither limit is relevant, and this modifier is ignored (with a
-warning message).
+being used, only the depth limit is relevant.
.P
The \fImatch_limit\fP number is a measure of the amount of backtracking
that takes place, and learning the minimum value can be instructive. For most
simple matches, the number is quite small, but for patterns with very large
numbers of matching possibilities, it can become large very quickly with
-increasing length of subject string. The \fImatch_limit_recursion\fP number is
-a measure of how much stack (or, if PCRE2 is compiled with NO_RECURSE, how much
-heap) memory is needed to complete the match attempt.
+increasing length of subject string.
+.P
+For non-DFA matching, the minimum \fIdepth_limit\fP number is a measure of how
+much nested backtracking happens (that is, how deeply the pattern's tree is
+searched). In the case of DFA matching, \fIdepth_limit\fP controls the depth of
+recursive calls of the internal function that is used for handling pattern
+recursion, lookaround assertions, and atomic groups.
.
.
.SS "Showing MARK names"
@@ -1247,8 +1435,15 @@ is added to the non-match message.
.SS "Showing memory usage"
.rs
.sp
-The \fBmemory\fP modifier causes \fBpcre2test\fP to log all memory allocation
-and freeing calls that occur during a match operation.
+The \fBmemory\fP modifier causes \fBpcre2test\fP to log the sizes of all heap
+memory allocation and freeing calls that occur during a call to
+\fBpcre2_match()\fP. These occur only when a match requires a bigger vector
+than the default for remembering backtracking points. In many cases there will
+be no heap memory used and therefore no additional output. No heap memory is
+allocated during matching with \fBpcre2_dfa_match\fP or with JIT, so in those
+cases the \fBmemory\fP modifier never has any effect. For this modifier to
+work, the \fBnull_context\fP modifier must not be set on both the pattern and
+the subject, though it can be set on one or the other.
.
.
.SS "Setting a starting offset"
@@ -1291,8 +1486,8 @@ pair of offsets.)
By default, the subject string is passed to a native API matching function with
its correct length. In order to test the facility for passing a zero-terminated
string, the \fBzero_terminate\fP modifier is provided. It causes the length to
-be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
-this modifier has no effect, as there is no facility for passing a length.)
+be passed as PCRE2_ZERO_TERMINATED. When matching via the POSIX interface,
+this modifier is ignored, with a warning.
.P
When testing \fBpcre2_substitute()\fP, this modifier also has the effect of
passing the replacement string as zero-terminated.
@@ -1349,7 +1544,7 @@ code unit offset of the start of the failing character is also output. Here is
an example of an interactive \fBpcre2test\fP run.
.sp
$ pcre2test
- PCRE2 version 9.00 2014-05-10
+ PCRE2 version 10.22 2016-07-29
.sp
re> /^abc(\ed+)/
data> abc123
@@ -1376,7 +1571,7 @@ unset substring is shown as "<unset>", as for the second data line.
If the strings contain any non-printing characters, they are output as \exhh
escapes if the value is less than 256 and UTF mode is not set. Otherwise they
are output as \ex{hh...} escapes. See below for the definition of non-printing
-characters. If the \fB/aftertext\fP modifier is set, the output for substring
+characters. If the \fBaftertext\fP modifier is set, the output for substring
0 is followed by the the rest of the subject string, identified by "0+" like
this:
.sp
@@ -1470,27 +1665,15 @@ For further information about partial matching, see the
documentation.
.
.
+.\" HTML <a name="callouts"></a>
.SH CALLOUTS
.rs
.sp
If the pattern contains any callout requests, \fBpcre2test\fP's callout
-function is called during matching unless \fBcallout_none\fP is specified.
-This works with both matching functions.
-.P
-The callout function in \fBpcre2test\fP returns zero (carry on matching) by
-default, but you can use a \fBcallout_fail\fP modifier in a subject line (as
-described above) to change this and other parameters of the callout.
-.P
-Inserting callouts can be helpful when using \fBpcre2test\fP to check
-complicated regular expressions. For further information about callouts, see
-the
-.\" HREF
-\fBpcre2callout\fP
-.\"
-documentation.
-.P
-The output for callouts with numerical arguments and those with string
-arguments is slightly different.
+function is called during matching unless \fBcallout_none\fP is specified. This
+works with both matching functions, and with JIT, though there are some
+differences in behaviour. The output for callouts with numerical arguments and
+those with string arguments is slightly different.
.
.
.SS "Callouts with numerical arguments"
@@ -1511,7 +1694,7 @@ the current position precedes the start position, which can happen if the
callout is in a lookbehind assertion.
.P
Callouts numbered 255 are assumed to be automatic callouts, inserted as a
-result of the \fB/auto_callout\fP pattern modifier. In this case, instead of
+result of the \fBauto_callout\fP pattern modifier. In this case, instead of
showing the callout number, the offset in the pattern, preceded by a plus, is
output. For example:
.sp
@@ -1564,6 +1747,103 @@ example:
.sp
.
.
+.SS "Callout modifiers"
+.rs
+.sp
+The callout function in \fBpcre2test\fP returns zero (carry on matching) by
+default, but you can use a \fBcallout_fail\fP modifier in a subject line to
+change this and other parameters of the callout (see below).
+.P
+If the \fBcallout_capture\fP modifier is set, the current captured groups are
+output when a callout occurs. This is useful only for non-DFA matching, as
+\fBpcre2_dfa_match()\fP does not support capturing, so no captures are ever
+shown.
+.P
+The normal callout output, showing the callout number or pattern offset (as
+described above) is suppressed if the \fBcallout_no_where\fP modifier is set.
+.P
+When using the interpretive matching function \fBpcre2_match()\fP without JIT,
+setting the \fBcallout_extra\fP modifier causes additional output from
+\fBpcre2test\fP's callout function to be generated. For the first callout in a
+match attempt at a new starting position in the subject, "New match attempt" is
+output. If there has been a backtrack since the last callout (or start of
+matching if this is the first callout), "Backtrack" is output, followed by "No
+other matching paths" if the backtrack ended the previous match attempt. For
+example:
+.sp
+ re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess
+ data> aac\e=callout_extra
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ +3 ^ ^ )
+ +4 ^ ^ b
+ Backtrack
+ --->aac
+ +3 ^^ )
+ +4 ^^ b
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ +3 ^^ )
+ +4 ^^ b
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ No match
+.sp
+Notice that various optimizations must be turned off if you want all possible
+matching paths to be scanned. If \fBno_start_optimize\fP is not used, there is
+an immediate "no match", without any callouts, because the starting
+optimization fails to find "b" in the subject, which it knows must be present
+for any match. If \fBno_auto_possess\fP is not used, the "a+" item is turned
+into "a++", which reduces the number of backtracks.
+.P
+The \fBcallout_extra\fP modifier has no effect if used with the DFA matching
+function, or with JIT.
+.
+.
+.SS "Return values from callouts"
+.rs
+.sp
+The default return from the callout function is zero, which allows matching to
+continue. The \fBcallout_fail\fP modifier can be given one or two numbers. If
+there is only one number, 1 is returned instead of 0 (causing matching to
+backtrack) when a callout of that number is reached. If two numbers (<n>:<m>)
+are given, 1 is returned when callout <n> is reached and there have been at
+least <m> callouts. The \fBcallout_error\fP modifier is similar, except that
+PCRE2_ERROR_CALLOUT is returned, causing the entire matching process to be
+aborted. If both these modifiers are set for the same callout number,
+\fBcallout_error\fP takes precedence. Note that callouts with string arguments
+are always given the number zero.
+.P
+The \fBcallout_data\fP modifier can be given an unsigned or a negative number.
+This is set as the "user data" that is passed to the matching function, and
+passed back when the callout function is invoked. Any value other than zero is
+used as a return from \fBpcre2test\fP's callout function.
+.P
+Inserting callouts can be helpful when using \fBpcre2test\fP to check
+complicated regular expressions. For further information about callouts, see
+the
+.\" HREF
+\fBpcre2callout\fP
+.\"
+documentation.
+.
+.
.
.SH "NON-PRINTING CHARACTERS"
.rs
@@ -1574,7 +1854,7 @@ therefore shown as hex escapes.
.P
When \fBpcre2test\fP is outputting text that is a matched part of a subject
string, it behaves in the same way, unless a different locale has been set for
-the pattern (using the \fB/locale\fP modifier). In this case, the
+the pattern (using the \fBlocale\fP modifier). In this case, the
\fBisprint()\fP function is used to distinguish printing and non-printing
characters.
.
@@ -1682,6 +1962,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 06 July 2016
-Copyright (c) 1997-2016 University of Cambridge.
+Last updated: 21 December 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi
diff --git a/doc/pcre2test.txt b/doc/pcre2test.txt
index cfa0baa..93efd24 100644
--- a/doc/pcre2test.txt
+++ b/doc/pcre2test.txt
@@ -26,7 +26,7 @@ SYNOPSIS
As the original fairly simple PCRE library evolved, it acquired many
different features, and as a result, the original pcretest program
- ended up with a lot of options in a messy, arcane syntax, for testing
+ ended up with a lot of options in a messy, arcane syntax for testing
all the features. The move to the new PCRE2 API provided an opportunity
to re-implement the test program as pcre2test, with a cleaner modifier
syntax. Nevertheless, there are still many obscure modifiers, some of
@@ -45,7 +45,7 @@ PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
installed. The pcre2test program can be used to test all the libraries.
However, its own input and output are always in 8-bit format. When
testing the 16-bit or 32-bit libraries, patterns and subject strings
- are converted to 16- or 32-bit format before being passed to the
+ are converted to 16-bit or 32-bit format before being passed to the
library functions. Results are converted back to 8-bit code units for
output.
@@ -58,45 +58,81 @@ PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
INPUT ENCODING
Input to pcre2test is processed line by line, either by calling the C
- library's fgets() function, or via the libreadline library (see below).
+ library's fgets() function, or via the libreadline library. In some
+ Windows environments character 26 (hex 1A) causes an immediate end of
+ file, and no further data is read, so this character should be avoided
+ unless you really want that action.
+
The input is processed using using C's string functions, so must not
- contain binary zeroes, even though in Unix-like environments, fgets()
- treats any bytes other than newline as data characters. In some Windows
- environments character 26 (hex 1A) causes an immediate end of file, and
- no further data is read.
+ contain binary zeros, even though in Unix-like environments, fgets()
+ treats any bytes other than newline as data characters. An error is
+ generated if a binary zero is encountered. By default subject lines are
+ processed for backslash escapes, which makes it possible to include any
+ data value in strings that are passed to the library for matching. For
+ patterns, there is a facility for specifying some or all of the 8-bit
+ input characters as hexadecimal pairs, which makes it possible to
+ include binary zeros.
+
+ Input for the 16-bit and 32-bit libraries
+
+ When testing the 16-bit or 32-bit libraries, there is a need to be able
+ to generate character code points greater than 255 in the strings that
+ are passed to the library. For subject lines, backslash escapes can be
+ used. In addition, when the utf modifier (see "Setting compilation
+ options" below) is set, the pattern and any following subject lines are
+ interpreted as UTF-8 strings and translated to UTF-16 or UTF-32 as
+ appropriate.
- For maximum portability, therefore, it is safest to avoid non-printing
- characters in pcre2test input files. There is a facility for specifying
- some or all of a pattern's characters as hexadecimal pairs, thus making
- it possible to include binary zeroes in a pattern for testing purposes.
- Subject lines are processed for backslash escapes, which makes it pos-
- sible to include any data value.
+ For non-UTF testing of wide characters, the utf8_input modifier can be
+ used. This is mutually exclusive with utf, and is allowed only in
+ 16-bit or 32-bit mode. It causes the pattern and following subject
+ lines to be treated as UTF-8 according to the original definition (RFC
+ 2279), which allows for character values up to 0x7fffffff. Each charac-
+ ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case,
+ values greater than 0xffff cause an error to occur).
+
+ UTF-8 (in its original definition) is not capable of encoding values
+ greater than 0x7fffffff, but such values can be handled by the 32-bit
+ library. When testing this library in non-UTF mode with utf8_input set,
+ if any character is preceded by the byte 0xff (which is an illegal byte
+ in UTF-8) 0x80000000 is added to the character's value. This is the
+ only way of passing such code points in a pattern string. For subject
+ strings, using an escape sequence is preferable.
COMMAND LINE OPTIONS
-8 If the 8-bit library has been built, this option causes it to
- be used (this is the default). If the 8-bit library has not
+ be used (this is the default). If the 8-bit library has not
been built, this option causes an error.
- -16 If the 16-bit library has been built, this option causes it
- to be used. If only the 16-bit library has been built, this
- is the default. If the 16-bit library has not been built,
+ -16 If the 16-bit library has been built, this option causes it
+ to be used. If only the 16-bit library has been built, this
+ is the default. If the 16-bit library has not been built,
this option causes an error.
- -32 If the 32-bit library has been built, this option causes it
- to be used. If only the 32-bit library has been built, this
- is the default. If the 32-bit library has not been built,
+ -32 If the 32-bit library has been built, this option causes it
+ to be used. If only the 32-bit library has been built, this
+ is the default. If the 32-bit library has not been built,
this option causes an error.
- -b Behave as if each pattern has the /fullbincode modifier; the
+ -ac Behave as if each pattern has the auto_callout modifier, that
+ is, insert automatic callouts into every pattern that is com-
+ piled.
+
+ -AC As for -ac, but in addition behave as if each subject line
+ has the callout_extra modifier, that is, show additional
+ information from callouts.
+
+ -b Behave as if each pattern has the fullbincode modifier; the
full internal binary form of the pattern is output after com-
pilation.
- -C Output the version number of the PCRE2 library, and all
- available information about the optional features that are
- included, and then exit with zero exit code. All other
- options are ignored.
+ -C Output the version number of the PCRE2 library, and all
+ available information about the optional features that are
+ included, and then exit with zero exit code. All other
+ options are ignored. If both -C and -LM are present, which-
+ ever is first is recognized.
-C option Output information about a specific build-time option, then
exit. This functionality is intended for use in scripts such
@@ -110,7 +146,7 @@ COMMAND LINE OPTIONS
linksize the configured internal link size (2, 3, or 4)
exit code is set to the link size
newline the default newline setting:
- CR, LF, CRLF, ANYCRLF, or ANY
+ CR, LF, CRLF, ANYCRLF, ANY, or NUL
exit code is always 0
bsr the default setting for what \R matches:
ANYCRLF or ANY
@@ -147,13 +183,24 @@ COMMAND LINE OPTIONS
-help Output a brief summary these options and then exit.
- -i Behave as if each pattern has the /info modifier; information
+ -i Behave as if each pattern has the info modifier; information
about the compiled pattern is given after compilation.
-jit Behave as if each pattern line has the jit modifier; after
successful compilation, each pattern is passed to the just-
in-time compiler, if available.
+ -jitverify
+ Behave as if each pattern line has the jitverify modifier;
+ after successful compilation, each pattern is passed to the
+ just-in-time compiler, if available, and the use of JIT is
+ verified.
+
+ -LM List modifiers: write a list of available pattern and subject
+ modifiers to the standard output, then exit with zero exit
+ code. All other options are ignored. If both -C and -LM are
+ present, whichever is first is recognized.
+
-pattern modifier-list
Behave as if each pattern line contains the given modifiers.
@@ -269,7 +316,7 @@ COMMAND LINES
The #newline_default command specifies a list of newline types that are
acceptable as the default. The types must be one of CR, LF, CRLF, ANY-
- CRLF, or ANY (in upper or lower case), for example:
+ CRLF, ANY, or NUL (in upper or lower case), for example:
#newline_default LF Any anyCRLF
@@ -282,9 +329,9 @@ COMMAND LINES
When the POSIX API is being tested there is no way to override the
default newline convention, though it is possible to set the newline
- convention from within the pattern. A warning is given if the posix
- modifier is used when #newline_default would set a default for the non-
- POSIX API.
+ convention from within the pattern. A warning is given if the posix or
+ posix_nosub modifier is used when #newline_default would set a default
+ for the non-POSIX API.
#pattern <modifier-list>
@@ -387,8 +434,9 @@ SUBJECT LINE SYNTAX
Before each subject line is passed to pcre2_match() or
pcre2_dfa_match(), leading and trailing white space is removed, and the
- line is scanned for backslash escapes. The following provide a means of
- encoding non-printing characters in a visible way:
+ line is scanned for backslash escapes, unless the subject_literal modi-
+ fier was set for the pattern. The following provide a means of encoding
+ non-printing characters in a visible way:
\a alarm (BEL, \x07)
\b backspace (\x08)
@@ -405,23 +453,23 @@ SUBJECT LINE SYNTAX
\x{hh...} hexadecimal character (any number of hex digits)
The use of \x{hh...} is not dependent on the use of the utf modifier on
- the pattern. It is recognized always. There may be any number of hexa-
- decimal digits inside the braces; invalid values provoke error mes-
+ the pattern. It is recognized always. There may be any number of hexa-
+ decimal digits inside the braces; invalid values provoke error mes-
sages.
- Note that \xhh specifies one byte rather than one character in UTF-8
- mode; this makes it possible to construct invalid UTF-8 sequences for
- testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
- character in UTF-8 mode, generating more than one byte if the value is
- greater than 127. When testing the 8-bit library not in UTF-8 mode,
+ Note that \xhh specifies one byte rather than one character in UTF-8
+ mode; this makes it possible to construct invalid UTF-8 sequences for
+ testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
+ character in UTF-8 mode, generating more than one byte if the value is
+ greater than 127. When testing the 8-bit library not in UTF-8 mode,
\x{hh} generates one byte for values less than 256, and causes an error
for greater values.
In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
possible to construct invalid UTF-16 sequences for testing purposes.
- In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
- makes it possible to construct invalid UTF-32 sequences for testing
+ In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
+ makes it possible to construct invalid UTF-32 sequences for testing
purposes.
There is a special backslash sequence that specifies replication of one
@@ -429,33 +477,38 @@ SUBJECT LINE SYNTAX
\[<characters>]{<count>}
- This makes it possible to test long strings without having to provide
+ This makes it possible to test long strings without having to provide
them as part of the file. For example:
\[abc]{4}
- is converted to "abcabcabcabc". This feature does not support nesting.
+ is converted to "abcabcabcabc". This feature does not support nesting.
To include a closing square bracket in the characters, code it as \x5D.
- A backslash followed by an equals sign marks the end of the subject
+ A backslash followed by an equals sign marks the end of the subject
string and the start of a modifier list. For example:
abc\=notbol,notempty
- If the subject string is empty and \= is followed by whitespace, the
- line is treated as a comment line, and is not used for matching. For
+ If the subject string is empty and \= is followed by whitespace, the
+ line is treated as a comment line, and is not used for matching. For
example:
\= This is a comment.
abc\= This is an invalid modifier list.
- A backslash followed by any other non-alphanumeric character just
+ A backslash followed by any other non-alphanumeric character just
escapes that character. A backslash followed by anything else causes an
- error. However, if the very last character in the line is a backslash
- (and there is no modifier list), it is ignored. This gives a way of
- passing an empty line as data, since a real empty line terminates the
+ error. However, if the very last character in the line is a backslash
+ (and there is no modifier list), it is ignored. This gives a way of
+ passing an empty line as data, since a real empty line terminates the
data input.
+ If the subject_literal modifier is set for a pattern, all subject lines
+ that follow are treated as literals, with no special treatment of back-
+ slashes. No replication is possible, and any subject modifiers must be
+ set as defaults by a #subject command.
+
PATTERN MODIFIERS
@@ -466,28 +519,42 @@ PATTERN MODIFIERS
Setting compilation options
- The following modifiers set options for pcre2_compile(). The most com-
- mon ones have single-letter abbreviations. See pcre2api for a descrip-
- tion of their effects.
+ The following modifiers set options for pcre2_compile(). Most of them
+ set bits in the options argument of that function, but those whose
+ names start with PCRE2_EXTRA are additional options that are set in the
+ compile context. For the main options, there are some single-letter
+ abbreviations that are the same as Perl options. There is special han-
+ dling for /x: if a second x is present, PCRE2_EXTENDED is converted
+ into PCRE2_EXTENDED_MORE as in Perl. A third appearance adds
+ PCRE2_EXTENDED as well, though this makes no difference to the way
+ pcre2_compile() behaves. See pcre2api for a description of the effects
+ of these options.
allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS
+ allow_surrogate_escapes set PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
alt_bsux set PCRE2_ALT_BSUX
alt_circumflex set PCRE2_ALT_CIRCUMFLEX
alt_verbnames set PCRE2_ALT_VERBNAMES
anchored set PCRE2_ANCHORED
auto_callout set PCRE2_AUTO_CALLOUT
+ bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
/i caseless set PCRE2_CASELESS
dollar_endonly set PCRE2_DOLLAR_ENDONLY
/s dotall set PCRE2_DOTALL
dupnames set PCRE2_DUPNAMES
+ endanchored set PCRE2_ENDANCHORED
/x extended set PCRE2_EXTENDED
+ /xx extended_more set PCRE2_EXTENDED_MORE
firstline set PCRE2_FIRSTLINE
+ literal set PCRE2_LITERAL
+ match_line set PCRE2_EXTRA_MATCH_LINE
match_unset_backref set PCRE2_MATCH_UNSET_BACKREF
+ match_word set PCRE2_EXTRA_MATCH_WORD
/m multiline set PCRE2_MULTILINE
never_backslash_c set PCRE2_NEVER_BACKSLASH_C
never_ucp set PCRE2_NEVER_UCP
never_utf set PCRE2_NEVER_UTF
- no_auto_capture set PCRE2_NO_AUTO_CAPTURE
+ /n no_auto_capture set PCRE2_NO_AUTO_CAPTURE
no_auto_possess set PCRE2_NO_AUTO_POSSESS
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
no_start_optimize set PCRE2_NO_START_OPTIMIZE
@@ -498,19 +565,27 @@ PATTERN MODIFIERS
utf set PCRE2_UTF
As well as turning on the PCRE2_UTF option, the utf modifier causes all
- non-printing characters in output strings to be printed using the
- \x{hh...} notation. Otherwise, those less than 0x100 are output in hex
- without the curly brackets.
+ non-printing characters in output strings to be printed using the
+ \x{hh...} notation. Otherwise, those less than 0x100 are output in hex
+ without the curly brackets. Setting utf in 16-bit or 32-bit mode also
+ causes pattern and subject strings to be translated to UTF-16 or
+ UTF-32, respectively, before being passed to library functions.
Setting compilation controls
- The following modifiers affect the compilation process or request
- information about the pattern:
+ The following modifiers affect the compilation process or request
+ information about the pattern. There are single-letter abbreviations
+ for some that are heavily used in the test files.
bsr=[anycrlf|unicode] specify \R handling
/B bincode show binary code without lengths
callout_info show callout information
+ convert=<options> request foreign pattern conversion
+ convert_glob_escape=c set glob escape character
+ convert_glob_separator=c set glob separator character
+ convert_length set convert buffer length
debug same as info,fullbincode
+ framesize show matching frame size
fullbincode show binary code with lengths
/I info show info about compiled pattern
hex unquoted characters are hexadecimal
@@ -528,7 +603,10 @@ PATTERN MODIFIERS
push push compiled pattern onto the stack
pushcopy push a copy onto the stack
stackguard=<number> test the stackguard feature
+ subject_literal treat all subject lines as literal
tables=[0|1|2] select internal tables
+ use_length do not zero-terminate the pattern
+ utf8_input treat input as UTF-8
The effects of these modifiers are described in the following sections.
@@ -541,7 +619,7 @@ PATTERN MODIFIERS
The newline modifier specifies which characters are to be interpreted
as newlines, both in the pattern and in subject lines. The type must be
- one of CR, LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
+ one of CR, LF, CRLF, ANYCRLF, ANY, or NUL (in upper or lower case).
Information about a pattern
@@ -589,6 +667,10 @@ PATTERN MODIFIERS
last character. These lines are omitted if no starting or ending code
units are recorded.
+ The framesize modifier shows the size, in bytes, of the storage frames
+ used by pcre2_match() for handling backtracking. The size depends on
+ the number of capturing parentheses in the pattern.
+
The callout_info modifier requests information about all the callouts
in the pattern. A list of them is output at the end of any other infor-
mation that is requested. For each callout, either its number or string
@@ -619,12 +701,30 @@ PATTERN MODIFIERS
/ab "literal" 32/hex
Either single or double quotes may be used. There is no way of includ-
- ing the delimiter within a substring.
+ ing the delimiter within a substring. The hex and expand modifiers are
+ mutually exclusive.
+
+ Specifying the pattern's length
+
+ By default, patterns are passed to the compiling functions as zero-ter-
+ minated strings but can be passed by length instead of being zero-ter-
+ minated. The use_length modifier causes this to happen. Using a length
+ happens automatically (whether or not use_length is set) when hex is
+ set, because patterns specified in hexadecimal may contain binary
+ zeros.
+
+ If hex or use_length is used with the POSIX wrapper API (see "Using the
+ POSIX wrapper API" below), the REG_PEND extension is used to pass the
+ pattern's length.
+
+ Specifying wide characters in 16-bit and 32-bit modes
- By default, pcre2test passes patterns as zero-terminated strings to
- pcre2_compile(), giving the length as PCRE2_ZERO_TERMINATED. However,
- for patterns specified with the hex modifier, the actual length of the
- pattern is passed.
+ In 16-bit and 32-bit modes, all input is automatically treated as UTF-8
+ and translated to UTF-16 or UTF-32 when the utf modifier is set. For
+ testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input
+ modifier can be used. It is mutually exclusive with utf. Input lines
+ are interpreted as UTF-8 as a means of specifying wide characters. More
+ details are given in "Input encoding" above.
Generating long repetitive patterns
@@ -640,38 +740,39 @@ PATTERN MODIFIERS
ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
cannot be nested. An initial "\[" sequence is recognized only if "]{"
followed by decimal digits and "}" is found later in the pattern. If
- not, the characters remain in the pattern unaltered.
+ not, the characters remain in the pattern unaltered. The expand and hex
+ modifiers are mutually exclusive.
- If part of an expanded pattern looks like an expansion, but is really
+ If part of an expanded pattern looks like an expansion, but is really
part of the actual pattern, unwanted expansion can be avoided by giving
two values in the quantifier. For example, \[AB]{6000,6000} is not rec-
ognized as an expansion item.
- If the info modifier is set on an expanded pattern, the result of the
+ If the info modifier is set on an expanded pattern, the result of the
expansion is included in the information that is output.
JIT compilation
- Just-in-time (JIT) compiling is a heavyweight optimization that can
- greatly speed up pattern matching. See the pcre2jit documentation for
- details. JIT compiling happens, optionally, after a pattern has been
- successfully compiled into an internal form. The JIT compiler converts
+ Just-in-time (JIT) compiling is a heavyweight optimization that can
+ greatly speed up pattern matching. See the pcre2jit documentation for
+ details. JIT compiling happens, optionally, after a pattern has been
+ successfully compiled into an internal form. The JIT compiler converts
this to optimized machine code. It needs to know whether the match-time
options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used,
- because different code is generated for the different cases. See the
- partial modifier in "Subject Modifiers" below for details of how these
+ because different code is generated for the different cases. See the
+ partial modifier in "Subject Modifiers" below for details of how these
options are specified for each match attempt.
- JIT compilation is requested by the /jit pattern modifier, which may
+ JIT compilation is requested by the jit pattern modifier, which may
optionally be followed by an equals sign and a number in the range 0 to
- 7. The three bits that make up the number specify which of the three
+ 7. The three bits that make up the number specify which of the three
JIT operating modes are to be compiled:
1 compile JIT code for non-partial matching
2 compile JIT code for soft partial matching
4 compile JIT code for hard partial matching
- The possible values for the /jit modifier are therefore:
+ The possible values for the jit modifier are therefore:
0 disable JIT
1 normal matching only
@@ -681,54 +782,54 @@ PATTERN MODIFIERS
6 soft and hard partial matching only
7 all three modes
- If no number is given, 7 is assumed. The phrase "partial matching"
+ If no number is given, 7 is assumed. The phrase "partial matching"
means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the
- PCRE2_PARTIAL_HARD option set. Note that such a call may return a com-
+ PCRE2_PARTIAL_HARD option set. Note that such a call may return a com-
plete match; the options enable the possibility of a partial match, but
- do not require it. Note also that if you request JIT compilation only
- for partial matching (for example, /jit=2) but do not set the partial
- modifier on a subject line, that match will not use JIT code because
+ do not require it. Note also that if you request JIT compilation only
+ for partial matching (for example, jit=2) but do not set the partial
+ modifier on a subject line, that match will not use JIT code because
none was compiled for non-partial matching.
- If JIT compilation is successful, the compiled JIT code will automati-
- cally be used when an appropriate type of match is run, except when
- incompatible run-time options are specified. For more details, see the
- pcre2jit documentation. See also the jitstack modifier below for a way
+ If JIT compilation is successful, the compiled JIT code will automati-
+ cally be used when an appropriate type of match is run, except when
+ incompatible run-time options are specified. For more details, see the
+ pcre2jit documentation. See also the jitstack modifier below for a way
of setting the size of the JIT stack.
- If the jitfast modifier is specified, matching is done using the JIT
- "fast path" interface, pcre2_jit_match(), which skips some of the san-
- ity checks that are done by pcre2_match(), and of course does not work
- when JIT is not supported. If jitfast is specified without jit, jit=7
+ If the jitfast modifier is specified, matching is done using the JIT
+ "fast path" interface, pcre2_jit_match(), which skips some of the san-
+ ity checks that are done by pcre2_match(), and of course does not work
+ when JIT is not supported. If jitfast is specified without jit, jit=7
is assumed.
- If the jitverify modifier is specified, information about the compiled
- pattern shows whether JIT compilation was or was not successful. If
- jitverify is specified without jit, jit=7 is assumed. If JIT compila-
- tion is successful when jitverify is set, the text "(JIT)" is added to
+ If the jitverify modifier is specified, information about the compiled
+ pattern shows whether JIT compilation was or was not successful. If
+ jitverify is specified without jit, jit=7 is assumed. If JIT compila-
+ tion is successful when jitverify is set, the text "(JIT)" is added to
the first output line after a match or non match when JIT-compiled code
was actually used in the match.
Setting a locale
- The /locale modifier must specify the name of a locale, for example:
+ The locale modifier must specify the name of a locale, for example:
/pattern/locale=fr_FR
The given locale is set, pcre2_maketables() is called to build a set of
- character tables for the locale, and this is then passed to pcre2_com-
- pile() when compiling the regular expression. The same tables are used
- when matching the following subject lines. The /locale modifier applies
+ character tables for the locale, and this is then passed to pcre2_com-
+ pile() when compiling the regular expression. The same tables are used
+ when matching the following subject lines. The locale modifier applies
only to the pattern on which it appears, but can be given in a #pattern
- command if a default is needed. Setting a locale and alternate charac-
+ command if a default is needed. Setting a locale and alternate charac-
ter tables are mutually exclusive.
Showing pattern memory
- The /memory modifier causes the size in bytes of the memory used to
- hold the compiled pattern to be output. This does not include the size
- of the pcre2_code block; it is just the actual compiled data. If the
- pattern is subsequently passed to the JIT compiler, the size of the JIT
+ The memory modifier causes the size in bytes of the memory used to hold
+ the compiled pattern to be output. This does not include the size of
+ the pcre2_code block; it is just the actual compiled data. If the pat-
+ tern is subsequently passed to the JIT compiler, the size of the JIT
compiled code is also output. Here is an example:
re> /a(b)c/jit,memory
@@ -738,27 +839,27 @@ PATTERN MODIFIERS
Limiting nested parentheses
- The parens_nest_limit modifier sets a limit on the depth of nested
- parentheses in a pattern. Breaching the limit causes a compilation
- error. The default for the library is set when PCRE2 is built, but
- pcre2test sets its own default of 220, which is required for running
+ The parens_nest_limit modifier sets a limit on the depth of nested
+ parentheses in a pattern. Breaching the limit causes a compilation
+ error. The default for the library is set when PCRE2 is built, but
+ pcre2test sets its own default of 220, which is required for running
the standard test suite.
Limiting the pattern length
- The max_pattern_length modifier sets a limit, in code units, to the
+ The max_pattern_length modifier sets a limit, in code units, to the
length of pattern that pcre2_compile() will accept. Breaching the limit
- causes a compilation error. The default is the largest number a
+ causes a compilation error. The default is the largest number a
PCRE2_SIZE variable can hold (essentially unlimited).
Using the POSIX wrapper API
- The /posix and posix_nosub modifiers cause pcre2test to call PCRE2 via
- the POSIX wrapper API rather than its native API. When posix_nosub is
- used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX
- wrapper supports only the 8-bit library. Note that it does not imply
+ The posix and posix_nosub modifiers cause pcre2test to call PCRE2 via
+ the POSIX wrapper API rather than its native API. When posix_nosub is
+ used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX
+ wrapper supports only the 8-bit library. Note that it does not imply
POSIX matching semantics; for more detail see the pcre2posix documenta-
- tion. The following pattern modifiers set options for the regcomp()
+ tion. The following pattern modifiers set options for the regcomp()
function:
caseless REG_ICASE
@@ -768,35 +869,39 @@ PATTERN MODIFIERS
ucp REG_UCP ) the POSIX standard
utf REG_UTF8 )
- The regerror_buffsize modifier specifies a size for the error buffer
- that is passed to regerror() in the event of a compilation error. For
+ The regerror_buffsize modifier specifies a size for the error buffer
+ that is passed to regerror() in the event of a compilation error. For
example:
/abc/posix,regerror_buffsize=20
- This provides a means of testing the behaviour of regerror() when the
- buffer is too small for the error message. If this modifier has not
+ This provides a means of testing the behaviour of regerror() when the
+ buffer is too small for the error message. If this modifier has not
been set, a large buffer is used.
- The aftertext and allaftertext subject modifiers work as described
- below. All other modifiers are either ignored, with a warning message,
+ The aftertext and allaftertext subject modifiers work as described
+ below. All other modifiers are either ignored, with a warning message,
or cause an error.
+ The pattern is passed to regcomp() as a zero-terminated string by
+ default, but if the use_length or hex modifiers are set, the REG_PEND
+ extension is used to pass it by length.
+
Testing the stack guard feature
- The /stackguard modifier is used to test the use of pcre2_set_com-
- pile_recursion_guard(), a function that is provided to enable stack
- availability to be checked during compilation (see the pcre2api docu-
- mentation for details). If the number specified by the modifier is
+ The stackguard modifier is used to test the use of pcre2_set_com-
+ pile_recursion_guard(), a function that is provided to enable stack
+ availability to be checked during compilation (see the pcre2api docu-
+ mentation for details). If the number specified by the modifier is
greater than zero, pcre2_set_compile_recursion_guard() is called to set
- up callback from pcre2_compile() to a local function. The argument it
- receives is the current nesting parenthesis depth; if this is greater
+ up callback from pcre2_compile() to a local function. The argument it
+ receives is the current nesting parenthesis depth; if this is greater
than the value given by the modifier, non-zero is returned, causing the
compilation to be aborted.
Using alternative character tables
- The value specified for the /tables modifier must be one of the digits
+ The value specified for the tables modifier must be one of the digits
0, 1, or 2. It causes a specific set of built-in character tables to be
passed to pcre2_compile(). This is used in the PCRE2 tests to check be-
haviour with different character tables. The digit specifies the tables
@@ -807,23 +912,25 @@ PATTERN MODIFIERS
pcre2_chartables.c.dist
2 a set of tables defining ISO 8859 characters
- In table 2, some characters whose codes are greater than 128 are iden-
- tified as letters, digits, spaces, etc. Setting alternate character
+ In table 2, some characters whose codes are greater than 128 are iden-
+ tified as letters, digits, spaces, etc. Setting alternate character
tables and a locale are mutually exclusive.
Setting certain match controls
The following modifiers are really subject modifiers, and are described
- below. However, they may be included in a pattern's modifier list, in
- which case they are applied to every subject line that is processed
- with that pattern. They may not appear in #pattern commands. These mod-
- ifiers do not affect the compilation process.
+ under "Subject Modifiers" below. However, they may be included in a
+ pattern's modifier list, in which case they are applied to every sub-
+ ject line that is processed with that pattern. These modifiers do not
+ affect the compilation process.
aftertext show text after match
allaftertext show text after captures
allcaptures show all captures
allusedtext show all consulted text
+ altglobal alternative global matching
/g global global matching
+ jitstack=<n> set size of JIT stack
mark show mark values
replace=<string> specify a replacement string
startchar show starting character when relevant
@@ -832,26 +939,65 @@ PATTERN MODIFIERS
substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY
- These modifiers may not appear in a #pattern command. If you want them
+ These modifiers may not appear in a #pattern command. If you want them
as defaults, set them in a #subject command.
+ Specifying literal subject lines
+
+ If the subject_literal modifier is present on a pattern, all the sub-
+ ject lines that it matches are taken as literal strings, with no inter-
+ pretation of backslashes. It is not possible to set subject modifiers
+ on such lines, but any that are set as defaults by a #subject command
+ are recognized.
+
Saving a compiled pattern
- When a pattern with the push modifier is successfully compiled, it is
- pushed onto a stack of compiled patterns, and pcre2test expects the
- next line to contain a new pattern (or a command) instead of a subject
+ When a pattern with the push modifier is successfully compiled, it is
+ pushed onto a stack of compiled patterns, and pcre2test expects the
+ next line to contain a new pattern (or a command) instead of a subject
line. This facility is used when saving compiled patterns to a file, as
- described in the section entitled "Saving and restoring compiled pat-
- terns" below. If pushcopy is used instead of push, a copy of the com-
- piled pattern is stacked, leaving the original as current, ready to
- match the following input lines. This provides a way of testing the
- pcre2_code_copy() function. The push and pushcopy modifiers are
- incompatible with compilation modifiers such as global that act at
- match time. Any that are specified are ignored (for the stacked copy),
+ described in the section entitled "Saving and restoring compiled pat-
+ terns" below. If pushcopy is used instead of push, a copy of the com-
+ piled pattern is stacked, leaving the original as current, ready to
+ match the following input lines. This provides a way of testing the
+ pcre2_code_copy() function. The push and pushcopy modifiers are
+ incompatible with compilation modifiers such as global that act at
+ match time. Any that are specified are ignored (for the stacked copy),
with a warning message, except for replace, which causes an error. Note
- that jitverify, which is allowed, does not carry through to any subse-
+ that jitverify, which is allowed, does not carry through to any subse-
quent matching that uses a stacked pattern.
+ Testing foreign pattern conversion
+
+ The experimental foreign pattern conversion functions in PCRE2 can be
+ tested by setting the convert modifier. Its argument is a colon-sepa-
+ rated list of options, which set the equivalent option for the
+ pcre2_pattern_convert() function:
+
+ glob PCRE2_CONVERT_GLOB
+ glob_no_starstar PCRE2_CONVERT_GLOB_NO_STARSTAR
+ glob_no_wild_separator PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR
+ posix_basic PCRE2_CONVERT_POSIX_BASIC
+ posix_extended PCRE2_CONVERT_POSIX_EXTENDED
+ unset Unset all options
+
+ The "unset" value is useful for turning off a default that has been set
+ by a #pattern command. When one of these options is set, the input pat-
+ tern is passed to pcre2_pattern_convert(). If the conversion is suc-
+ cessful, the result is reflected in the output and then passed to
+ pcre2_compile(). The normal utf and no_utf_check options, if set, cause
+ the PCRE2_CONVERT_UTF and PCRE2_CONVERT_NO_UTF_CHECK options to be
+ passed to pcre2_pattern_convert().
+
+ By default, the conversion function is allowed to allocate a buffer for
+ its output. However, if the convert_length modifier is set to a value
+ greater than zero, pcre2test passes a buffer of the given length. This
+ makes it possible to test the length check.
+
+ The convert_glob_escape and convert_glob_separator modifiers can be
+ used to specify the escape and separator characters for glob process-
+ ing, overriding the defaults, which are operating-system dependent.
+
SUBJECT MODIFIERS
@@ -860,10 +1006,11 @@ SUBJECT MODIFIERS
Setting match options
- The following modifiers set options for pcre2_match() or
+ The following modifiers set options for pcre2_match() or
pcre2_dfa_match(). See pcreapi for a description of their effects.
anchored set PCRE2_ANCHORED
+ endanchored set PCRE2_ENDANCHORED
dfa_restart set PCRE2_DFA_RESTART
dfa_shortest set PCRE2_DFA_SHORTEST
no_jit set PCRE2_NO_JIT
@@ -875,20 +1022,34 @@ SUBJECT MODIFIERS
partial_hard (or ph) set PCRE2_PARTIAL_HARD
partial_soft (or ps) set PCRE2_PARTIAL_SOFT
- The partial matching modifiers are provided with abbreviations because
+ The partial matching modifiers are provided with abbreviations because
they appear frequently in tests.
- If the /posix modifier was present on the pattern, causing the POSIX
- wrapper API to be used, the only option-setting modifiers that have any
- effect are notbol, notempty, and noteol, causing REG_NOTBOL,
- REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec().
- The other modifiers are ignored, with a warning message.
+ If the posix or posix_nosub modifier was present on the pattern, caus-
+ ing the POSIX wrapper API to be used, the only option-setting modifiers
+ that have any effect are notbol, notempty, and noteol, causing REG_NOT-
+ BOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to
+ regexec(). The other modifiers are ignored, with a warning message.
+
+ There is one additional modifier that can be used with the POSIX wrap-
+ per. It is ignored (with a warning) if used for non-POSIX matching.
+
+ posix_startend=<n>[:<m>]
+
+ This causes the subject string to be passed to regexec() using the
+ REG_STARTEND option, which uses offsets to specify which part of the
+ string is searched. If only one number is given, the end offset is
+ passed as the end of the subject string. For more detail of REG_STAR-
+ TEND, see the pcre2posix documentation. If the subject string contains
+ binary zeros (coded as escapes such as \x{00} because pcre2test does
+ not support actual binary zeros in its input), you must use posix_star-
+ tend to specify its length.
Setting match controls
- The following modifiers affect the matching process or request addi-
- tional information. Some of them may also be specified on a pattern
- line (see above), in which case they apply to every subject line that
+ The following modifiers affect the matching process or request addi-
+ tional information. Some of them may also be specified on a pattern
+ line (see above), in which case they apply to every subject line that
is matched against that pattern.
aftertext show text after match
@@ -898,23 +1059,28 @@ SUBJECT MODIFIERS
altglobal alternative global matching
callout_capture show captures at callout time
callout_data=<n> set a value to pass via callouts
+ callout_error=<n>[:<m>] control callout error
+ callout_extra show extra callout information
callout_fail=<n>[:<m>] control callout failure
+ callout_no_where do not show position of a callout
callout_none do not supply a callout function
copy=<number or name> copy captured substring
+ depth_limit=<n> set a depth limit
dfa use pcre2_dfa_match()
- find_limits find match and recursion limits
+ find_limits find match and depth limits
get=<number or name> extract captured substring
getall extract all captured substrings
/g global global matching
+ heap_limit=<n> set a limit on heap memory
jitstack=<n> set size of JIT stack
mark show mark values
match_limit=<n> set a match limit
- memory show memory usage
+ memory show heap memory usage
null_context match with a NULL context
offset=<n> set starting offset
offset_limit=<n> set offset limit
ovector=<n> set size of output vector
- recursion_limit=<n> set a recursion limit
+ recursion_limit=<n> obsolete synonym for depth_limit
replace=<string> specify a replacement string
startchar show startchar when relevant
startoffset=<n> same as offset=<n>
@@ -925,29 +1091,29 @@ SUBJECT MODIFIERS
zero_terminate pass the subject as zero-terminated
The effects of these modifiers are described in the following sections.
- When matching via the POSIX wrapper API, the aftertext, allaftertext,
- and ovector subject modifiers work as described below. All other modi-
+ When matching via the POSIX wrapper API, the aftertext, allaftertext,
+ and ovector subject modifiers work as described below. All other modi-
fiers are either ignored, with a warning message, or cause an error.
Showing more text
- The aftertext modifier requests that as well as outputting the part of
+ The aftertext modifier requests that as well as outputting the part of
the subject string that matched the entire pattern, pcre2test should in
addition output the remainder of the subject string. This is useful for
tests where the subject contains multiple copies of the same substring.
- The allaftertext modifier requests the same action for captured sub-
+ The allaftertext modifier requests the same action for captured sub-
strings as well as the main matched substring. In each case the remain-
der is output on the following line with a plus character following the
capture number.
- The allusedtext modifier requests that all the text that was consulted
- during a successful pattern match by the interpreter should be shown.
- This feature is not supported for JIT matching, and if requested with
- JIT it is ignored (with a warning message). Setting this modifier
+ The allusedtext modifier requests that all the text that was consulted
+ during a successful pattern match by the interpreter should be shown.
+ This feature is not supported for JIT matching, and if requested with
+ JIT it is ignored (with a warning message). Setting this modifier
affects the output if there is a lookbehind at the start of a match, or
- a lookahead at the end, or if \K is used in the pattern. Characters
- that precede or follow the start and end of the actual match are indi-
- cated in the output by '<' or '>' characters underneath them. Here is
+ a lookahead at the end, or if \K is used in the pattern. Characters
+ that precede or follow the start and end of the actual match are indi-
+ cated in the output by '<' or '>' characters underneath them. Here is
an example:
re> /(?<=pqr)abc(?=xyz)/
@@ -955,16 +1121,16 @@ SUBJECT MODIFIERS
0: pqrabcxyz
<<< >>>
- This shows that the matched string is "abc", with the preceding and
- following strings "pqr" and "xyz" having been consulted during the
+ This shows that the matched string is "abc", with the preceding and
+ following strings "pqr" and "xyz" having been consulted during the
match (when processing the assertions).
- The startchar modifier requests that the starting character for the
- match be indicated, if it is different to the start of the matched
+ The startchar modifier requests that the starting character for the
+ match be indicated, if it is different to the start of the matched
string. The only time when this occurs is when \K has been processed as
part of the match. In this situation, the output for the matched string
- is displayed from the starting character instead of from the match
- point, with circumflex characters under the earlier characters. For
+ is displayed from the starting character instead of from the match
+ point, with circumflex characters under the earlier characters. For
example:
re> /abc\Kxyz/
@@ -972,7 +1138,7 @@ SUBJECT MODIFIERS
0: abcxyz
^^^
- Unlike allusedtext, the startchar modifier can be used with JIT. How-
+ Unlike allusedtext, the startchar modifier can be used with JIT. How-
ever, these two modifiers are mutually exclusive.
Showing the value of all capture groups
@@ -980,90 +1146,78 @@ SUBJECT MODIFIERS
The allcaptures modifier requests that the values of all potential cap-
tured parentheses be output after a match. By default, only those up to
the highest one actually used in the match are output (corresponding to
- the return code from pcre2_match()). Groups that did not take part in
- the match are output as "<unset>". This modifier is not relevant for
- DFA matching (which does no capturing); it is ignored, with a warning
+ the return code from pcre2_match()). Groups that did not take part in
+ the match are output as "<unset>". This modifier is not relevant for
+ DFA matching (which does no capturing); it is ignored, with a warning
message, if present.
Testing callouts
- A callout function is supplied when pcre2test calls the library match-
- ing functions, unless callout_none is specified. If callout_capture is
- set, the current captured groups are output when a callout occurs.
-
- The callout_fail modifier can be given one or two numbers. If there is
- only one number, 1 is returned instead of 0 when a callout of that num-
- ber is reached. If two numbers are given, 1 is returned when callout
- <n> is reached for the <m>th time. Note that callouts with string argu-
- ments are always given the number zero. See "Callouts" below for a
- description of the output when a callout it taken.
-
- The callout_data modifier can be given an unsigned or a negative num-
- ber. This is set as the "user data" that is passed to the matching
- function, and passed back when the callout function is invoked. Any
- value other than zero is used as a return from pcre2test's callout
- function.
+ A callout function is supplied when pcre2test calls the library match-
+ ing functions, unless callout_none is specified. Its behaviour can be
+ controlled by various modifiers listed above whose names begin with
+ callout_. Details are given in the section entitled "Callouts" below.
Finding all matches in a string
Searching for all possible matches within a subject can be requested by
- the global or /altglobal modifier. After finding a match, the matching
- function is called again to search the remainder of the subject. The
- difference between global and altglobal is that the former uses the
- start_offset argument to pcre2_match() or pcre2_dfa_match() to start
- searching at a new point within the entire string (which is what Perl
+ the global or altglobal modifier. After finding a match, the matching
+ function is called again to search the remainder of the subject. The
+ difference between global and altglobal is that the former uses the
+ start_offset argument to pcre2_match() or pcre2_dfa_match() to start
+ searching at a new point within the entire string (which is what Perl
does), whereas the latter passes over a shortened subject. This makes a
difference to the matching process if the pattern begins with a lookbe-
hind assertion (including \b or \B).
- If an empty string is matched, the next match is done with the
+ If an empty string is matched, the next match is done with the
PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search
for another, non-empty, match at the same point in the subject. If this
- match fails, the start offset is advanced, and the normal match is
- retried. This imitates the way Perl handles such cases when using the
- /g modifier or the split() function. Normally, the start offset is
- advanced by one character, but if the newline convention recognizes
- CRLF as a newline, and the current character is CR followed by LF, an
+ match fails, the start offset is advanced, and the normal match is
+ retried. This imitates the way Perl handles such cases when using the
+ /g modifier or the split() function. Normally, the start offset is
+ advanced by one character, but if the newline convention recognizes
+ CRLF as a newline, and the current character is CR followed by LF, an
advance of two characters occurs.
Testing substring extraction functions
- The copy and get modifiers can be used to test the pcre2_sub-
+ The copy and get modifiers can be used to test the pcre2_sub-
string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be
- given more than once, and each can specify a group name or number, for
+ given more than once, and each can specify a group name or number, for
example:
abcd\=copy=1,copy=3,get=G1
- If the #subject command is used to set default copy and/or get lists,
- these can be unset by specifying a negative number to cancel all num-
+ If the #subject command is used to set default copy and/or get lists,
+ these can be unset by specifying a negative number to cancel all num-
bered groups and an empty name to cancel all named groups.
- The getall modifier tests pcre2_substring_list_get(), which extracts
+ The getall modifier tests pcre2_substring_list_get(), which extracts
all captured substrings.
- If the subject line is successfully matched, the substrings extracted
- by the convenience functions are output with C, G, or L after the
- string number instead of a colon. This is in addition to the normal
- full list. The string length (that is, the return from the extraction
+ If the subject line is successfully matched, the substrings extracted
+ by the convenience functions are output with C, G, or L after the
+ string number instead of a colon. This is in addition to the normal
+ full list. The string length (that is, the return from the extraction
function) is given in parentheses after each substring, followed by the
name when the extraction was by name.
Testing the substitution function
- If the replace modifier is set, the pcre2_substitute() function is
- called instead of one of the matching functions. Note that replacement
- strings cannot contain commas, because a comma signifies the end of a
+ If the replace modifier is set, the pcre2_substitute() function is
+ called instead of one of the matching functions. Note that replacement
+ strings cannot contain commas, because a comma signifies the end of a
modifier. This is not thought to be an issue in a test program.
- Unlike subject strings, pcre2test does not process replacement strings
- for escape sequences. In UTF mode, a replacement string is checked to
- see if it is a valid UTF-8 string. If so, it is correctly converted to
- a UTF string of the appropriate code unit width. If it is not a valid
- UTF-8 string, the individual code units are copied directly. This pro-
+ Unlike subject strings, pcre2test does not process replacement strings
+ for escape sequences. In UTF mode, a replacement string is checked to
+ see if it is a valid UTF-8 string. If so, it is correctly converted to
+ a UTF string of the appropriate code unit width. If it is not a valid
+ UTF-8 string, the individual code units are copied directly. This pro-
vides a means of passing an invalid UTF-8 string for testing purposes.
- The following modifiers set options (in additional to the normal match
+ The following modifiers set options (in additional to the normal match
options) for pcre2_substitute():
global PCRE2_SUBSTITUTE_GLOBAL
@@ -1073,8 +1227,8 @@ SUBJECT MODIFIERS
substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY
- After a successful substitution, the modified string is output, pre-
- ceded by the number of replacements. This may be zero if there were no
+ After a successful substitution, the modified string is output, pre-
+ ceded by the number of replacements. This may be zero if there were no
matches. Here is a simple example of a substitution test:
/abc/replace=xxx
@@ -1083,12 +1237,12 @@ SUBJECT MODIFIERS
=abc=abc=\=global
2: =xxx=xxx=
- Subject and replacement strings should be kept relatively short (fewer
- than 256 characters) for substitution tests, as fixed-size buffers are
- used. To make it easy to test for buffer overflow, if the replacement
- string starts with a number in square brackets, that number is passed
- to pcre2_substitute() as the size of the output buffer, with the
- replacement string starting at the next character. Here is an example
+ Subject and replacement strings should be kept relatively short (fewer
+ than 256 characters) for substitution tests, as fixed-size buffers are
+ used. To make it easy to test for buffer overflow, if the replacement
+ string starts with a number in square brackets, that number is passed
+ to pcre2_substitute() as the size of the output buffer, with the
+ replacement string starting at the next character. Here is an example
that tests the edge case:
/abc/
@@ -1097,11 +1251,11 @@ SUBJECT MODIFIERS
123abc123\=replace=[9]XYZ
Failed: error -47: no more memory
- The default action of pcre2_substitute() is to return
- PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if
- the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub-
- stitute_overflow_length modifier), pcre2_substitute() continues to go
- through the motions of matching and substituting, in order to compute
+ The default action of pcre2_substitute() is to return
+ PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if
+ the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub-
+ stitute_overflow_length modifier), pcre2_substitute() continues to go
+ through the motions of matching and substituting, in order to compute
the size of buffer that is required. When this happens, pcre2test shows
the required buffer length (which includes space for the trailing zero)
as part of the error message. For example:
@@ -1111,43 +1265,48 @@ SUBJECT MODIFIERS
Failed: error -47: no more memory: 10 code units are needed
A replacement string is ignored with POSIX and DFA matching. Specifying
- partial matching provokes an error return ("bad option value") from
+ partial matching provokes an error return ("bad option value") from
pcre2_substitute().
Setting the JIT stack size
- The jitstack modifier provides a way of setting the maximum stack size
- that is used by the just-in-time optimization code. It is ignored if
+ The jitstack modifier provides a way of setting the maximum stack size
+ that is used by the just-in-time optimization code. It is ignored if
JIT optimization is not being used. The value is a number of kilobytes.
- Providing a stack that is larger than the default 32K is necessary only
- for very complicated patterns.
+ Setting zero reverts to the default of 32K. Providing a stack that is
+ larger than the default is necessary only for very complicated pat-
+ terns. If jitstack is set non-zero on a subject line it overrides any
+ value that was set on the pattern.
- Setting match and recursion limits
+ Setting heap, match, and depth limits
- The match_limit and recursion_limit modifiers set the appropriate lim-
- its in the match context. These values are ignored when the find_limits
- modifier is specified.
+ The heap_limit, match_limit, and depth_limit modifiers set the appro-
+ priate limits in the match context. These values are ignored when the
+ find_limits modifier is specified.
Finding minimum limits
- If the find_limits modifier is present, pcre2test calls pcre2_match()
- several times, setting different values in the match context via
- pcre2_set_match_limit() and pcre2_set_recursion_limit() until it finds
- the minimum values for each parameter that allow pcre2_match() to com-
- plete without error.
+ If the find_limits modifier is present on a subject line, pcre2test
+ calls the relevant matching function several times, setting different
+ values in the match context via pcre2_set_heap_limit(),
+ pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the
+ minimum values for each parameter that allows the match to complete
+ without error.
If JIT is being used, only the match limit is relevant. If DFA matching
- is being used, neither limit is relevant, and this modifier is ignored
- (with a warning message).
-
- The match_limit number is a measure of the amount of backtracking that
- takes place, and learning the minimum value can be instructive. For
- most simple matches, the number is quite small, but for patterns with
- very large numbers of matching possibilities, it can become large very
- quickly with increasing length of subject string. The
- match_limit_recursion number is a measure of how much stack (or, if
- PCRE2 is compiled with NO_RECURSE, how much heap) memory is needed to
- complete the match attempt.
+ is being used, only the depth limit is relevant.
+
+ The match_limit number is a measure of the amount of backtracking that
+ takes place, and learning the minimum value can be instructive. For
+ most simple matches, the number is quite small, but for patterns with
+ very large numbers of matching possibilities, it can become large very
+ quickly with increasing length of subject string.
+
+ For non-DFA matching, the minimum depth_limit number is a measure of
+ how much nested backtracking happens (that is, how deeply the pattern's
+ tree is searched). In the case of DFA matching, depth_limit controls
+ the depth of recursive calls of the internal function that is used for
+ handling pattern recursion, lookaround assertions, and atomic groups.
Showing MARK names
@@ -1160,8 +1319,16 @@ SUBJECT MODIFIERS
Showing memory usage
- The memory modifier causes pcre2test to log all memory allocation and
- freeing calls that occur during a match operation.
+ The memory modifier causes pcre2test to log the sizes of all heap mem-
+ ory allocation and freeing calls that occur during a call to
+ pcre2_match(). These occur only when a match requires a bigger vector
+ than the default for remembering backtracking points. In many cases
+ there will be no heap memory used and therefore no additional output.
+ No heap memory is allocated during matching with pcre2_dfa_match or
+ with JIT, so in those cases the memory modifier never has any effect.
+ For this modifier to work, the null_context modifier must not be set on
+ both the pattern and the subject, though it can be set on one or the
+ other.
Setting a starting offset
@@ -1196,59 +1363,58 @@ SUBJECT MODIFIERS
By default, the subject string is passed to a native API matching func-
tion with its correct length. In order to test the facility for passing
a zero-terminated string, the zero_terminate modifier is provided. It
- causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching
- via the POSIX interface, this modifier has no effect, as there is no
- facility for passing a length.)
+ causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching
+ via the POSIX interface, this modifier is ignored, with a warning.
- When testing pcre2_substitute(), this modifier also has the effect of
+ When testing pcre2_substitute(), this modifier also has the effect of
passing the replacement string as zero-terminated.
Passing a NULL context
- Normally, pcre2test passes a context block to pcre2_match(),
+ Normally, pcre2test passes a context block to pcre2_match(),
pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is
- set, however, NULL is passed. This is for testing that the matching
+ set, however, NULL is passed. This is for testing that the matching
functions behave correctly in this case (they use default values). This
- modifier cannot be used with the find_limits modifier or when testing
+ modifier cannot be used with the find_limits modifier or when testing
the substitution function.
THE ALTERNATIVE MATCHING FUNCTION
- By default, pcre2test uses the standard PCRE2 matching function,
+ By default, pcre2test uses the standard PCRE2 matching function,
pcre2_match() to match each subject line. PCRE2 also supports an alter-
- native matching function, pcre2_dfa_match(), which operates in a dif-
- ferent way, and has some restrictions. The differences between the two
+ native matching function, pcre2_dfa_match(), which operates in a dif-
+ ferent way, and has some restrictions. The differences between the two
functions are described in the pcre2matching documentation.
- If the dfa modifier is set, the alternative matching function is used.
- This function finds all possible matches at a given point in the sub-
- ject. If, however, the dfa_shortest modifier is set, processing stops
- after the first match is found. This is always the shortest possible
+ If the dfa modifier is set, the alternative matching function is used.
+ This function finds all possible matches at a given point in the sub-
+ ject. If, however, the dfa_shortest modifier is set, processing stops
+ after the first match is found. This is always the shortest possible
match.
DEFAULT OUTPUT FROM pcre2test
- This section describes the output when the normal matching function,
+ This section describes the output when the normal matching function,
pcre2_match(), is being used.
- When a match succeeds, pcre2test outputs the list of captured sub-
- strings, starting with number 0 for the string that matched the whole
- pattern. Otherwise, it outputs "No match" when the return is
- PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially
- matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that
- this is the entire substring that was inspected during the partial
- match; it may include characters before the actual match start if a
+ When a match succeeds, pcre2test outputs the list of captured sub-
+ strings, starting with number 0 for the string that matched the whole
+ pattern. Otherwise, it outputs "No match" when the return is
+ PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially
+ matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that
+ this is the entire substring that was inspected during the partial
+ match; it may include characters before the actual match start if a
lookbehind assertion, \K, \b, or \B was involved.)
For any other return, pcre2test outputs the PCRE2 negative error number
- and a short descriptive phrase. If the error is a failed UTF string
- check, the code unit offset of the start of the failing character is
+ and a short descriptive phrase. If the error is a failed UTF string
+ check, the code unit offset of the start of the failing character is
also output. Here is an example of an interactive pcre2test run.
$ pcre2test
- PCRE2 version 9.00 2014-05-10
+ PCRE2 version 10.22 2016-07-29
re> /^abc(\d+)/
data> abc123
@@ -1260,8 +1426,8 @@ DEFAULT OUTPUT FROM pcre2test
Unset capturing substrings that are not followed by one that is set are
not shown by pcre2test unless the allcaptures modifier is specified. In
the following example, there are two capturing substrings, but when the
- first data line is matched, the second, unset substring is not shown.
- An "internal" unset substring is shown as "<unset>", as for the second
+ first data line is matched, the second, unset substring is not shown.
+ An "internal" unset substring is shown as "<unset>", as for the second
data line.
re> /(a)|(b)/
@@ -1273,11 +1439,11 @@ DEFAULT OUTPUT FROM pcre2test
1: <unset>
2: b
- If the strings contain any non-printing characters, they are output as
- \xhh escapes if the value is less than 256 and UTF mode is not set.
+ If the strings contain any non-printing characters, they are output as
+ \xhh escapes if the value is less than 256 and UTF mode is not set.
Otherwise they are output as \x{hh...} escapes. See below for the defi-
- nition of non-printing characters. If the /aftertext modifier is set,
- the output for substring 0 is followed by the the rest of the subject
+ nition of non-printing characters. If the aftertext modifier is set,
+ the output for substring 0 is followed by the the rest of the subject
string, identified by "0+" like this:
re> /cat/aftertext
@@ -1285,7 +1451,7 @@ DEFAULT OUTPUT FROM pcre2test
0: cat
0+ aract
- If global matching is requested, the results of successive matching
+ If global matching is requested, the results of successive matching
attempts are output in sequence, like this:
re> /\Bi(\w\w)/g
@@ -1297,8 +1463,8 @@ DEFAULT OUTPUT FROM pcre2test
0: ipp
1: pp
- "No match" is output only if the first match attempt fails. Here is an
- example of a failure message (the offset 4 that is specified by the
+ "No match" is output only if the first match attempt fails. Here is an
+ example of a failure message (the offset 4 that is specified by the
offset modifier is past the end of the subject string):
re> /xyz/
@@ -1306,7 +1472,7 @@ DEFAULT OUTPUT FROM pcre2test
Error -24 (bad offset value)
Note that whereas patterns can be continued over several lines (a plain
- ">" prompt is used for continuations), subject lines may not. However
+ ">" prompt is used for continuations), subject lines may not. However
newlines can be included in a subject by means of the \n escape (or \r,
\r\n, etc., depending on the newline sequence setting).
@@ -1314,7 +1480,7 @@ DEFAULT OUTPUT FROM pcre2test
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
When the alternative matching function, pcre2_dfa_match(), is used, the
- output consists of a list of all the matches that start at the first
+ output consists of a list of all the matches that start at the first
point in the subject where there is at least one match. For example:
re> /(tang|tangerine|tan)/
@@ -1323,11 +1489,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tang
2: tan
- Using the normal matching function on this data finds only "tang". The
- longest matching string is always given first (and numbered zero).
- After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:",
- followed by the partially matching substring. Note that this is the
- entire substring that was inspected during the partial match; it may
+ Using the normal matching function on this data finds only "tang". The
+ longest matching string is always given first (and numbered zero).
+ After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:",
+ followed by the partially matching substring. Note that this is the
+ entire substring that was inspected during the partial match; it may
include characters before the actual match start if a lookbehind asser-
tion, \b, or \B was involved. (\K is not supported for DFA matching.)
@@ -1343,16 +1509,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tan
0: tan
- The alternative matching function does not support substring capture,
- so the modifiers that are concerned with captured substrings are not
+ The alternative matching function does not support substring capture,
+ so the modifiers that are concerned with captured substrings are not
relevant.
RESTARTING AFTER A PARTIAL MATCH
- When the alternative matching function has given the PCRE2_ERROR_PAR-
+ When the alternative matching function has given the PCRE2_ERROR_PAR-
TIAL return, indicating that the subject partially matched the pattern,
- you can restart the match with additional subject data by means of the
+ you can restart the match with additional subject data by means of the
dfa_restart modifier. For example:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -1361,26 +1527,17 @@ RESTARTING AFTER A PARTIAL MATCH
data> n05\=dfa,dfa_restart
0: n05
- For further information about partial matching, see the pcre2partial
+ For further information about partial matching, see the pcre2partial
documentation.
CALLOUTS
If the pattern contains any callout requests, pcre2test's callout func-
- tion is called during matching unless callout_none is specified. This
- works with both matching functions.
-
- The callout function in pcre2test returns zero (carry on matching) by
- default, but you can use a callout_fail modifier in a subject line (as
- described above) to change this and other parameters of the callout.
-
- Inserting callouts can be helpful when using pcre2test to check compli-
- cated regular expressions. For further information about callouts, see
- the pcre2callout documentation.
-
- The output for callouts with numerical arguments and those with string
- arguments is slightly different.
+ tion is called during matching unless callout_none is specified. This
+ works with both matching functions, and with JIT, though there are some
+ differences in behaviour. The output for callouts with numerical argu-
+ ments and those with string arguments is slightly different.
Callouts with numerical arguments
@@ -1399,8 +1556,8 @@ CALLOUTS
position, which can happen if the callout is in a lookbehind assertion.
Callouts numbered 255 are assumed to be automatic callouts, inserted as
- a result of the /auto_callout pattern modifier. In this case, instead
- of showing the callout number, the offset in the pattern, preceded by a
+ a result of the auto_callout pattern modifier. In this case, instead of
+ showing the callout number, the offset in the pattern, preceded by a
plus, is output. For example:
re> /\d?[A-E]\*/auto_callout
@@ -1451,46 +1608,140 @@ CALLOUTS
0: abcdef
+ Callout modifiers
+
+ The callout function in pcre2test returns zero (carry on matching) by
+ default, but you can use a callout_fail modifier in a subject line to
+ change this and other parameters of the callout (see below).
+
+ If the callout_capture modifier is set, the current captured groups are
+ output when a callout occurs. This is useful only for non-DFA matching,
+ as pcre2_dfa_match() does not support capturing, so no captures are
+ ever shown.
+
+ The normal callout output, showing the callout number or pattern offset
+ (as described above) is suppressed if the callout_no_where modifier is
+ set.
+
+ When using the interpretive matching function pcre2_match() without
+ JIT, setting the callout_extra modifier causes additional output from
+ pcre2test's callout function to be generated. For the first callout in
+ a match attempt at a new starting position in the subject, "New match
+ attempt" is output. If there has been a backtrack since the last call-
+ out (or start of matching if this is the first callout), "Backtrack" is
+ output, followed by "No other matching paths" if the backtrack ended
+ the previous match attempt. For example:
+
+ re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess
+ data> aac\=callout_extra
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ +3 ^ ^ )
+ +4 ^ ^ b
+ Backtrack
+ --->aac
+ +3 ^^ )
+ +4 ^^ b
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ +3 ^^ )
+ +4 ^^ b
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ Backtrack
+ No other matching paths
+ New match attempt
+ --->aac
+ +0 ^ (
+ +1 ^ a+
+ No match
+
+ Notice that various optimizations must be turned off if you want all
+ possible matching paths to be scanned. If no_start_optimize is not
+ used, there is an immediate "no match", without any callouts, because
+ the starting optimization fails to find "b" in the subject, which it
+ knows must be present for any match. If no_auto_possess is not used,
+ the "a+" item is turned into "a++", which reduces the number of back-
+ tracks.
+
+ The callout_extra modifier has no effect if used with the DFA matching
+ function, or with JIT.
+
+ Return values from callouts
+
+ The default return from the callout function is zero, which allows
+ matching to continue. The callout_fail modifier can be given one or two
+ numbers. If there is only one number, 1 is returned instead of 0 (caus-
+ ing matching to backtrack) when a callout of that number is reached. If
+ two numbers (<n>:<m>) are given, 1 is returned when callout <n> is
+ reached and there have been at least <m> callouts. The callout_error
+ modifier is similar, except that PCRE2_ERROR_CALLOUT is returned, caus-
+ ing the entire matching process to be aborted. If both these modifiers
+ are set for the same callout number, callout_error takes precedence.
+ Note that callouts with string arguments are always given the number
+ zero.
+
+ The callout_data modifier can be given an unsigned or a negative num-
+ ber. This is set as the "user data" that is passed to the matching
+ function, and passed back when the callout function is invoked. Any
+ value other than zero is used as a return from pcre2test's callout
+ function.
+
+ Inserting callouts can be helpful when using pcre2test to check compli-
+ cated regular expressions. For further information about callouts, see
+ the pcre2callout documentation.
+
+
NON-PRINTING CHARACTERS
When pcre2test is outputting text in the compiled version of a pattern,
- bytes other than 32-126 are always treated as non-printing characters
+ bytes other than 32-126 are always treated as non-printing characters
and are therefore shown as hex escapes.
- When pcre2test is outputting text that is a matched part of a subject
- string, it behaves in the same way, unless a different locale has been
- set for the pattern (using the /locale modifier). In this case, the
- isprint() function is used to distinguish printing and non-printing
+ When pcre2test is outputting text that is a matched part of a subject
+ string, it behaves in the same way, unless a different locale has been
+ set for the pattern (using the locale modifier). In this case, the
+ isprint() function is used to distinguish printing and non-printing
characters.
SAVING AND RESTORING COMPILED PATTERNS
- It is possible to save compiled patterns on disc or elsewhere, and
+ It is possible to save compiled patterns on disc or elsewhere, and
reload them later, subject to a number of restrictions. JIT data cannot
- be saved. The host on which the patterns are reloaded must be running
+ be saved. The host on which the patterns are reloaded must be running
the same version of PCRE2, with the same code unit width, and must also
- have the same endianness, pointer width and PCRE2_SIZE type. Before
- compiled patterns can be saved they must be serialized, that is, con-
- verted to a stream of bytes. A single byte stream may contain any num-
- ber of compiled patterns, but they must all use the same character
+ have the same endianness, pointer width and PCRE2_SIZE type. Before
+ compiled patterns can be saved they must be serialized, that is, con-
+ verted to a stream of bytes. A single byte stream may contain any num-
+ ber of compiled patterns, but they must all use the same character
tables. A single copy of the tables is included in the byte stream (its
size is 1088 bytes).
- The functions whose names begin with pcre2_serialize_ are used for
- serializing and de-serializing. They are described in the pcre2serial-
+ The functions whose names begin with pcre2_serialize_ are used for
+ serializing and de-serializing. They are described in the pcre2serial-
ize documentation. In this section we describe the features of
pcre2test that can be used to test these functions.
- When a pattern with push modifier is successfully compiled, it is
- pushed onto a stack of compiled patterns, and pcre2test expects the
- next line to contain a new pattern (or command) instead of a subject
- line. By contrast, the pushcopy modifier causes a copy of the compiled
- pattern to be stacked, leaving the original available for immediate
- matching. By using push and/or pushcopy, a number of patterns can be
+ When a pattern with push modifier is successfully compiled, it is
+ pushed onto a stack of compiled patterns, and pcre2test expects the
+ next line to contain a new pattern (or command) instead of a subject
+ line. By contrast, the pushcopy modifier causes a copy of the compiled
+ pattern to be stacked, leaving the original available for immediate
+ matching. By using push and/or pushcopy, a number of patterns can be
compiled and retained. These modifiers are incompatible with posix, and
- control modifiers that act at match time are ignored (with a message)
- for the stacked patterns. The jitverify modifier applies only at com-
+ control modifiers that act at match time are ignored (with a message)
+ for the stacked patterns. The jitverify modifier applies only at com-
pile time.
The command
@@ -1498,21 +1749,21 @@ SAVING AND RESTORING COMPILED PATTERNS
#save <filename>
causes all the stacked patterns to be serialized and the result written
- to the named file. Afterwards, all the stacked patterns are freed. The
+ to the named file. Afterwards, all the stacked patterns are freed. The
command
#load <filename>
- reads the data in the file, and then arranges for it to be de-serial-
- ized, with the resulting compiled patterns added to the pattern stack.
- The pattern on the top of the stack can be retrieved by the #pop com-
- mand, which must be followed by lines of subjects that are to be
- matched with the pattern, terminated as usual by an empty line or end
- of file. This command may be followed by a modifier list containing
- only control modifiers that act after a pattern has been compiled. In
+ reads the data in the file, and then arranges for it to be de-serial-
+ ized, with the resulting compiled patterns added to the pattern stack.
+ The pattern on the top of the stack can be retrieved by the #pop com-
+ mand, which must be followed by lines of subjects that are to be
+ matched with the pattern, terminated as usual by an empty line or end
+ of file. This command may be followed by a modifier list containing
+ only control modifiers that act after a pattern has been compiled. In
particular, hex, posix, posix_nosub, push, and pushcopy are not
- allowed, nor are any option-setting modifiers. The JIT modifiers are,
- however permitted. Here is an example that saves and reloads two pat-
+ allowed, nor are any option-setting modifiers. The JIT modifiers are,
+ however permitted. Here is an example that saves and reloads two pat-
terns.
/abc/push
@@ -1525,10 +1776,10 @@ SAVING AND RESTORING COMPILED PATTERNS
#pop jit,bincode
abc
- If jitverify is used with #pop, it does not automatically imply jit,
+ If jitverify is used with #pop, it does not automatically imply jit,
which is different behaviour from when it is used on a pattern.
- The #popcopy command is analagous to the pushcopy modifier in that it
+ The #popcopy command is analagous to the pushcopy modifier in that it
makes current a copy of the topmost stack pattern, leaving the original
still on the stack.
@@ -1548,5 +1799,5 @@ AUTHOR
REVISION
- Last updated: 06 July 2016
- Copyright (c) 1997-2016 University of Cambridge.
+ Last updated: 21 December 2017
+ Copyright (c) 1997-2017 University of Cambridge.
diff --git a/doc/pcre2unicode.3 b/doc/pcre2unicode.3
index 253d4b6..813fadf 100644
--- a/doc/pcre2unicode.3
+++ b/doc/pcre2unicode.3
@@ -1,4 +1,4 @@
-.TH PCRE2UNICODE 3 "03 July 2016" "PCRE2 10.22"
+.TH PCRE2UNICODE 3 "17 May 2017" "PCRE2 10.30"
.SH NAME
PCRE - Perl-compatible regular expressions (revised API)
.SH "UNICODE AND UTF SUPPORT"
@@ -40,7 +40,7 @@ and
documentation. Only the short names for properties are supported. For example,
\ep{L} matches a letter. Its Perl synonym, \ep{Letter}, is not supported.
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
-compatibility with Perl 5.6. PCRE does not support this.
+compatibility with Perl 5.6. PCRE2 does not support this.
.
.
.SH "WIDE CHARACTERS AND UTF MODES"
@@ -101,10 +101,16 @@ low-valued characters, unless the PCRE2_UCP option is set.
However, the special horizontal and vertical white space matching escapes (\eh,
\eH, \ev, and \eV) do match all the appropriate Unicode characters, whether or
not PCRE2_UCP is set.
-.P
-Case-insensitive matching in UTF mode makes use of Unicode properties. A few
-Unicode characters such as Greek sigma have more than two codepoints that are
-case-equivalent, and these are treated as such.
+.
+.
+.SH "CASE-EQUIVALENCE IN UTF MODES"
+.rs
+.sp
+Case-insensitive matching in a UTF mode makes use of Unicode properties except
+for characters whose code points are less than 128 and that have at most two
+case-equivalent values. For these, a direct table lookup is used for speed. A
+few Unicode characters such as Greek sigma have more than two codepoints that
+are case-equivalent, and these are treated as such.
.
.
.SH "VALIDITY OF UTF STRINGS"
@@ -158,6 +164,14 @@ or \fBpcre2_dfa_match()\fP.
.P
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
is undefined and your program may crash or loop indefinitely.
+.P
+Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable the error
+that is given if an escape sequence for an invalid Unicode code point is
+encountered in the pattern. If you want to allow escape sequences such as
+\ex{d800} (a surrogate code point) you can set the
+PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is possible
+only in UTF-8 and UTF-32 modes, because these values are not representable in
+UTF-16.
.
.
.\" HTML <a name="utf8strings"></a>
@@ -266,6 +280,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 03 July 2016
-Copyright (c) 1997-2016 University of Cambridge.
+Last updated: 17 May 2017
+Copyright (c) 1997-2017 University of Cambridge.
.fi