diff options
author | Matthew Vernon <matthew@debian.org> | 2018-02-24 12:07:04 +0000 |
---|---|---|
committer | Matthew Vernon <matthew@debian.org> | 2018-02-24 12:07:04 +0000 |
commit | e98c3314cf9e05aa99f5e192862ec37f29b7dbb5 (patch) | |
tree | b69bb3feb63a4fd79ad8a6e55865228f6fde04eb /README | |
parent | 92b17f0eb8fddd7117c5344a1e1177daec21995a (diff) |
New upstream version 10.31
Diffstat (limited to 'README')
-rw-r--r-- | README | 212 |
1 files changed, 125 insertions, 87 deletions
@@ -15,8 +15,8 @@ subscribe or manage your subscription here: https://lists.exim.org/mailman/listinfo/pcre-dev -Please read the NEWS file if you are upgrading from a previous release. -The contents of this README file are: +Please read the NEWS file if you are upgrading from a previous release. The +contents of this README file are: The PCRE2 APIs Documentation for PCRE2 @@ -44,8 +44,8 @@ wrappers. The distribution does contain a set of C wrapper functions for the 8-bit library that are based on the POSIX regular expression API (see the pcre2posix -man page). These can be found in a library called libpcre2posix. Note that this -just provides a POSIX calling interface to PCRE2; the regular expressions +man page). These can be found in a library called libpcre2-posix. Note that +this just provides a POSIX calling interface to PCRE2; the regular expressions themselves still follow Perl syntax and semantics. The POSIX API is restricted, and does not give full access to all of PCRE2's facilities. @@ -58,8 +58,8 @@ renamed or pointed at by a link. If you are using the POSIX interface to PCRE2 and there is already a POSIX regex library installed on your system, as well as worrying about the regex.h header file (as mentioned above), you must also take care when linking programs -to ensure that they link with PCRE2's libpcre2posix library. Otherwise they may -pick up the POSIX functions of the same name from the other library. +to ensure that they link with PCRE2's libpcre2-posix library. Otherwise they +may pick up the POSIX functions of the same name from the other library. One way of avoiding this confusion is to compile PCRE2 with the addition of -Dregcomp=PCRE2regcomp (and similarly for the other POSIX functions) to the @@ -95,10 +95,9 @@ PCRE2 documentation is supplied in two other forms: Building PCRE2 on non-Unix-like systems --------------------------------------- -For a non-Unix-like system, please read the comments in the file -NON-AUTOTOOLS-BUILD, though if your system supports the use of "configure" and -"make" you may be able to build PCRE2 using autotools in the same way as for -many Unix-like systems. +For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if +your system supports the use of "configure" and "make" you may be able to build +PCRE2 using autotools in the same way as for many Unix-like systems. PCRE2 can also be configured using CMake, which can be run in various ways (command line, GUI, etc). This creates Makefiles, solution files, etc. The file @@ -172,21 +171,24 @@ library. They are also documented in the pcre2build man page. give large performance improvements on certain platforms, add --enable-jit to the "configure" command. This support is available only for certain hardware architectures. If you try to enable it on an unsupported architecture, there - will be a compile time error. - -. If you do not want to make use of the support for UTF-8 Unicode character - strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit - library, or UTF-32 Unicode character strings in the 32-bit library, you can - add --disable-unicode to the "configure" command. This reduces the size of - the libraries. It is not possible to configure one library with Unicode - support, and another without, in the same configuration. + will be a compile time error. If you are running under SELinux you may also + want to add --enable-jit-sealloc, which enables the use of an execmem + allocator in JIT that is compatible with SELinux. This has no effect if JIT + is not enabled. + +. If you do not want to make use of the default support for UTF-8 Unicode + character strings in the 8-bit library, UTF-16 Unicode character strings in + the 16-bit library, or UTF-32 Unicode character strings in the 32-bit + library, you can add --disable-unicode to the "configure" command. This + reduces the size of the libraries. It is not possible to configure one + library with Unicode support, and another without, in the same configuration. + It is also not possible to use --enable-ebcdic (see below) with Unicode + support, so if this option is set, you must also use --disable-unicode. When Unicode support is available, the use of a UTF encoding still has to be enabled by setting the PCRE2_UTF option at run time or starting a pattern with (*UTF). When PCRE2 is compiled with Unicode support, its input can only - either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is - not possible to use both --enable-unicode and --enable-ebcdic at the same - time. + either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. As well as supporting UTF strings, Unicode support includes support for the \P, \p, and \X sequences that recognize Unicode character properties. @@ -196,20 +198,14 @@ library. They are also documented in the pcre2build man page. or starting a pattern with (*UCP). . You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any - of the preceding, or any of the Unicode newline sequences, as indicating the - end of a line. Whatever you specify at build time is the default; the caller - of PCRE2 can change the selection at run time. The default newline indicator - is a single LF character (the Unix standard). You can specify the default - newline indicator by adding --enable-newline-is-cr, --enable-newline-is-lf, - --enable-newline-is-crlf, --enable-newline-is-anycrlf, or - --enable-newline-is-any to the "configure" command, respectively. - - If you specify --enable-newline-is-cr or --enable-newline-is-crlf, some of - the standard tests will fail, because the lines in the test files end with - LF. Even if the files are edited to change the line endings, there are likely - to be some failures. With --enable-newline-is-anycrlf or - --enable-newline-is-any, many tests should succeed, but there may be some - failures. + of the preceding, or any of the Unicode newline sequences, or the NUL (zero) + character as indicating the end of a line. Whatever you specify at build time + is the default; the caller of PCRE2 can change the selection at run time. The + default newline indicator is a single LF character (the Unix standard). You + can specify the default newline indicator by adding --enable-newline-is-cr, + --enable-newline-is-lf, --enable-newline-is-crlf, + --enable-newline-is-anycrlf, --enable-newline-is-any, or + --enable-newline-is-nul to the "configure" command, respectively. . By default, the sequence \R in a pattern matches any Unicode line ending sequence. This is independent of the option specifying what PCRE2 considers @@ -231,49 +227,44 @@ library. They are also documented in the pcre2build man page. --with-parens-nest-limit=500 -. PCRE2 has a counter that can be set to limit the amount of resources it uses - when matching a pattern. If the limit is exceeded during a match, the match - fails. The default is ten million. You can change the default by setting, for - example, +. PCRE2 has a counter that can be set to limit the amount of computing resource + it uses when matching a pattern. If the limit is exceeded during a match, the + match fails. The default is ten million. You can change the default by + setting, for example, --with-match-limit=500000 on the "configure" command. This is just the default; individual calls to - pcre2_match() can supply their own value. There is more discussion on the - pcre2api man page. + pcre2_match() or pcre2_dfa_match() can supply their own value. There is more + discussion in the pcre2api man page (search for pcre2_set_match_limit). + +. There is a separate counter that limits the depth of nested backtracking + during a matching process, which indirectly limits the amount of heap memory + that is used. This also has a default of ten million, which is essentially + "unlimited". You can change the default by setting, for example, + + --with-match-limit-depth=5000 -. There is a separate counter that limits the depth of recursive function calls - during a matching process. This also has a default of ten million, which is - essentially "unlimited". You can change the default by setting, for example, + There is more discussion in the pcre2api man page (search for + pcre2_set_depth_limit). - --with-match-limit-recursion=500000 +. You can also set an explicit limit on the amount of heap memory used by + the pcre2_match() interpreter: - Recursive function calls use up the runtime stack; running out of stack can - cause programs to crash in strange ways. There is a discussion about stack - sizes in the pcre2stack man page. + --with-heap-limit=500 + + The units are kilobytes. This limit does not apply when the JIT optimization + (which has its own memory control features) is used. There is more discussion + on the pcre2api man page (search for pcre2_set_heap_limit). . In the 8-bit library, the default maximum compiled pattern size is around - 64K. You can increase this by adding --with-link-size=3 to the "configure" - command. PCRE2 then uses three bytes instead of two for offsets to different - parts of the compiled pattern. In the 16-bit library, --with-link-size=3 is - the same as --with-link-size=4, which (in both libraries) uses four-byte - offsets. Increasing the internal link size reduces performance in the 8-bit - and 16-bit libraries. In the 32-bit library, the link size setting is - ignored, as 4-byte offsets are always used. - -. You can build PCRE2 so that its internal match() function that is called from - pcre2_match() does not call itself recursively. Instead, it uses memory - blocks obtained from the heap to save data that would otherwise be saved on - the stack. To build PCRE2 like this, use - - --disable-stack-for-recursion - - on the "configure" command. PCRE2 runs more slowly in this mode, but it may - be necessary in environments with limited stack sizes. This applies only to - the normal execution of the pcre2_match() function; if JIT support is being - successfully used, it is not relevant. Equally, it does not apply to - pcre2_dfa_match(), which does not use deeply nested recursion. There is a - discussion about stack sizes in the pcre2stack man page. + 64K bytes. You can increase this by adding --with-link-size=3 to the + "configure" command. PCRE2 then uses three bytes instead of two for offsets + to different parts of the compiled pattern. In the 16-bit library, + --with-link-size=3 is the same as --with-link-size=4, which (in both + libraries) uses four-byte offsets. Increasing the internal link size reduces + performance in the 8-bit and 16-bit libraries. In the 32-bit library, the + link size setting is ignored, as 4-byte offsets are always used. . For speed, PCRE2 uses four tables for manipulating and identifying characters whose code point values are less than 256. By default, it uses a set of @@ -339,12 +330,23 @@ library. They are also documented in the pcre2build man page. Of course, the relevant libraries must be installed on your system. -. The default size (in bytes) of the internal buffer used by pcre2grep can be - set by, for example: +. The default starting size (in bytes) of the internal buffer used by pcre2grep + can be set by, for example: --with-pcre2grep-bufsize=51200 - The value must be a plain integer. The default is 20480. + The value must be a plain integer. The default is 20480. The amount of memory + used by pcre2grep is actually three times this number, to allow for "before" + and "after" lines. If very long lines are encountered, the buffer is + automatically enlarged, up to a fixed maximum size. + +. The default maximum size of pcre2grep's internal buffer can be set by, for + example: + + --with-pcre2grep-max-bufsize=2097152 + + The default is either 1048576 or the value of --with-pcre2grep-bufsize, + whichever is the larger. . It is possible to compile pcre2test so that it links with the libreadline or libedit libraries, by specifying, respectively, @@ -369,6 +371,29 @@ library. They are also documented in the pcre2build man page. tgetflag, or tgoto, this is the problem, and linking with the ncurses library should fix it. +. There is a special option called --enable-fuzz-support for use by people who + want to run fuzzing tests on PCRE2. At present this applies only to the 8-bit + library. If set, it causes an extra library called libpcre2-fuzzsupport.a to + be built, but not installed. This contains a single function called + LLVMFuzzerTestOneInput() whose arguments are a pointer to a string and the + length of the string. When called, this function tries to compile the string + as a pattern, and if that succeeds, to match it. This is done both with no + options and with some random options bits that are generated from the string. + Setting --enable-fuzz-support also causes a binary called pcre2fuzzcheck to + be created. This is normally run under valgrind or used when PCRE2 is + compiled with address sanitizing enabled. It calls the fuzzing function and + outputs information about it is doing. The input strings are specified by + arguments: if an argument starts with "=" the rest of it is a literal input + string. Otherwise, it is assumed to be a file name, and the contents of the + file are the test string. + +. Releases before 10.30 could be compiled with --disable-stack-for-recursion, + which caused pcre2_match() to use individual blocks on the heap for + backtracking instead of recursive function calls (which use the stack). This + is now obsolete since pcre2_match() was refactored always to use the heap (in + a much more efficient way than before). This option is retained for backwards + compatibility, but has no effect other than to output a warning. + The "configure" script builds the following files for the basic C library: . Makefile the makefile that builds the library @@ -543,7 +568,7 @@ script creates the .txt and HTML forms of the documentation from the man pages. Testing PCRE2 ------------- +------------- To test the basic PCRE2 library on a Unix-like system, run the RunTest script. There is another script called RunGrepTest that tests the pcre2grep command. @@ -635,32 +660,43 @@ with the perltest.sh script, and test 5 checking PCRE2-specific things. Tests 6 and 7 check the pcre2_dfa_match() alternative matching function, in non-UTF mode and UTF-mode with Unicode property support, respectively. -Test 8 checks some internal offsets and code size features; it is run only when -the default "link size" of 2 is set (in other cases the sizes change) and when -Unicode support is enabled. +Test 8 checks some internal offsets and code size features, but it is run only +when Unicode support is enabled. The output is different in 8-bit, 16-bit, and +32-bit modes and for different link sizes, so there are different output files +for each mode and link size. Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in 16-bit and 32-bit modes. These are tests that generate different output in 8-bit mode. Each pair are for general cases and Unicode support, respectively. + Test 13 checks the handling of non-UTF characters greater than 255 by pcre2_dfa_match() in 16-bit and 32-bit modes. -Test 14 contains a number of tests that must not be run with JIT. They check, +Test 14 contains some special UTF and UCP tests that give different output for +different code unit widths. + +Test 15 contains a number of tests that must not be run with JIT. They check, among other non-JIT things, the match-limiting features of the intepretive matcher. -Test 15 is run only when JIT support is not available. It checks that an +Test 16 is run only when JIT support is not available. It checks that an attempt to use JIT has the expected behaviour. -Test 16 is run only when JIT support is available. It checks JIT complete and +Test 17 is run only when JIT support is available. It checks JIT complete and partial modes, match-limiting under JIT, and other JIT-specific features. -Tests 17 and 18 are run only in 8-bit mode. They check the POSIX interface to +Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to the 8-bit library, without and with Unicode support, respectively. -Test 19 checks the serialization functions by writing a set of compiled +Test 20 checks the serialization functions by writing a set of compiled patterns to a file, and then reloading and checking them. +Tests 21 and 22 test \C support when the use of \C is not locked out, without +and with UTF support, respectively. Test 23 tests \C when it is locked out. + +Tests 24 and 25 test the experimental pattern conversion functions, without and +with UTF support, respectively. + Character tables ---------------- @@ -679,7 +715,7 @@ specified for ./configure, a different version of pcre2_chartables.c is built by the program dftables (compiled from dftables.c), which uses the ANSI C character handling functions such as isalnum(), isalpha(), isupper(), islower(), etc. to build the table sources. This means that the default C -locale which is set for your system will control the contents of these default +locale that is set for your system will control the contents of these default tables. You can change the default tables by editing pcre2_chartables.c and then re-building PCRE2. If you do this, you should take care to ensure that the file does not get automatically re-generated. The best way to do this is to @@ -734,8 +770,10 @@ The distribution should contain the files listed below. src/pcre2_compile.c ) src/pcre2_config.c ) src/pcre2_context.c ) + src/pcre2_convert.c ) src/pcre2_dfa_match.c ) src/pcre2_error.c ) + src/pcre2_extuni.c ) src/pcre2_find_bracket.c ) src/pcre2_jit_compile.c ) src/pcre2_jit_match.c ) sources for the functions in the library, @@ -757,6 +795,7 @@ The distribution should contain the files listed below. src/pcre2_xclass.c ) src/pcre2_printint.c debugging function that is used by pcre2test, + src/pcre2_fuzzsupport.c function for (optional) fuzzing support src/config.h.in template for config.h, when built by "configure" src/pcre2.h.in template for pcre2.h when built by "configure" @@ -772,7 +811,6 @@ The distribution should contain the files listed below. src/pcre2demo.c simple demonstration of coding calls to PCRE2 src/pcre2grep.c source of a grep utility that uses PCRE2 src/pcre2test.c comprehensive test program - src/pcre2_printint.c part of pcre2test src/pcre2_jit_test.c JIT test program (C) Auxiliary files: @@ -814,7 +852,7 @@ The distribution should contain the files listed below. libpcre2-8.pc.in template for libpcre2-8.pc for pkg-config libpcre2-16.pc.in template for libpcre2-16.pc for pkg-config libpcre2-32.pc.in template for libpcre2-32.pc for pkg-config - libpcre2posix.pc.in template for libpcre2posix.pc for pkg-config + libpcre2-posix.pc.in template for libpcre2-posix.pc for pkg-config ltmain.sh file used to build a libtool script missing ) common stub for a few missing GNU programs while ) installing, generated by automake @@ -837,12 +875,12 @@ The distribution should contain the files listed below. (E) Auxiliary files for building PCRE2 "by hand" - pcre2.h.generic ) a version of the public PCRE2 header file + src/pcre2.h.generic ) a version of the public PCRE2 header file ) for use in non-"configure" environments - config.h.generic ) a version of config.h for use in non-"configure" + src/config.h.generic ) a version of config.h for use in non-"configure" ) environments Philip Hazel Email local part: ph10 Email domain: cam.ac.uk -Last updated: 01 April 2016 +Last updated: 12 September 2017 |