diff options
Diffstat (limited to 'doc/pcre2api.3')
-rw-r--r-- | doc/pcre2api.3 | 246 |
1 files changed, 186 insertions, 60 deletions
diff --git a/doc/pcre2api.3 b/doc/pcre2api.3 index b29f7b0..db61ea0 100644 --- a/doc/pcre2api.3 +++ b/doc/pcre2api.3 @@ -1,4 +1,4 @@ -.TH PCRE2API 3 "16 December 2015" "PCRE2 10.21" +.TH PCRE2API 3 "17 June 2016" "PCRE2 10.22" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .sp @@ -233,6 +233,8 @@ document for an overview of all the PCRE2 documentation. .rs .sp .nf +.B pcre2_code *pcre2_code_copy(const pcre2_code *\fIcode\fP); +.sp .B int pcre2_get_error_message(int \fIerrorcode\fP, PCRE2_UCHAR *\fIbuffer\fP, .B " PCRE2_SIZE \fIbufflen\fP);" .sp @@ -352,9 +354,10 @@ More complicated programs might need to make use of the specialist functions \fBpcre2_jit_stack_create()\fP, \fBpcre2_jit_stack_free()\fP, and \fBpcre2_jit_stack_assign()\fP in order to control the JIT code's memory usage. .P -JIT matching is automatically used by \fBpcre2_match()\fP if it is available. -There is also a direct interface for JIT matching, which gives improved -performance. The JIT-specific functions are discussed in the +JIT matching is automatically used by \fBpcre2_match()\fP if it is available, +unless the PCRE2_NO_JIT option is set. There is also a direct interface for JIT +matching, which gives improved performance. The JIT-specific functions are +discussed in the .\" HREF \fBpcre2jit\fP .\" @@ -393,9 +396,16 @@ The function \fBpcre2_substitute()\fP can be called to match a pattern and return a copy of the subject string with substitutions for parts that were matched. .P +Functions whose names begin with \fBpcre2_serialize_\fP are used for saving +compiled patterns on disc or elsewhere, and reloading them later. +.P Finally, there are functions for finding out information about a compiled pattern (\fBpcre2_pattern_info()\fP) and about the configuration with which PCRE2 was built (\fBpcre2_config()\fP). +.P +Functions with names ending with \fB_free()\fP are used for freeing memory +blocks of various sorts. In all cases, if one of these functions is called with +a NULL argument, it does nothing. . . .SH "STRING LENGTHS AND OFFSETS" @@ -461,21 +471,52 @@ time ensuring that multithreaded applications can use it. .P There are several different blocks of data that are used to pass information between the application and the PCRE2 libraries. -.P -(1) A pointer to the compiled form of a pattern is returned to the user when +. +. +.SS "The compiled pattern" +.rs +.sp +A pointer to the compiled form of a pattern is returned to the user when \fBpcre2_compile()\fP is successful. The data in the compiled pattern is fixed, and does not change when the pattern is matched. Therefore, it is thread-safe, that is, the same compiled pattern can be used by more than one thread -simultaneously. An application can compile all its patterns at the start, -before forking off multiple threads that use them. However, if the just-in-time -optimization feature is being used, it needs separate memory stack areas for -each thread. See the +simultaneously. For example, an application can compile all its patterns at the +start, before forking off multiple threads that use them. However, if the +just-in-time optimization feature is being used, it needs separate memory stack +areas for each thread. See the .\" HREF \fBpcre2jit\fP .\" documentation for more details. .P -(2) The next section below introduces the idea of "contexts" in which PCRE2 +In a more complicated situation, where patterns are compiled only when they are +first needed, but are still shared between threads, pointers to compiled +patterns must be protected from simultaneous writing by multiple threads, at +least until a pattern has been compiled. The logic can be something like this: +.sp + Get a read-only (shared) lock (mutex) for pointer + if (pointer == NULL) + { + Get a write (unique) lock for pointer + pointer = pcre2_compile(... + } + Release the lock + Use pointer in pcre2_match() +.sp +Of course, testing for compilation errors should also be included in the code. +.P +If JIT is being used, but the JIT compilation is not being done immediately, +(perhaps waiting to see if the pattern is used often enough) similar logic is +required. JIT compilation updates a pointer within the compiled code block, so +a thread must gain unique write access to the pointer before calling +\fBpcre2_jit_compile()\fP. Alternatively, \fBpcre2_code_copy()\fP can be used +to obtain a private copy of the compiled code. +. +. +.SS "Context blocks" +.rs +.sp +The next main section below introduces the idea of "contexts" in which PCRE2 functions are called. A context is nothing more than a collection of parameters that control the way PCRE2 operates. Grouping a number of parameters together in a context is a convenient way of passing them to a PCRE2 function without @@ -487,11 +528,15 @@ In a multithreaded application, if the parameters in a context are values that are never changed, the same context can be used by all the threads. However, if any thread needs to change any value in a context, it must make its own thread-specific copy. -.P -(3) The matching functions need a block of memory for working space and for -storing the results of a match. This includes details of what was matched, as -well as additional information such as the name of a (*MARK) setting. Each -thread must provide its own version of this memory. +. +. +.SS "Match blocks" +.rs +.sp +The matching functions need a block of memory for working space and for storing +the results of a match. This includes details of what was matched, as well as +additional information such as the name of a (*MARK) setting. Each thread must +provide its own copy of this memory. . . .SH "PCRE2 CONTEXTS" @@ -979,34 +1024,51 @@ zero. .B " pcre2_compile_context *\fIccontext\fP);" .sp .B void pcre2_code_free(pcre2_code *\fIcode\fP); +.sp +.B pcre2_code *pcre2_code_copy(const pcre2_code *\fIcode\fP); .fi .P The \fBpcre2_compile()\fP function compiles a pattern into an internal form. -The pattern is defined by a pointer to a string of code units and a length, If +The pattern is defined by a pointer to a string of code units and a length. If the pattern is zero-terminated, the length can be specified as PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that -contains the compiled pattern and related data. The caller must free the memory -by calling \fBpcre2_code_free()\fP when it is no longer needed. +contains the compiled pattern and related data, or NULL if an error occurred. +.P +If the compile context argument \fIccontext\fP is NULL, memory for the compiled +pattern is obtained by calling \fBmalloc()\fP. Otherwise, it is obtained from +the same memory function that was used for the compile context. The caller must +free the memory by calling \fBpcre2_code_free()\fP when it is no longer needed. +.P +The function \fBpcre2_code_copy()\fP makes a copy of the compiled code in new +memory, using the same memory allocator as was used for the original. However, +if the code has been processed by the JIT compiler (see +.\" HTML <a href="#jitcompiling"> +.\" </a> +below), +.\" +the JIT information cannot be copied (because it is position-dependent). +The new copy can initially be used only for non-JIT matching, though it can be +passed to \fBpcre2_jit_compile()\fP if required. The \fBpcre2_code_copy()\fP +function provides a way for individual threads in a multithreaded application +to acquire a private copy of shared compiled code. .P NOTE: When one of the matching functions is called, pointers to the compiled pattern and the subject string are set in the match data block so that they can -be referenced by the extraction functions. After running a match, you must not -free a compiled pattern (or a subject string) until after all operations on the +be referenced by the substring extraction functions. After running a match, you +must not free a compiled pattern (or a subject string) until after all +operations on the .\" HTML <a href="#matchdatablock"> .\" </a> match data block .\" have taken place. .P -If the compile context argument \fIccontext\fP is NULL, memory for the compiled -pattern is obtained by calling \fBmalloc()\fP. Otherwise, it is obtained from -the same memory function that was used for the compile context. -.P -The \fIoptions\fP argument contains various bit settings that affect the -compilation. It should be zero if no options are required. The available -options are described below. Some of them (in particular, those that are -compatible with Perl, but some others as well) can also be set and unset from -within the pattern (see the detailed description in the +The \fIoptions\fP argument for \fBpcre2_compile()\fP contains various bit +settings that affect the compilation. It should be zero if no options are +required. The available options are described below. Some of them (in +particular, those that are compatible with Perl, but some others as well) can +also be set and unset from within the pattern (see the detailed description in +the .\" HREF \fBpcre2pattern\fP .\" @@ -1025,13 +1087,22 @@ above). .\" .P If \fIerrorcode\fP or \fIerroroffset\fP is NULL, \fBpcre2_compile()\fP returns -NULL immediately. Otherwise, if compilation of a pattern fails, -\fBpcre2_compile()\fP returns NULL, having set these variables to an error code -and an offset (number of code units) within the pattern, respectively. The -\fBpcre2_get_error_message()\fP function provides a textual message for each -error code. Compilation errors are positive numbers, but UTF formatting errors -are negative numbers. For an invalid UTF-8 or UTF-16 string, the offset is that -of the first code unit of the failing character. +NULL immediately. Otherwise, the variables to which these point are set to an +error code and an offset (number of code units) within the pattern, +respectively, when \fBpcre2_compile()\fP returns NULL because a compilation +error has occurred. The values are not defined when compilation is successful +and \fBpcre2_compile()\fP returns a non-NULL value. +.P +The \fBpcre2_get_error_message()\fP function (see "Obtaining a textual error +message" +.\" HTML <a href="#geterrormessage"> +.\" </a> +below) +.\" +provides a textual message for each error code. Compilation errors have +positive error codes; UTF formatting error codes are negative. For an invalid +UTF-8 or UTF-16 string, the offset is that of the first code unit of the +failing character. .P Some errors are not detected until the whole pattern has been scanned; in these cases, the offset passed back is the length of the pattern. Note that the @@ -1255,7 +1326,9 @@ If this option is set, it disables the use of numbered capturing parentheses in the pattern. Any opening parenthesis that is not followed by ? behaves as if it were followed by ?: but named parentheses can still be used for capturing (and they acquire numbers in the usual way). There is no equivalent of this option -in Perl. +in Perl. Note that, if this option is set, references to capturing groups (back +references or recursion/subroutine calls) may only refer to named groups, +though the reference can be by name or by number. .sp PCRE2_NO_AUTO_POSSESS .sp @@ -1416,17 +1489,24 @@ page. .SH "COMPILATION ERROR CODES" .rs .sp -There are over 80 positive error codes that \fBpcre2_compile()\fP may return if -it finds an error in the pattern. There are also some negative error codes that -are used for invalid UTF strings. These are the same as given by -\fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, and are described in the +There are over 80 positive error codes that \fBpcre2_compile()\fP may return +(via \fIerrorcode\fP) if it finds an error in the pattern. There are also some +negative error codes that are used for invalid UTF strings. These are the same +as given by \fBpcre2_match()\fP and \fBpcre2_dfa_match()\fP, and are described +in the .\" HREF \fBpcre2unicode\fP .\" -page. The \fBpcre2_get_error_message()\fP function can be called to obtain a -textual error message from any error code. +page. The \fBpcre2_get_error_message()\fP function (see "Obtaining a textual +error message" +.\" HTML <a href="#geterrormessage"> +.\" </a> +below) +.\" +can be called to obtain a textual error message from any error code. . . +.\" HTML <a name="jitcompiling"></a> .SH "JUST-IN-TIME (JIT) COMPILATION" .rs .sp @@ -1565,10 +1645,14 @@ are as follows: Return a copy of the pattern's options. The third argument should point to a \fBuint32_t\fP variable. PCRE2_INFO_ARGOPTIONS returns exactly the options that were passed to \fBpcre2_compile()\fP, whereas PCRE2_INFO_ALLOPTIONS returns -the compile options as modified by any top-level option settings such as (*UTF) -at the start of the pattern itself. For example, if the pattern /(*UTF)abc/ is -compiled with the PCRE2_EXTENDED option, the result is PCRE2_EXTENDED and -PCRE2_UTF. +the compile options as modified by any top-level (*XXX) option settings such as +(*UTF) at the start of the pattern itself. +.P +For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED +option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED and PCRE2_UTF. +Option settings such as (?i) that can change within a pattern do not affect the +result of PCRE2_INFO_ALLOPTIONS, even if they appear right at the start of the +pattern. (This was different in some earlier releases.) .P A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if the first significant item in every top-level branch is one of the following: @@ -2043,13 +2127,14 @@ pattern does not require the match to be at the start of the subject. .sp The unused bits of the \fIoptions\fP argument for \fBpcre2_match()\fP must be zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL, -PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, -PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is described below. +PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, +PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is +described below. .P Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT) compiler. If it is set, JIT matching is disabled and the normal interpretive -code in \fBpcre2_match()\fP is run. The remaining options are supported for JIT -matching. +code in \fBpcre2_match()\fP is run. Apart from PCRE2_NO_JIT (obviously), the +remaining options are supported for JIT matching. .sp PCRE2_ANCHORED .sp @@ -2097,6 +2182,13 @@ the starting offset. An empty string match later in the subject is permitted. If the pattern is anchored, such a match can occur only if the pattern contains \eK. .sp + PCRE2_NO_JIT +.sp +By default, if a pattern has been successfully processed by +\fBpcre2_jit_compile()\fP, JIT is automatically used when \fBpcre2_match()\fP +is called with options that JIT supports. Setting PCRE2_NO_JIT disables the use +of JIT; it forces matching to be done by the interpreter. +.sp PCRE2_NO_UTF_CHECK .sp When PCRE2_UTF is set at compile time, the validity of the subject as a UTF @@ -2378,11 +2470,16 @@ page. .rs .sp If \fBpcre2_match()\fP fails, it returns a negative number. This can be -converted to a text string by calling \fBpcre2_get_error_message()\fP. Negative -error codes are also returned by other functions, and are documented with them. -The codes are given names in the header file. If UTF checking is in force and -an invalid UTF subject string is detected, one of a number of UTF-specific -negative error codes is returned. Details are given in the +converted to a text string by calling the \fBpcre2_get_error_message()\fP +function (see "Obtaining a textual error message" +.\" HTML <a href="#geterrormessage"> +.\" </a> +below). +.\" +Negative error codes are also returned by other functions, and are documented +with them. The codes are given names in the header file. If UTF checking is in +force and an invalid UTF subject string is detected, one of a number of +UTF-specific negative error codes is returned. Details are given in the .\" HREF \fBpcre2unicode\fP .\" @@ -2495,6 +2592,30 @@ is attempted. The internal recursion limit was reached. . . +.\" HTML <a name="geterrormessage"></a> +.SH "OBTAINING A TEXTUAL ERROR MESSAGE" +.rs +.sp +.nf +.B int pcre2_get_error_message(int \fIerrorcode\fP, PCRE2_UCHAR *\fIbuffer\fP, +.B " PCRE2_SIZE \fIbufflen\fP);" +.fi +.P +A text message for an error code from any PCRE2 function (compile, match, or +auxiliary) can be obtained by calling \fBpcre2_get_error_message()\fP. The code +is passed as the first argument, with the remaining two arguments specifying a +code unit buffer and its length, into which the text message is placed. Note +that the message is returned in code units of the appropriate width for the +library that is being used. +.P +The returned message is terminated with a trailing zero, and the function +returns the number of code units used, excluding the trailing zero. If the +error number is unknown, the negative error code PCRE2_ERROR_BADDATA is +returned. If the buffer is too small, the message is truncated (but still with +a trailing zero), and the negative error code PCRE2_ERROR_NOMEMORY is returned. +None of the messages are very long; a buffer size of 120 code units is ample. +. +. .\" HTML <a name="extractbynumber"></a> .SH "EXTRACTING CAPTURED SUBSTRINGS BY NUMBER" .rs @@ -2872,7 +2993,12 @@ substitution), and PCRE2_BADSUBPATTERN (the pattern match ended before it started, which can happen if \eK is used in an assertion). .P As for all PCRE2 errors, a text message that describes the error can be -obtained by calling \fBpcre2_get_error_message()\fP. +obtained by calling the \fBpcre2_get_error_message()\fP function (see +"Obtaining a textual error message" +.\" HTML <a href="#geterrormessage"> +.\" </a> +above). +.\" . . .SH "DUPLICATE SUBPATTERN NAMES" @@ -3166,6 +3292,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 16 December 2015 -Copyright (c) 1997-2015 University of Cambridge. +Last updated: 17 June 2016 +Copyright (c) 1997-2016 University of Cambridge. .fi |