From afa50af6ab95d3d8899891d82ed4fd473cb00571 Mon Sep 17 00:00:00 2001 From: John Millaway Date: Thu, 3 Apr 2003 01:01:37 +0000 Subject: Docbook. --- doc/flex.xml | 773 +++++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 482 insertions(+), 291 deletions(-) (limited to 'doc') diff --git a/doc/flex.xml b/doc/flex.xml index 71edc75..10a9703 100644 --- a/doc/flex.xml +++ b/doc/flex.xml @@ -14,7 +14,7 @@ All rights reserved. VernPaxson @@ -22,16 +22,20 @@ All rights reserved. JohnMillaway + This code is derived from software contributed to Berkeley by Vern Paxson. - + + The United States Government has rights in this work pursuant to contract no. DE-AC03-76SF00098 between the United States Department of Energy and the University of California. - + + Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: + @@ -47,15 +51,17 @@ documentation and/or other materials provided with the distribution. - + Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. - + + THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. + @@ -68,42 +74,34 @@ PURPOSE. @direntry - + This manual describes flex, a tool for generating programs that perform pattern-matching on text. The manual includes both tutorial and reference sections. - + + This edition of @cite{The flex Manual} documents flex version @value{VERSION}. It was last updated on @value{UPDATED}. + - -Copyright - - - - -The flex manual is placed under the same licensing conditions as the -rest of flex: - - - Reporting Bugs - + If you have problems with flex or think you have found a bug, please send mail detailing your problem to @email{lex-help@@lists.sourceforge.net}. Patches are always welcome. + Introduction - + flex is a tool for generating @dfn{scanners}. A scanner is a program which recognizes lexical patterns in text. The flex @@ -116,19 +114,22 @@ This file can be compiled and linked with the flex runtime library to produce an executable. When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code. + Some Simple Examples -First some simple examples to get the flavor of how one uses +First some simple examples to get the flavor of how one uses flex. - + + The following flex input specifies a scanner which, when it encounters the string @samp{username} will replace it with the user's login name: + @@ -139,6 +140,7 @@ login name: + By default, any text not matched by a flex scanner is copied to @@ -147,8 +149,11 @@ to its output with each occurrence of @samp{username} expanded. In this input, there is just one rule. @samp{username} is the @dfn{pattern} and the @samp{printf} is the @dfn{action}. The @samp{%%} symbol marks the beginning of the rules. + + Here's another simple example: + @@ -171,6 +176,7 @@ Here's another simple example: + This scanner counts the number of characters and the number of lines in its input. It produces no output other than the final report on the character and line counts. The first line declares two globals, @@ -180,8 +186,11 @@ second @samp{%%}. There are two rules, one which matches a newline (@samp{\n}) and increments both the line count and the character count, and one which matches any character other than a newline (indicated by the @samp{.} regular expression). + + A somewhat more complicated example: + @@ -241,12 +250,16 @@ A somewhat more complicated example: + This is the beginnings of a simple scanner for a language like Pascal. It identifies different types of @dfn{tokens} and reports on what it has seen. + + The details of this example will be explained in the following sections. + @@ -259,8 +272,10 @@ sections. + The flex input file consists of three sections, separated by a line containing only @samp{%%}. + @@ -277,10 +292,10 @@ line containing only @samp{%%}. @@ -288,15 +303,19 @@ line containing only @samp{%%}.
Format of the Definitions Section + The @dfn{definitions section} contains declarations of simple @dfn{name} definitions to simplify the scanner specification, and declarations of @dfn{start conditions}, which are explained in a later section. + + Name definitions have the form: + @@ -306,12 +325,14 @@ Name definitions have the form: + The @samp{name} is a word beginning with a letter or an underscore (@samp{_}) followed by zero or more letters, digits, @samp{_}, or @samp{-} (dash). The definition is taken to begin at the first non-whitespace character following the name and continuing to the end of the line. The definition can subsequently be referred to using -@samp{@{name@}}, which will expand to @samp{(definition)}. For example, +@samp{{name}}, which will expand to @samp{(definition)}. For example, + @@ -324,9 +345,11 @@ the line. The definition can subsequently be referred to using + Defines @samp{DIGIT} to be a regular expression which matches a single digit, and @samp{ID} to be a regular expression which matches a letter followed by zero-or-more letters-or-digits. A subsequent reference to + @@ -337,7 +360,9 @@ followed by zero-or-more letters-or-digits. A subsequent reference to + is identical to + @@ -347,32 +372,40 @@ is identical to + and matches one-or-more digits followed by a @samp{.} followed by zero-or-more digits. + + An unindented comment (i.e., a line beginning with @samp{/*}) is copied verbatim to the output up to the next @samp{*/}. + - + + -Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}} -is also copied verbatim to the output (with the %@{ and %@} symbols -removed). The %@{ and %@} symbols must appear unindented on lines by +Any indented text or text enclosed in @samp{%{} and @samp{%}} +is also copied verbatim to the output (with the %{ and %} symbols +removed). The %{ and %} symbols must appear unindented on lines by themselves. + -A @code{%top} block is similar to a @samp{%@{} ... @samp{%@}} block, except -that the code in a @code{%top} block is relocated to the @emph{top} of the + +A @code{%top} block is similar to a @samp{%{} ... @samp{%}} block, except +that the code in a @code{%top} block is relocated to the top of the generated file, before any flex definitions @footnote{Actually, -@code{yyIN_HEADER} is defined before the @samp{%top} block.}. +@code{yyIN_HEADER} is defined before the @samp{%top} block.}. The @code{%top} block is useful when you want certain preprocessor macros to be defined or certain files to be included before the generated code. -The single characters, @samp{@{} and @samp{@}} are used to delimit the +The single characters, @samp{{} and @samp{}} are used to delimit the @code{%top} block, as show in the example below: + @@ -386,7 +419,9 @@ The single characters, @samp{@{} and @samp{@}} are used to delimit the + Multiple @code{%top} blocks are allowed, and their order is preserved. +
@@ -395,8 +430,11 @@ Multiple @code{%top} blocks are allowed, and their order is preserved. + + The @dfn{rules} section of the flex input contains a series of rules of the form: + @@ -406,22 +444,28 @@ rules of the form: + where the pattern must be unindented and the action must begin on the same line. @xref{Patterns}, for a further description of patterns and actions. + -In the rules section, any indented or %@{ %@} enclosed text appearing + +In the rules section, any indented or %{ %} enclosed text appearing before the first rule may be used to declare variables which are local to the scanning routine and (after the declarations) code which is to be executed whenever the scanning routine is entered. Other indented or -%@{ %@} text in the rule section is still copied to the output, but its +%{ %} text in the rule section is still copied to the output, but its meaning is not well-defined and it may well cause compile-time errors (this feature is present for @acronym{POSIX} compliance. @xref{Lex and Posix}, for other such features). + -Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}} -is copied verbatim to the output (with the %@{ and %@} symbols removed). -The %@{ and %@} symbols must appear unindented on lines by themselves. + +Any indented text or text enclosed in @samp{%{} and @samp{%}} +is copied verbatim to the output (with the %{ and %} symbols removed). +The %{ and %} symbols must appear unindented on lines by themselves. + @@ -430,10 +474,13 @@ The %@{ and %@} symbols must appear unindented on lines by themselves. + + The user code section is simply copied to lex.yy.c verbatim. It is used for companion routines which call or are called by the scanner. The presence of this section is optional; if it is missing, the second @samp{%%} in the input file may be skipped, too. + @@ -441,33 +488,46 @@ The presence of this section is optional; if it is missing, the second Comments in the Input + + Flex supports C-style comments, that is, anything between /* and */ is considered a comment. Whenever flex encounters a comment, it copies the entire comment verbatim to the generated source code. Comments may appear just about anywhere, but with the following exceptions: + + Comments may not appear in the Rules Section wherever flex is expecting a regular expression. This means comments may not appear at the beginning of a line, or immediately following a list of scanner states. + + + Comments may not appear on an @samp{%option} line in the Definitions Section. + + -If you want to follow a simple rule, then always begin a comment on a + +If you want to follow a simple rule, then always begin a comment on a new line, with one or more whitespace characters before the initial @samp{/*}). This rule will work anywhere in the input file. + + All the comments in the following example are valid: + @@ -507,14 +567,18 @@ ruleD ECHO; + + The patterns in the input (see @ref{Rules Section}) are written using an extended set of regular expressions. These are: + x + match the character 'x' @@ -589,21 +653,21 @@ zero or one r's (that is, ``an optional r'') -r@{2,5@} +r{2,5} anywhere from two to five r's -r@{2,@} +r{2,} two or more r's -r@{4@} +r{4} exactly 4 r's @@ -611,7 +675,7 @@ exactly 4 r's -@{name@} +{name} the expansion of the @samp{name} definition (@pxref{Format}). @@ -773,7 +837,7 @@ operators, @samp{-}, @samp{]]}, and, at the beginning of the class, @samp{^}. The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. Those grouped together have equal precedence (see special note on the -precedence of the repeat operator, @samp{@{@}}, under the documentation +precedence of the repeat operator, @samp{{}}, under the documentation for the @samp{--posix} POSIX compliance option). For example, @@ -797,7 +861,7 @@ is the same as since the @samp{*} operator has higher precedence than concatenation, and concatenation higher than alternation (@samp{|}). This pattern -therefore matches @emph{either} the string @samp{foo} @emph{or} the +therefore matches either the string @samp{foo} or the string @samp{ba} followed by zero-or-more @samp{r}'s. To match @samp{foo} or zero-or-more repetitions of the string @samp{bar}, use: @@ -893,7 +957,7 @@ enabled: @item @samp{[a-t]} @tab ok @tab @samp{[a-tA-T]} @tab @item @samp{[A-T]} @tab ok @tab @samp{[a-tA-T]} @tab @item @samp{[A-t]} @tab ambiguous @tab @samp{[A-Z\[\\\]_`a-t]} @tab @samp{[a-tA-T]} -@item @samp{[_-@{]} @tab ambiguous @tab @samp{[_`a-z@{]} @tab @samp{[_`a-zA-Z@{]} +@item @samp{[_-{]} @tab ambiguous @tab @samp{[_`a-z{]} @tab @samp{[_`a-zA-Z{]} @item @samp{[@@-C]} @tab ambiguous @tab @samp{[@@ABC]} @tab @samp{[@@A-Z\[\\\]_`abc]} @end multitable--> @@ -904,7 +968,7 @@ enabled: A negated character class such as the example @samp{[^A-Z]} above -@emph{will} match a newline unless @samp{\n} (or an equivalent escape +will match a newline unless @samp{\n} (or an equivalent escape sequence) is one of the characters explicitly present in the negated character class (e.g., @samp{[^A-Z\n]}). This is unlike how many other regular expression tools treat negated character classes, but @@ -1028,7 +1092,7 @@ a time) to its output. Note that yytext can be defined in two different ways: either as -a character @emph{pointer} or as a character @emph{array}. You can +a character pointer or as a character array. You can control which definition flex uses by including one of the special directives @code{%pointer} or @code{%array} in the first (definitions) section of your flex input. The default is @@ -1070,7 +1134,7 @@ accommodate very large tokens (such as matching entire blocks of comments), bear in mind that each time the scanner must resize yytext it also must rescan the entire token from the beginning, so matching such tokens can prove slow. yytext presently does -@emph{not} dynamically grow if a call to unput results in too +not dynamically grow if a call to unput results in too much text being pushed back; instead, a run-time error results. @@ -1118,17 +1182,17 @@ single blank, and throws away whitespace found at the end of a line:
- - + + -If the action contains a @samp{@}}, then the action spans till the -balancing @samp{@}} is found, and the action may cross multiple lines. +If the action contains a @samp{}}, then the action spans till the +balancing @samp{}} is found, and the action may cross multiple lines. flex knows about C strings and comments and won't be fooled by braces found within them, but also allows actions to begin with -@samp{%@{} and will consider the action to be all the text up to the -next @samp{%@}} (regardless of ordinary braces inside the action). +@samp{%{} and will consider the action to be all the text up to the +next @samp{%}} (regardless of ordinary braces inside the action). An action consisting solely of a vertical bar (@samp{|}) means ``same as the @@ -1225,14 +1289,14 @@ The first three rules share the fourth's action since they use the special @samp{|} action. @code{REJECT} is a particularly expensive feature in terms of scanner -performance; if it is used in @emph{any} of the scanner's actions it -will slow down @emph{all} of the scanner's matching. Furthermore, +performance; if it is used in any of the scanner's actions it +will slow down all of the scanner's matching. Furthermore, @code{REJECT} cannot be used with the @samp{-Cf} or @samp{-CF} options (@pxref{Scanner Options}). Note also that unlike the other special actions, @code{REJECT} is a -@emph{branch}. code immediately following it in the action will -@emph{not} be executed. +branch. code immediately following it in the action will +not be executed. @@ -1241,7 +1305,7 @@ Note also that unlike the other special actions, @code{REJECT} is a tells the scanner that the next time it matches a rule, the -corresponding token should be @emph{appended} onto the current value of +corresponding token should be appended onto the current value of yytext rather than replacing it. For example, given the input @samp{mega-kludge} the following will write @samp{mega-mega-kludge} to the output: @@ -1331,14 +1395,14 @@ the current token and cause it to be rescanned enclosed in parentheses. Note that since each unput puts the given character back at the -@emph{beginning} of the input stream, pushing back strings must be done +beginning of the input stream, pushing back strings must be done back-to-front. An important potential problem when using unput is that if you are using @code{%pointer} (the default), a call to unput -@emph{destroys} the contents of yytext, starting with its +destroys the contents of yytext, starting with its rightmost character and devouring one character to the left with each call. If you need the value of yytext preserved after a call to unput (as in the above example), you must either first copy it @@ -1463,7 +1527,7 @@ definitions prevent us from using any standard data types smaller than int (such as short, char, or bool) as function arguments. For this reason, future versions of flex may generate standard C99 code only, leaving K&R-style functions to the historians. Currently, if you -do @strong{not} want @samp{C99} definitions, then you must use +do not want @samp{C99} definitions, then you must use @code{%option noansi-definitions}. @@ -1489,7 +1553,7 @@ the latter is available for compatibility with previous versions of middle of scanning. It can also be used to throw away the current input buffer, by calling it with an argument of yyin; but it would be better to use @code{YY_FLUSH_BUFFER} (@pxref{Actions}). Note that -yyrestart does @emph{not} reset the start condition to +yyrestart does not reset the start condition to @code{INITIAL} (@pxref{Start Conditions}). @@ -1537,7 +1601,7 @@ false (zero), then it is assumed that the function has gone ahead and set up yyin to point to another input file, and scanning continues. If it returns true (non-zero), then the scanner terminates, returning 0 to its caller. Note that in either case, the start -condition remains unchanged; it does @emph{not} revert to +condition remains unchanged; it does not revert to @code{INITIAL}. @@ -1607,7 +1671,7 @@ action. Until the next @code{BEGIN} action is executed, rules with the given start condition will be active and rules with other start conditions will be inactive. If the start condition is inclusive, then rules with no start conditions at all will also be active. If it is -exclusive, then @emph{only} rules qualified with the start condition +exclusive, then only rules qualified with the start condition will be active. A set of rules contingent on the same exclusive start condition describe a scanner which is independent of any of the other rules in the flex input. Because of this, exclusive start @@ -1930,8 +1994,8 @@ condition @dfn{scope}. A start condition scope is begun with: where @code{SCs} is a list of one or more start conditions. Inside the start condition scope, every rule automatically has the prefix -@code{SCs>} applied to it, until a @samp{@}} which matches the initial -@samp{@{}. So, for example, +@code{SCs>} applied to it, until a @samp{}} which matches the initial +@samp{{}. So, for example, @@ -1947,7 +2011,9 @@ start condition scope, every rule automatically has the prefix + is equivalent to: + @@ -1960,37 +2026,61 @@ is equivalent to: + Start condition scopes may be nested. + + The following routines are available for manipulating stacks of start conditions: + + + + +void yy_push_state + int @code{new_state} + + -@deftypefun void yy_push_state ( int @code{new_state} ) pushes the current start condition onto the top of the start condition stack and switches to @code{new_state} as though you had used @code{BEGIN new_state} (recall that start condition names are also integers). -@end deftypefun -@deftypefun void yy_pop_state () + + +void yy_pop_state + + + + pops the top of the stack and switches to it via @code{BEGIN}. -@end deftypefun -@deftypefun int yy_top_state () + + +int yy_top_state + + + + returns the top of the stack without altering the stack's contents. -@end deftypefun + + The start condition stack grows dynamically and so has no built-in size limitation. If memory is exhausted, program execution aborts. + + To use start condition stacks, your scanner must include a @code{%option stack} directive (@pxref{Scanner Options}). + @@ -1998,6 +2088,8 @@ stack} directive (@pxref{Scanner Options}). Multiple Input Buffers + + Some scanners (such as those which support ``include'' files) require reading from several input streams. As flex scanners do a large amount of buffering, one cannot control where the next input will be @@ -2006,15 +2098,25 @@ the scanning context. YY_INPUT is only called when the sca reaches the end of its buffer, which may be a long time after scanning a statement such as an @code{include} statement which requires switching the input source. + + To negotiate these sorts of problems, flex provides a mechanism for creating and switching between multiple input buffers. An input buffer is created by using: + -@deftypefun YY_BUFFER_STATE yy_create_buffer ( FILE *file, int size ) -@end deftypefun + + +YY_BUFFER_STATE yy_create_buffer + FILE *file + intsize + + + + which takes a @code{FILE} pointer and a size and creates a buffer associated with the given file and large enough to hold @code{size} characters (when in doubt, use @code{YY_BUF_SIZE} for the size). It @@ -2032,76 +2134,123 @@ scanner. Note that the @code{FILE} pointer in the call to yyin, then you can safely pass a NULL @code{FILE} pointer to yy_create_buffer. You select a particular buffer to scan from using: + + + + +void yy_switch_to_buffer + YY_BUFFER_STATE new_buffer + + -@deftypefun void yy_switch_to_buffer ( YY_BUFFER_STATE new_buffer ) -@end deftypefun -The above function switches the scanner's input buffer so subsequent tokens +The above function switches the scanner's input buffer so subsequent tokens will come from @code{new_buffer}. Note that yy_switch_to_buffer may be used by yywrap to set things up for continued scanning, instead of opening a new file and pointing yyin at it. If you are looking for a stack of input buffers, then you want to use yypush_buffer_state instead of this function. Note also that switching input sources via either -yy_switch_to_buffer or yywrap does @emph{not} change the +yy_switch_to_buffer or yywrap does not change the start condition. + -@deftypefun void yy_delete_buffer ( YY_BUFFER_STATE buffer ) -@end deftypefun + + +void yy_delete_buffer + YY_BUFFER_STATE buffer + + + + is used to reclaim the storage associated with a buffer. (@code{buffer} can be NULL, in which case the routine does nothing.) You can also clear the current contents of a buffer using: + -@deftypefun void yypush_buffer_state ( YY_BUFFER_STATE buffer ) -@end deftypefun + + +void yypush_buffer_state + YY_BUFFER_STATE buffer + + + + This function pushes the new buffer state onto an internal stack. The pushed state becomes the new current state. The stack is maintained by flex and will grow as required. This function is intended to be used instead of yy_switch_to_buffer, when you want to change states, but preserve the -current state for later use. +current state for later use. + -@deftypefun void yypop_buffer_state ( ) -@end deftypefun + + +void yypop_buffer_state + + + + + This function removes the current state from the top of the stack, and deletes it by calling yy_delete_buffer. The next state on the stack, if any, becomes the new current state. + -@deftypefun void yy_flush_buffer ( YY_BUFFER_STATE buffer ) -@end deftypefun + + +void yy_flush_buffer + YY_BUFFER_STATE buffer + + + + This function discards the buffer's contents, so the next time the scanner attempts to match a token from the buffer, it will first fill the buffer anew using YY_INPUT. + + @deftypefun YY_BUFFER_STATE yy_new_buffer ( FILE *file, int size ) @end deftypefun + + is an alias for yy_create_buffer, provided for compatibility with the C++ use of @code{new} and @code{delete} for creating and destroying dynamic objects. + + + @code{YY_CURRENT_BUFFER} macro returns a @code{YY_BUFFER_STATE} handle to the current buffer. It should not be used as an lvalue. + + + Here are two examples of using these features for writing a scanner which expands include files (the @code{<<EOF>>} feature is discussed below). + + This first example uses yypush_buffer_state and yypop_buffer_state. Flex maintains the stack internally. + @@ -2141,8 +2290,10 @@ maintains the stack internally. + The second example, below, does the same thing as the previous example did, but manages its own input buffer stack manually (instead of letting flex do it). + @@ -2214,28 +2365,36 @@ input buffer for scanning the string, and return a corresponding new buffer using yy_switch_to_buffer, so the next call to yylex will start scanning the string. -@deftypefun YY_BUFFER_STATE yy_scan_string ( const char *str ) + + +YY_BUFFER_STATE yy_scan_string + const char *str + + scans a NUL-terminated string. -@end deftypefun @deftypefun YY_BUFFER_STATE yy_scan_bytes ( const char *bytes, int len ) +@end deftypefun scans @code{len} bytes (including possibly @code{NUL}s) starting at location @code{bytes}. -@end deftypefun -Note that both of these functions create and scan a @emph{copy} of the +Note that both of these functions create and scan a copy of the string or bytes. (This may be desirable, since yylex modifies the contents of the buffer it is scanning.) You can avoid the copy by using: + + @deftypefun YY_BUFFER_STATE yy_scan_buffer (char *base, yy_size_t size) +@end deftypefun + + which scans in place the buffer starting at @code{base}, consisting of -@code{size} bytes, the last two bytes of which @emph{must} be +@code{size} bytes, the last two bytes of which must be @code{YY_END_OF_BUFFER_CHAR} (ASCII NUL). These last two bytes are not scanned; thus, scanning consists of @code{base[0]} through @code{base[size-2]}, inclusive. -@end deftypefun If you fail to set up @code{base} in this manner (i.e., forget the final two @code{YY_END_OF_BUFFER_CHAR} bytes), then yy_scan_buffer @@ -2288,7 +2447,7 @@ shown in the example above. <<EOF>> rules may not be used with other patterns; they may only be qualified with a list of start conditions. If an unqualified <<EOF>> -rule is given, it applies to @emph{all} start conditions which do not +rule is given, it applies to all start conditions which do not already have <<EOF>> actions. To specify an <<EOF>> rule for only the initial start condition, use: @@ -2552,12 +2711,12 @@ menu. If you want to lookup a particular option by name, @xref{Index of Scanner @@ -2715,7 +2874,7 @@ This option is for flex development. We document it here in case you stumble upon it by accident or in case you suspect some inconsistency in the serialized tables. Flex will serialize the scanner dfa tables but will also generate the in-code tables as it normally does. At runtime, the scanner will verify that -the serialized tables match the in-code tables, instead of loading them. +the serialized tables match the in-code tables, instead of loading them. @@ -2752,7 +2911,7 @@ not be folded). For tricky behavior, see @ref{case and character ranges}. -l, --lex-compat, @code{%option lex-compat} turns on maximum compatibility with the original & @code{lex} -implementation. Note that this does not mean @emph{full} compatibility. +implementation. Note that this does not mean full compatibility. Use of this option costs a considerable amount of performance, and it cannot be used with the @samp{--c++}, @samp{--full}, @samp{--fast}, @samp{-Cf}, or @samp{-CF} options. For details on the compatibilities it provides, see @@ -2771,11 +2930,11 @@ cannot be used with the @samp{--c++}, @samp{--full}, @samp{--fast}, @samp{-Cf}, -B, --batch, @code{%option batch} instructs flex to generate a @dfn{batch} scanner, the opposite of -@emph{interactive} scanners generated by @samp{--interactive} (see below). In -general, you use @samp{-B} when you are @emph{certain} that your scanner +interactive scanners generated by @samp{--interactive} (see below). In +general, you use @samp{-B} when you are certain that your scanner will never be used interactively, and you want to squeeze a -@emph{little} more performance out of it. If your goal is instead to -squeeze out a @emph{lot} more performance, you should be using the +little more performance out of it. If your goal is instead to +squeeze out a lot more performance, you should be using the @samp{-Cf} or @samp{-CF} options, which turn on @samp{--batch} automatically anyway. @@ -2798,7 +2957,7 @@ enough text to disambiguate the current token, is a bit faster than only looking ahead when necessary. But scanners that always look ahead give dreadful interactive performance; for example, when a user types a newline, it is not recognized as a newline token until they enter -@emph{another} token, which often means typing in another whole line. +another token, which often means typing in another whole line. flex scanners default to @code{interactive} unless you use the @samp{-Cf} or @samp{-CF} table-compression options @@ -2806,12 +2965,12 @@ newline, it is not recognized as a newline token until they enter high-performance you should be using one of these options, so if you didn't, flex assumes you'd rather trade off a bit of run-time performance for intuitive interactive behavior. Note also that you -@emph{cannot} use @samp{--interactive} in conjunction with @samp{-Cf} or +cannot use @samp{--interactive} in conjunction with @samp{-Cf} or @samp{-CF}. Thus, this option is not really needed; it is on by default for all those cases in which it is allowed. You can force a scanner to -@emph{not} +not be interactive by using @samp{--batch} @@ -2892,7 +3051,7 @@ generate the default rule. --always-interactive, @code{%option always-interactive} instructs flex to generate a scanner which always considers its input -@emph{interactive}. Normally, on each new input file the scanner calls +interactive. Normally, on each new input file the scanner calls isatty in an attempt to determine whether the scanner's input source is interactive and thus should be read a character at a time. When this option is used, however, then no such call is made. @@ -2928,11 +3087,11 @@ in behavior. At the current writing the known differences between -In POSIX and & @code{lex}, the repeat operator, @samp{@{@}}, has lower -precedence than concatenation (thus @samp{ab@{3@}} yields @samp{ababab}). +In POSIX and & @code{lex}, the repeat operator, @samp{{}}, has lower +precedence than concatenation (thus @samp{ab{3}} yields @samp{ababab}). Most POSIX utilities use an Extended Regular Expression (ERE) precedence that has the precedence of the repeat operator higher than concatenation -(which causes @samp{ab@{3@}} to yield @samp{abbb}). By default, flex +(which causes @samp{ab{3}} to yield @samp{abbb}). By default, flex places the precedence of the repeat operator higher than concatenation which matches the ERE processing of other POSIX utilities. When either @samp{--posix} or @samp{-l} are specified, flex will use the @@ -3034,7 +3193,7 @@ is generated. --ansi-prototypes, @code{%option ansi-prototypes} -instructs flex to generate ANSI C99 prototypes for functions. +instructs flex to generate ANSI C99 prototypes for functions. This option is enabled by default. If @code{noansi-prototypes} is specified, then prototypes will have empty parameter lists. @@ -3066,7 +3225,7 @@ is modified to take an additional parameter, --bison-locations, @code{%option bison-locations} -instruct flex that +instruct flex that @code{GNU bison} @code{%locations} are being used. This means yylex will be passed an additional parameter, yylloc. This option @@ -3215,7 +3374,7 @@ programs into the same executable. Note, though, that using this option also renames yywrap, so you now -@emph{must} +must either provide your own (appropriately-named) version of the routine for your scanner, or use @@ -3381,7 +3540,7 @@ array look-up per character scanned). -Cr, --read, @code{%option read} -causes the generated scanner to @emph{bypass} use of the standard I/O +causes the generated scanner to bypass use of the standard I/O library (@code{stdio}) for input. Instead of calling fread or getc, the scanner will use the read system call, resulting in a performance gain which varies from system to system, but @@ -3455,12 +3614,12 @@ The result is large but fast. This option is equivalent to -F, --fast, @code{%option fast} -specifies that the @emph{fast} scanner table representation should be +specifies that the fast scanner table representation should be used (and @code{stdio} bypassed). This representation is about as fast as the full table representation @samp{--full}, and for some sets of patterns will be considerably smaller (and for others, larger). In -general, if the pattern set contains both @emph{keywords} and a -catch-all, @emph{identifier} rule, such as in the set: +general, if the pattern set contains both keywords and a +catch-all, identifier rule, such as in the set: @@ -3475,7 +3634,7 @@ catch-all, @emph{identifier} rule, such as in the set: then you're better off using the full table representation. If only -the @emph{identifier} rule is present and you then use a hash table or some such +the identifier rule is present and you then use a hash table or some such to detect the keywords, you're better off using @samp{--fast}. @@ -3503,7 +3662,7 @@ with @samp{--c++}. Generate backing-up information to lex.backup. This is a list of scanner states which require backing up and the input characters on which they do so. By adding rules one can remove backing-up states. If -@emph{all} backing-up states are eliminated and @samp{-Cf} or @code{-CF} +all backing-up states are eliminated and @samp{-Cf} or @code{-CF} is used, the generated scanner will run faster (see the @samp{--perf-report} flag). Only users who wish to squeeze every last cycle out of their scanners need worry about this option. (@pxref{Performance}). @@ -3572,7 +3731,7 @@ the @samp{--interactive} flag entail minor performance penalties. -s, --nodefault, @code{%option nodefault} -causes the @emph{default rule} (that unmatched scanner input is echoed +causes the default rule (that unmatched scanner input is echoed to stdout) to be suppressed. If the scanner encounters input that does not match any of its rules, it aborts with an error. This option is useful for finding holes in a scanner's rule set. @@ -3733,7 +3892,7 @@ you scanned, use ss. important. It is a particularly expensive option. There is one case when @code{%option yylineno} can be expensive. That is when -your patterns match long tokens that could @emph{possibly} contain a newline +your patterns match long tokens that could possibly contain a newline character. There is no performance penalty for rules that can not possibly match newlines, since flex does not need to check them for newlines. In general, you should avoid rules such as @code{[^f]+}, which match very long @@ -3869,10 +4028,10 @@ accidentally match a valid token. A possible future flexevery instance of backing up. Leaving just one means you gain nothing. -@emph{Variable} trailing context (where both the leading and trailing +Variable trailing context (where both the leading and trailing parts do not have a fixed length) entails almost the same performance loss as @code{REJECT} (i.e., substantial). So when possible a rule like: @@ -3911,7 +4070,7 @@ or as -Note that here the special '|' action does @emph{not} provide any +Note that here the special '|' action does not provide any savings, and can even make things worse (@pxref{Limitations}). Another area where the user can increase a scanner's performance (and @@ -3962,8 +4121,8 @@ This could be sped up by writing it as: Now instead of each newline requiring the processing of another action, recognizing the newlines is distributed over the other rules to keep the -matched text as long as possible. Note that @emph{adding} rules does -@emph{not} slow down the scanner! The speed of the scanner is +matched text as long as possible. Note that adding rules does +not slow down the scanner! The speed of the scanner is independent of the number of rules or (modulo the considerations given at the beginning of this section) how complicated the rules are with regard to operators such as @samp{*} and @samp{|}. @@ -4034,7 +4193,7 @@ recognition of newlines with that of the other tokens: One has to be careful here, as we have now reintroduced backing up into the scanner. In particular, while -@emph{we} +we know that there will never be any characters in the input stream other than letters or newlines, flex @@ -4071,7 +4230,7 @@ Compiled with @samp{-Cf}, this is about as fast as one can get a A final note: flex is slow when matching @code{NUL}s, particularly when a token contains multiple @code{NUL}s. It's best to -write rules which match @emph{short} amounts of text if it's anticipated +write rules which match short amounts of text if it's anticipated that the text will often include @code{NUL}s. Another final note regarding performance: as mentioned in @@ -4089,7 +4248,7 @@ characters per token. -@strong{IMPORTANT}: the present form of the scanning class is @emph{experimental} +IMPORTANT: the present form of the scanning class is experimental and may change considerably between major releases. @@ -4102,7 +4261,7 @@ not encounter any compilation errors (@pxref{Reporting Bugs}). You can then use C++ code in your rule actions instead of C code. Note that the default input source for your scanner remains yyin, and default echoing is still done to yyout. Both of these remain @code{FILE -*} variables and not C++ @emph{streams}. +*} variables and not C++ streams. You can also use flex to generate a C++ scanner class, using the @samp{-+} option (or, equivalently, @code{%option c++)}, which is @@ -4266,7 +4425,7 @@ writes the message to the stream @code{cerr} and exits. -Note that a @code{yyFlexLexer} object contains its @emph{entire} +Note that a @code{yyFlexLexer} object contains its entire scanning state. Thus you can use such objects to create reentrant scanners, but see also @ref{Reentrant}. You can instantiate multiple instances of the same @code{yyFlexLexer} class, and you can also combine @@ -4384,11 +4543,11 @@ multi-threaded applications. Any thread may create and execute a reentrant @@ -4540,13 +4699,13 @@ Here are the things you need to do or know to use the reentrant C API of @@ -4562,7 +4721,7 @@ Notice that @code{%option reentrant} is specified in the above example (@pxref{Reentrant Example}. Had this option not been specified, flex would have happily generated a non-reentrant scanner without complaining. You may explicitly specify @code{%option noreentrant}, if -you do @emph{not} want a reentrant scanner, although it is not +you do not want a reentrant scanner, although it is not necessary. The default is to generate a non-reentrant scanner. @@ -4971,7 +5130,7 @@ input. -flex is a rewrite of the & Unix @emph{lex} tool (the two +flex is a rewrite of the & Unix lex tool (the two implementations do not share any code, though), with some extensions and incompatibilities, both of which are of concern to those who wish to write scanners acceptable to both implementations. flex is fully @@ -5069,7 +5228,7 @@ isn't a problem with an interactive scanner. @xref{Reentrant}, for Also note that flex C++ scanner classes -@emph{are} +are reentrant, so if using C++ is an option for you, you should use them instead. @xref{Cxx}, and @ref{Reentrant} for details. @@ -5118,7 +5277,7 @@ and so the string @samp{foo} will match. Note that if the definition begins with @samp{^} or ends with @samp{$} -then it is @emph{not} expanded with parentheses, to allow these +then it is not expanded with parentheses, to allow these operators to appear in definitions without losing their special meanings. But the @samp{<s>}, @samp{/}, and @code{<<EOF>>} operators cannot be used in a flex definition. @@ -5162,7 +5321,7 @@ supported. It is not part of the POSIX specification. -After a call to unput, @emph{yytext} is undefined until the +After a call to unput, yytext is undefined until the next token is matched, unless the scanner was built using @code{%array}. This is not the case with @code{lex} or the POSIX specification. The @samp{-l} option does away with this incompatibility. @@ -5170,9 +5329,9 @@ This is not the case with @code{lex} or the POSIX specification. The -The precedence of the @samp{@{,@}} (numeric range) operator is +The precedence of the @samp{{,}} (numeric range) operator is different. The & and POSIX specifications of @code{lex} -interpret @samp{abc@{1,3@}} as match one, two, +interpret @samp{abc{1,3}} as match one, two, or three occurrences of @samp{abc}'', whereas flex interprets it as ``match @samp{ab} followed by one, two, or three occurrences of @samp{c}''. The @samp{-l} and @samp{--posix} options do away with this @@ -5282,7 +5441,7 @@ YY_USER_INIT -%@{@}'s around actions +%{}'s around actions @@ -5337,9 +5496,9 @@ override the default behavior. @@ -5355,7 +5514,7 @@ buffer. As of version 2.5.9 Flex will clean up all memory when you call @@ -5402,7 +5561,7 @@ is about 40 bytes, plus an additional large character buffer (described above.) The initial buffer state is created during initialization, and with each call to yy_create_buffer(). You can't tune the size of this, but you can tune the character buffer as described above. Any buffer state that you explicitly -create by calling yy_create_buffer() is @emph{NOT} destroyed automatically. You +create by calling yy_create_buffer() is NOT destroyed automatically. You must call yy_delete_buffer() to free the memory. The exception to this rule is that flex will delete the current buffer automatically when you call yylex_destroy(). If you delete the current buffer, be sure to set it to NULL. @@ -5524,7 +5683,7 @@ void * yyrealloc (void * ptr, size_t bytes, void* yyscanner) { return allocator_realloc (yyextra, bytes); } -void yyfree (void * ptr, void * yyscanner) { +void yyfree (void * ptr, void * yyscanner) { /* Do nothing -- we leave it to the garbage collector. */ } @@ -5542,7 +5701,7 @@ void yyfree (void * ptr, void * yyscanner) { When flex finds a match, yytext points to the first character of the match in the input buffer. The string itself is part of the input buffer, and -is @emph{NOT} allocated separately. The value of yytext will be overwritten the next +is NOT allocated separately. The value of yytext will be overwritten the next time yylex() is called. In short, the value of yytext is only valid from within the matched rule's action. @@ -5579,9 +5738,9 @@ scanning begins. The tables may be discarded when scanning is finished. @@ -5605,9 +5764,9 @@ or These options instruct flex to save the DFA tables to the file @var{FILE}. The tables -will @emph{not} be embedded in the generated scanner. The scanner will not +will not be embedded in the generated scanner. The scanner will not function on its own. The scanner will be dependent upon the serialized tables. You must -load the tables from this file at runtime before you can scan anything. +load the tables from this file at runtime before you can scan anything. If you do not specify a filename to @code{--tables-file}, the tables will be saved to lex.yy.tables, where @samp{yy} is the appropriate prefix. @@ -5656,7 +5815,7 @@ only appears in the reentrant scanner. This function returns @samp{0} (zero) on success, or non-zero on error. @end deftypefun -The loaded tables are @strong{not} automatically destroyed (unloaded) when you +The loaded tables are not automatically destroyed (unloaded) when you call yylex_destroy. The reason is that you may create several scanners of the same type (in a reentrant scanner), each of which needs access to these tables. To avoid a nasty memory leak, you must call the following function: @@ -5668,7 +5827,7 @@ scanner. This function returns @samp{0} (zero) on success, or non-zero on error. @end deftypefun -@strong{The functions yytables_fload and yytables_destroy are not thread-safe.} You must ensure that these functions are called exactly once (for +The functions yytables_fload and yytables_destroy are not thread-safe. You must ensure that these functions are called exactly once (for each scanner type) in a threaded program, before any thread calls yylex. After the tables are loaded, they are never written to, and no thread protection is required thereafter -- until you destroy them. @@ -5731,7 +5890,7 @@ and tables sections are padded to 64-bit boundaries. Below we describe each field in detail. This format does not specify how the scanner will expand the given data, i.e., data may be serialized as int8, but expanded to an int32 array at runtime. This is to reduce the size of the serialized data where -possible. Remember, @emph{all integer values are in network byte order}. +possible. Remember, all integer values are in network byte order. @noindent Fields of a table header: @@ -6109,11 +6268,11 @@ matches the 'x' at the beginning of the trailing context. (Note that the POSIX draft states that the text matched by such patterns is undefined.) For some trailing context rules, parts which are actually fixed-length are not recognized as such, leading to the abovementioned -performance loss. In particular, parts using @samp{|} or @samp{@{n@}} -(such as @samp{foo@{3@}}) are always considered variable-length. +performance loss. In particular, parts using @samp{|} or @samp{{n}} +(such as @samp{foo{3}}) are always considered variable-length. Combining trailing context with the special @samp{|} action can result -in @emph{fixed} trailing context being turned into the more expensive -@emph{variable} trailing context. For example, in the following: +in fixed trailing context being turned into the more expensive +variable trailing context. For example, in the following: @@ -6169,7 +6328,7 @@ You may wish to read more about the following programs: The following books may contain material of interest: John Levine, Tony Mason, and Doug Brown, -@emph{Lex & Yacc}, +Lex & Yacc, O'Reilly and Associates. Be sure to get the 2nd edition. M. E. Lesk and E. Schmidt, @@ -6191,105 +6350,105 @@ publish them here. @@ -6355,9 +6514,9 @@ No. You cannot have recursive definitions. The pattern-matching power of regular expressions in general (and therefore flex scanners, too) is limited. In particular, regular expressions cannot ``balance'' parentheses to an arbitrary degree. For example, it's impossible to write a regular -expression that matches all strings containing the same number of '@{'s -as '@}'s. For more powerful pattern matching, you need a parser, such -as @cite{GNU bison}. +expression that matches all strings containing the same number of '{'s +as '}'s. For more powerful pattern matching, you need a parser, such +as GNU bison. @@ -6379,7 +6538,7 @@ simultaneously, in parallel. (Seems impossible, but it's actually a fairly simple technique once you understand the principles.) A side-effect of this parallel matching is that when the input matches more -than one rule, flex scanners pick the rule that matched the @emph{most} text. This +than one rule, flex scanners pick the rule that matched the most text. This is explained further in the manual, in the section @xref{Matching}. If you want flex to choose a shorter match, then you can work around this @@ -6409,7 +6568,7 @@ also not have the option of changing the input language.)
My actions are executing out of order or sometimes not at all. -Most likely, you have (in error) placed the opening @samp{@{} of the action +Most likely, you have (in error) placed the opening @samp{{} of the action block on a different line than the rule, e.g., @@ -6423,7 +6582,7 @@ block on a different line than the rule, e.g., -flex requires that the opening @samp{@{} of an action associated with a rule +flex requires that the opening @samp{{} of an action associated with a rule begin on the same line as does the rule. You need instead to write your rules as follows: @@ -6662,29 +6821,43 @@ Here are some tips for using @samp{.}: + A common mistake is to place the grouping parenthesis AFTER an operator, when you really meant to place the parenthesis BEFORE the operator, e.g., you probably want this @code{(foo|bar)+} and NOT this @code{(foo|bar+)}. + + The first pattern matches the words @samp{foo} or @samp{bar} any number of times, e.g., it matches the text @samp{barfoofoobarfoo}. The second pattern matches a single instance of @code{foo} or a single instance of @code{bar} followed by one or more @samp{r}s, e.g., it matches the text @code{barrrr} . + + + A @samp{.} inside @samp{[]}'s just means a literal@samp{.} (period), and NOT ``any character except newline''. + + + Remember that @samp{.} matches any character EXCEPT @samp{\n} (and @samp{EOF}). If you really want to match ANY character, including newlines, then use @code{(.|\n)} Beware that the regex @code{(.|\n)+} will match your entire input! + + + Finally, if you want to match a literal @samp{.} (a period), then use @samp{[.]} or @samp{"."} + + @@ -6705,10 +6878,12 @@ number of formats.
Does there exist a "faster" NDFA->DFA algorithm? + There's no way around the potential exponential running time - it can take you exponential time just to enumerate all of the DFA states. In practice, though, the running time is closer to linear, or sometimes quadratic. +
@@ -6721,18 +6896,24 @@ There are two big speed wins that flex uses: + It analyzes the input rules to construct equivalence classes for those characters that always make the same transitions. It then rewrites the NFA using equivalence classes for transitions instead of characters. This cuts down the NFA->DFA computation time dramatically, to the point where, for uncompressed DFA tables, the DFA generation is often I/O bound in writing out the tables. + + + It maintains hash values for previously computed DFA states, so testing whether a newly constructed DFA state is equivalent to a previously constructed state can be done very quickly, by first comparing hash values. + + @@ -6742,9 +6923,11 @@ state can be done very quickly, by first comparing hash values.
How can I use more than 8192 rules? + flex is compiled with an upper limit of 8192 rules per scanner. If you need more than 8192 rules in your scanner, you'll have to recompile flex with the following changes in flexdef.h: + @@ -6758,13 +6941,19 @@ with the following changes in flexdef.h: + This should work okay as long as your C compiler uses 32 bit integers. But you might want to think about whether using such a huge number of rules is the best way to solve your problem. + + The following may also be relevant: + + With luck, you should be able to increase the definitions in flexdef.h for: + @@ -6776,10 +6965,12 @@ With luck, you should be able to increase the definitions in flexdef.h for: + recompile everything, and it'll all work. Flex only has these 16-bit-like values built into it because a long time ago it was developed on a machine with 16-bit ints. I've given this advice to others in the past but haven't heard back from them whether it worked okay or not... +
@@ -7114,7 +7305,7 @@ How do I skip as many chars as possible -- without interfering with the other patterns? In the example below, we want to skip over characters until we see the phrase -"endskip". The following will @emph{NOT} work correctly (do you see why not?) +"endskip". The following will NOT work correctly (do you see why not?) @@ -9547,7 +9738,7 @@ code such as @code{x[y[z]]}. m4 is only required at the time you run flex. The generated -scanner is ordinary C or C++, and does @emph{not} require m4. +scanner is ordinary C or C++, and does not require m4. @@ -9556,12 +9747,12 @@ scanner is ordinary C or C++, and does @emph{not} require m4Indices @menu -* Concept Index:: -* Index of Functions and Macros:: -* Index of Variables:: -* Index of Data Types:: -* Index of Hooks:: -* Index of Scanner Options:: +* Concept Index:: +* Index of Functions and Macros:: +* Index of Variables:: +* Index of Data Types:: +* Index of Hooks:: +* Index of Scanner Options:: @end menu
-- cgit v1.2.3