diff options
Diffstat (limited to 'MISC/texinfo')
-rw-r--r-- | MISC/texinfo/flex.info | 2951 | ||||
-rw-r--r-- | MISC/texinfo/flex.texi | 3448 |
2 files changed, 6399 insertions, 0 deletions
diff --git a/MISC/texinfo/flex.info b/MISC/texinfo/flex.info new file mode 100644 index 0000000..9269418 --- /dev/null +++ b/MISC/texinfo/flex.info @@ -0,0 +1,2951 @@ +This is Info file flex.info, produced by Makeinfo-1.55 from the input +file flex.texi. + +START-INFO-DIR-ENTRY +* Flex: (flex). A fast scanner generator. +END-INFO-DIR-ENTRY + + This file documents Flex. + + Copyright (c) 1990 The Regents of the University of California. All +rights reserved. + + This code is derived from software contributed to Berkeley by Vern +Paxson. + + The United States Government has rights in this work pursuant to +contract no. DE-AC03-76SF00098 between the United States Department of +Energy and the University of California. + + Redistribution and use in source and binary forms with or without +modification are permitted provided that: (1) source distributions +retain this entire copyright notice and comment, and (2) distributions +including binaries display the following acknowledgement: "This +product includes software developed by the University of California, +Berkeley and its contributors" in the documentation or other materials +provided with the distribution and in all advertising materials +mentioning features or use of this software. Neither the name of the +University nor the names of its contributors may be used to endorse or +promote products derived from this software without specific prior +written permission. + + THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED +WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. + + +File: flex.info, Node: Top, Next: Name, Prev: (dir), Up: (dir) + +flex +**** + + This manual documents `flex'. It covers release 2.5. + +* Menu: + +* Name:: Name +* Synopsis:: Synopsis +* Overview:: Overview +* Description:: Description +* Examples:: Some simple examples +* Format:: Format of the input file +* Patterns:: Patterns +* Matching:: How the input is matched +* Actions:: Actions +* Generated scanner:: The generated scanner +* Start conditions:: Start conditions +* Multiple buffers:: Multiple input buffers +* End-of-file rules:: End-of-file rules +* Miscellaneous:: Miscellaneous macros +* User variables:: Values available to the user +* YACC interface:: Interfacing with `yacc' +* Options:: Options +* Performance:: Performance considerations +* C++:: Generating C++ scanners +* Incompatibilities:: Incompatibilities with `lex' and POSIX +* Diagnostics:: Diagnostics +* Files:: Files +* Deficiencies:: Deficiencies / Bugs +* See also:: See also +* Author:: Author + + +File: flex.info, Node: Name, Next: Synopsis, Prev: Top, Up: Top + +Name +==== + + flex - fast lexical analyzer generator + + +File: flex.info, Node: Synopsis, Next: Overview, Prev: Name, Up: Top + +Synopsis +======== + + flex [-bcdfhilnpstvwBFILTV78+? -C[aefFmr] -ooutput -Pprefix -Sskeleton] + [--help --version] [FILENAME ...] + + +File: flex.info, Node: Overview, Next: Description, Prev: Synopsis, Up: Top + +Overview +======== + + This manual describes `flex', a tool for generating programs that +perform pattern-matching on text. The manual includes both tutorial +and reference sections: + +Description + a brief overview of the tool + +Some Simple Examples +Format Of The Input File +Patterns + the extended regular expressions used by flex + +How The Input Is Matched + the rules for determining what has been matched + +Actions + how to specify what to do when a pattern is matched + +The Generated Scanner + details regarding the scanner that flex produces; how to control + the input source + +Start Conditions + introducing context into your scanners, and managing + "mini-scanners" + +Multiple Input Buffers + how to manipulate multiple input sources; how to scan from strings + instead of files + +End-of-file Rules + special rules for matching the end of the input + +Miscellaneous Macros + a summary of macros available to the actions + +Values Available To The User + a summary of values available to the actions + +Interfacing With Yacc + connecting flex scanners together with yacc parsers + +Options + flex command-line options, and the "%option" directive + +Performance Considerations + how to make your scanner go as fast as possible + +Generating C++ Scanners + the (experimental) facility for generating C++ scanner classes + +Incompatibilities With Lex And POSIX + how flex differs from AT&T lex and the POSIX lex standard + +Diagnostics + those error messages produced by flex (or scanners it generates) + whose meanings might not be apparent + +Files + files used by flex + +Deficiencies / Bugs + known problems with flex + +See Also + other documentation, related tools + +Author + includes contact information + + +File: flex.info, Node: Description, Next: Examples, Prev: Overview, Up: Top + +Description +=========== + + `flex' is a tool for generating "scanners": programs which +recognized lexical patterns in text. `flex' reads the given input +files, or its standard input if no file names are given, for a +description of a scanner to generate. The description is in the form +of pairs of regular expressions and C code, called "rules". `flex' +generates as output a C source file, `lex.yy.c', which defines a +routine `yylex()'. This file is compiled and linked with the `-lfl' +library to produce an executable. When the executable is run, it +analyzes its input for occurrences of the regular expressions. +Whenever it finds one, it executes the corresponding C code. + + +File: flex.info, Node: Examples, Next: Format, Prev: Description, Up: Top + +Some simple examples +==================== + + First some simple examples to get the flavor of how one uses `flex'. +The following `flex' input specifies a scanner which whenever it +encounters the string "username" will replace it with the user's login +name: + + %% + username printf( "%s", getlogin() ); + + By default, any text not matched by a `flex' scanner is copied to +the output, so the net effect of this scanner is to copy its input file +to its output with each occurrence of "username" expanded. In this +input, there is just one rule. "username" is the PATTERN and the +"printf" is the ACTION. The "%%" marks the beginning of the rules. + + Here's another simple example: + + int num_lines = 0, num_chars = 0; + + %% + \n ++num_lines; ++num_chars; + . ++num_chars; + + %% + main() + { + yylex(); + printf( "# of lines = %d, # of chars = %d\n", + num_lines, num_chars ); + } + + This scanner counts the number of characters and the number of lines +in its input (it produces no output other than the final report on the +counts). The first line declares two globals, "num_lines" and +"num_chars", which are accessible both inside `yylex()' and in the +`main()' routine declared after the second "%%". There are two rules, +one which matches a newline ("\n") and increments both the line count +and the character count, and one which matches any character other than +a newline (indicated by the "." regular expression). + + A somewhat more complicated example: + + /* scanner for a toy Pascal-like language */ + + %{ + /* need this for the call to atof() below */ + #include <math.h> + %} + + DIGIT [0-9] + ID [a-z][a-z0-9]* + + %% + + {DIGIT}+ { + printf( "An integer: %s (%d)\n", yytext, + atoi( yytext ) ); + } + + {DIGIT}+"."{DIGIT}* { + printf( "A float: %s (%g)\n", yytext, + atof( yytext ) ); + } + + if|then|begin|end|procedure|function { + printf( "A keyword: %s\n", yytext ); + } + + {ID} printf( "An identifier: %s\n", yytext ); + + "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); + + "{"[^}\n]*"}" /* eat up one-line comments */ + + [ \t\n]+ /* eat up whitespace */ + + . printf( "Unrecognized character: %s\n", yytext ); + + %% + + main( argc, argv ) + int argc; + char **argv; + { + ++argv, --argc; /* skip over program name */ + if ( argc > 0 ) + yyin = fopen( argv[0], "r" ); + else + yyin = stdin; + + yylex(); + } + + This is the beginnings of a simple scanner for a language like +Pascal. It identifies different types of TOKENS and reports on what it +has seen. + + The details of this example will be explained in the following +sections. + + +File: flex.info, Node: Format, Next: Patterns, Prev: Examples, Up: Top + +Format of the input file +======================== + + The `flex' input file consists of three sections, separated by a +line with just `%%' in it: + + definitions + %% + rules + %% + user code + + The "definitions" section contains declarations of simple "name" +definitions to simplify the scanner specification, and declarations of +"start conditions", which are explained in a later section. Name +definitions have the form: + + name definition + + The "name" is a word beginning with a letter or an underscore ('_') +followed by zero or more letters, digits, '_', or '-' (dash). The +definition is taken to begin at the first non-white-space character +following the name and continuing to the end of the line. The +definition can subsequently be referred to using "{name}", which will +expand to "(definition)". For example, + + DIGIT [0-9] + ID [a-z][a-z0-9]* + +defines "DIGIT" to be a regular expression which matches a single +digit, and "ID" to be a regular expression which matches a letter +followed by zero-or-more letters-or-digits. A subsequent reference to + + {DIGIT}+"."{DIGIT}* + +is identical to + + ([0-9])+"."([0-9])* + +and matches one-or-more digits followed by a '.' followed by +zero-or-more digits. + + The RULES section of the `flex' input contains a series of rules of +the form: + + pattern action + +where the pattern must be unindented and the action must begin on the +same line. + + See below for a further description of patterns and actions. + + Finally, the user code section is simply copied to `lex.yy.c' +verbatim. It is used for companion routines which call or are called +by the scanner. The presence of this section is optional; if it is +missing, the second `%%' in the input file may be skipped, too. + + In the definitions and rules sections, any *indented* text or text +enclosed in `%{' and `%}' is copied verbatim to the output (with the +`%{}''s removed). The `%{}''s must appear unindented on lines by +themselves. + + In the rules section, any indented or %{} text appearing before the +first rule may be used to declare variables which are local to the +scanning routine and (after the declarations) code which is to be +executed whenever the scanning routine is entered. Other indented or +%{} text in the rule section is still copied to the output, but its +meaning is not well-defined and it may well cause compile-time errors +(this feature is present for `POSIX' compliance; see below for other +such features). + + In the definitions section (but not in the rules section), an +unindented comment (i.e., a line beginning with "/*") is also copied +verbatim to the output up to the next "*/". + + +File: flex.info, Node: Patterns, Next: Matching, Prev: Format, Up: Top + +Patterns +======== + + The patterns in the input are written using an extended set of +regular expressions. These are: + +`x' + match the character `x' + +`.' + any character (byte) except newline + +`[xyz]' + a "character class"; in this case, the pattern matches either an + `x', a `y', or a `z' + +`[abj-oZ]' + a "character class" with a range in it; matches an `a', a `b', any + letter from `j' through `o', or a `Z' + +`[^A-Z]' + a "negated character class", i.e., any character but those in the + class. In this case, any character EXCEPT an uppercase letter. + +`[^A-Z\n]' + any character EXCEPT an uppercase letter or a newline + +`R*' + zero or more R's, where R is any regular expression + +`R+' + one or more R's + +`R?' + zero or one R's (that is, "an optional R") + +`R{2,5}' + anywhere from two to five R's + +`R{2,}' + two or more R's + +`R{4}' + exactly 4 R's + +`{NAME}' + the expansion of the "NAME" definition (see above) + +`"[xyz]\"foo"' + the literal string: `[xyz]"foo' + +`\X' + if X is an `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C + interpretation of \X. Otherwise, a literal `X' (used to escape + operators such as `*') + +`\0' + a NUL character (ASCII code 0) + +`\123' + the character with octal value 123 + +`\x2a' + the character with hexadecimal value `2a' + +`(R)' + match an R; parentheses are used to override precedence (see below) + +`RS' + the regular expression R followed by the regular expression S; + called "concatenation" + +`R|S' + either an R or an S + +`R/S' + an R but only if it is followed by an S. The text matched by S is + included when determining whether this rule is the "longest + match", but is then returned to the input before the action is + executed. So the action only sees the text matched by R. This + type of pattern is called "trailing context". (There are some + combinations of `R/S' that `flex' cannot match correctly; see + notes in the Deficiencies / Bugs section below regarding + "dangerous trailing context".) + +`^R' + an R, but only at the beginning of a line (i.e., which just + starting to scan, or right after a newline has been scanned). + +`R$' + an R, but only at the end of a line (i.e., just before a newline). + Equivalent to "R/\n". + + Note that flex's notion of "newline" is exactly whatever the C + compiler used to compile flex interprets '\n' as; in particular, + on some DOS systems you must either filter out \r's in the input + yourself, or explicitly use R/\r\n for "r$". + +`<S>R' + an R, but only in start condition S (see below for discussion of + start conditions) <S1,S2,S3>R same, but in any of start conditions + S1, S2, or S3 + +`<*>R' + an R in any start condition, even an exclusive one. + +`<<EOF>>' + an end-of-file <S1,S2><<EOF>> an end-of-file when in start + condition S1 or S2 + + Note that inside of a character class, all regular expression +operators lose their special meaning except escape ('\') and the +character class operators, '-', ']', and, at the beginning of the +class, '^'. + + The regular expressions listed above are grouped according to +precedence, from highest precedence at the top to lowest at the bottom. +Those grouped together have equal precedence. For example, + + foo|bar* + +is the same as + + (foo)|(ba(r*)) + +since the '*' operator has higher precedence than concatenation, and +concatenation higher than alternation ('|'). This pattern therefore +matches *either* the string "foo" *or* the string "ba" followed by +zero-or-more r's. To match "foo" or zero-or-more "bar"'s, use: + + foo|(bar)* + +and to match zero-or-more "foo"'s-or-"bar"'s: + + (foo|bar)* + + In addition to characters and ranges of characters, character +classes can also contain character class "expressions". These are +expressions enclosed inside `[': and `:'] delimiters (which themselves +must appear between the '[' and ']' of the character class; other +elements may occur inside the character class, too). The valid +expressions are: + + [:alnum:] [:alpha:] [:blank:] + [:cntrl:] [:digit:] [:graph:] + [:lower:] [:print:] [:punct:] + [:space:] [:upper:] [:xdigit:] + + These expressions all designate a set of characters equivalent to +the corresponding standard C `isXXX' function. For example, +`[:alnum:]' designates those characters for which `isalnum()' returns +true - i.e., any alphabetic or numeric. Some systems don't provide +`isblank()', so flex defines `[:blank:]' as a blank or a tab. + + For example, the following character classes are all equivalent: + + [[:alnum:]] + [[:alpha:][:digit:] + [[:alpha:]0-9] + [a-zA-Z0-9] + + If your scanner is case-insensitive (the `-i' flag), then +`[:upper:]' and `[:lower:]' are equivalent to `[:alpha:]'. + + Some notes on patterns: + + - A negated character class such as the example "[^A-Z]" above *will + match a newline* unless "\n" (or an equivalent escape sequence) is + one of the characters explicitly present in the negated character + class (e.g., "[^A-Z\n]"). This is unlike how many other regular + expression tools treat negated character classes, but + unfortunately the inconsistency is historically entrenched. + Matching newlines means that a pattern like [^"]* can match the + entire input unless there's another quote in the input. + + - A rule can have at most one instance of trailing context (the '/' + operator or the '$' operator). The start condition, '^', and + "<<EOF>>" patterns can only occur at the beginning of a pattern, + and, as well as with '/' and '$', cannot be grouped inside + parentheses. A '^' which does not occur at the beginning of a + rule or a '$' which does not occur at the end of a rule loses its + special properties and is treated as a normal character. + + The following are illegal: + + foo/bar$ + <sc1>foo<sc2>bar + + Note that the first of these, can be written "foo/bar\n". + + The following will result in '$' or '^' being treated as a normal + character: + + foo|(bar$) + foo|^bar + + If what's wanted is a "foo" or a bar-followed-by-a-newline, the + following could be used (the special '|' action is explained + below): + + foo | + bar$ /* action goes here */ + + A similar trick will work for matching a foo or a + bar-at-the-beginning-of-a-line. + + +File: flex.info, Node: Matching, Next: Actions, Prev: Patterns, Up: Top + +How the input is matched +======================== + + When the generated scanner is run, it analyzes its input looking for +strings which match any of its patterns. If it finds more than one +match, it takes the one matching the most text (for trailing context +rules, this includes the length of the trailing part, even though it +will then be returned to the input). If it finds two or more matches +of the same length, the rule listed first in the `flex' input file is +chosen. + + Once the match is determined, the text corresponding to the match +(called the TOKEN) is made available in the global character pointer +`yytext', and its length in the global integer `yyleng'. The ACTION +corresponding to the matched pattern is then executed (a more detailed +description of actions follows), and then the remaining input is +scanned for another match. + + If no match is found, then the "default rule" is executed: the next +character in the input is considered matched and copied to the standard +output. Thus, the simplest legal `flex' input is: + + %% + + which generates a scanner that simply copies its input (one +character at a time) to its output. + + Note that `yytext' can be defined in two different ways: either as a +character *pointer* or as a character *array*. You can control which +definition `flex' uses by including one of the special directives +`%pointer' or `%array' in the first (definitions) section of your flex +input. The default is `%pointer', unless you use the `-l' lex +compatibility option, in which case `yytext' will be an array. The +advantage of using `%pointer' is substantially faster scanning and no +buffer overflow when matching very large tokens (unless you run out of +dynamic memory). The disadvantage is that you are restricted in how +your actions can modify `yytext' (see the next section), and calls to +the `unput()' function destroys the present contents of `yytext', which +can be a considerable porting headache when moving between different +`lex' versions. + + The advantage of `%array' is that you can then modify `yytext' to +your heart's content, and calls to `unput()' do not destroy `yytext' +(see below). Furthermore, existing `lex' programs sometimes access +`yytext' externally using declarations of the form: + extern char yytext[]; + This definition is erroneous when used with `%pointer', but correct +for `%array'. + + `%array' defines `yytext' to be an array of `YYLMAX' characters, +which defaults to a fairly large value. You can change the size by +simply #define'ing `YYLMAX' to a different value in the first section +of your `flex' input. As mentioned above, with `%pointer' yytext grows +dynamically to accommodate large tokens. While this means your +`%pointer' scanner can accommodate very large tokens (such as matching +entire blocks of comments), bear in mind that each time the scanner +must resize `yytext' it also must rescan the entire token from the +beginning, so matching such tokens can prove slow. `yytext' presently +does *not* dynamically grow if a call to `unput()' results in too much +text being pushed back; instead, a run-time error results. + + Also note that you cannot use `%array' with C++ scanner classes (the +`c++' option; see below). + + +File: flex.info, Node: Actions, Next: Generated scanner, Prev: Matching, Up: Top + +Actions +======= + + Each pattern in a rule has a corresponding action, which can be any +arbitrary C statement. The pattern ends at the first non-escaped +whitespace character; the remainder of the line is its action. If the +action is empty, then when the pattern is matched the input token is +simply discarded. For example, here is the specification for a program +which deletes all occurrences of "zap me" from its input: + + %% + "zap me" + + (It will copy all other characters in the input to the output since +they will be matched by the default rule.) + + Here is a program which compresses multiple blanks and tabs down to +a single blank, and throws away whitespace found at the end of a line: + + %% + [ \t]+ putchar( ' ' ); + [ \t]+$ /* ignore this token */ + + If the action contains a '{', then the action spans till the +balancing '}' is found, and the action may cross multiple lines. +`flex' knows about C strings and comments and won't be fooled by braces +found within them, but also allows actions to begin with `%{' and will +consider the action to be all the text up to the next `%}' (regardless +of ordinary braces inside the action). + + An action consisting solely of a vertical bar ('|') means "same as +the action for the next rule." See below for an illustration. + + Actions can include arbitrary C code, including `return' statements +to return a value to whatever routine called `yylex()'. Each time +`yylex()' is called it continues processing tokens from where it last +left off until it either reaches the end of the file or executes a +return. + + Actions are free to modify `yytext' except for lengthening it +(adding characters to its end-these will overwrite later characters in +the input stream). This however does not apply when using `%array' +(see above); in that case, `yytext' may be freely modified in any way. + + Actions are free to modify `yyleng' except they should not do so if +the action also includes use of `yymore()' (see below). + + There are a number of special directives which can be included +within an action: + + - `ECHO' copies yytext to the scanner's output. + + - `BEGIN' followed by the name of a start condition places the + scanner in the corresponding start condition (see below). + + - `REJECT' directs the scanner to proceed on to the "second best" + rule which matched the input (or a prefix of the input). The rule + is chosen as described above in "How the Input is Matched", and + `yytext' and `yyleng' set up appropriately. It may either be one + which matched as much text as the originally chosen rule but came + later in the `flex' input file, or one which matched less text. + For example, the following will both count the words in the input + and call the routine special() whenever "frob" is seen: + + int word_count = 0; + %% + + frob special(); REJECT; + [^ \t\n]+ ++word_count; + + Without the `REJECT', any "frob"'s in the input would not be + counted as words, since the scanner normally executes only one + action per token. Multiple `REJECT's' are allowed, each one + finding the next best choice to the currently active rule. For + example, when the following scanner scans the token "abcd", it + will write "abcdabcaba" to the output: + + %% + a | + ab | + abc | + abcd ECHO; REJECT; + .|\n /* eat up any unmatched character */ + + (The first three rules share the fourth's action since they use + the special '|' action.) `REJECT' is a particularly expensive + feature in terms of scanner performance; if it is used in *any* of + the scanner's actions it will slow down *all* of the scanner's + matching. Furthermore, `REJECT' cannot be used with the `-Cf' or + `-CF' options (see below). + + Note also that unlike the other special actions, `REJECT' is a + *branch*; code immediately following it in the action will *not* + be executed. + + - `yymore()' tells the scanner that the next time it matches a rule, + the corresponding token should be *appended* onto the current + value of `yytext' rather than replacing it. For example, given + the input "mega-kludge" the following will write + "mega-mega-kludge" to the output: + + %% + mega- ECHO; yymore(); + kludge ECHO; + + First "mega-" is matched and echoed to the output. Then "kludge" + is matched, but the previous "mega-" is still hanging around at + the beginning of `yytext' so the `ECHO' for the "kludge" rule will + actually write "mega-kludge". + + Two notes regarding use of `yymore()'. First, `yymore()' depends on +the value of `yyleng' correctly reflecting the size of the current +token, so you must not modify `yyleng' if you are using `yymore()'. +Second, the presence of `yymore()' in the scanner's action entails a +minor performance penalty in the scanner's matching speed. + + - `yyless(n)' returns all but the first N characters of the current + token back to the input stream, where they will be rescanned when + the scanner looks for the next match. `yytext' and `yyleng' are + adjusted appropriately (e.g., `yyleng' will now be equal to N ). + For example, on the input "foobar" the following will write out + "foobarbar": + + %% + foobar ECHO; yyless(3); + [a-z]+ ECHO; + + An argument of 0 to `yyless' will cause the entire current input + string to be scanned again. Unless you've changed how the scanner + will subsequently process its input (using `BEGIN', for example), + this will result in an endless loop. + + Note that `yyless' is a macro and can only be used in the flex + input file, not from other source files. + + - `unput(c)' puts the character `c' back onto the input stream. It + will be the next character scanned. The following action will + take the current token and cause it to be rescanned enclosed in + parentheses. + + { + int i; + /* Copy yytext because unput() trashes yytext */ + char *yycopy = strdup( yytext ); + unput( ')' ); + for ( i = yyleng - 1; i >= 0; --i ) + unput( yycopy[i] ); + unput( '(' ); + free( yycopy ); + } + + Note that since each `unput()' puts the given character back at + the *beginning* of the input stream, pushing back strings must be + done back-to-front. An important potential problem when using + `unput()' is that if you are using `%pointer' (the default), a + call to `unput()' *destroys* the contents of `yytext', starting + with its rightmost character and devouring one character to the + left with each call. If you need the value of yytext preserved + after a call to `unput()' (as in the above example), you must + either first copy it elsewhere, or build your scanner using + `%array' instead (see How The Input Is Matched). + + Finally, note that you cannot put back `EOF' to attempt to mark + the input stream with an end-of-file. + + - `input()' reads the next character from the input stream. For + example, the following is one way to eat up C comments: + + %% + "/*" { + register int c; + + for ( ; ; ) + { + while ( (c = input()) != '*' && + c != EOF ) + ; /* eat up text of comment */ + + if ( c == '*' ) + { + while ( (c = input()) == '*' ) + ; + if ( c == '/' ) + break; /* found the end */ + } + + if ( c == EOF ) + { + error( "EOF in comment" ); + break; + } + } + } + + (Note that if the scanner is compiled using `C++', then `input()' + is instead referred to as `yyinput()', in order to avoid a name + clash with the `C++' stream by the name of `input'.) + + - YY_FLUSH_BUFFER flushes the scanner's internal buffer so that the + next time the scanner attempts to match a token, it will first + refill the buffer using `YY_INPUT' (see The Generated Scanner, + below). This action is a special case of the more general + `yy_flush_buffer()' function, described below in the section + Multiple Input Buffers. + + - `yyterminate()' can be used in lieu of a return statement in an + action. It terminates the scanner and returns a 0 to the + scanner's caller, indicating "all done". By default, + `yyterminate()' is also called when an end-of-file is encountered. + It is a macro and may be redefined. + + +File: flex.info, Node: Generated scanner, Next: Start conditions, Prev: Actions, Up: Top + +The generated scanner +===================== + + The output of `flex' is the file `lex.yy.c', which contains the +scanning routine `yylex()', a number of tables used by it for matching +tokens, and a number of auxiliary routines and macros. By default, +`yylex()' is declared as follows: + + int yylex() + { + ... various definitions and the actions in here ... + } + + (If your environment supports function prototypes, then it will be +"int yylex( void )".) This definition may be changed by defining +the "YY_DECL" macro. For example, you could use: + + #define YY_DECL float lexscan( a, b ) float a, b; + + to give the scanning routine the name `lexscan', returning a float, +and taking two floats as arguments. Note that if you give arguments to +the scanning routine using a K&R-style/non-prototyped function +declaration, you must terminate the definition with a semi-colon (`;'). + + Whenever `yylex()' is called, it scans tokens from the global input +file `yyin' (which defaults to stdin). It continues until it either +reaches an end-of-file (at which point it returns the value 0) or one +of its actions executes a `return' statement. + + If the scanner reaches an end-of-file, subsequent calls are undefined +unless either `yyin' is pointed at a new input file (in which case +scanning continues from that file), or `yyrestart()' is called. +`yyrestart()' takes one argument, a `FILE *' pointer (which can be nil, +if you've set up `YY_INPUT' to scan from a source other than `yyin'), +and initializes `yyin' for scanning from that file. Essentially there +is no difference between just assigning `yyin' to a new input file or +using `yyrestart()' to do so; the latter is available for compatibility +with previous versions of `flex', and because it can be used to switch +input files in the middle of scanning. It can also be used to throw +away the current input buffer, by calling it with an argument of +`yyin'; but better is to use `YY_FLUSH_BUFFER' (see above). Note that +`yyrestart()' does *not* reset the start condition to `INITIAL' (see +Start Conditions, below). + + If `yylex()' stops scanning due to executing a `return' statement in +one of the actions, the scanner may then be called again and it will +resume scanning where it left off. + + By default (and for purposes of efficiency), the scanner uses +block-reads rather than simple `getc()' calls to read characters from +`yyin'. The nature of how it gets its input can be controlled by +defining the `YY_INPUT' macro. YY_INPUT's calling sequence is +"YY_INPUT(buf,result,max_size)". Its action is to place up to MAX_SIZE +characters in the character array BUF and return in the integer +variable RESULT either the number of characters read or the constant +YY_NULL (0 on Unix systems) to indicate EOF. The default YY_INPUT +reads from the global file-pointer "yyin". + + A sample definition of YY_INPUT (in the definitions section of the +input file): + + %{ + #define YY_INPUT(buf,result,max_size) \ + { \ + int c = getchar(); \ + result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ + } + %} + + This definition will change the input processing to occur one +character at a time. + + When the scanner receives an end-of-file indication from YY_INPUT, +it then checks the `yywrap()' function. If `yywrap()' returns false +(zero), then it is assumed that the function has gone ahead and set up +`yyin' to point to another input file, and scanning continues. If it +returns true (non-zero), then the scanner terminates, returning 0 to +its caller. Note that in either case, the start condition remains +unchanged; it does *not* revert to `INITIAL'. + + If you do not supply your own version of `yywrap()', then you must +either use `%option noyywrap' (in which case the scanner behaves as +though `yywrap()' returned 1), or you must link with `-lfl' to obtain +the default version of the routine, which always returns 1. + + Three routines are available for scanning from in-memory buffers +rather than files: `yy_scan_string()', `yy_scan_bytes()', and +`yy_scan_buffer()'. See the discussion of them below in the section +Multiple Input Buffers. + + The scanner writes its `ECHO' output to the `yyout' global (default, +stdout), which may be redefined by the user simply by assigning it to +some other `FILE' pointer. + + +File: flex.info, Node: Start conditions, Next: Multiple buffers, Prev: Generated scanner, Up: Top + +Start conditions +================ + + `flex' provides a mechanism for conditionally activating rules. Any +rule whose pattern is prefixed with "<sc>" will only be active when the +scanner is in the start condition named "sc". For example, + + <STRING>[^"]* { /* eat up the string body ... */ + ... + } + +will be active only when the scanner is in the "STRING" start +condition, and + + <INITIAL,STRING,QUOTE>\. { /* handle an escape ... */ + ... + } + +will be active only when the current start condition is either +"INITIAL", "STRING", or "QUOTE". + + Start conditions are declared in the definitions (first) section of +the input using unindented lines beginning with either `%s' or `%x' +followed by a list of names. The former declares *inclusive* start +conditions, the latter *exclusive* start conditions. A start condition +is activated using the `BEGIN' action. Until the next `BEGIN' action is +executed, rules with the given start condition will be active and rules +with other start conditions will be inactive. If the start condition +is *inclusive*, then rules with no start conditions at all will also be +active. If it is *exclusive*, then *only* rules qualified with the +start condition will be active. A set of rules contingent on the same +exclusive start condition describe a scanner which is independent of +any of the other rules in the `flex' input. Because of this, exclusive +start conditions make it easy to specify "mini-scanners" which scan +portions of the input that are syntactically different from the rest +(e.g., comments). + + If the distinction between inclusive and exclusive start conditions +is still a little vague, here's a simple example illustrating the +connection between the two. The set of rules: + + %s example + %% + + <example>foo do_something(); + + bar something_else(); + +is equivalent to + + %x example + %% + + <example>foo do_something(); + + <INITIAL,example>bar something_else(); + + Without the `<INITIAL,example>' qualifier, the `bar' pattern in the +second example wouldn't be active (i.e., couldn't match) when in start +condition `example'. If we just used `<example>' to qualify `bar', +though, then it would only be active in `example' and not in `INITIAL', +while in the first example it's active in both, because in the first +example the `example' starting condition is an *inclusive* (`%s') start +condition. + + Also note that the special start-condition specifier `<*>' matches +every start condition. Thus, the above example could also have been +written; + + %x example + %% + + <example>foo do_something(); + + <*>bar something_else(); + + The default rule (to `ECHO' any unmatched character) remains active +in start conditions. It is equivalent to: + + <*>.|\\n ECHO; + + `BEGIN(0)' returns to the original state where only the rules with +no start conditions are active. This state can also be referred to as +the start-condition "INITIAL", so `BEGIN(INITIAL)' is equivalent to +`BEGIN(0)'. (The parentheses around the start condition name are not +required but are considered good style.) + + `BEGIN' actions can also be given as indented code at the beginning +of the rules section. For example, the following will cause the +scanner to enter the "SPECIAL" start condition whenever `yylex()' is +called and the global variable `enter_special' is true: + + int enter_special; + + %x SPECIAL + %% + if ( enter_special ) + BEGIN(SPECIAL); + + <SPECIAL>blahblahblah + ...more rules follow... + + To illustrate the uses of start conditions, here is a scanner which +provides two different interpretations of a string like "123.456". By +default it will treat it as as three tokens, the integer "123", a dot +('.'), and the integer "456". But if the string is preceded earlier in +the line by the string "expect-floats" it will treat it as a single +token, the floating-point number 123.456: + + %{ + #include <math.h> + %} + %s expect + + %% + expect-floats BEGIN(expect); + + <expect>[0-9]+"."[0-9]+ { + printf( "found a float, = %f\n", + atof( yytext ) ); + } + <expect>\n { + /* that's the end of the line, so + * we need another "expect-number" + * before we'll recognize any more + * numbers + */ + BEGIN(INITIAL); + } + + [0-9]+ { + + Version 2.5 December 1994 18 + + printf( "found an integer, = %d\n", + atoi( yytext ) ); + } + + "." printf( "found a dot\n" ); + + Here is a scanner which recognizes (and discards) C comments while +maintaining a count of the current input line. + + %x comment + %% + int line_num = 1; + + "/*" BEGIN(comment); + + <comment>[^*\n]* /* eat anything that's not a '*' */ + <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ + <comment>\n ++line_num; + <comment>"*"+"/" BEGIN(INITIAL); + + This scanner goes to a bit of trouble to match as much text as +possible with each rule. In general, when attempting to write a +high-speed scanner try to match as much possible in each rule, as it's +a big win. + + Note that start-conditions names are really integer values and can +be stored as such. Thus, the above could be extended in the following +fashion: + + %x comment foo + %% + int line_num = 1; + int comment_caller; + + "/*" { + comment_caller = INITIAL; + BEGIN(comment); + } + + ... + + <foo>"/*" { + comment_caller = foo; + BEGIN(comment); + } + + <comment>[^*\n]* /* eat anything that's not a '*' */ + <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ + <comment>\n ++line_num; + <comment>"*"+"/" BEGIN(comment_caller); + + Furthermore, you can access the current start condition using the +integer-valued `YY_START' macro. For example, the above assignments to +`comment_caller' could instead be written + + comment_caller = YY_START; + + Flex provides `YYSTATE' as an alias for `YY_START' (since that is +what's used by AT&T `lex'). + + Note that start conditions do not have their own name-space; %s's +and %x's declare names in the same fashion as #define's. + + Finally, here's an example of how to match C-style quoted strings +using exclusive start conditions, including expanded escape sequences +(but not including checking for a string that's too long): + + %x str + + %% + char string_buf[MAX_STR_CONST]; + char *string_buf_ptr; + + \" string_buf_ptr = string_buf; BEGIN(str); + + <str>\" { /* saw closing quote - all done */ + BEGIN(INITIAL); + *string_buf_ptr = '\0'; + /* return string constant token type and + * value to parser + */ + } + + <str>\n { + /* error - unterminated string constant */ + /* generate error message */ + } + + <str>\\[0-7]{1,3} { + /* octal escape sequence */ + int result; + + (void) sscanf( yytext + 1, "%o", &result ); + + if ( result > 0xff ) + /* error, constant is out-of-bounds */ + + *string_buf_ptr++ = result; + } + + <str>\\[0-9]+ { + /* generate error - bad escape sequence; something + * like '\48' or '\0777777' + */ + } + + <str>\\n *string_buf_ptr++ = '\n'; + <str>\\t *string_buf_ptr++ = '\t'; + <str>\\r *string_buf_ptr++ = '\r'; + <str>\\b *string_buf_ptr++ = '\b'; + <str>\\f *string_buf_ptr++ = '\f'; + + <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; + + <str>[^\\\n\"]+ { + char *yptr = yytext; + + while ( *yptr ) + *string_buf_ptr++ = *yptr++; + } + + Often, such as in some of the examples above, you wind up writing a +whole bunch of rules all preceded by the same start condition(s). Flex +makes this a little easier and cleaner by introducing a notion of start +condition "scope". A start condition scope is begun with: + + <SCs>{ + +where SCs is a list of one or more start conditions. Inside the start +condition scope, every rule automatically has the prefix `<SCs>' +applied to it, until a `}' which matches the initial `{'. So, for +example, + + <ESC>{ + "\\n" return '\n'; + "\\r" return '\r'; + "\\f" return '\f'; + "\\0" return '\0'; + } + +is equivalent to: + + <ESC>"\\n" return '\n'; + <ESC>"\\r" return '\r'; + <ESC>"\\f" return '\f'; + <ESC>"\\0" return '\0'; + + Start condition scopes may be nested. + + Three routines are available for manipulating stacks of start +conditions: + +`void yy_push_state(int new_state)' + pushes the current start condition onto the top of the start + condition stack and switches to NEW_STATE as though you had used + `BEGIN new_state' (recall that start condition names are also + integers). + +`void yy_pop_state()' + pops the top of the stack and switches to it via `BEGIN'. + +`int yy_top_state()' + returns the top of the stack without altering the stack's contents. + + The start condition stack grows dynamically and so has no built-in +size limitation. If memory is exhausted, program execution aborts. + + To use start condition stacks, your scanner must include a `%option +stack' directive (see Options below). + + +File: flex.info, Node: Multiple buffers, Next: End-of-file rules, Prev: Start conditions, Up: Top + +Multiple input buffers +====================== + + Some scanners (such as those which support "include" files) require +reading from several input streams. As `flex' scanners do a large +amount of buffering, one cannot control where the next input will be +read from by simply writing a `YY_INPUT' which is sensitive to the +scanning context. `YY_INPUT' is only called when the scanner reaches +the end of its buffer, which may be a long time after scanning a +statement such as an "include" which requires switching the input +source. + + To negotiate these sorts of problems, `flex' provides a mechanism +for creating and switching between multiple input buffers. An input +buffer is created by using: + + YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) + +which takes a `FILE' pointer and a size and creates a buffer associated +with the given file and large enough to hold SIZE characters (when in +doubt, use `YY_BUF_SIZE' for the size). It returns a `YY_BUFFER_STATE' +handle, which may then be passed to other routines (see below). The +`YY_BUFFER_STATE' type is a pointer to an opaque `struct' +`yy_buffer_state' structure, so you may safely initialize +YY_BUFFER_STATE variables to `((YY_BUFFER_STATE) 0)' if you wish, and +also refer to the opaque structure in order to correctly declare input +buffers in source files other than that of your scanner. Note that the +`FILE' pointer in the call to `yy_create_buffer' is only used as the +value of `yyin' seen by `YY_INPUT'; if you redefine `YY_INPUT' so it no +longer uses `yyin', then you can safely pass a nil `FILE' pointer to +`yy_create_buffer'. You select a particular buffer to scan from using: + + void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) + + switches the scanner's input buffer so subsequent tokens will come +from NEW_BUFFER. Note that `yy_switch_to_buffer()' may be used by +`yywrap()' to set things up for continued scanning, instead of opening +a new file and pointing `yyin' at it. Note also that switching input +sources via either `yy_switch_to_buffer()' or `yywrap()' does *not* +change the start condition. + + void yy_delete_buffer( YY_BUFFER_STATE buffer ) + +is used to reclaim the storage associated with a buffer. You can also +clear the current contents of a buffer using: + + void yy_flush_buffer( YY_BUFFER_STATE buffer ) + + This function discards the buffer's contents, so the next time the +scanner attempts to match a token from the buffer, it will first fill +the buffer anew using `YY_INPUT'. + + `yy_new_buffer()' is an alias for `yy_create_buffer()', provided for +compatibility with the C++ use of `new' and `delete' for creating and +destroying dynamic objects. + + Finally, the `YY_CURRENT_BUFFER' macro returns a `YY_BUFFER_STATE' +handle to the current buffer. + + Here is an example of using these features for writing a scanner +which expands include files (the `<<EOF>>' feature is discussed below): + + /* the "incl" state is used for picking up the name + * of an include file + */ + %x incl + + %{ + #define MAX_INCLUDE_DEPTH 10 + YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; + int include_stack_ptr = 0; + %} + + %% + include BEGIN(incl); + + [a-z]+ ECHO; + [^a-z\n]*\n? ECHO; + + <incl>[ \t]* /* eat the whitespace */ + <incl>[^ \t\n]+ { /* got the include file name */ + if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) + { + fprintf( stderr, "Includes nested too deeply" ); + exit( 1 ); + } + + include_stack[include_stack_ptr++] = + YY_CURRENT_BUFFER; + + yyin = fopen( yytext, "r" ); + + if ( ! yyin ) + error( ... ); + + yy_switch_to_buffer( + yy_create_buffer( yyin, YY_BUF_SIZE ) ); + + BEGIN(INITIAL); + } + + <<EOF>> { + if ( --include_stack_ptr < 0 ) + { + yyterminate(); + } + + else + { + yy_delete_buffer( YY_CURRENT_BUFFER ); + yy_switch_to_buffer( + include_stack[include_stack_ptr] ); + } + } + + Three routines are available for setting up input buffers for +scanning in-memory strings instead of files. All of them create a new +input buffer for scanning the string, and return a corresponding +`YY_BUFFER_STATE' handle (which you should delete with +`yy_delete_buffer()' when done with it). They also switch to the new +buffer using `yy_switch_to_buffer()', so the next call to `yylex()' will +start scanning the string. + +`yy_scan_string(const char *str)' + scans a NUL-terminated string. + +`yy_scan_bytes(const char *bytes, int len)' + scans `len' bytes (including possibly NUL's) starting at location + BYTES. + + Note that both of these functions create and scan a *copy* of the +string or bytes. (This may be desirable, since `yylex()' modifies the +contents of the buffer it is scanning.) You can avoid the copy by using: + +`yy_scan_buffer(char *base, yy_size_t size)' + which scans in place the buffer starting at BASE, consisting of + SIZE bytes, the last two bytes of which *must* be + `YY_END_OF_BUFFER_CHAR' (ASCII NUL). These last two bytes are not + scanned; thus, scanning consists of `base[0]' through + `base[size-2]', inclusive. + + If you fail to set up BASE in this manner (i.e., forget the final + two `YY_END_OF_BUFFER_CHAR' bytes), then `yy_scan_buffer()' + returns a nil pointer instead of creating a new input buffer. + + The type `yy_size_t' is an integral type to which you can cast an + integer expression reflecting the size of the buffer. + + +File: flex.info, Node: End-of-file rules, Next: Miscellaneous, Prev: Multiple buffers, Up: Top + +End-of-file rules +================= + + The special rule "<<EOF>>" indicates actions which are to be taken +when an end-of-file is encountered and yywrap() returns non-zero (i.e., +indicates no further files to process). The action must finish by +doing one of four things: + + - assigning `yyin' to a new input file (in previous versions of + flex, after doing the assignment you had to call the special + action `YY_NEW_FILE'; this is no longer necessary); + + - executing a `return' statement; + + - executing the special `yyterminate()' action; + + - or, switching to a new buffer using `yy_switch_to_buffer()' as + shown in the example above. + + <<EOF>> rules may not be used with other patterns; they may only be +qualified with a list of start conditions. If an unqualified <<EOF>> +rule is given, it applies to *all* start conditions which do not +already have <<EOF>> actions. To specify an <<EOF>> rule for only the +initial start condition, use + + <INITIAL><<EOF>> + + These rules are useful for catching things like unclosed comments. +An example: + + %x quote + %% + + ...other rules for dealing with quotes... + + <quote><<EOF>> { + error( "unterminated quote" ); + yyterminate(); + } + <<EOF>> { + if ( *++filelist ) + yyin = fopen( *filelist, "r" ); + else + yyterminate(); + } + + +File: flex.info, Node: Miscellaneous, Next: User variables, Prev: End-of-file rules, Up: Top + +Miscellaneous macros +==================== + + The macro `YY_USER_ACTION' can be defined to provide an action which +is always executed prior to the matched rule's action. For example, it +could be #define'd to call a routine to convert yytext to lower-case. +When `YY_USER_ACTION' is invoked, the variable `yy_act' gives the +number of the matched rule (rules are numbered starting with 1). +Suppose you want to profile how often each of your rules is matched. +The following would do the trick: + + #define YY_USER_ACTION ++ctr[yy_act] + + where `ctr' is an array to hold the counts for the different rules. +Note that the macro `YY_NUM_RULES' gives the total number of rules +(including the default rule, even if you use `-s', so a correct +declaration for `ctr' is: + + int ctr[YY_NUM_RULES]; + + The macro `YY_USER_INIT' may be defined to provide an action which +is always executed before the first scan (and before the scanner's +internal initializations are done). For example, it could be used to +call a routine to read in a data table or open a logging file. + + The macro `yy_set_interactive(is_interactive)' can be used to +control whether the current buffer is considered *interactive*. An +interactive buffer is processed more slowly, but must be used when the +scanner's input source is indeed interactive to avoid problems due to +waiting to fill buffers (see the discussion of the `-I' flag below). A +non-zero value in the macro invocation marks the buffer as interactive, +a zero value as non-interactive. Note that use of this macro overrides +`%option always-interactive' or `%option never-interactive' (see +Options below). `yy_set_interactive()' must be invoked prior to +beginning to scan the buffer that is (or is not) to be considered +interactive. + + The macro `yy_set_bol(at_bol)' can be used to control whether the +current buffer's scanning context for the next token match is done as +though at the beginning of a line. A non-zero macro argument makes +rules anchored with + + The macro `YY_AT_BOL()' returns true if the next token scanned from +the current buffer will have '^' rules active, false otherwise. + + In the generated scanner, the actions are all gathered in one large +switch statement and separated using `YY_BREAK', which may be +redefined. By default, it is simply a "break", to separate each rule's +action from the following rule's. Redefining `YY_BREAK' allows, for +example, C++ users to #define YY_BREAK to do nothing (while being very +careful that every rule ends with a "break" or a "return"!) to avoid +suffering from unreachable statement warnings where because a rule's +action ends with "return", the `YY_BREAK' is inaccessible. + + +File: flex.info, Node: User variables, Next: YACC interface, Prev: Miscellaneous, Up: Top + +Values available to the user +============================ + + This section summarizes the various values available to the user in +the rule actions. + + - `char *yytext' holds the text of the current token. It may be + modified but not lengthened (you cannot append characters to the + end). + + If the special directive `%array' appears in the first section of + the scanner description, then `yytext' is instead declared `char + yytext[YYLMAX]', where `YYLMAX' is a macro definition that you can + redefine in the first section if you don't like the default value + (generally 8KB). Using `%array' results in somewhat slower + scanners, but the value of `yytext' becomes immune to calls to + `input()' and `unput()', which potentially destroy its value when + `yytext' is a character pointer. The opposite of `%array' is + `%pointer', which is the default. + + You cannot use `%array' when generating C++ scanner classes (the + `-+' flag). + + - `int yyleng' holds the length of the current token. + + - `FILE *yyin' is the file which by default `flex' reads from. It + may be redefined but doing so only makes sense before scanning + begins or after an EOF has been encountered. Changing it in the + midst of scanning will have unexpected results since `flex' + buffers its input; use `yyrestart()' instead. Once scanning + terminates because an end-of-file has been seen, you can assign + `yyin' at the new input file and then call the scanner again to + continue scanning. + + - `void yyrestart( FILE *new_file )' may be called to point `yyin' + at the new input file. The switch-over to the new file is + immediate (any previously buffered-up input is lost). Note that + calling `yyrestart()' with `yyin' as an argument thus throws away + the current input buffer and continues scanning the same input + file. + + - `FILE *yyout' is the file to which `ECHO' actions are done. It + can be reassigned by the user. + + - `YY_CURRENT_BUFFER' returns a `YY_BUFFER_STATE' handle to the + current buffer. + + - `YY_START' returns an integer value corresponding to the current + start condition. You can subsequently use this value with `BEGIN' + to return to that start condition. + + +File: flex.info, Node: YACC interface, Next: Options, Prev: User variables, Up: Top + +Interfacing with `yacc' +======================= + + One of the main uses of `flex' is as a companion to the `yacc' +parser-generator. `yacc' parsers expect to call a routine named +`yylex()' to find the next input token. The routine is supposed to +return the type of the next token as well as putting any associated +value in the global `yylval'. To use `flex' with `yacc', one specifies +the `-d' option to `yacc' to instruct it to generate the file `y.tab.h' +containing definitions of all the `%tokens' appearing in the `yacc' +input. This file is then included in the `flex' scanner. For example, +if one of the tokens is "TOK_NUMBER", part of the scanner might look +like: + + %{ + #include "y.tab.h" + %} + + %% + + [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; + + +File: flex.info, Node: Options, Next: Performance, Prev: YACC interface, Up: Top + +Options +======= + + `flex' has the following options: + +`-b' + Generate backing-up information to `lex.backup'. This is a list + of scanner states which require backing up and the input + characters on which they do so. By adding rules one can remove + backing-up states. If *all* backing-up states are eliminated and + `-Cf' or `-CF' is used, the generated scanner will run faster (see + the `-p' flag). Only users who wish to squeeze every last cycle + out of their scanners need worry about this option. (See the + section on Performance Considerations below.) + +`-c' + is a do-nothing, deprecated option included for POSIX compliance. + +`-d' + makes the generated scanner run in "debug" mode. Whenever a + pattern is recognized and the global `yy_flex_debug' is non-zero + (which is the default), the scanner will write to `stderr' a line + of the form: + + --accepting rule at line 53 ("the matched text") + + The line number refers to the location of the rule in the file + defining the scanner (i.e., the file that was fed to flex). + Messages are also generated when the scanner backs up, accepts the + default rule, reaches the end of its input buffer (or encounters a + NUL; at this point, the two look the same as far as the scanner's + concerned), or reaches an end-of-file. + +`-f' + specifies "fast scanner". No table compression is done and stdio + is bypassed. The result is large but fast. This option is + equivalent to `-Cfr' (see below). + +`-h' + generates a "help" summary of `flex's' options to `stdout' and + then exits. `-?' and `--help' are synonyms for `-h'. + +`-i' + instructs `flex' to generate a *case-insensitive* scanner. The + case of letters given in the `flex' input patterns will be + ignored, and tokens in the input will be matched regardless of + case. The matched text given in `yytext' will have the preserved + case (i.e., it will not be folded). + +`-l' + turns on maximum compatibility with the original AT&T `lex' + implementation. Note that this does not mean *full* + compatibility. Use of this option costs a considerable amount of + performance, and it cannot be used with the `-+, -f, -F, -Cf', or + `-CF' options. For details on the compatibilities it provides, see + the section "Incompatibilities With Lex And POSIX" below. This + option also results in the name `YY_FLEX_LEX_COMPAT' being + #define'd in the generated scanner. + +`-n' + is another do-nothing, deprecated option included only for POSIX + compliance. + +`-p' + generates a performance report to stderr. The report consists of + comments regarding features of the `flex' input file which will + cause a serious loss of performance in the resulting scanner. If + you give the flag twice, you will also get comments regarding + features that lead to minor performance losses. + + Note that the use of `REJECT', `%option yylineno' and variable + trailing context (see the Deficiencies / Bugs section below) + entails a substantial performance penalty; use of `yymore()', the + `^' operator, and the `-I' flag entail minor performance penalties. + +`-s' + causes the "default rule" (that unmatched scanner input is echoed + to `stdout') to be suppressed. If the scanner encounters input + that does not match any of its rules, it aborts with an error. + This option is useful for finding holes in a scanner's rule set. + +`-t' + instructs `flex' to write the scanner it generates to standard + output instead of `lex.yy.c'. + +`-v' + specifies that `flex' should write to `stderr' a summary of + statistics regarding the scanner it generates. Most of the + statistics are meaningless to the casual `flex' user, but the + first line identifies the version of `flex' (same as reported by + `-V'), and the next line the flags used when generating the + scanner, including those that are on by default. + +`-w' + suppresses warning messages. + +`-B' + instructs `flex' to generate a *batch* scanner, the opposite of + *interactive* scanners generated by `-I' (see below). In general, + you use `-B' when you are *certain* that your scanner will never + be used interactively, and you want to squeeze a *little* more + performance out of it. If your goal is instead to squeeze out a + *lot* more performance, you should be using the `-Cf' or `-CF' + options (discussed below), which turn on `-B' automatically anyway. + +`-F' + specifies that the "fast" scanner table representation should be + used (and stdio bypassed). This representation is about as fast + as the full table representation `(-f)', and for some sets of + patterns will be considerably smaller (and for others, larger). + In general, if the pattern set contains both "keywords" and a + catch-all, "identifier" rule, such as in the set: + + "case" return TOK_CASE; + "switch" return TOK_SWITCH; + ... + "default" return TOK_DEFAULT; + [a-z]+ return TOK_ID; + + then you're better off using the full table representation. If + only the "identifier" rule is present and you then use a hash + table or some such to detect the keywords, you're better off using + `-F'. + + This option is equivalent to `-CFr' (see below). It cannot be + used with `-+'. + +`-I' + instructs `flex' to generate an *interactive* scanner. An + interactive scanner is one that only looks ahead to decide what + token has been matched if it absolutely must. It turns out that + always looking one extra character ahead, even if the scanner has + already seen enough text to disambiguate the current token, is a + bit faster than only looking ahead when necessary. But scanners + that always look ahead give dreadful interactive performance; for + example, when a user types a newline, it is not recognized as a + newline token until they enter *another* token, which often means + typing in another whole line. + + `Flex' scanners default to *interactive* unless you use the `-Cf' + or `-CF' table-compression options (see below). That's because if + you're looking for high-performance you should be using one of + these options, so if you didn't, `flex' assumes you'd rather trade + off a bit of run-time performance for intuitive interactive + behavior. Note also that you *cannot* use `-I' in conjunction + with `-Cf' or `-CF'. Thus, this option is not really needed; it + is on by default for all those cases in which it is allowed. + + You can force a scanner to *not* be interactive by using `-B' (see + above). + +`-L' + instructs `flex' not to generate `#line' directives. Without this + option, `flex' peppers the generated scanner with #line directives + so error messages in the actions will be correctly located with + respect to either the original `flex' input file (if the errors + are due to code in the input file), or `lex.yy.c' (if the errors + are `flex's' fault - you should report these sorts of errors to + the email address given below). + +`-T' + makes `flex' run in `trace' mode. It will generate a lot of + messages to `stderr' concerning the form of the input and the + resultant non-deterministic and deterministic finite automata. + This option is mostly for use in maintaining `flex'. + +`-V' + prints the version number to `stdout' and exits. `--version' is a + synonym for `-V'. + +`-7' + instructs `flex' to generate a 7-bit scanner, i.e., one which can + only recognized 7-bit characters in its input. The advantage of + using `-7' is that the scanner's tables can be up to half the size + of those generated using the `-8' option (see below). The + disadvantage is that such scanners often hang or crash if their + input contains an 8-bit character. + + Note, however, that unless you generate your scanner using the + `-Cf' or `-CF' table compression options, use of `-7' will save + only a small amount of table space, and make your scanner + considerably less portable. `Flex's' default behavior is to + generate an 8-bit scanner unless you use the `-Cf' or `-CF', in + which case `flex' defaults to generating 7-bit scanners unless + your site was always configured to generate 8-bit scanners (as + will often be the case with non-USA sites). You can tell whether + flex generated a 7-bit or an 8-bit scanner by inspecting the flag + summary in the `-v' output as described above. + + Note that if you use `-Cfe' or `-CFe' (those table compression + options, but also using equivalence classes as discussed see + below), flex still defaults to generating an 8-bit scanner, since + usually with these compression options full 8-bit tables are not + much more expensive than 7-bit tables. + +`-8' + instructs `flex' to generate an 8-bit scanner, i.e., one which can + recognize 8-bit characters. This flag is only needed for scanners + generated using `-Cf' or `-CF', as otherwise flex defaults to + generating an 8-bit scanner anyway. + + See the discussion of `-7' above for flex's default behavior and + the tradeoffs between 7-bit and 8-bit scanners. + +`-+' + specifies that you want flex to generate a C++ scanner class. See + the section on Generating C++ Scanners below for details. + +`-C[aefFmr]' + controls the degree of table compression and, more generally, + trade-offs between small scanners and fast scanners. + + `-Ca' ("align") instructs flex to trade off larger tables in the + generated scanner for faster performance because the elements of + the tables are better aligned for memory access and computation. + On some RISC architectures, fetching and manipulating long-words + is more efficient than with smaller-sized units such as + shortwords. This option can double the size of the tables used by + your scanner. + + `-Ce' directs `flex' to construct "equivalence classes", i.e., + sets of characters which have identical lexical properties (for + example, if the only appearance of digits in the `flex' input is + in the character class "[0-9]" then the digits '0', '1', ..., '9' + will all be put in the same equivalence class). Equivalence + classes usually give dramatic reductions in the final table/object + file sizes (typically a factor of 2-5) and are pretty cheap + performance-wise (one array look-up per character scanned). + + `-Cf' specifies that the *full* scanner tables should be generated + - `flex' should not compress the tables by taking advantages of + similar transition functions for different states. + + `-CF' specifies that the alternate fast scanner representation + (described above under the `-F' flag) should be used. This option + cannot be used with `-+'. + + `-Cm' directs `flex' to construct "meta-equivalence classes", + which are sets of equivalence classes (or characters, if + equivalence classes are not being used) that are commonly used + together. Meta-equivalence classes are often a big win when using + compressed tables, but they have a moderate performance impact + (one or two "if" tests and one array look-up per character + scanned). + + `-Cr' causes the generated scanner to *bypass* use of the standard + I/O library (stdio) for input. Instead of calling `fread()' or + `getc()', the scanner will use the `read()' system call, resulting + in a performance gain which varies from system to system, but in + general is probably negligible unless you are also using `-Cf' or + `-CF'. Using `-Cr' can cause strange behavior if, for example, + you read from `yyin' using stdio prior to calling the scanner + (because the scanner will miss whatever text your previous reads + left in the stdio input buffer). + + `-Cr' has no effect if you define `YY_INPUT' (see The Generated + Scanner above). + + A lone `-C' specifies that the scanner tables should be compressed + but neither equivalence classes nor meta-equivalence classes + should be used. + + The options `-Cf' or `-CF' and `-Cm' do not make sense together - + there is no opportunity for meta-equivalence classes if the table + is not being compressed. Otherwise the options may be freely + mixed, and are cumulative. + + The default setting is `-Cem', which specifies that `flex' should + generate equivalence classes and meta-equivalence classes. This + setting provides the highest degree of table compression. You can + trade off faster-executing scanners at the cost of larger tables + with the following generally being true: + + slowest & smallest + -Cem + -Cm + -Ce + -C + -C{f,F}e + -C{f,F} + -C{f,F}a + fastest & largest + + Note that scanners with the smallest tables are usually generated + and compiled the quickest, so during development you will usually + want to use the default, maximal compression. + + `-Cfe' is often a good compromise between speed and size for + production scanners. + +`-ooutput' + directs flex to write the scanner to the file `out-' `put' instead + of `lex.yy.c'. If you combine `-o' with the `-t' option, then the + scanner is written to `stdout' but its `#line' directives (see the + `-L' option above) refer to the file `output'. + +`-Pprefix' + changes the default `yy' prefix used by `flex' for all + globally-visible variable and function names to instead be PREFIX. + For example, `-Pfoo' changes the name of `yytext' to `footext'. + It also changes the name of the default output file from + `lex.yy.c' to `lex.foo.c'. Here are all of the names affected: + + yy_create_buffer + yy_delete_buffer + yy_flex_debug + yy_init_buffer + yy_flush_buffer + yy_load_buffer_state + yy_switch_to_buffer + yyin + yyleng + yylex + yylineno + yyout + yyrestart + yytext + yywrap + + (If you are using a C++ scanner, then only `yywrap' and + `yyFlexLexer' are affected.) Within your scanner itself, you can + still refer to the global variables and functions using either + version of their name; but externally, they have the modified name. + + This option lets you easily link together multiple `flex' programs + into the same executable. Note, though, that using this option + also renames `yywrap()', so you now *must* either provide your own + (appropriately-named) version of the routine for your scanner, or + use `%option noyywrap', as linking with `-lfl' no longer provides + one for you by default. + +`-Sskeleton_file' + overrides the default skeleton file from which `flex' constructs + its scanners. You'll never need this option unless you are doing + `flex' maintenance or development. + + `flex' also provides a mechanism for controlling options within the +scanner specification itself, rather than from the flex command-line. +This is done by including `%option' directives in the first section of +the scanner specification. You can specify multiple options with a +single `%option' directive, and multiple directives in the first +section of your flex input file. Most options are given simply as +names, optionally preceded by the word "no" (with no intervening +whitespace) to negate their meaning. A number are equivalent to flex +flags or their negation: + + 7bit -7 option + 8bit -8 option + align -Ca option + backup -b option + batch -B option + c++ -+ option + + caseful or + case-sensitive opposite of -i (default) + + case-insensitive or + caseless -i option + + debug -d option + default opposite of -s option + ecs -Ce option + fast -F option + full -f option + interactive -I option + lex-compat -l option + meta-ecs -Cm option + perf-report -p option + read -Cr option + stdout -t option + verbose -v option + warn opposite of -w option + (use "%option nowarn" for -w) + + array equivalent to "%array" + pointer equivalent to "%pointer" (default) + + Some `%option's' provide features otherwise not available: + +`always-interactive' + instructs flex to generate a scanner which always considers its + input "interactive". Normally, on each new input file the scanner + calls `isatty()' in an attempt to determine whether the scanner's + input source is interactive and thus should be read a character at + a time. When this option is used, however, then no such call is + made. + +`main' + directs flex to provide a default `main()' program for the + scanner, which simply calls `yylex()'. This option implies + `noyywrap' (see below). + +`never-interactive' + instructs flex to generate a scanner which never considers its + input "interactive" (again, no call made to `isatty())'. This is + the opposite of `always-' *interactive*. + +`stack' + enables the use of start condition stacks (see Start Conditions + above). + +`stdinit' + if unset (i.e., `%option nostdinit') initializes `yyin' and + `yyout' to nil `FILE' pointers, instead of `stdin' and `stdout'. + +`yylineno' + directs `flex' to generate a scanner that maintains the number of + the current line read from its input in the global variable + `yylineno'. This option is implied by `%option lex-compat'. + +`yywrap' + if unset (i.e., `%option noyywrap'), makes the scanner not call + `yywrap()' upon an end-of-file, but simply assume that there are + no more files to scan (until the user points `yyin' at a new file + and calls `yylex()' again). + + `flex' scans your rule actions to determine whether you use the +`REJECT' or `yymore()' features. The `reject' and `yymore' options are +available to override its decision as to whether you use the options, +either by setting them (e.g., `%option reject') to indicate the feature +is indeed used, or unsetting them to indicate it actually is not used +(e.g., `%option noyymore'). + + Three options take string-delimited values, offset with '=': + + %option outfile="ABC" + +is equivalent to `-oABC', and + + %option prefix="XYZ" + +is equivalent to `-PXYZ'. + + Finally, + + %option yyclass="foo" + +only applies when generating a C++ scanner (`-+' option). It informs +`flex' that you have derived `foo' as a subclass of `yyFlexLexer' so +`flex' will place your actions in the member function `foo::yylex()' +instead of `yyFlexLexer::yylex()'. It also generates a +`yyFlexLexer::yylex()' member function that emits a run-time error (by +invoking `yyFlexLexer::LexerError()') if called. See Generating C++ +Scanners, below, for additional information. + + A number of options are available for lint purists who want to +suppress the appearance of unneeded routines in the generated scanner. +Each of the following, if unset, results in the corresponding routine +not appearing in the generated scanner: + + input, unput + yy_push_state, yy_pop_state, yy_top_state + yy_scan_buffer, yy_scan_bytes, yy_scan_string + +(though `yy_push_state()' and friends won't appear anyway unless you +use `%option stack'). + + +File: flex.info, Node: Performance, Next: C++, Prev: Options, Up: Top + +Performance considerations +========================== + + The main design goal of `flex' is that it generate high-performance +scanners. It has been optimized for dealing well with large sets of +rules. Aside from the effects on scanner speed of the table +compression `-C' options outlined above, there are a number of +options/actions which degrade performance. These are, from most +expensive to least: + + REJECT + %option yylineno + arbitrary trailing context + + pattern sets that require backing up + %array + %option interactive + %option always-interactive + + '^' beginning-of-line operator + yymore() + + with the first three all being quite expensive and the last two +being quite cheap. Note also that `unput()' is implemented as a +routine call that potentially does quite a bit of work, while +`yyless()' is a quite-cheap macro; so if just putting back some excess +text you scanned, use `yyless()'. + + `REJECT' should be avoided at all costs when performance is +important. It is a particularly expensive option. + + Getting rid of backing up is messy and often may be an enormous +amount of work for a complicated scanner. In principal, one begins by +using the `-b' flag to generate a `lex.backup' file. For example, on +the input + + %% + foo return TOK_KEYWORD; + foobar return TOK_KEYWORD; + +the file looks like: + + State #6 is non-accepting - + associated rule line numbers: + 2 3 + out-transitions: [ o ] + jam-transitions: EOF [ \001-n p-\177 ] + + State #8 is non-accepting - + associated rule line numbers: + 3 + out-transitions: [ a ] + jam-transitions: EOF [ \001-` b-\177 ] + + State #9 is non-accepting - + associated rule line numbers: + 3 + out-transitions: [ r ] + jam-transitions: EOF [ \001-q s-\177 ] + + Compressed tables always back up. + + The first few lines tell us that there's a scanner state in which it +can make a transition on an 'o' but not on any other character, and +that in that state the currently scanned text does not match any rule. +The state occurs when trying to match the rules found at lines 2 and 3 +in the input file. If the scanner is in that state and then reads +something other than an 'o', it will have to back up to find a rule +which is matched. With a bit of head-scratching one can see that this +must be the state it's in when it has seen "fo". When this has +happened, if anything other than another 'o' is seen, the scanner will +have to back up to simply match the 'f' (by the default rule). + + The comment regarding State #8 indicates there's a problem when +"foob" has been scanned. Indeed, on any character other than an 'a', +the scanner will have to back up to accept "foo". Similarly, the +comment for State #9 concerns when "fooba" has been scanned and an 'r' +does not follow. + + The final comment reminds us that there's no point going to all the +trouble of removing backing up from the rules unless we're using `-Cf' +or `-CF', since there's no performance gain doing so with compressed +scanners. + + The way to remove the backing up is to add "error" rules: + + %% + foo return TOK_KEYWORD; + foobar return TOK_KEYWORD; + + fooba | + foob | + fo { + /* false alarm, not really a keyword */ + return TOK_ID; + } + + Eliminating backing up among a list of keywords can also be done +using a "catch-all" rule: + + %% + foo return TOK_KEYWORD; + foobar return TOK_KEYWORD; + + [a-z]+ return TOK_ID; + + This is usually the best solution when appropriate. + + Backing up messages tend to cascade. With a complicated set of +rules it's not uncommon to get hundreds of messages. If one can +decipher them, though, it often only takes a dozen or so rules to +eliminate the backing up (though it's easy to make a mistake and have +an error rule accidentally match a valid token. A possible future +`flex' feature will be to automatically add rules to eliminate backing +up). + + It's important to keep in mind that you gain the benefits of +eliminating backing up only if you eliminate *every* instance of +backing up. Leaving just one means you gain nothing. + + VARIABLE trailing context (where both the leading and trailing parts +do not have a fixed length) entails almost the same performance loss as +`REJECT' (i.e., substantial). So when possible a rule like: + + %% + mouse|rat/(cat|dog) run(); + +is better written: + + %% + mouse/cat|dog run(); + rat/cat|dog run(); + +or as + + %% + mouse|rat/cat run(); + mouse|rat/dog run(); + + Note that here the special '|' action does *not* provide any +savings, and can even make things worse (see Deficiencies / Bugs below). + + Another area where the user can increase a scanner's performance +(and one that's easier to implement) arises from the fact that the +longer the tokens matched, the faster the scanner will run. This is +because with long tokens the processing of most input characters takes +place in the (short) inner scanning loop, and does not often have to go +through the additional work of setting up the scanning environment +(e.g., `yytext') for the action. Recall the scanner for C comments: + + %x comment + %% + int line_num = 1; + + "/*" BEGIN(comment); + + <comment>[^*\n]* + <comment>"*"+[^*/\n]* + <comment>\n ++line_num; + <comment>"*"+"/" BEGIN(INITIAL); + + This could be sped up by writing it as: + + %x comment + %% + int line_num = 1; + + "/*" BEGIN(comment); + + <comment>[^*\n]* + <comment>[^*\n]*\n ++line_num; + <comment>"*"+[^*/\n]* + <comment>"*"+[^*/\n]*\n ++line_num; + <comment>"*"+"/" BEGIN(INITIAL); + + Now instead of each newline requiring the processing of another +action, recognizing the newlines is "distributed" over the other rules +to keep the matched text as long as possible. Note that *adding* rules +does *not* slow down the scanner! The speed of the scanner is +independent of the number of rules or (modulo the considerations given +at the beginning of this section) how complicated the rules are with +regard to operators such as '*' and '|'. + + A final example in speeding up a scanner: suppose you want to scan +through a file containing identifiers and keywords, one per line and +with no other extraneous characters, and recognize all the keywords. A +natural first approach is: + + %% + asm | + auto | + break | + ... etc ... + volatile | + while /* it's a keyword */ + + .|\n /* it's not a keyword */ + + To eliminate the back-tracking, introduce a catch-all rule: + + %% + asm | + auto | + break | + ... etc ... + volatile | + while /* it's a keyword */ + + [a-z]+ | + .|\n /* it's not a keyword */ + + Now, if it's guaranteed that there's exactly one word per line, then +we can reduce the total number of matches by a half by merging in the +recognition of newlines with that of the other tokens: + + %% + asm\n | + auto\n | + break\n | + ... etc ... + volatile\n | + while\n /* it's a keyword */ + + [a-z]+\n | + .|\n /* it's not a keyword */ + + One has to be careful here, as we have now reintroduced backing up +into the scanner. In particular, while *we* know that there will never +be any characters in the input stream other than letters or newlines, +`flex' can't figure this out, and it will plan for possibly needing to +back up when it has scanned a token like "auto" and then the next +character is something other than a newline or a letter. Previously it +would then just match the "auto" rule and be done, but now it has no +"auto" rule, only a "auto\n" rule. To eliminate the possibility of +backing up, we could either duplicate all rules but without final +newlines, or, since we never expect to encounter such an input and +therefore don't how it's classified, we can introduce one more +catch-all rule, this one which doesn't include a newline: + + %% + asm\n | + auto\n | + break\n | + ... etc ... + volatile\n | + while\n /* it's a keyword */ + + [a-z]+\n | + [a-z]+ | + .|\n /* it's not a keyword */ + + Compiled with `-Cf', this is about as fast as one can get a `flex' +scanner to go for this particular problem. + + A final note: `flex' is slow when matching NUL's, particularly when +a token contains multiple NUL's. It's best to write rules which match +*short* amounts of text if it's anticipated that the text will often +include NUL's. + + Another final note regarding performance: as mentioned above in the +section How the Input is Matched, dynamically resizing `yytext' to +accommodate huge tokens is a slow process because it presently requires +that the (huge) token be rescanned from the beginning. Thus if +performance is vital, you should attempt to match "large" quantities of +text but not "huge" quantities, where the cutoff between the two is at +about 8K characters/token. + + +File: flex.info, Node: C++, Next: Incompatibilities, Prev: Performance, Up: Top + +Generating C++ scanners +======================= + + `flex' provides two different ways to generate scanners for use with +C++. The first way is to simply compile a scanner generated by `flex' +using a C++ compiler instead of a C compiler. You should not encounter +any compilations errors (please report any you find to the email address +given in the Author section below). You can then use C++ code in your +rule actions instead of C code. Note that the default input source for +your scanner remains `yyin', and default echoing is still done to +`yyout'. Both of these remain `FILE *' variables and not C++ `streams'. + + You can also use `flex' to generate a C++ scanner class, using the +`-+' option, (or, equivalently, `%option c++'), which is automatically +specified if the name of the flex executable ends in a `+', such as +`flex++'. When using this option, flex defaults to generating the +scanner to the file `lex.yy.cc' instead of `lex.yy.c'. The generated +scanner includes the header file `FlexLexer.h', which defines the +interface to two C++ classes. + + The first class, `FlexLexer', provides an abstract base class +defining the general scanner class interface. It provides the +following member functions: + +`const char* YYText()' + returns the text of the most recently matched token, the + equivalent of `yytext'. + +`int YYLeng()' + returns the length of the most recently matched token, the + equivalent of `yyleng'. + +`int lineno() const' + returns the current input line number (see `%option yylineno'), or + 1 if `%option yylineno' was not used. + +`void set_debug( int flag )' + sets the debugging flag for the scanner, equivalent to assigning to + `yy_flex_debug' (see the Options section above). Note that you + must build the scanner using `%option debug' to include debugging + information in it. + +`int debug() const' + returns the current setting of the debugging flag. + + Also provided are member functions equivalent to +`yy_switch_to_buffer(), yy_create_buffer()' (though the first argument +is an `istream*' object pointer and not a `FILE*', `yy_flush_buffer()', +`yy_delete_buffer()', and `yyrestart()' (again, the first argument is a +`istream*' object pointer). + + The second class defined in `FlexLexer.h' is `yyFlexLexer', which is +derived from `FlexLexer'. It defines the following additional member +functions: + +`yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )' + constructs a `yyFlexLexer' object using the given streams for + input and output. If not specified, the streams default to `cin' + and `cout', respectively. + +`virtual int yylex()' + performs the same role is `yylex()' does for ordinary flex + scanners: it scans the input stream, consuming tokens, until a + rule's action returns a value. If you derive a subclass S from + `yyFlexLexer' and want to access the member functions and + variables of S inside `yylex()', then you need to use `%option + yyclass="S"' to inform `flex' that you will be using that subclass + instead of `yyFlexLexer'. In this case, rather than generating + `yyFlexLexer::yylex()', `flex' generates `S::yylex()' (and also + generates a dummy `yyFlexLexer::yylex()' that calls + `yyFlexLexer::LexerError()' if called). + +`virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0)' + reassigns `yyin' to `new_in' (if non-nil) and `yyout' to `new_out' + (ditto), deleting the previous input buffer if `yyin' is + reassigned. + +`int yylex( istream* new_in = 0, ostream* new_out = 0 )' + first switches the input streams via `switch_streams( new_in, + new_out )' and then returns the value of `yylex()'. + + In addition, `yyFlexLexer' defines the following protected virtual +functions which you can redefine in derived classes to tailor the +scanner: + +`virtual int LexerInput( char* buf, int max_size )' + reads up to `max_size' characters into BUF and returns the number + of characters read. To indicate end-of-input, return 0 + characters. Note that "interactive" scanners (see the `-B' and + `-I' flags) define the macro `YY_INTERACTIVE'. If you redefine + `LexerInput()' and need to take different actions depending on + whether or not the scanner might be scanning an interactive input + source, you can test for the presence of this name via `#ifdef'. + +`virtual void LexerOutput( const char* buf, int size )' + writes out SIZE characters from the buffer BUF, which, while + NUL-terminated, may also contain "internal" NUL's if the scanner's + rules can match text with NUL's in them. + +`virtual void LexerError( const char* msg )' + reports a fatal error message. The default version of this + function writes the message to the stream `cerr' and exits. + + Note that a `yyFlexLexer' object contains its *entire* scanning +state. Thus you can use such objects to create reentrant scanners. +You can instantiate multiple instances of the same `yyFlexLexer' class, +and you can also combine multiple C++ scanner classes together in the +same program using the `-P' option discussed above. Finally, note that +the `%array' feature is not available to C++ scanner classes; you must +use `%pointer' (the default). + + Here is an example of a simple C++ scanner: + + // An example of using the flex C++ scanner class. + + %{ + int mylineno = 0; + %} + + string \"[^\n"]+\" + + ws [ \t]+ + + alpha [A-Za-z] + dig [0-9] + name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])* + num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)? + num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)? + number {num1}|{num2} + + %% + + {ws} /* skip blanks and tabs */ + + "/*" { + int c; + + while((c = yyinput()) != 0) + { + if(c == '\n') + ++mylineno; + + else if(c == '*') + { + if((c = yyinput()) == '/') + break; + else + unput(c); + } + } + } + + {number} cout << "number " << YYText() << '\n'; + + \n mylineno++; + + {name} cout << "name " << YYText() << '\n'; + + {string} cout << "string " << YYText() << '\n'; + + %% + + Version 2.5 December 1994 44 + + int main( int /* argc */, char** /* argv */ ) + { + FlexLexer* lexer = new yyFlexLexer; + while(lexer->yylex() != 0) + ; + return 0; + } + + If you want to create multiple (different) lexer classes, you use +the `-P' flag (or the `prefix=' option) to rename each `yyFlexLexer' to +some other `xxFlexLexer'. You then can include `<FlexLexer.h>' in your +other sources once per lexer class, first renaming `yyFlexLexer' as +follows: + + #undef yyFlexLexer + #define yyFlexLexer xxFlexLexer + #include <FlexLexer.h> + + #undef yyFlexLexer + #define yyFlexLexer zzFlexLexer + #include <FlexLexer.h> + + if, for example, you used `%option prefix="xx"' for one of your +scanners and `%option prefix="zz"' for the other. + + IMPORTANT: the present form of the scanning class is *experimental* +and may change considerably between major releases. + + +File: flex.info, Node: Incompatibilities, Next: Diagnostics, Prev: C++, Up: Top + +Incompatibilities with `lex' and POSIX +====================================== + + `flex' is a rewrite of the AT&T Unix `lex' tool (the two +implementations do not share any code, though), with some extensions +and incompatibilities, both of which are of concern to those who wish +to write scanners acceptable to either implementation. Flex is fully +compliant with the POSIX `lex' specification, except that when using +`%pointer' (the default), a call to `unput()' destroys the contents of +`yytext', which is counter to the POSIX specification. + + In this section we discuss all of the known areas of incompatibility +between flex, AT&T lex, and the POSIX specification. + + `flex's' `-l' option turns on maximum compatibility with the +original AT&T `lex' implementation, at the cost of a major loss in the +generated scanner's performance. We note below which incompatibilities +can be overcome using the `-l' option. + + `flex' is fully compatible with `lex' with the following exceptions: + + - The undocumented `lex' scanner internal variable `yylineno' is not + supported unless `-l' or `%option yylineno' is used. `yylineno' + should be maintained on a per-buffer basis, rather than a + per-scanner (single global variable) basis. `yylineno' is not + part of the POSIX specification. + + - The `input()' routine is not redefinable, though it may be called + to read characters following whatever has been matched by a rule. + If `input()' encounters an end-of-file the normal `yywrap()' + processing is done. A "real" end-of-file is returned by `input()' + as `EOF'. + + Input is instead controlled by defining the `YY_INPUT' macro. + + The `flex' restriction that `input()' cannot be redefined is in + accordance with the POSIX specification, which simply does not + specify any way of controlling the scanner's input other than by + making an initial assignment to `yyin'. + + - The `unput()' routine is not redefinable. This restriction is in + accordance with POSIX. + + - `flex' scanners are not as reentrant as `lex' scanners. In + particular, if you have an interactive scanner and an interrupt + handler which long-jumps out of the scanner, and the scanner is + subsequently called again, you may get the following message: + + fatal flex scanner internal error--end of buffer missed + + To reenter the scanner, first use + + yyrestart( yyin ); + + Note that this call will throw away any buffered input; usually + this isn't a problem with an interactive scanner. + + Also note that flex C++ scanner classes *are* reentrant, so if + using C++ is an option for you, you should use them instead. See + "Generating C++ Scanners" above for details. + + - `output()' is not supported. Output from the `ECHO' macro is done + to the file-pointer `yyout' (default `stdout'). + + `output()' is not part of the POSIX specification. + + - `lex' does not support exclusive start conditions (%x), though + they are in the POSIX specification. + + - When definitions are expanded, `flex' encloses them in + parentheses. With lex, the following: + + NAME [A-Z][A-Z0-9]* + %% + foo{NAME}? printf( "Found it\n" ); + %% + + will not match the string "foo" because when the macro is expanded + the rule is equivalent to "foo[A-Z][A-Z0-9]*?" and the precedence + is such that the '?' is associated with "[A-Z0-9]*". With `flex', + the rule will be expanded to "foo([A-Z][A-Z0-9]*)?" and so the + string "foo" will match. + + Note that if the definition begins with `^' or ends with `$' then + it is *not* expanded with parentheses, to allow these operators to + appear in definitions without losing their special meanings. But + the `<s>, /', and `<<EOF>>' operators cannot be used in a `flex' + definition. + + Using `-l' results in the `lex' behavior of no parentheses around + the definition. + + The POSIX specification is that the definition be enclosed in + parentheses. + + - Some implementations of `lex' allow a rule's action to begin on a + separate line, if the rule's pattern has trailing whitespace: + + %% + foo|bar<space here> + { foobar_action(); } + + `flex' does not support this feature. + + - The `lex' `%r' (generate a Ratfor scanner) option is not + supported. It is not part of the POSIX specification. + + - After a call to `unput()', `yytext' is undefined until the next + token is matched, unless the scanner was built using `%array'. + This is not the case with `lex' or the POSIX specification. The + `-l' option does away with this incompatibility. + + - The precedence of the `{}' (numeric range) operator is different. + `lex' interprets "abc{1,3}" as "match one, two, or three + occurrences of 'abc'", whereas `flex' interprets it as "match 'ab' + followed by one, two, or three occurrences of 'c'". The latter is + in agreement with the POSIX specification. + + - The precedence of the `^' operator is different. `lex' interprets + "^foo|bar" as "match either 'foo' at the beginning of a line, or + 'bar' anywhere", whereas `flex' interprets it as "match either + 'foo' or 'bar' if they come at the beginning of a line". The + latter is in agreement with the POSIX specification. + + - The special table-size declarations such as `%a' supported by + `lex' are not required by `flex' scanners; `flex' ignores them. + + - The name FLEX_SCANNER is #define'd so scanners may be written for + use with either `flex' or `lex'. Scanners also include + `YY_FLEX_MAJOR_VERSION' and `YY_FLEX_MINOR_VERSION' indicating + which version of `flex' generated the scanner (for example, for the + 2.5 release, these defines would be 2 and 5 respectively). + + The following `flex' features are not included in `lex' or the POSIX +specification: + + C++ scanners + %option + start condition scopes + start condition stacks + interactive/non-interactive scanners + yy_scan_string() and friends + yyterminate() + yy_set_interactive() + yy_set_bol() + YY_AT_BOL() + <<EOF>> + <*> + YY_DECL + YY_START + YY_USER_ACTION + YY_USER_INIT + #line directives + %{}'s around actions + multiple actions on a line + +plus almost all of the flex flags. The last feature in the list refers +to the fact that with `flex' you can put multiple actions on the same +line, separated with semicolons, while with `lex', the following + + foo handle_foo(); ++num_foos_seen; + +is (rather surprisingly) truncated to + + foo handle_foo(); + + `flex' does not truncate the action. Actions that are not enclosed +in braces are simply terminated at the end of the line. + + +File: flex.info, Node: Diagnostics, Next: Files, Prev: Incompatibilities, Up: Top + +Diagnostics +=========== + +`warning, rule cannot be matched' + indicates that the given rule cannot be matched because it follows + other rules that will always match the same text as it. For + example, in the following "foo" cannot be matched because it comes + after an identifier "catch-all" rule: + + [a-z]+ got_identifier(); + foo got_foo(); + + Using `REJECT' in a scanner suppresses this warning. + +`warning, -s option given but default rule can be matched' + means that it is possible (perhaps only in a particular start + condition) that the default rule (match any single character) is + the only one that will match a particular input. Since `-s' was + given, presumably this is not intended. + +`reject_used_but_not_detected undefined' +`yymore_used_but_not_detected undefined' + These errors can occur at compile time. They indicate that the + scanner uses `REJECT' or `yymore()' but that `flex' failed to + notice the fact, meaning that `flex' scanned the first two sections + looking for occurrences of these actions and failed to find any, + but somehow you snuck some in (via a #include file, for example). + Use `%option reject' or `%option yymore' to indicate to flex that + you really do use these features. + +`flex scanner jammed' + a scanner compiled with `-s' has encountered an input string which + wasn't matched by any of its rules. This error can also occur due + to internal problems. + +`token too large, exceeds YYLMAX' + your scanner uses `%array' and one of its rules matched a string + longer than the `YYL-' `MAX' constant (8K bytes by default). You + can increase the value by #define'ing `YYLMAX' in the definitions + section of your `flex' input. + +`scanner requires -8 flag to use the character 'X'' + Your scanner specification includes recognizing the 8-bit + character X and you did not specify the -8 flag, and your scanner + defaulted to 7-bit because you used the `-Cf' or `-CF' table + compression options. See the discussion of the `-7' flag for + details. + +`flex scanner push-back overflow' + you used `unput()' to push back so much text that the scanner's + buffer could not hold both the pushed-back text and the current + token in `yytext'. Ideally the scanner should dynamically resize + the buffer in this case, but at present it does not. + +`input buffer overflow, can't enlarge buffer because scanner uses REJECT' + the scanner was working on matching an extremely large token and + needed to expand the input buffer. This doesn't work with + scanners that use `REJECT'. + +`fatal flex scanner internal error--end of buffer missed' + This can occur in an scanner which is reentered after a long-jump + has jumped out (or over) the scanner's activation frame. Before + reentering the scanner, use: + + yyrestart( yyin ); + + or, as noted above, switch to using the C++ scanner class. + +`too many start conditions in <> construct!' + you listed more start conditions in a <> construct than exist (so + you must have listed at least one of them twice). + + +File: flex.info, Node: Files, Next: Deficiencies, Prev: Diagnostics, Up: Top + +Files +===== + +`-lfl' + library with which scanners must be linked. + +`lex.yy.c' + generated scanner (called `lexyy.c' on some systems). + +`lex.yy.cc' + generated C++ scanner class, when using `-+'. + +`<FlexLexer.h>' + header file defining the C++ scanner base class, `FlexLexer', and + its derived class, `yyFlexLexer'. + +`flex.skl' + skeleton scanner. This file is only used when building flex, not + when flex executes. + +`lex.backup' + backing-up information for `-b' flag (called `lex.bck' on some + systems). + + +File: flex.info, Node: Deficiencies, Next: See also, Prev: Files, Up: Top + +Deficiencies / Bugs +=================== + + Some trailing context patterns cannot be properly matched and +generate warning messages ("dangerous trailing context"). These are +patterns where the ending of the first part of the rule matches the +beginning of the second part, such as "zx*/xy*", where the 'x*' matches +the 'x' at the beginning of the trailing context. (Note that the POSIX +draft states that the text matched by such patterns is undefined.) + + For some trailing context rules, parts which are actually +fixed-length are not recognized as such, leading to the abovementioned +performance loss. In particular, parts using '|' or {n} (such as +"foo{3}") are always considered variable-length. + + Combining trailing context with the special '|' action can result in +*fixed* trailing context being turned into the more expensive VARIABLE +trailing context. For example, in the following: + + %% + abc | + xyz/def + + Use of `unput()' invalidates yytext and yyleng, unless the `%array' +directive or the `-l' option has been used. + + Pattern-matching of NUL's is substantially slower than matching +other characters. + + Dynamic resizing of the input buffer is slow, as it entails +rescanning all the text matched so far by the current (generally huge) +token. + + Due to both buffering of input and read-ahead, you cannot intermix +calls to <stdio.h> routines, such as, for example, `getchar()', with +`flex' rules and expect it to work. Call `input()' instead. + + The total table entries listed by the `-v' flag excludes the number +of table entries needed to determine what rule has been matched. The +number of entries is equal to the number of DFA states if the scanner +does not use `REJECT', and somewhat greater than the number of states +if it does. + + `REJECT' cannot be used with the `-f' or `-F' options. + + The `flex' internal algorithms need documentation. + + +File: flex.info, Node: See also, Next: Author, Prev: Deficiencies, Up: Top + +See also +======== + + `lex'(1), `yacc'(1), `sed'(1), `awk'(1). + + John Levine, Tony Mason, and Doug Brown: Lex & Yacc; O'Reilly and +Associates. Be sure to get the 2nd edition. + + M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator. + + Alfred Aho, Ravi Sethi and Jeffrey Ullman: Compilers: Principles, +Techniques and Tools; Addison-Wesley (1986). Describes the +pattern-matching techniques used by `flex' (deterministic finite +automata). + + +File: flex.info, Node: Author, Prev: See also, Up: Top + +Author +====== + + Vern Paxson, with the help of many ideas and much inspiration from +Van Jacobson. Original version by Jef Poskanzer. The fast table +representation is a partial implementation of a design done by Van +Jacobson. The implementation was done by Kevin Gong and Vern Paxson. + + Thanks to the many `flex' beta-testers, feedbackers, and +contributors, especially Francois Pinard, Casey Leedom, Stan Adermann, +Terry Allen, David Barker-Plummer, John Basrai, Nelson H.F. Beebe, +`benson@odi.com', Karl Berry, Peter A. Bigot, Simon Blanchard, Keith +Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher, Brian +Clapper, J.T. Conklin, Jason Coughlin, Bill Cox, Nick Cropper, Dave +Curtis, Scott David Daniels, Chris G. Demetriou, Theo Deraadt, Mike +Donahue, Chuck Doucette, Tom Epperly, Leo Eskin, Chris Faylor, Chris +Flatters, Jon Forrest, Joe Gayda, Kaveh R. Ghazi, Eric Goldman, +Christopher M. Gould, Ulrich Grepel, Peer Griebel, Jan Hajic, Charles +Hemphill, NORO Hideo, Jarkko Hietaniemi, Scott Hofmann, Jeff Honig, +Dana Hudes, Eric Hughes, John Interrante, Ceriel Jacobs, Michal +Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, Henry Juengst, Klaus +Kaempf, Jonathan I. Kamens, Terrence O Kane, Amir Katz, +`ken@ken.hilco.com', Kevin B. Kenny, Steve Kirsch, Winfried Koenig, +Marq Kole, Ronald Lamprecht, Greg Lee, Rohan Lenard, Craig Leres, John +Levine, Steve Liddle, Mike Long, Mohamed el Lozy, Brian Madsen, Malte, +Joe Marshall, Bengt Martensson, Chris Metcalf, Luke Mewburn, Jim +Meyering, R. Alexander Milowski, Erik Naggum, G.T. Nicol, Landon Noll, +James Nordby, Marc Nozell, Richard Ohnemus, Karsten Pahnke, Sven Panne, +Roland Pesch, Walter Pelissero, Gaumond Pierre, Esmond Pitt, Jef +Poskanzer, Joe Rahmeh, Jarmo Raiha, Frederic Raimbault, Pat Rankin, +Rick Richardson, Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto +Santini, Andreas Scherer, Darrell Schiebel, Raf Schietekat, Doug +Schmidt, Philippe Schnoebelen, Andreas Schwab, Alex Siegel, Eckehard +Stolz, Jan-Erik Strvmquist, Mike Stump, Paul Stuart, Dave Tallman, Ian +Lance Taylor, Chris Thewalt, Richard M. Timoney, Jodi Tsai, Paul +Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken +Yap, Ron Zellar, Nathan Zelle, David Zuhn, and those whose names have +slipped my marginal mail-archiving skills but whose contributions are +appreciated all the same. + + Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John Gilmore, +Craig Leres, John Levine, Bob Mulcahy, G.T. Nicol, Francois Pinard, +Rich Salz, and Richard Stallman for help with various distribution +headaches. + + Thanks to Esmond Pitt and Earle Horton for 8-bit character support; +to Benson Margulies and Fred Burke for C++ support; to Kent Williams +and Tom Epperly for C++ class support; to Ove Ewerlid for support of +NUL's; and to Eric Hughes for support of multiple buffers. + + This work was primarily done when I was with the Real Time Systems +Group at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks +to all there for the support I received. + + Send comments to `vern@ee.lbl.gov'. + + + +Tag Table: +Node: Top1430 +Node: Name2808 +Node: Synopsis2933 +Node: Overview3145 +Node: Description4986 +Node: Examples5748 +Node: Format8896 +Node: Patterns11637 +Node: Matching18138 +Node: Actions21438 +Node: Generated scanner30560 +Node: Start conditions34988 +Node: Multiple buffers45069 +Node: End-of-file rules50975 +Node: Miscellaneous52508 +Node: User variables55279 +Node: YACC interface57651 +Node: Options58542 +Node: Performance78234 +Node: C++87532 +Node: Incompatibilities94993 +Node: Diagnostics101853 +Node: Files105094 +Node: Deficiencies105715 +Node: See also107684 +Node: Author108216 + +End Tag Table diff --git a/MISC/texinfo/flex.texi b/MISC/texinfo/flex.texi new file mode 100644 index 0000000..23280b1 --- /dev/null +++ b/MISC/texinfo/flex.texi @@ -0,0 +1,3448 @@ +\input texinfo +@c %**start of header +@setfilename flex.info +@settitle Flex - a scanner generator +@c @finalout +@c @setchapternewpage odd +@c %**end of header + +@set EDITION 2.5 +@set UPDATED March 1995 +@set VERSION 2.5 + +@c FIXME - Reread a printed copy with a red pen and patience. +@c FIXME - Modify all "See ..." references and replace with @xref's. + +@ifinfo +@format +START-INFO-DIR-ENTRY +* Flex: (flex). A fast scanner generator. +END-INFO-DIR-ENTRY +@end format +@end ifinfo + +@c Define new indices for commands, filenames, and options. +@c @defcodeindex cm +@c @defcodeindex fl +@c @defcodeindex op + +@c Put everything in one index (arbitrarily chosen to be the concept index). +@c @syncodeindex cm cp +@c @syncodeindex fl cp +@syncodeindex fn cp +@syncodeindex ky cp +@c @syncodeindex op cp +@syncodeindex pg cp +@syncodeindex vr cp + +@ifinfo +This file documents Flex. + +Copyright (c) 1990 The Regents of the University of California. +All rights reserved. + +This code is derived from software contributed to Berkeley by +Vern Paxson. + +The United States Government has rights in this work pursuant +to contract no. DE-AC03-76SF00098 between the United States +Department of Energy and the University of California. + +Redistribution and use in source and binary forms with or without +modification are permitted provided that: (1) source distributions +retain this entire copyright notice and comment, and (2) +distributions including binaries display the following +acknowledgement: ``This product includes software developed by the +University of California, Berkeley and its contributors'' in the +documentation or other materials provided with the distribution and +in all advertising materials mentioning features or use of this +software. Neither the name of the University nor the names of its +contributors may be used to endorse or promote products derived +from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR +IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED +WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. + +@ignore +Permission is granted to process this file through TeX and print the +results, provided the printed document carries copying permission +notice identical to this one except for the removal of this paragraph +(this paragraph not being relevant to the printed manual). + +@end ignore +@end ifinfo + +@titlepage +@title Flex, version @value{VERSION} +@subtitle A fast scanner generator +@subtitle Edition @value{EDITION}, @value{UPDATED} +@author Vern Paxson + +@page +@vskip 0pt plus 1filll +Copyright @copyright{} 1990 The Regents of the University of California. +All rights reserved. + +This code is derived from software contributed to Berkeley by +Vern Paxson. + +The United States Government has rights in this work pursuant +to contract no. DE-AC03-76SF00098 between the United States +Department of Energy and the University of California. + +Redistribution and use in source and binary forms with or without +modification are permitted provided that: (1) source distributions +retain this entire copyright notice and comment, and (2) +distributions including binaries display the following +acknowledgement: ``This product includes software developed by the +University of California, Berkeley and its contributors'' in the +documentation or other materials provided with the distribution and +in all advertising materials mentioning features or use of this +software. Neither the name of the University nor the names of its +contributors may be used to endorse or promote products derived +from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR +IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED +WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. +@end titlepage + +@ifinfo + +@node Top, Name, (dir), (dir) +@top flex + +@cindex scanner generator + +This manual documents @code{flex}. It covers release @value{VERSION}. + +@menu +* Name:: Name +* Synopsis:: Synopsis +* Overview:: Overview +* Description:: Description +* Examples:: Some simple examples +* Format:: Format of the input file +* Patterns:: Patterns +* Matching:: How the input is matched +* Actions:: Actions +* Generated scanner:: The generated scanner +* Start conditions:: Start conditions +* Multiple buffers:: Multiple input buffers +* End-of-file rules:: End-of-file rules +* Miscellaneous:: Miscellaneous macros +* User variables:: Values available to the user +* YACC interface:: Interfacing with @code{yacc} +* Options:: Options +* Performance:: Performance considerations +* C++:: Generating C++ scanners +* Incompatibilities:: Incompatibilities with @code{lex} and POSIX +* Diagnostics:: Diagnostics +* Files:: Files +* Deficiencies:: Deficiencies / Bugs +* See also:: See also +* Author:: Author +@c * Index:: Index +@end menu + +@end ifinfo + +@node Name, Synopsis, Top, Top +@section Name + +flex - fast lexical analyzer generator + +@node Synopsis, Overview, Name, Top +@section Synopsis + +@example +flex [-bcdfhilnpstvwBFILTV78+? -C[aefFmr] -ooutput -Pprefix -Sskeleton] +[--help --version] [@var{filename} @dots{}] +@end example + +@node Overview, Description, Synopsis, Top +@section Overview + +This manual describes @code{flex}, a tool for generating programs +that perform pattern-matching on text. The manual +includes both tutorial and reference sections: + +@table @asis +@item Description +a brief overview of the tool + +@item Some Simple Examples + +@item Format Of The Input File + +@item Patterns +the extended regular expressions used by flex + +@item How The Input Is Matched +the rules for determining what has been matched + +@item Actions +how to specify what to do when a pattern is matched + +@item The Generated Scanner +details regarding the scanner that flex produces; +how to control the input source + +@item Start Conditions +introducing context into your scanners, and +managing "mini-scanners" + +@item Multiple Input Buffers +how to manipulate multiple input sources; how to +scan from strings instead of files + +@item End-of-file Rules +special rules for matching the end of the input + +@item Miscellaneous Macros +a summary of macros available to the actions + +@item Values Available To The User +a summary of values available to the actions + +@item Interfacing With Yacc +connecting flex scanners together with yacc parsers + +@item Options +flex command-line options, and the "%option" +directive + +@item Performance Considerations +how to make your scanner go as fast as possible + +@item Generating C++ Scanners +the (experimental) facility for generating C++ +scanner classes + +@item Incompatibilities With Lex And POSIX +how flex differs from AT&T lex and the POSIX lex +standard + +@item Diagnostics +those error messages produced by flex (or scanners +it generates) whose meanings might not be apparent + +@item Files +files used by flex + +@item Deficiencies / Bugs +known problems with flex + +@item See Also +other documentation, related tools + +@item Author +includes contact information +@end table + +@node Description, Examples, Overview, Top +@section Description + +@code{flex} is a tool for generating @dfn{scanners}: programs which +recognized lexical patterns in text. @code{flex} reads the given +input files, or its standard input if no file names are +given, for a description of a scanner to generate. The +description is in the form of pairs of regular expressions +and C code, called @dfn{rules}. @code{flex} generates as output a C +source file, @file{lex.yy.c}, which defines a routine @samp{yylex()}. +This file is compiled and linked with the @samp{-lfl} library to +produce an executable. When the executable is run, it +analyzes its input for occurrences of the regular +expressions. Whenever it finds one, it executes the +corresponding C code. + +@node Examples, Format, Description, Top +@section Some simple examples + +First some simple examples to get the flavor of how one +uses @code{flex}. The following @code{flex} input specifies a scanner +which whenever it encounters the string "username" will +replace it with the user's login name: + +@example +%% +username printf( "%s", getlogin() ); +@end example + +By default, any text not matched by a @code{flex} scanner is +copied to the output, so the net effect of this scanner is +to copy its input file to its output with each occurrence +of "username" expanded. In this input, there is just one +rule. "username" is the @var{pattern} and the "printf" is the +@var{action}. The "%%" marks the beginning of the rules. + +Here's another simple example: + +@example + int num_lines = 0, num_chars = 0; + +%% +\n ++num_lines; ++num_chars; +. ++num_chars; + +%% +main() + @{ + yylex(); + printf( "# of lines = %d, # of chars = %d\n", + num_lines, num_chars ); + @} +@end example + +This scanner counts the number of characters and the +number of lines in its input (it produces no output other +than the final report on the counts). The first line +declares two globals, "num_lines" and "num_chars", which +are accessible both inside @samp{yylex()} and in the @samp{main()} +routine declared after the second "%%". There are two rules, +one which matches a newline ("\n") and increments both the +line count and the character count, and one which matches +any character other than a newline (indicated by the "." +regular expression). + +A somewhat more complicated example: + +@example +/* scanner for a toy Pascal-like language */ + +%@{ +/* need this for the call to atof() below */ +#include <math.h> +%@} + +DIGIT [0-9] +ID [a-z][a-z0-9]* + +%% + +@{DIGIT@}+ @{ + printf( "An integer: %s (%d)\n", yytext, + atoi( yytext ) ); + @} + +@{DIGIT@}+"."@{DIGIT@}* @{ + printf( "A float: %s (%g)\n", yytext, + atof( yytext ) ); + @} + +if|then|begin|end|procedure|function @{ + printf( "A keyword: %s\n", yytext ); + @} + +@{ID@} printf( "An identifier: %s\n", yytext ); + +"+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); + +"@{"[^@}\n]*"@}" /* eat up one-line comments */ + +[ \t\n]+ /* eat up whitespace */ + +. printf( "Unrecognized character: %s\n", yytext ); + +%% + +main( argc, argv ) +int argc; +char **argv; + @{ + ++argv, --argc; /* skip over program name */ + if ( argc > 0 ) + yyin = fopen( argv[0], "r" ); + else + yyin = stdin; + + yylex(); + @} +@end example + +This is the beginnings of a simple scanner for a language +like Pascal. It identifies different types of @var{tokens} and +reports on what it has seen. + +The details of this example will be explained in the +following sections. + +@node Format, Patterns, Examples, Top +@section Format of the input file + +The @code{flex} input file consists of three sections, separated +by a line with just @samp{%%} in it: + +@example +definitions +%% +rules +%% +user code +@end example + +The @dfn{definitions} section contains declarations of simple +@dfn{name} definitions to simplify the scanner specification, +and declarations of @dfn{start conditions}, which are explained +in a later section. +Name definitions have the form: + +@example +name definition +@end example + +The "name" is a word beginning with a letter or an +underscore ('_') followed by zero or more letters, digits, '_', +or '-' (dash). The definition is taken to begin at the +first non-white-space character following the name and +continuing to the end of the line. The definition can +subsequently be referred to using "@{name@}", which will +expand to "(definition)". For example, + +@example +DIGIT [0-9] +ID [a-z][a-z0-9]* +@end example + +@noindent +defines "DIGIT" to be a regular expression which matches a +single digit, and "ID" to be a regular expression which +matches a letter followed by zero-or-more +letters-or-digits. A subsequent reference to + +@example +@{DIGIT@}+"."@{DIGIT@}* +@end example + +@noindent +is identical to + +@example +([0-9])+"."([0-9])* +@end example + +@noindent +and matches one-or-more digits followed by a '.' followed +by zero-or-more digits. + +The @var{rules} section of the @code{flex} input contains a series of +rules of the form: + +@example +pattern action +@end example + +@noindent +where the pattern must be unindented and the action must +begin on the same line. + +See below for a further description of patterns and +actions. + +Finally, the user code section is simply copied to +@file{lex.yy.c} verbatim. It is used for companion routines +which call or are called by the scanner. The presence of +this section is optional; if it is missing, the second @samp{%%} +in the input file may be skipped, too. + +In the definitions and rules sections, any @emph{indented} text or +text enclosed in @samp{%@{} and @samp{%@}} is copied verbatim to the +output (with the @samp{%@{@}}'s removed). The @samp{%@{@}}'s must +appear unindented on lines by themselves. + +In the rules section, any indented or %@{@} text appearing +before the first rule may be used to declare variables +which are local to the scanning routine and (after the +declarations) code which is to be executed whenever the +scanning routine is entered. Other indented or %@{@} text +in the rule section is still copied to the output, but its +meaning is not well-defined and it may well cause +compile-time errors (this feature is present for @code{POSIX} compliance; +see below for other such features). + +In the definitions section (but not in the rules section), +an unindented comment (i.e., a line beginning with "/*") +is also copied verbatim to the output up to the next "*/". + +@node Patterns, Matching, Format, Top +@section Patterns + +The patterns in the input are written using an extended +set of regular expressions. These are: + +@table @samp +@item x +match the character @samp{x} +@item . +any character (byte) except newline +@item [xyz] +a "character class"; in this case, the pattern +matches either an @samp{x}, a @samp{y}, or a @samp{z} +@item [abj-oZ] +a "character class" with a range in it; matches +an @samp{a}, a @samp{b}, any letter from @samp{j} through @samp{o}, +or a @samp{Z} +@item [^A-Z] +a "negated character class", i.e., any character +but those in the class. In this case, any +character EXCEPT an uppercase letter. +@item [^A-Z\n] +any character EXCEPT an uppercase letter or +a newline +@item @var{r}* +zero or more @var{r}'s, where @var{r} is any regular expression +@item @var{r}+ +one or more @var{r}'s +@item @var{r}? +zero or one @var{r}'s (that is, "an optional @var{r}") +@item @var{r}@{2,5@} +anywhere from two to five @var{r}'s +@item @var{r}@{2,@} +two or more @var{r}'s +@item @var{r}@{4@} +exactly 4 @var{r}'s +@item @{@var{name}@} +the expansion of the "@var{name}" definition +(see above) +@item "[xyz]\"foo" +the literal string: @samp{[xyz]"foo} +@item \@var{x} +if @var{x} is an @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or @samp{v}, +then the ANSI-C interpretation of \@var{x}. +Otherwise, a literal @samp{@var{x}} (used to escape +operators such as @samp{*}) +@item \0 +a NUL character (ASCII code 0) +@item \123 +the character with octal value 123 +@item \x2a +the character with hexadecimal value @code{2a} +@item (@var{r}) +match an @var{r}; parentheses are used to override +precedence (see below) +@item @var{r}@var{s} +the regular expression @var{r} followed by the +regular expression @var{s}; called "concatenation" +@item @var{r}|@var{s} +either an @var{r} or an @var{s} +@item @var{r}/@var{s} +an @var{r} but only if it is followed by an @var{s}. The text +matched by @var{s} is included when determining whether this rule is +the @dfn{longest match}, but is then returned to the input before +the action is executed. So the action only sees the text matched +by @var{r}. This type of pattern is called @dfn{trailing context}. +(There are some combinations of @samp{@var{r}/@var{s}} that @code{flex} +cannot match correctly; see notes in the Deficiencies / Bugs section +below regarding "dangerous trailing context".) +@item ^@var{r} +an @var{r}, but only at the beginning of a line (i.e., +which just starting to scan, or right after a +newline has been scanned). +@item @var{r}$ +an @var{r}, but only at the end of a line (i.e., just +before a newline). Equivalent to "@var{r}/\n". + +Note that flex's notion of "newline" is exactly +whatever the C compiler used to compile flex +interprets '\n' as; in particular, on some DOS +systems you must either filter out \r's in the +input yourself, or explicitly use @var{r}/\r\n for "r$". +@item <@var{s}>@var{r} +an @var{r}, but only in start condition @var{s} (see +below for discussion of start conditions) +<@var{s1},@var{s2},@var{s3}>@var{r} +same, but in any of start conditions @var{s1}, +@var{s2}, or @var{s3} +@item <*>@var{r} +an @var{r} in any start condition, even an exclusive one. +@item <<EOF>> +an end-of-file +<@var{s1},@var{s2}><<EOF>> +an end-of-file when in start condition @var{s1} or @var{s2} +@end table + +Note that inside of a character class, all regular +expression operators lose their special meaning except escape +('\') and the character class operators, '-', ']', and, at +the beginning of the class, '^'. + +The regular expressions listed above are grouped according +to precedence, from highest precedence at the top to +lowest at the bottom. Those grouped together have equal +precedence. For example, + +@example +foo|bar* +@end example + +@noindent +is the same as + +@example +(foo)|(ba(r*)) +@end example + +@noindent +since the '*' operator has higher precedence than +concatenation, and concatenation higher than alternation ('|'). +This pattern therefore matches @emph{either} the string "foo" @emph{or} +the string "ba" followed by zero-or-more r's. To match +"foo" or zero-or-more "bar"'s, use: + +@example +foo|(bar)* +@end example + +@noindent +and to match zero-or-more "foo"'s-or-"bar"'s: + +@example +(foo|bar)* +@end example + +In addition to characters and ranges of characters, +character classes can also contain character class +@dfn{expressions}. These are expressions enclosed inside @samp{[}: and @samp{:}] +delimiters (which themselves must appear between the '[' +and ']' of the character class; other elements may occur +inside the character class, too). The valid expressions +are: + +@example +[:alnum:] [:alpha:] [:blank:] +[:cntrl:] [:digit:] [:graph:] +[:lower:] [:print:] [:punct:] +[:space:] [:upper:] [:xdigit:] +@end example + +These expressions all designate a set of characters +equivalent to the corresponding standard C @samp{isXXX} function. For +example, @samp{[:alnum:]} designates those characters for which +@samp{isalnum()} returns true - i.e., any alphabetic or numeric. +Some systems don't provide @samp{isblank()}, so flex defines +@samp{[:blank:]} as a blank or a tab. + +For example, the following character classes are all +equivalent: + +@example +[[:alnum:]] +[[:alpha:][:digit:] +[[:alpha:]0-9] +[a-zA-Z0-9] +@end example + +If your scanner is case-insensitive (the @samp{-i} flag), then +@samp{[:upper:]} and @samp{[:lower:]} are equivalent to @samp{[:alpha:]}. + +Some notes on patterns: + +@itemize - +@item +A negated character class such as the example +"[^A-Z]" above @emph{will match a newline} unless "\n" (or an +equivalent escape sequence) is one of the +characters explicitly present in the negated character +class (e.g., "[^A-Z\n]"). This is unlike how many +other regular expression tools treat negated +character classes, but unfortunately the inconsistency +is historically entrenched. Matching newlines +means that a pattern like [^"]* can match the +entire input unless there's another quote in the +input. + +@item +A rule can have at most one instance of trailing +context (the '/' operator or the '$' operator). +The start condition, '^', and "<<EOF>>" patterns +can only occur at the beginning of a pattern, and, +as well as with '/' and '$', cannot be grouped +inside parentheses. A '^' which does not occur at +the beginning of a rule or a '$' which does not +occur at the end of a rule loses its special +properties and is treated as a normal character. + +The following are illegal: + +@example +foo/bar$ +<sc1>foo<sc2>bar +@end example + +Note that the first of these, can be written +"foo/bar\n". + +The following will result in '$' or '^' being +treated as a normal character: + +@example +foo|(bar$) +foo|^bar +@end example + +If what's wanted is a "foo" or a +bar-followed-by-a-newline, the following could be used (the special +'|' action is explained below): + +@example +foo | +bar$ /* action goes here */ +@end example + +A similar trick will work for matching a foo or a +bar-at-the-beginning-of-a-line. +@end itemize + +@node Matching, Actions, Patterns, Top +@section How the input is matched + +When the generated scanner is run, it analyzes its input +looking for strings which match any of its patterns. If +it finds more than one match, it takes the one matching +the most text (for trailing context rules, this includes +the length of the trailing part, even though it will then +be returned to the input). If it finds two or more +matches of the same length, the rule listed first in the +@code{flex} input file is chosen. + +Once the match is determined, the text corresponding to +the match (called the @var{token}) is made available in the +global character pointer @code{yytext}, and its length in the +global integer @code{yyleng}. The @var{action} corresponding to the +matched pattern is then executed (a more detailed +description of actions follows), and then the remaining input is +scanned for another match. + +If no match is found, then the @dfn{default rule} is executed: +the next character in the input is considered matched and +copied to the standard output. Thus, the simplest legal +@code{flex} input is: + +@example +%% +@end example + +which generates a scanner that simply copies its input +(one character at a time) to its output. + +Note that @code{yytext} can be defined in two different ways: +either as a character @emph{pointer} or as a character @emph{array}. +You can control which definition @code{flex} uses by including +one of the special directives @samp{%pointer} or @samp{%array} in the +first (definitions) section of your flex input. The +default is @samp{%pointer}, unless you use the @samp{-l} lex +compatibility option, in which case @code{yytext} will be an array. The +advantage of using @samp{%pointer} is substantially faster +scanning and no buffer overflow when matching very large +tokens (unless you run out of dynamic memory). The +disadvantage is that you are restricted in how your actions can +modify @code{yytext} (see the next section), and calls to the +@samp{unput()} function destroys the present contents of @code{yytext}, +which can be a considerable porting headache when moving +between different @code{lex} versions. + +The advantage of @samp{%array} is that you can then modify @code{yytext} +to your heart's content, and calls to @samp{unput()} do not +destroy @code{yytext} (see below). Furthermore, existing @code{lex} +programs sometimes access @code{yytext} externally using +declarations of the form: +@example +extern char yytext[]; +@end example +This definition is erroneous when used with @samp{%pointer}, but +correct for @samp{%array}. + +@samp{%array} defines @code{yytext} to be an array of @code{YYLMAX} characters, +which defaults to a fairly large value. You can change +the size by simply #define'ing @code{YYLMAX} to a different value +in the first section of your @code{flex} input. As mentioned +above, with @samp{%pointer} yytext grows dynamically to +accommodate large tokens. While this means your @samp{%pointer} scanner +can accommodate very large tokens (such as matching entire +blocks of comments), bear in mind that each time the +scanner must resize @code{yytext} it also must rescan the entire +token from the beginning, so matching such tokens can +prove slow. @code{yytext} presently does @emph{not} dynamically grow if +a call to @samp{unput()} results in too much text being pushed +back; instead, a run-time error results. + +Also note that you cannot use @samp{%array} with C++ scanner +classes (the @code{c++} option; see below). + +@node Actions, Generated scanner, Matching, Top +@section Actions + +Each pattern in a rule has a corresponding action, which +can be any arbitrary C statement. The pattern ends at the +first non-escaped whitespace character; the remainder of +the line is its action. If the action is empty, then when +the pattern is matched the input token is simply +discarded. For example, here is the specification for a +program which deletes all occurrences of "zap me" from its +input: + +@example +%% +"zap me" +@end example + +(It will copy all other characters in the input to the +output since they will be matched by the default rule.) + +Here is a program which compresses multiple blanks and +tabs down to a single blank, and throws away whitespace +found at the end of a line: + +@example +%% +[ \t]+ putchar( ' ' ); +[ \t]+$ /* ignore this token */ +@end example + +If the action contains a '@{', then the action spans till +the balancing '@}' is found, and the action may cross +multiple lines. @code{flex} knows about C strings and comments and +won't be fooled by braces found within them, but also +allows actions to begin with @samp{%@{} and will consider the +action to be all the text up to the next @samp{%@}} (regardless of +ordinary braces inside the action). + +An action consisting solely of a vertical bar ('|') means +"same as the action for the next rule." See below for an +illustration. + +Actions can include arbitrary C code, including @code{return} +statements to return a value to whatever routine called +@samp{yylex()}. Each time @samp{yylex()} is called it continues +processing tokens from where it last left off until it either +reaches the end of the file or executes a return. + +Actions are free to modify @code{yytext} except for lengthening +it (adding characters to its end--these will overwrite +later characters in the input stream). This however does +not apply when using @samp{%array} (see above); in that case, +@code{yytext} may be freely modified in any way. + +Actions are free to modify @code{yyleng} except they should not +do so if the action also includes use of @samp{yymore()} (see +below). + +There are a number of special directives which can be +included within an action: + +@itemize - +@item +@samp{ECHO} copies yytext to the scanner's output. + +@item +@code{BEGIN} followed by the name of a start condition +places the scanner in the corresponding start +condition (see below). + +@item +@code{REJECT} directs the scanner to proceed on to the +"second best" rule which matched the input (or a +prefix of the input). The rule is chosen as +described above in "How the Input is Matched", and +@code{yytext} and @code{yyleng} set up appropriately. It may +either be one which matched as much text as the +originally chosen rule but came later in the @code{flex} +input file, or one which matched less text. For +example, the following will both count the words in +the input and call the routine special() whenever +"frob" is seen: + +@example + int word_count = 0; +%% + +frob special(); REJECT; +[^ \t\n]+ ++word_count; +@end example + +Without the @code{REJECT}, any "frob"'s in the input would +not be counted as words, since the scanner normally +executes only one action per token. Multiple +@code{REJECT's} are allowed, each one finding the next +best choice to the currently active rule. For +example, when the following scanner scans the token +"abcd", it will write "abcdabcaba" to the output: + +@example +%% +a | +ab | +abc | +abcd ECHO; REJECT; +.|\n /* eat up any unmatched character */ +@end example + +(The first three rules share the fourth's action +since they use the special '|' action.) @code{REJECT} is +a particularly expensive feature in terms of +scanner performance; if it is used in @emph{any} of the +scanner's actions it will slow down @emph{all} of the +scanner's matching. Furthermore, @code{REJECT} cannot be used +with the @samp{-Cf} or @samp{-CF} options (see below). + +Note also that unlike the other special actions, +@code{REJECT} is a @emph{branch}; code immediately following it +in the action will @emph{not} be executed. + +@item +@samp{yymore()} tells the scanner that the next time it +matches a rule, the corresponding token should be +@emph{appended} onto the current value of @code{yytext} rather +than replacing it. For example, given the input +"mega-kludge" the following will write +"mega-mega-kludge" to the output: + +@example +%% +mega- ECHO; yymore(); +kludge ECHO; +@end example + +First "mega-" is matched and echoed to the output. +Then "kludge" is matched, but the previous "mega-" +is still hanging around at the beginning of @code{yytext} +so the @samp{ECHO} for the "kludge" rule will actually +write "mega-kludge". +@end itemize + +Two notes regarding use of @samp{yymore()}. First, @samp{yymore()} +depends on the value of @code{yyleng} correctly reflecting the +size of the current token, so you must not modify @code{yyleng} +if you are using @samp{yymore()}. Second, the presence of +@samp{yymore()} in the scanner's action entails a minor +performance penalty in the scanner's matching speed. + +@itemize - +@item +@samp{yyless(n)} returns all but the first @var{n} characters of +the current token back to the input stream, where +they will be rescanned when the scanner looks for +the next match. @code{yytext} and @code{yyleng} are adjusted +appropriately (e.g., @code{yyleng} will now be equal to @var{n} +). For example, on the input "foobar" the +following will write out "foobarbar": + +@example +%% +foobar ECHO; yyless(3); +[a-z]+ ECHO; +@end example + +An argument of 0 to @code{yyless} will cause the entire +current input string to be scanned again. Unless +you've changed how the scanner will subsequently +process its input (using @code{BEGIN}, for example), this +will result in an endless loop. + +Note that @code{yyless} is a macro and can only be used in the +flex input file, not from other source files. + +@item +@samp{unput(c)} puts the character @code{c} back onto the input +stream. It will be the next character scanned. +The following action will take the current token +and cause it to be rescanned enclosed in +parentheses. + +@example +@{ +int i; +/* Copy yytext because unput() trashes yytext */ +char *yycopy = strdup( yytext ); +unput( ')' ); +for ( i = yyleng - 1; i >= 0; --i ) + unput( yycopy[i] ); +unput( '(' ); +free( yycopy ); +@} +@end example + +Note that since each @samp{unput()} puts the given +character back at the @emph{beginning} of the input stream, +pushing back strings must be done back-to-front. +An important potential problem when using @samp{unput()} is that +if you are using @samp{%pointer} (the default), a call to @samp{unput()} +@emph{destroys} the contents of @code{yytext}, starting with its +rightmost character and devouring one character to the left +with each call. If you need the value of yytext preserved +after a call to @samp{unput()} (as in the above example), you +must either first copy it elsewhere, or build your scanner +using @samp{%array} instead (see How The Input Is Matched). + +Finally, note that you cannot put back @code{EOF} to attempt to +mark the input stream with an end-of-file. + +@item +@samp{input()} reads the next character from the input +stream. For example, the following is one way to +eat up C comments: + +@example +%% +"/*" @{ + register int c; + + for ( ; ; ) + @{ + while ( (c = input()) != '*' && + c != EOF ) + ; /* eat up text of comment */ + + if ( c == '*' ) + @{ + while ( (c = input()) == '*' ) + ; + if ( c == '/' ) + break; /* found the end */ + @} + + if ( c == EOF ) + @{ + error( "EOF in comment" ); + break; + @} + @} + @} +@end example + +(Note that if the scanner is compiled using @samp{C++}, +then @samp{input()} is instead referred to as @samp{yyinput()}, +in order to avoid a name clash with the @samp{C++} stream +by the name of @code{input}.) + +@item YY_FLUSH_BUFFER +flushes the scanner's internal buffer so that the next time the scanner +attempts to match a token, it will first refill the buffer using +@code{YY_INPUT} (see The Generated Scanner, below). This action is +a special case of the more general @samp{yy_flush_buffer()} function, +described below in the section Multiple Input Buffers. + +@item +@samp{yyterminate()} can be used in lieu of a return +statement in an action. It terminates the scanner +and returns a 0 to the scanner's caller, indicating +"all done". By default, @samp{yyterminate()} is also +called when an end-of-file is encountered. It is a +macro and may be redefined. +@end itemize + +@node Generated scanner, Start conditions, Actions, Top +@section The generated scanner + +The output of @code{flex} is the file @file{lex.yy.c}, which contains +the scanning routine @samp{yylex()}, a number of tables used by +it for matching tokens, and a number of auxiliary routines +and macros. By default, @samp{yylex()} is declared as follows: + +@example +int yylex() + @{ + @dots{} various definitions and the actions in here @dots{} + @} +@end example + +(If your environment supports function prototypes, then it +will be "int yylex( void )".) This definition may be +changed by defining the "YY_DECL" macro. For example, you +could use: + +@example +#define YY_DECL float lexscan( a, b ) float a, b; +@end example + +to give the scanning routine the name @code{lexscan}, returning a +float, and taking two floats as arguments. Note that if +you give arguments to the scanning routine using a +K&R-style/non-prototyped function declaration, you must +terminate the definition with a semi-colon (@samp{;}). + +Whenever @samp{yylex()} is called, it scans tokens from the +global input file @code{yyin} (which defaults to stdin). It +continues until it either reaches an end-of-file (at which +point it returns the value 0) or one of its actions +executes a @code{return} statement. + +If the scanner reaches an end-of-file, subsequent calls are undefined +unless either @code{yyin} is pointed at a new input file (in which case +scanning continues from that file), or @samp{yyrestart()} is called. +@samp{yyrestart()} takes one argument, a @samp{FILE *} pointer (which +can be nil, if you've set up @code{YY_INPUT} to scan from a source +other than @code{yyin}), and initializes @code{yyin} for scanning from +that file. Essentially there is no difference between just assigning +@code{yyin} to a new input file or using @samp{yyrestart()} to do so; +the latter is available for compatibility with previous versions of +@code{flex}, and because it can be used to switch input files in the +middle of scanning. It can also be used to throw away the current +input buffer, by calling it with an argument of @code{yyin}; but +better is to use @code{YY_FLUSH_BUFFER} (see above). Note that +@samp{yyrestart()} does @emph{not} reset the start condition to +@code{INITIAL} (see Start Conditions, below). + + +If @samp{yylex()} stops scanning due to executing a @code{return} +statement in one of the actions, the scanner may then be called +again and it will resume scanning where it left off. + +By default (and for purposes of efficiency), the scanner +uses block-reads rather than simple @samp{getc()} calls to read +characters from @code{yyin}. The nature of how it gets its input +can be controlled by defining the @code{YY_INPUT} macro. +YY_INPUT's calling sequence is +"YY_INPUT(buf,result,max_size)". Its action is to place +up to @var{max_size} characters in the character array @var{buf} and +return in the integer variable @var{result} either the number of +characters read or the constant YY_NULL (0 on Unix +systems) to indicate EOF. The default YY_INPUT reads from +the global file-pointer "yyin". + +A sample definition of YY_INPUT (in the definitions +section of the input file): + +@example +%@{ +#define YY_INPUT(buf,result,max_size) \ + @{ \ + int c = getchar(); \ + result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ + @} +%@} +@end example + +This definition will change the input processing to occur +one character at a time. + +When the scanner receives an end-of-file indication from +YY_INPUT, it then checks the @samp{yywrap()} function. If +@samp{yywrap()} returns false (zero), then it is assumed that the +function has gone ahead and set up @code{yyin} to point to +another input file, and scanning continues. If it returns +true (non-zero), then the scanner terminates, returning 0 +to its caller. Note that in either case, the start +condition remains unchanged; it does @emph{not} revert to @code{INITIAL}. + +If you do not supply your own version of @samp{yywrap()}, then you +must either use @samp{%option noyywrap} (in which case the scanner +behaves as though @samp{yywrap()} returned 1), or you must link with +@samp{-lfl} to obtain the default version of the routine, which always +returns 1. + +Three routines are available for scanning from in-memory +buffers rather than files: @samp{yy_scan_string()}, +@samp{yy_scan_bytes()}, and @samp{yy_scan_buffer()}. See the discussion +of them below in the section Multiple Input Buffers. + +The scanner writes its @samp{ECHO} output to the @code{yyout} global +(default, stdout), which may be redefined by the user +simply by assigning it to some other @code{FILE} pointer. + +@node Start conditions, Multiple buffers, Generated scanner, Top +@section Start conditions + +@code{flex} provides a mechanism for conditionally activating +rules. Any rule whose pattern is prefixed with "<sc>" +will only be active when the scanner is in the start +condition named "sc". For example, + +@example +<STRING>[^"]* @{ /* eat up the string body ... */ + @dots{} + @} +@end example + +@noindent +will be active only when the scanner is in the "STRING" +start condition, and + +@example +<INITIAL,STRING,QUOTE>\. @{ /* handle an escape ... */ + @dots{} + @} +@end example + +@noindent +will be active only when the current start condition is +either "INITIAL", "STRING", or "QUOTE". + +Start conditions are declared in the definitions (first) +section of the input using unindented lines beginning with +either @samp{%s} or @samp{%x} followed by a list of names. The former +declares @emph{inclusive} start conditions, the latter @emph{exclusive} +start conditions. A start condition is activated using +the @code{BEGIN} action. Until the next @code{BEGIN} action is +executed, rules with the given start condition will be active +and rules with other start conditions will be inactive. +If the start condition is @emph{inclusive}, then rules with no +start conditions at all will also be active. If it is +@emph{exclusive}, then @emph{only} rules qualified with the start +condition will be active. A set of rules contingent on the +same exclusive start condition describe a scanner which is +independent of any of the other rules in the @code{flex} input. +Because of this, exclusive start conditions make it easy +to specify "mini-scanners" which scan portions of the +input that are syntactically different from the rest +(e.g., comments). + +If the distinction between inclusive and exclusive start +conditions is still a little vague, here's a simple +example illustrating the connection between the two. The set +of rules: + +@example +%s example +%% + +<example>foo do_something(); + +bar something_else(); +@end example + +@noindent +is equivalent to + +@example +%x example +%% + +<example>foo do_something(); + +<INITIAL,example>bar something_else(); +@end example + +Without the @samp{<INITIAL,example>} qualifier, the @samp{bar} pattern +in the second example wouldn't be active (i.e., couldn't match) when +in start condition @samp{example}. If we just used @samp{<example>} +to qualify @samp{bar}, though, then it would only be active in +@samp{example} and not in @code{INITIAL}, while in the first example +it's active in both, because in the first example the @samp{example} +starting condition is an @emph{inclusive} (@samp{%s}) start condition. + +Also note that the special start-condition specifier @samp{<*>} +matches every start condition. Thus, the above example +could also have been written; + +@example +%x example +%% + +<example>foo do_something(); + +<*>bar something_else(); +@end example + +The default rule (to @samp{ECHO} any unmatched character) remains +active in start conditions. It is equivalent to: + +@example +<*>.|\\n ECHO; +@end example + +@samp{BEGIN(0)} returns to the original state where only the +rules with no start conditions are active. This state can +also be referred to as the start-condition "INITIAL", so +@samp{BEGIN(INITIAL)} is equivalent to @samp{BEGIN(0)}. (The +parentheses around the start condition name are not required but +are considered good style.) + +@code{BEGIN} actions can also be given as indented code at the +beginning of the rules section. For example, the +following will cause the scanner to enter the "SPECIAL" start +condition whenever @samp{yylex()} is called and the global +variable @code{enter_special} is true: + +@example + int enter_special; + +%x SPECIAL +%% + if ( enter_special ) + BEGIN(SPECIAL); + +<SPECIAL>blahblahblah +@dots{}more rules follow@dots{} +@end example + +To illustrate the uses of start conditions, here is a +scanner which provides two different interpretations of a +string like "123.456". By default it will treat it as as +three tokens, the integer "123", a dot ('.'), and the +integer "456". But if the string is preceded earlier in +the line by the string "expect-floats" it will treat it as +a single token, the floating-point number 123.456: + +@example +%@{ +#include <math.h> +%@} +%s expect + +%% +expect-floats BEGIN(expect); + +<expect>[0-9]+"."[0-9]+ @{ + printf( "found a float, = %f\n", + atof( yytext ) ); + @} +<expect>\n @{ + /* that's the end of the line, so + * we need another "expect-number" + * before we'll recognize any more + * numbers + */ + BEGIN(INITIAL); + @} + +[0-9]+ @{ + +Version 2.5 December 1994 18 + + printf( "found an integer, = %d\n", + atoi( yytext ) ); + @} + +"." printf( "found a dot\n" ); +@end example + +Here is a scanner which recognizes (and discards) C +comments while maintaining a count of the current input line. + +@example +%x comment +%% + int line_num = 1; + +"/*" BEGIN(comment); + +<comment>[^*\n]* /* eat anything that's not a '*' */ +<comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ +<comment>\n ++line_num; +<comment>"*"+"/" BEGIN(INITIAL); +@end example + +This scanner goes to a bit of trouble to match as much +text as possible with each rule. In general, when +attempting to write a high-speed scanner try to match as +much possible in each rule, as it's a big win. + +Note that start-conditions names are really integer values +and can be stored as such. Thus, the above could be +extended in the following fashion: + +@example +%x comment foo +%% + int line_num = 1; + int comment_caller; + +"/*" @{ + comment_caller = INITIAL; + BEGIN(comment); + @} + +@dots{} + +<foo>"/*" @{ + comment_caller = foo; + BEGIN(comment); + @} + +<comment>[^*\n]* /* eat anything that's not a '*' */ +<comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ +<comment>\n ++line_num; +<comment>"*"+"/" BEGIN(comment_caller); +@end example + +Furthermore, you can access the current start condition +using the integer-valued @code{YY_START} macro. For example, the +above assignments to @code{comment_caller} could instead be +written + +@example +comment_caller = YY_START; +@end example + +Flex provides @code{YYSTATE} as an alias for @code{YY_START} (since that +is what's used by AT&T @code{lex}). + +Note that start conditions do not have their own +name-space; %s's and %x's declare names in the same fashion as +#define's. + +Finally, here's an example of how to match C-style quoted +strings using exclusive start conditions, including +expanded escape sequences (but not including checking for +a string that's too long): + +@example +%x str + +%% + char string_buf[MAX_STR_CONST]; + char *string_buf_ptr; + +\" string_buf_ptr = string_buf; BEGIN(str); + +<str>\" @{ /* saw closing quote - all done */ + BEGIN(INITIAL); + *string_buf_ptr = '\0'; + /* return string constant token type and + * value to parser + */ + @} + +<str>\n @{ + /* error - unterminated string constant */ + /* generate error message */ + @} + +<str>\\[0-7]@{1,3@} @{ + /* octal escape sequence */ + int result; + + (void) sscanf( yytext + 1, "%o", &result ); + + if ( result > 0xff ) + /* error, constant is out-of-bounds */ + + *string_buf_ptr++ = result; + @} + +<str>\\[0-9]+ @{ + /* generate error - bad escape sequence; something + * like '\48' or '\0777777' + */ + @} + +<str>\\n *string_buf_ptr++ = '\n'; +<str>\\t *string_buf_ptr++ = '\t'; +<str>\\r *string_buf_ptr++ = '\r'; +<str>\\b *string_buf_ptr++ = '\b'; +<str>\\f *string_buf_ptr++ = '\f'; + +<str>\\(.|\n) *string_buf_ptr++ = yytext[1]; + +<str>[^\\\n\"]+ @{ + char *yptr = yytext; + + while ( *yptr ) + *string_buf_ptr++ = *yptr++; + @} +@end example + +Often, such as in some of the examples above, you wind up +writing a whole bunch of rules all preceded by the same +start condition(s). Flex makes this a little easier and +cleaner by introducing a notion of start condition @dfn{scope}. +A start condition scope is begun with: + +@example +<SCs>@{ +@end example + +@noindent +where SCs is a list of one or more start conditions. +Inside the start condition scope, every rule automatically +has the prefix @samp{<SCs>} applied to it, until a @samp{@}} which +matches the initial @samp{@{}. So, for example, + +@example +<ESC>@{ + "\\n" return '\n'; + "\\r" return '\r'; + "\\f" return '\f'; + "\\0" return '\0'; +@} +@end example + +@noindent +is equivalent to: + +@example +<ESC>"\\n" return '\n'; +<ESC>"\\r" return '\r'; +<ESC>"\\f" return '\f'; +<ESC>"\\0" return '\0'; +@end example + +Start condition scopes may be nested. + +Three routines are available for manipulating stacks of +start conditions: + +@table @samp +@item void yy_push_state(int new_state) +pushes the current start condition onto the top of +the start condition stack and switches to @var{new_state} +as though you had used @samp{BEGIN new_state} (recall that +start condition names are also integers). + +@item void yy_pop_state() +pops the top of the stack and switches to it via +@code{BEGIN}. + +@item int yy_top_state() +returns the top of the stack without altering the +stack's contents. +@end table + +The start condition stack grows dynamically and so has no +built-in size limitation. If memory is exhausted, program +execution aborts. + +To use start condition stacks, your scanner must include a +@samp{%option stack} directive (see Options below). + +@node Multiple buffers, End-of-file rules, Start conditions, Top +@section Multiple input buffers + +Some scanners (such as those which support "include" +files) require reading from several input streams. As +@code{flex} scanners do a large amount of buffering, one cannot +control where the next input will be read from by simply +writing a @code{YY_INPUT} which is sensitive to the scanning +context. @code{YY_INPUT} is only called when the scanner reaches +the end of its buffer, which may be a long time after +scanning a statement such as an "include" which requires +switching the input source. + +To negotiate these sorts of problems, @code{flex} provides a +mechanism for creating and switching between multiple +input buffers. An input buffer is created by using: + +@example +YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) +@end example + +@noindent +which takes a @code{FILE} pointer and a size and creates a buffer +associated with the given file and large enough to hold +@var{size} characters (when in doubt, use @code{YY_BUF_SIZE} for the +size). It returns a @code{YY_BUFFER_STATE} handle, which may +then be passed to other routines (see below). The +@code{YY_BUFFER_STATE} type is a pointer to an opaque @code{struct} +@code{yy_buffer_state} structure, so you may safely initialize +YY_BUFFER_STATE variables to @samp{((YY_BUFFER_STATE) 0)} if you +wish, and also refer to the opaque structure in order to +correctly declare input buffers in source files other than +that of your scanner. Note that the @code{FILE} pointer in the +call to @code{yy_create_buffer} is only used as the value of @code{yyin} +seen by @code{YY_INPUT}; if you redefine @code{YY_INPUT} so it no longer +uses @code{yyin}, then you can safely pass a nil @code{FILE} pointer to +@code{yy_create_buffer}. You select a particular buffer to scan +from using: + +@example +void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) +@end example + +switches the scanner's input buffer so subsequent tokens +will come from @var{new_buffer}. Note that +@samp{yy_switch_to_buffer()} may be used by @samp{yywrap()} to set +things up for continued scanning, instead of opening a new +file and pointing @code{yyin} at it. Note also that switching +input sources via either @samp{yy_switch_to_buffer()} or @samp{yywrap()} +does @emph{not} change the start condition. + +@example +void yy_delete_buffer( YY_BUFFER_STATE buffer ) +@end example + +@noindent +is used to reclaim the storage associated with a buffer. +You can also clear the current contents of a buffer using: + +@example +void yy_flush_buffer( YY_BUFFER_STATE buffer ) +@end example + +This function discards the buffer's contents, so the next time the +scanner attempts to match a token from the buffer, it will first fill +the buffer anew using @code{YY_INPUT}. + +@samp{yy_new_buffer()} is an alias for @samp{yy_create_buffer()}, +provided for compatibility with the C++ use of @code{new} and @code{delete} +for creating and destroying dynamic objects. + +Finally, the @code{YY_CURRENT_BUFFER} macro returns a +@code{YY_BUFFER_STATE} handle to the current buffer. + +Here is an example of using these features for writing a +scanner which expands include files (the @samp{<<EOF>>} feature +is discussed below): + +@example +/* the "incl" state is used for picking up the name + * of an include file + */ +%x incl + +%@{ +#define MAX_INCLUDE_DEPTH 10 +YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; +int include_stack_ptr = 0; +%@} + +%% +include BEGIN(incl); + +[a-z]+ ECHO; +[^a-z\n]*\n? ECHO; + +<incl>[ \t]* /* eat the whitespace */ +<incl>[^ \t\n]+ @{ /* got the include file name */ + if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) + @{ + fprintf( stderr, "Includes nested too deeply" ); + exit( 1 ); + @} + + include_stack[include_stack_ptr++] = + YY_CURRENT_BUFFER; + + yyin = fopen( yytext, "r" ); + + if ( ! yyin ) + error( @dots{} ); + + yy_switch_to_buffer( + yy_create_buffer( yyin, YY_BUF_SIZE ) ); + + BEGIN(INITIAL); + @} + +<<EOF>> @{ + if ( --include_stack_ptr < 0 ) + @{ + yyterminate(); + @} + + else + @{ + yy_delete_buffer( YY_CURRENT_BUFFER ); + yy_switch_to_buffer( + include_stack[include_stack_ptr] ); + @} + @} +@end example + +Three routines are available for setting up input buffers +for scanning in-memory strings instead of files. All of +them create a new input buffer for scanning the string, +and return a corresponding @code{YY_BUFFER_STATE} handle (which +you should delete with @samp{yy_delete_buffer()} when done with +it). They also switch to the new buffer using +@samp{yy_switch_to_buffer()}, so the next call to @samp{yylex()} will +start scanning the string. + +@table @samp +@item yy_scan_string(const char *str) +scans a NUL-terminated string. + +@item yy_scan_bytes(const char *bytes, int len) +scans @code{len} bytes (including possibly NUL's) starting +at location @var{bytes}. +@end table + +Note that both of these functions create and scan a @emph{copy} +of the string or bytes. (This may be desirable, since +@samp{yylex()} modifies the contents of the buffer it is +scanning.) You can avoid the copy by using: + +@table @samp +@item yy_scan_buffer(char *base, yy_size_t size) +which scans in place the buffer starting at @var{base}, +consisting of @var{size} bytes, the last two bytes of +which @emph{must} be @code{YY_END_OF_BUFFER_CHAR} (ASCII NUL). +These last two bytes are not scanned; thus, +scanning consists of @samp{base[0]} through @samp{base[size-2]}, +inclusive. + +If you fail to set up @var{base} in this manner (i.e., +forget the final two @code{YY_END_OF_BUFFER_CHAR} bytes), +then @samp{yy_scan_buffer()} returns a nil pointer instead +of creating a new input buffer. + +The type @code{yy_size_t} is an integral type to which you +can cast an integer expression reflecting the size +of the buffer. +@end table + +@node End-of-file rules, Miscellaneous, Multiple buffers, Top +@section End-of-file rules + +The special rule "<<EOF>>" indicates actions which are to +be taken when an end-of-file is encountered and yywrap() +returns non-zero (i.e., indicates no further files to +process). The action must finish by doing one of four +things: + +@itemize - +@item +assigning @code{yyin} to a new input file (in previous +versions of flex, after doing the assignment you +had to call the special action @code{YY_NEW_FILE}; this is +no longer necessary); + +@item +executing a @code{return} statement; + +@item +executing the special @samp{yyterminate()} action; + +@item +or, switching to a new buffer using +@samp{yy_switch_to_buffer()} as shown in the example +above. +@end itemize + +<<EOF>> rules may not be used with other patterns; they +may only be qualified with a list of start conditions. If +an unqualified <<EOF>> rule is given, it applies to @emph{all} +start conditions which do not already have <<EOF>> +actions. To specify an <<EOF>> rule for only the initial +start condition, use + +@example +<INITIAL><<EOF>> +@end example + +These rules are useful for catching things like unclosed +comments. An example: + +@example +%x quote +%% + +@dots{}other rules for dealing with quotes@dots{} + +<quote><<EOF>> @{ + error( "unterminated quote" ); + yyterminate(); + @} +<<EOF>> @{ + if ( *++filelist ) + yyin = fopen( *filelist, "r" ); + else + yyterminate(); + @} +@end example + +@node Miscellaneous, User variables, End-of-file rules, Top +@section Miscellaneous macros + +The macro @code{YY_USER_ACTION} can be defined to provide an +action which is always executed prior to the matched +rule's action. For example, it could be #define'd to call +a routine to convert yytext to lower-case. When +@code{YY_USER_ACTION} is invoked, the variable @code{yy_act} gives the +number of the matched rule (rules are numbered starting +with 1). Suppose you want to profile how often each of +your rules is matched. The following would do the trick: + +@example +#define YY_USER_ACTION ++ctr[yy_act] +@end example + +where @code{ctr} is an array to hold the counts for the different +rules. Note that the macro @code{YY_NUM_RULES} gives the total number +of rules (including the default rule, even if you use @samp{-s}, so +a correct declaration for @code{ctr} is: + +@example +int ctr[YY_NUM_RULES]; +@end example + +The macro @code{YY_USER_INIT} may be defined to provide an action +which is always executed before the first scan (and before +the scanner's internal initializations are done). For +example, it could be used to call a routine to read in a +data table or open a logging file. + +The macro @samp{yy_set_interactive(is_interactive)} can be used +to control whether the current buffer is considered +@emph{interactive}. An interactive buffer is processed more slowly, +but must be used when the scanner's input source is indeed +interactive to avoid problems due to waiting to fill +buffers (see the discussion of the @samp{-I} flag below). A +non-zero value in the macro invocation marks the buffer as +interactive, a zero value as non-interactive. Note that +use of this macro overrides @samp{%option always-interactive} or +@samp{%option never-interactive} (see Options below). +@samp{yy_set_interactive()} must be invoked prior to beginning to +scan the buffer that is (or is not) to be considered +interactive. + +The macro @samp{yy_set_bol(at_bol)} can be used to control +whether the current buffer's scanning context for the next +token match is done as though at the beginning of a line. +A non-zero macro argument makes rules anchored with + +The macro @samp{YY_AT_BOL()} returns true if the next token +scanned from the current buffer will have '^' rules +active, false otherwise. + +In the generated scanner, the actions are all gathered in +one large switch statement and separated using @code{YY_BREAK}, +which may be redefined. By default, it is simply a +"break", to separate each rule's action from the following +rule's. Redefining @code{YY_BREAK} allows, for example, C++ +users to #define YY_BREAK to do nothing (while being very +careful that every rule ends with a "break" or a +"return"!) to avoid suffering from unreachable statement +warnings where because a rule's action ends with "return", +the @code{YY_BREAK} is inaccessible. + +@node User variables, YACC interface, Miscellaneous, Top +@section Values available to the user + +This section summarizes the various values available to +the user in the rule actions. + +@itemize - +@item +@samp{char *yytext} holds the text of the current token. +It may be modified but not lengthened (you cannot +append characters to the end). + +If the special directive @samp{%array} appears in the +first section of the scanner description, then +@code{yytext} is instead declared @samp{char yytext[YYLMAX]}, +where @code{YYLMAX} is a macro definition that you can +redefine in the first section if you don't like the +default value (generally 8KB). Using @samp{%array} +results in somewhat slower scanners, but the value +of @code{yytext} becomes immune to calls to @samp{input()} and +@samp{unput()}, which potentially destroy its value when +@code{yytext} is a character pointer. The opposite of +@samp{%array} is @samp{%pointer}, which is the default. + +You cannot use @samp{%array} when generating C++ scanner +classes (the @samp{-+} flag). + +@item +@samp{int yyleng} holds the length of the current token. + +@item +@samp{FILE *yyin} is the file which by default @code{flex} reads +from. It may be redefined but doing so only makes +sense before scanning begins or after an EOF has +been encountered. Changing it in the midst of +scanning will have unexpected results since @code{flex} +buffers its input; use @samp{yyrestart()} instead. Once +scanning terminates because an end-of-file has been +seen, you can assign @code{yyin} at the new input file and +then call the scanner again to continue scanning. + +@item +@samp{void yyrestart( FILE *new_file )} may be called to +point @code{yyin} at the new input file. The switch-over +to the new file is immediate (any previously +buffered-up input is lost). Note that calling +@samp{yyrestart()} with @code{yyin} as an argument thus throws +away the current input buffer and continues +scanning the same input file. + +@item +@samp{FILE *yyout} is the file to which @samp{ECHO} actions are +done. It can be reassigned by the user. + +@item +@code{YY_CURRENT_BUFFER} returns a @code{YY_BUFFER_STATE} handle +to the current buffer. + +@item +@code{YY_START} returns an integer value corresponding to +the current start condition. You can subsequently +use this value with @code{BEGIN} to return to that start +condition. +@end itemize + +@node YACC interface, Options, User variables, Top +@section Interfacing with @code{yacc} + +One of the main uses of @code{flex} is as a companion to the @code{yacc} +parser-generator. @code{yacc} parsers expect to call a routine +named @samp{yylex()} to find the next input token. The routine +is supposed to return the type of the next token as well +as putting any associated value in the global @code{yylval}. To +use @code{flex} with @code{yacc}, one specifies the @samp{-d} option to @code{yacc} to +instruct it to generate the file @file{y.tab.h} containing +definitions of all the @samp{%tokens} appearing in the @code{yacc} input. +This file is then included in the @code{flex} scanner. For +example, if one of the tokens is "TOK_NUMBER", part of the +scanner might look like: + +@example +%@{ +#include "y.tab.h" +%@} + +%% + +[0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; +@end example + +@node Options, Performance, YACC interface, Top +@section Options +@code{flex} has the following options: + +@table @samp +@item -b +Generate backing-up information to @file{lex.backup}. +This is a list of scanner states which require +backing up and the input characters on which they +do so. By adding rules one can remove backing-up +states. If @emph{all} backing-up states are eliminated +and @samp{-Cf} or @samp{-CF} is used, the generated scanner will +run faster (see the @samp{-p} flag). Only users who wish +to squeeze every last cycle out of their scanners +need worry about this option. (See the section on +Performance Considerations below.) + +@item -c +is a do-nothing, deprecated option included for +POSIX compliance. + +@item -d +makes the generated scanner run in @dfn{debug} mode. +Whenever a pattern is recognized and the global +@code{yy_flex_debug} is non-zero (which is the default), +the scanner will write to @code{stderr} a line of the +form: + +@example +--accepting rule at line 53 ("the matched text") +@end example + +The line number refers to the location of the rule +in the file defining the scanner (i.e., the file +that was fed to flex). Messages are also generated +when the scanner backs up, accepts the default +rule, reaches the end of its input buffer (or +encounters a NUL; at this point, the two look the +same as far as the scanner's concerned), or reaches +an end-of-file. + +@item -f +specifies @dfn{fast scanner}. No table compression is +done and stdio is bypassed. The result is large +but fast. This option is equivalent to @samp{-Cfr} (see +below). + +@item -h +generates a "help" summary of @code{flex's} options to +@code{stdout} and then exits. @samp{-?} and @samp{--help} are synonyms +for @samp{-h}. + +@item -i +instructs @code{flex} to generate a @emph{case-insensitive} +scanner. The case of letters given in the @code{flex} input +patterns will be ignored, and tokens in the input +will be matched regardless of case. The matched +text given in @code{yytext} will have the preserved case +(i.e., it will not be folded). + +@item -l +turns on maximum compatibility with the original +AT&T @code{lex} implementation. Note that this does not +mean @emph{full} compatibility. Use of this option costs +a considerable amount of performance, and it cannot +be used with the @samp{-+, -f, -F, -Cf}, or @samp{-CF} options. +For details on the compatibilities it provides, see +the section "Incompatibilities With Lex And POSIX" +below. This option also results in the name +@code{YY_FLEX_LEX_COMPAT} being #define'd in the generated +scanner. + +@item -n +is another do-nothing, deprecated option included +only for POSIX compliance. + +@item -p +generates a performance report to stderr. The +report consists of comments regarding features of +the @code{flex} input file which will cause a serious loss +of performance in the resulting scanner. If you +give the flag twice, you will also get comments +regarding features that lead to minor performance +losses. + +Note that the use of @code{REJECT}, @samp{%option yylineno} and +variable trailing context (see the Deficiencies / Bugs section below) +entails a substantial performance penalty; use of @samp{yymore()}, +the @samp{^} operator, and the @samp{-I} flag entail minor performance +penalties. + +@item -s +causes the @dfn{default rule} (that unmatched scanner +input is echoed to @code{stdout}) to be suppressed. If +the scanner encounters input that does not match +any of its rules, it aborts with an error. This +option is useful for finding holes in a scanner's +rule set. + +@item -t +instructs @code{flex} to write the scanner it generates to +standard output instead of @file{lex.yy.c}. + +@item -v +specifies that @code{flex} should write to @code{stderr} a +summary of statistics regarding the scanner it +generates. Most of the statistics are meaningless to +the casual @code{flex} user, but the first line identifies +the version of @code{flex} (same as reported by @samp{-V}), and +the next line the flags used when generating the +scanner, including those that are on by default. + +@item -w +suppresses warning messages. + +@item -B +instructs @code{flex} to generate a @emph{batch} scanner, the +opposite of @emph{interactive} scanners generated by @samp{-I} +(see below). In general, you use @samp{-B} when you are +@emph{certain} that your scanner will never be used +interactively, and you want to squeeze a @emph{little} more +performance out of it. If your goal is instead to +squeeze out a @emph{lot} more performance, you should be +using the @samp{-Cf} or @samp{-CF} options (discussed below), +which turn on @samp{-B} automatically anyway. + +@item -F +specifies that the @dfn{fast} scanner table +representation should be used (and stdio bypassed). This +representation is about as fast as the full table +representation @samp{(-f)}, and for some sets of patterns +will be considerably smaller (and for others, +larger). In general, if the pattern set contains +both "keywords" and a catch-all, "identifier" rule, +such as in the set: + +@example +"case" return TOK_CASE; +"switch" return TOK_SWITCH; +... +"default" return TOK_DEFAULT; +[a-z]+ return TOK_ID; +@end example + +@noindent +then you're better off using the full table +representation. If only the "identifier" rule is +present and you then use a hash table or some such to +detect the keywords, you're better off using @samp{-F}. + +This option is equivalent to @samp{-CFr} (see below). It +cannot be used with @samp{-+}. + +@item -I +instructs @code{flex} to generate an @emph{interactive} scanner. +An interactive scanner is one that only looks ahead +to decide what token has been matched if it +absolutely must. It turns out that always looking one +extra character ahead, even if the scanner has +already seen enough text to disambiguate the +current token, is a bit faster than only looking ahead +when necessary. But scanners that always look +ahead give dreadful interactive performance; for +example, when a user types a newline, it is not +recognized as a newline token until they enter +@emph{another} token, which often means typing in another +whole line. + +@code{Flex} scanners default to @emph{interactive} unless you use +the @samp{-Cf} or @samp{-CF} table-compression options (see +below). That's because if you're looking for +high-performance you should be using one of these +options, so if you didn't, @code{flex} assumes you'd +rather trade off a bit of run-time performance for +intuitive interactive behavior. Note also that you +@emph{cannot} use @samp{-I} in conjunction with @samp{-Cf} or @samp{-CF}. +Thus, this option is not really needed; it is on by +default for all those cases in which it is allowed. + +You can force a scanner to @emph{not} be interactive by +using @samp{-B} (see above). + +@item -L +instructs @code{flex} not to generate @samp{#line} directives. +Without this option, @code{flex} peppers the generated +scanner with #line directives so error messages in +the actions will be correctly located with respect +to either the original @code{flex} input file (if the +errors are due to code in the input file), or +@file{lex.yy.c} (if the errors are @code{flex's} fault -- you +should report these sorts of errors to the email +address given below). + +@item -T +makes @code{flex} run in @code{trace} mode. It will generate a +lot of messages to @code{stderr} concerning the form of +the input and the resultant non-deterministic and +deterministic finite automata. This option is +mostly for use in maintaining @code{flex}. + +@item -V +prints the version number to @code{stdout} and exits. +@samp{--version} is a synonym for @samp{-V}. + +@item -7 +instructs @code{flex} to generate a 7-bit scanner, i.e., +one which can only recognized 7-bit characters in +its input. The advantage of using @samp{-7} is that the +scanner's tables can be up to half the size of +those generated using the @samp{-8} option (see below). +The disadvantage is that such scanners often hang +or crash if their input contains an 8-bit +character. + +Note, however, that unless you generate your +scanner using the @samp{-Cf} or @samp{-CF} table compression options, +use of @samp{-7} will save only a small amount of table +space, and make your scanner considerably less +portable. @code{Flex's} default behavior is to generate +an 8-bit scanner unless you use the @samp{-Cf} or @samp{-CF}, in +which case @code{flex} defaults to generating 7-bit +scanners unless your site was always configured to +generate 8-bit scanners (as will often be the case +with non-USA sites). You can tell whether flex +generated a 7-bit or an 8-bit scanner by inspecting +the flag summary in the @samp{-v} output as described +above. + +Note that if you use @samp{-Cfe} or @samp{-CFe} (those table +compression options, but also using equivalence +classes as discussed see below), flex still +defaults to generating an 8-bit scanner, since +usually with these compression options full 8-bit +tables are not much more expensive than 7-bit +tables. + +@item -8 +instructs @code{flex} to generate an 8-bit scanner, i.e., +one which can recognize 8-bit characters. This +flag is only needed for scanners generated using +@samp{-Cf} or @samp{-CF}, as otherwise flex defaults to +generating an 8-bit scanner anyway. + +See the discussion of @samp{-7} above for flex's default +behavior and the tradeoffs between 7-bit and 8-bit +scanners. + +@item -+ +specifies that you want flex to generate a C++ +scanner class. See the section on Generating C++ +Scanners below for details. + +@item -C[aefFmr] +controls the degree of table compression and, more +generally, trade-offs between small scanners and +fast scanners. + +@samp{-Ca} ("align") instructs flex to trade off larger +tables in the generated scanner for faster +performance because the elements of the tables are better +aligned for memory access and computation. On some +RISC architectures, fetching and manipulating +long-words is more efficient than with smaller-sized +units such as shortwords. This option can double +the size of the tables used by your scanner. + +@samp{-Ce} directs @code{flex} to construct @dfn{equivalence classes}, +i.e., sets of characters which have identical +lexical properties (for example, if the only appearance +of digits in the @code{flex} input is in the character +class "[0-9]" then the digits '0', '1', @dots{}, '9' +will all be put in the same equivalence class). +Equivalence classes usually give dramatic +reductions in the final table/object file sizes +(typically a factor of 2-5) and are pretty cheap +performance-wise (one array look-up per character +scanned). + +@samp{-Cf} specifies that the @emph{full} scanner tables should +be generated - @code{flex} should not compress the tables +by taking advantages of similar transition +functions for different states. + +@samp{-CF} specifies that the alternate fast scanner +representation (described above under the @samp{-F} flag) +should be used. This option cannot be used with +@samp{-+}. + +@samp{-Cm} directs @code{flex} to construct @dfn{meta-equivalence +classes}, which are sets of equivalence classes (or +characters, if equivalence classes are not being +used) that are commonly used together. +Meta-equivalence classes are often a big win when using +compressed tables, but they have a moderate +performance impact (one or two "if" tests and one array +look-up per character scanned). + +@samp{-Cr} causes the generated scanner to @emph{bypass} use of +the standard I/O library (stdio) for input. +Instead of calling @samp{fread()} or @samp{getc()}, the scanner +will use the @samp{read()} system call, resulting in a +performance gain which varies from system to +system, but in general is probably negligible unless +you are also using @samp{-Cf} or @samp{-CF}. Using @samp{-Cr} can cause +strange behavior if, for example, you read from +@code{yyin} using stdio prior to calling the scanner +(because the scanner will miss whatever text your +previous reads left in the stdio input buffer). + +@samp{-Cr} has no effect if you define @code{YY_INPUT} (see The +Generated Scanner above). + +A lone @samp{-C} specifies that the scanner tables should +be compressed but neither equivalence classes nor +meta-equivalence classes should be used. + +The options @samp{-Cf} or @samp{-CF} and @samp{-Cm} do not make sense +together - there is no opportunity for +meta-equivalence classes if the table is not being +compressed. Otherwise the options may be freely +mixed, and are cumulative. + +The default setting is @samp{-Cem}, which specifies that +@code{flex} should generate equivalence classes and +meta-equivalence classes. This setting provides the +highest degree of table compression. You can trade +off faster-executing scanners at the cost of larger +tables with the following generally being true: + +@example +slowest & smallest + -Cem + -Cm + -Ce + -C + -C@{f,F@}e + -C@{f,F@} + -C@{f,F@}a +fastest & largest +@end example + +Note that scanners with the smallest tables are +usually generated and compiled the quickest, so +during development you will usually want to use the +default, maximal compression. + +@samp{-Cfe} is often a good compromise between speed and +size for production scanners. + +@item -ooutput +directs flex to write the scanner to the file @samp{out-} +@code{put} instead of @file{lex.yy.c}. If you combine @samp{-o} with +the @samp{-t} option, then the scanner is written to +@code{stdout} but its @samp{#line} directives (see the @samp{-L} option +above) refer to the file @code{output}. + +@item -Pprefix +changes the default @samp{yy} prefix used by @code{flex} for all +globally-visible variable and function names to +instead be @var{prefix}. For example, @samp{-Pfoo} changes the +name of @code{yytext} to @file{footext}. It also changes the +name of the default output file from @file{lex.yy.c} to +@file{lex.foo.c}. Here are all of the names affected: + +@example +yy_create_buffer +yy_delete_buffer +yy_flex_debug +yy_init_buffer +yy_flush_buffer +yy_load_buffer_state +yy_switch_to_buffer +yyin +yyleng +yylex +yylineno +yyout +yyrestart +yytext +yywrap +@end example + +(If you are using a C++ scanner, then only @code{yywrap} +and @code{yyFlexLexer} are affected.) Within your scanner +itself, you can still refer to the global variables +and functions using either version of their name; +but externally, they have the modified name. + +This option lets you easily link together multiple +@code{flex} programs into the same executable. Note, +though, that using this option also renames +@samp{yywrap()}, so you now @emph{must} either provide your own +(appropriately-named) version of the routine for +your scanner, or use @samp{%option noyywrap}, as linking +with @samp{-lfl} no longer provides one for you by +default. + +@item -Sskeleton_file +overrides the default skeleton file from which @code{flex} +constructs its scanners. You'll never need this +option unless you are doing @code{flex} maintenance or +development. +@end table + +@code{flex} also provides a mechanism for controlling options +within the scanner specification itself, rather than from +the flex command-line. This is done by including @samp{%option} +directives in the first section of the scanner +specification. You can specify multiple options with a single +@samp{%option} directive, and multiple directives in the first +section of your flex input file. Most options are given +simply as names, optionally preceded by the word "no" +(with no intervening whitespace) to negate their meaning. +A number are equivalent to flex flags or their negation: + +@example +7bit -7 option +8bit -8 option +align -Ca option +backup -b option +batch -B option +c++ -+ option + +caseful or +case-sensitive opposite of -i (default) + +case-insensitive or +caseless -i option + +debug -d option +default opposite of -s option +ecs -Ce option +fast -F option +full -f option +interactive -I option +lex-compat -l option +meta-ecs -Cm option +perf-report -p option +read -Cr option +stdout -t option +verbose -v option +warn opposite of -w option + (use "%option nowarn" for -w) + +array equivalent to "%array" +pointer equivalent to "%pointer" (default) +@end example + +Some @samp{%option's} provide features otherwise not available: + +@table @samp +@item always-interactive +instructs flex to generate a scanner which always +considers its input "interactive". Normally, on +each new input file the scanner calls @samp{isatty()} in +an attempt to determine whether the scanner's input +source is interactive and thus should be read a +character at a time. When this option is used, +however, then no such call is made. + +@item main +directs flex to provide a default @samp{main()} program +for the scanner, which simply calls @samp{yylex()}. This +option implies @code{noyywrap} (see below). + +@item never-interactive +instructs flex to generate a scanner which never +considers its input "interactive" (again, no call +made to @samp{isatty())}. This is the opposite of @samp{always-} +@emph{interactive}. + +@item stack +enables the use of start condition stacks (see +Start Conditions above). + +@item stdinit +if unset (i.e., @samp{%option nostdinit}) initializes @code{yyin} +and @code{yyout} to nil @code{FILE} pointers, instead of @code{stdin} +and @code{stdout}. + +@item yylineno +directs @code{flex} to generate a scanner that maintains the number +of the current line read from its input in the global variable +@code{yylineno}. This option is implied by @samp{%option lex-compat}. + +@item yywrap +if unset (i.e., @samp{%option noyywrap}), makes the +scanner not call @samp{yywrap()} upon an end-of-file, but +simply assume that there are no more files to scan +(until the user points @code{yyin} at a new file and calls +@samp{yylex()} again). +@end table + +@code{flex} scans your rule actions to determine whether you use +the @code{REJECT} or @samp{yymore()} features. The @code{reject} and @code{yymore} +options are available to override its decision as to +whether you use the options, either by setting them (e.g., +@samp{%option reject}) to indicate the feature is indeed used, or +unsetting them to indicate it actually is not used (e.g., +@samp{%option noyymore}). + +Three options take string-delimited values, offset with '=': + +@example +%option outfile="ABC" +@end example + +@noindent +is equivalent to @samp{-oABC}, and + +@example +%option prefix="XYZ" +@end example + +@noindent +is equivalent to @samp{-PXYZ}. + +Finally, + +@example +%option yyclass="foo" +@end example + +@noindent +only applies when generating a C++ scanner (@samp{-+} option). It +informs @code{flex} that you have derived @samp{foo} as a subclass of +@code{yyFlexLexer} so @code{flex} will place your actions in the member +function @samp{foo::yylex()} instead of @samp{yyFlexLexer::yylex()}. +It also generates a @samp{yyFlexLexer::yylex()} member function that +emits a run-time error (by invoking @samp{yyFlexLexer::LexerError()}) +if called. See Generating C++ Scanners, below, for additional +information. + +A number of options are available for lint purists who +want to suppress the appearance of unneeded routines in +the generated scanner. Each of the following, if unset, +results in the corresponding routine not appearing in the +generated scanner: + +@example +input, unput +yy_push_state, yy_pop_state, yy_top_state +yy_scan_buffer, yy_scan_bytes, yy_scan_string +@end example + +@noindent +(though @samp{yy_push_state()} and friends won't appear anyway +unless you use @samp{%option stack}). + +@node Performance, C++, Options, Top +@section Performance considerations + +The main design goal of @code{flex} is that it generate +high-performance scanners. It has been optimized for dealing +well with large sets of rules. Aside from the effects on +scanner speed of the table compression @samp{-C} options outlined +above, there are a number of options/actions which degrade +performance. These are, from most expensive to least: + +@example +REJECT +%option yylineno +arbitrary trailing context + +pattern sets that require backing up +%array +%option interactive +%option always-interactive + +'^' beginning-of-line operator +yymore() +@end example + +with the first three all being quite expensive and the +last two being quite cheap. Note also that @samp{unput()} is +implemented as a routine call that potentially does quite +a bit of work, while @samp{yyless()} is a quite-cheap macro; so +if just putting back some excess text you scanned, use +@samp{yyless()}. + +@code{REJECT} should be avoided at all costs when performance is +important. It is a particularly expensive option. + +Getting rid of backing up is messy and often may be an +enormous amount of work for a complicated scanner. In +principal, one begins by using the @samp{-b} flag to generate a +@file{lex.backup} file. For example, on the input + +@example +%% +foo return TOK_KEYWORD; +foobar return TOK_KEYWORD; +@end example + +@noindent +the file looks like: + +@example +State #6 is non-accepting - + associated rule line numbers: + 2 3 + out-transitions: [ o ] + jam-transitions: EOF [ \001-n p-\177 ] + +State #8 is non-accepting - + associated rule line numbers: + 3 + out-transitions: [ a ] + jam-transitions: EOF [ \001-` b-\177 ] + +State #9 is non-accepting - + associated rule line numbers: + 3 + out-transitions: [ r ] + jam-transitions: EOF [ \001-q s-\177 ] + +Compressed tables always back up. +@end example + +The first few lines tell us that there's a scanner state +in which it can make a transition on an 'o' but not on any +other character, and that in that state the currently +scanned text does not match any rule. The state occurs +when trying to match the rules found at lines 2 and 3 in +the input file. If the scanner is in that state and then +reads something other than an 'o', it will have to back up +to find a rule which is matched. With a bit of +head-scratching one can see that this must be the state it's in +when it has seen "fo". When this has happened, if +anything other than another 'o' is seen, the scanner will +have to back up to simply match the 'f' (by the default +rule). + +The comment regarding State #8 indicates there's a problem +when "foob" has been scanned. Indeed, on any character +other than an 'a', the scanner will have to back up to +accept "foo". Similarly, the comment for State #9 +concerns when "fooba" has been scanned and an 'r' does not +follow. + +The final comment reminds us that there's no point going +to all the trouble of removing backing up from the rules +unless we're using @samp{-Cf} or @samp{-CF}, since there's no +performance gain doing so with compressed scanners. + +The way to remove the backing up is to add "error" rules: + +@example +%% +foo return TOK_KEYWORD; +foobar return TOK_KEYWORD; + +fooba | +foob | +fo @{ + /* false alarm, not really a keyword */ + return TOK_ID; + @} +@end example + +Eliminating backing up among a list of keywords can also +be done using a "catch-all" rule: + +@example +%% +foo return TOK_KEYWORD; +foobar return TOK_KEYWORD; + +[a-z]+ return TOK_ID; +@end example + +This is usually the best solution when appropriate. + +Backing up messages tend to cascade. With a complicated +set of rules it's not uncommon to get hundreds of +messages. If one can decipher them, though, it often only +takes a dozen or so rules to eliminate the backing up +(though it's easy to make a mistake and have an error rule +accidentally match a valid token. A possible future @code{flex} +feature will be to automatically add rules to eliminate +backing up). + +It's important to keep in mind that you gain the benefits +of eliminating backing up only if you eliminate @emph{every} +instance of backing up. Leaving just one means you gain +nothing. + +@var{Variable} trailing context (where both the leading and +trailing parts do not have a fixed length) entails almost +the same performance loss as @code{REJECT} (i.e., substantial). +So when possible a rule like: + +@example +%% +mouse|rat/(cat|dog) run(); +@end example + +@noindent +is better written: + +@example +%% +mouse/cat|dog run(); +rat/cat|dog run(); +@end example + +@noindent +or as + +@example +%% +mouse|rat/cat run(); +mouse|rat/dog run(); +@end example + +Note that here the special '|' action does @emph{not} provide any +savings, and can even make things worse (see Deficiencies +/ Bugs below). + +Another area where the user can increase a scanner's +performance (and one that's easier to implement) arises from +the fact that the longer the tokens matched, the faster +the scanner will run. This is because with long tokens +the processing of most input characters takes place in the +(short) inner scanning loop, and does not often have to go +through the additional work of setting up the scanning +environment (e.g., @code{yytext}) for the action. Recall the +scanner for C comments: + +@example +%x comment +%% + int line_num = 1; + +"/*" BEGIN(comment); + +<comment>[^*\n]* +<comment>"*"+[^*/\n]* +<comment>\n ++line_num; +<comment>"*"+"/" BEGIN(INITIAL); +@end example + +This could be sped up by writing it as: + +@example +%x comment +%% + int line_num = 1; + +"/*" BEGIN(comment); + +<comment>[^*\n]* +<comment>[^*\n]*\n ++line_num; +<comment>"*"+[^*/\n]* +<comment>"*"+[^*/\n]*\n ++line_num; +<comment>"*"+"/" BEGIN(INITIAL); +@end example + +Now instead of each newline requiring the processing of +another action, recognizing the newlines is "distributed" +over the other rules to keep the matched text as long as +possible. Note that @emph{adding} rules does @emph{not} slow down the +scanner! The speed of the scanner is independent of the +number of rules or (modulo the considerations given at the +beginning of this section) how complicated the rules are +with regard to operators such as '*' and '|'. + +A final example in speeding up a scanner: suppose you want +to scan through a file containing identifiers and +keywords, one per line and with no other extraneous +characters, and recognize all the keywords. A natural first +approach is: + +@example +%% +asm | +auto | +break | +@dots{} etc @dots{} +volatile | +while /* it's a keyword */ + +.|\n /* it's not a keyword */ +@end example + +To eliminate the back-tracking, introduce a catch-all +rule: + +@example +%% +asm | +auto | +break | +... etc ... +volatile | +while /* it's a keyword */ + +[a-z]+ | +.|\n /* it's not a keyword */ +@end example + +Now, if it's guaranteed that there's exactly one word per +line, then we can reduce the total number of matches by a +half by merging in the recognition of newlines with that +of the other tokens: + +@example +%% +asm\n | +auto\n | +break\n | +@dots{} etc @dots{} +volatile\n | +while\n /* it's a keyword */ + +[a-z]+\n | +.|\n /* it's not a keyword */ +@end example + +One has to be careful here, as we have now reintroduced +backing up into the scanner. In particular, while @emph{we} know +that there will never be any characters in the input +stream other than letters or newlines, @code{flex} can't figure +this out, and it will plan for possibly needing to back up +when it has scanned a token like "auto" and then the next +character is something other than a newline or a letter. +Previously it would then just match the "auto" rule and be +done, but now it has no "auto" rule, only a "auto\n" rule. +To eliminate the possibility of backing up, we could +either duplicate all rules but without final newlines, or, +since we never expect to encounter such an input and +therefore don't how it's classified, we can introduce one +more catch-all rule, this one which doesn't include a +newline: + +@example +%% +asm\n | +auto\n | +break\n | +@dots{} etc @dots{} +volatile\n | +while\n /* it's a keyword */ + +[a-z]+\n | +[a-z]+ | +.|\n /* it's not a keyword */ +@end example + +Compiled with @samp{-Cf}, this is about as fast as one can get a +@code{flex} scanner to go for this particular problem. + +A final note: @code{flex} is slow when matching NUL's, +particularly when a token contains multiple NUL's. It's best to +write rules which match @emph{short} amounts of text if it's +anticipated that the text will often include NUL's. + +Another final note regarding performance: as mentioned +above in the section How the Input is Matched, dynamically +resizing @code{yytext} to accommodate huge tokens is a slow +process because it presently requires that the (huge) token +be rescanned from the beginning. Thus if performance is +vital, you should attempt to match "large" quantities of +text but not "huge" quantities, where the cutoff between +the two is at about 8K characters/token. + +@node C++, Incompatibilities, Performance, Top +@section Generating C++ scanners + +@code{flex} provides two different ways to generate scanners for +use with C++. The first way is to simply compile a +scanner generated by @code{flex} using a C++ compiler instead of a C +compiler. You should not encounter any compilations +errors (please report any you find to the email address +given in the Author section below). You can then use C++ +code in your rule actions instead of C code. Note that +the default input source for your scanner remains @code{yyin}, +and default echoing is still done to @code{yyout}. Both of these +remain @samp{FILE *} variables and not C++ @code{streams}. + +You can also use @code{flex} to generate a C++ scanner class, using +the @samp{-+} option, (or, equivalently, @samp{%option c++}), which +is automatically specified if the name of the flex executable ends +in a @samp{+}, such as @code{flex++}. When using this option, flex +defaults to generating the scanner to the file @file{lex.yy.cc} instead +of @file{lex.yy.c}. The generated scanner includes the header file +@file{FlexLexer.h}, which defines the interface to two C++ classes. + +The first class, @code{FlexLexer}, provides an abstract base +class defining the general scanner class interface. It +provides the following member functions: + +@table @samp +@item const char* YYText() +returns the text of the most recently matched +token, the equivalent of @code{yytext}. + +@item int YYLeng() +returns the length of the most recently matched +token, the equivalent of @code{yyleng}. + +@item int lineno() const +returns the current input line number (see @samp{%option yylineno}), +or 1 if @samp{%option yylineno} was not used. + +@item void set_debug( int flag ) +sets the debugging flag for the scanner, equivalent to assigning to +@code{yy_flex_debug} (see the Options section above). Note that you +must build the scanner using @samp{%option debug} to include debugging +information in it. + +@item int debug() const +returns the current setting of the debugging flag. +@end table + +Also provided are member functions equivalent to +@samp{yy_switch_to_buffer(), yy_create_buffer()} (though the +first argument is an @samp{istream*} object pointer and not a +@samp{FILE*}, @samp{yy_flush_buffer()}, @samp{yy_delete_buffer()}, +and @samp{yyrestart()} (again, the first argument is a @samp{istream*} +object pointer). + +The second class defined in @file{FlexLexer.h} is @code{yyFlexLexer}, +which is derived from @code{FlexLexer}. It defines the following +additional member functions: + +@table @samp +@item yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) +constructs a @code{yyFlexLexer} object using the given +streams for input and output. If not specified, +the streams default to @code{cin} and @code{cout}, respectively. + +@item virtual int yylex() +performs the same role is @samp{yylex()} does for ordinary +flex scanners: it scans the input stream, consuming +tokens, until a rule's action returns a value. If you derive a subclass +@var{S} +from @code{yyFlexLexer} +and want to access the member functions and variables of +@var{S} +inside @samp{yylex()}, +then you need to use @samp{%option yyclass="@var{S}"} +to inform @code{flex} +that you will be using that subclass instead of @code{yyFlexLexer}. +In this case, rather than generating @samp{yyFlexLexer::yylex()}, +@code{flex} generates @samp{@var{S}::yylex()} +(and also generates a dummy @samp{yyFlexLexer::yylex()} +that calls @samp{yyFlexLexer::LexerError()} +if called). + +@item virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0) +reassigns @code{yyin} to @code{new_in} +(if non-nil) +and @code{yyout} to @code{new_out} +(ditto), deleting the previous input buffer if @code{yyin} +is reassigned. + +@item int yylex( istream* new_in = 0, ostream* new_out = 0 ) +first switches the input streams via @samp{switch_streams( new_in, new_out )} +and then returns the value of @samp{yylex()}. +@end table + +In addition, @code{yyFlexLexer} defines the following protected +virtual functions which you can redefine in derived +classes to tailor the scanner: + +@table @samp +@item virtual int LexerInput( char* buf, int max_size ) +reads up to @samp{max_size} characters into @var{buf} and +returns the number of characters read. To indicate +end-of-input, return 0 characters. Note that +"interactive" scanners (see the @samp{-B} and @samp{-I} flags) +define the macro @code{YY_INTERACTIVE}. If you redefine +@code{LexerInput()} and need to take different actions +depending on whether or not the scanner might be +scanning an interactive input source, you can test +for the presence of this name via @samp{#ifdef}. + +@item virtual void LexerOutput( const char* buf, int size ) +writes out @var{size} characters from the buffer @var{buf}, +which, while NUL-terminated, may also contain +"internal" NUL's if the scanner's rules can match +text with NUL's in them. + +@item virtual void LexerError( const char* msg ) +reports a fatal error message. The default version +of this function writes the message to the stream +@code{cerr} and exits. +@end table + +Note that a @code{yyFlexLexer} object contains its @emph{entire} +scanning state. Thus you can use such objects to create +reentrant scanners. You can instantiate multiple instances of +the same @code{yyFlexLexer} class, and you can also combine +multiple C++ scanner classes together in the same program +using the @samp{-P} option discussed above. +Finally, note that the @samp{%array} feature is not available to +C++ scanner classes; you must use @samp{%pointer} (the default). + +Here is an example of a simple C++ scanner: + +@example + // An example of using the flex C++ scanner class. + +%@{ +int mylineno = 0; +%@} + +string \"[^\n"]+\" + +ws [ \t]+ + +alpha [A-Za-z] +dig [0-9] +name (@{alpha@}|@{dig@}|\$)(@{alpha@}|@{dig@}|[_.\-/$])* +num1 [-+]?@{dig@}+\.?([eE][-+]?@{dig@}+)? +num2 [-+]?@{dig@}*\.@{dig@}+([eE][-+]?@{dig@}+)? +number @{num1@}|@{num2@} + +%% + +@{ws@} /* skip blanks and tabs */ + +"/*" @{ + int c; + + while((c = yyinput()) != 0) + @{ + if(c == '\n') + ++mylineno; + + else if(c == '*') + @{ + if((c = yyinput()) == '/') + break; + else + unput(c); + @} + @} + @} + +@{number@} cout << "number " << YYText() << '\n'; + +\n mylineno++; + +@{name@} cout << "name " << YYText() << '\n'; + +@{string@} cout << "string " << YYText() << '\n'; + +%% + +Version 2.5 December 1994 44 + +int main( int /* argc */, char** /* argv */ ) + @{ + FlexLexer* lexer = new yyFlexLexer; + while(lexer->yylex() != 0) + ; + return 0; + @} +@end example + +If you want to create multiple (different) lexer classes, +you use the @samp{-P} flag (or the @samp{prefix=} option) to rename each +@code{yyFlexLexer} to some other @code{xxFlexLexer}. You then can +include @samp{<FlexLexer.h>} in your other sources once per lexer +class, first renaming @code{yyFlexLexer} as follows: + +@example +#undef yyFlexLexer +#define yyFlexLexer xxFlexLexer +#include <FlexLexer.h> + +#undef yyFlexLexer +#define yyFlexLexer zzFlexLexer +#include <FlexLexer.h> +@end example + +if, for example, you used @samp{%option prefix="xx"} for one of +your scanners and @samp{%option prefix="zz"} for the other. + +IMPORTANT: the present form of the scanning class is +@emph{experimental} and may change considerably between major +releases. + +@node Incompatibilities, Diagnostics, C++, Top +@section Incompatibilities with @code{lex} and POSIX + +@code{flex} is a rewrite of the AT&T Unix @code{lex} tool (the two +implementations do not share any code, though), with some +extensions and incompatibilities, both of which are of +concern to those who wish to write scanners acceptable to +either implementation. Flex is fully compliant with the +POSIX @code{lex} specification, except that when using @samp{%pointer} +(the default), a call to @samp{unput()} destroys the contents of +@code{yytext}, which is counter to the POSIX specification. + +In this section we discuss all of the known areas of +incompatibility between flex, AT&T lex, and the POSIX +specification. + +@code{flex's} @samp{-l} option turns on maximum compatibility with the +original AT&T @code{lex} implementation, at the cost of a major +loss in the generated scanner's performance. We note +below which incompatibilities can be overcome using the @samp{-l} +option. + +@code{flex} is fully compatible with @code{lex} with the following +exceptions: + +@itemize - +@item +The undocumented @code{lex} scanner internal variable @code{yylineno} +is not supported unless @samp{-l} or @samp{%option yylineno} is used. +@code{yylineno} should be maintained on a per-buffer basis, rather +than a per-scanner (single global variable) basis. @code{yylineno} is +not part of the POSIX specification. + +@item +The @samp{input()} routine is not redefinable, though it +may be called to read characters following whatever +has been matched by a rule. If @samp{input()} encounters +an end-of-file the normal @samp{yywrap()} processing is +done. A ``real'' end-of-file is returned by +@samp{input()} as @code{EOF}. + +Input is instead controlled by defining the +@code{YY_INPUT} macro. + +The @code{flex} restriction that @samp{input()} cannot be +redefined is in accordance with the POSIX +specification, which simply does not specify any way of +controlling the scanner's input other than by making +an initial assignment to @code{yyin}. + +@item +The @samp{unput()} routine is not redefinable. This +restriction is in accordance with POSIX. + +@item +@code{flex} scanners are not as reentrant as @code{lex} scanners. +In particular, if you have an interactive scanner +and an interrupt handler which long-jumps out of +the scanner, and the scanner is subsequently called +again, you may get the following message: + +@example +fatal flex scanner internal error--end of buffer missed +@end example + +To reenter the scanner, first use + +@example +yyrestart( yyin ); +@end example + +Note that this call will throw away any buffered +input; usually this isn't a problem with an +interactive scanner. + +Also note that flex C++ scanner classes @emph{are} +reentrant, so if using C++ is an option for you, you +should use them instead. See "Generating C++ +Scanners" above for details. + +@item +@samp{output()} is not supported. Output from the @samp{ECHO} +macro is done to the file-pointer @code{yyout} (default +@code{stdout}). + +@samp{output()} is not part of the POSIX specification. + +@item +@code{lex} does not support exclusive start conditions +(%x), though they are in the POSIX specification. + +@item +When definitions are expanded, @code{flex} encloses them +in parentheses. With lex, the following: + +@example +NAME [A-Z][A-Z0-9]* +%% +foo@{NAME@}? printf( "Found it\n" ); +%% +@end example + +will not match the string "foo" because when the +macro is expanded the rule is equivalent to +"foo[A-Z][A-Z0-9]*?" and the precedence is such that the +'?' is associated with "[A-Z0-9]*". With @code{flex}, the +rule will be expanded to "foo([A-Z][A-Z0-9]*)?" and +so the string "foo" will match. + +Note that if the definition begins with @samp{^} or ends +with @samp{$} then it is @emph{not} expanded with parentheses, to +allow these operators to appear in definitions +without losing their special meanings. But the +@samp{<s>, /}, and @samp{<<EOF>>} operators cannot be used in a +@code{flex} definition. + +Using @samp{-l} results in the @code{lex} behavior of no +parentheses around the definition. + +The POSIX specification is that the definition be enclosed in +parentheses. + +@item +Some implementations of @code{lex} allow a rule's action to begin on +a separate line, if the rule's pattern has trailing whitespace: + +@example +%% +foo|bar<space here> + @{ foobar_action(); @} +@end example + +@code{flex} does not support this feature. + +@item +The @code{lex} @samp{%r} (generate a Ratfor scanner) option is +not supported. It is not part of the POSIX +specification. + +@item +After a call to @samp{unput()}, @code{yytext} is undefined until +the next token is matched, unless the scanner was +built using @samp{%array}. This is not the case with @code{lex} +or the POSIX specification. The @samp{-l} option does +away with this incompatibility. + +@item +The precedence of the @samp{@{@}} (numeric range) operator +is different. @code{lex} interprets "abc@{1,3@}" as "match +one, two, or three occurrences of 'abc'", whereas +@code{flex} interprets it as "match 'ab' followed by one, +two, or three occurrences of 'c'". The latter is +in agreement with the POSIX specification. + +@item +The precedence of the @samp{^} operator is different. @code{lex} +interprets "^foo|bar" as "match either 'foo' at the +beginning of a line, or 'bar' anywhere", whereas +@code{flex} interprets it as "match either 'foo' or 'bar' +if they come at the beginning of a line". The +latter is in agreement with the POSIX specification. + +@item +The special table-size declarations such as @samp{%a} +supported by @code{lex} are not required by @code{flex} scanners; +@code{flex} ignores them. + +@item +The name FLEX_SCANNER is #define'd so scanners may +be written for use with either @code{flex} or @code{lex}. +Scanners also include @code{YY_FLEX_MAJOR_VERSION} and +@code{YY_FLEX_MINOR_VERSION} indicating which version of +@code{flex} generated the scanner (for example, for the +2.5 release, these defines would be 2 and 5 +respectively). +@end itemize + +The following @code{flex} features are not included in @code{lex} or the +POSIX specification: + +@example +C++ scanners +%option +start condition scopes +start condition stacks +interactive/non-interactive scanners +yy_scan_string() and friends +yyterminate() +yy_set_interactive() +yy_set_bol() +YY_AT_BOL() +<<EOF>> +<*> +YY_DECL +YY_START +YY_USER_ACTION +YY_USER_INIT +#line directives +%@{@}'s around actions +multiple actions on a line +@end example + +@noindent +plus almost all of the flex flags. The last feature in +the list refers to the fact that with @code{flex} you can put +multiple actions on the same line, separated with +semicolons, while with @code{lex}, the following + +@example +foo handle_foo(); ++num_foos_seen; +@end example + +@noindent +is (rather surprisingly) truncated to + +@example +foo handle_foo(); +@end example + +@code{flex} does not truncate the action. Actions that are not +enclosed in braces are simply terminated at the end of the +line. + +@node Diagnostics, Files, Incompatibilities, Top +@section Diagnostics + +@table @samp +@item warning, rule cannot be matched +indicates that the given +rule cannot be matched because it follows other rules that +will always match the same text as it. For example, in +the following "foo" cannot be matched because it comes +after an identifier "catch-all" rule: + +@example +[a-z]+ got_identifier(); +foo got_foo(); +@end example + +Using @code{REJECT} in a scanner suppresses this warning. + +@item warning, -s option given but default rule can be matched +means that it is possible (perhaps only in a particular +start condition) that the default rule (match any single +character) is the only one that will match a particular +input. Since @samp{-s} was given, presumably this is not +intended. + +@item reject_used_but_not_detected undefined +@itemx yymore_used_but_not_detected undefined +These errors can +occur at compile time. They indicate that the scanner +uses @code{REJECT} or @samp{yymore()} but that @code{flex} failed to notice the +fact, meaning that @code{flex} scanned the first two sections +looking for occurrences of these actions and failed to +find any, but somehow you snuck some in (via a #include +file, for example). Use @samp{%option reject} or @samp{%option yymore} +to indicate to flex that you really do use these features. + +@item flex scanner jammed +a scanner compiled with @samp{-s} has +encountered an input string which wasn't matched by any of +its rules. This error can also occur due to internal +problems. + +@item token too large, exceeds YYLMAX +your scanner uses @samp{%array} +and one of its rules matched a string longer than the @samp{YYL-} +@code{MAX} constant (8K bytes by default). You can increase the +value by #define'ing @code{YYLMAX} in the definitions section of +your @code{flex} input. + +@item scanner requires -8 flag to use the character '@var{x}' +Your +scanner specification includes recognizing the 8-bit +character @var{x} and you did not specify the -8 flag, and your +scanner defaulted to 7-bit because you used the @samp{-Cf} or @samp{-CF} +table compression options. See the discussion of the @samp{-7} +flag for details. + +@item flex scanner push-back overflow +you used @samp{unput()} to push +back so much text that the scanner's buffer could not hold +both the pushed-back text and the current token in @code{yytext}. +Ideally the scanner should dynamically resize the buffer +in this case, but at present it does not. + +@item input buffer overflow, can't enlarge buffer because scanner uses REJECT +the scanner was working on matching an +extremely large token and needed to expand the input +buffer. This doesn't work with scanners that use @code{REJECT}. + +@item fatal flex scanner internal error--end of buffer missed +This can occur in an scanner which is reentered after a +long-jump has jumped out (or over) the scanner's +activation frame. Before reentering the scanner, use: + +@example +yyrestart( yyin ); +@end example + +@noindent +or, as noted above, switch to using the C++ scanner class. + +@item too many start conditions in <> construct! +you listed +more start conditions in a <> construct than exist (so you +must have listed at least one of them twice). +@end table + +@node Files, Deficiencies, Diagnostics, Top +@section Files + +@table @file +@item -lfl +library with which scanners must be linked. + +@item lex.yy.c +generated scanner (called @file{lexyy.c} on some systems). + +@item lex.yy.cc +generated C++ scanner class, when using @samp{-+}. + +@item <FlexLexer.h> +header file defining the C++ scanner base class, +@code{FlexLexer}, and its derived class, @code{yyFlexLexer}. + +@item flex.skl +skeleton scanner. This file is only used when +building flex, not when flex executes. + +@item lex.backup +backing-up information for @samp{-b} flag (called @file{lex.bck} +on some systems). +@end table + +@node Deficiencies, See also, Files, Top +@section Deficiencies / Bugs + +Some trailing context patterns cannot be properly matched +and generate warning messages ("dangerous trailing +context"). These are patterns where the ending of the first +part of the rule matches the beginning of the second part, +such as "zx*/xy*", where the 'x*' matches the 'x' at the +beginning of the trailing context. (Note that the POSIX +draft states that the text matched by such patterns is +undefined.) + +For some trailing context rules, parts which are actually +fixed-length are not recognized as such, leading to the +abovementioned performance loss. In particular, parts +using '|' or @{n@} (such as "foo@{3@}") are always considered +variable-length. + +Combining trailing context with the special '|' action can +result in @emph{fixed} trailing context being turned into the +more expensive @var{variable} trailing context. For example, in +the following: + +@example +%% +abc | +xyz/def +@end example + +Use of @samp{unput()} invalidates yytext and yyleng, unless the +@samp{%array} directive or the @samp{-l} option has been used. + +Pattern-matching of NUL's is substantially slower than +matching other characters. + +Dynamic resizing of the input buffer is slow, as it +entails rescanning all the text matched so far by the +current (generally huge) token. + +Due to both buffering of input and read-ahead, you cannot +intermix calls to <stdio.h> routines, such as, for +example, @samp{getchar()}, with @code{flex} rules and expect it to work. +Call @samp{input()} instead. + +The total table entries listed by the @samp{-v} flag excludes the +number of table entries needed to determine what rule has +been matched. The number of entries is equal to the +number of DFA states if the scanner does not use @code{REJECT}, and +somewhat greater than the number of states if it does. + +@code{REJECT} cannot be used with the @samp{-f} or @samp{-F} options. + +The @code{flex} internal algorithms need documentation. + +@node See also, Author, Deficiencies, Top +@section See also + +@code{lex}(1), @code{yacc}(1), @code{sed}(1), @code{awk}(1). + +John Levine, Tony Mason, and Doug Brown: Lex & Yacc; +O'Reilly and Associates. Be sure to get the 2nd edition. + +M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator. + +Alfred Aho, Ravi Sethi and Jeffrey Ullman: Compilers: +Principles, Techniques and Tools; Addison-Wesley (1986). +Describes the pattern-matching techniques used by @code{flex} +(deterministic finite automata). + +@node Author, , See also, Top +@section Author + +Vern Paxson, with the help of many ideas and much inspiration from +Van Jacobson. Original version by Jef Poskanzer. The fast table +representation is a partial implementation of a design done by Van +Jacobson. The implementation was done by Kevin Gong and Vern Paxson. + +Thanks to the many @code{flex} beta-testers, feedbackers, and +contributors, especially Francois Pinard, Casey Leedom, Stan +Adermann, Terry Allen, David Barker-Plummer, John Basrai, Nelson +H.F. Beebe, @samp{benson@@odi.com}, Karl Berry, Peter A. Bigot, +Simon Blanchard, Keith Bostic, Frederic Brehm, Ian Brockbank, Kin +Cho, Nick Christopher, Brian Clapper, J.T. Conklin, Jason Coughlin, +Bill Cox, Nick Cropper, Dave Curtis, Scott David Daniels, Chris +G. Demetriou, Theo Deraadt, Mike Donahue, Chuck Doucette, Tom Epperly, +Leo Eskin, Chris Faylor, Chris Flatters, Jon Forrest, Joe Gayda, Kaveh +R. Ghazi, Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer +Griebel, Jan Hajic, Charles Hemphill, NORO Hideo, Jarkko Hietaniemi, +Scott Hofmann, Jeff Honig, Dana Hudes, Eric Hughes, John Interrante, +Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, +Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane, +Amir Katz, @samp{ken@@ken.hilco.com}, Kevin B. Kenny, Steve Kirsch, +Winfried Koenig, Marq Kole, Ronald Lamprecht, Greg Lee, Rohan Lenard, +Craig Leres, John Levine, Steve Liddle, Mike Long, Mohamed el Lozy, +Brian Madsen, Malte, Joe Marshall, Bengt Martensson, Chris Metcalf, +Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum, +G.T. Nicol, Landon Noll, James Nordby, Marc Nozell, Richard Ohnemus, +Karsten Pahnke, Sven Panne, Roland Pesch, Walter Pelissero, Gaumond +Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha, Frederic +Raimbault, Pat Rankin, Rick Richardson, Kevin Rodgers, Kai Uwe Rommel, +Jim Roskind, Alberto Santini, Andreas Scherer, Darrell Schiebel, Raf +Schietekat, Doug Schmidt, Philippe Schnoebelen, Andreas Schwab, Alex +Siegel, Eckehard Stolz, Jan-Erik Strvmquist, Mike Stump, Paul Stuart, +Dave Tallman, Ian Lance Taylor, Chris Thewalt, Richard M. Timoney, +Jodi Tsai, Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, +Kent Williams, Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn, and +those whose names have slipped my marginal mail-archiving skills but +whose contributions are appreciated all the same. + +Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John Gilmore, +Craig Leres, John Levine, Bob Mulcahy, G.T. Nicol, Francois Pinard, +Rich Salz, and Richard Stallman for help with various distribution +headaches. + +Thanks to Esmond Pitt and Earle Horton for 8-bit character support; +to Benson Margulies and Fred Burke for C++ support; to Kent Williams +and Tom Epperly for C++ class support; to Ove Ewerlid for support of +NUL's; and to Eric Hughes for support of multiple buffers. + +This work was primarily done when I was with the Real Time Systems +Group at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks +to all there for the support I received. + +Send comments to @samp{vern@@ee.lbl.gov}. + +@c @node Index, , Top, Top +@c @unnumbered Index +@c +@c @printindex cp + +@contents +@bye + +@c Local variables: +@c texinfo-column-for-description: 32 +@c End: |