From edc848712307fe5c881364e12e520e9fe58d9969 Mon Sep 17 00:00:00 2001 From: Manoj Srivastava Date: Wed, 3 Dec 2003 10:38:02 +0000 Subject: Initial import of the 2.5.4a branch git-archimport-id: srivasta@debian.org--2003-primary/flex--upstream--2.5--base-0 --- MISC/flex.man | 3696 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 3696 insertions(+) create mode 100644 MISC/flex.man (limited to 'MISC/flex.man') diff --git a/MISC/flex.man b/MISC/flex.man new file mode 100644 index 0000000..d41f5ba --- /dev/null +++ b/MISC/flex.man @@ -0,0 +1,3696 @@ + + + +FLEX(1) USER COMMANDS FLEX(1) + + + +NAME + flex - fast lexical analyzer generator + +SYNOPSIS + flex [-bcdfhilnpstvwBFILTV78+? -C[aefFmr] -ooutput -Pprefix + -Sskeleton] [--help --version] [filename ...] + +OVERVIEW + This manual describes flex, a tool for generating programs + that perform pattern-matching on text. The manual includes + both tutorial and reference sections: + + Description + a brief overview of the tool + + Some Simple Examples + + Format Of The Input File + + Patterns + the extended regular expressions used by flex + + How The Input Is Matched + the rules for determining what has been matched + + Actions + how to specify what to do when a pattern is matched + + The Generated Scanner + details regarding the scanner that flex produces; + how to control the input source + + Start Conditions + introducing context into your scanners, and + managing "mini-scanners" + + Multiple Input Buffers + how to manipulate multiple input sources; how to + scan from strings instead of files + + End-of-file Rules + special rules for matching the end of the input + + Miscellaneous Macros + a summary of macros available to the actions + + Values Available To The User + a summary of values available to the actions + + Interfacing With Yacc + connecting flex scanners together with yacc parsers + + + + +Version 2.5 Last change: April 1995 1 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + Options + flex command-line options, and the "%option" + directive + + Performance Considerations + how to make your scanner go as fast as possible + + Generating C++ Scanners + the (experimental) facility for generating C++ + scanner classes + + Incompatibilities With Lex And POSIX + how flex differs from AT&T lex and the POSIX lex + standard + + Diagnostics + those error messages produced by flex (or scanners + it generates) whose meanings might not be apparent + + Files + files used by flex + + Deficiencies / Bugs + known problems with flex + + See Also + other documentation, related tools + + Author + includes contact information + + +DESCRIPTION + flex is a tool for generating scanners: programs which + recognized lexical patterns in text. flex reads the given + input files, or its standard input if no file names are + given, for a description of a scanner to generate. The + description is in the form of pairs of regular expressions + and C code, called rules. flex generates as output a C + source file, lex.yy.c, which defines a routine yylex(). This + file is compiled and linked with the -lfl library to produce + an executable. When the executable is run, it analyzes its + input for occurrences of the regular expressions. Whenever + it finds one, it executes the corresponding C code. + +SOME SIMPLE EXAMPLES + First some simple examples to get the flavor of how one uses + flex. The following flex input specifies a scanner which + whenever it encounters the string "username" will replace it + with the user's login name: + + %% + + + +Version 2.5 Last change: April 1995 2 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + username printf( "%s", getlogin() ); + + By default, any text not matched by a flex scanner is copied + to the output, so the net effect of this scanner is to copy + its input file to its output with each occurrence of "user- + name" expanded. In this input, there is just one rule. + "username" is the pattern and the "printf" is the action. + The "%%" marks the beginning of the rules. + + Here's another simple example: + + int num_lines = 0, num_chars = 0; + + %% + \n ++num_lines; ++num_chars; + . ++num_chars; + + %% + main() + { + yylex(); + printf( "# of lines = %d, # of chars = %d\n", + num_lines, num_chars ); + } + + This scanner counts the number of characters and the number + of lines in its input (it produces no output other than the + final report on the counts). The first line declares two + globals, "num_lines" and "num_chars", which are accessible + both inside yylex() and in the main() routine declared after + the second "%%". There are two rules, one which matches a + newline ("\n") and increments both the line count and the + character count, and one which matches any character other + than a newline (indicated by the "." regular expression). + + A somewhat more complicated example: + + /* scanner for a toy Pascal-like language */ + + %{ + /* need this for the call to atof() below */ + #include + %} + + DIGIT [0-9] + ID [a-z][a-z0-9]* + + %% + + {DIGIT}+ { + printf( "An integer: %s (%d)\n", yytext, + atoi( yytext ) ); + + + +Version 2.5 Last change: April 1995 3 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + } + + {DIGIT}+"."{DIGIT}* { + printf( "A float: %s (%g)\n", yytext, + atof( yytext ) ); + } + + if|then|begin|end|procedure|function { + printf( "A keyword: %s\n", yytext ); + } + + {ID} printf( "An identifier: %s\n", yytext ); + + "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); + + "{"[^}\n]*"}" /* eat up one-line comments */ + + [ \t\n]+ /* eat up whitespace */ + + . printf( "Unrecognized character: %s\n", yytext ); + + %% + + main( argc, argv ) + int argc; + char **argv; + { + ++argv, --argc; /* skip over program name */ + if ( argc > 0 ) + yyin = fopen( argv[0], "r" ); + else + yyin = stdin; + + yylex(); + } + + This is the beginnings of a simple scanner for a language + like Pascal. It identifies different types of tokens and + reports on what it has seen. + + The details of this example will be explained in the follow- + ing sections. + +FORMAT OF THE INPUT FILE + The flex input file consists of three sections, separated by + a line with just %% in it: + + definitions + %% + rules + %% + user code + + + +Version 2.5 Last change: April 1995 4 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + The definitions section contains declarations of simple name + definitions to simplify the scanner specification, and + declarations of start conditions, which are explained in a + later section. + + Name definitions have the form: + + name definition + + The "name" is a word beginning with a letter or an under- + score ('_') followed by zero or more letters, digits, '_', + or '-' (dash). The definition is taken to begin at the + first non-white-space character following the name and con- + tinuing to the end of the line. The definition can subse- + quently be referred to using "{name}", which will expand to + "(definition)". For example, + + DIGIT [0-9] + ID [a-z][a-z0-9]* + + defines "DIGIT" to be a regular expression which matches a + single digit, and "ID" to be a regular expression which + matches a letter followed by zero-or-more letters-or-digits. + A subsequent reference to + + {DIGIT}+"."{DIGIT}* + + is identical to + + ([0-9])+"."([0-9])* + + and matches one-or-more digits followed by a '.' followed by + zero-or-more digits. + + The rules section of the flex input contains a series of + rules of the form: + + pattern action + + where the pattern must be unindented and the action must + begin on the same line. + + See below for a further description of patterns and actions. + + Finally, the user code section is simply copied to lex.yy.c + verbatim. It is used for companion routines which call or + are called by the scanner. The presence of this section is + optional; if it is missing, the second %% in the input file + may be skipped, too. + + In the definitions and rules sections, any indented text or + text enclosed in %{ and %} is copied verbatim to the output + + + +Version 2.5 Last change: April 1995 5 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + (with the %{}'s removed). The %{}'s must appear unindented + on lines by themselves. + + In the rules section, any indented or %{} text appearing + before the first rule may be used to declare variables which + are local to the scanning routine and (after the declara- + tions) code which is to be executed whenever the scanning + routine is entered. Other indented or %{} text in the rule + section is still copied to the output, but its meaning is + not well-defined and it may well cause compile-time errors + (this feature is present for POSIX compliance; see below for + other such features). + + In the definitions section (but not in the rules section), + an unindented comment (i.e., a line beginning with "/*") is + also copied verbatim to the output up to the next "*/". + +PATTERNS + The patterns in the input are written using an extended set + of regular expressions. These are: + + x match the character 'x' + . any character (byte) except newline + [xyz] a "character class"; in this case, the pattern + matches either an 'x', a 'y', or a 'z' + [abj-oZ] a "character class" with a range in it; matches + an 'a', a 'b', any letter from 'j' through 'o', + or a 'Z' + [^A-Z] a "negated character class", i.e., any character + but those in the class. In this case, any + character EXCEPT an uppercase letter. + [^A-Z\n] any character EXCEPT an uppercase letter or + a newline + r* zero or more r's, where r is any regular expression + r+ one or more r's + r? zero or one r's (that is, "an optional r") + r{2,5} anywhere from two to five r's + r{2,} two or more r's + r{4} exactly 4 r's + {name} the expansion of the "name" definition + (see above) + "[xyz]\"foo" + the literal string: [xyz]"foo + \X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', + then the ANSI-C interpretation of \x. + Otherwise, a literal 'X' (used to escape + operators such as '*') + \0 a NUL character (ASCII code 0) + \123 the character with octal value 123 + \x2a the character with hexadecimal value 2a + (r) match an r; parentheses are used to override + precedence (see below) + + + +Version 2.5 Last change: April 1995 6 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + rs the regular expression r followed by the + regular expression s; called "concatenation" + + + r|s either an r or an s + + + r/s an r but only if it is followed by an s. The + text matched by s is included when determining + whether this rule is the "longest match", + but is then returned to the input before + the action is executed. So the action only + sees the text matched by r. This type + of pattern is called trailing context". + (There are some combinations of r/s that flex + cannot match correctly; see notes in the + Deficiencies / Bugs section below regarding + "dangerous trailing context".) + ^r an r, but only at the beginning of a line (i.e., + which just starting to scan, or right after a + newline has been scanned). + r$ an r, but only at the end of a line (i.e., just + before a newline). Equivalent to "r/\n". + + Note that flex's notion of "newline" is exactly + whatever the C compiler used to compile flex + interprets '\n' as; in particular, on some DOS + systems you must either filter out \r's in the + input yourself, or explicitly use r/\r\n for "r$". + + + r an r, but only in start condition s (see + below for discussion of start conditions) + r + same, but in any of start conditions s1, + s2, or s3 + <*>r an r in any start condition, even an exclusive one. + + + <> an end-of-file + <> + an end-of-file when in start condition s1 or s2 + + Note that inside of a character class, all regular expres- + sion operators lose their special meaning except escape + ('\') and the character class operators, '-', ']', and, at + the beginning of the class, '^'. + + The regular expressions listed above are grouped according + to precedence, from highest precedence at the top to lowest + at the bottom. Those grouped together have equal pre- + cedence. For example, + + + +Version 2.5 Last change: April 1995 7 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + foo|bar* + + is the same as + + (foo)|(ba(r*)) + + since the '*' operator has higher precedence than concatena- + tion, and concatenation higher than alternation ('|'). This + pattern therefore matches either the string "foo" or the + string "ba" followed by zero-or-more r's. To match "foo" or + zero-or-more "bar"'s, use: + + foo|(bar)* + + and to match zero-or-more "foo"'s-or-"bar"'s: + + (foo|bar)* + + + In addition to characters and ranges of characters, charac- + ter classes can also contain character class expressions. + These are expressions enclosed inside [: and :] delimiters + (which themselves must appear between the '[' and ']' of the + character class; other elements may occur inside the charac- + ter class, too). The valid expressions are: + + [:alnum:] [:alpha:] [:blank:] + [:cntrl:] [:digit:] [:graph:] + [:lower:] [:print:] [:punct:] + [:space:] [:upper:] [:xdigit:] + + These expressions all designate a set of characters + equivalent to the corresponding standard C isXXX function. + For example, [:alnum:] designates those characters for which + isalnum() returns true - i.e., any alphabetic or numeric. + Some systems don't provide isblank(), so flex defines + [:blank:] as a blank or a tab. + + For example, the following character classes are all + equivalent: + + [[:alnum:]] + [[:alpha:][:digit:] + [[:alpha:]0-9] + [a-zA-Z0-9] + + If your scanner is case-insensitive (the -i flag), then + [:upper:] and [:lower:] are equivalent to [:alpha:]. + + Some notes on patterns: + + - A negated character class such as the example "[^A-Z]" + + + +Version 2.5 Last change: April 1995 8 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + above will match a newline unless "\n" (or an + equivalent escape sequence) is one of the characters + explicitly present in the negated character class + (e.g., "[^A-Z\n]"). This is unlike how many other reg- + ular expression tools treat negated character classes, + but unfortunately the inconsistency is historically + entrenched. Matching newlines means that a pattern + like [^"]* can match the entire input unless there's + another quote in the input. + + - A rule can have at most one instance of trailing con- + text (the '/' operator or the '$' operator). The start + condition, '^', and "<>" patterns can only occur + at the beginning of a pattern, and, as well as with '/' + and '$', cannot be grouped inside parentheses. A '^' + which does not occur at the beginning of a rule or a + '$' which does not occur at the end of a rule loses its + special properties and is treated as a normal charac- + ter. + + The following are illegal: + + foo/bar$ + foobar + + Note that the first of these, can be written + "foo/bar\n". + + The following will result in '$' or '^' being treated + as a normal character: + + foo|(bar$) + foo|^bar + + If what's wanted is a "foo" or a bar-followed-by-a- + newline, the following could be used (the special '|' + action is explained below): + + foo | + bar$ /* action goes here */ + + A similar trick will work for matching a foo or a bar- + at-the-beginning-of-a-line. + +HOW THE INPUT IS MATCHED + When the generated scanner is run, it analyzes its input + looking for strings which match any of its patterns. If it + finds more than one match, it takes the one matching the + most text (for trailing context rules, this includes the + length of the trailing part, even though it will then be + returned to the input). If it finds two or more matches of + the same length, the rule listed first in the flex input + + + +Version 2.5 Last change: April 1995 9 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + file is chosen. + + Once the match is determined, the text corresponding to the + match (called the token) is made available in the global + character pointer yytext, and its length in the global + integer yyleng. The action corresponding to the matched pat- + tern is then executed (a more detailed description of + actions follows), and then the remaining input is scanned + for another match. + + If no match is found, then the default rule is executed: the + next character in the input is considered matched and copied + to the standard output. Thus, the simplest legal flex input + is: + + %% + + which generates a scanner that simply copies its input (one + character at a time) to its output. + + Note that yytext can be defined in two different ways: + either as a character pointer or as a character array. You + can control which definition flex uses by including one of + the special directives %pointer or %array in the first + (definitions) section of your flex input. The default is + %pointer, unless you use the -l lex compatibility option, in + which case yytext will be an array. The advantage of using + %pointer is substantially faster scanning and no buffer + overflow when matching very large tokens (unless you run out + of dynamic memory). The disadvantage is that you are res- + tricted in how your actions can modify yytext (see the next + section), and calls to the unput() function destroys the + present contents of yytext, which can be a considerable + porting headache when moving between different lex versions. + + The advantage of %array is that you can then modify yytext + to your heart's content, and calls to unput() do not destroy + yytext (see below). Furthermore, existing lex programs + sometimes access yytext externally using declarations of the + form: + extern char yytext[]; + This definition is erroneous when used with %pointer, but + correct for %array. + + %array defines yytext to be an array of YYLMAX characters, + which defaults to a fairly large value. You can change the + size by simply #define'ing YYLMAX to a different value in + the first section of your flex input. As mentioned above, + with %pointer yytext grows dynamically to accommodate large + tokens. While this means your %pointer scanner can accommo- + date very large tokens (such as matching entire blocks of + comments), bear in mind that each time the scanner must + + + +Version 2.5 Last change: April 1995 10 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + resize yytext it also must rescan the entire token from the + beginning, so matching such tokens can prove slow. yytext + presently does not dynamically grow if a call to unput() + results in too much text being pushed back; instead, a run- + time error results. + + Also note that you cannot use %array with C++ scanner + classes (the c++ option; see below). + +ACTIONS + Each pattern in a rule has a corresponding action, which can + be any arbitrary C statement. The pattern ends at the first + non-escaped whitespace character; the remainder of the line + is its action. If the action is empty, then when the pat- + tern is matched the input token is simply discarded. For + example, here is the specification for a program which + deletes all occurrences of "zap me" from its input: + + %% + "zap me" + + (It will copy all other characters in the input to the out- + put since they will be matched by the default rule.) + + Here is a program which compresses multiple blanks and tabs + down to a single blank, and throws away whitespace found at + the end of a line: + + %% + [ \t]+ putchar( ' ' ); + [ \t]+$ /* ignore this token */ + + + If the action contains a '{', then the action spans till the + balancing '}' is found, and the action may cross multiple + lines. flex knows about C strings and comments and won't be + fooled by braces found within them, but also allows actions + to begin with %{ and will consider the action to be all the + text up to the next %} (regardless of ordinary braces inside + the action). + + An action consisting solely of a vertical bar ('|') means + "same as the action for the next rule." See below for an + illustration. + + Actions can include arbitrary C code, including return + statements to return a value to whatever routine called + yylex(). Each time yylex() is called it continues processing + tokens from where it last left off until it either reaches + the end of the file or executes a return. + + + + + +Version 2.5 Last change: April 1995 11 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + Actions are free to modify yytext except for lengthening it + (adding characters to its end--these will overwrite later + characters in the input stream). This however does not + apply when using %array (see above); in that case, yytext + may be freely modified in any way. + + Actions are free to modify yyleng except they should not do + so if the action also includes use of yymore() (see below). + + There are a number of special directives which can be + included within an action: + + - ECHO copies yytext to the scanner's output. + + - BEGIN followed by the name of a start condition places + the scanner in the corresponding start condition (see + below). + + - REJECT directs the scanner to proceed on to the "second + best" rule which matched the input (or a prefix of the + input). The rule is chosen as described above in "How + the Input is Matched", and yytext and yyleng set up + appropriately. It may either be one which matched as + much text as the originally chosen rule but came later + in the flex input file, or one which matched less text. + For example, the following will both count the words in + the input and call the routine special() whenever + "frob" is seen: + + int word_count = 0; + %% + + frob special(); REJECT; + [^ \t\n]+ ++word_count; + + Without the REJECT, any "frob"'s in the input would not + be counted as words, since the scanner normally exe- + cutes only one action per token. Multiple REJECT's are + allowed, each one finding the next best choice to the + currently active rule. For example, when the following + scanner scans the token "abcd", it will write "abcdab- + caba" to the output: + + %% + a | + ab | + abc | + abcd ECHO; REJECT; + .|\n /* eat up any unmatched character */ + + (The first three rules share the fourth's action since + they use the special '|' action.) REJECT is a + + + +Version 2.5 Last change: April 1995 12 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + particularly expensive feature in terms of scanner per- + formance; if it is used in any of the scanner's actions + it will slow down all of the scanner's matching. + Furthermore, REJECT cannot be used with the -Cf or -CF + options (see below). + + Note also that unlike the other special actions, REJECT + is a branch; code immediately following it in the + action will not be executed. + + - yymore() tells the scanner that the next time it + matches a rule, the corresponding token should be + appended onto the current value of yytext rather than + replacing it. For example, given the input "mega- + kludge" the following will write "mega-mega-kludge" to + the output: + + %% + mega- ECHO; yymore(); + kludge ECHO; + + First "mega-" is matched and echoed to the output. + Then "kludge" is matched, but the previous "mega-" is + still hanging around at the beginning of yytext so the + ECHO for the "kludge" rule will actually write "mega- + kludge". + + Two notes regarding use of yymore(). First, yymore() depends + on the value of yyleng correctly reflecting the size of the + current token, so you must not modify yyleng if you are + using yymore(). Second, the presence of yymore() in the + scanner's action entails a minor performance penalty in the + scanner's matching speed. + + - yyless(n) returns all but the first n characters of the + current token back to the input stream, where they will + be rescanned when the scanner looks for the next match. + yytext and yyleng are adjusted appropriately (e.g., + yyleng will now be equal to n ). For example, on the + input "foobar" the following will write out "foobar- + bar": + + %% + foobar ECHO; yyless(3); + [a-z]+ ECHO; + + An argument of 0 to yyless will cause the entire + current input string to be scanned again. Unless + you've changed how the scanner will subsequently pro- + cess its input (using BEGIN, for example), this will + result in an endless loop. + + + + +Version 2.5 Last change: April 1995 13 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + Note that yyless is a macro and can only be used in the flex + input file, not from other source files. + + - unput(c) puts the character c back onto the input + stream. It will be the next character scanned. The + following action will take the current token and cause + it to be rescanned enclosed in parentheses. + + { + int i; + /* Copy yytext because unput() trashes yytext */ + char *yycopy = strdup( yytext ); + unput( ')' ); + for ( i = yyleng - 1; i >= 0; --i ) + unput( yycopy[i] ); + unput( '(' ); + free( yycopy ); + } + + Note that since each unput() puts the given character + back at the beginning of the input stream, pushing back + strings must be done back-to-front. + + An important potential problem when using unput() is that if + you are using %pointer (the default), a call to unput() des- + troys the contents of yytext, starting with its rightmost + character and devouring one character to the left with each + call. If you need the value of yytext preserved after a + call to unput() (as in the above example), you must either + first copy it elsewhere, or build your scanner using %array + instead (see How The Input Is Matched). + + Finally, note that you cannot put back EOF to attempt to + mark the input stream with an end-of-file. + + - input() reads the next character from the input stream. + For example, the following is one way to eat up C com- + ments: + + %% + "/*" { + register int c; + + for ( ; ; ) + { + while ( (c = input()) != '*' && + c != EOF ) + ; /* eat up text of comment */ + + if ( c == '*' ) + { + while ( (c = input()) == '*' ) + + + +Version 2.5 Last change: April 1995 14 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + ; + if ( c == '/' ) + break; /* found the end */ + } + + if ( c == EOF ) + { + error( "EOF in comment" ); + break; + } + } + } + + (Note that if the scanner is compiled using C++, then + input() is instead referred to as yyinput(), in order + to avoid a name clash with the C++ stream by the name + of input.) + + - YY_FLUSH_BUFFER flushes the scanner's internal buffer + so that the next time the scanner attempts to match a + token, it will first refill the buffer using YY_INPUT + (see The Generated Scanner, below). This action is a + special case of the more general yy_flush_buffer() + function, described below in the section Multiple Input + Buffers. + + - yyterminate() can be used in lieu of a return statement + in an action. It terminates the scanner and returns a + 0 to the scanner's caller, indicating "all done". By + default, yyterminate() is also called when an end-of- + file is encountered. It is a macro and may be rede- + fined. + +THE GENERATED SCANNER + The output of flex is the file lex.yy.c, which contains the + scanning routine yylex(), a number of tables used by it for + matching tokens, and a number of auxiliary routines and mac- + ros. By default, yylex() is declared as follows: + + int yylex() + { + ... various definitions and the actions in here ... + } + + (If your environment supports function prototypes, then it + will be "int yylex( void )".) This definition may be + changed by defining the "YY_DECL" macro. For example, you + could use: + + #define YY_DECL float lexscan( a, b ) float a, b; + + to give the scanning routine the name lexscan, returning a + + + +Version 2.5 Last change: April 1995 15 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + float, and taking two floats as arguments. Note that if you + give arguments to the scanning routine using a K&R- + style/non-prototyped function declaration, you must ter- + minate the definition with a semi-colon (;). + + Whenever yylex() is called, it scans tokens from the global + input file yyin (which defaults to stdin). It continues + until it either reaches an end-of-file (at which point it + returns the value 0) or one of its actions executes a return + statement. + + If the scanner reaches an end-of-file, subsequent calls are + undefined unless either yyin is pointed at a new input file + (in which case scanning continues from that file), or yyres- + tart() is called. yyrestart() takes one argument, a FILE * + pointer (which can be nil, if you've set up YY_INPUT to scan + from a source other than yyin), and initializes yyin for + scanning from that file. Essentially there is no difference + between just assigning yyin to a new input file or using + yyrestart() to do so; the latter is available for compati- + bility with previous versions of flex, and because it can be + used to switch input files in the middle of scanning. It + can also be used to throw away the current input buffer, by + calling it with an argument of yyin; but better is to use + YY_FLUSH_BUFFER (see above). Note that yyrestart() does not + reset the start condition to INITIAL (see Start Conditions, + below). + + If yylex() stops scanning due to executing a return state- + ment in one of the actions, the scanner may then be called + again and it will resume scanning where it left off. + + By default (and for purposes of efficiency), the scanner + uses block-reads rather than simple getc() calls to read + characters from yyin. The nature of how it gets its input + can be controlled by defining the YY_INPUT macro. + YY_INPUT's calling sequence is + "YY_INPUT(buf,result,max_size)". Its action is to place up + to max_size characters in the character array buf and return + in the integer variable result either the number of charac- + ters read or the constant YY_NULL (0 on Unix systems) to + indicate EOF. The default YY_INPUT reads from the global + file-pointer "yyin". + + A sample definition of YY_INPUT (in the definitions section + of the input file): + + %{ + #define YY_INPUT(buf,result,max_size) \ + { \ + int c = getchar(); \ + result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ + + + +Version 2.5 Last change: April 1995 16 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + } + %} + + This definition will change the input processing to occur + one character at a time. + + When the scanner receives an end-of-file indication from + YY_INPUT, it then checks the yywrap() function. If yywrap() + returns false (zero), then it is assumed that the function + has gone ahead and set up yyin to point to another input + file, and scanning continues. If it returns true (non- + zero), then the scanner terminates, returning 0 to its + caller. Note that in either case, the start condition + remains unchanged; it does not revert to INITIAL. + + If you do not supply your own version of yywrap(), then you + must either use %option noyywrap (in which case the scanner + behaves as though yywrap() returned 1), or you must link + with -lfl to obtain the default version of the routine, + which always returns 1. + + Three routines are available for scanning from in-memory + buffers rather than files: yy_scan_string(), + yy_scan_bytes(), and yy_scan_buffer(). See the discussion of + them below in the section Multiple Input Buffers. + + The scanner writes its ECHO output to the yyout global + (default, stdout), which may be redefined by the user simply + by assigning it to some other FILE pointer. + +START CONDITIONS + flex provides a mechanism for conditionally activating + rules. Any rule whose pattern is prefixed with "" will + only be active when the scanner is in the start condition + named "sc". For example, + + [^"]* { /* eat up the string body ... */ + ... + } + + will be active only when the scanner is in the "STRING" + start condition, and + + \. { /* handle an escape ... */ + ... + } + + will be active only when the current start condition is + either "INITIAL", "STRING", or "QUOTE". + + Start conditions are declared in the definitions (first) + section of the input using unindented lines beginning with + + + +Version 2.5 Last change: April 1995 17 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + either %s or %x followed by a list of names. The former + declares inclusive start conditions, the latter exclusive + start conditions. A start condition is activated using the + BEGIN action. Until the next BEGIN action is executed, + rules with the given start condition will be active and + rules with other start conditions will be inactive. If the + start condition is inclusive, then rules with no start con- + ditions at all will also be active. If it is exclusive, + then only rules qualified with the start condition will be + active. A set of rules contingent on the same exclusive + start condition describe a scanner which is independent of + any of the other rules in the flex input. Because of this, + exclusive start conditions make it easy to specify "mini- + scanners" which scan portions of the input that are syntac- + tically different from the rest (e.g., comments). + + If the distinction between inclusive and exclusive start + conditions is still a little vague, here's a simple example + illustrating the connection between the two. The set of + rules: + + %s example + %% + + foo do_something(); + + bar something_else(); + + is equivalent to + + %x example + %% + + foo do_something(); + + bar something_else(); + + Without the qualifier, the bar pattern in + the second example wouldn't be active (i.e., couldn't match) + when in start condition example. If we just used + to qualify bar, though, then it would only be active in + example and not in INITIAL, while in the first example it's + active in both, because in the first example the example + startion condition is an inclusive (%s) start condition. + + Also note that the special start-condition specifier <*> + matches every start condition. Thus, the above example + could also have been written; + + %x example + %% + + + + +Version 2.5 Last change: April 1995 18 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + foo do_something(); + + <*>bar something_else(); + + + The default rule (to ECHO any unmatched character) remains + active in start conditions. It is equivalent to: + + <*>.|\n ECHO; + + + BEGIN(0) returns to the original state where only the rules + with no start conditions are active. This state can also be + referred to as the start-condition "INITIAL", so + BEGIN(INITIAL) is equivalent to BEGIN(0). (The parentheses + around the start condition name are not required but are + considered good style.) + + BEGIN actions can also be given as indented code at the + beginning of the rules section. For example, the following + will cause the scanner to enter the "SPECIAL" start condi- + tion whenever yylex() is called and the global variable + enter_special is true: + + int enter_special; + + %x SPECIAL + %% + if ( enter_special ) + BEGIN(SPECIAL); + + blahblahblah + ...more rules follow... + + + To illustrate the uses of start conditions, here is a + scanner which provides two different interpretations of a + string like "123.456". By default it will treat it as three + tokens, the integer "123", a dot ('.'), and the integer + "456". But if the string is preceded earlier in the line by + the string "expect-floats" it will treat it as a single + token, the floating-point number 123.456: + + %{ + #include + %} + %s expect + + %% + expect-floats BEGIN(expect); + + [0-9]+"."[0-9]+ { + + + +Version 2.5 Last change: April 1995 19 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + printf( "found a float, = %f\n", + atof( yytext ) ); + } + \n { + /* that's the end of the line, so + * we need another "expect-number" + * before we'll recognize any more + * numbers + */ + BEGIN(INITIAL); + } + + [0-9]+ { + printf( "found an integer, = %d\n", + atoi( yytext ) ); + } + + "." printf( "found a dot\n" ); + + Here is a scanner which recognizes (and discards) C comments + while maintaining a count of the current input line. + + %x comment + %% + int line_num = 1; + + "/*" BEGIN(comment); + + [^*\n]* /* eat anything that's not a '*' */ + "*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ + \n ++line_num; + "*"+"/" BEGIN(INITIAL); + + This scanner goes to a bit of trouble to match as much text + as possible with each rule. In general, when attempting to + write a high-speed scanner try to match as much possible in + each rule, as it's a big win. + + Note that start-conditions names are really integer values + and can be stored as such. Thus, the above could be + extended in the following fashion: + + %x comment foo + %% + int line_num = 1; + int comment_caller; + + "/*" { + comment_caller = INITIAL; + BEGIN(comment); + } + + + + +Version 2.5 Last change: April 1995 20 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + ... + + "/*" { + comment_caller = foo; + BEGIN(comment); + } + + [^*\n]* /* eat anything that's not a '*' */ + "*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ + \n ++line_num; + "*"+"/" BEGIN(comment_caller); + + Furthermore, you can access the current start condition + using the integer-valued YY_START macro. For example, the + above assignments to comment_caller could instead be written + + comment_caller = YY_START; + + Flex provides YYSTATE as an alias for YY_START (since that + is what's used by AT&T lex). + + Note that start conditions do not have their own name-space; + %s's and %x's declare names in the same fashion as + #define's. + + Finally, here's an example of how to match C-style quoted + strings using exclusive start conditions, including expanded + escape sequences (but not including checking for a string + that's too long): + + %x str + + %% + char string_buf[MAX_STR_CONST]; + char *string_buf_ptr; + + + \" string_buf_ptr = string_buf; BEGIN(str); + + \" { /* saw closing quote - all done */ + BEGIN(INITIAL); + *string_buf_ptr = '\0'; + /* return string constant token type and + * value to parser + */ + } + + \n { + /* error - unterminated string constant */ + /* generate error message */ + } + + + + +Version 2.5 Last change: April 1995 21 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + \\[0-7]{1,3} { + /* octal escape sequence */ + int result; + + (void) sscanf( yytext + 1, "%o", &result ); + + if ( result > 0xff ) + /* error, constant is out-of-bounds */ + + *string_buf_ptr++ = result; + } + + \\[0-9]+ { + /* generate error - bad escape sequence; something + * like '\48' or '\0777777' + */ + } + + \\n *string_buf_ptr++ = '\n'; + \\t *string_buf_ptr++ = '\t'; + \\r *string_buf_ptr++ = '\r'; + \\b *string_buf_ptr++ = '\b'; + \\f *string_buf_ptr++ = '\f'; + + \\(.|\n) *string_buf_ptr++ = yytext[1]; + + [^\\\n\"]+ { + char *yptr = yytext; + + while ( *yptr ) + *string_buf_ptr++ = *yptr++; + } + + + Often, such as in some of the examples above, you wind up + writing a whole bunch of rules all preceded by the same + start condition(s). Flex makes this a little easier and + cleaner by introducing a notion of start condition scope. A + start condition scope is begun with: + + { + + where SCs is a list of one or more start conditions. Inside + the start condition scope, every rule automatically has the + prefix applied to it, until a '}' which matches the + initial '{'. So, for example, + + { + "\\n" return '\n'; + "\\r" return '\r'; + "\\f" return '\f'; + "\\0" return '\0'; + + + +Version 2.5 Last change: April 1995 22 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + } + + is equivalent to: + + "\\n" return '\n'; + "\\r" return '\r'; + "\\f" return '\f'; + "\\0" return '\0'; + + Start condition scopes may be nested. + + Three routines are available for manipulating stacks of + start conditions: + + void yy_push_state(int new_state) + pushes the current start condition onto the top of the + start condition stack and switches to new_state as + though you had used BEGIN new_state (recall that start + condition names are also integers). + + void yy_pop_state() + pops the top of the stack and switches to it via BEGIN. + + int yy_top_state() + returns the top of the stack without altering the + stack's contents. + + The start condition stack grows dynamically and so has no + built-in size limitation. If memory is exhausted, program + execution aborts. + + To use start condition stacks, your scanner must include a + %option stack directive (see Options below). + +MULTIPLE INPUT BUFFERS + Some scanners (such as those which support "include" files) + require reading from several input streams. As flex + scanners do a large amount of buffering, one cannot control + where the next input will be read from by simply writing a + YY_INPUT which is sensitive to the scanning context. + YY_INPUT is only called when the scanner reaches the end of + its buffer, which may be a long time after scanning a state- + ment such as an "include" which requires switching the input + source. + + To negotiate these sorts of problems, flex provides a + mechanism for creating and switching between multiple input + buffers. An input buffer is created by using: + + YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) + + which takes a FILE pointer and a size and creates a buffer + + + +Version 2.5 Last change: April 1995 23 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + associated with the given file and large enough to hold size + characters (when in doubt, use YY_BUF_SIZE for the size). + It returns a YY_BUFFER_STATE handle, which may then be + passed to other routines (see below). The YY_BUFFER_STATE + type is a pointer to an opaque struct yy_buffer_state struc- + ture, so you may safely initialize YY_BUFFER_STATE variables + to ((YY_BUFFER_STATE) 0) if you wish, and also refer to the + opaque structure in order to correctly declare input buffers + in source files other than that of your scanner. Note that + the FILE pointer in the call to yy_create_buffer is only + used as the value of yyin seen by YY_INPUT; if you redefine + YY_INPUT so it no longer uses yyin, then you can safely pass + a nil FILE pointer to yy_create_buffer. You select a partic- + ular buffer to scan from using: + + void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) + + switches the scanner's input buffer so subsequent tokens + will come from new_buffer. Note that yy_switch_to_buffer() + may be used by yywrap() to set things up for continued scan- + ning, instead of opening a new file and pointing yyin at it. + Note also that switching input sources via either + yy_switch_to_buffer() or yywrap() does not change the start + condition. + + void yy_delete_buffer( YY_BUFFER_STATE buffer ) + + is used to reclaim the storage associated with a buffer. ( + buffer can be nil, in which case the routine does nothing.) + You can also clear the current contents of a buffer using: + + void yy_flush_buffer( YY_BUFFER_STATE buffer ) + + This function discards the buffer's contents, so the next + time the scanner attempts to match a token from the buffer, + it will first fill the buffer anew using YY_INPUT. + + yy_new_buffer() is an alias for yy_create_buffer(), provided + for compatibility with the C++ use of new and delete for + creating and destroying dynamic objects. + + Finally, the YY_CURRENT_BUFFER macro returns a + YY_BUFFER_STATE handle to the current buffer. + + Here is an example of using these features for writing a + scanner which expands include files (the <> feature is + discussed below): + + /* the "incl" state is used for picking up the name + * of an include file + */ + %x incl + + + +Version 2.5 Last change: April 1995 24 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + %{ + #define MAX_INCLUDE_DEPTH 10 + YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; + int include_stack_ptr = 0; + %} + + %% + include BEGIN(incl); + + [a-z]+ ECHO; + [^a-z\n]*\n? ECHO; + + [ \t]* /* eat the whitespace */ + [^ \t\n]+ { /* got the include file name */ + if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) + { + fprintf( stderr, "Includes nested too deeply" ); + exit( 1 ); + } + + include_stack[include_stack_ptr++] = + YY_CURRENT_BUFFER; + + yyin = fopen( yytext, "r" ); + + if ( ! yyin ) + error( ... ); + + yy_switch_to_buffer( + yy_create_buffer( yyin, YY_BUF_SIZE ) ); + + BEGIN(INITIAL); + } + + <> { + if ( --include_stack_ptr < 0 ) + { + yyterminate(); + } + + else + { + yy_delete_buffer( YY_CURRENT_BUFFER ); + yy_switch_to_buffer( + include_stack[include_stack_ptr] ); + } + } + + Three routines are available for setting up input buffers + for scanning in-memory strings instead of files. All of + them create a new input buffer for scanning the string, and + return a corresponding YY_BUFFER_STATE handle (which you + + + +Version 2.5 Last change: April 1995 25 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + should delete with yy_delete_buffer() when done with it). + They also switch to the new buffer using + yy_switch_to_buffer(), so the next call to yylex() will + start scanning the string. + + yy_scan_string(const char *str) + scans a NUL-terminated string. + + yy_scan_bytes(const char *bytes, int len) + scans len bytes (including possibly NUL's) starting at + location bytes. + + Note that both of these functions create and scan a copy of + the string or bytes. (This may be desirable, since yylex() + modifies the contents of the buffer it is scanning.) You + can avoid the copy by using: + + yy_scan_buffer(char *base, yy_size_t size) + which scans in place the buffer starting at base, con- + sisting of size bytes, the last two bytes of which must + be YY_END_OF_BUFFER_CHAR (ASCII NUL). These last two + bytes are not scanned; thus, scanning consists of + base[0] through base[size-2], inclusive. + + If you fail to set up base in this manner (i.e., forget + the final two YY_END_OF_BUFFER_CHAR bytes), then + yy_scan_buffer() returns a nil pointer instead of + creating a new input buffer. + + The type yy_size_t is an integral type to which you can + cast an integer expression reflecting the size of the + buffer. + +END-OF-FILE RULES + The special rule "<>" indicates actions which are to be + taken when an end-of-file is encountered and yywrap() + returns non-zero (i.e., indicates no further files to pro- + cess). The action must finish by doing one of four things: + + - assigning yyin to a new input file (in previous ver- + sions of flex, after doing the assignment you had to + call the special action YY_NEW_FILE; this is no longer + necessary); + + - executing a return statement; + + - executing the special yyterminate() action; + + - or, switching to a new buffer using + yy_switch_to_buffer() as shown in the example above. + + + + + +Version 2.5 Last change: April 1995 26 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + <> rules may not be used with other patterns; they may + only be qualified with a list of start conditions. If an + unqualified <> rule is given, it applies to all start + conditions which do not already have <> actions. To + specify an <> rule for only the initial start condi- + tion, use + + <> + + + These rules are useful for catching things like unclosed + comments. An example: + + %x quote + %% + + ...other rules for dealing with quotes... + + <> { + error( "unterminated quote" ); + yyterminate(); + } + <> { + if ( *++filelist ) + yyin = fopen( *filelist, "r" ); + else + yyterminate(); + } + + +MISCELLANEOUS MACROS + The macro YY_USER_ACTION can be defined to provide an action + which is always executed prior to the matched rule's action. + For example, it could be #define'd to call a routine to con- + vert yytext to lower-case. When YY_USER_ACTION is invoked, + the variable yy_act gives the number of the matched rule + (rules are numbered starting with 1). Suppose you want to + profile how often each of your rules is matched. The fol- + lowing would do the trick: + + #define YY_USER_ACTION ++ctr[yy_act] + + where ctr is an array to hold the counts for the different + rules. Note that the macro YY_NUM_RULES gives the total + number of rules (including the default rule, even if you use + -s), so a correct declaration for ctr is: + + int ctr[YY_NUM_RULES]; + + + The macro YY_USER_INIT may be defined to provide an action + which is always executed before the first scan (and before + + + +Version 2.5 Last change: April 1995 27 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + the scanner's internal initializations are done). For exam- + ple, it could be used to call a routine to read in a data + table or open a logging file. + + The macro yy_set_interactive(is_interactive) can be used to + control whether the current buffer is considered interac- + tive. An interactive buffer is processed more slowly, but + must be used when the scanner's input source is indeed + interactive to avoid problems due to waiting to fill buffers + (see the discussion of the -I flag below). A non-zero value + in the macro invocation marks the buffer as interactive, a + zero value as non-interactive. Note that use of this macro + overrides %option always-interactive or %option never- + interactive (see Options below). yy_set_interactive() must + be invoked prior to beginning to scan the buffer that is (or + is not) to be considered interactive. + + The macro yy_set_bol(at_bol) can be used to control whether + the current buffer's scanning context for the next token + match is done as though at the beginning of a line. A non- + zero macro argument makes rules anchored with + + The macro YY_AT_BOL() returns true if the next token scanned + from the current buffer will have '^' rules active, false + otherwise. + + In the generated scanner, the actions are all gathered in + one large switch statement and separated using YY_BREAK, + which may be redefined. By default, it is simply a "break", + to separate each rule's action from the following rule's. + Redefining YY_BREAK allows, for example, C++ users to + #define YY_BREAK to do nothing (while being very careful + that every rule ends with a "break" or a "return"!) to avoid + suffering from unreachable statement warnings where because + a rule's action ends with "return", the YY_BREAK is inacces- + sible. + +VALUES AVAILABLE TO THE USER + This section summarizes the various values available to the + user in the rule actions. + + - char *yytext holds the text of the current token. It + may be modified but not lengthened (you cannot append + characters to the end). + + If the special directive %array appears in the first + section of the scanner description, then yytext is + instead declared char yytext[YYLMAX], where YYLMAX is a + macro definition that you can redefine in the first + section if you don't like the default value (generally + 8KB). Using %array results in somewhat slower + scanners, but the value of yytext becomes immune to + + + +Version 2.5 Last change: April 1995 28 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + calls to input() and unput(), which potentially destroy + its value when yytext is a character pointer. The + opposite of %array is %pointer, which is the default. + + You cannot use %array when generating C++ scanner + classes (the -+ flag). + + - int yyleng holds the length of the current token. + + - FILE *yyin is the file which by default flex reads + from. It may be redefined but doing so only makes + sense before scanning begins or after an EOF has been + encountered. Changing it in the midst of scanning will + have unexpected results since flex buffers its input; + use yyrestart() instead. Once scanning terminates + because an end-of-file has been seen, you can assign + yyin at the new input file and then call the scanner + again to continue scanning. + + - void yyrestart( FILE *new_file ) may be called to point + yyin at the new input file. The switch-over to the new + file is immediate (any previously buffered-up input is + lost). Note that calling yyrestart() with yyin as an + argument thus throws away the current input buffer and + continues scanning the same input file. + + - FILE *yyout is the file to which ECHO actions are done. + It can be reassigned by the user. + + - YY_CURRENT_BUFFER returns a YY_BUFFER_STATE handle to + the current buffer. + + - YY_START returns an integer value corresponding to the + current start condition. You can subsequently use this + value with BEGIN to return to that start condition. + +INTERFACING WITH YACC + One of the main uses of flex is as a companion to the yacc + parser-generator. yacc parsers expect to call a routine + named yylex() to find the next input token. The routine is + supposed to return the type of the next token as well as + putting any associated value in the global yylval. To use + flex with yacc, one specifies the -d option to yacc to + instruct it to generate the file y.tab.h containing defini- + tions of all the %tokens appearing in the yacc input. This + file is then included in the flex scanner. For example, if + one of the tokens is "TOK_NUMBER", part of the scanner might + look like: + + %{ + #include "y.tab.h" + %} + + + +Version 2.5 Last change: April 1995 29 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + %% + + [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; + + +OPTIONS + flex has the following options: + + -b Generate backing-up information to lex.backup. This is + a list of scanner states which require backing up and + the input characters on which they do so. By adding + rules one can remove backing-up states. If all + backing-up states are eliminated and -Cf or -CF is + used, the generated scanner will run faster (see the -p + flag). Only users who wish to squeeze every last cycle + out of their scanners need worry about this option. + (See the section on Performance Considerations below.) + + -c is a do-nothing, deprecated option included for POSIX + compliance. + + -d makes the generated scanner run in debug mode. When- + ever a pattern is recognized and the global + yy_flex_debug is non-zero (which is the default), the + scanner will write to stderr a line of the form: + + --accepting rule at line 53 ("the matched text") + + The line number refers to the location of the rule in + the file defining the scanner (i.e., the file that was + fed to flex). Messages are also generated when the + scanner backs up, accepts the default rule, reaches the + end of its input buffer (or encounters a NUL; at this + point, the two look the same as far as the scanner's + concerned), or reaches an end-of-file. + + -f specifies fast scanner. No table compression is done + and stdio is bypassed. The result is large but fast. + This option is equivalent to -Cfr (see below). + + -h generates a "help" summary of flex's options to stdout + and then exits. -? and --help are synonyms for -h. + + -i instructs flex to generate a case-insensitive scanner. + The case of letters given in the flex input patterns + will be ignored, and tokens in the input will be + matched regardless of case. The matched text given in + yytext will have the preserved case (i.e., it will not + be folded). + + -l turns on maximum compatibility with the original AT&T + lex implementation. Note that this does not mean full + + + +Version 2.5 Last change: April 1995 30 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + compatibility. Use of this option costs a considerable + amount of performance, and it cannot be used with the + -+, -f, -F, -Cf, or -CF options. For details on the + compatibilities it provides, see the section "Incompa- + tibilities With Lex And POSIX" below. This option also + results in the name YY_FLEX_LEX_COMPAT being #define'd + in the generated scanner. + + -n is another do-nothing, deprecated option included only + for POSIX compliance. + + -p generates a performance report to stderr. The report + consists of comments regarding features of the flex + input file which will cause a serious loss of perfor- + mance in the resulting scanner. If you give the flag + twice, you will also get comments regarding features + that lead to minor performance losses. + + Note that the use of REJECT, %option yylineno, and + variable trailing context (see the Deficiencies / Bugs + section below) entails a substantial performance + penalty; use of yymore(), the ^ operator, and the -I + flag entail minor performance penalties. + + -s causes the default rule (that unmatched scanner input + is echoed to stdout) to be suppressed. If the scanner + encounters input that does not match any of its rules, + it aborts with an error. This option is useful for + finding holes in a scanner's rule set. + + -t instructs flex to write the scanner it generates to + standard output instead of lex.yy.c. + + -v specifies that flex should write to stderr a summary of + statistics regarding the scanner it generates. Most of + the statistics are meaningless to the casual flex user, + but the first line identifies the version of flex (same + as reported by -V), and the next line the flags used + when generating the scanner, including those that are + on by default. + + -w suppresses warning messages. + + -B instructs flex to generate a batch scanner, the oppo- + site of interactive scanners generated by -I (see + below). In general, you use -B when you are certain + that your scanner will never be used interactively, and + you want to squeeze a little more performance out of + it. If your goal is instead to squeeze out a lot more + performance, you should be using the -Cf or -CF + options (discussed below), which turn on -B automati- + cally anyway. + + + +Version 2.5 Last change: April 1995 31 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + -F specifies that the fast scanner table representation + should be used (and stdio bypassed). This representa- + tion is about as fast as the full table representation + (-f), and for some sets of patterns will be consider- + ably smaller (and for others, larger). In general, if + the pattern set contains both "keywords" and a catch- + all, "identifier" rule, such as in the set: + + "case" return TOK_CASE; + "switch" return TOK_SWITCH; + ... + "default" return TOK_DEFAULT; + [a-z]+ return TOK_ID; + + then you're better off using the full table representa- + tion. If only the "identifier" rule is present and you + then use a hash table or some such to detect the key- + words, you're better off using -F. + + This option is equivalent to -CFr (see below). It can- + not be used with -+. + + -I instructs flex to generate an interactive scanner. An + interactive scanner is one that only looks ahead to + decide what token has been matched if it absolutely + must. It turns out that always looking one extra char- + acter ahead, even if the scanner has already seen + enough text to disambiguate the current token, is a bit + faster than only looking ahead when necessary. But + scanners that always look ahead give dreadful interac- + tive performance; for example, when a user types a new- + line, it is not recognized as a newline token until + they enter another token, which often means typing in + another whole line. + + Flex scanners default to interactive unless you use the + -Cf or -CF table-compression options (see below). + That's because if you're looking for high-performance + you should be using one of these options, so if you + didn't, flex assumes you'd rather trade off a bit of + run-time performance for intuitive interactive + behavior. Note also that you cannot use -I in conjunc- + tion with -Cf or -CF. Thus, this option is not really + needed; it is on by default for all those cases in + which it is allowed. + + You can force a scanner to not be interactive by using + -B (see above). + + -L instructs flex not to generate #line directives. + Without this option, flex peppers the generated scanner + with #line directives so error messages in the actions + + + +Version 2.5 Last change: April 1995 32 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + will be correctly located with respect to either the + original flex input file (if the errors are due to code + in the input file), or lex.yy.c (if the errors are + flex's fault -- you should report these sorts of errors + to the email address given below). + + -T makes flex run in trace mode. It will generate a lot + of messages to stderr concerning the form of the input + and the resultant non-deterministic and deterministic + finite automata. This option is mostly for use in + maintaining flex. + + -V prints the version number to stdout and exits. --ver- + sion is a synonym for -V. + + -7 instructs flex to generate a 7-bit scanner, i.e., one + which can only recognized 7-bit characters in its + input. The advantage of using -7 is that the scanner's + tables can be up to half the size of those generated + using the -8 option (see below). The disadvantage is + that such scanners often hang or crash if their input + contains an 8-bit character. + + Note, however, that unless you generate your scanner + using the -Cf or -CF table compression options, use of + -7 will save only a small amount of table space, and + make your scanner considerably less portable. Flex's + default behavior is to generate an 8-bit scanner unless + you use the -Cf or -CF, in which case flex defaults to + generating 7-bit scanners unless your site was always + configured to generate 8-bit scanners (as will often be + the case with non-USA sites). You can tell whether + flex generated a 7-bit or an 8-bit scanner by inspect- + ing the flag summary in the -v output as described + above. + + Note that if you use -Cfe or -CFe (those table compres- + sion options, but also using equivalence classes as + discussed see below), flex still defaults to generating + an 8-bit scanner, since usually with these compression + options full 8-bit tables are not much more expensive + than 7-bit tables. + + -8 instructs flex to generate an 8-bit scanner, i.e., one + which can recognize 8-bit characters. This flag is + only needed for scanners generated using -Cf or -CF, as + otherwise flex defaults to generating an 8-bit scanner + anyway. + + See the discussion of -7 above for flex's default + behavior and the tradeoffs between 7-bit and 8-bit + scanners. + + + +Version 2.5 Last change: April 1995 33 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + -+ specifies that you want flex to generate a C++ scanner + class. See the section on Generating C++ Scanners + below for details. + + -C[aefFmr] + controls the degree of table compression and, more gen- + erally, trade-offs between small scanners and fast + scanners. + + -Ca ("align") instructs flex to trade off larger tables + in the generated scanner for faster performance because + the elements of the tables are better aligned for + memory access and computation. On some RISC architec- + tures, fetching and manipulating longwords is more + efficient than with smaller-sized units such as short- + words. This option can double the size of the tables + used by your scanner. + + -Ce directs flex to construct equivalence classes, + i.e., sets of characters which have identical lexical + properties (for example, if the only appearance of + digits in the flex input is in the character class + "[0-9]" then the digits '0', '1', ..., '9' will all be + put in the same equivalence class). Equivalence + classes usually give dramatic reductions in the final + table/object file sizes (typically a factor of 2-5) and + are pretty cheap performance-wise (one array look-up + per character scanned). + + -Cf specifies that the full scanner tables should be + generated - flex should not compress the tables by tak- + ing advantages of similar transition functions for dif- + ferent states. + + -CF specifies that the alternate fast scanner represen- + tation (described above under the -F flag) should be + used. This option cannot be used with -+. + + -Cm directs flex to construct meta-equivalence classes, + which are sets of equivalence classes (or characters, + if equivalence classes are not being used) that are + commonly used together. Meta-equivalence classes are + often a big win when using compressed tables, but they + have a moderate performance impact (one or two "if" + tests and one array look-up per character scanned). + + -Cr causes the generated scanner to bypass use of the + standard I/O library (stdio) for input. Instead of + calling fread() or getc(), the scanner will use the + read() system call, resulting in a performance gain + which varies from system to system, but in general is + probably negligible unless you are also using -Cf or + + + +Version 2.5 Last change: April 1995 34 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + -CF. Using -Cr can cause strange behavior if, for exam- + ple, you read from yyin using stdio prior to calling + the scanner (because the scanner will miss whatever + text your previous reads left in the stdio input + buffer). + + -Cr has no effect if you define YY_INPUT (see The Gen- + erated Scanner above). + + A lone -C specifies that the scanner tables should be + compressed but neither equivalence classes nor meta- + equivalence classes should be used. + + The options -Cf or -CF and -Cm do not make sense + together - there is no opportunity for meta-equivalence + classes if the table is not being compressed. Other- + wise the options may be freely mixed, and are cumula- + tive. + + The default setting is -Cem, which specifies that flex + should generate equivalence classes and meta- + equivalence classes. This setting provides the highest + degree of table compression. You can trade off + faster-executing scanners at the cost of larger tables + with the following generally being true: + + slowest & smallest + -Cem + -Cm + -Ce + -C + -C{f,F}e + -C{f,F} + -C{f,F}a + fastest & largest + + Note that scanners with the smallest tables are usually + generated and compiled the quickest, so during develop- + ment you will usually want to use the default, maximal + compression. + + -Cfe is often a good compromise between speed and size + for production scanners. + + -ooutput + directs flex to write the scanner to the file output + instead of lex.yy.c. If you combine -o with the -t + option, then the scanner is written to stdout but its + #line directives (see the -L option above) refer to the + file output. + + -Pprefix + + + +Version 2.5 Last change: April 1995 35 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + changes the default yy prefix used by flex for all + globally-visible variable and function names to instead + be prefix. For example, -Pfoo changes the name of + yytext to footext. It also changes the name of the + default output file from lex.yy.c to lex.foo.c. Here + are all of the names affected: + + yy_create_buffer + yy_delete_buffer + yy_flex_debug + yy_init_buffer + yy_flush_buffer + yy_load_buffer_state + yy_switch_to_buffer + yyin + yyleng + yylex + yylineno + yyout + yyrestart + yytext + yywrap + + (If you are using a C++ scanner, then only yywrap and + yyFlexLexer are affected.) Within your scanner itself, + you can still refer to the global variables and func- + tions using either version of their name; but exter- + nally, they have the modified name. + + This option lets you easily link together multiple flex + programs into the same executable. Note, though, that + using this option also renames yywrap(), so you now + must either provide your own (appropriately-named) ver- + sion of the routine for your scanner, or use %option + noyywrap, as linking with -lfl no longer provides one + for you by default. + + -Sskeleton_file + overrides the default skeleton file from which flex + constructs its scanners. You'll never need this option + unless you are doing flex maintenance or development. + + flex also provides a mechanism for controlling options + within the scanner specification itself, rather than from + the flex command-line. This is done by including %option + directives in the first section of the scanner specifica- + tion. You can specify multiple options with a single + %option directive, and multiple directives in the first sec- + tion of your flex input file. + + Most options are given simply as names, optionally preceded + by the word "no" (with no intervening whitespace) to negate + + + +Version 2.5 Last change: April 1995 36 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + their meaning. A number are equivalent to flex flags or + their negation: + + 7bit -7 option + 8bit -8 option + align -Ca option + backup -b option + batch -B option + c++ -+ option + + caseful or + case-sensitive opposite of -i (default) + + case-insensitive or + caseless -i option + + debug -d option + default opposite of -s option + ecs -Ce option + fast -F option + full -f option + interactive -I option + lex-compat -l option + meta-ecs -Cm option + perf-report -p option + read -Cr option + stdout -t option + verbose -v option + warn opposite of -w option + (use "%option nowarn" for -w) + + array equivalent to "%array" + pointer equivalent to "%pointer" (default) + + Some %option's provide features otherwise not available: + + always-interactive + instructs flex to generate a scanner which always con- + siders its input "interactive". Normally, on each new + input file the scanner calls isatty() in an attempt to + determine whether the scanner's input source is + interactive and thus should be read a character at a + time. When this option is used, however, then no such + call is made. + + main directs flex to provide a default main() program for + the scanner, which simply calls yylex(). This option + implies noyywrap (see below). + + never-interactive + instructs flex to generate a scanner which never con- + siders its input "interactive" (again, no call made to + + + +Version 2.5 Last change: April 1995 37 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + isatty()). This is the opposite of always-interactive. + + stack + enables the use of start condition stacks (see Start + Conditions above). + + stdinit + if set (i.e., %option stdinit) initializes yyin and + yyout to stdin and stdout, instead of the default of + nil. Some existing lex programs depend on this + behavior, even though it is not compliant with ANSI C, + which does not require stdin and stdout to be compile- + time constant. + + yylineno + directs flex to generate a scanner that maintains the + number of the current line read from its input in the + global variable yylineno. This option is implied by + %option lex-compat. + + yywrap + if unset (i.e., %option noyywrap), makes the scanner + not call yywrap() upon an end-of-file, but simply + assume that there are no more files to scan (until the + user points yyin at a new file and calls yylex() + again). + + flex scans your rule actions to determine whether you use + the REJECT or yymore() features. The reject and yymore + options are available to override its decision as to whether + you use the options, either by setting them (e.g., %option + reject) to indicate the feature is indeed used, or unsetting + them to indicate it actually is not used (e.g., %option + noyymore). + + Three options take string-delimited values, offset with '=': + + %option outfile="ABC" + + is equivalent to -oABC, and + + %option prefix="XYZ" + + is equivalent to -PXYZ. Finally, + + %option yyclass="foo" + + only applies when generating a C++ scanner ( -+ option). It + informs flex that you have derived foo as a subclass of + yyFlexLexer, so flex will place your actions in the member + function foo::yylex() instead of yyFlexLexer::yylex(). It + also generates a yyFlexLexer::yylex() member function that + + + +Version 2.5 Last change: April 1995 38 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + emits a run-time error (by invoking + yyFlexLexer::LexerError()) if called. See Generating C++ + Scanners, below, for additional information. + + A number of options are available for lint purists who want + to suppress the appearance of unneeded routines in the gen- + erated scanner. Each of the following, if unset (e.g., + %option nounput ), results in the corresponding routine not + appearing in the generated scanner: + + input, unput + yy_push_state, yy_pop_state, yy_top_state + yy_scan_buffer, yy_scan_bytes, yy_scan_string + + (though yy_push_state() and friends won't appear anyway + unless you use %option stack). + +PERFORMANCE CONSIDERATIONS + The main design goal of flex is that it generate high- + performance scanners. It has been optimized for dealing + well with large sets of rules. Aside from the effects on + scanner speed of the table compression -C options outlined + above, there are a number of options/actions which degrade + performance. These are, from most expensive to least: + + REJECT + %option yylineno + arbitrary trailing context + + pattern sets that require backing up + %array + %option interactive + %option always-interactive + + '^' beginning-of-line operator + yymore() + + with the first three all being quite expensive and the last + two being quite cheap. Note also that unput() is imple- + mented as a routine call that potentially does quite a bit + of work, while yyless() is a quite-cheap macro; so if just + putting back some excess text you scanned, use yyless(). + + REJECT should be avoided at all costs when performance is + important. It is a particularly expensive option. + + Getting rid of backing up is messy and often may be an enor- + mous amount of work for a complicated scanner. In princi- + pal, one begins by using the -b flag to generate a + lex.backup file. For example, on the input + + %% + + + +Version 2.5 Last change: April 1995 39 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + foo return TOK_KEYWORD; + foobar return TOK_KEYWORD; + + the file looks like: + + State #6 is non-accepting - + associated rule line numbers: + 2 3 + out-transitions: [ o ] + jam-transitions: EOF [ \001-n p-\177 ] + + State #8 is non-accepting - + associated rule line numbers: + 3 + out-transitions: [ a ] + jam-transitions: EOF [ \001-` b-\177 ] + + State #9 is non-accepting - + associated rule line numbers: + 3 + out-transitions: [ r ] + jam-transitions: EOF [ \001-q s-\177 ] + + Compressed tables always back up. + + The first few lines tell us that there's a scanner state in + which it can make a transition on an 'o' but not on any + other character, and that in that state the currently + scanned text does not match any rule. The state occurs when + trying to match the rules found at lines 2 and 3 in the + input file. If the scanner is in that state and then reads + something other than an 'o', it will have to back up to find + a rule which is matched. With a bit of headscratching one + can see that this must be the state it's in when it has seen + "fo". When this has happened, if anything other than + another 'o' is seen, the scanner will have to back up to + simply match the 'f' (by the default rule). + + The comment regarding State #8 indicates there's a problem + when "foob" has been scanned. Indeed, on any character + other than an 'a', the scanner will have to back up to + accept "foo". Similarly, the comment for State #9 concerns + when "fooba" has been scanned and an 'r' does not follow. + + The final comment reminds us that there's no point going to + all the trouble of removing backing up from the rules unless + we're using -Cf or -CF, since there's no performance gain + doing so with compressed scanners. + + The way to remove the backing up is to add "error" rules: + + %% + + + +Version 2.5 Last change: April 1995 40 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + foo return TOK_KEYWORD; + foobar return TOK_KEYWORD; + + fooba | + foob | + fo { + /* false alarm, not really a keyword */ + return TOK_ID; + } + + + Eliminating backing up among a list of keywords can also be + done using a "catch-all" rule: + + %% + foo return TOK_KEYWORD; + foobar return TOK_KEYWORD; + + [a-z]+ return TOK_ID; + + This is usually the best solution when appropriate. + + Backing up messages tend to cascade. With a complicated set + of rules it's not uncommon to get hundreds of messages. If + one can decipher them, though, it often only takes a dozen + or so rules to eliminate the backing up (though it's easy to + make a mistake and have an error rule accidentally match a + valid token. A possible future flex feature will be to + automatically add rules to eliminate backing up). + + It's important to keep in mind that you gain the benefits of + eliminating backing up only if you eliminate every instance + of backing up. Leaving just one means you gain nothing. + + Variable trailing context (where both the leading and trail- + ing parts do not have a fixed length) entails almost the + same performance loss as REJECT (i.e., substantial). So + when possible a rule like: + + %% + mouse|rat/(cat|dog) run(); + + is better written: + + %% + mouse/cat|dog run(); + rat/cat|dog run(); + + or as + + %% + mouse|rat/cat run(); + + + +Version 2.5 Last change: April 1995 41 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + mouse|rat/dog run(); + + Note that here the special '|' action does not provide any + savings, and can even make things worse (see Deficiencies / + Bugs below). + + Another area where the user can increase a scanner's perfor- + mance (and one that's easier to implement) arises from the + fact that the longer the tokens matched, the faster the + scanner will run. This is because with long tokens the pro- + cessing of most input characters takes place in the (short) + inner scanning loop, and does not often have to go through + the additional work of setting up the scanning environment + (e.g., yytext) for the action. Recall the scanner for C + comments: + + %x comment + %% + int line_num = 1; + + "/*" BEGIN(comment); + + [^*\n]* + "*"+[^*/\n]* + \n ++line_num; + "*"+"/" BEGIN(INITIAL); + + This could be sped up by writing it as: + + %x comment + %% + int line_num = 1; + + "/*" BEGIN(comment); + + [^*\n]* + [^*\n]*\n ++line_num; + "*"+[^*/\n]* + "*"+[^*/\n]*\n ++line_num; + "*"+"/" BEGIN(INITIAL); + + Now instead of each newline requiring the processing of + another action, recognizing the newlines is "distributed" + over the other rules to keep the matched text as long as + possible. Note that adding rules does not slow down the + scanner! The speed of the scanner is independent of the + number of rules or (modulo the considerations given at the + beginning of this section) how complicated the rules are + with regard to operators such as '*' and '|'. + + A final example in speeding up a scanner: suppose you want + to scan through a file containing identifiers and keywords, + + + +Version 2.5 Last change: April 1995 42 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + one per line and with no other extraneous characters, and + recognize all the keywords. A natural first approach is: + + %% + asm | + auto | + break | + ... etc ... + volatile | + while /* it's a keyword */ + + .|\n /* it's not a keyword */ + + To eliminate the back-tracking, introduce a catch-all rule: + + %% + asm | + auto | + break | + ... etc ... + volatile | + while /* it's a keyword */ + + [a-z]+ | + .|\n /* it's not a keyword */ + + Now, if it's guaranteed that there's exactly one word per + line, then we can reduce the total number of matches by a + half by merging in the recognition of newlines with that of + the other tokens: + + %% + asm\n | + auto\n | + break\n | + ... etc ... + volatile\n | + while\n /* it's a keyword */ + + [a-z]+\n | + .|\n /* it's not a keyword */ + + One has to be careful here, as we have now reintroduced + backing up into the scanner. In particular, while we know + that there will never be any characters in the input stream + other than letters or newlines, flex can't figure this out, + and it will plan for possibly needing to back up when it has + scanned a token like "auto" and then the next character is + something other than a newline or a letter. Previously it + would then just match the "auto" rule and be done, but now + it has no "auto" rule, only a "auto\n" rule. To eliminate + the possibility of backing up, we could either duplicate all + + + +Version 2.5 Last change: April 1995 43 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + rules but without final newlines, or, since we never expect + to encounter such an input and therefore don't how it's + classified, we can introduce one more catch-all rule, this + one which doesn't include a newline: + + %% + asm\n | + auto\n | + break\n | + ... etc ... + volatile\n | + while\n /* it's a keyword */ + + [a-z]+\n | + [a-z]+ | + .|\n /* it's not a keyword */ + + Compiled with -Cf, this is about as fast as one can get a + flex scanner to go for this particular problem. + + A final note: flex is slow when matching NUL's, particularly + when a token contains multiple NUL's. It's best to write + rules which match short amounts of text if it's anticipated + that the text will often include NUL's. + + Another final note regarding performance: as mentioned above + in the section How the Input is Matched, dynamically resiz- + ing yytext to accommodate huge tokens is a slow process + because it presently requires that the (huge) token be res- + canned from the beginning. Thus if performance is vital, + you should attempt to match "large" quantities of text but + not "huge" quantities, where the cutoff between the two is + at about 8K characters/token. + +GENERATING C++ SCANNERS + flex provides two different ways to generate scanners for + use with C++. The first way is to simply compile a scanner + generated by flex using a C++ compiler instead of a C com- + piler. You should not encounter any compilations errors + (please report any you find to the email address given in + the Author section below). You can then use C++ code in + your rule actions instead of C code. Note that the default + input source for your scanner remains yyin, and default + echoing is still done to yyout. Both of these remain FILE * + variables and not C++ streams. + + You can also use flex to generate a C++ scanner class, using + the -+ option (or, equivalently, %option c++), which is + automatically specified if the name of the flex executable + ends in a '+', such as flex++. When using this option, flex + defaults to generating the scanner to the file lex.yy.cc + instead of lex.yy.c. The generated scanner includes the + + + +Version 2.5 Last change: April 1995 44 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + header file FlexLexer.h, which defines the interface to two + C++ classes. + + The first class, FlexLexer, provides an abstract base class + defining the general scanner class interface. It provides + the following member functions: + + const char* YYText() + returns the text of the most recently matched token, + the equivalent of yytext. + + int YYLeng() + returns the length of the most recently matched token, + the equivalent of yyleng. + + int lineno() const + returns the current input line number (see %option + yylineno), or 1 if %option yylineno was not used. + + void set_debug( int flag ) + sets the debugging flag for the scanner, equivalent to + assigning to yy_flex_debug (see the Options section + above). Note that you must build the scanner using + %option debug to include debugging information in it. + + int debug() const + returns the current setting of the debugging flag. + + Also provided are member functions equivalent to + yy_switch_to_buffer(), yy_create_buffer() (though the first + argument is an istream* object pointer and not a FILE*), + yy_flush_buffer(), yy_delete_buffer(), and yyrestart() + (again, the first argument is a istream* object pointer). + + The second class defined in FlexLexer.h is yyFlexLexer, + which is derived from FlexLexer. It defines the following + additional member functions: + + yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) + constructs a yyFlexLexer object using the given streams + for input and output. If not specified, the streams + default to cin and cout, respectively. + + virtual int yylex() + performs the same role is yylex() does for ordinary + flex scanners: it scans the input stream, consuming + tokens, until a rule's action returns a value. If you + derive a subclass S from yyFlexLexer and want to access + the member functions and variables of S inside yylex(), + then you need to use %option yyclass="S" to inform flex + that you will be using that subclass instead of yyFlex- + Lexer. In this case, rather than generating + + + +Version 2.5 Last change: April 1995 45 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + yyFlexLexer::yylex(), flex generates S::yylex() (and + also generates a dummy yyFlexLexer::yylex() that calls + yyFlexLexer::LexerError() if called). + + virtual void switch_streams(istream* new_in = 0, + ostream* new_out = 0) reassigns yyin to new_in (if + non-nil) and yyout to new_out (ditto), deleting the + previous input buffer if yyin is reassigned. + + int yylex( istream* new_in, ostream* new_out = 0 ) + first switches the input streams via switch_streams( + new_in, new_out ) and then returns the value of + yylex(). + + In addition, yyFlexLexer defines the following protected + virtual functions which you can redefine in derived classes + to tailor the scanner: + + virtual int LexerInput( char* buf, int max_size ) + reads up to max_size characters into buf and returns + the number of characters read. To indicate end-of- + input, return 0 characters. Note that "interactive" + scanners (see the -B and -I flags) define the macro + YY_INTERACTIVE. If you redefine LexerInput() and need + to take different actions depending on whether or not + the scanner might be scanning an interactive input + source, you can test for the presence of this name via + #ifdef. + + virtual void LexerOutput( const char* buf, int size ) + writes out size characters from the buffer buf, which, + while NUL-terminated, may also contain "internal" NUL's + if the scanner's rules can match text with NUL's in + them. + + virtual void LexerError( const char* msg ) + reports a fatal error message. The default version of + this function writes the message to the stream cerr and + exits. + + Note that a yyFlexLexer object contains its entire scanning + state. Thus you can use such objects to create reentrant + scanners. You can instantiate multiple instances of the + same yyFlexLexer class, and you can also combine multiple + C++ scanner classes together in the same program using the + -P option discussed above. + + Finally, note that the %array feature is not available to + C++ scanner classes; you must use %pointer (the default). + + Here is an example of a simple C++ scanner: + + + + +Version 2.5 Last change: April 1995 46 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + // An example of using the flex C++ scanner class. + + %{ + int mylineno = 0; + %} + + string \"[^\n"]+\" + + ws [ \t]+ + + alpha [A-Za-z] + dig [0-9] + name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])* + num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)? + num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)? + number {num1}|{num2} + + %% + + {ws} /* skip blanks and tabs */ + + "/*" { + int c; + + while((c = yyinput()) != 0) + { + if(c == '\n') + ++mylineno; + + else if(c == '*') + { + if((c = yyinput()) == '/') + break; + else + unput(c); + } + } + } + + {number} cout << "number " << YYText() << '\n'; + + \n mylineno++; + + {name} cout << "name " << YYText() << '\n'; + + {string} cout << "string " << YYText() << '\n'; + + %% + + int main( int /* argc */, char** /* argv */ ) + { + FlexLexer* lexer = new yyFlexLexer; + + + +Version 2.5 Last change: April 1995 47 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + while(lexer->yylex() != 0) + ; + return 0; + } + If you want to create multiple (different) lexer classes, + you use the -P flag (or the prefix= option) to rename each + yyFlexLexer to some other xxFlexLexer. You then can include + in your other sources once per lexer class, + first renaming yyFlexLexer as follows: + + #undef yyFlexLexer + #define yyFlexLexer xxFlexLexer + #include + + #undef yyFlexLexer + #define yyFlexLexer zzFlexLexer + #include + + if, for example, you used %option prefix="xx" for one of + your scanners and %option prefix="zz" for the other. + + IMPORTANT: the present form of the scanning class is experi- + mental and may change considerably between major releases. + +INCOMPATIBILITIES WITH LEX AND POSIX + flex is a rewrite of the AT&T Unix lex tool (the two imple- + mentations do not share any code, though), with some exten- + sions and incompatibilities, both of which are of concern to + those who wish to write scanners acceptable to either imple- + mentation. Flex is fully compliant with the POSIX lex + specification, except that when using %pointer (the + default), a call to unput() destroys the contents of yytext, + which is counter to the POSIX specification. + + In this section we discuss all of the known areas of incom- + patibility between flex, AT&T lex, and the POSIX specifica- + tion. + + flex's -l option turns on maximum compatibility with the + original AT&T lex implementation, at the cost of a major + loss in the generated scanner's performance. We note below + which incompatibilities can be overcome using the -l option. + + flex is fully compatible with lex with the following excep- + tions: + + - The undocumented lex scanner internal variable yylineno + is not supported unless -l or %option yylineno is used. + + yylineno should be maintained on a per-buffer basis, + rather than a per-scanner (single global variable) + basis. + + + +Version 2.5 Last change: April 1995 48 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + yylineno is not part of the POSIX specification. + + - The input() routine is not redefinable, though it may + be called to read characters following whatever has + been matched by a rule. If input() encounters an end- + of-file the normal yywrap() processing is done. A + ``real'' end-of-file is returned by input() as EOF. + + Input is instead controlled by defining the YY_INPUT + macro. + + The flex restriction that input() cannot be redefined + is in accordance with the POSIX specification, which + simply does not specify any way of controlling the + scanner's input other than by making an initial assign- + ment to yyin. + + - The unput() routine is not redefinable. This restric- + tion is in accordance with POSIX. + + - flex scanners are not as reentrant as lex scanners. In + particular, if you have an interactive scanner and an + interrupt handler which long-jumps out of the scanner, + and the scanner is subsequently called again, you may + get the following message: + + fatal flex scanner internal error--end of buffer missed + + To reenter the scanner, first use + + yyrestart( yyin ); + + Note that this call will throw away any buffered input; + usually this isn't a problem with an interactive + scanner. + + Also note that flex C++ scanner classes are reentrant, + so if using C++ is an option for you, you should use + them instead. See "Generating C++ Scanners" above for + details. + + - output() is not supported. Output from the ECHO macro + is done to the file-pointer yyout (default stdout). + + output() is not part of the POSIX specification. + + - lex does not support exclusive start conditions (%x), + though they are in the POSIX specification. + + - When definitions are expanded, flex encloses them in + parentheses. With lex, the following: + + + + +Version 2.5 Last change: April 1995 49 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + NAME [A-Z][A-Z0-9]* + %% + foo{NAME}? printf( "Found it\n" ); + %% + + will not match the string "foo" because when the macro + is expanded the rule is equivalent to "foo[A-Z][A-Z0- + 9]*?" and the precedence is such that the '?' is asso- + ciated with "[A-Z0-9]*". With flex, the rule will be + expanded to "foo([A-Z][A-Z0-9]*)?" and so the string + "foo" will match. + + Note that if the definition begins with ^ or ends with + $ then it is not expanded with parentheses, to allow + these operators to appear in definitions without losing + their special meanings. But the , /, and <> + operators cannot be used in a flex definition. + + Using -l results in the lex behavior of no parentheses + around the definition. + + The POSIX specification is that the definition be + enclosed in parentheses. + + - Some implementations of lex allow a rule's action to + begin on a separate line, if the rule's pattern has + trailing whitespace: + + %% + foo|bar + { foobar_action(); } + + flex does not support this feature. + + - The lex %r (generate a Ratfor scanner) option is not + supported. It is not part of the POSIX specification. + + - After a call to unput(), yytext is undefined until the + next token is matched, unless the scanner was built + using %array. This is not the case with lex or the + POSIX specification. The -l option does away with this + incompatibility. + + - The precedence of the {} (numeric range) operator is + different. lex interprets "abc{1,3}" as "match one, + two, or three occurrences of 'abc'", whereas flex + interprets it as "match 'ab' followed by one, two, or + three occurrences of 'c'". The latter is in agreement + with the POSIX specification. + + - The precedence of the ^ operator is different. lex + interprets "^foo|bar" as "match either 'foo' at the + + + +Version 2.5 Last change: April 1995 50 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + beginning of a line, or 'bar' anywhere", whereas flex + interprets it as "match either 'foo' or 'bar' if they + come at the beginning of a line". The latter is in + agreement with the POSIX specification. + + - The special table-size declarations such as %a sup- + ported by lex are not required by flex scanners; flex + ignores them. + + - The name FLEX_SCANNER is #define'd so scanners may be + written for use with either flex or lex. Scanners also + include YY_FLEX_MAJOR_VERSION and YY_FLEX_MINOR_VERSION + indicating which version of flex generated the scanner + (for example, for the 2.5 release, these defines would + be 2 and 5 respectively). + + The following flex features are not included in lex or the + POSIX specification: + + C++ scanners + %option + start condition scopes + start condition stacks + interactive/non-interactive scanners + yy_scan_string() and friends + yyterminate() + yy_set_interactive() + yy_set_bol() + YY_AT_BOL() + <> + <*> + YY_DECL + YY_START + YY_USER_ACTION + YY_USER_INIT + #line directives + %{}'s around actions + multiple actions on a line + + plus almost all of the flex flags. The last feature in the + list refers to the fact that with flex you can put multiple + actions on the same line, separated with semi-colons, while + with lex, the following + + foo handle_foo(); ++num_foos_seen; + + is (rather surprisingly) truncated to + + foo handle_foo(); + + flex does not truncate the action. Actions that are not + enclosed in braces are simply terminated at the end of the + + + +Version 2.5 Last change: April 1995 51 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + line. + +DIAGNOSTICS + warning, rule cannot be matched indicates that the given + rule cannot be matched because it follows other rules that + will always match the same text as it. For example, in the + following "foo" cannot be matched because it comes after an + identifier "catch-all" rule: + + [a-z]+ got_identifier(); + foo got_foo(); + + Using REJECT in a scanner suppresses this warning. + + warning, -s option given but default rule can be matched + means that it is possible (perhaps only in a particular + start condition) that the default rule (match any single + character) is the only one that will match a particular + input. Since -s was given, presumably this is not intended. + + reject_used_but_not_detected undefined or + yymore_used_but_not_detected undefined - These errors can + occur at compile time. They indicate that the scanner uses + REJECT or yymore() but that flex failed to notice the fact, + meaning that flex scanned the first two sections looking for + occurrences of these actions and failed to find any, but + somehow you snuck some in (via a #include file, for exam- + ple). Use %option reject or %option yymore to indicate to + flex that you really do use these features. + + flex scanner jammed - a scanner compiled with -s has encoun- + tered an input string which wasn't matched by any of its + rules. This error can also occur due to internal problems. + + token too large, exceeds YYLMAX - your scanner uses %array + and one of its rules matched a string longer than the YYLMAX + constant (8K bytes by default). You can increase the value + by #define'ing YYLMAX in the definitions section of your + flex input. + + scanner requires -8 flag to use the character 'x' - Your + scanner specification includes recognizing the 8-bit charac- + ter 'x' and you did not specify the -8 flag, and your + scanner defaulted to 7-bit because you used the -Cf or -CF + table compression options. See the discussion of the -7 + flag for details. + + flex scanner push-back overflow - you used unput() to push + back so much text that the scanner's buffer could not hold + both the pushed-back text and the current token in yytext. + Ideally the scanner should dynamically resize the buffer in + this case, but at present it does not. + + + +Version 2.5 Last change: April 1995 52 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + input buffer overflow, can't enlarge buffer because scanner + uses REJECT - the scanner was working on matching an + extremely large token and needed to expand the input buffer. + This doesn't work with scanners that use REJECT. + + fatal flex scanner internal error--end of buffer missed - + This can occur in an scanner which is reentered after a + long-jump has jumped out (or over) the scanner's activation + frame. Before reentering the scanner, use: + + yyrestart( yyin ); + + or, as noted above, switch to using the C++ scanner class. + + too many start conditions in <> you listed more start condi- + tions in a <> construct than exist (so you must have listed + at least one of them twice). + +FILES + -lfl library with which scanners must be linked. + + lex.yy.c + generated scanner (called lexyy.c on some systems). + + lex.yy.cc + generated C++ scanner class, when using -+. + + + header file defining the C++ scanner base class, Flex- + Lexer, and its derived class, yyFlexLexer. + + flex.skl + skeleton scanner. This file is only used when building + flex, not when flex executes. + + lex.backup + backing-up information for -b flag (called lex.bck on + some systems). + +DEFICIENCIES / BUGS + Some trailing context patterns cannot be properly matched + and generate warning messages ("dangerous trailing con- + text"). These are patterns where the ending of the first + part of the rule matches the beginning of the second part, + such as "zx*/xy*", where the 'x*' matches the 'x' at the + beginning of the trailing context. (Note that the POSIX + draft states that the text matched by such patterns is unde- + fined.) + + For some trailing context rules, parts which are actually + fixed-length are not recognized as such, leading to the + abovementioned performance loss. In particular, parts using + + + +Version 2.5 Last change: April 1995 53 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + '|' or {n} (such as "foo{3}") are always considered + variable-length. + + Combining trailing context with the special '|' action can + result in fixed trailing context being turned into the more + expensive variable trailing context. For example, in the + following: + + %% + abc | + xyz/def + + + Use of unput() invalidates yytext and yyleng, unless the + %array directive or the -l option has been used. + + Pattern-matching of NUL's is substantially slower than + matching other characters. + + Dynamic resizing of the input buffer is slow, as it entails + rescanning all the text matched so far by the current (gen- + erally huge) token. + + Due to both buffering of input and read-ahead, you cannot + intermix calls to routines, such as, for example, + getchar(), with flex rules and expect it to work. Call + input() instead. + + The total table entries listed by the -v flag excludes the + number of table entries needed to determine what rule has + been matched. The number of entries is equal to the number + of DFA states if the scanner does not use REJECT, and some- + what greater than the number of states if it does. + + REJECT cannot be used with the -f or -F options. + + The flex internal algorithms need documentation. + +SEE ALSO + lex(1), yacc(1), sed(1), awk(1). + + John Levine, Tony Mason, and Doug Brown, Lex & Yacc, + O'Reilly and Associates. Be sure to get the 2nd edition. + + M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator + + Alfred Aho, Ravi Sethi and Jeffrey Ullman, Compilers: Prin- + ciples, Techniques and Tools, Addison-Wesley (1986). + Describes the pattern-matching techniques used by flex + (deterministic finite automata). + + + + + +Version 2.5 Last change: April 1995 54 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + +AUTHOR + Vern Paxson, with the help of many ideas and much inspira- + tion from Van Jacobson. Original version by Jef Poskanzer. + The fast table representation is a partial implementation of + a design done by Van Jacobson. The implementation was done + by Kevin Gong and Vern Paxson. + + Thanks to the many flex beta-testers, feedbackers, and con- + tributors, especially Francois Pinard, Casey Leedom, Robert + Abramovitz, Stan Adermann, Terry Allen, David Barker- + Plummer, John Basrai, Neal Becker, Nelson H.F. Beebe, + benson@odi.com, Karl Berry, Peter A. Bigot, Simon Blanchard, + Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick + Christopher, Brian Clapper, J.T. Conklin, Jason Coughlin, + Bill Cox, Nick Cropper, Dave Curtis, Scott David Daniels, + Chris G. Demetriou, Theo Deraadt, Mike Donahue, Chuck + Doucette, Tom Epperly, Leo Eskin, Chris Faylor, Chris + Flatters, Jon Forrest, Jeffrey Friedl, Joe Gayda, Kaveh R. + Ghazi, Wolfgang Glunz, Eric Goldman, Christopher M. Gould, + Ulrich Grepel, Peer Griebel, Jan Hajic, Charles Hemphill, + NORO Hideo, Jarkko Hietaniemi, Scott Hofmann, Jeff Honig, + Dana Hudes, Eric Hughes, John Interrante, Ceriel Jacobs, + Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, Henry + Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane, + Amir Katz, ken@ken.hilco.com, Kevin B. Kenny, Steve Kirsch, + Winfried Koenig, Marq Kole, Ronald Lamprecht, Greg Lee, + Rohan Lenard, Craig Leres, John Levine, Steve Liddle, David + Loffredo, Mike Long, Mohamed el Lozy, Brian Madsen, Malte, + Joe Marshall, Bengt Martensson, Chris Metcalf, Luke Mewburn, + Jim Meyering, R. Alexander Milowski, Erik Naggum, G.T. + Nicol, Landon Noll, James Nordby, Marc Nozell, Richard + Ohnemus, Karsten Pahnke, Sven Panne, Roland Pesch, Walter + Pelissero, Gaumond Pierre, Esmond Pitt, Jef Poskanzer, Joe + Rahmeh, Jarmo Raiha, Frederic Raimbault, Pat Rankin, Rick + Richardson, Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, + Alberto Santini, Andreas Scherer, Darrell Schiebel, Raf + Schietekat, Doug Schmidt, Philippe Schnoebelen, Andreas + Schwab, Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan- + Erik Strvmquist, Mike Stump, Paul Stuart, Dave Tallman, Ian + Lance Taylor, Chris Thewalt, Richard M. Timoney, Jodi Tsai, + Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, + Kent Williams, Ken Yap, Ron Zellar, Nathan Zelle, David + Zuhn, and those whose names have slipped my marginal mail- + archiving skills but whose contributions are appreciated all + the same. + + Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John + Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T. Nicol, + Francois Pinard, Rich Salz, and Richard Stallman for help + with various distribution headaches. + + + + + +Version 2.5 Last change: April 1995 55 + + + + + + +FLEX(1) USER COMMANDS FLEX(1) + + + + Thanks to Esmond Pitt and Earle Horton for 8-bit character + support; to Benson Margulies and Fred Burke for C++ support; + to Kent Williams and Tom Epperly for C++ class support; to + Ove Ewerlid for support of NUL's; and to Eric Hughes for + support of multiple buffers. + + This work was primarily done when I was with the Real Time + Systems Group at the Lawrence Berkeley Laboratory in Berke- + ley, CA. Many thanks to all there for the support I + received. + + Send comments to vern@ee.lbl.gov. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Version 2.5 Last change: April 1995 56 + + + -- cgit v1.2.3