diff options
author | Manoj Srivastava <srivasta@golden-gryphon.com> | 2003-12-03 22:33:17 -0800 |
---|---|---|
committer | Manoj Srivastava <srivasta@golden-gryphon.com> | 2003-12-03 22:33:17 -0800 |
commit | c2b22e08bd48278f2cf125f054c9f6286e345ff0 (patch) | |
tree | 3c0ab722c83ef33913ad293af7d56ce2c4e1fcc9 /MISC/flex.man | |
parent | edc848712307fe5c881364e12e520e9fe58d9969 (diff) |
Imported Upstream version 2.5.31
Diffstat (limited to 'MISC/flex.man')
-rw-r--r-- | MISC/flex.man | 3696 |
1 files changed, 0 insertions, 3696 deletions
diff --git a/MISC/flex.man b/MISC/flex.man deleted file mode 100644 index d41f5ba..0000000 --- a/MISC/flex.man +++ /dev/null @@ -1,3696 +0,0 @@ - - - -FLEX(1) USER COMMANDS FLEX(1) - - - -NAME - flex - fast lexical analyzer generator - -SYNOPSIS - flex [-bcdfhilnpstvwBFILTV78+? -C[aefFmr] -ooutput -Pprefix - -Sskeleton] [--help --version] [filename ...] - -OVERVIEW - This manual describes flex, a tool for generating programs - that perform pattern-matching on text. The manual includes - both tutorial and reference sections: - - Description - a brief overview of the tool - - Some Simple Examples - - Format Of The Input File - - Patterns - the extended regular expressions used by flex - - How The Input Is Matched - the rules for determining what has been matched - - Actions - how to specify what to do when a pattern is matched - - The Generated Scanner - details regarding the scanner that flex produces; - how to control the input source - - Start Conditions - introducing context into your scanners, and - managing "mini-scanners" - - Multiple Input Buffers - how to manipulate multiple input sources; how to - scan from strings instead of files - - End-of-file Rules - special rules for matching the end of the input - - Miscellaneous Macros - a summary of macros available to the actions - - Values Available To The User - a summary of values available to the actions - - Interfacing With Yacc - connecting flex scanners together with yacc parsers - - - - -Version 2.5 Last change: April 1995 1 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - Options - flex command-line options, and the "%option" - directive - - Performance Considerations - how to make your scanner go as fast as possible - - Generating C++ Scanners - the (experimental) facility for generating C++ - scanner classes - - Incompatibilities With Lex And POSIX - how flex differs from AT&T lex and the POSIX lex - standard - - Diagnostics - those error messages produced by flex (or scanners - it generates) whose meanings might not be apparent - - Files - files used by flex - - Deficiencies / Bugs - known problems with flex - - See Also - other documentation, related tools - - Author - includes contact information - - -DESCRIPTION - flex is a tool for generating scanners: programs which - recognized lexical patterns in text. flex reads the given - input files, or its standard input if no file names are - given, for a description of a scanner to generate. The - description is in the form of pairs of regular expressions - and C code, called rules. flex generates as output a C - source file, lex.yy.c, which defines a routine yylex(). This - file is compiled and linked with the -lfl library to produce - an executable. When the executable is run, it analyzes its - input for occurrences of the regular expressions. Whenever - it finds one, it executes the corresponding C code. - -SOME SIMPLE EXAMPLES - First some simple examples to get the flavor of how one uses - flex. The following flex input specifies a scanner which - whenever it encounters the string "username" will replace it - with the user's login name: - - %% - - - -Version 2.5 Last change: April 1995 2 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - username printf( "%s", getlogin() ); - - By default, any text not matched by a flex scanner is copied - to the output, so the net effect of this scanner is to copy - its input file to its output with each occurrence of "user- - name" expanded. In this input, there is just one rule. - "username" is the pattern and the "printf" is the action. - The "%%" marks the beginning of the rules. - - Here's another simple example: - - int num_lines = 0, num_chars = 0; - - %% - \n ++num_lines; ++num_chars; - . ++num_chars; - - %% - main() - { - yylex(); - printf( "# of lines = %d, # of chars = %d\n", - num_lines, num_chars ); - } - - This scanner counts the number of characters and the number - of lines in its input (it produces no output other than the - final report on the counts). The first line declares two - globals, "num_lines" and "num_chars", which are accessible - both inside yylex() and in the main() routine declared after - the second "%%". There are two rules, one which matches a - newline ("\n") and increments both the line count and the - character count, and one which matches any character other - than a newline (indicated by the "." regular expression). - - A somewhat more complicated example: - - /* scanner for a toy Pascal-like language */ - - %{ - /* need this for the call to atof() below */ - #include <math.h> - %} - - DIGIT [0-9] - ID [a-z][a-z0-9]* - - %% - - {DIGIT}+ { - printf( "An integer: %s (%d)\n", yytext, - atoi( yytext ) ); - - - -Version 2.5 Last change: April 1995 3 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - } - - {DIGIT}+"."{DIGIT}* { - printf( "A float: %s (%g)\n", yytext, - atof( yytext ) ); - } - - if|then|begin|end|procedure|function { - printf( "A keyword: %s\n", yytext ); - } - - {ID} printf( "An identifier: %s\n", yytext ); - - "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); - - "{"[^}\n]*"}" /* eat up one-line comments */ - - [ \t\n]+ /* eat up whitespace */ - - . printf( "Unrecognized character: %s\n", yytext ); - - %% - - main( argc, argv ) - int argc; - char **argv; - { - ++argv, --argc; /* skip over program name */ - if ( argc > 0 ) - yyin = fopen( argv[0], "r" ); - else - yyin = stdin; - - yylex(); - } - - This is the beginnings of a simple scanner for a language - like Pascal. It identifies different types of tokens and - reports on what it has seen. - - The details of this example will be explained in the follow- - ing sections. - -FORMAT OF THE INPUT FILE - The flex input file consists of three sections, separated by - a line with just %% in it: - - definitions - %% - rules - %% - user code - - - -Version 2.5 Last change: April 1995 4 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - The definitions section contains declarations of simple name - definitions to simplify the scanner specification, and - declarations of start conditions, which are explained in a - later section. - - Name definitions have the form: - - name definition - - The "name" is a word beginning with a letter or an under- - score ('_') followed by zero or more letters, digits, '_', - or '-' (dash). The definition is taken to begin at the - first non-white-space character following the name and con- - tinuing to the end of the line. The definition can subse- - quently be referred to using "{name}", which will expand to - "(definition)". For example, - - DIGIT [0-9] - ID [a-z][a-z0-9]* - - defines "DIGIT" to be a regular expression which matches a - single digit, and "ID" to be a regular expression which - matches a letter followed by zero-or-more letters-or-digits. - A subsequent reference to - - {DIGIT}+"."{DIGIT}* - - is identical to - - ([0-9])+"."([0-9])* - - and matches one-or-more digits followed by a '.' followed by - zero-or-more digits. - - The rules section of the flex input contains a series of - rules of the form: - - pattern action - - where the pattern must be unindented and the action must - begin on the same line. - - See below for a further description of patterns and actions. - - Finally, the user code section is simply copied to lex.yy.c - verbatim. It is used for companion routines which call or - are called by the scanner. The presence of this section is - optional; if it is missing, the second %% in the input file - may be skipped, too. - - In the definitions and rules sections, any indented text or - text enclosed in %{ and %} is copied verbatim to the output - - - -Version 2.5 Last change: April 1995 5 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - (with the %{}'s removed). The %{}'s must appear unindented - on lines by themselves. - - In the rules section, any indented or %{} text appearing - before the first rule may be used to declare variables which - are local to the scanning routine and (after the declara- - tions) code which is to be executed whenever the scanning - routine is entered. Other indented or %{} text in the rule - section is still copied to the output, but its meaning is - not well-defined and it may well cause compile-time errors - (this feature is present for POSIX compliance; see below for - other such features). - - In the definitions section (but not in the rules section), - an unindented comment (i.e., a line beginning with "/*") is - also copied verbatim to the output up to the next "*/". - -PATTERNS - The patterns in the input are written using an extended set - of regular expressions. These are: - - x match the character 'x' - . any character (byte) except newline - [xyz] a "character class"; in this case, the pattern - matches either an 'x', a 'y', or a 'z' - [abj-oZ] a "character class" with a range in it; matches - an 'a', a 'b', any letter from 'j' through 'o', - or a 'Z' - [^A-Z] a "negated character class", i.e., any character - but those in the class. In this case, any - character EXCEPT an uppercase letter. - [^A-Z\n] any character EXCEPT an uppercase letter or - a newline - r* zero or more r's, where r is any regular expression - r+ one or more r's - r? zero or one r's (that is, "an optional r") - r{2,5} anywhere from two to five r's - r{2,} two or more r's - r{4} exactly 4 r's - {name} the expansion of the "name" definition - (see above) - "[xyz]\"foo" - the literal string: [xyz]"foo - \X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', - then the ANSI-C interpretation of \x. - Otherwise, a literal 'X' (used to escape - operators such as '*') - \0 a NUL character (ASCII code 0) - \123 the character with octal value 123 - \x2a the character with hexadecimal value 2a - (r) match an r; parentheses are used to override - precedence (see below) - - - -Version 2.5 Last change: April 1995 6 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - rs the regular expression r followed by the - regular expression s; called "concatenation" - - - r|s either an r or an s - - - r/s an r but only if it is followed by an s. The - text matched by s is included when determining - whether this rule is the "longest match", - but is then returned to the input before - the action is executed. So the action only - sees the text matched by r. This type - of pattern is called trailing context". - (There are some combinations of r/s that flex - cannot match correctly; see notes in the - Deficiencies / Bugs section below regarding - "dangerous trailing context".) - ^r an r, but only at the beginning of a line (i.e., - which just starting to scan, or right after a - newline has been scanned). - r$ an r, but only at the end of a line (i.e., just - before a newline). Equivalent to "r/\n". - - Note that flex's notion of "newline" is exactly - whatever the C compiler used to compile flex - interprets '\n' as; in particular, on some DOS - systems you must either filter out \r's in the - input yourself, or explicitly use r/\r\n for "r$". - - - <s>r an r, but only in start condition s (see - below for discussion of start conditions) - <s1,s2,s3>r - same, but in any of start conditions s1, - s2, or s3 - <*>r an r in any start condition, even an exclusive one. - - - <<EOF>> an end-of-file - <s1,s2><<EOF>> - an end-of-file when in start condition s1 or s2 - - Note that inside of a character class, all regular expres- - sion operators lose their special meaning except escape - ('\') and the character class operators, '-', ']', and, at - the beginning of the class, '^'. - - The regular expressions listed above are grouped according - to precedence, from highest precedence at the top to lowest - at the bottom. Those grouped together have equal pre- - cedence. For example, - - - -Version 2.5 Last change: April 1995 7 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - foo|bar* - - is the same as - - (foo)|(ba(r*)) - - since the '*' operator has higher precedence than concatena- - tion, and concatenation higher than alternation ('|'). This - pattern therefore matches either the string "foo" or the - string "ba" followed by zero-or-more r's. To match "foo" or - zero-or-more "bar"'s, use: - - foo|(bar)* - - and to match zero-or-more "foo"'s-or-"bar"'s: - - (foo|bar)* - - - In addition to characters and ranges of characters, charac- - ter classes can also contain character class expressions. - These are expressions enclosed inside [: and :] delimiters - (which themselves must appear between the '[' and ']' of the - character class; other elements may occur inside the charac- - ter class, too). The valid expressions are: - - [:alnum:] [:alpha:] [:blank:] - [:cntrl:] [:digit:] [:graph:] - [:lower:] [:print:] [:punct:] - [:space:] [:upper:] [:xdigit:] - - These expressions all designate a set of characters - equivalent to the corresponding standard C isXXX function. - For example, [:alnum:] designates those characters for which - isalnum() returns true - i.e., any alphabetic or numeric. - Some systems don't provide isblank(), so flex defines - [:blank:] as a blank or a tab. - - For example, the following character classes are all - equivalent: - - [[:alnum:]] - [[:alpha:][:digit:] - [[:alpha:]0-9] - [a-zA-Z0-9] - - If your scanner is case-insensitive (the -i flag), then - [:upper:] and [:lower:] are equivalent to [:alpha:]. - - Some notes on patterns: - - - A negated character class such as the example "[^A-Z]" - - - -Version 2.5 Last change: April 1995 8 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - above will match a newline unless "\n" (or an - equivalent escape sequence) is one of the characters - explicitly present in the negated character class - (e.g., "[^A-Z\n]"). This is unlike how many other reg- - ular expression tools treat negated character classes, - but unfortunately the inconsistency is historically - entrenched. Matching newlines means that a pattern - like [^"]* can match the entire input unless there's - another quote in the input. - - - A rule can have at most one instance of trailing con- - text (the '/' operator or the '$' operator). The start - condition, '^', and "<<EOF>>" patterns can only occur - at the beginning of a pattern, and, as well as with '/' - and '$', cannot be grouped inside parentheses. A '^' - which does not occur at the beginning of a rule or a - '$' which does not occur at the end of a rule loses its - special properties and is treated as a normal charac- - ter. - - The following are illegal: - - foo/bar$ - <sc1>foo<sc2>bar - - Note that the first of these, can be written - "foo/bar\n". - - The following will result in '$' or '^' being treated - as a normal character: - - foo|(bar$) - foo|^bar - - If what's wanted is a "foo" or a bar-followed-by-a- - newline, the following could be used (the special '|' - action is explained below): - - foo | - bar$ /* action goes here */ - - A similar trick will work for matching a foo or a bar- - at-the-beginning-of-a-line. - -HOW THE INPUT IS MATCHED - When the generated scanner is run, it analyzes its input - looking for strings which match any of its patterns. If it - finds more than one match, it takes the one matching the - most text (for trailing context rules, this includes the - length of the trailing part, even though it will then be - returned to the input). If it finds two or more matches of - the same length, the rule listed first in the flex input - - - -Version 2.5 Last change: April 1995 9 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - file is chosen. - - Once the match is determined, the text corresponding to the - match (called the token) is made available in the global - character pointer yytext, and its length in the global - integer yyleng. The action corresponding to the matched pat- - tern is then executed (a more detailed description of - actions follows), and then the remaining input is scanned - for another match. - - If no match is found, then the default rule is executed: the - next character in the input is considered matched and copied - to the standard output. Thus, the simplest legal flex input - is: - - %% - - which generates a scanner that simply copies its input (one - character at a time) to its output. - - Note that yytext can be defined in two different ways: - either as a character pointer or as a character array. You - can control which definition flex uses by including one of - the special directives %pointer or %array in the first - (definitions) section of your flex input. The default is - %pointer, unless you use the -l lex compatibility option, in - which case yytext will be an array. The advantage of using - %pointer is substantially faster scanning and no buffer - overflow when matching very large tokens (unless you run out - of dynamic memory). The disadvantage is that you are res- - tricted in how your actions can modify yytext (see the next - section), and calls to the unput() function destroys the - present contents of yytext, which can be a considerable - porting headache when moving between different lex versions. - - The advantage of %array is that you can then modify yytext - to your heart's content, and calls to unput() do not destroy - yytext (see below). Furthermore, existing lex programs - sometimes access yytext externally using declarations of the - form: - extern char yytext[]; - This definition is erroneous when used with %pointer, but - correct for %array. - - %array defines yytext to be an array of YYLMAX characters, - which defaults to a fairly large value. You can change the - size by simply #define'ing YYLMAX to a different value in - the first section of your flex input. As mentioned above, - with %pointer yytext grows dynamically to accommodate large - tokens. While this means your %pointer scanner can accommo- - date very large tokens (such as matching entire blocks of - comments), bear in mind that each time the scanner must - - - -Version 2.5 Last change: April 1995 10 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - resize yytext it also must rescan the entire token from the - beginning, so matching such tokens can prove slow. yytext - presently does not dynamically grow if a call to unput() - results in too much text being pushed back; instead, a run- - time error results. - - Also note that you cannot use %array with C++ scanner - classes (the c++ option; see below). - -ACTIONS - Each pattern in a rule has a corresponding action, which can - be any arbitrary C statement. The pattern ends at the first - non-escaped whitespace character; the remainder of the line - is its action. If the action is empty, then when the pat- - tern is matched the input token is simply discarded. For - example, here is the specification for a program which - deletes all occurrences of "zap me" from its input: - - %% - "zap me" - - (It will copy all other characters in the input to the out- - put since they will be matched by the default rule.) - - Here is a program which compresses multiple blanks and tabs - down to a single blank, and throws away whitespace found at - the end of a line: - - %% - [ \t]+ putchar( ' ' ); - [ \t]+$ /* ignore this token */ - - - If the action contains a '{', then the action spans till the - balancing '}' is found, and the action may cross multiple - lines. flex knows about C strings and comments and won't be - fooled by braces found within them, but also allows actions - to begin with %{ and will consider the action to be all the - text up to the next %} (regardless of ordinary braces inside - the action). - - An action consisting solely of a vertical bar ('|') means - "same as the action for the next rule." See below for an - illustration. - - Actions can include arbitrary C code, including return - statements to return a value to whatever routine called - yylex(). Each time yylex() is called it continues processing - tokens from where it last left off until it either reaches - the end of the file or executes a return. - - - - - -Version 2.5 Last change: April 1995 11 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - Actions are free to modify yytext except for lengthening it - (adding characters to its end--these will overwrite later - characters in the input stream). This however does not - apply when using %array (see above); in that case, yytext - may be freely modified in any way. - - Actions are free to modify yyleng except they should not do - so if the action also includes use of yymore() (see below). - - There are a number of special directives which can be - included within an action: - - - ECHO copies yytext to the scanner's output. - - - BEGIN followed by the name of a start condition places - the scanner in the corresponding start condition (see - below). - - - REJECT directs the scanner to proceed on to the "second - best" rule which matched the input (or a prefix of the - input). The rule is chosen as described above in "How - the Input is Matched", and yytext and yyleng set up - appropriately. It may either be one which matched as - much text as the originally chosen rule but came later - in the flex input file, or one which matched less text. - For example, the following will both count the words in - the input and call the routine special() whenever - "frob" is seen: - - int word_count = 0; - %% - - frob special(); REJECT; - [^ \t\n]+ ++word_count; - - Without the REJECT, any "frob"'s in the input would not - be counted as words, since the scanner normally exe- - cutes only one action per token. Multiple REJECT's are - allowed, each one finding the next best choice to the - currently active rule. For example, when the following - scanner scans the token "abcd", it will write "abcdab- - caba" to the output: - - %% - a | - ab | - abc | - abcd ECHO; REJECT; - .|\n /* eat up any unmatched character */ - - (The first three rules share the fourth's action since - they use the special '|' action.) REJECT is a - - - -Version 2.5 Last change: April 1995 12 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - particularly expensive feature in terms of scanner per- - formance; if it is used in any of the scanner's actions - it will slow down all of the scanner's matching. - Furthermore, REJECT cannot be used with the -Cf or -CF - options (see below). - - Note also that unlike the other special actions, REJECT - is a branch; code immediately following it in the - action will not be executed. - - - yymore() tells the scanner that the next time it - matches a rule, the corresponding token should be - appended onto the current value of yytext rather than - replacing it. For example, given the input "mega- - kludge" the following will write "mega-mega-kludge" to - the output: - - %% - mega- ECHO; yymore(); - kludge ECHO; - - First "mega-" is matched and echoed to the output. - Then "kludge" is matched, but the previous "mega-" is - still hanging around at the beginning of yytext so the - ECHO for the "kludge" rule will actually write "mega- - kludge". - - Two notes regarding use of yymore(). First, yymore() depends - on the value of yyleng correctly reflecting the size of the - current token, so you must not modify yyleng if you are - using yymore(). Second, the presence of yymore() in the - scanner's action entails a minor performance penalty in the - scanner's matching speed. - - - yyless(n) returns all but the first n characters of the - current token back to the input stream, where they will - be rescanned when the scanner looks for the next match. - yytext and yyleng are adjusted appropriately (e.g., - yyleng will now be equal to n ). For example, on the - input "foobar" the following will write out "foobar- - bar": - - %% - foobar ECHO; yyless(3); - [a-z]+ ECHO; - - An argument of 0 to yyless will cause the entire - current input string to be scanned again. Unless - you've changed how the scanner will subsequently pro- - cess its input (using BEGIN, for example), this will - result in an endless loop. - - - - -Version 2.5 Last change: April 1995 13 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - Note that yyless is a macro and can only be used in the flex - input file, not from other source files. - - - unput(c) puts the character c back onto the input - stream. It will be the next character scanned. The - following action will take the current token and cause - it to be rescanned enclosed in parentheses. - - { - int i; - /* Copy yytext because unput() trashes yytext */ - char *yycopy = strdup( yytext ); - unput( ')' ); - for ( i = yyleng - 1; i >= 0; --i ) - unput( yycopy[i] ); - unput( '(' ); - free( yycopy ); - } - - Note that since each unput() puts the given character - back at the beginning of the input stream, pushing back - strings must be done back-to-front. - - An important potential problem when using unput() is that if - you are using %pointer (the default), a call to unput() des- - troys the contents of yytext, starting with its rightmost - character and devouring one character to the left with each - call. If you need the value of yytext preserved after a - call to unput() (as in the above example), you must either - first copy it elsewhere, or build your scanner using %array - instead (see How The Input Is Matched). - - Finally, note that you cannot put back EOF to attempt to - mark the input stream with an end-of-file. - - - input() reads the next character from the input stream. - For example, the following is one way to eat up C com- - ments: - - %% - "/*" { - register int c; - - for ( ; ; ) - { - while ( (c = input()) != '*' && - c != EOF ) - ; /* eat up text of comment */ - - if ( c == '*' ) - { - while ( (c = input()) == '*' ) - - - -Version 2.5 Last change: April 1995 14 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - ; - if ( c == '/' ) - break; /* found the end */ - } - - if ( c == EOF ) - { - error( "EOF in comment" ); - break; - } - } - } - - (Note that if the scanner is compiled using C++, then - input() is instead referred to as yyinput(), in order - to avoid a name clash with the C++ stream by the name - of input.) - - - YY_FLUSH_BUFFER flushes the scanner's internal buffer - so that the next time the scanner attempts to match a - token, it will first refill the buffer using YY_INPUT - (see The Generated Scanner, below). This action is a - special case of the more general yy_flush_buffer() - function, described below in the section Multiple Input - Buffers. - - - yyterminate() can be used in lieu of a return statement - in an action. It terminates the scanner and returns a - 0 to the scanner's caller, indicating "all done". By - default, yyterminate() is also called when an end-of- - file is encountered. It is a macro and may be rede- - fined. - -THE GENERATED SCANNER - The output of flex is the file lex.yy.c, which contains the - scanning routine yylex(), a number of tables used by it for - matching tokens, and a number of auxiliary routines and mac- - ros. By default, yylex() is declared as follows: - - int yylex() - { - ... various definitions and the actions in here ... - } - - (If your environment supports function prototypes, then it - will be "int yylex( void )".) This definition may be - changed by defining the "YY_DECL" macro. For example, you - could use: - - #define YY_DECL float lexscan( a, b ) float a, b; - - to give the scanning routine the name lexscan, returning a - - - -Version 2.5 Last change: April 1995 15 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - float, and taking two floats as arguments. Note that if you - give arguments to the scanning routine using a K&R- - style/non-prototyped function declaration, you must ter- - minate the definition with a semi-colon (;). - - Whenever yylex() is called, it scans tokens from the global - input file yyin (which defaults to stdin). It continues - until it either reaches an end-of-file (at which point it - returns the value 0) or one of its actions executes a return - statement. - - If the scanner reaches an end-of-file, subsequent calls are - undefined unless either yyin is pointed at a new input file - (in which case scanning continues from that file), or yyres- - tart() is called. yyrestart() takes one argument, a FILE * - pointer (which can be nil, if you've set up YY_INPUT to scan - from a source other than yyin), and initializes yyin for - scanning from that file. Essentially there is no difference - between just assigning yyin to a new input file or using - yyrestart() to do so; the latter is available for compati- - bility with previous versions of flex, and because it can be - used to switch input files in the middle of scanning. It - can also be used to throw away the current input buffer, by - calling it with an argument of yyin; but better is to use - YY_FLUSH_BUFFER (see above). Note that yyrestart() does not - reset the start condition to INITIAL (see Start Conditions, - below). - - If yylex() stops scanning due to executing a return state- - ment in one of the actions, the scanner may then be called - again and it will resume scanning where it left off. - - By default (and for purposes of efficiency), the scanner - uses block-reads rather than simple getc() calls to read - characters from yyin. The nature of how it gets its input - can be controlled by defining the YY_INPUT macro. - YY_INPUT's calling sequence is - "YY_INPUT(buf,result,max_size)". Its action is to place up - to max_size characters in the character array buf and return - in the integer variable result either the number of charac- - ters read or the constant YY_NULL (0 on Unix systems) to - indicate EOF. The default YY_INPUT reads from the global - file-pointer "yyin". - - A sample definition of YY_INPUT (in the definitions section - of the input file): - - %{ - #define YY_INPUT(buf,result,max_size) \ - { \ - int c = getchar(); \ - result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ - - - -Version 2.5 Last change: April 1995 16 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - } - %} - - This definition will change the input processing to occur - one character at a time. - - When the scanner receives an end-of-file indication from - YY_INPUT, it then checks the yywrap() function. If yywrap() - returns false (zero), then it is assumed that the function - has gone ahead and set up yyin to point to another input - file, and scanning continues. If it returns true (non- - zero), then the scanner terminates, returning 0 to its - caller. Note that in either case, the start condition - remains unchanged; it does not revert to INITIAL. - - If you do not supply your own version of yywrap(), then you - must either use %option noyywrap (in which case the scanner - behaves as though yywrap() returned 1), or you must link - with -lfl to obtain the default version of the routine, - which always returns 1. - - Three routines are available for scanning from in-memory - buffers rather than files: yy_scan_string(), - yy_scan_bytes(), and yy_scan_buffer(). See the discussion of - them below in the section Multiple Input Buffers. - - The scanner writes its ECHO output to the yyout global - (default, stdout), which may be redefined by the user simply - by assigning it to some other FILE pointer. - -START CONDITIONS - flex provides a mechanism for conditionally activating - rules. Any rule whose pattern is prefixed with "<sc>" will - only be active when the scanner is in the start condition - named "sc". For example, - - <STRING>[^"]* { /* eat up the string body ... */ - ... - } - - will be active only when the scanner is in the "STRING" - start condition, and - - <INITIAL,STRING,QUOTE>\. { /* handle an escape ... */ - ... - } - - will be active only when the current start condition is - either "INITIAL", "STRING", or "QUOTE". - - Start conditions are declared in the definitions (first) - section of the input using unindented lines beginning with - - - -Version 2.5 Last change: April 1995 17 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - either %s or %x followed by a list of names. The former - declares inclusive start conditions, the latter exclusive - start conditions. A start condition is activated using the - BEGIN action. Until the next BEGIN action is executed, - rules with the given start condition will be active and - rules with other start conditions will be inactive. If the - start condition is inclusive, then rules with no start con- - ditions at all will also be active. If it is exclusive, - then only rules qualified with the start condition will be - active. A set of rules contingent on the same exclusive - start condition describe a scanner which is independent of - any of the other rules in the flex input. Because of this, - exclusive start conditions make it easy to specify "mini- - scanners" which scan portions of the input that are syntac- - tically different from the rest (e.g., comments). - - If the distinction between inclusive and exclusive start - conditions is still a little vague, here's a simple example - illustrating the connection between the two. The set of - rules: - - %s example - %% - - <example>foo do_something(); - - bar something_else(); - - is equivalent to - - %x example - %% - - <example>foo do_something(); - - <INITIAL,example>bar something_else(); - - Without the <INITIAL,example> qualifier, the bar pattern in - the second example wouldn't be active (i.e., couldn't match) - when in start condition example. If we just used <example> - to qualify bar, though, then it would only be active in - example and not in INITIAL, while in the first example it's - active in both, because in the first example the example - startion condition is an inclusive (%s) start condition. - - Also note that the special start-condition specifier <*> - matches every start condition. Thus, the above example - could also have been written; - - %x example - %% - - - - -Version 2.5 Last change: April 1995 18 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - <example>foo do_something(); - - <*>bar something_else(); - - - The default rule (to ECHO any unmatched character) remains - active in start conditions. It is equivalent to: - - <*>.|\n ECHO; - - - BEGIN(0) returns to the original state where only the rules - with no start conditions are active. This state can also be - referred to as the start-condition "INITIAL", so - BEGIN(INITIAL) is equivalent to BEGIN(0). (The parentheses - around the start condition name are not required but are - considered good style.) - - BEGIN actions can also be given as indented code at the - beginning of the rules section. For example, the following - will cause the scanner to enter the "SPECIAL" start condi- - tion whenever yylex() is called and the global variable - enter_special is true: - - int enter_special; - - %x SPECIAL - %% - if ( enter_special ) - BEGIN(SPECIAL); - - <SPECIAL>blahblahblah - ...more rules follow... - - - To illustrate the uses of start conditions, here is a - scanner which provides two different interpretations of a - string like "123.456". By default it will treat it as three - tokens, the integer "123", a dot ('.'), and the integer - "456". But if the string is preceded earlier in the line by - the string "expect-floats" it will treat it as a single - token, the floating-point number 123.456: - - %{ - #include <math.h> - %} - %s expect - - %% - expect-floats BEGIN(expect); - - <expect>[0-9]+"."[0-9]+ { - - - -Version 2.5 Last change: April 1995 19 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - printf( "found a float, = %f\n", - atof( yytext ) ); - } - <expect>\n { - /* that's the end of the line, so - * we need another "expect-number" - * before we'll recognize any more - * numbers - */ - BEGIN(INITIAL); - } - - [0-9]+ { - printf( "found an integer, = %d\n", - atoi( yytext ) ); - } - - "." printf( "found a dot\n" ); - - Here is a scanner which recognizes (and discards) C comments - while maintaining a count of the current input line. - - %x comment - %% - int line_num = 1; - - "/*" BEGIN(comment); - - <comment>[^*\n]* /* eat anything that's not a '*' */ - <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ - <comment>\n ++line_num; - <comment>"*"+"/" BEGIN(INITIAL); - - This scanner goes to a bit of trouble to match as much text - as possible with each rule. In general, when attempting to - write a high-speed scanner try to match as much possible in - each rule, as it's a big win. - - Note that start-conditions names are really integer values - and can be stored as such. Thus, the above could be - extended in the following fashion: - - %x comment foo - %% - int line_num = 1; - int comment_caller; - - "/*" { - comment_caller = INITIAL; - BEGIN(comment); - } - - - - -Version 2.5 Last change: April 1995 20 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - ... - - <foo>"/*" { - comment_caller = foo; - BEGIN(comment); - } - - <comment>[^*\n]* /* eat anything that's not a '*' */ - <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ - <comment>\n ++line_num; - <comment>"*"+"/" BEGIN(comment_caller); - - Furthermore, you can access the current start condition - using the integer-valued YY_START macro. For example, the - above assignments to comment_caller could instead be written - - comment_caller = YY_START; - - Flex provides YYSTATE as an alias for YY_START (since that - is what's used by AT&T lex). - - Note that start conditions do not have their own name-space; - %s's and %x's declare names in the same fashion as - #define's. - - Finally, here's an example of how to match C-style quoted - strings using exclusive start conditions, including expanded - escape sequences (but not including checking for a string - that's too long): - - %x str - - %% - char string_buf[MAX_STR_CONST]; - char *string_buf_ptr; - - - \" string_buf_ptr = string_buf; BEGIN(str); - - <str>\" { /* saw closing quote - all done */ - BEGIN(INITIAL); - *string_buf_ptr = '\0'; - /* return string constant token type and - * value to parser - */ - } - - <str>\n { - /* error - unterminated string constant */ - /* generate error message */ - } - - - - -Version 2.5 Last change: April 1995 21 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - <str>\\[0-7]{1,3} { - /* octal escape sequence */ - int result; - - (void) sscanf( yytext + 1, "%o", &result ); - - if ( result > 0xff ) - /* error, constant is out-of-bounds */ - - *string_buf_ptr++ = result; - } - - <str>\\[0-9]+ { - /* generate error - bad escape sequence; something - * like '\48' or '\0777777' - */ - } - - <str>\\n *string_buf_ptr++ = '\n'; - <str>\\t *string_buf_ptr++ = '\t'; - <str>\\r *string_buf_ptr++ = '\r'; - <str>\\b *string_buf_ptr++ = '\b'; - <str>\\f *string_buf_ptr++ = '\f'; - - <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; - - <str>[^\\\n\"]+ { - char *yptr = yytext; - - while ( *yptr ) - *string_buf_ptr++ = *yptr++; - } - - - Often, such as in some of the examples above, you wind up - writing a whole bunch of rules all preceded by the same - start condition(s). Flex makes this a little easier and - cleaner by introducing a notion of start condition scope. A - start condition scope is begun with: - - <SCs>{ - - where SCs is a list of one or more start conditions. Inside - the start condition scope, every rule automatically has the - prefix <SCs> applied to it, until a '}' which matches the - initial '{'. So, for example, - - <ESC>{ - "\\n" return '\n'; - "\\r" return '\r'; - "\\f" return '\f'; - "\\0" return '\0'; - - - -Version 2.5 Last change: April 1995 22 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - } - - is equivalent to: - - <ESC>"\\n" return '\n'; - <ESC>"\\r" return '\r'; - <ESC>"\\f" return '\f'; - <ESC>"\\0" return '\0'; - - Start condition scopes may be nested. - - Three routines are available for manipulating stacks of - start conditions: - - void yy_push_state(int new_state) - pushes the current start condition onto the top of the - start condition stack and switches to new_state as - though you had used BEGIN new_state (recall that start - condition names are also integers). - - void yy_pop_state() - pops the top of the stack and switches to it via BEGIN. - - int yy_top_state() - returns the top of the stack without altering the - stack's contents. - - The start condition stack grows dynamically and so has no - built-in size limitation. If memory is exhausted, program - execution aborts. - - To use start condition stacks, your scanner must include a - %option stack directive (see Options below). - -MULTIPLE INPUT BUFFERS - Some scanners (such as those which support "include" files) - require reading from several input streams. As flex - scanners do a large amount of buffering, one cannot control - where the next input will be read from by simply writing a - YY_INPUT which is sensitive to the scanning context. - YY_INPUT is only called when the scanner reaches the end of - its buffer, which may be a long time after scanning a state- - ment such as an "include" which requires switching the input - source. - - To negotiate these sorts of problems, flex provides a - mechanism for creating and switching between multiple input - buffers. An input buffer is created by using: - - YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) - - which takes a FILE pointer and a size and creates a buffer - - - -Version 2.5 Last change: April 1995 23 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - associated with the given file and large enough to hold size - characters (when in doubt, use YY_BUF_SIZE for the size). - It returns a YY_BUFFER_STATE handle, which may then be - passed to other routines (see below). The YY_BUFFER_STATE - type is a pointer to an opaque struct yy_buffer_state struc- - ture, so you may safely initialize YY_BUFFER_STATE variables - to ((YY_BUFFER_STATE) 0) if you wish, and also refer to the - opaque structure in order to correctly declare input buffers - in source files other than that of your scanner. Note that - the FILE pointer in the call to yy_create_buffer is only - used as the value of yyin seen by YY_INPUT; if you redefine - YY_INPUT so it no longer uses yyin, then you can safely pass - a nil FILE pointer to yy_create_buffer. You select a partic- - ular buffer to scan from using: - - void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) - - switches the scanner's input buffer so subsequent tokens - will come from new_buffer. Note that yy_switch_to_buffer() - may be used by yywrap() to set things up for continued scan- - ning, instead of opening a new file and pointing yyin at it. - Note also that switching input sources via either - yy_switch_to_buffer() or yywrap() does not change the start - condition. - - void yy_delete_buffer( YY_BUFFER_STATE buffer ) - - is used to reclaim the storage associated with a buffer. ( - buffer can be nil, in which case the routine does nothing.) - You can also clear the current contents of a buffer using: - - void yy_flush_buffer( YY_BUFFER_STATE buffer ) - - This function discards the buffer's contents, so the next - time the scanner attempts to match a token from the buffer, - it will first fill the buffer anew using YY_INPUT. - - yy_new_buffer() is an alias for yy_create_buffer(), provided - for compatibility with the C++ use of new and delete for - creating and destroying dynamic objects. - - Finally, the YY_CURRENT_BUFFER macro returns a - YY_BUFFER_STATE handle to the current buffer. - - Here is an example of using these features for writing a - scanner which expands include files (the <<EOF>> feature is - discussed below): - - /* the "incl" state is used for picking up the name - * of an include file - */ - %x incl - - - -Version 2.5 Last change: April 1995 24 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - %{ - #define MAX_INCLUDE_DEPTH 10 - YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; - int include_stack_ptr = 0; - %} - - %% - include BEGIN(incl); - - [a-z]+ ECHO; - [^a-z\n]*\n? ECHO; - - <incl>[ \t]* /* eat the whitespace */ - <incl>[^ \t\n]+ { /* got the include file name */ - if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) - { - fprintf( stderr, "Includes nested too deeply" ); - exit( 1 ); - } - - include_stack[include_stack_ptr++] = - YY_CURRENT_BUFFER; - - yyin = fopen( yytext, "r" ); - - if ( ! yyin ) - error( ... ); - - yy_switch_to_buffer( - yy_create_buffer( yyin, YY_BUF_SIZE ) ); - - BEGIN(INITIAL); - } - - <<EOF>> { - if ( --include_stack_ptr < 0 ) - { - yyterminate(); - } - - else - { - yy_delete_buffer( YY_CURRENT_BUFFER ); - yy_switch_to_buffer( - include_stack[include_stack_ptr] ); - } - } - - Three routines are available for setting up input buffers - for scanning in-memory strings instead of files. All of - them create a new input buffer for scanning the string, and - return a corresponding YY_BUFFER_STATE handle (which you - - - -Version 2.5 Last change: April 1995 25 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - should delete with yy_delete_buffer() when done with it). - They also switch to the new buffer using - yy_switch_to_buffer(), so the next call to yylex() will - start scanning the string. - - yy_scan_string(const char *str) - scans a NUL-terminated string. - - yy_scan_bytes(const char *bytes, int len) - scans len bytes (including possibly NUL's) starting at - location bytes. - - Note that both of these functions create and scan a copy of - the string or bytes. (This may be desirable, since yylex() - modifies the contents of the buffer it is scanning.) You - can avoid the copy by using: - - yy_scan_buffer(char *base, yy_size_t size) - which scans in place the buffer starting at base, con- - sisting of size bytes, the last two bytes of which must - be YY_END_OF_BUFFER_CHAR (ASCII NUL). These last two - bytes are not scanned; thus, scanning consists of - base[0] through base[size-2], inclusive. - - If you fail to set up base in this manner (i.e., forget - the final two YY_END_OF_BUFFER_CHAR bytes), then - yy_scan_buffer() returns a nil pointer instead of - creating a new input buffer. - - The type yy_size_t is an integral type to which you can - cast an integer expression reflecting the size of the - buffer. - -END-OF-FILE RULES - The special rule "<<EOF>>" indicates actions which are to be - taken when an end-of-file is encountered and yywrap() - returns non-zero (i.e., indicates no further files to pro- - cess). The action must finish by doing one of four things: - - - assigning yyin to a new input file (in previous ver- - sions of flex, after doing the assignment you had to - call the special action YY_NEW_FILE; this is no longer - necessary); - - - executing a return statement; - - - executing the special yyterminate() action; - - - or, switching to a new buffer using - yy_switch_to_buffer() as shown in the example above. - - - - - -Version 2.5 Last change: April 1995 26 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - <<EOF>> rules may not be used with other patterns; they may - only be qualified with a list of start conditions. If an - unqualified <<EOF>> rule is given, it applies to all start - conditions which do not already have <<EOF>> actions. To - specify an <<EOF>> rule for only the initial start condi- - tion, use - - <INITIAL><<EOF>> - - - These rules are useful for catching things like unclosed - comments. An example: - - %x quote - %% - - ...other rules for dealing with quotes... - - <quote><<EOF>> { - error( "unterminated quote" ); - yyterminate(); - } - <<EOF>> { - if ( *++filelist ) - yyin = fopen( *filelist, "r" ); - else - yyterminate(); - } - - -MISCELLANEOUS MACROS - The macro YY_USER_ACTION can be defined to provide an action - which is always executed prior to the matched rule's action. - For example, it could be #define'd to call a routine to con- - vert yytext to lower-case. When YY_USER_ACTION is invoked, - the variable yy_act gives the number of the matched rule - (rules are numbered starting with 1). Suppose you want to - profile how often each of your rules is matched. The fol- - lowing would do the trick: - - #define YY_USER_ACTION ++ctr[yy_act] - - where ctr is an array to hold the counts for the different - rules. Note that the macro YY_NUM_RULES gives the total - number of rules (including the default rule, even if you use - -s), so a correct declaration for ctr is: - - int ctr[YY_NUM_RULES]; - - - The macro YY_USER_INIT may be defined to provide an action - which is always executed before the first scan (and before - - - -Version 2.5 Last change: April 1995 27 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - the scanner's internal initializations are done). For exam- - ple, it could be used to call a routine to read in a data - table or open a logging file. - - The macro yy_set_interactive(is_interactive) can be used to - control whether the current buffer is considered interac- - tive. An interactive buffer is processed more slowly, but - must be used when the scanner's input source is indeed - interactive to avoid problems due to waiting to fill buffers - (see the discussion of the -I flag below). A non-zero value - in the macro invocation marks the buffer as interactive, a - zero value as non-interactive. Note that use of this macro - overrides %option always-interactive or %option never- - interactive (see Options below). yy_set_interactive() must - be invoked prior to beginning to scan the buffer that is (or - is not) to be considered interactive. - - The macro yy_set_bol(at_bol) can be used to control whether - the current buffer's scanning context for the next token - match is done as though at the beginning of a line. A non- - zero macro argument makes rules anchored with - - The macro YY_AT_BOL() returns true if the next token scanned - from the current buffer will have '^' rules active, false - otherwise. - - In the generated scanner, the actions are all gathered in - one large switch statement and separated using YY_BREAK, - which may be redefined. By default, it is simply a "break", - to separate each rule's action from the following rule's. - Redefining YY_BREAK allows, for example, C++ users to - #define YY_BREAK to do nothing (while being very careful - that every rule ends with a "break" or a "return"!) to avoid - suffering from unreachable statement warnings where because - a rule's action ends with "return", the YY_BREAK is inacces- - sible. - -VALUES AVAILABLE TO THE USER - This section summarizes the various values available to the - user in the rule actions. - - - char *yytext holds the text of the current token. It - may be modified but not lengthened (you cannot append - characters to the end). - - If the special directive %array appears in the first - section of the scanner description, then yytext is - instead declared char yytext[YYLMAX], where YYLMAX is a - macro definition that you can redefine in the first - section if you don't like the default value (generally - 8KB). Using %array results in somewhat slower - scanners, but the value of yytext becomes immune to - - - -Version 2.5 Last change: April 1995 28 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - calls to input() and unput(), which potentially destroy - its value when yytext is a character pointer. The - opposite of %array is %pointer, which is the default. - - You cannot use %array when generating C++ scanner - classes (the -+ flag). - - - int yyleng holds the length of the current token. - - - FILE *yyin is the file which by default flex reads - from. It may be redefined but doing so only makes - sense before scanning begins or after an EOF has been - encountered. Changing it in the midst of scanning will - have unexpected results since flex buffers its input; - use yyrestart() instead. Once scanning terminates - because an end-of-file has been seen, you can assign - yyin at the new input file and then call the scanner - again to continue scanning. - - - void yyrestart( FILE *new_file ) may be called to point - yyin at the new input file. The switch-over to the new - file is immediate (any previously buffered-up input is - lost). Note that calling yyrestart() with yyin as an - argument thus throws away the current input buffer and - continues scanning the same input file. - - - FILE *yyout is the file to which ECHO actions are done. - It can be reassigned by the user. - - - YY_CURRENT_BUFFER returns a YY_BUFFER_STATE handle to - the current buffer. - - - YY_START returns an integer value corresponding to the - current start condition. You can subsequently use this - value with BEGIN to return to that start condition. - -INTERFACING WITH YACC - One of the main uses of flex is as a companion to the yacc - parser-generator. yacc parsers expect to call a routine - named yylex() to find the next input token. The routine is - supposed to return the type of the next token as well as - putting any associated value in the global yylval. To use - flex with yacc, one specifies the -d option to yacc to - instruct it to generate the file y.tab.h containing defini- - tions of all the %tokens appearing in the yacc input. This - file is then included in the flex scanner. For example, if - one of the tokens is "TOK_NUMBER", part of the scanner might - look like: - - %{ - #include "y.tab.h" - %} - - - -Version 2.5 Last change: April 1995 29 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - %% - - [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; - - -OPTIONS - flex has the following options: - - -b Generate backing-up information to lex.backup. This is - a list of scanner states which require backing up and - the input characters on which they do so. By adding - rules one can remove backing-up states. If all - backing-up states are eliminated and -Cf or -CF is - used, the generated scanner will run faster (see the -p - flag). Only users who wish to squeeze every last cycle - out of their scanners need worry about this option. - (See the section on Performance Considerations below.) - - -c is a do-nothing, deprecated option included for POSIX - compliance. - - -d makes the generated scanner run in debug mode. When- - ever a pattern is recognized and the global - yy_flex_debug is non-zero (which is the default), the - scanner will write to stderr a line of the form: - - --accepting rule at line 53 ("the matched text") - - The line number refers to the location of the rule in - the file defining the scanner (i.e., the file that was - fed to flex). Messages are also generated when the - scanner backs up, accepts the default rule, reaches the - end of its input buffer (or encounters a NUL; at this - point, the two look the same as far as the scanner's - concerned), or reaches an end-of-file. - - -f specifies fast scanner. No table compression is done - and stdio is bypassed. The result is large but fast. - This option is equivalent to -Cfr (see below). - - -h generates a "help" summary of flex's options to stdout - and then exits. -? and --help are synonyms for -h. - - -i instructs flex to generate a case-insensitive scanner. - The case of letters given in the flex input patterns - will be ignored, and tokens in the input will be - matched regardless of case. The matched text given in - yytext will have the preserved case (i.e., it will not - be folded). - - -l turns on maximum compatibility with the original AT&T - lex implementation. Note that this does not mean full - - - -Version 2.5 Last change: April 1995 30 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - compatibility. Use of this option costs a considerable - amount of performance, and it cannot be used with the - -+, -f, -F, -Cf, or -CF options. For details on the - compatibilities it provides, see the section "Incompa- - tibilities With Lex And POSIX" below. This option also - results in the name YY_FLEX_LEX_COMPAT being #define'd - in the generated scanner. - - -n is another do-nothing, deprecated option included only - for POSIX compliance. - - -p generates a performance report to stderr. The report - consists of comments regarding features of the flex - input file which will cause a serious loss of perfor- - mance in the resulting scanner. If you give the flag - twice, you will also get comments regarding features - that lead to minor performance losses. - - Note that the use of REJECT, %option yylineno, and - variable trailing context (see the Deficiencies / Bugs - section below) entails a substantial performance - penalty; use of yymore(), the ^ operator, and the -I - flag entail minor performance penalties. - - -s causes the default rule (that unmatched scanner input - is echoed to stdout) to be suppressed. If the scanner - encounters input that does not match any of its rules, - it aborts with an error. This option is useful for - finding holes in a scanner's rule set. - - -t instructs flex to write the scanner it generates to - standard output instead of lex.yy.c. - - -v specifies that flex should write to stderr a summary of - statistics regarding the scanner it generates. Most of - the statistics are meaningless to the casual flex user, - but the first line identifies the version of flex (same - as reported by -V), and the next line the flags used - when generating the scanner, including those that are - on by default. - - -w suppresses warning messages. - - -B instructs flex to generate a batch scanner, the oppo- - site of interactive scanners generated by -I (see - below). In general, you use -B when you are certain - that your scanner will never be used interactively, and - you want to squeeze a little more performance out of - it. If your goal is instead to squeeze out a lot more - performance, you should be using the -Cf or -CF - options (discussed below), which turn on -B automati- - cally anyway. - - - -Version 2.5 Last change: April 1995 31 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - -F specifies that the fast scanner table representation - should be used (and stdio bypassed). This representa- - tion is about as fast as the full table representation - (-f), and for some sets of patterns will be consider- - ably smaller (and for others, larger). In general, if - the pattern set contains both "keywords" and a catch- - all, "identifier" rule, such as in the set: - - "case" return TOK_CASE; - "switch" return TOK_SWITCH; - ... - "default" return TOK_DEFAULT; - [a-z]+ return TOK_ID; - - then you're better off using the full table representa- - tion. If only the "identifier" rule is present and you - then use a hash table or some such to detect the key- - words, you're better off using -F. - - This option is equivalent to -CFr (see below). It can- - not be used with -+. - - -I instructs flex to generate an interactive scanner. An - interactive scanner is one that only looks ahead to - decide what token has been matched if it absolutely - must. It turns out that always looking one extra char- - acter ahead, even if the scanner has already seen - enough text to disambiguate the current token, is a bit - faster than only looking ahead when necessary. But - scanners that always look ahead give dreadful interac- - tive performance; for example, when a user types a new- - line, it is not recognized as a newline token until - they enter another token, which often means typing in - another whole line. - - Flex scanners default to interactive unless you use the - -Cf or -CF table-compression options (see below). - That's because if you're looking for high-performance - you should be using one of these options, so if you - didn't, flex assumes you'd rather trade off a bit of - run-time performance for intuitive interactive - behavior. Note also that you cannot use -I in conjunc- - tion with -Cf or -CF. Thus, this option is not really - needed; it is on by default for all those cases in - which it is allowed. - - You can force a scanner to not be interactive by using - -B (see above). - - -L instructs flex not to generate #line directives. - Without this option, flex peppers the generated scanner - with #line directives so error messages in the actions - - - -Version 2.5 Last change: April 1995 32 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - will be correctly located with respect to either the - original flex input file (if the errors are due to code - in the input file), or lex.yy.c (if the errors are - flex's fault -- you should report these sorts of errors - to the email address given below). - - -T makes flex run in trace mode. It will generate a lot - of messages to stderr concerning the form of the input - and the resultant non-deterministic and deterministic - finite automata. This option is mostly for use in - maintaining flex. - - -V prints the version number to stdout and exits. --ver- - sion is a synonym for -V. - - -7 instructs flex to generate a 7-bit scanner, i.e., one - which can only recognized 7-bit characters in its - input. The advantage of using -7 is that the scanner's - tables can be up to half the size of those generated - using the -8 option (see below). The disadvantage is - that such scanners often hang or crash if their input - contains an 8-bit character. - - Note, however, that unless you generate your scanner - using the -Cf or -CF table compression options, use of - -7 will save only a small amount of table space, and - make your scanner considerably less portable. Flex's - default behavior is to generate an 8-bit scanner unless - you use the -Cf or -CF, in which case flex defaults to - generating 7-bit scanners unless your site was always - configured to generate 8-bit scanners (as will often be - the case with non-USA sites). You can tell whether - flex generated a 7-bit or an 8-bit scanner by inspect- - ing the flag summary in the -v output as described - above. - - Note that if you use -Cfe or -CFe (those table compres- - sion options, but also using equivalence classes as - discussed see below), flex still defaults to generating - an 8-bit scanner, since usually with these compression - options full 8-bit tables are not much more expensive - than 7-bit tables. - - -8 instructs flex to generate an 8-bit scanner, i.e., one - which can recognize 8-bit characters. This flag is - only needed for scanners generated using -Cf or -CF, as - otherwise flex defaults to generating an 8-bit scanner - anyway. - - See the discussion of -7 above for flex's default - behavior and the tradeoffs between 7-bit and 8-bit - scanners. - - - -Version 2.5 Last change: April 1995 33 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - -+ specifies that you want flex to generate a C++ scanner - class. See the section on Generating C++ Scanners - below for details. - - -C[aefFmr] - controls the degree of table compression and, more gen- - erally, trade-offs between small scanners and fast - scanners. - - -Ca ("align") instructs flex to trade off larger tables - in the generated scanner for faster performance because - the elements of the tables are better aligned for - memory access and computation. On some RISC architec- - tures, fetching and manipulating longwords is more - efficient than with smaller-sized units such as short- - words. This option can double the size of the tables - used by your scanner. - - -Ce directs flex to construct equivalence classes, - i.e., sets of characters which have identical lexical - properties (for example, if the only appearance of - digits in the flex input is in the character class - "[0-9]" then the digits '0', '1', ..., '9' will all be - put in the same equivalence class). Equivalence - classes usually give dramatic reductions in the final - table/object file sizes (typically a factor of 2-5) and - are pretty cheap performance-wise (one array look-up - per character scanned). - - -Cf specifies that the full scanner tables should be - generated - flex should not compress the tables by tak- - ing advantages of similar transition functions for dif- - ferent states. - - -CF specifies that the alternate fast scanner represen- - tation (described above under the -F flag) should be - used. This option cannot be used with -+. - - -Cm directs flex to construct meta-equivalence classes, - which are sets of equivalence classes (or characters, - if equivalence classes are not being used) that are - commonly used together. Meta-equivalence classes are - often a big win when using compressed tables, but they - have a moderate performance impact (one or two "if" - tests and one array look-up per character scanned). - - -Cr causes the generated scanner to bypass use of the - standard I/O library (stdio) for input. Instead of - calling fread() or getc(), the scanner will use the - read() system call, resulting in a performance gain - which varies from system to system, but in general is - probably negligible unless you are also using -Cf or - - - -Version 2.5 Last change: April 1995 34 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - -CF. Using -Cr can cause strange behavior if, for exam- - ple, you read from yyin using stdio prior to calling - the scanner (because the scanner will miss whatever - text your previous reads left in the stdio input - buffer). - - -Cr has no effect if you define YY_INPUT (see The Gen- - erated Scanner above). - - A lone -C specifies that the scanner tables should be - compressed but neither equivalence classes nor meta- - equivalence classes should be used. - - The options -Cf or -CF and -Cm do not make sense - together - there is no opportunity for meta-equivalence - classes if the table is not being compressed. Other- - wise the options may be freely mixed, and are cumula- - tive. - - The default setting is -Cem, which specifies that flex - should generate equivalence classes and meta- - equivalence classes. This setting provides the highest - degree of table compression. You can trade off - faster-executing scanners at the cost of larger tables - with the following generally being true: - - slowest & smallest - -Cem - -Cm - -Ce - -C - -C{f,F}e - -C{f,F} - -C{f,F}a - fastest & largest - - Note that scanners with the smallest tables are usually - generated and compiled the quickest, so during develop- - ment you will usually want to use the default, maximal - compression. - - -Cfe is often a good compromise between speed and size - for production scanners. - - -ooutput - directs flex to write the scanner to the file output - instead of lex.yy.c. If you combine -o with the -t - option, then the scanner is written to stdout but its - #line directives (see the -L option above) refer to the - file output. - - -Pprefix - - - -Version 2.5 Last change: April 1995 35 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - changes the default yy prefix used by flex for all - globally-visible variable and function names to instead - be prefix. For example, -Pfoo changes the name of - yytext to footext. It also changes the name of the - default output file from lex.yy.c to lex.foo.c. Here - are all of the names affected: - - yy_create_buffer - yy_delete_buffer - yy_flex_debug - yy_init_buffer - yy_flush_buffer - yy_load_buffer_state - yy_switch_to_buffer - yyin - yyleng - yylex - yylineno - yyout - yyrestart - yytext - yywrap - - (If you are using a C++ scanner, then only yywrap and - yyFlexLexer are affected.) Within your scanner itself, - you can still refer to the global variables and func- - tions using either version of their name; but exter- - nally, they have the modified name. - - This option lets you easily link together multiple flex - programs into the same executable. Note, though, that - using this option also renames yywrap(), so you now - must either provide your own (appropriately-named) ver- - sion of the routine for your scanner, or use %option - noyywrap, as linking with -lfl no longer provides one - for you by default. - - -Sskeleton_file - overrides the default skeleton file from which flex - constructs its scanners. You'll never need this option - unless you are doing flex maintenance or development. - - flex also provides a mechanism for controlling options - within the scanner specification itself, rather than from - the flex command-line. This is done by including %option - directives in the first section of the scanner specifica- - tion. You can specify multiple options with a single - %option directive, and multiple directives in the first sec- - tion of your flex input file. - - Most options are given simply as names, optionally preceded - by the word "no" (with no intervening whitespace) to negate - - - -Version 2.5 Last change: April 1995 36 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - their meaning. A number are equivalent to flex flags or - their negation: - - 7bit -7 option - 8bit -8 option - align -Ca option - backup -b option - batch -B option - c++ -+ option - - caseful or - case-sensitive opposite of -i (default) - - case-insensitive or - caseless -i option - - debug -d option - default opposite of -s option - ecs -Ce option - fast -F option - full -f option - interactive -I option - lex-compat -l option - meta-ecs -Cm option - perf-report -p option - read -Cr option - stdout -t option - verbose -v option - warn opposite of -w option - (use "%option nowarn" for -w) - - array equivalent to "%array" - pointer equivalent to "%pointer" (default) - - Some %option's provide features otherwise not available: - - always-interactive - instructs flex to generate a scanner which always con- - siders its input "interactive". Normally, on each new - input file the scanner calls isatty() in an attempt to - determine whether the scanner's input source is - interactive and thus should be read a character at a - time. When this option is used, however, then no such - call is made. - - main directs flex to provide a default main() program for - the scanner, which simply calls yylex(). This option - implies noyywrap (see below). - - never-interactive - instructs flex to generate a scanner which never con- - siders its input "interactive" (again, no call made to - - - -Version 2.5 Last change: April 1995 37 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - isatty()). This is the opposite of always-interactive. - - stack - enables the use of start condition stacks (see Start - Conditions above). - - stdinit - if set (i.e., %option stdinit) initializes yyin and - yyout to stdin and stdout, instead of the default of - nil. Some existing lex programs depend on this - behavior, even though it is not compliant with ANSI C, - which does not require stdin and stdout to be compile- - time constant. - - yylineno - directs flex to generate a scanner that maintains the - number of the current line read from its input in the - global variable yylineno. This option is implied by - %option lex-compat. - - yywrap - if unset (i.e., %option noyywrap), makes the scanner - not call yywrap() upon an end-of-file, but simply - assume that there are no more files to scan (until the - user points yyin at a new file and calls yylex() - again). - - flex scans your rule actions to determine whether you use - the REJECT or yymore() features. The reject and yymore - options are available to override its decision as to whether - you use the options, either by setting them (e.g., %option - reject) to indicate the feature is indeed used, or unsetting - them to indicate it actually is not used (e.g., %option - noyymore). - - Three options take string-delimited values, offset with '=': - - %option outfile="ABC" - - is equivalent to -oABC, and - - %option prefix="XYZ" - - is equivalent to -PXYZ. Finally, - - %option yyclass="foo" - - only applies when generating a C++ scanner ( -+ option). It - informs flex that you have derived foo as a subclass of - yyFlexLexer, so flex will place your actions in the member - function foo::yylex() instead of yyFlexLexer::yylex(). It - also generates a yyFlexLexer::yylex() member function that - - - -Version 2.5 Last change: April 1995 38 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - emits a run-time error (by invoking - yyFlexLexer::LexerError()) if called. See Generating C++ - Scanners, below, for additional information. - - A number of options are available for lint purists who want - to suppress the appearance of unneeded routines in the gen- - erated scanner. Each of the following, if unset (e.g., - %option nounput ), results in the corresponding routine not - appearing in the generated scanner: - - input, unput - yy_push_state, yy_pop_state, yy_top_state - yy_scan_buffer, yy_scan_bytes, yy_scan_string - - (though yy_push_state() and friends won't appear anyway - unless you use %option stack). - -PERFORMANCE CONSIDERATIONS - The main design goal of flex is that it generate high- - performance scanners. It has been optimized for dealing - well with large sets of rules. Aside from the effects on - scanner speed of the table compression -C options outlined - above, there are a number of options/actions which degrade - performance. These are, from most expensive to least: - - REJECT - %option yylineno - arbitrary trailing context - - pattern sets that require backing up - %array - %option interactive - %option always-interactive - - '^' beginning-of-line operator - yymore() - - with the first three all being quite expensive and the last - two being quite cheap. Note also that unput() is imple- - mented as a routine call that potentially does quite a bit - of work, while yyless() is a quite-cheap macro; so if just - putting back some excess text you scanned, use yyless(). - - REJECT should be avoided at all costs when performance is - important. It is a particularly expensive option. - - Getting rid of backing up is messy and often may be an enor- - mous amount of work for a complicated scanner. In princi- - pal, one begins by using the -b flag to generate a - lex.backup file. For example, on the input - - %% - - - -Version 2.5 Last change: April 1995 39 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - foo return TOK_KEYWORD; - foobar return TOK_KEYWORD; - - the file looks like: - - State #6 is non-accepting - - associated rule line numbers: - 2 3 - out-transitions: [ o ] - jam-transitions: EOF [ \001-n p-\177 ] - - State #8 is non-accepting - - associated rule line numbers: - 3 - out-transitions: [ a ] - jam-transitions: EOF [ \001-` b-\177 ] - - State #9 is non-accepting - - associated rule line numbers: - 3 - out-transitions: [ r ] - jam-transitions: EOF [ \001-q s-\177 ] - - Compressed tables always back up. - - The first few lines tell us that there's a scanner state in - which it can make a transition on an 'o' but not on any - other character, and that in that state the currently - scanned text does not match any rule. The state occurs when - trying to match the rules found at lines 2 and 3 in the - input file. If the scanner is in that state and then reads - something other than an 'o', it will have to back up to find - a rule which is matched. With a bit of headscratching one - can see that this must be the state it's in when it has seen - "fo". When this has happened, if anything other than - another 'o' is seen, the scanner will have to back up to - simply match the 'f' (by the default rule). - - The comment regarding State #8 indicates there's a problem - when "foob" has been scanned. Indeed, on any character - other than an 'a', the scanner will have to back up to - accept "foo". Similarly, the comment for State #9 concerns - when "fooba" has been scanned and an 'r' does not follow. - - The final comment reminds us that there's no point going to - all the trouble of removing backing up from the rules unless - we're using -Cf or -CF, since there's no performance gain - doing so with compressed scanners. - - The way to remove the backing up is to add "error" rules: - - %% - - - -Version 2.5 Last change: April 1995 40 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - foo return TOK_KEYWORD; - foobar return TOK_KEYWORD; - - fooba | - foob | - fo { - /* false alarm, not really a keyword */ - return TOK_ID; - } - - - Eliminating backing up among a list of keywords can also be - done using a "catch-all" rule: - - %% - foo return TOK_KEYWORD; - foobar return TOK_KEYWORD; - - [a-z]+ return TOK_ID; - - This is usually the best solution when appropriate. - - Backing up messages tend to cascade. With a complicated set - of rules it's not uncommon to get hundreds of messages. If - one can decipher them, though, it often only takes a dozen - or so rules to eliminate the backing up (though it's easy to - make a mistake and have an error rule accidentally match a - valid token. A possible future flex feature will be to - automatically add rules to eliminate backing up). - - It's important to keep in mind that you gain the benefits of - eliminating backing up only if you eliminate every instance - of backing up. Leaving just one means you gain nothing. - - Variable trailing context (where both the leading and trail- - ing parts do not have a fixed length) entails almost the - same performance loss as REJECT (i.e., substantial). So - when possible a rule like: - - %% - mouse|rat/(cat|dog) run(); - - is better written: - - %% - mouse/cat|dog run(); - rat/cat|dog run(); - - or as - - %% - mouse|rat/cat run(); - - - -Version 2.5 Last change: April 1995 41 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - mouse|rat/dog run(); - - Note that here the special '|' action does not provide any - savings, and can even make things worse (see Deficiencies / - Bugs below). - - Another area where the user can increase a scanner's perfor- - mance (and one that's easier to implement) arises from the - fact that the longer the tokens matched, the faster the - scanner will run. This is because with long tokens the pro- - cessing of most input characters takes place in the (short) - inner scanning loop, and does not often have to go through - the additional work of setting up the scanning environment - (e.g., yytext) for the action. Recall the scanner for C - comments: - - %x comment - %% - int line_num = 1; - - "/*" BEGIN(comment); - - <comment>[^*\n]* - <comment>"*"+[^*/\n]* - <comment>\n ++line_num; - <comment>"*"+"/" BEGIN(INITIAL); - - This could be sped up by writing it as: - - %x comment - %% - int line_num = 1; - - "/*" BEGIN(comment); - - <comment>[^*\n]* - <comment>[^*\n]*\n ++line_num; - <comment>"*"+[^*/\n]* - <comment>"*"+[^*/\n]*\n ++line_num; - <comment>"*"+"/" BEGIN(INITIAL); - - Now instead of each newline requiring the processing of - another action, recognizing the newlines is "distributed" - over the other rules to keep the matched text as long as - possible. Note that adding rules does not slow down the - scanner! The speed of the scanner is independent of the - number of rules or (modulo the considerations given at the - beginning of this section) how complicated the rules are - with regard to operators such as '*' and '|'. - - A final example in speeding up a scanner: suppose you want - to scan through a file containing identifiers and keywords, - - - -Version 2.5 Last change: April 1995 42 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - one per line and with no other extraneous characters, and - recognize all the keywords. A natural first approach is: - - %% - asm | - auto | - break | - ... etc ... - volatile | - while /* it's a keyword */ - - .|\n /* it's not a keyword */ - - To eliminate the back-tracking, introduce a catch-all rule: - - %% - asm | - auto | - break | - ... etc ... - volatile | - while /* it's a keyword */ - - [a-z]+ | - .|\n /* it's not a keyword */ - - Now, if it's guaranteed that there's exactly one word per - line, then we can reduce the total number of matches by a - half by merging in the recognition of newlines with that of - the other tokens: - - %% - asm\n | - auto\n | - break\n | - ... etc ... - volatile\n | - while\n /* it's a keyword */ - - [a-z]+\n | - .|\n /* it's not a keyword */ - - One has to be careful here, as we have now reintroduced - backing up into the scanner. In particular, while we know - that there will never be any characters in the input stream - other than letters or newlines, flex can't figure this out, - and it will plan for possibly needing to back up when it has - scanned a token like "auto" and then the next character is - something other than a newline or a letter. Previously it - would then just match the "auto" rule and be done, but now - it has no "auto" rule, only a "auto\n" rule. To eliminate - the possibility of backing up, we could either duplicate all - - - -Version 2.5 Last change: April 1995 43 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - rules but without final newlines, or, since we never expect - to encounter such an input and therefore don't how it's - classified, we can introduce one more catch-all rule, this - one which doesn't include a newline: - - %% - asm\n | - auto\n | - break\n | - ... etc ... - volatile\n | - while\n /* it's a keyword */ - - [a-z]+\n | - [a-z]+ | - .|\n /* it's not a keyword */ - - Compiled with -Cf, this is about as fast as one can get a - flex scanner to go for this particular problem. - - A final note: flex is slow when matching NUL's, particularly - when a token contains multiple NUL's. It's best to write - rules which match short amounts of text if it's anticipated - that the text will often include NUL's. - - Another final note regarding performance: as mentioned above - in the section How the Input is Matched, dynamically resiz- - ing yytext to accommodate huge tokens is a slow process - because it presently requires that the (huge) token be res- - canned from the beginning. Thus if performance is vital, - you should attempt to match "large" quantities of text but - not "huge" quantities, where the cutoff between the two is - at about 8K characters/token. - -GENERATING C++ SCANNERS - flex provides two different ways to generate scanners for - use with C++. The first way is to simply compile a scanner - generated by flex using a C++ compiler instead of a C com- - piler. You should not encounter any compilations errors - (please report any you find to the email address given in - the Author section below). You can then use C++ code in - your rule actions instead of C code. Note that the default - input source for your scanner remains yyin, and default - echoing is still done to yyout. Both of these remain FILE * - variables and not C++ streams. - - You can also use flex to generate a C++ scanner class, using - the -+ option (or, equivalently, %option c++), which is - automatically specified if the name of the flex executable - ends in a '+', such as flex++. When using this option, flex - defaults to generating the scanner to the file lex.yy.cc - instead of lex.yy.c. The generated scanner includes the - - - -Version 2.5 Last change: April 1995 44 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - header file FlexLexer.h, which defines the interface to two - C++ classes. - - The first class, FlexLexer, provides an abstract base class - defining the general scanner class interface. It provides - the following member functions: - - const char* YYText() - returns the text of the most recently matched token, - the equivalent of yytext. - - int YYLeng() - returns the length of the most recently matched token, - the equivalent of yyleng. - - int lineno() const - returns the current input line number (see %option - yylineno), or 1 if %option yylineno was not used. - - void set_debug( int flag ) - sets the debugging flag for the scanner, equivalent to - assigning to yy_flex_debug (see the Options section - above). Note that you must build the scanner using - %option debug to include debugging information in it. - - int debug() const - returns the current setting of the debugging flag. - - Also provided are member functions equivalent to - yy_switch_to_buffer(), yy_create_buffer() (though the first - argument is an istream* object pointer and not a FILE*), - yy_flush_buffer(), yy_delete_buffer(), and yyrestart() - (again, the first argument is a istream* object pointer). - - The second class defined in FlexLexer.h is yyFlexLexer, - which is derived from FlexLexer. It defines the following - additional member functions: - - yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) - constructs a yyFlexLexer object using the given streams - for input and output. If not specified, the streams - default to cin and cout, respectively. - - virtual int yylex() - performs the same role is yylex() does for ordinary - flex scanners: it scans the input stream, consuming - tokens, until a rule's action returns a value. If you - derive a subclass S from yyFlexLexer and want to access - the member functions and variables of S inside yylex(), - then you need to use %option yyclass="S" to inform flex - that you will be using that subclass instead of yyFlex- - Lexer. In this case, rather than generating - - - -Version 2.5 Last change: April 1995 45 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - yyFlexLexer::yylex(), flex generates S::yylex() (and - also generates a dummy yyFlexLexer::yylex() that calls - yyFlexLexer::LexerError() if called). - - virtual void switch_streams(istream* new_in = 0, - ostream* new_out = 0) reassigns yyin to new_in (if - non-nil) and yyout to new_out (ditto), deleting the - previous input buffer if yyin is reassigned. - - int yylex( istream* new_in, ostream* new_out = 0 ) - first switches the input streams via switch_streams( - new_in, new_out ) and then returns the value of - yylex(). - - In addition, yyFlexLexer defines the following protected - virtual functions which you can redefine in derived classes - to tailor the scanner: - - virtual int LexerInput( char* buf, int max_size ) - reads up to max_size characters into buf and returns - the number of characters read. To indicate end-of- - input, return 0 characters. Note that "interactive" - scanners (see the -B and -I flags) define the macro - YY_INTERACTIVE. If you redefine LexerInput() and need - to take different actions depending on whether or not - the scanner might be scanning an interactive input - source, you can test for the presence of this name via - #ifdef. - - virtual void LexerOutput( const char* buf, int size ) - writes out size characters from the buffer buf, which, - while NUL-terminated, may also contain "internal" NUL's - if the scanner's rules can match text with NUL's in - them. - - virtual void LexerError( const char* msg ) - reports a fatal error message. The default version of - this function writes the message to the stream cerr and - exits. - - Note that a yyFlexLexer object contains its entire scanning - state. Thus you can use such objects to create reentrant - scanners. You can instantiate multiple instances of the - same yyFlexLexer class, and you can also combine multiple - C++ scanner classes together in the same program using the - -P option discussed above. - - Finally, note that the %array feature is not available to - C++ scanner classes; you must use %pointer (the default). - - Here is an example of a simple C++ scanner: - - - - -Version 2.5 Last change: April 1995 46 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - // An example of using the flex C++ scanner class. - - %{ - int mylineno = 0; - %} - - string \"[^\n"]+\" - - ws [ \t]+ - - alpha [A-Za-z] - dig [0-9] - name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])* - num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)? - num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)? - number {num1}|{num2} - - %% - - {ws} /* skip blanks and tabs */ - - "/*" { - int c; - - while((c = yyinput()) != 0) - { - if(c == '\n') - ++mylineno; - - else if(c == '*') - { - if((c = yyinput()) == '/') - break; - else - unput(c); - } - } - } - - {number} cout << "number " << YYText() << '\n'; - - \n mylineno++; - - {name} cout << "name " << YYText() << '\n'; - - {string} cout << "string " << YYText() << '\n'; - - %% - - int main( int /* argc */, char** /* argv */ ) - { - FlexLexer* lexer = new yyFlexLexer; - - - -Version 2.5 Last change: April 1995 47 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - while(lexer->yylex() != 0) - ; - return 0; - } - If you want to create multiple (different) lexer classes, - you use the -P flag (or the prefix= option) to rename each - yyFlexLexer to some other xxFlexLexer. You then can include - <FlexLexer.h> in your other sources once per lexer class, - first renaming yyFlexLexer as follows: - - #undef yyFlexLexer - #define yyFlexLexer xxFlexLexer - #include <FlexLexer.h> - - #undef yyFlexLexer - #define yyFlexLexer zzFlexLexer - #include <FlexLexer.h> - - if, for example, you used %option prefix="xx" for one of - your scanners and %option prefix="zz" for the other. - - IMPORTANT: the present form of the scanning class is experi- - mental and may change considerably between major releases. - -INCOMPATIBILITIES WITH LEX AND POSIX - flex is a rewrite of the AT&T Unix lex tool (the two imple- - mentations do not share any code, though), with some exten- - sions and incompatibilities, both of which are of concern to - those who wish to write scanners acceptable to either imple- - mentation. Flex is fully compliant with the POSIX lex - specification, except that when using %pointer (the - default), a call to unput() destroys the contents of yytext, - which is counter to the POSIX specification. - - In this section we discuss all of the known areas of incom- - patibility between flex, AT&T lex, and the POSIX specifica- - tion. - - flex's -l option turns on maximum compatibility with the - original AT&T lex implementation, at the cost of a major - loss in the generated scanner's performance. We note below - which incompatibilities can be overcome using the -l option. - - flex is fully compatible with lex with the following excep- - tions: - - - The undocumented lex scanner internal variable yylineno - is not supported unless -l or %option yylineno is used. - - yylineno should be maintained on a per-buffer basis, - rather than a per-scanner (single global variable) - basis. - - - -Version 2.5 Last change: April 1995 48 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - yylineno is not part of the POSIX specification. - - - The input() routine is not redefinable, though it may - be called to read characters following whatever has - been matched by a rule. If input() encounters an end- - of-file the normal yywrap() processing is done. A - ``real'' end-of-file is returned by input() as EOF. - - Input is instead controlled by defining the YY_INPUT - macro. - - The flex restriction that input() cannot be redefined - is in accordance with the POSIX specification, which - simply does not specify any way of controlling the - scanner's input other than by making an initial assign- - ment to yyin. - - - The unput() routine is not redefinable. This restric- - tion is in accordance with POSIX. - - - flex scanners are not as reentrant as lex scanners. In - particular, if you have an interactive scanner and an - interrupt handler which long-jumps out of the scanner, - and the scanner is subsequently called again, you may - get the following message: - - fatal flex scanner internal error--end of buffer missed - - To reenter the scanner, first use - - yyrestart( yyin ); - - Note that this call will throw away any buffered input; - usually this isn't a problem with an interactive - scanner. - - Also note that flex C++ scanner classes are reentrant, - so if using C++ is an option for you, you should use - them instead. See "Generating C++ Scanners" above for - details. - - - output() is not supported. Output from the ECHO macro - is done to the file-pointer yyout (default stdout). - - output() is not part of the POSIX specification. - - - lex does not support exclusive start conditions (%x), - though they are in the POSIX specification. - - - When definitions are expanded, flex encloses them in - parentheses. With lex, the following: - - - - -Version 2.5 Last change: April 1995 49 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - NAME [A-Z][A-Z0-9]* - %% - foo{NAME}? printf( "Found it\n" ); - %% - - will not match the string "foo" because when the macro - is expanded the rule is equivalent to "foo[A-Z][A-Z0- - 9]*?" and the precedence is such that the '?' is asso- - ciated with "[A-Z0-9]*". With flex, the rule will be - expanded to "foo([A-Z][A-Z0-9]*)?" and so the string - "foo" will match. - - Note that if the definition begins with ^ or ends with - $ then it is not expanded with parentheses, to allow - these operators to appear in definitions without losing - their special meanings. But the <s>, /, and <<EOF>> - operators cannot be used in a flex definition. - - Using -l results in the lex behavior of no parentheses - around the definition. - - The POSIX specification is that the definition be - enclosed in parentheses. - - - Some implementations of lex allow a rule's action to - begin on a separate line, if the rule's pattern has - trailing whitespace: - - %% - foo|bar<space here> - { foobar_action(); } - - flex does not support this feature. - - - The lex %r (generate a Ratfor scanner) option is not - supported. It is not part of the POSIX specification. - - - After a call to unput(), yytext is undefined until the - next token is matched, unless the scanner was built - using %array. This is not the case with lex or the - POSIX specification. The -l option does away with this - incompatibility. - - - The precedence of the {} (numeric range) operator is - different. lex interprets "abc{1,3}" as "match one, - two, or three occurrences of 'abc'", whereas flex - interprets it as "match 'ab' followed by one, two, or - three occurrences of 'c'". The latter is in agreement - with the POSIX specification. - - - The precedence of the ^ operator is different. lex - interprets "^foo|bar" as "match either 'foo' at the - - - -Version 2.5 Last change: April 1995 50 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - beginning of a line, or 'bar' anywhere", whereas flex - interprets it as "match either 'foo' or 'bar' if they - come at the beginning of a line". The latter is in - agreement with the POSIX specification. - - - The special table-size declarations such as %a sup- - ported by lex are not required by flex scanners; flex - ignores them. - - - The name FLEX_SCANNER is #define'd so scanners may be - written for use with either flex or lex. Scanners also - include YY_FLEX_MAJOR_VERSION and YY_FLEX_MINOR_VERSION - indicating which version of flex generated the scanner - (for example, for the 2.5 release, these defines would - be 2 and 5 respectively). - - The following flex features are not included in lex or the - POSIX specification: - - C++ scanners - %option - start condition scopes - start condition stacks - interactive/non-interactive scanners - yy_scan_string() and friends - yyterminate() - yy_set_interactive() - yy_set_bol() - YY_AT_BOL() - <<EOF>> - <*> - YY_DECL - YY_START - YY_USER_ACTION - YY_USER_INIT - #line directives - %{}'s around actions - multiple actions on a line - - plus almost all of the flex flags. The last feature in the - list refers to the fact that with flex you can put multiple - actions on the same line, separated with semi-colons, while - with lex, the following - - foo handle_foo(); ++num_foos_seen; - - is (rather surprisingly) truncated to - - foo handle_foo(); - - flex does not truncate the action. Actions that are not - enclosed in braces are simply terminated at the end of the - - - -Version 2.5 Last change: April 1995 51 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - line. - -DIAGNOSTICS - warning, rule cannot be matched indicates that the given - rule cannot be matched because it follows other rules that - will always match the same text as it. For example, in the - following "foo" cannot be matched because it comes after an - identifier "catch-all" rule: - - [a-z]+ got_identifier(); - foo got_foo(); - - Using REJECT in a scanner suppresses this warning. - - warning, -s option given but default rule can be matched - means that it is possible (perhaps only in a particular - start condition) that the default rule (match any single - character) is the only one that will match a particular - input. Since -s was given, presumably this is not intended. - - reject_used_but_not_detected undefined or - yymore_used_but_not_detected undefined - These errors can - occur at compile time. They indicate that the scanner uses - REJECT or yymore() but that flex failed to notice the fact, - meaning that flex scanned the first two sections looking for - occurrences of these actions and failed to find any, but - somehow you snuck some in (via a #include file, for exam- - ple). Use %option reject or %option yymore to indicate to - flex that you really do use these features. - - flex scanner jammed - a scanner compiled with -s has encoun- - tered an input string which wasn't matched by any of its - rules. This error can also occur due to internal problems. - - token too large, exceeds YYLMAX - your scanner uses %array - and one of its rules matched a string longer than the YYLMAX - constant (8K bytes by default). You can increase the value - by #define'ing YYLMAX in the definitions section of your - flex input. - - scanner requires -8 flag to use the character 'x' - Your - scanner specification includes recognizing the 8-bit charac- - ter 'x' and you did not specify the -8 flag, and your - scanner defaulted to 7-bit because you used the -Cf or -CF - table compression options. See the discussion of the -7 - flag for details. - - flex scanner push-back overflow - you used unput() to push - back so much text that the scanner's buffer could not hold - both the pushed-back text and the current token in yytext. - Ideally the scanner should dynamically resize the buffer in - this case, but at present it does not. - - - -Version 2.5 Last change: April 1995 52 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - input buffer overflow, can't enlarge buffer because scanner - uses REJECT - the scanner was working on matching an - extremely large token and needed to expand the input buffer. - This doesn't work with scanners that use REJECT. - - fatal flex scanner internal error--end of buffer missed - - This can occur in an scanner which is reentered after a - long-jump has jumped out (or over) the scanner's activation - frame. Before reentering the scanner, use: - - yyrestart( yyin ); - - or, as noted above, switch to using the C++ scanner class. - - too many start conditions in <> you listed more start condi- - tions in a <> construct than exist (so you must have listed - at least one of them twice). - -FILES - -lfl library with which scanners must be linked. - - lex.yy.c - generated scanner (called lexyy.c on some systems). - - lex.yy.cc - generated C++ scanner class, when using -+. - - <FlexLexer.h> - header file defining the C++ scanner base class, Flex- - Lexer, and its derived class, yyFlexLexer. - - flex.skl - skeleton scanner. This file is only used when building - flex, not when flex executes. - - lex.backup - backing-up information for -b flag (called lex.bck on - some systems). - -DEFICIENCIES / BUGS - Some trailing context patterns cannot be properly matched - and generate warning messages ("dangerous trailing con- - text"). These are patterns where the ending of the first - part of the rule matches the beginning of the second part, - such as "zx*/xy*", where the 'x*' matches the 'x' at the - beginning of the trailing context. (Note that the POSIX - draft states that the text matched by such patterns is unde- - fined.) - - For some trailing context rules, parts which are actually - fixed-length are not recognized as such, leading to the - abovementioned performance loss. In particular, parts using - - - -Version 2.5 Last change: April 1995 53 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - '|' or {n} (such as "foo{3}") are always considered - variable-length. - - Combining trailing context with the special '|' action can - result in fixed trailing context being turned into the more - expensive variable trailing context. For example, in the - following: - - %% - abc | - xyz/def - - - Use of unput() invalidates yytext and yyleng, unless the - %array directive or the -l option has been used. - - Pattern-matching of NUL's is substantially slower than - matching other characters. - - Dynamic resizing of the input buffer is slow, as it entails - rescanning all the text matched so far by the current (gen- - erally huge) token. - - Due to both buffering of input and read-ahead, you cannot - intermix calls to <stdio.h> routines, such as, for example, - getchar(), with flex rules and expect it to work. Call - input() instead. - - The total table entries listed by the -v flag excludes the - number of table entries needed to determine what rule has - been matched. The number of entries is equal to the number - of DFA states if the scanner does not use REJECT, and some- - what greater than the number of states if it does. - - REJECT cannot be used with the -f or -F options. - - The flex internal algorithms need documentation. - -SEE ALSO - lex(1), yacc(1), sed(1), awk(1). - - John Levine, Tony Mason, and Doug Brown, Lex & Yacc, - O'Reilly and Associates. Be sure to get the 2nd edition. - - M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator - - Alfred Aho, Ravi Sethi and Jeffrey Ullman, Compilers: Prin- - ciples, Techniques and Tools, Addison-Wesley (1986). - Describes the pattern-matching techniques used by flex - (deterministic finite automata). - - - - - -Version 2.5 Last change: April 1995 54 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - -AUTHOR - Vern Paxson, with the help of many ideas and much inspira- - tion from Van Jacobson. Original version by Jef Poskanzer. - The fast table representation is a partial implementation of - a design done by Van Jacobson. The implementation was done - by Kevin Gong and Vern Paxson. - - Thanks to the many flex beta-testers, feedbackers, and con- - tributors, especially Francois Pinard, Casey Leedom, Robert - Abramovitz, Stan Adermann, Terry Allen, David Barker- - Plummer, John Basrai, Neal Becker, Nelson H.F. Beebe, - benson@odi.com, Karl Berry, Peter A. Bigot, Simon Blanchard, - Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick - Christopher, Brian Clapper, J.T. Conklin, Jason Coughlin, - Bill Cox, Nick Cropper, Dave Curtis, Scott David Daniels, - Chris G. Demetriou, Theo Deraadt, Mike Donahue, Chuck - Doucette, Tom Epperly, Leo Eskin, Chris Faylor, Chris - Flatters, Jon Forrest, Jeffrey Friedl, Joe Gayda, Kaveh R. - Ghazi, Wolfgang Glunz, Eric Goldman, Christopher M. Gould, - Ulrich Grepel, Peer Griebel, Jan Hajic, Charles Hemphill, - NORO Hideo, Jarkko Hietaniemi, Scott Hofmann, Jeff Honig, - Dana Hudes, Eric Hughes, John Interrante, Ceriel Jacobs, - Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, Henry - Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane, - Amir Katz, ken@ken.hilco.com, Kevin B. Kenny, Steve Kirsch, - Winfried Koenig, Marq Kole, Ronald Lamprecht, Greg Lee, - Rohan Lenard, Craig Leres, John Levine, Steve Liddle, David - Loffredo, Mike Long, Mohamed el Lozy, Brian Madsen, Malte, - Joe Marshall, Bengt Martensson, Chris Metcalf, Luke Mewburn, - Jim Meyering, R. Alexander Milowski, Erik Naggum, G.T. - Nicol, Landon Noll, James Nordby, Marc Nozell, Richard - Ohnemus, Karsten Pahnke, Sven Panne, Roland Pesch, Walter - Pelissero, Gaumond Pierre, Esmond Pitt, Jef Poskanzer, Joe - Rahmeh, Jarmo Raiha, Frederic Raimbault, Pat Rankin, Rick - Richardson, Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, - Alberto Santini, Andreas Scherer, Darrell Schiebel, Raf - Schietekat, Doug Schmidt, Philippe Schnoebelen, Andreas - Schwab, Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan- - Erik Strvmquist, Mike Stump, Paul Stuart, Dave Tallman, Ian - Lance Taylor, Chris Thewalt, Richard M. Timoney, Jodi Tsai, - Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, - Kent Williams, Ken Yap, Ron Zellar, Nathan Zelle, David - Zuhn, and those whose names have slipped my marginal mail- - archiving skills but whose contributions are appreciated all - the same. - - Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John - Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T. Nicol, - Francois Pinard, Rich Salz, and Richard Stallman for help - with various distribution headaches. - - - - - -Version 2.5 Last change: April 1995 55 - - - - - - -FLEX(1) USER COMMANDS FLEX(1) - - - - Thanks to Esmond Pitt and Earle Horton for 8-bit character - support; to Benson Margulies and Fred Burke for C++ support; - to Kent Williams and Tom Epperly for C++ class support; to - Ove Ewerlid for support of NUL's; and to Eric Hughes for - support of multiple buffers. - - This work was primarily done when I was with the Real Time - Systems Group at the Lawrence Berkeley Laboratory in Berke- - ley, CA. Many thanks to all there for the support I - received. - - Send comments to vern@ee.lbl.gov. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Version 2.5 Last change: April 1995 56 - - - |