@c This file is part of flex. @c Copyright (c) 1990, 1997 The Regents of the University of California. @c All rights reserved. @c This code is derived from software contributed to Berkeley by @c Vern Paxson. @c The United States Government has rights in this work pursuant @c to contract no. DE-AC03-76SF00098 between the United States @c Department of Energy and the University of California. @c Redistribution and use in source and binary forms, with or without @c modification, are permitted provided that the following conditions @c are met: @c 1. Redistributions of source code must retain the above copyright @c notice, this list of conditions and the following disclaimer. @c 2. Redistributions in binary form must reproduce the above copyright @c notice, this list of conditions and the following disclaimer in the @c documentation and/or other materials provided with the distribution. @c Neither the name of the University nor the names of its contributors @c may be used to endorse or promote products derived from this software @c without specific prior written permission. @c THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR @c IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED @c WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR @c PURPOSE. @node FAQ @unnumbered FAQ @menu * When was flex born?:: * How do I expand \ escape sequences in C-style quoted strings?:: * Why do flex scanners call fileno if it is not ANSI compatible?:: * Does flex support recursive pattern definitions?:: * How do skip huge chunks of input (tens of megabytes) while using flex?:: * Flex is not matching my patterns in the same order that I defined them.:: * My actions are executing out of order or sometimes not at all.:: * How can I have multiple input sources feed into the same scanner at the same time?:: * Can I build nested parsers that work with the same input file?:: * How can I match text only at the end of a file?:: * How can I make REJECT cascade across start condition boundaries?:: * Why can't I use fast or full tables with interactive mode?:: * How much faster is -F or -f than -C?:: * If I have a simple grammar can't I just parse it with flex?:: * Why doesn't yyrestart() set the start state back to INITIAL?:: * How can I match C-style comments?:: * The '.' isn't working the way I expected.:: * Can I get the flex manual in another format?:: * Does there exist a "faster" NDFA->DFA algorithm?:: * How does flex compile the DFA so quickly?:: * How can I use more than 8192 rules?:: * How do I abandon a file in the middle of a scan and switch to a new file?:: * How do I execute code only during initialization (only before the first scan)?:: * How do I execute code at termination?:: * Where else can I find help?:: * Can I include comments in the "rules" section of the file file?:: * I get an error about undefined yywrap().:: * How can I change the matching pattern at run time?:: * Is there a way to increase the rules (NFA states to a bigger number?):: * How can I expand macros in the input?:: * How can I build a two-pass scanner?:: * How do I match any string not matched in the preceding rules?:: * I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: * Is there a way to make flex treat NULL like a regular character?:: * Whenever flex can not match the input it says "flex scanner jammed".:: * Why doesn't flex have non-greedy operators like perl does?:: @end menu @node When was flex born? @unnumberedsec When was flex born? When was flex born? Vern Paxson took over the Software Tools lex project from Jef Poskanzer in 1982. At that point it was written in Ratfor. Around 1987 or so, Paxson translated it into C, and a legend was born :-). @node How do I expand \ escape sequences in C-style quoted strings? @unnumberedsec How do I expand \ escape sequences in C-style quoted strings? How do I expand \ escape sequences in C-style quoted strings? A key point when scanning quoted strings is that you cannot (easily) write a single rule that will precisely match the string if you allow things like embedded escape sequences and newlines. If you try to match strings with a single rule then you'll wind up having to rescan the string anyway to find any escape sequences. Instead you use exclusive start conditions and a set of rules, one for matching non-escaped text, one for matching a single escape, one for matching an embedded newline, and one for recognizing the end of the string. Each of these rules is then faced with the question of where to put its intermediary results. The best solution is for the rules to append their local value of yytext to the end of a "string literal" buffer. A rule like the escape-matcher will append to the buffer the meaning of the escape sequence rather than the literal text in yytext. In this way, yytext does not need to be modified at all. @node Why do flex scanners call fileno if it is not ANSI compatible? @unnumberedsec Why do flex scanners call fileno if it is not ANSI compatible? Why do flex scanners call fileno if it is not ANSI compatible? Flex scanners call fileno() in order to get the file descriptor corresponding to yyin. The file descriptor may be passed to isatty() or read(), depending upon which %options you specified. If your system does not have fileno() support. To get rid of the read() call, do not specify %option read. To get rid of the isatty() call, you must specify one of %option always-interactive or %option never-interactive. @node Does flex support recursive pattern definitions? @unnumberedsec Does flex support recursive pattern definitions? Does flex support recursive pattern definitions? e.g., @example @verbatim %% block "{"({block}|{statement})*"}" @end verbatim @end example No. You cannot have recursive definitions. The pattern-matching power of regular expressions in general (and therefore flex scanners, too) is limited. In particular, regular expressions cannot "balance" parentheses to an arbitrary degree. For example, it's impossible to write a regular expression that matches all strings containing the same number of '@{'s as '@}'s. For more powerful pattern matching, you need a parser, such as GNU bison. @node How do skip huge chunks of input (tens of megabytes) while using flex? @unnumberedsec How do skip huge chunks of input (tens of megabytes) while using flex? How do skip huge chunks of input (tens of megabytes) while using flex? Use fseek (or lseek) to position yyin, then call yyrestart(). @node Flex is not matching my patterns in the same order that I defined them. @unnumberedsec Flex is not matching my patterns in the same order that I defined them. Flex is not matching my patterns in the same order that I defined them. This is indeed the natural way to expect it to work, however, flex picks the rule that matches the most text (i.e., the longest possible input string). This is because flex uses an entirely different matching technique ("deterministic finite automata") that actually does all of the matching simultaneously, in parallel. (Seems impossible, but it's actually a fairly simple technique once you understand the principles.) A side-effect of this parallel matching is that when the input matches more than one rule, flex scanners pick the rule that matched the *most* text. This is explained further in the manual, in the section "How the input is Matched". If you want flex to choose a shorter match, then you can work around this behavior by expanding your short rule to match more text, then put back the extra: @example @verbatim data_.* yyless( 5 ); BEGIN BLOCKIDSTATE; @end verbatim @end example Another fix would be to make the second rule active only during the start condition, and make that start condition exclusive by declaring it with %x instead of %s. A final fix is to change the input language so that the ambiguity for data_ is removed, by adding characters to it that don't match the identifier rule, or by removing characters (such as '_') from the identifier rule so it no longer matches "data_". (Of course, you might also not have the option of changing the input language ...) @node My actions are executing out of order or sometimes not at all. @unnumberedsec My actions are executing out of order or sometimes not at all. My actions are executing out of order or sometimes not at all. What's happening? Most likely, you have (in error) placed the opening @samp{@{} of the action block on a different line than the rule, e.g., @example @verbatim ^(foo|bar) { <<<--- WRONG! } @end verbatim @end example flex requires that the opening @samp{@{} of an action associated with a rule begin on the same line as does the rule. You need instead to write your rules as follows: @example @verbatim ^(foo|bar) { // CORRECT! } @end verbatim @end example @node How can I have multiple input sources feed into the same scanner at the same time? @unnumberedsec How can I have multiple input sources feed into the same scanner at the same time? How can I have multiple input sources feed into the same scanner at the same time? If... @itemize @w @item your scanner is free of backtracking (verified using flex's -b flag), @item AND you run it interactively (-I option; default unless using special table compression options), @item AND you feed it one character at a time by redefining YY_INPUT to do so, @end itemize then every time it matches a token, it will have exhausted its input buffer (because the scanner is free of backtracking). This means you can safely use select() at the point and only call yylex() for another token if select() indicates there's data available. That is, move the select() out from the input function to a point where it determines whether yylex() gets called for the next token. With this approach, you will still have problems if your input can arrive piecemeal; select() could inform you that the beginning of a token is available, you call yylex() to get it, but it winds up blocking waiting for the later characters in the token. Here's another way: Move your input multiplexing inside of YY_INPUT. That is, whenever YY_INPUT is called, it select()'s to see where input is available. If input is available for the scanner, it reads and returns the next byte. If input is available from another source, it calls whatever function is responsible for reading from that source. (If no input is available, it blocks until some is.) I've used this technique in an interpreter I wrote that both reads keyboard input using a flex scanner and IPC traffic from sockets, and it works fine. @node Can I build nested parsers that work with the same input file? @unnumberedsec Can I build nested parsers that work with the same input file? Can I build nested parsers that work with the same input file? This is not going to work without some additional effort. The reason is that flex block-buffers the input it reads from yyin. This means that the "outermost" yylex(), when called, will automatically slurp up the first 8K of input available on yyin, and subsequent calls to other yylex()'s won't see that input. You might be tempted to work around this problem by redefining YY_INPUT to only return a small amount of text, but it turns out that that approach is quite difficult. Instead, the best solution is to combine all of your scanners into one large scanner, using a different exclusive start condition for each. @node How can I match text only at the end of a file? @unnumberedsec How can I match text only at the end of a file? How can I match text only at the end of a file? There is no way to write a rule which is "match this text, but only if it comes at the end of the file". You can fake it, though, if you happen to have a character lying around that you don't allow in your input. Then you redefine YY_INPUT to call your own routine which, if it sees an EOF, returns the magic character first (and remembers to return a real EOF next time it's called). Then you could write: @example @verbatim (.|\n)*{EOF_CHAR} /* saw comment at EOF */ @end verbatim @end example @node How can I make REJECT cascade across start condition boundaries? @unnumberedsec How can I make REJECT cascade across start condition boundaries? How can I make REJECT cascade across start condition boundaries? You can do this as follows. Suppose you have a start condition A, and after exhausting all of the possible matches in , you want to try matches in . Then you could use the following: @example @verbatim %x A %% rule_that_is_long ...; REJECT; rule ...; REJECT; /* shorter rule */ etc. ... .|\n { /* Shortest and last rule in , so * cascaded REJECT's will eventually * wind up matching this rule. We want * to now switch to the initial state * and try matching from there instead. */ yyless(0); /* put back matched text */ BEGIN(INITIAL); } @end verbatim @end example @node Why can't I use fast or full tables with interactive mode? @unnumberedsec Why can't I use fast or full tables with interactive mode? Why can't I use fast or full tables with interactive mode? One of the assumptions flex makes is that interactive applications are inherently slow (for just that reason, they're waiting on a human). It has to do with how the scanner detects that it must be finished scanning a token. For interactive scanners, after scanning each character the current state is looked up in a table (essentially) to see whether there's a chance of another input character possibly extending the length of the match. If not, the scanner halts. For non-interactive scanners, the end-of-token test is much simpler, basically a compare with 0, so no memory bus cycles. Since the test occurs in the innermost scanning loop, one would like to make it go as fast as possible. Still, it seems reasonable to allow the user to choose to trade off a bit of performance in this area to gain the corresponding flexibility. There might be another reason, though, why fast scanners don't support the interactive option @node How much faster is -F or -f than -C? @unnumberedsec How much faster is -F or -f than -C? How much faster is -F or -f than -C? Much faster (factor of 2-3). @node If I have a simple grammar can't I just parse it with flex? @unnumberedsec If I have a simple grammar can't I just parse it with flex? If I have a simple grammar, can't I just parse it with flex? Is your grammar recursive? That's almost always a sign that you're better off using a parser/scanner rather than just trying to use a scanner alone. @node Why doesn't yyrestart() set the start state back to INITIAL? @unnumberedsec Why doesn't yyrestart() set the start state back to INITIAL? Why doesn't yyrestart() set the start state back to INITIAL? There are two reasons. The first is that there might be programs that rely on the start state not changing across file changes. The second is that with flex 2.4, use of yyrestart() is no longer required, so fixing the problem there doesn't solve the more general problem. @node How can I match C-style comments? @unnumberedsec How can I match C-style comments? How can I match C-style comments? You might be tempted to try something like this: @example @verbatim "/*".*"*/" // WRONG! @end verbatim @end example or, worse, this: @example @verbatim "/*"(.|\n)"*/" // WRONG! @end verbatim @end example The above rules will eat too much input, and blow up on things like: @example @verbatim /* a comment */ do_my_thing( "oops */" ); @end verbatim @end example Here is one way which allows you to track line information: @example @verbatim { "/*" BEGIN(IN_COMMENT); } { "*/" BEGIN(INITIAL); [^*\n]+ // eat comment in chunks "*" // eat the lone star \n yylineno++; } @end verbatim @end example @node The '.' isn't working the way I expected. @unnumberedsec The '.' isn't working the way I expected. The '.' (dot) isn't working the way I expected. Here are some tips for using @samp{.}: @itemize @item A common mistake is to place the grouping parenthesis AFTER an operator, when you really meant to place the parenthesis BEFORE the operator, e.g., you probably want this @code{(foo|bar)+} and NOT this @code{(foo|bar+)}. The first pattern matches the words @code{foo} or @code{bar} any number of times, e.g., it matches the text @code{barfoofoobarfoo}. The second pattern matches a single instance of @code{foo} or a single instance of @code{ba} followed by one or more @samp{r}s, e.g., it matches the text @code{barrrr} . @item A @samp{.} inside []'s just means a literal@samp{.} (period), and NOT "any character except newline". @item Remember that @samp{.} matches any character EXCEPT @samp{\n} (and EOF). If you really want to match ANY character, including newlines, then use @code{(.|\n)} --- Beware that the regex @code{(.|\n)+} will match your entire input! @item Finally, if you want to match a literal @samp{.} (a period), then use [.] or "." @end itemize @node Can I get the flex manual in another format? @unnumberedsec Can I get the flex manual in another format? Can I get the flex manual in another format? As of flex 2.5, the manual is distributed in texinfo format. You can use the "texi2*" tools to convert the manual to any format you desire (e.g., @samp{texi2html}). @node Does there exist a "faster" NDFA->DFA algorithm? @unnumberedsec Does there exist a "faster" NDFA->DFA algorithm? Does there exist a "faster" NDFA->DFA algorithm? Most standard texts (e.g., Aho), imply that NDFA->DFA can take exponential time, since there are exponential number of potential states in NDFA. There's no way around the potential exponential running time - it can take you exponential time just to enumerate all of the DFA states. In practice, though, the running time is closer to linear, or sometimes quadratic. @node How does flex compile the DFA so quickly? @unnumberedsec How does flex compile the DFA so quickly? How does flex compile the DFA so quickly? There are two big speed wins that flex uses: @enumerate @item It analyzes the input rules to construct equivalence classes for those characters that always make the same transitions. It then rewrites the NFA using equivalence classes for transitions instead of characters. This cuts down the NFA->DFA computation time dramatically, to the point where, for uncompressed DFA tables, the DFA generation is often I/O bound in writing out the tables. @item It maintains hash values for previously computed DFA states, so testing whether a newly constructed DFA state is equivalent to a previously constructed state can be done very quickly, by first comparing hash values. @end enumerate @node How can I use more than 8192 rules? @unnumberedsec How can I use more than 8192 rules? How can I use more than 8192 rules? Flex is compiled with an upper limit of 8192 rules per scanner. If you need more than 8192 rules in your scanner, you'll have to recompile flex with the following changes in flexdef.h: @example @verbatim < #define YY_TRAILING_MASK 0x2000 < #define YY_TRAILING_HEAD_MASK 0x4000 -- > #define YY_TRAILING_MASK 0x20000000 > #define YY_TRAILING_HEAD_MASK 0x40000000 @end verbatim @end example This should work okay as long as your C compiler uses 32 bit integers. But you might want to think about whether using such a huge number of rules is the best way to solve your problem. @node How do I abandon a file in the middle of a scan and switch to a new file? @unnumberedsec How do I abandon a file in the middle of a scan and switch to a new file? How do I abandon a file in the middle of a scan and switch to a new file? Just all yyrestart(newfile). Be sure to reset the start state if you want a "fresh" start, since yyrestart does NOT reset the start state back to INITIAL. @node How do I execute code only during initialization (only before the first scan)? @unnumberedsec How do I execute code only during initialization (only before the first scan)? How do I execute code only during initialization (only before the first scan)? You can specify an initial action by defining the macro YY_USER_INIT (though note that yyout may not be available at the time this macro is executed). Or you can add to the beginning of your rules section: @example @verbatim %% /* Must be indented! */ static int did_init = 0; if ( ! did_init ){ do_my_init(); did_init = 1; } @end verbatim @end example @node How do I execute code at termination? @unnumberedsec How do I execute code at termination? How do I execute code at termination (i.e., only after the last scan?) You can specifiy an action for the <> rule. @node Where else can I find help? @unnumberedsec Where else can I find help? Where else can I find help? The @code{help-flex} email list is served by GNU. See http://www.gnu.org/ for details how to subscribe or search the archives. @node Can I include comments in the "rules" section of the file file? @unnumberedsec Can I include comments in the "rules" section of the file file? Can I include comments in the "rules" section of the file file? Yes, just about anywhere you want to. See the manual for the specific syntax. @node I get an error about undefined yywrap(). @unnumberedsec I get an error about undefined yywrap(). I get an error about undefined yywrap(). You must supply a yywrap() function of your own, or link to libfl.a (which provides one), or use %option noyywrap in your source to say you don't want a yywrap() function. See the manual page for more details concerning yywrap(). @node How can I change the matching pattern at run time? @unnumberedsec How can I change the matching pattern at run time? How can I change the matching pattern at run time? You can't, it's compiled into a static table when flex builds the scanner. @node Is there a way to increase the rules (NFA states to a bigger number?) @unnumberedsec Is there a way to increase the rules (NFA states to a bigger number?) Is there a way to increase the rules (NFA states to a bigger number?) With luck, you should be able to increase the definitions in flexdef.h for: @example @verbatim #define JAMSTATE -32766 /* marks a reference to the state that always jams */ #define MAXIMUM_MNS 31999 #define BAD_SUBSCRIPT -32767 @end verbatim @end example recompile everything, and it'll all work. Flex only has these 16-bit-like values built into it because a long time ago it was developed on a machine with 16-bit ints. I've given this advice to others in the past but haven't heard back from them whether it worked okay or not... @node How can I expand macros in the input? @unnumberedsec How can I expand macros in the input? How can I expand macros in the input? The best way to approach this problem is at a higher level, e.g., in the parser. However, you can do this using multiple input buffers. @example @verbatim %% macro/[a-z]+ { /* Saw the macro "macro" followed by extra stuff. */ main_buffer = YY_CURRENT_BUFFER; expansion_buffer = yy_scan_string(expand(yytext)); yy_switch_to_buffer(expansion_buffer); } <> { if ( expansion_buffer ) { // We were doing an expansion, return to where // we were. yy_switch_to_buffer(main_buffer); yy_delete_buffer(expansion_buffer); expansion_buffer = 0; } else yyterminate(); } @end verbatim @end example You probably will want a stack of expansion buffers to allow nested macros. From the above though hopefully the idea is clear. @node How can I build a two-pass scanner? @unnumberedsec How can I build a two-pass scanner? How can I build a two-pass scanner? One way to do it is to filter the first pass to a temporary file, then process the temporary file on the second pass. You will probably see a performance hit, do to all the disk I/O. When you need to look ahead far forward like this, it almost always means that the right solution is to build a parse tree of the entire input, then walk it after the parse in order to generate the output. In a sense, this is a two-pass approach, once through the text and once through the parse tree, but the performance hit for the latter is usually an order of magnitude smaller, since everything is already classified, in binary format, and residing in memory. @node How do I match any string not matched in the preceding rules? @unnumberedsec How do I match any string not matched in the preceding rules? How do I match any string not matched in the preceding rules? One way to assign precedence, is to place the more specific rules first. If two rules would match the same input (same sequence of characters) then the first rule listed in the flex input wins. e.g., @example @verbatim %% foo[a-zA-Z_]+ return FOO_ID; bar[a-zA-Z_]+ return BAR_ID; [a-zA-Z_]+ return GENERIC_ID; @end verbatim @end example Note that the rule @code{[a-zA-Z_]+} must come *after* the others. It will match the same amount of text as the more specific rules, and in that case the flex scanner will pick the first rule listed in your scanner as the one to match. @node I am trying to port code from AT&T lex that uses yysptr and yysbuf. @unnumberedsec I am trying to port code from AT&T lex that uses yysptr and yysbuf. I am trying to port code from AT&T lex that uses yysptr and yysbuf. Those are internal variables pointing into the AT&T scanner's input buffer. I imagine they're being manipulated in user versions of the input() and unput() functions. If so, what you need to do is analyze those functions to figure out what they're doing, and then replace input() with an appropriate definition of YY_INPUT (see the flex man page). You shouldn't need to (and must not) replace flex's unput() function. @node Is there a way to make flex treat NULL like a regular character? @unnumberedsec Is there a way to make flex treat NULL like a regular character? Is there a way to make flex treat NULL like a regular character? Yes, \0 and \x00 should both do the trick. Perhaps you have an ancient version of flex. The latest release is version @value{VERSION}. @node Whenever flex can not match the input it says "flex scanner jammed". @unnumberedsec Whenever flex can not match the input it says "flex scanner jammed". Whenever flex can not match the input it says "flex scanner jammed". You need to add a rule that matches the otherwise-unmatched text. e.g., @example @verbatim %option yylineno %% [[a bunch of rules here]] . printf("bad input character '%s' at line %d\n", yytext, yylineno); @end verbatim @end example See %option default for more information. @node Why doesn't flex have non-greedy operators like perl does? @unnumberedsec Why doesn't flex have non-greedy operators like perl does? Why doesn't flex have non-greedy operators like perl does? A DFA can do a non-greedy match by stopping the first time it enters an accepting state, instead of consuming input until it determines that no further matching is possible (a "jam" state). This is actually easier to implement than longest leftmost match (which flex does). But it's also much less useful than longest leftmost match. In general, when you find yourself wishing for non-greedy matching, that's usually a sign that you're trying to make the scanner do some parsing. That's generally the wrong approach, since it lacks the power to do a decent job. Better is to either introduce a separate parser, or to split the scanner into multiple scanners using (exclusive) start conditions. You might have a separate start state once you've seen the BEGIN. In that state, you might then have a regex that will match END (to kick you out of the state), and perhaps (.|\n) to get a single character within the chunk ... This approach also has much better error-reporting properties.