diff options
author | Will Estes <westes575@gmail.com> | 2014-02-13 12:59:14 -0500 |
---|---|---|
committer | Will Estes <westes575@gmail.com> | 2014-02-13 12:59:14 -0500 |
commit | fefcb5f6b51b71464f6f8d1a6b187c09178a15d1 (patch) | |
tree | 5c1dc65341ca4ccc5da5051da2c733d52f9dd503 /doc | |
parent | 54f041e9b304c21f0dfb4f8bb017ec75b2c1af46 (diff) |
remove unmaintained xml documentation
Diffstat (limited to 'doc')
-rw-r--r-- | doc/flex.xml | 9821 |
1 files changed, 0 insertions, 9821 deletions
diff --git a/doc/flex.xml b/doc/flex.xml deleted file mode 100644 index 7f1d4c4..0000000 --- a/doc/flex.xml +++ /dev/null @@ -1,9821 +0,0 @@ -<?xml version="1.0"?> -<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" -"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"> -<book> -<bookinfo> -<title>flex: a fast lexical analyzer generator</title> - -<copyright> -<year>1990</year> -<year>1997</year> -<holder>The Regents of the University of California.</holder> -All rights reserved. -</copyright> - -<!-- -@title Flex, version @value{VERSION} -@subtitle Edition <edition>@value{EDITION}</edition>, @value{UPDATED} ---> - -<author><firstname>Vern</firstname><surname>Paxson</surname></author> -<author><firstname>W. L.</firstname><surname>Estes</surname></author> -<author><firstname>John</firstname><surname>Millaway</surname></author> - -<legalnotice> -<para> -This code is derived from software contributed to Berkeley by -Vern Paxson. -</para> -<para> -The United States Government has rights in this work pursuant -to contract no. DE-AC03-76SF00098 between the United States -Department of Energy and the University of California. -</para> -<para> -Redistribution and use in source and binary forms, with or without -modification, are permitted provided that the following conditions -are met: -</para> - -<orderedlist> - -<listitem> -Redistributions of source code must retain the above copyright -notice, this list of conditions and the following disclaimer. -</listitem> - -<listitem> -Redistributions in binary form must reproduce the above copyright -notice, this list of conditions and the following disclaimer in the -documentation and/or other materials provided with the distribution. -</listitem> -</orderedlist> - -<para> -Neither the name of the University nor the names of its contributors -may be used to endorse or promote products derived from this software -without specific prior written permission. -</para> -<para> -THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR -IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED -WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR -PURPOSE. -</para> -</legalnotice> - -</bookinfo> - -<!-- @c "Macro Hooks" index --> -@defindex hk -<!-- @c "Options" index --> -@defindex op -@dircategory Programming -@direntry - -<preface> -<para> -This manual describes <application>flex</application>, a tool for generating programs that -perform pattern-matching on text. The manual includes both tutorial and -reference sections. -</para> -<para> -This edition of @cite{The flex Manual} documents <application>flex</application> version -@value{VERSION}. It was last updated on @value{UPDATED}. -</para> - -</preface> - -<chapter> -<title>Reporting Bugs</title> - -<!-- @cindex bugs, reporting --> -<!-- @cindex reporting bugs --> -<para> -If you have problems with <application>flex</application> or think you have found a bug, -please send mail detailing your problem to -@email{flex-help@@lists.sourceforge.net}. Patches are always welcome. -</para> - -</chapter> - -<chapter> -<title>Introduction</title> -<para> -<!-- @cindex scanner, definition of --> -<application>flex</application> is a tool for generating @dfn{scanners}. A scanner is a -program which recognizes lexical patterns in text. The <application>flex</application> -program reads the given input files, or its standard input if no file -names are given, for a description of a scanner to generate. The -description is in the form of pairs of regular expressions and C code, -called @dfn{rules}. <application>flex</application> generates as output a C source file, -<filename>lex.yy.c</filename> by default, which defines a routine <function>yylex</function>. -This file can be compiled and linked with the flex runtime library to -produce an executable. When the executable is run, it analyzes its -input for occurrences of the regular expressions. Whenever it finds -one, it executes the corresponding C code. -</para> - -</chapter> - -<chapter> -<title>Some Simple Examples</title> - -First <para>some simple examples to get the flavor of how one uses -<application>flex</application>. -</para> -<para> -<!-- @cindex username expansion --> -The following <application>flex</application> input specifies a scanner which, when it -encounters the string @samp{username} will replace it with the user's -login name: -</para> - -<informalexample> -<programlisting> -<![CDATA[ - %% - username printf( "%s", getlogin() ); -]]> -</programlisting> -</informalexample> - -<para> -<!-- @cindex default rule --> -<!-- @cindex rules, default --> -By default, any text not matched by a <application>flex</application> scanner is copied to -the output, so the net effect of this scanner is to copy its input file -to its output with each occurrence of @samp{username} expanded. In this -input, there is just one rule. @samp{username} is the @dfn{pattern} and -the @samp{printf} is the @dfn{action}. The @samp{%%} symbol marks the -beginning of the rules. -</para> - -<para> -Here's another simple example: -</para> - -<!-- @cindex counting characters and lines --> -<informalexample> -<programlisting> -<![CDATA[ - int num_lines = 0, num_chars = 0; - - %% - \n ++num_lines; ++num_chars; - . ++num_chars; - - %% - main() - { - yylex(); - printf( "# of lines = %d, # of chars = %d\n", - num_lines, num_chars ); - } -]]> -</programlisting> -</informalexample> - -<para> -This scanner counts the number of characters and the number of lines in -its input. It produces no output other than the final report on the -character and line counts. The first line declares two globals, -@code{num_lines} and @code{num_chars}, which are accessible both inside -<function>yylex</function> and in the <function>main</function> routine declared after the -second @samp{%%}. There are two rules, one which matches a newline -(@samp{\n}) and increments both the line count and the character count, -and one which matches any character other than a newline (indicated by -the @samp{.} regular expression). -</para> - -<para> -A somewhat more complicated example: -</para> - -<!-- @cindex Pascal-like language --> -<informalexample> -<programlisting> -<![CDATA[ - /* scanner for a toy Pascal-like language */ - - %{ - /* need this for the call to atof() below */ - #include math.h> - %} - - DIGIT [0-9] - ID [a-z][a-z0-9]* - - %% - - {DIGIT}+ { - printf( "An integer: %s (%d)\n", yytext, - atoi( yytext ) ); - } - - {DIGIT}+"."{DIGIT}* { - printf( "A float: %s (%g)\n", yytext, - atof( yytext ) ); - } - - if|then|begin|end|procedure|function { - printf( "A keyword: %s\n", yytext ); - } - - {ID} printf( "An identifier: %s\n", yytext ); - - "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); - - "{"[\^{}}\n]*"}" /* eat up one-line comments */ - - [ \t\n]+ /* eat up whitespace */ - - . printf( "Unrecognized character: %s\n", yytext ); - - %% - - main( argc, argv ) - int argc; - char **argv; - { - ++argv, --argc; /* skip over program name */ - if ( argc > 0 ) - yyin = fopen( argv[0], "r" ); - else - yyin = stdin; - - yylex(); - } -]]> -</programlisting> -</informalexample> - -<para> -This is the beginnings of a simple scanner for a language like Pascal. -It identifies different types of @dfn{tokens} and reports on what it has -seen. -</para> - -<para> -The details of this example will be explained in the following -sections. -</para> - -</chapter> - -<chapter> -<title>Format of the Input File</title> - - -<!-- @cindex format of flex input --> -<!-- @cindex input, format of --> -<!-- @cindex file format --> -<!-- @cindex sections of flex input --> - -<para> -The <application>flex</application> input file consists of three sections, separated by a -line containing only @samp{%%}. -</para> - -<!-- @cindex format of input file --> -<informalexample> -<programlisting> -<![CDATA[ - definitions - %% - rules - %% - user code -]]> -</programlisting> -</informalexample> - -<!-- -@menu -* Definitions Section:: -* Rules Section:: -* User Code Section:: -* Comments in the Input:: -@end menu ---> - - -<section> -<title>Format of the Definitions Section</title> - -<para> -<!-- @cindex input file, Definitions section --> -<!-- @cindex Definitions, in flex input --> -The @dfn{definitions section} contains declarations of simple @dfn{name} -definitions to simplify the scanner specification, and declarations of -@dfn{start conditions}, which are explained in a later section. -</para> - -<para> -<!-- @cindex aliases, how to define --> -<!-- @cindex pattern aliases, how to define --> -Name definitions have the form: -</para> - -<informalexample> -<programlisting> -<![CDATA[ - name definition -]]> -</programlisting> -</informalexample> - -<para> -The @samp{name} is a word beginning with a letter or an underscore -(@samp{_}) followed by zero or more letters, digits, @samp{_}, or -@samp{-} (dash). The definition is taken to begin at the first -non-whitespace character following the name and continuing to the end of -the line. The definition can subsequently be referred to using -@samp{{name}}, which will expand to @samp{(definition)}. For example, -</para> - -<!-- @cindex pattern aliases, defining --> -<!-- @cindex defining pattern aliases --> -<informalexample> -<programlisting> -<![CDATA[ - DIGIT [0-9] - ID [a-z][a-z0-9]* -]]> -</programlisting> -</informalexample> - -<para> -Defines @samp{DIGIT} to be a regular expression which matches a single -digit, and @samp{ID} to be a regular expression which matches a letter -followed by zero-or-more letters-or-digits. A subsequent reference to -</para> - -<!-- @cindex pattern aliases, use of --> -<informalexample> -<programlisting> -<![CDATA[ - {DIGIT}+"."{DIGIT}* -]]> -</programlisting> -</informalexample> - -<para> -is identical to -</para> - -<informalexample> -<programlisting> -<![CDATA[ - ([0-9])+"."([0-9])* -]]> -</programlisting> -</informalexample> - -<para> -and matches one-or-more digits followed by a @samp{.} followed by -zero-or-more digits. -</para> - -<para> -<!-- @cindex comments in flex input --> -An unindented comment (i.e., a line -beginning with @samp{/*}) is copied verbatim to the output up -to the next @samp{*/}. -</para> - -<para> -<!-- @cindex %{ and %}, in Definitions Section --> -<!-- @cindex embedding C code in flex input --> -<!-- @cindex C code in flex input --> -Any <emphasis>indented</emphasis> text or text enclosed in @samp{%{} and @samp{%}} -is also copied verbatim to the output (with the %{ and %} symbols -removed). The %{ and %} symbols must appear unindented on lines by -themselves. -</para> - -<!-- @cindex %top --> - -<para> -A @code{%top} block is similar to a @samp{%{} ... @samp{%}} block, except -that the code in a @code{%top} block is relocated to the <emphasis>top</emphasis> of the -generated file, before any flex definitions @footnote{Actually, -@code{yyIN_HEADER} is defined before the @samp{%top} block.}. -The @code{%top} block is useful when you want certain preprocessor macros to be -defined or certain files to be included before the generated code. -The single characters, @samp{{} and @samp{}} are used to delimit the -@code{%top} block, as show in the example below: -</para> - -<informalexample> -<programlisting> -<![CDATA[ - %top{ - /* This code goes at the "top" of the generated file. */ - #include <stdint.h> - #include <inttypes.h> - } -]]> -</programlisting> -</informalexample> - -<para> -Multiple @code{%top} blocks are allowed, and their order is preserved. -</para> - -</section> - -<section> -<title>Format of the Rules Section</title> - -<!-- @cindex input file, Rules Section --> -<!-- @cindex rules, in flex input --> - -<para> -The @dfn{rules} section of the <application>flex</application> input contains a series of -rules of the form: -</para> - -<informalexample> -<programlisting> -<![CDATA[ - pattern action -]]> -</programlisting> -</informalexample> - -<para> -where the pattern must be unindented and the action must begin -on the same line. -@xref{Patterns}, for a further description of patterns and actions. -</para> - -<para> -In the rules section, any indented or %{ %} enclosed text appearing -before the first rule may be used to declare variables which are local -to the scanning routine and (after the declarations) code which is to be -executed whenever the scanning routine is entered. Other indented or -%{ %} text in the rule section is still copied to the output, but its -meaning is not well-defined and it may well cause compile-time errors -(this feature is present for @acronym{POSIX} compliance. @xref{Lex and -Posix}, for other such features). -</para> - -<para> -Any <emphasis>indented</emphasis> text or text enclosed in @samp{%{} and @samp{%}} -is copied verbatim to the output (with the %{ and %} symbols removed). -The %{ and %} symbols must appear unindented on lines by themselves. -</para> - -</section> - -<section> -<title>Format of the User Code Section</title> - -<!-- @cindex input file, user code Section --> -<!-- @cindex user code, in flex input --> - -<para> -The user code section is simply copied to <filename>lex.yy.c</filename> verbatim. It -is used for companion routines which call or are called by the scanner. -The presence of this section is optional; if it is missing, the second -@samp{%%} in the input file may be skipped, too. -</para> - -</section> - -<section> -<title>Comments in the Input</title> - -<!-- @cindex comments, syntax of --> - -<para> -Flex supports C-style comments, that is, anything between /* and */ is -considered a comment. Whenever flex encounters a comment, it copies the -entire comment verbatim to the generated source code. Comments may -appear just about anywhere, but with the following exceptions: -</para> - -<itemizedlist> - -<!-- @cindex comments, in rules section --> -<listitem> - -<para> -Comments may not appear in the Rules Section wherever flex is expecting -a regular expression. This means comments may not appear at the -beginning of a line, or immediately following a list of scanner states. -</para> - -</listitem> -<listitem> - -<para> -Comments may not appear on an @samp{%option} line in the Definitions -Section. -</para> - -</listitem> -</itemizedlist> - - - -<para>If you want to follow a simple rule, then always begin a comment on a -new line, with one or more whitespace characters before the initial -@samp{/*}). This rule will work anywhere in the input file. -</para> - -<para> -All the comments in the following example are valid: -</para> - -<!-- @cindex comments, valid uses of --> -<!-- @cindex comments in the input --> -<informalexample> -<programlisting> -<![CDATA[ -%{ -/* code block */ -%} - -/* Definitions Section */ -%x STATE_X - -%% - /* Rules Section */ -ruleA /* after regex */ { /* code block */ } /* after code block */ - /* Rules Section (indented) */ -<STATE_X>{ -ruleC ECHO; -ruleD ECHO; -%{ -/* code block */ -%} -} -%% -/* User Code Section */ - -]]> -</programlisting> -</informalexample> - -</section> -</chapter> - -<chapter> -<title>Patterns</title> - -<!-- @cindex patterns, in rules section --> -<!-- @cindex regular expressions, in patterns --> - -<para> -The patterns in the input (see @ref{Rules Section}) are written using an -extended set of regular expressions. These are: -</para> - -<!-- @cindex patterns, syntax --> -<!-- @cindex patterns, syntax --> -<variablelist> -<varlistentry><term>x</term> -<listitem> - -match the character 'x' - -</listitem> -</varlistentry> - -<varlistentry><term>.</term> -<listitem> -any character (byte) except newline - -<!-- @cindex [] in patterns --> -<!-- @cindex character classes in patterns, syntax of --> -<!-- @cindex POSIX, character classes in patterns, syntax of --> -</listitem> -</varlistentry> - -<varlistentry><term>[xyz]</term> -<listitem> -a @dfn{character class}; in this case, the pattern -matches either an 'x', a 'y', or a 'z' - -<!-- @cindex ranges in patterns --> -</listitem> -</varlistentry> - -<varlistentry><term>[abj-oZ]</term> -<listitem> -a "character class" with a range in it; matches -an 'a', a 'b', any letter from 'j' through 'o', -or a 'Z' - -<!-- @cindex ranges in patterns, negating --> -<!-- @cindex negating ranges in patterns --> -</listitem> -</varlistentry> - -<varlistentry><term>[^A-Z]</term> -<listitem> -a "negated character class", i.e., any character -but those in the class. In this case, any -character EXCEPT an uppercase letter. - -</listitem> -</varlistentry> - -<varlistentry><term>[^A-Z\n]</term> -<listitem> -any character EXCEPT an uppercase letter or -a newline - -</listitem> -</varlistentry> - -<varlistentry><term>r*</term> -<listitem> -zero or more r's, where r is any regular expression - -</listitem> -</varlistentry> - -<varlistentry><term>r+</term> -<listitem> -one or more r's - -</listitem> -</varlistentry> - -<varlistentry><term>r?</term> -<listitem> -zero or one r's (that is, ``an optional r'') - -<!-- @cindex braces in patterns --> -</listitem> -</varlistentry> - -<varlistentry><term>r{2,5}</term> -<listitem> -anywhere from two to five r's - -</listitem> -</varlistentry> - -<varlistentry><term>r{2,}</term> -<listitem> -two or more r's - -</listitem> -</varlistentry> - -<varlistentry><term>r{4}</term> -<listitem> -exactly 4 r's - -<!-- @cindex pattern aliases, expansion of --> -</listitem> -</varlistentry> - -<varlistentry><term>{name}</term> -<listitem> -the expansion of the @samp{name} definition -(@pxref{Format}). - -<!-- @cindex literal text in patterns, syntax of --> -<!-- @cindex verbatim text in patterns, syntax of --> -</listitem> -</varlistentry> - -<varlistentry><term>"[xyz]\"foo"</term> -<listitem> -the literal string: @samp{[xyz]"foo} - -<!-- @cindex escape sequences in patterns, syntax of --> -</listitem> -</varlistentry> - -<varlistentry><term>\X</term> -<listitem> -if X is @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or -@samp{v}, then the ANSI-C interpretation of @samp{\x}. Otherwise, a -literal @samp{X} (used to escape operators such as @samp{*}) - -<!-- @cindex NULL character in patterns, syntax of --> -</listitem> -</varlistentry> - -<varlistentry><term>\0</term> -<listitem> -a NUL character (ASCII code 0) - -<!-- @cindex octal characters in patterns --> -</listitem> -</varlistentry> - -<varlistentry><term>\123</term> -<listitem> -the character with octal value 123 - -</listitem> -</varlistentry> - -<varlistentry><term>\x2a</term> -<listitem> -the character with hexadecimal value 2a - -</listitem> -</varlistentry> - -<varlistentry><term>(r)</term> -<listitem> -match an @samp{r}; parentheses are used to override precedence (see below) - -<!-- @cindex concatenation, in patterns --> -</listitem> -</varlistentry> - -<varlistentry><term>rs</term> -<listitem> -the regular expression @samp{r} followed by the regular expression @samp{s}; called -@dfn{concatenation} - -</listitem> -</varlistentry> - -<varlistentry><term>r|s</term> -<listitem> -either an @samp{r} or an @samp{s} - -<!-- @cindex trailing context, in patterns --> -</listitem> -</varlistentry> - -<varlistentry><term>r/s</term> -<listitem> -an @samp{r} but only if it is followed by an @samp{s}. The text matched by @samp{s} is -included when determining whether this rule is the longest match, but is -then returned to the input before the action is executed. So the action -only sees the text matched by @samp{r}. This type of pattern is called -@dfn{trailing context}. (There are some combinations of @samp{r/s} that flex -cannot match correctly. @xref{Limitations}, regarding dangerous trailing -context.) - -<!-- @cindex beginning of line, in patterns --> -<!-- @cindex BOL, in patterns --> -</listitem> -</varlistentry> - -<varlistentry><term>^r</term> -<listitem> -an @samp{r}, but only at the beginning of a line (i.e., -when just starting to scan, or right after a -newline has been scanned). - -<!-- @cindex end of line, in patterns --> -<!-- @cindex EOL, in patterns --> -</listitem> -</varlistentry> - -<varlistentry><term>r$</term> -<listitem> -an @samp{r}, but only at the end of a line (i.e., just before a -newline). Equivalent to @samp{r/\n}. - -<!-- @cindex newline, matching in patterns --> -Note that <application>flex</application>'s notion of ``newline'' is exactly -whatever the C compiler used to compile <application>flex</application> -interprets @samp{\n} as; in particular, on some DOS -systems you must either filter out @samp{\r}s in the -input yourself, or explicitly use @samp{r/\r\n} for @samp{r$}. - -<!-- @cindex start conditions, in patterns --> -</listitem> -</varlistentry> - -<varlistentry><term><s>r</term> -<listitem> -an @samp{r}, but only in start condition @code{s} (see @ref{Start -Conditions} for discussion of start conditions). - -</listitem> -</varlistentry> - -<varlistentry><term><s1,s2,s3>r</term> -<listitem> -same, but in any of start conditions @code{s1}, @code{s2}, or @code{s3}. - -</listitem> -</varlistentry> - -<varlistentry><term><*>r</term> -<listitem> -an @samp{r} in any start condition, even an exclusive one. - -<!-- @cindex end of file, in patterns --> -<!-- @cindex EOF in patterns, syntax of --> -</listitem> -</varlistentry> - -<varlistentry><term><<EOF>></term> -<listitem> -an end-of-file. - -</listitem> -</varlistentry> - -<varlistentry><term><s1,s2><<EOF>></term> -<listitem> -an end-of-file when in start condition @code{s1} or @code{s2} -</listitem> -</varlistentry> -</variablelist> - -Note that inside of a character class, all regular expression operators -lose their special meaning except escape (@samp{\}) and the character class -operators, @samp{-}, @samp{]]}, and, at the beginning of the class, @samp{^}. - -<!-- @cindex patterns, precedence of operators --> -The regular expressions listed above are grouped according to -precedence, from highest precedence at the top to lowest at the bottom. -Those grouped together have equal precedence (see special note on the -precedence of the repeat operator, @samp{{}}, under the documentation -for the @samp{--posix} POSIX compliance option). For example, - -<!-- @cindex patterns, grouping and precedence --> -<informalexample> -<programlisting> -<![CDATA[ - foo|bar* -]]> -</programlisting> -</informalexample> - -is the same as - -<informalexample> -<programlisting> -<![CDATA[ - (foo)|(ba(r*)) -]]> -</programlisting> -</informalexample> - -since the @samp{*} operator has higher precedence than concatenation, -and concatenation higher than alternation (@samp{|}). This pattern -therefore matches <emphasis>either</emphasis> the string @samp{foo} <emphasis>or</emphasis> the -string @samp{ba} followed by zero-or-more @samp{r}'s. To match -@samp{foo} or zero-or-more repetitions of the string @samp{bar}, use: - -<informalexample> -<programlisting> -<![CDATA[ - foo|(bar)* -]]> -</programlisting> -</informalexample> - -And to match a sequence of zero or more repetitions of @samp{foo} and -@samp{bar}: - -<!-- @cindex patterns, repetitions with grouping --> -<informalexample> -<programlisting> -<![CDATA[ - (foo|bar)* -]]> -</programlisting> -</informalexample> - -<!-- @cindex character classes in patterns --> -In addition to characters and ranges of characters, character classes -can also contain @dfn{character class expressions}. These are -expressions enclosed inside @samp{[}: and @samp{:]} delimiters (which -themselves must appear between the @samp{[} and @samp{]} of the -character class. Other elements may occur inside the character class, -too). The valid expressions are: - -<!-- @cindex patterns, valid character classes --> -<informalexample> -<programlisting> -<![CDATA[ - [:alnum:] [:alpha:] [:blank:] - [:cntrl:] [:digit:] [:graph:] - [:lower:] [:print:] [:punct:] - [:space:] [:upper:] [:xdigit:] -]]> -</programlisting> -</informalexample> - -These expressions all designate a set of characters equivalent to the -corresponding standard C @code{isXXX} function. For example, -@samp{[:alnum:]} designates those characters for which <function>isalnum</function> -returns true - i.e., any alphabetic or numeric character. Some systems -don't provide <function>isblank</function>, so flex defines @samp{[:blank:]} as a -blank or a tab. - -For example, the following character classes are all equivalent: - -<!-- @cindex character classes, equivalence of --> -<!-- @cindex patterns, character class equivalence --> -<informalexample> -<programlisting> -<![CDATA[ - [[:alnum:]] - [[:alpha:][:digit:]] - [[:alpha:][0-9]] - [a-zA-Z0-9] -]]> -</programlisting> -</informalexample> - -Some notes on patterns are in order. - - -<itemizedlist> - -<!-- @cindex case-insensitive, effect on character classes --> -<listitem> - If your scanner is case-insensitive (the @samp{-i} flag), then -@samp{[:upper:]} and @samp{[:lower:]} are equivalent to -@samp{[:alpha:]}. - -@anchor{case and character ranges} -</listitem> -<listitem> - Character classes with ranges, such as @samp{[a-Z]}, should be used with -caution in a case-insensitive scanner if the range spans upper or lowercase -characters. Flex does not know if you want to fold all upper and lowercase -characters together, or if you want the literal numeric range specified (with -no case folding). When in doubt, flex will assume that you meant the literal -numeric range, and will issue a warning. The exception to this rule is a -character range such as @samp{[a-z]} or @samp{[S-W]} where it is obvious that you -want case-folding to occur. Here are some examples with the @samp{-i} flag -enabled: - -<!-- -@multitable {@samp{[a-zA-Z]}} {ambiguous} {@samp{[A-Z\[\\\]_`a-t]}} {@samp{[@@A-Z\[\\\]_`abc]}} -@item Range @tab Result @tab Literal Range @tab Alternate Range -@item @samp{[a-t]} @tab ok @tab @samp{[a-tA-T]} @tab -@item @samp{[A-T]} @tab ok @tab @samp{[a-tA-T]} @tab -@item @samp{[A-t]} @tab ambiguous @tab @samp{[A-Z\[\\\]_`a-t]} @tab @samp{[a-tA-T]} -@item @samp{[_-{]} @tab ambiguous @tab @samp{[_`a-z{]} @tab @samp{[_`a-zA-Z{]} -@item @samp{[@@-C]} @tab ambiguous @tab @samp{[@@ABC]} @tab @samp{[@@A-Z\[\\\]_`abc]} -@end multitable--> - - -<!-- @cindex end of line, in negated character classes --> -<!-- @cindex EOL, in negated character classes --> -</listitem> -<listitem> - -A negated character class such as the example @samp{[^A-Z]} above -<emphasis>will</emphasis> match a newline unless @samp{\n} (or an equivalent escape -sequence) is one of the characters explicitly present in the negated -character class (e.g., @samp{[^A-Z\n]}). This is unlike how many other -regular expression tools treat negated character classes, but -unfortunately the inconsistency is historically entrenched. Matching -newlines means that a pattern like @samp{[^"]*} can match the entire -input unless there's another quote in the input. - -<!-- @cindex trailing context, limits of --> -<!-- @cindex ^ as non-special character in patterns --> -<!-- @cindex $ as normal character in patterns --> -</listitem> -<listitem> - -A rule can have at most one instance of trailing context (the @samp{/} operator -or the @samp{$} operator). The start condition, @samp{^}, and @samp{<<EOF>>} patterns -can only occur at the beginning of a pattern, and, as well as with @samp{/} and @samp{$}, -cannot be grouped inside parentheses. A @samp{^} which does not occur at -the beginning of a rule or a @samp{$} which does not occur at the end of -a rule loses its special properties and is treated as a normal character. - -</listitem> -<listitem> - -The following are invalid: - -<!-- @cindex patterns, invalid trailing context --> -<informalexample> -<programlisting> -<![CDATA[ - foo/bar$ - <sc1>foo<sc2>bar -]]> -</programlisting> -</informalexample> - -Note that the first of these can be written @samp{foo/bar\n}. - -</listitem> -<listitem> - -The following will result in @samp{$} or @samp{^} being treated as a normal character: - -<!-- @cindex patterns, special characters treated as non-special --> -<informalexample> -<programlisting> -<![CDATA[ - foo|(bar$) - foo|^bar -]]> -</programlisting> -</informalexample> - -If the desired meaning is a @samp{foo} or a -@samp{bar}-followed-by-a-newline, the following could be used (the -special @code{|} action is explained below, @pxref{Actions}): - -<!-- @cindex patterns, end of line --> -<informalexample> -<programlisting> -<![CDATA[ - foo | - bar$ /* action goes here */ -]]> -</programlisting> -</informalexample> - -A similar trick will work for matching a @samp{foo} or a -@samp{bar}-at-the-beginning-of-a-line. -</listitem> -</itemizedlist> - - -</chapter> - -<chapter> -<title>How the Input Is Matched</title> - -<!-- @cindex patterns, matching --> -<!-- @cindex input, matching --> -<!-- @cindex trailing context, matching --> -<!-- @cindex matching, and trailing context --> -<!-- @cindex matching, length of --> -<!-- @cindex matching, multiple matches --> -When the generated scanner is run, it analyzes its input looking for -strings which match any of its patterns. If it finds more than one -match, it takes the one matching the most text (for trailing context -rules, this includes the length of the trailing part, even though it -will then be returned to the input). If it finds two or more matches of -the same length, the rule listed first in the <application>flex</application> input file is -chosen. - -<!-- @cindex token --> -<!-- @cindex yytext --> -<!-- @cindex yyleng --> -Once the match is determined, the text corresponding to the match -(called the @dfn{token}) is made available in the global character -pointer <varname>yytext</varname>, and its length in the global integer -<varname>yyleng</varname>. The @dfn{action} corresponding to the matched pattern is -then executed (@pxref{Actions}), and then the remaining input is scanned -for another match. - -<!-- @cindex default rule --> -If no match is found, then the @dfn{default rule} is executed: the next -character in the input is considered matched and copied to the standard -output. Thus, the simplest valid <application>flex</application> input is: - -<!-- @cindex minimal scanner --> -<informalexample> -<programlisting> -<![CDATA[ - %% -]]> -</programlisting> -</informalexample> - -which generates a scanner that simply copies its input (one character at -a time) to its output. - -<!-- @cindex yytext, two types of --> -<!-- @cindex %array, use of --> -<!-- @cindex %pointer, use of --> -<!-- @vindex yytext --> -Note that <varname>yytext</varname> can be defined in two different ways: either as -a character <emphasis>pointer</emphasis> or as a character <emphasis>array</emphasis>. You can -control which definition <application>flex</application> uses by including one of the -special directives @code{%pointer} or @code{%array} in the first -(definitions) section of your flex input. The default is -@code{%pointer}, unless you use the @samp{-l} lex compatibility option, -in which case <varname>yytext</varname> will be an array. The advantage of using -@code{%pointer} is substantially faster scanning and no buffer overflow -when matching very large tokens (unless you run out of dynamic memory). -The disadvantage is that you are restricted in how your actions can -modify <varname>yytext</varname> (@pxref{Actions}), and calls to the <function>unput</function> -function destroys the present contents of <varname>yytext</varname>, which can be a -considerable porting headache when moving between different @code{lex} -versions. - -<!-- @cindex %array, advantages of --> -The advantage of @code{%array} is that you can then modify <varname>yytext</varname> -to your heart's content, and calls to <function>unput</function> do not destroy -<varname>yytext</varname> (@pxref{Actions}). Furthermore, existing @code{lex} -programs sometimes access <varname>yytext</varname> externally using declarations of -the form: - -<informalexample> -<programlisting> -<![CDATA[ - extern char yytext[]; -]]> -</programlisting> -</informalexample> - -This definition is erroneous when used with @code{%pointer}, but correct -for @code{%array}. - -The @code{%array} declaration defines <varname>yytext</varname> to be an array of -@code{YYLMAX} characters, which defaults to a fairly large value. You -can change the size by simply #define'ing @code{YYLMAX} to a different -value in the first section of your <application>flex</application> input. As mentioned -above, with @code{%pointer} yytext grows dynamically to accommodate -large tokens. While this means your @code{%pointer} scanner can -accommodate very large tokens (such as matching entire blocks of -comments), bear in mind that each time the scanner must resize -<varname>yytext</varname> it also must rescan the entire token from the beginning, -so matching such tokens can prove slow. <varname>yytext</varname> presently does -<emphasis>not</emphasis> dynamically grow if a call to <function>unput</function> results in too -much text being pushed back; instead, a run-time error results. - -<!-- @cindex %array, with C++ --> -Also note that you cannot use @code{%array} with C++ scanner classes -(@pxref{Cxx}). - -</chapter> - -<chapter> -<title>Actions</title> - -<!-- @cindex actions --> -Each pattern in a rule has a corresponding @dfn{action}, which can be -any arbitrary C statement. The pattern ends at the first non-escaped -whitespace character; the remainder of the line is its action. If the -action is empty, then when the pattern is matched the input token is -simply discarded. For example, here is the specification for a program -which deletes all occurrences of @samp{zap me} from its input: - -<!-- @cindex deleting lines from input --> -<informalexample> -<programlisting> -<![CDATA[ - %% - "zap me" -]]> -</programlisting> -</informalexample> - -This example will copy all other characters in the input to the output -since they will be matched by the default rule. - -Here is a program which compresses multiple blanks and tabs down to a -single blank, and throws away whitespace found at the end of a line: - -<!-- @cindex whitespace, compressing --> -<!-- @cindex compressing whitespace --> -<informalexample> -<programlisting> -<![CDATA[ - %% - [ \t]+ putchar( ' ' ); - [ \t]+$ /* ignore this token */ -]]> -</programlisting> -</informalexample> - -<!-- @cindex %{ and %}, in Rules Section --> -<!-- @cindex actions, use of { and } --> -<!-- @cindex actions, embedded C strings --> -<!-- @cindex C-strings, in actions --> -<!-- @cindex comments, in actions --> -If the action contains a @samp{}}, then the action spans till the -balancing @samp{}} is found, and the action may cross multiple lines. -<application>flex</application> knows about C strings and comments and won't be fooled by -braces found within them, but also allows actions to begin with -@samp{%{} and will consider the action to be all the text up to the -next @samp{%}} (regardless of ordinary braces inside the action). - -<!-- @cindex |, in actions --> -An action consisting solely of a vertical bar (@samp{|}) means ``same as the -action for the next rule''. See below for an illustration. - -Actions can include arbitrary C code, including @code{return} statements -to return a value to whatever routine called <function>yylex</function>. Each time -<function>yylex</function> is called it continues processing tokens from where it -last left off until it either reaches the end of the file or executes a -return. - -<!-- @cindex yytext, modification of --> -Actions are free to modify <varname>yytext</varname> except for lengthening it -(adding characters to its end--these will overwrite later characters in -the input stream). This however does not apply when using @code{%array} -(@pxref{Matching}). In that case, <varname>yytext</varname> may be freely modified -in any way. - -<!-- @cindex yyleng, modification of --> -<!-- @cindex yymore, and yyleng --> -Actions are free to modify <varname>yyleng</varname> except they should not do so if -the action also includes use of <function>yymore</function> (see below). - -<!-- @cindex preprocessor macros, for use in actions --> -There are a number of special directives which can be included within an -action: - -<variablelist> - -<varlistentry><term>ECHO</term> -<listitem> -<!-- @cindex ECHO --> -copies yytext to the scanner's output. - -</listitem> -</varlistentry> - -<varlistentry><term>BEGIN</term> -<listitem> -<!-- @cindex BEGIN --> -followed by the name of a start condition places the scanner in the -corresponding start condition (see below). - -</listitem> -</varlistentry> - -<varlistentry><term>REJECT</term> -<listitem> -<!-- @cindex REJECT --> -directs the scanner to proceed on to the ``second best'' rule which -matched the input (or a prefix of the input). The rule is chosen as -described above in @ref{Matching}, and <varname>yytext</varname> and <varname>yyleng</varname> -set up appropriately. It may either be one which matched as much text -as the originally chosen rule but came later in the <application>flex</application> input -file, or one which matched less text. For example, the following will -both count the words in the input and call the routine <function>special</function> -whenever @samp{frob} is seen: - -<informalexample> -<programlisting> -<![CDATA[ - int word_count = 0; - %% - - frob special(); REJECT; - [^ \t\n]+ ++word_count; -]]> -</programlisting> -</informalexample> - -Without the @code{REJECT}, any occurences of @samp{frob} in the input -would not be counted as words, since the scanner normally executes only -one action per token. Multiple uses of @code{REJECT} are allowed, each -one finding the next best choice to the currently active rule. For -example, when the following scanner scans the token @samp{abcd}, it will -write @samp{abcdabcaba} to the output: - -<!-- @cindex REJECT, calling multiple times --> -<!-- @cindex |, use of --> -<informalexample> -<programlisting> -<![CDATA[ - %% - a | - ab | - abc | - abcd ECHO; REJECT; - .|\n /* eat up any unmatched character */ -]]> -</programlisting> -</informalexample> - -The first three rules share the fourth's action since they use the -special @samp{|} action. - -@code{REJECT} is a particularly expensive feature in terms of scanner -performance; if it is used in <emphasis>any</emphasis> of the scanner's actions it -will slow down <emphasis>all</emphasis> of the scanner's matching. Furthermore, -@code{REJECT} cannot be used with the @samp{-Cf} or @samp{-CF} options -(@pxref{Scanner Options}). - -Note also that unlike the other special actions, @code{REJECT} is a -<emphasis>branch</emphasis>. code immediately following it in the action will -<emphasis>not</emphasis> be executed. - -</listitem> -</varlistentry> - -<varlistentry><term>yymore()</term> -<listitem> -<!-- @cindex yymore() --> -tells the scanner that the next time it matches a rule, the -corresponding token should be <emphasis>appended</emphasis> onto the current value of -<varname>yytext</varname> rather than replacing it. For example, given the input -@samp{mega-kludge} the following will write @samp{mega-mega-kludge} to -the output: - -<!-- @cindex yymore(), mega-kludge --> -<!-- @cindex yymore() to append token to previous token --> -<informalexample> -<programlisting> -<![CDATA[ - %% - mega- ECHO; yymore(); - kludge ECHO; -]]> -</programlisting> -</informalexample> - -First @samp{mega-} is matched and echoed to the output. Then @samp{kludge} -is matched, but the previous @samp{mega-} is still hanging around at the -beginning of -<varname>yytext</varname> -so the -@code{ECHO} -for the @samp{kludge} rule will actually write @samp{mega-kludge}. -</listitem> -</varlistentry> -</variablelist> - -<!-- @cindex yymore, performance penalty of --> -Two notes regarding use of <function>yymore</function>. First, <function>yymore</function> -depends on the value of <varname>yyleng</varname> correctly reflecting the size of -the current token, so you must not modify <varname>yyleng</varname> if you are using -<function>yymore</function>. Second, the presence of <function>yymore</function> in the -scanner's action entails a minor performance penalty in the scanner's -matching speed. - -<!-- @cindex yyless() --> -@code{yyless(n)} returns all but the first @code{n} characters of the -current token back to the input stream, where they will be rescanned -when the scanner looks for the next match. <varname>yytext</varname> and -<varname>yyleng</varname> are adjusted appropriately (e.g., <varname>yyleng</varname> will now -be equal to @code{n}). For example, on the input @samp{foobar} the -following will write out @samp{foobarbar}: - -<!-- @cindex yyless(), pushing back characters --> -<!-- @cindex pushing back characters with yyless --> -<informalexample> -<programlisting> -<![CDATA[ - %% - foobar ECHO; yyless(3); - [a-z]+ ECHO; -]]> -</programlisting> -</informalexample> - -An argument of 0 to <function>yyless</function> will cause the entire current input -string to be scanned again. Unless you've changed how the scanner will -subsequently process its input (using @code{BEGIN}, for example), this -will result in an endless loop. - -Note that <function>yyless</function> is a macro and can only be used in the flex -input file, not from other source files. - -<!-- @cindex unput() --> -<!-- @cindex pushing back characters with unput --> -@code{unput(c)} puts the character @code{c} back onto the input stream. -It will be the next character scanned. The following action will take -the current token and cause it to be rescanned enclosed in parentheses. - -<!-- @cindex unput(), pushing back characters --> -<!-- @cindex pushing back characters with unput() --> -<informalexample> -<programlisting> -<![CDATA[ - { - int i; - /* Copy yytext because unput() trashes yytext */ - char *yycopy = strdup( yytext ); - unput( ')' ); - for ( i = yyleng - 1; i >= 0; --i ) - unput( yycopy[i] ); - unput( '(' ); - free( yycopy ); - } -]]> -</programlisting> -</informalexample> - -Note that since each <function>unput</function> puts the given character back at the -<emphasis>beginning</emphasis> of the input stream, pushing back strings must be done -back-to-front. - -<!-- @cindex %pointer, and unput() --> -<!-- @cindex unput(), and %pointer --> -An important potential problem when using <function>unput</function> is that if you -are using @code{%pointer} (the default), a call to <function>unput</function> -<emphasis>destroys</emphasis> the contents of <varname>yytext</varname>, starting with its -rightmost character and devouring one character to the left with each -call. If you need the value of <varname>yytext</varname> preserved after a call to -<function>unput</function> (as in the above example), you must either first copy it -elsewhere, or build your scanner using @code{%array} instead -(@pxref{Matching}). - -<!-- @cindex pushing back EOF --> -<!-- @cindex EOF, pushing back --> -Finally, note that you cannot put back @samp{EOF} to attempt to mark the -input stream with an end-of-file. - -<!-- @cindex input() --> -<function>input</function> reads the next character from the input stream. For -example, the following is one way to eat up C comments: - -<!-- @cindex comments, discarding --> -<!-- @cindex discarding C comments --> -<informalexample> -<programlisting> -<![CDATA[ - %% - "/*" { - register int c; - - for ( ; ; ) - { - while ( (c = input()) != '*' && - c != EOF ) - ; /* eat up text of comment */ - - if ( c == '*' ) - { - while ( (c = input()) == '*' ) - ; - if ( c == '/' ) - break; /* found the end */ - } - - if ( c == EOF ) - { - error( "EOF in comment" ); - break; - } - } - } -]]> -</programlisting> -</informalexample> - -<!-- @cindex input(), and C++ --> -<!-- @cindex yyinput() --> -(Note that if the scanner is compiled using @code{C++}, then -<function>input</function> is instead referred to as @b{yyinput()}, in order to -avoid a name clash with the @code{C++} stream by the name of -@code{input}.) - -<!-- @cindex flushing the internal buffer --> -<!-- @cindex YY_FLUSH_BUFFER() --> -<function>YY_FLUSH_BUFFER</function> flushes the scanner's internal buffer so that -the next time the scanner attempts to match a token, it will first -refill the buffer using <function>YY_INPUT</function> (@pxref{Generated Scanner}). -This action is a special case of the more general -<function>yy_flush_buffer</function> function, described below (@pxref{Multiple -Input Buffers}) - -<!-- @cindex yyterminate() --> -<!-- @cindex terminating with yyterminate() --> -<!-- @cindex exiting with yyterminate() --> -<!-- @cindex halting with yyterminate() --> -<function>yyterminate</function> can be used in lieu of a return statement in an -action. It terminates the scanner and returns a 0 to the scanner's -caller, indicating ``all done''. By default, <function>yyterminate</function> is -also called when an end-of-file is encountered. It is a macro and may -be redefined. - -</chapter> - -<chapter> -<title>The Generated Scanner</title> - -<!-- @cindex yylex(), in generated scanner --> -The output of <application>flex</application> is the file <filename>lex.yy.c</filename>, which contains -the scanning routine <function>yylex</function>, a number of tables used by it for -matching tokens, and a number of auxiliary routines and macros. By -default, <function>yylex</function> is declared as follows: - -<informalexample> -<programlisting> -<![CDATA[ - int yylex() - { - ... various definitions and the actions in here ... - } -]]> -</programlisting> -</informalexample> - -<!-- @cindex yylex(), overriding --> -(If your environment supports function prototypes, then it will be -@code{int yylex( void )}.) This definition may be changed by defining -the @code{YY_DECL} macro. For example, you could use: - -<!-- @cindex yylex, overriding the prototype of --> -<informalexample> -<programlisting> -<![CDATA[ - #define YY_DECL float lexscan( a, b ) float a, b; -]]> -</programlisting> -</informalexample> - -to give the scanning routine the name @code{lexscan}, returning a float, -and taking two floats as arguments. Note that if you give arguments to -the scanning routine using a K&R-style/non-prototyped function -declaration, you must terminate the definition with a semi-colon (;). - -<application>flex</application> generates @samp{C99} function definitions by -default. However flex does have the ability to generate obsolete, er, -@samp{traditional}, function definitions. This is to support -bootstrapping gcc on old systems. Unfortunately, traditional -definitions prevent us from using any standard data types smaller than -int (such as short, char, or bool) as function arguments. For this -reason, future versions of <application>flex</application> may generate standard C99 code -only, leaving K&R-style functions to the historians. Currently, if you -do <emphasis role="strong">not</emphasis> want @samp{C99} definitions, then you must use -@code{%option noansi-definitions}. - -<!-- @cindex stdin, default for yyin --> -<!-- @cindex yyin --> -Whenever <function>yylex</function> is called, it scans tokens from the global input -file <filename>yyin</filename> (which defaults to stdin). It continues until it -either reaches an end-of-file (at which point it returns the value 0) or -one of its actions executes a @code{return} statement. - -<!-- @cindex EOF and yyrestart() --> -<!-- @cindex end-of-file, and yyrestart() --> -<!-- @cindex yyrestart() --> -If the scanner reaches an end-of-file, subsequent calls are undefined -unless either <filename>yyin</filename> is pointed at a new input file (in which case -scanning continues from that file), or <function>yyrestart</function> is called. -<function>yyrestart</function> takes one argument, a @code{FILE *} pointer (which -can be NULL, if you've set up @code{YY_INPUT} to scan from a source other -than <varname>yyin</varname>), and initializes <filename>yyin</filename> for scanning from that -file. Essentially there is no difference between just assigning -<filename>yyin</filename> to a new input file or using <function>yyrestart</function> to do so; -the latter is available for compatibility with previous versions of -<application>flex</application>, and because it can be used to switch input files in the -middle of scanning. It can also be used to throw away the current input -buffer, by calling it with an argument of <filename>yyin</filename>; but it would be -better to use @code{YY_FLUSH_BUFFER} (@pxref{Actions}). Note that -<function>yyrestart</function> does <emphasis>not</emphasis> reset the start condition to -@code{INITIAL} (@pxref{Start Conditions}). - -<!-- @cindex RETURN, within actions --> -If <function>yylex</function> stops scanning due to executing a @code{return} -statement in one of the actions, the scanner may then be called again -and it will resume scanning where it left off. - -<!-- @cindex YY_INPUT --> -By default (and for purposes of efficiency), the scanner uses -block-reads rather than simple <function>getc</function> calls to read characters -from <filename>yyin</filename>. The nature of how it gets its input can be controlled -by defining the @code{YY_INPUT} macro. The calling sequence for -<function>YY_INPUT</function> is @code{YY_INPUT(buf,result,max_size)}. Its action -is to place up to @code{max_size} characters in the character array -@code{buf} and return in the integer variable @code{result} either the -number of characters read or the constant @code{YY_NULL} (0 on Unix -systems) to indicate @samp{EOF}. The default @code{YY_INPUT} reads from -the global file-pointer <filename>yyin</filename>. - -<!-- @cindex YY_INPUT, overriding --> -Here is a sample definition of @code{YY_INPUT} (in the definitions -section of the input file): - -<informalexample> -<programlisting> -<![CDATA[ - %{ - #define YY_INPUT(buf,result,max_size) \ - { \ - int c = getchar(); \ - result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ - } - %} -]]> -</programlisting> -</informalexample> - -This definition will change the input processing to occur one character -at a time. - -<!-- @cindex yywrap() --> -When the scanner receives an end-of-file indication from YY_INPUT, it -then checks the <function>yywrap</function> function. If <function>yywrap</function> returns -false (zero), then it is assumed that the function has gone ahead and -set up <filename>yyin</filename> to point to another input file, and scanning -continues. If it returns true (non-zero), then the scanner terminates, -returning 0 to its caller. Note that in either case, the start -condition remains unchanged; it does <emphasis>not</emphasis> revert to -@code{INITIAL}. - -<!-- @cindex yywrap, default for --> -<!-- @cindex nowrap, %option --> -<!-- @cindex %option nowrap --> -If you do not supply your own version of <function>yywrap</function>, then you must -either use @code{%option noyywrap} (in which case the scanner behaves as -though <function>yywrap</function> returned 1), or you must link with @samp{-lfl} to -obtain the default version of the routine, which always returns 1. - -For scanning from in-memory buffers (e.g., scanning strings), see -@ref{Scanning Strings}. @xref{Multiple Input Buffers}. - -<!-- @cindex ECHO, and yyout --> -<!-- @cindex yyout --> -<!-- @cindex stdout, as default for yyout --> -The scanner writes its @code{ECHO} output to the <filename>yyout</filename> global -(default, <filename>stdout</filename>), which may be redefined by the user simply by -assigning it to some other @code{FILE} pointer. - -</chapter> - -<chapter> -<title>Start Conditions</title> - -<!-- @cindex start conditions --> -<application>flex</application> provides a mechanism for conditionally activating rules. -Any rule whose pattern is prefixed with @samp{<NAME>} will only be active -when the scanner is in the @dfn{start condition} named @code{NAME}. For -example, - -<!-- @c proofread edit stopped here --> -<informalexample> -<programlisting> -<![CDATA[ - <STRING>[^"]* { /* eat up the string body ... */ - ... - } -]]> -</programlisting> -</informalexample> - -will be active only when the scanner is in the @code{STRING} start -condition, and - -<!-- @cindex start conditions, multiple --> -<informalexample> -<programlisting> -<![CDATA[ - <INITIAL,STRING,QUOTE>\. { /* handle an escape ... */ - ... - } -]]> -</programlisting> -</informalexample> - -will be active only when the current start condition is either -@code{INITIAL}, @code{STRING}, or @code{QUOTE}. - -<!-- @cindex start conditions, inclusive v.s. exclusive --> -Start conditions are declared in the definitions (first) section of the -input using unindented lines beginning with either @samp{%s} or -@samp{%x} followed by a list of names. The former declares -@dfn{inclusive} start conditions, the latter @dfn{exclusive} start -conditions. A start condition is activated using the @code{BEGIN} -action. Until the next @code{BEGIN} action is executed, rules with the -given start condition will be active and rules with other start -conditions will be inactive. If the start condition is inclusive, then -rules with no start conditions at all will also be active. If it is -exclusive, then <emphasis>only</emphasis> rules qualified with the start condition -will be active. A set of rules contingent on the same exclusive start -condition describe a scanner which is independent of any of the other -rules in the <application>flex</application> input. Because of this, exclusive start -conditions make it easy to specify ``mini-scanners'' which scan portions -of the input that are syntactically different from the rest (e.g., -comments). - -If the distinction between inclusive and exclusive start conditions -is still a little vague, here's a simple example illustrating the -connection between the two. The set of rules: - -<!-- @cindex start conditions, inclusive --> -<informalexample> -<programlisting> -<![CDATA[ - %s example - %% - - <example>foo do_something(); - - bar something_else(); -]]> -</programlisting> -</informalexample> - -is equivalent to - -<!-- @cindex start conditions, exclusive --> -<informalexample> -<programlisting> -<![CDATA[ - %x example - %% - - <example>foo do_something(); - - <INITIAL,example>bar something_else(); -]]> -</programlisting> -</informalexample> - -Without the @code{<INITIAL,example>} qualifier, the @code{bar} pattern in -the second example wouldn't be active (i.e., couldn't match) when in -start condition @code{example}. If we just used @code{example>} to -qualify @code{bar}, though, then it would only be active in -@code{example} and not in @code{INITIAL}, while in the first example -it's active in both, because in the first example the @code{example} -start condition is an inclusive @code{(%s)} start condition. - -<!-- @cindex start conditions, special wildcard condition --> -Also note that the special start-condition specifier -@code{<*>} -matches every start condition. Thus, the above example could also -have been written: - -<!-- @cindex start conditions, use of wildcard condition (<*>) --> -<informalexample> -<programlisting> -<![CDATA[ - %x example - %% - - <example>foo do_something(); - - <*>bar something_else(); -]]> -</programlisting> -</informalexample> - -The default rule (to @code{ECHO} any unmatched character) remains active -in start conditions. It is equivalent to: - -<!-- @cindex start conditions, behavior of default rule --> -<informalexample> -<programlisting> -<![CDATA[ - <*>.|\n ECHO; -]]> -</programlisting> -</informalexample> - -<!-- @cindex BEGIN, explanation --> -<!-- @findex BEGIN --> -<!-- @vindex INITIAL --> -@code{BEGIN(0)} returns to the original state where only the rules with -no start conditions are active. This state can also be referred to as -the start-condition @code{INITIAL}, so @code{BEGIN(INITIAL)} is -equivalent to @code{BEGIN(0)}. (The parentheses around the start -condition name are not required but are considered good style.) - -@code{BEGIN} actions can also be given as indented code at the beginning -of the rules section. For example, the following will cause the scanner -to enter the @code{SPECIAL} start condition whenever <function>yylex</function> is -called and the global variable @code{enter_special} is true: - -<!-- @cindex start conditions, using BEGIN --> -<informalexample> -<programlisting> -<![CDATA[ - int enter_special; - - %x SPECIAL - %% - if ( enter_special ) - BEGIN(SPECIAL); - - <SPECIAL>blahblahblah - ...more rules follow... -]]> -</programlisting> -</informalexample> - -To illustrate the uses of start conditions, here is a scanner which -provides two different interpretations of a string like @samp{123.456}. -By default it will treat it as three tokens, the integer @samp{123}, a -dot (@samp{.}), and the integer @samp{456}. But if the string is -preceded earlier in the line by the string @samp{expect-floats} it will -treat it as a single token, the floating-point number @samp{123.456}: - -<!-- @cindex start conditions, for different interpretations of same input --> -<informalexample> -<programlisting> -<![CDATA[ - %{ - #include <math.h> - %} - %s expect - - %% - expect-floats BEGIN(expect); - - <expect>[0-9]+@samp{.}[0-9]+ { - printf( "found a float, = %f\n", - atof( yytext ) ); - } - <expect>\n { - /* that's the end of the line, so - * we need another "expect-number" - * before we'll recognize any more - * numbers - */ - BEGIN(INITIAL); - } - - [0-9]+ { - printf( "found an integer, = %d\n", - atoi( yytext ) ); - } - - "." printf( "found a dot\n" ); -]]> -</programlisting> -</informalexample> - -<!-- @cindex comments, example of scanning C comments --> -Here is a scanner which recognizes (and discards) C comments while -maintaining a count of the current input line. - -<!-- @cindex recognizing C comments --> -<informalexample> -<programlisting> -<![CDATA[ - %x comment - %% - int line_num = 1; - - "/*" BEGIN(comment); - - <comment>[^*\n]* /* eat anything that's not a '*' */ - <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ - <comment>\n ++line_num; - <comment>"*"+"/" BEGIN(INITIAL); -]]> -</programlisting> -</informalexample> - -This scanner goes to a bit of trouble to match as much -text as possible with each rule. In general, when attempting to write -a high-speed scanner try to match as much possible in each rule, as -it's a big win. - -Note that start-conditions names are really integer values and -can be stored as such. Thus, the above could be extended in the -following fashion: - -<!-- @cindex start conditions, integer values --> -<!-- @cindex using integer values of start condition names --> -<informalexample> -<programlisting> -<![CDATA[ - %x comment foo - %% - int line_num = 1; - int comment_caller; - - "/*" { - comment_caller = INITIAL; - BEGIN(comment); - } - - ... - - <foo>"/*" { - comment_caller = foo; - BEGIN(comment); - } - - <comment>[^*\n]* /* eat anything that's not a '*' */ - <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ - <comment>\n ++line_num; - <comment>"*"+"/" BEGIN(comment_caller); -]]> -</programlisting> -</informalexample> - -<!-- @cindex YY_START, example --> -Furthermore, you can access the current start condition using the -integer-valued @code{YY_START} macro. For example, the above -assignments to @code{comment_caller} could instead be written - -<!-- @cindex getting current start state with YY_START --> -<informalexample> -<programlisting> -<![CDATA[ - comment_caller = YY_START; -]]> -</programlisting> -</informalexample> - -<!-- @vindex YY_START --> -Flex provides @code{YYSTATE} as an alias for @code{YY_START} (since that -is what's used by <acronym>&</acronym> @code{lex}). - -For historical reasons, start conditions do not have their own -name-space within the generated scanner. The start condition names are -unmodified in the generated scanner and generated header. -@xref{option-header}. @xref{option-prefix}. - - - -Finally, here's an example of how to match C-style quoted strings using -exclusive start conditions, including expanded escape sequences (but -not including checking for a string that's too long): - -<!-- @cindex matching C-style double-quoted strings --> -<informalexample> -<programlisting> -<![CDATA[ - %x str - - %% - char string_buf[MAX_STR_CONST]; - char *string_buf_ptr; - - - \" string_buf_ptr = string_buf; BEGIN(str); - - <str>\" { /* saw closing quote - all done */ - BEGIN(INITIAL); - *string_buf_ptr = '\0'; - /* return string constant token type and - * value to parser - */ - } - - <str>\n { - /* error - unterminated string constant */ - /* generate error message */ - } - - <str>\\[0-7]{1,3} { - /* octal escape sequence */ - int result; - - (void) sscanf( yytext + 1, "%o", &result ); - - if ( result > 0xff ) - /* error, constant is out-of-bounds */ - - *string_buf_ptr++ = result; - } - - <str>\\[0-9]+ { - /* generate error - bad escape sequence; something - * like '\48' or '\0777777' - */ - } - - <str>\\n *string_buf_ptr++ = '\n'; - <str>\\t *string_buf_ptr++ = '\t'; - <str>\\r *string_buf_ptr++ = '\r'; - <str>\\b *string_buf_ptr++ = '\b'; - <str>\\f *string_buf_ptr++ = '\f'; - - <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; - - <str>[^\\\n\"]+ { - char *yptr = yytext; - - while ( *yptr ) - *string_buf_ptr++ = *yptr++; - } -]]> -</programlisting> -</informalexample> - -<!-- @cindex start condition, applying to multiple patterns --> -Often, such as in some of the examples above, you wind up writing a -whole bunch of rules all preceded by the same start condition(s). Flex -makes this a little easier and cleaner by introducing a notion of start -condition @dfn{scope}. A start condition scope is begun with: - -<informalexample> -<programlisting> -<![CDATA[ - <SCs>{ -]]> -</programlisting> -</informalexample> - -where @code{SCs} is a list of one or more start conditions. Inside the -start condition scope, every rule automatically has the prefix -@code{SCs>} applied to it, until a @samp{}} which matches the initial -@samp{{}. So, for example, - -<!-- @cindex extended scope of start conditions --> -<informalexample> -<programlisting> -<![CDATA[ - <ESC>{ - "\\n" return '\n'; - "\\r" return '\r'; - "\\f" return '\f'; - "\\0" return '\0'; - } -]]> -</programlisting> -</informalexample> - -<para> -is equivalent to: -</para> - -<informalexample> -<programlisting> -<![CDATA[ - <ESC>"\\n" return '\n'; - <ESC>"\\r" return '\r'; - <ESC>"\\f" return '\f'; - <ESC>"\\0" return '\0'; -]]> -</programlisting> -</informalexample> - -<para> -Start condition scopes may be nested. -</para> - -<!-- @cindex stacks, routines for manipulating --> -<!-- @cindex start conditions, use of a stack --> - -<para> -The following routines are available for manipulating stacks of start conditions: -</para> - -<funcsynopsis> -<funcprototype> -<funcdef>void <function>yy_push_state</function></funcdef> - <paramdef> int @code{new_state<parameter>}</parameter> </paramdef> -</funcprototype> -</funcsynopsis> - -pushes the current start condition onto the top of the start condition -stack and switches to -@code{new_state} -as though you had used -@code{BEGIN new_state} -(recall that start condition names are also integers). - -<funcsynopsis> -<funcprototype> -<funcdef>void <function>yy_pop_state</function></funcdef> - <void/> -</funcprototype> -</funcsynopsis> - -pops the top of the stack and switches to it via -@code{BEGIN}. - -<funcsynopsis> -<funcprototype> -<funcdef>int <function>yy_top_state</function></funcdef> - <void/> -</funcprototype> -</funcsynopsis> - -returns the top of the stack without altering the stack's contents. - -<!-- @cindex memory, for start condition stacks --> - -<para> -The start condition stack grows dynamically and so has no built-in size -limitation. If memory is exhausted, program execution aborts. -</para> - -<para> -To use start condition stacks, your scanner must include a @code{%option -stack} directive (@pxref{Scanner Options}). -</para> - -</chapter> - -<chapter> -<title>Multiple Input Buffers</title> - -<!-- @cindex multiple input streams --> - -<para> -Some scanners (such as those which support ``include'' files) require -reading from several input streams. As <application>flex</application> scanners do a large -amount of buffering, one cannot control where the next input will be -read from by simply writing a <function>YY_INPUT</function> which is sensitive to -the scanning context. <function>YY_INPUT</function> is only called when the scanner -reaches the end of its buffer, which may be a long time after scanning a -statement such as an @code{include} statement which requires switching -the input source. -</para> - -<para> -To negotiate these sorts of problems, <application>flex</application> provides a mechanism -for creating and switching between multiple input buffers. An input -buffer is created by using: -</para> - -<!-- @cindex memory, allocating input buffers --> - -<funcsynopsis> -<funcprototype> -<funcdef>YY_BUFFER_STATE <function>yy_create_buffer</function></funcdef> - <paramdef>FILE *<parameter>file</parameter></paramdef> - <paramdef>int<parameter>size</parameter></paramdef> -</funcprototype> -</funcsynopsis> - -<para> -which takes a @code{FILE} pointer and a size and creates a buffer -associated with the given file and large enough to hold @code{size} -characters (when in doubt, use @code{YY_BUF_SIZE} for the size). It -returns a @code{YY_BUFFER_STATE} handle, which may then be passed to -other routines (see below). -<!-- @tindex YY_BUFFER_STATE --> -The @code{YY_BUFFER_STATE} type is a -pointer to an opaque @code{struct yy_buffer_state} structure, so you may -safely initialize @code{YY_BUFFER_STATE} variables to @code{((YY_BUFFER_STATE) -0)} if you wish, and also refer to the opaque structure in order to -correctly declare input buffers in source files other than that of your -scanner. Note that the @code{FILE} pointer in the call to -<function>yy_create_buffer</function> is only used as the value of <filename>yyin</filename> seen by -@code{YY_INPUT}. If you redefine <function>YY_INPUT</function> so it no longer uses -<filename>yyin</filename>, then you can safely pass a NULL @code{FILE} pointer to -<function>yy_create_buffer</function>. You select a particular buffer to scan from -using: -</para> - -<funcsynopsis> -<funcprototype> -<funcdef>void <function>yy_switch_to_buffer</function></funcdef> - <paramdef> YY_BUFFER_STATE <parameter>new_buffer</parameter> </paramdef> -</funcprototype> -</funcsynopsis> - - -<para>The above function switches the scanner's input buffer so subsequent tokens -will come from @code{new_buffer}. Note that <function>yy_switch_to_buffer</function> may -be used by <function>yywrap</function> to set things up for continued scanning, instead of -opening a new file and pointing <filename>yyin</filename> at it. If you are looking for a -stack of input buffers, then you want to use <function>yypush_buffer_state</function> -instead of this function. Note also that switching input sources via either -<function>yy_switch_to_buffer</function> or <function>yywrap</function> does <emphasis>not</emphasis> change the -start condition. -</para> - -<!-- @cindex memory, deleting input buffers --> - -<funcsynopsis> -<funcprototype> -<funcdef>void <function>yy_delete_buffer</function></funcdef> - <paramdef> YY_BUFFER_STATE <parameter>buffer</parameter> </paramdef> -</funcprototype> -</funcsynopsis> - -<para> -is used to reclaim the storage associated with a buffer. (@code{buffer} -can be NULL, in which case the routine does nothing.) You can also clear -the current contents of a buffer using: -</para> - -<!-- @cindex pushing an input buffer --> -<!-- @cindex stack, input buffer push --> - -<funcsynopsis> -<funcprototype> -<funcdef>void <function>yypush_buffer_state</function></funcdef> - <paramdef> YY_BUFFER_STATE <parameter>buffer</parameter> </paramdef> -</funcprototype> -</funcsynopsis> - -<para> -This function pushes the new buffer state onto an internal stack. The pushed -state becomes the new current state. The stack is maintained by flex and will -grow as required. This function is intended to be used instead of -<function>yy_switch_to_buffer</function>, when you want to change states, but preserve the -current state for later use. -</para> - -<!-- @cindex popping an input buffer --> -<!-- @cindex stack, input buffer pop --> - -<funcsynopsis> -<funcprototype> -<funcdef>void <function>yypop_buffer_state</function></funcdef> - <void/> -</funcprototype> -</funcsynopsis> - -<para> -This function removes the current state from the top of the stack, and deletes -it by calling <function>yy_delete_buffer</function>. The next state on the stack, if any, -becomes the new current state. -</para> - -<!-- @cindex clearing an input buffer --> -<!-- @cindex flushing an input buffer --> - -<funcsynopsis> -<funcprototype> -<funcdef>void <function>yy_flush_buffer</function></funcdef> - <paramdef> YY_BUFFER_STATE <parameter>buffer</parameter> </paramdef> -</funcprototype> -</funcsynopsis> - -<para> -This function discards the buffer's contents, -so the next time the scanner attempts to match a token from the -buffer, it will first fill the buffer anew using -<function>YY_INPUT</function>. -</para> - -<para> -@deftypefun YY_BUFFER_STATE yy_new_buffer ( FILE *file, int size ) -@end deftypefun -</para> - -<para> -is an alias for <function>yy_create_buffer</function>, -provided for compatibility with the C++ use of @code{new} and -@code{delete} for creating and destroying dynamic objects. -</para> - -<!-- @cindex YY_CURRENT_BUFFER, and multiple buffers Finally, the macro --> - -<para> -@code{YY_CURRENT_BUFFER} macro returns a @code{YY_BUFFER_STATE} handle to the -current buffer. It should not be used as an lvalue. -</para> - -<!-- @cindex EOF, example using multiple input buffers --> - -<para> -Here are two examples of using these features for writing a scanner -which expands include files (the -@code{<<EOF>>} -feature is discussed below). -</para> - -<para> -This first example uses yypush_buffer_state and yypop_buffer_state. Flex -maintains the stack internally. -</para> - -<!-- @cindex handling include files with multiple input buffers --> -<informalexample> -<programlisting> -<![CDATA[ - /* the "incl" state is used for picking up the name - * of an include file - */ - %x incl - %% - include BEGIN(incl); - - [a-z]+ ECHO; - [^a-z\n]*\n? ECHO; - - <incl>[ \t]* /* eat the whitespace */ - <incl>[^ \t\n]+ { /* got the include file name */ - yyin = fopen( yytext, "r" ); - - if ( ! yyin ) - error( ... ); - - yypush_buffer_state(yy_create_buffer( yyin, YY_BUF_SIZE )); - - BEGIN(INITIAL); - } - - <<EOF>> { - yypop_buffer_state(); - - if ( !YY_CURRENT_BUFFER ) - { - yyterminate(); - } - } -]]> -</programlisting> -</informalexample> - -<para> -The second example, below, does the same thing as the previous example did, but -manages its own input buffer stack manually (instead of letting flex do it). -</para> - -<!-- @cindex handling include files with multiple input buffers --> -<informalexample> -<programlisting> -<![CDATA[ - /* the "incl" state is used for picking up the name - * of an include file - */ - %x incl - - %{ - #define MAX_INCLUDE_DEPTH 10 - YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; - int include_stack_ptr = 0; - %} - - %% - include BEGIN(incl); - - [a-z]+ ECHO; - [^a-z\n]*\n? ECHO; - - <incl>[ \t]* /* eat the whitespace */ - <incl>[^ \t\n]+ { /* got the include file name */ - if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) - { - fprintf( stderr, "Includes nested too deeply" ); - exit( 1 ); - } - - include_stack[include_stack_ptr++] = - YY_CURRENT_BUFFER; - - yyin = fopen( yytext, "r" ); - - if ( ! yyin ) - error( ... ); - - yy_switch_to_buffer( - yy_create_buffer( yyin, YY_BUF_SIZE ) ); - - BEGIN(INITIAL); - } - - <<EOF>> { - if ( --include_stack_ptr 0 ) - { - yyterminate(); - } - - else - { - yy_delete_buffer( YY_CURRENT_BUFFER ); - yy_switch_to_buffer( - include_stack[include_stack_ptr] ); - } - } -]]> -</programlisting> -</informalexample> - -@anchor{Scanning Strings} -<!-- @cindex strings, scanning strings instead of files --> -The following routines are available for setting up input buffers for -scanning in-memory strings instead of files. All of them create a new -input buffer for scanning the string, and return a corresponding -@code{YY_BUFFER_STATE} handle (which you should delete with -<function>yy_delete_buffer</function> when done with it). They also switch to the -new buffer using <function>yy_switch_to_buffer</function>, so the next call to -<function>yylex</function> will start scanning the string. - -<funcsynopsis> -<funcprototype> -<funcdef>YY_BUFFER_STATE <function>yy_scan_string</function></funcdef> - <paramdef> const char *<parameter>str</parameter> </paramdef> -</funcprototype> -</funcsynopsis> -scans a NUL-terminated string. - -@deftypefun YY_BUFFER_STATE yy_scan_bytes ( const char *bytes, int len ) -@end deftypefun -scans @code{len} bytes (including possibly @code{NUL}s) starting at location -@code{bytes}. - -Note that both of these functions create and scan a <emphasis>copy</emphasis> of the -string or bytes. (This may be desirable, since <function>yylex</function> modifies -the contents of the buffer it is scanning.) You can avoid the copy by -using: - -<!-- @vindex YY_END_OF_BUFFER_CHAR --> - -<para> -@deftypefun YY_BUFFER_STATE yy_scan_buffer (char *base, yy_size_t size) -@end deftypefun -</para> - -which scans in place the buffer starting at @code{base}, consisting of -@code{size} bytes, the last two bytes of which <emphasis>must</emphasis> be -@code{YY_END_OF_BUFFER_CHAR} (ASCII NUL). These last two bytes are not -scanned; thus, scanning consists of @code{base[0]} through -@code{base[size-2]}, inclusive. - -If you fail to set up @code{base} in this manner (i.e., forget the final -two @code{YY_END_OF_BUFFER_CHAR} bytes), then <function>yy_scan_buffer</function> -returns a NULL pointer instead of creating a new input buffer. - -@deftp {Data type} yy_size_t -is an integral type to which you can cast an integer expression -reflecting the size of the buffer. -@end deftp - -</chapter> - -<chapter> -<title>End-of-File Rules</title> - -<!-- @cindex EOF, explanation --> -The special rule @code{<<EOF>>} indicates -actions which are to be taken when an end-of-file is -encountered and <function>yywrap</function> returns non-zero (i.e., indicates -no further files to process). The action must finish -by doing one of the following things: - -<itemizedlist> - -<listitem> - -<!-- @findex YY_NEW_FILE (now obsolete) --> -assigning <filename>yyin</filename> to a new input file (in previous versions of -<application>flex</application>, after doing the assignment you had to call the special -action @code{YY_NEW_FILE}. This is no longer necessary.) - -</listitem> -<listitem> - -executing a @code{return} statement; - -</listitem> -<listitem> - -executing the special <function>yyterminate</function> action. - -</listitem> -<listitem> - -or, switching to a new buffer using <function>yy_switch_to_buffer</function> as -shown in the example above. -</listitem> -</itemizedlist> - - -<<EOF>> rules may not be used with other patterns; they may only be -qualified with a list of start conditions. If an unqualified <<EOF>> -rule is given, it applies to <emphasis>all</emphasis> start conditions which do not -already have <<EOF>> actions. To specify an <<EOF>> rule for only the -initial start condition, use: - -<informalexample> -<programlisting> -<![CDATA[ - <INITIAL><<EOF>> -]]> -</programlisting> -</informalexample> - -These rules are useful for catching things like unclosed comments. An -example: - -<!-- @cindex <<EOF>>, use of --> -<informalexample> -<programlisting> -<![CDATA[ - %x quote - %% - - ...other rules for dealing with quotes... - - <quote><<EOF>> { - error( "unterminated quote" ); - yyterminate(); - } - <<EOF>> { - if ( *++filelist ) - yyin = fopen( *filelist, "r" ); - else - yyterminate(); - } -]]> -</programlisting> -</informalexample> - -</chapter> - -<chapter> -<title>Miscellaneous Macros</title> - -@hkindex YY_USER_ACTION -The macro @code{YY_USER_ACTION} can be defined to provide an action -which is always executed prior to the matched rule's action. For -example, it could be #define'd to call a routine to convert yytext to -lower-case. When @code{YY_USER_ACTION} is invoked, the variable -@code{yy_act} gives the number of the matched rule (rules are numbered -starting with 1). Suppose you want to profile how often each of your -rules is matched. The following would do the trick: - -<!-- @cindex YY_USER_ACTION to track each time a rule is matched --> -<informalexample> -<programlisting> -<![CDATA[ - #define YY_USER_ACTION ++ctr[yy_act] -]]> -</programlisting> -</informalexample> - -<!-- @vindex YY_NUM_RULES --> -where @code{ctr} is an array to hold the counts for the different rules. -Note that the macro @code{YY_NUM_RULES} gives the total number of rules -(including the default rule), even if you use @samp{-s)}, so a correct -declaration for @code{ctr} is: - -<informalexample> -<programlisting> -<![CDATA[ - int ctr[YY_NUM_RULES]; -]]> -</programlisting> -</informalexample> - -@hkindex YY_USER_INIT -The macro @code{YY_USER_INIT} may be defined to provide an action which -is always executed before the first scan (and before the scanner's -internal initializations are done). For example, it could be used to -call a routine to read in a data table or open a logging file. - -<!-- @findex yy_set_interactive --> -The macro @code{yy_set_interactive(is_interactive)} can be used to -control whether the current buffer is considered @dfn{interactive}. An -interactive buffer is processed more slowly, but must be used when the -scanner's input source is indeed interactive to avoid problems due to -waiting to fill buffers (see the discussion of the @samp{-I} flag in -@ref{Scanner Options}). A non-zero value in the macro invocation marks -the buffer as interactive, a zero value as non-interactive. Note that -use of this macro overrides @code{%option always-interactive} or -@code{%option never-interactive} (@pxref{Scanner Options}). -<function>yy_set_interactive</function> must be invoked prior to beginning to scan -the buffer that is (or is not) to be considered interactive. - -<!-- @cindex BOL, setting it --> -<!-- @findex yy_set_bol --> -The macro @code{yy_set_bol(at_bol)} can be used to control whether the -current buffer's scanning context for the next token match is done as -though at the beginning of a line. A non-zero macro argument makes -rules anchored with @samp{^} active, while a zero argument makes -@samp{^} rules inactive. - -<!-- @cindex BOL, checking the BOL flag --> -<!-- @findex YY_AT_BOL --> -The macro <function>YY_AT_BOL</function> returns true if the next token scanned from -the current buffer will have @samp{^} rules active, false otherwise. - -<!-- @cindex actions, redefining YY_BREAK --> -@hkindex YY_BREAK -In the generated scanner, the actions are all gathered in one large -switch statement and separated using @code{YY_BREAK}, which may be -redefined. By default, it is simply a @code{break}, to separate each -rule's action from the following rule's. Redefining @code{YY_BREAK} -allows, for example, C++ users to #define YY_BREAK to do nothing (while -being very careful that every rule ends with a @code{break}" or a -@code{return}!) to avoid suffering from unreachable statement warnings -where because a rule's action ends with @code{return}, the -@code{YY_BREAK} is inaccessible. - -</chapter> - -<chapter> -<title>Values Available To the User</title> - -This chapter summarizes the various values available to the user in the -rule actions. - -<variablelist> -<!-- @vindex yytext --> - -<varlistentry><term>char *yytext</term> -<listitem> -holds the text of the current token. It may be modified but not -lengthened (you cannot append characters to the end). - -<!-- @cindex yytext, default array size --> -<!-- @cindex array, default size for yytext --> -<!-- @vindex YYLMAX --> -If the special directive @code{%array} appears in the first section of -the scanner description, then <varname>yytext</varname> is instead declared -@code{char yytext[YYLMAX]}, where @code{YYLMAX} is a macro definition -that you can redefine in the first section if you don't like the default -value (generally 8KB). Using @code{%array} results in somewhat slower -scanners, but the value of <varname>yytext</varname> becomes immune to calls to -<function>unput</function>, which potentially destroy its value when <varname>yytext</varname> is -a character pointer. The opposite of @code{%array} is @code{%pointer}, -which is the default. - -<!-- @cindex C++ and %array --> -You cannot use @code{%array} when generating C++ scanner classes (the -@samp{-+} flag). - -<!-- @vindex yyleng --> -</listitem> -</varlistentry> - -<varlistentry><term>int yyleng</term> -<listitem> -holds the length of the current token. - -<!-- @vindex yyin --> -</listitem> -</varlistentry> - -<varlistentry><term>FILE *yyin</term> -<listitem> -is the file which by default <application>flex</application> reads from. It may be -redefined but doing so only makes sense before scanning begins or after -an EOF has been encountered. Changing it in the midst of scanning will -have unexpected results since <application>flex</application> buffers its input; use -<function>yyrestart</function> instead. Once scanning terminates because an -end-of-file has been seen, you can assign <filename>yyin</filename> at the new input -file and then call the scanner again to continue scanning. - -<!-- @findex yyrestart --> -</listitem> -</varlistentry> - -<varlistentry><term>void yyrestart( FILE *new_file )</term> -<listitem> -may be called to point <filename>yyin</filename> at the new input file. The -switch-over to the new file is immediate (any previously buffered-up -input is lost). Note that calling <function>yyrestart</function> with <filename>yyin</filename> -as an argument thus throws away the current input buffer and continues -scanning the same input file. - -<!-- @vindex yyout --> -</listitem> -</varlistentry> - -<varlistentry><term>FILE *yyout</term> -<listitem> -is the file to which @code{ECHO} actions are done. It can be reassigned -by the user. - -<!-- @vindex YY_CURRENT_BUFFER --> -</listitem> -</varlistentry> - -<varlistentry><term>YY_CURRENT_BUFFER</term> -<listitem> -returns a @code{YY_BUFFER_STATE} handle to the current buffer. - -<!-- @vindex YY_START --> -</listitem> -</varlistentry> - -<varlistentry><term>YY_START</term> -<listitem> -returns an integer value corresponding to the current start condition. -You can subsequently use this value with @code{BEGIN} to return to that -start condition. -</listitem> -</varlistentry> -</variablelist> - -</chapter> - -<chapter> -<title>Interfacing with <application>Yacc</application></title> - -<!-- @cindex yacc, interface --> - -<!-- @vindex yylval, with yacc --> -One of the main uses of <application>flex</application> is as a companion to the <application>yacc</application> -parser-generator. <application>yacc</application> parsers expect to call a routine named -<function>yylex</function> to find the next input token. The routine is supposed to -return the type of the next token as well as putting any associated -value in the global <varname>yylval</varname>. To use <application>flex</application> with <application>yacc</application>, -one specifies the @samp{-d} option to <application>yacc</application> to instruct it to -generate the file <filename>y.tab.h</filename> containing definitions of all the -@code{%tokens} appearing in the <application>yacc</application> input. This file is then -included in the <application>flex</application> scanner. For example, if one of the tokens -is @code{TOK_NUMBER}, part of the scanner might look like: - -<!-- @cindex yacc interface --> -<informalexample> -<programlisting> -<![CDATA[ - %{ - #include "y.tab.h" - %} - - %% - - [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; -]]> -</programlisting> -</informalexample> - -</chapter> - -<chapter> -<title>Scanner Options</title> - -<!-- @cindex command-line options --> -<!-- @cindex options, command-line --> -<!-- @cindex arguments, command-line --> - -The various <application>flex</application> options are categorized by function in the following -menu. If you want to lookup a particular option by name, @xref{Index of Scanner Options}. - -<!-- -@menu -* Options for Specifing Filenames:: -* Options Affecting Scanner Behavior:: -* Code-Level And API Options:: -* Options for Scanner Speed and Size:: -* Debugging Options:: -* Miscellaneous Options:: -@end menu ---> - - -Even though there are many scanner options, a typical scanner might only -specify the following options: - -<informalexample> -<programlisting> -<![CDATA[ -%option 8bit reentrant bison-bridge -%option warn nodefault -%option yylineno -%option outfile="scanner.c" header-file="scanner.h" -]]> -</programlisting> -</informalexample> - -The first line specifies the general type of scanner we want. The second line -specifies that we are being careful. The third line asks flex to track line -numbers. The last line tells flex what to name the files. (The options can be -specified in any order. We just dividied them.) - -<application>flex</application> also provides a mechanism for controlling options within the -scanner specification itself, rather than from the flex command-line. -This is done by including @code{%option} directives in the first section -of the scanner specification. You can specify multiple options with a -single @code{%option} directive, and multiple directives in the first -section of your flex input file. - -Most options are given simply as names, optionally preceded by the -word @samp{no} (with no intervening whitespace) to negate their meaning. -The names are the same as their long-option equivalents (but without the -leading @samp{--} ). - -<application>flex</application> scans your rule actions to determine whether you use the -@code{REJECT} or <function>yymore</function> features. The @code{REJECT} and -@code{yymore} options are available to override its decision as to -whether you use the options, either by setting them (e.g., @code{%option -reject)} to indicate the feature is indeed used, or unsetting them to -indicate it actually is not used (e.g., @code{%option noyymore)}. - - -A number of options are available for lint purists who want to suppress -the appearance of unneeded routines in the generated scanner. Each of -the following, if unset (e.g., @code{%option nounput}), results in the -corresponding routine not appearing in the generated scanner: - -<informalexample> -<programlisting> -<![CDATA[ - input, unput - yy_push_state, yy_pop_state, yy_top_state - yy_scan_buffer, yy_scan_bytes, yy_scan_string - - yyget_extra, yyset_extra, yyget_leng, yyget_text, - yyget_lineno, yyset_lineno, yyget_in, yyset_in, - yyget_out, yyset_out, yyget_lval, yyset_lval, - yyget_lloc, yyset_lloc, yyget_debug, yyset_debug -]]> -</programlisting> -</informalexample> - -(though <function>yy_push_state</function> and friends won't appear anyway unless -you use @code{%option stack)}. - -<section> -<title>Options for Specifing Filenames</title> - -<variablelist> - -@anchor{option-header} -@opindex ---header-file -@opindex header-file - -<varlistentry><term>--header-file=FILE, @code{%option header-file="FILE"}</term> -<listitem> -instructs flex to write a C header to <filename>FILE</filename>. This file contains -function prototypes, extern variables, and types used by the scanner. -Only the external API is exported by the header file. Many macros that -are usable from within scanner actions are not exported to the header -file. This is due to namespace problems and the goal of a clean -external API. - -While in the header, the macro @code{yyIN_HEADER} is defined, where @samp{yy} -is substituted with the appropriate prefix. - -The @samp{--header-file} option is not compatible with the @samp{--c++} option, -since the C++ scanner provides its own header in <filename>yyFlexLexer.h</filename>. - - - -@anchor{option-outfile} -@opindex -o -@opindex ---outfile -@opindex outfile -</listitem> -</varlistentry> - -<varlistentry><term>-oFILE, --outfile=FILE, @code{%option outfile="FILE"}</term> -<listitem> -directs flex to write the scanner to the file <filename>FILE</filename> instead of -<filename>lex.yy.c</filename>. If you combine @samp{--outfile} with the @samp{--stdout} option, -then the scanner is written to <filename>stdout</filename> but its @code{#line} -directives (see the @samp{-l} option above) refer to the file -<filename>FILE</filename>. - - - -@anchor{option-stdout} -@opindex -t -@opindex ---stdout -@opindex stdout -</listitem> -</varlistentry> - -<varlistentry><term>-t, --stdout, @code{%option stdout}</term> -<listitem> -instructs <application>flex</application> to write the scanner it generates to standard -output instead of <filename>lex.yy.c</filename>. - - - -@opindex ---skel -</listitem> -</varlistentry> - -<varlistentry><term>-SFILE, --skel=FILE</term> -<listitem> -overrides the default skeleton file from which -<application>flex</application> -constructs its scanners. You'll never need this option unless you are doing -<application>flex</application> -maintenance or development. - -@opindex ---tables-file -@opindex tables-file -</listitem> -</varlistentry> - -<varlistentry><term>--tables-file=FILE</term> -<listitem> -Write serialized scanner dfa tables to FILE. The generated scanner will not -contain the tables, and requires them to be loaded at runtime. -@xref{serialization}. - -@opindex ---tables-verify -@opindex tables-verify -</listitem> -</varlistentry> - -<varlistentry><term>--tables-verify</term> -<listitem> -This option is for flex development. We document it here in case you stumble -upon it by accident or in case you suspect some inconsistency in the serialized -tables. Flex will serialize the scanner dfa tables but will also generate the -in-code tables as it normally does. At runtime, the scanner will verify that -the serialized tables match the in-code tables, instead of loading them. - -</listitem> -</varlistentry> -</variablelist> - -</section> - -<section> -<title>Options Affecting Scanner Behavior</title> - -<variablelist> -@anchor{option-case-insensitive} -@opindex -i -@opindex ---case-insensitive -@opindex case-insensitive - -<varlistentry><term>-i, --case-insensitive, @code{%option case-insensitive}</term> -<listitem> -instructs <application>flex</application> to generate a @dfn{case-insensitive} scanner. The -case of letters given in the <application>flex</application> input patterns will be ignored, -and tokens in the input will be matched regardless of case. The matched -text given in <varname>yytext</varname> will have the preserved case (i.e., it will -not be folded). For tricky behavior, see @ref{case and character ranges}. - - - -@anchor{option-lex-compat} -@opindex -l -@opindex ---lex-compat -@opindex lex-compat -</listitem> -</varlistentry> - -<varlistentry><term>-l, --lex-compat, @code{%option lex-compat}</term> -<listitem> -turns on maximum compatibility with the original <acronym>&</acronym> @code{lex} -implementation. Note that this does not mean <emphasis>full</emphasis> compatibility. -Use of this option costs a considerable amount of performance, and it -cannot be used with the @samp{--c++}, @samp{--full}, @samp{--fast}, @samp{-Cf}, or -@samp{-CF} options. For details on the compatibilities it provides, see -@ref{Lex and Posix}. This option also results in the name -@code{YY_FLEX_LEX_COMPAT} being @code{#define}'d in the generated scanner. - - - -@anchor{option-batch} -@opindex -B -@opindex ---batch -@opindex batch -</listitem> -</varlistentry> - -<varlistentry><term>-B, --batch, @code{%option batch}</term> -<listitem> -instructs <application>flex</application> to generate a @dfn{batch} scanner, the opposite of -<emphasis>interactive</emphasis> scanners generated by @samp{--interactive} (see below). In -general, you use @samp{-B} when you are <emphasis>certain</emphasis> that your scanner -will never be used interactively, and you want to squeeze a -<emphasis>little</emphasis> more performance out of it. If your goal is instead to -squeeze out a <emphasis>lot</emphasis> more performance, you should be using the -@samp{-Cf} or @samp{-CF} options, which turn on @samp{--batch} automatically -anyway. - - - -@anchor{option-interactive} -@opindex -I -@opindex ---interactive -@opindex interactive -</listitem> -</varlistentry> - -<varlistentry><term>-I, --interactive, @code{%option interactive}</term> -<listitem> -instructs <application>flex</application> to generate an @i{interactive} scanner. An -interactive scanner is one that only looks ahead to decide what token -has been matched if it absolutely must. It turns out that always -looking one extra character ahead, even if the scanner has already seen -enough text to disambiguate the current token, is a bit faster than only -looking ahead when necessary. But scanners that always look ahead give -dreadful interactive performance; for example, when a user types a -newline, it is not recognized as a newline token until they enter -<emphasis>another</emphasis> token, which often means typing in another whole line. - -<application>flex</application> scanners default to @code{interactive} unless you use the -@samp{-Cf} or @samp{-CF} table-compression options -(@pxref{Performance}). That's because if you're looking for -high-performance you should be using one of these options, so if you -didn't, <application>flex</application> assumes you'd rather trade off a bit of run-time -performance for intuitive interactive behavior. Note also that you -<emphasis>cannot</emphasis> use @samp{--interactive} in conjunction with @samp{-Cf} or -@samp{-CF}. Thus, this option is not really needed; it is on by default -for all those cases in which it is allowed. - -You can force a scanner to -<emphasis>not</emphasis> -be interactive by using -@samp{--batch} - - - -@anchor{option-7bit} -@opindex -7 -@opindex ---7bit -@opindex 7bit -</listitem> -</varlistentry> - -<varlistentry><term>-7, --7bit, @code{%option 7bit}</term> -<listitem> -instructs <application>flex</application> to generate a 7-bit scanner, i.e., one which can -only recognize 7-bit characters in its input. The advantage of using -@samp{--7bit} is that the scanner's tables can be up to half the size of -those generated using the @samp{--8bit}. The disadvantage is that such -scanners often hang or crash if their input contains an 8-bit character. - -Note, however, that unless you generate your scanner using the -@samp{-Cf} or @samp{-CF} table compression options, use of @samp{--7bit} -will save only a small amount of table space, and make your scanner -considerably less portable. <application>flex</application>'s default behavior is to -generate an 8-bit scanner unless you use the @samp{-Cf} or @samp{-CF}, -in which case <application>flex</application> defaults to generating 7-bit scanners unless -your site was always configured to generate 8-bit scanners (as will -often be the case with non-USA sites). You can tell whether flex -generated a 7-bit or an 8-bit scanner by inspecting the flag summary in -the @samp{--verbose} output as described above. - -Note that if you use @samp{-Cfe} or @samp{-CFe} <application>flex</application> still -defaults to generating an 8-bit scanner, since usually with these -compression options full 8-bit tables are not much more expensive than -7-bit tables. - - - -@anchor{option-8bit} -@opindex -8 -@opindex ---8bit -@opindex 8bit -</listitem> -</varlistentry> - -<varlistentry><term>-8, --8bit, @code{%option 8bit}</term> -<listitem> -instructs <application>flex</application> to generate an 8-bit scanner, i.e., one which can -recognize 8-bit characters. This flag is only needed for scanners -generated using @samp{-Cf} or @samp{-CF}, as otherwise flex defaults to -generating an 8-bit scanner anyway. - -See the discussion of -@samp{--7bit} -above for <application>flex</application>'s default behavior and the tradeoffs between 7-bit -and 8-bit scanners. - - - -@anchor{option-default} -@opindex ---default -@opindex default -</listitem> -</varlistentry> - -<varlistentry><term>--default, @code{%option default}</term> -<listitem> -generate the default rule. - - - -@anchor{option-always-interactive} -@opindex ---always-interactive -@opindex always-interactive -</listitem> -</varlistentry> - -<varlistentry><term>--always-interactive, @code{%option always-interactive}</term> -<listitem> -instructs flex to generate a scanner which always considers its input -<emphasis>interactive</emphasis>. Normally, on each new input file the scanner calls -<function>isatty</function> in an attempt to determine whether the scanner's input -source is interactive and thus should be read a character at a time. -When this option is used, however, then no such call is made. - - - -@opindex ---never-interactive -</listitem> -</varlistentry> - -<varlistentry><term>--never-interactive, @code{--never-interactive}</term> -<listitem> -instructs flex to generate a scanner which never considers its input -interactive. This is the opposite of @code{always-interactive}. - - -@anchor{option-posix} -@opindex -X -@opindex ---posix -@opindex posix -</listitem> -</varlistentry> - -<varlistentry><term>-X, --posix, @code{%option posix}</term> -<listitem> -turns on maximum compatibility with the POSIX 1003.2-1992 definition of -@code{lex}. Since <application>flex</application> was originally designed to implement the -POSIX definition of @code{lex} this generally involves very few changes -in behavior. At the current writing the known differences between -<application>flex</application> and the POSIX standard are: - -<itemizedlist> - -<listitem> - -In POSIX and <acronym>&</acronym> @code{lex}, the repeat operator, @samp{{}}, has lower -precedence than concatenation (thus @samp{ab{3}} yields @samp{ababab}). -Most POSIX utilities use an Extended Regular Expression (ERE) precedence -that has the precedence of the repeat operator higher than concatenation -(which causes @samp{ab{3}} to yield @samp{abbb}). By default, <application>flex</application> -places the precedence of the repeat operator higher than concatenation -which matches the ERE processing of other POSIX utilities. When either -@samp{--posix} or @samp{-l} are specified, <application>flex</application> will use the -traditional <acronym>&</acronym> and POSIX-compliant precedence for the repeat operator -where concatenation has higher precedence than the repeat operator. -</listitem> -</itemizedlist> - - - -@anchor{option-stack} -@opindex ---stack -@opindex stack -</listitem> -</varlistentry> - -<varlistentry><term>--stack, @code{%option stack}</term> -<listitem> -enables the use of -start condition stacks (@pxref{Start Conditions}). - - - -@anchor{option-stdinit} -@opindex ---stdinit -@opindex stdinit -</listitem> -</varlistentry> - -<varlistentry><term>--stdinit, @code{%option stdinit}</term> -<listitem> -if set (i.e., @b{%option stdinit)} initializes <varname>yyin</varname> and -<varname>yyout</varname> to <filename>stdin</filename> and <filename>stdout</filename>, instead of the default of -<filename>NULL</filename>. Some existing @code{lex} programs depend on this behavior, -even though it is not compliant with ANSI C, which does not require -<filename>stdin</filename> and <filename>stdout</filename> to be compile-time constant. In a -reentrant scanner, however, this is not a problem since initialization -is performed in <function>yylex_init</function> at runtime. - - - -@anchor{option-yylineno} -@opindex ---yylineno -@opindex yylineno -</listitem> -</varlistentry> - -<varlistentry><term>--yylineno, @code{%option yylineno}</term> -<listitem> -directs <application>flex</application> to generate a scanner -that maintains the number of the current line read from its input in the -global variable <varname>yylineno</varname>. This option is implied by @code{%option -lex-compat}. In a reentrant C scanner, the macro <varname>yylineno</varname> is -accessible regardless of the value of @code{%option yylineno}, however, its -value is not modified by <application>flex</application> unless @code{%option yylineno} is enabled. - - - -@anchor{option-yywrap} -@opindex ---yywrap -@opindex yywrap -</listitem> -</varlistentry> - -<varlistentry><term>--yywrap, @code{%option yywrap}</term> -<listitem> -if unset (i.e., @code{--noyywrap)}, makes the scanner not call -<function>yywrap</function> upon an end-of-file, but simply assume that there are no -more files to scan (until the user points <filename>yyin</filename> at a new file and -calls <function>yylex</function> again). - -</listitem> -</varlistentry> -</variablelist> - -</section> - -<section> -<title>Code-Level And API Options</title> - -<variablelist> - -@anchor{option-ansi-definitions} -@opindex ---option-ansi-definitions -@opindex ansi-definitions - -<varlistentry><term>--ansi-definitions, @code{%option ansi-definitions}</term> -<listitem> -instruct flex to generate ANSI C99 definitions for functions. -This option is enabled by default. -If @code{%option noansi-definitions} is specified, then the obsolete style -is generated. - -@anchor{option-ansi-prototypes} -@opindex ---option-ansi-prototypes -@opindex ansi-prototypes -</listitem> -</varlistentry> - -<varlistentry><term>--ansi-prototypes, @code{%option ansi-prototypes}</term> -<listitem> -instructs flex to generate ANSI C99 prototypes for functions. -This option is enabled by default. -If @code{noansi-prototypes} is specified, then -prototypes will have empty parameter lists. - -@anchor{option-bison-bridge} -@opindex ---bison-bridge -@opindex bison-bridge -</listitem> -</varlistentry> - -<varlistentry><term>--bison-bridge, @code{%option bison-bridge}</term> -<listitem> -instructs flex to generate a C scanner that is -meant to be called by a -@code{GNU bison} -parser. The scanner has minor API changes for -<application>bison</application> -compatibility. In particular, the declaration of -<function>yylex</function> -is modified to take an additional parameter, -<varname>yylval</varname>. -@xref{Bison Bridge}. - -@anchor{option-bison-locations} -@opindex ---bison-locations -@opindex bison-locations -</listitem> -</varlistentry> - -<varlistentry><term>--bison-locations, @code{%option bison-locations}</term> -<listitem> -instruct flex that -@code{GNU bison} @code{%locations} are being used. -This means <function>yylex</function> will be passed -an additional parameter, <varname>yylloc</varname>. This option -implies @code{%option bison-bridge}. -@xref{Bison Bridge}. - -@anchor{option-noline} -@opindex -L -@opindex ---noline -@opindex noline -</listitem> -</varlistentry> - -<varlistentry><term>-L, --noline, @code{%option noline}</term> -<listitem> -instructs -<application>flex</application> -not to generate -@code{#line} -directives. Without this option, -<application>flex</application> -peppers the generated scanner -with @code{#line} directives so error messages in the actions will be correctly -located with respect to either the original -<application>flex</application> -input file (if the errors are due to code in the input file), or -<filename>lex.yy.c</filename> -(if the errors are -<application>flex</application>'s -fault -- you should report these sorts of errors to the email address -given in @ref{Reporting Bugs}). - - - -@anchor{option-reentrant} -@opindex -R -@opindex ---reentrant -@opindex reentrant -</listitem> -</varlistentry> - -<varlistentry><term>-R, --reentrant, @code{%option reentrant}</term> -<listitem> -instructs flex to generate a reentrant C scanner. The generated scanner -may safely be used in a multi-threaded environment. The API for a -reentrant scanner is different than for a non-reentrant scanner -@pxref{Reentrant}). Because of the API difference between -reentrant and non-reentrant <application>flex</application> scanners, non-reentrant flex -code must be modified before it is suitable for use with this option. -This option is not compatible with the @samp{--c++} option. - -The option @samp{--reentrant} does not affect the performance of -the scanner. - - - -@anchor{option-c++} -@opindex -+ -@opindex ---c++ -@opindex c++ -</listitem> -</varlistentry> - -<varlistentry><term>-+, --c++, @code{%option c++}</term> -<listitem> -specifies that you want flex to generate a C++ -scanner class. @xref{Cxx}, for -details. - - - -@anchor{option-array} -@opindex ---array -@opindex array -</listitem> -</varlistentry> - -<varlistentry><term>--array, @code{%option array}</term> -<listitem> -specifies that you want yytext to be an array instead of a char* - - - -@anchor{option-pointer} -@opindex ---pointer -@opindex pointer -</listitem> -</varlistentry> - -<varlistentry><term>--pointer, @code{%option pointer}</term> -<listitem> -specify that <varname>yytext</varname> should be a @code{char *}, not an array. -This default is @code{char *}. - - - -@anchor{option-prefix} -@opindex -P -@opindex ---prefix -@opindex prefix -</listitem> -</varlistentry> - -<varlistentry><term>-PPREFIX, --prefix=PREFIX, @code{%option prefix="PREFIX"}</term> -<listitem> -changes the default @samp{yy} prefix used by <application>flex</application> for all -globally-visible variable and function names to instead be -@samp{PREFIX}. For example, @samp{--prefix=foo} changes the name of -<varname>yytext</varname> to @code{footext}. It also changes the name of the default -output file from <filename>lex.yy.c</filename> to <filename>lex.foo.c</filename>. Here is a partial -list of the names affected: - -<informalexample> -<programlisting> -<![CDATA[ - yy_create_buffer - yy_delete_buffer - yy_flex_debug - yy_init_buffer - yy_flush_buffer - yy_load_buffer_state - yy_switch_to_buffer - yyin - yyleng - yylex - yylineno - yyout - yyrestart - yytext - yywrap - yyalloc - yyrealloc - yyfree -]]> -</programlisting> -</informalexample> - -(If you are using a C++ scanner, then only <function>yywrap</function> and -@code{yyFlexLexer} are affected.) Within your scanner itself, you can -still refer to the global variables and functions using either version -of their name; but externally, they have the modified name. - -This option lets you easily link together multiple -<application>flex</application> -programs into the same executable. Note, though, that using this -option also renames -<function>yywrap</function>, -so you now -<emphasis>must</emphasis> -either -provide your own (appropriately-named) version of the routine for your -scanner, or use -@code{%option noyywrap}, -as linking with -@samp{-lfl} -no longer provides one for you by default. - - - -@anchor{option-main} -@opindex ---main -@opindex main -</listitem> -</varlistentry> - -<varlistentry><term>--main, @code{%option main}</term> -<listitem> - directs flex to provide a default <function>main</function> program for the -scanner, which simply calls <function>yylex</function>. This option implies -@code{noyywrap} (see below). - - - -@anchor{option-nounistd} -@opindex ---nounistd -@opindex nounistd -</listitem> -</varlistentry> - -<varlistentry><term>--nounistd, @code{%option nounistd}</term> -<listitem> -suppresses inclusion of the non-ANSI header file <filename>unistd.h</filename>. This option -is meant to target environments in which <filename>unistd.h</filename> does not exist. Be aware -that certain options may cause flex to generate code that relies on functions -normally found in <filename>unistd.h</filename>, (e.g. <function>isatty</function>, <function>read</function>.) -If you wish to use these functions, you will have to inform your compiler where -to find them. -@xref{option-always-interactive}. @xref{option-read}. - - - -@anchor{option-yyclass} -@opindex ---yyclass -@opindex yyclass -</listitem> -</varlistentry> - -<varlistentry><term>--yyclass, @code{%option yyclass="NAME"}</term> -<listitem> -only applies when generating a C++ scanner (the @samp{--c++} option). It -informs <application>flex</application> that you have derived @code{foo} as a subclass of -@code{yyFlexLexer}, so <application>flex</application> will place your actions in the member -function @code{foo::yylex()} instead of @code{yyFlexLexer::yylex()}. It -also generates a @code{yyFlexLexer::yylex()} member function that emits -a run-time error (by invoking @code{yyFlexLexer::LexerError())} if -called. @xref{Cxx}. - -</listitem> -</varlistentry> -</variablelist> - -</section> - -<section> -<title>Options for Scanner Speed and Size</title> - -<variablelist> - -<varlistentry><term>-C[aefFmr]</term> -<listitem> -controls the degree of table compression and, more generally, trade-offs -between small scanners and fast scanners. - -<variablelist> -@opindex -C - -<varlistentry><term>-C</term> -<listitem> -A lone @samp{-C} specifies that the scanner tables should be compressed -but neither equivalence classes nor meta-equivalence classes should be -used. - -@anchor{option-align} -@opindex -Ca -@opindex ---align -@opindex align -</listitem> -</varlistentry> - -<varlistentry><term>-Ca, --align, @code{%option align}</term> -<listitem> -(``align'') instructs flex to trade off larger tables in the -generated scanner for faster performance because the elements of -the tables are better aligned for memory access and computation. On some -RISC architectures, fetching and manipulating longwords is more efficient -than with smaller-sized units such as shortwords. This option can -quadruple the size of the tables used by your scanner. - -@anchor{option-ecs} -@opindex -Ce -@opindex ---ecs -@opindex ecs -</listitem> -</varlistentry> - -<varlistentry><term>-Ce, --ecs, @code{%option ecs}</term> -<listitem> -directs <application>flex</application> to construct @dfn{equivalence classes}, i.e., sets -of characters which have identical lexical properties (for example, if -the only appearance of digits in the <application>flex</application> input is in the -character class ``[0-9]'' then the digits '0', '1', ..., '9' will all be -put in the same equivalence class). Equivalence classes usually give -dramatic reductions in the final table/object file sizes (typically a -factor of 2-5) and are pretty cheap performance-wise (one array look-up -per character scanned). - -@opindex -Cf -</listitem> -</varlistentry> - -<varlistentry><term>-Cf</term> -<listitem> -specifies that the @dfn{full} scanner tables should be generated - -<application>flex</application> should not compress the tables by taking advantages of -similar transition functions for different states. - -@opindex -CF -</listitem> -</varlistentry> - -<varlistentry><term>-CF</term> -<listitem> -specifies that the alternate fast scanner representation (described -above under the @samp{--fast} flag) should be used. This option cannot be -used with @samp{--c++}. - -@anchor{option-meta-ecs} -@opindex -Cm -@opindex ---meta-ecs -@opindex meta-ecs -</listitem> -</varlistentry> - -<varlistentry><term>-Cm, --meta-ecs, @code{%option meta-ecs}</term> -<listitem> -directs -<application>flex</application> -to construct -@dfn{meta-equivalence classes}, -which are sets of equivalence classes (or characters, if equivalence -classes are not being used) that are commonly used together. Meta-equivalence -classes are often a big win when using compressed tables, but they -have a moderate performance impact (one or two @code{if} tests and one -array look-up per character scanned). - -@anchor{option-read} -@opindex -Cr -@opindex ---read -@opindex read -</listitem> -</varlistentry> - -<varlistentry><term>-Cr, --read, @code{%option read}</term> -<listitem> -causes the generated scanner to <emphasis>bypass</emphasis> use of the standard I/O -library (@code{stdio}) for input. Instead of calling <function>fread</function> or -<function>getc</function>, the scanner will use the <function>read</function> system call, -resulting in a performance gain which varies from system to system, but -in general is probably negligible unless you are also using @samp{-Cf} -or @samp{-CF}. Using @samp{-Cr} can cause strange behavior if, for -example, you read from <filename>yyin</filename> using @code{stdio} prior to calling -the scanner (because the scanner will miss whatever text your previous -reads left in the @code{stdio} input buffer). @samp{-Cr} has no effect -if you define <function>YY_INPUT</function> (@pxref{Generated Scanner}). - -</listitem> -</varlistentry> -</variablelist> - -The options @samp{-Cf} or @samp{-CF} and @samp{-Cm} do not make sense -together - there is no opportunity for meta-equivalence classes if the -table is not being compressed. Otherwise the options may be freely -mixed, and are cumulative. - -The default setting is @samp{-Cem}, which specifies that <application>flex</application> -should generate equivalence classes and meta-equivalence classes. This -setting provides the highest degree of table compression. You can trade -off faster-executing scanners at the cost of larger tables with the -following generally being true: - -<informalexample> -<programlisting> -<![CDATA[ - slowest & smallest - -Cem - -Cm - -Ce - -C - -C{f,F}e - -C{f,F} - -C{f,F}a - fastest & largest -]]> -</programlisting> -</informalexample> - -Note that scanners with the smallest tables are usually generated and -compiled the quickest, so during development you will usually want to -use the default, maximal compression. - -@samp{-Cfe} is often a good compromise between speed and size for -production scanners. - -@anchor{option-full} -@opindex -f -@opindex ---full -@opindex full -</listitem> -</varlistentry> - -<varlistentry><term>-f, --full, @code{%option full}</term> -<listitem> -specifies -@dfn{fast scanner}. -No table compression is done and @code{stdio} is bypassed. -The result is large but fast. This option is equivalent to -@samp{--Cfr} - - -@anchor{option-fast} -@opindex -F -@opindex ---fast -@opindex fast -</listitem> -</varlistentry> - -<varlistentry><term>-F, --fast, @code{%option fast}</term> -<listitem> -specifies that the <emphasis>fast</emphasis> scanner table representation should be -used (and @code{stdio} bypassed). This representation is about as fast -as the full table representation @samp{--full}, and for some sets of -patterns will be considerably smaller (and for others, larger). In -general, if the pattern set contains both <emphasis>keywords</emphasis> and a -catch-all, <emphasis>identifier</emphasis> rule, such as in the set: - -<informalexample> -<programlisting> -<![CDATA[ - "case" return TOK_CASE; - "switch" return TOK_SWITCH; - ... - "default" return TOK_DEFAULT; - [a-z]+ return TOK_ID; -]]> -</programlisting> -</informalexample> - -then you're better off using the full table representation. If only -the <emphasis>identifier</emphasis> rule is present and you then use a hash table or some such -to detect the keywords, you're better off using -@samp{--fast}. - -This option is equivalent to @samp{-CFr} (see below). It cannot be used -with @samp{--c++}. - -</listitem> -</varlistentry> -</variablelist> - -</section> - -<section> -<title>Debugging Options</title> - -<variablelist> - -@anchor{option-backup} -@opindex -b -@opindex ---backup -@opindex backup - -<varlistentry><term>-b, --backup, @code{%option backup}</term> -<listitem> -Generate backing-up information to <filename>lex.backup</filename>. This is a list of -scanner states which require backing up and the input characters on -which they do so. By adding rules one can remove backing-up states. If -<emphasis>all</emphasis> backing-up states are eliminated and @samp{-Cf} or @code{-CF} -is used, the generated scanner will run faster (see the @samp{--perf-report} flag). -Only users who wish to squeeze every last cycle out of their scanners -need worry about this option. (@pxref{Performance}). - - - -@anchor{option-debug} -@opindex -d -@opindex ---debug -@opindex debug -</listitem> -</varlistentry> - -<varlistentry><term>-d, --debug, @code{%option debug}</term> -<listitem> -makes the generated scanner run in @dfn{debug} mode. Whenever a pattern -is recognized and the global variable @code{yy_flex_debug} is non-zero -(which is the default), the scanner will write to <filename>stderr</filename> a line -of the form: - -<informalexample> -<programlisting> -<![CDATA[ - -accepting rule at line 53 ("the matched text") -]]> -</programlisting> -</informalexample> - -The line number refers to the location of the rule in the file defining -the scanner (i.e., the file that was fed to flex). Messages are also -generated when the scanner backs up, accepts the default rule, reaches -the end of its input buffer (or encounters a NUL; at this point, the two -look the same as far as the scanner's concerned), or reaches an -end-of-file. - - - -@anchor{option-perf-report} -@opindex -p -@opindex ---perf-report -@opindex perf-report -</listitem> -</varlistentry> - -<varlistentry><term>-p, --perf-report, @code{%option perf-report}</term> -<listitem> -generates a performance report to <filename>stderr</filename>. The report consists of -comments regarding features of the <application>flex</application> input file which will -cause a serious loss of performance in the resulting scanner. If you -give the flag twice, you will also get comments regarding features that -lead to minor performance losses. - -Note that the use of @code{REJECT}, and -variable trailing context (@pxref{Limitations}) entails a substantial -performance penalty; use of <function>yymore</function>, the @samp{^} operator, and -the @samp{--interactive} flag entail minor performance penalties. - - - -@anchor{option-nodefault} -@opindex -s -@opindex ---nodefault -@opindex nodefault -</listitem> -</varlistentry> - -<varlistentry><term>-s, --nodefault, @code{%option nodefault}</term> -<listitem> -causes the <emphasis>default rule</emphasis> (that unmatched scanner input is echoed -to <filename>stdout)</filename> to be suppressed. If the scanner encounters input -that does not match any of its rules, it aborts with an error. This -option is useful for finding holes in a scanner's rule set. - - - -@anchor{option-trace} -@opindex -T -@opindex ---trace -@opindex trace -</listitem> -</varlistentry> - -<varlistentry><term>-T, --trace, @code{%option trace}</term> -<listitem> -makes <application>flex</application> run in @dfn{trace} mode. It will generate a lot of -messages to <filename>stderr</filename> concerning the form of the input and the -resultant non-deterministic and deterministic finite automata. This -option is mostly for use in maintaining <application>flex</application>. - - - -@anchor{option-nowarn} -@opindex -w -@opindex ---nowarn -@opindex nowarn -</listitem> -</varlistentry> - -<varlistentry><term>-w, --nowarn, @code{%option nowarn}</term> -<listitem> -suppresses warning messages. - - - -@anchor{option-verbose} -@opindex -v -@opindex ---verbose -@opindex verbose -</listitem> -</varlistentry> - -<varlistentry><term>-v, --verbose, @code{%option verbose}</term> -<listitem> -specifies that <application>flex</application> should write to <filename>stderr</filename> a summary of -statistics regarding the scanner it generates. Most of the statistics -are meaningless to the casual <application>flex</application> user, but the first line -identifies the version of <application>flex</application> (same as reported by @samp{--version}), -and the next line the flags used when generating the scanner, including -those that are on by default. - - - -@anchor{option-warn} -@opindex ---warn -@opindex warn -</listitem> -</varlistentry> - -<varlistentry><term>--warn, @code{%option warn}</term> -<listitem> -warn about certain things. In particular, if the default rule can be -matched but no defualt rule has been given, the flex will warn you. -We recommend using this option always. - -</listitem> -</varlistentry> -</variablelist> - -</section> - -<section> -<title>Miscellaneous Options</title> - -<variablelist> -@opindex -c - -<varlistentry><term>-c</term> -<listitem> -is a do-nothing option included for POSIX compliance. - -@opindex -h -@opindex ---help -generates -</listitem> -</varlistentry> - -<varlistentry><term>-h, -?, --help</term> -<listitem> -generates a ``help'' summary of <application>flex</application>'s options to <filename>stdout</filename> -and then exits. - -@opindex -n -</listitem> -</varlistentry> - -<varlistentry><term>-n</term> -<listitem> -is another do-nothing option included only for -POSIX compliance. - -@opindex -V -@opindex ---version -</listitem> -</varlistentry> - -<varlistentry><term>-V, --version</term> -<listitem> -prints the version number to <filename>stdout</filename> and exits. - -</listitem> -</varlistentry> -</variablelist> - - -</section> -</chapter> - -<chapter> -<title>Performance Considerations</title> - -<!-- @cindex performance, considerations --> -The main design goal of <application>flex</application> is that it generate high-performance -scanners. It has been optimized for dealing well with large sets of -rules. Aside from the effects on scanner speed of the table compression -@samp{-C} options outlined above, there are a number of options/actions -which degrade performance. These are, from most expensive to least: - -<!-- @cindex REJECT, performance costs --> -<!-- @cindex yylineno, performance costs --> -<!-- @cindex trailing context, performance costs --> -<informalexample> -<programlisting> -<![CDATA[ - REJECT - arbitrary trailing context - - pattern sets that require backing up - %option yylineno - %array - - %option interactive - %option always-interactive - - @samp{^} beginning-of-line operator - yymore() -]]> -</programlisting> -</informalexample> - -with the first two all being quite expensive and the last two being -quite cheap. Note also that <function>unput</function> is implemented as a routine -call that potentially does quite a bit of work, while <function>yyless</function> is -a quite-cheap macro. So if you are just putting back some excess text -you scanned, use <function>ss</function>. - -@code{REJECT} should be avoided at all costs when performance is -important. It is a particularly expensive option. - -There is one case when @code{%option yylineno} can be expensive. That is when -your patterns match long tokens that could <emphasis>possibly</emphasis> contain a newline -character. There is no performance penalty for rules that can not possibly -match newlines, since flex does not need to check them for newlines. In -general, you should avoid rules such as @code{[^f]+}, which match very long -tokens, including newlines, and may possibly match your entire file! A better -approach is to separate @code{[^f]+} into two rules: - -<informalexample> -<programlisting> -<![CDATA[ -%option yylineno -%% - [^f\n]+ - \n+ -]]> -</programlisting> -</informalexample> - -The above scanner does not incur a performance penalty. - -<!-- @cindex patterns, tuning for performance --> -<!-- @cindex performance, backing up --> -<!-- @cindex backing up, example of eliminating --> -Getting rid of backing up is messy and often may be an enormous amount -of work for a complicated scanner. In principal, one begins by using -the @samp{-b} flag to generate a <filename>lex.backup</filename> file. For example, -on the input: - -<!-- @cindex backing up, eliminating --> -<informalexample> -<programlisting> -<![CDATA[ - %% - foo return TOK_KEYWORD; - foobar return TOK_KEYWORD; -]]> -</programlisting> -</informalexample> - -the file looks like: - -<informalexample> -<programlisting> -<![CDATA[ - State #6 is non-accepting - - associated rule line numbers: - 2 3 - out-transitions: [ o ] - jam-transitions: EOF [ \001-n p-\177 ] - - State #8 is non-accepting - - associated rule line numbers: - 3 - out-transitions: [ a ] - jam-transitions: EOF [ \001-` b-\177 ] - - State #9 is non-accepting - - associated rule line numbers: - 3 - out-transitions: [ r ] - jam-transitions: EOF [ \001-q s-\177 ] - - Compressed tables always back up. -]]> -</programlisting> -</informalexample> - -The first few lines tell us that there's a scanner state in which it can -make a transition on an 'o' but not on any other character, and that in -that state the currently scanned text does not match any rule. The -state occurs when trying to match the rules found at lines 2 and 3 in -the input file. If the scanner is in that state and then reads -something other than an 'o', it will have to back up to find a rule -which is matched. With a bit of headscratching one can see that this -must be the state it's in when it has seen @samp{fo}. When this has -happened, if anything other than another @samp{o} is seen, the scanner -will have to back up to simply match the @samp{f} (by the default rule). - -The comment regarding State #8 indicates there's a problem when -@samp{foob} has been scanned. Indeed, on any character other than an -@samp{a}, the scanner will have to back up to accept "foo". Similarly, -the comment for State #9 concerns when @samp{fooba} has been scanned and -an @samp{r} does not follow. - -The final comment reminds us that there's no point going to all the -trouble of removing backing up from the rules unless we're using -@samp{-Cf} or @samp{-CF}, since there's no performance gain doing so -with compressed scanners. - -<!-- @cindex error rules, to eliminate backing up --> -The way to remove the backing up is to add ``error'' rules: - -<!-- @cindex backing up, eliminating by adding error rules --> -<informalexample> -<programlisting> -<![CDATA[ - %% - foo return TOK_KEYWORD; - foobar return TOK_KEYWORD; - - fooba | - foob | - fo { - /* false alarm, not really a keyword */ - return TOK_ID; - } -]]> -</programlisting> -</informalexample> - -Eliminating backing up among a list of keywords can also be done using a -``catch-all'' rule: - -<!-- @cindex backing up, eliminating with catch-all rule --> -<informalexample> -<programlisting> -<![CDATA[ - %% - foo return TOK_KEYWORD; - foobar return TOK_KEYWORD; - - [a-z]+ return TOK_ID; -]]> -</programlisting> -</informalexample> - -This is usually the best solution when appropriate. - -Backing up messages tend to cascade. With a complicated set of rules -it's not uncommon to get hundreds of messages. If one can decipher -them, though, it often only takes a dozen or so rules to eliminate the -backing up (though it's easy to make a mistake and have an error rule -accidentally match a valid token. A possible future <application>flex</application> feature -will be to automatically add rules to eliminate backing up). - -It's important to keep in mind that you gain the benefits of eliminating -backing up only if you eliminate <emphasis>every</emphasis> instance of backing up. -Leaving just one means you gain nothing. - -<emphasis>Variable</emphasis> trailing context (where both the leading and trailing -parts do not have a fixed length) entails almost the same performance -loss as @code{REJECT} (i.e., substantial). So when possible a rule -like: - -<!-- @cindex trailing context, variable length --> -<informalexample> -<programlisting> -<![CDATA[ - %% - mouse|rat/(cat|dog) run(); -]]> -</programlisting> -</informalexample> - -is better written: - -<informalexample> -<programlisting> -<![CDATA[ - %% - mouse/cat|dog run(); - rat/cat|dog run(); -]]> -</programlisting> -</informalexample> - -or as - -<informalexample> -<programlisting> -<![CDATA[ - %% - mouse|rat/cat run(); - mouse|rat/dog run(); -]]> -</programlisting> -</informalexample> - -Note that here the special '|' action does <emphasis>not</emphasis> provide any -savings, and can even make things worse (@pxref{Limitations}). - -Another area where the user can increase a scanner's performance (and -one that's easier to implement) arises from the fact that the longer the -tokens matched, the faster the scanner will run. This is because with -long tokens the processing of most input characters takes place in the -(short) inner scanning loop, and does not often have to go through the -additional work of setting up the scanning environment (e.g., -<varname>yytext</varname>) for the action. Recall the scanner for C comments: - -<!-- @cindex performance optimization, matching longer tokens --> -<informalexample> -<programlisting> -<![CDATA[ - %x comment - %% - int line_num = 1; - - "/*" BEGIN(comment); - - <comment>[^*\n]* - <comment>"*"+[^*/\n]* - <comment>\n ++line_num; - <comment>"*"+"/" BEGIN(INITIAL); -]]> -</programlisting> -</informalexample> - -This could be sped up by writing it as: - -<informalexample> -<programlisting> -<![CDATA[ - %x comment - %% - int line_num = 1; - - "/*" BEGIN(comment); - - <comment>[^*\n]* - <comment>[^*\n]*\n ++line_num; - <comment>"*"+[^*/\n]* - <comment>"*"+[^*/\n]*\n ++line_num; - <comment>"*"+"/" BEGIN(INITIAL); -]]> -</programlisting> -</informalexample> - -Now instead of each newline requiring the processing of another action, -recognizing the newlines is distributed over the other rules to keep the -matched text as long as possible. Note that <emphasis>adding</emphasis> rules does -<emphasis>not</emphasis> slow down the scanner! The speed of the scanner is -independent of the number of rules or (modulo the considerations given -at the beginning of this section) how complicated the rules are with -regard to operators such as @samp{*} and @samp{|}. - -<!-- @cindex keywords, for performance --> -<!-- @cindex performance, using keywords --> -A final example in speeding up a scanner: suppose you want to scan -through a file containing identifiers and keywords, one per line -and with no other extraneous characters, and recognize all the -keywords. A natural first approach is: - -<!-- @cindex performance optimization, recognizing keywords --> -<informalexample> -<programlisting> -<![CDATA[ - %% - asm | - auto | - break | - ... etc ... - volatile | - while /* it's a keyword */ - - .|\n /* it's not a keyword */ -]]> -</programlisting> -</informalexample> - -To eliminate the back-tracking, introduce a catch-all rule: - -<informalexample> -<programlisting> -<![CDATA[ - %% - asm | - auto | - break | - ... etc ... - volatile | - while /* it's a keyword */ - - [a-z]+ | - .|\n /* it's not a keyword */ -]]> -</programlisting> -</informalexample> - -Now, if it's guaranteed that there's exactly one word per line, then we -can reduce the total number of matches by a half by merging in the -recognition of newlines with that of the other tokens: - -<informalexample> -<programlisting> -<![CDATA[ - %% - asm\n | - auto\n | - break\n | - ... etc ... - volatile\n | - while\n /* it's a keyword */ - - [a-z]+\n | - .|\n /* it's not a keyword */ -]]> -</programlisting> -</informalexample> - -One has to be careful here, as we have now reintroduced backing up -into the scanner. In particular, while -<emphasis>we</emphasis> -know that there will never be any characters in the input stream -other than letters or newlines, -<application>flex</application> -can't figure this out, and it will plan for possibly needing to back up -when it has scanned a token like @samp{auto} and then the next character -is something other than a newline or a letter. Previously it would -then just match the @samp{auto} rule and be done, but now it has no @samp{auto} -rule, only a @samp{auto\n} rule. To eliminate the possibility of backing up, -we could either duplicate all rules but without final newlines, or, -since we never expect to encounter such an input and therefore don't -how it's classified, we can introduce one more catch-all rule, this -one which doesn't include a newline: - -<informalexample> -<programlisting> -<![CDATA[ - %% - asm\n | - auto\n | - break\n | - ... etc ... - volatile\n | - while\n /* it's a keyword */ - - [a-z]+\n | - [a-z]+ | - .|\n /* it's not a keyword */ -]]> -</programlisting> -</informalexample> - -Compiled with @samp{-Cf}, this is about as fast as one can get a -<application>flex</application> scanner to go for this particular problem. - -A final note: <application>flex</application> is slow when matching @code{NUL}s, -particularly when a token contains multiple @code{NUL}s. It's best to -write rules which match <emphasis>short</emphasis> amounts of text if it's anticipated -that the text will often include @code{NUL}s. - -Another final note regarding performance: as mentioned in -@ref{Matching}, dynamically resizing <varname>yytext</varname> to accommodate huge -tokens is a slow process because it presently requires that the (huge) -token be rescanned from the beginning. Thus if performance is vital, -you should attempt to match ``large'' quantities of text but not -``huge'' quantities, where the cutoff between the two is at about 8K -characters per token. - -</chapter> - -<chapter> -<title>Generating C++ Scanners</title> - -<!-- @cindex c++, experimental form of scanner class --> -<!-- @cindex experimental form of c++ scanner class --> -<emphasis role="strong">IMPORTANT</emphasis>: the present form of the scanning class is <emphasis>experimental</emphasis> -and may change considerably between major releases. - -<!-- @cindex C++ --> -<!-- @cindex member functions, C++ --> -<!-- @cindex methods, c++ --> -<application>flex</application> provides two different ways to generate scanners for use -with C++. The first way is to simply compile a scanner generated by -<application>flex</application> using a C++ compiler instead of a C compiler. You should -not encounter any compilation errors (@pxref{Reporting Bugs}). You can -then use C++ code in your rule actions instead of C code. Note that the -default input source for your scanner remains <filename>yyin</filename>, and default -echoing is still done to <filename>yyout</filename>. Both of these remain @code{FILE -*} variables and not C++ <emphasis>streams</emphasis>. - -You can also use <application>flex</application> to generate a C++ scanner class, using the -@samp{-+} option (or, equivalently, @code{%option c++)}, which is -automatically specified if the name of the <application>flex</application> executable ends -in a '+', such as @code{flex++}. When using this option, <application>flex</application> -defaults to generating the scanner to the file <filename>lex.yy.cc</filename> instead -of <filename>lex.yy.c</filename>. The generated scanner includes the header file -<filename>FlexLexer.h</filename>, which defines the interface to two C++ classes. - -The first class, -@code{FlexLexer}, -provides an abstract base class defining the general scanner class -interface. It provides the following member functions: - -<variablelist> -<!-- @findex YYText (C++ only) --> - -<varlistentry><term>const char* YYText()</term> -<listitem> -returns the text of the most recently matched token, the equivalent of -<varname>yytext</varname>. - -<!-- @findex YYLeng (C++ only) --> -</listitem> -</varlistentry> - -<varlistentry><term>int YYLeng()</term> -<listitem> -returns the length of the most recently matched token, the equivalent of -<varname>yyleng</varname>. - -<!-- @findex lineno (C++ only) --> -</listitem> -</varlistentry> - -<varlistentry><term>int lineno() const</term> -<listitem> -returns the current input line number (see @code{%option yylineno)}, or -@code{1} if @code{%option yylineno} was not used. - -<!-- @findex set_debug (C++ only) --> -</listitem> -</varlistentry> - -<varlistentry><term>void set_debug( int flag )</term> -<listitem> -sets the debugging flag for the scanner, equivalent to assigning to -@code{yy_flex_debug} (@pxref{Scanner Options}). Note that you must build -the scannerusing @code{%option debug} to include debugging information -in it. - -<!-- @findex debug (C++ only) --> -</listitem> -</varlistentry> - -<varlistentry><term>int debug() const</term> -<listitem> -returns the current setting of the debugging flag. -</listitem> -</varlistentry> -</variablelist> - -Also provided are member functions equivalent to -<function>yy_switch_to_buffer</function>, <function>yy_create_buffer</function> (though the -first argument is an @code{istream*} object pointer and not a -@code{FILE*)}, <function>yy_flush_buffer</function>, <function>yy_delete_buffer</function>, and -<function>yyrestart</function> (again, the first argument is a @code{istream*} -object pointer). - -<!-- @tindex yyFlexLexer (C++ only) --> -<!-- @tindex FlexLexer (C++ only) --> -The second class defined in <filename>FlexLexer.h</filename> is @code{yyFlexLexer}, -which is derived from @code{FlexLexer}. It defines the following -additional member functions: - -<variablelist> -<!-- @findex yyFlexLexer constructor (C++ only) --> - -<varlistentry><term>yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )</term> -<listitem> -constructs a @code{yyFlexLexer} object using the given streams for input -and output. If not specified, the streams default to @code{cin} and -@code{cout}, respectively. - -<!-- @findex yylex (C++ version) --> -</listitem> -</varlistentry> - -<varlistentry><term>virtual int yylex()</term> -<listitem> -performs the same role is <function>yylex</function> does for ordinary <application>flex</application> -scanners: it scans the input stream, consuming tokens, until a rule's -action returns a value. If you derive a subclass @code{S} from -@code{yyFlexLexer} and want to access the member functions and variables -of @code{S} inside <function>yylex</function>, then you need to use @code{%option -yyclass="S"} to inform <application>flex</application> that you will be using that subclass -instead of @code{yyFlexLexer}. In this case, rather than generating -@code{yyFlexLexer::yylex()}, <application>flex</application> generates @code{S::yylex()} -(and also generates a dummy @code{yyFlexLexer::yylex()} that calls -@code{yyFlexLexer::LexerError()} if called). - -<!-- @findex switch_streams (C++ only) --> -</listitem> -</varlistentry> - -<varlistentry><term>virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0)</term> -<listitem> -reassigns <varname>yyin</varname> to @code{new_in} (if non-null) and <varname>yyout</varname> to -@code{new_out} (if non-null), deleting the previous input buffer if -<varname>yyin</varname> is reassigned. - -</listitem> -</varlistentry> - -<varlistentry><term>int yylex( istream* new_in, ostream* new_out = 0 )</term> -<listitem> -first switches the input streams via @code{switch_streams( new_in, -new_out )} and then returns the value of <function>yylex</function>. -</listitem> -</varlistentry> -</variablelist> - -In addition, @code{yyFlexLexer} defines the following protected virtual -functions which you can redefine in derived classes to tailor the -scanner: - -<variablelist> -<!-- @findex LexerInput (C++ only) --> - -<varlistentry><term>virtual int LexerInput( char* buf, int max_size )</term> -<listitem> -reads up to @code{max_size} characters into @code{buf} and returns the -number of characters read. To indicate end-of-input, return 0 -characters. Note that @code{interactive} scanners (see the @samp{-B} -and @samp{-I} flags in @ref{Scanner Options}) define the macro -@code{YY_INTERACTIVE}. If you redefine <function>LexerInput</function> and need to -take different actions depending on whether or not the scanner might be -scanning an interactive input source, you can test for the presence of -this name via @code{#ifdef} statements. - -<!-- @findex LexerOutput (C++ only) --> -</listitem> -</varlistentry> - -<varlistentry><term>virtual void LexerOutput( const char* buf, int size )</term> -<listitem> -writes out @code{size} characters from the buffer @code{buf}, which, while -@code{NUL}-terminated, may also contain internal @code{NUL}s if the -scanner's rules can match text with @code{NUL}s in them. - -<!-- @cindex error reporting, in C++ --> -<!-- @findex LexerError (C++ only) --> -</listitem> -</varlistentry> - -<varlistentry><term>virtual void LexerError( const char* msg )</term> -<listitem> -reports a fatal error message. The default version of this function -writes the message to the stream @code{cerr} and exits. -</listitem> -</varlistentry> -</variablelist> - -Note that a @code{yyFlexLexer} object contains its <emphasis>entire</emphasis> -scanning state. Thus you can use such objects to create reentrant -scanners, but see also @ref{Reentrant}. You can instantiate multiple -instances of the same @code{yyFlexLexer} class, and you can also combine -multiple C++ scanner classes together in the same program using the -@samp{-P} option discussed above. - -Finally, note that the @code{%array} feature is not available to C++ -scanner classes; you must use @code{%pointer} (the default). - -Here is an example of a simple C++ scanner: - -<!-- @cindex C++ scanners, use of --> -<informalexample> -<programlisting> -<![CDATA[ - // An example of using the flex C++ scanner class. - - %{ - int mylineno = 0; - %} - - string \"[^\n"]+\" - - ws [ \t]+ - - alpha [A-Za-z] - dig [0-9] - name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])* - num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)? - num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)? - number {num1}|{num2} - - %% - - {ws} /* skip blanks and tabs */ - - "/*" { - int c; - - while((c = yyinput()) != 0) - { - if(c == '\n') - ++mylineno; - - else if(c == @samp{*}) - { - if((c = yyinput()) == '/') - break; - else - unput(c); - } - } - } - - {number} cout "number " YYText() '\n'; - - \n mylineno++; - - {name} cout "name " YYText() '\n'; - - {string} cout "string " YYText() '\n'; - - %% - - int main( int /* argc */, char** /* argv */ ) - { - <application>flex</application>Lexer* lexer = new yyFlexLexer; - while(lexer->yylex() != 0) - ; - return 0; - } -]]> -</programlisting> -</informalexample> - -<!-- @cindex C++, multiple different scanners --> -If you want to create multiple (different) lexer classes, you use the -@samp{-P} flag (or the @code{prefix=} option) to rename each -@code{yyFlexLexer} to some other @samp{xxFlexLexer}. You then can -include <filename>FlexLexer.h></filename> in your other sources once per lexer class, -first renaming @code{yyFlexLexer} as follows: - -<!-- @cindex include files, with C++ --> -<!-- @cindex header files, with C++ --> -<!-- @cindex C++ scanners, including multiple scanners --> -<informalexample> -<programlisting> -<![CDATA[ - #undef yyFlexLexer - #define yyFlexLexer xxFlexLexer - #include <FflexLexer.h> - - #undef yyFlexLexer - #define yyFlexLexer zzFlexLexer - #include FlexLexer.h> -]]> -</programlisting> -</informalexample> - -if, for example, you used @code{%option prefix="xx"} for one of your -scanners and @code{%option prefix="zz"} for the other. - -</chapter> - -<chapter> -<title>Reentrant C Scanners</title> - -<!-- @cindex reentrant, explanation --> -<application>flex</application> has the ability to generate a reentrant C scanner. This is -accomplished by specifying @code{%option reentrant} (@samp{-R}) The generated -scanner is both portable, and safe to use in one or more separate threads of -control. The most common use for reentrant scanners is from within -multi-threaded applications. Any thread may create and execute a reentrant -<application>flex</application> scanner without the need for synchronization with other threads. - -<!-- -@menu -* Reentrant Uses:: -* Reentrant Overview:: -* Reentrant Example:: -* Reentrant Detail:: -* Reentrant Functions:: -@end menu ---> - - -<section> -<title>Uses for Reentrant Scanners</title> - -However, there are other uses for a reentrant scanner. For example, you -could scan two or more files simultaneously to implement a @code{diff} at -the token level (i.e., instead of at the character level): - -<!-- @cindex reentrant scanners, multiple interleaved scanners --> -<informalexample> -<programlisting> -<![CDATA[ - /* Example of maintaining more than one active scanner. */ - - do { - int tok1, tok2; - - tok1 = yylex( scanner_1 ); - tok2 = yylex( scanner_2 ); - - if( tok1 != tok2 ) - printf("Files are different."); - - } while ( tok1 && tok2 ); -]]> -</programlisting> -</informalexample> - -Another use for a reentrant scanner is recursion. -(Note that a recursive scanner can also be created using a non-reentrant scanner and -buffer states. @xref{Multiple Input Buffers}.) - -The following crude scanner supports the @samp{eval} command by invoking -another instance of itself. - -<!-- @cindex reentrant scanners, recursive invocation --> -<informalexample> -<programlisting> -<![CDATA[ - /* Example of recursive invocation. */ - - %option reentrant - - %% - "eval(".+")" { - yyscan_t scanner; - YY_BUFFER_STATE buf; - - yylex_init( &scanner ); - yytext[yyleng-1] = ' '; - - buf = yy_scan_string( yytext + 5, scanner ); - yylex( scanner ); - - yy_delete_buffer(buf,scanner); - yylex_destroy( scanner ); - } - ... - %% -]]> -</programlisting> -</informalexample> - -</section> - -<section> -<title>An Overview of the Reentrant API</title> - -<!-- @cindex reentrant, API explanation --> -The API for reentrant scanners is different than for non-reentrant -scanners. Here is a quick overview of the API: - -<itemizedlist> - -@code{%option reentrant} must be specified. - -<listitem> - -All functions take one additional argument: <varname>yyscanner</varname> - -</listitem> -<listitem> - -All global variables are replaced by their macro equivalents. -(We tell you this because it may be important to you during debugging.) - -</listitem> -<listitem> - -<function>yylex_init</function> and <function>yylex_destroy</function> must be called before and -after <function>yylex</function>, respectively. - -</listitem> -<listitem> - -Accessor methods (get/set functions) provide access to common -<application>flex</application> variables. - -</listitem> -<listitem> - -User-specific data can be stored in @code{yyextra}. -</listitem> -</itemizedlist> - - -</section> - -<section> -<title>Reentrant Example</title> - -First, an example of a reentrant scanner: -<!-- @cindex reentrant, example of --> -<informalexample> -<programlisting> -<![CDATA[ - /* This scanner prints "//" comments. */ - %option reentrant stack - %x COMMENT - %% - "//" yy_push_state( COMMENT, yyscanner); - .|\n - <COMMENT>\n yy_pop_state( yyscanner ); - <COMMENT>[^\n]+ fprintf( yyout, "%s\n", yytext); - %% - int main ( int argc, char * argv[] ) - { - yyscan_t scanner; - - yylex_init ( &scanner ); - yylex ( scanner ); - yylex_destroy ( scanner ); - return 0; - } -]]> -</programlisting> -</informalexample> - -</section> - -<section> -<title>The Reentrant API in Detail</title> - -Here are the things you need to do or know to use the reentrant C API of -<application>flex</application>. - -<!-- -@menu -* Specify Reentrant:: -* Extra Reentrant Argument:: -* Global Replacement:: -* Init and Destroy Functions:: -* Accessor Methods:: -* Extra Data:: -* About yyscan_t:: -@end menu ---> - - -</section> - -<section> -<title>Declaring a Scanner As Reentrant</title> - - %option reentrant (--reentrant) must be specified. - -Notice that @code{%option reentrant} is specified in the above example -(@pxref{Reentrant Example}. Had this option not been specified, -<application>flex</application> would have happily generated a non-reentrant scanner without -complaining. You may explicitly specify @code{%option noreentrant}, if -you do <emphasis>not</emphasis> want a reentrant scanner, although it is not -necessary. The default is to generate a non-reentrant scanner. - -</section> - -<section> -<title>The Extra Argument</title> - -<!-- @cindex reentrant, calling functions --> -<!-- @vindex yyscanner (reentrant only) --> -All functions take one additional argument: <varname>yyscanner</varname>. - -Notice that the calls to <function>yy_push_state</function> and <function>yy_pop_state</function> -both have an argument, <varname>yyscanner</varname> , that is not present in a -non-reentrant scanner. Here are the declarations of -<function>yy_push_state</function> and <function>yy_pop_state</function> in the generated scanner: - -<informalexample> -<programlisting> -<![CDATA[ - static void yy_push_state ( int new_state , yyscan_t yyscanner ) ; - static void yy_pop_state ( yyscan_t yyscanner ) ; -]]> -</programlisting> -</informalexample> - -Notice that the argument <varname>yyscanner</varname> appears in the declaration of -both functions. In fact, all <application>flex</application> functions in a reentrant -scanner have this additional argument. It is always the last argument -in the argument list, it is always of type @code{yyscan_t} (which is -typedef'd to @code{void *}) and it is -always named <varname>yyscanner</varname>. As you may have guessed, -<varname>yyscanner</varname> is a pointer to an opaque data structure encapsulating -the current state of the scanner. For a list of function declarations, -see @ref{Reentrant Functions}. Note that preprocessor macros, such as -@code{BEGIN}, @code{ECHO}, and @code{REJECT}, do not take this -additional argument. - -</section> - -<section> -<title>Global Variables Replaced By Macros</title> - -<!-- @cindex reentrant, accessing flex variables --> -All global variables in traditional flex have been replaced by macro equivalents. - -Note that in the above example, <varname>yyout</varname> and <varname>yytext</varname> are -not plain variables. These are macros that will expand to their equivalent lvalue. -All of the familiar <application>flex</application> globals have been replaced by their macro -equivalents. In particular, <varname>yytext</varname>, <varname>yyleng</varname>, <varname>yylineno</varname>, -<varname>yyin</varname>, <varname>yyout</varname>, @code{yyextra}, <varname>yylval</varname>, and <varname>yylloc</varname> -are macros. You may safely use these macros in actions as if they were plain -variables. We only tell you this so you don't expect to link to these variables -externally. Currently, each macro expands to a member of an internal struct, e.g., - -<informalexample> -<programlisting> -<![CDATA[ -#define yytext (((struct yyguts_t*)yyscanner)->yytext_r) -]]> -</programlisting> -</informalexample> - -One important thing to remember about -<varname>yytext</varname> -and friends is that -<varname>yytext</varname> -is not a global variable in a reentrant -scanner, you can not access it directly from outside an action or from -other functions. You must use an accessor method, e.g., -<function>yyget_text</function>, -to accomplish this. (See below). - -</section> - -<section> -<title>Init and Destroy Functions</title> - -<!-- @cindex memory, considerations for reentrant scanners --> -<!-- @cindex reentrant, initialization --> -<!-- @findex yylex_init --> -<!-- @findex yylex_destroy --> - -<function>yylex_init</function> and <function>yylex_destroy</function> must be called before and -after <function>yylex</function>, respectively. - -<informalexample> -<programlisting> -<![CDATA[ - int yylex_init ( yyscan_t * ptr_yy_globals ) ; - int yylex ( yyscan_t yyscanner ) ; - int yylex_destroy ( yyscan_t yyscanner ) ; -]]> -</programlisting> -</informalexample> - -The function <function>yylex_init</function> must be called before calling any other -function. The argument to <function>yylex_init</function> is the address of an -uninitialized pointer to be filled in by <application>flex</application>. The contents of -@code{ptr_yy_globals} need not be initialized, since <application>flex</application> will -overwrite it anyway. The value stored in @code{ptr_yy_globals} should -thereafter be passed to <function>yylex</function> and @b{yylex_destroy()}. Flex -does not save the argument passed to <function>yylex_init</function>, so it is safe to -pass the address of a local pointer to <function>yylex_init</function>. The function -<function>yylex</function> should be familiar to you by now. The reentrant version -takes one argument, which is the value returned (via an argument) by -<function>yylex_init</function>. Otherwise, it behaves the same as the non-reentrant -version of <function>yylex</function>. - -<function>yylex_init</function> returns 0 (zero) on success, or non-zero on failure, -in which case, errno is set to one of the following values: - -<itemizedlist> - -<listitem> - ENOMEM -Memory allocation error. @xref{memory-management}. -</listitem> -<listitem> - EINVAL -Invalid argument. -</listitem> -</itemizedlist> - - - -The function <function>yylex_destroy</function> should be -called to free resources used by the scanner. After <function>yylex_destroy</function> -is called, the contents of <varname>yyscanner</varname> should not be used. Of -course, there is no need to destroy a scanner if you plan to reuse it. -A <application>flex</application> scanner (both reentrant and non-reentrant) may be -restarted by calling <function>yyrestart</function>. - -Below is an example of a program that creates a scanner, uses it, then destroys -it when done: - -<informalexample> -<programlisting> -<![CDATA[ - int main () - { - yyscan_t scanner; - int tok; - - yylex_init(&scanner); - - while ((tok=yylex()) > 0) - printf("tok=%d yytext=%s\n", tok, yyget_text(scanner)); - - yylex_destroy(scanner); - return 0; - } -]]> -</programlisting> -</informalexample> - -</section> - -<section> -<title>Accessing Variables with Reentrant Scanners</title> - -<!-- @cindex reentrant, accessor functions --> -Accessor methods (get/set functions) provide access to common -<application>flex</application> variables. - -Many scanners that you build will be part of a larger project. Portions -of your project will need access to <application>flex</application> values, such as -<varname>yytext</varname>. In a non-reentrant scanner, these values are global, so -there is no problem accessing them. However, in a reentrant scanner, there are no -global <application>flex</application> values. You can not access them directly. Instead, -you must access <application>flex</application> values using accessor methods (get/set -functions). Each accessor method is named <function>yyget_NAME</function> or -<function>yyset_NAME</function>, where @code{NAME} is the name of the <application>flex</application> -variable you want. For example: - -<!-- @cindex accessor functions, use of --> -<informalexample> -<programlisting> -<![CDATA[ - /* Set the last character of yytext to NULL. */ - void chop ( yyscan_t scanner ) - { - int len = yyget_leng( scanner ); - yyget_text( scanner )[len - 1] = '\0'; - } -]]> -</programlisting> -</informalexample> - -The above code may be called from within an action like this: - -<informalexample> -<programlisting> -<![CDATA[ - %% - .+\n { chop( yyscanner );} -]]> -</programlisting> -</informalexample> - -You may find that @code{%option header-file} is particularly useful for generating -prototypes of all the accessor functions. @xref{option-header}. - -</section> - -<section> -<title>Extra Data</title> - -<!-- @cindex reentrant, extra data --> -<!-- @vindex yyextra --> -User-specific data can be stored in @code{yyextra}. - -In a reentrant scanner, it is unwise to use global variables to -communicate with or maintain state between different pieces of your program. -However, you may need access to external data or invoke external functions -from within the scanner actions. -Likewise, you may need to pass information to your scanner -(e.g., open file descriptors, or database connections). -In a non-reentrant scanner, the only way to do this would be through the -use of global variables. -<application>flex</application> allows you to store arbitrary, ``extra'' data in a scanner. -This data is accessible through the accessor methods -<function>yyget_extra</function> -and -<function>yyset_extra</function> -from outside the scanner, and through the shortcut macro -@code{yyextra} -from within the scanner itself. They are defined as follows: - -<!-- @tindex YY_EXTRA_TYPE (reentrant only) --> -<!-- @findex yyget_extra --> -<!-- @findex yyset_extra --> -<informalexample> -<programlisting> -<![CDATA[ - #define YY_EXTRA_TYPE void* - YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); - void yyset_extra ( YY_EXTRA_TYPE arbitrary_data , yyscan_t scanner); -]]> -</programlisting> -</informalexample> - -By default, @code{YY_EXTRA_TYPE} is defined as type @code{void *}. You -will have to cast @code{yyextra} and the return value from -<function>yyget_extra</function> to the appropriate value each time you access the -extra data. To avoid casting, you may override the default type by -defining @code{YY_EXTRA_TYPE} in section 1 of your scanner: - -<!-- @cindex YY_EXTRA_TYPE, defining your own type --> -<informalexample> -<programlisting> -<![CDATA[ - /* An example of overriding YY_EXTRA_TYPE. */ - %{ - #include <sys/stat.h> - #include <unistd.h> - #define YY_EXTRA_TYPE struct stat* - %} - %option reentrant - %% - - __filesize__ printf( "%ld", yyextra->st_size ); - __lastmod__ printf( "%ld", yyextra->st_mtime ); - %% - void scan_file( char* filename ) - { - yyscan_t scanner; - struct stat buf; - - yylex_init ( &scanner ); - yyset_in( fopen(filename,"r"), scanner ); - - stat( filename, &buf); - yyset_extra( &buf, scanner ); - yylex ( scanner ); - yylex_destroy( scanner ); - } -]]> -</programlisting> -</informalexample> - - -</section> - -<section> -<title>About yyscan_t</title> - -<!-- @tindex yyscan_t (reentrant only) --> -@code{yyscan_t} is defined as: - -<informalexample> -<programlisting> -<![CDATA[ - typedef void* yyscan_t; -]]> -</programlisting> -</informalexample> - -It is initialized by <function>yylex_init</function> to point to -an internal structure. You should never access this value -directly. In particular, you should never attempt to free it -(use <function>yylex_destroy</function> instead.) - -</section> - -<section> -<title>Functions and Macros Available in Reentrant C Scanners</title> - -The following Functions are available in a reentrant scanner: - -<!-- @findex yyget_text --> -<!-- @findex yyget_leng --> -<!-- @findex yyget_in --> -<!-- @findex yyget_out --> -<!-- @findex yyget_lineno --> -<!-- @findex yyset_in --> -<!-- @findex yyset_out --> -<!-- @findex yyset_lineno --> -<!-- @findex yyget_debug --> -<!-- @findex yyset_debug --> -<!-- @findex yyget_extra --> -<!-- @findex yyset_extra --> - -<informalexample> -<programlisting> -<![CDATA[ - char *yyget_text ( yyscan_t scanner ); - int yyget_leng ( yyscan_t scanner ); - FILE *yyget_in ( yyscan_t scanner ); - FILE *yyget_out ( yyscan_t scanner ); - int yyget_lineno ( yyscan_t scanner ); - YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); - int yyget_debug ( yyscan_t scanner ); - - void yyset_debug ( int flag, yyscan_t scanner ); - void yyset_in ( FILE * in_str , yyscan_t scanner ); - void yyset_out ( FILE * out_str , yyscan_t scanner ); - void yyset_lineno ( int line_number , yyscan_t scanner ); - void yyset_extra ( YY_EXTRA_TYPE user_defined , yyscan_t scanner ); -]]> -</programlisting> -</informalexample> - -There are no ``set'' functions for yytext and yyleng. This is intentional. - -The following Macro shortcuts are available in actions in a reentrant -scanner: - -<informalexample> -<programlisting> -<![CDATA[ - yytext - yyleng - yyin - yyout - yylineno - yyextra - yy_flex_debug -]]> -</programlisting> -</informalexample> - -<!-- @cindex yylineno, in a reentrant scanner --> -In a reentrant C scanner, support for yylineno is always present -(i.e., you may access yylineno), but the value is never modified by -<application>flex</application> unless @code{%option yylineno} is enabled. This is to allow -the user to maintain the line count independently of <application>flex</application>. - -@anchor{bison-functions} -The following functions and macros are made available when @code{%option -bison-bridge} (@samp{--bison-bridge}) is specified: - -<informalexample> -<programlisting> -<![CDATA[ - YYSTYPE * yyget_lval ( yyscan_t scanner ); - void yyset_lval ( YYSTYPE * yylvalp , yyscan_t scanner ); - yylval -]]> -</programlisting> -</informalexample> - -The following functions and macros are made available -when @code{%option bison-locations} (@samp{--bison-locations}) is specified: - -<informalexample> -<programlisting> -<![CDATA[ - YYLTYPE *yyget_lloc ( yyscan_t scanner ); - void yyset_lloc ( YYLTYPE * yyllocp , yyscan_t scanner ); - yylloc -]]> -</programlisting> -</informalexample> - -Support for yylval assumes that @code{YYSTYPE} is a valid type. Support for -yylloc assumes that @code{YYSLYPE} is a valid type. Typically, these types are -generated by <application>bison</application>, and are included in section 1 of the <application>flex</application> -input. - -</section> -</chapter> - -<chapter> -<title>Incompatibilities with Lex and Posix</title> - -<!-- @cindex POSIX and lex --> -<!-- @cindex lex (traditional) and POSIX --> - -<application>flex</application> is a rewrite of the <acronym>&</acronym> Unix <emphasis>lex</emphasis> tool (the two -implementations do not share any code, though), with some extensions and -incompatibilities, both of which are of concern to those who wish to -write scanners acceptable to both implementations. <application>flex</application> is fully -compliant with the POSIX @code{lex} specification, except that when -using @code{%pointer} (the default), a call to <function>unput</function> destroys -the contents of <varname>yytext</varname>, which is counter to the POSIX -specification. In this section we discuss all of the known areas of -incompatibility between <application>flex</application>, <acronym>&</acronym> @code{lex}, and the POSIX -specification. <application>flex</application>'s @samp{-l} option turns on maximum -compatibility with the original <acronym>&</acronym> @code{lex} implementation, at the -cost of a major loss in the generated scanner's performance. We note -below which incompatibilities can be overcome using the @samp{-l} -option. <application>flex</application> is fully compatible with @code{lex} with the -following exceptions: - -<itemizedlist> - -<listitem> - -The undocumented @code{lex} scanner internal variable <varname>yylineno</varname> is -not supported unless @samp{-l} or @code{%option yylineno} is used. - -</listitem> -<listitem> - -<varname>yylineno</varname> should be maintained on a per-buffer basis, rather than -a per-scanner (single global variable) basis. - -</listitem> -<listitem> - -<varname>yylineno</varname> is not part of the POSIX specification. - -</listitem> -<listitem> - -The <function>input</function> routine is not redefinable, though it may be called -to read characters following whatever has been matched by a rule. If -<function>input</function> encounters an end-of-file the normal <function>yywrap</function> -processing is done. A ``real'' end-of-file is returned by -<function>input</function> as @code{EOF}. - -</listitem> -<listitem> - -Input is instead controlled by defining the <function>YY_INPUT</function> macro. - -</listitem> -<listitem> - -The <application>flex</application> restriction that <function>input</function> cannot be redefined is -in accordance with the POSIX specification, which simply does not -specify any way of controlling the scanner's input other than by making -an initial assignment to <filename>yyin</filename>. - -</listitem> -<listitem> - -The <function>unput</function> routine is not redefinable. This restriction is in -accordance with POSIX. - -</listitem> -<listitem> - -<application>flex</application> scanners are not as reentrant as @code{lex} scanners. In -particular, if you have an interactive scanner and an interrupt handler -which long-jumps out of the scanner, and the scanner is subsequently -called again, you may get the following message: - -<!-- @cindex error messages, end of buffer missed --> -<informalexample> -<programlisting> -<![CDATA[ - fatal <application>flex</application> scanner internal error--end of buffer missed -]]> -</programlisting> -</informalexample> - -To reenter the scanner, first use: - -<!-- @cindex restarting the scanner --> -<informalexample> -<programlisting> -<![CDATA[ - yyrestart( yyin ); -]]> -</programlisting> -</informalexample> - -Note that this call will throw away any buffered input; usually this -isn't a problem with an interactive scanner. @xref{Reentrant}, for -<application>flex</application>'s reentrant API. - -</listitem> -<listitem> - -Also note that <application>flex</application> C++ scanner classes -<emphasis>are</emphasis> -reentrant, so if using C++ is an option for you, you should use -them instead. @xref{Cxx}, and @ref{Reentrant} for details. - -</listitem> -<listitem> - -<function>output</function> is not supported. Output from the @b{ECHO} macro is -done to the file-pointer <varname>yyout</varname> (default <filename>stdout)</filename>. - -</listitem> -<listitem> - -<function>output</function> is not part of the POSIX specification. - -</listitem> -<listitem> - -@code{lex} does not support exclusive start conditions (%x), though they -are in the POSIX specification. - -</listitem> -<listitem> - -When definitions are expanded, <application>flex</application> encloses them in parentheses. -With @code{lex}, the following: - -<!-- @cindex name definitions, not POSIX --> -<informalexample> -<programlisting> -<![CDATA[ - NAME [A-Z][A-Z0-9]* - %% - foo{NAME}? printf( "Found it\n" ); - %% -]]> -</programlisting> -</informalexample> - -will not match the string @samp{foo} because when the macro is expanded -the rule is equivalent to @samp{foo[A-Z][A-Z0-9]*?} and the precedence -is such that the @samp{?} is associated with @samp{[A-Z0-9]*}. With -<application>flex</application>, the rule will be expanded to @samp{foo([A-Z][A-Z0-9]*)?} -and so the string @samp{foo} will match. - -</listitem> -<listitem> - -Note that if the definition begins with @samp{^} or ends with @samp{$} -then it is <emphasis>not</emphasis> expanded with parentheses, to allow these -operators to appear in definitions without losing their special -meanings. But the @samp{<s>}, @samp{/}, and @code{<<EOF>>} operators -cannot be used in a <application>flex</application> definition. - -</listitem> -<listitem> - -Using @samp{-l} results in the @code{lex} behavior of no parentheses -around the definition. - -</listitem> -<listitem> - -The POSIX specification is that the definition be enclosed in parentheses. - -</listitem> -<listitem> - -Some implementations of @code{lex} allow a rule's action to begin on a -separate line, if the rule's pattern has trailing whitespace: - -<!-- @cindex patterns and actions on different lines --> -<informalexample> -<programlisting> -<![CDATA[ - %% - foo|bar<space here> - { foobar_action();} -]]> -</programlisting> -</informalexample> - -<application>flex</application> does not support this feature. - -</listitem> -<listitem> - -The @code{lex} @code{%r} (generate a Ratfor scanner) option is not -supported. It is not part of the POSIX specification. - -</listitem> -<listitem> - -After a call to <function>unput</function>, <emphasis>yytext</emphasis> is undefined until the -next token is matched, unless the scanner was built using @code{%array}. -This is not the case with @code{lex} or the POSIX specification. The -@samp{-l} option does away with this incompatibility. - -</listitem> -<listitem> - -The precedence of the @samp{{,}} (numeric range) operator is -different. The <acronym>&</acronym> and POSIX specifications of @code{lex} -interpret @samp{abc{1,3}} as match one, two, -or three occurrences of @samp{abc}'', whereas <application>flex</application> interprets it -as ``match @samp{ab} followed by one, two, or three occurrences of -@samp{c}''. The @samp{-l} and @samp{--posix} options do away with this -incompatibility. - -</listitem> -<listitem> - -The precedence of the @samp{^} operator is different. @code{lex} -interprets @samp{^foo|bar} as ``match either 'foo' at the beginning of a -line, or 'bar' anywhere'', whereas <application>flex</application> interprets it as ``match -either @samp{foo} or @samp{bar} if they come at the beginning of a -line''. The latter is in agreement with the POSIX specification. - -</listitem> -<listitem> - -The special table-size declarations such as @code{%a} supported by -@code{lex} are not required by <application>flex</application> scanners.. <application>flex</application> -ignores them. -</listitem> -<listitem> - -The name @code{FLEX_SCANNER} is @code{#define}'d so scanners may be -written for use with either <application>flex</application> or @code{lex}. Scanners also -include @code{YY_FLEX_MAJOR_VERSION}, @code{YY_FLEX_MINOR_VERSION} -and @code{YY_FLEX_SUBMINOR_VERSION} -indicating which version of <application>flex</application> generated the scanner. For -example, for the 2.5.22 release, these defines would be 2, 5 and 22 -respectively. If the version of <application>flex</application> being used is a beta -version, then the symbol @code{FLEX_BETA} is defined. -</listitem> -</itemizedlist> - - -<!-- @cindex POSIX comp;compliance --> -<!-- @cindex non-POSIX features of flex --> -The following <application>flex</application> features are not included in @code{lex} or the -POSIX specification: - -<itemizedlist> - -<listitem> - -C++ scanners -</listitem> -<listitem> - -%option -</listitem> -<listitem> - -start condition scopes -</listitem> -<listitem> - -start condition stacks -</listitem> -<listitem> - -interactive/non-interactive scanners -</listitem> -<listitem> - -yy_scan_string() and friends -</listitem> -<listitem> - -yyterminate() -</listitem> -<listitem> - -yy_set_interactive() -</listitem> -<listitem> - -yy_set_bol() -</listitem> -<listitem> - -YY_AT_BOL() - <<EOF>> -</listitem> -<listitem> - -<*> -</listitem> -<listitem> - -YY_DECL -</listitem> -<listitem> - -YY_START -</listitem> -<listitem> - -YY_USER_ACTION -</listitem> -<listitem> - -YY_USER_INIT -</listitem> -<listitem> - -#line directives -</listitem> -<listitem> - -%{}'s around actions -</listitem> -<listitem> - -reentrant C API -</listitem> -<listitem> - -multiple actions on a line -</listitem> -<listitem> - -almost all of the <application>flex</application> command-line options -</listitem> -</itemizedlist> - - -The feature ``multiple actions on a line'' -refers to the fact that with <application>flex</application> you can put multiple actions on -the same line, separated with semi-colons, while with @code{lex}, the -following: - -<informalexample> -<programlisting> -<![CDATA[ - foo handle_foo(); ++num_foos_seen; -]]> -</programlisting> -</informalexample> - -is (rather surprisingly) truncated to - -<informalexample> -<programlisting> -<![CDATA[ - foo handle_foo(); -]]> -</programlisting> -</informalexample> - -<application>flex</application> does not truncate the action. Actions that are not enclosed -in braces are simply terminated at the end of the line. - -</chapter> - -<chapter> -<title>Memory Management</title> - -<!-- @cindex memory management --> -@anchor{memory-management} -This chapter describes how flex handles dynamic memory, and how you can -override the default behavior. - -<!-- -@menu -* The Default Memory Management:: -* Overriding The Default Memory Management:: -* A Note About yytext And Memory:: -@end menu ---> - - -<section> -<title>The Default Memory Management</title> - -Flex allocates dynamic memory during initialization, and once in a while from -within a call to yylex(). Initialization takes place during the first call to -yylex(). Thereafter, flex may reallocate more memory if it needs to enlarge a -buffer. As of version 2.5.9 Flex will clean up all memory when you call <function>yylex_destroy</function> -@xref{faq-memory-leak}. - -Flex allocates dynamic memory for four purposes, listed below @footnote{The -quantities given here are approximate, and may vary due to host architecture, -compiler configuration, or due to future enhancements to flex.} - -<variablelist> - - -<varlistentry><term>16kB for the input buffer.</term> -<listitem> -Flex allocates memory for the character buffer used to perform pattern -matching. Flex must read ahead from the input stream and store it in a large -character buffer. This buffer is typically the largest chunk of dynamic memory -flex consumes. This buffer will grow if necessary, doubling the size each time. -Flex frees this memory when you call yylex_destroy(). The default size of this -buffer (16384 bytes) is almost always too large. The ideal size for this -buffer is the length of the longest token expected. Flex will allocate a few -extra bytes for housekeeping. - -</listitem> -</varlistentry> - -<varlistentry><term>16kb for the REJECT state. This will only be allocated if you use REJECT.</term> -<listitem> -The size is the same as the input buffer, so if you override the size of the -input buffer, then you automatically override the size of this buffer as well. - -</listitem> -</varlistentry> - -<varlistentry><term>100 bytes for the start condition stack.</term> -<listitem> -Flex allocates memory for the start condition stack. This is the stack used -for pushing start states, i.e., with yy_push_state(). It will grow if -necessary. Since the states are simply integers, this stack doesn't consume -much memory. This stack is not present if @code{%option stack} is not -specified. You will rarely need to tune this buffer. The ideal size for this -stack is the maximum depth expected. The memory for this stack is -automatically destroyed when you call yylex_destroy(). @xref{option-stack}. - -</listitem> -</varlistentry> - -<varlistentry><term>40 bytes for each YY_BUFFER_STATE.</term> -<listitem> -Flex allocates memory for each YY_BUFFER_STATE. The buffer state itself -is about 40 bytes, plus an additional large character buffer (described above.) -The initial buffer state is created during initialization, and with each call -to yy_create_buffer(). You can't tune the size of this, but you can tune the -character buffer as described above. Any buffer state that you explicitly -create by calling yy_create_buffer() is <emphasis>NOT</emphasis> destroyed automatically. You -must call yy_delete_buffer() to free the memory. The exception to this rule is -that flex will delete the current buffer automatically when you call -yylex_destroy(). If you delete the current buffer, be sure to set it to NULL. -That way, flex will not try to delete the buffer a second time (possibly -crashing your program!) At the time of this writing, flex does not provide a -growable stack for the buffer states. You have to manage that yourself. -@xref{Multiple Input Buffers}. - -</listitem> -</varlistentry> - -<varlistentry><term>84 bytes for the reentrant scanner guts</term> -<listitem> -Flex allocates about 84 bytes for the reentrant scanner structure when -you call yylex_init(). It is destroyed when the user calls yylex_destroy(). - -</listitem> -</varlistentry> -</variablelist> - - -</section> - -<section> -<title>Overriding The Default Memory Management</title> - -<!-- @cindex yyalloc, overriding --> -<!-- @cindex yyrealloc, overriding --> -<!-- @cindex yyfree, overriding --> - -Flex calls the functions <function>yyalloc</function>, <function>yyrealloc</function>, and <function>yyfree</function> -when it needs to allocate or free memory. By default, these functions are -wrappers around the standard C functions, @code{malloc}, @code{realloc}, and -@code{free}, respectively. You can override the default implementations by telling -flex that you will provide your own implementations. - -To override the default implementations, you must do two things: - -<orderedlist> - - -<listitem> - Suppress the default implementations by specifying one or more of the -following options: - -<itemizedlist> - -@opindex noyyalloc -<listitem> - @code{%option noyyalloc} -</listitem> -<listitem> - @code{%option noyyrealloc} -</listitem> -<listitem> - @code{%option noyyfree}. -</listitem> -</itemizedlist> - - -</listitem> -<listitem> - Provide your own implementation of the following functions: @footnote{It -is not necessary to override all (or any) of the memory management routines. -You may, for example, override <function>yyrealloc</function>, but not <function>yyfree</function> or -<function>yyalloc</function>.} - -<informalexample> -<programlisting> -<![CDATA[ -// For a non-reentrant scanner -void * yyalloc (size_t bytes); -void * yyrealloc (void * ptr, size_t bytes); -void yyfree (void * ptr); - -// For a reentrant scanner -void * yyalloc (size_t bytes, void * yyscanner); -void * yyrealloc (void * ptr, size_t bytes, void * yyscanner); -void yyfree (void * ptr, void * yyscanner); -]]> -</programlisting> -</informalexample> - -</listitem> -</orderedlist> - - -In the following example, we will override all three memory routines. We assume -that there is a custom allocator with garbage collection. In order to make this -example interesting, we will use a reentrant scanner, passing a pointer to the -custom allocator through @code{yyextra}. - -<!-- @cindex overriding the memory routines --> -<informalexample> -<programlisting> -<![CDATA[ -%{ -#include "some_allocator.h" -%} - -/* Suppress the default implementations. */ -%option noyyalloc noyyrealloc noyyfree -%option reentrant - -/* Initialize the allocator. */ -#define YY_EXTRA_TYPE struct allocator* -#define YY_USER_INIT yyextra = allocator_create(); - -%% -.|\n ; -%% - -/* Provide our own implementations. */ -void * yyalloc (size_t bytes, void* yyscanner) { - return allocator_alloc (yyextra, bytes); -} - -void * yyrealloc (void * ptr, size_t bytes, void* yyscanner) { - return allocator_realloc (yyextra, bytes); -} - -void yyfree (void * ptr, void * yyscanner) { - /* Do nothing -- we leave it to the garbage collector. */ -} - -]]> -</programlisting> -</informalexample> - - -</section> - -<section> -<title>A Note About yytext And Memory</title> - -<!-- @cindex yytext, memory considerations --> - -When flex finds a match, <varname>yytext</varname> points to the first character of the -match in the input buffer. The string itself is part of the input buffer, and -is <emphasis>NOT</emphasis> allocated separately. The value of yytext will be overwritten the next -time yylex() is called. In short, the value of yytext is only valid from within -the matched rule's action. - -Often, you want the value of yytext to persist for later processing, i.e., by a -parser with non-zero lookahead. In order to preserve yytext, you will have to -copy it with strdup() or a similar function. But this introduces some headache -because your parser is now responsible for freeing the copy of yytext. If you -use a yacc or bison parser, (commonly used with flex), you will discover that -the error recovery mechanisms can cause memory to be leaked. - -To prevent memory leaks from strdup'd yytext, you will have to track the memory -somehow. Our experience has shown that a garbage collection mechanism or a -pooled memory mechanism will save you a lot of grief when writing parsers. - -</section> -</chapter> - -<chapter> -<title>Serialized Tables</title> -<!-- @cindex serialization --> -<!-- @cindex memory, serialized tables --> - -@anchor{serialization} -A <application>flex</application> scanner has the ability to save the DFA tables to a file, and -load them at runtime when needed. The motivation for this feature is to reduce -the runtime memory footprint. Traditionally, these tables have been compiled into -the scanner as C arrays, and are sometimes quite large. Since the tables are -compiled into the scanner, the memory used by the tables can never be freed. -This is a waste of memory, especially if an application uses several scanners, -but none of them at the same time. - -The serialization feature allows the tables to be loaded at runtime, before -scanning begins. The tables may be discarded when scanning is finished. - -<!-- -@menu -* Creating Serialized Tables:: -* Loading and Unloading Serialized Tables:: -* Tables File Format:: -@end menu ---> - - - -<section> -<title>Creating Serialized Tables</title> -<!-- @cindex tables, creating serialized --> -<!-- @cindex serialization of tables --> - -You may create a scanner with serialized tables by specifying: - -<informalexample> -<programlisting> -<![CDATA[ - %option tables-file=FILE -or - --tables-file=FILE -]]> -</programlisting> -</informalexample> - -These options instruct flex to save the DFA tables to the file @var{FILE}. The tables -will <emphasis>not</emphasis> be embedded in the generated scanner. The scanner will not -function on its own. The scanner will be dependent upon the serialized tables. You must -load the tables from this file at runtime before you can scan anything. - -If you do not specify a filename to @code{--tables-file}, the tables will be -saved to <filename>lex.yy.tables</filename>, where @samp{yy} is the appropriate prefix. - -If your project uses several different scanners, you can concatenate the -serialized tables into one file, and flex will find the correct set of tables, -using the scanner prefix as part of the lookup key. An example follows: - -<!-- @cindex serialized tables, multiple scanners --> -<informalexample> -<programlisting> -<![CDATA[ -$ flex --tables-file --prefix=cpp cpp.l -$ flex --tables-file --prefix=c c.l -$ cat lex.cpp.tables lex.c.tables > all.tables -]]> -</programlisting> -</informalexample> - -The above example created two scanners, @samp{cpp}, and @samp{c}. Since we did -not specify a filename, the tables were serialized to <filename>lex.c.tables</filename> and -<filename>lex.cpp.tables</filename>, respectively. Then, we concatenated the two files -together into <filename>all.tables</filename>, which we will distribute with our project. At -runtime, we will open the file and tell flex to load the tables from it. Flex -will find the correct tables automatically. (See next section). - -</section> - -<section> -<title>Loading and Unloading Serialized Tables</title> -<!-- @cindex tables, loading and unloading --> -<!-- @cindex loading tables at runtime --> -<!-- @cindex tables, freeing --> -<!-- @cindex freeing tables --> -<!-- @cindex memory, serialized tables --> - -If you've built your scanner with @code{%option tables-file}, then you must -load the scanner tables at runtime. This can be accomplished with the following -function: - -@deftypefun int yytables_fload (FILE* @var{fp} [, yyscan_t @var{scanner}]) -Locates scanner tables in the stream pointed to by @var{fp} and loads them. -Memory for the tables is allocated via <function>yyalloc</function>. You must call this -function before the first call to <function>yylex</function>. The argument @var{scanner} -only appears in the reentrant scanner. -This function returns @samp{0} (zero) on success, or non-zero on error. -@end deftypefun - -The loaded tables are <emphasis role="strong">not</emphasis> automatically destroyed (unloaded) when you -call <function>yylex_destroy</function>. The reason is that you may create several scanners -of the same type (in a reentrant scanner), each of which needs access to these -tables. To avoid a nasty memory leak, you must call the following function: - -@deftypefun int yytables_destroy ([yyscan_t @var{scanner}]) -Unloads the scanner tables. The tables must be loaded again before you can scan -any more data. The argument @var{scanner} only appears in the reentrant -scanner. This function returns @samp{0} (zero) on success, or non-zero on -error. -@end deftypefun - -<emphasis role="strong">The functions <function>yytables_fload</function> and <function>yytables_destroy</function> are not thread-safe.</emphasis> You must ensure that these functions are called exactly once (for -each scanner type) in a threaded program, before any thread calls <function>yylex</function>. -After the tables are loaded, they are never written to, and no thread -protection is required thereafter -- until you destroy them. - -</section> - -<section> -<title>Tables File Format</title> -<!-- @cindex tables, file format --> -<!-- @cindex file format, serialized tables --> - -This section defines the file format of serialized <application>flex</application> tables. - -The tables format allows for one or more sets of tables to be -specified, where each set corresponds to a given scanner. Scanners are -indexed by name, as described below. The file format is as follows: - -<informalexample> -<programlisting> -<![CDATA[ - TABLE SET 1 - +-------------------------------+ - Header | uint32 th_magic; | - | uint32 th_hsize; | - | uint32 th_ssize; | - | uint16 th_flags; | - | char th_version[]; | - | char th_name[]; | - | uint8 th_pad64[]; | - +-------------------------------+ - Table 1 | uint16 td_id; | - | uint16 td_flags; | - | uint32 td_lolen; | - | uint32 td_hilen; | - | void td_data[]; | - | uint8 td_pad64[]; | - +-------------------------------+ - Table 2 | | - . . . - . . . - . . . - . . . - Table n | | - +-------------------------------+ - TABLE SET 2 - . - . - . - TABLE SET N -]]> -</programlisting> -</informalexample> - -The above diagram shows that a complete set of tables consists of a header -followed by multiple individual tables. Furthermore, multiple complete sets may -be present in the same file, each set with its own header and tables. The sets -are contiguous in the file. The only way to know if another set follows is to -check the next four bytes for the magic number (or check for EOF). The header -and tables sections are padded to 64-bit boundaries. Below we describe each -field in detail. This format does not specify how the scanner will expand the -given data, i.e., data may be serialized as int8, but expanded to an int32 -array at runtime. This is to reduce the size of the serialized data where -possible. Remember, <emphasis>all integer values are in network byte order</emphasis>. - -@noindent -Fields of a table header: - -<variablelist> - -<varlistentry><term>th_magic</term> -<listitem> -Magic number, always 0xF13C57B1. - -</listitem> -</varlistentry> - -<varlistentry><term>th_hsize</term> -<listitem> -Size of this entire header, in bytes, including all fields plus any padding. - -</listitem> -</varlistentry> - -<varlistentry><term>th_ssize</term> -<listitem> -Size of this entire set, in bytes, including the header, all tables, plus -any padding. - -</listitem> -</varlistentry> - -<varlistentry><term>th_flags</term> -<listitem> -Bit flags for this table set. Currently unused. - -</listitem> -</varlistentry> - -<varlistentry><term>th_version[]</term> -<listitem> -Flex version in NULL-termninated string format. e.g., @samp{2.5.13a}. This is -the version of flex that was used to create the serialized tables. - -</listitem> -</varlistentry> - -<varlistentry><term>th_name[]</term> -<listitem> -Contains the name of this table set. The default is @samp{yytables}, -and is prefixed accordingly, e.g., @samp{footables}. Must be NULL-terminated. - -</listitem> -</varlistentry> - -<varlistentry><term>th_pad64[]</term> -<listitem> -Zero or more NULL bytes, padding the entire header to the next 64-bit boundary -as calculated from the beginning of the header. -</listitem> -</varlistentry> -</variablelist> - -@noindent -Fields of a table: - -<variablelist> - -<varlistentry><term>td_id</term> -<listitem> -Specifies the table identifier. Possible values are: -<variablelist> - -<varlistentry><term>YYTD_ID_ACCEPT (0x01)</term> -<listitem> -@code{yy_accept} -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_ID_BASE (0x02)</term> -<listitem> -@code{yy_base} -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_ID_CHK (0x03)</term> -<listitem> -@code{yy_chk} -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_ID_DEF (0x04)</term> -<listitem> -@code{yy_def} -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_ID_EC (0x05)</term> -<listitem> -@code{yy_ec } -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_ID_META (0x06)</term> -<listitem> -@code{yy_meta} -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_ID_NUL_TRANS (0x07)</term> -<listitem> -@code{yy_NUL_trans} -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_ID_NXT (0x08)</term> -<listitem> -@code{yy_nxt}. This array may be two dimensional. See the <structfield>td_hilen</structfield> -field below. -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_ID_RULE_CAN_MATCH_EOL (0x09)</term> -<listitem> -@code{yy_rule_can_match_eol} -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_ID_START_STATE_LIST (0x0A)</term> -<listitem> -@code{yy_start_state_list}. This array is handled specially because it is an -array of pointers to structs. See the <structfield>td_flags</structfield> field below. -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_ID_TRANSITION (0x0B)</term> -<listitem> -@code{yy_transition}. This array is handled specially because it is an array of -structs. See the <structfield>td_lolen</structfield> field below. -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_ID_ACCLIST (0x0C)</term> -<listitem> -@code{yy_acclist} -</listitem> -</varlistentry> -</variablelist> - -</listitem> -</varlistentry> - -<varlistentry><term>td_flags</term> -<listitem> -Bit flags describing how to interpret the data in <structfield>td_data</structfield>. -The data arrays are one-dimensional by default, but may be -two dimensional as specified in the <structfield>td_hilen</structfield> field. - -<variablelist> - -<varlistentry><term>YYTD_DATA8 (0x01)</term> -<listitem> -The data is serialized as an array of type int8. -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_DATA16 (0x02)</term> -<listitem> -The data is serialized as an array of type int16. -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_DATA32 (0x04)</term> -<listitem> -The data is serialized as an array of type int32. -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_PTRANS (0x08)</term> -<listitem> -The data is a list of indexes of entries in the expanded @code{yy_transition} -array. Each index should be expanded to a pointer to the corresponding entry -in the @code{yy_transition} array. We count on the fact that the -@code{yy_transition} array has already been seen. -</listitem> -</varlistentry> - -<varlistentry><term>YYTD_STRUCT (0x10)</term> -<listitem> -The data is a list of yy_trans_info structs, each of which consists of -two integers. There is no padding between struct elements or between structs. -The type of each member is determined by the @code{YYTD_DATA*} bits. -</listitem> -</varlistentry> -</variablelist> - -</listitem> -</varlistentry> - -<varlistentry><term>td_lolen</term> -<listitem> -Specifies the number of elements in the lowest dimension array. If this is -a one-dimensional array, then it is simply the number of elements in this array. -The element size is determined by the <structfield>td_flags</structfield> field. - -</listitem> -</varlistentry> - -<varlistentry><term>td_hilen</term> -<listitem> -If <structfield>td_hilen</structfield> is non-zero, then the data is a two-dimensional array. -Otherwise, the data is a one-dimensional array. <structfield>td_hilen</structfield> contains the -number of elements in the higher dimensional array, and <structfield>td_lolen</structfield> contains -the number of elements in the lowest dimension. - -Conceptually, <structfield>td_data</structfield> is either @code{sometype td_data[td_lolen]}, or -@code{sometype td_data[td_hilen][td_lolen]}, where @code{sometype} is specified -by the <structfield>td_flags</structfield> field. It is possible for both <structfield>td_lolen</structfield> and -<structfield>td_hilen</structfield> to be zero, in which case <structfield>td_data</structfield> is a zero length -array, and no data is loaded, i.e., this table is simply skipped. Flex does not -currently generate tables of zero length. - -</listitem> -</varlistentry> - -<varlistentry><term>td_data[]</term> -<listitem> -The table data. This array may be a one- or two-dimensional array, of type -@code{int8}, @code{int16}, @code{int32}, @code{struct yy_trans_info}, or -@code{struct yy_trans_info*}, depending upon the values in the -<structfield>td_flags</structfield>, <structfield>td_lolen</structfield>, and <structfield>td_hilen</structfield> fields. - -</listitem> -</varlistentry> - -<varlistentry><term>td_pad64[]</term> -<listitem> -Zero or more NULL bytes, padding the entire table to the next 64-bit boundary as -calculated from the beginning of this table. -</listitem> -</varlistentry> -</variablelist> - -</section> -</chapter> - -<chapter> -<title>Diagnostics</title> - -<!-- @cindex error reporting, diagnostic messages --> -<!-- @cindex warnings, diagnostic messages --> - -The following is a list of <application>flex</application> diagnostic messages: - -<itemizedlist> - -<listitem> - -@samp{warning, rule cannot be matched} indicates that the given rule -cannot be matched because it follows other rules that will always match -the same text as it. For example, in the following @samp{foo} cannot be -matched because it comes after an identifier ``catch-all'' rule: - -<!-- @cindex warning, rule cannot be matched --> -<informalexample> -<programlisting> -<![CDATA[ - [a-z]+ got_identifier(); - foo got_foo(); -]]> -</programlisting> -</informalexample> - -Using @code{REJECT} in a scanner suppresses this warning. - -</listitem> -<listitem> - -@samp{warning, -s option given but default rule can be matched} means -that it is possible (perhaps only in a particular start condition) that -the default rule (match any single character) is the only one that will -match a particular input. Since @samp{-s} was given, presumably this is -not intended. - -</listitem> -<listitem> - -@code{reject_used_but_not_detected undefined} or -@code{yymore_used_but_not_detected undefined}. These errors can occur -at compile time. They indicate that the scanner uses @code{REJECT} or -<function>yymore</function> but that <application>flex</application> failed to notice the fact, meaning -that <application>flex</application> scanned the first two sections looking for occurrences -of these actions and failed to find any, but somehow you snuck some in -(via a #include file, for example). Use @code{%option reject} or -@code{%option yymore} to indicate to <application>flex</application> that you really do use -these features. - -</listitem> -<listitem> - -@samp{flex scanner jammed}. a scanner compiled with -@samp{-s} has encountered an input string which wasn't matched by any of -its rules. This error can also occur due to internal problems. - -</listitem> -<listitem> - -@samp{token too large, exceeds YYLMAX}. your scanner uses @code{%array} -and one of its rules matched a string longer than the @code{YYLMAX} -constant (8K bytes by default). You can increase the value by -#define'ing @code{YYLMAX} in the definitions section of your <application>flex</application> -input. - -</listitem> -<listitem> - -@samp{scanner requires -8 flag to use the character 'x'}. Your scanner -specification includes recognizing the 8-bit character @samp{'x'} and -you did not specify the -8 flag, and your scanner defaulted to 7-bit -because you used the @samp{-Cf} or @samp{-CF} table compression options. -See the discussion of the @samp{-7} flag, @ref{Scanner Options}, for -details. - -</listitem> -<listitem> - -@samp{flex scanner push-back overflow}. you used <function>unput</function> to push -back so much text that the scanner's buffer could not hold both the -pushed-back text and the current token in <varname>yytext</varname>. Ideally the -scanner should dynamically resize the buffer in this case, but at -present it does not. - -</listitem> -<listitem> - -@samp{input buffer overflow, can't enlarge buffer because scanner uses -REJECT}. the scanner was working on matching an extremely large token -and needed to expand the input buffer. This doesn't work with scanners -that use @code{REJECT}. - -</listitem> -<listitem> - -@samp{fatal flex scanner internal error--end of buffer missed}. This can -occur in a scanner which is reentered after a long-jump has jumped out -(or over) the scanner's activation frame. Before reentering the -scanner, use: -<informalexample> -<programlisting> -<![CDATA[ - yyrestart( yyin ); -]]> -</programlisting> -</informalexample> -or, as noted above, switch to using the C++ scanner class. - -</listitem> -<listitem> - -@samp{too many start conditions in <> construct!} you listed more start -conditions in a <> construct than exist (so you must have listed at -least one of them twice). -</listitem> -</itemizedlist> - - -</chapter> - -<chapter> -<title>Limitations</title> - -<!-- @cindex limitations of flex --> - -Some trailing context patterns cannot be properly matched and generate -warning messages (@samp{dangerous trailing context}). These are -patterns where the ending of the first part of the rule matches the -beginning of the second part, such as @samp{zx*/xy*}, where the 'x*' -matches the 'x' at the beginning of the trailing context. (Note that -the POSIX draft states that the text matched by such patterns is -undefined.) For some trailing context rules, parts which are actually -fixed-length are not recognized as such, leading to the abovementioned -performance loss. In particular, parts using @samp{|} or @samp{{n}} -(such as @samp{foo{3}}) are always considered variable-length. -Combining trailing context with the special @samp{|} action can result -in <emphasis>fixed</emphasis> trailing context being turned into the more expensive -<emphasis>variable</emphasis> trailing context. For example, in the following: - -<!-- @cindex warning, dangerous trailing context --> -<informalexample> -<programlisting> -<![CDATA[ - %% - abc | - xyz/def -]]> -</programlisting> -</informalexample> - -Use of <function>unput</function> invalidates yytext and yyleng, unless the -@code{%array} directive or the @samp{-l} option has been used. -Pattern-matching of @code{NUL}s is substantially slower than matching -other characters. Dynamic resizing of the input buffer is slow, as it -entails rescanning all the text matched so far by the current (generally -huge) token. Due to both buffering of input and read-ahead, you cannot -intermix calls to <filename>stdio.h</filename> routines, such as, @b{getchar()}, -with <application>flex</application> rules and expect it to work. Call <function>input</function> -instead. The total table entries listed by the @samp{-v} flag excludes -the number of table entries needed to determine what rule has been -matched. The number of entries is equal to the number of DFA states if -the scanner does not use @code{REJECT}, and somewhat greater than the -number of states if it does. @code{REJECT} cannot be used with the -@samp{-f} or @samp{-F} options. - -The <application>flex</application> internal algorithms need documentation. - -</chapter> - -<chapter> -<title>Additional Reading</title> - -You may wish to read more about the following programs: -<itemizedlist> - -<listitem> - lex -</listitem> -<listitem> - yacc -</listitem> -<listitem> - sed -</listitem> -<listitem> - awk -</listitem> -</itemizedlist> - - -The following books may contain material of interest: - -John Levine, Tony Mason, and Doug Brown, -Lex & Yacc, -O'Reilly and Associates. Be sure to get the 2nd edition. - -M. E. Lesk and E. Schmidt, -@emph{LEX -- Lexical Analyzer Generator} - -Alfred Aho, Ravi Sethi and Jeffrey Ullman, @emph{Compilers: Principles, -Techniques and Tools}, Addison-Wesley (1986). Describes the -pattern-matching techniques used by <application>flex</application> (deterministic finite -automata). - -</chapter> - -<chapter> -<title>FAQ</title> - -From time to time, the <application>flex</application> maintainer receives certain -questions. Rather than repeat answers to well-understood problems, we -publish them here. - -<!-- -@menu -* When was flex born?:: -* How do I expand \ escape sequences in C-style quoted strings?:: -* Why do flex scanners call fileno if it is not ANSI compatible?:: -* Does flex support recursive pattern definitions?:: -* How do I skip huge chunks of input (tens of megabytes) while using flex?:: -* Flex is not matching my patterns in the same order that I defined them.:: -* My actions are executing out of order or sometimes not at all.:: -* How can I have multiple input sources feed into the same scanner at the same time?:: -* Can I build nested parsers that work with the same input file?:: -* How can I match text only at the end of a file?:: -* How can I make REJECT cascade across start condition boundaries?:: -* Why cant I use fast or full tables with interactive mode?:: -* How much faster is -F or -f than -C?:: -* If I have a simple grammar cant I just parse it with flex?:: -* Why doesnt yyrestart() set the start state back to INITIAL?:: -* How can I match C-style comments?:: -* The period isnt working the way I expected.:: -* Can I get the flex manual in another format?:: -* Does there exist a "faster" NDFA->DFA algorithm?:: -* How does flex compile the DFA so quickly?:: -* How can I use more than 8192 rules?:: -* How do I abandon a file in the middle of a scan and switch to a new file?:: -* How do I execute code only during initialization (only before the first scan)?:: -* How do I execute code at termination?:: -* Where else can I find help?:: -* Can I include comments in the "rules" section of the file?:: -* I get an error about undefined yywrap().:: -* How can I change the matching pattern at run time?:: -* How can I expand macros in the input?:: -* How can I build a two-pass scanner?:: -* How do I match any string not matched in the preceding rules?:: -* I am trying to port code from <acronym>&</acronym> lex that uses yysptr and yysbuf.:: -* Is there a way to make flex treat NULL like a regular character?:: -* Whenever flex can not match the input it says "flex scanner jammed".:: -* Why doesnt flex have non-greedy operators like perl does?:: -* Memory leak - 16386 bytes allocated by malloc.:: -* How do I track the byte offset for lseek()?:: -* How do I use my own I/O classes in a C++ scanner?:: -* How do I skip as many chars as possible?:: -* deleteme00:: -* Are certain equivalent patterns faster than others?:: -* Is backing up a big deal?:: -* Can I fake multi-byte character support?:: -* deleteme01:: -* Can you discuss some flex internals?:: -* unput() messes up yy_at_bol:: -* The | operator is not doing what I want:: -* Why can't flex understand this variable trailing context pattern?:: -* The ^ operator isn't working:: -* Trailing context is getting confused with trailing optional patterns:: -* Is flex GNU or not?:: -* ERASEME53:: -* I need to scan if-then-else blocks and while loops:: -* ERASEME55:: -* ERASEME56:: -* ERASEME57:: -* Is there a repository for flex scanners?:: -* How can I conditionally compile or preprocess my flex input file?:: -* Where can I find grammars for lex and yacc?:: -* I get an end-of-buffer message for each character scanned.:: -* unnamed-faq-62:: -* unnamed-faq-63:: -* unnamed-faq-64:: -* unnamed-faq-65:: -* unnamed-faq-66:: -* unnamed-faq-67:: -* unnamed-faq-68:: -* unnamed-faq-69:: -* unnamed-faq-70:: -* unnamed-faq-71:: -* unnamed-faq-72:: -* unnamed-faq-73:: -* unnamed-faq-74:: -* unnamed-faq-75:: -* unnamed-faq-76:: -* unnamed-faq-77:: -* unnamed-faq-78:: -* unnamed-faq-79:: -* unnamed-faq-80:: -* unnamed-faq-81:: -* unnamed-faq-82:: -* unnamed-faq-83:: -* unnamed-faq-84:: -* unnamed-faq-85:: -* unnamed-faq-86:: -* unnamed-faq-87:: -* unnamed-faq-88:: -* unnamed-faq-90:: -* unnamed-faq-91:: -* unnamed-faq-92:: -* unnamed-faq-93:: -* unnamed-faq-94:: -* unnamed-faq-95:: -* unnamed-faq-96:: -* unnamed-faq-97:: -* unnamed-faq-98:: -* unnamed-faq-99:: -* unnamed-faq-100:: -* unnamed-faq-101:: -@end menu--> - - -<section> -<title>When was flex born?</title> - -Vern Paxson took over -the @cite{Software Tools} lex project from Jef Poskanzer in 1982. At that point it -was written in Ratfor. Around 1987 or so, Paxson translated it into C, and -a legend was born :-). - -</section> - -<section> -<title>How do I expand \ escape sequences in C-style quoted strings?</title> - -A key point when scanning quoted strings is that you cannot (easily) write -a single rule that will precisely match the string if you allow things -like embedded escape sequences and newlines. If you try to match strings -with a single rule then you'll wind up having to rescan the string anyway -to find any escape sequences. - -Instead you can use exclusive start conditions and a set of rules, one for -matching non-escaped text, one for matching a single escape, one for -matching an embedded newline, and one for recognizing the end of the -string. Each of these rules is then faced with the question of where to -put its intermediary results. The best solution is for the rules to -append their local value of <varname>yytext</varname> to the end of a ``string literal'' -buffer. A rule like the escape-matcher will append to the buffer the -meaning of the escape sequence rather than the literal text in <varname>yytext</varname>. -In this way, <varname>yytext</varname> does not need to be modified at all. - -</section> - -<section> -<title>Why do flex scanners call fileno if it is not ANSI compatible?</title> - -Flex scanners call <function>fileno</function> in order to get the file descriptor -corresponding to <varname>yyin</varname>. The file descriptor may be passed to -<function>isatty</function> or <function>read</function>, depending upon which @code{%options} you specified. -If your system does not have <function>fileno</function> support, to get rid of the -<function>read</function> call, do not specify @code{%option read}. To get rid of the <function>isatty</function> -call, you must specify one of @code{%option always-interactive} or -@code{%option never-interactive}. - -</section> - -<section> -<title>Does flex support recursive pattern definitions?</title> - -e.g., - -<informalexample> -<programlisting> -<![CDATA[ -%% -block "{"({block}|{statement})*"}" -]]> -</programlisting> -</informalexample> - -No. You cannot have recursive definitions. The pattern-matching power of -regular expressions in general (and therefore flex scanners, too) is -limited. In particular, regular expressions cannot ``balance'' parentheses -to an arbitrary degree. For example, it's impossible to write a regular -expression that matches all strings containing the same number of '{'s -as '}'s. For more powerful pattern matching, you need a parser, such -as <application>GNU bison</application>. - -</section> - -<section> -<title>How do I skip huge chunks of input (tens of megabytes) while using flex?</title> - -Use <function>fseek</function> (or <function>lseek</function>) to position yyin, then call <function>yyrestart</function>. - -</section> - -<section> -<title>Flex is not matching my patterns in the same order that I defined them.</title> - -<application>flex</application> picks the -rule that matches the most text (i.e., the longest possible input string). -This is because <application>flex</application> uses an entirely different matching technique -(``deterministic finite automata'') that actually does all of the matching -simultaneously, in parallel. (Seems impossible, but it's actually a fairly -simple technique once you understand the principles.) - -A side-effect of this parallel matching is that when the input matches more -than one rule, <application>flex</application> scanners pick the rule that matched the <emphasis>most</emphasis> text. This -is explained further in the manual, in the section @xref{Matching}. - -If you want <application>flex</application> to choose a shorter match, then you can work around this -behavior by expanding your short -rule to match more text, then put back the extra: - -<informalexample> -<programlisting> -<![CDATA[ -data_.* yyless( 5 ); BEGIN BLOCKIDSTATE; -]]> -</programlisting> -</informalexample> - -Another fix would be to make the second rule active only during the -@code{<BLOCKIDSTATE>} start condition, and make that start condition exclusive -by declaring it with @code{%x} instead of @code{%s}. - -A final fix is to change the input language so that the ambiguity for -@samp{data_} is removed, by adding characters to it that don't match the -identifier rule, or by removing characters (such as @samp{_}) from the -identifier rule so it no longer matches @samp{data_}. (Of course, you might -also not have the option of changing the input language.) - -</section> - -<section> -<title>My actions are executing out of order or sometimes not at all.</title> - -Most likely, you have (in error) placed the opening @samp{{} of the action -block on a different line than the rule, e.g., - -<informalexample> -<programlisting> -<![CDATA[ -^(foo|bar) -{ <<<--- WRONG! - -} -]]> -</programlisting> -</informalexample> - -<application>flex</application> requires that the opening @samp{{} of an action associated with a rule -begin on the same line as does the rule. You need instead to write your rules -as follows: - -<informalexample> -<programlisting> -<![CDATA[ -^(foo|bar) { // CORRECT! - -} -]]> -</programlisting> -</informalexample> - -</section> - -<section> -<title>How can I have multiple input sources feed into the same scanner at the same time?</title> - -If @dots{} -<itemizedlist> - -<listitem> - -your scanner is free of backtracking (verified using <application>flex</application>'s @samp{-b} flag), -</listitem> -<listitem> - -AND you run your scanner interactively (@samp{-I} option; default unless using special table -compression options), -</listitem> -<listitem> - -AND you feed it one character at a time by redefining @code{YY_INPUT} to do so, -</listitem> -</itemizedlist> - - -then every time it matches a token, it will have exhausted its input -buffer (because the scanner is free of backtracking). This means you -can safely use <function>select</function> at the point and only call <function>yylex</function> for another -token if <function>select</function> indicates there's data available. - -That is, move the <function>select</function> out from the input function to a point where -it determines whether <function>yylex</function> gets called for the next token. - -With this approach, you will still have problems if your input can arrive -piecemeal; <function>select</function> could inform you that the beginning of a token is -available, you call <function>yylex</function> to get it, but it winds up blocking waiting -for the later characters in the token. - -Here's another way: Move your input multiplexing inside of @code{YY_INPUT}. That -is, whenever @code{YY_INPUT} is called, it <function>select</function>'s to see where input is -available. If input is available for the scanner, it reads and returns the -next byte. If input is available from another source, it calls whatever -function is responsible for reading from that source. (If no input is -available, it blocks until some input is available.) I've used this technique in an -interpreter I wrote that both reads keyboard input using a <application>flex</application> scanner and -IPC traffic from sockets, and it works fine. - -</section> - -<section> -<title>Can I build nested parsers that work with the same input file?</title> - -This is not going to work without some additional effort. The reason is -that <application>flex</application> block-buffers the input it reads from <varname>yyin</varname>. This means that the -``outermost'' <function>yylex</function>, when called, will automatically slurp up the first 8K -of input available on yyin, and subsequent calls to other <function>yylex</function>'s won't -see that input. You might be tempted to work around this problem by -redefining @code{YY_INPUT} to only return a small amount of text, but it turns out -that that approach is quite difficult. Instead, the best solution is to -combine all of your scanners into one large scanner, using a different -exclusive start condition for each. - -</section> - -<section> -<title>How can I match text only at the end of a file?</title> - -There is no way to write a rule which is ``match this text, but only if -it comes at the end of the file''. You can fake it, though, if you happen -to have a character lying around that you don't allow in your input. -Then you redefine @code{YY_INPUT} to call your own routine which, if it sees -an @samp{EOF}, returns the magic character first (and remembers to return a -real @code{EOF} next time it's called). Then you could write: - -<informalexample> -<programlisting> -<![CDATA[ -<COMMENT>(.|\n)*{EOF_CHAR} /* saw comment at EOF */ -]]> -</programlisting> -</informalexample> - -</section> - -<section> -<title>How can I make REJECT cascade across start condition boundaries?</title> - -You can do this as follows. Suppose you have a start condition @samp{A}, and -after exhausting all of the possible matches in @samp{<A>}, you want to try -matches in @samp{<INITIAL>}. Then you could use the following: - -<informalexample> -<programlisting> -<![CDATA[ -%x A -%% -<A>rule_that_is_long ...; REJECT; -<A>rule ...; REJECT; /* shorter rule */ -<A>etc. -... -<A>.|\n { -/* Shortest and last rule in <A>, so -* cascaded REJECT's will eventually -* wind up matching this rule. We want -* to now switch to the initial state -* and try matching from there instead. -*/ -yyless(0); /* put back matched text */ -BEGIN(INITIAL); -} -]]> -</programlisting> -</informalexample> - -</section> - -<section> -<title>Why can't I use fast or full tables with interactive mode?</title> - -One of the assumptions -flex makes is that interactive applications are inherently slow (they're -waiting on a human after all). -It has to do with how the scanner detects that it must be finished scanning -a token. For interactive scanners, after scanning each character the current -state is looked up in a table (essentially) to see whether there's a chance -of another input character possibly extending the length of the match. If -not, the scanner halts. For non-interactive scanners, the end-of-token test -is much simpler, basically a compare with 0, so no memory bus cycles. Since -the test occurs in the innermost scanning loop, one would like to make it go -as fast as possible. - -Still, it seems reasonable to allow the user to choose to trade off a bit -of performance in this area to gain the corresponding flexibility. There -might be another reason, though, why fast scanners don't support the -interactive option. - -</section> - -<section> -<title>How much faster is -F or -f than -C?</title> - -Much faster (factor of 2-3). - -</section> - -<section> -<title>If I have a simple grammar can't I just parse it with flex?</title> - -Is your grammar recursive? That's almost always a sign that you're -better off using a parser/scanner rather than just trying to use a scanner -alone. - -</section> - -<section> -<title>Why doesn't yyrestart() set the start state back to INITIAL?</title> - -There are two reasons. The first is that there might -be programs that rely on the start state not changing across file changes. -The second is that beginning with <application>flex</application> version 2.4, use of <function>yyrestart</function> is no longer required, -so fixing the problem there doesn't solve the more general problem. - -</section> - -<section> -<title>How can I match C-style comments?</title> - -You might be tempted to try something like this: - -<informalexample> -<programlisting> -<![CDATA[ -"/*".*"*/" // WRONG! -]]> -</programlisting> -</informalexample> - -or, worse, this: - -<informalexample> -<programlisting> -<![CDATA[ -"/*"(.|\n)"*/" // WRONG! -]]> -</programlisting> -</informalexample> - -The above rules will eat too much input, and blow up on things like: - -<informalexample> -<programlisting> -<![CDATA[ -/* a comment */ do_my_thing( "oops */" ); -]]> -</programlisting> -</informalexample> - -Here is one way which allows you to track line information: - -<informalexample> -<programlisting> -<![CDATA[ -<INITIAL>{ -"/*" BEGIN(IN_COMMENT); -} -<IN_COMMENT>{ -"*/" BEGIN(INITIAL); -[^*\n]+ // eat comment in chunks -"*" // eat the lone star -\n yylineno++; -} -]]> -</programlisting> -</informalexample> - -</section> - -<section> -<title>The '.' isn't working the way I expected.</title> - -Here are some tips for using @samp{.}: - -<itemizedlist> - -<listitem> - -<para> -A common mistake is to place the grouping parenthesis AFTER an operator, when -you really meant to place the parenthesis BEFORE the operator, e.g., you -probably want this @code{(foo|bar)+} and NOT this @code{(foo|bar+)}. -</para> - -<para> -The first pattern matches the words @samp{foo} or @samp{bar} any number of -times, e.g., it matches the text @samp{barfoofoobarfoo}. The -second pattern matches a single instance of @code{foo} or a single instance of -@code{bar} followed by one or more @samp{r}s, e.g., it matches the text @code{barrrr} . -</para> - -</listitem> -<listitem> - -<para> -A @samp{.} inside @samp{[]}'s just means a literal@samp{.} (period), -and NOT ``any character except newline''. -</para> - -</listitem> -<listitem> - -<para> -Remember that @samp{.} matches any character EXCEPT @samp{\n} (and @samp{EOF}). -If you really want to match ANY character, including newlines, then use @code{(.|\n)} -Beware that the regex @code{(.|\n)+} will match your entire input! -</para> - -</listitem> -<listitem> - -<para> -Finally, if you want to match a literal @samp{.} (a period), then use @samp{[.]} or @samp{"."} -</para> - -</listitem> -</itemizedlist> - - -</section> - -<section> -<title>Can I get the flex manual in another format?</title> - -The <application>flex</application> source distribution includes a -<application>texinfo</application> manual. You are free to convert that -<application>texinfo</application> into whatever format you desire. The -<application>texinfo</application> package includes tools for conversion to a -number of formats. - -</section> - -<section> -<title>Does there exist a "faster" NDFA->DFA algorithm?</title> - -<para> -There's no way around the potential exponential running time - it -can take you exponential time just to enumerate all of the DFA states. -In practice, though, the running time is closer to linear, or sometimes -quadratic. -</para> - -</section> - -<section> -<title>How does flex compile the DFA so quickly?</title> - -There are two big speed wins that <application>flex</application> uses: - -<orderedlist> - -<listitem> - -<para> -It analyzes the input rules to construct equivalence classes for those -characters that always make the same transitions. It then rewrites the NFA -using equivalence classes for transitions instead of characters. This cuts -down the NFA->DFA computation time dramatically, to the point where, for -uncompressed DFA tables, the DFA generation is often I/O bound in writing out -the tables. -</para> - -</listitem> -<listitem> - -<para> -It maintains hash values for previously computed DFA states, so testing -whether a newly constructed DFA state is equivalent to a previously constructed -state can be done very quickly, by first comparing hash values. -</para> - -</listitem> -</orderedlist> - - -</section> - -<section> -<title>How can I use more than 8192 rules?</title> - -<para> -<application>flex</application> is compiled with an upper limit of 8192 rules per scanner. -If you need more than 8192 rules in your scanner, you'll have to recompile <application>flex</application> -with the following changes in <filename>flexdef.h</filename>: -</para> - -<informalexample> -<programlisting> -<![CDATA[ -< #define YY_TRAILING_MASK 0x2000 -< #define YY_TRAILING_HEAD_MASK 0x4000 --- -> #define YY_TRAILING_MASK 0x20000000 -> #define YY_TRAILING_HEAD_MASK 0x40000000 -]]> -</programlisting> -</informalexample> - -<para> -This should work okay as long as your C compiler uses 32 bit integers. -But you might want to think about whether using such a huge number of rules -is the best way to solve your problem. -</para> - -<para> -The following may also be relevant: -</para> - -<para> -With luck, you should be able to increase the definitions in flexdef.h for: -</para> - -<informalexample> -<programlisting> -<![CDATA[ -#define JAMSTATE -32766 /* marks a reference to the state that always jams */ -#define MAXIMUM_MNS 31999 -#define BAD_SUBSCRIPT -32767 -]]> -</programlisting> -</informalexample> - -<para> -recompile everything, and it'll all work. Flex only has these 16-bit-like -values built into it because a long time ago it was developed on a machine -with 16-bit ints. I've given this advice to others in the past but haven't -heard back from them whether it worked okay or not... -</para> - -</section> - -<section> -<title>How do I abandon a file in the middle of a scan and switch to a new file?</title> - -Just call @code{yyrestart(newfile)}. Be sure to reset the start state if you want a -``fresh start, since <function>yyrestart</function> does NOT reset the start state back to @code{INITIAL}. - -</section> - -<section> -<title>How do I execute code only during initialization (only before the first scan)?</title> - -You can specify an initial action by defining the macro @code{YY_USER_INIT} (though -note that <varname>yyout</varname> may not be available at the time this macro is executed). Or you -can add to the beginning of your rules section: - -<informalexample> -<programlisting> -<![CDATA[ -%% -/* Must be indented! */ -static int did_init = 0; - -if ( ! did_init ){ -do_my_init(); -did_init = 1; -} -]]> -</programlisting> -</informalexample> - -</section> - -<section> -<title>How do I execute code at termination?</title> - -You can specify an action for the @code{<<EOF>>} rule. - -</section> - -<section> -<title>Where else can I find help?</title> - -You can find the flex homepage on the web at -@uref{http://flex.sourceforge.net/}. See that page for details about flex -mailing lists as well. - -</section> - -<section> -<title>Can I include comments in the "rules" section of the file?</title> - -Yes, just about anywhere you want to. See the manual for the specific syntax. - -</section> - -<section> -<title>I get an error about undefined yywrap().</title> - -You must supply a <function>yywrap</function> function of your own, or link to <filename>libfl.a</filename> -(which provides one), or use - -<informalexample> -<programlisting> -<![CDATA[ -%option noyywrap -]]> -</programlisting> -</informalexample> - -in your source to say you don't want a <function>yywrap</function> function. - -</section> - -<section> -<title>How can I change the matching pattern at run time?</title> - -You can't, it's compiled into a static table when flex builds the scanner. - -</section> - -<section> -<title>How can I expand macros in the input?</title> - -The best way to approach this problem is at a higher level, e.g., in the parser. - -However, you can do this using multiple input buffers. - -<informalexample> -<programlisting> -<![CDATA[ -%% -macro/[a-z]+ { -/* Saw the macro "macro" followed by extra stuff. */ -main_buffer = YY_CURRENT_BUFFER; -expansion_buffer = yy_scan_string(expand(yytext)); -yy_switch_to_buffer(expansion_buffer); -} - -<<EOF>> { -if ( expansion_buffer ) -{ -// We were doing an expansion, return to where -// we were. -yy_switch_to_buffer(main_buffer); -yy_delete_buffer(expansion_buffer); -expansion_buffer = 0; -} -else -yyterminate(); -} -]]> -</programlisting> -</informalexample> - -You probably will want a stack of expansion buffers to allow nested macros. -From the above though hopefully the idea is clear. - -</section> - -<section> -<title>How can I build a two-pass scanner?</title> - -One way to do it is to filter the first pass to a temporary file, -then process the temporary file on the second pass. You will probably see a -performance hit, do to all the disk I/O. - -When you need to look ahead far forward like this, it almost always means -that the right solution is to build a parse tree of the entire input, then -walk it after the parse in order to generate the output. In a sense, this -is a two-pass approach, once through the text and once through the parse -tree, but the performance hit for the latter is usually an order of magnitude -smaller, since everything is already classified, in binary format, and -residing in memory. - -</section> - -<section> -<title>How do I match any string not matched in the preceding rules?</title> - -One way to assign precedence, is to place the more specific rules first. If -two rules would match the same input (same sequence of characters) then the -first rule listed in the <application>flex</application> input wins. e.g., - -<informalexample> -<programlisting> -<![CDATA[ -%% -foo[a-zA-Z_]+ return FOO_ID; -bar[a-zA-Z_]+ return BAR_ID; -[a-zA-Z_]+ return GENERIC_ID; -]]> -</programlisting> -</informalexample> - -Note that the rule @code{[a-zA-Z_]+} must come *after* the others. It will match the -same amount of text as the more specific rules, and in that case the -<application>flex</application> scanner will pick the first rule listed in your scanner as the -one to match. - -</section> - -<section> -<title>I am trying to port code from <acronym>&</acronym> lex that uses yysptr and yysbuf.</title> - -Those are internal variables pointing into the <acronym>&</acronym> scanner's input buffer. I -imagine they're being manipulated in user versions of the <function>input</function> and <function>unput</function> -functions. If so, what you need to do is analyze those functions to figure out -what they're doing, and then replace <function>input</function> with an appropriate definition of -@code{YY_INPUT}. You shouldn't need to (and must not) replace -<application>flex</application>'s <function>unput</function> function. - -</section> - -<section> -<title>Is there a way to make flex treat NULL like a regular character?</title> - -Yes, @samp{\0} and @samp{\x00} should both do the trick. Perhaps you have an ancient -version of <application>flex</application>. The latest release is version @value{VERSION}. - -</section> - -<section> -<title>Whenever flex can not match the input it says "flex scanner jammed".</title> - -You need to add a rule that matches the otherwise-unmatched text. -e.g., - -<informalexample> -<programlisting> -<![CDATA[ -%option yylineno -%% -[[a bunch of rules here]] - -. printf("bad input character '%s' at line %d\n", yytext, yylineno); -]]> -</programlisting> -</informalexample> - -See @code{%option default} for more information. - -</section> - -<section> -<title>Why doesn't flex have non-greedy operators like perl does?</title> - -A DFA can do a non-greedy match by stopping -the first time it enters an accepting state, instead of consuming input until -it determines that no further matching is possible (a ``jam'' state). This -is actually easier to implement than longest leftmost match (which flex does). - -But it's also much less useful than longest leftmost match. In general, -when you find yourself wishing for non-greedy matching, that's usually a -sign that you're trying to make the scanner do some parsing. That's -generally the wrong approach, since it lacks the power to do a decent job. -Better is to either introduce a separate parser, or to split the scanner -into multiple scanners using (exclusive) start conditions. - -You might have -a separate start state once you've seen the @samp{BEGIN}. In that state, you -might then have a regex that will match @samp{END} (to kick you out of the -state), and perhaps @samp{(.|\n)} to get a single character within the chunk ... - -This approach also has much better error-reporting properties. - -</section> - -<section> -<title>Memory leak - 16386 bytes allocated by malloc.</title> -@anchor{faq-memory-leak} - -UPDATED 2002-07-10: As of <application>flex</application> version 2.5.9, this leak means that you did not -call <function>yylex_destroy</function>. If you are using an earlier version of <application>flex</application>, then read -on. - -The leak is about 16426 bytes. That is, (8192 * 2 + 2) for the read-buffer, and -about 40 for @code{struct yy_buffer_state} (depending upon alignment). The leak is in -the non-reentrant C scanner only (NOT in the reentrant scanner, NOT in the C++ -scanner). Since <application>flex</application> doesn't know when you are done, the buffer is never freed. - -However, the leak won't multiply since the buffer is reused no matter how many -times you call <function>yylex</function>. - -If you want to reclaim the memory when you are completely done scanning, then -you might try this: - -<informalexample> -<programlisting> -<![CDATA[ -/* For non-reentrant C scanner only. */ -yy_delete_buffer(YY_CURRENT_BUFFER); -yy_init = 1; -]]> -</programlisting> -</informalexample> - -Note: @code{yy_init} is an "internal variable", and hasn't been tested in this -situation. It is possible that some other globals may need resetting as well. - -</section> - -<section> -<title>How do I track the byte offset for lseek()?</title> - -<informalexample> -<programlisting> -<![CDATA[ -> We thought that it would be possible to have this number through the -> evaluation of the following expression: -> -> seek_position = (no_buffers)*YY_READ_BUF_SIZE + yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf -]]> -</programlisting> -</informalexample> - -While this is the right idea, it has two problems. The first is that -it's possible that <application>flex</application> will request less than @code{YY_READ_BUF_SIZE} during -an invocation of @code{YY_INPUT} (or that your input source will return less -even though @code{YY_READ_BUF_SIZE} bytes were requested). The second problem -is that when refilling its internal buffer, <application>flex</application> keeps some characters -from the previous buffer (because usually it's in the middle of a match, -and needs those characters to construct <varname>yytext</varname> for the match once it's -done). Because of this, @code{yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf} won't -be exactly the number of characters already read from the current buffer. - -An alternative solution is to count the number of characters you've matched -since starting to scan. This can be done by using @code{YY_USER_ACTION}. For -example, - -<informalexample> -<programlisting> -<![CDATA[ -#define YY_USER_ACTION num_chars += yyleng; -]]> -</programlisting> -</informalexample> - -(You need to be careful to update your bookkeeping if you use <function>yymore</function>), -<function>yyless</function>, <function>unput</function>, or <function>input</function>.) - -</section> - -<section> -<title>How do I use my own I/O classes in a C++ scanner?</title> - -When the flex C++ scanning class rewrite finally happens, then this sort of thing should become much easier. - -<!-- @cindex LexerOutput, overriding --> -<!-- @cindex LexerInput, overriding --> -<!-- @cindex overriding LexerOutput --> -<!-- @cindex overriding LexerInput --> -<!-- @cindex customizing I/O in C++ scanners --> -<!-- @cindex C++ I/O, customizing --> -You can do this by passing the various functions (such as <function>LexerInput</function> -and <function>LexerOutput</function>) NULL @code{iostream*}'s, and then -dealing with your own I/O classes surreptitiously (i.e., stashing them in -special member variables). This works because the only assumption about -the lexer regarding what's done with the iostream's is that they're -ultimately passed to <function>LexerInput</function> and <function>LexerOutput</function>, which then do whatever -is necessary with them. - -<!-- @c faq edit stopped here --> -</section> - -<section> -<title>How do I skip as many chars as possible?</title> - -How do I skip as many chars as possible -- without interfering with the other -patterns? - -In the example below, we want to skip over characters until we see the phrase -"endskip". The following will <emphasis>NOT</emphasis> work correctly (do you see why not?) - -<informalexample> -<programlisting> -<![CDATA[ -/* INCORRECT SCANNER */ -%x SKIP -%% -<INITIAL>startskip BEGIN(SKIP); -... -<SKIP>"endskip" BEGIN(INITIAL); -<SKIP>.* ; -]]> -</programlisting> -</informalexample> - -The problem is that the pattern .* will eat up the word "endskip." -The simplest (but slow) fix is: - -<informalexample> -<programlisting> -<![CDATA[ -<SKIP>"endskip" BEGIN(INITIAL); -<SKIP>. ; -]]> -</programlisting> -</informalexample> - -The fix involves making the second rule match more, without -making it match "endskip" plus something else. So for example: - -<informalexample> -<programlisting> -<![CDATA[ -<SKIP>"endskip" BEGIN(INITIAL); -<SKIP>[^e]+ ; -<SKIP>. ;/* so you eat up e's, too */ -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>deleteme00</title> -<informalexample> -<programlisting> -<![CDATA[ -QUESTION: -When was flex born? - -Vern Paxson took over -the Software Tools lex project from Jef Poskanzer in 1982. At that point it -was written in Ratfor. Around 1987 or so, Paxson translated it into C, and -a legend was born :-). -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>Are certain equivalent patterns faster than others?</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Adoram Rogel <adoram@orna.hybridge.com> -Subject: Re: Flex 2.5.2 performance questions -In-reply-to: Your message of Wed, 18 Sep 96 11:12:17 EDT. -Date: Wed, 18 Sep 96 10:51:02 PDT -From: Vern Paxson <vern> - -[Note, the most recent flex release is 2.5.4, which you can get from -ftp.ee.lbl.gov. It has bug fixes over 2.5.2 and 2.5.3.] - -> 1. Using the pattern -> ([Ff](oot)?)?[Nn](ote)?(\.)? -> instead of -> (((F|f)oot(N|n)ote)|((N|n)ote)|((N|n)\.)|((F|f)(N|n)(\.))) -> (in a very complicated flex program) caused the program to slow from -> 300K+/min to 100K/min (no other changes were done). - -These two are not equivalent. For example, the first can match "footnote." -but the second can only match "footnote". This is almost certainly the -cause in the discrepancy - the slower scanner run is matching more tokens, -and/or having to do more backing up. - -> 2. Which of these two are better: [Ff]oot or (F|f)oot ? - -From a performance point of view, they're equivalent (modulo presumably -minor effects such as memory cache hit rates; and the presence of trailing -context, see below). From a space point of view, the first is slightly -preferable. - -> 3. I have a pattern that look like this: -> pats {p1}|{p2}|{p3}|...|{p50} (50 patterns ORd) -> -> running yet another complicated program that includes the following rule: -> <snext>{and}/{no4}{bb}{pats} -> -> gets me to "too complicated - over 32,000 states"... - -I can't tell from this example whether the trailing context is variable-length -or fixed-length (it could be the latter if {and} is fixed-length). If it's -variable length, which flex -p will tell you, then this reflects a basic -performance problem, and if you can eliminate it by restructuring your -scanner, you will see significant improvement. - -> so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about -> 10 patterns and changed the rule to be 5 rules. -> This did compile, but what is the rule of thumb here ? - -The rule is to avoid trailing context other than fixed-length, in which for -a/b, either the 'a' pattern or the 'b' pattern have a fixed length. Use -of the '|' operator automatically makes the pattern variable length, so in -this case '[Ff]oot' is preferred to '(F|f)oot'. - -> 4. I changed a rule that looked like this: -> <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN... -> -> to the next 2 rules: -> <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;} -> <snext8>{and}{bb}/{ROMAN} { BEGIN... -> -> Again, I understand the using [^...] will cause a great performance loss - -Actually, it doesn't cause any sort of performance loss. It's a surprising -fact about regular expressions that they always match in linear time -regardless of how complex they are. - -> but are there any specific rules about it ? - -See the "Performance Considerations" section of the man page, and also -the example in MISC/fastwc/. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>Is backing up a big deal?</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Adoram Rogel <adoram@hybridge.com> -Subject: Re: Flex 2.5.2 performance questions -In-reply-to: Your message of Thu, 19 Sep 96 10:16:04 EDT. -Date: Thu, 19 Sep 96 09:58:00 PDT -From: Vern Paxson <vern> - -> a lot about the backing up problem. -> I believe that there lies my biggest problem, and I'll try to improve -> it. - -Since you have variable trailing context, this is a bigger performance -problem. Fixing it is usually easier than fixing backing up, which in a -complicated scanner (yours seems to fit the bill) can be extremely -difficult to do correctly. - -You also don't mention what flags you are using for your scanner. --f makes a large speed difference, and -Cfe buys you nearly as much -speed but the resulting scanner is considerably smaller. - -> I have an | operator in {and} and in {pats} so both of them are variable -> length. - --p should have reported this. - -> Is changing one of them to fixed-length is enough ? - -Yes. - -> Is it possible to change the 32,000 states limit ? - -Yes. I've appended instructions on how. Before you make this change, -though, you should think about whether there are ways to fundamentally -simplify your scanner - those are certainly preferable! - - Vern - -To increase the 32K limit (on a machine with 32 bit integers), you increase -the magnitude of the following in flexdef.h: - -#define JAMSTATE -32766 /* marks a reference to the state that always jams */ -#define MAXIMUM_MNS 31999 -#define BAD_SUBSCRIPT -32767 -#define MAX_SHORT 32700 - -Adding a 0 or two after each should do the trick. -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>Can I fake multi-byte character support?</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Heeman_Lee@hp.com -Subject: Re: flex - multi-byte support? -In-reply-to: Your message of Thu, 03 Oct 1996 17:24:04 PDT. -Date: Fri, 04 Oct 1996 11:42:18 PDT -From: Vern Paxson <vern> - -> I assume as long as my *.l file defines the -> range of expected character code values (in octal format), flex will -> scan the file and read multi-byte characters correctly. But I have no -> confidence in this assumption. - -Your lack of confidence is justified - this won't work. - -Flex has in it a widespread assumption that the input is processed -one byte at a time. Fixing this is on the to-do list, but is involved, -so it won't happen any time soon. In the interim, the best I can suggest -(unless you want to try fixing it yourself) is to write your rules in -terms of pairs of bytes, using definitions in the first section: - - X \xfe\xc2 - ... - %% - foo{X}bar found_foo_fe_c2_bar(); - -etc. Definitely a pain - sorry about that. - -By the way, the email address you used for me is ancient, indicating you -have a very old version of flex. You can get the most recent, 2.5.4, from -ftp.ee.lbl.gov. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>deleteme01</title> -<informalexample> -<programlisting> -<![CDATA[ -To: moleary@primus.com -Subject: Re: Flex / Unicode compatibility question -In-reply-to: Your message of Tue, 22 Oct 1996 10:15:42 PDT. -Date: Tue, 22 Oct 1996 11:06:13 PDT -From: Vern Paxson <vern> - -Unfortunately flex at the moment has a widespread assumption within it -that characters are processed 8 bits at a time. I don't see any easy -fix for this (other than writing your rules in terms of double characters - -a pain). I also don't know of a wider lex, though you might try surfing -the Plan 9 stuff because I know it's a Unicode system, and also the PCCT -toolkit (try searching say Alta Vista for "Purdue Compiler Construction -Toolkit"). - -Fixing flex to handle wider characters is on the long-term to-do list. -But since flex is a strictly spare-time project these days, this probably -won't happen for quite a while, unless someone else does it first. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>Can you discuss some flex internals?</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Johan Linde <jl@theophys.kth.se> -Subject: Re: translation of flex -In-reply-to: Your message of Sun, 10 Nov 1996 09:16:36 PST. -Date: Mon, 11 Nov 1996 10:33:50 PST -From: Vern Paxson <vern> - -> I'm working for the Swedish team translating GNU program, and I'm currently -> working with flex. I have a few questions about some of the messages which -> I hope you can answer. - -All of the things you're wondering about, by the way, concerning flex -internals - probably the only person who understands what they mean in -English is me! So I wouldn't worry too much about getting them right. -That said ... - -> #: main.c:545 -> msgid " %d protos created\n" -> -> Does proto mean prototype? - -Yes - prototypes of state compression tables. - -> #: main.c:539 -> msgid " %d/%d (peak %d) template nxt-chk entries created\n" -> -> Here I'm mainly puzzled by 'nxt-chk'. I guess it means 'next-check'. (?) -> However, 'template next-check entries' doesn't make much sense to me. To be -> able to find a good translation I need to know a little bit more about it. - -There is a scheme in the Aho/Sethi/Ullman compiler book for compressing -scanner tables. It involves creating two pairs of tables. The first has -"base" and "default" entries, the second has "next" and "check" entries. -The "base" entry is indexed by the current state and yields an index into -the next/check table. The "default" entry gives what to do if the state -transition isn't found in next/check. The "next" entry gives the next -state to enter, but only if the "check" entry verifies that this entry is -correct for the current state. Flex creates templates of series of -next/check entries and then encodes differences from these templates as a -way to compress the tables. - -> #: main.c:533 -> msgid " %d/%d base-def entries created\n" -> -> The same problem here for 'base-def'. - -See above. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unput() messes up yy_at_bol</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Xinying Li <xli@npac.syr.edu> -Subject: Re: FLEX ? -In-reply-to: Your message of Wed, 13 Nov 1996 17:28:38 PST. -Date: Wed, 13 Nov 1996 19:51:54 PST -From: Vern Paxson <vern> - -> "unput()" them to input flow, question occurs. If I do this after I scan -> a carriage, the variable "YY_CURRENT_BUFFER->yy_at_bol" is changed. That -> means the carriage flag has gone. - -You can control this by calling yy_set_bol(). It's described in the manual. - -> And if in pre-reading it goes to the end of file, is anything done -> to control the end of curren buffer and end of file? - -No, there's no way to put back an end-of-file. - -> By the way I am using flex 2.5.2 and using the "-l". - -The latest release is 2.5.4, by the way. It fixes some bugs in 2.5.2 and -2.5.3. You can get it from ftp.ee.lbl.gov. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>The | operator is not doing what I want</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Alain.ISSARD@st.com -Subject: Re: Start condition with FLEX -In-reply-to: Your message of Mon, 18 Nov 1996 09:45:02 PST. -Date: Mon, 18 Nov 1996 10:41:34 PST -From: Vern Paxson <vern> - -> I am not able to use the start condition scope and to use the | (OR) with -> rules having start conditions. - -The problem is that if you use '|' as a regular expression operator, for -example "a|b" meaning "match either 'a' or 'b'", then it must *not* have -any blanks around it. If you instead want the special '|' *action* (which -from your scanner appears to be the case), which is a way of giving two -different rules the same action: - - foo | - bar matched_foo_or_bar(); - -then '|' *must* be separated from the first rule by whitespace and *must* -be followed by a new line. You *cannot* write it as: - - foo | bar matched_foo_or_bar(); - -even though you might think you could because yacc supports this syntax. -The reason for this unfortunately incompatibility is historical, but it's -unlikely to be changed. - -Your problems with start condition scope are simply due to syntax errors -from your use of '|' later confusing flex. - -Let me know if you still have problems. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>Why can't flex understand this variable trailing context pattern?</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Gregory Margo <gmargo@newton.vip.best.com> -Subject: Re: flex-2.5.3 bug report -In-reply-to: Your message of Sat, 23 Nov 1996 16:50:09 PST. -Date: Sat, 23 Nov 1996 17:07:32 PST -From: Vern Paxson <vern> - -> Enclosed is a lex file that "real" lex will process, but I cannot get -> flex to process it. Could you try it and maybe point me in the right direction? - -Your problem is that some of the definitions in the scanner use the '/' -trailing context operator, and have it enclosed in ()'s. Flex does not -allow this operator to be enclosed in ()'s because doing so allows undefined -regular expressions such as "(a/b)+". So the solution is to remove the -parentheses. Note that you must also be building the scanner with the -l -option for <acronym>&</acronym> lex compatibility. Without this option, flex automatically -encloses the definitions in parentheses. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>The ^ operator isn't working</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Thomas Hadig <hadig@toots.physik.rwth-aachen.de> -Subject: Re: Flex Bug ? -In-reply-to: Your message of Tue, 26 Nov 1996 14:35:01 PST. -Date: Tue, 26 Nov 1996 11:15:05 PST -From: Vern Paxson <vern> - -> In my lexer code, i have the line : -> ^\*.* { } -> -> Thus all lines starting with an astrix (*) are comment lines. -> This does not work ! - -I can't get this problem to reproduce - it works fine for me. Note -though that if what you have is slightly different: - - COMMENT ^\*.* - %% - {COMMENT} { } - -then it won't work, because flex pushes back macro definitions enclosed -in ()'s, so the rule becomes - - (^\*.*) { } - -and now that the '^' operator is not at the immediate beginning of the -line, it's interpreted as just a regular character. You can avoid this -behavior by using the "-l" lex-compatibility flag, or "%option lex-compat". - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>Trailing context is getting confused with trailing optional patterns</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Adoram Rogel <adoram@hybridge.com> -Subject: Re: Flex 2.5.4 BOF ??? -In-reply-to: Your message of Tue, 26 Nov 1996 16:10:41 PST. -Date: Wed, 27 Nov 1996 10:56:25 PST -From: Vern Paxson <vern> - -> Organization(s)?/[a-z] -> -> This matched "Organizations" (looking in debug mode, the trailing s -> was matched with trailing context instead of the optional (s) in the -> end of the word. - -That should only happen with lex. Flex can properly match this pattern. -(That might be what you're saying, I'm just not sure.) - -> Is there a way to avoid this dangerous trailing context problem ? - -Unfortunately, there's no easy way. On the other hand, I don't see why -it should be a problem. Lex's matching is clearly wrong, and I'd hope -that usually the intent remains the same as expressed with the pattern, -so flex's matching will be correct. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>Is flex GNU or not?</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Cameron MacKinnon <mackin@interlog.com> -Subject: Re: Flex documentation bug -In-reply-to: Your message of Mon, 02 Dec 1996 00:07:08 PST. -Date: Sun, 01 Dec 1996 22:29:39 PST -From: Vern Paxson <vern> - -> I'm not sure how or where to submit bug reports (documentation or -> otherwise) for the GNU project stuff ... - -Well, strictly speaking flex isn't part of the GNU project. They just -distribute it because no one's written a decent GPL'd lex replacement. -So you should send bugs directly to me. Those sent to the GNU folks -sometimes find there way to me, but some may drop between the cracks. - -> In GNU Info, under the section 'Start Conditions', and also in the man -> page (mine's dated April '95) is a nice little snippet showing how to -> parse C quoted strings into a buffer, defined to be MAX_STR_CONST in -> size. Unfortunately, no overflow checking is ever done ... - -This is already mentioned in the manual: - -Finally, here's an example of how to match C-style quoted -strings using exclusive start conditions, including expanded -escape sequences (but not including checking for a string -that's too long): - -The reason for not doing the overflow checking is that it will needlessly -clutter up an example whose main purpose is just to demonstrate how to -use flex. - -The latest release is 2.5.4, by the way, available from ftp.ee.lbl.gov. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>ERASEME53</title> -<informalexample> -<programlisting> -<![CDATA[ -To: tsv@cs.UManitoba.CA -Subject: Re: Flex (reg).. -In-reply-to: Your message of Thu, 06 Mar 1997 23:50:16 PST. -Date: Thu, 06 Mar 1997 15:54:19 PST -From: Vern Paxson <vern> - -> [:alpha:] ([:alnum:] | \\_)* - -If your rule really has embedded blanks as shown above, then it won't -work, as the first blank delimits the rule from the action. (It wouldn't -even compile ...) You need instead: - -[:alpha:]([:alnum:]|\\_)* - -and that should work fine - there's no restriction on what can go inside -of ()'s except for the trailing context operator, '/'. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>I need to scan if-then-else blocks and while loops</title> -<informalexample> -<programlisting> -<![CDATA[ -To: "Mike Stolnicki" <mstolnic@ford.com> -Subject: Re: FLEX help -In-reply-to: Your message of Fri, 30 May 1997 13:33:27 PDT. -Date: Fri, 30 May 1997 10:46:35 PDT -From: Vern Paxson <vern> - -> We'd like to add "if-then-else", "while", and "for" statements to our -> language ... -> We've investigated many possible solutions. The one solution that seems -> the most reasonable involves knowing the position of a TOKEN in yyin. - -I strongly advise you to instead build a parse tree (abstract syntax tree) -and loop over that instead. You'll find this has major benefits in keeping -your interpreter simple and extensible. - -That said, the functionality you mention for get_position and set_position -have been on the to-do list for a while. As flex is a purely spare-time -project for me, no guarantees when this will be added (in particular, it -for sure won't be for many months to come). - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>ERASEME55</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Colin Paul Adams <colin@colina.demon.co.uk> -Subject: Re: Flex C++ classes and Bison -In-reply-to: Your message of 09 Aug 1997 17:11:41 PDT. -Date: Fri, 15 Aug 1997 10:48:19 PDT -From: Vern Paxson <vern> - -> #define YY_DECL int yylex (YYSTYPE *lvalp, struct parser_control -> *parm) -> -> I have been trying to get this to work as a C++ scanner, but it does -> not appear to be possible (warning that it matches no declarations in -> yyFlexLexer, or something like that). -> -> Is this supposed to be possible, or is it being worked on (I DID -> notice the comment that scanner classes are still experimental, so I'm -> not too hopeful)? - -What you need to do is derive a subclass from yyFlexLexer that provides -the above yylex() method, squirrels away lvalp and parm into member -variables, and then invokes yyFlexLexer::yylex() to do the regular scanning. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>ERASEME56</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Mikael.Latvala@lmf.ericsson.se -Subject: Re: Possible mistake in Flex v2.5 document -In-reply-to: Your message of Fri, 05 Sep 1997 16:07:24 PDT. -Date: Fri, 05 Sep 1997 10:01:54 PDT -From: Vern Paxson <vern> - -> In that example you show how to count comment lines when using -> C style /* ... */ comments. My question is, shouldn't you take into -> account a scenario where end of a comment marker occurs inside -> character or string literals? - -The scanner certainly needs to also scan character and string literals. -However it does that (there's an example in the man page for strings), the -lexer will recognize the beginning of the literal before it runs across the -embedded "/*". Consequently, it will finish scanning the literal before it -even considers the possibility of matching "/*". - -Example: - - '([^']*|{ESCAPE_SEQUENCE})' - -will match all the text between the ''s (inclusive). So the lexer -considers this as a token beginning at the first ', and doesn't even -attempt to match other tokens inside it. - -I thinnk this subtlety is not worth putting in the manual, as I suspect -it would confuse more people than it would enlighten. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>ERASEME57</title> -<informalexample> -<programlisting> -<![CDATA[ -To: "Marty Leisner" <leisner@sdsp.mc.xerox.com> -Subject: Re: flex limitations -In-reply-to: Your message of Sat, 06 Sep 1997 11:27:21 PDT. -Date: Mon, 08 Sep 1997 11:38:08 PDT -From: Vern Paxson <vern> - -> %% -> [a-zA-Z]+ /* skip a line */ -> { printf("got %s\n", yytext); } -> %% - -What version of flex are you using? If I feed this to 2.5.4, it complains: - - "bug.l", line 5: EOF encountered inside an action - "bug.l", line 5: unrecognized rule - "bug.l", line 5: fatal parse error - -Not the world's greatest error message, but it manages to flag the problem. - -(With the introduction of start condition scopes, flex can't accommodate -an action on a separate line, since it's ambiguous with an indented rule.) - -You can get 2.5.4 from ftp.ee.lbl.gov. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>Is there a repository for flex scanners?</title> - -Not that we know of. You might try asking on comp.compilers. - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>How can I conditionally compile or preprocess my flex input file?</title> - - -Flex doesn't have a preprocessor like C does. You might try using m4, or the C -preprocessor plus a sed script to clean up the result. - - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>Where can I find grammars for lex and yacc?</title> - -In the sources for flex and bison. - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>I get an end-of-buffer message for each character scanned.</title> - -This will happen if your LexerInput() function returns only one character -at a time, which can happen either if you're scanner is "interactive", or -if the streams library on your platform always returns 1 for yyin->gcount(). - -Solution: override LexerInput() with a version that returns whole buffers. - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-62</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE -Subject: Re: Flex maximums -In-reply-to: Your message of Mon, 17 Nov 1997 17:16:06 PST. -Date: Mon, 17 Nov 1997 17:16:15 PST -From: Vern Paxson <vern> - -> I took a quick look into the flex-sources and altered some #defines in -> flexdefs.h: -> -> #define INITIAL_MNS 64000 -> #define MNS_INCREMENT 1024000 -> #define MAXIMUM_MNS 64000 - -The things to fix are to add a couple of zeroes to: - -#define JAMSTATE -32766 /* marks a reference to the state that always jams */ -#define MAXIMUM_MNS 31999 -#define BAD_SUBSCRIPT -32767 -#define MAX_SHORT 32700 - -and, if you get complaints about too many rules, make the following change too: - - #define YY_TRAILING_MASK 0x200000 - #define YY_TRAILING_HEAD_MASK 0x400000 - -- Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-63</title> -<informalexample> -<programlisting> -<![CDATA[ -To: jimmey@lexis-nexis.com (Jimmey Todd) -Subject: Re: FLEX question regarding istream vs ifstream -In-reply-to: Your message of Mon, 08 Dec 1997 15:54:15 PST. -Date: Mon, 15 Dec 1997 13:21:35 PST -From: Vern Paxson <vern> - -> stdin_handle = YY_CURRENT_BUFFER; -> ifstream fin( "aFile" ); -> yy_switch_to_buffer( yy_create_buffer( fin, YY_BUF_SIZE ) ); -> -> What I'm wanting to do, is pass the contents of a file thru one set -> of rules and then pass stdin thru another set... It works great if, I -> don't use the C++ classes. But since everything else that I'm doing is -> in C++, I thought I'd be consistent. -> -> The problem is that 'yy_create_buffer' is expecting an istream* as it's -> first argument (as stated in the man page). However, fin is a ifstream -> object. Any ideas on what I might be doing wrong? Any help would be -> appreciated. Thanks!! - -You need to pass &fin, to turn it into an ifstream* instead of an ifstream. -Then its type will be compatible with the expected istream*, because ifstream -is derived from istream. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-64</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Enda Fadian <fadiane@piercom.ie> -Subject: Re: Question related to Flex man page? -In-reply-to: Your message of Tue, 16 Dec 1997 15:17:34 PST. -Date: Tue, 16 Dec 1997 14:17:09 PST -From: Vern Paxson <vern> - -> Can you explain to me what is ment by a long-jump in relation to flex? - -Using the longjmp() function while inside yylex() or a routine called by it. - -> what is the flex activation frame. - -Just yylex()'s stack frame. - -> As far as I can see yyrestart will bring me back to the sart of the input -> file and using flex++ isnot really an option! - -No, yyrestart() doesn't imply a rewind, even though its name might sound -like it does. It tells the scanner to flush its internal buffers and -start reading from the given file at its present location. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-65</title> -<informalexample> -<programlisting> -<![CDATA[ -To: hassan@larc.info.uqam.ca (Hassan Alaoui) -Subject: Re: Need urgent Help -In-reply-to: Your message of Sat, 20 Dec 1997 19:38:19 PST. -Date: Sun, 21 Dec 1997 21:30:46 PST -From: Vern Paxson <vern> - -> /usr/lib/yaccpar: In function `int yyparse()': -> /usr/lib/yaccpar:184: warning: implicit declaration of function `int yylex(...)' -> -> ld: Undefined symbol -> _yylex -> _yyparse -> _yyin - -This is a known problem with Solaris C++ (and/or Solaris yacc). I believe -the fix is to explicitly insert some 'extern "C"' statements for the -corresponding routines/symbols. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-66</title> -<informalexample> -<programlisting> -<![CDATA[ -To: mc0307@mclink.it -Cc: gnu@prep.ai.mit.edu -Subject: Re: [mc0307@mclink.it: Help request] -In-reply-to: Your message of Fri, 12 Dec 1997 17:57:29 PST. -Date: Sun, 21 Dec 1997 22:33:37 PST -From: Vern Paxson <vern> - -> This is my definition for float and integer types: -> . . . -> NZD [1-9] -> ... -> I've tested my program on other lex version (on UNIX Sun Solaris an HP -> UNIX) and it work well, so I think that my definitions are correct. -> There are any differences between Lex and Flex? - -There are indeed differences, as discussed in the man page. The one -you are probably running into is that when flex expands a name definition, -it puts parentheses around the expansion, while lex does not. There's -an example in the man page of how this can lead to different matching. -Flex's behavior complies with the POSIX standard (or at least with the -last POSIX draft I saw). - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-67</title> -<informalexample> -<programlisting> -<![CDATA[ -To: hassan@larc.info.uqam.ca (Hassan Alaoui) -Subject: Re: Thanks -In-reply-to: Your message of Mon, 22 Dec 1997 16:06:35 PST. -Date: Mon, 22 Dec 1997 14:35:05 PST -From: Vern Paxson <vern> - -> Thank you very much for your help. I compile and link well with C++ while -> declaring 'yylex ...' extern, But a little problem remains. I get a -> segmentation default when executing ( I linked with lfl library) while it -> works well when using LEX instead of flex. Do you have some ideas about the -> reason for this ? - -The one possible reason for this that comes to mind is if you've defined -yytext as "extern char yytext[]" (which is what lex uses) instead of -"extern char *yytext" (which is what flex uses). If it's not that, then -I'm afraid I don't know what the problem might be. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-68</title> -<informalexample> -<programlisting> -<![CDATA[ -To: "Bart Niswonger" <NISWONGR@almaden.ibm.com> -Subject: Re: flex 2.5: c++ scanners & start conditions -In-reply-to: Your message of Tue, 06 Jan 1998 10:34:21 PST. -Date: Tue, 06 Jan 1998 19:19:30 PST -From: Vern Paxson <vern> - -> The problem is that when I do this (using %option c++) start -> conditions seem to not apply. - -The BEGIN macro modifies the yy_start variable. For C scanners, this -is a static with scope visible through the whole file. For C++ scanners, -it's a member variable, so it only has visible scope within a member -function. Your lexbegin() routine is not a member function when you -build a C++ scanner, so it's not modifying the correct yy_start. The -diagnostic that indicates this is that you found you needed to add -a declaration of yy_start in order to get your scanner to compile when -using C++; instead, the correct fix is to make lexbegin() a member -function (by deriving from yyFlexLexer). - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-69</title> -<informalexample> -<programlisting> -<![CDATA[ -To: "Boris Zinin" <boris@ippe.rssi.ru> -Subject: Re: current position in flex buffer -In-reply-to: Your message of Mon, 12 Jan 1998 18:58:23 PST. -Date: Mon, 12 Jan 1998 12:03:15 PST -From: Vern Paxson <vern> - -> The problem is how to determine the current position in flex active -> buffer when a rule is matched.... - -You will need to keep track of this explicitly, such as by redefining -YY_USER_ACTION to count the number of characters matched. - -The latest flex release, by the way, is 2.5.4, available from ftp.ee.lbl.gov. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-70</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Bik.Dhaliwal@bis.org -Subject: Re: Flex question -In-reply-to: Your message of Mon, 26 Jan 1998 13:05:35 PST. -Date: Tue, 27 Jan 1998 22:41:52 PST -From: Vern Paxson <vern> - -> That requirement involves knowing -> the character position at which a particular token was matched -> in the lexer. - -The way you have to do this is by explicitly keeping track of where -you are in the file, by counting the number of characters scanned -for each token (available in yyleng). It may prove convenient to -do this by redefining YY_USER_ACTION, as described in the manual. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-71</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Vladimir Alexiev <vladimir@cs.ualberta.ca> -Subject: Re: flex: how to control start condition from parser? -In-reply-to: Your message of Mon, 26 Jan 1998 05:50:16 PST. -Date: Tue, 27 Jan 1998 22:45:37 PST -From: Vern Paxson <vern> - -> It seems useful for the parser to be able to tell the lexer about such -> context dependencies, because then they don't have to be limited to -> local or sequential context. - -One way to do this is to have the parser call a stub routine that's -included in the scanner's .l file, and consequently that has access ot -BEGIN. The only ugliness is that the parser can't pass in the state -it wants, because those aren't visible - but if you don't have many -such states, then using a different set of names doesn't seem like -to much of a burden. - -While generating a .h file like you suggests is certainly cleaner, -flex development has come to a virtual stand-still :-(, so a workaround -like the above is much more pragmatic than waiting for a new feature. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-72</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Barbara Denny <denny@3com.com> -Subject: Re: freebsd flex bug? -In-reply-to: Your message of Fri, 30 Jan 1998 12:00:43 PST. -Date: Fri, 30 Jan 1998 12:42:32 PST -From: Vern Paxson <vern> - -> lex.yy.c:1996: parse error before `=' - -This is the key, identifying this error. (It may help to pinpoint -it by using flex -L, so it doesn't generate #line directives in its -output.) I will bet you heavy money that you have a start condition -name that is also a variable name, or something like that; flex spits -out #define's for each start condition name, mapping them to a number, -so you can wind up with: - - %x foo - %% - ... - %% - void bar() - { - int foo = 3; - } - -and the penultimate will turn into "int 1 = 3" after C preprocessing, -since flex will put "#define foo 1" in the generated scanner. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-73</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Maurice Petrie <mpetrie@infoscigroup.com> -Subject: Re: Lost flex .l file -In-reply-to: Your message of Mon, 02 Feb 1998 14:10:01 PST. -Date: Mon, 02 Feb 1998 11:15:12 PST -From: Vern Paxson <vern> - -> I am curious as to -> whether there is a simple way to backtrack from the generated source to -> reproduce the lost list of tokens we are searching on. - -In theory, it's straight-forward to go from the DFA representation -back to a regular-expression representation - the two are isomorphic. -In practice, a huge headache, because you have to unpack all the tables -back into a single DFA representation, and then write a program to munch -on that and translate it into an RE. - -Sorry for the less-than-happy news ... - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-74</title> -<informalexample> -<programlisting> -<![CDATA[ -To: jimmey@lexis-nexis.com (Jimmey Todd) -Subject: Re: Flex performance question -In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. -Date: Thu, 19 Feb 1998 08:48:51 PST -From: Vern Paxson <vern> - -> What I have found, is that the smaller the data chunk, the faster the -> program executes. This is the opposite of what I expected. Should this be -> happening this way? - -This is exactly what will happen if your input file has embedded NULs. -From the man page: - -A final note: flex is slow when matching NUL's, particularly -when a token contains multiple NUL's. It's best to write -rules which match short amounts of text if it's anticipated -that the text will often include NUL's. - -So that's the first thing to look for. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-75</title> -<informalexample> -<programlisting> -<![CDATA[ -To: jimmey@lexis-nexis.com (Jimmey Todd) -Subject: Re: Flex performance question -In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. -Date: Thu, 19 Feb 1998 15:42:25 PST -From: Vern Paxson <vern> - -So there are several problems. - -First, to go fast, you want to match as much text as possible, which -your scanners don't in the case that what they're scanning is *not* -a <RN> tag. So you want a rule like: - - [^<]+ - -Second, C++ scanners are particularly slow if they're interactive, -which they are by default. Using -B speeds it up by a factor of 3-4 -on my workstation. - -Third, C++ scanners that use the istream interface are slow, because -of how poorly implemented istream's are. I built two versions of -the following scanner: - - %% - .*\n - .* - %% - -and the C version inhales a 2.5MB file on my workstation in 0.8 seconds. -The C++ istream version, using -B, takes 3.8 seconds. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-76</title> -<informalexample> -<programlisting> -<![CDATA[ -To: "Frescatore, David (CRD, TAD)" <frescatore@exc01crdge.crd.ge.com> -Subject: Re: FLEX 2.5 & THE YEAR 2000 -In-reply-to: Your message of Wed, 03 Jun 1998 11:26:22 PDT. -Date: Wed, 03 Jun 1998 10:22:26 PDT -From: Vern Paxson <vern> - -> I am researching the Y2K problem with General Electric R&D -> and need to know if there are any known issues concerning -> the above mentioned software and Y2K regardless of version. - -There shouldn't be, all it ever does with the date is ask the system -for it and then print it out. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-77</title> -<informalexample> -<programlisting> -<![CDATA[ -To: "Hans Dermot Doran" <htd@ibhdoran.com> -Subject: Re: flex problem -In-reply-to: Your message of Wed, 15 Jul 1998 21:30:13 PDT. -Date: Tue, 21 Jul 1998 14:23:34 PDT -From: Vern Paxson <vern> - -> To overcome this, I gets() the stdin into a string and lex the string. The -> string is lexed OK except that the end of string isn't lexed properly -> (yy_scan_string()), that is the lexer dosn't recognise the end of string. - -Flex doesn't contain mechanisms for recognizing buffer endpoints. But if -you use fgets instead (which you should anyway, to protect against buffer -overflows), then the final \n will be preserved in the string, and you can -scan that in order to find the end of the string. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-78</title> -<informalexample> -<programlisting> -<![CDATA[ -To: soumen@almaden.ibm.com -Subject: Re: Flex++ 2.5.3 instance member vs. static member -In-reply-to: Your message of Mon, 27 Jul 1998 02:10:04 PDT. -Date: Tue, 28 Jul 1998 01:10:34 PDT -From: Vern Paxson <vern> - -> %{ -> int mylineno = 0; -> %} -> ws [ \t]+ -> alpha [A-Za-z] -> dig [0-9] -> %% -> -> Now you'd expect mylineno to be a member of each instance of class -> yyFlexLexer, but is this the case? A look at the lex.yy.cc file seems to -> indicate otherwise; unless I am missing something the declaration of -> mylineno seems to be outside any class scope. -> -> How will this work if I want to run a multi-threaded application with each -> thread creating a FlexLexer instance? - -Derive your own subclass and make mylineno a member variable of it. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-79</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Adoram Rogel <adoram@hybridge.com> -Subject: Re: More than 32K states change hangs -In-reply-to: Your message of Tue, 04 Aug 1998 16:55:39 PDT. -Date: Tue, 04 Aug 1998 22:28:45 PDT -From: Vern Paxson <vern> - -> Vern Paxson, -> -> I followed your advice, posted on Usenet bu you, and emailed to me -> personally by you, on how to overcome the 32K states limit. I'm running -> on Linux machines. -> I took the full source of version 2.5.4 and did the following changes in -> flexdef.h: -> #define JAMSTATE -327660 -> #define MAXIMUM_MNS 319990 -> #define BAD_SUBSCRIPT -327670 -> #define MAX_SHORT 327000 -> -> and compiled. -> All looked fine, including check and bigcheck, so I installed. - -Hmmm, you shouldn't increase MAX_SHORT, though looking through my email -archives I see that I did indeed recommend doing so. Try setting it back -to 32700; that should suffice that you no longer need -Ca. If it still -hangs, then the interesting question is - where? - -> Compiling the same hanged program with a out-of-the-box (RedHat 4.2 -> distribution of Linux) -> flex 2.5.4 binary works. - -Since Linux comes with source code, you should diff it against what -you have to see what problems they missed. - -> Should I always compile with the -Ca option now ? even short and simple -> filters ? - -No, definitely not. It's meant to be for those situations where you -absolutely must squeeze every last cycle out of your scanner. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-80</title> -<informalexample> -<programlisting> -<![CDATA[ -To: "Schmackpfeffer, Craig" <Craig.Schmackpfeffer@usa.xerox.com> -Subject: Re: flex output for static code portion -In-reply-to: Your message of Tue, 11 Aug 1998 11:55:30 PDT. -Date: Mon, 17 Aug 1998 23:57:42 PDT -From: Vern Paxson <vern> - -> I would like to use flex under the hood to generate a binary file -> containing the data structures that control the parse. - -This has been on the wish-list for a long time. In principle it's -straight-forward - you redirect mkdata() et al's I/O to another file, -and modify the skeleton to have a start-up function that slurps these -into dynamic arrays. The concerns are (1) the scanner generation code -is hairy and full of corner cases, so it's easy to get surprised when -going down this path :-( ; and (2) being careful about buffering so -that when the tables change you make sure the scanner starts in the -correct state and reading at the right point in the input file. - -> I was wondering if you know of anyone who has used flex in this way. - -I don't - but it seems like a reasonable project to undertake (unlike -numerous other flex tweaks :-). - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-81</title> -<informalexample> -<programlisting> -<![CDATA[ -Received: from 131.173.17.11 (131.173.17.11 [131.173.17.11]) - by ee.lbl.gov (8.9.1/8.9.1) with ESMTP id AAA03838 - for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 00:47:57 -0700 (PDT) -Received: from hal.cl-ki.uni-osnabrueck.de (hal.cl-ki.Uni-Osnabrueck.DE [131.173.141.2]) - by deimos.rz.uni-osnabrueck.de (8.8.7/8.8.8) with ESMTP id JAA34694 - for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 09:47:55 +0200 -Received: (from georg@localhost) by hal.cl-ki.uni-osnabrueck.de (8.6.12/8.6.12) id JAA34834 for vern@ee.lbl.gov; Thu, 20 Aug 1998 09:47:54 +0200 -From: Georg Rehm <georg@hal.cl-ki.uni-osnabrueck.de> -Message-Id: <199808200747.JAA34834@hal.cl-ki.uni-osnabrueck.de> -Subject: "flex scanner push-back overflow" -To: vern@ee.lbl.gov -Date: Thu, 20 Aug 1998 09:47:54 +0200 (MEST) -Reply-To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE -X-NoJunk: Do NOT send commercial mail, spam or ads to this address! -X-URL: http://www.cl-ki.uni-osnabrueck.de/~georg/ -X-Mailer: ELM [version 2.4ME+ PL28 (25)] -MIME-Version: 1.0 -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7bit - -Hi Vern, - -Yesterday, I encountered a strange problem: I use the macro processor m4 -to include some lengthy lists into a .l file. Following is a flex macro -definition that causes some serious pain in my neck: - -AUTHOR ("A. Boucard / L. Boucard"|"A. Dastarac / M. Levent"|"A.Boucaud / L.Boucaud"|"Abderrahim Lamchichi"|"Achmat Dangor"|"Adeline Toullier"|"Adewale Maja-Pearce"|"Ahmed Ziri"|"Akram Ellyas"|"Alain Bihr"|"Alain Gresh"|"Alain Guillemoles"|"Alain Joxe"|"Alain Morice"|"Alain Renon"|"Alain Zecchini"|"Albert Memmi"|"Alberto Manguel"|"Alex De Waal"|"Alfonso Artico"| [...]) - -The complete list contains about 10kB. When I try to "flex" this file -(on a Solaris 2.6 machine, using a modified flex 2.5.4 (I only increased -some of the predefined values in flexdefs.h) I get the error: - -myflex/flex -8 sentag.tmp.l -flex scanner push-back overflow - -When I remove the slashes in the macro definition everything works fine. -As I understand it, the double quotes escape the slash-character so it -really means "/" and not "trailing context". Furthermore, I tried to -escape the slashes with backslashes, but with no use, the same error message -appeared when flexing the code. - -Do you have an idea what's going on here? - -Greetings from Germany, - Georg --- -Georg Rehm georg@cl-ki.uni-osnabrueck.de -Institute for Semantic Information Processing, University of Osnabrueck, FRG -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-82</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE -Subject: Re: "flex scanner push-back overflow" -In-reply-to: Your message of Thu, 20 Aug 1998 09:47:54 PDT. -Date: Thu, 20 Aug 1998 07:05:35 PDT -From: Vern Paxson <vern> - -> myflex/flex -8 sentag.tmp.l -> flex scanner push-back overflow - -Flex itself uses a flex scanner. That scanner is running out of buffer -space when it tries to unput() the humongous macro you've defined. When -you remove the '/'s, you make it small enough so that it fits in the buffer; -removing spaces would do the same thing. - -The fix is to either rethink how come you're using such a big macro and -perhaps there's another/better way to do it; or to rebuild flex's own -scan.c with a larger value for - - #define YY_BUF_SIZE 16384 - -- Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-83</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Jan Kort <jan@research.techforce.nl> -Subject: Re: Flex -In-reply-to: Your message of Fri, 04 Sep 1998 12:18:43 +0200. -Date: Sat, 05 Sep 1998 00:59:49 PDT -From: Vern Paxson <vern> - -> %% -> -> "TEST1\n" { fprintf(stderr, "TEST1\n"); yyless(5); } -> ^\n { fprintf(stderr, "empty line\n"); } -> . { } -> \n { fprintf(stderr, "new line\n"); } -> -> %% -> -- input --------------------------------------- -> TEST1 -> -- output -------------------------------------- -> TEST1 -> empty line -> ------------------------------------------------ - -IMHO, it's not clear whether or not this is in fact a bug. It depends -on whether you view yyless() as backing up in the input stream, or as -pushing new characters onto the beginning of the input stream. Flex -interprets it as the latter (for implementation convenience, I'll admit), -and so considers the newline as in fact matching at the beginning of a -line, as after all the last token scanned an entire line and so the -scanner is now at the beginning of a new line. - -I agree that this is counter-intuitive for yyless(), given its -functional description (it's less so for unput(), depending on whether -you're unput()'ing new text or scanned text). But I don't plan to -change it any time soon, as it's a pain to do so. Consequently, -you do indeed need to use yy_set_bol() and YY_AT_BOL() to tweak -your scanner into the behavior you desire. - -Sorry for the less-than-completely-satisfactory answer. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-84</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Patrick Krusenotto <krusenot@mac-info-link.de> -Subject: Re: Problems with restarting flex-2.5.2-generated scanner -In-reply-to: Your message of Thu, 24 Sep 1998 10:14:07 PDT. -Date: Thu, 24 Sep 1998 23:28:43 PDT -From: Vern Paxson <vern> - -> I am using flex-2.5.2 and bison 1.25 for Solaris and I am desperately -> trying to make my scanner restart with a new file after my parser stops -> with a parse error. When my compiler restarts, the parser always -> receives the token after the token (in the old file!) that caused the -> parser error. - -I suspect the problem is that your parser has read ahead in order -to attempt to resolve an ambiguity, and when it's restarted it picks -up with that token rather than reading a fresh one. If you're using -yacc, then the special "error" production can sometimes be used to -consume tokens in an attempt to get the parser into a consistent state. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-85</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Henric Jungheim <junghelh@pe-nelson.com> -Subject: Re: flex 2.5.4a -In-reply-to: Your message of Tue, 27 Oct 1998 16:41:42 PST. -Date: Tue, 27 Oct 1998 16:50:14 PST -From: Vern Paxson <vern> - -> This brings up a feature request: How about a command line -> option to specify the filename when reading from stdin? That way one -> doesn't need to create a temporary file in order to get the "#line" -> directives to make sense. - -Use -o combined with -t (per the man page description of -o). - -> P.S., Is there any simple way to use non-blocking IO to parse multiple -> streams? - -Simple, no. - -One approach might be to return a magic character on EWOULDBLOCK and -have a rule - - .*<magic-character> // put back .*, eat magic character - -This is off the top of my head, not sure it'll work. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-86</title> -<informalexample> -<programlisting> -<![CDATA[ -To: "Repko, Billy D" <billy.d.repko@intel.com> -Subject: Re: Compiling scanners -In-reply-to: Your message of Wed, 13 Jan 1999 10:52:47 PST. -Date: Thu, 14 Jan 1999 00:25:30 PST -From: Vern Paxson <vern> - -> It appears that maybe it cannot find the lfl library. - -The Makefile in the distribution builds it, so you should have it. -It's exceedingly trivial, just a main() that calls yylex() and -a yyrap() that always returns 1. - -> %% -> \n ++num_lines; ++num_chars; -> . ++num_chars; - -You can't indent your rules like this - that's where the errors are coming -from. Flex copies indented text to the output file, it's how you do things -like - - int num_lines_seen = 0; - -to declare local variables. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-87</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Erick Branderhorst <Erick.Branderhorst@asml.nl> -Subject: Re: flex input buffer -In-reply-to: Your message of Tue, 09 Feb 1999 13:53:46 PST. -Date: Tue, 09 Feb 1999 21:03:37 PST -From: Vern Paxson <vern> - -> In the flex.skl file the size of the default input buffers is set. Can you -> explain why this size is set and why it is such a high number. - -It's large to optimize performance when scanning large files. You can -safely make it a lot lower if needed. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-88</title> -<informalexample> -<programlisting> -<![CDATA[ -To: "Guido Minnen" <guidomi@cogs.susx.ac.uk> -Subject: Re: Flex error message -In-reply-to: Your message of Wed, 24 Feb 1999 15:31:46 PST. -Date: Thu, 25 Feb 1999 00:11:31 PST -From: Vern Paxson <vern> - -> I'm extending a larger scanner written in Flex and I keep running into -> problems. More specifically, I get the error message: -> "flex: input rules are too complicated (>= 32000 NFA states)" - -Increase the definitions in flexdef.h for: - -#define JAMSTATE -32766 /* marks a reference to the state that always j -ams */ -#define MAXIMUM_MNS 31999 -#define BAD_SUBSCRIPT -32767 - -recompile everything, and it should all work. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-90</title> -<informalexample> -<programlisting> -<![CDATA[ -To: "Dmitriy Goldobin" <gold@ems.chel.su> -Subject: Re: FLEX trouble -In-reply-to: Your message of Mon, 31 May 1999 18:44:49 PDT. -Date: Tue, 01 Jun 1999 00:15:07 PDT -From: Vern Paxson <vern> - -> I have a trouble with FLEX. Why rule "/*".*"*/" work properly,=20 -> but rule "/*"(.|\n)*"*/" don't work ? - -The second of these will have to scan the entire input stream (because -"(.|\n)*" matches an arbitrary amount of any text) in order to see if -it ends with "*/", terminating the comment. That potentially will overflow -the input buffer. - -> More complex rule "/*"([^*]|(\*/[^/]))*"*/ give an error -> 'unrecognized rule'. - -You can't use the '/' operator inside parentheses. It's not clear -what "(a/b)*" actually means. - -> I now use workaround with state <comment>, but single-rule is -> better, i think. - -Single-rule is nice but will always have the problem of either setting -restrictions on comments (like not allowing multi-line comments) and/or -running the risk of consuming the entire input stream, as noted above. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-91</title> -<informalexample> -<programlisting> -<![CDATA[ -Received: from mc-qout4.whowhere.com (mc-qout4.whowhere.com [209.185.123.18]) - by ee.lbl.gov (8.9.3/8.9.3) with SMTP id IAA05100 - for <vern@ee.lbl.gov>; Tue, 15 Jun 1999 08:56:06 -0700 (PDT) -Received: from Unknown/Local ([?.?.?.?]) by my-deja.com; Tue Jun 15 08:55:43 1999 -To: vern@ee.lbl.gov -Date: Tue, 15 Jun 1999 08:55:43 -0700 -From: "Aki Niimura" <neko@my-deja.com> -Message-ID: <KNONDOHDOBGAEAAA@my-deja.com> -Mime-Version: 1.0 -Cc: -X-Sent-Mail: on -Reply-To: -X-Mailer: MailCity Service -Subject: A question on flex C++ scanner -X-Sender-Ip: 12.72.207.61 -Organization: My Deja Email (http://www.my-deja.com:80) -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit - -Dear Dr. Paxon, - -I have been using flex for years. -It works very well on many projects. -Most case, I used it to generate a scanner on C language. -However, one project I needed to generate a scanner -on C++ lanuage. Thanks to your enhancement, flex did -the job. - -Currently, I'm working on enhancing my previous project. -I need to deal with multiple input streams (recursive -inclusion) in this scanner (C++). -I did similar thing for another scanner (C) as you -explained in your documentation. - -The generated scanner (C++) has necessary methods: -- switch_to_buffer(struct yy_buffer_state *b) -- yy_create_buffer(istream *is, int sz) -- yy_delete_buffer(struct yy_buffer_state *b) - -However, I couldn't figure out how to access current -buffer (yy_current_buffer). - -yy_current_buffer is a protected member of yyFlexLexer. -I can't access it directly. -Then, I thought yy_create_buffer() with is = 0 might -return current stream buffer. But it seems not as far -as I checked the source. (flex 2.5.4) - -I went through the Web in addition to Flex documentation. -However, it hasn't been successful, so far. - -It is not my intention to bother you, but, can you -comment about how to obtain the current stream buffer? - -Your response would be highly appreciated. - -Best regards, -Aki Niimura - ---== Sent via Deja.com http://www.deja.com/ ==-- -Share what you know. Learn what you don't. -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-92</title> -<informalexample> -<programlisting> -<![CDATA[ -To: neko@my-deja.com -Subject: Re: A question on flex C++ scanner -In-reply-to: Your message of Tue, 15 Jun 1999 08:55:43 PDT. -Date: Tue, 15 Jun 1999 09:04:24 PDT -From: Vern Paxson <vern> - -> However, I couldn't figure out how to access current -> buffer (yy_current_buffer). - -Derive your own subclass from yyFlexLexer. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-93</title> -<informalexample> -<programlisting> -<![CDATA[ -To: "Stones, Darren" <Darren.Stones@nectech.co.uk> -Subject: Re: You're the man to see? -In-reply-to: Your message of Wed, 23 Jun 1999 11:10:29 PDT. -Date: Wed, 23 Jun 1999 09:01:40 PDT -From: Vern Paxson <vern> - -> I hope you can help me. I am using Flex and Bison to produce an interpreted -> language. However all goes well until I try to implement an IF statement or -> a WHILE. I cannot get this to work as the parser parses all the conditions -> eg. the TRUE and FALSE conditons to check for a rule match. So I cannot -> make a decision!! - -You need to use the parser to build a parse tree (= abstract syntax trwee), -and when that's all done you recursively evaluate the tree, binding variables -to values at that time. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-94</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Petr Danecek <petr@ics.cas.cz> -Subject: Re: flex - question -In-reply-to: Your message of Mon, 28 Jun 1999 19:21:41 PDT. -Date: Fri, 02 Jul 1999 16:52:13 PDT -From: Vern Paxson <vern> - -> file, it takes an enormous amount of time. It is funny, because the -> source code has only 12 rules!!! I think it looks like an exponencial -> growth. - -Right, that's the problem - some patterns (those with a lot of -ambiguity, where yours has because at any given time the scanner can -be in the middle of all sorts of combinations of the different -rules) blow up exponentially. - -For your rules, there is an easy fix. Change the ".*" that comes fater -the directory name to "[^ ]*". With that in place, the rules are no -longer nearly so ambiguous, because then once one of the directories -has been matched, no other can be matched (since they all require a -leading blank). - -If that's not an acceptable solution, then you can enter a start state -to pick up the .*\n after each directory is matched. - -Also note that for speed, you'll want to add a ".*" rule at the end, -otherwise rules that don't match any of the patterns will be matched -very slowly, a character at a time. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-95</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Tielman Koekemoer <tielman@spi.co.za> -Subject: Re: Please help. -In-reply-to: Your message of Thu, 08 Jul 1999 13:20:37 PDT. -Date: Thu, 08 Jul 1999 08:20:39 PDT -From: Vern Paxson <vern> - -> I was hoping you could help me with my problem. -> -> I tried compiling (gnu)flex on a Solaris 2.4 machine -> but when I ran make (after configure) I got an error. -> -> -------------------------------------------------------------- -> gcc -c -I. -I. -g -O parse.c -> ./flex -t -p ./scan.l >scan.c -> sh: ./flex: not found -> *** Error code 1 -> make: Fatal error: Command failed for target `scan.c' -> ------------------------------------------------------------- -> -> What's strange to me is that I'm only -> trying to install flex now. I then edited the Makefile to -> and changed where it says "FLEX = flex" to "FLEX = lex" -> ( lex: the native Solaris one ) but then it complains about -> the "-p" option. Is there any way I can compile flex without -> using flex or lex? -> -> Thanks so much for your time. - -You managed to step on the bootstrap sequence, which first copies -initscan.c to scan.c in order to build flex. Try fetching a fresh -distribution from ftp.ee.lbl.gov. (Or you can first try removing -".bootstrap" and doing a make again.) - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-96</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Tielman Koekemoer <tielman@spi.co.za> -Subject: Re: Please help. -In-reply-to: Your message of Fri, 09 Jul 1999 09:16:14 PDT. -Date: Fri, 09 Jul 1999 00:27:20 PDT -From: Vern Paxson <vern> - -> First I removed .bootstrap (and ran make) - no luck. I downloaded the -> software but I still have the same problem. Is there anything else I -> could try. - -Try: - - cp initscan.c scan.c - touch scan.c - make scan.o - -If this last tries to first build scan.c from scan.l using ./flex, then -your "make" is broken, in which case compile scan.c to scan.o by hand. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-97</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Sumanth Kamenani <skamenan@crl.nmsu.edu> -Subject: Re: Error -In-reply-to: Your message of Mon, 19 Jul 1999 23:08:41 PDT. -Date: Tue, 20 Jul 1999 00:18:26 PDT -From: Vern Paxson <vern> - -> I am getting a compilation error. The error is given as "unknown symbol- yylex". - -The parser relies on calling yylex(), but you're instead using the C++ scanning -class, so you need to supply a yylex() "glue" function that calls an instance -scanner of the scanner (e.g., "scanner->yylex()"). - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-98</title> -<informalexample> -<programlisting> -<![CDATA[ -To: daniel@synchrods.synchrods.COM (Daniel Senderowicz) -Subject: Re: lex -In-reply-to: Your message of Mon, 22 Nov 1999 11:19:04 PST. -Date: Tue, 23 Nov 1999 15:54:30 PST -From: Vern Paxson <vern> - -Well, your problem is the - -switch (yybgin-yysvec-1) { /* witchcraft */ - -at the beginning of lex rules. "witchcraft" == "non-portable". It's -assuming knowledge of the <acronym>&</acronym> lex's internal variables. - -For flex, you can probably do the equivalent using a switch on YYSTATE. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-99</title> -<informalexample> -<programlisting> -<![CDATA[ -To: archow@hss.hns.com -Subject: Re: Regarding distribution of flex and yacc based grammars -In-reply-to: Your message of Sun, 19 Dec 1999 17:50:24 +0530. -Date: Wed, 22 Dec 1999 01:56:24 PST -From: Vern Paxson <vern> - -> When we provide the customer with an object code distribution, is it -> necessary for us to provide source -> for the generated C files from flex and bison since they are generated by -> flex and bison ? - -For flex, no. I don't know what the current state of this is for bison. - -> Also, is there any requrirement for us to neccessarily provide source for -> the grammar files which are fed into flex and bison ? - -Again, for flex, no. - -See the file "COPYING" in the flex distribution for the legalese. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-100</title> -<informalexample> -<programlisting> -<![CDATA[ -To: Martin Gallwey <gallweym@hyperion.moe.ul.ie> -Subject: Re: Flex, and self referencing rules -In-reply-to: Your message of Sun, 20 Feb 2000 01:01:21 PST. -Date: Sat, 19 Feb 2000 18:33:16 PST -From: Vern Paxson <vern> - -> However, I do not use unput anywhere. I do use self-referencing -> rules like this: -> -> UnaryExpr ({UnionExpr})|("-"{UnaryExpr}) - -You can't do this - flex is *not* a parser like yacc (which does indeed -allow recursion), it is a scanner that's confined to regular expressions. - - Vern -]]> -</programlisting> -</informalexample> - -<!-- @c TODO: Evaluate this faq. --> -</section> - -<section> -<title>unnamed-faq-101</title> -<informalexample> -<programlisting> -<![CDATA[ -To: slg3@lehigh.edu (SAMUEL L. GULDEN) -Subject: Re: Flex problem -In-reply-to: Your message of Thu, 02 Mar 2000 12:29:04 PST. -Date: Thu, 02 Mar 2000 23:00:46 PST -From: Vern Paxson <vern> - -If this is exactly your program: - -> digit [0-9] -> digits {digit}+ -> whitespace [ \t\n]+ -> -> %% -> "[" { printf("open_brac\n");} -> "]" { printf("close_brac\n");} -> "+" { printf("addop\n");} -> "*" { printf("multop\n");} -> {digits} { printf("NUMBER = %s\n", yytext);} -> whitespace ; - -then the problem is that the last rule needs to be "{whitespace}" ! - - Vern -]]> -</programlisting> -</informalexample> -</section> - -<!-- END CHAPTER FAQ --> -</chapter> - -<appendix> -<title>Makefiles and Flex</title> - -<!-- @cindex Makefile, syntax --> - -In this appendix, we provide tips for writing Makefiles to build your scanners. - -In a traditional build environment, we say that the <filename>.c</filename> files are the -sources, and the <filename>.o</filename> files are the intermediate files. When using -<application>flex</application>, however, the <filename>.l</filename> files are the sources, and the generated -<filename>.c</filename> files (along with the <filename>.o</filename> files) are the intermediate files. -This requires you to carefully plan your Makefile. - -Modern @command{make} programs understand that <filename>foo.l</filename> is intended to -generate <filename>lex.yy.c</filename> or <filename>foo.c</filename>, and will behave -accordingly@footnote{GNU @command{make} and GNU @command{automake} are two such -programs that provide implicit rules for flex-generated scanners.}. The -following Makefile does not explicitly instruct @command{make} how to build -<filename>foo.c</filename> from <filename>foo.l</filename>. Instead, it relies on the implicit rules of the -@command{make} program to build the intermediate file, <filename>scan.c</filename>: - -<!-- @cindex Makefile, example of implicit rules --> -<informalexample> -<programlisting> -<![CDATA[ - # Basic Makefile -- relies on implicit rules - # Creates "myprogram" from "scan.l" and "myprogram.c" - # - LEX=flex - myprogram: scan.o myprogram.o - scan.o: scan.l - -]]> -</programlisting> -</informalexample> - - -For simple cases, the above may be sufficient. For other cases, -you may have to explicitly instruct @command{make} how to build your scanner. -The following is an example of a Makefile containing explicit rules: - -<!-- @cindex Makefile, explicit example --> -<informalexample> -<programlisting> -<![CDATA[ - # Basic Makefile -- provides explicit rules - # Creates "myprogram" from "scan.l" and "myprogram.c" - # - LEX=flex - myprogram: scan.o myprogram.o - $(CC) -o $@ $(LDFLAGS) $^ - - myprogram.o: myprogram.c - $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ - - scan.o: scan.c - $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ - - scan.c: scan.l - $(LEX) $(LFLAGS) -o $@ $^ - - clean: - $(RM) *.o scan.c - -]]> -</programlisting> -</informalexample> - -Notice in the above example that <filename>scan.c</filename> is in the @code{clean} target. -This is because we consider the file <filename>scan.c</filename> to be an intermediate file. - -Finally, we provide a realistic example of a <application>flex</application> scanner used with a -<application>bison</application> parser@footnote{This example also applies to yacc parsers.}. -There is a tricky problem we have to deal with. Since a <application>flex</application> scanner -will typically include a header file (e.g., <filename>y.tab.h</filename>) generated by the -parser, we need to be sure that the header file is generated BEFORE the scanner -is compiled. We handle this case in the following example: - -<informalexample> -<programlisting> -<![CDATA[ - # Makefile example -- scanner and parser. - # Creates "myprogram" from "scan.l", "parse.y", and "myprogram.c" - # - LEX = flex - YACC = bison -y - YFLAGS = -d - objects = scan.o parse.o myprogram.o - - myprogram: $(objects) - scan.o: scan.l parse.c - parse.o: parse.y - myprogram.o: myprogram.c - -]]> -</programlisting> -</informalexample> - -In the above example, notice the line, - -<informalexample> -<programlisting> -<![CDATA[ - scan.o: scan.l parse.c -]]> -</programlisting> -</informalexample> - -, which lists the file <filename>parse.c</filename> (the generated parser) as a dependency of -<filename>scan.o</filename>. We want to ensure that the parser is created before the scanner -is compiled, and the above line seems to do the trick. Feel free to experiment -with your specific implementation of @command{make}. - - -For more details on writing Makefiles, see @ref{Top, , , make, The -GNU Make Manual}. - -</appendix> - -<appendix> -<title>C Scanners with Bison Parsers</title> - -<!-- @cindex bison, bridging with flex --> -<!-- @vindex yylval --> -<!-- @vindex yylloc --> -<!-- @tindex YYLTYPE --> -<!-- @tindex YYSTYPE --> - -This section describes the <application>flex</application> features useful when integrating -<application>flex</application> with @code{GNU bison}@footnote{The features described here are -purely optional, and are by no means the only way to use flex with bison. -We merely provide some glue to ease development of your parser-scanner pair.}. -Skip this section if you are not using -<application>bison</application> with your scanner. Here we discuss only the <application>flex</application> -half of the <application>flex</application> and <application>bison</application> pair. We do not discuss -<application>bison</application> in any detail. For more information about generating -<application>bison</application> parsers, see @ref{Top, , , bison, the GNU Bison Manual}. - -A compatible <application>bison</application> scanner is generated by declaring @samp{%option -bison-bridge} or by supplying @samp{--bison-bridge} when invoking <application>flex</application> -from the command line. This instructs <application>flex</application> that the macro -<varname>yylval</varname> may be used. The data type for -<varname>yylval</varname>, @code{YYSTYPE}, -is typically defined in a header file, included in section 1 of the -<application>flex</application> input file. For a list of functions and macros -available, @xref{bison-functions}. - -The declaration of yylex becomes, - -<!-- @findex yylex (reentrant version) --> -<informalexample> -<programlisting> -<![CDATA[ - int yylex ( YYSTYPE * lvalp, yyscan_t scanner ); -]]> -</programlisting> -</informalexample> - -If @code{%option bison-locations} is specified, then the declaration -becomes, - -<!-- @findex yylex (reentrant version) --> -<informalexample> -<programlisting> -<![CDATA[ - int yylex ( YYSTYPE * lvalp, YYLTYPE * llocp, yyscan_t scanner ); -]]> -</programlisting> -</informalexample> - -Note that the macros <varname>yylval</varname> and <varname>yylloc</varname> evaluate to pointers. -Support for <varname>yylloc</varname> is optional in <application>bison</application>, so it is optional in -<application>flex</application> as well. The following is an example of a <application>flex</application> scanner that -is compatible with <application>bison</application>. - -<!-- @cindex bison, scanner to be called from bison --> -<informalexample> -<programlisting> -<![CDATA[ - /* Scanner for "C" assignment statements... sort of. */ - %{ - #include "y.tab.h" /* Generated by bison. */ - %} - - %option bison-bridge bison-locations - % - - [[:digit:]]+ { yylval->num = atoi(yytext); return NUMBER;} - [[:alnum:]]+ { yylval->str = strdup(yytext); return STRING;} - "="|";" { return yytext[0];} - . {} - % -]]> -</programlisting> -</informalexample> - -As you can see, there really is no magic here. We just use -<varname>yylval</varname> as we would any other variable. The data type of -<varname>yylval</varname> is generated by <application>bison</application>, and included in the file -<filename>y.tab.h</filename>. Here is the corresponding <application>bison</application> parser: - -<!-- @cindex bison, parser --> -<informalexample> -<programlisting> -<![CDATA[ - /* Parser to convert "C" assignments to lisp. */ - %{ - /* Pass the argument to yyparse through to yylex. */ - #define YYPARSE_PARAM scanner - #define YYLEX_PARAM scanner - %} - %locations - %pure_parser - %union { - int num; - char* str; - } - %token <str> STRING - %token <num> NUMBER - %% - assignment: - STRING '=' NUMBER ';' { - printf( "(setf %s %d)", $1, $3 ); - } - ; -]]> -</programlisting> -</informalexample> - -</appendix> - -<appendix> -<title>M4 Dependency</title> -<!-- @cindex m4 --> - -The macro processor <application>m4</application>@footnote{The use of m4 is subject to change in -future revisions of flex.} must be installed wherever flex is installed. -<application>flex</application> invokes @samp{m4}, found by searching the directories in the -@code{PATH} environment variable. Any code you place in section 1 or in the -actions will be sent through m4. Please follow these rules to protect your -code from unwanted <application>m4</application> processing. - -<itemizedlist> - - -<listitem> - Do not use symbols that begin with, @samp{m4_}, such as, @samp{m4_define}, -or @samp{m4_include}, since those are reserved for <application>m4</application> macro names. - -</listitem> -<listitem> - Do not use the strings @samp{[[} or @samp{]]} anywhere in your code. The -former is not valid in C, except within comments, but the latter is valid in -code such as @code{x[y[z]]}. -</listitem> -</itemizedlist> - - -<application>m4</application> is only required at the time you run <application>flex</application>. The generated -scanner is ordinary C or C++, and does <emphasis>not</emphasis> require <application>m4</application>. - - -</appendix> - -<!-- -<title>Indices</title> - -@menu -* Concept Index:: -* Index of Functions and Macros:: -* Index of Variables:: -* Index of Data Types:: -* Index of Hooks:: -* Index of Scanner Options:: -@end menu - -<section> -<title>Concept Index</title> - -@printindex cp - -</section> - -<section> -<title>Index of Functions and Macros</title> - -This is an index of functions and preprocessor macros that look like functions. -For macros that expand to variables or constants, see @ref{Index of Variables}. - -@printindex fn - -</section> - -<section> -<title>Index of Variables</title> - -This is an index of variables, constants, and preprocessor macros -that expand to variables or constants. - -@printindex vr - -</section> - -<section> -<title>Index of Data Types</title> -@printindex tp - -</section> - -<section> -<title>Index of Hooks</title> - -This is an index of "hooks" that the user may define. These hooks typically correspond -to specific locations in the generated scanner, and may be used to insert arbitrary code. - -@printindex hk - -</section> - -<section> -<title>Index of Scanner Options</title> - -@printindex op -</section> - ---> - -<!-- -A vim script to name the faq entries. delete this when faqs are no longer -named "unnamed-faq-XXX". -MIGHT NOT WORK AFTER XML CONVERSION. - -fu! Faq2 () range abort - let @r=input("Rename to: ") - exe "%s/" . @w . "/" . @r . "/g" - normal 'f -endf -nnoremap <F5> 1G/@node\s\+unnamed-faq-\d\+<cr>mfww"wy5ezt:call Faq2()<cr> ---> -</book> |