Initial revision

author: Vern Paxson <vern@ee.lbl.gov> 1990-02-25 01:28:22 +0000
committer: Vern Paxson <vern@ee.lbl.gov> 1990-02-25 01:28:22 +0000
commit: 65a8cee37cb94b7824c6f82246f262f314cde717 (patch)
tree: fdcff2418fc59886ff17c5d656b1d780c75dbf9e /flex.1
parent: fd6ee792d251e9ff28372ec39934a6a14ce346ad (diff)
1 files changed, 1434 insertions, 0 deletions
diff --git a/flex.1 b/flex.1
new file mode 100644
index 0000000..d52f126
--- /dev/null
+++ b/flex.1
@@ -0,0 +1,1434 @@
+.TH FLEX 1 "24 February 1990" "Version 2.2"
+.SH NAME
+flex - fast lexical analyzer generator
+.SH SYNOPSIS
+.B flex
+.B [-bcdfinpstvFILT -C[efmF] -Sskeleton]
+.I [filename ...]
+.SH DESCRIPTION
+.I flex
+is a tool for generating
+.I scanners:
+programs which recognized lexical patterns in text.
+.I flex
+reads
+the given input files, or its standard input if no file names are given,
+for a description of a scanner to generate.  The description is in
+the form of pairs
+of regular expressions and C code, called
+.I rules.  flex
+generates as output a C source file,
+.B lex.yy.c,
+which defines a routine
+.B yylex().
+This file is compiled and linked with the
+.B -ll
+library to produce an executable.  When the executable is run,
+it analyzes its input for occurrences
+of the regular expressions.  Whenever it finds one, it executes
+the corresponding C code.
+.SH SOME SIMPLE EXAMPLES
+.LP
+First some simple examples to get the flavor of how one uses
+.I flex.
+The following
+.I flex
+input specifies a scanner which whenever it encounters a tab
+will print eight blanks to its standard output:
+.nf
+
+    %%
+    \t    printf( "        " );
+
+.fi
+By default, any text not matched by a
+.I flex
+scanner
+is copied to the output, so the net effect of this scanner is
+to copy its input file to its output with each tab expanded
+into eight blanks.
+In this input, there is just one rule.  "\t" is the
+.I pattern
+(it's a regular expression specifying a tab) and the "printf" is the
+.I action.  The "%%" marks the beginning of the rules.
+.LP
+Here's another simple example:
+.nf
+
+        int num_lines = 0, num_chars = 0;
+
+    %%
+    \n    ++num_lines; ++num_chars;
+    .     ++num_chars;
+
+    %%
+    main()
+        {
+        yylex();
+        printf( "# of lines = %d, # of chars = %d\n",
+                num_lines, num_chars );
+        }
+
+.fi
+This scanner counts the number of characters and the number
+of lines in its input (it produces no output other than the
+final report on the counts).  The first line
+declares two globals, "num_lines" and "num_chars", which are accessible
+both inside
+.B yylex()
+and in the
+.B main()
+routine declared after the second "%%".  There are two rules, one
+which matches a newline ("\\n") and increments both the line count and
+the character count, and one which matches any character other than
+a newline (indicated by the "." regular expression).
+.LP
+A somewhat more complicated example:
+.nf
+
+    /* scanner for a toy Pascal-like language */
+
+    %{
+    /* need this for the call to atof() below */
+    #include <math.h>
+    %}
+
+    DIGIT    [0-9]
+    ID       [a-z][a-z0-9]*
+
+    %%
+
+    {DIGIT}+    {
+                printf( "An integer: %s (%d)\\n", yytext,
+                        atoi( yytext ) );
+                }
+
+    {DIGIT}+"."{DIGIT}*        {
+                printf( "A float: %s (%d)\\n", yytext,
+                        atof( yytext ) );
+                }
+
+    if|then|begin|end|procedure|function        {
+                printf( "A keyword: %s\\n", yytext );
+                }
+
+    {ID}        printf( "An identifier: %s\\n", yytext );
+
+    "+"|"-"|"*"|"/"   printf( "An operator: %s\\n", yytext );
+
+    "{"[^}\\n]*"}"   /* eat up one-line comments */
+
+    [ \\t\\n]+       /* eat up whitespace */
+
+    .           printf( "Unrecognized character: %s\\n", yytext );
+
+    %%
+
+    main( argc, argv )
+    int argc;
+    char **argv;
+        {
+        ++argv, --argc;
+        if ( argc > 0 )
+                yyin = fopen( argv[0], "r" );
+        else
+                yyin = stdin;
+        
+        yylex();
+        }
+
+.fi
+This is the beginnings of a simple scanner for a language like
+Pascal.  It identifies different types of
+.I tokens
+and reports on what it has seen.
+.LP
+The details of the example will be explained in the following
+sections.
+.SH FORMAT OF THE INPUT FILE
+The
+.I flex
+input file consists of three sections, separated by
+.B %%:
+.nf
+
+    definitions
+    %%
+    rules
+    %%
+    user code
+
+.fi
+The
+.I definitions
+section contains declarations of simple
+.I name
+definitions to simplify the scanner specification and of
+.I start conditions,
+which are explained in a later section.
+.LP
+Name definitions have the form:
+.nf
+
+    name definition
+
+.fi
+The "name" is a word beginning with a letter or a '_'
+followed by zero or more letters, digits, '_', or '-'.
+The definition is taken to begin at the first non-white-space
+following the name and continue to the end of the line.
+Definition can subsequently be referred to using "{name}", which
+will expand to "(definition)".  For example,
+.nf
+
+    DIGIT    [0-9]
+    ID       [a-z][a-z0-9]*
+
+.fi
+defines "DIGIT" to be a regular expression which matches a
+single digit, and
+"ID" to be a regular expression which matches a letter
+followed by zero-or-more letters or digits.
+A subsequent reference to
+.nf
+
+    {DIGIT}+"."{DIGIT}*
+
+.fi
+is identical to
+.nf
+
+    ([0-9])+"."([0-9])*
+
+.fi
+and matches one-or-more digits followed by a '.' followed
+by zero-or-more digits.
+.LP
+The
+.I rules
+section of the
+.I flex
+input contains a series of rules of the form:
+.nf
+
+    pattern   action
+
+.fi
+where the pattern must be unindented and the action must begin
+on the same line.
+.LP
+See below for a further description of patterns and actions.
+.LP
+Finally, the user code section is simply copied to
+.B lex.yy.c
+verbatim.
+It is used for companion routines which call or are called
+by the scanner.  The presence of this section is optional;
+if it is missing, the second
+.B %%
+in the input file may be skipped, too.
+.LP
+In the definitions and rule sections, any
+.I indented
+text or text enclosed in
+.B %{
+and
+.B %}
+is copied verbatim to the output (with the %{}'s removed).
+The %{}'s must appear unindented on lines by themselves.
+.LP
+In the rules section,
+any indented or %{} text appearing before the
+first rule may be used to declare variables
+which are local to the scanning routine, and, after the declarations,
+code which is to be executed whenever the scanning routine is entered.
+Other indented or %{} text in the rule section is still copied to the output,
+but its meaning is not well-defined and it may well cause compile-time
+errors (this feature is present for
+.I POSIX
+compliance; see below for other such features).
+.LP
+In the definitions section, an unindented comment (i.e., a line
+beginning with "/*") is also copied verbatim to the output up
+to the next "*/".  Also, any line beginning with '#' is ignored.
+.SH PATTERNS
+The patterns in the input are written using an extended set of regular
+expressions.  These are:
+.nf
+
+    x          match the character 'x'
+    .          any character except newline
+    [xyz]      an 'x', a 'y', or a 'z'
+    [abj-oZ]   an 'a', a 'b', any letter
+               from 'j' through 'o', or a 'Z'
+    [^A-Z]     any character EXCEPT an uppercase letter,
+               including a newline (unlike how many other
+               regular expression tools treat the '^'!).
+               This means that a pattern like [^"]* will
+               match an entire file (overflowing the input
+               buffer) unless there's another quote in
+               the input.
+    [^A-Z\\n]   any character EXCEPT an uppercase letter or
+               a newline
+    r*         zero or more r's, where r is any regular expression
+    r+         one or more r's
+    r?         zero or one r's (that is, "an optional r")
+    r{2,5}     anywhere from two to five r's
+    r{2,}      two or more r's
+    r{4}       exactly 4 r's
+    {name}     the expansion of the "name" definition
+               (see above)
+    "[xyz]\\"foo"
+               the literal string: [xyz]"foo
+    \\x         if x is an 'a', 'b', 'f', 'n', 'r',
+               't', or 'v', then the ANSI-C
+               interpretation of \\x.  Otherwise,
+               a literal 'x' (used to escape
+               operators such as '*')
+    \\123      the character with octal value 123
+    \\x2a      the character with hexadecimal value 2a
+    (r)        match an r; parentheses are used
+               to override precedence (see below)
+
+
+    rs         the regular expression r followed
+               by the regular expression s; called
+               "concatenation"
+
+
+    r|s        either an r or an s
+
+
+    r/s        an r but only if it is followed by
+               an s.  The s is not part of the
+               matched text.  This type of
+               pattern is known as "trailing context".
+    ^r         an r, but only at the beginning of a line
+    r$         an r, but only at the end of a line
+               (r must not use trailing context)
+
+
+    <s>r       an r, but only in start condition s (see
+               below for discussion of start conditions)
+    <s1,s2,s3>r
+               same, but in any of start conditions s1,
+               s2, or s3
+
+
+    <<EOF>>    an end-of-file
+    <s1,s2><<EOF>>
+               an end-of-file when in start condition s1 or s2
+
+.fi
+The regular expressions listed above are grouped according to
+precedence, from highest precedence at the top to lowest at the bottom.
+Those grouped together have equal precedence.  For example,
+.nf
+
+    foo|bar*
+
+.fi
+is the same as
+.nf
+
+    (foo)|(ba(r*))
+
+.fi
+since the '*' operator has higher precedence than concatenation,
+and concatenation higher than alternation ('|').  This pattern
+therefore matches
+.I either
+the string "foo"
+.I or
+the string "ba" followed by zero-or-more r's.
+To match "foo" or zero-or-more "bar"'s, use:
+.nf
+
+    foo|(bar)*
+
+.fi
+and to match zero-or-more "foo"'s or "bar"'s:
+.nf
+
+    (foo|bar)*
+
+.fi
+.SH HOW THE INPUT IS MATCHED
+When the generated scanner is run, it analyzes its input looking
+for strings which match any of its patterns.  If it finds more than
+one match, it takes the one matching the most text (for trailing
+context rules, this includes the length of the trailing part, even
+though it will then be returned to the input).  If it finds two
+or more matches of the same length, the
+rule listed first in the
+.I flex
+input file is chosen.
+.LP
+Once the match is determined, the text corresponding to the match
+(called the
+.I token)
+is made available in the global character pointer
+.B yytext,
+and its length in the global integer
+.B yyleng.
+The
+.I action
+corresponding to the matched pattern is then executed (a more
+detailed description of actions follows), and then the remaining
+input is scanned for another match.
+.LP
+If no match is found, then the
+.I default rule
+is executed: the next character in the input is matched and
+copied to the standard output.  Thus, the simplest legal
+.I flex
+input is:
+.nf
+
+    %%
+
+.fi
+which generates a scanner that simply copies its input (one character
+at a time) to its output.
+.SH ACTIONS
+Each pattern in a rule has a corresponding action, which can be any
+arbitrary C statement.  The pattern ends at the first non-escaped
+whitespace character; the remainder of the line is its action.  If the
+action is empty, then when the pattern is matched the input token
+is simply discarded.  For example, here is the specification for a program
+which deletes all occurrences of "zap me" from its input:
+.nf
+
+    %%
+    "zap me"
+
+.fi
+Here is a program which compresses multiple blanks and tabs down to
+a single blank, and throws away whitespace found at the end of a line:
+.nf
+
+    %%
+    [ \t]+        putchar( ' ' );
+    [ \t]+$       /* ignore this token */
+
+.fi
+.LP
+If the action contains a '{', then the action spans till the balancing
+'}' is found, and the action may cross multiple lines.
+.I flex 
+knows about C strings and comments and won't be fooled by braces found
+within them, but also allows actions to begin with
+.B %{
+and will consider the action to be all the text up to the next
+.B %}.
+.LP
+An action consisting solely of a vertical bar ('|') means "same as
+the action for the next rule.  See below for an illustration.
+.LP
+Actions can include arbitrary C code, including
+.B return
+statements to return a value whatever routine called
+.B yylex().
+Each time
+.B yylex()
+is called it continues processing tokens from where it last left
+off until it either reaches
+the end of the file or executes a return.
+.LP
+Actions are not allowed to modify yytext or yyleng.
+.LP
+There are a number of special directives which can be included within
+an action:
+.IP -
+.B ECHO
+copies yytext to the scanner's output.
+.IP -
+.B BEGIN
+followed by the name of a start condition places the scanner in the
+corresponding start condition (see below).
+.IP -
+.B REJECT
+directs the scanner to proceed on to the "second best" rule which matched the
+input (or a prefix of the input).  The rule is chosen as described
+above in "How the Input is Matched", and
+.B yytext
+and
+.B yyleng
+set up appropriately.
+It may either be one which matched as much text
+as the originally chosen rule but came later in the
+.I flex
+input file, or one which matched less text.
+For example, the following will both count the
+words in the input and call the routine special() whenever "frob" is seen:
+.nf
+
+    %{
+    int word_count = 0;
+    %}
+    %%
+
+    frob        special(); REJECT;
+    [^ \t\n]+   ++word_count;
+
+.fi
+Without the
+.B REJECT,
+any "frob"'s in the input would not be counted as words, since the
+scanner normally executes only one action per token.
+Multiple
+.B REJECT's
+are allowed, each one finding the next best choice to the currently
+active rule.  For example, when the following scanner scans the token
+"abcd", it will write "abcdabcaba" to the output:
+.nf
+
+    %%
+    a        |
+    ab       |
+    abc      |
+    abcd     ECHO; REJECT;
+    .|\n     /* eat up any unmatched character */
+
+.fi
+(The first three rules share the fourth's action since they use
+the special '|' action.)
+.B REJECT
+is a particularly expensive feature in terms scanner performance;
+if it is used in
+.I any
+of the scanner's actions it will slow down
+.I all
+of the scanner's matching.
+.IP -
+.B yymore()
+tells the scanner that the next time it matches a rule, the corresponding
+token should be
+.I appended
+onto the current value of
+.B yytext
+rather than replacing it.  For example, given the input "mega-kludge"
+the following will write "mega-mega-kludge" to the output:
+.nf
+
+    %%
+    mega-    ECHO; yymore();
+    kludge   ECHO;
+
+.fi
+First "mega-" is matched and echoed to the output.  Then "kludge"
+is matched, but the previous "mega-" is still hanging around at the
+beginning of
+.B yytext
+so the ECHO for the "kludge" rule will actually write "mega-kludge".
+The presence of
+.B yymore()
+in the scanner's action entails a minor performance penalty in the
+scanner's matching speed.
+.IP -
+.B yyless(n)
+returns all but the first
+.I n
+characters of the current token back to the input stream, where they
+will be rescanned when the scanner looks for the next match.
+.B yytext
+and
+.B yyleng
+are adjusted appropriately (e.g.,
+.B yyleng
+will now be equal to
+.I n
+).  For example, on the input "foobar" the following will write out
+"foobarbar":
+.nf
+
+    foobar    ECHO; yyless(3);
+    [a-z]+    ECHO;
+
+.fi
+An argument of 0 to
+.B yyless
+will cause the current input string to be scanned again.  Unless you've
+changed how the scanner will subsequently process its input (using
+.B BEGIN,
+for example), this will result in an endless loop.
+.IP -
+.B unput(c)
+puts the character
+.I c
+back onto the input stream.  It will be the next character scanned.
+The following action will take the current token and cause it
+to be rescanned enclosed in parentheses.
+.nf
+
+    {
+    int i;
+    unput( ')' );
+    for ( i = yyleng - 1; i >= 0; --i )
+        unput( yytext[i] );
+    unput( '(' );
+    }
+
+.fi
+Note that since each
+.B unput()
+puts the given character back at the
+.I beginning
+of the input stream, pushing back strings must be done back-to-front.
+.IP -
+.B input()
+reads the next character from the input stream.  For example,
+the following is one way to eat up C comments:
+.nf
+
+    %%
+    "/*"        {
+                register int c;
+
+                for ( ; ; )
+                    {
+                    while ( (c = input()) != '*' &&
+                            c != EOF )
+                        ;    /* eat up text of comment */
+
+                    if ( c == '*' )
+                        {
+                        while ( (c = input()) == '*' )
+                            ;
+                        if ( c == '/' )
+                            break;    /* found the end */
+                        }
+
+                    if ( c == EOF )
+                        {
+                        error( "EOF in comment" );
+                        break;
+                        }
+                    }
+                }
+
+.fi
+.IP -
+.I yyterminate()
+can be used in lieu of a return statement in an action.  It terminates
+the scanner and returns a 0 to the scanner's caller, indicating "all done".
+By default,
+.I yyterminate()
+is also called when an end-of-file is encountered.  It is a macro and
+may be redefined.
+.SH THE GENERATED SCANNER
+The output of
+.I flex
+is the file
+.B lex.yy.c,
+which contains the scanning routine
+.B yylex(),
+a number of tables used by it for matching tokens, and a number
+of auxilliary routines and macros.  By default,
+.B yylex()
+is declared as follows:
+.nf
+
+    int yylex()
+        {
+        ... various definitions and the actions in here ...
+        }
+
+.fi
+(If your environment supports function prototypes, then it will
+be "int yylex( void )".)  This definition may be changed by redefining
+the "YY_DECL" macro.  For example, you could use:
+.nf
+
+    #undef YY_DECL
+    #define YY_DECL float lexscan( a, b ) float a, b;
+
+.fi
+to give it the the scanning routine the name
+.I lexscan,
+returning a float, and taking two floats as arguments.  Note that
+if you give arguments to the scanning routine using a
+K&R-style/non-prototyped function declaration, you must terminate
+the definition with a semi-colon (;).
+.LP
+Whenever
+.B yylex()
+is called, it scans tokens from the global input file
+.I yyin
+(default, stdin).  It continues until it either reaches
+an end-of-file (at which point it returns the value 0) or
+one of its actions executes a
+.I return
+statement.
+In the former case, the scanner may not be called again unless
+.B void yyrestart( FILE *input_file )
+is called, to point
+.I yyin
+at the new input_file.  In the latter case (i.e., when an action
+executes a return), the scanner may then be called again and it
+will resume scanning where it left off.
+.LP
+By default (and for purposes of efficiency), the scanner uses
+block-reads rather than simple
+.I getc()
+calls to read characters from
+.I yyin.
+The nature of how it gets its input can be controlled by redefining the
+.B YY_INPUT
+macro.
+YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)".  Its
+action is to place up to
+.I max_size
+characters in the character array
+.I buf
+and return in the integer variable
+.I result
+either the
+number of characters read or the constant YY_NULL (0 on Unix systems)
+to indicate EOF.  The default YY_INPUT reads from the
+global file-pointer "yyin" (which is by default
+.I stdin),
+so if you
+just want to change the input file, you needn't redefine
+YY_INPUT - just point yyin at the input file.
+.LP
+A sample redefinition of YY_INPUT (in the definitions section of the input
+file):
+.nf
+
+    %{
+    #undef YY_INPUT
+    #define YY_INPUT(buf,result,max_size) \\
+        result = ((buf[0] = getchar()) == EOF) ? YY_NULL : 1;
+    %}
+
+.fi
+You also can add in things like keeping track of the
+input line number this way; but don't expect your scanner to
+go very fast.
+.LP
+When the scanner receives an end-of-file indication from YY_INPUT,
+it then checks the
+.B yywrap()
+function.  If it returns false (zero), then it is assumed that the
+function has gone ahead and set up
+.I yyin
+to point to another input file, and scanning continues.  If it returns
+true (non-zero), then the scanner terminates, returning 0 to its
+caller.
+.LP
+The default
+.B yywrap()
+always returns 1.  Presently, to redefine it you must first
+"#undef yywrap", as it is currently implemented as a macro.  As noted
+by the hedging in the previous sentence, it may be changed to
+a true function in the near future.
+.LP
+The scanner writes its
+.B ECHO
+output to the
+.I yyout
+global (default, stdout), which may be redefined by the user simply
+by assigning it to some other FILE*.
+.SH START CONDITIONS
+.I flex
+provides a mechanism for conditionally activating rules.  Any rule
+whose pattern is prefixed with "<sc>" will only be active when
+the scanner is in the start condition named "sc".  For example,
+.nf
+
+    <STRING>[^"]*        { /* eat up the string body ... */
+                ...
+                }
+
+.fi
+will be active only when the scanner is in the "STRING" start
+condition, and
+.nf
+
+    <INITIAL,STRING,QUOTE>\\.        { /* handle an escape ... */
+                ...
+                }
+
+.fi
+will be active only when the current start condition is
+either "INITIAL", "STRING", or "QUOTE".
+.LP
+Start conditions
+are declared in the definitions (first) section of the input
+using unindented lines beginning with either
+.B %s
+or
+.B %x
+followed by a list of names.
+The former declares
+.I inclusive
+start conditions, the latter
+.I exclusive
+start conditions.  A start condition is activated using the
+.B BEGIN
+action.  Until the next
+.B BEGIN
+action is executed, rules with the given start
+condition will be active and
+rules with other start conditions will be inactive.
+If the start condition is
+.I inclusive,
+then rules with no start conditions at all will also be active.
+If it is
+.I exclusive,
+then
+.I only
+rules qualified with the start condition will be active.
+So a set of rules conditioned on the same exclusive start condition
+describe a scanner which is independent of any of the other rules in the
+.I flex
+input.  Because of this,
+exclusive start conditions make it easy to specify "mini-scanners"
+which scan portions of the input that are syntactically different
+from the rest (e.g., comments).
+.LP
+.B BEGIN(0)
+returns to the original state where only the rules with
+no start conditions are active.  This state can also be
+referred to as the start-condition "INITIAL", so
+.B BEGIN(INITIAL)
+is equivalent to
+.B BEGIN(0).
+.LP
+Here is a scanner which will recognize numbers only if they
+are preceded earlier in the line by the string "expect-number":
+.nf
+
+    %s expect
+
+    %%
+    expect-number        BEGIN(expect);
+
+    <expect>[0-9]+       printf( "found a number\n" );
+    <expect>\n           {
+                /* that's the end of the line, so
+                 * we need another "expect-number"
+                 * before we'll recognize any more
+                 * numbers
+                 */
+                BEGIN(INITIAL);
+                }
+
+.fi
+Here is a scanner which recognizes (and discards) C comments while
+maintaining a count of the current input line.
+.nf
+
+    %x comment
+    %%
+            int line_num = 1;
+
+    <comment>[^*\n]*
+    <comment>"*"+[^*/\n]*
+    <comment>\n             ++line_num;
+    <comment>"*"+"/"        BEGIN(INITIAL);
+
+    "/*"         BEGIN(comment);
+
+.fi
+Note that start-conditions names are really integer values and
+can be stored as such.  Thus, the above could be extended in the
+following fashion:
+.nf
+
+    %x comment
+    %%
+            int line_num = 1;
+            int comment_caller;
+
+    <comment>[^*\n]*
+    <comment>"*"+[^*/\n]*
+    <comment>\n             ++line_num;
+    <comment>"*"+"/"        BEGIN(comment_caller);
+
+    "/*"         {
+                 comment_caller = INTIIAL;
+                 BEGIN(comment);
+                 }
+
+    ...
+
+    <foo>"/*"    {
+                 comment_caller = foo;
+                 BEGIN(comment);
+                 }
+.fi
+One can then implement a "stack" of start conditions using an
+array of integers.  (It is likely that such stacks will become
+a full-fledged
+.I flex
+feature in the future.)  Note, though, that
+start conditions do not have their own name-space; %s's and %x's
+declare names effectively the same as #define's.
+.SH END-OF-FILE RULES
+The special rule "<<EOF>>" indicates
+actions which are to be taken when an end-of-file is
+encountered and yywrap() returns non-zero (i.e., indicates
+no further files to process).  The action can either
+point yyin at a new file to process, in which case the
+action
+.I must
+finish with the special
+.I YY_NEW_FILE
+action
+(this is a branch, so subsequent code in the action won't
+be executed), or the action must finish with a
+.I return
+or
+.I yyterminate()
+statement.  <<EOF>> rules may not be used with other
+patterns; they may only be qualified with a list of start
+conditions.  If an unqualified <<EOF>> rule is given, it
+applies only to the
+.B INITIAL
+start condition, and
+.I not
+to
+.B %s
+(or
+.B %x)
+start conditions.
+.LP
+These rules are useful for catching things like unclosed comments.
+An example:
+.nf
+
+    %x quote
+    %%
+    ...
+    <quote><<EOF>>   {
+             error( "unterminated quote" );
+             yyterminate();
+             }
+    <<EOF>>          {
+             if ( *++filelist )
+                     {
+                     yyin = fopen( *filelist, "r" );
+                     YY_NEW_FILE;
+                     }
+             else
+                yyterminate();
+             }
+
+.fi
+.SH MISCELLANEOUS MACROS
+The macro
+.bd
+YY_USER_ACTION
+can be redefined to provide an action
+which is always executed prior to the matched rule's action.  For example,
+it could be #define'd to call a routine to convert yytext to lower-case.
+.LP
+In the generated scanner, the actions are all gathered in one large
+switch statement and separated using
+.B YY_BREAK,
+which may be redefined.
+This allows, for example, C++ users to
+#define YY_BREAK to do nothing (while being very careful that every
+rule ends with a "break" or a "return"!) to avoid suffering from
+unreachable statement warnings where a rule's action ends with "return".
+.SH INTERFACING WITH YACC
+One of the main uses of
+.I flex
+is as a companion to the
+.I yacc
+parser-generator.
+.I yacc
+parsers expect to call the
+.B yylex()
+routine to find the next input token.  The routine is supposed to
+return the type of the next token as well as putting any associated
+value in the global
+.B yylval.
+To use
+.I flex
+with
+.I yacc,
+one specifies the
+.B -d
+option to
+.I yacc
+to instruct it to generate the file
+.B y.tab.h
+containing definitions of all the
+.B %tokens
+appearing in the
+.I yacc
+input.  This file is then included in the
+.I flex
+scanner.  For example, if one of the tokens is "TOK_NUMBER",
+part of the scanner might look like:
+.nf
+
+    %{
+    #include "y.tab.h"
+    %}
+
+    %%
+
+    [0-9]+        yylval = atoi( yytext ); return TOK_NUMBER;
+
+.fi
+.SH TRANSLATION TABLE
+In the name of POSIX compliance,
+.I flex
+supports a
+.I translation table
+for mapping input characters together into specified sets.
+The table is specified in the first section, and its format looks like:
+.nf
+
+    %t
+    1        abcd
+    2        ABCDEFGHIJKLMNOPQRSTUVWXYZ
+    52       0123456789
+    %t
+
+.fi
+This example specifies that the characters 'a', 'b', 'c', and 'd'
+are to all be lumped into group #1, the upper-case letters are
+to be in group #2, and digits in group #52, and
+.I no other characters will appear in the patterns
+(note that characters can also be specified in a
+.B %t
+table using escape sequences).
+The group numbers are actually disregarded by
+.I flex;
+.B %t
+serves, though, to lump characters together.  Given the above
+table, for example, the pattern "aAA*5" is equivalent to "dZQ*0".
+They both say, "match any character in group #1, followed by
+a character from group #2, followed by zero-or-more characters
+from group #2, followed by a character from group #52."  Thus
+.B %t
+provides a crude way for introducing equivalence classes into
+the scanner specification.  It is the author's belief that the
+.B -i
+option coupled with the equivalence classes which
+.I flex
+automatically generates take care of virtually all the instances
+when one might consider using
+.B %t.
+But what the hell, it's there if you want it.
+.so options.man
+.SH PERFORMANCE CONSIDERATIONS
+The main design goal of
+.I flex
+is that it generate high-performance scanners.  It has been optimized
+for dealing well with large sets of rules.  Aside from the effects
+outlined above of table compression on scanner speed,
+there are a number of options/actions which degrade performance.  These
+are, in decreasing order of performance impact:
+.nf
+
+    REJECT
+    pattern sets that require backtracking
+    arbitrary trailing context
+    %T
+    '^' beginning-of-line operator
+    yymore()
+    start conditions
+
+.fi
+.LP
+Getting rid of backtracking is messy and often may be too much
+work for a complicated scanner's rules.  In principal, one begins
+by using the
+.B -b 
+flag to generate a
+.I lex.backtrack
+file.  For example, on the input
+.nf
+
+    %%
+    foo        return TOK_KEYWORD;
+    foobar     return TOK_KEYWORD;
+
+.fi
+the file looks like:
+.nf
+
+    State #6 is non-accepting -
+     associated rules:
+           2       3
+     out-transitions: [ o ]
+     jam-transitions: EOF [ \001-n  p-\177 ]
+
+    State #8 is non-accepting -
+     associated rules:
+           3
+     out-transitions: [ a ]
+     jam-transitions: EOF [ \001-`  b-\177 ]
+
+    State #9 is non-accepting -
+     associated rules:
+           3
+     out-transitions: [ r ]
+     jam-transitions: EOF [ \001-q  s-\177 ]
+
+    Compressed tables always backtrack.
+
+.fi
+The first few lines tell us that there's a scanner state in
+which it can make a transition on an 'o' but not on any other
+character, and the currently scanned text does not match any rule.
+If the scanner is in that state and then reads
+something other than an 'o', it will have to backtrack to find
+a rule which is matched.  With
+a bit of headscratching one can see that this must be the
+state it's in when it has seen "fo".  When this has happened,
+if anything other than another 'o' is seen, the scanner will
+have to back up to simply match the 'f' (by the default rule).
+.LP
+The comment regarding State #8 indicates there's a problem
+when "foob" has been scanned.  Indeed, on any character other
+than a 'b', the scanner will have to back up to accept "foo".
+Similarly, the comment for State #9 concerns when "fooba" has
+been scanned.
+.LP
+The final comment reminds us that there's no point going to
+all the trouble of removing backtracking from the rules unless
+we're using
+.B -f
+or
+.B -F,
+since there's no performance gain doing so with compressed scanners.
+.LP
+The way to remove the backtracking is to add "error" rules:
+.nf
+
+    %%
+    foo         return TOK_KEYWORD;
+    foobar      return TOK_KEYWORD;
+
+    fooba       |
+    foob        |
+    fo          {
+                /* false alarm, not really a keyword */
+                return TOK_ID;
+                }
+
+.fi
+.LP
+Unfortunately backtracking messages tend to cascade and
+with a complicated input set it's not uncommon to get hundreds
+of messages.  If one can decipher them, though, it often
+only takes a dozen or so rules to eliminate the backtracking.
+(A possible future
+.I flex
+feature will be to automatically add rules to eliminate backtracking.
+The problem is that while it's easy for
+.I flex
+to figure out what rules are needed, it's very hard for it to
+know what the proper action is.  Currently I'm thinking that it
+will simply invoke a user-redefinable macro and that's it ...)
+.LP
+Another area where the user can increase a scanner's performance
+(and one that's easier to implement) arises from the fact that
+the longer the tokens matched, the faster the scanner will run.
+This is because with long tokens the processing of most input
+characters takes place in the (short) inner scanning loop, and
+does not often have to go through the additional work of setting up
+the scanning environment (e.g.,
+.B yytext)
+for the action.  Recall the scanner for C comments:
+.nf
+
+    %x comment
+    %%
+            int line_num = 1;
+
+    <comment>[^*\n]*
+    <comment>"*"+[^*/\n]*
+    <comment>\n             ++line_num;
+    <comment>"*"+"/"        BEGIN(INITIAL);
+
+    "/*"         BEGIN(comment);
+
+.fi
+This could be sped up by writing it as:
+.nf
+
+    %x comment
+    %%
+            int line_num = 1;
+
+    <comment>[^*\n]*
+    <comment>[^*\n]*\n      ++line_num;
+    <comment>"*"+[^*/\n]*
+    <comment>"*"+[^*/\n]*\n ++line_num;
+    <comment>"*"+"/"        BEGIN(INITIAL);
+
+    "/*"         BEGIN(comment);
+
+.fi
+Now instead of each newline requiring the processing of another
+action, recognizing the newlines is "distributed" over the other rules
+to keep the matched text as long as possible.  Note that
+.I adding
+rules does
+.I not
+slow down the scanner!  The speed of the scanner is independent
+of the number of rules or (modulo the considerations given at the
+beginning of this section) how complicated the rules are with
+regard to operators such as '*' and '|'.
+.SH INCOMPATIBILITIES WITH LEX AND POSIX
+.I flex
+is a rewrite of the Unix
+.I lex
+tool (the two implementations do not share any code, though),
+which dates to the late 1970's.  There are some incompatibilities
+which are of concern to those who wish to write scanners acceptable
+to either implementation.  At present, the POSIX lex draft is
+very close to the original lex implementation, so some of these
+incompatibilities are also in conflict with the POSIX draft.  But
+the intent is that except as noted below,
+.I flex
+as it presently stands will
+ultimately be POSIX comformant (i.e., that those areas of conflict with
+the POSIX draft will be resolved in
+.I flex's
+favor).  Please bare in
+mind that all the comments are with regard to the POSIX
+.I draft
+standard and not the final document; they are included so
+.I flex
+users can be aware of the standardization issues and those areas where
+.I flex
+may in the near future be incompatibly changed with
+its current definition.
+.LP
+.I flex
+is fully compatible with
+.I lex
+with the following exceptions:
+.IP -
+When definitions are expanded,
+.I flex
+encloses them in parentheses.
+With lex, the following
+.nf
+
+    NAME    [A-Z][A-Z0-9]*
+    %%
+    foo{NAME}?      printf( "Found it\\n" );
+    %%
+
+.fi
+will not match the string "foo" because when the macro
+is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
+and the precedence is such that the '?' is associated with
+"[A-Z0-9]*".  With
+.I flex,
+the rule will be expanded to
+"foo([A-z][A-Z0-9]*)?" and so the string "foo" will match.
+Note that because of this, the
+.B ^, $, <s>,
+and
+.B /
+operators cannot be used in a definition.
+.IP
+Note that the POSIX draft interpretation here is the same as
+.I flex's.
+.IP -
+The undocumented lex-scanner internal variable
+.B yylineno
+is not supported.  (The variable is not part of the POSIX draft.)
+.IP -
+The
+.B input()
+routine is not redefinable, though may be called to read characters
+following whatever has been matched by a rule.  If
+.B input()
+encounters an end-of-file the normal
+.B yywrap()
+processing is done.  A ``real'' end-of-file is returned as
+.I EOF.
+.IP
+Input is instead controlled by redefining the
+.B YY_INPUT
+macro.
+.IP
+The
+.I flex
+restriction that
+.B input()
+cannot be redefined is in accordance with the POSIX draft, but
+.B YY_INPUT
+has not yet been accepted into the draft.
+.IP -
+.B output()
+is not supported.
+Output from the ECHO macro is done to the file-pointer
+"yyout" (default
+.I stdout).
+.IP
+The POSIX draft mentions that an
+.B output()
+routine exists but currently gives no details as to what it does.
+.IP -
+If you are providing your own yywrap() routine, you must include a
+"#undef yywrap" in the definitions section (section 1).  Note that
+the "#undef" will have to be enclosed in %{}'s.
+.IP
+The POSIX draft
+specifies that yywrap() is a function and this is unlikely to change; so
+.I flex users are warned
+that
+.I yywrap()
+is likely to be changed to a function in the near future.
+.IP -
+The precedence of the
+.B {}
+operator is different.  lex interprets "abc{1,3}" as "match one, two, or
+three occurrences of 'abc'", whereas
+.I flex
+interprets it as "match 'ab'
+followed by one, two, or three occurrences of 'c'".  The latter is
+in agreement with the current POSIX draft.
+.IP -
+To refer to yytext outside of your scanner source file, use
+"extern char *yytext;" rather than "extern char yytext[];".
+This is contrary to the POSIX draft but a point on which I refuse
+to budge, as the array representation entails a serious performance penalty.
+.IP -
+The name
+.bd
+FLEX_SCANNER
+is #define'd so scanners may be written for use with either
+.I flex
+or
+.I lex.
+.SH DEFICIENCES / BUGS
+.LP
+Some trailing context
+patterns cannot be properly matched and generate
+warning messages ("Dangerous trailing context").  These are
+patterns where the ending of the
+first part of the rule matches the beginning of the second
+part, such as "zx*/xy*", where the 'x*' matches the 'x' at
+the beginning of the trailing context.  (Note that the POSIX draft
+states that the text matched by such patterns is undefined.)
+If desperate, you can use
+.B yyless()
+to effect arbitrary trailing context.
+.LP
+.I variable
+trailing context (where both the leading and trailing parts do not have
+a fixed length) entails the same performance loss as
+.I REJECT
+(i.e., substantial).
+.LP
+For some trailing context rules, parts which are actually fixed-length are
+not recognized as such, leading to the abovementioned performance loss.
+In particular, parts using '|' or {n} are always considered variable-length.
+.LP
+Use of unput() or input() invalidates yytext and yyleng.
+.LP
+Use of unput() to push back more text than was matched can
+result in the pushed-back text matching a beginning-of-line ('^')
+rule even though it didn't come at the beginning of the line
+(though this is rare!).
+.LP
+Nulls are not allowed in
+.I flex
+inputs or in the inputs to
+scanners generated by
+.I flex.
+Their presence generates fatal errors.
+.LP
+.I flex
+does not generate correct #line directives for code internal
+to the scanner; thus, bugs in
+.I
+flex.skel
+yield bogus line numbers.
+.LP
+Due to both buffering of input and read-ahead, you cannot intermix
+calls to stdio routines, such as, for example,
+.B getchar()
+with
+.I flex
+rules and expect it to work.  Call
+.B input()
+instead.
+.LP
+The total table entries listed by the
+.B -v
+flag excludes the number of table entries needed to determine
+what rule has been matched.  The number of entries is equal
+to the number of DFA states if the scanner does not use REJECT,
+and somewhat greater than the number of states if it does.
+.LP
+It would be useful if
+.I flex
+wrote to lex.yy.c a summary of the flags used in
+its generation (such as which table compression options).
+.LP
+Some of the macros, such as
+.B yywrap(),
+may in the future become functions which live in the
+.B -ll
+library.  This will doubtless break a lot of code, but may be
+required for POSIX-compliance.
+.LP
+The
+.I flex
+internal algorithms need documentation.
+.SH "SEE ALSO"
+.LP
+lex(1), yacc(1), sed(1), awk(1).
+.LP
+M. E. Lesk and E. Schmidt,
+.I LEX - Lexical Analyzer Generator
+.SH AUTHOR
+Vern Paxson, with the help of many ideas and much inspiration from
+Van Jacobson.  Original version by Jef Poskanzer.  Fast table
+representation is a partial implementation of a design done by Van
+Jacobson.  The implementation was done by Kevin Gong and Vern Paxson.
+.LP
+Thanks to the many
+.I flex
+beta-testers and feedbackers, especially Casey
+Leedom, benson@odi.com,
+Frederic Brehm, Nick Christopher, Jason Coughlin,
+Chris Faylor, Eric Goldman, Eric
+Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht,
+Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt,
+Jef Poskanzer, Dave Tallman, Frank Whaley, Ken Yap, and others whose names
+have slipped my marginal mail-archiving skills but whose contributions
+are appreciated all the same.
+.LP
+Thanks to Keith Bostic, John Gilmore, Bob
+Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
+headaches.
+.LP
+Thanks to Esmond Pitt for 8-bit character support, Benson Margulies and Fred
+Burke for C++ support, and Ove Ewerlid for supporting NUL's (as well as for
+impressive efforts regarding generating extremely high-performance
+scanners, which with luck will be soon forthcoming).
+.LP
+This work was primarily done when I was a member of the Real Time System Group
+at the Lawrence Berkeley Laboratory in Berkeley, CA.  Many thanks to all there
+for the support I received.
+.LP
+Send comments to:
+.nf
+
+     Vern Paxson
+     Computer Science Department
+     4126 Upson Hall
+     Cornell University
+     Ithaca, NY 14853-7501
+
+     vern@cs.cornell.edu
+     decvax!cornell!vern
+     vern@LBL (bitnet)
+
+.fi
author	Vern Paxson <vern@ee.lbl.gov>	1990-02-25 01:28:22 +0000
committer	Vern Paxson <vern@ee.lbl.gov>	1990-02-25 01:28:22 +0000
commit	65a8cee37cb94b7824c6f82246f262f314cde717 (patch)
tree	fdcff2418fc59886ff17c5d656b1d780c75dbf9e /flex.1
parent	fd6ee792d251e9ff28372ec39934a6a14ce346ad (diff)