summaryrefslogtreecommitdiff
path: root/flex.1
diff options
context:
space:
mode:
authorVern Paxson <vern@ee.lbl.gov>1990-02-25 01:28:22 +0000
committerVern Paxson <vern@ee.lbl.gov>1990-02-25 01:28:22 +0000
commit65a8cee37cb94b7824c6f82246f262f314cde717 (patch)
treefdcff2418fc59886ff17c5d656b1d780c75dbf9e /flex.1
parentfd6ee792d251e9ff28372ec39934a6a14ce346ad (diff)
Initial revision
Diffstat (limited to 'flex.1')
-rw-r--r--flex.11434
1 files changed, 1434 insertions, 0 deletions
diff --git a/flex.1 b/flex.1
new file mode 100644
index 0000000..d52f126
--- /dev/null
+++ b/flex.1
@@ -0,0 +1,1434 @@
+.TH FLEX 1 "24 February 1990" "Version 2.2"
+.SH NAME
+flex - fast lexical analyzer generator
+.SH SYNOPSIS
+.B flex
+.B [-bcdfinpstvFILT -C[efmF] -Sskeleton]
+.I [filename ...]
+.SH DESCRIPTION
+.I flex
+is a tool for generating
+.I scanners:
+programs which recognized lexical patterns in text.
+.I flex
+reads
+the given input files, or its standard input if no file names are given,
+for a description of a scanner to generate. The description is in
+the form of pairs
+of regular expressions and C code, called
+.I rules. flex
+generates as output a C source file,
+.B lex.yy.c,
+which defines a routine
+.B yylex().
+This file is compiled and linked with the
+.B -ll
+library to produce an executable. When the executable is run,
+it analyzes its input for occurrences
+of the regular expressions. Whenever it finds one, it executes
+the corresponding C code.
+.SH SOME SIMPLE EXAMPLES
+.LP
+First some simple examples to get the flavor of how one uses
+.I flex.
+The following
+.I flex
+input specifies a scanner which whenever it encounters a tab
+will print eight blanks to its standard output:
+.nf
+
+ %%
+ \t printf( " " );
+
+.fi
+By default, any text not matched by a
+.I flex
+scanner
+is copied to the output, so the net effect of this scanner is
+to copy its input file to its output with each tab expanded
+into eight blanks.
+In this input, there is just one rule. "\t" is the
+.I pattern
+(it's a regular expression specifying a tab) and the "printf" is the
+.I action. The "%%" marks the beginning of the rules.
+.LP
+Here's another simple example:
+.nf
+
+ int num_lines = 0, num_chars = 0;
+
+ %%
+ \n ++num_lines; ++num_chars;
+ . ++num_chars;
+
+ %%
+ main()
+ {
+ yylex();
+ printf( "# of lines = %d, # of chars = %d\n",
+ num_lines, num_chars );
+ }
+
+.fi
+This scanner counts the number of characters and the number
+of lines in its input (it produces no output other than the
+final report on the counts). The first line
+declares two globals, "num_lines" and "num_chars", which are accessible
+both inside
+.B yylex()
+and in the
+.B main()
+routine declared after the second "%%". There are two rules, one
+which matches a newline ("\\n") and increments both the line count and
+the character count, and one which matches any character other than
+a newline (indicated by the "." regular expression).
+.LP
+A somewhat more complicated example:
+.nf
+
+ /* scanner for a toy Pascal-like language */
+
+ %{
+ /* need this for the call to atof() below */
+ #include <math.h>
+ %}
+
+ DIGIT [0-9]
+ ID [a-z][a-z0-9]*
+
+ %%
+
+ {DIGIT}+ {
+ printf( "An integer: %s (%d)\\n", yytext,
+ atoi( yytext ) );
+ }
+
+ {DIGIT}+"."{DIGIT}* {
+ printf( "A float: %s (%d)\\n", yytext,
+ atof( yytext ) );
+ }
+
+ if|then|begin|end|procedure|function {
+ printf( "A keyword: %s\\n", yytext );
+ }
+
+ {ID} printf( "An identifier: %s\\n", yytext );
+
+ "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext );
+
+ "{"[^}\\n]*"}" /* eat up one-line comments */
+
+ [ \\t\\n]+ /* eat up whitespace */
+
+ . printf( "Unrecognized character: %s\\n", yytext );
+
+ %%
+
+ main( argc, argv )
+ int argc;
+ char **argv;
+ {
+ ++argv, --argc;
+ if ( argc > 0 )
+ yyin = fopen( argv[0], "r" );
+ else
+ yyin = stdin;
+
+ yylex();
+ }
+
+.fi
+This is the beginnings of a simple scanner for a language like
+Pascal. It identifies different types of
+.I tokens
+and reports on what it has seen.
+.LP
+The details of the example will be explained in the following
+sections.
+.SH FORMAT OF THE INPUT FILE
+The
+.I flex
+input file consists of three sections, separated by
+.B %%:
+.nf
+
+ definitions
+ %%
+ rules
+ %%
+ user code
+
+.fi
+The
+.I definitions
+section contains declarations of simple
+.I name
+definitions to simplify the scanner specification and of
+.I start conditions,
+which are explained in a later section.
+.LP
+Name definitions have the form:
+.nf
+
+ name definition
+
+.fi
+The "name" is a word beginning with a letter or a '_'
+followed by zero or more letters, digits, '_', or '-'.
+The definition is taken to begin at the first non-white-space
+following the name and continue to the end of the line.
+Definition can subsequently be referred to using "{name}", which
+will expand to "(definition)". For example,
+.nf
+
+ DIGIT [0-9]
+ ID [a-z][a-z0-9]*
+
+.fi
+defines "DIGIT" to be a regular expression which matches a
+single digit, and
+"ID" to be a regular expression which matches a letter
+followed by zero-or-more letters or digits.
+A subsequent reference to
+.nf
+
+ {DIGIT}+"."{DIGIT}*
+
+.fi
+is identical to
+.nf
+
+ ([0-9])+"."([0-9])*
+
+.fi
+and matches one-or-more digits followed by a '.' followed
+by zero-or-more digits.
+.LP
+The
+.I rules
+section of the
+.I flex
+input contains a series of rules of the form:
+.nf
+
+ pattern action
+
+.fi
+where the pattern must be unindented and the action must begin
+on the same line.
+.LP
+See below for a further description of patterns and actions.
+.LP
+Finally, the user code section is simply copied to
+.B lex.yy.c
+verbatim.
+It is used for companion routines which call or are called
+by the scanner. The presence of this section is optional;
+if it is missing, the second
+.B %%
+in the input file may be skipped, too.
+.LP
+In the definitions and rule sections, any
+.I indented
+text or text enclosed in
+.B %{
+and
+.B %}
+is copied verbatim to the output (with the %{}'s removed).
+The %{}'s must appear unindented on lines by themselves.
+.LP
+In the rules section,
+any indented or %{} text appearing before the
+first rule may be used to declare variables
+which are local to the scanning routine, and, after the declarations,
+code which is to be executed whenever the scanning routine is entered.
+Other indented or %{} text in the rule section is still copied to the output,
+but its meaning is not well-defined and it may well cause compile-time
+errors (this feature is present for
+.I POSIX
+compliance; see below for other such features).
+.LP
+In the definitions section, an unindented comment (i.e., a line
+beginning with "/*") is also copied verbatim to the output up
+to the next "*/". Also, any line beginning with '#' is ignored.
+.SH PATTERNS
+The patterns in the input are written using an extended set of regular
+expressions. These are:
+.nf
+
+ x match the character 'x'
+ . any character except newline
+ [xyz] an 'x', a 'y', or a 'z'
+ [abj-oZ] an 'a', a 'b', any letter
+ from 'j' through 'o', or a 'Z'
+ [^A-Z] any character EXCEPT an uppercase letter,
+ including a newline (unlike how many other
+ regular expression tools treat the '^'!).
+ This means that a pattern like [^"]* will
+ match an entire file (overflowing the input
+ buffer) unless there's another quote in
+ the input.
+ [^A-Z\\n] any character EXCEPT an uppercase letter or
+ a newline
+ r* zero or more r's, where r is any regular expression
+ r+ one or more r's
+ r? zero or one r's (that is, "an optional r")
+ r{2,5} anywhere from two to five r's
+ r{2,} two or more r's
+ r{4} exactly 4 r's
+ {name} the expansion of the "name" definition
+ (see above)
+ "[xyz]\\"foo"
+ the literal string: [xyz]"foo
+ \\x if x is an 'a', 'b', 'f', 'n', 'r',
+ 't', or 'v', then the ANSI-C
+ interpretation of \\x. Otherwise,
+ a literal 'x' (used to escape
+ operators such as '*')
+ \\123 the character with octal value 123
+ \\x2a the character with hexadecimal value 2a
+ (r) match an r; parentheses are used
+ to override precedence (see below)
+
+
+ rs the regular expression r followed
+ by the regular expression s; called
+ "concatenation"
+
+
+ r|s either an r or an s
+
+
+ r/s an r but only if it is followed by
+ an s. The s is not part of the
+ matched text. This type of
+ pattern is known as "trailing context".
+ ^r an r, but only at the beginning of a line
+ r$ an r, but only at the end of a line
+ (r must not use trailing context)
+
+
+ <s>r an r, but only in start condition s (see
+ below for discussion of start conditions)
+ <s1,s2,s3>r
+ same, but in any of start conditions s1,
+ s2, or s3
+
+
+ <<EOF>> an end-of-file
+ <s1,s2><<EOF>>
+ an end-of-file when in start condition s1 or s2
+
+.fi
+The regular expressions listed above are grouped according to
+precedence, from highest precedence at the top to lowest at the bottom.
+Those grouped together have equal precedence. For example,
+.nf
+
+ foo|bar*
+
+.fi
+is the same as
+.nf
+
+ (foo)|(ba(r*))
+
+.fi
+since the '*' operator has higher precedence than concatenation,
+and concatenation higher than alternation ('|'). This pattern
+therefore matches
+.I either
+the string "foo"
+.I or
+the string "ba" followed by zero-or-more r's.
+To match "foo" or zero-or-more "bar"'s, use:
+.nf
+
+ foo|(bar)*
+
+.fi
+and to match zero-or-more "foo"'s or "bar"'s:
+.nf
+
+ (foo|bar)*
+
+.fi
+.SH HOW THE INPUT IS MATCHED
+When the generated scanner is run, it analyzes its input looking
+for strings which match any of its patterns. If it finds more than
+one match, it takes the one matching the most text (for trailing
+context rules, this includes the length of the trailing part, even
+though it will then be returned to the input). If it finds two
+or more matches of the same length, the
+rule listed first in the
+.I flex
+input file is chosen.
+.LP
+Once the match is determined, the text corresponding to the match
+(called the
+.I token)
+is made available in the global character pointer
+.B yytext,
+and its length in the global integer
+.B yyleng.
+The
+.I action
+corresponding to the matched pattern is then executed (a more
+detailed description of actions follows), and then the remaining
+input is scanned for another match.
+.LP
+If no match is found, then the
+.I default rule
+is executed: the next character in the input is matched and
+copied to the standard output. Thus, the simplest legal
+.I flex
+input is:
+.nf
+
+ %%
+
+.fi
+which generates a scanner that simply copies its input (one character
+at a time) to its output.
+.SH ACTIONS
+Each pattern in a rule has a corresponding action, which can be any
+arbitrary C statement. The pattern ends at the first non-escaped
+whitespace character; the remainder of the line is its action. If the
+action is empty, then when the pattern is matched the input token
+is simply discarded. For example, here is the specification for a program
+which deletes all occurrences of "zap me" from its input:
+.nf
+
+ %%
+ "zap me"
+
+.fi
+Here is a program which compresses multiple blanks and tabs down to
+a single blank, and throws away whitespace found at the end of a line:
+.nf
+
+ %%
+ [ \t]+ putchar( ' ' );
+ [ \t]+$ /* ignore this token */
+
+.fi
+.LP
+If the action contains a '{', then the action spans till the balancing
+'}' is found, and the action may cross multiple lines.
+.I flex
+knows about C strings and comments and won't be fooled by braces found
+within them, but also allows actions to begin with
+.B %{
+and will consider the action to be all the text up to the next
+.B %}.
+.LP
+An action consisting solely of a vertical bar ('|') means "same as
+the action for the next rule. See below for an illustration.
+.LP
+Actions can include arbitrary C code, including
+.B return
+statements to return a value whatever routine called
+.B yylex().
+Each time
+.B yylex()
+is called it continues processing tokens from where it last left
+off until it either reaches
+the end of the file or executes a return.
+.LP
+Actions are not allowed to modify yytext or yyleng.
+.LP
+There are a number of special directives which can be included within
+an action:
+.IP -
+.B ECHO
+copies yytext to the scanner's output.
+.IP -
+.B BEGIN
+followed by the name of a start condition places the scanner in the
+corresponding start condition (see below).
+.IP -
+.B REJECT
+directs the scanner to proceed on to the "second best" rule which matched the
+input (or a prefix of the input). The rule is chosen as described
+above in "How the Input is Matched", and
+.B yytext
+and
+.B yyleng
+set up appropriately.
+It may either be one which matched as much text
+as the originally chosen rule but came later in the
+.I flex
+input file, or one which matched less text.
+For example, the following will both count the
+words in the input and call the routine special() whenever "frob" is seen:
+.nf
+
+ %{
+ int word_count = 0;
+ %}
+ %%
+
+ frob special(); REJECT;
+ [^ \t\n]+ ++word_count;
+
+.fi
+Without the
+.B REJECT,
+any "frob"'s in the input would not be counted as words, since the
+scanner normally executes only one action per token.
+Multiple
+.B REJECT's
+are allowed, each one finding the next best choice to the currently
+active rule. For example, when the following scanner scans the token
+"abcd", it will write "abcdabcaba" to the output:
+.nf
+
+ %%
+ a |
+ ab |
+ abc |
+ abcd ECHO; REJECT;
+ .|\n /* eat up any unmatched character */
+
+.fi
+(The first three rules share the fourth's action since they use
+the special '|' action.)
+.B REJECT
+is a particularly expensive feature in terms scanner performance;
+if it is used in
+.I any
+of the scanner's actions it will slow down
+.I all
+of the scanner's matching.
+.IP -
+.B yymore()
+tells the scanner that the next time it matches a rule, the corresponding
+token should be
+.I appended
+onto the current value of
+.B yytext
+rather than replacing it. For example, given the input "mega-kludge"
+the following will write "mega-mega-kludge" to the output:
+.nf
+
+ %%
+ mega- ECHO; yymore();
+ kludge ECHO;
+
+.fi
+First "mega-" is matched and echoed to the output. Then "kludge"
+is matched, but the previous "mega-" is still hanging around at the
+beginning of
+.B yytext
+so the ECHO for the "kludge" rule will actually write "mega-kludge".
+The presence of
+.B yymore()
+in the scanner's action entails a minor performance penalty in the
+scanner's matching speed.
+.IP -
+.B yyless(n)
+returns all but the first
+.I n
+characters of the current token back to the input stream, where they
+will be rescanned when the scanner looks for the next match.
+.B yytext
+and
+.B yyleng
+are adjusted appropriately (e.g.,
+.B yyleng
+will now be equal to
+.I n
+). For example, on the input "foobar" the following will write out
+"foobarbar":
+.nf
+
+ foobar ECHO; yyless(3);
+ [a-z]+ ECHO;
+
+.fi
+An argument of 0 to
+.B yyless
+will cause the current input string to be scanned again. Unless you've
+changed how the scanner will subsequently process its input (using
+.B BEGIN,
+for example), this will result in an endless loop.
+.IP -
+.B unput(c)
+puts the character
+.I c
+back onto the input stream. It will be the next character scanned.
+The following action will take the current token and cause it
+to be rescanned enclosed in parentheses.
+.nf
+
+ {
+ int i;
+ unput( ')' );
+ for ( i = yyleng - 1; i >= 0; --i )
+ unput( yytext[i] );
+ unput( '(' );
+ }
+
+.fi
+Note that since each
+.B unput()
+puts the given character back at the
+.I beginning
+of the input stream, pushing back strings must be done back-to-front.
+.IP -
+.B input()
+reads the next character from the input stream. For example,
+the following is one way to eat up C comments:
+.nf
+
+ %%
+ "/*" {
+ register int c;
+
+ for ( ; ; )
+ {
+ while ( (c = input()) != '*' &&
+ c != EOF )
+ ; /* eat up text of comment */
+
+ if ( c == '*' )
+ {
+ while ( (c = input()) == '*' )
+ ;
+ if ( c == '/' )
+ break; /* found the end */
+ }
+
+ if ( c == EOF )
+ {
+ error( "EOF in comment" );
+ break;
+ }
+ }
+ }
+
+.fi
+.IP -
+.I yyterminate()
+can be used in lieu of a return statement in an action. It terminates
+the scanner and returns a 0 to the scanner's caller, indicating "all done".
+By default,
+.I yyterminate()
+is also called when an end-of-file is encountered. It is a macro and
+may be redefined.
+.SH THE GENERATED SCANNER
+The output of
+.I flex
+is the file
+.B lex.yy.c,
+which contains the scanning routine
+.B yylex(),
+a number of tables used by it for matching tokens, and a number
+of auxilliary routines and macros. By default,
+.B yylex()
+is declared as follows:
+.nf
+
+ int yylex()
+ {
+ ... various definitions and the actions in here ...
+ }
+
+.fi
+(If your environment supports function prototypes, then it will
+be "int yylex( void )".) This definition may be changed by redefining
+the "YY_DECL" macro. For example, you could use:
+.nf
+
+ #undef YY_DECL
+ #define YY_DECL float lexscan( a, b ) float a, b;
+
+.fi
+to give it the the scanning routine the name
+.I lexscan,
+returning a float, and taking two floats as arguments. Note that
+if you give arguments to the scanning routine using a
+K&R-style/non-prototyped function declaration, you must terminate
+the definition with a semi-colon (;).
+.LP
+Whenever
+.B yylex()
+is called, it scans tokens from the global input file
+.I yyin
+(default, stdin). It continues until it either reaches
+an end-of-file (at which point it returns the value 0) or
+one of its actions executes a
+.I return
+statement.
+In the former case, the scanner may not be called again unless
+.B void yyrestart( FILE *input_file )
+is called, to point
+.I yyin
+at the new input_file. In the latter case (i.e., when an action
+executes a return), the scanner may then be called again and it
+will resume scanning where it left off.
+.LP
+By default (and for purposes of efficiency), the scanner uses
+block-reads rather than simple
+.I getc()
+calls to read characters from
+.I yyin.
+The nature of how it gets its input can be controlled by redefining the
+.B YY_INPUT
+macro.
+YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
+action is to place up to
+.I max_size
+characters in the character array
+.I buf
+and return in the integer variable
+.I result
+either the
+number of characters read or the constant YY_NULL (0 on Unix systems)
+to indicate EOF. The default YY_INPUT reads from the
+global file-pointer "yyin" (which is by default
+.I stdin),
+so if you
+just want to change the input file, you needn't redefine
+YY_INPUT - just point yyin at the input file.
+.LP
+A sample redefinition of YY_INPUT (in the definitions section of the input
+file):
+.nf
+
+ %{
+ #undef YY_INPUT
+ #define YY_INPUT(buf,result,max_size) \\
+ result = ((buf[0] = getchar()) == EOF) ? YY_NULL : 1;
+ %}
+
+.fi
+You also can add in things like keeping track of the
+input line number this way; but don't expect your scanner to
+go very fast.
+.LP
+When the scanner receives an end-of-file indication from YY_INPUT,
+it then checks the
+.B yywrap()
+function. If it returns false (zero), then it is assumed that the
+function has gone ahead and set up
+.I yyin
+to point to another input file, and scanning continues. If it returns
+true (non-zero), then the scanner terminates, returning 0 to its
+caller.
+.LP
+The default
+.B yywrap()
+always returns 1. Presently, to redefine it you must first
+"#undef yywrap", as it is currently implemented as a macro. As noted
+by the hedging in the previous sentence, it may be changed to
+a true function in the near future.
+.LP
+The scanner writes its
+.B ECHO
+output to the
+.I yyout
+global (default, stdout), which may be redefined by the user simply
+by assigning it to some other FILE*.
+.SH START CONDITIONS
+.I flex
+provides a mechanism for conditionally activating rules. Any rule
+whose pattern is prefixed with "<sc>" will only be active when
+the scanner is in the start condition named "sc". For example,
+.nf
+
+ <STRING>[^"]* { /* eat up the string body ... */
+ ...
+ }
+
+.fi
+will be active only when the scanner is in the "STRING" start
+condition, and
+.nf
+
+ <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */
+ ...
+ }
+
+.fi
+will be active only when the current start condition is
+either "INITIAL", "STRING", or "QUOTE".
+.LP
+Start conditions
+are declared in the definitions (first) section of the input
+using unindented lines beginning with either
+.B %s
+or
+.B %x
+followed by a list of names.
+The former declares
+.I inclusive
+start conditions, the latter
+.I exclusive
+start conditions. A start condition is activated using the
+.B BEGIN
+action. Until the next
+.B BEGIN
+action is executed, rules with the given start
+condition will be active and
+rules with other start conditions will be inactive.
+If the start condition is
+.I inclusive,
+then rules with no start conditions at all will also be active.
+If it is
+.I exclusive,
+then
+.I only
+rules qualified with the start condition will be active.
+So a set of rules conditioned on the same exclusive start condition
+describe a scanner which is independent of any of the other rules in the
+.I flex
+input. Because of this,
+exclusive start conditions make it easy to specify "mini-scanners"
+which scan portions of the input that are syntactically different
+from the rest (e.g., comments).
+.LP
+.B BEGIN(0)
+returns to the original state where only the rules with
+no start conditions are active. This state can also be
+referred to as the start-condition "INITIAL", so
+.B BEGIN(INITIAL)
+is equivalent to
+.B BEGIN(0).
+.LP
+Here is a scanner which will recognize numbers only if they
+are preceded earlier in the line by the string "expect-number":
+.nf
+
+ %s expect
+
+ %%
+ expect-number BEGIN(expect);
+
+ <expect>[0-9]+ printf( "found a number\n" );
+ <expect>\n {
+ /* that's the end of the line, so
+ * we need another "expect-number"
+ * before we'll recognize any more
+ * numbers
+ */
+ BEGIN(INITIAL);
+ }
+
+.fi
+Here is a scanner which recognizes (and discards) C comments while
+maintaining a count of the current input line.
+.nf
+
+ %x comment
+ %%
+ int line_num = 1;
+
+ <comment>[^*\n]*
+ <comment>"*"+[^*/\n]*
+ <comment>\n ++line_num;
+ <comment>"*"+"/" BEGIN(INITIAL);
+
+ "/*" BEGIN(comment);
+
+.fi
+Note that start-conditions names are really integer values and
+can be stored as such. Thus, the above could be extended in the
+following fashion:
+.nf
+
+ %x comment
+ %%
+ int line_num = 1;
+ int comment_caller;
+
+ <comment>[^*\n]*
+ <comment>"*"+[^*/\n]*
+ <comment>\n ++line_num;
+ <comment>"*"+"/" BEGIN(comment_caller);
+
+ "/*" {
+ comment_caller = INTIIAL;
+ BEGIN(comment);
+ }
+
+ ...
+
+ <foo>"/*" {
+ comment_caller = foo;
+ BEGIN(comment);
+ }
+.fi
+One can then implement a "stack" of start conditions using an
+array of integers. (It is likely that such stacks will become
+a full-fledged
+.I flex
+feature in the future.) Note, though, that
+start conditions do not have their own name-space; %s's and %x's
+declare names effectively the same as #define's.
+.SH END-OF-FILE RULES
+The special rule "<<EOF>>" indicates
+actions which are to be taken when an end-of-file is
+encountered and yywrap() returns non-zero (i.e., indicates
+no further files to process). The action can either
+point yyin at a new file to process, in which case the
+action
+.I must
+finish with the special
+.I YY_NEW_FILE
+action
+(this is a branch, so subsequent code in the action won't
+be executed), or the action must finish with a
+.I return
+or
+.I yyterminate()
+statement. <<EOF>> rules may not be used with other
+patterns; they may only be qualified with a list of start
+conditions. If an unqualified <<EOF>> rule is given, it
+applies only to the
+.B INITIAL
+start condition, and
+.I not
+to
+.B %s
+(or
+.B %x)
+start conditions.
+.LP
+These rules are useful for catching things like unclosed comments.
+An example:
+.nf
+
+ %x quote
+ %%
+ ...
+ <quote><<EOF>> {
+ error( "unterminated quote" );
+ yyterminate();
+ }
+ <<EOF>> {
+ if ( *++filelist )
+ {
+ yyin = fopen( *filelist, "r" );
+ YY_NEW_FILE;
+ }
+ else
+ yyterminate();
+ }
+
+.fi
+.SH MISCELLANEOUS MACROS
+The macro
+.bd
+YY_USER_ACTION
+can be redefined to provide an action
+which is always executed prior to the matched rule's action. For example,
+it could be #define'd to call a routine to convert yytext to lower-case.
+.LP
+In the generated scanner, the actions are all gathered in one large
+switch statement and separated using
+.B YY_BREAK,
+which may be redefined.
+This allows, for example, C++ users to
+#define YY_BREAK to do nothing (while being very careful that every
+rule ends with a "break" or a "return"!) to avoid suffering from
+unreachable statement warnings where a rule's action ends with "return".
+.SH INTERFACING WITH YACC
+One of the main uses of
+.I flex
+is as a companion to the
+.I yacc
+parser-generator.
+.I yacc
+parsers expect to call the
+.B yylex()
+routine to find the next input token. The routine is supposed to
+return the type of the next token as well as putting any associated
+value in the global
+.B yylval.
+To use
+.I flex
+with
+.I yacc,
+one specifies the
+.B -d
+option to
+.I yacc
+to instruct it to generate the file
+.B y.tab.h
+containing definitions of all the
+.B %tokens
+appearing in the
+.I yacc
+input. This file is then included in the
+.I flex
+scanner. For example, if one of the tokens is "TOK_NUMBER",
+part of the scanner might look like:
+.nf
+
+ %{
+ #include "y.tab.h"
+ %}
+
+ %%
+
+ [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
+
+.fi
+.SH TRANSLATION TABLE
+In the name of POSIX compliance,
+.I flex
+supports a
+.I translation table
+for mapping input characters together into specified sets.
+The table is specified in the first section, and its format looks like:
+.nf
+
+ %t
+ 1 abcd
+ 2 ABCDEFGHIJKLMNOPQRSTUVWXYZ
+ 52 0123456789
+ %t
+
+.fi
+This example specifies that the characters 'a', 'b', 'c', and 'd'
+are to all be lumped into group #1, the upper-case letters are
+to be in group #2, and digits in group #52, and
+.I no other characters will appear in the patterns
+(note that characters can also be specified in a
+.B %t
+table using escape sequences).
+The group numbers are actually disregarded by
+.I flex;
+.B %t
+serves, though, to lump characters together. Given the above
+table, for example, the pattern "aAA*5" is equivalent to "dZQ*0".
+They both say, "match any character in group #1, followed by
+a character from group #2, followed by zero-or-more characters
+from group #2, followed by a character from group #52." Thus
+.B %t
+provides a crude way for introducing equivalence classes into
+the scanner specification. It is the author's belief that the
+.B -i
+option coupled with the equivalence classes which
+.I flex
+automatically generates take care of virtually all the instances
+when one might consider using
+.B %t.
+But what the hell, it's there if you want it.
+.so options.man
+.SH PERFORMANCE CONSIDERATIONS
+The main design goal of
+.I flex
+is that it generate high-performance scanners. It has been optimized
+for dealing well with large sets of rules. Aside from the effects
+outlined above of table compression on scanner speed,
+there are a number of options/actions which degrade performance. These
+are, in decreasing order of performance impact:
+.nf
+
+ REJECT
+ pattern sets that require backtracking
+ arbitrary trailing context
+ %T
+ '^' beginning-of-line operator
+ yymore()
+ start conditions
+
+.fi
+.LP
+Getting rid of backtracking is messy and often may be too much
+work for a complicated scanner's rules. In principal, one begins
+by using the
+.B -b
+flag to generate a
+.I lex.backtrack
+file. For example, on the input
+.nf
+
+ %%
+ foo return TOK_KEYWORD;
+ foobar return TOK_KEYWORD;
+
+.fi
+the file looks like:
+.nf
+
+ State #6 is non-accepting -
+ associated rules:
+ 2 3
+ out-transitions: [ o ]
+ jam-transitions: EOF [ \001-n p-\177 ]
+
+ State #8 is non-accepting -
+ associated rules:
+ 3
+ out-transitions: [ a ]
+ jam-transitions: EOF [ \001-` b-\177 ]
+
+ State #9 is non-accepting -
+ associated rules:
+ 3
+ out-transitions: [ r ]
+ jam-transitions: EOF [ \001-q s-\177 ]
+
+ Compressed tables always backtrack.
+
+.fi
+The first few lines tell us that there's a scanner state in
+which it can make a transition on an 'o' but not on any other
+character, and the currently scanned text does not match any rule.
+If the scanner is in that state and then reads
+something other than an 'o', it will have to backtrack to find
+a rule which is matched. With
+a bit of headscratching one can see that this must be the
+state it's in when it has seen "fo". When this has happened,
+if anything other than another 'o' is seen, the scanner will
+have to back up to simply match the 'f' (by the default rule).
+.LP
+The comment regarding State #8 indicates there's a problem
+when "foob" has been scanned. Indeed, on any character other
+than a 'b', the scanner will have to back up to accept "foo".
+Similarly, the comment for State #9 concerns when "fooba" has
+been scanned.
+.LP
+The final comment reminds us that there's no point going to
+all the trouble of removing backtracking from the rules unless
+we're using
+.B -f
+or
+.B -F,
+since there's no performance gain doing so with compressed scanners.
+.LP
+The way to remove the backtracking is to add "error" rules:
+.nf
+
+ %%
+ foo return TOK_KEYWORD;
+ foobar return TOK_KEYWORD;
+
+ fooba |
+ foob |
+ fo {
+ /* false alarm, not really a keyword */
+ return TOK_ID;
+ }
+
+.fi
+.LP
+Unfortunately backtracking messages tend to cascade and
+with a complicated input set it's not uncommon to get hundreds
+of messages. If one can decipher them, though, it often
+only takes a dozen or so rules to eliminate the backtracking.
+(A possible future
+.I flex
+feature will be to automatically add rules to eliminate backtracking.
+The problem is that while it's easy for
+.I flex
+to figure out what rules are needed, it's very hard for it to
+know what the proper action is. Currently I'm thinking that it
+will simply invoke a user-redefinable macro and that's it ...)
+.LP
+Another area where the user can increase a scanner's performance
+(and one that's easier to implement) arises from the fact that
+the longer the tokens matched, the faster the scanner will run.
+This is because with long tokens the processing of most input
+characters takes place in the (short) inner scanning loop, and
+does not often have to go through the additional work of setting up
+the scanning environment (e.g.,
+.B yytext)
+for the action. Recall the scanner for C comments:
+.nf
+
+ %x comment
+ %%
+ int line_num = 1;
+
+ <comment>[^*\n]*
+ <comment>"*"+[^*/\n]*
+ <comment>\n ++line_num;
+ <comment>"*"+"/" BEGIN(INITIAL);
+
+ "/*" BEGIN(comment);
+
+.fi
+This could be sped up by writing it as:
+.nf
+
+ %x comment
+ %%
+ int line_num = 1;
+
+ <comment>[^*\n]*
+ <comment>[^*\n]*\n ++line_num;
+ <comment>"*"+[^*/\n]*
+ <comment>"*"+[^*/\n]*\n ++line_num;
+ <comment>"*"+"/" BEGIN(INITIAL);
+
+ "/*" BEGIN(comment);
+
+.fi
+Now instead of each newline requiring the processing of another
+action, recognizing the newlines is "distributed" over the other rules
+to keep the matched text as long as possible. Note that
+.I adding
+rules does
+.I not
+slow down the scanner! The speed of the scanner is independent
+of the number of rules or (modulo the considerations given at the
+beginning of this section) how complicated the rules are with
+regard to operators such as '*' and '|'.
+.SH INCOMPATIBILITIES WITH LEX AND POSIX
+.I flex
+is a rewrite of the Unix
+.I lex
+tool (the two implementations do not share any code, though),
+which dates to the late 1970's. There are some incompatibilities
+which are of concern to those who wish to write scanners acceptable
+to either implementation. At present, the POSIX lex draft is
+very close to the original lex implementation, so some of these
+incompatibilities are also in conflict with the POSIX draft. But
+the intent is that except as noted below,
+.I flex
+as it presently stands will
+ultimately be POSIX comformant (i.e., that those areas of conflict with
+the POSIX draft will be resolved in
+.I flex's
+favor). Please bare in
+mind that all the comments are with regard to the POSIX
+.I draft
+standard and not the final document; they are included so
+.I flex
+users can be aware of the standardization issues and those areas where
+.I flex
+may in the near future be incompatibly changed with
+its current definition.
+.LP
+.I flex
+is fully compatible with
+.I lex
+with the following exceptions:
+.IP -
+When definitions are expanded,
+.I flex
+encloses them in parentheses.
+With lex, the following
+.nf
+
+ NAME [A-Z][A-Z0-9]*
+ %%
+ foo{NAME}? printf( "Found it\\n" );
+ %%
+
+.fi
+will not match the string "foo" because when the macro
+is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
+and the precedence is such that the '?' is associated with
+"[A-Z0-9]*". With
+.I flex,
+the rule will be expanded to
+"foo([A-z][A-Z0-9]*)?" and so the string "foo" will match.
+Note that because of this, the
+.B ^, $, <s>,
+and
+.B /
+operators cannot be used in a definition.
+.IP
+Note that the POSIX draft interpretation here is the same as
+.I flex's.
+.IP -
+The undocumented lex-scanner internal variable
+.B yylineno
+is not supported. (The variable is not part of the POSIX draft.)
+.IP -
+The
+.B input()
+routine is not redefinable, though may be called to read characters
+following whatever has been matched by a rule. If
+.B input()
+encounters an end-of-file the normal
+.B yywrap()
+processing is done. A ``real'' end-of-file is returned as
+.I EOF.
+.IP
+Input is instead controlled by redefining the
+.B YY_INPUT
+macro.
+.IP
+The
+.I flex
+restriction that
+.B input()
+cannot be redefined is in accordance with the POSIX draft, but
+.B YY_INPUT
+has not yet been accepted into the draft.
+.IP -
+.B output()
+is not supported.
+Output from the ECHO macro is done to the file-pointer
+"yyout" (default
+.I stdout).
+.IP
+The POSIX draft mentions that an
+.B output()
+routine exists but currently gives no details as to what it does.
+.IP -
+If you are providing your own yywrap() routine, you must include a
+"#undef yywrap" in the definitions section (section 1). Note that
+the "#undef" will have to be enclosed in %{}'s.
+.IP
+The POSIX draft
+specifies that yywrap() is a function and this is unlikely to change; so
+.I flex users are warned
+that
+.I yywrap()
+is likely to be changed to a function in the near future.
+.IP -
+The precedence of the
+.B {}
+operator is different. lex interprets "abc{1,3}" as "match one, two, or
+three occurrences of 'abc'", whereas
+.I flex
+interprets it as "match 'ab'
+followed by one, two, or three occurrences of 'c'". The latter is
+in agreement with the current POSIX draft.
+.IP -
+To refer to yytext outside of your scanner source file, use
+"extern char *yytext;" rather than "extern char yytext[];".
+This is contrary to the POSIX draft but a point on which I refuse
+to budge, as the array representation entails a serious performance penalty.
+.IP -
+The name
+.bd
+FLEX_SCANNER
+is #define'd so scanners may be written for use with either
+.I flex
+or
+.I lex.
+.SH DEFICIENCES / BUGS
+.LP
+Some trailing context
+patterns cannot be properly matched and generate
+warning messages ("Dangerous trailing context"). These are
+patterns where the ending of the
+first part of the rule matches the beginning of the second
+part, such as "zx*/xy*", where the 'x*' matches the 'x' at
+the beginning of the trailing context. (Note that the POSIX draft
+states that the text matched by such patterns is undefined.)
+If desperate, you can use
+.B yyless()
+to effect arbitrary trailing context.
+.LP
+.I variable
+trailing context (where both the leading and trailing parts do not have
+a fixed length) entails the same performance loss as
+.I REJECT
+(i.e., substantial).
+.LP
+For some trailing context rules, parts which are actually fixed-length are
+not recognized as such, leading to the abovementioned performance loss.
+In particular, parts using '|' or {n} are always considered variable-length.
+.LP
+Use of unput() or input() invalidates yytext and yyleng.
+.LP
+Use of unput() to push back more text than was matched can
+result in the pushed-back text matching a beginning-of-line ('^')
+rule even though it didn't come at the beginning of the line
+(though this is rare!).
+.LP
+Nulls are not allowed in
+.I flex
+inputs or in the inputs to
+scanners generated by
+.I flex.
+Their presence generates fatal errors.
+.LP
+.I flex
+does not generate correct #line directives for code internal
+to the scanner; thus, bugs in
+.I
+flex.skel
+yield bogus line numbers.
+.LP
+Due to both buffering of input and read-ahead, you cannot intermix
+calls to stdio routines, such as, for example,
+.B getchar()
+with
+.I flex
+rules and expect it to work. Call
+.B input()
+instead.
+.LP
+The total table entries listed by the
+.B -v
+flag excludes the number of table entries needed to determine
+what rule has been matched. The number of entries is equal
+to the number of DFA states if the scanner does not use REJECT,
+and somewhat greater than the number of states if it does.
+.LP
+It would be useful if
+.I flex
+wrote to lex.yy.c a summary of the flags used in
+its generation (such as which table compression options).
+.LP
+Some of the macros, such as
+.B yywrap(),
+may in the future become functions which live in the
+.B -ll
+library. This will doubtless break a lot of code, but may be
+required for POSIX-compliance.
+.LP
+The
+.I flex
+internal algorithms need documentation.
+.SH "SEE ALSO"
+.LP
+lex(1), yacc(1), sed(1), awk(1).
+.LP
+M. E. Lesk and E. Schmidt,
+.I LEX - Lexical Analyzer Generator
+.SH AUTHOR
+Vern Paxson, with the help of many ideas and much inspiration from
+Van Jacobson. Original version by Jef Poskanzer. Fast table
+representation is a partial implementation of a design done by Van
+Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
+.LP
+Thanks to the many
+.I flex
+beta-testers and feedbackers, especially Casey
+Leedom, benson@odi.com,
+Frederic Brehm, Nick Christopher, Jason Coughlin,
+Chris Faylor, Eric Goldman, Eric
+Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht,
+Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt,
+Jef Poskanzer, Dave Tallman, Frank Whaley, Ken Yap, and others whose names
+have slipped my marginal mail-archiving skills but whose contributions
+are appreciated all the same.
+.LP
+Thanks to Keith Bostic, John Gilmore, Bob
+Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
+headaches.
+.LP
+Thanks to Esmond Pitt for 8-bit character support, Benson Margulies and Fred
+Burke for C++ support, and Ove Ewerlid for supporting NUL's (as well as for
+impressive efforts regarding generating extremely high-performance
+scanners, which with luck will be soon forthcoming).
+.LP
+This work was primarily done when I was a member of the Real Time System Group
+at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there
+for the support I received.
+.LP
+Send comments to:
+.nf
+
+ Vern Paxson
+ Computer Science Department
+ 4126 Upson Hall
+ Cornell University
+ Ithaca, NY 14853-7501
+
+ vern@cs.cornell.edu
+ decvax!cornell!vern
+ vern@LBL (bitnet)
+
+.fi