summaryrefslogtreecommitdiff
path: root/doc/flex.info-1
blob: da0d581ed2c46f05837a29899aa1f0865589c3f6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
This is flex.info, produced by makeinfo version 4.5 from flex.texi.

INFO-DIR-SECTION Programming
START-INFO-DIR-ENTRY
* flex: (flex).      Fast lexical analyzer generator (lex replacement).
END-INFO-DIR-ENTRY


   The flex manual is placed under the same licensing conditions as the
rest of flex:

   Copyright (C) 1990, 1997 The Regents of the University of California.
All rights reserved.

   This code is derived from software contributed to Berkeley by Vern
Paxson.

   The United States Government has rights in this work pursuant to
contract no. DE-AC03-76SF00098 between the United States Department of
Energy and the University of California.

   Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

  1.  Redistributions of source code must retain the above copyright
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the
     distribution.
   Neither the name of the University nor the names of its contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.

   THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

File: flex.info,  Node: Top,  Next: Copyright,  Prev: (dir),  Up: (dir)

flex
****

   This manual describes `flex', a tool for generating programs that
perform pattern-matching on text.  The manual includes both tutorial and
reference sections.

   This edition of `The flex Manual' documents `flex' version 2.5.33.
It was last updated on 20 February 2006.

* Menu:

* Copyright::
* Reporting Bugs::
* Introduction::
* Simple Examples::
* Format::
* Patterns::
* Matching::
* Actions::
* Generated Scanner::
* Start Conditions::
* Multiple Input Buffers::
* EOF::
* Misc Macros::
* User Values::
* Yacc::
* Scanner Options::
* Performance::
* Cxx::
* Reentrant::
* Lex and Posix::
* Memory Management::
* Serialized Tables::
* Diagnostics::
* Limitations::
* Bibliography::
* FAQ::
* Appendices::
* Indices::

 --- The Detailed Node Listing ---

Format of the Input File

* Definitions Section::
* Rules Section::
* User Code Section::
* Comments in the Input::

Scanner Options

* Options for Specifing Filenames::
* Options Affecting Scanner Behavior::
* Code-Level And API Options::
* Options for Scanner Speed and Size::
* Debugging Options::
* Miscellaneous Options::

Reentrant C Scanners

* Reentrant Uses::
* Reentrant Overview::
* Reentrant Example::
* Reentrant Detail::
* Reentrant Functions::

The Reentrant API in Detail

* Specify Reentrant::
* Extra Reentrant Argument::
* Global Replacement::
* Init and Destroy Functions::
* Accessor Methods::
* Extra Data::
* About yyscan_t::

Memory Management

* The Default Memory Management::
* Overriding The Default Memory Management::
* A Note About yytext And Memory::

Serialized Tables

* Creating Serialized Tables::
* Loading and Unloading Serialized Tables::
* Tables File Format::

FAQ

* When was flex born?::
* How do I expand \ escape sequences in C-style quoted strings?::
* Why do flex scanners call fileno if it is not ANSI compatible?::
* Does flex support recursive pattern definitions?::
* How do I skip huge chunks of input (tens of megabytes) while using flex?::
* Flex is not matching my patterns in the same order that I defined them.::
* My actions are executing out of order or sometimes not at all.::
* How can I have multiple input sources feed into the same scanner at the same time?::
* Can I build nested parsers that work with the same input file?::
* How can I match text only at the end of a file?::
* How can I make REJECT cascade across start condition boundaries?::
* Why cant I use fast or full tables with interactive mode?::
* How much faster is -F or -f than -C?::
* If I have a simple grammar cant I just parse it with flex?::
* Why doesnt yyrestart() set the start state back to INITIAL?::
* How can I match C-style comments?::
* The period isnt working the way I expected.::
* Can I get the flex manual in another format?::
* Does there exist a "faster" NDFA->DFA algorithm?::
* How does flex compile the DFA so quickly?::
* How can I use more than 8192 rules?::
* How do I abandon a file in the middle of a scan and switch to a new file?::
* How do I execute code only during initialization (only before the first scan)?::
* How do I execute code at termination?::
* Where else can I find help?::
* Can I include comments in the "rules" section of the file?::
* I get an error about undefined yywrap().::
* How can I change the matching pattern at run time?::
* How can I expand macros in the input?::
* How can I build a two-pass scanner?::
* How do I match any string not matched in the preceding rules?::
* I am trying to port code from AT&T lex that uses yysptr and yysbuf.::
* Is there a way to make flex treat NULL like a regular character?::
* Whenever flex can not match the input it says "flex scanner jammed".::
* Why doesnt flex have non-greedy operators like perl does?::
* Memory leak - 16386 bytes allocated by malloc.::
* How do I track the byte offset for lseek()?::
* How do I use my own I/O classes in a C++ scanner?::
* How do I skip as many chars as possible?::
* deleteme00::
* Are certain equivalent patterns faster than others?::
* Is backing up a big deal?::
* Can I fake multi-byte character support?::
* deleteme01::
* Can you discuss some flex internals?::
* unput() messes up yy_at_bol::
* The | operator is not doing what I want::
* Why can't flex understand this variable trailing context pattern?::
* The ^ operator isn't working::
* Trailing context is getting confused with trailing optional patterns::
* Is flex GNU or not?::
* ERASEME53::
* I need to scan if-then-else blocks and while loops::
* ERASEME55::
* ERASEME56::
* ERASEME57::
* Is there a repository for flex scanners?::
* How can I conditionally compile or preprocess my flex input file?::
* Where can I find grammars for lex and yacc?::
* I get an end-of-buffer message for each character scanned.::
* unnamed-faq-62::
* unnamed-faq-63::
* unnamed-faq-64::
* unnamed-faq-65::
* unnamed-faq-66::
* unnamed-faq-67::
* unnamed-faq-68::
* unnamed-faq-69::
* unnamed-faq-70::
* unnamed-faq-71::
* unnamed-faq-72::
* unnamed-faq-73::
* unnamed-faq-74::
* unnamed-faq-75::
* unnamed-faq-76::
* unnamed-faq-77::
* unnamed-faq-78::
* unnamed-faq-79::
* unnamed-faq-80::
* unnamed-faq-81::
* unnamed-faq-82::
* unnamed-faq-83::
* unnamed-faq-84::
* unnamed-faq-85::
* unnamed-faq-86::
* unnamed-faq-87::
* unnamed-faq-88::
* unnamed-faq-90::
* unnamed-faq-91::
* unnamed-faq-92::
* unnamed-faq-93::
* unnamed-faq-94::
* unnamed-faq-95::
* unnamed-faq-96::
* unnamed-faq-97::
* unnamed-faq-98::
* unnamed-faq-99::
* unnamed-faq-100::
* unnamed-faq-101::
* What is the difference between YYLEX_PARAM and YY_DECL?::
* Why do I get "conflicting types for yylex" error?::
* How do I access the values set in a Flex action from within a Bison action?::

Appendices

* Makefiles and Flex::
* Bison Bridge::
* M4 Dependency::

Indices

* Concept Index::
* Index of Functions and Macros::
* Index of Variables::
* Index of Data Types::
* Index of Hooks::
* Index of Scanner Options::


File: flex.info,  Node: Copyright,  Next: Reporting Bugs,  Prev: Top,  Up: Top

Copyright
*********


   The flex manual is placed under the same licensing conditions as the
rest of flex:

   Copyright (C) 1990, 1997 The Regents of the University of California.
All rights reserved.

   This code is derived from software contributed to Berkeley by Vern
Paxson.

   The United States Government has rights in this work pursuant to
contract no. DE-AC03-76SF00098 between the United States Department of
Energy and the University of California.

   Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

  1.  Redistributions of source code must retain the above copyright
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the
     distribution.
   Neither the name of the University nor the names of its contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.

   THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

File: flex.info,  Node: Reporting Bugs,  Next: Introduction,  Prev: Copyright,  Up: Top

Reporting Bugs
**************

   If you have problems with `flex' or think you have found a bug,
please send mail detailing your problem to
<flex-help@lists.sourceforge.net>. Patches are always welcome.


File: flex.info,  Node: Introduction,  Next: Simple Examples,  Prev: Reporting Bugs,  Up: Top

Introduction
************

   `flex' is a tool for generating "scanners".  A scanner is a program
which recognizes lexical patterns in text.  The `flex' program reads
the given input files, or its standard input if no file names are
given, for a description of a scanner to generate.  The description is
in the form of pairs of regular expressions and C code, called "rules".
`flex' generates as output a C source file, `lex.yy.c' by default,
which defines a routine `yylex()'.  This file can be compiled and
linked with the flex runtime library to produce an executable.  When
the executable is run, it analyzes its input for occurrences of the
regular expressions.  Whenever it finds one, it executes the
corresponding C code.


File: flex.info,  Node: Simple Examples,  Next: Format,  Prev: Introduction,  Up: Top

Some Simple Examples
********************

   First some simple examples to get the flavor of how one uses `flex'.

   The following `flex' input specifies a scanner which, when it
encounters the string `username' will replace it with the user's login
name:


         %%
         username    printf( "%s", getlogin() );

   By default, any text not matched by a `flex' scanner is copied to
the output, so the net effect of this scanner is to copy its input file
to its output with each occurrence of `username' expanded.  In this
input, there is just one rule.  `username' is the "pattern" and the
`printf' is the "action".  The `%%' symbol marks the beginning of the
rules.

   Here's another simple example:


                 int num_lines = 0, num_chars = 0;
     
         %%
         \n      ++num_lines; ++num_chars;
         .       ++num_chars;
     
         %%
         main()
                 {
                 yylex();
                 printf( "# of lines = %d, # of chars = %d\n",
                         num_lines, num_chars );
                 }

   This scanner counts the number of characters and the number of lines
in its input. It produces no output other than the final report on the
character and line counts.  The first line declares two globals,
`num_lines' and `num_chars', which are accessible both inside `yylex()'
and in the `main()' routine declared after the second `%%'.  There are
two rules, one which matches a newline (`\n') and increments both the
line count and the character count, and one which matches any character
other than a newline (indicated by the `.' regular expression).

   A somewhat more complicated example:


         /* scanner for a toy Pascal-like language */
     
         %{
         /* need this for the call to atof() below */
         #include math.h>
         %}
     
         DIGIT    [0-9]
         ID       [a-z][a-z0-9]*
     
         %%
     
         {DIGIT}+    {
                     printf( "An integer: %s (%d)\n", yytext,
                             atoi( yytext ) );
                     }
     
         {DIGIT}+"."{DIGIT}*        {
                     printf( "A float: %s (%g)\n", yytext,
                             atof( yytext ) );
                     }
     
         if|then|begin|end|procedure|function        {
                     printf( "A keyword: %s\n", yytext );
                     }
     
         {ID}        printf( "An identifier: %s\n", yytext );
     
         "+"|"-"|"*"|"/"   printf( "An operator: %s\n", yytext );
     
         "{"[\^{}}\n]*"}"     /* eat up one-line comments */
     
         [ \t\n]+          /* eat up whitespace */
     
         .           printf( "Unrecognized character: %s\n", yytext );
     
         %%
     
         main( argc, argv )
         int argc;
         char **argv;
             {
             ++argv, --argc;  /* skip over program name */
             if ( argc > 0 )
                     yyin = fopen( argv[0], "r" );
             else
                     yyin = stdin;
     
             yylex();
             }

   This is the beginnings of a simple scanner for a language like
Pascal.  It identifies different types of "tokens" and reports on what
it has seen.

   The details of this example will be explained in the following
sections.


File: flex.info,  Node: Format,  Next: Patterns,  Prev: Simple Examples,  Up: Top

Format of the Input File
************************

   The `flex' input file consists of three sections, separated by a
line containing only `%%'.


         definitions
         %%
         rules
         %%
         user code

* Menu:

* Definitions Section::
* Rules Section::
* User Code Section::
* Comments in the Input::


File: flex.info,  Node: Definitions Section,  Next: Rules Section,  Prev: Format,  Up: Format

Format of the Definitions Section
=================================

   The "definitions section" contains declarations of simple "name"
definitions to simplify the scanner specification, and declarations of
"start conditions", which are explained in a later section.

   Name definitions have the form:


         name definition

   The `name' is a word beginning with a letter or an underscore (`_')
followed by zero or more letters, digits, `_', or `-' (dash).  The
definition is taken to begin at the first non-whitespace character
following the name and continuing to the end of the line.  The
definition can subsequently be referred to using `{name}', which will
expand to `(definition)'.  For example,


         DIGIT    [0-9]
         ID       [a-z][a-z0-9]*

   Defines `DIGIT' to be a regular expression which matches a single
digit, and `ID' to be a regular expression which matches a letter
followed by zero-or-more letters-or-digits.  A subsequent reference to


         {DIGIT}+"."{DIGIT}*

   is identical to


         ([0-9])+"."([0-9])*

   and matches one-or-more digits followed by a `.' followed by
zero-or-more digits.

   An unindented comment (i.e., a line beginning with `/*') is copied
verbatim to the output up to the next `*/'.

   Any _indented_ text or text enclosed in `%{' and `%}' is also copied
verbatim to the output (with the %{ and %} symbols removed).  The %{
and %} symbols must appear unindented on lines by themselves.

   A `%top' block is similar to a `%{' ... `%}' block, except that the
code in a `%top' block is relocated to the _top_ of the generated file,
before any flex definitions (1).  The `%top' block is useful when you
want certain preprocessor macros to be defined or certain files to be
included before the generated code.  The single characters, `{'  and
`}' are used to delimit the `%top' block, as show in the example below:


         %top{
             /* This code goes at the "top" of the generated file. */
             #include <stdint.h>
             #include <inttypes.h>
         }

   Multiple `%top' blocks are allowed, and their order is preserved.

   ---------- Footnotes ----------

   (1) Actually, `yyIN_HEADER' is defined before the `%top' block.


File: flex.info,  Node: Rules Section,  Next: User Code Section,  Prev: Definitions Section,  Up: Format

Format of the Rules Section
===========================

   The "rules" section of the `flex' input contains a series of rules
of the form:


         pattern   action

   where the pattern must be unindented and the action must begin on
the same line.  *Note Patterns::, for a further description of patterns
and actions.

   In the rules section, any indented or %{ %} enclosed text appearing
before the first rule may be used to declare variables which are local
to the scanning routine and (after the declarations) code which is to be
executed whenever the scanning routine is entered.  Other indented or
%{ %} text in the rule section is still copied to the output, but its
meaning is not well-defined and it may well cause compile-time errors
(this feature is present for POSIX compliance. *Note Lex and Posix::,
for other such features).

   Any _indented_ text or text enclosed in `%{' and `%}' is copied
verbatim to the output (with the %{ and %} symbols removed).  The %{
and %} symbols must appear unindented on lines by themselves.


File: flex.info,  Node: User Code Section,  Next: Comments in the Input,  Prev: Rules Section,  Up: Format

Format of the User Code Section
===============================

   The user code section is simply copied to `lex.yy.c' verbatim.  It
is used for companion routines which call or are called by the scanner.
The presence of this section is optional; if it is missing, the second
`%%' in the input file may be skipped, too.


File: flex.info,  Node: Comments in the Input,  Prev: User Code Section,  Up: Format

Comments in the Input
=====================

   Flex supports C-style comments, that is, anything between /* and */
is considered a comment. Whenever flex encounters a comment, it copies
the entire comment verbatim to the generated source code. Comments may
appear just about anywhere, but with the following exceptions:

   * Comments may not appear in the Rules Section wherever flex is
     expecting a regular expression. This means comments may not appear
     at the beginning of a line, or immediately following a list of
     scanner states.

   * Comments may not appear on an `%option' line in the Definitions
     Section.

   If you want to follow a simple rule, then always begin a comment on a
new line, with one or more whitespace characters before the initial
`/*').  This rule will work anywhere in the input file.

   All the comments in the following example are valid:


     %{
     /* code block */
     %}
     
     /* Definitions Section */
     %x STATE_X
     
     %%
         /* Rules Section */
     ruleA   /* after regex */ { /* code block */ } /* after code block */
             /* Rules Section (indented) */
     <STATE_X>{
     ruleC   ECHO;
     ruleD   ECHO;
     %{
     /* code block */
     %}
     }
     %%
     /* User Code Section */


File: flex.info,  Node: Patterns,  Next: Matching,  Prev: Format,  Up: Top

Patterns
********

   The patterns in the input (see *Note Rules Section::) are written
using an extended set of regular expressions.  These are:

`x'
     match the character 'x'

`.'
     any character (byte) except newline

`[xyz]'
     a "character class"; in this case, the pattern matches either an
     'x', a 'y', or a 'z'

`[abj-oZ]'
     a "character class" with a range in it; matches an 'a', a 'b', any
     letter from 'j' through 'o', or a 'Z'

`[^A-Z]'
     a "negated character class", i.e., any character but those in the
     class.  In this case, any character EXCEPT an uppercase letter.

`[^A-Z\n]'
     any character EXCEPT an uppercase letter or a newline

`r*'
     zero or more r's, where r is any regular expression

`r+'
     one or more r's

`r?'
     zero or one r's (that is, "an optional r")

`r{2,5}'
     anywhere from two to five r's

`r{2,}'
     two or more r's

`r{4}'
     exactly 4 r's

`{name}'
     the expansion of the `name' definition (*note Format::).

`"[xyz]\"foo"'
     the literal string: `[xyz]"foo'

`\X'
     if X is `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C
     interpretation of `\x'.  Otherwise, a literal `X' (used to escape
     operators such as `*')

`\0'
     a NUL character (ASCII code 0)

`\123'
     the character with octal value 123

`\x2a'
     the character with hexadecimal value 2a

`(r)'
     match an `r'; parentheses are used to override precedence (see
     below)

`rs'
     the regular expression `r' followed by the regular expression `s';
     called "concatenation"

`r|s'
     either an `r' or an `s'

`r/s'
     an `r' but only if it is followed by an `s'.  The text matched by
     `s' is included when determining whether this rule is the longest
     match, but is then returned to the input before the action is
     executed.  So the action only sees the text matched by `r'.  This
     type of pattern is called "trailing context".  (There are some
     combinations of `r/s' that flex cannot match correctly. *Note
     Limitations::, regarding dangerous trailing context.)

`^r'
     an `r', but only at the beginning of a line (i.e., when just
     starting to scan, or right after a newline has been scanned).

`r$'
     an `r', but only at the end of a line (i.e., just before a
     newline).  Equivalent to `r/\n'.

     Note that `flex''s notion of "newline" is exactly whatever the C
     compiler used to compile `flex' interprets `\n' as; in particular,
     on some DOS systems you must either filter out `\r's in the input
     yourself, or explicitly use `r/\r\n' for `r$'.

`<s>r'
     an `r', but only in start condition `s' (see *Note Start
     Conditions:: for discussion of start conditions).

`<s1,s2,s3>r'
     same, but in any of start conditions `s1', `s2', or `s3'.

`<*>r'
     an `r' in any start condition, even an exclusive one.

`<<EOF>>'
     an end-of-file.

`<s1,s2><<EOF>>'
     an end-of-file when in start condition `s1' or `s2'

   Note that inside of a character class, all regular expression
operators lose their special meaning except escape (`\') and the
character class operators, `-', `]]', and, at the beginning of the
class, `^'.

   The regular expressions listed above are grouped according to
precedence, from highest precedence at the top to lowest at the bottom.
Those grouped together have equal precedence (see special note on the
precedence of the repeat operator, `{}', under the documentation for
the `--posix' POSIX compliance option).  For example,


         foo|bar*

   is the same as


         (foo)|(ba(r*))

   since the `*' operator has higher precedence than concatenation, and
concatenation higher than alternation (`|').  This pattern therefore
matches _either_ the string `foo' _or_ the string `ba' followed by
zero-or-more `r''s.  To match `foo' or zero-or-more repetitions of the
string `bar', use:


         foo|(bar)*

   And to match a sequence of zero or more repetitions of `foo' and
`bar':


         (foo|bar)*

   In addition to characters and ranges of characters, character classes
can also contain "character class expressions".  These are expressions
enclosed inside `[': and `:]' delimiters (which themselves must appear
between the `[' and `]' of the character class. Other elements may
occur inside the character class, too).  The valid expressions are:


         [:alnum:] [:alpha:] [:blank:]
         [:cntrl:] [:digit:] [:graph:]
         [:lower:] [:print:] [:punct:]
         [:space:] [:upper:] [:xdigit:]

   These expressions all designate a set of characters equivalent to the
corresponding standard C `isXXX' function.  For example, `[:alnum:]'
designates those characters for which `isalnum()' returns true - i.e.,
any alphabetic or numeric character.  Some systems don't provide
`isblank()', so flex defines `[:blank:]' as a blank or a tab.

   For example, the following character classes are all equivalent:


         [[:alnum:]]
         [[:alpha:][:digit:]]
         [[:alpha:][0-9]]
         [a-zA-Z0-9]

   Some notes on patterns are in order.

   * If your scanner is case-insensitive (the `-i' flag), then
     `[:upper:]' and `[:lower:]' are equivalent to `[:alpha:]'.

   * Character classes with ranges, such as `[a-Z]', should be used with
     caution in a case-insensitive scanner if the range spans upper or
     lowercase characters. Flex does not know if you want to fold all
     upper and lowercase characters together, or if you want the
     literal numeric range specified (with no case folding). When in
     doubt, flex will assume that you meant the literal numeric range,
     and will issue a warning. The exception to this rule is a
     character range such as `[a-z]' or `[S-W]' where it is obvious
     that you want case-folding to occur. Here are some examples with
     the `-i' flag enabled:

     Range        Result      Literal Range        Alternate Range
     `[a-t]'      ok          `[a-tA-T]'           
     `[A-T]'      ok          `[a-tA-T]'           
     `[A-t]'      ambiguous   `[A-Z\[\\\]_`a-t]'   `[a-tA-T]'
     `[_-{]'      ambiguous   `[_`a-z{]'           `[_`a-zA-Z{]'
     `[@-C]'      ambiguous   `[@ABC]'             `[@A-Z\[\\\]_`abc]'

   * A negated character class such as the example `[^A-Z]' above
     _will_ match a newline unless `\n' (or an equivalent escape
     sequence) is one of the characters explicitly present in the
     negated character class (e.g., `[^A-Z\n]').  This is unlike how
     many other regular expression tools treat negated character
     classes, but unfortunately the inconsistency is historically
     entrenched.  Matching newlines means that a pattern like `[^"]*'
     can match the entire input unless there's another quote in the
     input.

   * A rule can have at most one instance of trailing context (the `/'
     operator or the `$' operator).  The start condition, `^', and
     `<<EOF>>' patterns can only occur at the beginning of a pattern,
     and, as well as with `/' and `$', cannot be grouped inside
     parentheses.  A `^' which does not occur at the beginning of a
     rule or a `$' which does not occur at the end of a rule loses its
     special properties and is treated as a normal character.

   * The following are invalid:


              foo/bar$
              <sc1>foo<sc2>bar

     Note that the first of these can be written `foo/bar\n'.

   * The following will result in `$' or `^' being treated as a normal
     character:


              foo|(bar$)
              foo|^bar

     If the desired meaning is a `foo' or a
     `bar'-followed-by-a-newline, the following could be used (the
     special `|' action is explained below, *note Actions::):


              foo      |
              bar$     /* action goes here */

     A similar trick will work for matching a `foo' or a
     `bar'-at-the-beginning-of-a-line.


File: flex.info,  Node: Matching,  Next: Actions,  Prev: Patterns,  Up: Top

How the Input Is Matched
************************

   When the generated scanner is run, it analyzes its input looking for
strings which match any of its patterns.  If it finds more than one
match, it takes the one matching the most text (for trailing context
rules, this includes the length of the trailing part, even though it
will then be returned to the input).  If it finds two or more matches of
the same length, the rule listed first in the `flex' input file is
chosen.

   Once the match is determined, the text corresponding to the match
(called the "token") is made available in the global character pointer
`yytext', and its length in the global integer `yyleng'.  The "action"
corresponding to the matched pattern is then executed (*note
Actions::), and then the remaining input is scanned for another match.

   If no match is found, then the "default rule" is executed: the next
character in the input is considered matched and copied to the standard
output.  Thus, the simplest valid `flex' input is:


         %%

   which generates a scanner that simply copies its input (one
character at a time) to its output.

   Note that `yytext' can be defined in two different ways: either as a
character _pointer_ or as a character _array_. You can control which
definition `flex' uses by including one of the special directives
`%pointer' or `%array' in the first (definitions) section of your flex
input.  The default is `%pointer', unless you use the `-l' lex
compatibility option, in which case `yytext' will be an array.  The
advantage of using `%pointer' is substantially faster scanning and no
buffer overflow when matching very large tokens (unless you run out of
dynamic memory).  The disadvantage is that you are restricted in how
your actions can modify `yytext' (*note Actions::), and calls to the
`unput()' function destroys the present contents of `yytext', which can
be a considerable porting headache when moving between different `lex'
versions.

   The advantage of `%array' is that you can then modify `yytext' to
your heart's content, and calls to `unput()' do not destroy `yytext'
(*note Actions::).  Furthermore, existing `lex' programs sometimes
access `yytext' externally using declarations of the form:


         extern char yytext[];

   This definition is erroneous when used with `%pointer', but correct
for `%array'.

   The `%array' declaration defines `yytext' to be an array of `YYLMAX'
characters, which defaults to a fairly large value.  You can change the
size by simply #define'ing `YYLMAX' to a different value in the first
section of your `flex' input.  As mentioned above, with `%pointer'
yytext grows dynamically to accommodate large tokens.  While this means
your `%pointer' scanner can accommodate very large tokens (such as
matching entire blocks of comments), bear in mind that each time the
scanner must resize `yytext' it also must rescan the entire token from
the beginning, so matching such tokens can prove slow.  `yytext'
presently does _not_ dynamically grow if a call to `unput()' results in
too much text being pushed back; instead, a run-time error results.

   Also note that you cannot use `%array' with C++ scanner classes
(*note Cxx::).


File: flex.info,  Node: Actions,  Next: Generated Scanner,  Prev: Matching,  Up: Top

Actions
*******

   Each pattern in a rule has a corresponding "action", which can be
any arbitrary C statement.  The pattern ends at the first non-escaped
whitespace character; the remainder of the line is its action.  If the
action is empty, then when the pattern is matched the input token is
simply discarded.  For example, here is the specification for a program
which deletes all occurrences of `zap me' from its input:


         %%
         "zap me"

   This example will copy all other characters in the input to the
output since they will be matched by the default rule.

   Here is a program which compresses multiple blanks and tabs down to a
single blank, and throws away whitespace found at the end of a line:


         %%
         [ \t]+        putchar( ' ' );
         [ \t]+$       /* ignore this token */

   If the action contains a `}', then the action spans till the
balancing `}' is found, and the action may cross multiple lines.
`flex' knows about C strings and comments and won't be fooled by braces
found within them, but also allows actions to begin with `%{' and will
consider the action to be all the text up to the next `%}' (regardless
of ordinary braces inside the action).

   An action consisting solely of a vertical bar (`|') means "same as
the action for the next rule".  See below for an illustration.

   Actions can include arbitrary C code, including `return' statements
to return a value to whatever routine called `yylex()'.  Each time
`yylex()' is called it continues processing tokens from where it last
left off until it either reaches the end of the file or executes a
return.

   Actions are free to modify `yytext' except for lengthening it
(adding characters to its end-these will overwrite later characters in
the input stream).  This however does not apply when using `%array'
(*note Matching::). In that case, `yytext' may be freely modified in
any way.

   Actions are free to modify `yyleng' except they should not do so if
the action also includes use of `yymore()' (see below).

   There are a number of special directives which can be included
within an action:

`ECHO'
     copies yytext to the scanner's output.

`BEGIN'
     followed by the name of a start condition places the scanner in the
     corresponding start condition (see below).

`REJECT'
     directs the scanner to proceed on to the "second best" rule which
     matched the input (or a prefix of the input).  The rule is chosen
     as described above in *Note Matching::, and `yytext' and `yyleng'
     set up appropriately.  It may either be one which matched as much
     text as the originally chosen rule but came later in the `flex'
     input file, or one which matched less text.  For example, the
     following will both count the words in the input and call the
     routine `special()' whenever `frob' is seen:


                      int word_count = 0;
              %%
          
              frob        special(); REJECT;
              [^ \t\n]+   ++word_count;

     Without the `REJECT', any occurences of `frob' in the input would
     not be counted as words, since the scanner normally executes only
     one action per token.  Multiple uses of `REJECT' are allowed, each
     one finding the next best choice to the currently active rule.  For
     example, when the following scanner scans the token `abcd', it will
     write `abcdabcaba' to the output:


              %%
              a        |
              ab       |
              abc      |
              abcd     ECHO; REJECT;
              .|\n     /* eat up any unmatched character */

     The first three rules share the fourth's action since they use the
     special `|' action.

     `REJECT' is a particularly expensive feature in terms of scanner
     performance; if it is used in _any_ of the scanner's actions it
     will slow down _all_ of the scanner's matching.  Furthermore,
     `REJECT' cannot be used with the `-Cf' or `-CF' options (*note
     Scanner Options::).

     Note also that unlike the other special actions, `REJECT' is a
     _branch_.  code immediately following it in the action will _not_
     be executed.

`yymore()'
     tells the scanner that the next time it matches a rule, the
     corresponding token should be _appended_ onto the current value of
     `yytext' rather than replacing it.  For example, given the input
     `mega-kludge' the following will write `mega-mega-kludge' to the
     output:


              %%
              mega-    ECHO; yymore();
              kludge   ECHO;

     First `mega-' is matched and echoed to the output.  Then `kludge'
     is matched, but the previous `mega-' is still hanging around at the
     beginning of `yytext' so the `ECHO' for the `kludge' rule will
     actually write `mega-kludge'.

   Two notes regarding use of `yymore()'.  First, `yymore()' depends on
the value of `yyleng' correctly reflecting the size of the current
token, so you must not modify `yyleng' if you are using `yymore()'.
Second, the presence of `yymore()' in the scanner's action entails a
minor performance penalty in the scanner's matching speed.

   `yyless(n)' returns all but the first `n' characters of the current
token back to the input stream, where they will be rescanned when the
scanner looks for the next match.  `yytext' and `yyleng' are adjusted
appropriately (e.g., `yyleng' will now be equal to `n').  For example,
on the input `foobar' the following will write out `foobarbar':


         %%
         foobar    ECHO; yyless(3);
         [a-z]+    ECHO;

   An argument of 0 to `yyless()' will cause the entire current input
string to be scanned again.  Unless you've changed how the scanner will
subsequently process its input (using `BEGIN', for example), this will
result in an endless loop.

   Note that `yyless()' is a macro and can only be used in the flex
input file, not from other source files.

   `unput(c)' puts the character `c' back onto the input stream.  It
will be the next character scanned.  The following action will take the
current token and cause it to be rescanned enclosed in parentheses.


         {
         int i;
         /* Copy yytext because unput() trashes yytext */
         char *yycopy = strdup( yytext );
         unput( ')' );
         for ( i = yyleng - 1; i >= 0; --i )
             unput( yycopy[i] );
         unput( '(' );
         free( yycopy );
         }

   Note that since each `unput()' puts the given character back at the
_beginning_ of the input stream, pushing back strings must be done
back-to-front.

   An important potential problem when using `unput()' is that if you
are using `%pointer' (the default), a call to `unput()' _destroys_ the
contents of `yytext', starting with its rightmost character and
devouring one character to the left with each call.  If you need the
value of `yytext' preserved after a call to `unput()' (as in the above
example), you must either first copy it elsewhere, or build your
scanner using `%array' instead (*note Matching::).

   Finally, note that you cannot put back `EOF' to attempt to mark the
input stream with an end-of-file.

   `input()' reads the next character from the input stream.  For
example, the following is one way to eat up C comments:


         %%
         "/*"        {
                     register int c;
     
                     for ( ; ; )
                         {
                         while ( (c = input()) != '*' &&
                                 c != EOF )
                             ;    /* eat up text of comment */
     
                         if ( c == '*' )
                             {
                             while ( (c = input()) == '*' )
                                 ;
                             if ( c == '/' )
                                 break;    /* found the end */
                             }
     
                         if ( c == EOF )
                             {
                             error( "EOF in comment" );
                             break;
                             }
                         }
                     }

   (Note that if the scanner is compiled using `C++', then `input()' is
instead referred to as yyinput(), in order to avoid a name clash with
the `C++' stream by the name of `input'.)

   `YY_FLUSH_BUFFER()' flushes the scanner's internal buffer so that
the next time the scanner attempts to match a token, it will first
refill the buffer using `YY_INPUT()' (*note Generated Scanner::).  This
action is a special case of the more general `yy_flush_buffer()'
function, described below (*note Multiple Input Buffers::)

   `yyterminate()' can be used in lieu of a return statement in an
action.  It terminates the scanner and returns a 0 to the scanner's
caller, indicating "all done".  By default, `yyterminate()' is also
called when an end-of-file is encountered.  It is a macro and may be
redefined.


File: flex.info,  Node: Generated Scanner,  Next: Start Conditions,  Prev: Actions,  Up: Top

The Generated Scanner
*********************

   The output of `flex' is the file `lex.yy.c', which contains the
scanning routine `yylex()', a number of tables used by it for matching
tokens, and a number of auxiliary routines and macros.  By default,
`yylex()' is declared as follows:


         int yylex()
             {
             ... various definitions and the actions in here ...
             }

   (If your environment supports function prototypes, then it will be
`int yylex( void )'.)  This definition may be changed by defining the
`YY_DECL' macro.  For example, you could use:


         #define YY_DECL float lexscan( a, b ) float a, b;

   to give the scanning routine the name `lexscan', returning a float,
and taking two floats as arguments.  Note that if you give arguments to
the scanning routine using a K&R-style/non-prototyped function
declaration, you must terminate the definition with a semi-colon (;).

   `flex' generates `C99' function definitions by default. However flex
does have the ability to generate obsolete, er, `traditional', function
definitions. This is to support bootstrapping gcc on old systems.
Unfortunately, traditional definitions prevent us from using any
standard data types smaller than int (such as short, char, or bool) as
function arguments.  For this reason, future versions of `flex' may
generate standard C99 code only, leaving K&R-style functions to the
historians.  Currently, if you do *not* want `C99' definitions, then
you must use `%option noansi-definitions'.

   Whenever `yylex()' is called, it scans tokens from the global input
file `yyin' (which defaults to stdin).  It continues until it either
reaches an end-of-file (at which point it returns the value 0) or one
of its actions executes a `return' statement.

   If the scanner reaches an end-of-file, subsequent calls are undefined
unless either `yyin' is pointed at a new input file (in which case
scanning continues from that file), or `yyrestart()' is called.
`yyrestart()' takes one argument, a `FILE *' pointer (which can be
NULL, if you've set up `YY_INPUT' to scan from a source other than
`yyin'), and initializes `yyin' for scanning from that file.
Essentially there is no difference between just assigning `yyin' to a
new input file or using `yyrestart()' to do so; the latter is available
for compatibility with previous versions of `flex', and because it can
be used to switch input files in the middle of scanning.  It can also
be used to throw away the current input buffer, by calling it with an
argument of `yyin'; but it would be better to use `YY_FLUSH_BUFFER'
(*note Actions::).  Note that `yyrestart()' does _not_ reset the start
condition to `INITIAL' (*note Start Conditions::).

   If `yylex()' stops scanning due to executing a `return' statement in
one of the actions, the scanner may then be called again and it will
resume scanning where it left off.

   By default (and for purposes of efficiency), the scanner uses
block-reads rather than simple `getc()' calls to read characters from
`yyin'.  The nature of how it gets its input can be controlled by
defining the `YY_INPUT' macro.  The calling sequence for `YY_INPUT()'
is `YY_INPUT(buf,result,max_size)'.  Its action is to place up to
`max_size' characters in the character array `buf' and return in the
integer variable `result' either the number of characters read or the
constant `YY_NULL' (0 on Unix systems) to indicate `EOF'.  The default
`YY_INPUT' reads from the global file-pointer `yyin'.

   Here is a sample definition of `YY_INPUT' (in the definitions
section of the input file):


         %{
         #define YY_INPUT(buf,result,max_size) \
             { \
             int c = getchar(); \
             result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
             }
         %}

   This definition will change the input processing to occur one
character at a time.

   When the scanner receives an end-of-file indication from YY_INPUT, it
then checks the `yywrap()' function.  If `yywrap()' returns false
(zero), then it is assumed that the function has gone ahead and set up
`yyin' to point to another input file, and scanning continues.  If it
returns true (non-zero), then the scanner terminates, returning 0 to
its caller.  Note that in either case, the start condition remains
unchanged; it does _not_ revert to `INITIAL'.

   If you do not supply your own version of `yywrap()', then you must
either use `%option noyywrap' (in which case the scanner behaves as
though `yywrap()' returned 1), or you must link with `-lfl' to obtain
the default version of the routine, which always returns 1.

   For scanning from in-memory buffers (e.g., scanning strings), see
*Note Scanning Strings::. *Note Multiple Input Buffers::.

   The scanner writes its `ECHO' output to the `yyout' global (default,
`stdout'), which may be redefined by the user simply by assigning it to
some other `FILE' pointer.