@@ -13,16 +13,10 @@ Note: if there is any discrepancy, please refer to `The flex Manual`.
...
@@ -13,16 +13,10 @@ Note: if there is any discrepancy, please refer to `The flex Manual`.
**************************
**************************
When the generated scanner is run, it analyzes its input looking for strings which match any of its patterns. If it finds more than one match, it takes the one matching the most text (for trailing context rules, this includes the length of the trailing part, even though it will then be returned to the input). If it finds two or more matches of the same length, the rule listed first in the `flex` input file is chosen.
When the generated scanner is run, it analyzes its input looking for strings which match any of its patterns. If it finds more than one match, it takes the one matching the most text (for trailing context rules, this includes the length of the trailing part, even though it will then be returned to the input). If it finds two or more matches of the same length, the rule listed first in the `flex` input file is chosen.
Once the match is determined, the text corresponding to the match (called the "token") is made available in the global character pointer `yytext`, and its length in the global integer `yyleng`. The "action" corresponding to the matched pattern is then executed, and then the remaining input is scanned for another match.
Once the match is determined, the text corresponding to the match (called the "token") is made available in the global character pointer `yytext`, and its length in the global integer `yyleng`. The "action" corresponding to the matched pattern is then executed, and then the remaining input is scanned for another match.
If no match is found, then the "default rule" is executed: the next character in the input is considered matched and copied to the standard output. Thus, the simplest valid `flex` input is:
If no match is found, then the "default rule" is executed: the next character in the input is considered matched and copied to the standard output. Thus, the simplest valid `flex` input is:
```c
```c
...
@@ -31,12 +25,8 @@ If no match is found, then the "default rule" is executed: the next character in
...
@@ -31,12 +25,8 @@ If no match is found, then the "default rule" is executed: the next character in
which generates a scanner that simply copies its input (one character at a time) to its output.
which generates a scanner that simply copies its input (one character at a time) to its output.
Note that `yytext` can be defined in two different ways: either as a character _pointer_ or as a character _array_. You can control which definition `flex` uses by including one of the special directives `%pointer` or `%array` in the first (definitions) section of your flex input. The default is `%pointer`, unless you use the `-l` lex compatibility option, in which case `yytext` will be an array. The advantage of using `%pointer` is substantially faster scanning and no buffer overflow when matching very large tokens (unless you run out of dynamic memory). The disadvantage is that you are restricted in how your actions can modify `yytext`, and calls to the `unput()` function destroys the present contents of `yytext`, which can be a considerable porting headache when moving between different `lex` versions.
Note that `yytext` can be defined in two different ways: either as a character _pointer_ or as a character _array_. You can control which definition `flex` uses by including one of the special directives `%pointer` or `%array` in the first (definitions) section of your flex input. The default is `%pointer`, unless you use the `-l` lex compatibility option, in which case `yytext` will be an array. The advantage of using `%pointer` is substantially faster scanning and no buffer overflow when matching very large tokens (unless you run out of dynamic memory). The disadvantage is that you are restricted in how your actions can modify `yytext`, and calls to the `unput()` function destroys the present contents of `yytext`, which can be a considerable porting headache when moving between different `lex` versions.
The advantage of `%array` is that you can then modify `yytext` to your heart‘s content, and calls to `unput()` do not destroy `yytext`. Furthermore, existing `lex` programs sometimes access `yytext` externally using declarations of the form:
The advantage of `%array` is that you can then modify `yytext` to your heart‘s content, and calls to `unput()` do not destroy `yytext`. Furthermore, existing `lex` programs sometimes access `yytext` externally using declarations of the form:
```c
```c
...
@@ -45,10 +35,6 @@ The advantage of `%array` is that you can then modify `yytext` to your heart‘s
...
@@ -45,10 +35,6 @@ The advantage of `%array` is that you can then modify `yytext` to your heart‘s
This definition is erroneous when used with `%pointer`, but correct for `%array`.
This definition is erroneous when used with `%pointer`, but correct for `%array`.
The `%array` declaration defines `yytext` to be an array of `YYLMAX` characters, which defaults to a fairly large value. You can change the size by simply #define'ing `YYLMAX` to a different value in the first section of your `flex` input. As mentioned above, with `%pointer` yytext grows dynamically to accommodate large tokens. While this means your `%pointer` scanner can accommodate very large tokens (such as matching entire blocks of comments), bear in mind that each time the scanner must resize `yytext` it also must rescan the entire token from the beginning, so matching such tokens can prove slow. `yytext` presently does _not_ dynamically grow if a call to `unput()` results in too much text being pushed back; instead, a run-time error results.
The `%array` declaration defines `yytext` to be an array of `YYLMAX` characters, which defaults to a fairly large value. You can change the size by simply #define'ing `YYLMAX` to a different value in the first section of your `flex` input. As mentioned above, with `%pointer` yytext grows dynamically to accommodate large tokens. While this means your `%pointer` scanner can accommodate very large tokens (such as matching entire blocks of comments), bear in mind that each time the scanner must resize `yytext` it also must rescan the entire token from the beginning, so matching such tokens can prove slow. `yytext` presently does _not_ dynamically grow if a call to `unput()` results in too much text being pushed back; instead, a run-time error results.
Also note that you cannot use `%array` with C++ scanner classes
Also note that you cannot use `%array` with C++ scanner classes
@@ -16,189 +16,181 @@ Note: if there is any discrepancy, please refer to `The flex Manual`.
...
@@ -16,189 +16,181 @@ Note: if there is any discrepancy, please refer to `The flex Manual`.
The patterns in the input are written using an extended set of regular expressions. These are:
The patterns in the input are written using an extended set of regular expressions. These are:
*`x`
*`x`
match the character `x`
match the character `x`
*`.`
*`.`
any character (byte) except newline
any character (byte) except newline
*`[xyz]`
*`[xyz]`
a "character class"; in this case, the pattern matches either an `x`, a `y`, or a `z`
a "character class"; in this case, the pattern matches either an `x`, a `y`, or a `z`
*`[abj-oZ]`
*`[abj-oZ]`
a "character class" with a range in it; matches an `a`, a `b`, any letter from `j` through `o`, or a `Z`
a "character class" with a range in it; matches an `a`, a `b`, any letter from `j` through `o`, or a `Z`
*`[^A-Z]`
*`[^A-Z]`
a "negated character class", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter.
a "negated character class", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter.
*`[^A-Z\n]`
*`[^A-Z\n]`
any character EXCEPT an uppercase letter or a newline
any character EXCEPT an uppercase letter or a newline
*`[a-z]{-}[aeiou]`
*`[a-z]{-}[aeiou]`
the lowercase consonants
the lowercase consonants
*`r*`
*`r*`
zero or more r`s, where r is any regular expression
zero or more r`s, where r is any regular expression
* `r+`
* `r+`
one or more r`s
one or more r`s
*`r?`
*`r?`
zero or one r`s (that is, "an optional r")
zero or one r`s (that is, "an optional r")
* `r{2,5}`
* `r{2,5}`
anywhere from two to five `r`
anywhere from two to five `r`
* `r{2,}`
* `r{2,}`
two or more `r`
two or more `r`
* `r{4}`
* `r{4}`
exactly 4 `r`
exactly 4 `r`
* `"[xyz]\"foo"`
* `"[xyz]\"foo"`
the literal string: `[xyz]"foo`
the literal string: `[xyz]"foo`
* `\X`
* `\X`
if X is `a`, `b`, `f`, `n`, `r`, `t`, or `v`, then the ANSI-C interpretation of `\x`. Otherwise, a literal `X` (used to escape operators such as `*`)
if X is `a`, `b`, `f`, `n`, `r`, `t`, or `v`, then the ANSI-C interpretation of `\x`. Otherwise, a literal `X` (used to escape operators such as `*`)
* `\0`
* `\0`
a NUL character (ASCII code 0)
a NUL character (ASCII code 0)
* `\123`
* `\123`
the character with octal value 123
the character with octal value 123
* `\x2a`
* `\x2a`
the character with hexadecimal value 2a
the character with hexadecimal value 2a
* `(r)`
* `(r)`
match an `r`; parentheses are used to override precedence (see below)
match an `r`; parentheses are used to override precedence (see below)
* `(?r-s:pattern)`
* `(?r-s:pattern)`
apply option `r` and omit option `s` while interpreting pattern. Options may be zero or more of the characters `i`, `s`, or `x`.
apply option `r` and omit option `s` while interpreting pattern. Options may be zero or more of the characters `i`, `s`, or `x`.
`i` means case-insensitive. `-i` means case-sensitive.
`i` means case-insensitive. `-i` means case-sensitive.
`s` alters the meaning of the `.` syntax to match any single byte whatsoever. `-s` alters the meaning of `.` to match any byte except `\n`.
`s` alters the meaning of the `.` syntax to match any single byte whatsoever. `-s` alters the meaning of `.` to match any byte except `\n`.
`x` ignores comments and whitespace in patterns. Whitespace is ignored unless it is backslash-escaped, contained within `""`s, or appears inside a character class.
`x` ignores comments and whitespace in patterns. Whitespace is ignored unless it is backslash-escaped, contained within `""`s, or appears inside a character class.
The following are all valid:
The following are all valid:
* `(?:foo)` same as `(foo)`
* `(?:foo)` same as `(foo)`
* `(?i:ab7)` same as `([aA][bB]7)`
* `(?i:ab7)` same as `([aA][bB]7)`
* `(?-i:ab)` same as `(ab)`
* `(?-i:ab)` same as `(ab)`
* `(?s:.) ` same as `[\x00-\xFF]`
* `(?s:.) ` same as `[\x00-\xFF]`
* `(?-s:.)` same as `[^\n]`
* `(?-s:.)` same as `[^\n]`
* `(?ix-s: a . b)` same as `([Aa][^\n][bB])`
* `(?ix-s: a . b)` same as `([Aa][^\n][bB])`
* `(?x:a b)` same as `("ab")`
* `(?x:a b)` same as `("ab")`
* `(?x:a\ b)` same as `("a b")`
* `(?x:a\ b)` same as `("a b")`
* `(?x:a" "b) ` same as` ("a b")`
* `(?x:a" "b) ` same as` ("a b")`
* `(?x:a[ ]b) ` same as `("a b")`
* `(?x:a[ ]b) ` same as `("a b")`
* ```shell
* ```shell
(?x:a
(?x:a
/* comment */
/* comment */
b
b
c)
c)
```
```
same as `(abc)`
same as `(abc)`
* `(?# comment )`
* `(?# comment )`
omit everything within `()`. The first `)` character encountered ends the pattern. It is not possible to for the comment to contain a `)` character. The comment may span lines.
omit everything within `()`. The first `)` character encountered ends the pattern. It is not possible to for the comment to contain a `)` character. The comment may span lines.
* `rs`
* `rs`
the regular expression `r` followed by the regular expression `s`; called "concatenation"
the regular expression `r` followed by the regular expression `s`; called "concatenation"
* `r|s`
* `r|s`
either an `r` or an `s`
either an `r` or an `s`
* `r/s`
* `r/s`
an `r` but only if it is followed by an `s`. The text matched by `s` is included when determining whether this rule is the longest match, but is then returned to the input before the action is executed. So the action only sees the text matched by `r`. This type of pattern is called "trailing context". (There are some combinations of `r/s` that flex cannot match correctly.)
an `r` but only if it is followed by an `s`. The text matched by `s` is included when determining whether this rule is the longest match, but is then returned to the input before the action is executed. So the action only sees the text matched by `r`. This type of pattern is called "trailing context". (There are some combinations of `r/s` that flex cannot match correctly.)
* `^r`
* `^r`
an `r`, but only at the beginning of a line (i.e., when just starting to scan, or right after a newline has been scanned).
an `r`, but only at the beginning of a line (i.e., when just starting to scan, or right after a newline has been scanned).
* `r$`
* `r$`
an `r`, but only at the end of a line (i.e., just before a newline). Equivalent to `r/\n`.
an `r`, but only at the end of a line (i.e., just before a newline). Equivalent to `r/\n`.
Note that `flex`s notion of "newline" is exactly whatever the C compiler used to compile `flex` interprets `\n` as; in particular, on some DOS systems you must either filter out `\r`s in the input yourself, or explicitly use `r/\r\n` for `r$`.
Note that `flex`s notion of "newline" is exactly whatever the C compiler used to compile `flex` interprets `\n` as; in particular, on some DOS systems you must either filter out `\r`s in the input yourself, or explicitly use `r/\r\n` for `r$`.
* `<s>r`
* `<s>r`
an `r`, but only in start condition `s`.
an `r`, but only in start condition `s`.
* `<s1,s2,s3>r`
* `<s1,s2,s3>r`
same, but in any of start conditions `s1`, `s2`, or `s3`.
same, but in any of start conditions `s1`, `s2`, or `s3`.
* `<*>r`
* `<*>r`
an `r` in any start condition, even an exclusive one.
an `r` in any start condition, even an exclusive one.
* `<<EOF>>`
* `<<EOF>>`
an end-of-file.
an end-of-file.
* `<s1,s2><<EOF>>`
* `<s1,s2><<EOF>>`
an end-of-file when in start condition `s1` or `s2`
an end-of-file when in start condition `s1` or `s2`
Note that inside of a character class, all regular expression operators lose their special meaning except escape (`\`) and the character class operators, `-`, `]]`, and, at the beginning of the class, `^`.
Note that inside of a character class, all regular expression operators lose their special meaning except escape (`\`) and the character class operators, `-`, `]]`, and, at the beginning of the class, `^`.
The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. Those grouped together have equal precedence (see special note on the precedence of the repeat operator, `{}`, under the documentation for the `--posix` POSIX compliance option). For example,
The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. Those grouped together have equal precedence (see special note on the precedence of the repeat operator, `{}`, under the documentation for the `--posix` POSIX compliance option). For example,
`foo|bar*` is the same as `(foo)|(ba(r*))`
`foo|bar*` is the same as `(foo)|(ba(r*))`
Since the `*` operator has higher precedence than concatenation, and concatenation higher than alternation (`|`). This pattern therefore matches _either_ the string `foo` _or_ the string `ba` followed by zero-or-more `r`'s. To match `foo` or zero-or-more repetitions of the string `bar`, use:
Since the `*` operator has higher precedence than concatenation, and concatenation higher than alternation (`|`). This pattern therefore matches _either_ the string `foo` _or_ the string `ba` followed by zero-or-more `r`'s. To match `foo` or zero-or-more repetitions of the string `bar`, use:
` foo|(bar)*`
` foo|(bar)*`
And to match a sequence of zero or more repetitions of `foo` and`bar`:
And to match a sequence of zero or more repetitions of `foo` and`bar`:
`(foo|bar)*`
`(foo|bar)*`
In addition to characters and ranges of characters, character classes can also contain "character class expressions". These are expressions enclosed inside `[:` and `:]` delimiters (which themselves must appear between the `[` and `]` of the character class. Other elements may occur inside the character class, too). The valid expressions are:
In addition to characters and ranges of characters, character classes can also contain "character class expressions". These are expressions enclosed inside `[:` and `:]` delimiters (which themselves must appear between the `[` and `]` of the character class. Other elements may occur inside the character class, too). The valid expressions are:
...
@@ -219,91 +211,61 @@ For example, the following character classes are all equivalent:
...
@@ -219,91 +211,61 @@ For example, the following character classes are all equivalent:
*`[[:alpha:][0-9]]`
*`[[:alpha:][0-9]]`
*`[a-zA-Z0-9]`
*`[a-zA-Z0-9]`
A word of caution. Character classes are expanded immediately when seen in the `flex` input. This means the character classes are sensitive to the locale in which `flex` is executed, and the resulting scanner will not be sensitive to the runtime locale. This may or may not be desirable.
A word of caution. Character classes are expanded immediately when seen in the `flex` input. This means the character classes are sensitive to the locale in which `flex` is executed, and the resulting scanner will not be sensitive to the runtime locale. This may or may not be desirable.
* If your scanner is case-insensitive (the `-i` flag), then
* If your scanner is case-insensitive (the `-i` flag), then
`[:upper:]` and `[:lower:]` are equivalent to `[:alpha:]`.
`[:upper:]` and `[:lower:]` are equivalent to `[:alpha:]`.
* Character classes with ranges, such as `[a-Z]`, should be used with caution in a case-insensitive scanner if the range spans upper or lowercase characters. Flex does not know if you want to fold all upper and lowercase characters together, or if you want the literal numeric range specified (with no case folding). When in doubt, flex will assume that you meant the literal numeric range, and will issue a warning. The exception to this rule is a character range such as `[a-z]` or `[S-W]` where it is obvious that you want case-folding to occur. Here are some examples with the `-i` flag enabled:
* Character classes with ranges, such as `[a-Z]`, should be used with caution in a case-insensitive scanner if the range spans upper or lowercase characters. Flex does not know if you want to fold all upper and lowercase characters together, or if you want the literal numeric range specified (with no case folding). When in doubt, flex will assume that you meant the literal numeric range, and will issue a warning. The exception to this rule is a character range such as `[a-z]` or `[S-W]` where it is obvious that you want case-folding to occur. Here are some examples with the `-i` flag enabled:
| Range | Result | Literal Range | Alternate Range |
| Range | Result | Literal Range | Alternate Range |
* A negated character class such as the example `[^A-Z]` above _will_ match a newline unless `\n` (or an equivalent escape sequence) is one of the characters explicitly present in the negated character class (e.g., `[^A-Z\n]`). This is unlike how many other regular expression tools treat negated character classes, but unfortunately the inconsistency is historically entrenched. Matching newlines means that a pattern like `[^"]*` can match the entire input unless there`s another quote in the input.
* A negated character class such as the example `[^A-Z]` above _will_ match a newline unless `\n` (or an equivalent escape sequence) is one of the characters explicitly present in the negated character class (e.g., `[^A-Z\n]`). This is unlike how many other regular expression tools treat negated character classes, but unfortunately the inconsistency is historically entrenched. Matching newlines means that a pattern like `[^"]*` can match the entire input unless there`s another quote in the input.
Flex allows negation of character class expressions by prepending `^` to the POSIX character class name.
Flex allows negation of character class expressions by prepending `^` to the POSIX character class name.
`[:^alnum:]` `[:^alpha:]` `[:^blank:]`
`[:^alnum:]` `[:^alpha:]` `[:^blank:]`
`[:^cntrl:]` `[:^digit:]` `[:^graph:]`
`[:^cntrl:]` `[:^digit:]` `[:^graph:]`
`[:^lower:]` `[:^print:]` `[:^punct:]`
`[:^lower:]` `[:^print:]` `[:^punct:]`
`[:^space:]` `[:^upper:]` `[:^xdigit:]`
`[:^space:]` `[:^upper:]` `[:^xdigit:]`
Flex will issue a warning if the expressions `[:^upper:]` and`[:^lower:]` appear in a case-insensitive scanner, since their meaning is unclear. The current behavior is to skip them entirely, but this may change without notice in future revisions of flex.
Flex will issue a warning if the expressions `[:^upper:]` and`[:^lower:]` appear in a case-insensitive scanner, since their meaning is unclear. The current behavior is to skip them entirely, but this may change without notice in future revisions of flex.
* The `{-}` operator computes the difference of two character classes. For example, `[a-c]{-}[b-z]` represents all the characters in the class `[a-c]` that are not in the class `[b-z]` (which in this case, is just the single character `a`). The `{-}` operator is left associative, so `[abc]{-}[b]{-}[c]` is the same as `[a]`. Be careful not to accidentally create an empty set, which will never match.
* The `{-}` operator computes the difference of two character classes. For example, `[a-c]{-}[b-z]` represents all the characters in the class `[a-c]` that are not in the class `[b-z]` (which in this case, is just the single character `a`). The `{-}` operator is left associative, so `[abc]{-}[b]{-}[c]` is the same as `[a]`. Be careful not to accidentally create an empty set, which will never match.
* The `{+}` operator computes the union of two character classes. For example, `[a-z]{+}[0-9]` is the same as `[a-z0-9]`. This operator is useful when preceded by the result of a difference operation, as in, `[[:alpha:]]{-}[[:lower:]]{+}[q]`, which is equivalent to `[A-Zq]` in the "C" locale.
* The `{+}` operator computes the union of two character classes. For example, `[a-z]{+}[0-9]` is the same as `[a-z0-9]`. This operator is useful when preceded by the result of a difference operation, as in, `[[:alpha:]]{-}[[:lower:]]{+}[q]`, which is equivalent to `[A-Zq]` in the "C" locale.
* A rule can have at most one instance of trailing context (the `/` operator or the `$` operator). The start condition, `^`, and `<<EOF>>` patterns can only occur at the beginning of a pattern, and, as well as with `/` and `$`, cannot be grouped inside parentheses. A `^` which does not occur at the beginning of a rule or a `$` which does not occur at the end of a rule loses its special properties and is treated as a normal character.
* A rule can have at most one instance of trailing context (the `/` operator or the `$` operator). The start condition, `^`, and `<<EOF>>` patterns can only occur at the beginning of a pattern, and, as well as with `/` and `$`, cannot be grouped inside parentheses. A `^` which does not occur at the beginning of a rule or a `$` which does not occur at the end of a rule loses its special properties and is treated as a normal character.
* The following are invalid:
* The following are invalid:
`foo/bar$`
`foo/bar$`
`<sc1>foo<sc2>bar`
`<sc1>foo<sc2>bar`
Note that the first of these can be written `foo/bar\n`.
Note that the first of these can be written `foo/bar\n`.
* The following will result in `$` or `^` being treated as a normal character:
* The following will result in `$` or `^` being treated as a normal character:
`foo|(bar$)`
`foo|(bar$)`
``foo|^bar`
``foo|^bar`
If the desired meaning is a `foo` or a `bar`-followed-by-a-newline, the following could be used (the special `|` action is explained below):
If the desired meaning is a `foo` or a `bar`-followed-by-a-newline, the following could be used (the special `|` action is explained below):
```shell
```shell
foo |
foo |
bar$ /* action goes here */
bar$ /* action goes here */
```
```
A similar trick will work for matching a `foo` or a `bar`-at-the-beginning-of-a-line.
A similar trick will work for matching a `foo` or a `bar`-at-the-beginning-of-a-line.