Appendix D: Regular expressions : Regular expression syntax
 
Regular expression syntax
Accurate regular expression syntax is vital for detecting different forms of the same attack, for rewriting all but only the intended URLs, and for allowing normal traffic to pass (see “Reducing false positives”). When configuring Expression or similar settings, always use the >> (test) button to:
Validate your expression’s syntax.
Look for unintended matches.
Verify intended matches.
Will your expression match? Will it match more than once? Where will it match? Generally, unless the feature is specifically designed to look for all instances, FortiWeb will evaluate only a specific location for a match, and it will start from that location’s beginning. (In English, this is the left most, topmost point in the string.) FortiWeb will take only the first match, unless you have defined a number of repetitions.
FortiWeb follows most Perl-compatible regular expression (PCRE) syntax. Table 75 shows syntax and popular grammar examples. You can find additional examples with each feature, such as “Example: Sanitizing poisoned HTML”.
 
Inverse string matching is not currently supported.
For example, to match all strings that do not contain hamsters, you cannot use:
!(hamsters)
You can, however, use inverse matching for specific character classes, such as:
[^A]
to match any string that contains any characters that are not the letter A.
Table 75: Popular FortiWeb regular expression syntax
Notation
Function
Sample Matches
Anything except *.|^$?+\(){}[]
Literal match, except if the character is part of a:
capture group
back-reference (e.g. $0 or \1)
other regular expression token (e.g. \w)
Text: My cat catches things.
Regular expression: cat
Matches: cat
Depending on whether the feature looks for all instances, it may also match “cat” in the beginning of “catches”.
\
Escape character. If it is followed by:
An alphanumeric character, the alphanumeric character is not matched literally as usual. Instead, it is interpreted as a regular expression token. For example, \w matches a word, as defined by the locale.
Any regular expression special character:
*.|^$?+\(){}[]\
this escapes interpretation as a regular expression token, and instead treats it as a normal letter. For example, \\ matches:
\
Text: /url?parameter=value
Regular expression: \?param
Matches: ?param
(?i)
Turns on case-insensitive matching for subsequent evaluation, until it is turned off or the evaluation completes.
Text: /url?Parameter=value
Regular expression: (?i)param
Matches: Param
Would also match pArAM etc.
\n
Matches a new line (also called a line feed).
Microsoft Windows platforms typically use \r\n at the end of each line. Linux and Unix platforms typically use \n. Mac OS X typically uses \r
Text: My cat catches things.
Regular expression: \n
Matches: The end of the text on Linux and other Unix-like platforms, only part of the line ending on Windows, and nothing on Mac OS X.
\r
Matches a carriage return.
Text: My cat catches things.
Regular expression: \r
Matches: Part of the line ending on Windows, nothing on Linux/Unix, and the whole line ending on Mac OS X.
\s
Matches a space, non-breaking space, tab, line ending, or other white space character.
Tip: Many languages do not separate words with white space. Even in languages that usually use a white space separator, words can be separated with many other characters such as:
\/-”’"“‘.,><—:;
and new lines.
In these cases, you should usually include those in addition to \s in a match set ( [] ) or may need to use \b (word boundary) instead.
Text: <a href=‘http://www.example.com’>
Regular expression: www\.example\.com\s
Matches: Nothing.
Due to the final ’ which is a word boundary but not a white space, this does not match. The regular expression should be:
www.example.com\b
\S
Matches a character that is not white space, such as A or 9.
Text: My cat catches things.
Regular expression: \S
Matches: Mycatcatchesthings.
\d
Matches a decimal digit such as 9.
Text: /url?parameterA=value1
Regular expression: \d
Matches: 1
\D
Matches a character that is not a digit, such as A or b or É.
 
\w
Matches a whole word.
Words are substrings of any uninterrupted combination of one or more characters from this set:
[a-zA-Z0-9_]
between two word boundaries (space, new line, :, etc.).
It does not match Unicode characters that are equivalent, such as , ٣‎ or .
Text: Yahoo!
Regular expression: \w
Matches: Yahoo
Does not match the terminal exclamation point, which is a word boundary.
\W
Matches anything that is not a word.
Text: Sell?!?~
Regular expression: \W
Matches: ?!?~
.
Matches any single character except \r or \n.
Note: If the character is written by combining two Unicode code points, such as à where the core letter is encoded separately from the accent mark, this will not match the entire character: it will only match one of the code points.
Text: My cat catches things.
Regular expression: c.t
Matches: cat cat
+
Repeatedly matches the previous character or capture group, 1 or more times, as many times as possible (also called “greedy” matching) unless followed by a question mark ( ? ), which makes it optional.
Does not match if there is not at least 1 instance.
Text: www.example.com
Regular expression: w+
Matches: www
Would also match “w”, “ww”, “wwww”, or any number of uninterrupted repetitions of the character “w”.
*
Repeatedly matches the previous character or capture group, 0 or more times. Depending on its combination with other special characters, this token could be either:
* — Match as many times as possible (also called “greedy” matching).
*? — Match as few times as possible (also called “lazy” matching).
Text: www.example.com
Regular expression: .*
Matches: www.example.com
All of any text, except line endings (\r and \n).
Text: www.example.com
Regular expression: (w)*?
Matches: www
Would also match common typos where the “w” was repeated too few or too many times, such as “ww” in w.example.com or “wwww” in wwww.example.com. It would still match, however, if no amount of “w” existed.
? except when followed by =
Makes the preceding character or capture group optional (also called “lazy” matching).
Text: www.example.com
Regular expression: (www\.)?example.com
Matches: www.example.com
Would also match example.com.
?=
Looks ahead to see if the next character or capture group matches and evaluate the match based upon them, but does not include those next characters in the returned match string (if any).
This can be useful for back-references where you do not want to include permutations of the final few characters, such as matching “cat” when it is part of “cats” but not when it is part of “catch”.
Text: /url?parameter=valuepack
Regular expression: p(?=arameter)
Matches: p, but only in “parameter, not in “pack”, which does not end with “arameter”.
()
Creates a capture group or sub-pattern for back-reference or to denote order of operations. See also “Example: Inserting & deleting body text” and “What are back-references?”.
Text: /url/app/app/mapp
Regular expression: (/app)*
Matches: /app/app
Text: /url?paramA=valueA&paramB=valueB
Regular expression: (param)A=(value)A&\0B\1B
Matches: paramA=valueA&paramB=valueB
|
Matches either the character/capture group before or after the pipe ( | ).
Text: Host: www.example.com
Regular expression: (\r\n)|\n|\r
Matches: The line ending, regardless of platform.
^
Matches either:
the position of the beginning of a line (or, in multiline mode, the first line), not the first character itself
the inverse of a character, but only if ^ is the first character in a character class, such as [^A]
This is useful if you want to match a word, but only when it occurs at the start of the line, or when you want to match anything that is not a specific character.
Text: /url?parameter=value
Regular expression: ^/url
Matches: /url, but only if it is at the beginning of the path string. It will not match “/url” in subdirectories.
Text: /url?parameter=value
Regular expression: [^u]
Matches: /rl?parameter=vale
$
Matches the position of the end of a line (or, in multiline mode, the entire string), not the last character itself.
 
[]
Defines a set of characters or capture groups that are acceptable matches.
To define a set via a whole range instead of listing every possible match, separate the first and last character in the range with a hyphen.
Note: Character ranges are matched according to their numerical code point in the encoding. For example, [@-B] matches any UTF-8 code points from 40 to 42 inclusive:
@AB
Text: /url?parameter=value1
Regular expression: [012]
Matches: 1
Would also match 0 or 2.
Text: /url?parameter=valueB
Regular expression: [A-C]
Matches: B
Would also match “A” or “C”. It would not match “b”.
{}
Quantifies the number of times the previous character or capture group may be repeated continuously.
To define a varying number repetitions, delimit it with a comma.
Text: 1234567890
Regular expression: \d{3}
Matches: 123
Text: www.example.com
Regular expression: w{1,4}
Matches: www
If the string were a typo such as “ww ” or “wwww”, it would also match that.
See also
What are back-references?
Cookbook regular expressions
Language support
Rewriting & redirecting
Defining custom data leak & attack signatures
Configuring URL interpreters
Configuring custom suspicious request URLs