Appendix D: Regular expressions : Language support
 
Language support
Features such as Recursive URL Decoding, input rules, and attack signatures can detect attacks and data leaks even when multiple languages are used as an evasion technique.
When configuring FortiWeb, regardless of the display language (see “Global web UI & CLI settings”), the simplest case is to configure with only US-ASCII characters. All features, including queries to external servers, support it.
If you want to configure FortiWeb using another language/encoding, or support clients using another language or multiple languages, sometimes characters such as ñ, é, symbols, and ideographs such as are valid input. Support varies by the nature of the item being configured.
For example, by definition, host names cannot contain special characters. DNS standards predate many standards for internationalization. Because of this, the web UI and CLI will reject input if it contains non-ASCII encoded characters when configuring the host name. This means that languages other than English are not supported unless encoded as an RFC 3490 international domain name (IDN) prefixed with xn--. However, other configuration items, such as names and comments, often support the language of your choice.
To use your preferred languages in those cases, use an encoding that supports it.
For best results:
for regular expressions that must match HTTP requests, use the same encoding as your HTTP clients
for other features, use UTF-8 encoding, or use only the characters whose encoded values are the same in UTF‑8 (for example, US-ASCII characters are usually encoded using the same byte-wise values in ISO 8859-1, Windows code page 1252, Shift-JIS and others; however, ideographs such as may be garbled or interpreted as the wrong character when viewed as another encoding)
 
HTTP clients may send requests in encodings that are not UTF-8. Encodings vary by the client’s operating system or input language.
If you input the configuration in English, the client’s request may match regardless of encoding: due to US-ASCII predating most other encodings, byte-wise, the values for English characters tend to have identical numerical values in many encoding types. For example, English words may be readable regardless of interpreting a web page as either ISO 8859-1 or as GB2312.
For other languages (especially non-Latin alphabets such as Cyrillic and Thai), match the client’s encoding exactly.
For example, with Shift-JIS, backslashes ( \ ) could be inadvertently interpreted as yen symbols ( ¥ ) and vice versa. A regular expression intended to match HTTP requests containing money values with a yen symbol therefore may not work if the symbol is entered using the wrong encoding. Likewise, simplified Chinese characters might only be understandable if the page is interpreted as GB2312. Test your expressions. If you enter a regular expression using another encoding, or if an HTTP client sends a request in an encoding other than UTF‑8, remember that matches may not be what you initially expect.
Regular expressions are especially impacted. Matching engines on FortiWeb use the UTF‑8 character values. If you need to match multiple possible languages from clients, especially for attack signatures, make sure you construct a regular expression that matches all alternative values.
For example, the Latin letter C is not encoded using the same byte-wise value as the similar-looking Cyrillic letter С. A human being can read a Spanish phrase written with that Cyrillic character, because they are visually similar. But a regular expressions will not match unless written to match both numerical values: one for the Latin character, and one for the Cyrillic look-alike (sometimes called a “confusable”).To configure your FortiWeb appliance using other encodings, you may need to switch language settings on your management computer, including for your web browser or Telnet/SSH client. For instructions on how to configure your management computer’s operating system language, locale, or input method, see its documentation.
 
If you choose to configure parts of the FortiWeb appliance using non-ASCII characters, you should also use the same encoding throughout the configuration if possible in order to avoid needing to switch the language settings of your web browser or Telnet/SSH client while you work.
Similarly, your web browser or CLI client should usually interpret display output as encoded using UTF-8. If it does not, your configured items may not display correctly in the web UI or CLI. Exceptions include items such as regular expressions that you may have configured using other encodings in order to match the encoding of HTTP requests that the FortiWeb appliance receives.
See also
Cookbook regular expressions
Regular expression syntax