The Algorithm to Validate an UTF-8 String
Frank Yung-Fong Tang <[EMAIL PROTECTED]>Draft: 0.4
Status of this document: DRAFT. NEED TO BE CHECKED and REVIEWED
Background Information
UTF-8 is widly used by many software these days. Many Internet Draft, RFC and W3C standards/recommendation refers to it. Many public or private software start to support it. However, a lot of people do not know how to perform a common task- how to validate an UTF-8 string.Applications of this Algorithm
The implemtation of this UTF-8 validation process/function could be used to- work with assertion code in debuggin mode to check input parameter which is expected in UTF-8. If failed, the debugging mode code could alert user with assertion.
- work with function entry error checking code to check input parameter which is expected in UTF-8. If failed, the function could return error value immediately in order to reduce error in the future processing
- intergrated into the debugging facility of an IDE (intergrated development
environment), for example, people can build a gdb command to allow the developer
to verify the input parameter in the break point is encoded in valid UTF-8
sequence.
- work with logging code to log invalide data in the entry point of a system (for example, an application server). In case of failure, the system could log the invalid request and perform the error handling policy.
Consideration
In order to consider a string is a valide UTF-8 string. The following thing need to be considered:- The octects sequences should fit into the UTF-8 pattern
- No five and sex octets UTF-8 sequence and no unused byte
- No non-shortest form
- No suurogate characters encoded directly in 3-bytes UTF-8 sequence
- No more than pattern represent UCS4 value larger than U+10FFFF
The octects sequences should fit into the UTF-8 pattern
UTF-8 have the following byte patterns according to RFC 2279:
Number |
UCS-4 range (hex.) |
UTF-8 octet sequence (binary) |
UTF-8 octet range (hex) |
| 1 |
0000 0000-0000 007F |
0xxxxxxx |
0x00-0x7f |
| 2 |
0000 0080-0000 07FF |
110xxxxx 10xxxxxx |
0xc0-0xdf 0x80-0xbf |
| 3 |
0000 0800-0000 FFFF |
1110xxxx 10xxxxxx 10xxxxxx |
0xe0-0xef 0x80-0xbf 0x80-0xbf |
| 4 |
0001 0000-001F FFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
0xf0-0xf7 0x80-0xbf 0x80-0xbf 0x80-0xbf |
See the next session about the five and 6 octets sequences mentioned in RFC 2279
No five and six octets UTF-8 sequence and no unused byte
Unicode 3.1 ruled out the five and six octets UTF-8 sequence as illegal although previous standard / specification such as Unciode 3.0 and RFC 2279 allow the five and six octets UTF-8 sequence. Therefore, we need to make sure those value are not in the UTF-8The checking is easy. If you just allow the patten in the previous session in the sequence, then these five and six octets squence won't be allowed. Another way to look at it is in the old UTF-8 definitation, the first octet of the five octect sequence is start with 0xf8-0xfb, the first octet of the five octect sequence is start with 0xfc-0xfd, and these bytes never appeared in any other places in the UTF-8 sequence. The byte 0xfe and 0xff is also never used in UTF-8 neither. Therefore, we now can conclude any sequence have a byte range 0xf8-0xfd is not a valid UTF-8 sequence.
This was introduced into Unicode 3.1 dated May 6, 2001
No non-shortest form
With the byte pattern in UTF-8, it is possible to use non shortest form the encode UCS4. However, Unicode 3.1 already rule out such sequence is legal UTF-8 sequence due to the concern of security. Therefore, we could check the sequence to see any of the following pattern exist, if so, then it is not a valid UTF-8 sequence:
Number |
UCS-4 range (hex.) |
Non-shortest form (which is illegal) UTF-8 octet
sequence (binary) |
Non-shortest form (which is illegal) UTF-8 octet
range (hex) |
| 2 |
0000 0000-0000 007F |
1100000x 10xxxxxx |
0xc0-0xc1 0x80-0xbf |
| 3 |
0000 0080-0000 07FF |
11100000 100xxxxx 10xxxxxx |
0xe0 0x80-0x9f 0x80-0xbf |
| 4 |
0000 0800-0000 FFFF |
11110000 1000xxxx 10xxxxxx 10xxxxxx |
0xf0 0x80-0x8f 0x80-0xbf 0x80-0xbf |
Since 0xc0 and 0xc1 in UTF-8 should only be used as the first octet of a 2 octet sequence, and that always lead to a 2 octect non-shortest form of UCS4 range in 0x00-0x7f, we can conclude in a legal UTF-8 sequence, there should have no appearance of octet 0xc0-0xc1.
Similar to the byte 0xe0, we can conclude there are no UTF-8 sequence have the sequence 0xe0 followed by a octect in the range of 0x80-0x9f. For the 4 octect sequence, we can conclude there are no UTF-8 sequence have the sequence 0xf0 followed by a octects in the range of 0x80-0x8f.
This was introduced into Unicode 3.1 dated May 6, 2001
No suurogate characters encoded directly in 3-bytes UTF-8 sequence
This is similar to non-shortest form. In UTF-16, the UCS4 U+10000 - U+10FFFF is encoded with a surrogate pair - 0xD800-0xDBFF then 0xDC00-0xDFFF. In the UTF-8, they should be encoded in the 4 octects ranges instead of two three-octect UTF-8 sequence which represent the surrogate pair directly with the bit pattern shifting. Therefore, the patten in the three octect UTF-8 seqeunce which represent value 0xD800-0xDFFF should be consider illegal UTF-8
Number |
UCS-4 range (hex.) |
Surrogate high and surrogate low range directly
map to UTF-8 octet sequence (binary) |
Surrogate high and surrogate low range directly
map to UTF-8 octet range (hex) |
| 3 |
0000 D800-0000 DFFF |
11101101 101xxxxx 10xxxxxx |
0xed 0xa0-0xbf 0x80-0xbf |
Therefore, we can conclude any octet 0xed followed by a octect ranged in 0xa0-0xbf is illegal UTF-8.
This was introduced into Unicode 3.2 dated March 27, 2002
No more than pattern represent UCS4 value larger than U+10FFFF
Unicode defined BMP and only plane 1 to plane 16. Therefore, there are no Unicode could be greater than U+10FFFF. Therefore, the combination of UTF-8 which lead to a UCS4 value larger than 10ffff should be consider illegal unicode. Since 4 octets UTF-8 sequence origionally could encode UCS4 value up to 0x1ffff, we need to make sure UTF-8 pattern represent 0x110000 to 0x1fffff be considered as illegal. This need to be break down to 4 sub range
Number |
UCS-4 range (hex.) |
Illegal UTF-8 octet sequence (binary) represent
UCS4 value greater than 0x10FFFF |
Illegal UTF-8 octet range (hex) represent UCS4
value greater than 0x10FFFF |
| 4 |
0011 0000-0011 FFFF |
11110100 1001xxxx 10xxxxxx 10xxxxxx |
0xf4 0x90-0x9f 0x80-0xbf 0x80-0xbf |
| 4 |
0012 0000-0013 FFFF |
11110100 101xxxxx 10xxxxxx 10xxxxxx |
0xf4 0xa0-0xbf 0x80-0xbf 0x80-0xbf |
| 4 |
0014 0000-0017 FFFF |
11110101 10xxxxxx 10xxxxxx 10xxxxxx |
0xf5 0x80-0xbf 0x80-0xbf 0x80-0xbf |
| 4 |
0018 0000-001F FFFF |
1111011x 10xxxxxx 10xxxxxx 10xxxxxx |
0xf6-0xf7 0x80-0xbf 0x80-0xbf 0x80-0xbf |
Therefore, we can conclude a sequence include byte range in 0xf5-0xf7 and a two octets sequence 0xf4 followed by 0x90-0xbf always should be treated as illegal UTF-8.
This was introduced into Unicode 3.1 dated May 6, 2001
Summary:
Below are the byte values which never should be seen in a valid UTF-8 string| Bytes never used by valid UTF-8: |
Reason |
0xFE-0xFF |
Is not used by any UTF-8 octet pattern, not leading
octet, neither trial octet |
0xF8-0xFB |
Represent 5 octets UTF-8 sequence which the definitation
obsoleted by Unicode 3.1 |
0xFC-0xFD |
Represent 6 octets UTF-8 sequence which the definitation
obsoleted by Unicode 3.1 |
0xC0-0xC1 |
Represent non-shortest form in 2 octet sequence
if followed by one 0x80-0xbf |
0xF5-0xF7 |
Represent value larger than U+10FFFF in 4 octet
sequence if followed by three 0x80-0xbf |
Below are the two-bytes paris which never should be seen in a valid UTF-8 string
| Two byte paris never used in UTF-8: |
Reason |
0xc0-0xf7 0x00-0x7f |
Does not match with UTF-8 pattern |
0xe0 0x80-0x9f |
Represent non-shortest form in 3 octet sequence if followed by one 0x80-0xbf |
0xf0 0x80-0x8f |
Represent non-shortest form in 4 octet sequence
if followed by two 0x80-0xbf |
0xed 0xa0-0xbf |
Represent range 0xd800-0xdfff in 3 octet sequence
if followed by one 0x80-0xbf |
0xf4 0x90-0xbf |
Represent value larger than U+10FFFF in 4 octet sequence if followed by one 0x80-0xbf |
Algorithm
A regular expression algorithm
The validation code could be implemented by regular expression if environment permit:We can say a sequence is UTF-8 if all the bytes match
/^(([\0-\x7F])|But in the mean time it does NOT match any of the following
([\xC0-\xDF][\x80-\xBF])|
([\xE0-\xEF][\x80-\xBF][\x80-\xBF])|
([\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]))*$/
([\xC0-\xC1])|This regular expression can be simplified to
([\xE0][\x80-\x9F])|
([\xF0][\x80-\x8F])|
([\xED][\xA0-\xBF])|
([\xF4][\x90-\xBF])|
([\xF5-\xF7])
/^(([\0-\x7F])
|([\xC2-\xDF][\x80-\xBF])
|((([\xE0][\xA0-\xBF])
|([\xE1-\xEC\xEE-\xEF][\x80-\xBF])
|([\xED][\x80-\x9F])
)[\x80-\xBF])
|((([\xF0][\x90-\xBF])
|([\xF1-\xF3][\x80-\xBF])
|([\xF4][\x80-\x8F])
)[\x80-\xBF][\x80-\xBF])
)*$/
A state machine algorithm
The validation process could also be implemented by a statemachine if the environment does not support regular expression or the regluar expression implementation is not effective enough.Below is the state machine diagram
- state START
- Input = 0x00 ~ 0x7F : change state to START
- Input = 0xC2-0xDF: change state to A
- Input = 0xE1-0xEC, 0xEE-0xEF: change state to B
- Input = 0xE0: change state to C
- Input = 0xED: change state to D
- Input = 0xF1-0xF3:change state to E
- Input = 0xF0: change state to F
- Input = 0xF4: change state to G
- Input = Others (0x80-0xBF,0xC0-0xC1, 0xF5-0xFF): ERROR
- state A
- Input = 0x80-0xBF: change state to START
- Others: ERROR
- state B
- Input = 0x80-0xBF: change state to A
- Others: ERROR
- state C
- Input = 0xA0-0xBF: change state to A
- Others: ERROR
- state D
- Input = 0x80-0x9F: change state to A
- Others: ERROR
- state E
- Input = 0x80-0xBF: change state to B
- Others: ERROR
- state F
- Input = 0x90-0xBF: change state to B
- Others: ERROR
- state G
- Input = 0x80-0x8F: change state to B
- Others: ERROR
| Change the state to |
Current State |
|||||||
| Input |
START |
A |
B |
C |
D |
E |
F |
G |
| 0x00-7F |
START |
ERROR |
ERROR |
ERROR |
ERROR |
ERROR |
ERROR |
ERROR |
| 0x80-0x8F |
ERROR |
START |
A |
A |
B |
B |
||
| 0x90-0x9F |
B |
ERROR |
||||||
| 0xA0-0xBF |
A |
ERROR |
||||||
| 0xC0-0xC1,0xF5-0xFF |
ERROR |
ERROR |
ERROR |
ERROR |
ERROR |
|||
| 0xC2-0xDF |
A |
|||||||
| 0xE0 |
C |
|||||||
| 0xE1-0xEC, 0xEE-0xEF |
B |
|||||||
| 0xED |
D |
|||||||
| 0xF0 |
F |
|||||||
| 0xF1-0xF3 |
E |
|||||||
| 0xF4 |
G |
|||||||
Sample Implementations
A Perl Implementation of the Regular Expression Algorithm:
# |
A C Implementation of the State Machine Algorithm:
/* -*- Mode: C; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 2 -*- |
References
- Unicode Standard Annex #27 - Unicode 3.1
- Unicode Standard
Annex #28 - Unicode 3.2
- RFC 2279 - UTF-8, a transformation format of ISO 10646
- Internet Draft which obsoleted RFC 2279
Change History:
- version 0.2:
- Correct the state D to from "Input = 0x90-0x9F: change state to A" to "Input = 0x80-0x9F: change state to A" and change the state table to reflect it.
- Add new regular expression which use one regluar experssion to catch all valid UTF-8
- Add reference link to Unicode 3.2
- Add when does those consideration introduced into Unicode with version and date.
- version 0.3;
- Add MPL based C implementation
- Add MPL based Perl implementation
- version 0.4:
- correct the C implementation to make the byte 0xF0 class 9 instead
of 10.

