ok, I find some problem and rewirte a little bit. Here is the rev 0.4 . Please give me your comment.
Title: The Algorithm to Valide an UTF-8 String

The Algorithm to Validate an UTF-8 String

Frank Yung-Fong Tang <[EMAIL PROTECTED]>
Draft: 0.4
Status of this document: DRAFT. NEED TO BE CHECKED and REVIEWED

Background Information

UTF-8 is widly used by many software these days. Many Internet Draft, RFC and W3C standards/recommendation refers to it. Many public or private software start to support it. However, a lot of people do not know how to perform a common task- how to validate an UTF-8 string.

Applications of this Algorithm

The implemtation of this UTF-8 validation process/function could be used to
  • work with assertion code in debuggin mode to check input parameter which is expected in UTF-8. If failed, the debugging mode code could alert user with assertion.
  • work with function entry error checking code to check input parameter which is expected in UTF-8. If failed, the function could return error value immediately in order to reduce error in the future processing
  • intergrated into the debugging facility of an IDE (intergrated development environment), for example, people can build a gdb command to allow the developer to verify the input parameter in the break point is encoded in valid UTF-8 sequence.
  • work with logging code to log invalide data in the entry point of a system (for example, an application server). In case of failure, the system could log the invalid request and perform the error handling policy.

Consideration

In order to consider a string is a valide UTF-8 string. The following thing need to be considered:
  1. The octects sequences should fit into the UTF-8 pattern
  2. No five and sex octets UTF-8 sequence and no unused byte
  3. No non-shortest form
  4. No suurogate characters encoded directly in 3-bytes UTF-8 sequence
  5. No more than pattern represent UCS4 value larger than U+10FFFF

The octects sequences should fit into the UTF-8 pattern

UTF-8 have the following byte patterns according to RFC 2279:
Number 
of Bytes
UCS-4 range (hex.)
UTF-8 octet sequence (binary)
UTF-8 octet range (hex)
1
0000 0000-0000 007F
0xxxxxxx
0x00-0x7f
2
0000 0080-0000 07FF
110xxxxx 10xxxxxx
0xc0-0xdf 0x80-0xbf
3
0000 0800-0000 FFFF
1110xxxx 10xxxxxx 10xxxxxx
0xe0-0xef 0x80-0xbf 0x80-0xbf
4
0001 0000-001F FFFF
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0xf0-0xf7 0x80-0xbf 0x80-0xbf 0x80-0xbf

See the next session about the five and 6 octets sequences mentioned in RFC 2279

No five and six octets UTF-8 sequence and no unused byte 

Unicode 3.1 ruled out the five and six octets UTF-8 sequence as illegal although previous standard / specification such as Unciode 3.0 and RFC 2279 allow the five and six octets UTF-8 sequence. Therefore, we need to make sure those value are not in the UTF-8

The checking is easy. If you just allow the patten in the previous session in the sequence, then these five and six octets squence won't be allowed. Another way to look at it is in the old UTF-8 definitation, the first octet of the five octect sequence is start with 0xf8-0xfb, the first octet of the five octect sequence is start with 0xfc-0xfd, and these bytes never appeared in any other places in the UTF-8 sequence. The byte 0xfe and 0xff is also never used in UTF-8 neither. Therefore, we now can conclude any sequence have a byte range 0xf8-0xfd is not a valid UTF-8 sequence.

This was introduced into Unicode 3.1 dated May 6, 2001

No non-shortest form

With the byte pattern in UTF-8, it is possible to use non shortest form the encode UCS4. However, Unicode 3.1 already rule out such sequence is legal UTF-8 sequence due to the concern of security. Therefore, we could check the sequence to see any of the following pattern exist, if so, then it is not a valid UTF-8 sequence:

Number 
of Bytes
UCS-4 range (hex.)
Non-shortest form (which is illegal) UTF-8 octet sequence (binary)
Non-shortest form (which is illegal) UTF-8 octet range (hex)
2
0000 0000-0000 007F
1100000x 10xxxxxx
0xc0-0xc1 0x80-0xbf
3
0000 0080-0000 07FF
11100000 100xxxxx 10xxxxxx
0xe0      0x80-0x9f 0x80-0xbf
4
0000 0800-0000 FFFF
11110000 1000xxxx 10xxxxxx 10xxxxxx
0xf0      0x80-0x8f 0x80-0xbf 0x80-0xbf

Since 0xc0 and 0xc1 in UTF-8 should only be used as the first octet of a 2 octet sequence, and that always lead to a 2 octect non-shortest form of UCS4 range in 0x00-0x7f, we can conclude in a legal UTF-8 sequence, there should have no appearance of octet 0xc0-0xc1.
Similar to the byte 0xe0, we can conclude there are no UTF-8 sequence have the sequence 0xe0 followed by a octect in the range of 0x80-0x9f. For the 4 octect sequence, we can conclude there are no UTF-8 sequence have the sequence 0xf0 followed by a octects in the range of 0x80-0x8f.

This was introduced into Unicode 3.1 dated May 6, 2001

No suurogate characters encoded directly in 3-bytes UTF-8 sequence

This is similar to non-shortest form. In UTF-16, the UCS4 U+10000 - U+10FFFF is encoded with a surrogate pair - 0xD800-0xDBFF then 0xDC00-0xDFFF. In the UTF-8, they should be encoded in the 4 octects ranges instead of two three-octect UTF-8 sequence which represent the surrogate pair directly with the bit pattern shifting. Therefore, the patten in the three octect UTF-8 seqeunce which represent value 0xD800-0xDFFF should be consider illegal UTF-8

Number 
of Bytes
UCS-4 range (hex.)
Surrogate high and surrogate low range directly map to UTF-8 octet sequence (binary)
Surrogate high and surrogate low range directly map to UTF-8 octet range (hex)
3
0000 D800-0000 DFFF
11101101 101xxxxx 10xxxxxx
0xed 0xa0-0xbf 0x80-0xbf

Therefore, we can conclude any octet 0xed followed by a octect ranged in 0xa0-0xbf is illegal UTF-8.

This was introduced into Unicode 3.2 dated March 27, 2002


No more than pattern represent UCS4 value larger than U+10FFFF

Unicode defined BMP and only plane 1 to plane 16. Therefore, there are no Unicode could be greater than U+10FFFF. Therefore, the combination of UTF-8 which lead to a UCS4 value larger than 10ffff should be consider illegal unicode. Since 4 octets UTF-8 sequence origionally could encode UCS4 value up to 0x1ffff, we need to make sure UTF-8 pattern represent 0x110000 to 0x1fffff be considered as illegal. This need to be break down to 4 sub range
Number 
of Bytes
UCS-4 range (hex.)
Illegal UTF-8 octet sequence (binary) represent UCS4 value greater than 0x10FFFF
Illegal UTF-8 octet range (hex) represent UCS4 value greater than 0x10FFFF
4
0011 0000-0011 FFFF
11110100 1001xxxx 10xxxxxx 10xxxxxx
0xf4      0x90-0x9f 0x80-0xbf 0x80-0xbf
4
0012 0000-0013 FFFF
11110100 101xxxxx 10xxxxxx 10xxxxxx
0xf4      0xa0-0xbf 0x80-0xbf 0x80-0xbf
4
0014 0000-0017 FFFF
11110101 10xxxxxx 10xxxxxx 10xxxxxx
0xf5      0x80-0xbf 0x80-0xbf 0x80-0xbf
4
0018 0000-001F FFFF
1111011x 10xxxxxx 10xxxxxx 10xxxxxx
0xf6-0xf7 0x80-0xbf 0x80-0xbf 0x80-0xbf

Therefore, we can conclude a sequence include byte range in 0xf5-0xf7 and a two octets sequence 0xf4 followed by 0x90-0xbf always should be treated as illegal UTF-8.

This was introduced into Unicode 3.1 dated May 6, 2001

Summary:

Below are the byte values which never should be seen in a valid UTF-8 string

Bytes never used by valid UTF-8:
Reason
0xFE-0xFF
Is not used by any UTF-8 octet pattern, not leading octet, neither trial octet
0xF8-0xFB
Represent 5 octets UTF-8 sequence which the definitation obsoleted by Unicode 3.1
0xFC-0xFD
Represent 6 octets UTF-8 sequence which the definitation obsoleted by Unicode 3.1
0xC0-0xC1
Represent non-shortest form in 2 octet sequence if followed by one 0x80-0xbf
0xF5-0xF7
Represent value larger than U+10FFFF in 4 octet sequence if followed by three 0x80-0xbf

Below are the two-bytes paris which never should be seen in a valid UTF-8 string

Two byte paris never used in UTF-8:
Reason
0xc0-0xf7 0x00-0x7f
0xc0-0xf7 0xc0-0xff
Does not match with UTF-8 pattern
0xe0      0x80-0x9f
Represent non-shortest form in 3 octet sequence if followed by one 0x80-0xbf
0xf0      0x80-0x8f

Represent non-shortest form in 4 octet sequence if followed by two 0x80-0xbf

0xed      0xa0-0xbf

Represent range 0xd800-0xdfff in 3 octet sequence if followed by one 0x80-0xbf

0xf4      0x90-0xbf
Represent value larger than U+10FFFF in 4 octet sequence if followed by one 0x80-0xbf

Algorithm

A regular expression algorithm

The validation code could be implemented by regular expression if environment permit:

We can say a sequence is UTF-8 if all the bytes match
/^(([\0-\x7F])|
([\xC0-\xDF][\x80-\xBF])|
([\xE0-\xEF][\x80-\xBF][\x80-\xBF])|
([\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]))*$/
But in the mean time it does NOT match any of the following
 ([\xC0-\xC1])|
([\xE0][\x80-\x9F])|
([\xF0][\x80-\x8F])|
([\xED][\xA0-\xBF])|
([\xF4][\x90-\xBF])|
([\xF5-\xF7])

This regular expression can be simplified to
/^(([\0-\x7F])
|([\xC2-\xDF][\x80-\xBF])
|((([\xE0][\xA0-\xBF])
|([\xE1-\xEC\xEE-\xEF][\x80-\xBF])
|([\xED][\x80-\x9F])
)[\x80-\xBF])
|((([\xF0][\x90-\xBF])
|([\xF1-\xF3][\x80-\xBF])
|([\xF4][\x80-\x8F])
)[\x80-\xBF][\x80-\xBF])
)*$/

A state machine algorithm

The validation process could also be implemented by a statemachine if the environment does not support regular expression or the regluar expression implementation is not effective enough.

Below is the state machine diagram
  • state START
    • Input = 0x00 ~ 0x7F : change state to START
    • Input = 0xC2-0xDF: change state to A
    • Input = 0xE1-0xEC, 0xEE-0xEF: change state to B
    • Input = 0xE0: change state to C
    • Input = 0xED: change state to D
    • Input = 0xF1-0xF3:change state to E
    • Input = 0xF0: change state to F
    • Input = 0xF4: change state to G
    • Input = Others (0x80-0xBF,0xC0-0xC1, 0xF5-0xFF): ERROR
  • state A
    • Input = 0x80-0xBF: change state to START
    • Others: ERROR
  • state B
    • Input = 0x80-0xBF: change state to A
    • Others: ERROR
  • state C
    • Input = 0xA0-0xBF: change state to A
    • Others: ERROR
  • state D
    • Input = 0x80-0x9F: change state to A
    • Others: ERROR
  • state E
    • Input = 0x80-0xBF: change state to B
    • Others: ERROR
  • state F
    • Input = 0x90-0xBF: change state to B
    • Others: ERROR
  • state G
    • Input = 0x80-0x8F: change state to B
    • Others: ERROR
Change the state to
Current State
Input
START
A
B
C
D
E
F
G
0x00-7F
START
ERROR
ERROR
ERROR
ERROR
ERROR
ERROR
ERROR
0x80-0x8F
ERROR
START
A
A

B
B
0x90-0x9F
B
ERROR
0xA0-0xBF
A
ERROR
0xC0-0xC1,0xF5-0xFF
ERROR
ERROR
ERROR
ERROR
ERROR
0xC2-0xDF
A
0xE0
C
0xE1-0xEC, 0xEE-0xEF
B
0xED
D
0xF0
F
0xF1-0xF3
E
0xF4
G

Sample Implementations

A Perl Implementation of the Regular Expression Algorithm:

#
# The contents of this file are subject to the Netscape Public License
# Version 1.0 (the "NPL"); you may not use this file except in
# compliance with the NPL. You may obtain a copy of the NPL at
# http://www.mozilla.org/NPL/
#
# Software distributed under the NPL is distributed on an "AS IS" basis,
# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the NPL
# for the specific language governing rights and limitations under the
# NPL.
#
# The Initial Developer of this code under the NPL is Netscape
# Communications Corporation. Portions created by Netscape are
# Copyright (C) 1998 Netscape Communications Corporation. All Rights
# Reserved.
#

sub ValidUTF8 {
local ( $utf8 ) = pop (@_);
return ($utf8 =~ /^(([\0-\x7F])|([\xC2-\xDF][\x80-\xBF])|
((([\xE0][\xA0-\xBF])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF])|([\xED][\x80-\x9F]))[\x80-\xBF])|
((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))[\x80-\xBF][\x80-\xBF]))*$/);
}

A C Implementation of the State Machine Algorithm:

/* -*- Mode: C; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 2 -*-
*
* The contents of this file are subject to the Netscape Public License
* Version 1.0 (the "NPL"); you may not use this file except in
* compliance with the NPL. You may obtain a copy of the NPL at
* http://www.mozilla.org/NPL/
*
* Software distributed under the NPL is distributed on an "AS IS" basis,
* WITHOUT WARRANTY OF ANY KIND, either express or implied. See the NPL
* for the specific language governing rights and limitations under the
* NPL.
*
* The Initial Developer of this code under the NPL is Netscape
* Communications Corporation. Portions created by Netscape are
* Copyright (C) 1998 Netscape Communications Corporation. All Rights
* Reserved.
*/

/*
* Character class
* 0x00-0x7F: 0
* 0x80-0x8F: 1
* 0x90-0x9f: 2
* 0xa0-0xbf: 3
* 0xc0-0xc1, 0xf5-0xff: 4
* 0xc2-0xdf: 5
* 0xe0: 6
* 0xe1-0xec, 0xee-0xef: 7
* 0xed: 8
* 0xf0: 9
* 0xf1-0xf3: 10
* 0xf4: 11
*/
static int byte_class_table[256] =
{
/* 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F */
/* 00 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 10 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 20 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 30 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 40 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 50 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 60 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 70 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 80 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/* 90 */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
/* A0 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
/* B0 */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
/* C0 */ 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
/* D0 */ 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
/* E0 */ 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 7, 7,
/* F0 */ 9,10,10,10,11, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
/* 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F */
};
/* state table */
typedef enum {
kSTART = 0,
kA,
kB,
kC,
kD,
kE,
kF,
kG,
kERROR,
kNumOfStates
} utf8_state;

static utf8_state state_table[] = {
/* kSTART, kA, kB, kC, kD, kE, kF, kG, kERROR */
/* 0x00-0x7F: 0 */ kSTART, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
/* 0x80-0x8F: 1 */ kERROR, kSTART, kA, kERROR, kA, kB, kERROR, kB, kERROR,
/* 0x90-0x9f: 2 */ kERROR, kSTART, kA, kERROR, kA, kB, kB, kERROR, kERROR,
/* 0xa0-0xbf: 3 */ kERROR, kSTART, kA, kA, kERROR, kB, kB, kERROR, kERROR,
/* 0xc0-0xc1, 0xf5-0xff: 4 */ kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
/* 0xc2-0xdf: 5 */ kA, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
/* 0xe0: 6 */ kC, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
/* 0xe1-0xec, 0xee-0xef: 7 */ kB, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
/* 0xed: 8 */ kD, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
/* 0xf0: 9 */ kF, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
/* 0xf1-0xf3: 10 */ kE, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR,
/* 0xf4: 11 */ kG, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR, kERROR
};

#define BYTE_CLASS(b) (byte_class_table[(unsigned char)b])
#define NEXT_STATE( b, cur) (state_table[ (BYTE_CLASS(b) * kNumOfStates) + (cur)])


int IsUTF8Text(const char* utf8, int len)
{
utf8_state current = kSTART;
int i;
const char* pt = utf8;
for(i = 0; i < len ; i++, pt++)
{
current = NEXT_STATE(*pt, current);
if (kERROR == current)
return 0;
}
return (current == kSTART) ? -1 : 0;
}

References

Change History:

  • version 0.2: 
    • Correct the state D to from "Input = 0x90-0x9F: change state to A" to "Input = 0x80-0x9F: change state to A" and change the state table to reflect it. 
    • Add new regular expression which use one regluar experssion to catch all valid UTF-8
    • Add reference link to Unicode 3.2
    • Add when does those consideration introduced into Unicode with version and date.
  • version 0.3;
    • Add MPL based C implementation
    • Add MPL based Perl implementation
  • version 0.4:
    • correct the C implementation to make the byte 0xF0 class 9 instead of 10.




Reply via email to