"Back up" here refers to decrementing the pointer in the string.

If you have a string consisting of the following UTF-16 code units, for example:

00C0 0020 20AC D800 DC00 00C5
    0       1        2       3       4       5

If you set the pointer to code unit number 4 (counting from 0), you'll be 
pointed at "DC00", which is a trailing ("low") surrogate. The pointer needs to 
"back up" (decrement) by one to position 3 (0xD800) to find the start of the 
character (each of the other code units refers to a single code point).

Addison Phillips
Globalization Architect (Amazon Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.



> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Xue Fuqiao
> Sent: Tuesday, August 27, 2013 6:37 PM
> To: [email protected]
> Subject: What to backup after corruption of code units?
> 
> Hi list,
> 
> I'm reading Unicode 6.2.0 and have a question. In Section 2.5, Encoding Forms:
> 
>   For example, when randomly accessing a string, a program can find the
>   boundary of a character with limited backup. In UTF-16, if a pointer
>   points to a leading surrogate, a single backup is required. In UTF-8,
>   if a pointer points to a byte starting with 10xxxxxx (in binary), one
>   to three backups are required to find the beginning of the character.
> 
> What does the "backup" mean here? What does the program backup?
> 
> I searched "backup" with unicode.org/search/ but didn't get anything that
> looked promising.  Can anyone point me in the right direction?
> 
> (English is not my native language; please excuse typing errors.)
> 
> --
> Best regards, Xue Fuqiao.



Reply via email to