The term is probably badly chosen but it means that you must read backward from the start position. The term "backup" is not related to any data copying/saving operation.
- in UTF-16 there's an error in your citation: if you find a leading surrogate (in 0xD800..0xDBFF), you are already at thecorrct position to read the next code unit (which should then be the trailing surrogate in 0xDC00..0xDFFF). Otherwise you need to read from the previous position which should be the leading surrogate. - in UTF-8, you'll need to look backward between 1 to 3 positions before your start position to find the leading 8-bit code unit (>= 0xC0). In both cases you have to check the value found. If you don't find it, in the limited range of positions, the input is not valid UTF-8 or UTF-16 and you have to handle an encoding error exception in the input stream. The Unicode standarddoes not specify how you'll handle this error situation or from where you'll be able to resync the stream, or even if you should resync from some further position; this is application-dependant. If the input stream is live (for example coming from a broadcasted media), you'll probably want to just skip the error, invalidate the current state, signal an error to the user in some way, and then try to restart from the next valid position. But data truncation will occur and there's no easy way to determine if your text stream will parse correctly. If the input text stream is a script with its own syntax, the script will not process correctly and its interpretation or compilation should be stopped with an exception thrown or error status returned to the client API. But if the stream is just some readable text (e.g. subtitles text displayed on a video), the user will jsut see a part of the text, but the video will continue reading.If the input text is an ongoing chat discussion, some of thediscussion will be truncated but the discussion will continue from there. If the input text is from a file or from an data structure supposed to contain the full text, the file or data structure is corrupted. Depending on cases this could be an internal software bug, or a reliability problem from the storage, or from the transmission medium or network error. This could as well be an input stream that was actually not encoded with this UTF (you may retry guessing which text encoding was used, not necessarily an UTF). 2013/8/28 Xue Fuqiao <[email protected]> > Hi list, > > I'm reading Unicode 6.2.0 and have a question. In Section 2.5, Encoding > Forms: > > For example, when randomly accessing a string, a program can find the > boundary of a character with limited backup. In UTF-16, if a pointer > points to a leading surrogate, a single backup is required. In UTF-8, > if a pointer points to a byte starting with 10xxxxxx (in binary), one > to three backups are required to find the beginning of the character. > > What does the "backup" mean here? What does the program backup? > > I searched "backup" with unicode.org/search/ but didn't get anything > that looked promising. Can anyone point me in the right direction? > > (English is not my native language; please excuse typing errors.) > > -- > Best regards, Xue Fuqiao. > >

