Re: What to backup after corruption of code units?

Asmus Freytag Wed, 28 Aug 2013 17:58:45 -0700

On 8/28/2013 5:19 PM, Doug Ewell wrote:

Actually 0xC2, according to the rules of UTF-8.

Hmm. What you are referring to is that 0xC0 and 0xC1 don't occur becauseof the requirement for minimal length encoding. However, a check for>=0xC0 will give the correct result for backing up, assuming the datais valid UTf-8 (or at least locally valid).

In terms of boundary determination, would you take violating the ruleabout minimal length encoding as evidence for corrupted data, or wouldyou first detect the boundary, then decide that a sequence starting with0xC0 is in violation?

A./


--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell
------------------------------------------------------------------------
From: Ian Clifton <mailto:[email protected]>
Sent: ‎8/‎28/‎2013 17:34
To: Unicode discussion <mailto:[email protected]>
Subject: Re: What to backup after corruption of code units?

On 28/08/13 23:29, Xue Fuqiao wrote:
> I see.  Thanks for all your replies!
>
> BTW I have a further question:
>

> On Wed, Aug 28, 2013 at 1:44 PM, Philippe Verdy<[email protected]>wrote:>> - in UTF-8, you'll need to look backward between 1 to 3 positionsbefore

>> your start position to find the leading 8-bit code unit (>= 0xC0).
> Why should this be >=0xC0?
>

Because a well‐formed UTF-8 header byte must start with at least two 1
bits, numerically, the smallest such byte is 16#C0#.

--
Ian ◎

Re: What to backup after corruption of code units?

Reply via email to