On Mon, Oct 19, 2015 at 1:32 PM, Doug Ewell <[email protected]> wrote:
> > ICU (but perhaps it's actually Java) seems to have a culture of > > tolerating lone surrogates, and rules for handling lone surrogates are > > strewn across the Unicode standards and annexes. > > I suspect you have an example. I have examples from ICU processing of 16-bit Unicode strings (which are not usually required to be well-formed UTF-16 strings): - "Count code points" counts an unpaired surrogate as 1. - "Move forward/backward by n code points" counts an unpaired surrogate as 1. - "Lower-/title-/upper-case the string" passes through an unpaired surrogate as-is like any code point that does not have case mappings. - "Get property x of code point y" returns the property value according to the UCD; for example, gc(surrogate)=Cs. - Collating a string that contains an unpaired surrogate: ICU currently uses the second approach from UCA section 7.1.1 <http://www.unicode.org/reports/tr10/#Handling_Illformed>. See http://userguide.icu-project.org/strings#TOC-ICU:-16-bit-Unicode-strings However, "convert from UTF-16 to UTF-8" and such treats an unpaired surrogate as an error. > The Unicode collation algorithm conformance test once tested that > > implementations of collation collated lone surrogates correctly. > > Raising an exception was an automatic test failure! By contrast, > > no-one's proposed collation rules for broken bits of UTF-8 characters > > or non-minimal length forms. > > Are these tests still included, or did someone notice that they were in > conflict with the standard and removed them? > We updated http://www.unicode.org/Public/UCA/latest/CollationTest.html to say: "These files contain test cases that include ill-formed strings, with surrogate code points. Implementations that do not weight surrogate code points the same way as reserved code points may filter out such lines lines in the test cases, before testing for conformance." Best regards, markus

