Re: Whitespace characters in Unicode

Martin J. Dürst Mon, 08 Aug 2016 00:12:50 -0700

On 2016/08/08 08:08, Sean Leonard wrote:

On 8/6/2016 11:30 AM, Doug Ewell wrote:

Additionally, in UTF-8, either LS or PS actually takes more bytes than
CR plus LF, so the "increased text size" argument also discouraged use
of the new controls.


That is true, it takes 3 bytes. However, the original UTF-8 proposal

The term "original UTF-8 proposal" is quite misleading, because thatproposal was never labeled as UTF-8. "FSS-UTF draft version" would bemuch better.

encoded U+0080 - U+207F in two octets:
https://en.wikipedia.org/wiki/UTF-8 :
|10xxxxxx|     |1xxxxxxx|


So, the space block /just barely makes it/. Was this intentional during
the original design of UTF-8, or just a coincidence? I think it was more
than a coincidence.

Just a coincidence, I'd say. When designing such schemes, trying to becompact is obviously one of the goals. But "how can I design it so thatthese two characters still make it as two bytes" isn't.

It is regrettable that the space block was too high
to work in the final version of UTF-8...maybe it should have gone below
U+07FF.

There aren't too many line breaks (and usually even less paragraphbreaks) in a text, so the overall effect of the encoding length for LSor PS were really not that much of an issue. The main reason for whythey didn't spread was that everybody was already dealing with severalvariants of line breaks and didn't want more of these, even at theprospect of (potentially, eventually, in the very, very long run maybe)have only a single one.


Regards,   Martin.

Re: Whitespace characters in Unicode

Reply via email to