On 2016/08/08 08:08, Sean Leonard wrote:
On 8/6/2016 11:30 AM, Doug Ewell wrote:
Additionally, in UTF-8, either LS or PS actually takes more bytes than
CR plus LF, so the "increased text size" argument also discouraged use
of the new controls.

That is true, it takes 3 bytes. However, the original UTF-8 proposal

The term "original UTF-8 proposal" is quite misleading, because that proposal was never labeled as UTF-8. "FSS-UTF draft version" would be much better.

encoded U+0080 - U+207F in two octets:
https://en.wikipedia.org/wiki/UTF-8 :
|10xxxxxx|     |1xxxxxxx|


So, the space block /just barely makes it/. Was this intentional during
the original design of UTF-8, or just a coincidence? I think it was more
than a coincidence.

Just a coincidence, I'd say. When designing such schemes, trying to be compact is obviously one of the goals. But "how can I design it so that these two characters still make it as two bytes" isn't.

It is regrettable that the space block was too high
to work in the final version of UTF-8...maybe it should have gone below
U+07FF.

There aren't too many line breaks (and usually even less paragraph breaks) in a text, so the overall effect of the encoding length for LS or PS were really not that much of an issue. The main reason for why they didn't spread was that everybody was already dealing with several variants of line breaks and didn't want more of these, even at the prospect of (potentially, eventually, in the very, very long run maybe) have only a single one.

Regards,   Martin.

Reply via email to