Martin J. Dürst wrote: >> encoded U+0080 - U+207F in two octets: >> https://en.wikipedia.org/wiki/UTF-8 : >> |10xxxxxx| |1xxxxxxx| >> >> So, the space block /just barely makes it/. Was this intentional >> during the original design of UTF-8, or just a coincidence? I think >> it was more than a coincidence. > > Just a coincidence, I'd say. When designing such schemes, trying to be > compact is obviously one of the goals. But "how can I design it so > that these two characters still make it as two bytes" isn't.
For actual Unicode compression schemes (SCSU and BOCU-1), certain design elements do exist to allow certain character blocks "in widespread use" to fit in minimal space. For byte-based UTFs, that wasn't a goal at all. ASCII in one byte was a given -- for compatibility with existing software, not favoritism toward English as was sometimes claimed -- but otherwise, algorithmic simplicity and reasonable overall efficiency were more important than optimizing for certain blocks. Replacing one encoding with ranges like "U+2080 through U+8207F" with another which architecturally allows non-shortest sequences, and then disallowing them, is simply a matter of different engineering solutions to the same problem. Each adds simplicity in one place and complexity in another. UTF-8 happened to tick more additional boxes (e.g. self-synchronization) than the others. -- Doug Ewell | Thornton, CO, US | ewellic.org

