Re: Unicode String Models

2018-09-11 Thread Henri Sivonen via Unicode
On Tue, Sep 11, 2018 at 2:13 PM Eli Zaretskii wrote: > > > Date: Tue, 11 Sep 2018 13:12:40 +0300 > > From: Henri Sivonen via Unicode > > > > * I suggest splitting the "UTF-8 model" into three substantially > > different models: > > > > 1) The UTF-8 Garbage In, Garbage Out model (the model of

Re: Tamil Brahmi Short Mid Vowels

2018-09-11 Thread Asmus Freytag via Unicode
On 9/11/2018 5:02 PM, Andrew Glass via Unicode wrote: On Windows, Khmer is rendered with a dedicated shaping engine. I don't see a need to alter that engine or integrate Khmer with USE. How we fix Tai Tham, which does go to USE is a different matter. We need to

Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> Date: Wed, 12 Sep 2018 00:13:52 +0200 > Cc: unicode@unicode.org > From: Hans Åberg via Unicode > > It might be useful to represent non-UTF-8 bytes as Unicode code points. One > way might be to use a codepoint to indicate high bit set followed by the byte > value with its high bit set to 0,

RE: Tamil Brahmi Short Mid Vowels

2018-09-11 Thread Andrew Glass via Unicode
On Windows, Khmer is rendered with a dedicated shaping engine. I don't see a need to alter that engine or integrate Khmer with USE. How we fix Tai Tham, which does go to USE is a different matter. We need to work through the solution for Tai Tham. I'm opposed to a generic and broad relaxation

Re: Unicode String Models

2018-09-11 Thread Philippe Verdy via Unicode
No 0xF8..0xFF are not used at all in UTF-8; but U+00F8..U+00FF really **do** have UTF-8 encodings (using two bytes). The only safe way to represent arbitrary bytes within strings when they are not valid UTF-8 is to use invalid UTF-8 sequences, i.e by using a "UTF-8-like" private extension of

Re: Tamil Brahmi Short Mid Vowels

2018-09-11 Thread Richard Wordingham via Unicode
On Wed, 29 Aug 2018 21:42:57 + Andrew Glass via Unicode wrote: > Thank you Richard and Shriramana for bringing up this interesting > problem. > > I agree we need to fix this. I don’t want to fix this with a font > hack or change to USE cluster rules or properties. I think the right > place

Re: Unicode String Models

2018-09-11 Thread J Decker via Unicode
On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode wrote: > > > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode < > unicode@unicode.org> wrote: > > > > On Tue, 11 Sep 2018 21:10:03 +0200 > > Hans Åberg via Unicode wrote: > > > >> Indeed, before UTF-8, in the 1990s, I recall some

Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode
> On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode > wrote: > > On Tue, 11 Sep 2018 21:10:03 +0200 > Hans Åberg via Unicode wrote: > >> Indeed, before UTF-8, in the 1990s, I recall some Russians using >> LaTeX files with sections in different Cyrillic and Latin encodings, >>

Re: Unicode String Models

2018-09-11 Thread Richard Wordingham via Unicode
On Tue, 11 Sep 2018 21:10:03 +0200 Hans Åberg via Unicode wrote: > Indeed, before UTF-8, in the 1990s, I recall some Russians using > LaTeX files with sections in different Cyrillic and Latin encodings, > changing the editor encoding while typing. Rather like some of the old Unicode list

Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode
> On 11 Sep 2018, at 20:40, Eli Zaretskii wrote: > >> From: Hans Åberg >> Date: Tue, 11 Sep 2018 20:14:30 +0200 >> Cc: hsivo...@hsivonen.fi, >> unicode@unicode.org >> >> If one encounters a file with mixed encodings, it is good to be able to view >> its contents and then convert it, as I

Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> From: Hans Åberg > Date: Tue, 11 Sep 2018 20:14:30 +0200 > Cc: hsivo...@hsivonen.fi, > unicode@unicode.org > > If one encounters a file with mixed encodings, it is good to be able to view > its contents and then convert it, as I see one can do in Emacs. Yes. And mixed encodings is not the

Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode
> On 11 Sep 2018, at 19:21, Eli Zaretskii wrote: > >> From: Hans Åberg >> Date: Tue, 11 Sep 2018 19:13:28 +0200 >> Cc: Henri Sivonen , >> unicode@unicode.org >> >>> In Emacs, each raw byte belonging >>> to a byte sequence which is invalid under UTF-8 is represented as a >>> special multibyte

Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> From: Hans Åberg > Date: Tue, 11 Sep 2018 19:13:28 +0200 > Cc: Henri Sivonen , > unicode@unicode.org > > > In Emacs, each raw byte belonging > > to a byte sequence which is invalid under UTF-8 is represented as a > > special multibyte sequence. IOW, Emacs's internal representation > >

Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode
> On 11 Sep 2018, at 13:13, Eli Zaretskii via Unicode > wrote: > > In Emacs, each raw byte belonging > to a byte sequence which is invalid under UTF-8 is represented as a > special multibyte sequence. IOW, Emacs's internal representation > extends UTF-8 with multibyte sequences it uses to

Re: Unicode String Models

2018-09-11 Thread Mark Davis ☕️ via Unicode
These are all interesting and useful comments. I'll be responding once I get a bit of free time, probably Friday or Saturday. Mark On Tue, Sep 11, 2018 at 4:16 AM Eli Zaretskii via Unicode < unicode@unicode.org> wrote: > > Date: Tue, 11 Sep 2018 13:12:40 +0300 > > From: Henri Sivonen via

Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> Date: Tue, 11 Sep 2018 13:12:40 +0300 > From: Henri Sivonen via Unicode > > * I suggest splitting the "UTF-8 model" into three substantially > different models: > > 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No > UTF-8-related operations are performed when ingesting

Re: Unicode String Models

2018-09-11 Thread Henri Sivonen via Unicode
On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode wrote: > > I recently did some extensive revisions of a paper on Unicode string models > (APIs). Comments are welcome. > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# * The Grapheme Cluster Model