On Tue, Sep 11, 2018 at 2:13 PM Eli Zaretskii wrote:
>
> > Date: Tue, 11 Sep 2018 13:12:40 +0300
> > From: Henri Sivonen via Unicode
> >
> > * I suggest splitting the "UTF-8 model" into three substantially
> > different models:
> >
> > 1) The UTF-8 Garbage In, Garbage Out model (the model of
On 9/11/2018 5:02 PM, Andrew Glass via
Unicode wrote:
On Windows, Khmer is rendered with a dedicated shaping engine. I don't see a need to alter that engine or integrate Khmer with USE. How we fix Tai Tham, which does go to USE is a different matter. We need to
> Date: Wed, 12 Sep 2018 00:13:52 +0200
> Cc: unicode@unicode.org
> From: Hans Åberg via Unicode
>
> It might be useful to represent non-UTF-8 bytes as Unicode code points. One
> way might be to use a codepoint to indicate high bit set followed by the byte
> value with its high bit set to 0,
On Windows, Khmer is rendered with a dedicated shaping engine. I don't see a
need to alter that engine or integrate Khmer with USE. How we fix Tai Tham,
which does go to USE is a different matter. We need to work through the
solution for Tai Tham. I'm opposed to a generic and broad relaxation
No 0xF8..0xFF are not used at all in UTF-8; but U+00F8..U+00FF really
**do** have UTF-8 encodings (using two bytes).
The only safe way to represent arbitrary bytes within strings when they are
not valid UTF-8 is to use invalid UTF-8 sequences, i.e by using a
"UTF-8-like" private extension of
On Wed, 29 Aug 2018 21:42:57 +
Andrew Glass via Unicode wrote:
> Thank you Richard and Shriramana for bringing up this interesting
> problem.
>
> I agree we need to fix this. I don’t want to fix this with a font
> hack or change to USE cluster rules or properties. I think the right
> place
On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode
wrote:
>
> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:
> >
> > On Tue, 11 Sep 2018 21:10:03 +0200
> > Hans Åberg via Unicode wrote:
> >
> >> Indeed, before UTF-8, in the 1990s, I recall some
> On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode
> wrote:
>
> On Tue, 11 Sep 2018 21:10:03 +0200
> Hans Åberg via Unicode wrote:
>
>> Indeed, before UTF-8, in the 1990s, I recall some Russians using
>> LaTeX files with sections in different Cyrillic and Latin encodings,
>>
On Tue, 11 Sep 2018 21:10:03 +0200
Hans Åberg via Unicode wrote:
> Indeed, before UTF-8, in the 1990s, I recall some Russians using
> LaTeX files with sections in different Cyrillic and Latin encodings,
> changing the editor encoding while typing.
Rather like some of the old Unicode list
> On 11 Sep 2018, at 20:40, Eli Zaretskii wrote:
>
>> From: Hans Åberg
>> Date: Tue, 11 Sep 2018 20:14:30 +0200
>> Cc: hsivo...@hsivonen.fi,
>> unicode@unicode.org
>>
>> If one encounters a file with mixed encodings, it is good to be able to view
>> its contents and then convert it, as I
> From: Hans Åberg
> Date: Tue, 11 Sep 2018 20:14:30 +0200
> Cc: hsivo...@hsivonen.fi,
> unicode@unicode.org
>
> If one encounters a file with mixed encodings, it is good to be able to view
> its contents and then convert it, as I see one can do in Emacs.
Yes. And mixed encodings is not the
> On 11 Sep 2018, at 19:21, Eli Zaretskii wrote:
>
>> From: Hans Åberg
>> Date: Tue, 11 Sep 2018 19:13:28 +0200
>> Cc: Henri Sivonen ,
>> unicode@unicode.org
>>
>>> In Emacs, each raw byte belonging
>>> to a byte sequence which is invalid under UTF-8 is represented as a
>>> special multibyte
> From: Hans Åberg
> Date: Tue, 11 Sep 2018 19:13:28 +0200
> Cc: Henri Sivonen ,
> unicode@unicode.org
>
> > In Emacs, each raw byte belonging
> > to a byte sequence which is invalid under UTF-8 is represented as a
> > special multibyte sequence. IOW, Emacs's internal representation
> >
> On 11 Sep 2018, at 13:13, Eli Zaretskii via Unicode
> wrote:
>
> In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence. IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to
These are all interesting and useful comments. I'll be responding once I
get a bit of free time, probably Friday or Saturday.
Mark
On Tue, Sep 11, 2018 at 4:16 AM Eli Zaretskii via Unicode <
unicode@unicode.org> wrote:
> > Date: Tue, 11 Sep 2018 13:12:40 +0300
> > From: Henri Sivonen via
> Date: Tue, 11 Sep 2018 13:12:40 +0300
> From: Henri Sivonen via Unicode
>
> * I suggest splitting the "UTF-8 model" into three substantially
> different models:
>
> 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> UTF-8-related operations are performed when ingesting
On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode
wrote:
>
> I recently did some extensive revisions of a paper on Unicode string models
> (APIs). Comments are welcome.
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
* The Grapheme Cluster Model
17 matches
Mail list logo