On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ☕️ wrote:
>
> * The Python 3.3 model mentions the disadvantages of memory usage
>> cliffs but doesn't mention the associated perfomance cliffs. It would
>> be good to also mention that when a string manipulation causes the
>> storage to expand or
On 3 October 2018 at 15:41:42, Mark Davis ☕️ via Unicode (unicode@unicode.org)
wrote:
> Let me clear that up; I meant that "the underlying storage never contains
> something that would need to be represented as a surrogate code point." Of
> course, UTF-16 does need surrogate code units. What #1
Mark
On Wed, Oct 3, 2018 at 3:01 PM Daniel Bünzli
wrote:
> On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode (
> unicode@unicode.org) wrote:
>
> > There are two main choices for a scalar-value API:
> >
> > 1. Guarantee that the storage never contains surrogates. This is the
> > simplest
On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode (unicode@unicode.org)
wrote:
> There are two main choices for a scalar-value API:
>
> 1. Guarantee that the storage never contains surrogates. This is the
> simplest model.
> 2. Substitute U+FFFD for surrogates when the API returns code
Mark
On Tue, Oct 2, 2018 at 8:31 PM Daniel Bünzli
wrote:
> On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode (
> unicode@unicode.org) wrote:
>
> > Because of performance and storage consideration, you need to consider
> the
> > possible internal data structures when you are looking at
On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode (unicode@unicode.org)
wrote:
> Because of performance and storage consideration, you need to consider the
> possible internal data structures when you are looking at something as
> low-level as strings. But most of the 'model's in the
Whether or not it is well suited, that's probably water under the bridge at
this point. Think of it as a jargon at this point; after all, there are
lots of cases like that: a "near miss" wasn't nearly a miss, it was nearly
a hit.
Mark
On Sun, Sep 9, 2018 at 10:56 AM Janusz S. Bień wrote:
> On
Mark
On Tue, Sep 11, 2018 at 12:17 PM Henri Sivonen via Unicode <
unicode@unicode.org> wrote:
> On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode
> wrote:
> >
> > I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
> >
> >
>
Mark
On Sun, Sep 9, 2018 at 3:42 PM Daniel Bünzli
wrote:
> Hello,
>
> I find your notion of "model" and presentation a bit confusing since it
> conflates what I would call the internal representation and the API.
>
> The internal representation defines how the Unicode text is stored and
>
Mark
On Sun, Sep 9, 2018 at 10:03 AM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:
> On Sat, 8 Sep 2018 18:36:00 +0200
> Mark Davis ☕️ via Unicode wrote:
>
> > I recently did some extensive revisions of a paper on Unicode string
> > models (APIs). Comments are welcome.
> >
> >
>
Thanks, added a quote from you on that; see if it looks ok.
Mark
On Sat, Sep 8, 2018 at 9:20 PM John Cowan wrote:
> This paper makes the default assumption that the internal storage of a
> string is a featureless array. If this assumption is abandoned, it is
> possible to get O(1) indexes
Thanks to all for comments. Just revised the text in https://goo.gl/neguxb.
Mark
On Sat, Sep 8, 2018 at 6:36 PM Mark Davis ☕️ wrote:
> I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
>
>
>
On Wed, Sep 12, 2018 at 11:37 AM Hans Åberg via Unicode
wrote:
> The idea is to extend Unicode itself, so that those bytes can be represented
> by legal codepoints.
Extending Unicode itself would likely create more problems that it
would solve. Extending the value space of Unicode scalar values
> On 12 Sep 2018, at 04:34, Eli Zaretskii via Unicode
> wrote:
>
>> Date: Wed, 12 Sep 2018 00:13:52 +0200
>> Cc: unicode@unicode.org
>> From: Hans Åberg via Unicode
>>
>> It might be useful to represent non-UTF-8 bytes as Unicode code points. One
>> way might be to use a codepoint to
On Tue, Sep 11, 2018 at 2:13 PM Eli Zaretskii wrote:
>
> > Date: Tue, 11 Sep 2018 13:12:40 +0300
> > From: Henri Sivonen via Unicode
> >
> > * I suggest splitting the "UTF-8 model" into three substantially
> > different models:
> >
> > 1) The UTF-8 Garbage In, Garbage Out model (the model of
> Date: Wed, 12 Sep 2018 00:13:52 +0200
> Cc: unicode@unicode.org
> From: Hans Åberg via Unicode
>
> It might be useful to represent non-UTF-8 bytes as Unicode code points. One
> way might be to use a codepoint to indicate high bit set followed by the byte
> value with its high bit set to 0,
No 0xF8..0xFF are not used at all in UTF-8; but U+00F8..U+00FF really
**do** have UTF-8 encodings (using two bytes).
The only safe way to represent arbitrary bytes within strings when they are
not valid UTF-8 is to use invalid UTF-8 sequences, i.e by using a
"UTF-8-like" private extension of
On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode
wrote:
>
> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:
> >
> > On Tue, 11 Sep 2018 21:10:03 +0200
> > Hans Åberg via Unicode wrote:
> >
> >> Indeed, before UTF-8, in the 1990s, I recall some
> On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode
> wrote:
>
> On Tue, 11 Sep 2018 21:10:03 +0200
> Hans Åberg via Unicode wrote:
>
>> Indeed, before UTF-8, in the 1990s, I recall some Russians using
>> LaTeX files with sections in different Cyrillic and Latin encodings,
>>
On Tue, 11 Sep 2018 21:10:03 +0200
Hans Åberg via Unicode wrote:
> Indeed, before UTF-8, in the 1990s, I recall some Russians using
> LaTeX files with sections in different Cyrillic and Latin encodings,
> changing the editor encoding while typing.
Rather like some of the old Unicode list
> On 11 Sep 2018, at 20:40, Eli Zaretskii wrote:
>
>> From: Hans Åberg
>> Date: Tue, 11 Sep 2018 20:14:30 +0200
>> Cc: hsivo...@hsivonen.fi,
>> unicode@unicode.org
>>
>> If one encounters a file with mixed encodings, it is good to be able to view
>> its contents and then convert it, as I
> From: Hans Åberg
> Date: Tue, 11 Sep 2018 20:14:30 +0200
> Cc: hsivo...@hsivonen.fi,
> unicode@unicode.org
>
> If one encounters a file with mixed encodings, it is good to be able to view
> its contents and then convert it, as I see one can do in Emacs.
Yes. And mixed encodings is not the
> On 11 Sep 2018, at 19:21, Eli Zaretskii wrote:
>
>> From: Hans Åberg
>> Date: Tue, 11 Sep 2018 19:13:28 +0200
>> Cc: Henri Sivonen ,
>> unicode@unicode.org
>>
>>> In Emacs, each raw byte belonging
>>> to a byte sequence which is invalid under UTF-8 is represented as a
>>> special multibyte
> From: Hans Åberg
> Date: Tue, 11 Sep 2018 19:13:28 +0200
> Cc: Henri Sivonen ,
> unicode@unicode.org
>
> > In Emacs, each raw byte belonging
> > to a byte sequence which is invalid under UTF-8 is represented as a
> > special multibyte sequence. IOW, Emacs's internal representation
> >
> On 11 Sep 2018, at 13:13, Eli Zaretskii via Unicode
> wrote:
>
> In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence. IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to
These are all interesting and useful comments. I'll be responding once I
get a bit of free time, probably Friday or Saturday.
Mark
On Tue, Sep 11, 2018 at 4:16 AM Eli Zaretskii via Unicode <
unicode@unicode.org> wrote:
> > Date: Tue, 11 Sep 2018 13:12:40 +0300
> > From: Henri Sivonen via
> Date: Tue, 11 Sep 2018 13:12:40 +0300
> From: Henri Sivonen via Unicode
>
> * I suggest splitting the "UTF-8 model" into three substantially
> different models:
>
> 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> UTF-8-related operations are performed when ingesting
On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode
wrote:
>
> I recently did some extensive revisions of a paper on Unicode string models
> (APIs). Comments are welcome.
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
* The Grapheme Cluster Model
> On 9 Sep 2018, at 21:20, Eli Zaretskii via Unicode
> wrote:
>
> In Emacs, the gap is always where the text is inserted or deleted, be
> it in the middle of text or at its end.
>
>> All editors I have seen treat the text as ordered collections of small
>> buffers (these small buffers may
> From: Philippe Verdy
> Date: Sun, 9 Sep 2018 19:35:47 +0200
> Cc: Richard Wordingham ,
> unicode Unicode Discussion
>
> In Emacs, buffer text is a character string with a gap, actually.
>
> A text buffer with gaps is a complex structure, not just a plain string.
The difference is
Le dim. 9 sept. 2018 à 17:53, Eli Zaretskii a écrit :
> > Text editors use various indexing caches always, to manage memory, I/O,
> and allow working on large texts
> > even on systems with low memory available. As much as possible they
> attempt to use the OS-level caches
> > of the filesystem.
> Date: Sun, 9 Sep 2018 16:10:26 +0200
> Cc: unicode Unicode Discussion
> From: Philippe Verdy via Unicode
>
> In practive, we use a memory by preparing the "small memory" while
> instantiating a new iterator that will
> process the whole string (which may not be fully loaded in memory, in
Le dim. 9 sept. 2018 à 10:10, Richard Wordingham via Unicode <
unicode@unicode.org> a écrit :
> On Sat, 8 Sep 2018 18:36:00 +0200
> Mark Davis ☕️ via Unicode wrote:
>
> > I recently did some extensive revisions of a paper on Unicode string
> > models (APIs). Comments are welcome.
> >
> >
>
Hello,
I find your notion of "model" and presentation a bit confusing since it
conflates what I would call the internal representation and the API.
The internal representation defines how the Unicode text is stored and should
not really matter to the end user of the string data structure.
On Sat, Sep 08 2018 at 18:36 +0200, Mark Davis ☕️ via Unicode wrote:
> I recently did some extensive revisions of a paper on Unicode string models
> (APIs). Comments are welcome.
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
It's a good opportunity to
Thanks, excellent comments. While it is clear that some string models have
more complicated structures (with their own pros and cons), my focus was on
simple internal structures. The focus was also on immutable strings — and
the tradeoffs for mutable ones can be quite different — and that needs to
On Sat, 8 Sep 2018 18:36:00 +0200
Mark Davis ☕️ via Unicode wrote:
> I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
Theoretically at least,
a programmer (excepting a
knowledge of html / css and a little java script and maybe just a bit of other
stuff).
From: cewcat...@hotmail.com
To: verd...@wanadoo.fr
CC: unicode@unicode.org
Subject: RE: Unicode String Models -- minor proofreading nit (was: Unicode
String Models)
Date: Sat, 28 Jul
2012/7/31 CE Whitehead cewcat...@hotmail.com
Hmm, after checking several unicode documents and some of the faq (
http://unicode.org/faq/collation.html), my understanding is that using a
non-character code point is the best solution here; I don't know which
non-character code point is best,
From: verd...@wanadoo.fr
Date: Fri, 27 Jul 2012 03:17:07 +0200
Subject: Re: Unicode String Models -- minor proofreading nit (was: Unicode
String Models)
To: m...@macchiato.com
CC: cewcat...@hotmail.com; unicode@unicode.org
I just wonder where the XSS attack is really an issue here
David Starner wrote (Saturday, July 21, 2012 12:02 AM):
The question of whether to allow non-ASCII characters in variables is open.
I don't see why. Yes, a lot of organizations will use ASCII only, but
not all programming is done large international organizations. For
personal hacking, or
Hi, I have one minor comment:
* * *
Validation; par 3, comment in parentheses
. . . (you never want to just delete it; that has security problems).
{ COMMENT: would it be helpful here to have a reference here to the unicode
security document that discusses this issue -- TR 36, 3.5
Thanks, good suggestion.
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**
On Thu, Jul 26, 2012 at 12:40 PM, CE Whitehead cewcat...@hotmail.comwrote:
Validation; par 3, comment in parentheses
. . . (you never want to just delete it; that has
I just wonder where the XSS attack is really an issue here. XSS
attacks involve bypassing the document source domain in order to
attempt to use or insert data found in another document issued or
managed by another domain, in a distinct security realm.
What is a more serious issue would be the
On Fri, 20 Jul 2012 15:01:42 -0700
David Starner prosfil...@gmail.com wrote:
The question of whether to allow non-ASCII characters in variables
is open.
It's not like Chinese variables
with Chinese comments is going to be much harder to debug for the
English speaker then English variables
On Fri, Jul 20, 2012 at 1:31 PM, Mark Davis ☕ m...@macchiato.com wrote:
I put together some notes on different ways for programming languages to
handle Unicode at a low level. Comments welcome.
Macchiato »
Many programming languages (and most modern software) have moved to Unicode
model of
That means that it is best to optimize for BMP characters (and as a
subset, ASCII and Latin-1), and fall into a ‘slow path’ when a
supplementary character is encountered.
I'm concerned about the statement/implication that one can optimize
for ASCII and Latin-1. It's too easy for a lot of
Mark wrote: “I put together some notes on different ways for programming
languages to handle Unicode at a low level. Comments welcome.”
Nice article as far as it goes and additions are forthcoming. In addition to
multiple code units per character in UTF-8 and UTF-16, there are variation
Thanks, nice article. We got into some of those hair caret positioning
issues back at Apple; we even had a design that would associate a series of
lines (which could be slanted and positioned) with a ligature, but
ultimately 1/m gets you 99% of the value, with very little cost.
(My article was
On 2012/07/21 7:01, David Starner wrote:
I'm concerned about the statement/implication that one can optimize
for ASCII and Latin-1. It's too easy for a lot of developers to test
speed with the English/European documents they have around and test
correctness only with Chinese. I see the argument
50 matches
Mail list logo