Re: Unicode String Models

2018-11-22 Thread Henri Sivonen via Unicode
On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ☕️ wrote: > > * The Python 3.3 model mentions the disadvantages of memory usage >> cliffs but doesn't mention the associated perfomance cliffs. It would >> be good to also mention that when a string manipulation causes the >> storage to expand or

Re: Unicode String Models

2018-10-03 Thread Daniel Bünzli via Unicode
On 3 October 2018 at 15:41:42, Mark Davis ☕️ via Unicode (unicode@unicode.org) wrote:   > Let me clear that up; I meant that "the underlying storage never contains > something that would need to be represented as a surrogate code point." Of > course, UTF-16 does need surrogate code units. What #1

Re: Unicode String Models

2018-10-03 Thread Mark Davis ☕️ via Unicode
Mark On Wed, Oct 3, 2018 at 3:01 PM Daniel Bünzli wrote: > On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode ( > unicode@unicode.org) wrote: > > > There are two main choices for a scalar-value API: > > > > 1. Guarantee that the storage never contains surrogates. This is the > > simplest

Re: Unicode String Models

2018-10-03 Thread Daniel Bünzli via Unicode
On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode (unicode@unicode.org) wrote: > There are two main choices for a scalar-value API: > > 1. Guarantee that the storage never contains surrogates. This is the > simplest model. > 2. Substitute U+FFFD for surrogates when the API returns code

Re: Unicode String Models

2018-10-03 Thread Mark Davis ☕️ via Unicode
Mark On Tue, Oct 2, 2018 at 8:31 PM Daniel Bünzli wrote: > On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode ( > unicode@unicode.org) wrote: > > > Because of performance and storage consideration, you need to consider > the > > possible internal data structures when you are looking at

Re: Unicode String Models

2018-10-02 Thread Daniel Bünzli via Unicode
On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode (unicode@unicode.org) wrote: > Because of performance and storage consideration, you need to consider the > possible internal data structures when you are looking at something as > low-level as strings. But most of the 'model's in the

Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Whether or not it is well suited, that's probably water under the bridge at this point. Think of it as a jargon at this point; after all, there are lots of cases like that: a "near miss" wasn't nearly a miss, it was nearly a hit. Mark On Sun, Sep 9, 2018 at 10:56 AM Janusz S. Bień wrote: > On

Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Mark On Tue, Sep 11, 2018 at 12:17 PM Henri Sivonen via Unicode < unicode@unicode.org> wrote: > On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode > wrote: > > > > I recently did some extensive revisions of a paper on Unicode string > models (APIs). Comments are welcome. > > > > >

Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Mark On Sun, Sep 9, 2018 at 3:42 PM Daniel Bünzli wrote: > Hello, > > I find your notion of "model" and presentation a bit confusing since it > conflates what I would call the internal representation and the API. > > The internal representation defines how the Unicode text is stored and >

Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Mark On Sun, Sep 9, 2018 at 10:03 AM Richard Wordingham via Unicode < unicode@unicode.org> wrote: > On Sat, 8 Sep 2018 18:36:00 +0200 > Mark Davis ☕️ via Unicode wrote: > > > I recently did some extensive revisions of a paper on Unicode string > > models (APIs). Comments are welcome. > > > > >

Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Thanks, added a quote from you on that; see if it looks ok. Mark On Sat, Sep 8, 2018 at 9:20 PM John Cowan wrote: > This paper makes the default assumption that the internal storage of a > string is a featureless array. If this assumption is abandoned, it is > possible to get O(1) indexes

Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Thanks to all for comments. Just revised the text in https://goo.gl/neguxb. Mark On Sat, Sep 8, 2018 at 6:36 PM Mark Davis ☕️ wrote: > I recently did some extensive revisions of a paper on Unicode string > models (APIs). Comments are welcome. > > >

Re: Unicode String Models

2018-09-12 Thread Henri Sivonen via Unicode
On Wed, Sep 12, 2018 at 11:37 AM Hans Åberg via Unicode wrote: > The idea is to extend Unicode itself, so that those bytes can be represented > by legal codepoints. Extending Unicode itself would likely create more problems that it would solve. Extending the value space of Unicode scalar values

Re: Unicode String Models

2018-09-12 Thread Hans Åberg via Unicode
> On 12 Sep 2018, at 04:34, Eli Zaretskii via Unicode > wrote: > >> Date: Wed, 12 Sep 2018 00:13:52 +0200 >> Cc: unicode@unicode.org >> From: Hans Åberg via Unicode >> >> It might be useful to represent non-UTF-8 bytes as Unicode code points. One >> way might be to use a codepoint to

Re: Unicode String Models

2018-09-11 Thread Henri Sivonen via Unicode
On Tue, Sep 11, 2018 at 2:13 PM Eli Zaretskii wrote: > > > Date: Tue, 11 Sep 2018 13:12:40 +0300 > > From: Henri Sivonen via Unicode > > > > * I suggest splitting the "UTF-8 model" into three substantially > > different models: > > > > 1) The UTF-8 Garbage In, Garbage Out model (the model of

Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> Date: Wed, 12 Sep 2018 00:13:52 +0200 > Cc: unicode@unicode.org > From: Hans Åberg via Unicode > > It might be useful to represent non-UTF-8 bytes as Unicode code points. One > way might be to use a codepoint to indicate high bit set followed by the byte > value with its high bit set to 0,

Re: Unicode String Models

2018-09-11 Thread Philippe Verdy via Unicode
No 0xF8..0xFF are not used at all in UTF-8; but U+00F8..U+00FF really **do** have UTF-8 encodings (using two bytes). The only safe way to represent arbitrary bytes within strings when they are not valid UTF-8 is to use invalid UTF-8 sequences, i.e by using a "UTF-8-like" private extension of

Re: Unicode String Models

2018-09-11 Thread J Decker via Unicode
On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode wrote: > > > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode < > unicode@unicode.org> wrote: > > > > On Tue, 11 Sep 2018 21:10:03 +0200 > > Hans Åberg via Unicode wrote: > > > >> Indeed, before UTF-8, in the 1990s, I recall some

Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode
> On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode > wrote: > > On Tue, 11 Sep 2018 21:10:03 +0200 > Hans Åberg via Unicode wrote: > >> Indeed, before UTF-8, in the 1990s, I recall some Russians using >> LaTeX files with sections in different Cyrillic and Latin encodings, >>

Re: Unicode String Models

2018-09-11 Thread Richard Wordingham via Unicode
On Tue, 11 Sep 2018 21:10:03 +0200 Hans Åberg via Unicode wrote: > Indeed, before UTF-8, in the 1990s, I recall some Russians using > LaTeX files with sections in different Cyrillic and Latin encodings, > changing the editor encoding while typing. Rather like some of the old Unicode list

Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode
> On 11 Sep 2018, at 20:40, Eli Zaretskii wrote: > >> From: Hans Åberg >> Date: Tue, 11 Sep 2018 20:14:30 +0200 >> Cc: hsivo...@hsivonen.fi, >> unicode@unicode.org >> >> If one encounters a file with mixed encodings, it is good to be able to view >> its contents and then convert it, as I

Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> From: Hans Åberg > Date: Tue, 11 Sep 2018 20:14:30 +0200 > Cc: hsivo...@hsivonen.fi, > unicode@unicode.org > > If one encounters a file with mixed encodings, it is good to be able to view > its contents and then convert it, as I see one can do in Emacs. Yes. And mixed encodings is not the

Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode
> On 11 Sep 2018, at 19:21, Eli Zaretskii wrote: > >> From: Hans Åberg >> Date: Tue, 11 Sep 2018 19:13:28 +0200 >> Cc: Henri Sivonen , >> unicode@unicode.org >> >>> In Emacs, each raw byte belonging >>> to a byte sequence which is invalid under UTF-8 is represented as a >>> special multibyte

Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> From: Hans Åberg > Date: Tue, 11 Sep 2018 19:13:28 +0200 > Cc: Henri Sivonen , > unicode@unicode.org > > > In Emacs, each raw byte belonging > > to a byte sequence which is invalid under UTF-8 is represented as a > > special multibyte sequence. IOW, Emacs's internal representation > >

Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode
> On 11 Sep 2018, at 13:13, Eli Zaretskii via Unicode > wrote: > > In Emacs, each raw byte belonging > to a byte sequence which is invalid under UTF-8 is represented as a > special multibyte sequence. IOW, Emacs's internal representation > extends UTF-8 with multibyte sequences it uses to

Re: Unicode String Models

2018-09-11 Thread Mark Davis ☕️ via Unicode
These are all interesting and useful comments. I'll be responding once I get a bit of free time, probably Friday or Saturday. Mark On Tue, Sep 11, 2018 at 4:16 AM Eli Zaretskii via Unicode < unicode@unicode.org> wrote: > > Date: Tue, 11 Sep 2018 13:12:40 +0300 > > From: Henri Sivonen via

Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> Date: Tue, 11 Sep 2018 13:12:40 +0300 > From: Henri Sivonen via Unicode > > * I suggest splitting the "UTF-8 model" into three substantially > different models: > > 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No > UTF-8-related operations are performed when ingesting

Re: Unicode String Models

2018-09-11 Thread Henri Sivonen via Unicode
On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode wrote: > > I recently did some extensive revisions of a paper on Unicode string models > (APIs). Comments are welcome. > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# * The Grapheme Cluster Model

Re: Unicode String Models

2018-09-10 Thread Hans Åberg via Unicode
> On 9 Sep 2018, at 21:20, Eli Zaretskii via Unicode > wrote: > > In Emacs, the gap is always where the text is inserted or deleted, be > it in the middle of text or at its end. > >> All editors I have seen treat the text as ordered collections of small >> buffers (these small buffers may

Re: Unicode String Models

2018-09-09 Thread Eli Zaretskii via Unicode
> From: Philippe Verdy > Date: Sun, 9 Sep 2018 19:35:47 +0200 > Cc: Richard Wordingham , > unicode Unicode Discussion > > In Emacs, buffer text is a character string with a gap, actually. > > A text buffer with gaps is a complex structure, not just a plain string. The difference is

Re: Unicode String Models

2018-09-09 Thread Philippe Verdy via Unicode
Le dim. 9 sept. 2018 à 17:53, Eli Zaretskii a écrit : > > Text editors use various indexing caches always, to manage memory, I/O, > and allow working on large texts > > even on systems with low memory available. As much as possible they > attempt to use the OS-level caches > > of the filesystem.

Re: Unicode String Models

2018-09-09 Thread Eli Zaretskii via Unicode
> Date: Sun, 9 Sep 2018 16:10:26 +0200 > Cc: unicode Unicode Discussion > From: Philippe Verdy via Unicode > > In practive, we use a memory by preparing the "small memory" while > instantiating a new iterator that will > process the whole string (which may not be fully loaded in memory, in

Re: Unicode String Models

2018-09-09 Thread Philippe Verdy via Unicode
Le dim. 9 sept. 2018 à 10:10, Richard Wordingham via Unicode < unicode@unicode.org> a écrit : > On Sat, 8 Sep 2018 18:36:00 +0200 > Mark Davis ☕️ via Unicode wrote: > > > I recently did some extensive revisions of a paper on Unicode string > > models (APIs). Comments are welcome. > > > > >

Re: Unicode String Models

2018-09-09 Thread Daniel Bünzli via Unicode
Hello,  I find your notion of "model" and presentation a bit confusing since it conflates what I would call the internal representation and the API.  The internal representation defines how the Unicode text is stored and should not really matter to the end user of the string data structure.

Re: Unicode String Models

2018-09-09 Thread Janusz S. Bień via Unicode
On Sat, Sep 08 2018 at 18:36 +0200, Mark Davis ☕️ via Unicode wrote: > I recently did some extensive revisions of a paper on Unicode string models > (APIs). Comments are welcome. > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# It's a good opportunity to

Re: Unicode String Models

2018-09-09 Thread Mark Davis ☕️ via Unicode
Thanks, excellent comments. While it is clear that some string models have more complicated structures (with their own pros and cons), my focus was on simple internal structures. The focus was also on immutable strings — and the tradeoffs for mutable ones can be quite different — and that needs to

Re: Unicode String Models

2018-09-09 Thread Richard Wordingham via Unicode
On Sat, 8 Sep 2018 18:36:00 +0200 Mark Davis ☕️ via Unicode wrote: > I recently did some extensive revisions of a paper on Unicode string > models (APIs). Comments are welcome. > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# Theoretically at least,

RE: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

2012-07-31 Thread CE Whitehead
a programmer (excepting a knowledge of html / css and a little java script and maybe just a bit of other stuff). From: cewcat...@hotmail.com To: verd...@wanadoo.fr CC: unicode@unicode.org Subject: RE: Unicode String Models -- minor proofreading nit (was: Unicode String Models) Date: Sat, 28 Jul

Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

2012-07-31 Thread Philippe Verdy
2012/7/31 CE Whitehead cewcat...@hotmail.com Hmm, after checking several unicode documents and some of the faq ( http://unicode.org/faq/collation.html), my understanding is that using a non-character code point is the best solution here; I don't know which non-character code point is best,

RE: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

2012-07-28 Thread CE Whitehead
From: verd...@wanadoo.fr Date: Fri, 27 Jul 2012 03:17:07 +0200 Subject: Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models) To: m...@macchiato.com CC: cewcat...@hotmail.com; unicode@unicode.org I just wonder where the XSS attack is really an issue here

RE: Unicode String Models

2012-07-26 Thread Dreiheller, Albrecht
David Starner wrote (Saturday, July 21, 2012 12:02 AM): The question of whether to allow non-ASCII characters in variables is open. I don't see why. Yes, a lot of organizations will use ASCII only, but not all programming is done large international organizations. For personal hacking, or

Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

2012-07-26 Thread CE Whitehead
Hi, I have one minor comment: * * * Validation; par 3, comment in parentheses . . . (you never want to just delete it; that has security problems). { COMMENT: would it be helpful here to have a reference here to the unicode security document that discusses this issue -- TR 36, 3.5

Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

2012-07-26 Thread Mark Davis ☕
Thanks, good suggestion. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Thu, Jul 26, 2012 at 12:40 PM, CE Whitehead cewcat...@hotmail.comwrote: Validation; par 3, comment in parentheses . . . (you never want to just delete it; that has

Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

2012-07-26 Thread Philippe Verdy
I just wonder where the XSS attack is really an issue here. XSS attacks involve bypassing the document source domain in order to attempt to use or insert data found in another document issued or managed by another domain, in a distinct security realm. What is a more serious issue would be the

Re: Unicode String Models

2012-07-21 Thread Richard Wordingham
On Fri, 20 Jul 2012 15:01:42 -0700 David Starner prosfil...@gmail.com wrote: The question of whether to allow non-ASCII characters in variables is open. It's not like Chinese variables with Chinese comments is going to be much harder to debug for the English speaker then English variables

Re: Unicode String Models

2012-07-20 Thread David Starner
On Fri, Jul 20, 2012 at 1:31 PM, Mark Davis ☕ m...@macchiato.com wrote: I put together some notes on different ways for programming languages to handle Unicode at a low level. Comments welcome. Macchiato » Many programming languages (and most modern software) have moved to Unicode model of

Re: Unicode String Models

2012-07-20 Thread martin
That means that it is best to optimize for BMP characters (and as a subset, ASCII and Latin-1), and fall into a ‘slow path’ when a supplementary character is encountered. I'm concerned about the statement/implication that one can optimize for ASCII and Latin-1. It's too easy for a lot of

RE: Unicode String Models

2012-07-20 Thread Murray Sargent
Mark wrote: “I put together some notes on different ways for programming languages to handle Unicode at a low level. Comments welcome.” Nice article as far as it goes and additions are forthcoming. In addition to multiple code units per character in UTF-8 and UTF-16, there are variation

Re: Unicode String Models

2012-07-20 Thread Mark Davis ☕
Thanks, nice article. We got into some of those hair caret positioning issues back at Apple; we even had a design that would associate a series of lines (which could be slanted and positioned) with a ligature, but ultimately 1/m gets you 99% of the value, with very little cost. (My article was

Re: Unicode String Models

2012-07-20 Thread Martin J. Dürst
On 2012/07/21 7:01, David Starner wrote: I'm concerned about the statement/implication that one can optimize for ASCII and Latin-1. It's too easy for a lot of developers to test speed with the English/European documents they have around and test correctness only with Chinese. I see the argument