Re: Unicode String Models

2018-11-22 Thread Henri Sivonen via Unicode
On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ☕️  wrote:

>
>   * The Python 3.3 model mentions the disadvantages of memory usage
>> cliffs but doesn't mention the associated perfomance cliffs. It would
>> be good to also mention that when a string manipulation causes the
>> storage to expand or contract, there's a performance impact that's not
>> apparent from the nature of the operation if the programmer's
>> intuition works on the assumption that the programmer is dealing with
>> UTF-32.
>>
>
> The focus was on immutable string models, but I didn't make that clear.
> Added some text.
>

Thanks.


>  * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
>> text node storage in Gecko, (I believe but am not 100% sure) V8 and,
>> optionally, HotSpot
>> (
>> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A
>> ).
>> That is, text has UTF-16 semantics, but if the high half of every code
>> unit in a string is zero, only the lower half is stored. This has
>> properties analogous to the Python 3.3 model, except non-BMP doesn't
>> expand to UTF-32 but uses UTF-16 surrogate pairs.
>>
>
> Thanks, will add.
>

V8 source code shows it has a OneByteString storage option:
https://cs.chromium.org/chromium/src/v8/src/objects/string.h?sq=package:chromium=0=494
. From hearsay, I'm convinced that it means Latin1, but I've failed to find
a clear quotable statement from a V8 developer to that affect.


>   3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
>> have a different type in the type system than byte buffers. To go from
>> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
>> has been tagged as valid UTF-8, the validity is trusted completely so
>> that iteration by code point does not have "else" branches for
>> malformed sequences. If data that the type system indicates to be
>> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
>> language has a default "safe" side and an opt-in "unsafe" side. The
>> unsafe side is for performing low-level operations in a way where the
>> responsibility of upholding invariants is moved from the compiler to
>> the programmer. It's impossible to violate the UTF-8 validity
>> invariant using the safe part of the language.
>>
>
> Added a quote based on this; please check if it is ok.
>

Looks accurate. Thanks.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Unicode String Models

2018-10-03 Thread Daniel Bünzli via Unicode
On 3 October 2018 at 15:41:42, Mark Davis ☕️ via Unicode (unicode@unicode.org) 
wrote:
 
> Let me clear that up; I meant that "the underlying storage never contains
> something that would need to be represented as a surrogate code point." Of
> course, UTF-16 does need surrogate code units. What #1 would be excluding
> in the case of UTF-16 would be unpaired surrogates. That is, suppose the
> underlying storage is UTF-16 code units that don't satisfy #1.
>  
> 0061 D83D DC7D 0061 D83D
>  
> A code point API would return for those a sequence of 4 values, the last of
> which would be a surrogate code point.
>  
> 0061, 0001F47D, 0061, D83D
>  
> A scalar value API would return for those also 4 values, but since we
> aren't in #1, it would need to remap.
>  
> 0061, 0001F47D, 0061, FFFD

Ok understood. But I think that if you go to the length of providing a 
scalar-value API you would also prevent the construction of strings that have 
such anomalities in the first place (e.g. by erroring in the constructor if you 
provide it with malformed UTF-X data), i.e. maintain 1. From a programmer's 
perspective I really don't get anything from 2. except confusion.

> If it is a real datatype, with strong guarantees that it *never* contains
> values outside of [0x..0xD7FF 0xE000..0x10], then every conversion
> from number will require checking. And in my experience, without a strong
> guarantee the datatype is in practice pretty useless.

Sure. My point was that the places where you perform this check are few in 
practice. Namely mainly at the IO boundary of your program where you actually 
need to deal with encodings and, additionally, whenever you define scalar value 
constants (a check that could actually be performed by your compiler if your 
language provides a literal notation for values of this type).

Best, 

Daniel





Re: Unicode String Models

2018-10-03 Thread Mark Davis ☕️ via Unicode
Mark


On Wed, Oct 3, 2018 at 3:01 PM Daniel Bünzli 
wrote:

> On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode (
> unicode@unicode.org) wrote:
>
> > There are two main choices for a scalar-value API:
> >
> > 1. Guarantee that the storage never contains surrogates. This is the
> > simplest model.
> > 2. Substitute U+FFFD for surrogates when the API returns code
> > points. This can be done where #1 is not feasible, such as where the API
> is
> > a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code
> units
> > that are not guaranteed to be UTF-16. The cost is extra tests on every
> code
> > point access.
>
> I'm not sure 2. really makes sense in pratice: it would mean you can't
> access scalar values
> which needs surrogates to be encoded.
>

Let me clear that up; I meant that "the underlying storage never contains
something that would need to be represented as a surrogate code point." Of
course, UTF-16 does need surrogate code units. What #1 would be excluding
in the case of UTF-16 would be unpaired surrogates. That is, suppose the
underlying storage is UTF-16 code units that don't satisfy #1.

0061 D83D DC7D 0061 D83D

A code point API would return for those a sequence of 4 values, the last of
which would be a surrogate code point.

0061, 0001F47D, 0061, D83D

A scalar value API would return for those also 4 values, but since we
aren't in #1, it would need to remap.

0061, 0001F47D, 0061, FFFD

>
> Also regarding 1. you can always defines an API that has this property
> regardless of the actual storage, it's only that your indexing operations
> might be costly as they do not directly map to the underlying storage array.


> That being said I don't think direct indexing/iterating for Unicode text
> is such an interesting operation due of course to the
> normalization/segmentation issues. Basically if your API provides them I
> only see these indexes as useful ways to define substrings. APIs that
> identify/iterate boundaries (and thus substrings) are more interesting due
> to the nature of Unicode text.
>

I agree that iteration is a very common case. But quite often
implementations need to have at least opaque indexes (as discussed).

>
> > If the programming language provides for such a primitive datatype, that
> is
> > possible. That would mean at a minimum that casting/converting to that
> > datatype from other numerical datatypes would require bounds-checking and
> > throwing an exception for values outside of [0x..0xD7FF
> > 0xE000..0x10].
>
> Yes. But note that in practice if you are in 1. above you usually perform
> this only at the point of decoding where you are already performing a lot
> of other checks. Once done you no longer need to check anything as long as
> the operations you perform on the values preserve the invariant. Also
> converting back to an integer if you need one is a no-op: it's the identity
> function.
>

If it is a real datatype, with strong guarantees that it *never* contains
values outside of [0x..0xD7FF 0xE000..0x10], then every conversion
from number will require checking. And in my experience, without a strong
guarantee the datatype is in practice pretty useless.


>
> The OCaml Uchar module does this. This is the interface:
>
>   https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli
>
> which defines the type t as abstract and here is the implementation:
>
>   https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml
>
> which defines the implementation of type t = int which means values of
> this type are an *unboxed* OCaml integer (and will be stored as such in say
> an OCaml array). However since the module system enforces type abstraction
> the only way of creating such values is to use the constants or the
> constructors (e.g. of_int) which all maintain the scalar value invariant
> (if you disregard the unsafe_* functions).
>
> Note that it would perfectly be possible to adopt a similar approach in C
> via a typedef though given C's rather loose type system a little bit more
> discipline would be required from the programmer (always go through the
> constructor functions to create values of the type).


That's the C motto: "requiring a 'bit more' discipline from programmers"

>


> Best,
>
> Daniel
>
>
>


Re: Unicode String Models

2018-10-03 Thread Daniel Bünzli via Unicode
On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode (unicode@unicode.org) 
wrote:

> There are two main choices for a scalar-value API:
>  
> 1. Guarantee that the storage never contains surrogates. This is the
> simplest model.
> 2. Substitute U+FFFD for surrogates when the API returns code
> points. This can be done where #1 is not feasible, such as where the API is
> a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units
> that are not guaranteed to be UTF-16. The cost is extra tests on every code
> point access.

I'm not sure 2. really makes sense in pratice: it would mean you can't access 
scalar values 
which needs surrogates to be encoded. 

Also regarding 1. you can always defines an API that has this property 
regardless of the actual storage, it's only that your indexing operations might 
be costly as they do not directly map to the underlying storage array.

That being said I don't think direct indexing/iterating for Unicode text is 
such an interesting operation due of course to the normalization/segmentation 
issues. Basically if your API provides them I only see these indexes as useful 
ways to define substrings. APIs that identify/iterate boundaries (and thus 
substrings) are more interesting due to the nature of Unicode text.

> If the programming language provides for such a primitive datatype, that is
> possible. That would mean at a minimum that casting/converting to that
> datatype from other numerical datatypes would require bounds-checking and
> throwing an exception for values outside of [0x..0xD7FF
> 0xE000..0x10]. 

Yes. But note that in practice if you are in 1. above you usually perform this 
only at the point of decoding where you are already performing a lot of other 
checks. Once done you no longer need to check anything as long as the 
operations you perform on the values preserve the invariant. Also converting 
back to an integer if you need one is a no-op: it's the identity function. 

The OCaml Uchar module does this. This is the interface: 

  https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli

which defines the type t as abstract and here is the implementation: 

  https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml

which defines the implementation of type t = int which means values of this 
type are an *unboxed* OCaml integer (and will be stored as such in say an OCaml 
array). However since the module system enforces type abstraction the only way 
of creating such values is to use the constants or the constructors (e.g. 
of_int) which all maintain the scalar value invariant (if you disregard the 
unsafe_* functions). 

Note that it would perfectly be possible to adopt a similar approach in C via a 
typedef though given C's rather loose type system a little bit more discipline 
would be required from the programmer (always go through the constructor 
functions to create values of the type).

Best, 

Daniel





Re: Unicode String Models

2018-10-03 Thread Mark Davis ☕️ via Unicode
Mark


On Tue, Oct 2, 2018 at 8:31 PM Daniel Bünzli 
wrote:

> On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode (
> unicode@unicode.org) wrote:
>
> > Because of performance and storage consideration, you need to consider
> the
> > possible internal data structures when you are looking at something as
> > low-level as strings. But most of the 'model's in the document are only
> > really distinguished by API, only the "Code Point model" discussions are
> > segmented by internal storage, as with "Code Point Model: UTF-32"
>
> I guess my gripe with the presentation of that document is that it
> perpetuates the problem of confusing "unicode characters" (or integers, or
> scalar values) and their *encoding* (how to represent these integers as
> byte sequences) which a source of endless confusion among programmers.
>
> This confusion is easy lifted once you explain that there exists certain
> integers, the scalar values, which are your actual characters and then you
> have different ways of encoding your characters; one can then explain that
> a surrogate is not a character per se, it's a hack and there's no point in
> indexing them except if you want trouble.
>
> This may also suggest another taxonomy of classification for the APIs,
> those in which you work directly with the character data (the scalar
> values) and those in which you work with an encoding of the actual
> character data (e.g. a JavaScript string).
>

Thanks for the feedback. It is worth adding a discussion of the issues,
perhaps something like:

A code-point-based API takes and returns int32's, although only a small
subset of the values are valid code points, namely 0x0..0x10. (In
practice some APIs may support returning -1 to signal an error or
termination, such as before or after the end of a string.) A surrogate code
point is one in U+D800..U+DFFF; these reflect a range of special code units
used in pairs in UTF-16 for representing code points above U+. A scalar
value is a code point that is not a surrogate.

A scalar-value API for immutable strings requires that no surrogate code
points are ever returned. In practice, the main advantage of that API is
that round-tripping to UTF-8/16 is guaranteed. Otherwise, a leaked
surrogate code point is relatively harmless: Unicode properties are devised
so that clients can essentially treat them as (permanently) unassigned
characters. Warning: an iterator should *never* avoid returning surrogate
code points by skipping them; that can cause security problems; see
https://www.unicode.org/reports/tr36/tr36-7.html#Substituting_for_Ill_Formed_Subsequences
and
https://www.unicode.org/reports/tr36/tr36-7.html#Deletion_of_Noncharacters.

There are two main choices for a scalar-value API:

   1. Guarantee that the storage never contains surrogates. This is the
   simplest model.
   2. Substitute U+FFFD for surrogates when the API returns code
   points. This can be done where #1 is not feasible, such as where the API is
   a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units
   that are not guaranteed to be UTF-16. The cost is extra tests on every code
   point access.


> > In reality, most APIs are not even going to be in terms of code points:
> > they will return int32's.
>
> That reality depends on your programming language. If the latter supports
> type abstraction you can define an abstract type for scalar values (whose
> implementation may simply be an integer). If you always go through the
> constructor to create these "integers" you can maintain the invariant that
> a value of this type is an integer in the ranges [0x;0xD7FF] and
> [0xE000;0x10]. Knowing this invariant holds is quite useful when you
> feed your "character" data to other processes like UTF-X encoders: it
> guarantees the correctness of their outputs regardless of what the
> programmer does.
>

If the programming language provides for such a primitive datatype, that is
possible. That would mean at a minimum that casting/converting to that
datatype from other numerical datatypes would require bounds-checking and
throwing an exception for values outside of [0x..0xD7FF
0xE000..0x10]. Most common-use programming languages that I know of
don't support that for primitives; the API would have to use a class, which
would be so very painful for performance/storage. If you (or others) know
of languages that do have such a cheap primitive datatype, that would be
worth mentioning!


> Best,
>
> Daniel
>
>
>


Re: Unicode String Models

2018-10-02 Thread Daniel Bünzli via Unicode
On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode (unicode@unicode.org) 
wrote:

> Because of performance and storage consideration, you need to consider the
> possible internal data structures when you are looking at something as
> low-level as strings. But most of the 'model's in the document are only
> really distinguished by API, only the "Code Point model" discussions are
> segmented by internal storage, as with "Code Point Model: UTF-32"

I guess my gripe with the presentation of that document is that it perpetuates 
the problem of confusing "unicode characters" (or integers, or scalar values) 
and their *encoding* (how to represent these integers as byte sequences) which 
a source of endless confusion among programmers. 

This confusion is easy lifted once you explain that there exists certain 
integers, the scalar values, which are your actual characters and then you have 
different ways of encoding your characters; one can then explain that a 
surrogate is not a character per se, it's a hack and there's no point in 
indexing them except if you want trouble.

This may also suggest another taxonomy of classification for the APIs, those in 
which you work directly with the character data (the scalar values) and those 
in which you work with an encoding of the actual character data (e.g. a 
JavaScript string).

> In reality, most APIs are not even going to be in terms of code points:
> they will return int32's. 

That reality depends on your programming language. If the latter supports type 
abstraction you can define an abstract type for scalar values (whose 
implementation may simply be an integer). If you always go through the 
constructor to create these "integers" you can maintain the invariant that a 
value of this type is an integer in the ranges [0x;0xD7FF] and 
[0xE000;0x10]. Knowing this invariant holds is quite useful when you feed 
your "character" data to other processes like UTF-X encoders: it guarantees the 
correctness of their outputs regardless of what the programmer does.

Best, 

Daniel





Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Whether or not it is well suited, that's probably water under the bridge at
this point. Think of it as a jargon at this point; after all, there are
lots of cases like that: a "near miss" wasn't nearly a miss, it was nearly
a hit.

Mark


On Sun, Sep 9, 2018 at 10:56 AM Janusz S. Bień  wrote:

> On Sat, Sep 08 2018 at 18:36 +0200, Mark Davis ☕️ via Unicode wrote:
> > I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
> It's a good opportunity to propose a better term for "extended grapheme
> cluster", which usually are neither extended nor clusters, it's also not
> obvious that they are always graphemes.
>
> Cf.the earlier threads
>
> https://www.unicode.org/mail-arch/unicode-ml/y2017-m03/0031.html
> https://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0040.html
>
> Best regards
>
> Janusz
>
> --
>  ,
> Janusz S. Bien
> emeryt (emeritus)
> https://sites.google.com/view/jsbien
>


Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Mark


On Tue, Sep 11, 2018 at 12:17 PM Henri Sivonen via Unicode <
unicode@unicode.org> wrote:

> On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode
>  wrote:
> >
> > I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
> * The Grapheme Cluster Model seems to have a couple of disadvantages
> that are not mentioned:
>   1) The subunit of string is also a string (a short string conforming
> to particular constraints). There's a need for *another* more atomic
> mechanism for examining the internals of the grapheme cluster string.
>

I did mention this.


>   2) The way an arbitrary string is divided into units when iterating
> over it changes when the program is executed on a newer version of the
> language runtime that is aware of newly-assigned codepoints from a
> newer version of Unicode.
>

Good point. I did mention the EGC definitions changing, but should point
out that if you have a string with unassigned characters in it, they may be
clustered on future versions. Will add.


>  * The Python 3.3 model mentions the disadvantages of memory usage
> cliffs but doesn't mention the associated perfomance cliffs. It would
> be good to also mention that when a string manipulation causes the
> storage to expand or contract, there's a performance impact that's not
> apparent from the nature of the operation if the programmer's
> intuition works on the assumption that the programmer is dealing with
> UTF-32.
>

The focus was on immutable string models, but I didn't make that clear.
Added some text.

>
>  * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
> text node storage in Gecko, (I believe but am not 100% sure) V8 and,
> optionally, HotSpot
> (
> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A
> ).
> That is, text has UTF-16 semantics, but if the high half of every code
> unit in a string is zero, only the lower half is stored. This has
> properties analogous to the Python 3.3 model, except non-BMP doesn't
> expand to UTF-32 but uses UTF-16 surrogate pairs.
>

Thanks, will add.

>
>  * I think the fact that systems that chose UTF-16 or UTF-32 have
> implemented models that try to save storage by omitting leading zeros
> and gaining complexity and performance cliffs as a result is a strong
> indication that UTF-8 should be recommended for newly-designed systems
> that don't suffer from a forceful legacy need to expose UTF-16 or
> UTF-32 semantics.
>
>  * I suggest splitting the "UTF-8 model" into three substantially
> different models:
>
>  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> UTF-8-related operations are performed when ingesting byte-oriented
> data. Byte buffers and text buffers are type-wise ambiguous. Only
> iterating over byte data by code point gives the data the UTF-8
> interpretation. Unless the data is cleaned up as a side effect of such
> iteration, malformed sequences in input survive into output.
>
>  2) UTF-8 without full trust in ability to retain validity (the model
> of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> common UTF-8 model for C and C++, but I don't have evidence to back
> this up): When data is ingested with text semantics, it is converted
> to UTF-8. For data that's supposed to already be in UTF-8, this means
> replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> data is valid UTF-8 right after input. However, iteration by code
> point doesn't trust ability of other code to retain UTF-8 validity
> perfectly and has "else" branches in order not to blow up if invalid
> UTF-8 creeps into the system.
>
>  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> have a different type in the type system than byte buffers. To go from
> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> has been tagged as valid UTF-8, the validity is trusted completely so
> that iteration by code point does not have "else" branches for
> malformed sequences. If data that the type system indicates to be
> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> language has a default "safe" side and an opt-in "unsafe" side. The
> unsafe side is for performing low-level operations in a way where the
> responsibility of upholding invariants is moved from the compiler to
> the programmer. It's impossible to violate the UTF-8 validity
> invariant using the safe part of the language.
>

Added a quote based on this; plea

Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Mark


On Sun, Sep 9, 2018 at 3:42 PM Daniel Bünzli 
wrote:

> Hello,
>
> I find your notion of "model" and presentation a bit confusing since it
> conflates what I would call the internal representation and the API.
>
> The internal representation defines how the Unicode text is stored and
> should not really matter to the end user of the string data structure. The
> API defines how the Unicode text is accessed, expressed by what is the
> result of an indexing operation on the string. The latter is really what
> matters for the end-user and what I would call the "model".
>

Because of performance and storage consideration, you need to consider the
possible internal data structures when you are looking at something as
low-level as strings. But most of the 'model's in the document are only
really distinguished by API, only the "Code Point model" discussions are
segmented by internal storage, as with "Code Point Model: UTF-32"


> I think the presentation would benefit from making a clear distinction
> between the internal representation and the API; you could then easily
> summarize them in a table which would make a nice summary of the design
> space.
>

That's an interesting suggestion, I'll mull it over.

>
> I also think you are missing one API which is the one with ECG I would
> favour: indexing returns Unicode scalar values, internally be it whatever
> you wish UTF-{8,16,32} or a custom encoding. Maybe that's what you intended
> by the "Code Point Model: Internal 8/16/32" but that's not what it says,
> the distinction between code point and scalar value is an important one and
> I think it would be good to insist on it to clarify the minds in such
> documents.
>

In reality, most APIs are not even going to be in terms of code points:
they will return int32's. So not only are they not scalar values,
99.97% are not even code points. Of course, values above 10 or below 0
shouldn't ever be stored in strings, but in practice treating
non-scalar-value-code-points as "permanently unassigned" characters doesn't
really cause problems in processing.


> Best,
>
> Daniel
>
>
>


Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Mark


On Sun, Sep 9, 2018 at 10:03 AM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Sat, 8 Sep 2018 18:36:00 +0200
> Mark Davis ☕️ via Unicode  wrote:
>
> > I recently did some extensive revisions of a paper on Unicode string
> > models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
>
> Theoretically at least, the cost of indexing a big string by codepoint
> is negligible.  For example, cost of accessing the middle character is
> O(1)*, not O(n), where n is the length of the string.  The trick is to
> use a proportionately small amount of memory to store and maintain a
> partial conversion table from character index to byte index.  For
> example, Emacs claims to offer O(1) access to a UTF-8 buffer by
> character number, and I can't significantly fault the claim.
>
> *There may be some creep, but it doesn't matter for strings that can be
> stored within a galaxy.
>
> Of course, the coefficients implied by big-oh notation also matter.
> For example, it can be very easy to forget that a bubble sort is often
> the quickest sorting algorithm.
>

Thanks, added a quote from you on that; see if it looks ok.


> You keep muttering that a a sequence of 8-bit code units can contain
> invalid sequences, but often forget that that is also true of sequences
> of 16-bit code units.  Do emoji now ensure that confusion between
> codepoints and code units rapidly comes to light?
>

I didn't neglect that, had a [TBD] for it.

While UTF16 invalid unpaired surrogates don't complicate processing much if
they are treated as unassigned characters, allowing UTF8 invalid sequences
are more troublesome. See, for example, the convolutions needed in ICU
methods that allow ill-formed UTF8.


> You seem to keep forgetting that grapheme clusters are not how some
> people people work.  Does the English word 'café' contain the letter
> 'e'?  Yes or no?  I maintain that it does.  I can't help thinking that
> one might want to look for the letter 'ă' in Vietnamese and find it
> whatever the associated tone mark is.
>

I'm pretty familiar with the situation, thanks for asking.

Often you want to find out more about the components of grapheme clusters,
so you always need to be able to iterate through the code points it
contains. One might think that iterating by grapheme cluster is hiding
features of the text. For example, with *fox́* (fox\u{301}) it is easy to
find that the text contains an *x* by iterating through code points. But
code points often don't reveal their components: does the word
*también* contain
the letter *e*? A reasonable question, but iterating by code point rather
than grapheme cluster doesn't help, since it is typically encoded as a
single U+00E9. And even decomposing to NFD doesn't always help, as with
cases like *rødgrød*.

>
> You didn't discuss substrings.


I did. But if you mean a definition of substring that lets you access
internal components of substrings, I'm afraid that is quite a specialized
usage. One could do it, but it would burden down the general use case.

> I'm interested in how subsequences of
> strings are defined, as the concept of 'substring' isn't really Unicode
> compliant.  Again, expressing 'ă' as a subsequence of the Vietnamese
> word 'nặng' ought to be possible, whether one is using NFD (easier) or
> NFC.  (And there are alternative normalisations that are compatible
> with canonical equivalence.)  I'm most interested in subsequences X of a
> word W where W is the same as AXB for some strings A and B.


> Richard.
>
>


Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Thanks, added a quote from you on that; see if it looks ok.

Mark


On Sat, Sep 8, 2018 at 9:20 PM John Cowan  wrote:

> This paper makes the default assumption that the internal storage of a
> string is a featureless array.  If this assumption is abandoned, it is
> possible to get O(1) indexes with fairly low space overhead.  The Scheme
> language has recently adopted immutable strings called "texts" as a
> supplement to its pre-existing mutable strings, and the sample
> implementation for this feature uses a vector of either native strings or
> bytevectors (char[] vectors in C/Java terms).  I would urge anyone
> interested in the question of storing and accessing mutable strings to read
> the following parts of SRFI 135 at <
> https://srfi.schemers.org/srfi-135/srfi-135.html>:  Abstract, Rationale,
> Specification / Basic concepts, and Implementation.  In addition, the
> design notes at <https://github.com/larcenists/larceny/wiki/ImmutableTexts>,
> though not up to date (in particular, UTF-16 internals are now allowed as
> an alternative to UTF-8), are of interest: unfortunately, the link to the
> span API has rotted.
>
> On Sat, Sep 8, 2018 at 12:53 PM Mark Davis ☕️ via Unicore <
> unic...@unicode.org> wrote:
>
>> I recently did some extensive revisions of a paper on Unicode string
>> models (APIs). Comments are welcome.
>>
>>
>> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>>
>> Mark
>>
>


Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Thanks to all for comments. Just revised the text in https://goo.gl/neguxb.

Mark


On Sat, Sep 8, 2018 at 6:36 PM Mark Davis ☕️  wrote:

> I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
>
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
> Mark
>


Re: Unicode String Models

2018-09-12 Thread Henri Sivonen via Unicode
On Wed, Sep 12, 2018 at 11:37 AM Hans Åberg via Unicode
 wrote:
> The idea is to extend Unicode itself, so that those bytes can be represented 
> by legal codepoints.

Extending Unicode itself would likely create more problems that it
would solve. Extending the value space of Unicode scalar values would
be extremely disruptive for systems whose design is deeply committed
to the current definitions of UTF-16 and UTF-8 staying unchanged.
Assigning a scalar value within the current Unicode scalar value space
to currently malformed bytes would have the problem of those scalar
values losing information whether they came from malformed bytes or
the well-formed encoding of those scalar values.

It seems better to let applications that have use cases that involve
representing non-Unicode values to use a special-purpose extension on
their own.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Unicode String Models

2018-09-12 Thread Hans Åberg via Unicode


> On 12 Sep 2018, at 04:34, Eli Zaretskii via Unicode  
> wrote:
> 
>> Date: Wed, 12 Sep 2018 00:13:52 +0200
>> Cc: unicode@unicode.org
>> From: Hans Åberg via Unicode 
>> 
>> It might be useful to represent non-UTF-8 bytes as Unicode code points. One 
>> way might be to use a codepoint to indicate high bit set followed by the 
>> byte value with its high bit set to 0, that is, truncated into the ASCII 
>> range. For example, U+0080 looks like it is not in use, though I could not 
>> verify this.
> 
> You must use a codepoint that is not defined by Unicode, and never
> will.  That is what Emacs does: it extends the Unicode codepoint space
> beyond 0x10.

The idea is to extend Unicode itself, so that those bytes can be represented by 
legal codepoints. Then U+0080 has had some use in other encodings, but it looks 
like not in Unicode itself. But one could use some other value or values, and 
mark it for this special purpose.

There are a number of other byte sequences that are in use, too, like overlong 
UTF-8. Also original UTF-8 can be extended to handle all 32-bit words, also 
those with the high bit set, then.




Re: Unicode String Models

2018-09-11 Thread Henri Sivonen via Unicode
On Tue, Sep 11, 2018 at 2:13 PM Eli Zaretskii  wrote:
>
> > Date: Tue, 11 Sep 2018 13:12:40 +0300
> > From: Henri Sivonen via Unicode 
> >
> >  * I suggest splitting the "UTF-8 model" into three substantially
> > different models:
> >
> >  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> > UTF-8-related operations are performed when ingesting byte-oriented
> > data. Byte buffers and text buffers are type-wise ambiguous. Only
> > iterating over byte data by code point gives the data the UTF-8
> > interpretation. Unless the data is cleaned up as a side effect of such
> > iteration, malformed sequences in input survive into output.
> >
> >  2) UTF-8 without full trust in ability to retain validity (the model
> > of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> > common UTF-8 model for C and C++, but I don't have evidence to back
> > this up): When data is ingested with text semantics, it is converted
> > to UTF-8. For data that's supposed to already be in UTF-8, this means
> > replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> > data is valid UTF-8 right after input. However, iteration by code
> > point doesn't trust ability of other code to retain UTF-8 validity
> > perfectly and has "else" branches in order not to blow up if invalid
> > UTF-8 creeps into the system.
> >
> >  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> > have a different type in the type system than byte buffers. To go from
> > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> > has been tagged as valid UTF-8, the validity is trusted completely so
> > that iteration by code point does not have "else" branches for
> > malformed sequences. If data that the type system indicates to be
> > valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> > language has a default "safe" side and an opt-in "unsafe" side. The
> > unsafe side is for performing low-level operations in a way where the
> > responsibility of upholding invariants is moved from the compiler to
> > the programmer. It's impossible to violate the UTF-8 validity
> > invariant using the safe part of the language.
>
> There's another model, the one used by Emacs.  AFAIU, it is different
> from all the 3 you describe above.  In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence.  IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> This allows mixing stray bytes and valid text in the same buffer,
> without risking lossy conversions (such as those one gets under model
> 2 above).

I think extensions of UTF-8 that expand the value space beyond Unicode
scalar values and the problems these extensions are designed to solve
is a worthwhile topic to cover, but I think it's not the same topic as
in the document but a slightly adjacent topic.

On that topic, these two are relevant:
https://simonsapin.github.io/wtf-8/
https://github.com/kennytm/omgwtf8

The former is used in the Rust standard library in order to provide a
Unix-like view to Windows file paths in a way that can represent all
Windows file paths. File paths on Unix-like systems are sequences of
bytes whose presentable-to-humans interpretation (these days) is
UTF-8, but there's no guarantee of UTF-8 validity. File paths on
Windows are are sequences of unsigned 16-bit numbers whose
presentable-to-humans interpretation is UTF-16, but there's no
guarantee of UTF-16 validity. WTF-8 can represent all Windows file
paths as sequences of bytes such that the paths that are valid UTF-16
as sequences of 16-bit units are valid UTF-8 in the 8-bit-unit
representation. This allows application-visible file paths in the Rust
standard library to be sequences of bytes both on Windows and
non-Windows platforms and to be presentable to humans by decoding as
UTF-8 in both cases.

To my knowledge, the latter isn't in use yet. The implementation is
tracked in https://github.com/rust-lang/rust/issues/49802

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> Date: Wed, 12 Sep 2018 00:13:52 +0200
> Cc: unicode@unicode.org
> From: Hans Åberg via Unicode 
> 
> It might be useful to represent non-UTF-8 bytes as Unicode code points. One 
> way might be to use a codepoint to indicate high bit set followed by the byte 
> value with its high bit set to 0, that is, truncated into the ASCII range. 
> For example, U+0080 looks like it is not in use, though I could not verify 
> this.

You must use a codepoint that is not defined by Unicode, and never
will.  That is what Emacs does: it extends the Unicode codepoint space
beyond 0x10.


Re: Unicode String Models

2018-09-11 Thread Philippe Verdy via Unicode
No 0xF8..0xFF are not used at all in UTF-8; but U+00F8..U+00FF really
**do** have UTF-8 encodings (using two bytes).

The only safe way to represent arbitrary bytes within strings when they are
not valid UTF-8 is to use invalid UTF-8 sequences, i.e by using a
"UTF-8-like" private extension of UTF-8 (that extension is still not UTF-8!)

This is what Java does for representing U+ by (0xC0,0x80) in the
compiled Bytecode or via the C/C++ interface for JNI when converting the
java string buffer into a C/C++ string terminated by a NULL byte (not part
of the Java string content itself). That special sequence however is really
exposed in the Java API as a true unsigned 16-bit code unit (char) with
value 0x, and a valid single code point.

The same can be done for reencoding each invalid byte in non-UTF-8
conforming texts using sequences with a "UTF-8-like" scheme (still
compatible with plain UTF-8 for every valid UTF-8 texts): you may either:
  * (a) encode each invalid byte separately (using two bytes for each), or
by encoding them by groups of 3-bits (represented using bytes 0xF8..0FF)
and then needing 3 bytes in the encoding.
  * (b) encode a private starter (e.g. 0xFF), followed by a byte for the
length of the raw bytes sequence that follows, and then the raw bytes
sequence of that length without any reencoding: this will never be confused
with other valid codepoints (however this scheme may no longer be directly
indexable from arbitrary random positions, unlike scheme a which may be
marginally longer longer)
But both schemes (a) or (b) would be useful in editors allowing to edit
arbitrary binary files as if they were plain-text, even if they contain
null bytes, or invalid UTF-8 sequences (it's up to these editors to find a
way to distinctively represent these bytes, and a way to enter/change them
reliably.

There's also a possibility of extension if the backing store uses UTF-16,
as all code units 0x.0x are used, but one scheme is possible by
using unpaired surrogates (notably a low surrogate NOT prefixed by a high
surrogate: the low surrogate already has 10 useful bits that can store any
raw byte value in its lowest bits): this scheme allows indexing from random
position and reliable sequencial traversal in both directions (backward or
forward)...

... But the presence of such extension of UTF-16 means that all the
implementation code handling standard text has to detect unpaired
surrogates, and can no longer assume that a low surrogate necessarily has a
high surrogate encoded just before it: it must be tested and that previous
position may be before the buffer start, causing a possibly buffer overrun
in backward direction (so the code will need to also know the start
position of the buffer and check it, or know the index which cannot be
negative), possibly exposing unrelated data and causing some security
risks, unless the backing store always adds a leading "guard" code unit set
arbitrarily to 0x.





Le mer. 12 sept. 2018 à 00:48, J Decker via Unicode  a
écrit :

>
>
> On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode <
> unicode@unicode.org> wrote:
>
>>
>> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode <
>> unicode@unicode.org> wrote:
>> >
>> > On Tue, 11 Sep 2018 21:10:03 +0200
>> > Hans Åberg via Unicode  wrote:
>> >
>> >> Indeed, before UTF-8, in the 1990s, I recall some Russians using
>> >> LaTeX files with sections in different Cyrillic and Latin encodings,
>> >> changing the editor encoding while typing.
>> >
>> > Rather like some of the old Unicode list archives, which are just
>> > concatenations of a month's emails, with all sorts of 8-bit encodings
>> > and stretches of base64.
>>
>> It might be useful to represent non-UTF-8 bytes as Unicode code points.
>> One way might be to use a codepoint to indicate high bit set followed by
>> the byte value with its high bit set to 0, that is, truncated into the
>> ASCII range. For example, U+0080 looks like it is not in use, though I
>> could not verify this.
>>
>>
> it's used for character 0x400.   0xD0 0x80   or 0x8000   0xE8 0x80 0x80
> (I'm probably off a bit in the leading byte)
> UTF-8 can represent from 0 to 0x20 every value; (which is all defined
> codepoints) early varients can support up to U+7FFF...
> and there's enough bits to carry the pattern forward to support 36 bits or
> 42 bits... (the last one breaking the standard a bit by allowing a byte
> wihout one bit off... 0xFF would be the leadin)
>
> 0xF8-FF are unused byte values; but those can all be encoded into utf-8.
>


Re: Unicode String Models

2018-09-11 Thread J Decker via Unicode
On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode 
wrote:

>
> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:
> >
> > On Tue, 11 Sep 2018 21:10:03 +0200
> > Hans Åberg via Unicode  wrote:
> >
> >> Indeed, before UTF-8, in the 1990s, I recall some Russians using
> >> LaTeX files with sections in different Cyrillic and Latin encodings,
> >> changing the editor encoding while typing.
> >
> > Rather like some of the old Unicode list archives, which are just
> > concatenations of a month's emails, with all sorts of 8-bit encodings
> > and stretches of base64.
>
> It might be useful to represent non-UTF-8 bytes as Unicode code points.
> One way might be to use a codepoint to indicate high bit set followed by
> the byte value with its high bit set to 0, that is, truncated into the
> ASCII range. For example, U+0080 looks like it is not in use, though I
> could not verify this.
>
>
it's used for character 0x400.   0xD0 0x80   or 0x8000   0xE8 0x80 0x80
(I'm probably off a bit in the leading byte)
UTF-8 can represent from 0 to 0x20 every value; (which is all defined
codepoints) early varients can support up to U+7FFF...
and there's enough bits to carry the pattern forward to support 36 bits or
42 bits... (the last one breaking the standard a bit by allowing a byte
wihout one bit off... 0xFF would be the leadin)

0xF8-FF are unused byte values; but those can all be encoded into utf-8.


Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode


> On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode 
>  wrote:
> 
> On Tue, 11 Sep 2018 21:10:03 +0200
> Hans Åberg via Unicode  wrote:
> 
>> Indeed, before UTF-8, in the 1990s, I recall some Russians using
>> LaTeX files with sections in different Cyrillic and Latin encodings,
>> changing the editor encoding while typing.
> 
> Rather like some of the old Unicode list archives, which are just
> concatenations of a month's emails, with all sorts of 8-bit encodings
> and stretches of base64.

It might be useful to represent non-UTF-8 bytes as Unicode code points. One way 
might be to use a codepoint to indicate high bit set followed by the byte value 
with its high bit set to 0, that is, truncated into the ASCII range. For 
example, U+0080 looks like it is not in use, though I could not verify this.




Re: Unicode String Models

2018-09-11 Thread Richard Wordingham via Unicode
On Tue, 11 Sep 2018 21:10:03 +0200
Hans Åberg via Unicode  wrote:

> Indeed, before UTF-8, in the 1990s, I recall some Russians using
> LaTeX files with sections in different Cyrillic and Latin encodings,
> changing the editor encoding while typing.

Rather like some of the old Unicode list archives, which are just
concatenations of a month's emails, with all sorts of 8-bit encodings
and stretches of base64.

Richard.



Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode


> On 11 Sep 2018, at 20:40, Eli Zaretskii  wrote:
> 
>> From: Hans Åberg 
>> Date: Tue, 11 Sep 2018 20:14:30 +0200
>> Cc: hsivo...@hsivonen.fi,
>> unicode@unicode.org
>> 
>> If one encounters a file with mixed encodings, it is good to be able to view 
>> its contents and then convert it, as I see one can do in Emacs.
> 
> Yes.  And mixed encodings is not the only use case: it may well happen
> that the initial attempt to decode the file uses incorrect assumption
> about the encoding, for some reason.
> 
> In addition, it is important that changing some portion of the file,
> then saving the modified text will never change any part that the user
> didn't touch, as will happen if invalid sequences are rejected at
> input time and replaced with something else.

Indeed, before UTF-8, in the 1990s, I recall some Russians using LaTeX files 
with sections in different Cyrillic and Latin encodings, changing the editor 
encoding while typing.





Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> From: Hans Åberg 
> Date: Tue, 11 Sep 2018 20:14:30 +0200
> Cc: hsivo...@hsivonen.fi,
>  unicode@unicode.org
> 
> If one encounters a file with mixed encodings, it is good to be able to view 
> its contents and then convert it, as I see one can do in Emacs.

Yes.  And mixed encodings is not the only use case: it may well happen
that the initial attempt to decode the file uses incorrect assumption
about the encoding, for some reason.

In addition, it is important that changing some portion of the file,
then saving the modified text will never change any part that the user
didn't touch, as will happen if invalid sequences are rejected at
input time and replaced with something else.


Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode


> On 11 Sep 2018, at 19:21, Eli Zaretskii  wrote:
> 
>> From: Hans Åberg 
>> Date: Tue, 11 Sep 2018 19:13:28 +0200
>> Cc: Henri Sivonen ,
>> unicode@unicode.org
>> 
>>> In Emacs, each raw byte belonging
>>> to a byte sequence which is invalid under UTF-8 is represented as a
>>> special multibyte sequence.  IOW, Emacs's internal representation
>>> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
>>> This allows mixing stray bytes and valid text in the same buffer,
>>> without risking lossy conversions (such as those one gets under model
>>> 2 above).
>> 
>> Can you give a reference detailing this format?
> 
> There's no formal description as English text, if that's what you
> meant.  The comments, macros and functions in the files
> src/character.[ch] in the Emacs source tree tell most of that story,
> albeit indirectly, and some additional info can be found in the
> section "Text Representation" of the Emacs Lisp Reference manual.

OK. If one encounters a file with mixed encodings, it is good to be able to 
view its contents and then convert it, as I see one can do in Emacs.





Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> From: Hans Åberg 
> Date: Tue, 11 Sep 2018 19:13:28 +0200
> Cc: Henri Sivonen ,
>  unicode@unicode.org
> 
> > In Emacs, each raw byte belonging
> > to a byte sequence which is invalid under UTF-8 is represented as a
> > special multibyte sequence.  IOW, Emacs's internal representation
> > extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> > This allows mixing stray bytes and valid text in the same buffer,
> > without risking lossy conversions (such as those one gets under model
> > 2 above).
> 
> Can you give a reference detailing this format?

There's no formal description as English text, if that's what you
meant.  The comments, macros and functions in the files
src/character.[ch] in the Emacs source tree tell most of that story,
albeit indirectly, and some additional info can be found in the
section "Text Representation" of the Emacs Lisp Reference manual.


Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode


> On 11 Sep 2018, at 13:13, Eli Zaretskii via Unicode  
> wrote:
> 
> In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence.  IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> This allows mixing stray bytes and valid text in the same buffer,
> without risking lossy conversions (such as those one gets under model
> 2 above).

Can you give a reference detailing this format?





Re: Unicode String Models

2018-09-11 Thread Mark Davis ☕️ via Unicode
These are all interesting and useful comments. I'll be responding once I
get a bit of free time, probably Friday or Saturday.

Mark


On Tue, Sep 11, 2018 at 4:16 AM Eli Zaretskii via Unicode <
unicode@unicode.org> wrote:

> > Date: Tue, 11 Sep 2018 13:12:40 +0300
> > From: Henri Sivonen via Unicode 
> >
> >  * I suggest splitting the "UTF-8 model" into three substantially
> > different models:
> >
> >  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> > UTF-8-related operations are performed when ingesting byte-oriented
> > data. Byte buffers and text buffers are type-wise ambiguous. Only
> > iterating over byte data by code point gives the data the UTF-8
> > interpretation. Unless the data is cleaned up as a side effect of such
> > iteration, malformed sequences in input survive into output.
> >
> >  2) UTF-8 without full trust in ability to retain validity (the model
> > of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> > common UTF-8 model for C and C++, but I don't have evidence to back
> > this up): When data is ingested with text semantics, it is converted
> > to UTF-8. For data that's supposed to already be in UTF-8, this means
> > replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> > data is valid UTF-8 right after input. However, iteration by code
> > point doesn't trust ability of other code to retain UTF-8 validity
> > perfectly and has "else" branches in order not to blow up if invalid
> > UTF-8 creeps into the system.
> >
> >  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> > have a different type in the type system than byte buffers. To go from
> > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> > has been tagged as valid UTF-8, the validity is trusted completely so
> > that iteration by code point does not have "else" branches for
> > malformed sequences. If data that the type system indicates to be
> > valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> > language has a default "safe" side and an opt-in "unsafe" side. The
> > unsafe side is for performing low-level operations in a way where the
> > responsibility of upholding invariants is moved from the compiler to
> > the programmer. It's impossible to violate the UTF-8 validity
> > invariant using the safe part of the language.
>
> There's another model, the one used by Emacs.  AFAIU, it is different
> from all the 3 you describe above.  In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence.  IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> This allows mixing stray bytes and valid text in the same buffer,
> without risking lossy conversions (such as those one gets under model
> 2 above).
>


Re: Unicode String Models

2018-09-11 Thread Eli Zaretskii via Unicode
> Date: Tue, 11 Sep 2018 13:12:40 +0300
> From: Henri Sivonen via Unicode 
> 
>  * I suggest splitting the "UTF-8 model" into three substantially
> different models:
> 
>  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> UTF-8-related operations are performed when ingesting byte-oriented
> data. Byte buffers and text buffers are type-wise ambiguous. Only
> iterating over byte data by code point gives the data the UTF-8
> interpretation. Unless the data is cleaned up as a side effect of such
> iteration, malformed sequences in input survive into output.
> 
>  2) UTF-8 without full trust in ability to retain validity (the model
> of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> common UTF-8 model for C and C++, but I don't have evidence to back
> this up): When data is ingested with text semantics, it is converted
> to UTF-8. For data that's supposed to already be in UTF-8, this means
> replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> data is valid UTF-8 right after input. However, iteration by code
> point doesn't trust ability of other code to retain UTF-8 validity
> perfectly and has "else" branches in order not to blow up if invalid
> UTF-8 creeps into the system.
> 
>  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> have a different type in the type system than byte buffers. To go from
> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> has been tagged as valid UTF-8, the validity is trusted completely so
> that iteration by code point does not have "else" branches for
> malformed sequences. If data that the type system indicates to be
> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> language has a default "safe" side and an opt-in "unsafe" side. The
> unsafe side is for performing low-level operations in a way where the
> responsibility of upholding invariants is moved from the compiler to
> the programmer. It's impossible to violate the UTF-8 validity
> invariant using the safe part of the language.

There's another model, the one used by Emacs.  AFAIU, it is different
from all the 3 you describe above.  In Emacs, each raw byte belonging
to a byte sequence which is invalid under UTF-8 is represented as a
special multibyte sequence.  IOW, Emacs's internal representation
extends UTF-8 with multibyte sequences it uses to represent raw bytes.
This allows mixing stray bytes and valid text in the same buffer,
without risking lossy conversions (such as those one gets under model
2 above).


Re: Unicode String Models

2018-09-11 Thread Henri Sivonen via Unicode
On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode
 wrote:
>
> I recently did some extensive revisions of a paper on Unicode string models 
> (APIs). Comments are welcome.
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#

* The Grapheme Cluster Model seems to have a couple of disadvantages
that are not mentioned:
  1) The subunit of string is also a string (a short string conforming
to particular constraints). There's a need for *another* more atomic
mechanism for examining the internals of the grapheme cluster string.
  2) The way an arbitrary string is divided into units when iterating
over it changes when the program is executed on a newer version of the
language runtime that is aware of newly-assigned codepoints from a
newer version of Unicode.

 * The Python 3.3 model mentions the disadvantages of memory usage
cliffs but doesn't mention the associated perfomance cliffs. It would
be good to also mention that when a string manipulation causes the
storage to expand or contract, there's a performance impact that's not
apparent from the nature of the operation if the programmer's
intuition works on the assumption that the programmer is dealing with
UTF-32.

 * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
text node storage in Gecko, (I believe but am not 100% sure) V8 and,
optionally, HotSpot
(https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A).
That is, text has UTF-16 semantics, but if the high half of every code
unit in a string is zero, only the lower half is stored. This has
properties analogous to the Python 3.3 model, except non-BMP doesn't
expand to UTF-32 but uses UTF-16 surrogate pairs.

 * I think the fact that systems that chose UTF-16 or UTF-32 have
implemented models that try to save storage by omitting leading zeros
and gaining complexity and performance cliffs as a result is a strong
indication that UTF-8 should be recommended for newly-designed systems
that don't suffer from a forceful legacy need to expose UTF-16 or
UTF-32 semantics.

 * I suggest splitting the "UTF-8 model" into three substantially
different models:

 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
UTF-8-related operations are performed when ingesting byte-oriented
data. Byte buffers and text buffers are type-wise ambiguous. Only
iterating over byte data by code point gives the data the UTF-8
interpretation. Unless the data is cleaned up as a side effect of such
iteration, malformed sequences in input survive into output.

 2) UTF-8 without full trust in ability to retain validity (the model
of the UTF-8-using C++ parts of Gecko; I believe this to be the most
common UTF-8 model for C and C++, but I don't have evidence to back
this up): When data is ingested with text semantics, it is converted
to UTF-8. For data that's supposed to already be in UTF-8, this means
replacing malformed sequences with the REPLACEMENT CHARACTER, so the
data is valid UTF-8 right after input. However, iteration by code
point doesn't trust ability of other code to retain UTF-8 validity
perfectly and has "else" branches in order not to blow up if invalid
UTF-8 creeps into the system.

 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
have a different type in the type system than byte buffers. To go from
a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
has been tagged as valid UTF-8, the validity is trusted completely so
that iteration by code point does not have "else" branches for
malformed sequences. If data that the type system indicates to be
valid UTF-8 wasn't actually valid, it would be nasal demon time. The
language has a default "safe" side and an opt-in "unsafe" side. The
unsafe side is for performing low-level operations in a way where the
responsibility of upholding invariants is moved from the compiler to
the programmer. It's impossible to violate the UTF-8 validity
invariant using the safe part of the language.

 * After working with different string models, I'd recommend the Rust
model for newly-designed programming languages. (Not because I work
for Mozilla but because I believe Rust's way of dealing with Unicode
is the best I've seen.) Rust's standard library provides Unicode
version-independent iterations over strings: by code unit and by code
point. Iteration by extended grapheme cluster is provided by a library
that's easy to include due to the nature of Rust package management
(https://crates.io/crates/unicode_segmentation). Viewing a UTF-8
buffer as a read-only byte buffer has zero run-time cost and allows
for maximally fast guaranteed-valid-UTF-8 output.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Unicode String Models

2018-09-10 Thread Hans Åberg via Unicode


> On 9 Sep 2018, at 21:20, Eli Zaretskii via Unicode  
> wrote:
> 
> In Emacs, the gap is always where the text is inserted or deleted, be
> it in the middle of text or at its end.
> 
>> All editors I have seen treat the text as ordered collections of small 
>> buffers (these small buffers may still have
>> small gaps), which are occasionnally merged or splitted when needed (merging 
>> does not cause any
>> reallocation but may free one of the buffers), some of them being paged out 
>> to tempoary files when memory is
>> stressed. There are some heuristics in the editor's code to when 
>> mainatenance of the collection is really
>> needed and useful for the performance.
> 
> My point was to say that Emacs is not one of these editors you
> describe.

FYI, gap and rope buffers are described at [1-2]; also see the Emacs manual [3].

1. https://en.wikipedia.org/wiki/Gap_buffer
2. https://en.wikipedia.org/wiki/Rope_(data_structure)
3. https://www.gnu.org/software/emacs/manual/html_node/elisp/Buffer-Gap.html





Re: Unicode String Models

2018-09-09 Thread Eli Zaretskii via Unicode
> From: Philippe Verdy 
> Date: Sun, 9 Sep 2018 19:35:47 +0200
> Cc: Richard Wordingham , 
>   unicode Unicode Discussion 
> 
>  In Emacs, buffer text is a character string with a gap, actually.
> 
> A text buffer with gaps is a complex structure, not just a plain string.

The difference is very small, and a couple of macros allow you to
almost forget about the gap.

> I doubt it constantly uses a single gap at end (insertions and deletions in 
> the middle would
> constant move large blocks and use excessive CPU and memory bandwidth, with 
> very slow response: users
> do not want to see what they type appearing on the screen at one keystroke 
> every few seconds because each
> typed key causes massive block moves and excessive memory paging from/to disk 
> while this move is being
> performed).

In Emacs, the gap is always where the text is inserted or deleted, be
it in the middle of text or at its end.

> All editors I have seen treat the text as ordered collections of small 
> buffers (these small buffers may still have
> small gaps), which are occasionnally merged or splitted when needed (merging 
> does not cause any
> reallocation but may free one of the buffers), some of them being paged out 
> to tempoary files when memory is
> stressed. There are some heuristics in the editor's code to when mainatenance 
> of the collection is really
> needed and useful for the performance.

My point was to say that Emacs is not one of these editors you
describe.

> But beside this the performance cost of UTF indexing of the codepoints is 
> invisible: each buffer will only need
> to avoid breaking text between codepoint boundaries, if the current encoding 
> of the edited text is an UTF. An
> editor may also avoid breaking buffers in the middle of clusters if they 
> render clusters (including ligatures if
> they are supported): clusters are still small in size in every encoding and 
> reasonnable buffer sizes can hold at
> least hundreds of clusters (even the largest ones which occur rarely). How 
> editors will manage clusters to
> make them editable is dependant of the implementation, buyt even the UTF or 
> codepoints boundaries are not
> enough to handle that. In all cases the logical text buffer is structured 
> with a complex backing store, where
> parts may be paged out (and will also include more than just the current 
> text, notably it will include parts of the
> indexes, possibly in another temporary working file).

You ignore or disregard the need to represent raw bytes in editor
buffers.  That is when the encoding stops being "invisible".


Re: Unicode String Models

2018-09-09 Thread Philippe Verdy via Unicode
Le dim. 9 sept. 2018 à 17:53, Eli Zaretskii  a écrit :

> > Text editors use various indexing caches always, to manage memory, I/O,
> and allow working on large texts
> > even on systems with low memory available. As much as possible they
> attempt to use the OS-level caches
> > of the filesystem. And in all cases, they don't work directly on their
> text buffer (whose internal represenation in
> > their backing store is not just a single string, but a structured
> collection of buffers, built on top of an interface
> > masking the details: the effective text will then be reencoded and saved
> from that object, using complex
> > serialization schemes; the text buffer is "virtualized").
>
> In Emacs, buffer text is a character string with a gap, actually.
>

A text buffer with gaps is a complex structure, not just a plain string.
Gaps are one way to manage memory more efficiently and get reasonnable
performance when editing, without having to constantly move large blocks:
these "strings" with gaps may then actually be just a byte buffer using as
a backing store, but that buffer alone does not represent only the
currently represented text. A process will still serialize and perform
cleanup befire this buffer can be used to save the edited text. Emacs may
not necasserily unallocate the end of the buffer, but I doubt it constantly
uses a single gap at end (insertions and deletions in the middle would
constant move large blocks and use excessive CPU and memory bandwidth, with
very slow response: users do not want to see what they type appearing on
the screen at one keystroke every few seconds because each typed key causes
massive block moves and excessive memory paging from/to disk while this
move is being performed).

All editors I have seen treat the text as ordered collections of small
buffers (these small buffers may still have small gaps), which are
occasionnally merged or splitted when needed (merging does not cause any
reallocation but may free one of the buffers), some of them being paged out
to tempoary files when memory is stressed. There are some heuristics in the
editor's code to when mainatenance of the collection is really needed and
useful for the performance.

But beside this the performance cost of UTF indexing of the codepoints is
invisible: each buffer will only need to avoid breaking text between
codepoint boundaries, if the current encoding of the edited text is an UTF.
An editor may also avoid breaking buffers in the middle of clusters if they
render clusters (including ligatures if they are supported): clusters are
still small in size in every encoding and reasonnable buffer sizes can hold
at least hundreds of clusters (even the largest ones which occur rarely).
How editors will manage clusters to make them editable is dependant of the
implementation, buyt even the UTF or codepoints boundaries are not enough
to handle that. In all cases the logical text buffer is structured with a
complex backing store, where parts may be paged out (and will also include
more than just the current text, notably it will include parts of the
indexes, possibly in another temporary working file).


Re: Unicode String Models

2018-09-09 Thread Eli Zaretskii via Unicode
> Date: Sun, 9 Sep 2018 16:10:26 +0200
> Cc: unicode Unicode Discussion 
> From: Philippe Verdy via Unicode 
> 
> In practive, we use a memory by preparing the "small memory" while 
> instantiating a new iterator that will
> process the whole string (which may not be fully loaded in memory, in which 
> case that "small memory" will
> need reallocation as we also read the whole string (but not necessarily keep 
> it in memory if it's a very long
> text file: the index buffer will still remain in memory even if we no longer 
> need to come back to the start of the
> string). That "small memory" is just a local helper, its cost must be 
> evaluated. In practice however, long texts
> come from I/O: the text will have its interface from files, in which case 
> you'll benefit from the filesystem cache
> of the OS to save I/O, or from network (in which case you'll need to store 
> the network data in a local
> temporary file if you don't want to keep it fully in memory and allow some 
> parts to be paged out of memory by
> the OS. But in Emacs, it only works with files: network texts are necessarily 
> backed at least by a local
> temporary file.

Emacs maintains caches for byte to character conversions for both
strings and buffers.  The cache holds data only for the last string
and separately the last buffer where Emacs needed to convert character
counts to byte counts or vice versa.  For buffers, there are 4 places
that are maintained for every buffer at all times, for which both the
character and byte positions are known, and Emacs uses those whenever
it needs to do conversions for a buffer that is not the cached one.

> So that "small memory" for the index is not even needed (but Emacs maintains 
> an index in memory only to
> locate line numbers.

That's a different cache, unrelated to what Richard was alluding to
(and I think unrelated to the current discussion).

> Text editors use various indexing caches always, to manage memory, I/O, and 
> allow working on large texts
> even on systems with low memory available. As much as possible they attempt 
> to use the OS-level caches
> of the filesystem. And in all cases, they don't work directly on their text 
> buffer (whose internal represenation in
> their backing store is not just a single string, but a structured collection 
> of buffers, built on top of an interface
> masking the details: the effective text will then be reencoded and saved from 
> that object, using complex
> serialization schemes; the text buffer is "virtualized").

In Emacs, buffer text is a character string with a gap, actually.


Re: Unicode String Models

2018-09-09 Thread Philippe Verdy via Unicode
Le dim. 9 sept. 2018 à 10:10, Richard Wordingham via Unicode <
unicode@unicode.org> a écrit :

> On Sat, 8 Sep 2018 18:36:00 +0200
> Mark Davis ☕️ via Unicode  wrote:
>
> > I recently did some extensive revisions of a paper on Unicode string
> > models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
>
> Theoretically at least, the cost of indexing a big string by codepoint
> is negligible.  For example, cost of accessing the middle character is
> O(1)*, not O(n), where n is the length of the string.  The trick is to
> use a proportionately small amount of memory to store and maintain a
> partial conversion table from character index to byte index.  For
> example, Emacs claims to offer O(1) access to a UTF-8 buffer by
> character number, and I can't significantly fault the claim.
>

I fully agree, as long as the "middle" character is **approximated** by the
middle of the **encoded** length.

But if it has to be the exact middle (by code point number), you have to
count the codepoints exactly by parsing the whole string as O(n), then
compute the middle from it and parse again from the begining to locate the
encoded position of that code point index as O(n/2) so the final cost is
O(n*3/2).

The trick using a "small amount" of memory only is only to avoid the second
parsing to get a O(n) result. You get O(1)* only if you keep that "small
memory" to locate ofthe indexes. But the claim that it is "small" is wrong
if the string is large (big value n). and has no interest if the string is
indexed only once.

In practive, we use a memory by preparing the "small memory" while
instantiating a new iterator that will process the whole string (which may
not be fully loaded in memory, in which case that "small memory" will need
reallocation as we also read the whole string (but not necessarily keep it
in memory if it's a very long text file: the index buffer will still remain
in memory even if we no longer need to come back to the start of the
string). That "small memory" is just a local helper, its cost must be
evaluated. In practice however, long texts come from I/O: the text will
have its interface from files, in which case you'll benefit from the
filesystem cache of the OS to save I/O, or from network (in which case
you'll need to store the network data in a local temporary file if you
don't want to keep it fully in memory and allow some parts to be paged out
of memory by the OS. But in Emacs, it only works with files: network texts
are necessarily backed at least by a local temporary file.

So that "small memory" for the index is not even needed (but Emacs
maintains an index in memory only to locate line numbers. It has no need to
do that for column numbers, as it is just faster to rescan the line (and
extremely long lines of text are exceptional, these files are rarely edited
with Emacs, unless you use it to load a binary file, whose representation
on screen will be very different, notably for controls, which are expanded
into another cached form: the column index for display, which is different
from the code point index and specific to the Emacs representation for
display/editing, is built only line by line, separately from the line index
kept for the whole edited file; it is also independant of the effective
encoding: it would still be needed even if the encoding of the backing
buffer was UTF-32 with only 1 codepoint per code unit, becase the actual
display will still expand the code points to other forms using visible
escaping mechanisms, and it is even needed when the file is pure 7-bit
ASCII, and kept with one byte per code point: choosing the Unicode encoding
forms has no impact at all to what is really needed for display in text
editors).

Text editors use various indexing caches always, to manage memory, I/O, and
allow working on large texts even on systems with low memory available. As
much as possible they attempt to use the OS-level caches of the filesystem.
And in all cases, they don't work directly on their text buffer (whose
internal represenation in their backing store is not just a single string,
but a structured collection of buffers, built on top of an interface
masking the details: the effective text will then be reencoded and saved
from that object, using complex serialization schemes; the text buffer is
"virtualized").

Only very basic text editors (such as Notepad) use a native single text
buffer, but they are very slow when editing very large files as they
constantly need to copy/move large blocks of memory to perform
inserts/deletions, and they also use too much the memory reallocator. Even
vi(m) or (s)ed in Unix/Linux now use another internal encoded form with a
temporary backing store in temporary files, created automatically when
needed as you start modifying the content. The final consolidation and
serialization will occur only when saving the result.


Re: Unicode String Models

2018-09-09 Thread Daniel Bünzli via Unicode
Hello, 

I find your notion of "model" and presentation a bit confusing since it 
conflates what I would call the internal representation and the API. 

The internal representation defines how the Unicode text is stored and should 
not really matter to the end user of the string data structure. The API defines 
how the Unicode text is accessed, expressed by what is the result of an 
indexing operation on the string. The latter is really what matters for the 
end-user and what I would call the "model".

I think the presentation would benefit from making a clear distinction between 
the internal representation and the API; you could then easily summarize them 
in a table which would make a nice summary of the design space.

I also think you are missing one API which is the one with ECG I would favour: 
indexing returns Unicode scalar values, internally be it whatever you wish 
UTF-{8,16,32} or a custom encoding. Maybe that's what you intended by the "Code 
Point Model: Internal 8/16/32" but that's not what it says, the distinction 
between code point and scalar value is an important one and I think it would be 
good to insist on it to clarify the minds in such documents.

Best, 

Daniel





Re: Unicode String Models

2018-09-09 Thread Janusz S. Bień via Unicode
On Sat, Sep 08 2018 at 18:36 +0200, Mark Davis ☕️ via Unicode wrote:
> I recently did some extensive revisions of a paper on Unicode string models 
> (APIs). Comments are welcome.
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#

It's a good opportunity to propose a better term for "extended grapheme
cluster", which usually are neither extended nor clusters, it's also not
obvious that they are always graphemes.

Cf.the earlier threads

https://www.unicode.org/mail-arch/unicode-ml/y2017-m03/0031.html
https://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0040.html

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien



Re: Unicode String Models

2018-09-09 Thread Mark Davis ☕️ via Unicode
Thanks, excellent comments. While it is clear that some string models have
more complicated structures (with their own pros and cons), my focus was on
simple internal structures. The focus was also on immutable strings — and
the tradeoffs for mutable ones can be quite different — and that needs to
be clearer. I'll add some material about those two areas (with pointers to
sources where possible).

Mark


On Sat, Sep 8, 2018 at 9:20 PM John Cowan  wrote:

> This paper makes the default assumption that the internal storage of a
> string is a featureless array.  If this assumption is abandoned, it is
> possible to get O(1) indexes with fairly low space overhead.  The Scheme
> language has recently adopted immutable strings called "texts" as a
> supplement to its pre-existing mutable strings, and the sample
> implementation for this feature uses a vector of either native strings or
> bytevectors (char[] vectors in C/Java terms).  I would urge anyone
> interested in the question of storing and accessing mutable strings to read
> the following parts of SRFI 135 at <
> https://srfi.schemers.org/srfi-135/srfi-135.html>:  Abstract, Rationale,
> Specification / Basic concepts, and Implementation.  In addition, the
> design notes at <https://github.com/larcenists/larceny/wiki/ImmutableTexts>,
> though not up to date (in particular, UTF-16 internals are now allowed as
> an alternative to UTF-8), are of interest: unfortunately, the link to the
> span API has rotted.
>
> On Sat, Sep 8, 2018 at 12:53 PM Mark Davis ☕️ via Unicore <
> unic...@unicode.org> wrote:
>
>> I recently did some extensive revisions of a paper on Unicode string
>> models (APIs). Comments are welcome.
>>
>>
>> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>>
>> Mark
>>
>


Re: Unicode String Models

2018-09-09 Thread Richard Wordingham via Unicode
On Sat, 8 Sep 2018 18:36:00 +0200
Mark Davis ☕️ via Unicode  wrote:

> I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
> 
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#


Theoretically at least, the cost of indexing a big string by codepoint
is negligible.  For example, cost of accessing the middle character is
O(1)*, not O(n), where n is the length of the string.  The trick is to
use a proportionately small amount of memory to store and maintain a
partial conversion table from character index to byte index.  For
example, Emacs claims to offer O(1) access to a UTF-8 buffer by
character number, and I can't significantly fault the claim.

*There may be some creep, but it doesn't matter for strings that can be
stored within a galaxy.

Of course, the coefficients implied by big-oh notation also matter.
For example, it can be very easy to forget that a bubble sort is often
the quickest sorting algorithm.

You keep muttering that a a sequence of 8-bit code units can contain
invalid sequences, but often forget that that is also true of sequences
of 16-bit code units.  Do emoji now ensure that confusion between
codepoints and code units rapidly comes to light?

You seem to keep forgetting that grapheme clusters are not how some
people people work.  Does the English word 'café' contain the letter
'e'?  Yes or no?  I maintain that it does.  I can't help thinking that
one might want to look for the letter 'ă' in Vietnamese and find it
whatever the associated tone mark is.

You didn't discuss substrings.  I'm interested in how subsequences of
strings are defined, as the concept of 'substring' isn't really Unicode
compliant.  Again, expressing 'ă' as a subsequence of the Vietnamese
word 'nặng' ought to be possible, whether one is using NFD (easier) or
NFC.  (And there are alternative normalisations that are compatible
with canonical equivalence.)  I'm most interested in subsequences X of a
word W where W is the same as AXB for some strings A and B.

Richard.



Unicode String Models

2018-09-08 Thread Mark Davis ☕️ via Unicode
I recently did some extensive revisions of a paper on Unicode string models
(APIs). Comments are welcome.

https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#

Mark


RE: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

2012-07-31 Thread CE Whitehead




Hi, once more Phillipe; one more note: my apologies; I am still trying to make 
sense of the effects of the various characters/non-characters on the rest of 
the text in processing of character strings; thus, if there are any errors in 
my reply (below), someone correct me; I am not really a programmer (excepting a 
knowledge of html / css and a little java script and maybe just a bit of other 
stuff).

From: cewcat...@hotmail.com
To: verd...@wanadoo.fr
CC: unicode@unicode.org
Subject: RE: Unicode String Models -- minor proofreading nit (was: Unicode 
String Models)
Date: Sat, 28 Jul 2012 13:35:57 -0400







 From: verd...@wanadoo.fr
 Date: Fri, 27 Jul 2012 03:17:07 +0200
 Subject: Re: Unicode String Models -- minor proofreading nit (was: Unicode 
 String Models)
 To: m...@macchiato.com
 CC: cewcat...@hotmail.com; unicode@unicode.org
 
 I just wonder where the XSS attack is really an issue here. XSS
 attacks involve bypassing the document source domain in order to
 attempt to use or insert data found in another document issued or
 managed by another domain, in a distinct security realm.
 
 What is a more serious issue would be the fact that the document
 parsed has an unknown security, and that its document is subject to an
 inspection (for example by an antivirus or antimalware trying to
 identify sensitive code which would remain usable (but hidden by the
 cipher-like invalid encoding that a browser would just interpret
 blindly).
 
 Yes that's what I think is the issue here. 

And this is also what's discussed in the Unicode security document I suggested 
linking to.
 One problem with the strategy of delering invalid sequences blindly is
 of course the fact that such invalid sequences may be complex and
 could be arbitrarily ylong. But antiviri/antimalware solutions already
 know how to ignore these invalid sequences when trying to identify
 malicious code, so that it will detect more possibilities.
 
 Thanks for info. I did not know this.
 In that case, the safest strategy for an iantivirus is effectively to
 discard the invalid sequences, trying to mimic what an unaware browser
 would do blindly with the consequence of running the potentially
 dangerous code. The strategy used in a browser for rendering the
 documentn or in an security solution when trying to detect malicious
 code, will then be completely opposed.
 
 Yes, this is a good strategy for anti-virus and malware detection programs; 
 however I think unicode is more focused on general 
 character handling/display.
 Another consern is the choice of the replacement character. This
 document only suggests the U+FFD character which may also not pass
 some encoding converters used when forwarding the document to a lower
 layer API running the code effectively.
 
 If the code (as opposed to the normal text) is used, it will
 frequently be restricted only to ASCII or to a SBCS encoding. And in
 that case, a better substitute will be the ASCII C0 control which is
 noramlly invalid in plain text programming/scripting source code.
 Traditionally this C0 control character is SUB. IT may even be used to
 replace all invalid bytes of an invalid UTF-8 sequence, without
 changing its length (this is not always possible with U+FFFD in UTF-8
 because it will be encoded as 3 bytes and there may be
 invalid/rejected sequences containing only 1 or 2 bytes that should
 survive with the same length after the replacement.
 
 Once concern is that SUB and U+FFFD have different character
 properties. And not all Unicode algorithms are treating it the way it
 should (for example in boundary breakers or in some transforms).
Hmm, after checking several unicode documents and some of the faq 
(http://unicode.org/faq/collation.html), my understanding is that using a 
non-character code point is the best solution here; I don't know which 
non-character code point is best, but at least in collation any non-character 
code point should be ignored. That is, collation is ideally performed on 
normalized character strings and not on code points.
However, I do believe that some string processing/comparison algorithms that 
look at the string itself and not the characters may be affected. So this is an 
issue to consider for some yes.
 Another concern is that even this C0 control may be used for
 controling some terminal functions (such uses are probably in very old
 applications), so some code converters are using instead the question
 mark (?) which is even worse as it may break a query URL, unexpectedly
 passing the data encoded after it to another HTTP(S) resource than the
 expected one, and also because it will bypass some cache-control
 mechanism.
Thanks for bringing this up. (I'm not a programmer and really can't discuss 
this further thus but I do know how to create my own queries for the search 
engine, placing question marks wherever so I can bring a particular search page 
up by typing a url for example when I'm searching for particular text in a 
google book

Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

2012-07-31 Thread Philippe Verdy
2012/7/31 CE Whitehead cewcat...@hotmail.com

  Hmm, after checking several unicode documents and some of the faq (
 http://unicode.org/faq/collation.html), my understanding is that using a
 non-character code point is the best solution here; I don't know which
 non-character code point is best, but at least in collation any
 non-character code point should be ignored. That is, collation is ideally
 performed on normalized character strings and not on code points.
 However, I do believe that some string processing/comparison algorithms
 that look at the string itself and not the characters may be affected. So
 this is an issue to consider for some yes.


The issue when using a placeholder to replace invalid sequences, is that in
frequent cases, the stream length must not be altered. If you use a
non-character in an UTF-8 stream, it will not always be possible to insert
it. The null character (even though it is encoded as a single byte in
UTF-8) is the worst choice to to the many assumptions made throughout
softwares where it means an end-of-string or sometimes end-of-stream
(sometimes also some downstream processes will represent the actual
characer as a 2-byte sequence even if it's not strictly UTF-8.

In UTF-8 you may use 0xFF as a placeholder, but it will not pass through
some interfaces because it is an invalid sequence everywhere in UTF-8. So
you need a valid character, that is still encoded as a single byte, and not
used in plain-text files. The SUB C0 control character matches such needs.

As always, this is not an universal solution, there are always pros and
cons in all approaches when trying to manage encoding errors and how to
pass over them (if it is desirable).



 Another concern is that even this C0 control may be used for
  controling some terminal functions (such uses are probably in very old
  applications), so some code converters are using instead the question
  mark (?) which is even worse as it may break a query URL, unexpectedly
  passing the data encoded after it to another HTTP(S) resource than the
  expected one, and also because it will bypass some cache-control
  mechanism.
 Thanks for bringing this up. (I'm not a programmer and really can't
 discuss this further thus but I do know how to create my own queries for
 the search engine, placing question marks wherever so I can bring a
 particular search page up by typing a url for example when I'm searching
 for particular text in a google book . . . )

 
  The document does not discuss really how to choose the replacement
  character. My opinion is that for UTF-8 encoded documents, the ASCII
  C0 control (SUB) is still better than the U+FFFD character which works
  well only in UTF-16 and UTF-32 encodings. It also works well with many
  legacy SBCS or MBCS encodings (including ISO 8859-*, Windows codepages
  and many PC/OEM codepages, JIS or EUC variants; it is also mapped in
  many EBCDIC codepages, distinctly from simple filler/padding
  characters that are blindly stripped in many applications as if they
  were just whitespaces at end of a fixed-width data field).
 
 It seems that in a previous unicode  discussion, it's been recommended
 that applications use codepoints in the noncharacter code points block
 rather than non-unicode control codes.  Thus one should not use a character
 at all, just a placeholder.


If the encoding length is not an issue (UTF-16 and UTF-32 streams), yes
this is a good solution. Unfortunately we don't have any non-character in
the ASCII range which is encoded as one byte in most encodings.


 IMO (in my opinion), just having any placeholder is helpful
 security-wise. (However, I'm still thinking this over.)


Not any placeholder randomly, but placeholders that can be universally
replaced one for another, depending on the situations and constraints. Then
you pass only that value. But if encoding length is an issue, you'll have
no other choce than allowing sequences of multiple placeholders.

The list of possible placeholders that an application can process on input
or return on output should be documented. Non-characters are not the only
possible choices.


RE: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

2012-07-28 Thread CE Whitehead



 From: verd...@wanadoo.fr
 Date: Fri, 27 Jul 2012 03:17:07 +0200
 Subject: Re: Unicode String Models -- minor proofreading nit (was: Unicode 
 String Models)
 To: m...@macchiato.com
 CC: cewcat...@hotmail.com; unicode@unicode.org
 
 I just wonder where the XSS attack is really an issue here. XSS
 attacks involve bypassing the document source domain in order to
 attempt to use or insert data found in another document issued or
 managed by another domain, in a distinct security realm.
 
 What is a more serious issue would be the fact that the document
 parsed has an unknown security, and that its document is subject to an
 inspection (for example by an antivirus or antimalware trying to
 identify sensitive code which would remain usable (but hidden by the
 cipher-like invalid encoding that a browser would just interpret
 blindly).
 
Yes that's what I think is the issue here.
 One problem with the strategy of delering invalid sequences blindly is
 of course the fact that such invalid sequences may be complex and
 could be arbitrarily ylong. But antiviri/antimalware solutions already
 know how to ignore these invalid sequences when trying to identify
 malicious code, so that it will detect more possibilities.
 
Thanks for info. I did not know this.
 In that case, the safest strategy for an iantivirus is effectively to
 discard the invalid sequences, trying to mimic what an unaware browser
 would do blindly with the consequence of running the potentially
 dangerous code. The strategy used in a browser for rendering the
 documentn or in an security solution when trying to detect malicious
 code, will then be completely opposed.
 
Yes, this is a good strategy for anti-virus and malware detection programs; 
however I think unicode is more focused on general character handling/display.
 Another consern is the choice of the replacement character. This
 document only suggests the U+FFD character which may also not pass
 some encoding converters used when forwarding the document to a lower
 layer API running the code effectively.
 
 If the code (as opposed to the normal text) is used, it will
 frequently be restricted only to ASCII or to a SBCS encoding. And in
 that case, a better substitute will be the ASCII C0 control which is
 noramlly invalid in plain text programming/scripting source code.
 Traditionally this C0 control character is SUB. IT may even be used to
 replace all invalid bytes of an invalid UTF-8 sequence, without
 changing its length (this is not always possible with U+FFFD in UTF-8
 because it will be encoded as 3 bytes and there may be
 invalid/rejected sequences containing only 1 or 2 bytes that should
 survive with the same length after the replacement.
 
 Once concern is that SUB and U+FFFD have different character
 properties. And not all Unicode algorithms are treating it the way it
 should (for example in boundary breakers or in some transforms).
 Another concern is that even this C0 control may be used for
 controling some terminal functions (such uses are probably in very old
 applications), so some code converters are using instead the question
 mark (?) which is even worse as it may break a query URL, unexpectedly
 passing the data encoded after it to another HTTP(S) resource than the
 expected one, and also because it will bypass some cache-control
 mechanism.
 
 The document does not discuss really how to choose the replacement
 character. My opinion is that for UTF-8 encoded documents, the ASCII
 C0 control (SUB) is still better than the U+FFFD character which works
 well only in UTF-16 and UTF-32 encodings. It also works well with many
 legacy SBCS or MBCS encodings (including ISO 8859-*, Windows codepages
 and many PC/OEM codepages, JIS or EUC variants; it is also mapped in
 many EBCDIC codepages, distinctly from simple filler/padding
 characters that are blindly stripped in many applications as if they
 were just whitespaces at end of a fixed-width data field).
 
 How many replacements must be made ? My opinion is that replacements
 should be done so that no change occurs to the data length. For the
 remaining cases, data security can detect this case with strong data
 signatures like SHA1 for not too long documents (like HTML pages, or
 full email contents, with some common headers needed for their
 indexing or routing or delivery to the right person), or SHA256 for
 very short documents (like single datagrams or the value of short
 database fields like phone numbers or people last name or email
 address) or very long documents (or with security certificates over a
 secure channel which will also detect undetected data corruption in
 the end-to-end communication channel, either one-to-one or one-to-many
 for broadcasts and selective multicasts but this case of secure
 channels should not be a problem here as it also has to detect and
 secure many other cases than just invalid plain-text encodings,
 notably by man-in-the-middle attacks or replay attacks, or to reliably
 detect

RE: Unicode String Models

2012-07-26 Thread Dreiheller, Albrecht
David Starner wrote (Saturday, July 21, 2012 12:02 AM):

 The question of whether to allow non-ASCII characters in variables is open.
 
 I don't see why. Yes, a lot of organizations will use ASCII only, but
 not all programming is done large international organizations. For
 personal hacking, or small mononational organizations, Unicode
 variables may be much more convenient. It's not like Chinese variables
 with Chinese comments is going to be much harder to debug for the
 English speaker then English variables (or bad English variables) with
 Chinese comments, and ASCII-romanized Chinese variables may be the
 worst of all worlds.

Imagine mixed used of Latin and cyrillic variable names. How to debug
code using two variables named /* cyrillic */ А and /* latin */ A ?

If it would be state-of-the-art to use Unicode variables, the bad guy
could have his back door even in public source code without being detected.

To avoid confusion, rules from
http://www.unicode.org/Public/security/latest/confusables.txt
were to be applied.

A.D.




Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

2012-07-26 Thread CE Whitehead




Hi, I have one minor comment:




* * *

Validation; par 3, comment in parentheses
. . . (you never want to just delete it; that has security problems).
{ COMMENT: would it be helpful here to have a reference here to the unicode 
security document that discusses this issue -- TR 36, 3.5 
http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters ?}

Best,

--C. E. Whitehead
cewcat...@hotmail.com 


  

Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

2012-07-26 Thread Mark Davis ☕
Thanks, good suggestion.

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Thu, Jul 26, 2012 at 12:40 PM, CE Whitehead cewcat...@hotmail.comwrote:

 Validation; par 3, comment in parentheses
 . . . (you never want to just delete it; that has security problems).
 { COMMENT: would it be helpful here to have a reference here to the
 unicode security document that discusses this issue -- TR 36, 3.5
 http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters ?}



Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

2012-07-26 Thread Philippe Verdy
I just wonder where the XSS attack is really an issue here. XSS
attacks involve bypassing the document source domain in order to
attempt to use or insert data found in another document issued or
managed by another domain, in a distinct security realm.

What is a more serious issue would be the fact that the document
parsed has an unknown security, and that its document is subject to an
inspection (for example by an antivirus or antimalware trying to
identify sensitive code which would remain usable (but hidden by the
cipher-like invalid encoding that a browser would just interpret
blindly).

One problem with the strategy of delering invalid sequences blindly is
of course the fact that such invalid sequences may be complex and
could be arbitrarily ylong. But antiviri/antimalware solutions already
know how to ignore these invalid sequences when trying to identify
malicious code, so that it will detect more possibilities.

In that case, the safest strategy for an iantivirus is effectively to
discard the invalid sequences, trying to mimic what an unaware browser
would do blindly with the consequence of running the potentially
dangerous code. The strategy used in a browser for rendering the
documentn or in an security solution when trying to detect malicious
code, will then be completely opposed.

Another consern is the choice of the replacement character. This
document only suggests the U+FFD character which may also not pass
some encoding converters used when forwarding the document to a lower
layer API running the code effectively.

If the code (as opposed to the normal text) is used, it will
frequently be restricted only to ASCII or to a SBCS encoding. And in
that case, a better substitute will be the ASCII C0 control which is
noramlly invalid in plain text programming/scripting source code.
Traditionally this C0 control character is SUB. IT may even be used to
replace all invalid bytes of an invalid UTF-8 sequence, without
changing its length (this is not always possible with U+FFFD in UTF-8
because it will be encoded as 3 bytes and there may be
invalid/rejected sequences containing only 1 or 2 bytes that should
survive with the same length after the replacement.

Once concern is that SUB and U+FFFD have different character
properties. And not all Unicode algorithms are treating it the way it
should (for example in boundary breakers or in some transforms).
Another concern is that even this C0 control may be used for
controling some terminal functions (such uses are probably in very old
applications), so some code converters are using instead the question
mark (?) which is even worse as it may break a query URL, unexpectedly
passing the data encoded after it to another HTTP(S) resource than the
expected one, and also because it will bypass some cache-control
mechanism.

The document does not discuss really how to choose the replacement
character. My opinion is that for UTF-8 encoded documents, the ASCII
C0 control (SUB) is still better than the U+FFFD character which works
well only in UTF-16 and UTF-32 encodings. It also works well with many
legacy SBCS or MBCS encodings (including ISO 8859-*, Windows codepages
and many PC/OEM codepages, JIS or EUC variants; it is also mapped in
many EBCDIC codepages, distinctly from simple filler/padding
characters that are blindly stripped in many applications as if they
were just whitespaces at end of a fixed-width data field).

How many replacements must be made ? My opinion is that replacements
should be done so that no change occurs to the data length. For the
remaining cases, data security can detect this case with strong data
signatures like SHA1 for not too long documents (like HTML pages, or
full email contents, with some common headers needed for their
indexing or routing or delivery to the right person), or SHA256 for
very short documents (like single datagrams or the value of short
database fields like phone numbers or people last name or email
address) or very long documents (or with security certificates over a
secure channel which will also detect undetected data corruption in
the end-to-end communication channel, either one-to-one or one-to-many
for broadcasts and selective multicasts but this case of secure
channels should not be a problem here as it also has to detect and
secure many other cases than just invalid plain-text encodings,
notably by man-in-the-middle attacks or replay attacks, or to reliably
detect DoS attack by a broken channel with unrecoverable data losses,
something that can be enforced by reasonnable timeout watchdogs, if
performance of the channel should be ensured).

2012/7/27 Mark Davis ☕ m...@macchiato.com:
 Thanks, good suggestion.

 Mark

 — Il meglio è l’inimico del bene —



 On Thu, Jul 26, 2012 at 12:40 PM, CE Whitehead cewcat...@hotmail.com
 wrote:

 Validation; par 3, comment in parentheses
 . . . (you never want to just delete it; that has security problems).
 { COMMENT: would it be helpful here to have a reference here to 

Re: User-Hostile Text Editing (was: Unicode String Models)

2012-07-22 Thread Julian Bradfield
On 2012-07-21, Richard Wordingham richard.wording...@ntlworld.com wrote:
 Are there any widely available ways of enabling the deleting of the
 first character in a default grapheme cluster?  Having carefully added
 two or more marks to a base character, I find it extremely irritating
 to find I have entered the wrong base character and have to type the
 whole thing again. As one can delete the last character in a cluster,
 why not the first? It's not as though the default grapheme cluster is
 usually thought of as a single character.

What do you mean by widely available?
A decent editor should let you choose whether to break apart clusters
or not. I presume that such editors exist! (Mine always breaks
clusters, but that's because I'm the only user, and I don't care
enough to implement clustering;-) Yudit might be one, but since it
seems to have no documentation, I can't tell.
If yours doesn't, then get on to its authors!


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: User-Hostile Text Editing (was: Unicode String Models)

2012-07-22 Thread Richard Wordingham
On Sun, 22 Jul 2012 08:59:13 +0100
Julian Bradfield jcb+unic...@inf.ed.ac.uk wrote:

 On 2012-07-21, Richard Wordingham richard.wording...@ntlworld.com
 wrote:

  Are there any widely available ways of enabling the deleting of the
  first character in a default grapheme cluster?

 What do you mean by widely available?

An example would be a technique that worked for many application on a
platform, or for several significant applications across most
platforms.  An example of the former would be an effective per
user tailoring of grapheme clusters.  A candidate for the latter is
Libreoffice's rule that alt+cursor key moves within grapheme clusters
rather than moving the point to the start of the next grapheme
cluster.  (Unfortunately this doesn't even work inside tables, so it
doesn't look much of a candidate.)  This can be used in the sequence
alt/right-arrow rubout.

Richard.



User-Hostile Text Editing (was: Unicode String Models)

2012-07-21 Thread Richard Wordingham
On Fri, 20 Jul 2012 23:16:17 +
Murray Sargent murr...@exchange.microsoft.com wrote:

 My latest blog post “Ligatures, Clusters, Combining Marks and
 Variation
 Sequenceshttp://blogs.msdn.com/b/murrays/archive/2012/06/30/ligatures-clusters-combining-marks-and-variation-sequences.aspx”
 discusses some of these complications.

Are there any widely available ways of enabling the deleting of the
first character in a default grapheme cluster?  Having carefully added
two or more marks to a base character, I find it extremely irritating
to find I have entered the wrong base character and have to type the
whole thing again. As one can delete the last character in a cluster,
why not the first? It's not as though the default grapheme cluster is
usually thought of as a single character.

Richard.




Re: Unicode String Models

2012-07-21 Thread Richard Wordingham
On Fri, 20 Jul 2012 15:01:42 -0700
David Starner prosfil...@gmail.com wrote:

 The question of whether to allow non-ASCII characters in variables
 is open.

 It's not like Chinese variables
 with Chinese comments is going to be much harder to debug for the
 English speaker then English variables (or bad English variables) with
 Chinese comments, and ASCII-romanized Chinese variables may be the
 worst of all worlds.

On the contrary, there is the issue of confusables.  An English speaker
may easily overlook the Chinese equivalent of ASCII confusables such as
the letter 'l' and the digit '1' or the letter 'O' and the digit '0'.
It gets even worse if the Chinese characters are rendered as missing
glyphs.

Moreover, one method of hiding design information while still
delivering 'source' code is to not only strip out all comments, but to
replace all variable names by meaningless and hard to distinguish names
such as x1234, x1235, x1236, etc.

Richard.



RE: User-Hostile Text Editing (was: Unicode String Models)

2012-07-21 Thread Murray Sargent
For math accents, it's easy since the base is the argument of the accent 
operator. But for clusters the standard practice is for the Delete key to 
delete the whole cluster as you note. Also you can't select just part of a 
cluster to save it from deletion. 

I'd think deleting the first character of a cluster would make a nice 
context-menu option. For example, when you right-click on a cluster, the 
resulting context menu could have an entry like delete first character. Maybe 
other such options could be added as well.

Murray

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Richard Wordingham
Sent: Saturday, July 21, 2012 4:52 PM
To: Unicode
Subject: User-Hostile Text Editing (was: Unicode String Models)

On Fri, 20 Jul 2012 23:16:17 +
Murray Sargent murr...@exchange.microsoft.com wrote:

 My latest blog post “Ligatures, Clusters, Combining Marks and 
 Variation 
 Sequenceshttp://blogs.msdn.com/b/murrays/archive/2012/06/30/ligatures-clusters-combining-marks-and-variation-sequences.aspx”
 discusses some of these complications.

Are there any widely available ways of enabling the deleting of the first 
character in a default grapheme cluster?  Having carefully added two or more 
marks to a base character, I find it extremely irritating to find I have 
entered the wrong base character and have to type the whole thing again. As one 
can delete the last character in a cluster, why not the first? It's not as 
though the default grapheme cluster is usually thought of as a single character.

Richard.







Unicode String Models

2012-07-20 Thread Mark Davis ☕
I put together some notes on different ways for programming languages to
handle Unicode at a low level. Comments welcome.
http://macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html
Macchiato 
»http://macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html
Many programming languages (and most modern software) have moved to Unicode
model of text. Text coming into the system might be in legacy encodings
like Shift-JIS or Latin-1, and text being pushed out...

--
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


Re: Unicode String Models

2012-07-20 Thread David Starner
On Fri, Jul 20, 2012 at 1:31 PM, Mark Davis ☕ m...@macchiato.com wrote:
 I put together some notes on different ways for programming languages to 
 handle Unicode at a low level. Comments welcome.
 Macchiato »
 Many programming languages (and most modern software) have moved to Unicode 
 model of text. Text coming into the system might be in legacy encodings like 
 Shift-JIS or Latin-1, and text being pushed out...

I had a few comments for general discussion:

That means that it is best to optimize for BMP characters (and as a
subset, ASCII and Latin-1), and fall into a ‘slow path’ when a
supplementary character is encountered.

I'm concerned about the statement/implication that one can optimize
for ASCII and Latin-1. It's too easy for a lot of developers to test
speed with the English/European documents they have around and test
correctness only with Chinese. I see the argument in theory and
practice, but it's a tough line to walk, especially if you're not
familiar with i18n.

I can see for i in range (1, 1000) do a :=  ; a +:= 龜; done being
way slower than necessary (especially for non-trivially optimized away
cases), for example.

Interfacing with most software libraries can avoid conversions in and out

I'm curious about this. I won't dismiss it off hand, but besides ICU,
what libraries are we talking about that haven't already been
rewritten for GTK, Java, Python, take your pick.

The string class is indexed by code unit, and is UTF-32. Used by: glibc?

I haven't poked at it, but Ada 2012 (in pre-standard editorial-changes
only stage) has Latin-1, UCS-2 (the standard is not clear here about
UTF-16 vs. UCS-2) and UTF-32 (UCS-4--it mentions 2147483648 code
points) strings. There are functions in the standard to store a
Unicode string in the Latin-1 strings as UTF-8 and in the UCS-2
strings as UTF-16, but there is a choice to use straight UTF-32.

The question of whether to allow non-ASCII characters in variables is open.

I don't see why. Yes, a lot of organizations will use ASCII only, but
not all programming is done large international organizations. For
personal hacking, or small mononational organizations, Unicode
variables may be much more convenient. It's not like Chinese variables
with Chinese comments is going to be much harder to debug for the
English speaker then English variables (or bad English variables) with
Chinese comments, and ASCII-romanized Chinese variables may be the
worst of all worlds.

--
Kie ekzistas vivo, ekzistas espero.




Re: Unicode String Models

2012-07-20 Thread martin

That means that it is best to optimize for BMP characters (and as a
subset, ASCII and Latin-1), and fall into a ‘slow path’ when a
supplementary character is encountered.

I'm concerned about the statement/implication that one can optimize
for ASCII and Latin-1. It's too easy for a lot of developers to test
speed with the English/European documents they have around and test
correctness only with Chinese.


I don't think this is a concern within the context of the posting.
He is talking about Unicode String Models, something that most developers
will never have to design themselves - instead, they use what the
language gives them.

People implementing Unicode support for programming languages, in
turn, typically will be aware of all issues.


I can see for i in range (1, 1000) do a :=  ; a +:= 龜; done being
way slower than necessary (especially for non-trivially optimized away
cases), for example.


Why is that? Take Python 3.3, for example. It does optimize for ASCII,
so the first string will use only 1 byte for the space, and two bytes
for 龜 (both in a string literal, which is already stored in a constant
string object).

The concatenation determines that the result string will need two bytes
per char, and will have two chars, so it allocates a string being able
to hold four bytes. It then copies the space (widening the representation),
and the other character (as-is). I don't see why this is slower than
necessary.


Interfacing with most software libraries can avoid conversions in and out

I'm curious about this. I won't dismiss it off hand, but besides ICU,
what libraries are we talking about that haven't already been
rewritten for GTK, Java, Python, take your pick.


rewritten for? None. Besides perhaps XML parsers, I don't think many
libraries have been rewritten *for* Python, none for Gtk, and many not
for Java. Take database adapters, for example. To access MySQL, Postgres,
Oracle, or SQLite, you often need to use the C library of the database
vendor, which then got integrated (e.g. through some FFI) into GTK,
Java, and Python. However, this FFI integration is where the conversions
in and out need to be performed.


The question of whether to allow non-ASCII characters in variables is open.

I don't see why.


Do you factually disagree that there is no universal consensus on this
question? Some languages support non-ASCII identifiers, but many more
don't, and proponents of those languages often claim that such support
isn't really needed. So I'd agree that the question is still undecided,
i.e. open.

Regards,
Martin





RE: Unicode String Models

2012-07-20 Thread Murray Sargent
Mark wrote: “I put together some notes on different ways for programming 
languages to handle Unicode at a low level. Comments welcome.”

Nice article as far as it goes and additions are forthcoming. In addition to 
multiple code units per character in UTF-8 and UTF-16, there are variation 
selectors, combining marks, ligatures, and clusters, all of which imply 
handling variable-length sequences even for UTF-32. Handling the variable 
length code points in UTF-8 and UTF-16 is actually considerably easier than 
dealing with these other sources of variable length. For all cases, you need to 
be able to find character entity boundaries for an arbitrary code-unit index.

My latest blog post “Ligatures, Clusters, Combining Marks and Variation 
Sequenceshttp://blogs.msdn.com/b/murrays/archive/2012/06/30/ligatures-clusters-combining-marks-and-variation-sequences.aspx”
 discusses some of these complications.

One amusing thing is that where I work it’s common to use cp to mean “character 
position”, which more precisely is “UTF-16 code-unit index”, whereas in Mark’s 
post, cp is used for codepoint.

Murray




Re: Unicode String Models

2012-07-20 Thread Mark Davis ☕
Thanks, nice article. We got into some of those hair caret positioning
issues back at Apple; we even had a design that would associate a series of
lines (which could be slanted and positioned) with a ligature, but
ultimately 1/m gets you 99% of the value, with very little cost.

(My article was just targeted at the very lowest level of Unicode
representation, without getting into the further complications for higher
level constructs like grapheme clusters, ligatures, etc.)

--
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Fri, Jul 20, 2012 at 4:16 PM, Murray Sargent 
murr...@exchange.microsoft.com wrote:

  Mark wrote: “I put together some notes on different ways for programming
 languages to handle Unicode at a low level. Comments welcome.”

 ** **

 Nice article as far as it goes and additions are forthcoming. In addition
 to multiple code units per character in UTF-8 and UTF-16, there are
 variation selectors, combining marks, ligatures, and clusters, all of which
 imply handling variable-length sequences even for UTF-32. Handling the
 variable length code points in UTF-8 and UTF-16 is actually considerably
 easier than dealing with these other sources of variable length. For all
 cases, you need to be able to find character entity boundaries for an
 arbitrary code-unit index.

 ** **

 My latest blog post “Ligatures, Clusters, Combining Marks and Variation
 Sequenceshttp://blogs.msdn.com/b/murrays/archive/2012/06/30/ligatures-clusters-combining-marks-and-variation-sequences.aspx”
 discusses some of these complications.

 ** **

 One amusing thing is that where I work it’s common to use cp to mean
 “character position”, which more precisely is “UTF-16 code-unit index”,
 whereas in Mark’s post, cp is used for codepoint.

 ** **

 Murray

 ** **

 ** **



Re: Unicode String Models

2012-07-20 Thread Martin J. Dürst

On 2012/07/21 7:01, David Starner wrote:


I'm concerned about the statement/implication that one can optimize
for ASCII and Latin-1. It's too easy for a lot of developers to test
speed with the English/European documents they have around and test
correctness only with Chinese. I see the argument in theory and
practice, but it's a tough line to walk, especially if you're not
familiar with i18n.

I can see for i in range (1, 1000) do a :=  ; a +:= 龜; done being
way slower than necessary (especially for non-trivially optimized away
cases), for example.


The main problem with the above loop isn't ASCII vs. Chinese or some 
such. It's that depending on the way the programming language handles 
Strings, it will result in a painter's algorithm phenomenon (see 
http://www.joelonsoftware.com/articles/fog000319.html).


Regards,   Martin.