Re: Unicode String Models
On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ☕️ wrote: > > * The Python 3.3 model mentions the disadvantages of memory usage >> cliffs but doesn't mention the associated perfomance cliffs. It would >> be good to also mention that when a string manipulation causes the >> storage to expand or contract, there's a performance impact that's not >> apparent from the nature of the operation if the programmer's >> intuition works on the assumption that the programmer is dealing with >> UTF-32. >> > > The focus was on immutable string models, but I didn't make that clear. > Added some text. > Thanks. > * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM >> text node storage in Gecko, (I believe but am not 100% sure) V8 and, >> optionally, HotSpot >> ( >> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A >> ). >> That is, text has UTF-16 semantics, but if the high half of every code >> unit in a string is zero, only the lower half is stored. This has >> properties analogous to the Python 3.3 model, except non-BMP doesn't >> expand to UTF-32 but uses UTF-16 surrogate pairs. >> > > Thanks, will add. > V8 source code shows it has a OneByteString storage option: https://cs.chromium.org/chromium/src/v8/src/objects/string.h?sq=package:chromium=0=494 . From hearsay, I'm convinced that it means Latin1, but I've failed to find a clear quotable statement from a V8 developer to that affect. > 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers >> have a different type in the type system than byte buffers. To go from >> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data >> has been tagged as valid UTF-8, the validity is trusted completely so >> that iteration by code point does not have "else" branches for >> malformed sequences. If data that the type system indicates to be >> valid UTF-8 wasn't actually valid, it would be nasal demon time. The >> language has a default "safe" side and an opt-in "unsafe" side. The >> unsafe side is for performing low-level operations in a way where the >> responsibility of upholding invariants is moved from the compiler to >> the programmer. It's impossible to violate the UTF-8 validity >> invariant using the safe part of the language. >> > > Added a quote based on this; please check if it is ok. > Looks accurate. Thanks. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/
Re: Unicode String Models
On 3 October 2018 at 15:41:42, Mark Davis ☕️ via Unicode (unicode@unicode.org) wrote: > Let me clear that up; I meant that "the underlying storage never contains > something that would need to be represented as a surrogate code point." Of > course, UTF-16 does need surrogate code units. What #1 would be excluding > in the case of UTF-16 would be unpaired surrogates. That is, suppose the > underlying storage is UTF-16 code units that don't satisfy #1. > > 0061 D83D DC7D 0061 D83D > > A code point API would return for those a sequence of 4 values, the last of > which would be a surrogate code point. > > 0061, 0001F47D, 0061, D83D > > A scalar value API would return for those also 4 values, but since we > aren't in #1, it would need to remap. > > 0061, 0001F47D, 0061, FFFD Ok understood. But I think that if you go to the length of providing a scalar-value API you would also prevent the construction of strings that have such anomalities in the first place (e.g. by erroring in the constructor if you provide it with malformed UTF-X data), i.e. maintain 1. From a programmer's perspective I really don't get anything from 2. except confusion. > If it is a real datatype, with strong guarantees that it *never* contains > values outside of [0x..0xD7FF 0xE000..0x10], then every conversion > from number will require checking. And in my experience, without a strong > guarantee the datatype is in practice pretty useless. Sure. My point was that the places where you perform this check are few in practice. Namely mainly at the IO boundary of your program where you actually need to deal with encodings and, additionally, whenever you define scalar value constants (a check that could actually be performed by your compiler if your language provides a literal notation for values of this type). Best, Daniel
Re: Unicode String Models
Mark On Wed, Oct 3, 2018 at 3:01 PM Daniel Bünzli wrote: > On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode ( > unicode@unicode.org) wrote: > > > There are two main choices for a scalar-value API: > > > > 1. Guarantee that the storage never contains surrogates. This is the > > simplest model. > > 2. Substitute U+FFFD for surrogates when the API returns code > > points. This can be done where #1 is not feasible, such as where the API > is > > a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code > units > > that are not guaranteed to be UTF-16. The cost is extra tests on every > code > > point access. > > I'm not sure 2. really makes sense in pratice: it would mean you can't > access scalar values > which needs surrogates to be encoded. > Let me clear that up; I meant that "the underlying storage never contains something that would need to be represented as a surrogate code point." Of course, UTF-16 does need surrogate code units. What #1 would be excluding in the case of UTF-16 would be unpaired surrogates. That is, suppose the underlying storage is UTF-16 code units that don't satisfy #1. 0061 D83D DC7D 0061 D83D A code point API would return for those a sequence of 4 values, the last of which would be a surrogate code point. 0061, 0001F47D, 0061, D83D A scalar value API would return for those also 4 values, but since we aren't in #1, it would need to remap. 0061, 0001F47D, 0061, FFFD > > Also regarding 1. you can always defines an API that has this property > regardless of the actual storage, it's only that your indexing operations > might be costly as they do not directly map to the underlying storage array. > That being said I don't think direct indexing/iterating for Unicode text > is such an interesting operation due of course to the > normalization/segmentation issues. Basically if your API provides them I > only see these indexes as useful ways to define substrings. APIs that > identify/iterate boundaries (and thus substrings) are more interesting due > to the nature of Unicode text. > I agree that iteration is a very common case. But quite often implementations need to have at least opaque indexes (as discussed). > > > If the programming language provides for such a primitive datatype, that > is > > possible. That would mean at a minimum that casting/converting to that > > datatype from other numerical datatypes would require bounds-checking and > > throwing an exception for values outside of [0x..0xD7FF > > 0xE000..0x10]. > > Yes. But note that in practice if you are in 1. above you usually perform > this only at the point of decoding where you are already performing a lot > of other checks. Once done you no longer need to check anything as long as > the operations you perform on the values preserve the invariant. Also > converting back to an integer if you need one is a no-op: it's the identity > function. > If it is a real datatype, with strong guarantees that it *never* contains values outside of [0x..0xD7FF 0xE000..0x10], then every conversion from number will require checking. And in my experience, without a strong guarantee the datatype is in practice pretty useless. > > The OCaml Uchar module does this. This is the interface: > > https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli > > which defines the type t as abstract and here is the implementation: > > https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml > > which defines the implementation of type t = int which means values of > this type are an *unboxed* OCaml integer (and will be stored as such in say > an OCaml array). However since the module system enforces type abstraction > the only way of creating such values is to use the constants or the > constructors (e.g. of_int) which all maintain the scalar value invariant > (if you disregard the unsafe_* functions). > > Note that it would perfectly be possible to adopt a similar approach in C > via a typedef though given C's rather loose type system a little bit more > discipline would be required from the programmer (always go through the > constructor functions to create values of the type). That's the C motto: "requiring a 'bit more' discipline from programmers" > > Best, > > Daniel > > >
Re: Unicode String Models
On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode (unicode@unicode.org) wrote: > There are two main choices for a scalar-value API: > > 1. Guarantee that the storage never contains surrogates. This is the > simplest model. > 2. Substitute U+FFFD for surrogates when the API returns code > points. This can be done where #1 is not feasible, such as where the API is > a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units > that are not guaranteed to be UTF-16. The cost is extra tests on every code > point access. I'm not sure 2. really makes sense in pratice: it would mean you can't access scalar values which needs surrogates to be encoded. Also regarding 1. you can always defines an API that has this property regardless of the actual storage, it's only that your indexing operations might be costly as they do not directly map to the underlying storage array. That being said I don't think direct indexing/iterating for Unicode text is such an interesting operation due of course to the normalization/segmentation issues. Basically if your API provides them I only see these indexes as useful ways to define substrings. APIs that identify/iterate boundaries (and thus substrings) are more interesting due to the nature of Unicode text. > If the programming language provides for such a primitive datatype, that is > possible. That would mean at a minimum that casting/converting to that > datatype from other numerical datatypes would require bounds-checking and > throwing an exception for values outside of [0x..0xD7FF > 0xE000..0x10]. Yes. But note that in practice if you are in 1. above you usually perform this only at the point of decoding where you are already performing a lot of other checks. Once done you no longer need to check anything as long as the operations you perform on the values preserve the invariant. Also converting back to an integer if you need one is a no-op: it's the identity function. The OCaml Uchar module does this. This is the interface: https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli which defines the type t as abstract and here is the implementation: https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml which defines the implementation of type t = int which means values of this type are an *unboxed* OCaml integer (and will be stored as such in say an OCaml array). However since the module system enforces type abstraction the only way of creating such values is to use the constants or the constructors (e.g. of_int) which all maintain the scalar value invariant (if you disregard the unsafe_* functions). Note that it would perfectly be possible to adopt a similar approach in C via a typedef though given C's rather loose type system a little bit more discipline would be required from the programmer (always go through the constructor functions to create values of the type). Best, Daniel
Re: Unicode String Models
Mark On Tue, Oct 2, 2018 at 8:31 PM Daniel Bünzli wrote: > On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode ( > unicode@unicode.org) wrote: > > > Because of performance and storage consideration, you need to consider > the > > possible internal data structures when you are looking at something as > > low-level as strings. But most of the 'model's in the document are only > > really distinguished by API, only the "Code Point model" discussions are > > segmented by internal storage, as with "Code Point Model: UTF-32" > > I guess my gripe with the presentation of that document is that it > perpetuates the problem of confusing "unicode characters" (or integers, or > scalar values) and their *encoding* (how to represent these integers as > byte sequences) which a source of endless confusion among programmers. > > This confusion is easy lifted once you explain that there exists certain > integers, the scalar values, which are your actual characters and then you > have different ways of encoding your characters; one can then explain that > a surrogate is not a character per se, it's a hack and there's no point in > indexing them except if you want trouble. > > This may also suggest another taxonomy of classification for the APIs, > those in which you work directly with the character data (the scalar > values) and those in which you work with an encoding of the actual > character data (e.g. a JavaScript string). > Thanks for the feedback. It is worth adding a discussion of the issues, perhaps something like: A code-point-based API takes and returns int32's, although only a small subset of the values are valid code points, namely 0x0..0x10. (In practice some APIs may support returning -1 to signal an error or termination, such as before or after the end of a string.) A surrogate code point is one in U+D800..U+DFFF; these reflect a range of special code units used in pairs in UTF-16 for representing code points above U+. A scalar value is a code point that is not a surrogate. A scalar-value API for immutable strings requires that no surrogate code points are ever returned. In practice, the main advantage of that API is that round-tripping to UTF-8/16 is guaranteed. Otherwise, a leaked surrogate code point is relatively harmless: Unicode properties are devised so that clients can essentially treat them as (permanently) unassigned characters. Warning: an iterator should *never* avoid returning surrogate code points by skipping them; that can cause security problems; see https://www.unicode.org/reports/tr36/tr36-7.html#Substituting_for_Ill_Formed_Subsequences and https://www.unicode.org/reports/tr36/tr36-7.html#Deletion_of_Noncharacters. There are two main choices for a scalar-value API: 1. Guarantee that the storage never contains surrogates. This is the simplest model. 2. Substitute U+FFFD for surrogates when the API returns code points. This can be done where #1 is not feasible, such as where the API is a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units that are not guaranteed to be UTF-16. The cost is extra tests on every code point access. > > In reality, most APIs are not even going to be in terms of code points: > > they will return int32's. > > That reality depends on your programming language. If the latter supports > type abstraction you can define an abstract type for scalar values (whose > implementation may simply be an integer). If you always go through the > constructor to create these "integers" you can maintain the invariant that > a value of this type is an integer in the ranges [0x;0xD7FF] and > [0xE000;0x10]. Knowing this invariant holds is quite useful when you > feed your "character" data to other processes like UTF-X encoders: it > guarantees the correctness of their outputs regardless of what the > programmer does. > If the programming language provides for such a primitive datatype, that is possible. That would mean at a minimum that casting/converting to that datatype from other numerical datatypes would require bounds-checking and throwing an exception for values outside of [0x..0xD7FF 0xE000..0x10]. Most common-use programming languages that I know of don't support that for primitives; the API would have to use a class, which would be so very painful for performance/storage. If you (or others) know of languages that do have such a cheap primitive datatype, that would be worth mentioning! > Best, > > Daniel > > >
Re: Unicode String Models
On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode (unicode@unicode.org) wrote: > Because of performance and storage consideration, you need to consider the > possible internal data structures when you are looking at something as > low-level as strings. But most of the 'model's in the document are only > really distinguished by API, only the "Code Point model" discussions are > segmented by internal storage, as with "Code Point Model: UTF-32" I guess my gripe with the presentation of that document is that it perpetuates the problem of confusing "unicode characters" (or integers, or scalar values) and their *encoding* (how to represent these integers as byte sequences) which a source of endless confusion among programmers. This confusion is easy lifted once you explain that there exists certain integers, the scalar values, which are your actual characters and then you have different ways of encoding your characters; one can then explain that a surrogate is not a character per se, it's a hack and there's no point in indexing them except if you want trouble. This may also suggest another taxonomy of classification for the APIs, those in which you work directly with the character data (the scalar values) and those in which you work with an encoding of the actual character data (e.g. a JavaScript string). > In reality, most APIs are not even going to be in terms of code points: > they will return int32's. That reality depends on your programming language. If the latter supports type abstraction you can define an abstract type for scalar values (whose implementation may simply be an integer). If you always go through the constructor to create these "integers" you can maintain the invariant that a value of this type is an integer in the ranges [0x;0xD7FF] and [0xE000;0x10]. Knowing this invariant holds is quite useful when you feed your "character" data to other processes like UTF-X encoders: it guarantees the correctness of their outputs regardless of what the programmer does. Best, Daniel
Re: Unicode String Models
Whether or not it is well suited, that's probably water under the bridge at this point. Think of it as a jargon at this point; after all, there are lots of cases like that: a "near miss" wasn't nearly a miss, it was nearly a hit. Mark On Sun, Sep 9, 2018 at 10:56 AM Janusz S. Bień wrote: > On Sat, Sep 08 2018 at 18:36 +0200, Mark Davis ☕️ via Unicode wrote: > > I recently did some extensive revisions of a paper on Unicode string > models (APIs). Comments are welcome. > > > > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# > > It's a good opportunity to propose a better term for "extended grapheme > cluster", which usually are neither extended nor clusters, it's also not > obvious that they are always graphemes. > > Cf.the earlier threads > > https://www.unicode.org/mail-arch/unicode-ml/y2017-m03/0031.html > https://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0040.html > > Best regards > > Janusz > > -- > , > Janusz S. Bien > emeryt (emeritus) > https://sites.google.com/view/jsbien >
Re: Unicode String Models
Mark On Tue, Sep 11, 2018 at 12:17 PM Henri Sivonen via Unicode < unicode@unicode.org> wrote: > On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode > wrote: > > > > I recently did some extensive revisions of a paper on Unicode string > models (APIs). Comments are welcome. > > > > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# > > * The Grapheme Cluster Model seems to have a couple of disadvantages > that are not mentioned: > 1) The subunit of string is also a string (a short string conforming > to particular constraints). There's a need for *another* more atomic > mechanism for examining the internals of the grapheme cluster string. > I did mention this. > 2) The way an arbitrary string is divided into units when iterating > over it changes when the program is executed on a newer version of the > language runtime that is aware of newly-assigned codepoints from a > newer version of Unicode. > Good point. I did mention the EGC definitions changing, but should point out that if you have a string with unassigned characters in it, they may be clustered on future versions. Will add. > * The Python 3.3 model mentions the disadvantages of memory usage > cliffs but doesn't mention the associated perfomance cliffs. It would > be good to also mention that when a string manipulation causes the > storage to expand or contract, there's a performance impact that's not > apparent from the nature of the operation if the programmer's > intuition works on the assumption that the programmer is dealing with > UTF-32. > The focus was on immutable string models, but I didn't make that clear. Added some text. > > * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM > text node storage in Gecko, (I believe but am not 100% sure) V8 and, > optionally, HotSpot > ( > https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A > ). > That is, text has UTF-16 semantics, but if the high half of every code > unit in a string is zero, only the lower half is stored. This has > properties analogous to the Python 3.3 model, except non-BMP doesn't > expand to UTF-32 but uses UTF-16 surrogate pairs. > Thanks, will add. > > * I think the fact that systems that chose UTF-16 or UTF-32 have > implemented models that try to save storage by omitting leading zeros > and gaining complexity and performance cliffs as a result is a strong > indication that UTF-8 should be recommended for newly-designed systems > that don't suffer from a forceful legacy need to expose UTF-16 or > UTF-32 semantics. > > * I suggest splitting the "UTF-8 model" into three substantially > different models: > > 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No > UTF-8-related operations are performed when ingesting byte-oriented > data. Byte buffers and text buffers are type-wise ambiguous. Only > iterating over byte data by code point gives the data the UTF-8 > interpretation. Unless the data is cleaned up as a side effect of such > iteration, malformed sequences in input survive into output. > > 2) UTF-8 without full trust in ability to retain validity (the model > of the UTF-8-using C++ parts of Gecko; I believe this to be the most > common UTF-8 model for C and C++, but I don't have evidence to back > this up): When data is ingested with text semantics, it is converted > to UTF-8. For data that's supposed to already be in UTF-8, this means > replacing malformed sequences with the REPLACEMENT CHARACTER, so the > data is valid UTF-8 right after input. However, iteration by code > point doesn't trust ability of other code to retain UTF-8 validity > perfectly and has "else" branches in order not to blow up if invalid > UTF-8 creeps into the system. > > 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers > have a different type in the type system than byte buffers. To go from > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data > has been tagged as valid UTF-8, the validity is trusted completely so > that iteration by code point does not have "else" branches for > malformed sequences. If data that the type system indicates to be > valid UTF-8 wasn't actually valid, it would be nasal demon time. The > language has a default "safe" side and an opt-in "unsafe" side. The > unsafe side is for performing low-level operations in a way where the > responsibility of upholding invariants is moved from the compiler to > the programmer. It's impossible to violate the UTF-8 validity > invariant using the safe part of the language. > Added a quote based on this; plea
Re: Unicode String Models
Mark On Sun, Sep 9, 2018 at 3:42 PM Daniel Bünzli wrote: > Hello, > > I find your notion of "model" and presentation a bit confusing since it > conflates what I would call the internal representation and the API. > > The internal representation defines how the Unicode text is stored and > should not really matter to the end user of the string data structure. The > API defines how the Unicode text is accessed, expressed by what is the > result of an indexing operation on the string. The latter is really what > matters for the end-user and what I would call the "model". > Because of performance and storage consideration, you need to consider the possible internal data structures when you are looking at something as low-level as strings. But most of the 'model's in the document are only really distinguished by API, only the "Code Point model" discussions are segmented by internal storage, as with "Code Point Model: UTF-32" > I think the presentation would benefit from making a clear distinction > between the internal representation and the API; you could then easily > summarize them in a table which would make a nice summary of the design > space. > That's an interesting suggestion, I'll mull it over. > > I also think you are missing one API which is the one with ECG I would > favour: indexing returns Unicode scalar values, internally be it whatever > you wish UTF-{8,16,32} or a custom encoding. Maybe that's what you intended > by the "Code Point Model: Internal 8/16/32" but that's not what it says, > the distinction between code point and scalar value is an important one and > I think it would be good to insist on it to clarify the minds in such > documents. > In reality, most APIs are not even going to be in terms of code points: they will return int32's. So not only are they not scalar values, 99.97% are not even code points. Of course, values above 10 or below 0 shouldn't ever be stored in strings, but in practice treating non-scalar-value-code-points as "permanently unassigned" characters doesn't really cause problems in processing. > Best, > > Daniel > > >
Re: Unicode String Models
Mark On Sun, Sep 9, 2018 at 10:03 AM Richard Wordingham via Unicode < unicode@unicode.org> wrote: > On Sat, 8 Sep 2018 18:36:00 +0200 > Mark Davis ☕️ via Unicode wrote: > > > I recently did some extensive revisions of a paper on Unicode string > > models (APIs). Comments are welcome. > > > > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# > > > Theoretically at least, the cost of indexing a big string by codepoint > is negligible. For example, cost of accessing the middle character is > O(1)*, not O(n), where n is the length of the string. The trick is to > use a proportionately small amount of memory to store and maintain a > partial conversion table from character index to byte index. For > example, Emacs claims to offer O(1) access to a UTF-8 buffer by > character number, and I can't significantly fault the claim. > > *There may be some creep, but it doesn't matter for strings that can be > stored within a galaxy. > > Of course, the coefficients implied by big-oh notation also matter. > For example, it can be very easy to forget that a bubble sort is often > the quickest sorting algorithm. > Thanks, added a quote from you on that; see if it looks ok. > You keep muttering that a a sequence of 8-bit code units can contain > invalid sequences, but often forget that that is also true of sequences > of 16-bit code units. Do emoji now ensure that confusion between > codepoints and code units rapidly comes to light? > I didn't neglect that, had a [TBD] for it. While UTF16 invalid unpaired surrogates don't complicate processing much if they are treated as unassigned characters, allowing UTF8 invalid sequences are more troublesome. See, for example, the convolutions needed in ICU methods that allow ill-formed UTF8. > You seem to keep forgetting that grapheme clusters are not how some > people people work. Does the English word 'café' contain the letter > 'e'? Yes or no? I maintain that it does. I can't help thinking that > one might want to look for the letter 'ă' in Vietnamese and find it > whatever the associated tone mark is. > I'm pretty familiar with the situation, thanks for asking. Often you want to find out more about the components of grapheme clusters, so you always need to be able to iterate through the code points it contains. One might think that iterating by grapheme cluster is hiding features of the text. For example, with *fox́* (fox\u{301}) it is easy to find that the text contains an *x* by iterating through code points. But code points often don't reveal their components: does the word *también* contain the letter *e*? A reasonable question, but iterating by code point rather than grapheme cluster doesn't help, since it is typically encoded as a single U+00E9. And even decomposing to NFD doesn't always help, as with cases like *rødgrød*. > > You didn't discuss substrings. I did. But if you mean a definition of substring that lets you access internal components of substrings, I'm afraid that is quite a specialized usage. One could do it, but it would burden down the general use case. > I'm interested in how subsequences of > strings are defined, as the concept of 'substring' isn't really Unicode > compliant. Again, expressing 'ă' as a subsequence of the Vietnamese > word 'nặng' ought to be possible, whether one is using NFD (easier) or > NFC. (And there are alternative normalisations that are compatible > with canonical equivalence.) I'm most interested in subsequences X of a > word W where W is the same as AXB for some strings A and B. > Richard. > >
Re: Unicode String Models
Thanks, added a quote from you on that; see if it looks ok. Mark On Sat, Sep 8, 2018 at 9:20 PM John Cowan wrote: > This paper makes the default assumption that the internal storage of a > string is a featureless array. If this assumption is abandoned, it is > possible to get O(1) indexes with fairly low space overhead. The Scheme > language has recently adopted immutable strings called "texts" as a > supplement to its pre-existing mutable strings, and the sample > implementation for this feature uses a vector of either native strings or > bytevectors (char[] vectors in C/Java terms). I would urge anyone > interested in the question of storing and accessing mutable strings to read > the following parts of SRFI 135 at < > https://srfi.schemers.org/srfi-135/srfi-135.html>: Abstract, Rationale, > Specification / Basic concepts, and Implementation. In addition, the > design notes at <https://github.com/larcenists/larceny/wiki/ImmutableTexts>, > though not up to date (in particular, UTF-16 internals are now allowed as > an alternative to UTF-8), are of interest: unfortunately, the link to the > span API has rotted. > > On Sat, Sep 8, 2018 at 12:53 PM Mark Davis ☕️ via Unicore < > unic...@unicode.org> wrote: > >> I recently did some extensive revisions of a paper on Unicode string >> models (APIs). Comments are welcome. >> >> >> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# >> >> Mark >> >
Re: Unicode String Models
Thanks to all for comments. Just revised the text in https://goo.gl/neguxb. Mark On Sat, Sep 8, 2018 at 6:36 PM Mark Davis ☕️ wrote: > I recently did some extensive revisions of a paper on Unicode string > models (APIs). Comments are welcome. > > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# > > Mark >
Re: Unicode String Models
On Wed, Sep 12, 2018 at 11:37 AM Hans Åberg via Unicode wrote: > The idea is to extend Unicode itself, so that those bytes can be represented > by legal codepoints. Extending Unicode itself would likely create more problems that it would solve. Extending the value space of Unicode scalar values would be extremely disruptive for systems whose design is deeply committed to the current definitions of UTF-16 and UTF-8 staying unchanged. Assigning a scalar value within the current Unicode scalar value space to currently malformed bytes would have the problem of those scalar values losing information whether they came from malformed bytes or the well-formed encoding of those scalar values. It seems better to let applications that have use cases that involve representing non-Unicode values to use a special-purpose extension on their own. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/
Re: Unicode String Models
> On 12 Sep 2018, at 04:34, Eli Zaretskii via Unicode > wrote: > >> Date: Wed, 12 Sep 2018 00:13:52 +0200 >> Cc: unicode@unicode.org >> From: Hans Åberg via Unicode >> >> It might be useful to represent non-UTF-8 bytes as Unicode code points. One >> way might be to use a codepoint to indicate high bit set followed by the >> byte value with its high bit set to 0, that is, truncated into the ASCII >> range. For example, U+0080 looks like it is not in use, though I could not >> verify this. > > You must use a codepoint that is not defined by Unicode, and never > will. That is what Emacs does: it extends the Unicode codepoint space > beyond 0x10. The idea is to extend Unicode itself, so that those bytes can be represented by legal codepoints. Then U+0080 has had some use in other encodings, but it looks like not in Unicode itself. But one could use some other value or values, and mark it for this special purpose. There are a number of other byte sequences that are in use, too, like overlong UTF-8. Also original UTF-8 can be extended to handle all 32-bit words, also those with the high bit set, then.
Re: Unicode String Models
On Tue, Sep 11, 2018 at 2:13 PM Eli Zaretskii wrote: > > > Date: Tue, 11 Sep 2018 13:12:40 +0300 > > From: Henri Sivonen via Unicode > > > > * I suggest splitting the "UTF-8 model" into three substantially > > different models: > > > > 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No > > UTF-8-related operations are performed when ingesting byte-oriented > > data. Byte buffers and text buffers are type-wise ambiguous. Only > > iterating over byte data by code point gives the data the UTF-8 > > interpretation. Unless the data is cleaned up as a side effect of such > > iteration, malformed sequences in input survive into output. > > > > 2) UTF-8 without full trust in ability to retain validity (the model > > of the UTF-8-using C++ parts of Gecko; I believe this to be the most > > common UTF-8 model for C and C++, but I don't have evidence to back > > this up): When data is ingested with text semantics, it is converted > > to UTF-8. For data that's supposed to already be in UTF-8, this means > > replacing malformed sequences with the REPLACEMENT CHARACTER, so the > > data is valid UTF-8 right after input. However, iteration by code > > point doesn't trust ability of other code to retain UTF-8 validity > > perfectly and has "else" branches in order not to blow up if invalid > > UTF-8 creeps into the system. > > > > 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers > > have a different type in the type system than byte buffers. To go from > > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data > > has been tagged as valid UTF-8, the validity is trusted completely so > > that iteration by code point does not have "else" branches for > > malformed sequences. If data that the type system indicates to be > > valid UTF-8 wasn't actually valid, it would be nasal demon time. The > > language has a default "safe" side and an opt-in "unsafe" side. The > > unsafe side is for performing low-level operations in a way where the > > responsibility of upholding invariants is moved from the compiler to > > the programmer. It's impossible to violate the UTF-8 validity > > invariant using the safe part of the language. > > There's another model, the one used by Emacs. AFAIU, it is different > from all the 3 you describe above. In Emacs, each raw byte belonging > to a byte sequence which is invalid under UTF-8 is represented as a > special multibyte sequence. IOW, Emacs's internal representation > extends UTF-8 with multibyte sequences it uses to represent raw bytes. > This allows mixing stray bytes and valid text in the same buffer, > without risking lossy conversions (such as those one gets under model > 2 above). I think extensions of UTF-8 that expand the value space beyond Unicode scalar values and the problems these extensions are designed to solve is a worthwhile topic to cover, but I think it's not the same topic as in the document but a slightly adjacent topic. On that topic, these two are relevant: https://simonsapin.github.io/wtf-8/ https://github.com/kennytm/omgwtf8 The former is used in the Rust standard library in order to provide a Unix-like view to Windows file paths in a way that can represent all Windows file paths. File paths on Unix-like systems are sequences of bytes whose presentable-to-humans interpretation (these days) is UTF-8, but there's no guarantee of UTF-8 validity. File paths on Windows are are sequences of unsigned 16-bit numbers whose presentable-to-humans interpretation is UTF-16, but there's no guarantee of UTF-16 validity. WTF-8 can represent all Windows file paths as sequences of bytes such that the paths that are valid UTF-16 as sequences of 16-bit units are valid UTF-8 in the 8-bit-unit representation. This allows application-visible file paths in the Rust standard library to be sequences of bytes both on Windows and non-Windows platforms and to be presentable to humans by decoding as UTF-8 in both cases. To my knowledge, the latter isn't in use yet. The implementation is tracked in https://github.com/rust-lang/rust/issues/49802 -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/
Re: Unicode String Models
> Date: Wed, 12 Sep 2018 00:13:52 +0200 > Cc: unicode@unicode.org > From: Hans Åberg via Unicode > > It might be useful to represent non-UTF-8 bytes as Unicode code points. One > way might be to use a codepoint to indicate high bit set followed by the byte > value with its high bit set to 0, that is, truncated into the ASCII range. > For example, U+0080 looks like it is not in use, though I could not verify > this. You must use a codepoint that is not defined by Unicode, and never will. That is what Emacs does: it extends the Unicode codepoint space beyond 0x10.
Re: Unicode String Models
No 0xF8..0xFF are not used at all in UTF-8; but U+00F8..U+00FF really **do** have UTF-8 encodings (using two bytes). The only safe way to represent arbitrary bytes within strings when they are not valid UTF-8 is to use invalid UTF-8 sequences, i.e by using a "UTF-8-like" private extension of UTF-8 (that extension is still not UTF-8!) This is what Java does for representing U+ by (0xC0,0x80) in the compiled Bytecode or via the C/C++ interface for JNI when converting the java string buffer into a C/C++ string terminated by a NULL byte (not part of the Java string content itself). That special sequence however is really exposed in the Java API as a true unsigned 16-bit code unit (char) with value 0x, and a valid single code point. The same can be done for reencoding each invalid byte in non-UTF-8 conforming texts using sequences with a "UTF-8-like" scheme (still compatible with plain UTF-8 for every valid UTF-8 texts): you may either: * (a) encode each invalid byte separately (using two bytes for each), or by encoding them by groups of 3-bits (represented using bytes 0xF8..0FF) and then needing 3 bytes in the encoding. * (b) encode a private starter (e.g. 0xFF), followed by a byte for the length of the raw bytes sequence that follows, and then the raw bytes sequence of that length without any reencoding: this will never be confused with other valid codepoints (however this scheme may no longer be directly indexable from arbitrary random positions, unlike scheme a which may be marginally longer longer) But both schemes (a) or (b) would be useful in editors allowing to edit arbitrary binary files as if they were plain-text, even if they contain null bytes, or invalid UTF-8 sequences (it's up to these editors to find a way to distinctively represent these bytes, and a way to enter/change them reliably. There's also a possibility of extension if the backing store uses UTF-16, as all code units 0x.0x are used, but one scheme is possible by using unpaired surrogates (notably a low surrogate NOT prefixed by a high surrogate: the low surrogate already has 10 useful bits that can store any raw byte value in its lowest bits): this scheme allows indexing from random position and reliable sequencial traversal in both directions (backward or forward)... ... But the presence of such extension of UTF-16 means that all the implementation code handling standard text has to detect unpaired surrogates, and can no longer assume that a low surrogate necessarily has a high surrogate encoded just before it: it must be tested and that previous position may be before the buffer start, causing a possibly buffer overrun in backward direction (so the code will need to also know the start position of the buffer and check it, or know the index which cannot be negative), possibly exposing unrelated data and causing some security risks, unless the backing store always adds a leading "guard" code unit set arbitrarily to 0x. Le mer. 12 sept. 2018 à 00:48, J Decker via Unicode a écrit : > > > On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode < > unicode@unicode.org> wrote: > >> >> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode < >> unicode@unicode.org> wrote: >> > >> > On Tue, 11 Sep 2018 21:10:03 +0200 >> > Hans Åberg via Unicode wrote: >> > >> >> Indeed, before UTF-8, in the 1990s, I recall some Russians using >> >> LaTeX files with sections in different Cyrillic and Latin encodings, >> >> changing the editor encoding while typing. >> > >> > Rather like some of the old Unicode list archives, which are just >> > concatenations of a month's emails, with all sorts of 8-bit encodings >> > and stretches of base64. >> >> It might be useful to represent non-UTF-8 bytes as Unicode code points. >> One way might be to use a codepoint to indicate high bit set followed by >> the byte value with its high bit set to 0, that is, truncated into the >> ASCII range. For example, U+0080 looks like it is not in use, though I >> could not verify this. >> >> > it's used for character 0x400. 0xD0 0x80 or 0x8000 0xE8 0x80 0x80 > (I'm probably off a bit in the leading byte) > UTF-8 can represent from 0 to 0x20 every value; (which is all defined > codepoints) early varients can support up to U+7FFF... > and there's enough bits to carry the pattern forward to support 36 bits or > 42 bits... (the last one breaking the standard a bit by allowing a byte > wihout one bit off... 0xFF would be the leadin) > > 0xF8-FF are unused byte values; but those can all be encoded into utf-8. >
Re: Unicode String Models
On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode wrote: > > > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode < > unicode@unicode.org> wrote: > > > > On Tue, 11 Sep 2018 21:10:03 +0200 > > Hans Åberg via Unicode wrote: > > > >> Indeed, before UTF-8, in the 1990s, I recall some Russians using > >> LaTeX files with sections in different Cyrillic and Latin encodings, > >> changing the editor encoding while typing. > > > > Rather like some of the old Unicode list archives, which are just > > concatenations of a month's emails, with all sorts of 8-bit encodings > > and stretches of base64. > > It might be useful to represent non-UTF-8 bytes as Unicode code points. > One way might be to use a codepoint to indicate high bit set followed by > the byte value with its high bit set to 0, that is, truncated into the > ASCII range. For example, U+0080 looks like it is not in use, though I > could not verify this. > > it's used for character 0x400. 0xD0 0x80 or 0x8000 0xE8 0x80 0x80 (I'm probably off a bit in the leading byte) UTF-8 can represent from 0 to 0x20 every value; (which is all defined codepoints) early varients can support up to U+7FFF... and there's enough bits to carry the pattern forward to support 36 bits or 42 bits... (the last one breaking the standard a bit by allowing a byte wihout one bit off... 0xFF would be the leadin) 0xF8-FF are unused byte values; but those can all be encoded into utf-8.
Re: Unicode String Models
> On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode > wrote: > > On Tue, 11 Sep 2018 21:10:03 +0200 > Hans Åberg via Unicode wrote: > >> Indeed, before UTF-8, in the 1990s, I recall some Russians using >> LaTeX files with sections in different Cyrillic and Latin encodings, >> changing the editor encoding while typing. > > Rather like some of the old Unicode list archives, which are just > concatenations of a month's emails, with all sorts of 8-bit encodings > and stretches of base64. It might be useful to represent non-UTF-8 bytes as Unicode code points. One way might be to use a codepoint to indicate high bit set followed by the byte value with its high bit set to 0, that is, truncated into the ASCII range. For example, U+0080 looks like it is not in use, though I could not verify this.
Re: Unicode String Models
On Tue, 11 Sep 2018 21:10:03 +0200 Hans Åberg via Unicode wrote: > Indeed, before UTF-8, in the 1990s, I recall some Russians using > LaTeX files with sections in different Cyrillic and Latin encodings, > changing the editor encoding while typing. Rather like some of the old Unicode list archives, which are just concatenations of a month's emails, with all sorts of 8-bit encodings and stretches of base64. Richard.
Re: Unicode String Models
> On 11 Sep 2018, at 20:40, Eli Zaretskii wrote: > >> From: Hans Åberg >> Date: Tue, 11 Sep 2018 20:14:30 +0200 >> Cc: hsivo...@hsivonen.fi, >> unicode@unicode.org >> >> If one encounters a file with mixed encodings, it is good to be able to view >> its contents and then convert it, as I see one can do in Emacs. > > Yes. And mixed encodings is not the only use case: it may well happen > that the initial attempt to decode the file uses incorrect assumption > about the encoding, for some reason. > > In addition, it is important that changing some portion of the file, > then saving the modified text will never change any part that the user > didn't touch, as will happen if invalid sequences are rejected at > input time and replaced with something else. Indeed, before UTF-8, in the 1990s, I recall some Russians using LaTeX files with sections in different Cyrillic and Latin encodings, changing the editor encoding while typing.
Re: Unicode String Models
> From: Hans Åberg > Date: Tue, 11 Sep 2018 20:14:30 +0200 > Cc: hsivo...@hsivonen.fi, > unicode@unicode.org > > If one encounters a file with mixed encodings, it is good to be able to view > its contents and then convert it, as I see one can do in Emacs. Yes. And mixed encodings is not the only use case: it may well happen that the initial attempt to decode the file uses incorrect assumption about the encoding, for some reason. In addition, it is important that changing some portion of the file, then saving the modified text will never change any part that the user didn't touch, as will happen if invalid sequences are rejected at input time and replaced with something else.
Re: Unicode String Models
> On 11 Sep 2018, at 19:21, Eli Zaretskii wrote: > >> From: Hans Åberg >> Date: Tue, 11 Sep 2018 19:13:28 +0200 >> Cc: Henri Sivonen , >> unicode@unicode.org >> >>> In Emacs, each raw byte belonging >>> to a byte sequence which is invalid under UTF-8 is represented as a >>> special multibyte sequence. IOW, Emacs's internal representation >>> extends UTF-8 with multibyte sequences it uses to represent raw bytes. >>> This allows mixing stray bytes and valid text in the same buffer, >>> without risking lossy conversions (such as those one gets under model >>> 2 above). >> >> Can you give a reference detailing this format? > > There's no formal description as English text, if that's what you > meant. The comments, macros and functions in the files > src/character.[ch] in the Emacs source tree tell most of that story, > albeit indirectly, and some additional info can be found in the > section "Text Representation" of the Emacs Lisp Reference manual. OK. If one encounters a file with mixed encodings, it is good to be able to view its contents and then convert it, as I see one can do in Emacs.
Re: Unicode String Models
> From: Hans Åberg > Date: Tue, 11 Sep 2018 19:13:28 +0200 > Cc: Henri Sivonen , > unicode@unicode.org > > > In Emacs, each raw byte belonging > > to a byte sequence which is invalid under UTF-8 is represented as a > > special multibyte sequence. IOW, Emacs's internal representation > > extends UTF-8 with multibyte sequences it uses to represent raw bytes. > > This allows mixing stray bytes and valid text in the same buffer, > > without risking lossy conversions (such as those one gets under model > > 2 above). > > Can you give a reference detailing this format? There's no formal description as English text, if that's what you meant. The comments, macros and functions in the files src/character.[ch] in the Emacs source tree tell most of that story, albeit indirectly, and some additional info can be found in the section "Text Representation" of the Emacs Lisp Reference manual.
Re: Unicode String Models
> On 11 Sep 2018, at 13:13, Eli Zaretskii via Unicode > wrote: > > In Emacs, each raw byte belonging > to a byte sequence which is invalid under UTF-8 is represented as a > special multibyte sequence. IOW, Emacs's internal representation > extends UTF-8 with multibyte sequences it uses to represent raw bytes. > This allows mixing stray bytes and valid text in the same buffer, > without risking lossy conversions (such as those one gets under model > 2 above). Can you give a reference detailing this format?
Re: Unicode String Models
These are all interesting and useful comments. I'll be responding once I get a bit of free time, probably Friday or Saturday. Mark On Tue, Sep 11, 2018 at 4:16 AM Eli Zaretskii via Unicode < unicode@unicode.org> wrote: > > Date: Tue, 11 Sep 2018 13:12:40 +0300 > > From: Henri Sivonen via Unicode > > > > * I suggest splitting the "UTF-8 model" into three substantially > > different models: > > > > 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No > > UTF-8-related operations are performed when ingesting byte-oriented > > data. Byte buffers and text buffers are type-wise ambiguous. Only > > iterating over byte data by code point gives the data the UTF-8 > > interpretation. Unless the data is cleaned up as a side effect of such > > iteration, malformed sequences in input survive into output. > > > > 2) UTF-8 without full trust in ability to retain validity (the model > > of the UTF-8-using C++ parts of Gecko; I believe this to be the most > > common UTF-8 model for C and C++, but I don't have evidence to back > > this up): When data is ingested with text semantics, it is converted > > to UTF-8. For data that's supposed to already be in UTF-8, this means > > replacing malformed sequences with the REPLACEMENT CHARACTER, so the > > data is valid UTF-8 right after input. However, iteration by code > > point doesn't trust ability of other code to retain UTF-8 validity > > perfectly and has "else" branches in order not to blow up if invalid > > UTF-8 creeps into the system. > > > > 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers > > have a different type in the type system than byte buffers. To go from > > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data > > has been tagged as valid UTF-8, the validity is trusted completely so > > that iteration by code point does not have "else" branches for > > malformed sequences. If data that the type system indicates to be > > valid UTF-8 wasn't actually valid, it would be nasal demon time. The > > language has a default "safe" side and an opt-in "unsafe" side. The > > unsafe side is for performing low-level operations in a way where the > > responsibility of upholding invariants is moved from the compiler to > > the programmer. It's impossible to violate the UTF-8 validity > > invariant using the safe part of the language. > > There's another model, the one used by Emacs. AFAIU, it is different > from all the 3 you describe above. In Emacs, each raw byte belonging > to a byte sequence which is invalid under UTF-8 is represented as a > special multibyte sequence. IOW, Emacs's internal representation > extends UTF-8 with multibyte sequences it uses to represent raw bytes. > This allows mixing stray bytes and valid text in the same buffer, > without risking lossy conversions (such as those one gets under model > 2 above). >
Re: Unicode String Models
> Date: Tue, 11 Sep 2018 13:12:40 +0300 > From: Henri Sivonen via Unicode > > * I suggest splitting the "UTF-8 model" into three substantially > different models: > > 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No > UTF-8-related operations are performed when ingesting byte-oriented > data. Byte buffers and text buffers are type-wise ambiguous. Only > iterating over byte data by code point gives the data the UTF-8 > interpretation. Unless the data is cleaned up as a side effect of such > iteration, malformed sequences in input survive into output. > > 2) UTF-8 without full trust in ability to retain validity (the model > of the UTF-8-using C++ parts of Gecko; I believe this to be the most > common UTF-8 model for C and C++, but I don't have evidence to back > this up): When data is ingested with text semantics, it is converted > to UTF-8. For data that's supposed to already be in UTF-8, this means > replacing malformed sequences with the REPLACEMENT CHARACTER, so the > data is valid UTF-8 right after input. However, iteration by code > point doesn't trust ability of other code to retain UTF-8 validity > perfectly and has "else" branches in order not to blow up if invalid > UTF-8 creeps into the system. > > 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers > have a different type in the type system than byte buffers. To go from > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data > has been tagged as valid UTF-8, the validity is trusted completely so > that iteration by code point does not have "else" branches for > malformed sequences. If data that the type system indicates to be > valid UTF-8 wasn't actually valid, it would be nasal demon time. The > language has a default "safe" side and an opt-in "unsafe" side. The > unsafe side is for performing low-level operations in a way where the > responsibility of upholding invariants is moved from the compiler to > the programmer. It's impossible to violate the UTF-8 validity > invariant using the safe part of the language. There's another model, the one used by Emacs. AFAIU, it is different from all the 3 you describe above. In Emacs, each raw byte belonging to a byte sequence which is invalid under UTF-8 is represented as a special multibyte sequence. IOW, Emacs's internal representation extends UTF-8 with multibyte sequences it uses to represent raw bytes. This allows mixing stray bytes and valid text in the same buffer, without risking lossy conversions (such as those one gets under model 2 above).
Re: Unicode String Models
On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode wrote: > > I recently did some extensive revisions of a paper on Unicode string models > (APIs). Comments are welcome. > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# * The Grapheme Cluster Model seems to have a couple of disadvantages that are not mentioned: 1) The subunit of string is also a string (a short string conforming to particular constraints). There's a need for *another* more atomic mechanism for examining the internals of the grapheme cluster string. 2) The way an arbitrary string is divided into units when iterating over it changes when the program is executed on a newer version of the language runtime that is aware of newly-assigned codepoints from a newer version of Unicode. * The Python 3.3 model mentions the disadvantages of memory usage cliffs but doesn't mention the associated perfomance cliffs. It would be good to also mention that when a string manipulation causes the storage to expand or contract, there's a performance impact that's not apparent from the nature of the operation if the programmer's intuition works on the assumption that the programmer is dealing with UTF-32. * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM text node storage in Gecko, (I believe but am not 100% sure) V8 and, optionally, HotSpot (https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A). That is, text has UTF-16 semantics, but if the high half of every code unit in a string is zero, only the lower half is stored. This has properties analogous to the Python 3.3 model, except non-BMP doesn't expand to UTF-32 but uses UTF-16 surrogate pairs. * I think the fact that systems that chose UTF-16 or UTF-32 have implemented models that try to save storage by omitting leading zeros and gaining complexity and performance cliffs as a result is a strong indication that UTF-8 should be recommended for newly-designed systems that don't suffer from a forceful legacy need to expose UTF-16 or UTF-32 semantics. * I suggest splitting the "UTF-8 model" into three substantially different models: 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No UTF-8-related operations are performed when ingesting byte-oriented data. Byte buffers and text buffers are type-wise ambiguous. Only iterating over byte data by code point gives the data the UTF-8 interpretation. Unless the data is cleaned up as a side effect of such iteration, malformed sequences in input survive into output. 2) UTF-8 without full trust in ability to retain validity (the model of the UTF-8-using C++ parts of Gecko; I believe this to be the most common UTF-8 model for C and C++, but I don't have evidence to back this up): When data is ingested with text semantics, it is converted to UTF-8. For data that's supposed to already be in UTF-8, this means replacing malformed sequences with the REPLACEMENT CHARACTER, so the data is valid UTF-8 right after input. However, iteration by code point doesn't trust ability of other code to retain UTF-8 validity perfectly and has "else" branches in order not to blow up if invalid UTF-8 creeps into the system. 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers have a different type in the type system than byte buffers. To go from a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data has been tagged as valid UTF-8, the validity is trusted completely so that iteration by code point does not have "else" branches for malformed sequences. If data that the type system indicates to be valid UTF-8 wasn't actually valid, it would be nasal demon time. The language has a default "safe" side and an opt-in "unsafe" side. The unsafe side is for performing low-level operations in a way where the responsibility of upholding invariants is moved from the compiler to the programmer. It's impossible to violate the UTF-8 validity invariant using the safe part of the language. * After working with different string models, I'd recommend the Rust model for newly-designed programming languages. (Not because I work for Mozilla but because I believe Rust's way of dealing with Unicode is the best I've seen.) Rust's standard library provides Unicode version-independent iterations over strings: by code unit and by code point. Iteration by extended grapheme cluster is provided by a library that's easy to include due to the nature of Rust package management (https://crates.io/crates/unicode_segmentation). Viewing a UTF-8 buffer as a read-only byte buffer has zero run-time cost and allows for maximally fast guaranteed-valid-UTF-8 output. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/
Re: Unicode String Models
> On 9 Sep 2018, at 21:20, Eli Zaretskii via Unicode > wrote: > > In Emacs, the gap is always where the text is inserted or deleted, be > it in the middle of text or at its end. > >> All editors I have seen treat the text as ordered collections of small >> buffers (these small buffers may still have >> small gaps), which are occasionnally merged or splitted when needed (merging >> does not cause any >> reallocation but may free one of the buffers), some of them being paged out >> to tempoary files when memory is >> stressed. There are some heuristics in the editor's code to when >> mainatenance of the collection is really >> needed and useful for the performance. > > My point was to say that Emacs is not one of these editors you > describe. FYI, gap and rope buffers are described at [1-2]; also see the Emacs manual [3]. 1. https://en.wikipedia.org/wiki/Gap_buffer 2. https://en.wikipedia.org/wiki/Rope_(data_structure) 3. https://www.gnu.org/software/emacs/manual/html_node/elisp/Buffer-Gap.html
Re: Unicode String Models
> From: Philippe Verdy > Date: Sun, 9 Sep 2018 19:35:47 +0200 > Cc: Richard Wordingham , > unicode Unicode Discussion > > In Emacs, buffer text is a character string with a gap, actually. > > A text buffer with gaps is a complex structure, not just a plain string. The difference is very small, and a couple of macros allow you to almost forget about the gap. > I doubt it constantly uses a single gap at end (insertions and deletions in > the middle would > constant move large blocks and use excessive CPU and memory bandwidth, with > very slow response: users > do not want to see what they type appearing on the screen at one keystroke > every few seconds because each > typed key causes massive block moves and excessive memory paging from/to disk > while this move is being > performed). In Emacs, the gap is always where the text is inserted or deleted, be it in the middle of text or at its end. > All editors I have seen treat the text as ordered collections of small > buffers (these small buffers may still have > small gaps), which are occasionnally merged or splitted when needed (merging > does not cause any > reallocation but may free one of the buffers), some of them being paged out > to tempoary files when memory is > stressed. There are some heuristics in the editor's code to when mainatenance > of the collection is really > needed and useful for the performance. My point was to say that Emacs is not one of these editors you describe. > But beside this the performance cost of UTF indexing of the codepoints is > invisible: each buffer will only need > to avoid breaking text between codepoint boundaries, if the current encoding > of the edited text is an UTF. An > editor may also avoid breaking buffers in the middle of clusters if they > render clusters (including ligatures if > they are supported): clusters are still small in size in every encoding and > reasonnable buffer sizes can hold at > least hundreds of clusters (even the largest ones which occur rarely). How > editors will manage clusters to > make them editable is dependant of the implementation, buyt even the UTF or > codepoints boundaries are not > enough to handle that. In all cases the logical text buffer is structured > with a complex backing store, where > parts may be paged out (and will also include more than just the current > text, notably it will include parts of the > indexes, possibly in another temporary working file). You ignore or disregard the need to represent raw bytes in editor buffers. That is when the encoding stops being "invisible".
Re: Unicode String Models
Le dim. 9 sept. 2018 à 17:53, Eli Zaretskii a écrit : > > Text editors use various indexing caches always, to manage memory, I/O, > and allow working on large texts > > even on systems with low memory available. As much as possible they > attempt to use the OS-level caches > > of the filesystem. And in all cases, they don't work directly on their > text buffer (whose internal represenation in > > their backing store is not just a single string, but a structured > collection of buffers, built on top of an interface > > masking the details: the effective text will then be reencoded and saved > from that object, using complex > > serialization schemes; the text buffer is "virtualized"). > > In Emacs, buffer text is a character string with a gap, actually. > A text buffer with gaps is a complex structure, not just a plain string. Gaps are one way to manage memory more efficiently and get reasonnable performance when editing, without having to constantly move large blocks: these "strings" with gaps may then actually be just a byte buffer using as a backing store, but that buffer alone does not represent only the currently represented text. A process will still serialize and perform cleanup befire this buffer can be used to save the edited text. Emacs may not necasserily unallocate the end of the buffer, but I doubt it constantly uses a single gap at end (insertions and deletions in the middle would constant move large blocks and use excessive CPU and memory bandwidth, with very slow response: users do not want to see what they type appearing on the screen at one keystroke every few seconds because each typed key causes massive block moves and excessive memory paging from/to disk while this move is being performed). All editors I have seen treat the text as ordered collections of small buffers (these small buffers may still have small gaps), which are occasionnally merged or splitted when needed (merging does not cause any reallocation but may free one of the buffers), some of them being paged out to tempoary files when memory is stressed. There are some heuristics in the editor's code to when mainatenance of the collection is really needed and useful for the performance. But beside this the performance cost of UTF indexing of the codepoints is invisible: each buffer will only need to avoid breaking text between codepoint boundaries, if the current encoding of the edited text is an UTF. An editor may also avoid breaking buffers in the middle of clusters if they render clusters (including ligatures if they are supported): clusters are still small in size in every encoding and reasonnable buffer sizes can hold at least hundreds of clusters (even the largest ones which occur rarely). How editors will manage clusters to make them editable is dependant of the implementation, buyt even the UTF or codepoints boundaries are not enough to handle that. In all cases the logical text buffer is structured with a complex backing store, where parts may be paged out (and will also include more than just the current text, notably it will include parts of the indexes, possibly in another temporary working file).
Re: Unicode String Models
> Date: Sun, 9 Sep 2018 16:10:26 +0200 > Cc: unicode Unicode Discussion > From: Philippe Verdy via Unicode > > In practive, we use a memory by preparing the "small memory" while > instantiating a new iterator that will > process the whole string (which may not be fully loaded in memory, in which > case that "small memory" will > need reallocation as we also read the whole string (but not necessarily keep > it in memory if it's a very long > text file: the index buffer will still remain in memory even if we no longer > need to come back to the start of the > string). That "small memory" is just a local helper, its cost must be > evaluated. In practice however, long texts > come from I/O: the text will have its interface from files, in which case > you'll benefit from the filesystem cache > of the OS to save I/O, or from network (in which case you'll need to store > the network data in a local > temporary file if you don't want to keep it fully in memory and allow some > parts to be paged out of memory by > the OS. But in Emacs, it only works with files: network texts are necessarily > backed at least by a local > temporary file. Emacs maintains caches for byte to character conversions for both strings and buffers. The cache holds data only for the last string and separately the last buffer where Emacs needed to convert character counts to byte counts or vice versa. For buffers, there are 4 places that are maintained for every buffer at all times, for which both the character and byte positions are known, and Emacs uses those whenever it needs to do conversions for a buffer that is not the cached one. > So that "small memory" for the index is not even needed (but Emacs maintains > an index in memory only to > locate line numbers. That's a different cache, unrelated to what Richard was alluding to (and I think unrelated to the current discussion). > Text editors use various indexing caches always, to manage memory, I/O, and > allow working on large texts > even on systems with low memory available. As much as possible they attempt > to use the OS-level caches > of the filesystem. And in all cases, they don't work directly on their text > buffer (whose internal represenation in > their backing store is not just a single string, but a structured collection > of buffers, built on top of an interface > masking the details: the effective text will then be reencoded and saved from > that object, using complex > serialization schemes; the text buffer is "virtualized"). In Emacs, buffer text is a character string with a gap, actually.
Re: Unicode String Models
Le dim. 9 sept. 2018 à 10:10, Richard Wordingham via Unicode < unicode@unicode.org> a écrit : > On Sat, 8 Sep 2018 18:36:00 +0200 > Mark Davis ☕️ via Unicode wrote: > > > I recently did some extensive revisions of a paper on Unicode string > > models (APIs). Comments are welcome. > > > > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# > > > Theoretically at least, the cost of indexing a big string by codepoint > is negligible. For example, cost of accessing the middle character is > O(1)*, not O(n), where n is the length of the string. The trick is to > use a proportionately small amount of memory to store and maintain a > partial conversion table from character index to byte index. For > example, Emacs claims to offer O(1) access to a UTF-8 buffer by > character number, and I can't significantly fault the claim. > I fully agree, as long as the "middle" character is **approximated** by the middle of the **encoded** length. But if it has to be the exact middle (by code point number), you have to count the codepoints exactly by parsing the whole string as O(n), then compute the middle from it and parse again from the begining to locate the encoded position of that code point index as O(n/2) so the final cost is O(n*3/2). The trick using a "small amount" of memory only is only to avoid the second parsing to get a O(n) result. You get O(1)* only if you keep that "small memory" to locate ofthe indexes. But the claim that it is "small" is wrong if the string is large (big value n). and has no interest if the string is indexed only once. In practive, we use a memory by preparing the "small memory" while instantiating a new iterator that will process the whole string (which may not be fully loaded in memory, in which case that "small memory" will need reallocation as we also read the whole string (but not necessarily keep it in memory if it's a very long text file: the index buffer will still remain in memory even if we no longer need to come back to the start of the string). That "small memory" is just a local helper, its cost must be evaluated. In practice however, long texts come from I/O: the text will have its interface from files, in which case you'll benefit from the filesystem cache of the OS to save I/O, or from network (in which case you'll need to store the network data in a local temporary file if you don't want to keep it fully in memory and allow some parts to be paged out of memory by the OS. But in Emacs, it only works with files: network texts are necessarily backed at least by a local temporary file. So that "small memory" for the index is not even needed (but Emacs maintains an index in memory only to locate line numbers. It has no need to do that for column numbers, as it is just faster to rescan the line (and extremely long lines of text are exceptional, these files are rarely edited with Emacs, unless you use it to load a binary file, whose representation on screen will be very different, notably for controls, which are expanded into another cached form: the column index for display, which is different from the code point index and specific to the Emacs representation for display/editing, is built only line by line, separately from the line index kept for the whole edited file; it is also independant of the effective encoding: it would still be needed even if the encoding of the backing buffer was UTF-32 with only 1 codepoint per code unit, becase the actual display will still expand the code points to other forms using visible escaping mechanisms, and it is even needed when the file is pure 7-bit ASCII, and kept with one byte per code point: choosing the Unicode encoding forms has no impact at all to what is really needed for display in text editors). Text editors use various indexing caches always, to manage memory, I/O, and allow working on large texts even on systems with low memory available. As much as possible they attempt to use the OS-level caches of the filesystem. And in all cases, they don't work directly on their text buffer (whose internal represenation in their backing store is not just a single string, but a structured collection of buffers, built on top of an interface masking the details: the effective text will then be reencoded and saved from that object, using complex serialization schemes; the text buffer is "virtualized"). Only very basic text editors (such as Notepad) use a native single text buffer, but they are very slow when editing very large files as they constantly need to copy/move large blocks of memory to perform inserts/deletions, and they also use too much the memory reallocator. Even vi(m) or (s)ed in Unix/Linux now use another internal encoded form with a temporary backing store in temporary files, created automatically when needed as you start modifying the content. The final consolidation and serialization will occur only when saving the result.
Re: Unicode String Models
Hello, I find your notion of "model" and presentation a bit confusing since it conflates what I would call the internal representation and the API. The internal representation defines how the Unicode text is stored and should not really matter to the end user of the string data structure. The API defines how the Unicode text is accessed, expressed by what is the result of an indexing operation on the string. The latter is really what matters for the end-user and what I would call the "model". I think the presentation would benefit from making a clear distinction between the internal representation and the API; you could then easily summarize them in a table which would make a nice summary of the design space. I also think you are missing one API which is the one with ECG I would favour: indexing returns Unicode scalar values, internally be it whatever you wish UTF-{8,16,32} or a custom encoding. Maybe that's what you intended by the "Code Point Model: Internal 8/16/32" but that's not what it says, the distinction between code point and scalar value is an important one and I think it would be good to insist on it to clarify the minds in such documents. Best, Daniel
Re: Unicode String Models
On Sat, Sep 08 2018 at 18:36 +0200, Mark Davis ☕️ via Unicode wrote: > I recently did some extensive revisions of a paper on Unicode string models > (APIs). Comments are welcome. > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# It's a good opportunity to propose a better term for "extended grapheme cluster", which usually are neither extended nor clusters, it's also not obvious that they are always graphemes. Cf.the earlier threads https://www.unicode.org/mail-arch/unicode-ml/y2017-m03/0031.html https://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0040.html Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
Re: Unicode String Models
Thanks, excellent comments. While it is clear that some string models have more complicated structures (with their own pros and cons), my focus was on simple internal structures. The focus was also on immutable strings — and the tradeoffs for mutable ones can be quite different — and that needs to be clearer. I'll add some material about those two areas (with pointers to sources where possible). Mark On Sat, Sep 8, 2018 at 9:20 PM John Cowan wrote: > This paper makes the default assumption that the internal storage of a > string is a featureless array. If this assumption is abandoned, it is > possible to get O(1) indexes with fairly low space overhead. The Scheme > language has recently adopted immutable strings called "texts" as a > supplement to its pre-existing mutable strings, and the sample > implementation for this feature uses a vector of either native strings or > bytevectors (char[] vectors in C/Java terms). I would urge anyone > interested in the question of storing and accessing mutable strings to read > the following parts of SRFI 135 at < > https://srfi.schemers.org/srfi-135/srfi-135.html>: Abstract, Rationale, > Specification / Basic concepts, and Implementation. In addition, the > design notes at <https://github.com/larcenists/larceny/wiki/ImmutableTexts>, > though not up to date (in particular, UTF-16 internals are now allowed as > an alternative to UTF-8), are of interest: unfortunately, the link to the > span API has rotted. > > On Sat, Sep 8, 2018 at 12:53 PM Mark Davis ☕️ via Unicore < > unic...@unicode.org> wrote: > >> I recently did some extensive revisions of a paper on Unicode string >> models (APIs). Comments are welcome. >> >> >> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# >> >> Mark >> >
Re: Unicode String Models
On Sat, 8 Sep 2018 18:36:00 +0200 Mark Davis ☕️ via Unicode wrote: > I recently did some extensive revisions of a paper on Unicode string > models (APIs). Comments are welcome. > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# Theoretically at least, the cost of indexing a big string by codepoint is negligible. For example, cost of accessing the middle character is O(1)*, not O(n), where n is the length of the string. The trick is to use a proportionately small amount of memory to store and maintain a partial conversion table from character index to byte index. For example, Emacs claims to offer O(1) access to a UTF-8 buffer by character number, and I can't significantly fault the claim. *There may be some creep, but it doesn't matter for strings that can be stored within a galaxy. Of course, the coefficients implied by big-oh notation also matter. For example, it can be very easy to forget that a bubble sort is often the quickest sorting algorithm. You keep muttering that a a sequence of 8-bit code units can contain invalid sequences, but often forget that that is also true of sequences of 16-bit code units. Do emoji now ensure that confusion between codepoints and code units rapidly comes to light? You seem to keep forgetting that grapheme clusters are not how some people people work. Does the English word 'café' contain the letter 'e'? Yes or no? I maintain that it does. I can't help thinking that one might want to look for the letter 'ă' in Vietnamese and find it whatever the associated tone mark is. You didn't discuss substrings. I'm interested in how subsequences of strings are defined, as the concept of 'substring' isn't really Unicode compliant. Again, expressing 'ă' as a subsequence of the Vietnamese word 'nặng' ought to be possible, whether one is using NFD (easier) or NFC. (And there are alternative normalisations that are compatible with canonical equivalence.) I'm most interested in subsequences X of a word W where W is the same as AXB for some strings A and B. Richard.
Unicode String Models
I recently did some extensive revisions of a paper on Unicode string models (APIs). Comments are welcome. https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# Mark
RE: Unicode String Models -- minor proofreading nit (was: Unicode String Models)
Hi, once more Phillipe; one more note: my apologies; I am still trying to make sense of the effects of the various characters/non-characters on the rest of the text in processing of character strings; thus, if there are any errors in my reply (below), someone correct me; I am not really a programmer (excepting a knowledge of html / css and a little java script and maybe just a bit of other stuff). From: cewcat...@hotmail.com To: verd...@wanadoo.fr CC: unicode@unicode.org Subject: RE: Unicode String Models -- minor proofreading nit (was: Unicode String Models) Date: Sat, 28 Jul 2012 13:35:57 -0400 From: verd...@wanadoo.fr Date: Fri, 27 Jul 2012 03:17:07 +0200 Subject: Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models) To: m...@macchiato.com CC: cewcat...@hotmail.com; unicode@unicode.org I just wonder where the XSS attack is really an issue here. XSS attacks involve bypassing the document source domain in order to attempt to use or insert data found in another document issued or managed by another domain, in a distinct security realm. What is a more serious issue would be the fact that the document parsed has an unknown security, and that its document is subject to an inspection (for example by an antivirus or antimalware trying to identify sensitive code which would remain usable (but hidden by the cipher-like invalid encoding that a browser would just interpret blindly). Yes that's what I think is the issue here. And this is also what's discussed in the Unicode security document I suggested linking to. One problem with the strategy of delering invalid sequences blindly is of course the fact that such invalid sequences may be complex and could be arbitrarily ylong. But antiviri/antimalware solutions already know how to ignore these invalid sequences when trying to identify malicious code, so that it will detect more possibilities. Thanks for info. I did not know this. In that case, the safest strategy for an iantivirus is effectively to discard the invalid sequences, trying to mimic what an unaware browser would do blindly with the consequence of running the potentially dangerous code. The strategy used in a browser for rendering the documentn or in an security solution when trying to detect malicious code, will then be completely opposed. Yes, this is a good strategy for anti-virus and malware detection programs; however I think unicode is more focused on general character handling/display. Another consern is the choice of the replacement character. This document only suggests the U+FFD character which may also not pass some encoding converters used when forwarding the document to a lower layer API running the code effectively. If the code (as opposed to the normal text) is used, it will frequently be restricted only to ASCII or to a SBCS encoding. And in that case, a better substitute will be the ASCII C0 control which is noramlly invalid in plain text programming/scripting source code. Traditionally this C0 control character is SUB. IT may even be used to replace all invalid bytes of an invalid UTF-8 sequence, without changing its length (this is not always possible with U+FFFD in UTF-8 because it will be encoded as 3 bytes and there may be invalid/rejected sequences containing only 1 or 2 bytes that should survive with the same length after the replacement. Once concern is that SUB and U+FFFD have different character properties. And not all Unicode algorithms are treating it the way it should (for example in boundary breakers or in some transforms). Hmm, after checking several unicode documents and some of the faq (http://unicode.org/faq/collation.html), my understanding is that using a non-character code point is the best solution here; I don't know which non-character code point is best, but at least in collation any non-character code point should be ignored. That is, collation is ideally performed on normalized character strings and not on code points. However, I do believe that some string processing/comparison algorithms that look at the string itself and not the characters may be affected. So this is an issue to consider for some yes. Another concern is that even this C0 control may be used for controling some terminal functions (such uses are probably in very old applications), so some code converters are using instead the question mark (?) which is even worse as it may break a query URL, unexpectedly passing the data encoded after it to another HTTP(S) resource than the expected one, and also because it will bypass some cache-control mechanism. Thanks for bringing this up. (I'm not a programmer and really can't discuss this further thus but I do know how to create my own queries for the search engine, placing question marks wherever so I can bring a particular search page up by typing a url for example when I'm searching for particular text in a google book
Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)
2012/7/31 CE Whitehead cewcat...@hotmail.com Hmm, after checking several unicode documents and some of the faq ( http://unicode.org/faq/collation.html), my understanding is that using a non-character code point is the best solution here; I don't know which non-character code point is best, but at least in collation any non-character code point should be ignored. That is, collation is ideally performed on normalized character strings and not on code points. However, I do believe that some string processing/comparison algorithms that look at the string itself and not the characters may be affected. So this is an issue to consider for some yes. The issue when using a placeholder to replace invalid sequences, is that in frequent cases, the stream length must not be altered. If you use a non-character in an UTF-8 stream, it will not always be possible to insert it. The null character (even though it is encoded as a single byte in UTF-8) is the worst choice to to the many assumptions made throughout softwares where it means an end-of-string or sometimes end-of-stream (sometimes also some downstream processes will represent the actual characer as a 2-byte sequence even if it's not strictly UTF-8. In UTF-8 you may use 0xFF as a placeholder, but it will not pass through some interfaces because it is an invalid sequence everywhere in UTF-8. So you need a valid character, that is still encoded as a single byte, and not used in plain-text files. The SUB C0 control character matches such needs. As always, this is not an universal solution, there are always pros and cons in all approaches when trying to manage encoding errors and how to pass over them (if it is desirable). Another concern is that even this C0 control may be used for controling some terminal functions (such uses are probably in very old applications), so some code converters are using instead the question mark (?) which is even worse as it may break a query URL, unexpectedly passing the data encoded after it to another HTTP(S) resource than the expected one, and also because it will bypass some cache-control mechanism. Thanks for bringing this up. (I'm not a programmer and really can't discuss this further thus but I do know how to create my own queries for the search engine, placing question marks wherever so I can bring a particular search page up by typing a url for example when I'm searching for particular text in a google book . . . ) The document does not discuss really how to choose the replacement character. My opinion is that for UTF-8 encoded documents, the ASCII C0 control (SUB) is still better than the U+FFFD character which works well only in UTF-16 and UTF-32 encodings. It also works well with many legacy SBCS or MBCS encodings (including ISO 8859-*, Windows codepages and many PC/OEM codepages, JIS or EUC variants; it is also mapped in many EBCDIC codepages, distinctly from simple filler/padding characters that are blindly stripped in many applications as if they were just whitespaces at end of a fixed-width data field). It seems that in a previous unicode discussion, it's been recommended that applications use codepoints in the noncharacter code points block rather than non-unicode control codes. Thus one should not use a character at all, just a placeholder. If the encoding length is not an issue (UTF-16 and UTF-32 streams), yes this is a good solution. Unfortunately we don't have any non-character in the ASCII range which is encoded as one byte in most encodings. IMO (in my opinion), just having any placeholder is helpful security-wise. (However, I'm still thinking this over.) Not any placeholder randomly, but placeholders that can be universally replaced one for another, depending on the situations and constraints. Then you pass only that value. But if encoding length is an issue, you'll have no other choce than allowing sequences of multiple placeholders. The list of possible placeholders that an application can process on input or return on output should be documented. Non-characters are not the only possible choices.
RE: Unicode String Models -- minor proofreading nit (was: Unicode String Models)
From: verd...@wanadoo.fr Date: Fri, 27 Jul 2012 03:17:07 +0200 Subject: Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models) To: m...@macchiato.com CC: cewcat...@hotmail.com; unicode@unicode.org I just wonder where the XSS attack is really an issue here. XSS attacks involve bypassing the document source domain in order to attempt to use or insert data found in another document issued or managed by another domain, in a distinct security realm. What is a more serious issue would be the fact that the document parsed has an unknown security, and that its document is subject to an inspection (for example by an antivirus or antimalware trying to identify sensitive code which would remain usable (but hidden by the cipher-like invalid encoding that a browser would just interpret blindly). Yes that's what I think is the issue here. One problem with the strategy of delering invalid sequences blindly is of course the fact that such invalid sequences may be complex and could be arbitrarily ylong. But antiviri/antimalware solutions already know how to ignore these invalid sequences when trying to identify malicious code, so that it will detect more possibilities. Thanks for info. I did not know this. In that case, the safest strategy for an iantivirus is effectively to discard the invalid sequences, trying to mimic what an unaware browser would do blindly with the consequence of running the potentially dangerous code. The strategy used in a browser for rendering the documentn or in an security solution when trying to detect malicious code, will then be completely opposed. Yes, this is a good strategy for anti-virus and malware detection programs; however I think unicode is more focused on general character handling/display. Another consern is the choice of the replacement character. This document only suggests the U+FFD character which may also not pass some encoding converters used when forwarding the document to a lower layer API running the code effectively. If the code (as opposed to the normal text) is used, it will frequently be restricted only to ASCII or to a SBCS encoding. And in that case, a better substitute will be the ASCII C0 control which is noramlly invalid in plain text programming/scripting source code. Traditionally this C0 control character is SUB. IT may even be used to replace all invalid bytes of an invalid UTF-8 sequence, without changing its length (this is not always possible with U+FFFD in UTF-8 because it will be encoded as 3 bytes and there may be invalid/rejected sequences containing only 1 or 2 bytes that should survive with the same length after the replacement. Once concern is that SUB and U+FFFD have different character properties. And not all Unicode algorithms are treating it the way it should (for example in boundary breakers or in some transforms). Another concern is that even this C0 control may be used for controling some terminal functions (such uses are probably in very old applications), so some code converters are using instead the question mark (?) which is even worse as it may break a query URL, unexpectedly passing the data encoded after it to another HTTP(S) resource than the expected one, and also because it will bypass some cache-control mechanism. The document does not discuss really how to choose the replacement character. My opinion is that for UTF-8 encoded documents, the ASCII C0 control (SUB) is still better than the U+FFFD character which works well only in UTF-16 and UTF-32 encodings. It also works well with many legacy SBCS or MBCS encodings (including ISO 8859-*, Windows codepages and many PC/OEM codepages, JIS or EUC variants; it is also mapped in many EBCDIC codepages, distinctly from simple filler/padding characters that are blindly stripped in many applications as if they were just whitespaces at end of a fixed-width data field). How many replacements must be made ? My opinion is that replacements should be done so that no change occurs to the data length. For the remaining cases, data security can detect this case with strong data signatures like SHA1 for not too long documents (like HTML pages, or full email contents, with some common headers needed for their indexing or routing or delivery to the right person), or SHA256 for very short documents (like single datagrams or the value of short database fields like phone numbers or people last name or email address) or very long documents (or with security certificates over a secure channel which will also detect undetected data corruption in the end-to-end communication channel, either one-to-one or one-to-many for broadcasts and selective multicasts but this case of secure channels should not be a problem here as it also has to detect and secure many other cases than just invalid plain-text encodings, notably by man-in-the-middle attacks or replay attacks, or to reliably detect
RE: Unicode String Models
David Starner wrote (Saturday, July 21, 2012 12:02 AM): The question of whether to allow non-ASCII characters in variables is open. I don't see why. Yes, a lot of organizations will use ASCII only, but not all programming is done large international organizations. For personal hacking, or small mononational organizations, Unicode variables may be much more convenient. It's not like Chinese variables with Chinese comments is going to be much harder to debug for the English speaker then English variables (or bad English variables) with Chinese comments, and ASCII-romanized Chinese variables may be the worst of all worlds. Imagine mixed used of Latin and cyrillic variable names. How to debug code using two variables named /* cyrillic */ А and /* latin */ A ? If it would be state-of-the-art to use Unicode variables, the bad guy could have his back door even in public source code without being detected. To avoid confusion, rules from http://www.unicode.org/Public/security/latest/confusables.txt were to be applied. A.D.
Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)
Hi, I have one minor comment: * * * Validation; par 3, comment in parentheses . . . (you never want to just delete it; that has security problems). { COMMENT: would it be helpful here to have a reference here to the unicode security document that discusses this issue -- TR 36, 3.5 http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters ?} Best, --C. E. Whitehead cewcat...@hotmail.com
Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)
Thanks, good suggestion. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Thu, Jul 26, 2012 at 12:40 PM, CE Whitehead cewcat...@hotmail.comwrote: Validation; par 3, comment in parentheses . . . (you never want to just delete it; that has security problems). { COMMENT: would it be helpful here to have a reference here to the unicode security document that discusses this issue -- TR 36, 3.5 http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters ?}
Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)
I just wonder where the XSS attack is really an issue here. XSS attacks involve bypassing the document source domain in order to attempt to use or insert data found in another document issued or managed by another domain, in a distinct security realm. What is a more serious issue would be the fact that the document parsed has an unknown security, and that its document is subject to an inspection (for example by an antivirus or antimalware trying to identify sensitive code which would remain usable (but hidden by the cipher-like invalid encoding that a browser would just interpret blindly). One problem with the strategy of delering invalid sequences blindly is of course the fact that such invalid sequences may be complex and could be arbitrarily ylong. But antiviri/antimalware solutions already know how to ignore these invalid sequences when trying to identify malicious code, so that it will detect more possibilities. In that case, the safest strategy for an iantivirus is effectively to discard the invalid sequences, trying to mimic what an unaware browser would do blindly with the consequence of running the potentially dangerous code. The strategy used in a browser for rendering the documentn or in an security solution when trying to detect malicious code, will then be completely opposed. Another consern is the choice of the replacement character. This document only suggests the U+FFD character which may also not pass some encoding converters used when forwarding the document to a lower layer API running the code effectively. If the code (as opposed to the normal text) is used, it will frequently be restricted only to ASCII or to a SBCS encoding. And in that case, a better substitute will be the ASCII C0 control which is noramlly invalid in plain text programming/scripting source code. Traditionally this C0 control character is SUB. IT may even be used to replace all invalid bytes of an invalid UTF-8 sequence, without changing its length (this is not always possible with U+FFFD in UTF-8 because it will be encoded as 3 bytes and there may be invalid/rejected sequences containing only 1 or 2 bytes that should survive with the same length after the replacement. Once concern is that SUB and U+FFFD have different character properties. And not all Unicode algorithms are treating it the way it should (for example in boundary breakers or in some transforms). Another concern is that even this C0 control may be used for controling some terminal functions (such uses are probably in very old applications), so some code converters are using instead the question mark (?) which is even worse as it may break a query URL, unexpectedly passing the data encoded after it to another HTTP(S) resource than the expected one, and also because it will bypass some cache-control mechanism. The document does not discuss really how to choose the replacement character. My opinion is that for UTF-8 encoded documents, the ASCII C0 control (SUB) is still better than the U+FFFD character which works well only in UTF-16 and UTF-32 encodings. It also works well with many legacy SBCS or MBCS encodings (including ISO 8859-*, Windows codepages and many PC/OEM codepages, JIS or EUC variants; it is also mapped in many EBCDIC codepages, distinctly from simple filler/padding characters that are blindly stripped in many applications as if they were just whitespaces at end of a fixed-width data field). How many replacements must be made ? My opinion is that replacements should be done so that no change occurs to the data length. For the remaining cases, data security can detect this case with strong data signatures like SHA1 for not too long documents (like HTML pages, or full email contents, with some common headers needed for their indexing or routing or delivery to the right person), or SHA256 for very short documents (like single datagrams or the value of short database fields like phone numbers or people last name or email address) or very long documents (or with security certificates over a secure channel which will also detect undetected data corruption in the end-to-end communication channel, either one-to-one or one-to-many for broadcasts and selective multicasts but this case of secure channels should not be a problem here as it also has to detect and secure many other cases than just invalid plain-text encodings, notably by man-in-the-middle attacks or replay attacks, or to reliably detect DoS attack by a broken channel with unrecoverable data losses, something that can be enforced by reasonnable timeout watchdogs, if performance of the channel should be ensured). 2012/7/27 Mark Davis ☕ m...@macchiato.com: Thanks, good suggestion. Mark — Il meglio è l’inimico del bene — On Thu, Jul 26, 2012 at 12:40 PM, CE Whitehead cewcat...@hotmail.com wrote: Validation; par 3, comment in parentheses . . . (you never want to just delete it; that has security problems). { COMMENT: would it be helpful here to have a reference here to
Re: User-Hostile Text Editing (was: Unicode String Models)
On 2012-07-21, Richard Wordingham richard.wording...@ntlworld.com wrote: Are there any widely available ways of enabling the deleting of the first character in a default grapheme cluster? Having carefully added two or more marks to a base character, I find it extremely irritating to find I have entered the wrong base character and have to type the whole thing again. As one can delete the last character in a cluster, why not the first? It's not as though the default grapheme cluster is usually thought of as a single character. What do you mean by widely available? A decent editor should let you choose whether to break apart clusters or not. I presume that such editors exist! (Mine always breaks clusters, but that's because I'm the only user, and I don't care enough to implement clustering;-) Yudit might be one, but since it seems to have no documentation, I can't tell. If yours doesn't, then get on to its authors! -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: User-Hostile Text Editing (was: Unicode String Models)
On Sun, 22 Jul 2012 08:59:13 +0100 Julian Bradfield jcb+unic...@inf.ed.ac.uk wrote: On 2012-07-21, Richard Wordingham richard.wording...@ntlworld.com wrote: Are there any widely available ways of enabling the deleting of the first character in a default grapheme cluster? What do you mean by widely available? An example would be a technique that worked for many application on a platform, or for several significant applications across most platforms. An example of the former would be an effective per user tailoring of grapheme clusters. A candidate for the latter is Libreoffice's rule that alt+cursor key moves within grapheme clusters rather than moving the point to the start of the next grapheme cluster. (Unfortunately this doesn't even work inside tables, so it doesn't look much of a candidate.) This can be used in the sequence alt/right-arrow rubout. Richard.
User-Hostile Text Editing (was: Unicode String Models)
On Fri, 20 Jul 2012 23:16:17 + Murray Sargent murr...@exchange.microsoft.com wrote: My latest blog post “Ligatures, Clusters, Combining Marks and Variation Sequenceshttp://blogs.msdn.com/b/murrays/archive/2012/06/30/ligatures-clusters-combining-marks-and-variation-sequences.aspx” discusses some of these complications. Are there any widely available ways of enabling the deleting of the first character in a default grapheme cluster? Having carefully added two or more marks to a base character, I find it extremely irritating to find I have entered the wrong base character and have to type the whole thing again. As one can delete the last character in a cluster, why not the first? It's not as though the default grapheme cluster is usually thought of as a single character. Richard.
Re: Unicode String Models
On Fri, 20 Jul 2012 15:01:42 -0700 David Starner prosfil...@gmail.com wrote: The question of whether to allow non-ASCII characters in variables is open. It's not like Chinese variables with Chinese comments is going to be much harder to debug for the English speaker then English variables (or bad English variables) with Chinese comments, and ASCII-romanized Chinese variables may be the worst of all worlds. On the contrary, there is the issue of confusables. An English speaker may easily overlook the Chinese equivalent of ASCII confusables such as the letter 'l' and the digit '1' or the letter 'O' and the digit '0'. It gets even worse if the Chinese characters are rendered as missing glyphs. Moreover, one method of hiding design information while still delivering 'source' code is to not only strip out all comments, but to replace all variable names by meaningless and hard to distinguish names such as x1234, x1235, x1236, etc. Richard.
RE: User-Hostile Text Editing (was: Unicode String Models)
For math accents, it's easy since the base is the argument of the accent operator. But for clusters the standard practice is for the Delete key to delete the whole cluster as you note. Also you can't select just part of a cluster to save it from deletion. I'd think deleting the first character of a cluster would make a nice context-menu option. For example, when you right-click on a cluster, the resulting context menu could have an entry like delete first character. Maybe other such options could be added as well. Murray -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Richard Wordingham Sent: Saturday, July 21, 2012 4:52 PM To: Unicode Subject: User-Hostile Text Editing (was: Unicode String Models) On Fri, 20 Jul 2012 23:16:17 + Murray Sargent murr...@exchange.microsoft.com wrote: My latest blog post “Ligatures, Clusters, Combining Marks and Variation Sequenceshttp://blogs.msdn.com/b/murrays/archive/2012/06/30/ligatures-clusters-combining-marks-and-variation-sequences.aspx” discusses some of these complications. Are there any widely available ways of enabling the deleting of the first character in a default grapheme cluster? Having carefully added two or more marks to a base character, I find it extremely irritating to find I have entered the wrong base character and have to type the whole thing again. As one can delete the last character in a cluster, why not the first? It's not as though the default grapheme cluster is usually thought of as a single character. Richard.
Unicode String Models
I put together some notes on different ways for programming languages to handle Unicode at a low level. Comments welcome. http://macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html Macchiato »http://macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html Many programming languages (and most modern software) have moved to Unicode model of text. Text coming into the system might be in legacy encodings like Shift-JIS or Latin-1, and text being pushed out... -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* **
Re: Unicode String Models
On Fri, Jul 20, 2012 at 1:31 PM, Mark Davis ☕ m...@macchiato.com wrote: I put together some notes on different ways for programming languages to handle Unicode at a low level. Comments welcome. Macchiato » Many programming languages (and most modern software) have moved to Unicode model of text. Text coming into the system might be in legacy encodings like Shift-JIS or Latin-1, and text being pushed out... I had a few comments for general discussion: That means that it is best to optimize for BMP characters (and as a subset, ASCII and Latin-1), and fall into a ‘slow path’ when a supplementary character is encountered. I'm concerned about the statement/implication that one can optimize for ASCII and Latin-1. It's too easy for a lot of developers to test speed with the English/European documents they have around and test correctness only with Chinese. I see the argument in theory and practice, but it's a tough line to walk, especially if you're not familiar with i18n. I can see for i in range (1, 1000) do a := ; a +:= 龜; done being way slower than necessary (especially for non-trivially optimized away cases), for example. Interfacing with most software libraries can avoid conversions in and out I'm curious about this. I won't dismiss it off hand, but besides ICU, what libraries are we talking about that haven't already been rewritten for GTK, Java, Python, take your pick. The string class is indexed by code unit, and is UTF-32. Used by: glibc? I haven't poked at it, but Ada 2012 (in pre-standard editorial-changes only stage) has Latin-1, UCS-2 (the standard is not clear here about UTF-16 vs. UCS-2) and UTF-32 (UCS-4--it mentions 2147483648 code points) strings. There are functions in the standard to store a Unicode string in the Latin-1 strings as UTF-8 and in the UCS-2 strings as UTF-16, but there is a choice to use straight UTF-32. The question of whether to allow non-ASCII characters in variables is open. I don't see why. Yes, a lot of organizations will use ASCII only, but not all programming is done large international organizations. For personal hacking, or small mononational organizations, Unicode variables may be much more convenient. It's not like Chinese variables with Chinese comments is going to be much harder to debug for the English speaker then English variables (or bad English variables) with Chinese comments, and ASCII-romanized Chinese variables may be the worst of all worlds. -- Kie ekzistas vivo, ekzistas espero.
Re: Unicode String Models
That means that it is best to optimize for BMP characters (and as a subset, ASCII and Latin-1), and fall into a ‘slow path’ when a supplementary character is encountered. I'm concerned about the statement/implication that one can optimize for ASCII and Latin-1. It's too easy for a lot of developers to test speed with the English/European documents they have around and test correctness only with Chinese. I don't think this is a concern within the context of the posting. He is talking about Unicode String Models, something that most developers will never have to design themselves - instead, they use what the language gives them. People implementing Unicode support for programming languages, in turn, typically will be aware of all issues. I can see for i in range (1, 1000) do a := ; a +:= 龜; done being way slower than necessary (especially for non-trivially optimized away cases), for example. Why is that? Take Python 3.3, for example. It does optimize for ASCII, so the first string will use only 1 byte for the space, and two bytes for 龜 (both in a string literal, which is already stored in a constant string object). The concatenation determines that the result string will need two bytes per char, and will have two chars, so it allocates a string being able to hold four bytes. It then copies the space (widening the representation), and the other character (as-is). I don't see why this is slower than necessary. Interfacing with most software libraries can avoid conversions in and out I'm curious about this. I won't dismiss it off hand, but besides ICU, what libraries are we talking about that haven't already been rewritten for GTK, Java, Python, take your pick. rewritten for? None. Besides perhaps XML parsers, I don't think many libraries have been rewritten *for* Python, none for Gtk, and many not for Java. Take database adapters, for example. To access MySQL, Postgres, Oracle, or SQLite, you often need to use the C library of the database vendor, which then got integrated (e.g. through some FFI) into GTK, Java, and Python. However, this FFI integration is where the conversions in and out need to be performed. The question of whether to allow non-ASCII characters in variables is open. I don't see why. Do you factually disagree that there is no universal consensus on this question? Some languages support non-ASCII identifiers, but many more don't, and proponents of those languages often claim that such support isn't really needed. So I'd agree that the question is still undecided, i.e. open. Regards, Martin
RE: Unicode String Models
Mark wrote: “I put together some notes on different ways for programming languages to handle Unicode at a low level. Comments welcome.” Nice article as far as it goes and additions are forthcoming. In addition to multiple code units per character in UTF-8 and UTF-16, there are variation selectors, combining marks, ligatures, and clusters, all of which imply handling variable-length sequences even for UTF-32. Handling the variable length code points in UTF-8 and UTF-16 is actually considerably easier than dealing with these other sources of variable length. For all cases, you need to be able to find character entity boundaries for an arbitrary code-unit index. My latest blog post “Ligatures, Clusters, Combining Marks and Variation Sequenceshttp://blogs.msdn.com/b/murrays/archive/2012/06/30/ligatures-clusters-combining-marks-and-variation-sequences.aspx” discusses some of these complications. One amusing thing is that where I work it’s common to use cp to mean “character position”, which more precisely is “UTF-16 code-unit index”, whereas in Mark’s post, cp is used for codepoint. Murray
Re: Unicode String Models
Thanks, nice article. We got into some of those hair caret positioning issues back at Apple; we even had a design that would associate a series of lines (which could be slanted and positioned) with a ligature, but ultimately 1/m gets you 99% of the value, with very little cost. (My article was just targeted at the very lowest level of Unicode representation, without getting into the further complications for higher level constructs like grapheme clusters, ligatures, etc.) -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Fri, Jul 20, 2012 at 4:16 PM, Murray Sargent murr...@exchange.microsoft.com wrote: Mark wrote: “I put together some notes on different ways for programming languages to handle Unicode at a low level. Comments welcome.” ** ** Nice article as far as it goes and additions are forthcoming. In addition to multiple code units per character in UTF-8 and UTF-16, there are variation selectors, combining marks, ligatures, and clusters, all of which imply handling variable-length sequences even for UTF-32. Handling the variable length code points in UTF-8 and UTF-16 is actually considerably easier than dealing with these other sources of variable length. For all cases, you need to be able to find character entity boundaries for an arbitrary code-unit index. ** ** My latest blog post “Ligatures, Clusters, Combining Marks and Variation Sequenceshttp://blogs.msdn.com/b/murrays/archive/2012/06/30/ligatures-clusters-combining-marks-and-variation-sequences.aspx” discusses some of these complications. ** ** One amusing thing is that where I work it’s common to use cp to mean “character position”, which more precisely is “UTF-16 code-unit index”, whereas in Mark’s post, cp is used for codepoint. ** ** Murray ** ** ** **
Re: Unicode String Models
On 2012/07/21 7:01, David Starner wrote: I'm concerned about the statement/implication that one can optimize for ASCII and Latin-1. It's too easy for a lot of developers to test speed with the English/European documents they have around and test correctness only with Chinese. I see the argument in theory and practice, but it's a tough line to walk, especially if you're not familiar with i18n. I can see for i in range (1, 1000) do a := ; a +:= 龜; done being way slower than necessary (especially for non-trivially optimized away cases), for example. The main problem with the above loop isn't ASCII vs. Chinese or some such. It's that depending on the way the programming language handles Strings, it will result in a painter's algorithm phenomenon (see http://www.joelonsoftware.com/articles/fog000319.html). Regards, Martin.