Re: Dealing with Georgian capitalization in programming languages

2018-10-02 Thread Ken Whistler via Unicode



On 10/2/2018 12:45 AM, Martin J. Dürst via Unicode wrote:
capitalize: uppercase (or title-case) the first character of the 
string, lowercase the rest



When I say "cause problems", I mean producing mixed-case output. I 
originally thought that 'capitalize' would be fine. It is fine for 
lowercase input: I stays lowercase because Unicode Data indicates that 
titlecase for lowercase Georgian letters is the letter itself. But it 
will produce the apparently undesirable Mixed Case for ALL UPPERCASE 
input.


My questions here are:
- Has this been considered when Georgian Mtavruli was discussed in the
  UTC?

Not explicitly, that I recall. The whole issue of titlecasing came up 
very late in the preparation of case mapping tables for Mtavruli and 
Mkhedruli for 11.0.


But it seems to me that the problem you are citing can be avoided if you 
simply rethink what your "capitalize" means. It really should be 
conceived of as first lowercasing the *entire* string, and then 
titlecasing the *eligible* letters -- i.e., usually the first letter. 
(Note that this allows for the concept that titlecasing might then be 
localized on a per-writing-system basis -- the issue would devolve to 
determining what the rules are for "eligible" letters.) But the simple 
default would just be to titlecase the initial letter of each "word" 
segment of a string.


Note that conceived this way, for the Georgian mappings, where the 
titlecase mapping for Mkhedruli is simply the letter itself, this 
approach ends up with:


capitalize(mkhedrulistring) --> mkhedrulistring

capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> 
mkhedrulistring


Thus avoiding any mixed case.

--Ken



Re: Dealing with Georgian capitalization in programming languages

2018-10-02 Thread Philippe Verdy via Unicode
I see no easy way to convert ALL UPPERCASE text with consistant casing as
there's no rule, except by using dictionnary lookups.
In reality data should be input using default casing (as in dictionnary
entries), independantly of their position in sentences, paragraphs or
titles, and the contextual conversion of some or all characters to
uppercase being done algorithmically (this is safe for conversion to ALL
UPPERCASE, and quite reliable for conversion to Tile Case, with just a few
dictionnary lookups for a small set of knows words per language.

Note that title casing works differently in English (which is most often
abusing by putting capitales on every word), while most other languages
capitalize only selected words, or just the first selected word in French
(in addition to the possible first letter of non-selected words such as
definite and indefinite articles at start of the sentence). Capitalization
of initials on every word is wrong in German which uses capitalisation even
more strictly than French or Italian: when in doubts, do not perform any
titlecasing, and allow data to provide the actual capitalization of titles
directly (it is OK and even recommanded in German to have section headings,
or even book titles, written as if they were in the middle of sentences,
and you capitalize only titles and headings that are full sentences
grammatically, but not simple nominal groups.

So title casing should not even be promoted by the UCD standard (where it
is in fact using only very basic, simplistic rules) and applicable only in
some applications for some languages and in specific technical or rendering
contexts.



Le mar. 2 oct. 2018 à 22:21, Markus Scherer via Unicode 
a écrit :

> On Tue, Oct 2, 2018 at 12:50 AM Martin J. Dürst via Unicode <
> unicode@unicode.org> wrote:
>
>> ... The only
>> operation that can cause problems is 'capitalize'.
>>
>> When I say "cause problems", I mean producing mixed-case output. I
>> originally thought that 'capitalize' would be fine. It is fine for
>> lowercase input: I stays lowercase because Unicode Data indicates that
>> titlecase for lowercase Georgian letters is the letter itself. But it
>> will produce the apparently undesirable Mixed Case for ALL UPPERCASE
>> input.
>>
>> My questions here are:
>> - Has this been considered when Georgian Mtavruli was discussed in the
>>UTC?
>> - How have any other implementers (ICU,...) addressed this, in
>>particular the operation that's called 'capitalize' in Ruby?
>>
>
> By default, ICU toTitle() functions titlecase at word boundaries (with
> adjustment) and lowercase all else.
> That is, we implement Unicode chapter 3.13 Default Case Conversions R3
> toTitlecase(x), except that we modified the default boundary adjustment.
>
> You can customize the boundaries (e.g., only the start of the string).
> We have options for whether and how to adjust the boundaries (e.g., adjust
> to the next cased letter) and for copying, not lowercasing, the other
> characters.
> See C++ and Java class CaseMap and the relevant options.
>
> markus
>


Re: Dealing with Georgian capitalization in programming languages

2018-10-02 Thread Markus Scherer via Unicode
On Tue, Oct 2, 2018 at 12:50 AM Martin J. Dürst via Unicode <
unicode@unicode.org> wrote:

> ... The only
> operation that can cause problems is 'capitalize'.
>
> When I say "cause problems", I mean producing mixed-case output. I
> originally thought that 'capitalize' would be fine. It is fine for
> lowercase input: I stays lowercase because Unicode Data indicates that
> titlecase for lowercase Georgian letters is the letter itself. But it
> will produce the apparently undesirable Mixed Case for ALL UPPERCASE input.
>
> My questions here are:
> - Has this been considered when Georgian Mtavruli was discussed in the
>UTC?
> - How have any other implementers (ICU,...) addressed this, in
>particular the operation that's called 'capitalize' in Ruby?
>

By default, ICU toTitle() functions titlecase at word boundaries (with
adjustment) and lowercase all else.
That is, we implement Unicode chapter 3.13 Default Case Conversions R3
toTitlecase(x), except that we modified the default boundary adjustment.

You can customize the boundaries (e.g., only the start of the string).
We have options for whether and how to adjust the boundaries (e.g., adjust
to the next cased letter) and for copying, not lowercasing, the other
characters.
See C++ and Java class CaseMap and the relevant options.

markus


Re: Unicode String Models

2018-10-02 Thread Daniel Bünzli via Unicode
On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode (unicode@unicode.org) 
wrote:

> Because of performance and storage consideration, you need to consider the
> possible internal data structures when you are looking at something as
> low-level as strings. But most of the 'model's in the document are only
> really distinguished by API, only the "Code Point model" discussions are
> segmented by internal storage, as with "Code Point Model: UTF-32"

I guess my gripe with the presentation of that document is that it perpetuates 
the problem of confusing "unicode characters" (or integers, or scalar values) 
and their *encoding* (how to represent these integers as byte sequences) which 
a source of endless confusion among programmers. 

This confusion is easy lifted once you explain that there exists certain 
integers, the scalar values, which are your actual characters and then you have 
different ways of encoding your characters; one can then explain that a 
surrogate is not a character per se, it's a hack and there's no point in 
indexing them except if you want trouble.

This may also suggest another taxonomy of classification for the APIs, those in 
which you work directly with the character data (the scalar values) and those 
in which you work with an encoding of the actual character data (e.g. a 
JavaScript string).

> In reality, most APIs are not even going to be in terms of code points:
> they will return int32's. 

That reality depends on your programming language. If the latter supports type 
abstraction you can define an abstract type for scalar values (whose 
implementation may simply be an integer). If you always go through the 
constructor to create these "integers" you can maintain the invariant that a 
value of this type is an integer in the ranges [0x;0xD7FF] and 
[0xE000;0x10]. Knowing this invariant holds is quite useful when you feed 
your "character" data to other processes like UTF-X encoders: it guarantees the 
correctness of their outputs regardless of what the programmer does.

Best, 

Daniel





Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Whether or not it is well suited, that's probably water under the bridge at
this point. Think of it as a jargon at this point; after all, there are
lots of cases like that: a "near miss" wasn't nearly a miss, it was nearly
a hit.

Mark


On Sun, Sep 9, 2018 at 10:56 AM Janusz S. Bień  wrote:

> On Sat, Sep 08 2018 at 18:36 +0200, Mark Davis ☕️ via Unicode wrote:
> > I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
> It's a good opportunity to propose a better term for "extended grapheme
> cluster", which usually are neither extended nor clusters, it's also not
> obvious that they are always graphemes.
>
> Cf.the earlier threads
>
> https://www.unicode.org/mail-arch/unicode-ml/y2017-m03/0031.html
> https://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0040.html
>
> Best regards
>
> Janusz
>
> --
>  ,
> Janusz S. Bien
> emeryt (emeritus)
> https://sites.google.com/view/jsbien
>


Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Mark


On Tue, Sep 11, 2018 at 12:17 PM Henri Sivonen via Unicode <
unicode@unicode.org> wrote:

> On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode
>  wrote:
> >
> > I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
> * The Grapheme Cluster Model seems to have a couple of disadvantages
> that are not mentioned:
>   1) The subunit of string is also a string (a short string conforming
> to particular constraints). There's a need for *another* more atomic
> mechanism for examining the internals of the grapheme cluster string.
>

I did mention this.


>   2) The way an arbitrary string is divided into units when iterating
> over it changes when the program is executed on a newer version of the
> language runtime that is aware of newly-assigned codepoints from a
> newer version of Unicode.
>

Good point. I did mention the EGC definitions changing, but should point
out that if you have a string with unassigned characters in it, they may be
clustered on future versions. Will add.


>  * The Python 3.3 model mentions the disadvantages of memory usage
> cliffs but doesn't mention the associated perfomance cliffs. It would
> be good to also mention that when a string manipulation causes the
> storage to expand or contract, there's a performance impact that's not
> apparent from the nature of the operation if the programmer's
> intuition works on the assumption that the programmer is dealing with
> UTF-32.
>

The focus was on immutable string models, but I didn't make that clear.
Added some text.

>
>  * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
> text node storage in Gecko, (I believe but am not 100% sure) V8 and,
> optionally, HotSpot
> (
> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A
> ).
> That is, text has UTF-16 semantics, but if the high half of every code
> unit in a string is zero, only the lower half is stored. This has
> properties analogous to the Python 3.3 model, except non-BMP doesn't
> expand to UTF-32 but uses UTF-16 surrogate pairs.
>

Thanks, will add.

>
>  * I think the fact that systems that chose UTF-16 or UTF-32 have
> implemented models that try to save storage by omitting leading zeros
> and gaining complexity and performance cliffs as a result is a strong
> indication that UTF-8 should be recommended for newly-designed systems
> that don't suffer from a forceful legacy need to expose UTF-16 or
> UTF-32 semantics.
>
>  * I suggest splitting the "UTF-8 model" into three substantially
> different models:
>
>  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> UTF-8-related operations are performed when ingesting byte-oriented
> data. Byte buffers and text buffers are type-wise ambiguous. Only
> iterating over byte data by code point gives the data the UTF-8
> interpretation. Unless the data is cleaned up as a side effect of such
> iteration, malformed sequences in input survive into output.
>
>  2) UTF-8 without full trust in ability to retain validity (the model
> of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> common UTF-8 model for C and C++, but I don't have evidence to back
> this up): When data is ingested with text semantics, it is converted
> to UTF-8. For data that's supposed to already be in UTF-8, this means
> replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> data is valid UTF-8 right after input. However, iteration by code
> point doesn't trust ability of other code to retain UTF-8 validity
> perfectly and has "else" branches in order not to blow up if invalid
> UTF-8 creeps into the system.
>
>  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> have a different type in the type system than byte buffers. To go from
> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> has been tagged as valid UTF-8, the validity is trusted completely so
> that iteration by code point does not have "else" branches for
> malformed sequences. If data that the type system indicates to be
> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> language has a default "safe" side and an opt-in "unsafe" side. The
> unsafe side is for performing low-level operations in a way where the
> responsibility of upholding invariants is moved from the compiler to
> the programmer. It's impossible to violate the UTF-8 validity
> invariant using the safe part of the language.
>

Added a quote based on this; please check if it is ok.

>
>  * After working with different string models, I'd recommend the Rust
> model for newly-designed programming languages. (Not because I work
> for Mozilla but because I believe Rust's way of dealing with Unicode
> is the best I've seen.) Rust's standard library provides Unicode
> 

Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Mark


On Sun, Sep 9, 2018 at 3:42 PM Daniel Bünzli 
wrote:

> Hello,
>
> I find your notion of "model" and presentation a bit confusing since it
> conflates what I would call the internal representation and the API.
>
> The internal representation defines how the Unicode text is stored and
> should not really matter to the end user of the string data structure. The
> API defines how the Unicode text is accessed, expressed by what is the
> result of an indexing operation on the string. The latter is really what
> matters for the end-user and what I would call the "model".
>

Because of performance and storage consideration, you need to consider the
possible internal data structures when you are looking at something as
low-level as strings. But most of the 'model's in the document are only
really distinguished by API, only the "Code Point model" discussions are
segmented by internal storage, as with "Code Point Model: UTF-32"


> I think the presentation would benefit from making a clear distinction
> between the internal representation and the API; you could then easily
> summarize them in a table which would make a nice summary of the design
> space.
>

That's an interesting suggestion, I'll mull it over.

>
> I also think you are missing one API which is the one with ECG I would
> favour: indexing returns Unicode scalar values, internally be it whatever
> you wish UTF-{8,16,32} or a custom encoding. Maybe that's what you intended
> by the "Code Point Model: Internal 8/16/32" but that's not what it says,
> the distinction between code point and scalar value is an important one and
> I think it would be good to insist on it to clarify the minds in such
> documents.
>

In reality, most APIs are not even going to be in terms of code points:
they will return int32's. So not only are they not scalar values,
99.97% are not even code points. Of course, values above 10 or below 0
shouldn't ever be stored in strings, but in practice treating
non-scalar-value-code-points as "permanently unassigned" characters doesn't
really cause problems in processing.


> Best,
>
> Daniel
>
>
>


Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Mark


On Sun, Sep 9, 2018 at 10:03 AM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Sat, 8 Sep 2018 18:36:00 +0200
> Mark Davis ☕️ via Unicode  wrote:
>
> > I recently did some extensive revisions of a paper on Unicode string
> > models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
>
> Theoretically at least, the cost of indexing a big string by codepoint
> is negligible.  For example, cost of accessing the middle character is
> O(1)*, not O(n), where n is the length of the string.  The trick is to
> use a proportionately small amount of memory to store and maintain a
> partial conversion table from character index to byte index.  For
> example, Emacs claims to offer O(1) access to a UTF-8 buffer by
> character number, and I can't significantly fault the claim.
>
> *There may be some creep, but it doesn't matter for strings that can be
> stored within a galaxy.
>
> Of course, the coefficients implied by big-oh notation also matter.
> For example, it can be very easy to forget that a bubble sort is often
> the quickest sorting algorithm.
>

Thanks, added a quote from you on that; see if it looks ok.


> You keep muttering that a a sequence of 8-bit code units can contain
> invalid sequences, but often forget that that is also true of sequences
> of 16-bit code units.  Do emoji now ensure that confusion between
> codepoints and code units rapidly comes to light?
>

I didn't neglect that, had a [TBD] for it.

While UTF16 invalid unpaired surrogates don't complicate processing much if
they are treated as unassigned characters, allowing UTF8 invalid sequences
are more troublesome. See, for example, the convolutions needed in ICU
methods that allow ill-formed UTF8.


> You seem to keep forgetting that grapheme clusters are not how some
> people people work.  Does the English word 'café' contain the letter
> 'e'?  Yes or no?  I maintain that it does.  I can't help thinking that
> one might want to look for the letter 'ă' in Vietnamese and find it
> whatever the associated tone mark is.
>

I'm pretty familiar with the situation, thanks for asking.

Often you want to find out more about the components of grapheme clusters,
so you always need to be able to iterate through the code points it
contains. One might think that iterating by grapheme cluster is hiding
features of the text. For example, with *fox́* (fox\u{301}) it is easy to
find that the text contains an *x* by iterating through code points. But
code points often don't reveal their components: does the word
*también* contain
the letter *e*? A reasonable question, but iterating by code point rather
than grapheme cluster doesn't help, since it is typically encoded as a
single U+00E9. And even decomposing to NFD doesn't always help, as with
cases like *rødgrød*.

>
> You didn't discuss substrings.


I did. But if you mean a definition of substring that lets you access
internal components of substrings, I'm afraid that is quite a specialized
usage. One could do it, but it would burden down the general use case.

> I'm interested in how subsequences of
> strings are defined, as the concept of 'substring' isn't really Unicode
> compliant.  Again, expressing 'ă' as a subsequence of the Vietnamese
> word 'nặng' ought to be possible, whether one is using NFD (easier) or
> NFC.  (And there are alternative normalisations that are compatible
> with canonical equivalence.)  I'm most interested in subsequences X of a
> word W where W is the same as AXB for some strings A and B.


> Richard.
>
>


Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Thanks, added a quote from you on that; see if it looks ok.

Mark


On Sat, Sep 8, 2018 at 9:20 PM John Cowan  wrote:

> This paper makes the default assumption that the internal storage of a
> string is a featureless array.  If this assumption is abandoned, it is
> possible to get O(1) indexes with fairly low space overhead.  The Scheme
> language has recently adopted immutable strings called "texts" as a
> supplement to its pre-existing mutable strings, and the sample
> implementation for this feature uses a vector of either native strings or
> bytevectors (char[] vectors in C/Java terms).  I would urge anyone
> interested in the question of storing and accessing mutable strings to read
> the following parts of SRFI 135 at <
> https://srfi.schemers.org/srfi-135/srfi-135.html>:  Abstract, Rationale,
> Specification / Basic concepts, and Implementation.  In addition, the
> design notes at ,
> though not up to date (in particular, UTF-16 internals are now allowed as
> an alternative to UTF-8), are of interest: unfortunately, the link to the
> span API has rotted.
>
> On Sat, Sep 8, 2018 at 12:53 PM Mark Davis ☕️ via Unicore <
> unic...@unicode.org> wrote:
>
>> I recently did some extensive revisions of a paper on Unicode string
>> models (APIs). Comments are welcome.
>>
>>
>> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>>
>> Mark
>>
>


Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode
Thanks to all for comments. Just revised the text in https://goo.gl/neguxb.

Mark


On Sat, Sep 8, 2018 at 6:36 PM Mark Davis ☕️  wrote:

> I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
>
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
> Mark
>


Dealing with Georgian capitalization in programming languages

2018-10-02 Thread Martin J. Dürst via Unicode
Since the last discussion on Georgian (Mtavruli) on this mailing list, I 
have been looking into how to implement it in the Programming language Ruby.


Ruby has four case-conversion operations for its class String:

upcase:   convert all characters to upper case
downcase: convert all characters to lower case
swapcase: switch upper to lower and lower to upper case
capitalize:  uppercase (or title-case) the first character of the 
string, lowercase the rest


'upcase' and 'downcase' don't pose problems. 'swapcase' doesn't cause 
problems assuming the input doesn't have any problems. The only 
operation that can cause problems is 'capitalize'.


When I say "cause problems", I mean producing mixed-case output. I 
originally thought that 'capitalize' would be fine. It is fine for 
lowercase input: I stays lowercase because Unicode Data indicates that 
titlecase for lowercase Georgian letters is the letter itself. But it 
will produce the apparently undesirable Mixed Case for ALL UPPERCASE input.


My questions here are:
- Has this been considered when Georgian Mtavruli was discussed in the
  UTC?
- How have any other implementers (ICU,...) addressed this, in
  particular the operation that's called 'capitalize' in Ruby?

Many thanks in advance for your input,

Regards,   Martin.