Re: Compatibility Casefold Equivalence

2018-11-22 Thread Carl via Unicode
(It looks like my HTML email got scrubbed, sorry for the double post)

Hi,


In Chapter 3 Section 13, the Unicode spec defines D146:


"A string X is a compatibility caseless match for a string Y if and only if: 
NFKD(toCasefold(NFKD(toCasefold(NFD(X) = 
NFKD(toCasefold(NFKD(toCasefold(NFD(Y)"


I am trying to understand the "if and only if" part of this.   Specifically, 
why is the outermost NFKD necessary?  Could it also be a NFKC normalization?   
Is wrapping the outer NFKD in a NFC or NFKC on both sides of the equation okay?


My use case is that I am trying to store user-provided tags in a database.  I 
would like the tags to be deduplicated based on compatibility and caseless 
equivalence, which is how I ended up looking at D146.  However, because 
decomposition can result in much larger strings, I would prefer to keep  the 
stored version in NFC or NFKC (I *think* this doesn't matter after doing the 
casefolding as described above).


Thanks,


Carl



Re: The encoding of the Welsh flag

2018-11-22 Thread Doug Ewell via Unicode

Christoph Päper wrote:


We have gotten requests for this, but the stumbling block is the lack
of an official N. Ireland document describing what the official flag
is and should look like.


Such documents are lacking for several of the RIS flag emojis as well,
though, e.g. for  from ISO 3166-1 code `UM` (United States 
Outlying

Islands), resulting in unknown or duplicate flags, hence confusion.
The solution there would have been to exclude codes for dependent
territories becoming RGI emojis. ISO 3166 provides that property.


That's neither the problem nor the solution, IMHO. Even for RIS 
sequences, you have no guarantee of exactly how the flag will be 
depicted. For flags that have been recently changed, you might get the 
old or the new. For UM, you might get the US flag or one of the 
unofficially adopted flags. For Northern Ireland (if it were 
RGI-blessed), you might get either the Ulster Banner or St. Patrick's 
Saltire.


This situation is described, and explicitly so for the UM flags, in 
Annex B of UTS #51 under "Caveats."


--
Doug Ewell | Thornton, CO, US | ewellic.org




Re: The encoding of the Welsh flag

2018-11-22 Thread Doug Ewell via Unicode

Ken Whistler replied to Michael Everson:


What really annoys me about this is that there is no flag for
Northern Ireland. The folks at CLDR did not think to ask either the
UK or the Irish representatives to SC2 about this.


[...]


If you or Andrew West or anyone else is interested in pursuing an
emoji tag sequence for an emoji flag for Northern Ireland, then that
should be done by submitting a proposal, with justification, to the
Emoji Subcommittee, which *does* have jurisdiction.


There is, of course, an encoding for the flag of Northern Ireland:

1F3F4 E0067 E0062 E006E E0069 E0072 E007F

where the tag characters are "gbnir" followed by TAG CANCEL.

What I suspect Michael means is that this sequence is not RGI, or 
"recommended for general interchange," a status which applies for flag 
emoji only to England, Scotland, and Wales, and not to any of the 
thousands of other subdivisions worldwide.


The terminology currently in UTS #51 is definitely an improvement over 
early drafts, which explicitly labeled such sequences "not recommended," 
but it still leads practically everyone. evidently including Michael, to 
believe the sequences are invalid or non-existent.


I would certainly like to use the flag of Colorado, whose visual 
appearance is very much standardized, but the vicious circle of vendor 
support and UTS #51 categorization means no system will offer glyph 
support, and some systems may even reject it as invalid.


--
Doug Ewell | Thornton, CO, US | ewellic.org



Compatibility Casefold Equivalence

2018-11-22 Thread - - via Unicode

Hi,In Chapter 3 Section 13, the Unicode spec defines D146:"A string X is a compatibility caseless match for a string Y if and only if: NFKD(toCasefold(NFKD(toCasefold(NFD(X) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y)"I am trying to understand the "if and only if" part of this.   Specifically, why is the outermost NFKD necessary?  Could it also be a NFKC normalization?   Is wrapping the outer NFKD in a NFC or NFKC on both sides of the equation okay?My use case is that I am trying to store user-provided tags in a database.  I would like the tags to be deduplicated based on compatibility and caseless equivalence, which is how I ended up looking at D146.  However, because decomposition can result in much larger strings, I would prefer to keep  the stored version in NFC or NFKC (I *think* this doesn't matter after doing the casefolding as described above).Thanks,Carl
 


Re: Unicode String Models

2018-11-22 Thread Henri Sivonen via Unicode
On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ☕️  wrote:

>
>   * The Python 3.3 model mentions the disadvantages of memory usage
>> cliffs but doesn't mention the associated perfomance cliffs. It would
>> be good to also mention that when a string manipulation causes the
>> storage to expand or contract, there's a performance impact that's not
>> apparent from the nature of the operation if the programmer's
>> intuition works on the assumption that the programmer is dealing with
>> UTF-32.
>>
>
> The focus was on immutable string models, but I didn't make that clear.
> Added some text.
>

Thanks.


>  * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
>> text node storage in Gecko, (I believe but am not 100% sure) V8 and,
>> optionally, HotSpot
>> (
>> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A
>> ).
>> That is, text has UTF-16 semantics, but if the high half of every code
>> unit in a string is zero, only the lower half is stored. This has
>> properties analogous to the Python 3.3 model, except non-BMP doesn't
>> expand to UTF-32 but uses UTF-16 surrogate pairs.
>>
>
> Thanks, will add.
>

V8 source code shows it has a OneByteString storage option:
https://cs.chromium.org/chromium/src/v8/src/objects/string.h?sq=package:chromium=0=494
. From hearsay, I'm convinced that it means Latin1, but I've failed to find
a clear quotable statement from a V8 developer to that affect.


>   3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
>> have a different type in the type system than byte buffers. To go from
>> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
>> has been tagged as valid UTF-8, the validity is trusted completely so
>> that iteration by code point does not have "else" branches for
>> malformed sequences. If data that the type system indicates to be
>> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
>> language has a default "safe" side and an opt-in "unsafe" side. The
>> unsafe side is for performing low-level operations in a way where the
>> responsibility of upholding invariants is moved from the compiler to
>> the programmer. It's impossible to violate the UTF-8 validity
>> invariant using the safe part of the language.
>>
>
> Added a quote based on this; please check if it is ok.
>

Looks accurate. Thanks.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-11-22 Thread Henri Sivonen via Unicode
On Wed, Jun 13, 2018 at 2:49 PM Mark Davis ☕️  wrote:
>
> > That is, why is conforming to UAX #31 worth the risk of prohibiting the use 
> > of characters that some users might want to use?
>
> One could parse for certain sequences, putting characters into a number of 
> broad categories. Very approximately:
>
> junk ~= [[:cn:][:cs:][:co:]]+
> whitespace ~= [[:z:][:c:]-junk]+
> syntax ~= [[:s:][:p:]] // broadly speaking, including both the language 
> syntax  user-named operators
> identifiers ~= [all-else]+
>
> UAX #31 specifies several different kinds of identifiers, and takes roughly 
> that approach for 
> http://unicode.org/reports/tr31/#Immutable_Identifier_Syntax, although the 
> focus there is on immutability.
>
> So an implementation could choose to follow that course, rather than the more 
> narrowly defined identifiers in 
> http://unicode.org/reports/tr31/#Default_Identifier_Syntax. Alternatively, 
> one can conform to the Default Identifiers but declare a profile that expands 
> the allowable characters. One could take a Swiftian approach, for example...

Thank you and sorry about my slow reply. Why is excluding junk important?

> On Fri, Jun 8, 2018 at 11:07 AM, Henri Sivonen via Unicode  wrote:
>>
>> On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen  wrote:
>> > Considering that ruling out too much can be a problem later, but just
>> > treating anything above ASCII as opaque hasn't caused trouble (that I
>> > know of) for HTML other than compatibility issues with XML's stricter
>> > stance, why should a programming language, if it opts to support
>> > non-ASCII identifiers in an otherwise ASCII core syntax, implement the
>> > complexity of UAX #31 instead of allowing everything above ASCII in
>> > identifiers? In other words, what problem does making a programming
>> > language conform to UAX #31 solve?
>>
>> After refreshing my memory of XML history, I realize that mentioning
>> XML does not helpfully illustrate my question despite the mention of
>> XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please
>> ignore the XML part.
>>
>> Trying to rephrase my question more clearly:
>>
>> Let's assume that we are designing a computer-parseable syntax where
>> tokens consisting of user-chosen characters can't occur next to each
>> other and, instead, always have some syntax-reserved characters
>> between them. That is, I'm talking about syntaxes that look like this
>> (could be e.g. Java):
>>
>> ab.cd();
>>
>> Here, ab and cd are tokens with user-chosen characters whereas space
>> (the indent),  period, parenthesis and the semicolon are
>> syntax-reserved. We know that ab and cd are distinct tokens, because
>> there is a period between them, and we know the opening parethesis
>> ends the cd token.
>>
>> To illustrate what I'm explicitly _not_ talking about, I'm not talking
>> about a syntax like this:
>>
>> αβ⊗γδ
>>
>> Here αβ and γδ are user-named variable names and ⊗ is a user-named
>> operator and the distinction between different kinds of user-named
>> tokens has to be known somehow in order to be able to tell that there
>> are three distinct tokens: αβ, ⊗, and γδ.
>>
>> My question is:
>>
>> When designing a syntax where tokens with the user-chosen characters
>> can't occur next to each other without some syntax-reserved characters
>> between them, what advantages are there from limiting the user-chosen
>> characters according to UAX #31 as opposed to treating any character
>> that is not a syntax-reserved character as a character that can occur
>> in user-named tokens?
>>
>> I understand that taking the latter approach allows users to mint
>> tokens that on some aesthetic measure don't make sense (e.g. minting
>> tokens that consist of glyphless code points), but why is it important
>> to prescribe that this is prohibited as opposed to just letting users
>> choose not to mint tokens that are inconvenient for them to work with
>> given the behavior that their plain text editor gives to various
>> characters? That is, why is conforming to UAX #31 worth the risk of
>> prohibiting the use of characters that some users might want to use?
>> The introduction of XID after ID and the introduction of Extended
>> Hashtag Identifiers after XID is indicative of over-restriction having
>> been a problem.
>>
>> Limiting user-minted tokens to UAX #31 does not appear to be necessary
>> for security purposes considering that HTML and CSS exist in a
>> particularly adversarial environment and get away with taking the
>> approach that any character that isn't a syntax-reserved character is
>> collected as part of a user-minted identifier. (Informally, both treat
>> non-ASCII characters the same as an ASCII underscore. HTML even treats
>> non-whitespace, non-U+ ASCII controls that way.)
>>
>> --
>> Henri Sivonen
>> hsivo...@hsivonen.fi
>> https://hsivonen.fi/
>>
>


-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/