Re: Collation Grapheme Clusters and Canonical Equivalence

2019-10-18 Thread Richard Wordingham via Unicode
On Thu, 17 Oct 2019 23:11:55 +0100 Richard Wordingham via Unicode wrote: > There seems to be a Unicode non-compliance (C6) issue in the > definition of collation grapheme clusters (defined in UTS#10 Section > 9.9). Using the DUCET collation, the canonically equivalent strings > รู้

Collation Grapheme Clusters and Canonical Equivalence

2019-10-17 Thread Richard Wordingham via Unicode
There seems to be a Unicode non-compliance (C6) issue in the definition of collation grapheme clusters (defined in UTS#10 Section 9.9). Using the DUCET collation, the canonically equivalent strings รู้ and รัู decompose into collation grapheme clusters in two different ways. The first

Re: UCA unnecessary collation weight 0000

2018-11-04 Thread Philippe Verdy via Unicode
0" does not >>> mean it is a valid "weight", it's a notation only >>> >>> No, it is explicitly a valid weight. And it is explicitly and >>> normatively referred to in the specification of the algorithm. See UTS10-D8 >>> (and su

Re: UCA unnecessary collation weight 0000

2018-11-04 Thread Mark Davis ☕️ via Unicode
plicitly a valid weight. And it is explicitly and normatively >> referred to in the specification of the algorithm. See UTS10-D8 (and >> subsequent definitions), which explicitly depend on a definition of "A >> collation weight whose value is zero." The entire statement o

Re: UCA unnecessary collation weight 0000

2018-11-03 Thread Philippe Verdy via Unicode
and normatively > referred to in the specification of the algorithm. See UTS10-D8 (and > subsequent definitions), which explicitly depend on a definition of "A > collation weight whose value is zero." The entire statement of what are > primary, secondary, tertiary, etc. collation elemen

Re: UCA unnecessary collation weight 0000

2018-11-03 Thread Philippe Verdy via Unicode
and normatively > referred to in the specification of the algorithm. See UTS10-D8 (and > subsequent definitions), which explicitly depend on a definition of "A > collation weight whose value is zero." The entire statement of what are > primary, secondary, tertiary, etc. collation elemen

Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Richard Wordingham via Unicode
only > > No, it is explicitly a valid weight. And it is explicitly and > normatively referred to in the specification of the algorithm. See > UTS10-D8 (and subsequent definitions), which explicitly depend on a > definition of "A collation weight whose value is zero." The

Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Ken Whistler via Unicode
at its use of "" does not mean it is a valid "weight", it's a notation only No, it is explicitly a valid weight. And it is explicitly and normatively referred to in the specification of the algorithm. See UTS10-D8 (and subsequent definitions), which explicitly depe

Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Philippe Verdy via Unicode
introducing confusion about these "". UTR#10 still does not explicitly state that its use of "" does not mean it is a valid "weight", it's a notation only (but the notation is used for TWO distinct purposes: one is for presenting the notation format used in the DU

Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Richard Wordingham via Unicode
required to > create an equivalent collation order. these steps are completely > unnecessary and should be removed. > > Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️ a > écrit : > > > You may not like the format of the data, but you are not bound to > > it. If you don't li

Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Mark Davis ☕️ via Unicode
dard makes the presence of required in some steps, and the > requirement is in fact wrong: this is in fact NEVER required to create an > equivalent collation order. these steps are completely unnecessary and > should be removed. > > Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️ a é

Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Philippe Verdy via Unicode
It's not just a question of "I like it or not". But the fact that the standard makes the presence of required in some steps, and the requirement is in fact wrong: this is in fact NEVER required to create an equivalent collation order. these steps are completely unnecessar

Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Mark Davis ☕️ via Unicode
mance “The Unicode Collation Algorithm is a logical specification. Implementations are free to change any part of the algorithm as long as any two strings compared by the implementation are ordered the same as they would be by the algorithm as specified. Implementations may also use a different forma

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
As well the step 2 of the algorithm speaks about a single "array" of collation elements. Actually it's best to create one separate array per level, and append weights for each level in the relevant array for that level. The steps S2.2 to S2.4 can do this, including for derived col

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Richard Wordingham via Unicode
nto > target (binary) strings: > > For a level-3 collation, you just then need only 3 calls to > "string:gsub()" to compute any collation: > > - the first ":gsub(mapNormalize)" can decompose a source text into > collation elements and can perform reordering t

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Richard Wordingham via Unicode
On Thu, 1 Nov 2018 21:13:46 +0100 Philippe Verdy via Unicode wrote: > I'm not speaking just about how collation keys will finally be stored > (as uint16 or bytes, or sequences of bits with variable length); I'm > just refering to the sequence of weights you generate. >

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Richard Wordingham via Unicode
On Thu, 1 Nov 2018 22:04:40 +0100 Philippe Verdy via Unicode wrote: > The DUCET could have as well used the notation ".none", or > just dropped every "." in its file (provided it contains a data > entry specifying what is the minimum weight used for each level). > This notation is only intend

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
So it should be clear in the UCA algorithm and in the DUCET datatable that "" is NOT a valid weight It is just a notational placeholder used as ".", only indicating in the DUCET format that there's NO weight assigned at the indicated level, because the collation el

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
cX" where "X" can be any character(s). Remove any reference to the "level separator" from the UCA. You never need it. As well this paragraph 7.3 Form Sort Keys <http://unicode.org/reports/tr10/#Step_3> *Step 3.* Construct a sort key for each collation element array

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
't actually assemble a sort key. > > People who want sort keys usually want them to be short, so you spend time > on compression. You probably also build sort keys as byte vectors not > uint16 vectors (because byte vectors fit into more APIs and tend to be > shorter), like ICU do

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
Le jeu. 1 nov. 2018 à 21:31, Philippe Verdy a écrit : > so you can use these two last functions to write the first one: > > bool isIgnorable(int level, string element) { > return getLevel(getWeightAt(element, 0)) > getMinWeight(level); > } > correction: return getWeightAt(element, 0)

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
Le jeu. 1 nov. 2018 à 21:08, Markus Scherer a écrit : > When you want fast string comparison, the zero weights are useful for >> processing -- and you don't actually assemble a sort key. >> > And no, I absolutely no case where any weight is useful during processing, it does not distinguish a

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
I'm not speaking just about how collation keys will finally be stored (as uint16 or bytes, or sequences of bits with variable length); I'm just refering to the sequence of weights you generate. You absolutely NEVER need ANYWHERE in the UCA algorithm any weight, not even during proc

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
secondary weights in the sort key are terminated by any sequence of 0020 (the minimal secondary weight), you can suppress them from the collation key. * when the tertiary weights are in the sort key are terminated by any sequence of 0002 (the minimal tertiary weight), you can suppress them from

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Markus Scherer via Unicode
sort keys as byte vectors not uint16 vectors (because byte vectors fit into more APIs and tend to be shorter), like ICU does using the CLDR collation data file. The CLDR root collation data file remunges all weights into fractional byte sequences, and leaves gaps for tailoring. markus

UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
I just remarked that there's absolutely NO utility of the collation weight anywhere in the algorithm. For example in UTR #10, section 3.3.1 gives a collection element : [..0021.0002] for COMBINING GRAVE ACCENT. However it can also be simply: [.0021.0002] for a simple reason

Re: Minimal Implementation of Unicode Collation Algorithm

2017-12-04 Thread Richard Wordingham via Unicode
doing so. I am not aware of any general requirement that a CET be a tailoring of DUCET or of the CLDR root collation, so the implicit weights would be irrelevant in this case. The implicit weights are part of DUCET. If no characters are supported, performing NFD will be a rather obvious trivial tra

Re: Minimal Implementation of Unicode Collation Algorithm

2017-12-04 Thread Markus Scherer via Unicode
On Mon, Dec 4, 2017 at 5:30 AM, Richard Wordingham via Unicode < unicode@unicode.org> wrote: > May a collation algorithm that always compares all strings as equal be a > compliant implementation of the Unicode Collation Algorithm (UTS #10)? > If not, by which clause is it not compl

Minimal Implementation of Unicode Collation Algorithm

2017-12-04 Thread Richard Wordingham via Unicode
May a collation algorithm that always compares all strings as equal be a compliant implementation of the Unicode Collation Algorithm (UTS #10)? If not, by which clause is it not compliant? Formally, this algorithm would require that all weights be zero. Would an implementation that supported no

Wild Card Collation Matches

2014-06-01 Thread Richard Wordingham
lar Expressions' Version 13 (dated 2008, superseded in 2012), RL3.5 comes pretty close to this with ranges tailored for collation. The pattern [\u0E01-\u0E02]* would match both those words. To be precise, one would use a search for [ก-ไก]*. RL3.5 has been with withdrawn because of difficulties

Codepoint Support for Phonetically-Aware Collation

2014-01-05 Thread Richard Wordingham
typically, visually opaque syllable boundaries are taken into account, e.g. in Lao and in some older Thai dictionaries (though the Thai examples I know of were compiled by Europeans). There are two approaches to these ambiguities for correct automated collation. One can either use a vocabulary-based

Re: FCD and Collation

2013-04-02 Thread Markus Scherer
Hi Richard, I was looking again at your example where U+0344 causes bad results in collation of FCD strings. See inline below. On Tue, Feb 12, 2013 at 12:19 PM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > On Mon, 11 Feb 2013 17:13:58 -0800 > Markus Scherer wr

Re: Size of Weights in Unicode Collation Algorithm

2013-03-16 Thread Philippe Verdy
ppresses codes that are prefixes of another. Exactyly what I described (you use variable number of bits), except that your scheme is highly suboptimal, compared to an Huffman coding or the optimal artithmetic coding (for which you can generate statistics of frequencies (precomputed from some initia

Re: Size of Weights in Unicode Collation Algorithm

2013-03-16 Thread Richard Wordingham
rote: > >> > Please give an example of how the low/high split would fail. With > >> > the primary collation weights 20, 21, 21 80 and 22 I get the > >> > following primary collation weight sequences for one and two > >> > collating elements, marking bou

Re: Size of Weights in Unicode Collation Algorithm

2013-03-16 Thread Philippe Verdy
can be used there as well. For example frequent sequences of weights could also have a predictive encoding model, notably when creating collation keys for strenghts 2 or higher, because there will be very frequent sequences of identical secondary weights. So instead of encoding the same secondary w

Re: Size of Weights in Unicode Collation Algorithm

2013-03-16 Thread Philippe Verdy
2013/3/16 Richard Wordingham : > On Sat, 16 Mar 2013 09:29:07 -0700 > Markus Scherer wrote: > >> On Sat, Mar 16, 2013 at 4:09 AM, Richard Wordingham < >> richard.wording...@ntlworld.com> wrote: >> >> > Please give an example of how the low/high split w

Re: Size of Weights in Unicode Collation Algorithm

2013-03-16 Thread Richard Wordingham
On Sat, 16 Mar 2013 09:29:07 -0700 Markus Scherer wrote: > On Sat, Mar 16, 2013 at 4:09 AM, Richard Wordingham < > richard.wording...@ntlworld.com> wrote: > > > Please give an example of how the low/high split would fail. With > > the primary collation weights 20,

Re: Size of Weights in Unicode Collation Algorithm

2013-03-16 Thread Markus Scherer
On Sat, Mar 16, 2013 at 4:09 AM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > Please give an example of how the low/high split would fail. With the > primary collation weights 20, 21, 21 80 and 22 I get the following > primary collation weight sequences for one an

Re: Size of Weights in Unicode Collation Algorithm

2013-03-16 Thread Richard Wordingham
lit in this post, for greatly improved clarity.) Please give an example of how the low/high split would fail. With the primary collation weights 20, 21, 21 80 and 22 I get the following primary collation weight sequences for one and two collating elements, marking boundaries of collating elements

Re: Size of Weights in Unicode Collation Algorithm

2013-03-15 Thread Markus Scherer
clearly treats 'large > weights' as being in multiple collation elements, whereas, in various > places, for transforming collation element tables properly, one needs > them to be treated as being in a single collation element. > Correct, that's where the complexities are that

Re: Size of Weights in Unicode Collation Algorithm

2013-03-15 Thread Richard Wordingham
> > The "fractional" refers to the same kind of mechanism as the "large > weight values" in the UCA spec. Yes. The problem is that formally the UCA clearly treats 'large weights' as being in multiple collation elements, whereas, in various places, for transfo

Re: Size of Weights in Unicode Collation Algorithm

2013-03-15 Thread Markus Scherer
On Fri, Mar 15, 2013 at 3:05 PM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > > In CLDR/ICU's FractionalUCA.txt, all but 40 or so of the primary > > weights (and many of the secondary weights) use the "large weights" > > mechanism. > > No, they're 32-bit weights expressed by omit

Re: Size of Weights in Unicode Collation Algorithm

2013-03-15 Thread Richard Wordingham
p for a character > or substring. In fact, that's really what ICU does, except the > current code is limited to one-or-two units (bytes). I would say that the UCA Section 6.2 stops me. It clearly says that the generic example '[(X+1)..], [..]' is two collation

Re: Size of Weights in Unicode Collation Algorithm

2013-03-15 Thread Markus Scherer
On Fri, Mar 15, 2013 at 12:50 PM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > On Thu, 14 Mar 2013 19:13:43 -0700 > Markus Scherer wrote: > > > On Thu, Mar 14, 2013 at 4:09 PM, Richard Wordingham < > > richard.wording...@ntlworld.com> wrote: > > > > > On Thu, 14 Mar 2013 14:49:1

Re: Size of Weights in Unicode Collation Algorithm

2013-03-15 Thread Richard Wordingham
s of other collating elements and an order preserving change of the irreducible substrings will preserve the order of the collating elements. This is a consequence of how humans (or just Unicode man?) generate primary weights, and does not apply to collation elements in general. This decompositi

Re: Size of Weights in Unicode Collation Algorithm

2013-03-14 Thread Markus Scherer
range -- it's basically an > > option for an "ignore punctuation" mode, and you wouldn't want to > > ignore nearly every assigned character in Unicode. > > There are a lot of characters in the SIP! Richard, we are talking about collation here, and "va

Re: Size of Weights in Unicode Collation Algorithm

2013-03-14 Thread Richard Wordingham
gt; and it would be expressed "u00u2FD5", not "u2FD5". No - though your being confused merits feedback. The example given specifies variableTop by means of a *string* - the 'string value' for the variable top. The equivalent basic syntax for variableTop = "uXXu&quo

Re: Size of Weights in Unicode Collation Algorithm

2013-03-14 Thread Richard Wordingham
On Thu, 14 Mar 2013 14:49:18 -0700 Markus Scherer wrote: > However, it does not make a lot of sense to set the variable top to > something above the currency symbols range -- it's basically an > option for an "ignore punctuation" mode, and you wouldn't want to > ignore nearly every assigned chara

Re: Size of Weights in Unicode Collation Algorithm

2013-03-14 Thread Markus Scherer
-byte primaries for the majority of characters now. See this doc from a few years ago: http://site.icu-project.org/design/collation/uca-weight-allocation Unfortunately, this makes setVariableTop() not work with most characters<http://bugs.icu-project.org/trac/ticket/8103>. I believe we have n

RE: Size of Weights in Unicode Collation Algorithm

2013-03-14 Thread Whistler, Ken
, and isn't a part of UCA per se at all. > Although I can't find a clear official definition of the semantics of > 'topVariable', "variableTop" is now defined in the LDML spec. See the proposed update for UTS #37. > I do remember being told that it simply us

Re: Size of Weights in Unicode Collation Algorithm

2013-03-14 Thread Richard Wordingham
x27;topVariable', I do remember being told that it simply uses the first positive primary in the collation key as the maximum variable weight. Now in allkeys.txt, U+2FD5 expands to two collation elements. However, in FractionalUCA.txt, which specifies 32-bit (fractional) weights, it has a sin

Re: Size of Weights in Unicode Collation Algorithm

2013-03-14 Thread Richard Wordingham
On Thu, 14 Mar 2013 00:19:15 + "Whistler, Ken" wrote: > Richard Wordingham wrote: > > > > It loosened up the spec, so that the spec itself didn't seem to be > > > requiring that each of the first 3 levels had to be expressed > > > wi

RE: Size of Weights in Unicode Collation Algorithm

2013-03-13 Thread Whistler, Ken
Richard Wordingham wrote: > > It loosened up the spec, so that the spec itself didn't seem to be > > requiring that each of the first 3 levels had to be expressed with a > > full 16 bits in any collation element table. > > I don't read it that way. But it di

Re: Size of Weights in Unicode Collation Algorithm

2013-03-13 Thread Richard Wordingham
ghts. 'Large weights' make it difficult (I don't say impossible) to check UCETs for well-formedness. > It loosened up the spec, so that the spec itself didn't seem to be > requiring that each of the first 3 levels had to be expressed with a > full 16 bits in any collat

RE: Size of Weights in Unicode Collation Algorithm

2013-03-13 Thread Whistler, Ken
0), or is it intended to do > away with the inconvenient concept of 'large weights'? Amplifying somewhat on Markus' response to these questions... In UCA 6.1.0, the wording was: "...where a collation element is an ordered list of three or more 16-bit weights." In

Re: Size of Weights in Unicode Collation Algorithm

2013-03-13 Thread Markus Scherer
U logically stores weights as sequences of 1, 2, 3 or 4 bytes, with collation elements encoded in interesting ways so that most CEs fit into 32-bit integers. Previously, each of the four weights could be accommodated in 16, 16, > 16 and 24 bits. How many bits may be needed for a DUCET collation &

Size of Weights in Unicode Collation Algorithm

2013-03-13 Thread Richard Wordingham
inconvenient concept of 'large weights'? Previously, each of the four weights could be accommodated in 16, 16, 16 and 24 bits. How many bits may be needed for a DUCET collation element now? Are we threatened with having to accommodate 36 bit weights? If it is not intended to do away with

Re: FCD and Collation

2013-02-14 Thread Richard Wordingham
losure > S of T is the least set such that: > > 1) E(T) ⊂ S > 2) If xu ∈ S, vy ∈ T, u and v are characters, and vy is the last > collation element in xuvy, then x(E(uv) ∩ U ∩ F)E(y) ⊂ S. I got Condition 2 wrong. See http://bugs.icu-project.org/trac/ticket/9319 for the correction. Richard.

Re: FCD and Collation

2013-02-12 Thread Richard Wordingham
closure > S of T is the least set such that: > > 1) E(T) ⊂ S > 2) If xu ∈ S, vy ∈ T, u and v are characters, and vy is the last > collation element in xuvy, then x(E(uv) ∩ U ∩ F)E(y) ⊂ S. CORRECTION: 'Collating element', not 'collation element'. If the '

Re: FCD and Collation

2013-02-12 Thread Richard Wordingham
On Tue, 12 Feb 2013 01:17:45 + "Whistler, Ken" wrote: > One of the reasons I resisted incorporation of > canonical enclosure in the basic UCA algorithm and in the DUCET table > is because of its infinitesimal ROI. It complicates the table and its > processing substantially, all in service of

Re: FCD and Collation

2013-02-12 Thread Richard Wordingham
On Mon, 11 Feb 2013 17:13:58 -0800 Markus Scherer wrote: > I would not revise FCD itself. For a number of processes, it is > sufficient as is. For collation it's not. > > About the Tibetan precomposed vowels: > > For the LDML spec, I submitted a CLDR ticket this mornin

RE: FCD and Collation

2013-02-11 Thread Whistler, Ken
t; for the adequacy of the current canonical closure. If the collation > fails this adequacy test, then again disabling normalisation should be > prohibited. (I would suggest that in these cases the normalisation > setting should be overridden with only the gentlest of chidings.) FCD is

Re: FCD and Collation

2013-02-11 Thread Markus Scherer
I would not revise FCD itself. For a number of processes, it is sufficient as is. For collation it's not. About the Tibetan precomposed vowels: For the LDML spec, I submitted a CLDR ticket this morning: http://unicode.org/cldr/trac/ticket/5667 For UTS #10 section 6.5, I just now submitt

FCD and Collation

2013-02-11 Thread Richard Wordingham
here is an ICU bug report http://bugs.icu-project.org/trac/ticket/9319 Default collation I remarked that the UCA (Technical Report 10) and LDML (Techical Report 35) specifications, taken together, make sense only if there is no such problem. Before raising a specific Unicode bug, I think it woul

Re: Unicode 6.2.0 Beta Collation Tests

2012-07-08 Thread Mark Davis ☕
del bene —* ** On Sun, Jul 8, 2012 at 7:46 AM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > Are the collation tests meant to have been updated for the change in > the draft of Step 2.1 of the collation algorithm? I haven't changed > what I believe to be

Unicode 6.2.0 Beta Collation Tests

2012-07-08 Thread Richard Wordingham
Are the collation tests meant to have been updated for the change in the draft of Step 2.1 of the collation algorithm? I haven't changed what I believe to be a UCA 6.1.0-compliant implementation, yet my code now passes the 6.2.0 tests for both DUCET and CLDR root. (I understand that the err

Re: [cldr-dev] Re: Questions on Chinese collation, stroke

2012-06-25 Thread Stephan Stiller
Mark Davis ☕ wrote: There are no current plans to do that. If you want to present a case for adding additional collation sequences to CLDR, please start the process by filing a bug at http://unicode.org/cldr/trac/newticket Mark — Il meglio è l’inimico del bene —

Re: [cldr-dev] Re: Questions on Chinese collation, stroke

2012-06-25 Thread Matt Ma
gt; Entered ticket #4949 for Simplified Chinese, stroke order. >> >> Thanks, >> Matt >> >> On Fri, Jun 22, 2012 at 12:55 PM, Mark Davis ☕ wrote: >>> >>> There are no current plans to do that. If you want to present a case for >>> addin

Re: [cldr-dev] Re: Questions on Chinese collation, stroke

2012-06-22 Thread Stephan Stiller
ootwork already?) Stephan On 6/22/2012 5:05 PM, Matt Ma wrote: Entered ticket #4949 for Simplified Chinese, stroke order. Thanks, Matt On Fri, Jun 22, 2012 at 12:55 PM, Mark Davis ☕ wrote: There are no current plans to do that. If you want to present a case for adding additional collation sequences

Re: [cldr-dev] Re: Questions on Chinese collation, stroke

2012-06-22 Thread Matt Ma
Entered ticket #4949 for Simplified Chinese, stroke order. Thanks, Matt On Fri, Jun 22, 2012 at 12:55 PM, Mark Davis ☕ wrote: > There are no current plans to do that. If you want to present a case for > adding additional collation sequences to CLDR, please start the process by > filing

Re: [cldr-dev] Re: Questions on Chinese collation, stroke

2012-06-22 Thread Mark Davis ☕
There are no current plans to do that. If you want to present a case for adding additional collation sequences to CLDR, please start the process by filing a bug at http://unicode.org/cldr/trac/newticket -- Mark <https://plus.google.com/114199149796022210033> * *

Re: [cldr-dev] Re: Questions on Chinese collation, stroke

2012-06-22 Thread Matt Ma
ese, U+8303 has 9 strokes as Matt mentioned in the >>>> email. >>>> >>>> The radical "++" is counted as 4 strokes. I think there are several >>>> radicals have the same issue, different stroke counts, between simplified >>>> Chine

Re: [cldr-dev] Re: Questions on Chinese collation, stroke

2012-06-08 Thread Mark Davis ☕
> Claire. >> >> On Thu, Jun 7, 2012 at 5:54 PM, Mark Davis ☕ wrote: >> >>> On Thu, Jun 7, 2012 at 4:28 PM, Matt Ma wrote: >>> >>>> Hi, >>>> >>>> I have two questions regarding the collation sequence defined in >

Re: [cldr-dev] Re: Questions on Chinese collation, stroke

2012-06-08 Thread 賀靜蘭
issue, different stroke counts, between simplified > Chinese and traditional Chinese. > > Claire. > > On Thu, Jun 7, 2012 at 5:54 PM, Mark Davis ☕ wrote: > >> On Thu, Jun 7, 2012 at 4:28 PM, Matt Ma wrote: >> >>> Hi, >>> >>> I have two q

Re: [cldr-dev] Re: Questions on Chinese collation, stroke

2012-06-08 Thread 賀靜蘭
fied Chinese and traditional Chinese. Claire. On Thu, Jun 7, 2012 at 5:54 PM, Mark Davis ☕ wrote: > On Thu, Jun 7, 2012 at 4:28 PM, Matt Ma wrote: > >> Hi, >> >> I have two questions regarding the collation sequence defined in >> zh.xml, CLDR 21.0 >> >> 1

Re: Questions on Chinese collation, stroke

2012-06-07 Thread Mark Davis ☕
On Thu, Jun 7, 2012 at 4:28 PM, Matt Ma wrote: > Hi, > > I have two questions regarding the collation sequence defined in > zh.xml, CLDR 21.0 > > 1. Why is U+8303 (范) counted as 9 strokes instead of 8 for type="stroke">? As a reference, U+59DA (姚) is counted as

Re: Discontiguous Collation Grapheme Clusters

2012-05-27 Thread Philippe Verdy
UTS#18 is really a mess about collation clusters. But remamber that collation elements are specific to each language for which they are defined (including the "root" locale which acts as a pseudo-language just working as a default option for all languages that don't have specific

Discontiguous Collation Grapheme Clusters

2012-05-27 Thread Richard Wordingham
I'm currently reviewing the definition of the Unicode Collation Algorithm (as opposed to just trying to comply with it), and I came across the concept of collation grapheme clusters, defined in UTS#18 'Unicode Regular Expressions'. For what types of strings are they supposed to b

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-21 Thread Richard Wordingham
On Sat, 19 May 2012 01:12:17 +0100 Richard Wordingham wrote: > This will then work for DUCET > 6.1.0, work for Danish, and work for my mischievous 0302 COMBINING > CIRCUMFLEX ACCENT+0067 LATIN SMALL LETTER G contraction. There is a very similar rule in CLDR for Lithuanian - 0307+0301 has CE(0301

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-20 Thread Markus Scherer
Hi Richard, This is essentially the same problem as http://bugs.icu-project.org/trac/ticket/9319 right? (Contractions overlapping with decomposition mappings.) Would you mind adding a reply to that with the Lithuanian issue? Thanks, markus

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-20 Thread Richard Wordingham
On Sun, 20 May 2012 17:05:00 +0100 Richard Wordingham wrote: > CORRECTION to correction I wrote "rules for +0307+" when, of course, I meant "rules for +0307+" Sorry about that. Richard.

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-20 Thread Richard Wordingham
On Sun, 20 May 2012 16:15:24 +0100 Richard Wordingham wrote: CORRECTION: > For the general case, we ought to be able to express a rule such as > 'ignore the countering of sof-dottedness', as in Lithuanian casing, > but I don't see any finite method of expressing it under the UCA, As we have dis

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-20 Thread Richard Wordingham
one starting with a combining accent or the non-initial part of an > Indic vowel. You may think the suggestions about hiders is excessive, but a real example of hiding occurs when subjecting the current Lithuanian collation in CLDR, which has a humanly unreadable contraction making 0307+0301 c

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-18 Thread Richard Wordingham
On Fri, 18 May 2012 09:51:34 -0700 Markus Scherer wrote: > On inspection, we think we can do better (and want to), probably by > adding overlap contractions. If we get into trouble with that, we > will think of alternatives. One is to decompose more characters even > in FCD input. Another is to k

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-18 Thread Mark Davis ☕
r particular input > > conditions (except NFD input). > > So long as you don't claim conformance to the CLDR collation > definitions. If you do, a lot depends on how one interprets the > definition of normalisation settings given in UTS#35 'Unicode Locale > Data Markup La

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-18 Thread Richard Wordingham
On Fri, 18 May 2012 09:51:34 -0700 Markus Scherer wrote: > There is nothing that requires us to get correct results *without > normalization* for all FCD strings or any other particular input > conditions (except NFD input). So long as you don't claim conformance to the

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-18 Thread Markus Scherer
collation to be fast, at least for most of normal input. One of the main performance optimizations is to skip the normalization step but still get the correct results for most input. We used to think and write that as long as input strings pass the FCD test, we will get the correct results. Except, at

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-18 Thread Richard Wordingham
On Thu, 17 May 2012 21:32:19 -0700 Markus Scherer wrote: > Ok, but assuming we didn't add 0FB2+0F71, why can't we add the > contraction 0FB2+0F81 and have the 0334 and any other non-starter be > handled via discontiguous matching? Time for me to make a pronouncement on colla

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-18 Thread Richard Wordingham
have the implementation-generated contractions for 0F71+0F73 and 0F71+0F73+0F72 (and the other pairs based on pairs of vowels from 0F72, 0F74 and 0F80), and F073 (and the other long vowels) are not blocked by 0F71, we're OK for UCA 6.1.0 at least as far back as UCA 4.1.0. (A collation ha

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Markus Scherer
0F72> skipping the two middle 0F71. That string is equivalent to the FCD-passing string <0F71, 0F71, 0F73> but there is no 0F72 in sight there to complete the match if we don't modify the string. If we cannot find a way to handle this with a finite (actually, small) amount of data,

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Richard Wordingham
On Thu, 17 May 2012 15:42:37 -0700 Markus Scherer wrote: > On Thu, May 17, 2012 at 3:00 PM, Richard Wordingham < > richard.wording...@ntlworld.com> wrote: >> HOWEVER, you must *not* have the added contraction for 0F71+0F71. > If we don't have this prefix contraction, then we will miss a > disco

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Markus Scherer
On Thu, May 17, 2012 at 3:00 PM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > If using DUCET, the collation elements for 0F71+0F71+0F72 are those for > <0F73, 0F71>, namely (at 6.1.0): > > [.2572.0020.0002.0F73][.2570.0020.0002.0F71]. > > The corr

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Richard Wordingham
> give the pair of long vowels. We don't need to worry about > > because that is not FCD. > I am not following. > Given contractions > 0F71+0F71 (needed as a prefix of the next one) > 0F71+0F73 > what other contractions do we need to add to avoid which pro

Re: Mark-Driven Script Categorisation (was: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm)

2012-05-17 Thread Philippe Verdy
2012/5/17 Richard Wordingham : > On Wed, 16 May 2012 21:46:17 -0700 > Mark Davis ☕ wrote: > >> No, it's not. >> >> Including x in Lao for some pedagogical (I'm guessing) purpose is >> completely out of scope. That'd be like including π in Latin because >> it sometimes occurs in the middle of Engli

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Markus Scherer
On Thu, May 17, 2012 at 1:02 PM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > As x = 0F71, we also need the > contractions of x+0F73 (or x+0F71+0F72) with 0F72, 0F74 and 0F80 to > give the pair of long vowels. We don't need to worry about > because that is not FCD. > I am not

Mark-Driven Script Categorisation (was: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm)

2012-05-17 Thread Richard Wordingham
On Wed, 16 May 2012 21:46:17 -0700 Mark Davis ☕ wrote: > No, it's not. > > Including x in Lao for some pedagogical (I'm guessing) purpose is > completely out of scope. That'd be like including π in Latin because > it sometimes occurs in the middle of English text. No, it's more like including D

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Richard Wordingham
On Wed, 16 May 2012 16:03:08 -0700 Markus Scherer wrote: > The problem is a contraction x+0F72 and input text x+0F73 where the > inner 0F71 should be skipped. We can avoid this by adding a > contraction for x+0F73 (and one for the equivalent x+0F71+0F72). > > On the other hand, x+0F73 (together

Re: Fw: Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Markus Scherer
*Please* use a different email subject line for the "x vs. Lao" discussion. markus On Thu, May 17, 2012 at 1:57 AM, wrote: > Well, I was speaking of the general case, not this specific example. > Orthographies which mix in random characters from other scripts do not, and > should not, drive the

Re: Fw: Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread vanisaac
From: Mark Davis ☕ > On Wed, May 16, 2012 at 9:20 PM, wrote: >> From: Ken Whistler >> > Orthographies which mix in random characters from other scripts do not >> > (or should not) drive the identity of characters for *scripts* per se. >> > And edge cases fo

Re: Fw: Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-16 Thread Asmus Freytag
On 5/16/2012 9:46 PM, Mark Davis ☕ wrote: No, it's not. Including x in Lao for some pedagogical (I'm guessing) purpose is completely out of scope. That'd be like including π in Latin because it sometimes occurs in the middle of English text. --

  1   2   3   4   >