date:20181101


  
  
On 11/1/2018 7:59 PM, James Kass via
  Unicode wrote:


  
  Alphabetic script users write things the way they are spelled and
  spell things the way they are written.  The abbreviation in
  question as written consists of three recognizable symbols.  An
  "M", a superscript "r", and an equal sign (= two lines).  It can
  be printed, handwritten, or in fraktur; it will still consist of
  those same three recognizable symbols.
  
  
  We're supposed to be preserving the past, not editing it or
  revising it.
  
  
  

Alphabetic script users' handwriting does
not match print in all features. Traditional German handwriting
used a line like a macron over the letter 'u' to distinguish it
from 'n'. Rendering this with a u-macron in print would be the
height of absurdity.
I feel similarly about the assertion that
the "two lines" are something that needs to be encoded, but only
an expert would know for sure.
A./

Re: A sign/abbreviation for "magister"

2018-11-01 Thread James Kass via Unicode




Alphabetic script users write things the way they are spelled and spell 
things the way they are written.  The abbreviation in question as 
written consists of three recognizable symbols.  An "M", a superscript 
"r", and an equal sign (= two lines).  It can be printed, handwritten, 
or in fraktur; it will still consist of those same three recognizable 
symbols.


We're supposed to be preserving the past, not editing it or revising it.

Re: A sign/abbreviation for "magister"

2018-11-01 Thread James Kass via Unicode




Richard Wordingham responded to Janusz S. Bień,

>> ... Nobody ever claimed that reproducing all variations
>> in manuscripts is in scope of Unicode, so whom do you want
>> to convince that it is not?
>
> I think the counter-claim is that one will never be able
> to encode all the meaning-conveying distinctions of text
> in Unicode.

I think that the general agreement is that Unicode plain text isn't 
intended for preserving stylistic differences.  The dilemma is that 
opinions differ as to what constitutes a stylistic difference.


If there had been an "International Typewriter Usage Consortium" a 
hundred years ago which had issued an edict like "the underscore is 
placed on the keyboard for the explicit purpose of typing empty lines 
for 'fill-in-the-blank' forms, and must never be used by the typist to 
underline any other element of type", then that consortium would have 
been dictating how users perceive their own written symbols along with 
preventing users from establishing new conventions using existing 
symbols, experimenting, or innovating.


Some people consider that Unicode is essentially doing the same kind of 
thing.  It's *that* perception which needs to be addressed, perhaps with 
FAQs and education, or with some kind of revisiting and rethinking.  Or 
both.

Re: UCA unnecessary collation weight 0000

As well the step 2 of the algorithm speaks about a single "array" of
collation elements. Actually it's best to create one separate array per
level, and append weights for each level in the relevant array for that
level.
The steps S2.2 to S2.4 can do this, including for derived collation
elements in section 10.1, or variable weighting in section 4.

This also means that for fast string compares, the primary weights can be
processed on the fly (without needing any buffering) is the primary weights
are different between the two strings (including when one or both of the
two strings ends, and the secondary weights or tertiary weights detected
until then have not found any weight higher than the minimum weight value
for each level).
Otherwise:
- the first secondary weight higher that the minimum secondary weght value,
and all subsequent secondary weights must be buffered in a secondary
buffer  .
- the first tertiary weight higher that the minimum secondary weght value,
and all subsequent secondary weights must be buffered in a tertiary buffer.
- and so on for higher levels (each buffer just needs to keep a counter,
when it's first used, indicating how many weights were not buffered while
processing and counting the primary weights, because all these weights were
all equal to the minimum value for the relevant level)
- these secondary/tertiary/etc. buffers will only be used once you reach
the end of the two strings when processing the primary level and no
difference was found: you'll start by comparing the initial counters in
these buffers and the buffer that has the largest counter value is
necessarily for the smaller compared string. If both counters are equal,
then you start comparing the weights stored in each buffer, until one of
the buffers ends before another (the shorter buffer is for the smaller
compared string). If both weight buffers reach the end, you use the next
pair of buffers built for the next level and process them with the same
algorithm.

Nowhere you'll ever need to consider any [.] weight which is just a
notation in the format of the DUCET intended only to be readable by humans
but never needed in any machine implementation.

Now if you want to create sort keys this is similar except that you don"t
have two strings to process and compare, all you want is to create separate
arrays of weights for each level: each level can be encoded separately, the
encoding must be made so that when you'll concatenate the encoded arrays,
the first few encoded *bits* in the secondary or tertiary encodings cannot
be larger or equal to the bits used by the encoding of the primary weights
(this only limits how you'll encode the 1st weight in each array as its
first encoding *bits* must be lower than the first bits used to encode any
weight in previous levels).

Nowhere you are required to encode weights exactly like their logical
weight, this encoding is fully reversible and can use any suitable
compression technics if needed. As long as you can safely detect when an
encoding ends, because it encounters some bits (with lower values) used to
start the encoding of one of the higher levels, the compression is safe.

For each level, you can reserve only a single code used to "mark" the start
of another higher level followed by some bits to indicate which level it
is, then followed by the compressed code for the level made so that each
weight is encoded by a code not starting by the reserved mark. That
encoding "mark" is not necessarily a , it may be a nul byte, or a '!'
(if the encoding must be readable as ASCII or UTF-8-based, and must not use
any control or SPACE or isolated surrogate) and codes used to encode each
weight must not start by a byte lower or equal to this mark. The binary or
ASCII code units used to encode each weight must just be comparable, so
that comparing codes is equivalent to compare weights represented by each
code.

As well, you are not required to store multiple "marks". This is just one
of the possibilities to encode in the sort key which level is encoded after
each "mark", and the marks are not necessarily the same before each level
(their length may also vary depending on the level they are starting):
these marks may be completely removed from the final encoding if the
encoding/compression used allows discriminating the level used by all
weights, encoded in separate sets of values.

Typical compression technics are for example differencial, notably in
secondary or higher levels, and run-legth encoded to skip sequences of
weights all equal to the minimum weight.

The code units used by the weigh encoding for each level may also need to
avoid some forbidden values if needed (e.g. when encoding the weights to
UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units
reserved for or representing an isolate surrogate in U+D800..U+DFFF as this
would create a string not conforming to any standard UTF).

Once again this means that the sequence of logical weight will can

Re: A sign/abbreviation for "magister"

On Thu, 01 Nov 2018 18:23:05 +0100
"Janusz S. Bień via Unicode"  wrote:

> On Thu, Nov 01 2018 at  8:43 -0700, Asmus Freytag via Unicode wrote:

> > I don't think it's a joke to recognize that there is a continuum
> > here and that there is no line that can be drawn which is based on
> > straightforward principles. This is a pattern that keeps surfacing
> > the deeper you look at character coding questions.  
> 
> Looks like you completely missed my point. Nobody ever claimed that
> reproducing all variations in manuscripts is in scope of Unicode, so
> whom do you want to convince that it is not?

I think the counter-claim is that one will never be able to encode all
the meaning-conveying distinctions of text in Unicode.

Richard.

Re: UCA unnecessary collation weight 0000

On Thu, 1 Nov 2018 18:39:16 +0100
Philippe Verdy via Unicode  wrote:

> What this means is that we can safely implement UCA using basic
> substitions (e.g. with a function like "string:gsub(map)" in Lua
> which uses a "map" to map source (binary) strings or regexps,into
> target (binary) strings:
> 
> For a level-3 collation, you just then need only 3 calls to
> "string:gsub()" to compute any collation:
> 
> - the first ":gsub(mapNormalize)" can decompose a source text into
> collation elements and can perform reordering to enforce a normalized
> order (possibly tuned for the tailored locale) using basic regexps.

Are you sure of this?  Will you publish the algorithm?  Have you
passed the official conformance tests?  (Mind you, DUCET is a
relatively easy UCA collation to implement successfully.)

> - the second ":gsub(mapSecondary)"  will substitute any collection
> elements by their "intermediary" collation elements+tertiary weight.
> 
> - the third ":gsub(mapSecondary)" will substitute any "intermediary"
> collation element by their primary weight + secondary weight

Richard.

Re: UCA unnecessary collation weight 0000

On Thu, 1 Nov 2018 21:13:46 +0100
Philippe Verdy via Unicode  wrote:

> I'm not speaking just about how collation keys will finally be stored
> (as uint16 or bytes, or sequences  of bits with variable length); I'm
> just refering to the sequence of weights you generate.


> You absolutely NEVER need ANYWHERE in the UCA algorithm any 
> weight, not even during processing, or un the DUCET table.

If you take the zero weights out, you have a different table structure
to store, e.g. the CLDR fractional weight tables.

Richard.

Re: UCA unnecessary collation weight 0000

On Thu, 1 Nov 2018 22:04:40 +0100
Philippe Verdy via Unicode  wrote:

> The DUCET could have as well used the notation ".none", or
> just dropped every "." in its file (provided it contains a data
> entry specifying what is the minimum weight used for each level).
> This notation is only intended to be read by humans editing the file,
> so they don't need to wonder what is the level of the first indicated
> weight or remember what is the minimum weight for that level.
> But the DUCET table is actually generated by a machine and processed
> by machines.

A fair few humans have tailored it by hand.

Richard.

Re: UCA unnecessary collation weight 0000

So it should be clear in the UCA algorithm and in the DUCET datatable that
"" is NOT a valid weight
It is just a notational placeholder used as ".", only indicating in the
DUCET format that there's NO weight assigned at the indicated level,
because the collation element is ALWAYS ignorable at this level.
The DUCET could have as well used the notation ".none", or just dropped
every "." in its file (provided it contains a data entry specifying
what is the minimum weight used for each level). This notation is only
intended to be read by humans editing the file, so they don't need to
wonder what is the level of the first indicated weight or remember what is
the minimum weight for that level.
But the DUCET table is actually generated by a machine and processed by
machines.



Le jeu. 1 nov. 2018 à 21:57, Philippe Verdy  a écrit :

> In summary, this step given in the algorithm is completely unneeded and
> can be dropped completely:
>
> *S3.2  *If L is not 1, append a *level
> separator*
>
> *Note:*The level separator is zero (), which is guaranteed to be
> lower than any weight in the resulting sort key. This guarantees that when
> two strings of unequal length are compared, where the shorter string is a
> prefix of the longer string, the longer string is always sorted after the
> shorter—in the absence of special features like contractions. For example:
> "abc" < "abcX" where "X" can be any character(s).
>
> Remove any reference to the "level separator" from the UCA. You never need
> it.
>
> As well this paragraph
>
> 7.3 Form Sort Keys 
>
> *Step 3.* Construct a sort key for each collation element array by
> successively appending all non-zero weights from the collation element
> array. Figure 2 gives an example of the application of this step to one
> collation element array.
>
> Figure 2. Collation Element Array to Sort Key
> 
> Collation Element ArraySort Key
> [.0706.0020.0002], [.06D9.0020.0002], [..0021.0002], [.06EE.0020.0002] 
> 0706
> 06D9 06EE  0020 0020 0021 0020  0002 0002 0002 0002
>
> can be written with this figure:
>
> Figure 2. Collation Element Array to Sort Key
> 
> Collation Element ArraySort Key
> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)
>
> The parentheses mark the collation weights 0020 and 0002 that can be
> safely removed if they are respectively the minimum secondary weight and
> minimum tertiary weight.
> But note that 0020 is kept in two places as they are followed by a higher
> weight 0021. This is general for any tailored collation (not just the
> DUCET).
>
> Le jeu. 1 nov. 2018 à 21:42, Philippe Verdy  a écrit :
>
>> The  is there in the UCA only because the DUCET is published in a
>> format that uses it, but here also this format is useless: you never need
>> any [.], or [..] in the DUCET table as well. Instead the DUCET
>> just needs to indicate what is the minimum weight assigned for every level
>> (except the highest level where it is "implicitly" 0001, and not ).
>>
>>
>> Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a
>> écrit :
>>
>>> There are lots of ways to implement the UCA.
>>>
>>> When you want fast string comparison, the zero weights are useful for
>>> processing -- and you don't actually assemble a sort key.
>>>
>>> People who want sort keys usually want them to be short, so you spend
>>> time on compression. You probably also build sort keys as byte vectors not
>>> uint16 vectors (because byte vectors fit into more APIs and tend to be
>>> shorter), like ICU does using the CLDR collation data file. The CLDR root
>>> collation data file remunges all weights into fractional byte sequences,
>>> and leaves gaps for tailoring.
>>>
>>> markus
>>>
>>

Re: UCA unnecessary collation weight 0000

In summary, this step given in the algorithm is completely unneeded and can
be dropped completely:

*S3.2  *If L is not 1, append a *level
separator*

*Note:*The level separator is zero (), which is guaranteed to be lower
than any weight in the resulting sort key. This guarantees that when two
strings of unequal length are compared, where the shorter string is a
prefix of the longer string, the longer string is always sorted after the
shorter—in the absence of special features like contractions. For example:
"abc" < "abcX" where "X" can be any character(s).

Remove any reference to the "level separator" from the UCA. You never need
it.

As well this paragraph

7.3 Form Sort Keys 

*Step 3.* Construct a sort key for each collation element array by
successively appending all non-zero weights from the collation element
array. Figure 2 gives an example of the application of this step to one
collation element array.

Figure 2. Collation Element Array to Sort Key

Collation Element ArraySort Key
[.0706.0020.0002], [.06D9.0020.0002], [..0021.0002], [.06EE.0020.0002] 0706
06D9 06EE  0020 0020 0021 0020  0002 0002 0002 0002

can be written with this figure:

Figure 2. Collation Element Array to Sort Key

Collation Element ArraySort Key
[.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)

The parentheses mark the collation weights 0020 and 0002 that can be safely
removed if they are respectively the minimum secondary weight and minimum
tertiary weight.
But note that 0020 is kept in two places as they are followed by a higher
weight 0021. This is general for any tailored collation (not just the
DUCET).

Le jeu. 1 nov. 2018 à 21:42, Philippe Verdy  a écrit :

> The  is there in the UCA only because the DUCET is published in a
> format that uses it, but here also this format is useless: you never need
> any [.], or [..] in the DUCET table as well. Instead the DUCET
> just needs to indicate what is the minimum weight assigned for every level
> (except the highest level where it is "implicitly" 0001, and not ).
>
>
> Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a
> écrit :
>
>> There are lots of ways to implement the UCA.
>>
>> When you want fast string comparison, the zero weights are useful for
>> processing -- and you don't actually assemble a sort key.
>>
>> People who want sort keys usually want them to be short, so you spend
>> time on compression. You probably also build sort keys as byte vectors not
>> uint16 vectors (because byte vectors fit into more APIs and tend to be
>> shorter), like ICU does using the CLDR collation data file. The CLDR root
>> collation data file remunges all weights into fractional byte sequences,
>> and leaves gaps for tailoring.
>>
>> markus
>>
>

Re: A sign/abbreviation for "magister"

2018-11-01 Thread Marcel Schneider via Unicode

On 01/11/2018 01:21, Asmus Freytag via Unicode wrote:

On 10/31/2018 3:37 PM, Marcel Schneider via Unicode wrote:

On 31/10/2018 19:42, Asmus Freytag via Unicode wrote:

[…]

It is a fallacy that all text output on a computer should match the convention
of "fine typography".

Much that is written on computers represents an (unedited) first draft. Giving
such texts the appearance of texts, which in the day of hot metal typography,
was reserved for texts that were fully edited and in many cases intended for
posterity is doing a disservice to the reader.

The disconnect is in many people believing the user should be disabled to
write
[prevented from writing]

Thank you for correcting.

his or her language without disfiguring it by lack of decent keyboarding, and
that such input should be considered standard for user input. Making such text
usable for publishing needs extra work, that today many users cannot afford,
while the mass of publishing has increased exponentially over the past decades.
The result is garbage, following the rule of “garbage in, garbage out.”

No argument that there are some things that users cannot key in easily and that
the common
fallbacks from the days of typewritten drafts are not really appropriate in
many texts that
otherwise fall short of being "fine typography".

The goal I wanted to reach by discussing and invalidating the biased and
misused concept
of “fine typography” is that this thread could get rid of it, but I’m
definitely unfortunate.
It’s hard for you to understand that relegating abbreviation indicators into
the realm of
“fine typography” recalls me what I got to hear (undisclosed for privacy) when
asking that
the French standard keyboard layouts (plural) support punctuation spacing with
NARROW NO-BREAK SPACE, and that is closely related to the issue about social
media that
you pointed below.

Don’t worry about users not being able to “key in easily” what is needed for
the digital
representation of their language, as long as:

1. Unicode has encoded what is needed;

2. Unicode does not prohibit the use of the needed characters.

The rest is up to keylayout designers. Keying in anything else is not an issue
so far.

The real
disservice to the reader is not to enable the inputting user to write his or her
language correctly. A draft whose backbone is a string usable as-is for
publishing
is not a disservice, but a service to the reader, paying the reader due respect.
Such a draft is also a service to the user, enabling him or her to streamline
the
workflow. Such streamlining brings monetary and reputational benefit to the
user.

I see a huge disconnect between "writing correctly" and "usable as-is for
publishing". These
two things are not at all the same.

Publishing involves making many choices that simply aren't necessary for more "rough
& ready"
types of texts. Not every twitter or e-mail message needs to be "usable as-is for
publishing", but
should allow "correctly written" text as far as possible.

Not every message, especially not those whose readers expect a quick response.
The reverse is true with new messages (tweets, thread lauchers, requests,
invitations).
As already discussed, there are several levels of correctness. We’re talking
only about
the accurate digital representation of human languages, which includes correct
punctuation.
E.g. in languages using letter apostrophe, hashtags made of a word including an
apostrophe
are broken when ASCII or punctuation apostrophe (close quote) is used, as we’ve
been told.

Supposedly part of this discussion would be streamlined if one could experience
how easy
it can be to type in one’s language’s accurate digital representation. But
it’s better
to be told what goes on, and what “strawmen” we’re confused with, since, again,
informed discussion brings advancement.

When "desktop publishing" as it was called then, became available, too many
people started to
obsess with form over content. You would get these beautifully laid out
documents, the contents
of which barely warranted calling them a first draft.

Typing in one’s language’s accurate digital representation is not being
obsessed with form
over content, provided that appropriate keyboarding is available. E.g. the
punctuation
apostrophe is on level 1 where the ASCII apostrophe is when digits are locked
on level 1
on the French keyboard I’ve in use; else, digits are on level 3 where is also
superscript e
for ready input of most of the ordinals (except 1ᵉʳ/1ʳᵉ, 2ⁿᵈ for ranges, and
plural with ˢ):
2ᵉ 3ᵉ 4ᵉ 5ᵉ 6ᵉ 7ᵉ 8ᵉ 9ᵉ 10ᵉ 11ᵉ 12ᵉ. Hopefully that demo makes clear what is
intended.
Users not needing accurate repsesentation in a given string are free to type in
otherwise.

The goal of this discussion is that Unicode allow accurate representation, not
impose it.
Actually Unicode is still imposing inaccurate representation to some languages
due to TUS
prohibiting the use of precomposed superscript letters in text

Re: UCA unnecessary collation weight 0000

The  is there in the UCA only because the DUCET is published in a
format that uses it, but here also this format is useless: you never need
any [.], or [..] in the DUCET table as well. Instead the DUCET
just needs to indicate what is the minimum weight assigned for every level
(except the highest level where it is "implicitly" 0001, and not ).


Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a écrit :

> There are lots of ways to implement the UCA.
>
> When you want fast string comparison, the zero weights are useful for
> processing -- and you don't actually assemble a sort key.
>
> People who want sort keys usually want them to be short, so you spend time
> on compression. You probably also build sort keys as byte vectors not
> uint16 vectors (because byte vectors fit into more APIs and tend to be
> shorter), like ICU does using the CLDR collation data file. The CLDR root
> collation data file remunges all weights into fractional byte sequences,
> and leaves gaps for tailoring.
>
> markus
>

Re: UCA unnecessary collation weight 0000

Le jeu. 1 nov. 2018 à 21:31, Philippe Verdy  a écrit :

> so you can use these two last functions to write the first one:
>
>   bool isIgnorable(int level, string element) {
> return getLevel(getWeightAt(element, 0)) > getMinWeight(level);
>   }
>
correction:
return getWeightAt(element, 0) > getMinWeight(level);

Re: UCA unnecessary collation weight 0000

Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a
écrit :

> When you want fast string comparison, the zero weights are useful for
>> processing -- and you don't actually assemble a sort key.
>>
>
And no, I absolutely no case where any  weight is useful during
processing, it does not distinguish any case, even for "fast" string
comparison.

Even if you don't build any sort key, may be you'll want to return  it
you query the weight for a specific collatable element, but this would be
the same as querying if the collatable element is ignorable or not for a
given specific level; this query just returns a false or true boolean, like
this method of a Collator object:

  bool isIgnorable(int level, string collatable element)

and you can also make this reliable for any collector:

  int getLevel(int weight);
  int getMinWeight(int level);
  int getWeightAt(string element, int level, int position);

so you can use these two last functions to write the first one:

  bool isIgnorable(int level, string element) {
return getLevel(getWeightAt(element, 0)) > getMinWeight(level);
  }

That's enough you can write the fast comparison...

What I said is not a complicate "compression" this is done on the fly,
without any complex transform. All that counts is that any primary weight
value is higher than any secondary weight, and any secondary weight is
higher than a tertiary weight.

Re: A sign/abbreviation for "magister"


  
  
On 11/1/2018 10:23 AM, Janusz S. Bień
  via Unicode wrote:


  On Thu, Nov 01 2018 at  8:43 -0700, Asmus Freytag via Unicode wrote:

  
On 11/1/2018 12:33 AM, Janusz S. Bień via Unicode wrote:

 On Wed, Oct 31 2018 at 12:14 -0700, Ken Whistler via Unicode wrote:

 On 10/31/2018 11:27 AM, Asmus Freytag via Unicode wrote:

 
 but we don't have an agreement that reproducing all variations in
 manuscripts is in scope.


In fact, I would say that in the UTC, at least, we have an agreement
that that clearly is out of scope!

Trying to represent all aspects of text in manuscripts, including
handwriting conventions, as plain text is hopeless.  There is no
principled line to draw there before you get into arbitrary
calligraphic conventions.


Your statements are perfect examples of "attacking a straw man":


Perhaps you are joking?

Not sure which of us you were suggesting as the jokester here.

I don't think it's a joke to recognize that there is a continuum here
and that there is no line that can be drawn which is based on
straightforward principles. This is a pattern that keeps surfacing the
deeper you look at character coding questions.

  
  
Looks like you completely missed my point. Nobody ever claimed that
reproducing all variations in manuscripts is in scope of Unicode, so
whom do you want to convince that it is not?



Looks like you are missing my point about there being a continuum
  with not clear lines that can be perfectly drawn a-priori.
"reproducing all variations in manuscripts" is only one possible
  end point of this continuum, and therefore, less interesting than
  the overall pattern.

A./

Re: UCA unnecessary collation weight 0000

I'm not speaking just about how collation keys will finally be stored (as
uint16 or bytes, or sequences  of bits with variable length); I'm just
refering to the sequence of weights you generate.
You absolutely NEVER need ANYWHERE in the UCA algorithm any  weight,
not even during processing, or un the DUCET table.

Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a écrit :

> There are lots of ways to implement the UCA.
>
> When you want fast string comparison, the zero weights are useful for
> processing -- and you don't actually assemble a sort key.
>
> People who want sort keys usually want them to be short, so you spend time
> on compression. You probably also build sort keys as byte vectors not
> uint16 vectors (because byte vectors fit into more APIs and tend to be
> shorter), like ICU does using the CLDR collation data file. The CLDR root
> collation data file remunges all weights into fractional byte sequences,
> and leaves gaps for tailoring.
>
> markus
>

Re: UCA unnecessary collation weight 0000

For example, Figure 3 in the UTR#10 contains:

Figure 3. Comparison of Sort Keys

 StringSort Key
1 cab *0706* 06D9 06EE ** 0020 0020 *0020* ** *0002* 0002 0002
2 Cab *0706* 06D9 06EE ** 0020 0020 *0020* ** *0008* 0002 0002
3 cáb *0706* 06D9 06EE ** 0020 0020 *0021* ** 0002 0002 0002 0002
4 dab *0712* 06D9 06EE ** 0020 0020 0020 ** 0002 0002 0002


The  weights are never needed, even if any of the source strings
("cab", "Cab", "cáb", "dab") is followed by ANY other string, or if any
other string (higher than "b") replaces their final "b".
What is really important is to understand where the input text (after
initial transforms like reodering and expansion) is broken at specific
boundaries between collatable elements.
But the boundaries of weights indicated each part of the sort key can
always be infered for example between 06EE and 0020, or between 0020 and
0002.
So this can obviously be changed to just:

Figure 3. Comparison of Sort Keys


 StringSort Key
1 cab *0706* 06D9 06EE 0020 0020 *0020* *0002* 0002 0002
2 Cab *0706* 06D9 06EE 0020 0020 *0020* *0008* 0002 0002
3 cáb *0706* 06D9 06EE 0020 0020 *0021* 0020 0002 0002 0002 0002
4 dab *0712* 06D9 06EE 0020 0020 0020 0002 0002 0002
As well (emphasized by black blackground above),
* when the secondary weights in the sort key are terminated by any sequence
of 0020 (the minimal secondary weight), you can suppress them from the
collation key.
* when the tertiary weights are in the sort key are terminated by any
sequence of 0002 (the minimal tertiary weight), you can suppress them from
the collation key.
This gives:

Figure 3. Comparison of Sort Keys

 StringSort Key
1 cab *0706* 06D9 06EE
2 Cab *0706* 06D9 06EE *0008*
3 cáb *0706* 06D9 06EE 0020 0020 *0021*
4 dab *0712* 06D9 06EE
See the reduction !

Le jeu. 1 nov. 2018 à 18:39, Philippe Verdy  a écrit :

> I just remarked that there's absolutely NO utility of the collation weight
>  anywhere in the algorithm.
>
> For example in UTR #10, section 3.3.1 gives a collection element :
>   [..0021.0002]
> for COMBINING GRAVE ACCENT. However it can also be simply:
>   [.0021.0002]
> for a simple reason: the secondary or tertiary weights are necessarily
> LOWER then any primary weight (for conformance reason):
>  any tertiary weight < any secondary weight < any primary weight
> (the set of all weights for all levels is fully partitioned into disjoint
> intervals in the same order, each interval containing all its weights, so
> weights are sorted by decreasing level, then increasing weight in all cases)
>
> This also means that we never need to handle  weights when creating
> sort keys from multiple collection elements, as we can easily detect that
> [.0021.0002] given above starts by a secondary weight 0021 and is not a
> primary weight.
>
> As well we don't need to use any level separator  in the sort key.
>
> This allows more interesting optimizations, and reduction of length for
> sort keys.
> What this means is that we can safely implement UCA using basic
> substitions (e.g. with a function like "string:gsub(map)" in Lua which uses
> a "map" to map source (binary) strings or regexps,into target (binary)
> strings:
>
> For a level-3 collation, you just then need only 3 calls to
> "string:gsub()" to compute any collation:
>
> - the first ":gsub(mapNormalize)" can decompose a source text into
> collation elements and can perform reordering to enforce a normalized order
> (possibly tuned for the tailored locale) using basic regexps.
>
> - the second ":gsub(mapSecondary)"  will substitute any collection
> elements by their "intermediary" collation elements+tertiary weight.
>
> - the third ":gsub(mapSecondary)" will substitute any "intermediary"
> collation element by their primary weight + secondary weight
>
> The "intermediary" collection elements are just like source text, except
> that higher level differences are eliminated, i.e.all source collation
> element string are replaced by the collection element string that have the
> smallest collation element weights. They must be just encoded so that they
> are HIGHER than any higher level weights.
>
> How to do that:
> - reserve the weight range between . (yes! not just .0001) and .001E
> for the last (tertiary) weight, make sure that all other intermediary
> collation elements will use only code units higher than .0020 (this means
> that they can remain encoded in their existing UTF form!)
> - reserve the weight .001F for the case where you don't want to use
> secondary differences (like letter case) and them to tertiary differences.
>
> This will be used in the second mapping to decompose source collection
> elements into "intermediary collation elements" + tertiary weight. you may
> then decide to leave tertiary weights

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Markus Scherer via Unicode

There are lots of ways to implement the UCA.

When you want fast string comparison, the zero weights are useful for
processing -- and you don't actually assemble a sort key.

People who want sort keys usually want them to be short, so you spend time
on compression. You probably also build sort keys as byte vectors not
uint16 vectors (because byte vectors fit into more APIs and tend to be
shorter), like ICU does using the CLDR collation data file. The CLDR root
collation data file remunges all weights into fractional byte sequences,
and leaves gaps for tailoring.

markus

UCA unnecessary collation weight 0000

I just remarked that there's absolutely NO utility of the collation weight
 anywhere in the algorithm.

For example in UTR #10, section 3.3.1 gives a collection element :
  [..0021.0002]
for COMBINING GRAVE ACCENT. However it can also be simply:
  [.0021.0002]
for a simple reason: the secondary or tertiary weights are necessarily
LOWER then any primary weight (for conformance reason):
 any tertiary weight < any secondary weight < any primary weight
(the set of all weights for all levels is fully partitioned into disjoint
intervals in the same order, each interval containing all its weights, so
weights are sorted by decreasing level, then increasing weight in all cases)

This also means that we never need to handle  weights when creating
sort keys from multiple collection elements, as we can easily detect that
[.0021.0002] given above starts by a secondary weight 0021 and is not a
primary weight.

As well we don't need to use any level separator  in the sort key.

This allows more interesting optimizations, and reduction of length for
sort keys.
What this means is that we can safely implement UCA using basic substitions
(e.g. with a function like "string:gsub(map)" in Lua which uses a "map" to
map source (binary) strings or regexps,into target (binary) strings:

For a level-3 collation, you just then need only 3 calls to "string:gsub()"
to compute any collation:

- the first ":gsub(mapNormalize)" can decompose a source text into
collation elements and can perform reordering to enforce a normalized order
(possibly tuned for the tailored locale) using basic regexps.

- the second ":gsub(mapSecondary)"  will substitute any collection elements
by their "intermediary" collation elements+tertiary weight.

- the third ":gsub(mapSecondary)" will substitute any "intermediary"
collation element by their primary weight + secondary weight

The "intermediary" collection elements are just like source text, except
that higher level differences are eliminated, i.e.all source collation
element string are replaced by the collection element string that have the
smallest collation element weights. They must be just encoded so that they
are HIGHER than any higher level weights.

How to do that:
- reserve the weight range between . (yes! not just .0001) and .001E
for the last (tertiary) weight, make sure that all other intermediary
collation elements will use only code units higher than .0020 (this means
that they can remain encoded in their existing UTF form!)
- reserve the weight .001F for the case where you don't want to use
secondary differences (like letter case) and them to tertiary differences.

This will be used in the second mapping to decompose source collection
elements into "intermediary collation elements" + tertiary weight. you may
then decide to leave tertiary weights in the substitute string, or because
the "gsub()" finds match from left to right, to accumulate the tertiary
weights into a separate buffer, so that the subtitution itself will still
return a valid UTF string, containing only "intermediary collation
elements" (with all tertiary differences erased).

You can repeat the process with the next gsub() to return the primary
collation elements" (still in UTF form), and separately the secondary
weights (also accumulable in a separate buffer).

Now there remains only 3 strings:
- one contains only the primary collection elements (still in UTF-form, but
using code units always higher than or equal to 0020)
- another one contains only secondary weights (between MINSECONDARYWEIGHT
and 001F)
- another one contains only tertiary weights. (between  and
MINSECONDARYWEIGHT-1)

For the rest I will assume that MINSECONDARYWEIGHT is 0010, so
* primary weights are encoded with one or more code units in [0020..]
(multiple code units are possible if you reserve some of these code units
to be prefixes or longer sequences)
* secondary weights are encoded with one or more code units in [0010..001E]
(same remark about multiple code units if you need them)
* tertiary weights are encoded  with one or more code units
in  [0010..001F] (same remark about multiple code units if you need them)

The last gsub() will only reorder the primary collection elements to remap
them in a suitable binary order (it will be a simple bijective permutation,
except that the target does not have to use multiple code units, but a
single one, when there are contractions). It's always possible to make this
permutation generate integers higher than 0020. The resulting weights can
remain encodable with UTF-8 as if it was source text.

And to return the sort key, all you need is to concatenate
* the string containing all primary weights encoded with code units in
[0020..], then
* the string containing secondary weights encoded with code units in
[0010..001E], then
* the string containing tertiary weights encoded with code units in
[..001F].
* you don't need to insert ANY [] as a level separator in the final
sort key,

Re: A sign/abbreviation for "magister"

2018-11-01 Thread Janusz S. Bień via Unicode

On Thu, Nov 01 2018 at  8:43 -0700, Asmus Freytag via Unicode wrote:
> On 11/1/2018 12:33 AM, Janusz S. Bień via Unicode wrote:
>
>  On Wed, Oct 31 2018 at 12:14 -0700, Ken Whistler via Unicode wrote:
>
>  On 10/31/2018 11:27 AM, Asmus Freytag via Unicode wrote:
>
>  
>  but we don't have an agreement that reproducing all variations in
>  manuscripts is in scope.
>
>
> In fact, I would say that in the UTC, at least, we have an agreement
> that that clearly is out of scope!
>
> Trying to represent all aspects of text in manuscripts, including
> handwriting conventions, as plain text is hopeless.  There is no
> principled line to draw there before you get into arbitrary
> calligraphic conventions.
>
>
> Your statements are perfect examples of "attacking a straw man":
>
>
> Perhaps you are joking?
>
> Not sure which of us you were suggesting as the jokester here.
>
> I don't think it's a joke to recognize that there is a continuum here
> and that there is no line that can be drawn which is based on
> straightforward principles. This is a pattern that keeps surfacing the
> deeper you look at character coding questions.

Looks like you completely missed my point. Nobody ever claimed that
reproducing all variations in manuscripts is in scope of Unicode, so
whom do you want to convince that it is not?

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: A sign/abbreviation for "magister"


  
  
On 11/1/2018 12:33 AM, Janusz S. Bień
  via Unicode wrote:


  On Wed, Oct 31 2018 at 12:14 -0700, Ken Whistler via Unicode wrote:

  
On 10/31/2018 11:27 AM, Asmus Freytag via Unicode wrote:


  
 but we don't have an agreement that reproducing all variations in
 manuscripts is in scope.



In fact, I would say that in the UTC, at least, we have an agreement
that that clearly is out of scope!

Trying to represent all aspects of text in manuscripts, including
handwriting conventions, as plain text is hopeless.  There is no
principled line to draw there before you get into arbitrary
calligraphic conventions.

  
  
Your statements are perfect examples of "attacking a straw man":


Perhaps you are joking?



Not sure which of us you were suggesting as the jokester here.
I don't think it's a joke to recognize that there is a continuum
  here and that there is no line that can be drawn which is based on
  straightforward principles. This is a pattern that keeps surfacing
  the deeper you look at character coding questions.
Well, there used to be something of a joke, and it went like
  this: when the first volume containing the Unicode Standard was
  printed, someone noted that it contained not one statement of
  principles, but three. And each of them listed a different number
  of them.
Common to all of them was the pattern that it was impossible to
  satisfy all of the principles simultaneously. They were always in
  tension with each other, meaning that it was necessary to weigh on
  a case-by-case basis which ones to prioritize.
This is no accident: it simply reflects the nature of the beast.
  Some encoding decisions are blindingly obvious, but beyond them,
  things rather quickly become a matter of judgment and if you go
  much further, you eventually reach a point, where such judgments
  become little more than a stab in the dark. Some may not even make
  useful precedents. That's a good place to stop, because beyond
  that you get into extreme territory.

Sometimes, it may be useful for plain text to be extended so as
  to facilitate a 90% solution to something, math for example. That
  kind of thing requires systematic analysis; looking at a single
  example in isolation is not enough. It also requires buy-in from
  an established user community, in the case of math that included
  mathematical societies and scientific publishers. That's the kind
  of thin needed to help understand how to make encoding decisions
  in borderline cases, and to ensure that the dividing line, while
  essentially still arbitrary, sits comfortably on the good side,
  because everyone agrees on which remaining 10% are to be out of
  scope.
In this case, there is no such framework that could help
  establish pragmatic boundaries dividing the truly useful from the
  merely fanciful. 

A./

Re: A sign/abbreviation for "magister"


  
  
On 11/1/2018 12:52 AM, Richard
  Wordingham via Unicode wrote:


  On Wed, 31 Oct 2018 11:35:19 -0700
Asmus Freytag via Unicode  wrote:


  
On the other hand, I'm a firm believer in applying certain styling
attributes to things like e-mail or discussion papers. Well-placed
emphasis can make such texts more readable (without requiring that
they pay attention to all other facets of "fine typography".)

  
  
Unfortunately, your emails are extremely hard to read in plain text.
It is even difficult to tell who wrote what.



Not sure why that is. After they make the round trip, they look
  fine to me.

A./

Re: A sign/abbreviation for "magister"

On Wed, 31 Oct 2018 11:35:19 -0700
Asmus Freytag via Unicode  wrote:

> On the other hand, I'm a firm believer in applying certain styling
> attributes to things like e-mail or discussion papers. Well-placed
> emphasis can make such texts more readable (without requiring that
> they pay attention to all other facets of "fine typography".)

Unfortunately, your emails are extremely hard to read in plain text.
It is even difficult to tell who wrote what.

Richard.

Re: use vs mention (was: second attempt)

On Wed, 31 Oct 2018 23:35:06 +0100
Piotr Karocki via Unicode  wrote:

> These are only examples of changes in meaning with  or ,
> not all of these examples can really exist - but, then, another
> question: can we know what author means? And as carbon and iodine
> cannot exist, then of course CI should be interpreted as carbon on
> first oxidation?

Are you sure about the non-existence?  Some pretty weird
chemical species exist in interstellar space. 

Richard.

Re: A sign/abbreviation for "magister"

2018-11-01 Thread Janusz S. Bień via Unicode

On Wed, Oct 31 2018 at 12:14 -0700, Ken Whistler via Unicode wrote:
> On 10/31/2018 11:27 AM, Asmus Freytag via Unicode wrote:
>>
>>  but we don't have an agreement that reproducing all variations in
>>  manuscripts is in scope.
>
> In fact, I would say that in the UTC, at least, we have an agreement
> that that clearly is out of scope!
>
> Trying to represent all aspects of text in manuscripts, including
> handwriting conventions, as plain text is hopeless.  There is no
> principled line to draw there before you get into arbitrary
> calligraphic conventions.

Your statements are perfect examples of "attacking a straw man":

 Straw Man (Fallacy Of Extension): attacking an exaggerated or
 caricatured version of your opponent's position.

 http://www.don-lindsay-archive.org/skeptic/arguments.html

 https://en.wikipedia.org/wiki/Straw_man

 https://en.wikipedia.org/wiki/The_Art_of_Being_Right

Perhaps you are joking?

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

Re: A sign/abbreviation for "magister"