Re: use vs mention (was: second attempt)

2018-11-02 Thread Richard Wordingham via Unicode
On Thu, 1 Nov 2018 07:46:40 +
Richard Wordingham via Unicode  wrote:

> On Wed, 31 Oct 2018 23:35:06 +0100
> Piotr Karocki via Unicode  wrote:
> 
> > These are only examples of changes in meaning with  or ,
> > not all of these examples can really exist - but, then, another
> > question: can we know what author means? And as carbon and iodine
> > cannot exist, then of course CI should be interpreted as carbon on
> > first oxidation?  
> 
> Are you sure about the non-existence?  Some pretty weird
> chemical species exist in interstellar space. 

It's not interstellar, but CI is the empirical formula for diiodoethyne
and its isomer iodoiodanuidylethyne, and the CI⁻ ion has Pubchem CID
59215341.

Richard.




Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Richard Wordingham via Unicode
On Fri, 2 Nov 2018 14:27:37 -0700
Ken Whistler via Unicode  wrote:

> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:

> > UTR#10 still does not explicitly state that its use of "" does
> > not mean it is a valid "weight", it's a notation only  
> 
> No, it is explicitly a valid weight. And it is explicitly and 
> normatively referred to in the specification of the algorithm. See 
> UTS10-D8 (and subsequent definitions), which explicitly depend on a 
> definition of "A collation weight whose value is zero." The entire 
> statement of what are primary, secondary, tertiary, etc. collation 
> elements depends on that definition. And see the tables in Section
> 3.2, which also depend on those definitions.

The definition is defective in that it doesn't handle 'large weight
values' well.  There is the anomaly that a mapping of collating element
to [1234..][0200.020.002] may be compatible with WF1, but the
exactly equivalent mapping to [1234.020.002][0200..] makes the
table ill-formed.  The fractional weight definitions for UCA eliminate
this '' notion quite well, and I once expected the UCA to move to
the CLDRCA (CLDR Collation Algorithm) fractional weight definition.
The definition of the CLDRCA does a much better job of explaining
'large weight values'.  It turns them from something exceptional to a
normal part of its functioning.  

> > (but the notation is used for TWO distinct purposes: one is for 
> > presenting the notation format used in the DUCET  
> 
> It is *not* just a notation format used in the DUCET -- it is part of 
> the normative definitional structure of the algorithm, which then 
> percolates down into further definitions and rules and the steps of
> the algorithm.

It's not needed for the CLDRCA!  The statement of the UCA algorithm
does depend on its notation, but it can be recast to avoid these zero
weights.

Richard.


mail attribution (was: A sign/abbreviation for "magister")

2018-11-02 Thread Janusz S. Bień via Unicode
On Thu, Nov 01 2018 at  6:43 -0700, Asmus Freytag via Unicode wrote:
> On 11/1/2018 12:52 AM, Richard Wordingham via Unicode wrote:
>
>  On Wed, 31 Oct 2018 11:35:19 -0700
> Asmus Freytag via Unicode  wrote:

[...]

> Unfortunately, your emails are extremely hard to read in plain text.
> It is even difficult to tell who wrote what.

My previous mail is unfortunately an example.

>
> Not sure why that is. After they make the round trip, they look fine
> to me.

When displaying your HTML mail, Emacs Gnus doesn't show correctly the
attributions. If I forget to edit it by hand when replying, we get the
confusion like in my previous mail.

I guess I should submit this as a bug or feature request to Emacs
developers. Perhaps Richard Wordingham should do the same for the mail
agent he uses.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


Re: A sign/abbreviation for "magister"

2018-11-02 Thread James Kass via Unicode



Asmus Freytag wrote,

> Alphabetic script users' handwriting does not match
> print in all features. Traditional German handwriting
> used a line like a macron over the letter 'u' to
> distinguish it from 'n'. Rendering this with a
> u-macron in print would be the height of absurdity.

If German text were displayed with a traditional German handwriting 
(cursive) font, then every "u" would display with a macron.  (Except the 
ones with umlauts.)  That's because the macron is part and parcel of the 
identity of the stylistic variant (cursive) of the letter, not because 
the addition of the macron makes a stylistic variation.  It would indeed 
be silly to encode such macrons in data derived from a traditional 
German handwriting specimen.  Hopefully most everyone here agrees with that.


We all seem to accept that, for example, d = d = d = face="MyCursiveFont">d.


We all don't seem to agree that d # d̲. Or that "Mr." # "Mr" # "Mʳ" # 
"Mʳ͇" # "M:r".




Re: A sign/abbreviation for "magister"

2018-11-02 Thread James Kass via Unicode



Julian Bradfield wrote,

>> consists of three recognizable symbols.  An "M", a superscript
>> "r", and an equal sign (= two lines).  It can be printed, handwritten,
>
> That's not true. The squiggle under the r is a squiggle - it is a
> matter of interpretation (on which there was some discussion a hundred
> messages up-thread or so :) whether it was intended to be = .

I recall Asmus pointing out that the Z-like squiggle was likely a 
handwritten "=" and that there was some agreement to this, but didn't 
realize that it was in dispute.  FWIW, I agree that the squiggle which 
looks kind of like "こ" is simply the cursive, stylistic variant of "=", 
especially when written quickly.


> Just as it is a matter of interpretation whether the superscript and
> squiggle were deeply meaningful to the writer, or whether they were
> just a stylistic flourish for Mr.

A third possibility is that the double-underlined superscript was a 
writing/spelling convention of the time for writing/spelling abbreviations.


Even if someone produced contemporary Polish manuscripts abbreviating 
magister as "Mr", it could be argued that the two writers were simply 
using different conventions.




Re: A sign/abbreviation for "magister"

2018-11-02 Thread Asmus Freytag via Unicode

  
  
On 11/2/2018 4:31 AM, James Kass via
  Unicode wrote:


  
  Suppose someone found a hundred year old form from Poland which
  included a section for "sign your name" and "print your name"
  which had been filled out by a man with the typically Polish name
  of Bogus McCoy?  And he was a Magister, to boot!  And proud of it.
  
  
  If he signed the magister abbreviation using double-underlined
  superscript and likewise his surname *and* printed it the same way
  -- it might still be arguable as to whether it was a
  writing/spelling or a stylish distinction, I suppose.
  
  
  But if he signed using double-underlined superscripts and printed
  using baseline lower case Latin letters, *that* might be
  persuasive.
  
  
  Doesn't seem likely, though, does it?
  
  
  (Bogusław is a legitimate Polish masculine given name.  Its
  nickname is Bogus.  McCoy is not, however, a typical Polish
  surname.  The snarky combination of "Bogus McCoy" was irresistible
  to someone of my character and temperament.  "Bogus" is American
  slang for fake and "McCoy" connotes being genuine, as in "the real
  McCoy".)
  
  
  

Where a contemporaneous printed form of a
writing system exists, it appears Unicode will generally base
encoding decisions on it and not on handwritten forms. Like the
case we discussed a few posts above about German, any
differences in appearance typical for the handwritten form would
be handled by styling (e.g. selection of a "handwriting" font).
To transcribe the postcard would mean
selecting the characters appropriate for the printed equivalent
of the text.
If the printed form had a standard way of
superscripting letters with a decoration below when used for
abbreviations, then, and only then would we start discussing
whether this decoration needs to be encoded, or whether it is
something a font can supply as part of rendering the (sequence
of) superscripted letters. (Perhaps with the aid of markup
identifying the sequence as abbreviation).
  
All else is just applying visual hacks to
simulate a specific appearance, at the possible cost of
obscuring the contents.
A./

  



Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Mark Davis ☕️ via Unicode
You may not like the format of the data, but you are not bound to it. If
you don't like the data format (eg you want [.0021.0002] instead of
[..0021.0002]), you can transform it however you want as long as you
get the same answer, as it says here:

http://unicode.org/reports/tr10/#Conformance
“The Unicode Collation Algorithm is a logical specification.
Implementations are free to change any part of the algorithm as long as any
two strings compared by the implementation are ordered the same as they
would be by the algorithm as specified. Implementations may also use a
different format for the data in the Default Unicode Collation Element
Table. The sort key is a logical intermediate object: if an implementation
produces the same results in comparison of strings, the sort keys can
differ in format from what is specified in this document. (See Section 9,
Implementation Notes.)”


That is what is done, for example, in ICU's implementation. See
http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
collation elements" and "sort keys" to see the transformed collation
elements (from the DUCET + CLDR) and the resulting sort keys.

a =>[29,05,_05] => 29 , 05 , 05 .
a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
à => 
A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
À => 

Mark


On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
unicode@unicode.org> wrote:

> As well the step 2 of the algorithm speaks about a single "array" of
> collation elements. Actually it's best to create one separate array per
> level, and append weights for each level in the relevant array for that
> level.
> The steps S2.2 to S2.4 can do this, including for derived collation
> elements in section 10.1, or variable weighting in section 4.
>
> This also means that for fast string compares, the primary weights can be
> processed on the fly (without needing any buffering) is the primary weights
> are different between the two strings (including when one or both of the
> two strings ends, and the secondary weights or tertiary weights detected
> until then have not found any weight higher than the minimum weight value
> for each level).
> Otherwise:
> - the first secondary weight higher that the minimum secondary weght
> value, and all subsequent secondary weights must be buffered in a
> secondary  buffer  .
> - the first tertiary weight higher that the minimum secondary weght value,
> and all subsequent secondary weights must be buffered in a tertiary buffer.
> - and so on for higher levels (each buffer just needs to keep a counter,
> when it's first used, indicating how many weights were not buffered while
> processing and counting the primary weights, because all these weights were
> all equal to the minimum value for the relevant level)
> - these secondary/tertiary/etc. buffers will only be used once you reach
> the end of the two strings when processing the primary level and no
> difference was found: you'll start by comparing the initial counters in
> these buffers and the buffer that has the largest counter value is
> necessarily for the smaller compared string. If both counters are equal,
> then you start comparing the weights stored in each buffer, until one of
> the buffers ends before another (the shorter buffer is for the smaller
> compared string). If both weight buffers reach the end, you use the next
> pair of buffers built for the next level and process them with the same
> algorithm.
>
> Nowhere you'll ever need to consider any [.] weight which is just a
> notation in the format of the DUCET intended only to be readable by humans
> but never needed in any machine implementation.
>
> Now if you want to create sort keys this is similar except that you don"t
> have two strings to process and compare, all you want is to create separate
> arrays of weights for each level: each level can be encoded separately, the
> encoding must be made so that when you'll concatenate the encoded arrays,
> the first few encoded *bits* in the secondary or tertiary encodings cannot
> be larger or equal to the bits used by the encoding of the primary weights
> (this only limits how you'll encode the 1st weight in each array as its
> first encoding *bits* must be lower than the first bits used to encode any
> weight in previous levels).
>
> Nowhere you are required to encode weights exactly like their logical
> weight, this encoding is fully reversible and can use any suitable
> compression technics if needed. As long as you can safely detect when an
> encoding ends, because it encounters some bits (with lower values) used to
> start the encoding of one of the higher levels, the compression is safe.
>
> For each level, you can reserve only a single code used to "mark" the
> start of another higher level followed by some bits to indicate which level
> it is, then followed by the compressed code for the level made so that each
> weight is encoded by a code not starting by the reserved mark. That
> encoding "mark" 

Re: A sign/abbreviation for "magister"

2018-11-02 Thread James Kass via Unicode



Suppose someone found a hundred year old form from Poland which included 
a section for "sign your name" and "print your name" which had been 
filled out by a man with the typically Polish name of Bogus McCoy?  And 
he was a Magister, to boot!  And proud of it.


If he signed the magister abbreviation using double-underlined 
superscript and likewise his surname *and* printed it the same way -- it 
might still be arguable as to whether it was a writing/spelling or a 
stylish distinction, I suppose.


But if he signed using double-underlined superscripts and printed using 
baseline lower case Latin letters, *that* might be persuasive.


Doesn't seem likely, though, does it?

(Bogusław is a legitimate Polish masculine given name.  Its nickname is 
Bogus.  McCoy is not, however, a typical Polish surname.  The snarky 
combination of "Bogus McCoy" was irresistible to someone of my character 
and temperament.  "Bogus" is American slang for fake and "McCoy" 
connotes being genuine, as in "the real McCoy".)




Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Richard Wordingham via Unicode
On Fri, 2 Nov 2018 14:54:19 +0100
Philippe Verdy via Unicode  wrote:

> It's not just a question of "I like it or not". But the fact that the
> standard makes the presence of  required in some steps, and the
> requirement is in fact wrong: this is in fact NEVER required to
> create an equivalent collation order. these steps are completely
> unnecessary and should be removed.
> 
> Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️  a
> écrit :
> 
> > You may not like the format of the data, but you are not bound to
> > it. If you don't like the data format (eg you want [.0021.0002]
> > instead of [..0021.0002]), you can transform it however you
> > want as long as you get the same answer, as it says here:
> >
> > http://unicode.org/reports/tr10/#Conformance
> > “The Unicode Collation Algorithm is a logical specification.
> > Implementations are free to change any part of the algorithm as
> > long as any two strings compared by the implementation are ordered
> > the same as they would be by the algorithm as specified.
> > Implementations may also use a different format for the data in the
> > Default Unicode Collation Element Table. The sort key is a logical
> > intermediate object: if an implementation produces the same results
> > in comparison of strings, the sort keys can differ in format from
> > what is specified in this document. (See Section 9, Implementation
> > Notes.)”

Given the above paragraph, how does the standard force you to use a
special ?  Perhaps the wording of the standard can be changed to
prevent your unhappy interpretation.

> > That is what is done, for example, in ICU's implementation. See
> > http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
> > collation elements" and "sort keys" to see the transformed collation
> > elements (from the DUCET + CLDR) and the resulting sort keys.
> >
> > a =>[29,05,_05] => 29 , 05 , 05 .
> > a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
> > à => 
> > A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
> > À => 

As you can see, Mark does not come to the same conclusion as you, and
nor do I.

Richard.



Re: A sign/abbreviation for "magister"

2018-11-02 Thread Marcel Schneider via Unicode

On 01/11/2018 16:43, Asmus Freytag via Unicode wrote:
[quoted mail]

I don't think it's a joke to recognize that there is a continuum here and that
 there is no line that can be drawn which is based on straightforward 
principles.

[…]

In this case, there is no such framework that could help establish pragmatic
 boundaries dividing the truly useful from the merely fanciful.


I think the red line was always between the positive and the negative answer to
the question whether a given graphic is relevant for legibility/readability of
the plain text backbone. But humans can be trained to mentally disambiguate
a mass of confusables, so the line vanishes and the continuum remains intact.

On 02/11/2018 06:22, Asmus Freytag via Unicode wrote:

On 11/1/2018 7:59 PM, James Kass via Unicode wrote:


Alphabetic script users write things the way they are spelled and spell things
 the way they are written.  The abbreviation in question as written consists of
 three recognizable symbols.  An "M", a superscript "r", and an equal sign
 (= two lines).  It can be printed, handwritten, or in fraktur; it will still
 consist of those same three recognizable symbols.

We're supposed to be preserving the past, not editing it or revising it.


Alphabetic script users' handwriting does not match print in all features.
Traditional German handwriting used a line like a macron over the letter 'u'
to distinguish it from 'n'. Rendering this with a u-macron in print would be
the height of absurdity.

I feel similarly about the assertion that the "two lines" are something that
 needs to be encoded, but only an expert would know for sure.


Indeed it would be relevant to know whether it is mandatory in Polish, and I’m
not an expert. But looking at several scripts using abbreviation indicators as
superscript, i.e. Latin and Cyrillic (when using the Latin-script-written
abbreviation of "Numero", given Cyrillic for "N" is "Н", so it’s strictly
speaking one single script, and two scripts using it), then we can easily see
how single and double underlines are added or not depending on font design
and on customary writing and display. E.g. the Romance feminine and masculine
ordinal indicators have one or zero underlines, to such extent that French
typography specifies that the masculine ordinal indicator, despite beinga
superscript small o, is unfit to compose the French "numéro" abbreviation,
that must not have an underline. Hence DEGREE SIGN is less bad than U+00BA.

If applying the same to Polish, "Magister" is "Mʳ" and is straigtforward
to input when using a new French keyboard layout or an enhanced variant of
any national Latin one having small supersripts on the Shift+Num level, or
via a ‹superscript› dead key, mapped e.g. on Shift + AltGr/Option + E or
any of the 26 letter keys as mnemonically convenient ("superscript"
translates to French "exposant"); or ‹Compose› ‹^› [e] (where the ASCII
circumflex or caret is repurposed for superscript compose sequences, while
‹circumflex accent› is active *after* LESS-THAN SIGN, consistently with the
*new* convention for ‹inverted breve› using LEFT PARENTHESIS rather than "g)".

These details are posted in this thread on this List rather than CLDR-USERS
in order to make clear that typing superscript letters directly via the
keyboard is easy, and therefore to propose it is not to harrass the end-user.

On 02/11/2018 13:09, Asmus Freytag via Unicode wrote:
[quoted mail]
[…]

To transcribe the postcard would mean selecting the characters appropriate
 for the printed equivalent of the text.


As already suggested, selecting the variants can be done using variation
selectors, provided the Standard has defined the intended use case.



If the printed form had a standard way of superscripting letters with a
 decoration below when used for abbreviations,


As already pointed out, Latin script does not benefit from a consensus
to use underline for superscript. E.g. Italian, Portuguese and Spanish
do use underline for superscript, English and French do not.


then, and only then would we start discussing whether this decoration
needs to be encoded, or whether it is something a font can supply as part
of rendering the (sequence of) superscripted letters.


I think the problem is not completely outlined, as long as the use of
variation sequences is not mentioned. There is no "all" or "nothing"
dilemma, given Unicode has the means of providing a standard way of
representing calligraphic variations using variation selectors. E.g.
the letter ENG is preferred in big lowercase form when writing
Bambara, while other locales may like it in hooked uppercase.
The Bambara Arial font allows to make sure it is the right glyph,
and Arial in general follows the Bambara preference, but other fonts
do not, while some of them have the Bambara-fit glyph inside but
don’t display it unless urged by an OpenType supporting renderer,
and appropriate settings turned on, e.g. on a locale identifier basis.


(Perhaps with the aid of 

Re: A sign/abbreviation for "magister"

2018-11-02 Thread Adam Borowski via Unicode
On Fri, Nov 02, 2018 at 01:44:25PM +, Michael Everson via Unicode wrote:
> I write my 7’s and Z’s with a horizontal line through them.  Ƶ is encoded
> not for this purpose, but because Z and Ƶ are distinct in orthographies
> for varieties of Tatar, Chechen, Karelian, and Mongolian.  This is a
> contemporary writing convention but it does not argue for a new SEVEN WITH
> STROKE character or that I should use Ƶ rather than Z when I write
> *Ƶanƶibar.

And that use conflicts with Ƶ ƶ being an allograph of Polish Ż ż, used
especially when marks above cap height are unwanted or when readability is
important (Żż is too similar to Źź).  It also happened to be nicely
renderable with Z^H- z^H- vs Z^H' z^H' on printers which had backspace.

I unsuccessfully argued for such a variant on a "historical terminals" font:
https://github.com/rbanffy/3270font/issues/19

But in either case the difference is purely visual rather than semantic.
The latter still applies to _some_ uses of superscript, but not to the
mgr.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Have you heard of the Amber Road?  For thousands of years, the
⣾⠁⢰⠒⠀⣿⡁ Romans and co valued amber, hauled through the Europe over the
⢿⡄⠘⠷⠚⠋⠀ mountains and along the Vistula, from Gdańsk.  To where it came
⠈⠳⣄ together with silk (judging by today's amber stalls).


Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Mark Davis ☕️ via Unicode
The table is the way it is because it is easier to process (and comprehend)
when the first field is always the primary weight, second is always the
secondary, etc.

Go ahead and transform the input DUCET files as you see fit. The "should be
removed" is your personal preference. Unless we hear strong demand
otherwise from major implementers, people have better things to do than
change their parsers to suit your preference.

Mark


On Fri, Nov 2, 2018 at 2:54 PM Philippe Verdy  wrote:

> It's not just a question of "I like it or not". But the fact that the
> standard makes the presence of  required in some steps, and the
> requirement is in fact wrong: this is in fact NEVER required to create an
> equivalent collation order. these steps are completely unnecessary and
> should be removed.
>
> Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️  a écrit :
>
>> You may not like the format of the data, but you are not bound to it. If
>> you don't like the data format (eg you want [.0021.0002] instead of
>> [..0021.0002]), you can transform it however you want as long as you
>> get the same answer, as it says here:
>>
>> http://unicode.org/reports/tr10/#Conformance
>> “The Unicode Collation Algorithm is a logical specification.
>> Implementations are free to change any part of the algorithm as long as any
>> two strings compared by the implementation are ordered the same as they
>> would be by the algorithm as specified. Implementations may also use a
>> different format for the data in the Default Unicode Collation Element
>> Table. The sort key is a logical intermediate object: if an implementation
>> produces the same results in comparison of strings, the sort keys can
>> differ in format from what is specified in this document. (See Section 9,
>> Implementation Notes.)”
>>
>>
>> That is what is done, for example, in ICU's implementation. See
>> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
>> collation elements" and "sort keys" to see the transformed collation
>> elements (from the DUCET + CLDR) and the resulting sort keys.
>>
>> a =>[29,05,_05] => 29 , 05 , 05 .
>> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
>> à => 
>> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
>> À => 
>>
>> Mark
>>
>>
>> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
>> unicode@unicode.org> wrote:
>>
>>> As well the step 2 of the algorithm speaks about a single "array" of
>>> collation elements. Actually it's best to create one separate array per
>>> level, and append weights for each level in the relevant array for that
>>> level.
>>> The steps S2.2 to S2.4 can do this, including for derived collation
>>> elements in section 10.1, or variable weighting in section 4.
>>>
>>> This also means that for fast string compares, the primary weights can
>>> be processed on the fly (without needing any buffering) is the primary
>>> weights are different between the two strings (including when one or both
>>> of the two strings ends, and the secondary weights or tertiary weights
>>> detected until then have not found any weight higher than the minimum
>>> weight value for each level).
>>> Otherwise:
>>> - the first secondary weight higher that the minimum secondary weght
>>> value, and all subsequent secondary weights must be buffered in a
>>> secondary  buffer  .
>>> - the first tertiary weight higher that the minimum secondary weght
>>> value, and all subsequent secondary weights must be buffered in a tertiary
>>> buffer.
>>> - and so on for higher levels (each buffer just needs to keep a counter,
>>> when it's first used, indicating how many weights were not buffered while
>>> processing and counting the primary weights, because all these weights were
>>> all equal to the minimum value for the relevant level)
>>> - these secondary/tertiary/etc. buffers will only be used once you reach
>>> the end of the two strings when processing the primary level and no
>>> difference was found: you'll start by comparing the initial counters in
>>> these buffers and the buffer that has the largest counter value is
>>> necessarily for the smaller compared string. If both counters are equal,
>>> then you start comparing the weights stored in each buffer, until one of
>>> the buffers ends before another (the shorter buffer is for the smaller
>>> compared string). If both weight buffers reach the end, you use the next
>>> pair of buffers built for the next level and process them with the same
>>> algorithm.
>>>
>>> Nowhere you'll ever need to consider any [.] weight which is just a
>>> notation in the format of the DUCET intended only to be readable by humans
>>> but never needed in any machine implementation.
>>>
>>> Now if you want to create sort keys this is similar except that you
>>> don"t have two strings to process and compare, all you want is to create
>>> separate arrays of weights for each level: each level can be encoded
>>> separately, the encoding must be made so that when you'll concatenate 

Re: A sign/abbreviation for "magister"

2018-11-02 Thread Doug Ewell via Unicode
Michael Everson wrote:

> I write my 7’s and Z’s with a horizontal line through them. Ƶ is
> encoded not for this purpose, but because Z and Ƶ are distinct in
> orthographies for varieties of Tatar, Chechen, Karelian, and
> Mongolian. This is a contemporary writing convention but it does not
> argue for a new SEVEN WITH STROKE character or that I should use Ƶ
> rather than Z when I write *Ƶanƶibar. 

http://www.unicode.org/L2/L2018/18323-open-four.pdf
 
--
Doug Ewell | Thornton, CO, US | ewellic.org




Re: A sign/abbreviation for "magister"

2018-11-02 Thread Janusz S. Bień via Unicode
On Fri, Nov 02 2018 at  5:09 -0700, Asmus Freytag via Unicode wrote:

[...]

> To transcribe the postcard would mean selecting the characters
> appropriate for the printed equivalent of the text.

You seem to make implicit assumptions which are not necessarily
true. For me to transcribe the postcard would mean to answer the needs
of the intended transcription users.

> If the printed form had a standard way of superscripting letters with
> a decoration below when used for abbreviations, then, and only then
> would we start discussing whether this decoration needs to be encoded,
> or whether it is something a font can supply as part of rendering the
> (sequence of) superscripted letters. (Perhaps with the aid of markup
> identifying the sequence as abbreviation).

As I wrote already some time ago on the list, the alternative "encoding
or using a specialized font" is wrong. These days texts are encoding for
processing (in particular searching), rendering is just a kind of
side-effect.

On the other hand, whom do you mean by "we" and what do you mean by
"encoding"? If I guess correctly what do you mean by these words then
you are discussing an issue which was never raised by anybody (if I'm
wrong, please quote the relevant post). Again is not clear for me whom
you want to convince or inform.

> All else is just applying visual hacks

I don't mind hacks if they are useful and serve the intended purpose,
even if they are visual :-)

> to simulate a specific appearance,

As I said above, the appearance is not necessarily of primary
importance.

> at the possible cost of obscuring the contents.

It's for the users of the transcription to decide what is obscuring the
text and what, to the contrary, makes the transcription more readable
and useful.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


Re: A sign/abbreviation for "magister"

2018-11-02 Thread Richard Wordingham via Unicode
On Fri, 02 Nov 2018 08:38:45 -0700
Doug Ewell via Unicode  wrote:

> Do we have any other evidence of this usage, besides a single
> handwritten postcard? 

What, beyond some of us actually employing it ourselves?  I'm sure I've
seen 'William' abbreviated in print to 'Wᵐ' with some mark below, but I
couldn't lay my hands on an example.

Richard.



Re: A sign/abbreviation for "magister"

2018-11-02 Thread Philippe Verdy via Unicode
Le ven. 2 nov. 2018 à 16:20, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> That seems to me a regression, after the front has moved in favor of
> recognizing Latin script needs preformatted superscript. The use case is
> clear, as we have ª, º, and n° with degree sign, and so on as already
> detailed in long e-mails in this thread and elsewhere. There is no point
> in setting up or maintaining a Unicode policy stating otherwise, as such
> a policy would be inconsistent with longlasting and extremely widespread
> practice.
>

Using variation selectors is only appropriate for these existing
(preencoded) superscript letters ª and º so that they display the
appropriate (underlined or not underlined) glyph. It is not a solution for
creating superscripts on any letters and mark that it should be rendered as
superscript (notably, the base letter to transform into superscript may
also have its own combining diacritics, that must be encoded explicitly,
and if you use the varaition selector, it should allow variation on the
presence or absence of the underline (which must then be encoded explicitly
as a combining character.

So finally what we get with variation selectors is:
and
   which
is NOT canonically equivalent.

Using a combining character avoids this caveat:
   and
   which ARE canonically equivalent.
And this explicitly states the semantic (something that is lost if we are
forced to use presentational superscripts in a higher level protocol like
HTML/CSS for rich text format, and one just extracts the plain text; using
collation will not help at all, except if collators are built with
preprocessing that will first infer the presence of a  to insert after each combining sequence of the
plain-text enclosed in a italic style).

There's little risk: if the  is not mapped in
fonts (or not recognized by text renderers to create synthetic superscript
scripts from existing recognized clusters), it will render as a visible
.notdef (tofu). But normally text renderers recognize the basic properties
of characters in the UCD and can see that  has
a combining mark general property (it also knows that it has a 0 combinjing
class, so canonical equivalences are not broken) to render a better symbols
than the .notdef "tofu": it should better render a dotted circle. Even if
this tofu or dotted circle is rendered, it still explicitly marks the
presence of the abbreviation mark, so there's less confusion about what is
preceding it (the combining sequence that was supposed to be superscripted).

The  can also have its own  to select other styles when they are optional, such as adding
underlines to the superscripted letter, or rendering the letter instead as
underscript, or as a small baseline letter with a dot after it: this is
still an explicit abbreviation mark, and the meaning of the plein text is
still preserved: the variation selector is only suitable to alter the
rendering of a cluster when it has effectively several variants and the
default rendering is not universal, notably across font styles initially
designed for specific markets with their own local preferences: the
variation selector still allows the same fonts to map all known variants
distinctly, independantly of the initial arbitrary choice of the default
glyph used when the variation selector is missing).

Even if fonts (or text renderers may map the 
to variable glyphs, this is purely stylictic, the semantic of the plain
text is not lost because the  is still there.
There's no need of any rich-text to encode it (the rich -text styles are
not explicitly encoding that a superscript is actually an abbreviation
mark, so it cannot also allow variation like rendering an underscript, or a
baseline small glyph with an added dot. Typically a  used in an English style would render the letter (or cluster) before
it as a "small" letter without any added dot.

So I really think that  is far better than:
* using preencoded superscript letters (they don't map all the necessary
repertoire of clusters where the abbreviation is needed, it now just covers
Basic Latin, ten digits, plus and minus signs, and the dot or comma, plus a
few other letters like stops; it's impossible to rencode the full Unicode
repertoire and its allowed combining sequences or extended default grapheme
clusters!),
* or using variation selectors to make them appear as a superscript (does
not work with all clusters containing other diacritics like accents),
* or using rich-text styling (from which you cannot safely infer any
semantic (there no warranty that Mr in HTML is actually an
abbreviation of "Mister"; in HTML this is encoded elsewhere as Mr or Mr (the
semantic of the abbreviation has to be looked a possible  container
element and the meaning of the abbreviation is to look inside its title
attribute, so obviously this requires complex preprocessing before we can
infer a plaintext version  (suitable for
example in plain-text searches where you don't want to match a mathematical

Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Philippe Verdy via Unicode
I was replying not about the notational repreentation of the DUCET data
table (using [....] unnecessarily) but about the text of UTR#10 itself.
Which remains highly confusive, and contains completely unnecesary steps,
and just complicates things with absoiluytely no benefit at all by
introducing confusion about these "". UTR#10 still does not explicitly
state that its use of "" does not mean it is a valid "weight", it's a
notation only (but the notation is used for TWO distinct purposes: one is
for presenting the notation format used in the DUCET itself to present how
collation elements are structured, the other one is for marking the
presence of a possible, but not always required, encoding of an explicit
level separator for encoding sort keys).

UTR#10 is still needlessly confusive. Even the example tables can be made
without using these "" (for example in tables showing how to build sort
keys, it can present the list of weights splitted in separate columns, one
column per level, without any "". The implementation does not
necessarily have to create a buffer containing all weight values in a row,
when separate buffers for each level is far superior (and even more
efficient as it can save space in memory). The step "S3.2" in the UCA
algorithm should not even be there (it is made in favor an specific
implementation which is not even efficient or optimal), it complicates the
algorithm with absoluytely no benefit at all); you can ALWAYS remove it
completely and this still generates equivalent results.


Le ven. 2 nov. 2018 à 15:23, Mark Davis ☕️  a écrit :

> The table is the way it is because it is easier to process (and
> comprehend) when the first field is always the primary weight, second is
> always the secondary, etc.
>
> Go ahead and transform the input DUCET files as you see fit. The "should
> be removed" is your personal preference. Unless we hear strong demand
> otherwise from major implementers, people have better things to do than
> change their parsers to suit your preference.
>
> Mark
>
>
> On Fri, Nov 2, 2018 at 2:54 PM Philippe Verdy  wrote:
>
>> It's not just a question of "I like it or not". But the fact that the
>> standard makes the presence of  required in some steps, and the
>> requirement is in fact wrong: this is in fact NEVER required to create an
>> equivalent collation order. these steps are completely unnecessary and
>> should be removed.
>>
>> Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️  a écrit :
>>
>>> You may not like the format of the data, but you are not bound to it. If
>>> you don't like the data format (eg you want [.0021.0002] instead of
>>> [..0021.0002]), you can transform it however you want as long as you
>>> get the same answer, as it says here:
>>>
>>> http://unicode.org/reports/tr10/#Conformance
>>> “The Unicode Collation Algorithm is a logical specification.
>>> Implementations are free to change any part of the algorithm as long as any
>>> two strings compared by the implementation are ordered the same as they
>>> would be by the algorithm as specified. Implementations may also use a
>>> different format for the data in the Default Unicode Collation Element
>>> Table. The sort key is a logical intermediate object: if an implementation
>>> produces the same results in comparison of strings, the sort keys can
>>> differ in format from what is specified in this document. (See Section 9,
>>> Implementation Notes.)”
>>>
>>>
>>> That is what is done, for example, in ICU's implementation. See
>>> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
>>> collation elements" and "sort keys" to see the transformed collation
>>> elements (from the DUCET + CLDR) and the resulting sort keys.
>>>
>>> a =>[29,05,_05] => 29 , 05 , 05 .
>>> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
>>> à => 
>>> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
>>> À => 
>>>
>>> Mark
>>>
>>>
>>> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
>>> unicode@unicode.org> wrote:
>>>
 As well the step 2 of the algorithm speaks about a single "array" of
 collation elements. Actually it's best to create one separate array per
 level, and append weights for each level in the relevant array for that
 level.
 The steps S2.2 to S2.4 can do this, including for derived collation
 elements in section 10.1, or variable weighting in section 4.

 This also means that for fast string compares, the primary weights can
 be processed on the fly (without needing any buffering) is the primary
 weights are different between the two strings (including when one or both
 of the two strings ends, and the secondary weights or tertiary weights
 detected until then have not found any weight higher than the minimum
 weight value for each level).
 Otherwise:
 - the first secondary weight higher that the minimum secondary weght
 value, and all subsequent secondary weights must be buffered in a

Re: A sign/abbreviation for "magister"

2018-11-02 Thread Doug Ewell via Unicode
Do we have any other evidence of this usage, besides a single
handwritten postcard? 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org




Re: A sign/abbreviation for "magister"

2018-11-02 Thread Marcel Schneider via Unicode

On 31/10/2018 at 19:34, Asmus Freytag via Unicode wrote:


On 10/31/2018 10:32 AM, Janusz S. Bień via Unicode wrote:
>
> Let me remind what plain text is according to the Unicode glossary:
> 
> Computer-encoded text that consists only of a sequence of code

> points from a given standard, with no other formatting or structural
> information.
> 
> If you try to use this definition to decide what is and what is not a

> character, you get vicious circle.
> 
> As mentioned already by others, there is no other generally accepted

> definition of plain text.


Being among those who argued that the “plain text” concept cannot—and
therefore mustn’t—be used per se to disallow the use of a more or less
restricted or extended set of characters in what is called “ordinary text”,
I’m ending up adding the following in case it might be of interest:



This definition becomes tautological only when you try to invoke it in making 
encoding decisions, that is, if you couple it with the statement that only 
"elements of plain text" are ever encoded.


I don’t think that Janusz S. Bień’s concern is about this definition
being “tautological”. AFAICS the Unicode definition of “plain text” is
quoted to back the assumption that it’s hard to use that concept to argue
against the use of a given Unicode character in a given context, or to
use it to kill a proposal for characters significant in natural languages.

The reasoning is that the call not to use character X in plain text, while X is
a legal Unicode character whose use is not discouraged for technical reasons,
is like if “ordinary people” (scarequoted derivative from “ordinary text”) were
told that X is not a Unicode character. That discourse is a “vicious circle” in
that there is no limit to it until Latin script is pulled down to plain ASCII.
As already well known, diacritics are handled by the rendering system and don’t
need to be displayed as such in the plain text backbone. I don’t believe that
the same applies to other scripts, but these are often not considered when the
encoding of Latin preformatted letters is fought, given superscripting seems
to be proper to Latin, and originated from longlasting medieval practice and
writing conventions.



For that purpose, you need a number of other definitions of "plain text". 
Including the definition that plain text is the "backbone" to which you apply 
formatting and layout information. I personally believer that there are more 
2D notations where it's quite obvious to me that what is "placed" is a text 
element. More like maps and music and less like a circuit diagram, where the 
elements are less text like (I deliberately include symbols in the definition 
of text, but not any random graphical line art).


All two-dimensional notations here (outside the parenthetical) use higher-level
protocols; maps and diagrams are often vector graphics. But Unicode strived to
encode all needed plain text elements, such as symbols for maritime and wheather
maps. Even arrows of many possible shapes, including 3D-looking ones, have been
encoded. While freehand (rather than “any random”) graphical art is out of 
scope,
we have a lot of box drawing, used with appropriate fonts to draw e.g. layouts 
of
keyboards above the relevant source code in plain text files (examples in XKB).

As a sidenote: Box drawing while useful is unduly neglected on font level, even
in the Code Charts where the advance width, usually half an em, is inconsistent
between different sorts of elements belonging to the same block.



Another definition of plain text is that which contains the "readable content" 
of the text.


As already discussed on this List, many documents in PDF have hard-to-read plain
text backbones, even misleading Google Search, for the purpose of handling 
special
glyphs (and, in some era, even special characters).

As we've discussed here, this definition has edge cases; some 
content is traditionally left to styling.


Many pre-Unicode traditions are found out there, that stay in use, partly for
technical reasons (mainly by lack of updated keyboard layouts), partly for
consistency with accustomed ways of doing. Being traditionally-left-to-styling
is the more unconvincing. Even a letter that got to become LATIN SMALL LETTER O 
E
(Unicode 1.0) was composed on typewriters using the half-backspace, and should 
be
_left to styling_ when it was pulled out of the draft ISO/IEC 8859-1 by the 
fault
of a Frenchman (name undisclosed for privacy). And we’ve been told on this List
that the tradition using styling (a special font) to display the additional 
Latin
letters used to write Bambara survived.

Example: some of the small words in 
some Scandinavian languages are routinely italicized to disambiguate their 
reading.


Other languages use titlecase to achieve the same disambiguation. E.g. French
titlecases the noun "Une" which means the "cover", not the undefined article,
and German did the same when "Ein(e)" is a numeral, but today, 

Re: A sign/abbreviation for "magister"

2018-11-02 Thread Julian Bradfield via Unicode
On 2018-11-02, James Kass via Unicode  wrote:
> Alphabetic script users write things the way they are spelled and spell 
> things the way they are written.  The abbreviation in question as 
> written consists of three recognizable symbols.  An "M", a superscript 
> "r", and an equal sign (= two lines).  It can be printed, handwritten, 

That's not true. The squiggle under the r is a squiggle - it is a
matter of interpretation (on which there was some discussion a hundred
messages up-thread or so :) whether it was intended to be = .
Just as it is a matter of interpretation whether the superscript and
squiggle were deeply meaningful to the writer, or whether they were
just a stylistic flourish for Mr.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Ken Whistler via Unicode


On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
I was replying not about the notational repreentation of the DUCET 
data table (using [....] unnecessarily) but about the text of 
UTR#10 itself. Which remains highly confusive, and contains completely 
unnecesary steps, and just complicates things with absoiluytely no 
benefit at all by introducing confusion about these "". 


Sorry, Philippe, but the confusion that I am seeing introduced is what 
you are introducing to the unicode list in the course of this discussion.



UTR#10 still does not explicitly state that its use of "" does not 
mean it is a valid "weight", it's a notation only


No, it is explicitly a valid weight. And it is explicitly and 
normatively referred to in the specification of the algorithm. See 
UTS10-D8 (and subsequent definitions), which explicitly depend on a 
definition of "A collation weight whose value is zero." The entire 
statement of what are primary, secondary, tertiary, etc. collation 
elements depends on that definition. And see the tables in Section 3.2, 
which also depend on those definitions.



(but the notation is used for TWO distinct purposes: one is for 
presenting the notation format used in the DUCET


It is *not* just a notation format used in the DUCET -- it is part of 
the normative definitional structure of the algorithm, which then 
percolates down into further definitions and rules and the steps of the 
algorithm.


itself to present how collation elements are structured, the other one 
is for marking the presence of a possible, but not always required, 
encoding of an explicit level separator for encoding sort keys).
That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It 
is not part of the *notation* for collation elements, but instead is a 
magic value chosen for the level separator precisely because zero values 
from the collation elements are removed during sort key construction, so 
that zero is then guaranteed to be a lower value than any remaining 
weight added to the sort key under construction. This part of the 
algorithm is not rocket science, by the way!


UTR#10 is still needlessly confusive.


O.k., if you think so, you then know what to do:

https://www.unicode.org/review/pri385/

and

https://www.unicode.org/reporting.html

Even the example tables can be made without using these "" (for 
example in tables showing how to build sort keys, it can present the 
list of weights splitted in separate columns, one column per level, 
without any "". The implementation does not necessarily have to 
create a buffer containing all weight values in a row, when separate 
buffers for each level is far superior (and even more efficient as it 
can save space in memory).


The UCA doesn't *require* you to do anything particular in your own 
implementation, other than come up with the same results for string 
comparisons. That is clearly stated in the conformance clause of UTS #10.


https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance

The step "S3.2" in the UCA algorithm should not even be there (it is 
made in favor an specific implementation which is not even efficient 
or optimal),


That is a false statement. Step S3.2 is there to provide a clear 
statement of the algorithm, to guarantee correct results for string 
comparison. Section 9 of UTS #10 provides a whole lunch buffet of 
techniques that implementations can choose from to increase the 
efficiency of their implementations, as they deem appropriate. You are 
free to implement as you choose -- including techniques that do not 
require any level separators. You are, however, duly warned in:


https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators

that "While this technique is relatively easy to implement, it can 
interfere with other compression methods."


it complicates the algorithm with absoluytely no benefit at all); you 
can ALWAYS remove it completely and this still generates equivalent 
results.


No you cannot ALWAYS remove it completely. Whether or not your 
implementation can do so, depends on what other techniques you may be 
using to increase performance, store shorter keys, or whatever else may 
be at stake in your optimization.


If you don't like zeroes in collation, be my guest, and ignore them 
completely. Take them out of your tables, and don't use level 
separators. Just make sure you end up with conformant result for 
comparison of strings when you are done. And in the meantime, if you 
want to complain about the text of the specification of UTS #10, then 
provide carefully worded alternatives as suggestions for improvement to 
the text, rather than just endlessly ranting about how the standard is 
confusive because the collation weight  is "unnecessary".


--Ken




Re: A sign/abbreviation for "magister"

2018-11-02 Thread Marcel Schneider via Unicode

On 02/11/2018 17:45, Philippe Verdy via Unicode wrote:
[quoted mail]


Using variation selectors is only appropriate for these existing 
(preencoded) superscript letters ª and º so that they display the 
appropriate (underlined or not underlined) glyph.


And it is for forcing the display of DIGIT ZERO with a short stroke:
0030 FE00; short diagonal stroke form; # DIGIT ZERO
https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt

From that it becomes unclear why that isn’t applied to 4, 7, z and Z
mentioned in this thread, to be displayed open or with a short bar.

It is not a solution for creating superscripts on any letters and 
mark that it should be rendered as superscript (notably, the base 
letter to transform into superscript may also have its own combining
diacritics, that must be encoded explicitly, and if you use the 
varaition selector, it should allow variation on the presence or 
absence of the underline (which must then be encoded explicitly as a

combining character.


I totally agree that abbreviation indicating superscript should not be
encoded using variation selectors, as already stated I don’t prefer it.


So finally what we get with variation selectors is: variation selector, combining diacritic> and precombined with the diacritic, variation selector> which is NOT 
canonically equivalent.


That seems to me like a flaw in canonical equivalence. Variations must
be canonically equivalent, and the variation selector position should
be handled or parsed accordingly. Personally I’m unaware of this rule.


Using a combining character avoids this caveat: combining diacritic, combining abbreviation mark> and 
precombined with the diacritic, combining abbreviation mark> which
ARE canonically equivalent. And this explicitly states the semantic
(something that is lost if we are forced to use presentational
superscripts in a higher level protocol like HTML/CSS for rich text
format, and one just extracts the plain text; using collation will
not help at all, except if collators are built with preprocessing
that will first infer the presence of a 
to insert after each combining sequence of the plain-text enclosed in
a italic style).


That exactly outlines my concern with calls for relegating superscript
as an abbreviation indicator to higher level protocols like HTML/CSS.


There's little risk: if the  is not 
mapped in fonts (or not recognized by text renderers to create 
synthetic superscript scripts from existing recognized clusters), it

will render as a visible .notdef (tofu). But normally text renderers
recognize the basic properties of characters in the UCD and can see
that  has a combining mark general 
property (it also knows that it has a 0 combinjing class, so 
canonical equivalences are not broken) to render a better symbols 
than the .notdef "tofu": it should better render a dotted circle. 
Even if this tofu or dotted circle is rendered, it still explicitly 
marks the presence of the abbreviation mark, so there's less 
confusion about what is preceding it (the combining sequence that was

supposed to be superscripted).


The problem with the  you are proposing
is that it contradicts streamlined implementation as well as easy
input of current abbreviations like ordinal indicators in French and,
optionally, in English. Preformatted superscripts are already widely
implemented, and coding of "4ᵉ" only needs two characters, input
using only three fingers in two times (thumb on AltGr, press key
E04 then E12) with an appropriately programmed layout driver. I’m
afraid that the solution with  would be
much less straightforward.


The  can also have its own selector> to select other styles when they are optional, such as 
adding underlines to the superscripted letter, or rendering the 
letter instead as underscript, or as a small baseline letter with a 
dot after it: this is still an explicit abbreviation mark, and the 
meaning of the plein text is still preserved: the variation selector
is only suitable to alter the rendering of a cluster when it has 
effectively several variants and the default rendering is not 
universal, notably across font styles initially designed for specific

markets with their own local preferences: the variation selector
still allows the same fonts to map all known variants distinctly,
independantly of the initial arbitrary choice of the default glyph
used when the variation selector is missing).


I don’t think German users would welcome being directed to input a
 plus a  instead of
a period.


Even if fonts (or text renderers may map the  to variable glyphs, this is purely stylictic, the semantic of
the plain text is not lost because the 
is still there. There's no need of any rich-text to encode it (the 
rich -text styles are not explicitly encoding that a superscript is 
actually an abbreviation mark, so it cannot also allow variation like

rendering an underscript, or a baseline small glyph with an added
dot. Typically a  used in an English
style would