Re: use vs mention (was: second attempt)
On Thu, 1 Nov 2018 07:46:40 + Richard Wordingham via Unicode wrote: > On Wed, 31 Oct 2018 23:35:06 +0100 > Piotr Karocki via Unicode wrote: > > > These are only examples of changes in meaning with or , > > not all of these examples can really exist - but, then, another > > question: can we know what author means? And as carbon and iodine > > cannot exist, then of course CI should be interpreted as carbon on > > first oxidation? > > Are you sure about the non-existence? Some pretty weird > chemical species exist in interstellar space. It's not interstellar, but CI is the empirical formula for diiodoethyne and its isomer iodoiodanuidylethyne, and the CI⁻ ion has Pubchem CID 59215341. Richard.
Re: UCA unnecessary collation weight 0000
On Fri, 2 Nov 2018 14:27:37 -0700 Ken Whistler via Unicode wrote: > On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote: > > UTR#10 still does not explicitly state that its use of "" does > > not mean it is a valid "weight", it's a notation only > > No, it is explicitly a valid weight. And it is explicitly and > normatively referred to in the specification of the algorithm. See > UTS10-D8 (and subsequent definitions), which explicitly depend on a > definition of "A collation weight whose value is zero." The entire > statement of what are primary, secondary, tertiary, etc. collation > elements depends on that definition. And see the tables in Section > 3.2, which also depend on those definitions. The definition is defective in that it doesn't handle 'large weight values' well. There is the anomaly that a mapping of collating element to [1234..][0200.020.002] may be compatible with WF1, but the exactly equivalent mapping to [1234.020.002][0200..] makes the table ill-formed. The fractional weight definitions for UCA eliminate this '' notion quite well, and I once expected the UCA to move to the CLDRCA (CLDR Collation Algorithm) fractional weight definition. The definition of the CLDRCA does a much better job of explaining 'large weight values'. It turns them from something exceptional to a normal part of its functioning. > > (but the notation is used for TWO distinct purposes: one is for > > presenting the notation format used in the DUCET > > It is *not* just a notation format used in the DUCET -- it is part of > the normative definitional structure of the algorithm, which then > percolates down into further definitions and rules and the steps of > the algorithm. It's not needed for the CLDRCA! The statement of the UCA algorithm does depend on its notation, but it can be recast to avoid these zero weights. Richard.
mail attribution (was: A sign/abbreviation for "magister")
On Thu, Nov 01 2018 at 6:43 -0700, Asmus Freytag via Unicode wrote: > On 11/1/2018 12:52 AM, Richard Wordingham via Unicode wrote: > > On Wed, 31 Oct 2018 11:35:19 -0700 > Asmus Freytag via Unicode wrote: [...] > Unfortunately, your emails are extremely hard to read in plain text. > It is even difficult to tell who wrote what. My previous mail is unfortunately an example. > > Not sure why that is. After they make the round trip, they look fine > to me. When displaying your HTML mail, Emacs Gnus doesn't show correctly the attributions. If I forget to edit it by hand when replying, we get the confusion like in my previous mail. I guess I should submit this as a bug or feature request to Emacs developers. Perhaps Richard Wordingham should do the same for the mail agent he uses. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
Re: A sign/abbreviation for "magister"
Asmus Freytag wrote, > Alphabetic script users' handwriting does not match > print in all features. Traditional German handwriting > used a line like a macron over the letter 'u' to > distinguish it from 'n'. Rendering this with a > u-macron in print would be the height of absurdity. If German text were displayed with a traditional German handwriting (cursive) font, then every "u" would display with a macron. (Except the ones with umlauts.) That's because the macron is part and parcel of the identity of the stylistic variant (cursive) of the letter, not because the addition of the macron makes a stylistic variation. It would indeed be silly to encode such macrons in data derived from a traditional German handwriting specimen. Hopefully most everyone here agrees with that. We all seem to accept that, for example, d = d = d = face="MyCursiveFont">d. We all don't seem to agree that d # d̲. Or that "Mr." # "Mr" # "Mʳ" # "Mʳ͇" # "M:r".
Re: A sign/abbreviation for "magister"
Julian Bradfield wrote, >> consists of three recognizable symbols. An "M", a superscript >> "r", and an equal sign (= two lines). It can be printed, handwritten, > > That's not true. The squiggle under the r is a squiggle - it is a > matter of interpretation (on which there was some discussion a hundred > messages up-thread or so :) whether it was intended to be = . I recall Asmus pointing out that the Z-like squiggle was likely a handwritten "=" and that there was some agreement to this, but didn't realize that it was in dispute. FWIW, I agree that the squiggle which looks kind of like "こ" is simply the cursive, stylistic variant of "=", especially when written quickly. > Just as it is a matter of interpretation whether the superscript and > squiggle were deeply meaningful to the writer, or whether they were > just a stylistic flourish for Mr. A third possibility is that the double-underlined superscript was a writing/spelling convention of the time for writing/spelling abbreviations. Even if someone produced contemporary Polish manuscripts abbreviating magister as "Mr", it could be argued that the two writers were simply using different conventions.
Re: A sign/abbreviation for "magister"
On 11/2/2018 4:31 AM, James Kass via Unicode wrote: Suppose someone found a hundred year old form from Poland which included a section for "sign your name" and "print your name" which had been filled out by a man with the typically Polish name of Bogus McCoy? And he was a Magister, to boot! And proud of it. If he signed the magister abbreviation using double-underlined superscript and likewise his surname *and* printed it the same way -- it might still be arguable as to whether it was a writing/spelling or a stylish distinction, I suppose. But if he signed using double-underlined superscripts and printed using baseline lower case Latin letters, *that* might be persuasive. Doesn't seem likely, though, does it? (Bogusław is a legitimate Polish masculine given name. Its nickname is Bogus. McCoy is not, however, a typical Polish surname. The snarky combination of "Bogus McCoy" was irresistible to someone of my character and temperament. "Bogus" is American slang for fake and "McCoy" connotes being genuine, as in "the real McCoy".) Where a contemporaneous printed form of a writing system exists, it appears Unicode will generally base encoding decisions on it and not on handwritten forms. Like the case we discussed a few posts above about German, any differences in appearance typical for the handwritten form would be handled by styling (e.g. selection of a "handwriting" font). To transcribe the postcard would mean selecting the characters appropriate for the printed equivalent of the text. If the printed form had a standard way of superscripting letters with a decoration below when used for abbreviations, then, and only then would we start discussing whether this decoration needs to be encoded, or whether it is something a font can supply as part of rendering the (sequence of) superscripted letters. (Perhaps with the aid of markup identifying the sequence as abbreviation). All else is just applying visual hacks to simulate a specific appearance, at the possible cost of obscuring the contents. A./
Re: UCA unnecessary collation weight 0000
You may not like the format of the data, but you are not bound to it. If you don't like the data format (eg you want [.0021.0002] instead of [..0021.0002]), you can transform it however you want as long as you get the same answer, as it says here: http://unicode.org/reports/tr10/#Conformance “The Unicode Collation Algorithm is a logical specification. Implementations are free to change any part of the algorithm as long as any two strings compared by the implementation are ordered the same as they would be by the algorithm as specified. Implementations may also use a different format for the data in the Default Unicode Collation Element Table. The sort key is a logical intermediate object: if an implementation produces the same results in comparison of strings, the sort keys can differ in format from what is specified in this document. (See Section 9, Implementation Notes.)” That is what is done, for example, in ICU's implementation. See http://demo.icu-project.org/icu-bin/collation.html and turn on "raw collation elements" and "sort keys" to see the transformed collation elements (from the DUCET + CLDR) and the resulting sort keys. a =>[29,05,_05] => 29 , 05 , 05 . a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 . à => A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 . À => Mark On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode < unicode@unicode.org> wrote: > As well the step 2 of the algorithm speaks about a single "array" of > collation elements. Actually it's best to create one separate array per > level, and append weights for each level in the relevant array for that > level. > The steps S2.2 to S2.4 can do this, including for derived collation > elements in section 10.1, or variable weighting in section 4. > > This also means that for fast string compares, the primary weights can be > processed on the fly (without needing any buffering) is the primary weights > are different between the two strings (including when one or both of the > two strings ends, and the secondary weights or tertiary weights detected > until then have not found any weight higher than the minimum weight value > for each level). > Otherwise: > - the first secondary weight higher that the minimum secondary weght > value, and all subsequent secondary weights must be buffered in a > secondary buffer . > - the first tertiary weight higher that the minimum secondary weght value, > and all subsequent secondary weights must be buffered in a tertiary buffer. > - and so on for higher levels (each buffer just needs to keep a counter, > when it's first used, indicating how many weights were not buffered while > processing and counting the primary weights, because all these weights were > all equal to the minimum value for the relevant level) > - these secondary/tertiary/etc. buffers will only be used once you reach > the end of the two strings when processing the primary level and no > difference was found: you'll start by comparing the initial counters in > these buffers and the buffer that has the largest counter value is > necessarily for the smaller compared string. If both counters are equal, > then you start comparing the weights stored in each buffer, until one of > the buffers ends before another (the shorter buffer is for the smaller > compared string). If both weight buffers reach the end, you use the next > pair of buffers built for the next level and process them with the same > algorithm. > > Nowhere you'll ever need to consider any [.] weight which is just a > notation in the format of the DUCET intended only to be readable by humans > but never needed in any machine implementation. > > Now if you want to create sort keys this is similar except that you don"t > have two strings to process and compare, all you want is to create separate > arrays of weights for each level: each level can be encoded separately, the > encoding must be made so that when you'll concatenate the encoded arrays, > the first few encoded *bits* in the secondary or tertiary encodings cannot > be larger or equal to the bits used by the encoding of the primary weights > (this only limits how you'll encode the 1st weight in each array as its > first encoding *bits* must be lower than the first bits used to encode any > weight in previous levels). > > Nowhere you are required to encode weights exactly like their logical > weight, this encoding is fully reversible and can use any suitable > compression technics if needed. As long as you can safely detect when an > encoding ends, because it encounters some bits (with lower values) used to > start the encoding of one of the higher levels, the compression is safe. > > For each level, you can reserve only a single code used to "mark" the > start of another higher level followed by some bits to indicate which level > it is, then followed by the compressed code for the level made so that each > weight is encoded by a code not starting by the reserved mark. That > encoding "mark"
Re: A sign/abbreviation for "magister"
Suppose someone found a hundred year old form from Poland which included a section for "sign your name" and "print your name" which had been filled out by a man with the typically Polish name of Bogus McCoy? And he was a Magister, to boot! And proud of it. If he signed the magister abbreviation using double-underlined superscript and likewise his surname *and* printed it the same way -- it might still be arguable as to whether it was a writing/spelling or a stylish distinction, I suppose. But if he signed using double-underlined superscripts and printed using baseline lower case Latin letters, *that* might be persuasive. Doesn't seem likely, though, does it? (Bogusław is a legitimate Polish masculine given name. Its nickname is Bogus. McCoy is not, however, a typical Polish surname. The snarky combination of "Bogus McCoy" was irresistible to someone of my character and temperament. "Bogus" is American slang for fake and "McCoy" connotes being genuine, as in "the real McCoy".)
Re: UCA unnecessary collation weight 0000
On Fri, 2 Nov 2018 14:54:19 +0100 Philippe Verdy via Unicode wrote: > It's not just a question of "I like it or not". But the fact that the > standard makes the presence of required in some steps, and the > requirement is in fact wrong: this is in fact NEVER required to > create an equivalent collation order. these steps are completely > unnecessary and should be removed. > > Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️ a > écrit : > > > You may not like the format of the data, but you are not bound to > > it. If you don't like the data format (eg you want [.0021.0002] > > instead of [..0021.0002]), you can transform it however you > > want as long as you get the same answer, as it says here: > > > > http://unicode.org/reports/tr10/#Conformance > > “The Unicode Collation Algorithm is a logical specification. > > Implementations are free to change any part of the algorithm as > > long as any two strings compared by the implementation are ordered > > the same as they would be by the algorithm as specified. > > Implementations may also use a different format for the data in the > > Default Unicode Collation Element Table. The sort key is a logical > > intermediate object: if an implementation produces the same results > > in comparison of strings, the sort keys can differ in format from > > what is specified in this document. (See Section 9, Implementation > > Notes.)” Given the above paragraph, how does the standard force you to use a special ? Perhaps the wording of the standard can be changed to prevent your unhappy interpretation. > > That is what is done, for example, in ICU's implementation. See > > http://demo.icu-project.org/icu-bin/collation.html and turn on "raw > > collation elements" and "sort keys" to see the transformed collation > > elements (from the DUCET + CLDR) and the resulting sort keys. > > > > a =>[29,05,_05] => 29 , 05 , 05 . > > a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 . > > à => > > A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 . > > À => As you can see, Mark does not come to the same conclusion as you, and nor do I. Richard.
Re: A sign/abbreviation for "magister"
On 01/11/2018 16:43, Asmus Freytag via Unicode wrote: [quoted mail] I don't think it's a joke to recognize that there is a continuum here and that there is no line that can be drawn which is based on straightforward principles. […] In this case, there is no such framework that could help establish pragmatic boundaries dividing the truly useful from the merely fanciful. I think the red line was always between the positive and the negative answer to the question whether a given graphic is relevant for legibility/readability of the plain text backbone. But humans can be trained to mentally disambiguate a mass of confusables, so the line vanishes and the continuum remains intact. On 02/11/2018 06:22, Asmus Freytag via Unicode wrote: On 11/1/2018 7:59 PM, James Kass via Unicode wrote: Alphabetic script users write things the way they are spelled and spell things the way they are written. The abbreviation in question as written consists of three recognizable symbols. An "M", a superscript "r", and an equal sign (= two lines). It can be printed, handwritten, or in fraktur; it will still consist of those same three recognizable symbols. We're supposed to be preserving the past, not editing it or revising it. Alphabetic script users' handwriting does not match print in all features. Traditional German handwriting used a line like a macron over the letter 'u' to distinguish it from 'n'. Rendering this with a u-macron in print would be the height of absurdity. I feel similarly about the assertion that the "two lines" are something that needs to be encoded, but only an expert would know for sure. Indeed it would be relevant to know whether it is mandatory in Polish, and I’m not an expert. But looking at several scripts using abbreviation indicators as superscript, i.e. Latin and Cyrillic (when using the Latin-script-written abbreviation of "Numero", given Cyrillic for "N" is "Н", so it’s strictly speaking one single script, and two scripts using it), then we can easily see how single and double underlines are added or not depending on font design and on customary writing and display. E.g. the Romance feminine and masculine ordinal indicators have one or zero underlines, to such extent that French typography specifies that the masculine ordinal indicator, despite beinga superscript small o, is unfit to compose the French "numéro" abbreviation, that must not have an underline. Hence DEGREE SIGN is less bad than U+00BA. If applying the same to Polish, "Magister" is "Mʳ" and is straigtforward to input when using a new French keyboard layout or an enhanced variant of any national Latin one having small supersripts on the Shift+Num level, or via a ‹superscript› dead key, mapped e.g. on Shift + AltGr/Option + E or any of the 26 letter keys as mnemonically convenient ("superscript" translates to French "exposant"); or ‹Compose› ‹^› [e] (where the ASCII circumflex or caret is repurposed for superscript compose sequences, while ‹circumflex accent› is active *after* LESS-THAN SIGN, consistently with the *new* convention for ‹inverted breve› using LEFT PARENTHESIS rather than "g)". These details are posted in this thread on this List rather than CLDR-USERS in order to make clear that typing superscript letters directly via the keyboard is easy, and therefore to propose it is not to harrass the end-user. On 02/11/2018 13:09, Asmus Freytag via Unicode wrote: [quoted mail] […] To transcribe the postcard would mean selecting the characters appropriate for the printed equivalent of the text. As already suggested, selecting the variants can be done using variation selectors, provided the Standard has defined the intended use case. If the printed form had a standard way of superscripting letters with a decoration below when used for abbreviations, As already pointed out, Latin script does not benefit from a consensus to use underline for superscript. E.g. Italian, Portuguese and Spanish do use underline for superscript, English and French do not. then, and only then would we start discussing whether this decoration needs to be encoded, or whether it is something a font can supply as part of rendering the (sequence of) superscripted letters. I think the problem is not completely outlined, as long as the use of variation sequences is not mentioned. There is no "all" or "nothing" dilemma, given Unicode has the means of providing a standard way of representing calligraphic variations using variation selectors. E.g. the letter ENG is preferred in big lowercase form when writing Bambara, while other locales may like it in hooked uppercase. The Bambara Arial font allows to make sure it is the right glyph, and Arial in general follows the Bambara preference, but other fonts do not, while some of them have the Bambara-fit glyph inside but don’t display it unless urged by an OpenType supporting renderer, and appropriate settings turned on, e.g. on a locale identifier basis. (Perhaps with the aid of
Re: A sign/abbreviation for "magister"
On Fri, Nov 02, 2018 at 01:44:25PM +, Michael Everson via Unicode wrote: > I write my 7’s and Z’s with a horizontal line through them. Ƶ is encoded > not for this purpose, but because Z and Ƶ are distinct in orthographies > for varieties of Tatar, Chechen, Karelian, and Mongolian. This is a > contemporary writing convention but it does not argue for a new SEVEN WITH > STROKE character or that I should use Ƶ rather than Z when I write > *Ƶanƶibar. And that use conflicts with Ƶ ƶ being an allograph of Polish Ż ż, used especially when marks above cap height are unwanted or when readability is important (Żż is too similar to Źź). It also happened to be nicely renderable with Z^H- z^H- vs Z^H' z^H' on printers which had backspace. I unsuccessfully argued for such a variant on a "historical terminals" font: https://github.com/rbanffy/3270font/issues/19 But in either case the difference is purely visual rather than semantic. The latter still applies to _some_ uses of superscript, but not to the mgr. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ Have you heard of the Amber Road? For thousands of years, the ⣾⠁⢰⠒⠀⣿⡁ Romans and co valued amber, hauled through the Europe over the ⢿⡄⠘⠷⠚⠋⠀ mountains and along the Vistula, from Gdańsk. To where it came ⠈⠳⣄ together with silk (judging by today's amber stalls).
Re: UCA unnecessary collation weight 0000
The table is the way it is because it is easier to process (and comprehend) when the first field is always the primary weight, second is always the secondary, etc. Go ahead and transform the input DUCET files as you see fit. The "should be removed" is your personal preference. Unless we hear strong demand otherwise from major implementers, people have better things to do than change their parsers to suit your preference. Mark On Fri, Nov 2, 2018 at 2:54 PM Philippe Verdy wrote: > It's not just a question of "I like it or not". But the fact that the > standard makes the presence of required in some steps, and the > requirement is in fact wrong: this is in fact NEVER required to create an > equivalent collation order. these steps are completely unnecessary and > should be removed. > > Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️ a écrit : > >> You may not like the format of the data, but you are not bound to it. If >> you don't like the data format (eg you want [.0021.0002] instead of >> [..0021.0002]), you can transform it however you want as long as you >> get the same answer, as it says here: >> >> http://unicode.org/reports/tr10/#Conformance >> “The Unicode Collation Algorithm is a logical specification. >> Implementations are free to change any part of the algorithm as long as any >> two strings compared by the implementation are ordered the same as they >> would be by the algorithm as specified. Implementations may also use a >> different format for the data in the Default Unicode Collation Element >> Table. The sort key is a logical intermediate object: if an implementation >> produces the same results in comparison of strings, the sort keys can >> differ in format from what is specified in this document. (See Section 9, >> Implementation Notes.)” >> >> >> That is what is done, for example, in ICU's implementation. See >> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw >> collation elements" and "sort keys" to see the transformed collation >> elements (from the DUCET + CLDR) and the resulting sort keys. >> >> a =>[29,05,_05] => 29 , 05 , 05 . >> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 . >> à => >> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 . >> À => >> >> Mark >> >> >> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode < >> unicode@unicode.org> wrote: >> >>> As well the step 2 of the algorithm speaks about a single "array" of >>> collation elements. Actually it's best to create one separate array per >>> level, and append weights for each level in the relevant array for that >>> level. >>> The steps S2.2 to S2.4 can do this, including for derived collation >>> elements in section 10.1, or variable weighting in section 4. >>> >>> This also means that for fast string compares, the primary weights can >>> be processed on the fly (without needing any buffering) is the primary >>> weights are different between the two strings (including when one or both >>> of the two strings ends, and the secondary weights or tertiary weights >>> detected until then have not found any weight higher than the minimum >>> weight value for each level). >>> Otherwise: >>> - the first secondary weight higher that the minimum secondary weght >>> value, and all subsequent secondary weights must be buffered in a >>> secondary buffer . >>> - the first tertiary weight higher that the minimum secondary weght >>> value, and all subsequent secondary weights must be buffered in a tertiary >>> buffer. >>> - and so on for higher levels (each buffer just needs to keep a counter, >>> when it's first used, indicating how many weights were not buffered while >>> processing and counting the primary weights, because all these weights were >>> all equal to the minimum value for the relevant level) >>> - these secondary/tertiary/etc. buffers will only be used once you reach >>> the end of the two strings when processing the primary level and no >>> difference was found: you'll start by comparing the initial counters in >>> these buffers and the buffer that has the largest counter value is >>> necessarily for the smaller compared string. If both counters are equal, >>> then you start comparing the weights stored in each buffer, until one of >>> the buffers ends before another (the shorter buffer is for the smaller >>> compared string). If both weight buffers reach the end, you use the next >>> pair of buffers built for the next level and process them with the same >>> algorithm. >>> >>> Nowhere you'll ever need to consider any [.] weight which is just a >>> notation in the format of the DUCET intended only to be readable by humans >>> but never needed in any machine implementation. >>> >>> Now if you want to create sort keys this is similar except that you >>> don"t have two strings to process and compare, all you want is to create >>> separate arrays of weights for each level: each level can be encoded >>> separately, the encoding must be made so that when you'll concatenate
Re: A sign/abbreviation for "magister"
Michael Everson wrote: > I write my 7’s and Z’s with a horizontal line through them. Ƶ is > encoded not for this purpose, but because Z and Ƶ are distinct in > orthographies for varieties of Tatar, Chechen, Karelian, and > Mongolian. This is a contemporary writing convention but it does not > argue for a new SEVEN WITH STROKE character or that I should use Ƶ > rather than Z when I write *Ƶanƶibar. http://www.unicode.org/L2/L2018/18323-open-four.pdf -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: A sign/abbreviation for "magister"
On Fri, Nov 02 2018 at 5:09 -0700, Asmus Freytag via Unicode wrote: [...] > To transcribe the postcard would mean selecting the characters > appropriate for the printed equivalent of the text. You seem to make implicit assumptions which are not necessarily true. For me to transcribe the postcard would mean to answer the needs of the intended transcription users. > If the printed form had a standard way of superscripting letters with > a decoration below when used for abbreviations, then, and only then > would we start discussing whether this decoration needs to be encoded, > or whether it is something a font can supply as part of rendering the > (sequence of) superscripted letters. (Perhaps with the aid of markup > identifying the sequence as abbreviation). As I wrote already some time ago on the list, the alternative "encoding or using a specialized font" is wrong. These days texts are encoding for processing (in particular searching), rendering is just a kind of side-effect. On the other hand, whom do you mean by "we" and what do you mean by "encoding"? If I guess correctly what do you mean by these words then you are discussing an issue which was never raised by anybody (if I'm wrong, please quote the relevant post). Again is not clear for me whom you want to convince or inform. > All else is just applying visual hacks I don't mind hacks if they are useful and serve the intended purpose, even if they are visual :-) > to simulate a specific appearance, As I said above, the appearance is not necessarily of primary importance. > at the possible cost of obscuring the contents. It's for the users of the transcription to decide what is obscuring the text and what, to the contrary, makes the transcription more readable and useful. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
Re: A sign/abbreviation for "magister"
On Fri, 02 Nov 2018 08:38:45 -0700 Doug Ewell via Unicode wrote: > Do we have any other evidence of this usage, besides a single > handwritten postcard? What, beyond some of us actually employing it ourselves? I'm sure I've seen 'William' abbreviated in print to 'Wᵐ' with some mark below, but I couldn't lay my hands on an example. Richard.
Re: A sign/abbreviation for "magister"
Le ven. 2 nov. 2018 à 16:20, Marcel Schneider via Unicode < unicode@unicode.org> a écrit : > That seems to me a regression, after the front has moved in favor of > recognizing Latin script needs preformatted superscript. The use case is > clear, as we have ª, º, and n° with degree sign, and so on as already > detailed in long e-mails in this thread and elsewhere. There is no point > in setting up or maintaining a Unicode policy stating otherwise, as such > a policy would be inconsistent with longlasting and extremely widespread > practice. > Using variation selectors is only appropriate for these existing (preencoded) superscript letters ª and º so that they display the appropriate (underlined or not underlined) glyph. It is not a solution for creating superscripts on any letters and mark that it should be rendered as superscript (notably, the base letter to transform into superscript may also have its own combining diacritics, that must be encoded explicitly, and if you use the varaition selector, it should allow variation on the presence or absence of the underline (which must then be encoded explicitly as a combining character. So finally what we get with variation selectors is: and which is NOT canonically equivalent. Using a combining character avoids this caveat: and which ARE canonically equivalent. And this explicitly states the semantic (something that is lost if we are forced to use presentational superscripts in a higher level protocol like HTML/CSS for rich text format, and one just extracts the plain text; using collation will not help at all, except if collators are built with preprocessing that will first infer the presence of a to insert after each combining sequence of the plain-text enclosed in a italic style). There's little risk: if the is not mapped in fonts (or not recognized by text renderers to create synthetic superscript scripts from existing recognized clusters), it will render as a visible .notdef (tofu). But normally text renderers recognize the basic properties of characters in the UCD and can see that has a combining mark general property (it also knows that it has a 0 combinjing class, so canonical equivalences are not broken) to render a better symbols than the .notdef "tofu": it should better render a dotted circle. Even if this tofu or dotted circle is rendered, it still explicitly marks the presence of the abbreviation mark, so there's less confusion about what is preceding it (the combining sequence that was supposed to be superscripted). The can also have its own to select other styles when they are optional, such as adding underlines to the superscripted letter, or rendering the letter instead as underscript, or as a small baseline letter with a dot after it: this is still an explicit abbreviation mark, and the meaning of the plein text is still preserved: the variation selector is only suitable to alter the rendering of a cluster when it has effectively several variants and the default rendering is not universal, notably across font styles initially designed for specific markets with their own local preferences: the variation selector still allows the same fonts to map all known variants distinctly, independantly of the initial arbitrary choice of the default glyph used when the variation selector is missing). Even if fonts (or text renderers may map the to variable glyphs, this is purely stylictic, the semantic of the plain text is not lost because the is still there. There's no need of any rich-text to encode it (the rich -text styles are not explicitly encoding that a superscript is actually an abbreviation mark, so it cannot also allow variation like rendering an underscript, or a baseline small glyph with an added dot. Typically a used in an English style would render the letter (or cluster) before it as a "small" letter without any added dot. So I really think that is far better than: * using preencoded superscript letters (they don't map all the necessary repertoire of clusters where the abbreviation is needed, it now just covers Basic Latin, ten digits, plus and minus signs, and the dot or comma, plus a few other letters like stops; it's impossible to rencode the full Unicode repertoire and its allowed combining sequences or extended default grapheme clusters!), * or using variation selectors to make them appear as a superscript (does not work with all clusters containing other diacritics like accents), * or using rich-text styling (from which you cannot safely infer any semantic (there no warranty that Mr in HTML is actually an abbreviation of "Mister"; in HTML this is encoded elsewhere as Mr or Mr (the semantic of the abbreviation has to be looked a possible container element and the meaning of the abbreviation is to look inside its title attribute, so obviously this requires complex preprocessing before we can infer a plaintext version (suitable for example in plain-text searches where you don't want to match a mathematical
Re: UCA unnecessary collation weight 0000
I was replying not about the notational repreentation of the DUCET data table (using [....] unnecessarily) but about the text of UTR#10 itself. Which remains highly confusive, and contains completely unnecesary steps, and just complicates things with absoiluytely no benefit at all by introducing confusion about these "". UTR#10 still does not explicitly state that its use of "" does not mean it is a valid "weight", it's a notation only (but the notation is used for TWO distinct purposes: one is for presenting the notation format used in the DUCET itself to present how collation elements are structured, the other one is for marking the presence of a possible, but not always required, encoding of an explicit level separator for encoding sort keys). UTR#10 is still needlessly confusive. Even the example tables can be made without using these "" (for example in tables showing how to build sort keys, it can present the list of weights splitted in separate columns, one column per level, without any "". The implementation does not necessarily have to create a buffer containing all weight values in a row, when separate buffers for each level is far superior (and even more efficient as it can save space in memory). The step "S3.2" in the UCA algorithm should not even be there (it is made in favor an specific implementation which is not even efficient or optimal), it complicates the algorithm with absoluytely no benefit at all); you can ALWAYS remove it completely and this still generates equivalent results. Le ven. 2 nov. 2018 à 15:23, Mark Davis ☕️ a écrit : > The table is the way it is because it is easier to process (and > comprehend) when the first field is always the primary weight, second is > always the secondary, etc. > > Go ahead and transform the input DUCET files as you see fit. The "should > be removed" is your personal preference. Unless we hear strong demand > otherwise from major implementers, people have better things to do than > change their parsers to suit your preference. > > Mark > > > On Fri, Nov 2, 2018 at 2:54 PM Philippe Verdy wrote: > >> It's not just a question of "I like it or not". But the fact that the >> standard makes the presence of required in some steps, and the >> requirement is in fact wrong: this is in fact NEVER required to create an >> equivalent collation order. these steps are completely unnecessary and >> should be removed. >> >> Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️ a écrit : >> >>> You may not like the format of the data, but you are not bound to it. If >>> you don't like the data format (eg you want [.0021.0002] instead of >>> [..0021.0002]), you can transform it however you want as long as you >>> get the same answer, as it says here: >>> >>> http://unicode.org/reports/tr10/#Conformance >>> “The Unicode Collation Algorithm is a logical specification. >>> Implementations are free to change any part of the algorithm as long as any >>> two strings compared by the implementation are ordered the same as they >>> would be by the algorithm as specified. Implementations may also use a >>> different format for the data in the Default Unicode Collation Element >>> Table. The sort key is a logical intermediate object: if an implementation >>> produces the same results in comparison of strings, the sort keys can >>> differ in format from what is specified in this document. (See Section 9, >>> Implementation Notes.)” >>> >>> >>> That is what is done, for example, in ICU's implementation. See >>> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw >>> collation elements" and "sort keys" to see the transformed collation >>> elements (from the DUCET + CLDR) and the resulting sort keys. >>> >>> a =>[29,05,_05] => 29 , 05 , 05 . >>> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 . >>> à => >>> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 . >>> À => >>> >>> Mark >>> >>> >>> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode < >>> unicode@unicode.org> wrote: >>> As well the step 2 of the algorithm speaks about a single "array" of collation elements. Actually it's best to create one separate array per level, and append weights for each level in the relevant array for that level. The steps S2.2 to S2.4 can do this, including for derived collation elements in section 10.1, or variable weighting in section 4. This also means that for fast string compares, the primary weights can be processed on the fly (without needing any buffering) is the primary weights are different between the two strings (including when one or both of the two strings ends, and the secondary weights or tertiary weights detected until then have not found any weight higher than the minimum weight value for each level). Otherwise: - the first secondary weight higher that the minimum secondary weght value, and all subsequent secondary weights must be buffered in a
Re: A sign/abbreviation for "magister"
Do we have any other evidence of this usage, besides a single handwritten postcard? -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: A sign/abbreviation for "magister"
On 31/10/2018 at 19:34, Asmus Freytag via Unicode wrote: On 10/31/2018 10:32 AM, Janusz S. Bień via Unicode wrote: > > Let me remind what plain text is according to the Unicode glossary: > > Computer-encoded text that consists only of a sequence of code > points from a given standard, with no other formatting or structural > information. > > If you try to use this definition to decide what is and what is not a > character, you get vicious circle. > > As mentioned already by others, there is no other generally accepted > definition of plain text. Being among those who argued that the “plain text” concept cannot—and therefore mustn’t—be used per se to disallow the use of a more or less restricted or extended set of characters in what is called “ordinary text”, I’m ending up adding the following in case it might be of interest: This definition becomes tautological only when you try to invoke it in making encoding decisions, that is, if you couple it with the statement that only "elements of plain text" are ever encoded. I don’t think that Janusz S. Bień’s concern is about this definition being “tautological”. AFAICS the Unicode definition of “plain text” is quoted to back the assumption that it’s hard to use that concept to argue against the use of a given Unicode character in a given context, or to use it to kill a proposal for characters significant in natural languages. The reasoning is that the call not to use character X in plain text, while X is a legal Unicode character whose use is not discouraged for technical reasons, is like if “ordinary people” (scarequoted derivative from “ordinary text”) were told that X is not a Unicode character. That discourse is a “vicious circle” in that there is no limit to it until Latin script is pulled down to plain ASCII. As already well known, diacritics are handled by the rendering system and don’t need to be displayed as such in the plain text backbone. I don’t believe that the same applies to other scripts, but these are often not considered when the encoding of Latin preformatted letters is fought, given superscripting seems to be proper to Latin, and originated from longlasting medieval practice and writing conventions. For that purpose, you need a number of other definitions of "plain text". Including the definition that plain text is the "backbone" to which you apply formatting and layout information. I personally believer that there are more 2D notations where it's quite obvious to me that what is "placed" is a text element. More like maps and music and less like a circuit diagram, where the elements are less text like (I deliberately include symbols in the definition of text, but not any random graphical line art). All two-dimensional notations here (outside the parenthetical) use higher-level protocols; maps and diagrams are often vector graphics. But Unicode strived to encode all needed plain text elements, such as symbols for maritime and wheather maps. Even arrows of many possible shapes, including 3D-looking ones, have been encoded. While freehand (rather than “any random”) graphical art is out of scope, we have a lot of box drawing, used with appropriate fonts to draw e.g. layouts of keyboards above the relevant source code in plain text files (examples in XKB). As a sidenote: Box drawing while useful is unduly neglected on font level, even in the Code Charts where the advance width, usually half an em, is inconsistent between different sorts of elements belonging to the same block. Another definition of plain text is that which contains the "readable content" of the text. As already discussed on this List, many documents in PDF have hard-to-read plain text backbones, even misleading Google Search, for the purpose of handling special glyphs (and, in some era, even special characters). As we've discussed here, this definition has edge cases; some content is traditionally left to styling. Many pre-Unicode traditions are found out there, that stay in use, partly for technical reasons (mainly by lack of updated keyboard layouts), partly for consistency with accustomed ways of doing. Being traditionally-left-to-styling is the more unconvincing. Even a letter that got to become LATIN SMALL LETTER O E (Unicode 1.0) was composed on typewriters using the half-backspace, and should be _left to styling_ when it was pulled out of the draft ISO/IEC 8859-1 by the fault of a Frenchman (name undisclosed for privacy). And we’ve been told on this List that the tradition using styling (a special font) to display the additional Latin letters used to write Bambara survived. Example: some of the small words in some Scandinavian languages are routinely italicized to disambiguate their reading. Other languages use titlecase to achieve the same disambiguation. E.g. French titlecases the noun "Une" which means the "cover", not the undefined article, and German did the same when "Ein(e)" is a numeral, but today,
Re: A sign/abbreviation for "magister"
On 2018-11-02, James Kass via Unicode wrote: > Alphabetic script users write things the way they are spelled and spell > things the way they are written. The abbreviation in question as > written consists of three recognizable symbols. An "M", a superscript > "r", and an equal sign (= two lines). It can be printed, handwritten, That's not true. The squiggle under the r is a squiggle - it is a matter of interpretation (on which there was some discussion a hundred messages up-thread or so :) whether it was intended to be = . Just as it is a matter of interpretation whether the superscript and squiggle were deeply meaningful to the writer, or whether they were just a stylistic flourish for Mr. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: UCA unnecessary collation weight 0000
On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote: I was replying not about the notational repreentation of the DUCET data table (using [....] unnecessarily) but about the text of UTR#10 itself. Which remains highly confusive, and contains completely unnecesary steps, and just complicates things with absoiluytely no benefit at all by introducing confusion about these "". Sorry, Philippe, but the confusion that I am seeing introduced is what you are introducing to the unicode list in the course of this discussion. UTR#10 still does not explicitly state that its use of "" does not mean it is a valid "weight", it's a notation only No, it is explicitly a valid weight. And it is explicitly and normatively referred to in the specification of the algorithm. See UTS10-D8 (and subsequent definitions), which explicitly depend on a definition of "A collation weight whose value is zero." The entire statement of what are primary, secondary, tertiary, etc. collation elements depends on that definition. And see the tables in Section 3.2, which also depend on those definitions. (but the notation is used for TWO distinct purposes: one is for presenting the notation format used in the DUCET It is *not* just a notation format used in the DUCET -- it is part of the normative definitional structure of the algorithm, which then percolates down into further definitions and rules and the steps of the algorithm. itself to present how collation elements are structured, the other one is for marking the presence of a possible, but not always required, encoding of an explicit level separator for encoding sort keys). That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It is not part of the *notation* for collation elements, but instead is a magic value chosen for the level separator precisely because zero values from the collation elements are removed during sort key construction, so that zero is then guaranteed to be a lower value than any remaining weight added to the sort key under construction. This part of the algorithm is not rocket science, by the way! UTR#10 is still needlessly confusive. O.k., if you think so, you then know what to do: https://www.unicode.org/review/pri385/ and https://www.unicode.org/reporting.html Even the example tables can be made without using these "" (for example in tables showing how to build sort keys, it can present the list of weights splitted in separate columns, one column per level, without any "". The implementation does not necessarily have to create a buffer containing all weight values in a row, when separate buffers for each level is far superior (and even more efficient as it can save space in memory). The UCA doesn't *require* you to do anything particular in your own implementation, other than come up with the same results for string comparisons. That is clearly stated in the conformance clause of UTS #10. https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance The step "S3.2" in the UCA algorithm should not even be there (it is made in favor an specific implementation which is not even efficient or optimal), That is a false statement. Step S3.2 is there to provide a clear statement of the algorithm, to guarantee correct results for string comparison. Section 9 of UTS #10 provides a whole lunch buffet of techniques that implementations can choose from to increase the efficiency of their implementations, as they deem appropriate. You are free to implement as you choose -- including techniques that do not require any level separators. You are, however, duly warned in: https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators that "While this technique is relatively easy to implement, it can interfere with other compression methods." it complicates the algorithm with absoluytely no benefit at all); you can ALWAYS remove it completely and this still generates equivalent results. No you cannot ALWAYS remove it completely. Whether or not your implementation can do so, depends on what other techniques you may be using to increase performance, store shorter keys, or whatever else may be at stake in your optimization. If you don't like zeroes in collation, be my guest, and ignore them completely. Take them out of your tables, and don't use level separators. Just make sure you end up with conformant result for comparison of strings when you are done. And in the meantime, if you want to complain about the text of the specification of UTS #10, then provide carefully worded alternatives as suggestions for improvement to the text, rather than just endlessly ranting about how the standard is confusive because the collation weight is "unnecessary". --Ken
Re: A sign/abbreviation for "magister"
On 02/11/2018 17:45, Philippe Verdy via Unicode wrote: [quoted mail] Using variation selectors is only appropriate for these existing (preencoded) superscript letters ª and º so that they display the appropriate (underlined or not underlined) glyph. And it is for forcing the display of DIGIT ZERO with a short stroke: 0030 FE00; short diagonal stroke form; # DIGIT ZERO https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt From that it becomes unclear why that isn’t applied to 4, 7, z and Z mentioned in this thread, to be displayed open or with a short bar. It is not a solution for creating superscripts on any letters and mark that it should be rendered as superscript (notably, the base letter to transform into superscript may also have its own combining diacritics, that must be encoded explicitly, and if you use the varaition selector, it should allow variation on the presence or absence of the underline (which must then be encoded explicitly as a combining character. I totally agree that abbreviation indicating superscript should not be encoded using variation selectors, as already stated I don’t prefer it. So finally what we get with variation selectors is: variation selector, combining diacritic> and precombined with the diacritic, variation selector> which is NOT canonically equivalent. That seems to me like a flaw in canonical equivalence. Variations must be canonically equivalent, and the variation selector position should be handled or parsed accordingly. Personally I’m unaware of this rule. Using a combining character avoids this caveat: combining diacritic, combining abbreviation mark> and precombined with the diacritic, combining abbreviation mark> which ARE canonically equivalent. And this explicitly states the semantic (something that is lost if we are forced to use presentational superscripts in a higher level protocol like HTML/CSS for rich text format, and one just extracts the plain text; using collation will not help at all, except if collators are built with preprocessing that will first infer the presence of a to insert after each combining sequence of the plain-text enclosed in a italic style). That exactly outlines my concern with calls for relegating superscript as an abbreviation indicator to higher level protocols like HTML/CSS. There's little risk: if the is not mapped in fonts (or not recognized by text renderers to create synthetic superscript scripts from existing recognized clusters), it will render as a visible .notdef (tofu). But normally text renderers recognize the basic properties of characters in the UCD and can see that has a combining mark general property (it also knows that it has a 0 combinjing class, so canonical equivalences are not broken) to render a better symbols than the .notdef "tofu": it should better render a dotted circle. Even if this tofu or dotted circle is rendered, it still explicitly marks the presence of the abbreviation mark, so there's less confusion about what is preceding it (the combining sequence that was supposed to be superscripted). The problem with the you are proposing is that it contradicts streamlined implementation as well as easy input of current abbreviations like ordinal indicators in French and, optionally, in English. Preformatted superscripts are already widely implemented, and coding of "4ᵉ" only needs two characters, input using only three fingers in two times (thumb on AltGr, press key E04 then E12) with an appropriately programmed layout driver. I’m afraid that the solution with would be much less straightforward. The can also have its own selector> to select other styles when they are optional, such as adding underlines to the superscripted letter, or rendering the letter instead as underscript, or as a small baseline letter with a dot after it: this is still an explicit abbreviation mark, and the meaning of the plein text is still preserved: the variation selector is only suitable to alter the rendering of a cluster when it has effectively several variants and the default rendering is not universal, notably across font styles initially designed for specific markets with their own local preferences: the variation selector still allows the same fonts to map all known variants distinctly, independantly of the initial arbitrary choice of the default glyph used when the variation selector is missing). I don’t think German users would welcome being directed to input a plus a instead of a period. Even if fonts (or text renderers may map the to variable glyphs, this is purely stylictic, the semantic of the plain text is not lost because the is still there. There's no need of any rich-text to encode it (the rich -text styles are not explicitly encoding that a superscript is actually an abbreviation mark, so it cannot also allow variation like rendering an underscript, or a baseline small glyph with an added dot. Typically a used in an English style would