Re: Encoding

Marcel Schneider via Unicode Mon, 05 Nov 2018 15:17:03 -0800

On 04/11/2018 20:19, Philippe Verdy via Unicode wrote:
[…]

Even the mere fallback to render the <combining abbreviation mark> as
a dotted circle (total absence of support) will not block completely
reading the abbreviation:


* you'll see "2e◌" (which is still better than only "2e", with
minimal impact) instead of

* "2◌" (which is worse ! this is still what already happens when you
use the legacy encoded <superscript e> which is also semantically
ambiguous for text processing), or

* "2e." (which is acceptable for rendering but ambiguous semantically
for text processing)


I’m afraid the dotted circle instead of the .notdef box would be confusing.


So compare things faily: the solution I propose is EVEN
MOREINTEROPERABLE than using <superscript Latin  letters> (which is
also impossible for noting all abbrevations as it is limited to just
a few letters, and most of the time limited to only the few lowercase
IPA symbols). It puts an end to the pressure to encode superscript
letters.


Actually it encompasses all Latin lowercase base letters except q.

As of putting an end to that pressure, that is also possible by encoding
the missing ones once and for all. As already stated, until the opposite
is posted authoritatively to this List, Latin script is deemed the only
one making extensive use of superscript to denote abbreviations, due to
strong and longlasting medieval practice acting as a template on a few
natural languages, namedly those enumerated so far, among which Polish.


If you want to support other notations (e.g. in chemical or
mathematics notations, where both superscript and subscript must be
present and stack together, and where the allowed varaition using a
dot or similar) you need another encoding and the existing legacy
<superscript Latin  letters> are not suitable as well.


I don’t lobby to support mathematics with more superscripts, but for
sure UnicodeMath would be able to use them when the set is complete.
What I did for chemical notations is to remind that chemistry seems
to be disfavored compared to mathematics, because instead of peculiar
subscripts it uses subscript Greek small letters. Three of them, as
has been reported on this List. They are being refused because they
are letters of a script. If they were fancy symbols, they would be
encoded, like alchemical symbols and mathematical symbols are.

Further, on 04/11/2018 20:51, Philippe Verdy via Unicode wrote:
[…]

Once again you need something else for these technical notations, but
NOT the proposed <combining abbreviation mark>, and NOT EVEN the
existing "modifier letters" <superscript letter X>, which were in
fact first introduced only for IPA […]
[…] these letters are NOT conveying any semantic of an abbreviation,
and this is also NOT the case for their usage as IPA symbols).


They do convey that semantic if used in a natural language giving
superscript the semantics of an abbreviation.

Unicode does not encode semantics, TUS specifies.


There's NO interoperability at all when taking **abusively** the
existing  "modifier letters" <superscript letter X> or <superscript
digit> for use in abbreviations […].


The interoperabillty I mean is between formats and environments.
Interoperable in that sense is what is in the plain text backbone.

Keep these "modifier letters" or <superscript digit> or <superscript
punctuation> for use as plain letters or plain digits or plain
punctuation or plain symbols (including IPA) in natural languages.


That is what I’m suggesting to do: Superscript letters are plain
abbreviation indicators, notably ordinal indicators and indicators
in other abbreviations, used in natural languages.

Anything else is abusive ans hould be considered only as "legacy"
encoding, not recommended at all in natural languages.


Put "traditional" in the place of "legacy", and you will come close
to what is actually going on when coding palaeographic texts is
achieved using purposely encoded Latin superscripts. The same
applies to living languages, because it is interoperable and fits
therefore Unicode quality standards about digitally representing
the world’s languages.

Finally, on 04/11/2018 21:59, Philippe Verdy via Unicode wrote:


I can take another example about what I call "legacy encoding" (which
really means that such encoding is just an "approximation" from which
no semantic can be clearly infered, except by using a non-determinist
heuristic, which can frequently make "false guesses").

Consider the case of the legacy Hangul "half-width" jamos: […]

The same can be said about the heuristics that attempt to infer an
abbreviation semantic from existing superscript letters (either
encoded in Unicode, or encoded as plain letters modified by
superscripting style in CSS or HTML, or in word processors for
example): it fails to give the correct guess most of the time if
there's no user to confirm the actual intended meaning


I don’t agree: As opposed to baseline fallbacks, Unicode superscripts
allow the reader to parse the string as an abbreviation, and machines
can be programmed to act likewise.


Such confirmation is the job of spell correctors in word processors:
[…] the user may type "Mr." then the wavy line will appear under
these 3 characters, the spell checker will propose to encode it as an
abbreviation "Mr<combinining abbrevitation mark>" or leave "Mr."
unchanged (and no longer signaled) in which case the dot remains a
regular punctuation, and the "r" is not modified. Then the user may
choose to style the "r" with superscripting or underlining, and a new
wavy red underline will appear below the three characters "M<styled
r>.", proposing to only transform the <styled r> as <superscript r>
or <r,combining underline> and even when the user accepts one of
these suggestions it will remain "M<superscript r>." or
"M<r,combining underline>." where it is still possible to infer the
semantics of an abbreviation (propose to replace or keep the dot
after it), or doing nothing else and cancel these suggestions (to
hide the wavy red underline hint, added by the spell checker), or
instruct the spell checker that the meaning of the superscript r is
that of a mathematical exponent, or a chemical a notation.


That mainly illustrates why <combining abbreviation mark> is not
interoperable. The input process seems to be too complicated. And if
a base letter is to be transformed to formatted superscript, you do
need OpenType, much like with U+2044 FRACTION SLASH behaving as
intended, ie transforming the preceding digit string to formatted
numerator digits, and the following to denominator digit glyphs.
In that, U+2044 acts as a format control, and so does <combining
abbreviation mark> that you are suggesting to encode.


In all cases, the user/author has full control of the intended
meaning of his text and an informed decision is made where all cases
are now distinguished. "Legacy" encoding can be kept as is (in
Unicode), even if it's no longer recommended, just like Unicode has
documented that half-width Hangul is deprecated (it just offers a
"compatibility decomposition" for NFKD or NFKC, but this is lossy and
cannot be done automatically without a human decision).

And the user/author can now freely and easily compose any
abbreviation he wishes in natural languages, without being limited by
the reduced "legacy" set of <superscript letters> encoded in Unicode


So far as the full Latin lowercase alphabet, and for use in all-caps
only, eventually the full Latin uppercase alphabet are encoded, I can
see nothing of a limitation, given these letters have the grapheme
cluster base property and therefore work with all combining diacritics.
That is already working with good font support, as demonstrated in the
parent thread.

(which should no longer be extended, except for use as distinct plain
letters needed in alphabets of actual natural languages, or as
possibly new IPA symbols),


One should be able to overcome the pattern tagging superscripts as not
being “plain letters”, because that is irrelevant when they are used as
abbreviation indicators in natural languages, and as such are plain
characters, like eg the Romance ordinal indicators U+00AA and U+00BA;
see also the DEGREE SIGN hijacked as a substitute of <superscript o>
because not superscripting the o in "nᵒ" is considered inacceptable.

 and without using the styling tricks (of
HTML/CSS, or of word processor documents, spreadsheets, presentation
documents allowing "'rich text" formats on top of "plain text") which
are best suitable for "free styling" of any human text, without any
additional semantics, […]


Yes I fully agree, if “semantics” is that required for readability in
accordance with standard orthographies in use.

Best regards,

Marcel

Re: Encoding

Reply via email to