Re: New Unicode Working Group: Message Formatting

2020-01-14 Thread Philippe Verdy via Unicode
People name are NOT transliterated freely. It's up to each person to
document his romanized name, it should not be invented by automatic
processes. And frequently the romanized name (officialized) does noit match
the original name in another script: this is very frequent for Chinese
people, as well as trademarks).
There are also common but informal names, not always official but commonly
use in the press/medias and their orthography varies across
countries/languages. If these people are "wellknown" (notably historic
personalities, or artists), they may have their page in some Wikipedia and
Wikidata.

There's no need to "translate" them, you'll use a database query to
retrieve names (including the preferred/most frequent one, the official
one). In some countries several orthographies may be used (e.g. for streets
named after people's: these names are not translatable, except if locally
the streets are multilingual: this is not a database of people names but a
geographic database for other purposes, even if these originate from people
they are still geographic names *derived* from people names).

For this you'll still use placeholders in the messages and the value of the
placeholder may be queried in the relevant database for the relevant target
language; variable forms for these names (e.g. genitives) may be found but
are not easily derived). If these are geographic names, they may be
transliterated but there are competing standards for transliterations of
toponyms, so you'll also need to tune your application to select the
romanization system relevant for the target language (the international
standards are language neutral, but not relevant for specific countries
that have their own officialized terminology, or for the Unioted Nations
that need to cite them in several official working languages), if the
geographic database does not already contain an officialized/prefered
romanization (there are also needs for transliteration from Latin to other
scripts).

Anyway proper names are to be treated specially, there's nothing that can
be used in message format API to select what will be the effective
replacement value of a placeholder. But the replacement may, or may not,
specify alternate forms for correct formatting when multiple forms are
possible (genitives, capitalisation, elisions and contextual mutations).
for the same selected name coming from an external database.

MessageFormat API and translator tools should not have to manage the
external databases, which will be "translated" separately with enough forms
relevant for their presentation and composition in larger messages.

Why this group exist now in CLDR ? most probably because there are already
difficulties to manage translations in existing CLDR data (which is focused
on a small part of what is translatable). CLDR is concerned by only a few
geographic items : countries, some subnational regions, continents, and
some cities used for timezones.

But the main problem is the proliferation of variant forms in CLDR, added
only for a few languages that need them, and no evident fallback to the
common form used in most other languages that don't need that distinction
or not the same kind of distinctions (e.g. plural forms, grammatical gender
or personal gender not always matching together, politeness/formal forms).

Once again I suggest you start contributing to a translation project and
experiment with them before continuing. Look at Wikimedia wikis
(translation templates, the translation extension, and the companion
Translatewiki.net wiki), Transifex, Google Translator, RessourceBundle and
formatting API in Java, .po/.pot for Gettext in many opensource projects,
Facebook translation tool, internationalization APIs in Windows, iOS,
MacOS, and the ICU library which is the de facto base for CLDR...


Le mar. 14 janv. 2020 à 16:11, wjgo_10...@btinternet.com via Unicode <
unicode@unicode.org> a écrit :

> The reply from Mr Verdy has indeed been helpful, as indeed has also been
> an offlist private reply from someone who has, thus far, not been a
> participant in this thread.
>
>
> Mr Verdy wrote:
>
>
> > You seem to have never seen how translation packages work and are used
> in common projects (not just CLDR, but you could find them as well in
> Wikimedia projects, or translation packages for lot of open source
> packages).
>
> What seems to be the case to Mr Verdy is in fact the actual situation.
>
> I do not satisfy the second of the two conditions of the invitation to
> join the working group. I am, in fact, retired and I have never worked in
> the i18n/l10n industry. Also, from the explanations it is not as close to
> my research interests as I had thought, and indeed hoped. I just do what I
> can on my research project from time to time using a home computer, a
> personal webspace hosted by an internet service provider, some budget
> software, mainly High-Logic FontCreator, and Serif PagePlus desktop
> publishing package, together with the software 

Re: Geological symbols

2020-01-13 Thread Philippe Verdy via Unicode
It is possible with some other markup languages, including HTML by using
ruby notation and other interlinear notations for creating special vertical
layouts inside an horizontal line.

There are difficulties however caused by line wraps which may occur before
the vertical layout, or even inside it for each stacked item, and for
managing the lineheight for the whole line. Finally you could endup with
the same problems as those found in mathematical formulas... and for
composing Egyptian hieroglyphs of Visiblespeech, for which a markup
language has to be defined (with a convention, similar to an orthographic
or typographic convention) in addition to the core characters that are used
to build up the composition, and possibly some extra styling (to adjust the
size of individual items, or to align them properly in the stack and fit
them cleanly in the composition area (e.g. an ideographic square). Final
difficulties are added by bidirectionality

Not all texts are purely linear (unidimensional) and a linear
representation is difficult to interpret without adding the markup syntax
inside the source text and sometimes aven adding extra symbols (or
punctuation) in the linear composition, which would not be needed in a true
bidimensional layout. Unicode does not encode characters for the second
dimension and the layout, so it's up to markup languages (or orthographic
conventions) to define the extra semantics and/or layout. A font alone
cannot guess without these conventions, and even if these conventions are
used, assumptions made could infer sometimes the incorrect layout.




Le lun. 13 janv. 2020 à 17:16, Oren Watson via Unicode 
a écrit :

> This is not possible in unicode plaintext as far as I can tell, since
> Unicode doesn't allow overstriking arbitrary characters over each other the
> way more advanced layout systems, e.g. LaTeX do. It is however possible to
> engineer a font to arrange those characters like that by using aggressive
> kerning.
>
>
> On Mon, Jan 13, 2020 at 10:14 AM Thomas Spehs (MonMap) via Unicode <
> unicode@unicode.org> wrote:
>
>> Hi, I would like to ask if there is any way to create geological
>> “symbols” with Unicode such as: Q₁¹ˉ², but with the two “1”s over each
>> other, without a space. Thanks!
>>
>


Re: New Unicode Working Group: Message Formatting

2020-01-11 Thread Philippe Verdy via Unicode
You seem to have never seen how translation packages work and are used in
common projects (not just CLDR, but you could find them as well in
Wikimedia projects, or translation packages for lot of open source
packages).
The purpose is to allow translating the UI of these applications for user's
demanded language. Internally the application can use whatever
representation it needs : it may be in any language or could be just an
identifier, here this does not matter as they are independant of the final
translation rendered. In CLDR, identifiers are used (more or less based on
simplified English, sometimes abbreviations or conventional codes). In
typical .po(t) packages the identifiers are the source language from which
the software was built and its strings extracted, and to be replaced by
calling an API.
Various projects do not always use English as the source of their
translation and even if this is the source, the strings themselves are not
always the unique identifiers used.

If you send you package and need to print it, of course you'll print the
label in a chosern language. Nothing forbifs the print to display both
languages, i.e. two copies of the message translated in two languages
(English or German in your example; just look at printed noticed you find
in your purchase packages: the booklets frequently include multiple copies,
one per language, often a dozen for products imported from China to Europe;
even food is frequently labeled in several languages for international
brands).

If needed, products descriptions or source and delivery addresses will be
accessible via an online web app by printing a barcode or QRcode on the
label (they will be converted to an URI): an URI by itself has no language,
it's also an identifier, allowing to retrive the texts in multiple
languages or the language of user's choice.

So your question is non-sense with the example you give.

Le sam. 11 janv. 2020 à 21:21, wjgo_10...@btinternet.com via Unicode <
unicode@unicode.org> a écrit :

> A person in England, who knows no German, wants to send the parcel to a
> person in Germany, who knows no English.
>
> The person in England wants to send a message about the delivery to the
> person in Germany..
>
> > English: “The package will arrive at {time} on {date}.”
>
> The person want to send the message by email.
>
> > German: “Das Paket wird am {date} um {time} geliefert.”
>
> Where does the translation of the text take place please, and by whom or
> by which computer?
>
> During the actual  transmission from the computer in England to the
> computer in Germany, is the text of the string in English, or German, or
> in a language-independent form please?
>
> 
>
> If the parcel were being sent from France to Germany by a person who
> knows only French, during the transmission of the message about the
> parcel, is the text of the string in French, or English, or German, or
> in a language-independent form please?
>
> William Overington
>
> Saturday 11 January 2020
>
>


Re: emojis for mouse buttons?

2020-01-01 Thread Philippe Verdy via Unicode
this is user's settings; the OS and softwares will automatically adapt to
these settings to display the proper label or icon, as well they'll be able
to document them accordingly.
Primary/secondary/ternary buttons are not used, even in the OS itself (the
mouse drivers will remap the internal events when configuring the mouse for
left-hand). If needed (when they want to document the difference for
right-hand or left-hand, they will change the label, icon or character;
there's no reason to not use the left vs. right indication of the button
for the mouse buttons (I think it's definitely better to force applications
to change the character accordingly; usually "left click" is just named
"click" (not "primary click"), but "right click" is used everywhere (may be
contextually changed to "left click where appropriate to document the
left-hand behavior).
Also I do not advocate a glyph limited to a mouse, the character being
encoded as well if it shows a square touchpad. And the wired vs. wireless
is not relevant here as we just want to be able to conventiently document
key mappings used by applications and present them the same way as other
keys on a keyboard (even iof the keyboard is virtual on a tactile screen).
Those that want a real mouse and real wired or wireless distinction or
touchpad, do not need a distinction of clicked buttons, and they already
have characters encoded for them including for emojis, but these are NOT
usable to document key mappings that are so frequently needed in apps (e.g.
menus showing shortcuts) and their documentation.



Le mer. 1 janv. 2020 à 16:08, John W Kennedy  a
écrit :

> As I have already said, this will not do. Mouses do not have “left” and
> “right” buttons; they have “primary” buttons, which may be on the left or
> right, and “secondary” buttons, which may be on the right or left. If this
> goes through, users with left-handed mouse setups will curse you forever.
>
> --
> John W. Kennedy
> "Compact is becoming contract,
> Man only earns and pays."
>  -- Charles Williams.  "Bors to Elayne:  On the King's Coins"
>
> > On Jan 1, 2020, at 6:43 AM, Marius Spix via Unicode 
> wrote:
> >
> > Cecause the middle button of many mice is a scroll button, I think, we
> > need five different characters:
> >
> > LEFT MOUSE BUTTON CLICK (mouse with left button black)
> > MIDDLE MOUSE BUTTON CLICK (mouse with middle button black)
> > RIGHT MOUSE BUTTON CLICK (mouse with right button black)
> > MOUSE SCROLL UP (mouse with middle button black and white triangle
> > pointing up inside)
> > MOUSE SCROLL DOWN (mouse with middle button black and white triangle
> > pointing down inside)
> >
> > These characters are pretty useful in software manuals, training
> > materials and user interfaces.
> >
> > Happy New Year,
> >
> > Marius
> >
> >
> >
> >> On Tue, 31 Dec 2019 23:04:39 +0100
> >> Philippe Verdy via Unicode  WROTE:
> >>
> >> Playing with the fiolling of the middle cell to mean a double click
> >> is a bad idea, it would be better to add one or two rounded borders
> >> separated from the button, or simply display two icons in sequence
> >> for a double click).
> >>
> >> Note that the glyphs do not necessarily have to show a mouse, it
> >> could as well be a square with its lower third part split into two or
> >> three squares, like a touchpad (see the notification icons displayed
> >> by Synaptics touchpad drivers). The same rounded borders could also
> >> mean the number of clicks. As well, if a ouse is represented, it may
> >> or may not have a wire.
> >>
> >> Emoji-styles could use more realistic 3D-like rendering with extra
> >> shadows...
> >>
> >> Le mar. 31 déc. 2019 à 22:16, wjgo_10...@btinternet.com via Unicode <
> >> unicode@unicode.org> a écrit :
> >>
> >>> How about the following.
> >>>
> >>> A filled upper cell to mean click,
> >>>
> >>> a filled upper cell and a filled middle cell to mean double click,
> >>>
> >> Note that clicking and maintaining the button is just like the
> >> convention of using "+" after a key modifier before the actual key
> >> (both key may be styled separately to decorate their glyphs into a
> >> keycap, but such styling should not be applied in the distinctive
> >> glyph; there may also be emoji sequences to combine an anonymous
> >> keycap base emoji with the following characters, using joiner
> >> controls, but this is more difficult for keys whose labels are texts
> >> made of multiple letters like "End" or words like "Print Screen",
> >> after a possible Unicode symbol for keys like Page Up, Home, End,
> >> NumLock; styling the text offers better option and accessibility even
> >> if symbols are used and a whole translatable string is surrounded by
> >> deocrating styles to create a visual keycap).
> >
>


Re: emojis for mouse buttons?

2019-12-31 Thread Philippe Verdy via Unicode
Playing with the fiolling of the middle cell to mean a double click is a
bad idea, it would be better to add one or two rounded borders separated
from the button, or simply display two icons in sequence for a double
click).

Note that the glyphs do not necessarily have to show a mouse, it could as
well be a square with its lower third part split into two or three squares,
like a touchpad (see the notification icons displayed by Synaptics touchpad
drivers). The same rounded borders could also mean the number of clicks. As
well, if a ouse is represented, it may or may not have a wire.

Emoji-styles could use more realistic 3D-like rendering with extra
shadows...

Le mar. 31 déc. 2019 à 22:16, wjgo_10...@btinternet.com via Unicode <
unicode@unicode.org> a écrit :

> How about the following.
>
> A filled upper cell to mean click,
>
> a filled upper cell and a filled middle cell to mean double click,
>
Note that clicking and maintaining the button is just like the convention
of using "+" after a key modifier before the actual key (both key may be
styled separately to decorate their glyphs into a keycap, but such styling
should not be applied in the distinctive glyph; there may also be emoji
sequences to combine an anonymous keycap base emoji with the following
characters, using joiner controls, but this is more difficult for keys
whose labels are texts made of multiple letters like "End" or words like
"Print Screen", after a possible Unicode symbol for keys like Page Up,
Home, End, NumLock; styling the text offers better option and accessibility
even if symbols are used and a whole translatable string is surrounded by
deocrating styles to create a visual keycap).


Re: emojis for mouse buttons?

2019-12-31 Thread Philippe Verdy via Unicode
I say "emoji" because they would belong to the subsets of emojis, within
characters, and existing mouse characters (but not button-specific) are
already encoded as emojis (i.e. two styles: basic glyphs or color icons).

What is important is less the mouse than the identification of the button
(left/center/right) for documenting keymaps in UI (the documentation
usually indicate the default right-hand assignment, a user may still
configure the mouse driver to swap the left/right buttons).

For now the alternative is to compose a localisable string like "L" or "R"
or "C", followed by the generic mouse (when documenting keymaps, the
surrounding square and shading may be done outside using styling, we
just need the unique symbol in a more immediately readable way than just
"click".

A generic clic (1st button) is sometimes represented as an arrow cursor or
hand with a pointing finger, and some radial strokes near the tip of the
arrow, but it is not very distinctive when we need to explicitly disinguish
the buttons, so I suggest a basic empty shape (rounded rectangle or ovoid
like a narrow theta "Θ"), with the top part split in three cells by
horizontal and vertical strokes, and one of the three cells filled
(representing the wire or the wireless waves is not necessary).


Le mar. 31 déc. 2019 à 14:57, Shriramana Sharma  a
écrit :

> Why are these called "emojis" for mouse buttons rather than just
> "characters" for them?
>
> On Tue, 31 Dec, 2019, 18:45 Philippe Verdy via Unicode, <
> unicode@unicode.org> wrote:
>
>> A lot of application need to document their keymap and want to display
>> keys.
>>
>> For now there are emojis for mouses (several variants: 1, 2 or 3
>> buttons), independently of the button actually pressed.
>>
>> However there's no simple emoji to represent the very common mouse click
>> buttons used in lot of UI.
>>
>> But it would be good to have emojis for the left, center, and right click
>> (showing a mouse with the correct button filled in black), instead of
>> writing "left click" in plain text.
>>
>> Has it been proposed ?
>>
>> See for example https://wiki.openstreetmap.org/wiki/ID/Shortcuts
>>
>>


emojis for mouse buttons?

2019-12-31 Thread Philippe Verdy via Unicode
A lot of application need to document their keymap and want to display keys.

For now there are emojis for mouses (several variants: 1, 2 or 3 buttons),
independently of the button actually pressed.

However there's no simple emoji to represent the very common mouse click
buttons used in lot of UI.

But it would be good to have emojis for the left, center, and right click
(showing a mouse with the correct button filled in black), instead of
writing "left click" in plain text.

Has it been proposed ?

See for example https://wiki.openstreetmap.org/wiki/ID/Shortcuts


Re: Encoding the Nsibidi script (African) for writing the Igbo language

2019-11-11 Thread Philippe Verdy via Unicode
Le lun. 11 nov. 2019 à 17:31, Markus Scherer  a
écrit :

> We generally assign the script code when the script is in the pipeline for
> a near-future version of Unicode, which demonstrates that it's "a candidate
> for encoding". We also want the name of the script to be settled, so that
> the script code can be roughly mnemonic for the name.
>

This is not true for some scripts that have been encoded since long in ISO
15924, not all with a proposal candidate for encoding (notably the various
Tolkien's invented scripts, Cirth, Tengwar, ... and Klingon, which all have
limited use and active supporters).

Other scripts were added even without lot of evidence, or that are not even
deciphered (Mayan hieroglyphs, Linear A...). There are also missing scripts
in India which are still in contemporary use and important for the local
cultures (but with limited support in specific states or smaller
communities at subregional level only), in Myanmar/Burma, and in aboriginal
communities some southern Indonesian islands (I think there are also some
aboriginal logographic scripts in Australia, and other Precolombian scripts
in Central and South America and very remote islands in Southern Pacific,
and still in North-eastern Russia/Beringia).


Re: Encoding the Nsibidi script (African) for writing the Igbo language

2019-11-11 Thread Philippe Verdy via Unicode
Names of this script can very a bit "Nsibidi", "Nsibiri", but not a lot
(d/r variation may be phonetic remonization in one of the supported
languages). It is stable across various sites.

Uniqueness is quite easy to assert, there's not a lot of ideographic
scripts, at least in modern use. But still not as complex as Chinese
scripts. The site speaks about a inventory of about 500 base characters (in
the first educational books), probably the double (in which case it
compares to the modern use of sinograms in China for children, whereas
adults use only about 2000 signs for almost everything, compare to the same
average of 2000 common words in Indo-European languages, and in Afroasiatic
or Nilo-Saharan languages; Igbo is still a minority language, and most of
their speakers have low level of litteracy, even in Latin or Arabic scripts
and due to the proliferation of vernacualr languages, they may as well use
about 500-1000 basic words to understand each other).

anyway, I suppose that you were already aware of that script, but were just
looking for more evidences to have some comparative researches from a few
more sources (lack of interest or finances for linguistic projects in
Africa, that prefer placing their efforts in major scripts that have
official national support in their educational and cultural programs:
Latin, Arabic, Ethiopic, Tifinagh; other scripts are still of interest due
to their important historic background and centuries of propagation across
countries or caused by wars, invasions, diplomacy, or commercial interests)


Le lun. 11 nov. 2019 à 17:31, Markus Scherer  a
écrit :

> On Mon, Nov 11, 2019 at 4:03 AM Philippe Verdy via Unicode <
> unicode@unicode.org> wrote:
>
>> But first there's still no code in ISO 15924 (first step easy to complete
>> before encoding in the UCS).
>>
>
> That's not first; it's nearly last.
>
> The script code standard says "In general, script codes shall be added to
> ISO 15924 when the script has been coded in ISO/IEC 10646, and when the
> script is agreed, by experts in ISO 15924/RA-JAC to be unique and a *candidate
> for encoding in the UCS*."
>
> We generally assign the script code when the script is in the pipeline for
> a near-future version of Unicode, which demonstrates that it's "a candidate
> for encoding". We also want the name of the script to be settled, so that
> the script code can be roughly mnemonic for the name.
>
> markus
>


Encoding the Nsibidi script (African) for writing the Igbo language

2019-11-11 Thread Philippe Verdy via Unicode
Encoding the Nsibidi script (African) for writing the Efik, Ekoi, Ibibio,
Igbo language.

See this site as an example of use, with links to published educational
books.
http://blog.nsibiri.org/
Also this online dictionary:
https://fr.scribd.com/doc/281219778/Ikpokwu

Other links:
https://en.wikipedia.org/wiki/Nsibidi

But first there's still no code in ISO 15924 (first step easy to complete
before encoding in the UCS).


Re: comma ellipses

2019-10-07 Thread Philippe Verdy via Unicode
Commas may be used instead of dots by users of French keyboards (it's
easier to type the comma, when the dot/full stop requires pressing the
SHIFT key).
I may be wrong, but I've quire frequently seen commas or semicolons instead
of dot/full stops under normal orthography.
But the web and notably social networks can invent their own "rule":
pretending that the dot/full stop at end of sentence is "aggressive" is
probably a deviation from the English-only designation of the dot as a
"full stop", reinterpreted as "stop talking about this, my sentence is
final, I don't want to give more justification" (when for such case the
user should have better used the exclamation mark!)

Anyway I've never liked the 3-dot ellipsis which just occurs in Unicode for
compatiblity with fixed-width fonts on terminals, just to compact 3 cells
into one (or in CJK styles to replace the "bubble" dots with their 1/2 cell
gap on the right side of each cell, contracting them to three smaller dots
in just one CJK cell).

But another reason could be that using commas instead of dots allows
distinguishing the ellipsis from an abbreviation dot used jut before it. Or
making the distinction to explicitly mark the end of sentence by a regular
dot/full stop after the ellipsis, when the ellipsis could be used in the
middle of a sentence (no clear distinction when what follows the ellipsis
is a proper name starting by a capital or not a word: where is the end of
sentence?) and for which the alternative using comma ellipsis would
explicitly say that the ellipsis does not terminate the sentence as in "I
need to spend $2... $4 to return" (one sentences, the meaning is different
from "I need to spend $2,,, $4 to return" where that comma ellipsis would
be an abbreviation for "between $2 and $4").

Anyway, people of the right to use commas if they prefer it for the
semantics they intend to distinguish. This does not mean that we need to
encode this sequence as a separate unbreakable character like it was done
for the dot ellipsis. Otherwise, we would have to encode "etc." also as a
single character, or we would end up adding also many more leader dots (in
classic metal types regular dots/fullstops were used, but some type
compositors may have liked to use mount a single "..." character to avoid
having to keep them glued or keep them regularly spaced with special
spacers when justifying lines mechanically: this saved them a little time
for compositing rows of metal types). There's no real need for CJK or for
monospaced terminals to get a more compact presentation. And for regular
text, just using multiple separate commas will still render as intended.
And metal types are no longer used.

Personnally I don't like the 3-dot ellipsis character because it plays
badly even in monospaced fonts. And there's no demonstrated use where a
single 3-commas ellipsis character would have to be distinguished
semantically and visually from 3 separate commas.

If people want to use ",,," for their informal speech on social networks,
or in chat sessions, they can do that today without needing any new
character and a new keyboard layout or input method. And nobody will really
know if this ",,," was mistyped instead of "..." to avoid pressing SHIFT on
a French AZERTY keyboard (not extended by a numeric keypad where the
dot/full stop may also be typed easily without SHIFT). As well a French
typist could have used ";;;" with semicolons when forgetting to press the
SHIFT key.

If we encode ",,," as a single character, then why not "???" or "!!!", or
"", or "**", or and many other variants mixing multiple punctuation
signs or symbols (like "$$" as an "angry" mark or the abbreviation for
"costly", then also "€€" or "££"...) Then also why not "eee" or
"h" for noting hesitations? This would become endless, without any
limit: Unicode would ten start encoding millions of whole words of
thousands languages as single characters, much more than the whole existing
set of CJK ideographs (including its extensions in nearly two planes).
Interoperability would worsen.




Le lun. 7 oct. 2019 à 01:14, Tex via Unicode  a écrit :

> Now that comma ellipses (,,,) are a thing (at least on social media) do we
> need a character proposal?
>
>
>
> Asking for a friend,,, J
>
>
>
> tex
>


Re: Acute/apostrophe diacritic in Võro for palatalized consonants

2019-08-19 Thread Philippe Verdy via Unicode
I must add that the current version of Wikipedia in Võro, seems to have
completely renounced to encode this combining mark (no acute, no
apostrophe), probably because of lack of proper encoding in Unicode and
difficulty to harmonize its orthography.

It may be a good argument for the addition of the missing combining palatal
accent and to restore the correct expected typography.

I'm curious also about other existing styles (notably with blackletters aka
"Gothic", or ISO 15924 "Latf" in historic texts: was that diacritic ever
handwritten, or typesetted in printed books, and how?)

Le mar. 20 août 2019 à 04:17, Philippe Verdy  a écrit :
>
> I'm curious about this statement in English Wikipedia about Võro:
>
>> Palatalization of consonants is marked with an acute accent (´) or
apostrophe ('). In proper typography and in handwriting, the palatalisation
mark does not extend above the cap height (except uppercase letters Ń, Ŕ,
Ś, V́ etc.), and it is written above the letter if the letter has no
ascender (ǵ, ḿ, ń, ṕ, ŕ, ś, v́ etc.) but written to the right of it
otherwise (b’, d’, f’, h’, k’, l’, t’). In computing, it is not usually
possible to enter these character combinations or to make them look
esthetically pleasing with most common fonts, so the apostrophe is
generally placed after the letter in all cases. This convention is followed
in this article as well.
>
>
> The problem is the encoding of this acute/apostrophe which changes
depending on lettercase or even depending on letterform for specific styles
(i.e. when there are ascenders or not for lowercase letters).


Re: Akkha script (used by Eastern Magar language) in ISO 15924?

2019-07-22 Thread Philippe Verdy via Unicode
So can I conclude that what The Ethnologue displays (using a private-use
ISO 15924 "Qabl") is wrong ?
And that translations classified under "mgp-Brah" are fine (while
"mgp-Qabl" would be unusable for interchange) ?


Le mar. 23 juil. 2019 à 02:42, Anshuman Pandey  a écrit :

> As I pointed out in L2/11-144, the “Magar Akkha” script is an
> appropriation of Brahmi, renamed to link it to the primordialist daydreams
> of an ethno-linguistic community in Nepal. I have never seen actual usage
> of the script by Magars. If things have changed since 2011, I would very
> much welcome such information. Otherwise, the so-called “Magar Akkha” is
> not suitable for encoding. The Brahmi encoding that we have should suffice.
>
> All my best,
> Anshu
>
> On Jul 22, 2019, at 10:06 AM, Lorna Evans via Unicode 
> wrote:
>
> Also: https://scriptsource.org/scr/Qabl
>
>
> On Mon, Jul 22, 2019, 12:47 PM Ken Whistler via Unicode <
> unicode@unicode.org> wrote:
>
>> See the entry for "Magar Akkha" on:
>>
>> http://linguistics.berkeley.edu/sei/scripts-not-encoded.html
>>
>> Anshuman Pandey did preliminary research on this in 2011.
>>
>> http://www.unicode.org/L2/L2011/11144-magar-akkha.pdf
>>
>> It would be premature to assign an ISO 15924 script code, pending the
>> research to determine whether this script should be separately encoded.
>>
>> --Ken
>> On 7/22/2019 9:16 AM, Philippe Verdy via Unicode wrote:
>>
>> According to Ethnolog, the Eastern Magar language (mgp) is written in two
>> scripts: Devanagari and "Akkha".
>>
>> But the "Akkha" script does not seem to have any ISO 15924 code.
>>
>> The Ethnologue currently assigns a private use code (Qabl) for this
>> script.
>>
>> Was the addition delayed due to lack of evidence (even if this language
>> is official in Nepal and India) ?
>>
>> Did the editors of Ethnologue submit an addition request for that script
>> (e.g. for the code "Akkh" or "Akha" ?)
>>
>> Or is it considered unified with another script that could explain why it
>> is not coded ? If this is a variant it could have its own code (like
>> Nastaliq in Arabic). Or may be this is just a subset of another
>> (Sino-Tibetan) script ?
>>
>>
>>
>>


Re: Akkha script (used by Eastern Magar language) in ISO 15924?

2019-07-22 Thread Philippe Verdy via Unicode
Also we can note that "mgp" (Eastern Magari) is severely endangered
according to multiple sources include Ethnologue and the Linguist List.
This is still not the case for Western Magari (mostly on Nepal, not in
Sikkim India), where evidence is probably easier to find (where the
encoding of a new script and disunificaition from Brahmi, may then be more
easily justified with their modern use, and probably unified with the
remaining use for Eastern Magari).


Le lun. 22 juil. 2019 à 19:33, Philippe Verdy  a écrit :

>
>
> Le lun. 22 juil. 2019 à 18:43, Ken Whistler  a
> écrit :
>
>> See the entry for "Magar Akkha" on:
>>
>> http://linguistics.berkeley.edu/sei/scripts-not-encoded.html
>>
>> Anshuman Pandey did preliminary research on this in 2011.
>>
>
> That's what I said: 8 years ago already.
>
>
>> http://www.unicode.org/L2/L2011/11144-magar-akkha.pdf
>>
>> It would be premature to assign an ISO 15924 script code, pending the
>> research to determine whether this script should be separately encoded.
>>
> And before that, does it mean that texts have to use the "Brah" code for
> early classification if they are tentatively encoded with Brahmi (and
> tagged as "mgp-Brah", which should limit the impact, because there's no
> other evidence that "mgp", the modern language, is related directly to the
> old Brahmi script, when the "mgp" still did not even exist) ?
>


Re: Akkha script (used by Eastern Magar language) in ISO 15924?

2019-07-22 Thread Philippe Verdy via Unicode
Le lun. 22 juil. 2019 à 18:43, Ken Whistler  a
écrit :

> See the entry for "Magar Akkha" on:
>
> http://linguistics.berkeley.edu/sei/scripts-not-encoded.html
>
> Anshuman Pandey did preliminary research on this in 2011.
>

That's what I said: 8 years ago already.


> http://www.unicode.org/L2/L2011/11144-magar-akkha.pdf
>
> It would be premature to assign an ISO 15924 script code, pending the
> research to determine whether this script should be separately encoded.
>
And before that, does it mean that texts have to use the "Brah" code for
early classification if they are tentatively encoded with Brahmi (and
tagged as "mgp-Brah", which should limit the impact, because there's no
other evidence that "mgp", the modern language, is related directly to the
old Brahmi script, when the "mgp" still did not even exist) ?


Akkha script (used by Eastern Magar language) in ISO 15924?

2019-07-22 Thread Philippe Verdy via Unicode
According to Ethnolog, the Eastern Magar language (mgp) is written in two
scripts: Devanagari and "Akkha".

But the "Akkha" script does not seem to have any ISO 15924 code.

The Ethnologue currently assigns a private use code (Qabl) for this script.

Was the addition delayed due to lack of evidence (even if this language is
official in Nepal and India) ?

Did the editors of Ethnologue submit an addition request for that script
(e.g. for the code "Akkh" or "Akha" ?)

Or is it considered unified with another script that could explain why it
is not coded ? If this is a variant it could have its own code (like
Nastaliq in Arabic). Or may be this is just a subset of another
(Sino-Tibetan) script ?


Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-20 Thread Philippe Verdy via Unicode
I had strange browser effects/caching issues: I did not see several "Age"
values in that page (possibly because of a broken cache), and even my
script did not detect it. I have already fixed that on my side and cleaned
my cache to get a proper view of that page. Sorry for this disturbance, I
trusted too much what my small semi-atuomated tool had collected (but I've
not detected where it could have failed to parse the content, so I updated
my own data manually). ISO 15924 does not have lot of data that cannot be
edited by human.

Le jeu. 18 juil. 2019 à 18:10, Ken Whistler  a
écrit :

>
> On 7/17/2019 4:54 PM, Philippe Verdy via Unicode wrote:
>
> then the Unicode version (age) used for Hieroglyphs should also be
> assigned to Hieratic.
>
> It is already.
>
>
> In fact the ligatures system for the "cursive" Egyptian Hieratic is so
> complex (and may also have its own variants showing its progression from
> Hieroglyphs to Demotic or Old Coptic), that probably Hieratic should no
> longer be considered "unified" with Hieroglyphs, and its existing ISO 15924
> code is then not represented at all in Unicode.
>
> It *is* considered unified with Egyptian hieroglyphs, until such time as
> anyone would make a serious case that the Unicode Standard (and students of
> the Egyptian hieroglyphs, in both their classic, monumental forms and in
> hieratic) would be better served by a disunification.
>
> Note that *many* cursive forms of scripts are not easily "supported" by
> out-of-the-box plain text implementations, for obvious reasons. And in the
> case of Egyptian hieroglyphs, it would probably be a good strategy to first
> get some experience in implementations/fonts supporting the Unicode 12.0
> controls for hieroglyphs, before worrying too much about what does or
> doesn't work to represent hieratic texts adequately. (Demotic is clearly a
> different case.)
>
>
> For now ISO 15924 still does not consider Egyptian Hieratic to be
> "unified" with Egyptian Hieroglyphs; this is not indicated in its
> descriptive names given in English or French with a suffix like "(cursive
> variant of Egyptian Hieroglyphs)", *and it has no "Unicode Age" version
> given, as if it was still not encoded at all by Unicode*,
>
> That latter part of that statement (highlighted) is false, as is easily
> determined by simple inspection of the Egyh entry on:
>
> https://www.unicode.org/iso15924/iso15924-codes.html
>
> --Ken
>
>
>


Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-17 Thread Philippe Verdy via Unicode
But my concern is in fact valid as well for Egyptian Hieratic (considered
in Chapter 14 to be "unified" with the Hieroglyphs, and being a cursive
variant, currently not supported in any font because of the very complex
set of ligatures this would require, and that may not even work properly
with the existing markup notations used with Hieroglyphs).
But if the "Manuel de codage" for Egyptian Hieroglyphs (describing a markup
notation) contains extensions to represent the Hieratic variants with the
unified Hieroglyphs, then the Unicode version (age) used for Hieroglyphs
should also be assigned to Hieratic.

In fact the ligatures system for the "cursive" Egyptian Hieratic is so
complex (and may also have its own variants showing its progression from
Hieroglyphs to Demotic or Old Coptic), that probably Hieratic should no
longer be considered "unified" with Hieroglyphs, and its existing ISO 15924
code is then not represented at all in Unicode.

For now ISO 15924 still does not consider Egyptian Hieratic to be "unified"
with Egyptian Hieroglyphs; this is not indicated in its descriptive names
given in English or French with a suffix like "(cursive variant of Egyptian
Hieroglyphs)", and it has no "Unicode Age" version given, as if it was
still not encoded at all by Unicode, and then Chapter 14 of the standard
(in its section about Hieroglyphs where Hieratic is cited once) is probably
misleading, waiting for further studies.

And I'm unable to find any non-proprietary (interoperable?) attempt to
encode Hieratic, the only attempts being with Hieroglyphs.

Le jeu. 18 juil. 2019 à 01:16, Philippe Verdy  a écrit :

> Sorry I misread (with an automated tool) an old dataset where these "3.0"
> versions were indicated in an incorrect form
>
> Le jeu. 18 juil. 2019 à 01:07, Philippe Verdy  a
> écrit :
>
>> Note also that there are variants registered with Unicode versions (Age)
>> for symbols, even if they don't have any assigned Unicode alias, but this
>> is not a problem.
>> 994 Zinh Code for inherited script codet pour écriture héritée Inherited
>> 2009-02-23
>> 995 *Zmth * 
>> Mathematical
>> notation notation mathématique 3.2 2007-11-26
>> 993 *Zsye * Symbols
>> (Emoji variant) symboles (variante émoji) 6.0 2015-12-16
>> 996 *Zsym
>> *
>> Symbols symboles 1.1 2007-11-26
>> The Unicode version is an important information which allows determining
>> that texts created in a given language (or notation), and written in these
>> scripts, can be written using the UCS.
>>
>> Weren't the 3 variants of Syriac unified in Unicode (even if they may be
>> distinguished in ISO 15924, for example to allow selecting a suitable but
>> preferred sets of fonts, like this is commonly used for Chinese Mandarin,
>> Arabic, Japanese, Korean or Latin) ?
>>
>>
>> Le jeu. 18 juil. 2019 à 00:55, Philippe Verdy  a
>> écrit :
>>
>>> The ISO 15924/RA reference page contains indication of support in
>>> Unicode for variants of various scripts such as Aran, Latf, Latg, Hanb,
>>> Hans, Hant:.
>>> 160 *Arab* Arabic arabe Arabic 1.1 2004-05-01
>>> 161 *Aran* Arabic (Nastaliq variant) arabe (variante nastalique) 1.1
>>> 2014-11-15
>>> ...
>>> 503 *Hanb* Han with Bopomofo (alias for Han + Bopomofo) han avec
>>> bopomofo (alias pour han + bopomofo) 1.1 2016-01-19
>>>
>>> 500 *Hani* Han (Hanzi, Kanji, Hanja) idéogrammes han (sinogrammes) Han
>>> 1.1 2009-02-23
>>>
>>> 501 *Hans* Han (Simplified variant) idéogrammes han (variante
>>> simplifiée) 1.1 2004-05-29
>>> 502 *Hant* Han (Traditional variant) idéogrammes han (variante
>>> traditionnelle) 1.1 2004-05-29
>>> ...
>>> 217 *Latf* Latin (Fraktur variant) latin (variante brisée) 1.1
>>> 2004-05-01
>>> 216 *Latg* Latin (Gaelic variant) latin (variante gaélique) 1.1
>>> 2004-05-01
>>> 215 *Latn* Latin latin Latin 1.1 2004-05-01
>>> ...
>>> There are other entries for aliases or mixed script also for Japanese
>>> and Korean.
>>>
>>> But for Syriac variants this is missing and this is the only script for
>>> which this occurs:
>>> 135 *Syrc* Syriac syriaque Syriac 3.0 2004-05-01
>>> 138 Syre Syriac (Estrangelo variant) syriaque (variante estranghélo)
>>> 2004-05-01
>>> 137 Syrj Syriac (Western variant) syriaque (variante occidentale)
>>> 2004-05-01
>>> 136 Syrn Syriac (Eastern variant) syriaque (variante orientale)
>>> 2004-05-01
>>> Why is there no Unicode version given for these 3 variants ?
>>>
>>>


Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-17 Thread Philippe Verdy via Unicode
Sorry I misread (with an automated tool) an old dataset where these "3.0"
versions were indicated in an incorrect form

Le jeu. 18 juil. 2019 à 01:07, Philippe Verdy  a écrit :

> Note also that there are variants registered with Unicode versions (Age)
> for symbols, even if they don't have any assigned Unicode alias, but this
> is not a problem.
> 994 Zinh Code for inherited script codet pour écriture héritée Inherited
> 2009-02-23
> 995 *Zmth * 
> Mathematical
> notation notation mathématique 3.2 2007-11-26
> 993 *Zsye * Symbols
> (Emoji variant) symboles (variante émoji) 6.0 2015-12-16
> 996 *Zsym
> *
> Symbols symboles 1.1 2007-11-26
> The Unicode version is an important information which allows determining
> that texts created in a given language (or notation), and written in these
> scripts, can be written using the UCS.
>
> Weren't the 3 variants of Syriac unified in Unicode (even if they may be
> distinguished in ISO 15924, for example to allow selecting a suitable but
> preferred sets of fonts, like this is commonly used for Chinese Mandarin,
> Arabic, Japanese, Korean or Latin) ?
>
>
> Le jeu. 18 juil. 2019 à 00:55, Philippe Verdy  a
> écrit :
>
>> The ISO 15924/RA reference page contains indication of support in Unicode
>> for variants of various scripts such as Aran, Latf, Latg, Hanb, Hans, Hant:.
>> 160 *Arab* Arabic arabe Arabic 1.1 2004-05-01
>> 161 *Aran* Arabic (Nastaliq variant) arabe (variante nastalique) 1.1
>> 2014-11-15
>> ...
>> 503 *Hanb* Han with Bopomofo (alias for Han + Bopomofo) han avec
>> bopomofo (alias pour han + bopomofo) 1.1 2016-01-19
>>
>> 500 *Hani* Han (Hanzi, Kanji, Hanja) idéogrammes han (sinogrammes) Han
>> 1.1 2009-02-23
>>
>> 501 *Hans* Han (Simplified variant) idéogrammes han (variante simplifiée)
>> 1.1 2004-05-29
>> 502 *Hant* Han (Traditional variant) idéogrammes han (variante
>> traditionnelle) 1.1 2004-05-29
>> ...
>> 217 *Latf* Latin (Fraktur variant) latin (variante brisée) 1.1 2004-05-01
>> 216 *Latg* Latin (Gaelic variant) latin (variante gaélique) 1.1
>> 2004-05-01
>> 215 *Latn* Latin latin Latin 1.1 2004-05-01
>> ...
>> There are other entries for aliases or mixed script also for Japanese and
>> Korean.
>>
>> But for Syriac variants this is missing and this is the only script for
>> which this occurs:
>> 135 *Syrc* Syriac syriaque Syriac 3.0 2004-05-01
>> 138 Syre Syriac (Estrangelo variant) syriaque (variante estranghélo)
>> 2004-05-01
>> 137 Syrj Syriac (Western variant) syriaque (variante occidentale)
>> 2004-05-01
>> 136 Syrn Syriac (Eastern variant) syriaque (variante orientale)
>> 2004-05-01
>> Why is there no Unicode version given for these 3 variants ?
>>
>>


Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-17 Thread Philippe Verdy via Unicode
Note also that there are variants registered with Unicode versions (Age)
for symbols, even if they don't have any assigned Unicode alias, but this
is not a problem.
994 Zinh Code for inherited script codet pour écriture héritée Inherited
2009-02-23
995 *Zmth *
Mathematical
notation notation mathématique 3.2 2007-11-26
993 *Zsye * Symbols
(Emoji variant) symboles (variante émoji) 6.0 2015-12-16
996 *Zsym
*
Symbols symboles 1.1 2007-11-26
The Unicode version is an important information which allows determining
that texts created in a given language (or notation), and written in these
scripts, can be written using the UCS.

Weren't the 3 variants of Syriac unified in Unicode (even if they may be
distinguished in ISO 15924, for example to allow selecting a suitable but
preferred sets of fonts, like this is commonly used for Chinese Mandarin,
Arabic, Japanese, Korean or Latin) ?


Le jeu. 18 juil. 2019 à 00:55, Philippe Verdy  a écrit :

> The ISO 15924/RA reference page contains indication of support in Unicode
> for variants of various scripts such as Aran, Latf, Latg, Hanb, Hans, Hant:.
> 160 *Arab* Arabic arabe Arabic 1.1 2004-05-01
> 161 *Aran* Arabic (Nastaliq variant) arabe (variante nastalique) 1.1
> 2014-11-15
> ...
> 503 *Hanb* Han with Bopomofo (alias for Han + Bopomofo) han avec bopomofo
> (alias pour han + bopomofo) 1.1 2016-01-19
>
> 500 *Hani* Han (Hanzi, Kanji, Hanja) idéogrammes han (sinogrammes) Han 1.1
> 2009-02-23
>
> 501 *Hans* Han (Simplified variant) idéogrammes han (variante simplifiée)
> 1.1 2004-05-29
> 502 *Hant* Han (Traditional variant) idéogrammes han (variante
> traditionnelle) 1.1 2004-05-29
> ...
> 217 *Latf* Latin (Fraktur variant) latin (variante brisée) 1.1 2004-05-01
> 216 *Latg* Latin (Gaelic variant) latin (variante gaélique) 1.1 2004-05-01
> 215 *Latn* Latin latin Latin 1.1 2004-05-01
> ...
> There are other entries for aliases or mixed script also for Japanese and
> Korean.
>
> But for Syriac variants this is missing and this is the only script for
> which this occurs:
> 135 *Syrc* Syriac syriaque Syriac 3.0 2004-05-01
> 138 Syre Syriac (Estrangelo variant) syriaque (variante estranghélo)
> 2004-05-01
> 137 Syrj Syriac (Western variant) syriaque (variante occidentale)
> 2004-05-01
> 136 Syrn Syriac (Eastern variant) syriaque (variante orientale) 2004-05-01
> Why is there no Unicode version given for these 3 variants ?
>
>


ISO 15924 : missing indication of support for Syriac variants

2019-07-17 Thread Philippe Verdy via Unicode
The ISO 15924/RA reference page contains indication of support in Unicode
for variants of various scripts such as Aran, Latf, Latg, Hanb, Hans, Hant:.
160 *Arab* Arabic arabe Arabic 1.1 2004-05-01
161 *Aran* Arabic (Nastaliq variant) arabe (variante nastalique) 1.1
2014-11-15
...
503 *Hanb* Han with Bopomofo (alias for Han + Bopomofo) han avec bopomofo
(alias pour han + bopomofo) 1.1 2016-01-19

500 *Hani* Han (Hanzi, Kanji, Hanja) idéogrammes han (sinogrammes) Han 1.1
2009-02-23

501 *Hans* Han (Simplified variant) idéogrammes han (variante simplifiée)
1.1 2004-05-29
502 *Hant* Han (Traditional variant) idéogrammes han (variante
traditionnelle) 1.1 2004-05-29
...
217 *Latf* Latin (Fraktur variant) latin (variante brisée) 1.1 2004-05-01
216 *Latg* Latin (Gaelic variant) latin (variante gaélique) 1.1 2004-05-01
215 *Latn* Latin latin Latin 1.1 2004-05-01
...
There are other entries for aliases or mixed script also for Japanese and
Korean.

But for Syriac variants this is missing and this is the only script for
which this occurs:
135 *Syrc* Syriac syriaque Syriac 3.0 2004-05-01
138 Syre Syriac (Estrangelo variant) syriaque (variante estranghélo)
2004-05-01
137 Syrj Syriac (Western variant) syriaque (variante occidentale) 2004-05-01
136 Syrn Syriac (Eastern variant) syriaque (variante orientale) 2004-05-01
Why is there no Unicode version given for these 3 variants ?


Fwd: Numeric group separators and Bidi

2019-07-09 Thread Philippe Verdy via Unicode
> Well my first feeling was that U+202F should work all the time, but I
> found cases where this is not always the case. So this must be bugs in
> those renderers.
>

I think we can attribute these bugs to the fact that this character is
insufficiently known, and not even accessible in most input tools...
including the Windows "Charmap" where it is not even listed with other
spaces or punctuations, except if we display the FULL list of characters
supported by a selected font that maps it (many fonts don't map it) and the
"Unicode" encoding. Windows charmap is so outdated (and has many
inconsistancies in its proposed grouping, look for example at the groups
proposed for Greek, they are complete non-sense, with duplicate subranges,
but groups made completely arbitrarily, making this basic tool really
difficult to use).

And beside that, all the input methods proposed in Windows still don't
offer it (this is also true on other platforms). So finally there are not
enough text to render with it, and renderers are not fixed to render it
correctly, developers think there's no emergency and that this bug is
minor, it can stay for years without ever being corrected (just like with
the old "Charmap" on Windows) even if such bug or omission was signaled
repeatedly.

This finally tends to perpetuate the old bad practices (and this is what
happened with ASCII speading everywhere even in scopes where it should not
have been used at all and certainly not selected as the only viable
alternative, the same is seen today with the choice of languages/locales,
where everything that is not English is minored as non-important for users).


Re: Numeric group separators and Bidi

2019-07-09 Thread Philippe Verdy via Unicode
Well my first feeling was that U+202F should work all the time, but I found
cases where this is not always the case. So this must be bugs in those
renderers.

And using Bidi controls (LRI/BDI) is absolutely not an option. These
controls are only intended to be used in pure plain-text files that have no
other ways to specify the embedding, and whose content is entirely static
(no generated by templates that return data from unspecified locales to an
unspecified locale).

As well the option of localizing each item is not possible. That's why I
search a locale-neutral solution that is acceptable in all languages, and
does not give false interpretation on the actual values of numbers (which
can have different scales or precision, and with also optional data, not
always present in all items to render but added to the list, for example as
annotations that should still be as locale-neutral as possible).

So U+202F is supposed to the the solution, but I did not find any way to
properly present the decimal separator: it is only unambiguous as a decimal
separator (and not a group separator) if there's a group separator present
in the number (and this is not always true!) And there I'm stuck with the
dot or comma, with no appropriate symbol that would not be confusable (may
be the small vertical tick hanging from the baseline could replace both the
dot and the comma?).



Le mar. 9 juil. 2019 à 22:10, Egmont Koblinger  a écrit :

> Hi Philippe,
>
> What do you mean U+202F doesn't work fo you?
>
> Whereas the logical string "hebrew 123456 hebrew" indeed shows
> the number incorrectly as "456 123", it's not the case with U+202F
> instead of space, then the number shows up as "123 456" as expected.
>
> I think you need to pick a character whose BiDi class is "Common
> Number Separator", see e.g.
> https://www.compart.com/en/unicode/bidiclass/CS for a list of such
> characters including U+00A0 no-break space and U+202F narrow no-break
> space. This suggests to me that U+202F is a correct choice if you need
> the look of a narrow space.
>
> Another possibility is to embed the number in a LRI...PDI block, as
> e.g. https://unicode.org/cldr/utility/bidic.jsp does with the "1–3%"
> fragment of its default example.
>
> cheers,
> egmont
>
> On Tue, Jul 9, 2019 at 9:01 PM Philippe Verdy via Unicode
>  wrote:
> >
> > Is there a narrow space usable as a numeric group separator, and that
> also has the same bidi property as digits (i.e. neutral outside the span of
> digits and separators, but inheriting the implied directionality of the
> previous digit) ?
> >
> > I can't find a way to use narrow spaces instead of punctuation signs
> (dot or comma) for example in Arabic/Hebrew, for example to present tabular
> numeric data in a really language-neutral way. In Arabic/Hebrew we need to
> use punctuations as group separators because spaces don't work (not even
> the narrow non-breaking space U+202F used in French and recommended in
> ISO), but then these punctuation separators are interpreted differently
> (notably between French and English where the interpretation dot and comma
> are swapped)
> >
> > Note that:
> > - the "figure space" is not suitable (as it has the same width as digits
> and is used as a "filler" in tabular data; but it also does not have the
> correct bidi behavior, as it does not have the same bidi properties as
> digits).
> > - the "thin space" is not suitable (it is breakable)
> > - the "narrow non-breaking space" U+202F (used in French and currently
> in ISO) is not suitable, or may be I'm wrong and its presence is still
> neutral between groups of digits where it inherits the properties of the
> previous digit, but still does not enforces the bidi direction of the whole
> span of digits.
> >
> > Can you point me if U+202F is really suitable ? I made some tests with
> various text renderers, and some of them "break" the group of digits by
> reordering these groups, changing completely the rendered value (units
> become thousands or more, and thousands become units...). But may be these
> are bugs in renderers.
> >
>


Numeric group separators and Bidi

2019-07-09 Thread Philippe Verdy via Unicode
Is there a narrow space usable as a numeric group separator, and that also
has the same bidi property as digits (i.e. neutral outside the span of
digits and separators, but inheriting the implied directionality of the
previous digit) ?

I can't find a way to use narrow spaces instead of punctuation signs (dot
or comma) for example in Arabic/Hebrew, for example to present tabular
numeric data in a really language-neutral way. In Arabic/Hebrew we need to
use punctuations as group separators because spaces don't work (not even
the narrow non-breaking space U+202F used in French and recommended in
ISO), but then these punctuation separators are interpreted differently
(notably between French and English where the interpretation dot and comma
are swapped)

Note that:
- the "figure space" is not suitable (as it has the same width as digits
and is used as a "filler" in tabular data; but it also does not have the
correct bidi behavior, as it does not have the same bidi properties as
digits).
- the "thin space" is not suitable (it is breakable)
- the "narrow non-breaking space" U+202F (used in French and currently in
ISO) is not suitable, or may be I'm wrong and its presence is still neutral
between groups of digits where it inherits the properties of the previous
digit, but still does not enforces the bidi direction of the whole span of
digits.

Can you point me if U+202F is really suitable ? I made some tests with
various text renderers, and some of them "break" the group of digits by
reordering these groups, changing completely the rendered value (units
become thousands or more, and thousands become units...). But may be these
are bugs in renderers.


Re: Unicode "no-op" Character?

2019-07-03 Thread Philippe Verdy via Unicode
Also consider that C0 controls (like STX and ETX) can already be used for
packetizing, but immediately comes the need for escaping (DLE has been used
for that goal, jsut before the character to preserve in the stream content,
notably before DLE itself, or STX and ETX).
There's then no need at all of any new character in Unicode. But if your
protoclol does not allow any fom of escaping, then it is broken as it
cannot transport **all** valid Unicode text.

Le mer. 3 juil. 2019 à 10:49, Philippe Verdy  a écrit :

> Le mer. 3 juil. 2019 à 06:09, Sławomir Osipiuk  a
> écrit :
>
>> I don’t think you understood me at all. I can packetize a string with any
>> character that is guaranteed not to appear in the text.
>>
>
> Your goal is **impossible** to reach with Unicode. Assume sich character
> is "added" to the UCS, then it can appear in the text. Your goal being that
> it should be "warrantied" not to be used in any text, means that your
> "character" cannot be encoded at all. Unicode and ISO **require** that the
> any proposed character can be used in text without limitation. Logivally it
> would be rejected becauyse your character would not be usable at all from
> the start.
>
> So you have no choice: you must use some transport format for your
> "packeting", jsut like what is used in MIME for emails, in HTTP(S) for
> streaming, or in internationalized domain names.
>
> For your escaping mechanism you have a very large choice already of
> characters considered special only for your chosen transport syntax.
>
> Your goal shows a chicken and egg problem. It is not solvable without
> creating self-contradictions immediately (and if you attempt to add some
> restriction to avoid the contradiction, then you'll fall on cases where you
> can no longer transport your message and your protocol will become unusable.
>


Re: Unicode "no-op" Character?

2019-07-03 Thread Philippe Verdy via Unicode
Le mer. 3 juil. 2019 à 06:09, Sławomir Osipiuk  a
écrit :

> I don’t think you understood me at all. I can packetize a string with any
> character that is guaranteed not to appear in the text.
>

Your goal is **impossible** to reach with Unicode. Assume sich character is
"added" to the UCS, then it can appear in the text. Your goal being that it
should be "warrantied" not to be used in any text, means that your
"character" cannot be encoded at all. Unicode and ISO **require** that the
any proposed character can be used in text without limitation. Logivally it
would be rejected becauyse your character would not be usable at all from
the start.

So you have no choice: you must use some transport format for your
"packeting", jsut like what is used in MIME for emails, in HTTP(S) for
streaming, or in internationalized domain names.

For your escaping mechanism you have a very large choice already of
characters considered special only for your chosen transport syntax.

Your goal shows a chicken and egg problem. It is not solvable without
creating self-contradictions immediately (and if you attempt to add some
restriction to avoid the contradiction, then you'll fall on cases where you
can no longer transport your message and your protocol will become unusable.


Re: Unicode "no-op" Character?

2019-06-29 Thread Philippe Verdy via Unicode
If you want to "packetize" arbitrarily long Unicode text, you don't need
any new magic character. Just prepend your packet with a base character
used as a syntaxic delimiter, that does not combine with what follows in
any normalization.

There's a fine character for that: the TAB control. Except that during
transmission it may turn into a SPACE that would combine. (the same will
happen with "=" which can combine with a combining slash).

But look at the normalization data (and consider that Unicode warranties
that there will not be any addition of new combining pair starting by the
same base character) there are LOT of suitable base characters in Unicode,
which you can use as a syntaxic delimiter.

Some examples (in the ASCII subset) include the hyphen-minus, the
apostrophe-quote, the double quotation mark...

So it's easy to split an arbitrarily long text at arbitrary character
position, even in the middle of any cluster or combining sequence. It does
not matter that this character may create a "cluster" with the following
character, your "packetized" stream is still not readable text, but only a
transport syntax (just like quoted-printable, or Base64).

You can also freely choose the base character at end of each packet (the
newlines are not safe as lines may be merged, but like Base64, "=" is fine
to terminate each packet, as well as two ASCII quotation marks, and in fact
all punctuations and symbols from ASCII (you can even use the ASCII letters
and digits).

If your packets have variable lengths, you may need to use escaping, or you
may prepend the length (in characters or in combining sequences) of your
packet before the expected terminator.

All this is used in MIME for attachments in emails (with the two common
transport syntaxes: Quoted Printable using escaping, or Base64 which does
not require any length but requires a distinctive terminator (not used to
encode the data part of the "packet") for variable length "packets".





Le dim. 23 juin 2019 à 02:35, Sławomir Osipiuk via Unicode <
unicode@unicode.org> a écrit :

> I assure you, it wasn’t very interesting. :-) Headache-y, more like. The
> diacritic thing was completely inapplicable anyway, as all our text was
> plain English. I really don’t want to get into what the thing was, because
> it sounds stupider the more I try to explain it. But it got the wheels
> spinning in my head, and now that I’ve been reading up a lot about Unicode
> and older standards like 2022/6429, it got me thinking whether there might
> already be an elegant solution.
>
>
>
> But, as an example I’m making up right now, imagine you want to packetize
> a large string. The packets are not all equal sized, the sizes are
> determined by some algorithm. And the packet boundary may occur between a
> base char and a diacritic. You insert markers into the string at the packet
> boundaries. You can then store the string, copy it, display it, or pass it
> to the sending function which will scan the string and know to send the
> next packet when it reaches the marker. And you can now do all that without
> the need to pass around extra metadata (like a list of ints of where the
> packet boundaries are supposed to be) or to re-calculate the boundaries;
> it’s still just a big string. If a different application sees the string,
> it will know to completely ignore the packet markers; it can even strip
> them out if it wants to (the canonical equivalent of the noop character is
> the absence of a character).
>
>
>
> As should be obvious, I’m not recommending this as good practice.
>
>
>
>
>
> *From:* Shawn Steele [mailto:shawn.ste...@microsoft.com]
> *Sent:* Saturday, June 22, 2019 19:57
> *To:* Sławomir Osipiuk; unicode@unicode.org
> *Subject:* RE: Unicode "no-op" Character?
>
>
>
> + the list.  For some reason the list’s reply header is confusing.
>
>
>
> *From:* Shawn Steele
> *Sent:* Saturday, June 22, 2019 4:55 PM
> *To:* Sławomir Osipiuk 
> *Subject:* RE: Unicode "no-op" Character?
>
>
>
> The original comment about putting it between the base character and the
> combining diacritic seems peculiar.  I’m having a hard time visualizing how
> that kind of markup could be interesting?
>
>
>
> *From:* Unicode  *On Behalf Of *Slawomir
> Osipiuk via Unicode
> *Sent:* Saturday, June 22, 2019 2:02 PM
> *To:* unicode@unicode.org
> *Subject:* RE: Unicode "no-op" Character?
>
>
>
> I see there is no such character, which I pretty much expected after
> Google didn’t help.
>
>
>
> The original problem I had was solved long ago but the recent article
> about watermarking reminded me of it, and my question was mostly out of
> curiosity. The task wasn’t, strictly speaking, about “padding”, but about
> marking – injecting “flag” characters at arbitrary points in a string
> without affecting the resulting visible text. I think we ended up using
> ESC, which is a dumb choice in retrospect, though the whole approach was a
> bit of a hack anyway and the process it was for isn’t being used anymore.
>


Symbols of colors used in Portugal for transport

2019-04-27 Thread Philippe Verdy via Unicode
A very useful think to add to Unicode (for colorblind people) !

http://bestinportugal.com/color-add-project-brings-color-identification-to-the-color-blind


Is it proposed to add as new symbols ?


Re: Emoji Haggadah

2019-04-19 Thread Philippe Verdy via Unicode
I cannot; definitely it requires first good knowldge of English (to find
possible synonyms, plus phonetic approximations, including using
abbreviatable words), and Hebrew culture (to guess names and the context).
All this text looks completely random and makes no sense otherwise.

Le mar. 16 avr. 2019 à 04:22, Tex via Unicode  a
écrit :

> Oy veh!
>
>
>
> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Mark
> E. Shoulson via Unicode
> *Sent:* Monday, April 15, 2019 5:27 PM
> *To:* unicode@unicode.org
> *Subject:* Emoji Haggadah
>
>
>
> The only thing more disturbing than the existence of The Emoji Haggadah (
> https://www.amazon.com/Emoji-Haggadah-Martin-Bodek/dp/1602803463/) is the
> fact that I'm starting to find that I can read it...
>
>
>
> ~mark
>


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-17 Thread Philippe Verdy via Unicode
Le ven. 8 févr. 2019 à 13:56, Egmont Koblinger  a écrit :

> Philippe, I hate do say it, but at the risk of being impolite, I just
> have to.
>

Resist this idea, I've not been impolite. I just want to show you that
terminals are legacy environments that are far behind what is needed for
proper internationalization. And when I exposed the problem of monospaced
fonts, and exposed the case of "dualspace" fonts, this is already used in
legacy terminals to solve practical problems (and there are even data in
the UCD about them): dualspace is an excellent solution that should be
extended even outside CJK contexts (for example with emojis, and various
other South Asian scripts).


Re: Bidi paragraph direction in terminal emulators

2019-02-14 Thread Philippe Verdy via Unicode
Le mar. 12 févr. 2019 à 14:16, Egmont Koblinger via Unicode <
unicode@unicode.org> a écrit :

> > There is nothing magic about the grid of cells, and once you introduce
> new escape sequences, you might as well truly modernise the terminal.
>
> The magic about the grid of cells is all the software that were built
> up with this assumption during the last couple of decades.
>

The minimum to support (which is already used in VT* terminals) needs to
include support "dualspace" rendering (i.e.characters rendered in one or
two cells), widely used for CJK (half-width and fullwidth characters). If
the terminal has square cells only one variant is needed (i.e. a monospace
cell), but common terminals today use rectangular cells.
Thanks Unicode has properties about that, allowing controls to select the
appropriate variant (plus legacy encodings for parts of
Latin/Greek/Cyrillic).
But the extension would be needed for other scripts. And a control in the
VT* protocol to select the variant (which would take effect in terminals
configured in dualspace rendering mode which is normally the default mode
in East Asia). This should apply to other South Asian scripts and most
emojis, and adding some control would extend the dualspace rendering to
cover the whole Unicode (without having to use the few compatibility
characters specifically encoded at end of the BMP).
Unfortunately Unicode still does not have any standard variant selector (or
other format control) to control that at least at cluster level.
This would mean adding some custom escape sequence to the VT* protocol
(using the compatibility characters for half-width/fullwidth should be
deprecated), which would be also more efficient than having to use variant
selector or format controls after each character (this solution works for
isolated characters) or having to configure the terminal in ugly monospace
mode (with typically 40 cells by line instead of 80) which is only fine for
CJK, or for output to old analog TV with very low vertical resolution
(below ~400 pixels with cells about 8x8 pixels at most) such as old CGA,
Teletext, and early 8-bit personal computers.


Re: Encoding colour (from Re: Encoding italic)

2019-02-11 Thread Philippe Verdy via Unicode
Le dim. 10 févr. 2019 à 02:33, wjgo_10...@btinternet.com via Unicode <
unicode@unicode.org> a écrit :

> Previously I wrote:
>
> > A stateful method, though which might be useful for plain text streams
> > in some applications, would be to encode as characters some of the
> > glyphs for indicating colours and the digit characters to go with them
> > from page 5 and from page 3 of the following publication.
>
> > http://www.users.globalnet.co.uk/~ngo/locse027.pdf
>
> Thinking about this further, for this application copies of the glyphs
> could be redesigned so as to be square and could be emoji-style and the
> meanings of the characters specifying which colour component is to be
> set could be changed so that they refer to the number previously entered
> using one or more  of the special  digit characters. Thus the setting of
> colour components could be done in the same reverse notation way that
> the FORTH computer language works.
>

FORTH is not relevant to this discussion. Anyway the usual order for Forth
operators (Forth is a stack-based language, similar to PostScript, and
working like calculators using the Polish reversed order) is to push the
operands from left to right and then use the operator which will pop them
in reverse order from right to left before pushing the result on the stack
(so "a/b/c" becomes "/a get /b get div /c get div"). But colors are just an
operator like "rgb(r,b,g)" and the natural order in stack based languages
should also be "/r get /g get /b get rgb".
Note that C/C++ (with C calling conventions) usually use another order for
its stack, pushing parameters from right to left (if they are not passed
via dedicated registers in fix order, the first parameter from the right
that fits a register being not passed in the stack but on the "main"
accumulator register, possibly a pair or registers for long integer or long
pointers, or a different register for floatting points if floatting point
registers are used).

There's no standard for the order of parameters in stack based languages.
It is arbitrary and specific to each language or specific implementations
of them. So if you want to create your own scripting language to support
your non-standard extension, you can choose any order you want, but this
will still not define a standard related to other languages that have never
been bound to a specific evaluation/encoding order. Then don't pretend it
will be part of the Unicode standard, which is not a scripting language and
that does not offer an "ABI" for stateful encodings with arbitarily long
contexts (Unicode has placed very low limits on the maximum length of
lookahead needed to process text, your extension would not work under these
reasonnable limits, so it will have limited private use and cannot be part
of TUS).

You may create your "proof of concept" (tested on limited configurations)
but it will just be private

[And so it should use PUA for full compatibility and not abuse the other
standardized code points, as your extension would not be
compatible/conforming to the existing rules and limits, without amending
them and discussing a lot how existing conforming applications can be
adapted, and analyzing the effects if they are not updated. Approving this
extension is another thing, and it will need to pass the standard process
to be added to the proposals schedule, pass through the two technical
comities, pass the alpha and beta phases, and then the prepublication.
You'll also need to work on documentations and fix many quirks found in
them, then you'll need supporters to pass the vote (and if you're not an
UTC member or an ISO member, you will never be able to vote for it: you
need then to convince the voters by listening what they remark and refine
your specifications to match their desires, and probably to split your
proposal in several parts or limit your initial goals, leaving the other
problematic poitns for later; if what remains "stable" in your proposal may
not be usable in practice without the additional extensions still in
discussion, and in fact this subset may still remain in the encoding queue
for years, until it reaches a point where it starts being usable for
practical problems; before that, you'll have to experiment with private-use
and should be ready to accept competing proposals, not compatible with your
proposal, and learn from them to reach an acceptable consensus; reaching
that consensus is the longest step but initially most voters will not
decide for or against your proposal if they are not confident enough about
the merit of each proposal, because they want to preserve a resasonnable
compatibility across TUS versions and with existing applications without
adding further problems, notably in terms of confusability/security. But
don't ask them to break the existing stability rules which were even harder
to formalize: these rules is the foundation that allowed TUS/ISO 10646 to
become a successful worldwide standard with lot of applications using 

Re: Encoding italic

2019-02-11 Thread Philippe Verdy via Unicode
Le dim. 10 févr. 2019 à 16:42, James Kass via Unicode 
a écrit :

>
> Philippe Verdy wrote,
>
>  >> ...[one font file having both italic and roman]...
>  > The only case where it happens in real fonts is for the mapping of
>  > Mathematical Symbols which have a distinct encoding for some
>  > variants ...
>
> William Overington made a proof-of-concept font using the VS14 character
> to access the italic glyphs which were, of course, in the same real
> font.  Which means that the developer of a font such as Deja Vu Math TeX
> Gyre could set up an OpenType table mapping the Basic Latin in the font
> to the italic math letter glyphs in the same font using the VS14
> characters.  Such a font would work interoperably on modern systems.
> Such a font would display italic letters both if encoded as math
> alphanumerics or if encoded as ASCII plus VS14.  Significantly, the
> display would be identical.
>
>  > ...[math alphanumerics]...
>  > These were allowed in Unicode because of their specific contextual
>  > use as distinctive symbols from known standards, and not for general
>  > use in human languages
>
> They were encoded for interoperability and round-tripping because they
> existed in character sets such as STIX.  They remain Latin letter form
> variants.  If they had been encoded as the variant forms which
> constitute their essential identity it would have broken the character
> vs. glyph encoding model of that era.  Arguing that they must not be
> used other than for scientific purposes is just so much semantic
> quibbling in order to justify their encoding.
>
> Suppose we started using the double struck ASCII variants on this list
> in order to note Unicode character numbers such as 핌+픽피픽픽 or
> 핌+ퟚퟘퟞퟘ?  Hexadecimal notation is certainly math and Unicode can be
> considered a science.  Would that be “math abuse” if we did it?  (Is
> linguistics not a science?)
>
>  > (because these encodings are defective and don't have the necessary
>  > coverage, notably for the many diacritics,
>
> The combining diacritics would be used.
>
Not for the many precombined characters that are in Latin: do you intend to
propose them to be reencoded with all the same variants encoded for maths?
Or allow the maths symbols to have diacritics added on them (hint: this
does not work correctly with the specific mathematical conventions on
diacritics and their specific stacking rules: they are NOT reorderable
through canonical equivalence, the order is significant in maths, so you
would also need to use CGJ to fix the expected logical semantic and visual
stacking order).

>
>  > case mappings,
>
> Adjust them as needed.
>

Not so easy: case mappings cannot be fixed. They are stabilized in Unicode.
You would need special casing rules under a specific "locale" for maths.

Really maths is a specific script even if it borrows some symbols from
Latin, Greek or Hebrew but only in specific glyph variants. These symbols
should not be even considered as part of the script they originate from
(just like Latin A is not the same as Cyrillic A or Greek Alpha, that all
have the same forms and the same origin).

I can argue tyhe same thing about IPA notations: they are NOT the Latin
script and also borrow some letter forms from Latin and Greek, but without
any case mappings (only lowercase is used), and also with specific glyph
variants.

Both examples are technical notations which do not obey the linguistic
rules and normal processing of the script they originate from. They are
specific "writing systems", unfortunaltely confused within "Unicode
scripts", and then abused.

Note that some Latin letters have been borrowed from IPA too, for use in
African languages, then case mappings were needed: these should have been
reencoded as a plain letter pair with a basic case mapping (not the special
case mapping rules now needed for African languages, such as open o which
looks much like the mirrored c from Latin Roman digits, and open e which
was borrowed from Greek epsilon in lowercase but does not use the uppercase
Greek Epsilon and uses instead another shape, meaning that the Latin open e
should have been encoded as a plain letter pair, distinct from the Greek
epsilon; but IPA already used the epsilon-like symbol...).

At end these exceptions just cause many inconsistancies and complexities.
Applications and libraries cannot adapt easily and are not downward
compatible because stable properties are immutable and specific tailorings
are needed each time in applications: the more we add these exceptions, the
less the standard is easy to adapt and compatibility is much more difficult
to preserve. In summary I don't like at all the dual encodings or encodings
of additional letters that cannot use the normal stable properties (and
this remark is also true for emojis: what a mess ! full of exceptions and
different incoherent encoding models !)


Re: Bidi paragraph direction in terminal emulators

2019-02-10 Thread Philippe Verdy via Unicode
Le sam. 9 févr. 2019 à 20:55, Egmont Koblinger via Unicode <
unicode@unicode.org> a écrit :

> Hi Asmus,
>
> > On quick reading this appears to be a strong argument why such emulators
> will
> > never be able to be used for certain scripts. Effectively, the model
> described works
> > well with any scripts where characters are laid out (or can be laid out)
> in fixed
> > width cells that are linearly adjacent.
>
> I'm wondering if you happen to know:
>
> Are there any (non-CJK) scripts for which a mechanical typewriter does
> not exist due to the complexity of the script?
>

Look into South Asian scripts (Lao, Khmer, Tibetan...) and large
syllabaries (CANS, Ethiopian).
Even Arabic is challenging and does not work very well (or is very ugly)
with typewriters or monospaced fonts, except if we use "simplified" Arabic.
Hebrew is a bit better but also has issues if you need to support all its
diacritics.

Finally even Latin is not easy to fit with its ligatures, and multiple
diacritics, some of them with complex layouts and applicable to pairs of
letters, or seomtimes larger groups).
The monospace restriction is a strong limitator: but then I don't see why a
"terminal" could not handle fonts with variable metrics, and why it must be
modeled only as a regular grid of rectangular cells (all of equal size)
containing only one "character" (or cluster?). It is perfectly possible to
have a terminal handling text as collection of "logical lines", split
(horizontally?) as multiple spans covering one or more cells, each span
containing one or more characters (or a full cluster) rendered correctly.

But then you recreate the basic HTML standard (just discard the "document"
and "body" level which would be implicit in a terminal, keep the "block"
and "inline" elements, and flow the text (note that rendered lines could as
well variable heights, depending on the height of their unbreakable spans
and their vertical alignment...). But then you need specific controls to
make proper vertical alignments (basically you need a "tabulator" in the
terminal with a way to define the start of a tabulator scope and its end,
and then reference tabulations by id when defining them in the middle of
the text; this tabulator would be more powerful than just the TAB control
which only uses an implicit/predefined tabulator).

Then for editors in terminals you need a way to query the position of some
items and make "logical" moves: the simple (line/column) coordinates on a
grid are not usable. In HTML we would do that with form input elements (the
form is flowed normally but is navigatable and input elements will have
their own editable areas).

So using controls, you would try to mimic again what HTML already provides
you for free (and without complex specifications and redevelopment).

So my opinion is that all legacy terminal protocosl will remain broken and
it is more viable to work with the W3C to define a basic HTML profile
suitable for terminals, but that will benefit of all the improvements made
in HTRML to support i18n, including required ones (BiDi, variable-width
fonts needed for complex scripts, accessibility...), but without the extra
elements that were added in HTML5 for semantic document structures (HTML5
still speaks about the "document" level, but there's little defined for
documents that are infinite streams that you can start reading from random
position and possibly never terminated):

All we need is a subset of HTML5 with only a few block elements without
terminator tags ("p" would be implicit) and the inline elements for all the
rest, and this becomes a viable "terminal protocol" which would deprecate
all the legacy VT-like protocols (and would put an end to the desire of
adding many new controls or duplicate reencodings in Unicode for specific
styles.

The only block elements that would be useful on top of this are forms and
form inputs, to create editable fields and some attributes to allow editing
or disallow them. Scripting would be an option (only for local data
validation or filtering some inputs that must not be sent to the server, or
to allow accessibility features, input methods and orthographic helpers).
Then with that we are no longer blocked by the old terminal limitations
(but it will still be possible for a terminal emulator to create a
reasonnable layout to map it to a grid-based terminal, and then offer some
helper tools to show a selectable popup view for things that cannot be
rendered on the basic grid).


Re: Encoding italic

2019-02-10 Thread Philippe Verdy via Unicode
Le dim. 10 févr. 2019 à 05:34, James Kass via Unicode 
a écrit :

>
> Martin J. Dürst wrote,
>
>  >> Isn't that already the case if one uses variation sequences to choose
>  >> between Chinese and Japanese glyphs?
>  >
>  > Well, not necessarily. There's nothing prohibiting a font that includes
>  > both Chinese and Japanese glyph variants.
>
> Just as there’s nothing prohibiting a single font file from including
> both roman and italic variants of Latin characters.
>

May be but such a fint would not work as intended to display both styles
distinctly with the common use of the italic style: it would have to make a
default choice and you would then need either a special text encoding, or
enabling an OpenType feature (if using OpenType font format) to select the
other style in a non-standard custom way.

The only case where it happens in real fonts is for the mapping of
Mathematical Symbols which have a distinct encoding for some variants (only
for a basic subset of the Latin alphabet, as well as some basic Greek and a
few other letters from other scripts), and this is typically done only in
symbol fonts containing other mathametical symbols, but because of the
specific encoding for such mathematical use. As well we have the variants
registered in Unicode for IPA usage (only lowercase letters, treated as
symbols and not case-paired).

These were allowed in Unicode because of their specific contextual use as
distinctive symbols from known standards, and not for general use in human
languages (because these encodings are defective and don't have the
necessary coverage, notably for the many diacritics, case mappings, and
other linguisitic, segmentation and layout properties).

The same can be said about superscript/subscript variants, bold variants,
monospace variants: they have specific use and not made for general purpose
texts in human languages with their common orthographic conventions: Latin
is a large script and one of the most complex, and it's quite normal that
there are some deviating usages for specific purposes, provided they are
bounded in scope and use.

But what you would like is to extend the whole Latin script (and why not
Greek, Cyrillic, and others) with multiple reencodings for lot of stylistic
variants, and each time a new character or diacritic is encoded it would
have to be encoded multiple times (so you'd break the encoding character
model, and would just complicate the implementation even more, and would
also create new security issues with lot of new confusables, that every
user of Unicode would then have to take into account, and evey application
or library would then need to be updated, and have to include large
datatables to handle them).

As well it would create many conflicts if we used the "VARIATION SELECTOR
n" characters, or would need to permanently assign specific ones for
specific styles; and then rapidly we would no longer have enough "VARIATION
SELECTOR n" selectors in Unicode : we only have 256 of them, only one is
more or less permanently dedicated.

[VS16 is almos compeltely reserved now for distinction between
normal/linguisitic and emoji/colorful variants. The emoji subset in Unicode
is an open set which could expand in the future to tens of thousands
symbols, and will likely cause large work overhaed in CLDR project just to
describe them, one reason for which I think that Emoji character data in
CLDR should be separated in a distinct translation project, with its own
versioning and milestones, and not maintained in sync with the rest of CLDR
data, if we consider how emojis have flooded the CLDR survey discussions,
when this subset has many known issues and inconsistencies and still no
viable encoding model like the "character encoding model" to make it more
consistant, and updatable separately from the rest of the Unicode UCD
releases; in my opinion the emojis in Unicode are still an alpha project in
development and it's too soon to describe them as a "standard" when there
are many other possible way to handle them; these emeojis are just there
now to remlain as "legacy" mappings but won't resist an expected coming new
formal standard about them insterad of the current mess they create now.]


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-07 Thread Philippe Verdy via Unicode
Adding a single bit of protection in cell attributes to indicate they are
either protected or become transparent (and the rest of the
attributes/character field indicates the id of another terminal grid or
rendering plugin crfeating its own layer and having its own scrolling state
and dimensions) can allow convenient things, including the possibility of
managing a grid-based system of stackable windows.
You can design one of the layer to allow input (managed directly in the
terminal, with local echo without transmission delays and without risks of
overwriting surrounding contents.
Asynchronous behavior can be defined as well between the remote
application/OS and the local processing in the terminal.
The protocol can also support an extension to provide alternate streams
(take an example on MIME multipart). This can even be used to transport the
inputs and outputs for each layer, and additional streams to support
(java)scripts, or the content of an image, or a link to a video stream.
And just like with classing graphics interface, you can have more than just
solid RGB colors and add an alpha layer. The single-rectangular-flat grid
design is not the only option. Layered approaches can then even be rendered
on hardware easily by mapping these virtual layers and flattening them
internally in the terminal emulator to the single flat grid supported by
the hardware. The result is more or less equivalent to graphic RGB frames,
except that the unit is not a single pixel but a whole cell with not just
one color but a pair of colors and an encoded character and a font selected
for that cell, or if a single font is supported, using a dynamic font and
storing glyph ids in that font (prescaled for the cell size). The hardware
then makes the rest to build the pixels of the frame, but it can be easily
accelerated.
The layered approache could also be used to link together the cells that
use the same script and font settings, in order to use proportional fonts
when monospaced fonts are not usable, and justify their text in the field
(which may turn to be scrollable itself when needed for input). Having
multiple communication streams between the terminal emulator and the remote
application allows the application to query the properties and behave in a
smarter way than with just static "termcaps" not taking into account the
actual state of the remote terminal.
All this requires some extension to TV-like protocols (using specific
escape sequences, just like with the Xterm extensions for X11).

You can also reconsider how "old" mainframes terminals worked: the user in
fact never submitted characters one by one to the remote application: the
application was sending a full screen and an input form, the user on its
terminal could fill in the form and press a "submit/send" button when he
had finished inputing the data. But while the user was inputing data, there
was absolutely no need to communicate each typed keystroke to the
application, all was taken in charge by the terminal itself which was
instructed (and could even perform form data validation with input formats
and some conditions, possibly as well a script). In other words, they
worked mostly like an HTML input form with a submit button.

Such mode is very useful for small devices because they don't have to react
interactively with the user, the transmission delays (which may be slow)
are no longer a problem, user can enter and correct data easily, and the
editing facilities don'ty need to be handled by the remote application
(which today could be a very tiny device with in fact much less processing
power than the terminal emulator, and would have in fact no knowledge at
all of the fonts needed) A terminal emulator can make a lot of things
itself and locally. And this would also be useful on many modern
application servers that need to serve lot of remote clients, possibly over
very slow internet links and long roundtrip times.

The idea behing this is to allow to distribute the workload and decide
which side will handle part of all of the I/O. Of course it will transport
text (preferably in an Unicode UTF), but text is not the only content to
transport. There are also audio/video/images, security items (certificates,
signatures, personal data that should remain private and be encrypted, or
only sent to the application in a on-way-hashed form), plus some
states/flags that could provide visual/audio hints to the user when working
in the rendered input/output form with his local terminal emulator.

I spoke about HTML because terminal-based browsers already exist since
long, some of them which are still maintained in 2019 (w3m still used as a
W3C-sponsored demo, Lynx is best known on Linux, or elinks):
  https://www.slant.co/topics/4702/~web-browsers-that-run-in-a-terminal
This gives a good idea of what is needed, what a good terminal protocol can
do, and what the many legacy VT-like protocol variants have never treid to
unify. These browsers don't reinvent the wheel: HTML 

Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-07 Thread Philippe Verdy via Unicode
Le jeu. 7 févr. 2019 à 19:38, Egmont Koblinger  a écrit :

> As you can see from previous discussions, there's a whole lot of
> confusion about the terminology.


And it was exactly the subject of my first message sent to this thread !
you probably missed it.


> Philippe, with all due respect, I have the feeling that you have some
> fundamental problems with my work (and I'm temped to ask back: have
> you read it at all?), but your message what your problem is just
> doesn't come across to me. Could you please avoid all those irrelevant
> stories with baud rate and font size and Asian scripts and whatnot,
> and clearly get to your point?
>

I have never said anything about your work because I don't know where you
spoke about it or where you made some proposals. I must have missed one of
your messages (did it reach this list?). So don't take that as a personal
attack because this only started on a reply I made (the one specifically
speaking about the various ambiguities of encoded newlines in terminal
protocols, which do not match the basic plain text definition (similar to
MIME) made only for static documents, but never tuned for interactive
bidirectional use (including for example text editors, which also requires
a modelization of 2D layout, and also sets some assumptions about
"characters" visible in a single cell of a regularly spaced grid, and a
known number of lines and columns, independant of the lines of the text
rendered and read on it.

Terminals are not displaying plain text, they create their own upper layer
protocol which requires and enforces the 2D layout (whereas Unicode is a
purely linear protocol with only relations between one character and the
next one in a 1D stream, and no assumption at all about their display
width, which cannot be monospaced in all scripts and are definitely not
encoded in logical order: try adding characters at end of a logical line,
with a Bidi text you do not just replace the content of one cell, you have
to scroll the content of surrounding cells and your input curet position
does not necessarily changes or you'l reach a point where a visual line
will be split in two part, but not at the rest position, and some parts
moved up to down

Bidi does not specify the 2D layout completely, it is purely 1D and speaks
about left and right direction and does not specify what happens when
contents do not fit on the visual line for the text which is already
present there before inserting new text or even what will be replaced if
you are in replace mode and not in insert mode: The Bidi algorithm is not
designed to handle overwrites, and not even the whole Unicoidce standard
itself, which is made as if all text was inserted only at end of lines and
not replacing anything.

For now terminal protocols, and emulators trying to implement them; that
must mix the desynchronized input and output (especially when they have to
do "local echo" of the input for performance reason over slow serial links
where there's no synchronization between the local buffer of the terminal
and the remote virtual buffer of the terminal emulator in the emitting app,
even those using the best "termcap" definitions) have no easy way to do
that. The logical encoding of Unicode does not play well and the time to
resynchronize the local and remote buffers is a limiting factor (over a
9.6kbps link, refreshing the whole screen takes too long, and this cannot
be done on every keystroke of input, or user input would have to be
dramatically slow if local echoing is also enabled, or most user inputs
that are too fast would have to be discarded, and this makes user input
very unreliable, requiring constant correction; these protocols are
definitely not human-friendly as they depend on strict timing which is not
the way humans enter text; this timing is also unpredicatable and very
variable over serial links and the protocols do not have any specification
for timing requirements. In fact time is constantly ignored, even if it
plays an evident role).

If you look at historic "terminal" protocols, technics were used to control
time: notably the XON/XOFF protocols, or mechanical constraints. Especially
when the output was a printer (with a daisywheel or matrix head). But time
was just control between one machine and another, a human could not really
interact asynchronously. And it was in a time where full-screen text
editors did not even exist (at most they were typing "on the flow" and text
layout was completely forgotten. This changed radiucally when the ouput
became a screen, with the assumption that the output was instantanous, but
the mechanical restrictions were removed.

Some older terminal protocols for mainframes notably were better than
today's VT-like protocols: you did not transmit just what would be
displayed, but you also described the screen area where user input is
allowed and the position of fields and navigation between them: the
terminal had then no difficulty to avoid breaking the output when 

Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-07 Thread Philippe Verdy via Unicode
Le jeu. 7 févr. 2019 à 13:29, Egmont Koblinger  a écrit :

> Hi Philippe,
>
> > There's some rules for correct display including with Bidi:
>
> In what sense are these "rules"? Where are these written, in what kind
> of specification or existing practice?
>

"Rules" are not formally written, they are just a sense of best practices.
Bidi plays very badly on terminals (even enhanced terminals like VT-* or
ANSI that expose capabilities when, most of the time, these capabilities
are not even accessible: it is too late and further modifications of the
terminal properties (notably its display size) can never be taken into
account (it is too late, the ouput has been already generated, and all what
the terminal can do is to play with what is in its history buffers). Even
on dual-channel protocols (input and output), terminal protocols are also
not synchronizing the input and the output and these asynchrnous channels
ignore the transmission time between the terminal and the aware
application, so the terminal protocol must include a functio nthat allows
flushing and redrawing the screen completely (but this requires long
delays). With a common 9.6kbps serial link, refreshing a typical 80x25
screen takes about one half second, which is much longer than typical user
input, so full screen refresh does not work for data input and editing, and
terminals implement themselves the echo of user input, ignoring how and
when the receiving application will handle the input, and also ignoring if
the applciation is already sending ouput to the terminal.
It's hard or impossible to synchroinize this and local echoes on the
terminal causes havoc.
I've not seen any way for a terminal to handle all these constraints. So
the only way for them is to support them only plain-text basic documents,
formatted reasonnably, and inserting layout "hints" in the format of their
output so that termioanl can perform reasonnable guesses and adapt.
But the concept of "line" or "paragraph" in a terminal protocols is
extremely fuzzy. It's then very difficult to take into account the
additiona Bidi contraints as it's impossible to conciliate BOTH the logical
ordering (what is encoded in the transmitted data or kept in history
buffers) and the visual ordering. That's why there are terminal protocols
that absolutely don't want to play with the logical ordering and require
all their data to be transmitted in visual order (in which case, there's no
bidi handling at all). Then terminals will attempt to consiliate the visual
line delimitations (in the transmitted data) with the local-only
capabilities of the rendered frame. Many terminals will also not allow
changing the display width, will not allow changing the display cell size,
will force constraints on cell sizes and fonts, and then won't be able to
correctly output many Asian scripts.
In fact most terminal protocols are very defective and were never dessign
to handle Bidi input, and Asian scripts with compelx clusters and variable
fonts that are needed for them (even CJK scripts which use a mix of
"half-wifth" and "full-width" characters.

> - Separate paragraphs that need a different default Bidi by double
> newlines (to force a hard break)
>
> There is currently no terminal emulator I'm aware of that uses empty
> lines as boundaries of BiDi treatment.
>

These are hint in absence of something else, and it plays a role when the
terminal disaply width is unpredicable by the application making the output
and having no access to any return input channel.
Take the example of terminal emulators in resizable windows: the display
width is undefined, but there's not any document level and no buffering,
scrolling text will flush the ouput partially, history is limited A
terminal emulator then needs hints about where paragrpahs are delimited and
most often don't have any other distinctions available even in their
limited history that allows distinguishing the 3 main kinds of line breaks.


> While my recommendation uses a one smaller unit (logical lines), and I
>

And here your unit (logical lines) is not even defined in the terminal
protocol and not known from the meitting applications whjich has no input
about the final output terminal properties. So the terminal must perform
guesses. As it can insert additional linebreaks itself, and scroll out some
portion of it, there's no way to delimit the effect of "bidi controls". The
basic requirement for correctly handling bidi controls is to make sure that
paragraph delimitations are known and stable. if additional breaks can
occur anywhere on what you think is a "logical line" but which is different
from the mietting application (or static text document which is ouput "as
is" without any change to reformat it, these bidi controls just make things
worse and it becomes impossible to make reasonnable guesses about paragraph
delimitations in the terminal. The result become unpredictable and most
often will not even make any sense as the terminal uses visual ordering

Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-06 Thread Philippe Verdy via Unicode
I read your email, you spoke for example about how a typical Unix/Linux
tool shows its usage option (e.g. "anycommand --help") with a leading line
then syntaxes and tabulated lists of options followed by translated help on
the same line.

There's some rules for correct display including with Bidi:

- Separate paragraphs that need a different default Bidi by double newlines
(to force a hard break)
- use a single newline on continuation
- if technical items are untranslatable, make sure they are at the begining
of lines and indented by some leading spaces, before translated ones.
- avoid breaking lists
- try to separate as much as posible text in natural languages from
technical texts.
- Be careful about correcty usage of leading punctuations (notably for list
items)
- Be consistant about indentation
- Normalize spaces,
- Don't ussume that TAB controls have the same width (ban TABS except at
the begining of lines)
- In column output, separate colums always with at least two spaces, don't
glue them as if they were sentences.
- Don't use "soft line breaks" in the middle of short lines (less than 72
base characters)
- Don't use any Bidi control !

With some cares, you can perfectly translate Linux/Unix tools in languages
needing Bidi and get consistant output, but be careful if your text
contains placeholders or technihcal untranslated terms (make sure to
surround them with paired punctuation, or don't translate them at all. And
avoid paragraphs that would mix natural and technical untranslatable terms
(such as command names or command-line options).

Make sure to test the output so that it will also work with varaible fonts
(don't assume monospaced fonts are used, they do not exist for various
scripts and don't work reliably for Arabic and most Asian scripts, and not
even for Chinese or Japanese even if these don't need Bidi support).

But the difficulty is not really in the terminal emulators but in the
source texts given to translators, when they don't know the context in
which the text will be used and have no hint about which terms should not
be translated (because they can become inconsistant: there are many
examples, even in Windows 10, where some of the command line tools are
completely unusable with the translated UI and with examples of syntaxes
that are not even working where some terms were randomly and inconsistantly
translated or confused, or because tools assumed an LTR-only layout of the
output, and monospaced fonts with one-to-one character per display cell, or
requiring specific fonts that do not contain the characters in their
monospaced variants: this is challenging notably for Asian scripts needing
complex clusters if you made these Latin-based assumptions)


Le mer. 6 févr. 2019 à 22:30, Egmont Koblinger  a écrit :

> Hi Philippe,
>
> Thanks a lot for your input!
>
> Another fundamental difficulty with terminal emulators is: These
> controls (CR, LF...) are control instructions that move the cursor in
> some ways, and then are forgotten. You cannot do BiDi on the
> instructions the terminal receives. You can only do BiDi on the
> result, the contents of the canvas after these instructions are
> executed. Here these controls are either lost, or you have to give a
> specification how exactly they need to be remembered, i.e. converted
> to being part of the canvas's data.
>
> Let's also mention that trying to get apps into using them is quite
> hopeless. The best you can do is design BiDi around what you already
> have, which pretty much means hard vs. soft line endings, and
> hopefully forthcoming semantical marks around shell prompts. (To
> overcomplicate the story, a received LF doesn't convert the line
> ending to hard wrapped in most terminal emulators. In some it does. I
> don't think there's an exact specification anywhere. Maybe the BiDi
> spec needs to create one. Lines are hard wrapped by default, turned to
> soft wrapped when the text gets wrapped at the end of the line, and a
> few random control functions turn them back to hard one, but in most
> terminals, a newline is not such a control function.)
>
> Anyway, please also see my previous email; I hope that clarifies a lot
> for you, too.
>
>
> cheers,
> egmont
>
> On Tue, Feb 5, 2019 at 5:53 PM Philippe Verdy via Unicode
>  wrote:
> >
> > I think that before making any decision we must make some decision about
> what we mean by "newlines". There are in fact 3 different functions:
> > - (1) soft line breaks (which are used to enforce a maximum display
> width between paragraph margins): these are equivalent to breakable and
> compressible whitespaces, and do not change the logical paragraph
> direction, they don't insert any additionnal vertical gap between lines, so
> the logicial line-height is preserved and continues uninterrupted. If text
> justification

Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-05 Thread Philippe Verdy via Unicode
I think that before making any decision we must make some decision about
what we mean by "newlines". There are in fact 3 different functions:
- (1) soft line breaks (which are used to enforce a maximum display width
between paragraph margins): these are equivalent to breakable and
compressible whitespaces, and do not change the logical paragraph
direction, they don't insert any additionnal vertical gap between lines, so
the logicial line-height is preserved and continues uninterrupted. If text
justification applies, this whitespace will be entirely collapsed into the
end margin, and any text before it will stilol be justified to match the
end margin (until the maximum expansion of other whitespaces in the middle
is reached, and the maximum intercharacter gap is also reached (in which
case, that line will not longer be expanded more), but this does not apply
to terminal emulators that noramlly never use text justification, so the
text will just be aligned to the start margin and whitespaces before it on
the same line are preserved, and collapsed only at end of the line (just
before the soft line break itself)
- (2) hard line breaks: they break to a new line but continue the paragraph
within its same logical direction, but they are not compressible
whitespaces (and do not depend on the logical end margin of the paragraph.
- (3) paragraph breaks: generally they introduce an addition vertical gap
with top and bottom margins

The problem in terminals is that they usually cannot distinguish types (1)
and (2), they are simply encoded by a single CR, or LF, or CR+LF, or NEL.
Type (1) is only existing within the framework of a higher level protocol
which gives additional interpretation to these "newlines". The special
control LS is almost never used but may be used for type (1) i.e. soft
line-breaks, and will fallback to type (2) which is represented by the
legacy "simple" newlines (single CR, or single LF, or single CR+LF, or
single NEL). I have seen very little or no use of the LS (line separator)
special control.

Type (3) may be encoded with PS (paragraph separator), but in terminals
(and common protocols line MIME) it is usually encoded using a couple of
newline (CR+CR, or LF+LF, or CR+LF+CR+LF, or NL+NL) possibly with
additional whitespaces (and additional presentation characters such as ">"
in quotations inserted in mail responses) between them (needed for MIME and
HTTP) which may be collapsed when rendering or interpreting them.

Some terminal protocols can also use other legacy ASCII separators such as
FS, GS, RS, US for grouping units containing multiple paragraphs, or
STX/EOT pairs for encapsulating whole text documents in an
protocol-specific enveloppe format (and will also use some escaping
mechanism for special controls found in the middle, such as DLE+control to
escape the control, or DLE+0 to escape a NUL, or DLE+# to escape a DEL, or
DEL+x+NN where N are a fixed number of hexadecimal, decimal or octal
digits. There's a wide variety of escaping mechanisms used by various
higher-layer protocols (including transport protocols or encoding syntaxes
used just below the plain-text layer, in a lower layer than the transport
protocol layer).

Le lun. 4 févr. 2019 à 21:46, Eli Zaretskii via Unicode 
a écrit :

> > Date: Mon, 4 Feb 2019 19:45:13 +
> > From: Richard Wordingham via Unicode 
> >
> > Yes.  If one has a text composed of LTR and RTL paragraphs, one has to
> > choose how far apart their starting margins are.  I think that could
> > get complicated for plain text if the terminal has unbounded width.
>
> But no real-life terminal does.  The width is always bounded.
>


Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Philippe Verdy via Unicode
Actually not all  U+E0020 through U+E007E are "un-deprecated" for this use.

For now emoji flags only use:
- U+E0041 through U+E005A (mapping to ASCII letters A through Z used in
2-letter ISO3166-1 codes). These are usable in pairs, without requiring any
modifier (and only for ISO3166-1 registered codes).
- I think that U+0030 through U+E0039 (mapping to ASCII digits 0 through 9)
are reserved for ISO3166 extensions, started with only the 3 "countries"
added in the United Kingdom ("ENENG", "ENSCO" and "ENWLS"), with possible
pending additions for other ISO3166-2, but not mapping any dash separator).
These tags are used as modifiers in sequences starting by a leading U+1F3F4

(WAVING
BLACK FLAG) emoji.
- U+E007F (CANCEL TAG) is already used too for the regional extensions as a
mandatory terminator, as seen in the three British countries. It is not
used for country flags made of 2-letter emoji codes without any leading
flag emoji.

And the proposal discussed here to use U+E003C, mapped to the ASCII "<"
LOWER THAN as a leading tag sequence for reencoding HTML tags in sequences
terminated by U+E003E ">" (and containing HTML element names using
lowercase letter tags, possibly digit tags in these names, and "/" for HTML
tags terminator, possibly also U+E0020 SPACE TAG for separating HTML
attributes, U+003D "=" for attribute values, U+E0022 (') or U+E0027 (")
around attribute values, but a problem if the mapped element names or
attributes contain non-ASCII characters...) is not standard (it's just an
experiment in one font), and would in fact not be compatible with the
existing specification for tags.

So only E+E0020 through U+E0040, and U+E005B through U+E007E remain
deprecated.


Le ven. 1 févr. 2019 à 23:26, Doug Ewell via Unicode 
a écrit :

> Richard Wordingham wrote:
>
> > Language tagging is already available in Unicode, via the tag
> > characters in the deprecated plane.
>
> Plane 14 isn't deprecated -- that isn't a property of planes -- and the
> tag characters U+E0020 through U+E007E have been un-deprecated for use
> with emoji flags. Only U+E0001 LANGUAGE TAG and U+E007F CANCEL TAG are
> deprecated.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
>


Re: Encoding italic

2019-02-01 Thread Philippe Verdy via Unicode
the proposal would contradict the goals of variation selectors and would
pollute ther variation sequences registry (possibly even creating
conflicts). And if we admit it for italics, than another VSn will be
dedicated to bold, and another for monospace, and finally many would follow
for various style modifiers.
Finally we would no longer have enough variation selectors for all
requests).
And what we would have made was only trying to reproduce another existing
styling standard, but very inefficiently (and this use wil be "abused" for
all usages, creating new implementation constraints and contradicting goals
with existing styling languages: they would then decide to make these
characters incompatible for use in conforming applications. The Unicode
encoding would have lost all its interest.
I do not support the idea of encoding generic styles (applicable to more
than 100k+ existing characters) using variation selectors. Their goal is
only to allow semantic distinctions when two glyphs were unified in one
language may occasionnaly (not always) have some significance in specific
languages. But what you propose would apply to all languages, all scripts,
and would definitely reserve some the the few existing VSn for this styling
use, blocking further registration of needed distinctions (VSn characters
are notably needed for sinographic scripts to properly represent toponyms
or person names, or to solve some problems existing with generic character
properties in Unicode that cannot be changed because of stability rules).


Le jeu. 31 janv. 2019 à 16:32, wjgo_10...@btinternet.com via Unicode <
unicode@unicode.org> a écrit :

> Is the way to try to resolve this for a proposal document to be produced
> for using Variation Selector 14 in order to produce italics and for the
> proposal document to be submitted to the Unicode Technical Committee?
>
> If the proposal is allowed to go to the committee rather than being
> ruled out of scope, then we can know whether the Unicode Technical
> Committee will allow the encoding.
>
> William Overington
>
> Thursday 31 January 2019
>
>


Re: Encoding italic

2019-01-28 Thread Philippe Verdy via Unicode
So you used
"bold 
I.e, you converted from ASCII to tag characters the full HTML sequences
"" and "", including the HTML element name. I see little interest
for that approach.

Additionally this means that U+E003C is the tag identifier and its scope
does not end for the rest of the text (the HTML close tag is closing the
previous Unicode tag but opens a new one, as the second sequence is not
, i.e. the Unicode tag-cancel).

I bet that a Unicode confirming code that treats some tag characters could
choose to remove everything in a Unicode tag that it does not understand
(e.g. U+E003C is not an understood identifier, only U+E0001 is understood
as a language tag) or does not want to parse but without the tag-cancel,
all the rest of your email could have been truncated, instead of just the
tagged text "bold".

Given how HTML tags are nesting(.. or not...), I don't think this approach
is desirable

And I'm not sure that everyone on this list actually received you mail with
this tag, it may have happened that your mail was truncated or all U+E00nn
characters were silently removed by an intermediate agent not wanting to
support any Unicode Tag character.

Le lun. 28 janv. 2019 à 03:03, James Kass via Unicode 
a écrit :

>
> On 2019-01-27 11:44 PM, Philippe Verdy wrote:
>
>  > You're not very explicit about the Tag encoding you use for these
> styles.
>
> This bold new concept was not mine.  When I tested it
> here, I was using the tag encoding recommended by the developer.
>
>  > Of course it must not be a language tag so the introducer is not
> U+E0001, or a cancel-all tag so it
>  > is not prefixed by U+E007F   It cannot also use letter-like,
> digit-like and hyphen-like tag characters
>  > for its introduction.  So probably you use some prefix in
> U+E0002..U+E001F and some additional tag
>  > (tag "I" for italic, tag "B" for bold, tag "U" for underline, tag "S"
> for strikethough?) and the cancel
>  > tag to return to normal text (terminate the tagged sequence).
>
> Yes, U+E0001 remains deprecated and its use is strongly discouraged.
>
>  > Or may be you just use standard HTML encoding by adding U+E to
> each character of the HTML
>  > tag syntax (including attributes and close tags, allowing embedding?)
> So you use the "<" and ">" tag
>  > characters (possibly also the space tag U+E0020, or TAB tag U+E0009
> for separating attributes and the
>  > quotation tags for attribute values)?  Is your proposal also allowing
> the embedding of other HTML
>  > objects (such as SVG)?
>
> AFAICT, this beta release supports the tag sequences , ,
> , &  expressed here in ASCII.  I don’t know if the
> software developer has plans to expand the enhancements in the future.
>
>  > And what is then the interest compared to standard HTML (it is not
> more compact, ...
>
> This was one of the ideas which surfaced earlier in this thread. Some
> users have expressed an interest in preserving, for example, italics in
> plain-text and are uncomfortable using the math alphanumerics for this,
> although the math alphanumerics seem well qualified for the purpose.
> One of the advantages given for this approach earlier is that it can be
> made to work without any official sanction and with no action necessary
> by the Consortium.
>
>  > I bet in fact that all tag characters are most often restricted in
> text input forms, and will be
>  > silently discarded or the whole text will be rejected.
>
> In this e-mail, I used the tags  &  around the word “bold” in the
> first sentence of my reply in order to test your bet.
>
>  > We were told that these tag characters were deprecated, and in fact
> even their use for language
>  > tags has not found any significant use except some trials (but there
> are now better technologies
>  > available in lot of softwares, APIs and services, and application
> design/development tools, or
>  > document editing/publishing tools).
>
> Indeed, these tags were deprecated.  At the time the tags were
> deprecated, there was such sorrow on this list that some list members
> were even inspired to compose haiku lamenting their passing and did post
> those haiku to this list.  Now, thanks to emoji requirements, many of
> those tags are experiencing a resurrection/renaissance.  I wonder if
> anyone is composing limericks in joyful celebration…
>
>


Re: Encoding italic

2019-01-27 Thread Philippe Verdy via Unicode
You're not very explicit about the Tag encoding you use for these styles.

Of course it must not be a language tag so the introducer is not U+E0001,
or a cancel-all tag so it is not prefixed by U+E007F
It cannot also use letter-like, digit-like and hyphen-like tag characters
for its introduction.
So probably you use some prefix in U+E0002..U+E001F and some additional tag
(tag "I" for italic, tag "B" for bold, tag "U" for underline, tag "S" for
strikethough?) and the cancel tag to return to normal text (terminate the
tagged sequence).

Or may be you just use standard HTML encoding by adding U+E to each
character of the HTML tag syntax (including attributes and close tags,
allowing embedding?) So you use the "<" and ">" tag characters (possibly
also the space tag U+E0020, or TAB tag U+E0009 for separating attributes
and the quotation tags for attribute values)?
Is your proposal also allowing the embedding of other HTML objects (such as
SVG)?

In that case what you do is only to remap the HTML syntax outside the
standard text. If an attribute values contains standard text (such as ...) do you also remap the attribute value, i.e.
"Some text"? Do you remap the technical name of the HTML tag itself i.e.
"span" in the last example?

And what is then the interest compared to standard HTML (it is not more
compact, and just adds another layer on top of it), except allowing to
embed it in places where plain HTML would be restricted by form inputs or
would be reconverted using character entities hiding the effect of "<", ">"
and "&" in HTML so they are not reinterpreted as HTML but as plain-text
characters?

Now let's suppose that your convention starts being decoded and used in
some applications, this could be used to transport sensitive active scripts
(e.g. Javascript event handlers or plain 

Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Philippe Verdy via Unicode
For Volapük, it looks much more like U+02BE (right half ring modifier
letter)
than like U+02BC (apostrophe "modifier" letter).
according to the PDF on
https://archive.org/details/cu31924027111453/page/n12

The half ring makes a clear distinction with the regular apostrophe (for
elisions) or quotation marks. It is used really in this context as a
modifier after another consonnants for borrowing words *phonetically* from
other languages, notably after 'c' and 'l'. Then U+02BD (left  half ring
"modifier" letter) is a regular letter (for translitterating the expirated
'h' from English). But I'm currious about the diacritic used above 'h' on
item (5) ("ta") of that page to transliteratiung the English soft "th". But
this was describing the "Labas" orthography.

On the next chapter ("Noms Tonabas"), another convention is used for the
aopostrophe like letters, and U+02BE (right half ring modifier letter) is
used instead of U+02BD for the expirated 'h' (see paragraph 18), but it is
said to use the "Greek mark" (not sure if the author meant the coronis
U+01FBD or the soft spirit U+01FBF).

So it looks like these were various early adaptations of the basic Volapük
orthography to borrow foreign names (notable proper names for people,
trademarks, toponyms and other place names), and these were part of several
competing proposals. I'm curious to know if there was finally a wide enough
consensus to standardize these.

So It seems that for Volapük all the apostrophe-like letters are not
formally assigned, authors will use anyone as they want when they
transliterate foreign words, or will simply avoid transliterating them
completely if they exist natively in a Latin form (I bet English is not
transliterated at all, and French or German accents are preserved as is if
they are already part of the basic alphabet and the only standard diacritic
is then the "diaeresis", as used in the German umlaut (Volapük does not
need any true diaeresis to avoid the formation of diphtongs and digrams,
all its orthography use a single base letter as a foundation principle.

If so, the 1st convention using the apostrophe-like modifier to create
digrams is probably not favored and ther Tonabas convention is proably more
convenient and more compliant t othe principles. I don't think they will
ever use directly the greek signs or letters (like the one used for
transliterating the English 'ng' and would prefer using now the Latin Eng
letter.

The right half-ring being rarely supported is now most probably supported
using U+02BC (for both letter cases, ignoring the bolder style for the
capital variant) which uses a curved comma shape (with a filled bowl at
top). If there are case distinction, the same glyph would be used but at
different height instead of using bold distinctions, or dictinction would
be made using the alternate forms of the comma (probably the wedge for
lowercase, and the bowl with curl for capitals).

Note: Are the different shapes of the comma (and similar apostrophe-like
letters, or even the semicolon) distinguished with encoded variant
selectors ?


Le dim. 27 janv. 2019 à 18:42, Mark E. Shoulson via Unicode <
unicode@unicode.org> a écrit :

> Well, sure; some languages work better with some fonts.  There's nothing
> wrong with saying that 02BC might look the same as 2019... but it's
> nice, when writing Hawaiian (or Klingon for that matter) to use a bigger
> glyph. That's why they pay typesetters the big bucks (you wish): to make
> things look good on the page.
>
> I recall in early Volapük, ʼ was a letter (presumably 02BC), with value
> /h/.  And the "capital" ʼ was the same, except bolder: see
> https://archive.org/details/cu31924027111453/page/n11 (entry 4, on the
> left-hand page).
>
> ~mark
>
> On 1/27/19 12:23 AM, Asmus Freytag via Unicode wrote:
> > On 1/26/2019 6:25 PM, Michael Everson via Unicode wrote:
> > the 02BC’s need to be bigger or the text can’t be read easily. In our
> > work we found that a vertical height of 140% bigger than the quotation
> > mark improved legibility hugely. Fine typography asks for some other
> > alterations to the glyph, but those are cosmetic.
> >> If the recommended glyph for 02BC were to be changed, it would in no
> case impact adversely on scientific linguistics texts. It would just make
> the mark a bit bigger. But for practical use in Polynesian languages where
> the character has to be found alongside the quotation marks, a glyph
> distinction must be made between this and punctuation.
> >
> > It somehow seems to me that an evolution of the glyph shape of 02BC in
> > a direction of increased distinction from U+2019 is something that
> > Unicode has indeed made possible by a separate encoding. However, that
> > evolution is a matter of ALL the language communities that use U+02BC
> > as part of their orthography, and definitely NOT something were
> > Unicode can be permitted to take a lead. Unicode does not *recommend*
> > glyphs for letters.
> >
> > However, as a publisher, you 

Re: Encoding italic (was: A last missing link)

2019-01-17 Thread Philippe Verdy via Unicode
If encoding italics means reencoding the normal linguistic usage, it is no
! We already have the nightmares caused by partial encoding of Latin and
Greek (als a few Hebrew characters) for maths notations or IPA notations,
but they are restricted to a well delimited scope of use and subset, and at
least they have relevant scientific sources and auditors for what is needed
in serious publications (Anyway these subsets may continue to evolve but
very slowly).
We could have exceptions added for chemical or electrical notations, if
there are standard bodies supporting them.
But for linguistic usage, there's no universal agreement and no single
authority. Characters are added according to common use (by statistic
survey, or because there are some national standard promoting them and
sometimes making their use mandatory with defined meanings, sometimes
legally binding).
For everything else, languages are not constrained and users around the
world invent their own letterforms, styles: there' no limit at all and if
we start accepting such reencoding, the situation would in fact be worse in
terms of interoperability ,because noone can support zillions variants if
they are not explicitly encoded separately as surrounding styles, or
scoping characters if needed (using contextual characters, possibly variant
selectors if these variants are most often isolated).
But italics encoded as varaint selectors would just pollute everything; and
anyway "italic" is not a single universal convention and does not apply
erqually to all scripts). The semantics attached to italic styles also
varies from document to documents, and the sema semantics also have
different typographic conventions depending on authors, and there's no
agreed meaning bout the distinctions they encode.
For this reason "italique/oblique/cursive/handwriting..." should remain in
styles (note also that even the italic transform can be variable, it could
also be later a subject of user preferences where people may want to adjust
the degree or slanting, according to their reading preferences, or its
orientation if they are left-handed to match how they write themselves, or
if the writer is a native RTL writer; the context of use (in BiDi) may also
adject this slanting orientation, e.g. inserting some Latin in Arabic could
present the Latin italic letters slanted backward, to better match the
slanting of Arabic itself and avoid collisions of Latin and Arabic glyphs
at BiDi boundaries...
One can still propose a contextual control character, but it would still be
insufficient for correctly representing the many stylistic variants
possible: we have better languages to do that now, and CSS (or even HTML)
is better for it (including for accessibility requirements: note that
there's no way to translate corretly these italics to Braille readers for
example; Braille or audio readers attempt to infer an heuristic to reduce
the number of contextual words or symbols they need to insert between each
character, but using VSn characters would complicate that: they are already
processing the standard HTML/CSS conventions to do that much more simply).
direct native encoding of italic characters for lingusitic use would fail
if it only covers English: it would worsen the language coverage if people
are then said to remove the essential diacritics common in their language,
only because of the partial coverage of their alphabet.
I don't think this is worth the effort (and it would in fact cause lot of
maintenance and would severely complicate the addition of new missing
letters; and let's not forget the case of common ligatures, correct
typograhpic features like kerning which would no longer be supported and
would render ugly text if many new kerning pairs are missing in fonts, many
fonts used today would no longer work properly, we would have a reduction
of stylistic options and less fonts usable, and we would fall into the trap
of proprietary solutions with a single provider; it would be too difficult
or any font designer to start defining a usable font sellable on various
market: these fonts would be reduced to niches, and would no longer find a
way to be economically defined and maintained at reasonable cost.
Consider the problems orthogonally: even if you use CSS/HTML styles in
document encoding (rather than the plain text character encoding) you can
also supply the additional semantics clearly in that document, and also
encode the intent of the author, or supply enough info to permit alternate
renderings (for accessibility, or for technical reasons such as small font
sizes on devices will low resolution, or for people with limited vision
capabilities). the same will apply to color (whose meaning is not clear,
except in specific notations supported by wellknown authorities, or by a
long tradition shared by many authors and kept in archives or important
text corpus, such as litterature, legal, and publications that have fallen
to the public domain after their iniçtial publisher 

Re: NNBSP (was: A last missing link for interoperable representation)

2019-01-17 Thread Philippe Verdy via Unicode
Le jeu. 17 janv. 2019 à 05:01, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> On 16/01/2019 21:53, Richard Wordingham via Unicode wrote:
> >
> > On Tue, 15 Jan 2019 13:25:06 +0100
> > Philippe Verdy via Unicode  wrote:
> >
> >> If your fonts behave incorrectly on your system because it does not
> >> map any glyph for NNBSP, don't blame the font or Unicode about this
> >> problem, blame the renderer (or the application or OS using it, may
> >> be they are very outdated and were not aware of these features, theyt
> >> are probably based on old versions of Unicode when NNBSP was still
> >> not present even if it was requested since very long at least for
> >> French and even English, before even Unicode, and long before
> >> Mongolian was then encoded, only in Unicode and not in any known
> >> supported legacy charset: Mongolian was specified by borrowing the
> >> same NNBSP already designed for Latin, because the Mongolian space
> >> had no known specific behavior: the encoded whitespaces in Unicode
> >> are compeltely script-neutral, they are generic, and are even
> >> BiDi-neutral, they are all usable with any script).
> >
> > The concept of this codepoint started for Mongolian, but was generalised
> > before the character was approved.
>
> Indeed it was proposed as MONGOLIAN SPACE  at block start, which was
> consistent with the need of a MONGOLIAN COMMA, MONGOLIAN FULL STOP and much
> more.


But the French "espace fine insécable" was requested long long before
Mongolian was discussed for encodinc in the UCS. The problem is that the
initial rush for French was made in a period where Unicode and ISO were
competing and not in sync, so no agreement could be found, until there was
a decision to merge the efforts. Tge early rush was in ISO still not using
any character model but a glyph model, with little desire to support
multiple whitespaces; on the Unicode side, there was initially no desire to
encode all the languages and scripts, focusing initially only on trying to
unify the existing vendor character sets which were already implemented by
a limited set of proprietary vendor implementations (notably IBM,
Microsoft, HP, Digital) plus a few of the registered chrsets in IANA
including the existing ISO 8859-*, GBK, and some national standard or de
facto standards (Russia, Thailand, Japan, Korea).
This early rush did not involve typographers (well there was Adobe at this
time but still using another unrelated technology). Font standards were
still not existing and were competing in incompatible ways, all was a mess
at that time, so publishers were still required to use proprietary software
solutions, with very low interoperability (at that time the only "standard"
was PostScript, not needing any character encoding at all, but only
encoding glyphs!)

If publishers had been involded, they would have revealed that they all
needed various whitespaces for correct typography (i.e. layout). Typographs
themselves did not care about whitespaces because they had no value for
them (no glyph to sell). Adobe's publishing software were then completely
proprietary (jsut like Microsoft and others like Lotus, WordPerfect...).
Years ago I was working for the French press, and they absolutely required
us to manage the [FINE] for use in newspapers, classified ads, articles,
guides, phone books, dictionnaries. It was even mandatory to enter these
[FINE] in the composed text and they trained their typists or ads sellers
to use it (that character was not "sold" in classified ads, it was
necessary for correct layout, notably in narrow columns, not using it
confused the readers (notably for the ":" colon): it had to be
non-breaking, non-expanding by justification, narrower than digits and even
narrower than standard non-justified whitespace, and was consistently used
as a decimal grouping separator.

But at that time the most common OSes did not support it natively because
there was no vendor charset supporting it (and in fact most OSes were still
unable to render proportional fonts everywhere and were frequently limited
to 8-bit encodings (DOS, Windows, Unix(es), and even Linux at its early
start). So intermediate solution was needed. Us chose not to use at all the
non-breakable thin space because in English it was not needed for basic
Latin, but also because of the huge prevalence of 7-bit ASCII for
everything (but including its own national symbol for the "$", competing
with other ISO 646 variants). There were tons of legacy applications
developed ince decenials that did not support anything else and
interoperability in US was available ony with ASCII, everything else was
unreliable.

If you remember the early years when the Internet started to develop
outside US, you remember the nig

Re: A last missing link for interoperable representation

2019-01-15 Thread Philippe Verdy via Unicode
Note that even if this NNBSP character is not mapped in a font, it should
be rendered correctly with all modern renderers (the mapping is necessary
only when a font design wants to tune its metrics, because its width varies
between 1/8 and 1/6 em (the narrow space is a bit narrower in traditional
English typography than in French, so typical English design set it at
about 1/8 em, typical French design set it at 1/6 em, and neutral fonts may
set it somewhere in the middle); the measure in em may however vary with
some fonts (notably those using "narrow" or "wide" letters by default
(because the font size in em indicates only its height) and in
decorated/cursive styles (e.g. fonts with swashes need a higher line gap,
the font design of the em size may be smaller than for modern simplified
styles for display).

But a renderer should have no problem using a default metric for all
whitespace characters, that actually don't need any glyph to be drawn:
All what is needed is metrics, everything else, inclusing character
properties like breaking are infered by the renderer independantly of the
font and other per-language tuning, or controled by styling effects applied
on top of the font

A renderer may expand the kerning/approach if needed for example to
generate "hollow" or "shadow" effects, or to generate synthetic weights,
including with "variable" fonts support, typically the renderer will base
the metrics of all missing/unmapped whitespaces from the metrics given to
the normal SPACE or NBSP which are typically both mapped to the same glyph;
NNBSP will be synthetized easily using half the advance width of SPACE, and
it's fine; renderers can also synthetize all other whitespaces for
ideographic usages, or will adapt the rendering if instructed to synthetize
a monospaced variant: here there's a choice for NNBSP to be rendered like
NBSP, typically for French as it is normally a bit wider, or as a
zero-width space like in English, or contextually for example zero-width
near punctuations or NBSP between letters/digits).

Fonts only specify defaults that alter the rendering produced by a
renderer, but a renderer is not required to use all infos and all glyphs in
a specific font, it has to adapt to the context and choose what is more
relevant and which kind of data it recognizeds and implements/uses at
runtime. The font just provides the best settings according to the font
designer, if all features are enabled, but most work is done by the
renderer (and fonts are completely unaware of tyhe actual encoding of
documents, fonts are only a database containing multiple features/settings,
all of them bneing optional and selectable individually).

If your fonts behave incorrectly on your system because it does not map any
glyph for NNBSP, don't blame the font or Unicode about this problem, blame
the renderer (or the application or OS using it, may be they are very
outdated and were not aware of these features, theyt are probably based on
old versions of Unicode when NNBSP was still not present even if it was
requested since very long at least for French and even English, before even
Unicode, and long before Mongolian was then encoded, only in Unicode and
not in any known supported legacy charset: Mongolian was specified by
borrowing the same NNBSP already designed for Latin, because the Mongolian
space had no known specific behavior: the encoded whitespaces in Unicode
are compeltely script-neutral, they are generic, and are even BiDi-neutral,
they are all usable with any script).


Re: A last missing link for interoperable representation

2019-01-15 Thread Philippe Verdy via Unicode
Le lun. 14 janv. 2019 à 20:25, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> On 14/01/2019 06:08, James Kass via Unicode wrote:
> >
> > Marcel Schneider wrote,
> >
> >> There is a crazy typeface out there, misleadingly called 'Courier
> >> New', as if the foundry didn’t anticipate that at some point it
> >> would be better called "Courier Obsolete". ...
> >
> > 퐴푟푡 푛표푢푣푒푎푢 seems a bit 푝푎푠푠é nowadays, as well.
> >
> > (Had to use mark-up for that “span” of a single letter in order to
> > indicate the proper letter form.  But the plain-text display looks
> > crazy with that HTML jive in it.)
> >
>
> I apologize for seeming to question the font name 푝푒푟 푠푒 while
> targeting only
> the fact that this typeface is not updated to support the . It just
> looks like the grand name is now misused to make people believe that if
> **this** great font is unsupporting , it has a good reason to do so,
> and we should keep people off using that “exotic whitespace” otherwise than
> “intended,” ie for Mongolian. Since fortunately TUS started backing its use
> in French (2014)
>

This is not for Mongolian and French wanted this space since long and it
has a use even in English since centuries for fine typography.
So no, NNBSP is definitely NOT "exotic whitespace". It's just that it was
forgotten in the early stages of computing with legacy 8-bit encodings but
it should have been in Unicode since the begining as its existence is
proven long before the computing age (before ASCII, or even before Baudot
and telegraphic systems). It has alsway been used by typographs, it has
centuries of tradition in publishing. And it has always been recommended
and still today for French for all books/papers publishers.


Re: UCA unnecessary collation weight 0000

2018-11-04 Thread Philippe Verdy via Unicode
So you finally admit that I was right... And that the specs include
requirements that are not even needed to make UCA work, and that not even
used by wellknown implementations. These are old artefacts which are now
really confusive (instructing programmers to adopt the old deprecated
behavior, before realizing that this was a bad advice which jut complicated
their task). UCA can be implemented **conformingly** without these, even
for the simplest implementations (where using complex packages like ICU is
not an option and rewriting it is not one as well for much simpler goals)
where these incorrect requirements are in fact suggesting to be more
inefficient than really needed.
There's not a lot of work to edit and to fix the specs without these
polluting  "pseudo-weights".

Le dim. 4 nov. 2018 à 09:27, Mark Davis ☕️  a écrit :

> Philippe, I agree that we could have structured the UCA differently. It
> does make sense, for example, to have the weights be simply decimal values
> instead of integers. But nobody is going to go through the substantial
> work of restructuring the UCA spec and data file unless there is a very
> strong reason to do so. It takes far more time and effort than people
> realize to change in the algorithm/data while making sure that everything
> lines up without inadvertent changes being introduced.
>
> It is just not worth the effort. There are so, so, many things we can do
> in Unicode (encoding, properties, algorithms, CLDR, ICU) that have a higher
> benefit.
>
> You can continue flogging this horse all you want, but I'm muting this
> thread (and I suspect I'm not the only one).
>
> Mark
>
>
> On Sun, Nov 4, 2018 at 2:37 AM Philippe Verdy via Unicode <
> unicode@unicode.org> wrote:
>
>> Le ven. 2 nov. 2018 à 22:27, Ken Whistler  a écrit :
>>
>>>
>>> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>>>
>>> I was replying not about the notational repreentation of the DUCET data
>>> table (using [....] unnecessarily) but about the text of UTR#10 itself.
>>> Which remains highly confusive, and contains completely unnecesary steps,
>>> and just complicates things with absoiluytely no benefit at all by
>>> introducing confusion about these "".
>>>
>>> Sorry, Philippe, but the confusion that I am seeing introduced is what
>>> you are introducing to the unicode list in the course of this discussion.
>>>
>>>
>>> UTR#10 still does not explicitly state that its use of "" does not
>>> mean it is a valid "weight", it's a notation only
>>>
>>> No, it is explicitly a valid weight. And it is explicitly and
>>> normatively referred to in the specification of the algorithm. See UTS10-D8
>>> (and subsequent definitions), which explicitly depend on a definition of "A
>>> collation weight whose value is zero." The entire statement of what are
>>> primary, secondary, tertiary, etc. collation elements depends on that
>>> definition. And see the tables in Section 3.2, which also depend on those
>>> definitions.
>>>
>>> (but the notation is used for TWO distinct purposes: one is for
>>> presenting the notation format used in the DUCET
>>>
>>> It is *not* just a notation format used in the DUCET -- it is part of
>>> the normative definitional structure of the algorithm, which then
>>> percolates down into further definitions and rules and the steps of the
>>> algorithm.
>>>
>>
>> I insist that this is NOT NEEDED at all for the definition, it is
>> absolutely NOT structural. The algorithm still guarantees the SAME result.
>>
>> It is ONLY used to explain the format of the DUCET and the fact the this
>> format does NOT use  as a valid weight, ans os can use it as a notation
>> (in fact only a presentational feature).
>>
>>
>>> itself to present how collation elements are structured, the other one
>>> is for marking the presence of a possible, but not always required,
>>> encoding of an explicit level separator for encoding sort keys).
>>>
>>> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
>>> is not part of the *notation* for collation elements, but instead is a
>>> magic value chosen for the level separator precisely because zero values
>>> from the collation elements are removed during sort key construction, so
>>> that zero is then guaranteed to be a lower value than any remaining weight
>>> added to the sort key under construction. This part of the algorithm is not
>>> rocket science, by 

Re: Encoding

2018-11-04 Thread Philippe Verdy via Unicode
I can take another example about what I call "legacy encoding" (which
really means that such encoding is just an "approximation" from which no
semantic can be clearly infered, except by using a non-determinist
heuristic, which can frequently make "false guesses").

Consider the case of the legacy Hangul "half-width" jamos: they were kept
in Unicode (as compatibility characters) not recommended for encoding
natural Korean text, because their semantic is not clear when they are used
in sequences: it's impossible to know clearly where semantically
significant syllable breaks occur, because they don't distinguish the
"leading" and "trailing consonants", and so it is not even possible to
clearly infer that any Hangul "half-width" vowel jamos is logically
attached to the same syllable as the "half-width" consonnant (or
consonnant+vowel) jamo that is encoded just before it. As a consequence,
you cannot safely convert Korean texts using these "half-width" jamos into
normal jamos: only an heuristic attempts to detertemine the syllable breaks
and then infer the "leading" or "trailing" semantic of consonnants. This
last semantic ("leading" or "trailing" is exactly like a letter case
distinction in Latin, so it can be said that the Korean alphabet is
bicameral for consonnants, but only monocameral for vowels, where each
Hangul syllable normally starts by an "uppercase-like" consonnant, or by a
consonnant filler which is also "uppercase-like", and that all other
consonnants and all vowels are "lowercase-like": the heuristic that
transforms the legacy "half-width" jamos into normal jamos just does the
same thing as the heuristic used in Latin that attempts to capitalize some
leading letters in words: it works frequently, but this also fails and that
heuristic is also lossy in Latin, just like it is lossy in Korean!).

The same can be said about the heuristics that attempt to infer an
abbreviation semantic from existing superscript letters (either encoded in
Unicode, or encoded as plain letters modified by superscripting style in
CSS or HTML, or in word processors for example): it fails to give the
correct guess most of the time if there's no user to confirm the actual
intended meaning

Such confirmation is the job of spell correctors in word processors: they
must clearly inform the user and let them decide, all what spell checkers
can do is to provide visual hints to the user editing the document, such as
the common wavy underline in red, that several interpretations are
possible, or this is not the preferrred encoding to use to convey the
correct semantic.

A spell checker may be instructed to do the conversion automatically, while
typing text, but there must be a way for the user to cancel this transform
and make his own decision about the real meaning if canceling the automatic
transform causes the "wavy red underline" to appear; the user may type
"Mr." then the wavy line will appear under these 3 characters, the spell
checker will propose to encode it as an abbreviation "Mr" or leave "Mr." unchanged (and no longer signaled) in
which case the dot remains a regular punctuation, and the "r" is not
modified. Then the user may choose to style the "r" with superscripting or
underlining, and a new wavy red underline will appear below the three
characters "M.", proposing to only transform the  as
 or  and even when the user accepts
one of these suggestions it will remain "M." or
"M." where it is still possible to infer the
semantics of an abbreviation (propose to replace or keep the dot after it),
or doing nothing else and cancel these suggestions (to hide the wavy red
underline hint, added by the spell checker), or instruct the spell checker
that the meaning of the superscript r is that of a mathematical exponent,
or a chemical a notation.

In all cases, the user/author has full control of the intended meaning of
his text and an informed decision is made where all cases are now
distinguished. "Legacy" encoding can be kept as is (in Unicode), even if
it's no longer recommended, just like Unicode has documented that
half-width Hangul is deprecated (it just offers a "compatibility
decomposition" for NFKD or NFKC, but this is lossy and cannot be done
automatically without a human decision).

And the user/author can now freely and easily compose any abbreviation he
wishes in natural languages, without being limited by the reduced "legacy"
set of  encoded in Unicode (which should no longer be
extended, except for use as distinct plain letters needed in alphabets of
actual natural languages, or as possibly new IPA symbols), and without
using the styling tricks (of HTML/CSS, or of word processor documents,
spreadsheets, presentation documents allowing "'rich text" formats on top
of "plain text") which are best suitable for "free styling" of any human
text, without any additional semantics, (or as a legacy but insufficient
trick for maths and chemical notations).



Le dim. 4 nov. 2018 à 20:51, Philippe Verdy  a écrit :

> Note 

Re: Encoding (was: Re: A sign/abbreviation for "magister")

2018-11-04 Thread Philippe Verdy via Unicode
Note that I actually propose not just one rendering for the  but two possible variants (that would be equally valid
withou preference). Use it after any base cluster (including with
diacritics if needed, like combining underlines).
- the first one can be to render the previous cluster as superscript (very
easy to do implement synthetically by any text renderer)
- the second one can be to render it as an abbreviation dot (also very easy
to)
Fonts can provide their own mapping (e.g. to offer alternate glyph forms or
kerning for the superscript, they can also reuse the leter forms used for
other existing and encoded superscript letters, or to position the
abbreviation dot with negative kerning, for example after a T), in which
case the renderer does not have to synthetize the rendering for the
sequence combining sequence not mapped in the font.

Allowing this variation from the start will:
- allow renderers to support it fast (so a rapid adoption for encoding
texts in humane languages, instead of the few legacy superscript letters).
- allow font designers to develop and provide reasonnable mappings if
needed (to adjust the position or size of the superscript) in updated fonts
(no requirement for them to add new glyphs if it's just to map the same
glyphs used by existing superscript letters)
- also prohibit the abuse of this mark for every text that one would would
to write in superscript (these cases can still uses the few existing
superscript letters/digits/signs that are already encoded), so this is not
suitable for example for marking mathematical exponents (e.g. "x²", if it's
encoded as  could validly be rendered as
"x2."): exponents must use the superscript (either the already encoded
ones, or using external styles like in HTML/CSS, or in LaTeX which uses the
notation "x^2", both as a style, but also some intended semantic of an
exponent and certainly not the intended semantic of an abbreviation)



Le dim. 4 nov. 2018 à 09:34, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> On 03/11/2018 23:50, James Kass via Unicode wrote:
> >
> > When the topic being discussed no longer matches the thread title,
> > somebody should start a new thread with an appropriate thread title.
> >
>
> Yes, that is what also the OP called for, but my last reply though
> taking me some time to write was sent without checking the new mail,
> so unfortunately it didn’t acknowledge. So let’s start this new thread
> to account for Philippe Verdy’s proposal to encode a new format control.
>
> But all what I can add so far prior to probably stepping out of this
> discussion is that the industry does not seem to be interested in this
> initiative. Why do I think so? As already discussed on this List, even
> the long-existing FRACTION SLASH U+2044 has not been implemented by
> major vendors, except that HarfBuzz does implement it and makes its
> specified behavior available in environments using HarfBuzz, among
> which some major vendors’ products are actually available with
> HarfBuzz support.
>
> As a result, the Polish abbreviation of Magister as found on the
> postcard, and all other abbreviations using superscript that have
> been put into parallel in the parent thread, cannot be reliably
> encoded without using preformatted superscript, so far as the goal
> is a plain text backbone being in the benefit of reliable rendering
> support, rather than a semantic-centered coding that may be easier
> to parse by special applications but lacks wider industrial support.
>
> If nevertheless,  is encoded and will
> gain traction, or rather reversely: if it gains traction and will be
> encoded (I don’t know which way around to put it, given U+2044 has
> been encoded but one still cannot seem to be able to call it widely
> implemented), I would surely add it on keyboard layouts if I will
> still be maintaining any in that era.
>
> Best regards,
>
> Marcel
>


Re: UCA unnecessary collation weight 0000

2018-11-03 Thread Philippe Verdy via Unicode
Le ven. 2 nov. 2018 à 22:27, Ken Whistler  a écrit :

>
> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>
> I was replying not about the notational repreentation of the DUCET data
> table (using [....] unnecessarily) but about the text of UTR#10 itself.
> Which remains highly confusive, and contains completely unnecesary steps,
> and just complicates things with absoiluytely no benefit at all by
> introducing confusion about these "".
>
> Sorry, Philippe, but the confusion that I am seeing introduced is what you
> are introducing to the unicode list in the course of this discussion.
>
>
> UTR#10 still does not explicitly state that its use of "" does not
> mean it is a valid "weight", it's a notation only
>
> No, it is explicitly a valid weight. And it is explicitly and normatively
> referred to in the specification of the algorithm. See UTS10-D8 (and
> subsequent definitions), which explicitly depend on a definition of "A
> collation weight whose value is zero." The entire statement of what are
> primary, secondary, tertiary, etc. collation elements depends on that
> definition. And see the tables in Section 3.2, which also depend on those
> definitions.
>
> (but the notation is used for TWO distinct purposes: one is for presenting
> the notation format used in the DUCET
>
> It is *not* just a notation format used in the DUCET -- it is part of the
> normative definitional structure of the algorithm, which then percolates
> down into further definitions and rules and the steps of the algorithm.
>

I insist that this is NOT NEEDED at all for the definition, it is
absolutely NOT structural. The algorithm still guarantees the SAME result.

It is ONLY used to explain the format of the DUCET and the fact the this
format does NOT use  as a valid weight, ans os can use it as a notation
(in fact only a presentational feature).


> itself to present how collation elements are structured, the other one is
> for marking the presence of a possible, but not always required, encoding
> of an explicit level separator for encoding sort keys).
>
> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
> is not part of the *notation* for collation elements, but instead is a
> magic value chosen for the level separator precisely because zero values
> from the collation elements are removed during sort key construction, so
> that zero is then guaranteed to be a lower value than any remaining weight
> added to the sort key under construction. This part of the algorithm is not
> rocket science, by the way!
>

Here again you make a confusion: a sort key MAY use them as separators if
it wants to compress keys by reencoding weights per level: that's the only
case where you may want to introduce an encoding pattern starting with 0,
while the rest of the encoding for weights in that level must using
patterns not starting by this 0 (the number of bits to encode this 0 does
not matter: it is only part of the encoding used on this level which does
not necessarily have to use 16-bit code units per weight.

>
> Even the example tables can be made without using these "" (for
> example in tables showing how to build sort keys, it can present the list
> of weights splitted in separate columns, one column per level, without any
> "". The implementation does not necessarily have to create a buffer
> containing all weight values in a row, when separate buffers for each level
> is far superior (and even more efficient as it can save space in memory).
>
> The UCA doesn't *require* you to do anything particular in your own
> implementation, other than come up with the same results for string
> comparisons.
>
Yes I know, but the algorithm also does not require me to use these invalid
 pseudo-weights, that the algorithm itself will always discard (in a
completely needless step)!


> That is clearly stated in the conformance clause of UTS #10.
>
> https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance
>
> The step "S3.2" in the UCA algorithm should not even be there (it is made
> in favor an specific implementation which is not even efficient or optimal),
>
> That is a false statement. Step S3.2 is there to provide a clear statement
> of the algorithm, to guarantee correct results for string comparison.
>

You're wrong, this statement is completely useless in all cases. There is
still the correct results for string comparison without them: a string
comparison can only compare valid weights for each level, it will not
compare any weight past the end of the text in any one of the two compared
strings, nowhere it will compare weights with one of them being 0, unless
this 0 is used as a "guard value" for the end o

Re: UCA unnecessary collation weight 0000

2018-11-03 Thread Philippe Verdy via Unicode
Le ven. 2 nov. 2018 à 22:27, Ken Whistler  a écrit :

>
> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>
> I was replying not about the notational repreentation of the DUCET data
> table (using [....] unnecessarily) but about the text of UTR#10 itself.
> Which remains highly confusive, and contains completely unnecesary steps,
> and just complicates things with absoiluytely no benefit at all by
> introducing confusion about these "".
>
> Sorry, Philippe, but the confusion that I am seeing introduced is what you
> are introducing to the unicode list in the course of this discussion.
>
>
> UTR#10 still does not explicitly state that its use of "" does not
> mean it is a valid "weight", it's a notation only
>
> No, it is explicitly a valid weight. And it is explicitly and normatively
> referred to in the specification of the algorithm. See UTS10-D8 (and
> subsequent definitions), which explicitly depend on a definition of "A
> collation weight whose value is zero." The entire statement of what are
> primary, secondary, tertiary, etc. collation elements depends on that
> definition. And see the tables in Section 3.2, which also depend on those
> definitions.
>
Ok is is a valid "weight" when taken *isolately*, but it is invalid as a
weight at any level.
This does not change the fact because weights are always relative to a
specific level for which they are defined, and  does not belong to any
one. This weight is completely artificial and introduced completely
needlessly: all levels are completely defined by a closed range of weights,
all of them being non-, and all ranges being numerically separated
(with the primary level using the largest range).

I can reread again and again (even the sections you cite), but there's
absolutely NO need of this articificial "" anywhere (any clause
introducing it or using it to define something can be safely removed)


Re: A sign/abbreviation for "magister"

2018-11-03 Thread Philippe Verdy via Unicode
It should be noted that the algorithmic complexity for this NFLD
normalization ("legacy") is exactly the same as for NFKD ("compatibility").

However NFLD is versioned (like also NFLC), so NFLD can take a second
parameter: the maximum Unicode version which can be used to filter which
decomposition mappings are usable (they indicate the first minimal version
where the mapping applies).

It is even possible to allow a "legacy" normalization to be changed in a
later version for the same source string:

 # deprecated codepoint(s) ; new preferred sequence ; Unicode version in
which it was deprecated
  101234 ; 101230 0300... ; 10.0
  101234 ; 101240 0301... ; 11.0

It is also possible to add other filters to these recommanded new
encodings, for example a language (or a BCP 47 locale identifier):
  101234 ; 101230 0300 ; 10.0 ; fr
  101234 ; 101240 0301... ; 10.0
(here starting in the same version 10.0, the new recommandation is to
replace <101234> by <101240 0301> in all languages except French (BCP47
rules) where <101230 0300> should be used instead).

In that case, the NFKD normalization can be viewed as if it was an historic
version of NFLD, or a specialisation of NFLD for a "compatibility locale"
(using "u-nfk" as a BCP 47 locale identifier???), independant of the
unicode version (you can specify any version in the parameters of the NFLD
or NFLC functions, and the locale identifier can be set to "u-nkf").

The complete parameters for NFLD (or NFLC) are :
  NFLD(text, version, locale) -> returns a text in NFD form
  NFLC(text, version, locale) -> returns a text in NFC form

The default version is the latest supported version of Unicode, the default
locale is "root" (in CLDR) or the same as the DUCET in Unicode, but should
not be "u-nfk".

And so:
 NFKD(text)  = NFLD(text, 8.0, "u-nfk") =  NFLD(text, 12.0, "u-nfk")
= NFLD(text, "u-nfk")  = NFD(NFLD(text, "u-nfk"))
 NFKC(text)  = NFLC(text, 8.0, "u-nfk") =  NFLC(text, 12.0,
"u-nfk")  = NFLC(text, "u-nfk")  = NFC(NFLC(text, "u-nfk"))


Re: A sign/abbreviation for "magister"

2018-11-03 Thread Philippe Verdy via Unicode
Le sam. 3 nov. 2018 à 23:36, Philippe Verdy  a écrit :

> - this new decomposition mapping file for NFLC and NFLD, where NFLC is
>> defined to be NFC(NFLD), has some stability requirements and it must be
>> warrantied that NFD(NFLD) = NFD
>>
> Oops! fix my typo: it must be warrantied that NFD(NFLD) = NFLD


Re: A sign/abbreviation for "magister"

2018-11-03 Thread Philippe Verdy via Unicode
>
> Unlike NFKC and NFKD, the NFLC and NFLD would be an extensible superset
> based on MUTABLE character properties (this can also be "decompositions
> mappings" except that once a character is added to the new property file,
> they won't be removed, and can have some stability as well, where the
> decision to "deprecate" old encodings can only be done if there's a new
> recommandation, and that if ever this recommandation changes and is
> deprecated, the previous "legacy decomposition mappings" can still be
> decomposed again to the new decompositions recommanded): unlike NFKC, and
> NFKD, a "legacy decomposition" is not "final" in all future versions, and a
> future version may remap them by just adding new entries for the new
> characters considered to be "legacy" and no longer recommended. This new
> properties file would allow evolution and adaptation to humane languages,
> and will allow correcting past errors in the standard. This file should
> have this form:
>
>   # deprecated codepoint(s) ; new preferred sequence ; Unicode version in
> which it was deprecated
>   101234 ; 101230 0300... ; 10.0
>
> This file can also be used to deprecate some old variation sequences, or
> some old clusters made of multiple characters that are isolately not
> deprecated.
>

Another note:

- this new decomposition mapping file for NFLC and NFLD, where NFLC is
defined to be NFC(NFLD), has some stability requirements and it must be
warrantied that NFD(NFLD) = NFD: the "legacy mapping forms" must be a
conforming process respecting the canonical equivalences:

- Unlike in the main UCD file for canonical decompositions, the
decompositions listed there are not limited to map one character to one or
two characters.

- The first column should be given in NFC form; the NFD form may also be
used, this does not change the result. It is NOT required that the 1st
column is in NFKC or NFKD forms (so the decompositions previously
recommanded by a "compatibility mapping" in the main UCD can be ignored: it
was just a suggestion and a requirement only for NFKC and NFKD). This
allows NFLC and NFLD to correct past errors in the frozen permanently NFKC
and NFKD decompositions.

- the mapping done here is permanent but versioned (by the first version of
Unicode deprecating a character or sequence). Being permanent means that
the deprecation cannot be removed, but it can still be changed if the
target string (preferably listed in NFC form) contains some newly
deprecated characters (that will be added separately.

- if the target of the mapping contains other deprecated characters or
sequences (added to the same file), the decompositions listed there becomes
recursive: a derived datafile can be produced listing only the new
recommended mappings.

- if a source string "SATB" is canonically equivalent to "SBTA", and "SA"
is listed as a legacy sequence mapped to be replaced by "X" in this file,
then the NFLD process will not just decompose "SATB" into NFD("XTB"), but
will also decompose "SBTA" into NBT("XBT").

- if a source string "SATB" is NOT canonically equivalent to "SBTA", and
"SA" is listed as a legacy sequence mapped to be replaced by "X" in this
file, then the NFLD process will not decompose "SATB" into NFD("XTB"), but
will not automatically decompose "SBTA" into NBT("XBT")

Then the CLDR project can use NFL(C/D) as a better source for deriving
collation elements (in the DUCET or root locale) instead of NFK(C/D) which
will follow the new recommandations and will correctly adapt the collation
orders for legacy encodings. Tailored collations (per-locale) are not
required to use compatibility mappings in the main UCD file, or in this
file, they'll use it only if they are based on the DUCET or the collation
order of the "root" locale. For that purpose, tailored collations may
specify an alternate set of "compatibility or legacy mappings" (to apply
after NFC or NFD normalization which is still required).

May be the CLDR projects would like to have these derived collation
elements to be orderable (so that it can infer and order the new relative
weights needed for ordering strings containing "legacy characters") but it
may require another column in the legacy mappings datafile (in my opinion
the "Unicode version" field already offers by default a suitable relative
ordering)


Re: A sign/abbreviation for "magister"

2018-11-03 Thread Philippe Verdy via Unicode
I can give other interesting examples about why the Unicode "character
encoding model" is the best option

Just consider how the Hangul alphabet is (now) encoded: its consonnant
letters are encoded "twice" (leading and trailing jamos) because they carry
semantic distinctions for efficient processing of Korean text where
syllable boundaries are significant to disambiguate text ; this apparent
"double encoding" also has a visual model (still currently employed) to
*preferably* (not mandatorily) render syllables in a well defined square
layout. But the square layout causes significant rendering issues (notably
at small font sizes), so it is also possible to render the syllable by
aligning letters horizontally. This was done in the "compatibility jamos"
used in old terminals/printers (but unfortunately without marking the
syllable boundaries explicitly before groups of consonnants, or after them,
or in the middle of the group); due to the need to preserve roundtrip
compatiblity with the non-UCS encodings, the "compatibility jamos" had to
be encoded separately, even if their use is no longer recommanded for
normal Korean texts that should explicitly encode syllabic boundaries by
distinguishing leading and trailing consonnants (this is equivalent to the
distinction of letter case in Latin: leading jamos in Hangul are exactly
like our Latin capital consonnants, trailing jamos in Hangul are exactly
like our latin small letters, the vowel jamos in Hangul however are
unicameral... for now) But Hangul is still a true alphabet (it is in fact
much simpler than Greek or Cyrillic, and Latin is the most complex script
of the world!). Thanks to this new (recommanded) encoding of Hangul, which
adopts a **semantic** and **logical** model, it is possible to process
Korean text very efficiently (and in fact very simply). The earlier
attempts of encoding Korean was done while ISO 10646 goals were thought to
be enough (so it was a **visual** encoding: it failed even when this
earlier encoding entered in the first versions of Unicode, and has created
a severe precedent where preserving the stability of Unicode (and upward
compatibility) was broken.

I can also cite the case of Egyptian hieroglyphs: there's still no way to
render them correctly because we lack the development of a stable
orthography that would drive the encoding of the missing **semantic**
characters (for this reason Egyptian hieroglyphs still require an upper
layer protocol, as there's still no accepted orthographic norm that
successfully represents all possible semantic variations, but alsop because
the research on old Egyptian hieroglyphs is still aphic very incomplete).
The same can be saif about Mayan hieroglyphs. And because there's still no
semantic encoding of real texts, it's almost impossible to process text in
this script: the characters encoded are ONLY basic glyphs (we don't know
what can be their allowed variations, so we cannot use them safely to
compose combining sequences: they are merely a collection of symbols, not a
humane script). In my opinion, there was absolutely no emergency to encode
them in the UCS (except by not resisting to the pressure of allowing fonts
containing these glyphs to be interchanged, but it remains impossible to
encode and compose complete text with only these fonts: you still need an
orthographic convention and there's still no concensus about it; as well
the standard higher level protocols like HTML/CSS cannot compose them
correctly and efficiently). This encoding was not necessary as these fonts
containing collection of glyphs could have remained encoded with a private
use convention, i.e. with PUAs required by only the attempted (but not
agreed) protocols.

I think on the opposite that VisibleSpeech, or Duployé shorthands will
reach a point where they have developed a stable orthographic convention:
there will be a standard, and this standard will request to Unicode to
encode the missing **semantic** characters.

This path should also be followed now for encoding emojis (there's a early
development of an orthography for them, it is done by Unicode itself, but
I'm not sure this is part of its mission: Emoji orthographic conventions
should be made by a separate commity). Unfortunately Unicode is starting to
create this orthography without developing what should come with it : its
integration in the Unicode "character encoding model" (which should then be
reviewed to meet the goals wanted for the composition of emoji sequences):
a clear set of character properties for emojis needs to be developed, and
then the emojis subcommittee can work with it (like what the IRC does for
ideographic scripts). But for now any revision of emojis adds new
incompatibilities an inefficiencies to process text correctly (for example
it's nearly imposssible to define the boundaries between clusters of
emojis).

Just consider what is also still missing for Egyptian and Mayan hieroglyphs
or VisibleSpeech, or Duployé Shorthands: please resist 

Re: A sign/abbreviation for "magister"

2018-11-03 Thread Philippe Verdy via Unicode
in another set which may
> be a value or variable.
>
> Once again this covered all the needs without using this duplicate
> encoding (that was NEVER needed for roundtrip compatibility with legacy
> non-UCS charsets).
>
> All I ask is reasonnable: it's just a SINGLE code point to encode the
> combining mark itself, semantically, NOT visually.
>
> The visual appearance can be controlled by an additional variation
> selector to cancel the effect of glyph variations allowed for ALL
> characters in the UCS, where there's just a **non-mandatory** form
> generally used by default in fonts and matching more or less the
> "representative glyph" shown in the Unicode and ISO 10646 charts, which
> cannot show all allowed variations (if there's a need to detail them,
> Unicode offers the possibility to ask to register known "variation
> sequences" which can feed a supplementary chart showing more representative
> glyphs, one for each accepted "variation sequence", but without even
> needing to modify the "representative glyph" shown in the base chart.
>
> Note that even if Unicode requires registration of variation sequences
> prior to using them, the published charts still omit to add the additional
> charts (just below the existing base chart) showing representative glyphs
> for accepted sequences, with one small chart per base character, listing
> them simply ordered by "VSn" value. All what Unicode publishes is only a
> mere data list with some names (not enough for most users to be ware that
> variations can be encoded explicitly and compliantly)
>
>
> Le sam. 3 nov. 2018 à 20:41, Philippe Verdy  a écrit :
>
>>
>>
>> Le ven. 2 nov. 2018 à 20:01, Marcel Schneider via Unicode <
>> unicode@unicode.org> a écrit :
>>
>>> On 02/11/2018 17:45, Philippe Verdy via Unicode wrote:
>>> [quoted mail]
>>> >
>>> > Using variation selectors is only appropriate for these existing
>>> > (preencoded) superscript letters ª and º so that they display the
>>> > appropriate (underlined or not underlined) glyph.
>>>
>>> And it is for forcing the display of DIGIT ZERO with a short stroke:
>>> 0030 FE00; short diagonal stroke form; # DIGIT ZERO
>>> https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt
>>>
>>>  From that it becomes unclear why that isn’t applied to 4, 7, z and Z
>>> mentioned in this thread, to be displayed open or with a short bar.
>>>
>>> > It is not a solution for creating superscripts on any letters and
>>> > mark that it should be rendered as superscript (notably, the base
>>> > letter to transform into superscript may also have its own combining
>>> > diacritics, that must be encoded explicitly, and if you use the
>>> > varaition selector, it should allow variation on the presence or
>>> > absence of the underline (which must then be encoded explicitly as a
>>> > combining character.
>>>
>>> I totally agree that abbreviation indicating superscript should not be
>>> encoded using variation selectors, as already stated I don’t prefer it.
>>> >
>>> > So finally what we get with variation selectors is: >> > variation selector, combining diacritic> and >> > precombined with the diacritic, variation selector> which is NOT
>>> > canonically equivalent.
>>>
>>> That seems to me like a flaw in canonical equivalence. Variations must
>>> be canonically equivalent, and the variation selector position should
>>> be handled or parsed accordingly. Personally I’m unaware of this rule.
>>> >
>>> > Using a combining character avoids this caveat: >> > combining diacritic, combining abbreviation mark> and >> > precombined with the diacritic, combining abbreviation mark> which
>>> > ARE canonically equivalent. And this explicitly states the semantic
>>> > (something that is lost if we are forced to use presentational
>>> > superscripts in a higher level protocol like HTML/CSS for rich text
>>> > format, and one just extracts the plain text; using collation will
>>> > not help at all, except if collators are built with preprocessing
>>> > that will first infer the presence of a 
>>> > to insert after each combining sequence of the plain-text enclosed in
>>> > a italic style).
>>>
>>> That exactly outlines my concern with calls for relegating superscript
>>> as an abbreviation indicator to higher level protocols like HTML/CSS.
>>>
>>

Re: A sign/abbreviation for "magister"

2018-11-03 Thread Philippe Verdy via Unicode
As well the separate encoding of mathematical variants could have been
completely avoided (we know that this encoding is not sufficient, so much
that even LaTeX renderers simply don't need it or use it !).

We could have just encoded a single  to use
after any base cluster, and the whole set was covered !

The additional distinction of visual variants (monospace, bold, italic...)
would have been encoded using variation selectors after the : the semantic as a mathematical symbols was still
preserved including the additional semantic for distinguishing some symbols
in maths notations like "f(f)=f" where the 3 "f" must be distinguished
(between the function in a set of functions, the source belonging to one
set of values or being a variable, and the result in another set which may
be a value or variable.

Once again this covered all the needs without using this duplicate encoding
(that was NEVER needed for roundtrip compatibility with legacy non-UCS
charsets).

All I ask is reasonnable: it's just a SINGLE code point to encode the
combining mark itself, semantically, NOT visually.

The visual appearance can be controlled by an additional variation selector
to cancel the effect of glyph variations allowed for ALL characters in the
UCS, where there's just a **non-mandatory** form generally used by default
in fonts and matching more or less the "representative glyph" shown in the
Unicode and ISO 10646 charts, which cannot show all allowed variations (if
there's a need to detail them, Unicode offers the possibility to ask to
register known "variation sequences" which can feed a supplementary chart
showing more representative glyphs, one for each accepted "variation
sequence", but without even needing to modify the "representative glyph"
shown in the base chart.

Note that even if Unicode requires registration of variation sequences
prior to using them, the published charts still omit to add the additional
charts (just below the existing base chart) showing representative glyphs
for accepted sequences, with one small chart per base character, listing
them simply ordered by "VSn" value. All what Unicode publishes is only a
mere data list with some names (not enough for most users to be ware that
variations can be encoded explicitly and compliantly)


Le sam. 3 nov. 2018 à 20:41, Philippe Verdy  a écrit :

>
>
> Le ven. 2 nov. 2018 à 20:01, Marcel Schneider via Unicode <
> unicode@unicode.org> a écrit :
>
>> On 02/11/2018 17:45, Philippe Verdy via Unicode wrote:
>> [quoted mail]
>> >
>> > Using variation selectors is only appropriate for these existing
>> > (preencoded) superscript letters ª and º so that they display the
>> > appropriate (underlined or not underlined) glyph.
>>
>> And it is for forcing the display of DIGIT ZERO with a short stroke:
>> 0030 FE00; short diagonal stroke form; # DIGIT ZERO
>> https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt
>>
>>  From that it becomes unclear why that isn’t applied to 4, 7, z and Z
>> mentioned in this thread, to be displayed open or with a short bar.
>>
>> > It is not a solution for creating superscripts on any letters and
>> > mark that it should be rendered as superscript (notably, the base
>> > letter to transform into superscript may also have its own combining
>> > diacritics, that must be encoded explicitly, and if you use the
>> > varaition selector, it should allow variation on the presence or
>> > absence of the underline (which must then be encoded explicitly as a
>> > combining character.
>>
>> I totally agree that abbreviation indicating superscript should not be
>> encoded using variation selectors, as already stated I don’t prefer it.
>> >
>> > So finally what we get with variation selectors is: > > variation selector, combining diacritic> and > > precombined with the diacritic, variation selector> which is NOT
>> > canonically equivalent.
>>
>> That seems to me like a flaw in canonical equivalence. Variations must
>> be canonically equivalent, and the variation selector position should
>> be handled or parsed accordingly. Personally I’m unaware of this rule.
>> >
>> > Using a combining character avoids this caveat: > > combining diacritic, combining abbreviation mark> and > > precombined with the diacritic, combining abbreviation mark> which
>> > ARE canonically equivalent. And this explicitly states the semantic
>> > (something that is lost if we are forced to use presentational
>> > superscripts in a higher level protocol like HTML/CSS for rich text
>> > format, and one just extracts the plain text; using collation will
>>

Re: A sign/abbreviation for "magister"

2018-11-03 Thread Philippe Verdy via Unicode
Le ven. 2 nov. 2018 à 20:01, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> On 02/11/2018 17:45, Philippe Verdy via Unicode wrote:
> [quoted mail]
> >
> > Using variation selectors is only appropriate for these existing
> > (preencoded) superscript letters ª and º so that they display the
> > appropriate (underlined or not underlined) glyph.
>
> And it is for forcing the display of DIGIT ZERO with a short stroke:
> 0030 FE00; short diagonal stroke form; # DIGIT ZERO
> https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt
>
>  From that it becomes unclear why that isn’t applied to 4, 7, z and Z
> mentioned in this thread, to be displayed open or with a short bar.
>
> > It is not a solution for creating superscripts on any letters and
> > mark that it should be rendered as superscript (notably, the base
> > letter to transform into superscript may also have its own combining
> > diacritics, that must be encoded explicitly, and if you use the
> > varaition selector, it should allow variation on the presence or
> > absence of the underline (which must then be encoded explicitly as a
> > combining character.
>
> I totally agree that abbreviation indicating superscript should not be
> encoded using variation selectors, as already stated I don’t prefer it.
> >
> > So finally what we get with variation selectors is:  > variation selector, combining diacritic> and  > precombined with the diacritic, variation selector> which is NOT
> > canonically equivalent.
>
> That seems to me like a flaw in canonical equivalence. Variations must
> be canonically equivalent, and the variation selector position should
> be handled or parsed accordingly. Personally I’m unaware of this rule.
> >
> > Using a combining character avoids this caveat:  > combining diacritic, combining abbreviation mark> and  > precombined with the diacritic, combining abbreviation mark> which
> > ARE canonically equivalent. And this explicitly states the semantic
> > (something that is lost if we are forced to use presentational
> > superscripts in a higher level protocol like HTML/CSS for rich text
> > format, and one just extracts the plain text; using collation will
> > not help at all, except if collators are built with preprocessing
> > that will first infer the presence of a 
> > to insert after each combining sequence of the plain-text enclosed in
> > a italic style).
>
> That exactly outlines my concern with calls for relegating superscript
> as an abbreviation indicator to higher level protocols like HTML/CSS.
>

That's exactlky my concern that this relation to HTML/CSS should NOT occur
at all ! It's really not the solution, HTML/CSS styles have NO semantic at
all (I demonstrated it in the message you are quoting).


> > There's little risk: if the  is not
> > mapped in fonts (or not recognized by text renderers to create
> > synthetic superscript scripts from existing recognized clusters), it
> > will render as a visible .notdef (tofu). But normally text renderers
> > recognize the basic properties of characters in the UCD and can see
> > that  has a combining mark general
> > property (it also knows that it has a 0 combinjing class, so
> > canonical equivalences are not broken) to render a better symbols
> > than the .notdef "tofu": it should better render a dotted circle.
> > Even if this tofu or dotted circle is rendered, it still explicitly
> > marks the presence of the abbreviation mark, so there's less
> > confusion about what is preceding it (the combining sequence that was
> > supposed to be superscripted).
>
> The problem with the  you are proposing
> is that it contradicts streamlined implementation as well as easy
> input of current abbreviations like ordinal indicators in French and,
> optionally, in English. Preformatted superscripts are already widely
> implemented, and coding of "4ᵉ" only needs two characters, input
> using only three fingers in two times (thumb on AltGr, press key
> E04 then E12) with an appropriately programmed layout driver. I’m
> afraid that the solution with  would be
> much less straightforward.
>

This is not a real concern: this is legacy old practives that should no
longer be recommanded as it is ambiguous (nothing says that "4ᵉ" is an
abbreviated ordinal, it can as well be 4 elevated to the power e, or
various other things).

Also the keys to press on a keyboard is absolutely not a concern: the same
key presses you propose can as well generate the letter followed by the
combining abbreviation mark. In fact what you propose is even less
practical because it uses complex input for all characters and re

Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Philippe Verdy via Unicode
I was replying not about the notational repreentation of the DUCET data
table (using [....] unnecessarily) but about the text of UTR#10 itself.
Which remains highly confusive, and contains completely unnecesary steps,
and just complicates things with absoiluytely no benefit at all by
introducing confusion about these "". UTR#10 still does not explicitly
state that its use of "" does not mean it is a valid "weight", it's a
notation only (but the notation is used for TWO distinct purposes: one is
for presenting the notation format used in the DUCET itself to present how
collation elements are structured, the other one is for marking the
presence of a possible, but not always required, encoding of an explicit
level separator for encoding sort keys).

UTR#10 is still needlessly confusive. Even the example tables can be made
without using these "" (for example in tables showing how to build sort
keys, it can present the list of weights splitted in separate columns, one
column per level, without any "". The implementation does not
necessarily have to create a buffer containing all weight values in a row,
when separate buffers for each level is far superior (and even more
efficient as it can save space in memory). The step "S3.2" in the UCA
algorithm should not even be there (it is made in favor an specific
implementation which is not even efficient or optimal), it complicates the
algorithm with absoluytely no benefit at all); you can ALWAYS remove it
completely and this still generates equivalent results.


Le ven. 2 nov. 2018 à 15:23, Mark Davis ☕️  a écrit :

> The table is the way it is because it is easier to process (and
> comprehend) when the first field is always the primary weight, second is
> always the secondary, etc.
>
> Go ahead and transform the input DUCET files as you see fit. The "should
> be removed" is your personal preference. Unless we hear strong demand
> otherwise from major implementers, people have better things to do than
> change their parsers to suit your preference.
>
> Mark
>
>
> On Fri, Nov 2, 2018 at 2:54 PM Philippe Verdy  wrote:
>
>> It's not just a question of "I like it or not". But the fact that the
>> standard makes the presence of  required in some steps, and the
>> requirement is in fact wrong: this is in fact NEVER required to create an
>> equivalent collation order. these steps are completely unnecessary and
>> should be removed.
>>
>> Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️  a écrit :
>>
>>> You may not like the format of the data, but you are not bound to it. If
>>> you don't like the data format (eg you want [.0021.0002] instead of
>>> [..0021.0002]), you can transform it however you want as long as you
>>> get the same answer, as it says here:
>>>
>>> http://unicode.org/reports/tr10/#Conformance
>>> “The Unicode Collation Algorithm is a logical specification.
>>> Implementations are free to change any part of the algorithm as long as any
>>> two strings compared by the implementation are ordered the same as they
>>> would be by the algorithm as specified. Implementations may also use a
>>> different format for the data in the Default Unicode Collation Element
>>> Table. The sort key is a logical intermediate object: if an implementation
>>> produces the same results in comparison of strings, the sort keys can
>>> differ in format from what is specified in this document. (See Section 9,
>>> Implementation Notes.)”
>>>
>>>
>>> That is what is done, for example, in ICU's implementation. See
>>> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
>>> collation elements" and "sort keys" to see the transformed collation
>>> elements (from the DUCET + CLDR) and the resulting sort keys.
>>>
>>> a =>[29,05,_05] => 29 , 05 , 05 .
>>> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
>>> à => 
>>> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
>>> À => 
>>>
>>> Mark
>>>
>>>
>>> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
>>> unicode@unicode.org> wrote:
>>>
>>>> As well the step 2 of the algorithm speaks about a single "array" of
>>>> collation elements. Actually it's best to create one separate array per
>>>> level, and append weights for each level in the relevant array for that
>>>> level.
>>>> The steps S2.2 to S2.4 can do this, including for derived collation
>>>> elements in section 10.1, or va

Re: A sign/abbreviation for "magister"

2018-11-02 Thread Philippe Verdy via Unicode
Le ven. 2 nov. 2018 à 16:20, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> That seems to me a regression, after the front has moved in favor of
> recognizing Latin script needs preformatted superscript. The use case is
> clear, as we have ª, º, and n° with degree sign, and so on as already
> detailed in long e-mails in this thread and elsewhere. There is no point
> in setting up or maintaining a Unicode policy stating otherwise, as such
> a policy would be inconsistent with longlasting and extremely widespread
> practice.
>

Using variation selectors is only appropriate for these existing
(preencoded) superscript letters ª and º so that they display the
appropriate (underlined or not underlined) glyph. It is not a solution for
creating superscripts on any letters and mark that it should be rendered as
superscript (notably, the base letter to transform into superscript may
also have its own combining diacritics, that must be encoded explicitly,
and if you use the varaition selector, it should allow variation on the
presence or absence of the underline (which must then be encoded explicitly
as a combining character.

So finally what we get with variation selectors is:
and
   which
is NOT canonically equivalent.

Using a combining character avoids this caveat:
   and
   which ARE canonically equivalent.
And this explicitly states the semantic (something that is lost if we are
forced to use presentational superscripts in a higher level protocol like
HTML/CSS for rich text format, and one just extracts the plain text; using
collation will not help at all, except if collators are built with
preprocessing that will first infer the presence of a  to insert after each combining sequence of the
plain-text enclosed in a italic style).

There's little risk: if the  is not mapped in
fonts (or not recognized by text renderers to create synthetic superscript
scripts from existing recognized clusters), it will render as a visible
.notdef (tofu). But normally text renderers recognize the basic properties
of characters in the UCD and can see that  has
a combining mark general property (it also knows that it has a 0 combinjing
class, so canonical equivalences are not broken) to render a better symbols
than the .notdef "tofu": it should better render a dotted circle. Even if
this tofu or dotted circle is rendered, it still explicitly marks the
presence of the abbreviation mark, so there's less confusion about what is
preceding it (the combining sequence that was supposed to be superscripted).

The  can also have its own  to select other styles when they are optional, such as adding
underlines to the superscripted letter, or rendering the letter instead as
underscript, or as a small baseline letter with a dot after it: this is
still an explicit abbreviation mark, and the meaning of the plein text is
still preserved: the variation selector is only suitable to alter the
rendering of a cluster when it has effectively several variants and the
default rendering is not universal, notably across font styles initially
designed for specific markets with their own local preferences: the
variation selector still allows the same fonts to map all known variants
distinctly, independantly of the initial arbitrary choice of the default
glyph used when the variation selector is missing).

Even if fonts (or text renderers may map the 
to variable glyphs, this is purely stylictic, the semantic of the plain
text is not lost because the  is still there.
There's no need of any rich-text to encode it (the rich -text styles are
not explicitly encoding that a superscript is actually an abbreviation
mark, so it cannot also allow variation like rendering an underscript, or a
baseline small glyph with an added dot. Typically a  used in an English style would render the letter (or cluster) before
it as a "small" letter without any added dot.

So I really think that  is far better than:
* using preencoded superscript letters (they don't map all the necessary
repertoire of clusters where the abbreviation is needed, it now just covers
Basic Latin, ten digits, plus and minus signs, and the dot or comma, plus a
few other letters like stops; it's impossible to rencode the full Unicode
repertoire and its allowed combining sequences or extended default grapheme
clusters!),
* or using variation selectors to make them appear as a superscript (does
not work with all clusters containing other diacritics like accents),
* or using rich-text styling (from which you cannot safely infer any
semantic (there no warranty that Mr in HTML is actually an
abbreviation of "Mister"; in HTML this is encoded elsewhere as Mr or Mr (the
semantic of the abbreviation has to be looked a possible  container
element and the meaning of the abbreviation is to look inside its title
attribute, so obviously this requires complex preprocessing before we can
infer a plaintext version  (suitable for
example in plain-text searches where you don't want to match a mathematical

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
As well the step 2 of the algorithm speaks about a single "array" of
collation elements. Actually it's best to create one separate array per
level, and append weights for each level in the relevant array for that
level.
The steps S2.2 to S2.4 can do this, including for derived collation
elements in section 10.1, or variable weighting in section 4.

This also means that for fast string compares, the primary weights can be
processed on the fly (without needing any buffering) is the primary weights
are different between the two strings (including when one or both of the
two strings ends, and the secondary weights or tertiary weights detected
until then have not found any weight higher than the minimum weight value
for each level).
Otherwise:
- the first secondary weight higher that the minimum secondary weght value,
and all subsequent secondary weights must be buffered in a secondary
buffer  .
- the first tertiary weight higher that the minimum secondary weght value,
and all subsequent secondary weights must be buffered in a tertiary buffer.
- and so on for higher levels (each buffer just needs to keep a counter,
when it's first used, indicating how many weights were not buffered while
processing and counting the primary weights, because all these weights were
all equal to the minimum value for the relevant level)
- these secondary/tertiary/etc. buffers will only be used once you reach
the end of the two strings when processing the primary level and no
difference was found: you'll start by comparing the initial counters in
these buffers and the buffer that has the largest counter value is
necessarily for the smaller compared string. If both counters are equal,
then you start comparing the weights stored in each buffer, until one of
the buffers ends before another (the shorter buffer is for the smaller
compared string). If both weight buffers reach the end, you use the next
pair of buffers built for the next level and process them with the same
algorithm.

Nowhere you'll ever need to consider any [.] weight which is just a
notation in the format of the DUCET intended only to be readable by humans
but never needed in any machine implementation.

Now if you want to create sort keys this is similar except that you don"t
have two strings to process and compare, all you want is to create separate
arrays of weights for each level: each level can be encoded separately, the
encoding must be made so that when you'll concatenate the encoded arrays,
the first few encoded *bits* in the secondary or tertiary encodings cannot
be larger or equal to the bits used by the encoding of the primary weights
(this only limits how you'll encode the 1st weight in each array as its
first encoding *bits* must be lower than the first bits used to encode any
weight in previous levels).

Nowhere you are required to encode weights exactly like their logical
weight, this encoding is fully reversible and can use any suitable
compression technics if needed. As long as you can safely detect when an
encoding ends, because it encounters some bits (with lower values) used to
start the encoding of one of the higher levels, the compression is safe.

For each level, you can reserve only a single code used to "mark" the start
of another higher level followed by some bits to indicate which level it
is, then followed by the compressed code for the level made so that each
weight is encoded by a code not starting by the reserved mark. That
encoding "mark" is not necessarily a , it may be a nul byte, or a '!'
(if the encoding must be readable as ASCII or UTF-8-based, and must not use
any control or SPACE or isolated surrogate) and codes used to encode each
weight must not start by a byte lower or equal to this mark. The binary or
ASCII code units used to encode each weight must just be comparable, so
that comparing codes is equivalent to compare weights represented by each
code.

As well, you are not required to store multiple "marks". This is just one
of the possibilities to encode in the sort key which level is encoded after
each "mark", and the marks are not necessarily the same before each level
(their length may also vary depending on the level they are starting):
these marks may be completely removed from the final encoding if the
encoding/compression used allows discriminating the level used by all
weights, encoded in separate sets of values.

Typical compression technics are for example differencial, notably in
secondary or higher levels, and run-legth encoded to skip sequences of
weights all equal to the minimum weight.

The code units used by the weigh encoding for each level may also need to
avoid some forbidden values if needed (e.g. when encoding the weights to
UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units
reserved for or representing an isolate surrogate in U+D800..U+DFFF as this
would create a string not conforming to any standard UTF).

Once again this means that the sequence of logical weight will can 

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
So it should be clear in the UCA algorithm and in the DUCET datatable that
"" is NOT a valid weight
It is just a notational placeholder used as ".", only indicating in the
DUCET format that there's NO weight assigned at the indicated level,
because the collation element is ALWAYS ignorable at this level.
The DUCET could have as well used the notation ".none", or just dropped
every "." in its file (provided it contains a data entry specifying
what is the minimum weight used for each level). This notation is only
intended to be read by humans editing the file, so they don't need to
wonder what is the level of the first indicated weight or remember what is
the minimum weight for that level.
But the DUCET table is actually generated by a machine and processed by
machines.



Le jeu. 1 nov. 2018 à 21:57, Philippe Verdy  a écrit :

> In summary, this step given in the algorithm is completely unneeded and
> can be dropped completely:
>
> *S3.2  *If L is not 1, append a *level
> separator*
>
> *Note:*The level separator is zero (), which is guaranteed to be
> lower than any weight in the resulting sort key. This guarantees that when
> two strings of unequal length are compared, where the shorter string is a
> prefix of the longer string, the longer string is always sorted after the
> shorter—in the absence of special features like contractions. For example:
> "abc" < "abcX" where "X" can be any character(s).
>
> Remove any reference to the "level separator" from the UCA. You never need
> it.
>
> As well this paragraph
>
> 7.3 Form Sort Keys 
>
> *Step 3.* Construct a sort key for each collation element array by
> successively appending all non-zero weights from the collation element
> array. Figure 2 gives an example of the application of this step to one
> collation element array.
>
> Figure 2. Collation Element Array to Sort Key
> 
> Collation Element ArraySort Key
> [.0706.0020.0002], [.06D9.0020.0002], [..0021.0002], [.06EE.0020.0002] 
> 0706
> 06D9 06EE  0020 0020 0021 0020  0002 0002 0002 0002
>
> can be written with this figure:
>
> Figure 2. Collation Element Array to Sort Key
> 
> Collation Element ArraySort Key
> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)
>
> The parentheses mark the collation weights 0020 and 0002 that can be
> safely removed if they are respectively the minimum secondary weight and
> minimum tertiary weight.
> But note that 0020 is kept in two places as they are followed by a higher
> weight 0021. This is general for any tailored collation (not just the
> DUCET).
>
> Le jeu. 1 nov. 2018 à 21:42, Philippe Verdy  a écrit :
>
>> The  is there in the UCA only because the DUCET is published in a
>> format that uses it, but here also this format is useless: you never need
>> any [.], or [..] in the DUCET table as well. Instead the DUCET
>> just needs to indicate what is the minimum weight assigned for every level
>> (except the highest level where it is "implicitly" 0001, and not ).
>>
>>
>> Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a
>> écrit :
>>
>>> There are lots of ways to implement the UCA.
>>>
>>> When you want fast string comparison, the zero weights are useful for
>>> processing -- and you don't actually assemble a sort key.
>>>
>>> People who want sort keys usually want them to be short, so you spend
>>> time on compression. You probably also build sort keys as byte vectors not
>>> uint16 vectors (because byte vectors fit into more APIs and tend to be
>>> shorter), like ICU does using the CLDR collation data file. The CLDR root
>>> collation data file remunges all weights into fractional byte sequences,
>>> and leaves gaps for tailoring.
>>>
>>> markus
>>>
>>


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
In summary, this step given in the algorithm is completely unneeded and can
be dropped completely:

*S3.2  *If L is not 1, append a *level
separator*

*Note:*The level separator is zero (), which is guaranteed to be lower
than any weight in the resulting sort key. This guarantees that when two
strings of unequal length are compared, where the shorter string is a
prefix of the longer string, the longer string is always sorted after the
shorter—in the absence of special features like contractions. For example:
"abc" < "abcX" where "X" can be any character(s).

Remove any reference to the "level separator" from the UCA. You never need
it.

As well this paragraph

7.3 Form Sort Keys 

*Step 3.* Construct a sort key for each collation element array by
successively appending all non-zero weights from the collation element
array. Figure 2 gives an example of the application of this step to one
collation element array.

Figure 2. Collation Element Array to Sort Key

Collation Element ArraySort Key
[.0706.0020.0002], [.06D9.0020.0002], [..0021.0002], [.06EE.0020.0002] 0706
06D9 06EE  0020 0020 0021 0020  0002 0002 0002 0002

can be written with this figure:

Figure 2. Collation Element Array to Sort Key

Collation Element ArraySort Key
[.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)

The parentheses mark the collation weights 0020 and 0002 that can be safely
removed if they are respectively the minimum secondary weight and minimum
tertiary weight.
But note that 0020 is kept in two places as they are followed by a higher
weight 0021. This is general for any tailored collation (not just the
DUCET).

Le jeu. 1 nov. 2018 à 21:42, Philippe Verdy  a écrit :

> The  is there in the UCA only because the DUCET is published in a
> format that uses it, but here also this format is useless: you never need
> any [.], or [..] in the DUCET table as well. Instead the DUCET
> just needs to indicate what is the minimum weight assigned for every level
> (except the highest level where it is "implicitly" 0001, and not ).
>
>
> Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a
> écrit :
>
>> There are lots of ways to implement the UCA.
>>
>> When you want fast string comparison, the zero weights are useful for
>> processing -- and you don't actually assemble a sort key.
>>
>> People who want sort keys usually want them to be short, so you spend
>> time on compression. You probably also build sort keys as byte vectors not
>> uint16 vectors (because byte vectors fit into more APIs and tend to be
>> shorter), like ICU does using the CLDR collation data file. The CLDR root
>> collation data file remunges all weights into fractional byte sequences,
>> and leaves gaps for tailoring.
>>
>> markus
>>
>


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
The  is there in the UCA only because the DUCET is published in a
format that uses it, but here also this format is useless: you never need
any [.], or [..] in the DUCET table as well. Instead the DUCET
just needs to indicate what is the minimum weight assigned for every level
(except the highest level where it is "implicitly" 0001, and not ).


Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a écrit :

> There are lots of ways to implement the UCA.
>
> When you want fast string comparison, the zero weights are useful for
> processing -- and you don't actually assemble a sort key.
>
> People who want sort keys usually want them to be short, so you spend time
> on compression. You probably also build sort keys as byte vectors not
> uint16 vectors (because byte vectors fit into more APIs and tend to be
> shorter), like ICU does using the CLDR collation data file. The CLDR root
> collation data file remunges all weights into fractional byte sequences,
> and leaves gaps for tailoring.
>
> markus
>


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
Le jeu. 1 nov. 2018 à 21:31, Philippe Verdy  a écrit :

> so you can use these two last functions to write the first one:
>
>   bool isIgnorable(int level, string element) {
> return getLevel(getWeightAt(element, 0)) > getMinWeight(level);
>   }
>
correction:
return getWeightAt(element, 0) > getMinWeight(level);


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a
écrit :

> When you want fast string comparison, the zero weights are useful for
>> processing -- and you don't actually assemble a sort key.
>>
>
And no, I absolutely no case where any  weight is useful during
processing, it does not distinguish any case, even for "fast" string
comparison.

Even if you don't build any sort key, may be you'll want to return  it
you query the weight for a specific collatable element, but this would be
the same as querying if the collatable element is ignorable or not for a
given specific level; this query just returns a false or true boolean, like
this method of a Collator object:

  bool isIgnorable(int level, string collatable element)

and you can also make this reliable for any collector:

  int getLevel(int weight);
  int getMinWeight(int level);
  int getWeightAt(string element, int level, int position);

so you can use these two last functions to write the first one:

  bool isIgnorable(int level, string element) {
return getLevel(getWeightAt(element, 0)) > getMinWeight(level);
  }

That's enough you can write the fast comparison...

What I said is not a complicate "compression" this is done on the fly,
without any complex transform. All that counts is that any primary weight
value is higher than any secondary weight, and any secondary weight is
higher than a tertiary weight.


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
I'm not speaking just about how collation keys will finally be stored (as
uint16 or bytes, or sequences  of bits with variable length); I'm just
refering to the sequence of weights you generate.
You absolutely NEVER need ANYWHERE in the UCA algorithm any  weight,
not even during processing, or un the DUCET table.

Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a écrit :

> There are lots of ways to implement the UCA.
>
> When you want fast string comparison, the zero weights are useful for
> processing -- and you don't actually assemble a sort key.
>
> People who want sort keys usually want them to be short, so you spend time
> on compression. You probably also build sort keys as byte vectors not
> uint16 vectors (because byte vectors fit into more APIs and tend to be
> shorter), like ICU does using the CLDR collation data file. The CLDR root
> collation data file remunges all weights into fractional byte sequences,
> and leaves gaps for tailoring.
>
> markus
>


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
For example, Figure 3 in the UTR#10 contains:

Figure 3. Comparison of Sort Keys

 StringSort Key
1 cab *0706* 06D9 06EE ** 0020 0020 *0020* ** *0002* 0002 0002
2 Cab *0706* 06D9 06EE ** 0020 0020 *0020* ** *0008* 0002 0002
3 cáb *0706* 06D9 06EE ** 0020 0020 *0021* ** 0002 0002 0002 0002
4 dab *0712* 06D9 06EE ** 0020 0020 0020 ** 0002 0002 0002


The  weights are never needed, even if any of the source strings
("cab", "Cab", "cáb", "dab") is followed by ANY other string, or if any
other string (higher than "b") replaces their final "b".
What is really important is to understand where the input text (after
initial transforms like reodering and expansion) is broken at specific
boundaries between collatable elements.
But the boundaries of weights indicated each part of the sort key can
always be infered for example between 06EE and 0020, or between 0020 and
0002.
So this can obviously be changed to just:

Figure 3. Comparison of Sort Keys


 StringSort Key
1 cab *0706* 06D9 06EE 0020 0020 *0020* *0002* 0002 0002
2 Cab *0706* 06D9 06EE 0020 0020 *0020* *0008* 0002 0002
3 cáb *0706* 06D9 06EE 0020 0020 *0021* 0020 0002 0002 0002 0002
4 dab *0712* 06D9 06EE 0020 0020 0020 0002 0002 0002
As well (emphasized by black blackground above),
* when the secondary weights in the sort key are terminated by any sequence
of 0020 (the minimal secondary weight), you can suppress them from the
collation key.
* when the tertiary weights are in the sort key are terminated by any
sequence of 0002 (the minimal tertiary weight), you can suppress them from
the collation key.
This gives:

Figure 3. Comparison of Sort Keys

 StringSort Key
1 cab *0706* 06D9 06EE
2 Cab *0706* 06D9 06EE *0008*
3 cáb *0706* 06D9 06EE 0020 0020 *0021*
4 dab *0712* 06D9 06EE
See the reduction !

Le jeu. 1 nov. 2018 à 18:39, Philippe Verdy  a écrit :

> I just remarked that there's absolutely NO utility of the collation weight
>  anywhere in the algorithm.
>
> For example in UTR #10, section 3.3.1 gives a collection element :
>   [..0021.0002]
> for COMBINING GRAVE ACCENT. However it can also be simply:
>   [.0021.0002]
> for a simple reason: the secondary or tertiary weights are necessarily
> LOWER then any primary weight (for conformance reason):
>  any tertiary weight < any secondary weight < any primary weight
> (the set of all weights for all levels is fully partitioned into disjoint
> intervals in the same order, each interval containing all its weights, so
> weights are sorted by decreasing level, then increasing weight in all cases)
>
> This also means that we never need to handle  weights when creating
> sort keys from multiple collection elements, as we can easily detect that
> [.0021.0002] given above starts by a secondary weight 0021 and is not a
> primary weight.
>
> As well we don't need to use any level separator  in the sort key.
>
> This allows more interesting optimizations, and reduction of length for
> sort keys.
> What this means is that we can safely implement UCA using basic
> substitions (e.g. with a function like "string:gsub(map)" in Lua which uses
> a "map" to map source (binary) strings or regexps,into target (binary)
> strings:
>
> For a level-3 collation, you just then need only 3 calls to
> "string:gsub()" to compute any collation:
>
> - the first ":gsub(mapNormalize)" can decompose a source text into
> collation elements and can perform reordering to enforce a normalized order
> (possibly tuned for the tailored locale) using basic regexps.
>
> - the second ":gsub(mapSecondary)"  will substitute any collection
> elements by their "intermediary" collation elements+tertiary weight.
>
> - the third ":gsub(mapSecondary)" will substitute any "intermediary"
> collation element by their primary weight + secondary weight
>
> The "intermediary" collection elements are just like source text, except
> that higher level differences are eliminated, i.e.all source collation
> element string are replaced by the collection element string that have the
> smallest collation element weights. They must be just encoded so that they
> are HIGHER than any higher level weights.
>
> How to do that:
> - reserve the weight range between . (yes! not just .0001) and .001E
> for the last (tertiary) weight, make sure that all other intermediary
> collation elements will use only code units higher than .0020 (this means
> that they can remain encoded in their existing UTF form!)
> - reserve the weight .001F for the case where you don't want to use
> secondary differences (like letter case) and them to tertiary differences.
>
> This will be used in the second mapping to decompose source collection
> elements into "intermediary collation elements" + tertiary weight. you may
> then decide to leave tertiary weights 

UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
I just remarked that there's absolutely NO utility of the collation weight
 anywhere in the algorithm.

For example in UTR #10, section 3.3.1 gives a collection element :
  [..0021.0002]
for COMBINING GRAVE ACCENT. However it can also be simply:
  [.0021.0002]
for a simple reason: the secondary or tertiary weights are necessarily
LOWER then any primary weight (for conformance reason):
 any tertiary weight < any secondary weight < any primary weight
(the set of all weights for all levels is fully partitioned into disjoint
intervals in the same order, each interval containing all its weights, so
weights are sorted by decreasing level, then increasing weight in all cases)

This also means that we never need to handle  weights when creating
sort keys from multiple collection elements, as we can easily detect that
[.0021.0002] given above starts by a secondary weight 0021 and is not a
primary weight.

As well we don't need to use any level separator  in the sort key.

This allows more interesting optimizations, and reduction of length for
sort keys.
What this means is that we can safely implement UCA using basic substitions
(e.g. with a function like "string:gsub(map)" in Lua which uses a "map" to
map source (binary) strings or regexps,into target (binary) strings:

For a level-3 collation, you just then need only 3 calls to "string:gsub()"
to compute any collation:

- the first ":gsub(mapNormalize)" can decompose a source text into
collation elements and can perform reordering to enforce a normalized order
(possibly tuned for the tailored locale) using basic regexps.

- the second ":gsub(mapSecondary)"  will substitute any collection elements
by their "intermediary" collation elements+tertiary weight.

- the third ":gsub(mapSecondary)" will substitute any "intermediary"
collation element by their primary weight + secondary weight

The "intermediary" collection elements are just like source text, except
that higher level differences are eliminated, i.e.all source collation
element string are replaced by the collection element string that have the
smallest collation element weights. They must be just encoded so that they
are HIGHER than any higher level weights.

How to do that:
- reserve the weight range between . (yes! not just .0001) and .001E
for the last (tertiary) weight, make sure that all other intermediary
collation elements will use only code units higher than .0020 (this means
that they can remain encoded in their existing UTF form!)
- reserve the weight .001F for the case where you don't want to use
secondary differences (like letter case) and them to tertiary differences.

This will be used in the second mapping to decompose source collection
elements into "intermediary collation elements" + tertiary weight. you may
then decide to leave tertiary weights in the substitute string, or because
the "gsub()" finds match from left to right, to accumulate the tertiary
weights into a separate buffer, so that the subtitution itself will still
return a valid UTF string, containing only "intermediary collation
elements" (with all tertiary differences erased).

You can repeat the process with the next gsub() to return the primary
collation elements" (still in UTF form), and separately the secondary
weights (also accumulable in a separate buffer).

Now there remains only 3 strings:
- one contains only the primary collection elements (still in UTF-form, but
using code units always higher than or equal to 0020)
- another one contains only secondary weights (between MINSECONDARYWEIGHT
and 001F)
- another one contains only tertiary weights. (between  and
MINSECONDARYWEIGHT-1)

For the rest I will assume that MINSECONDARYWEIGHT is 0010, so
* primary weights are encoded with one or more code units in [0020..]
(multiple code units are possible if you reserve some of these code units
to be prefixes or longer sequences)
* secondary weights are encoded with one or more code units in [0010..001E]
(same remark about multiple code units if you need them)
* tertiary weights are encoded  with one or more code units
in  [0010..001F] (same remark about multiple code units if you need them)

The last gsub() will only reorder the primary collection elements to remap
them in a suitable binary order (it will be a simple bijective permutation,
except that the target does not have to use multiple code units, but a
single one, when there are contractions). It's always possible to make this
permutation generate integers higher than 0020. The resulting weights can
remain encodable with UTF-8 as if it was source text.

And to return the sort key, all you need is to concatenate
* the string containing all primary weights encoded with code units in
[0020..], then
* the string containing secondary weights encoded with code units in
[0010..001E], then
* the string containing tertiary weights encoded with code units in
[..001F].
* you don't need to insert ANY [] as a level separator in the final
sort key, 

Re: A sign/abbreviation for "magister"

2018-10-31 Thread Philippe Verdy via Unicode
As is "Mgr" for Monseigneur in French ("Mgr" without
superscripts makes little sense, and if "Mr" is sometimes found as an
abbreviation for "Monsieur", its standard abbreviation is "M.", and its
plural "Messieurs" is noted "MM" without any abbreviation dot or
superscript, but normally never as "Mrs" or "Mrs"). If someone
finds "Mgr" without the superscript, it could think it is an English
abbreviation for "Manager" (a term now frequently used in the modern
"Frenglish" language used in French business)...

Le mar. 30 oct. 2018 à 22:58, Ken Whistler via Unicode 
a écrit :

>
> On 10/30/2018 2:32 PM, James Kass via Unicode wrote:
> > but we can't seem to agree on how to encode its abbreviation.
>
> For what it's worth, "mgr" seems to be the usual abbreviation in Polish
> for it.
>
> --Ken
>
>


Re: A sign/abbreviation for "magister"

2018-10-29 Thread Philippe Verdy via Unicode
For the case of "Mister" vs. "Magister", the (double) underlining is not
just a stylistic option but conveys semantics as an explicit abbreviation
mark !
We are here at the line between what is pure visual encoding (e.g. using
superscript letters), and logical encoding (as done eveywhere else in
unicode with combining sequences; the most well known exceptions being for
Thai script which uses the visual model).
Obviously the Latin script should not use any kind of visual encoding, and
even the superscript letters (initially introduced for something else,
notably as distinct symbols for IPA) was not the correct path (it also has
limitation because the superscript letters are quite limited; the same can
be saif about the visual encoding of Mathematic symbols as stylistic
variants transformed as plain characters, which will always be incomplete,
while it could as well be represented logically).
So Unicode does not have a consistent policy (and this inconsistence was
not just introduced due to legacy roundtrip compatibibility, like the
Numero abbreviation or the encoding of the Thai script).


Le lun. 29 oct. 2018 à 12:44, Asmus Freytag via Unicode 
a écrit :

> On 10/28/2018 11:50 PM, Martin J. Dürst via Unicode wrote:
>
> On 2018/10/29 05:42, Michael Everson via Unicode wrote:
>
> This is no different the Irish name McCoy which can be written MᶜCoy where 
> the raising of the c is actually just decorative, though perhaps it was once 
> an abbreviation for Mac. In some styles you can see a line or a dot under the 
> raised c. This is purely decorative.
>
> I would encode this as Mʳ if you wanted to make sure your data contained the 
> abbreviation mark. It would not make sense to encode it as M=ͬ or anything 
> else like that, because the “r” is not modifying a dot or a squiggle or an 
> equals sign. The dot or squiggle or equals sign has no meaning at all. And I 
> would not encode it as Mr͇, firstly because it would never render properly 
> and you might as well encode it as Mr. or M:r, and second because in the IPA 
> at least that character indicates an alveolar realization in disordered 
> speech. (Of course it could be used for anything.)
>
>
> I think this may depend on actual writing practice. In German at least,
> it is customary to have dots (periods) at the end of abbreviations, and
> using any other symbol, or not using the dot, would be considered an error.
>
> The question of how to encode that dot is fortunately an easy one, but
> even if it were not, German-writing people would find a sentence such as
> "The dot or ... has no meaning at all." extremely weird. The dot is
> there (and in German, has to be there) because it's an abbreviation.
>
> Swedes employ ":" for abbreviations but often (always?) for eliding
> several word-interior letters. Definitely also a case of a non-optional
> convention.
>
> The use of superscript is tricky, because it can be optional in some
> contexts; if I write "3rd" in English, it will definitely be understood no
> different from "3rd". Likewise with the several marks below superscripts.
> Whether "numero" has an underline or not appears to be a matter of font
> design, with some regional preferences (which also affect the style of the
> N).
>
> I'm very much with James that questions of what is spelling vs. what is
> style (decoration) can be a matter of opinion - or better perhaps, a matter
> of convention and associated expectations. And that there may not always be
> unanimity in the outcome.
>
> In TeX the two transition fluidly. If I was going to transcribe such texts
> in TeX, I would construct a macro for the construct of the entire
> abbreviation and would name it. That macro would raise the "r", and then -
> depending on the desired fidelity of the style of the document, might
> include secondary elements, such as underlining, or a squiggle.
>
> In the standard rich text model of plaintext "back bone" combined with
> font selection (and other styling), the named macro would correspond to
> encoding the semantic of an Mr abbreviation in the "superscript r"
> convention and the details would be handled in the font design.
>
> That system is perhaps not well suited to exact transcriptions because
> unlike Tex, it separates the two aspects, and removes the aspect of
> detailed glyph design from the control of the author, unless the latter is
> also a font-designer.
>
> Nevertheless, I think the use of devices like combining underlines and
> superscript letters in plain text are best avoided.
>
> A./
>
>
>


Re: A sign/abbreviation for "magister"

2018-10-28 Thread Philippe Verdy via Unicode
Also if the "combining abbreviation mark" is used only at end of a
combining sequence to transform it, we can avoid all needs of CGJ for that
mark, if the mark is itself assigned the combining class 0.
So
- abbreviating "Mister" as "M" (without the underscore below "r") becomes
  
- abbreviating "Monseigneur" as "M" (without the underscore below "g"
and "r") becomes
  
- abbreviating "Ditto" as "D" (without the underscore below "to")
becomes
  
- abbreviating "Operation" as "Op (without the underscore below "to")
becomes
  
- abbreviating "constitutionalité" as "C (without the underscore below
"té") becomes
  <é,COMBINING ABBREVIATION MARK> or
  
- abbreviating "Numéro" as "N" (without the underscore below "o") becomes
  
- abbreviating "Magister" as "M" (with the double underscore below "r")
becomes
  

It is quite easy for text renderers to infer the selection of a small
superscript for the base (and its other combining characters or extenders
when they support these combinations), before applying the new combiner
mark. If not, they can still render the leading base (and its other
supported combining characters or extenders), followed by some dotted mark
(e.g. a small dotted circle).
Renderers that do not recognize the new combining abbreviation mark will
just render it at end of the sequence as a usual square or rectangular
"tofu"; those that recognize it as a combining character but no support for
it, will render the usual dotted square (meaning "unsupported combining
mark", to distinguish from the meaning as if there was a "missing base
character" to apply before a known combining mark or extender)


Le dim. 28 oct. 2018 à 18:54, Philippe Verdy  a écrit :

> Le dim. 28 oct. 2018 à 18:28, Janusz S. Bień  a
> écrit :
>
>> On Sun, Oct 28 2018 at 15:19 +0100, Philippe Verdy via Unicode wrote:
>> > Given the "squiggle" below letters are actually gien distinctive
>> > semantics, I think it should be encoded a combining character (to be
>> > written not after a "superscript" but after any normal base letter,
>> > possibly with other combining characters, or CGJ if needed because of
>> > the compatibility equivalence.  That "squiggle" (which may look like
>> > an underscore) would haver the effect of implicity making the base
>> > letter superscript (smaller and elevated). It would have probably a
>> > "combining below" class.
>>
>> Seems to me an elegant solution.
>>
>> [...]
>>
>> On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote:
>> > Mr͇ / M=ͬ
>>
>> For me only the latter seems acceptable. Using COMBINING LATIN SMALL
>> LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as
>> the base character. However in the lack of a better solution I can live
>> with it :-)
>>
>
> There's a third alternative, that uses the superscript letter r, followed
> by the combining double underline, instead of the normal letter r followed
> by the same combining double underline.
> However it is still not very elegant if we stil need to use only the
> limited set of superscript letters (this still reduces the number of
> abbreviations, such as those commonly used in French that needs a
> superscript "é")
>
>
>
>
>


Re: A sign/abbreviation for "magister"

2018-10-28 Thread Philippe Verdy via Unicode
Le dim. 28 oct. 2018 à 18:28, Janusz S. Bień  a écrit :

> On Sun, Oct 28 2018 at 15:19 +0100, Philippe Verdy via Unicode wrote:
> > Given the "squiggle" below letters are actually gien distinctive
> > semantics, I think it should be encoded a combining character (to be
> > written not after a "superscript" but after any normal base letter,
> > possibly with other combining characters, or CGJ if needed because of
> > the compatibility equivalence.  That "squiggle" (which may look like
> > an underscore) would haver the effect of implicity making the base
> > letter superscript (smaller and elevated). It would have probably a
> > "combining below" class.
>
> Seems to me an elegant solution.
>
> [...]
>
> On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote:
> > Mr͇ / M=ͬ
>
> For me only the latter seems acceptable. Using COMBINING LATIN SMALL
> LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as
> the base character. However in the lack of a better solution I can live
> with it :-)
>

There's a third alternative, that uses the superscript letter r, followed
by the combining double underline, instead of the normal letter r followed
by the same combining double underline.
However it is still not very elegant if we stil need to use only the
limited set of superscript letters (this still reduces the number of
abbreviations, such as those commonly used in French that needs a
superscript "é")


Re: A sign/abbreviation for "magister"

2018-10-28 Thread Philippe Verdy via Unicode
Given the "squiggle" below letters are actually gien distinctive semantics,
I think it should be encoded a combining character (to be written not after
a "superscript" but after any normal base letter, possibly with other
combining characters, or CGJ if needed because of the compatibility
equivalence.
That "squiggle" (which may look like an underscore) would haver the effect
of implicity making the base letter superscript (smaller and elevated). It
would have probably a "combining below" class.

In that case U+2116 № is perfectly encodable, but still distinct from
 because "№" does not require this mark
(so there's no problem of stability with canonical equivalences, even if
this creates new possible confusable pairs when either the mark is used
after a normal letter: the risk of confusion only exists for "№" which is a
legacy non-decomposable ligature but that has an existing compatibility
equivalence, just like all other subscript letters).

In that case we have other ways to note *semantically* any abbreviations
using distinctive final letters (including for N abbreviating
"Numeros", M for "Madame", M for "Mademoiselle", M, M for
Monseigneur, P abbreviating "Professor"/"Professeur", or f
abbreviating "function").

Notes:
* The  and  are also used in French, instead of  or  to
abbreviate a "-tion" or "-tions" suffix (which derives from Latin "-tio" or
"-tios"). But I've also seen other abbreviation marks used for "-tion" and
"-tions".
* we also have in Unicode distinctive codes for dots used as abbreviation
marks (they are not combining, but still encoded distinctly from the
regular punctuation full stop), and for the mathematical binary dot
operator, or the decimal separator, or for implicit mathematical operators
that don't mark anything (i.e. invisible and zero-wdth) but that only break
grapheme clusters and prohibit formation of discretionary ligatures).

Medieval books or mails contained lot of abbreviation marks due to the cost
of paper (or parchment): texts were then frequently "packed" using
combining abbreviation marks in various positions (generally above or
below). The Germanic "Fraktur e" was a remnant of this old practice,
inherited from phonetic annotations added on top of Greek, Hebrew and
Arabic, which later turned into an "umlaut" that Unicode unified with the
diaeresis, even if it breaks the historic link to the letter Latin "e" used
like an abreviation mark or Hebrew vowel point in Fraktur (I think that the
history of the "Germanic Fraktur e" is highly linked to the influence of
Hebrew in today's Germany, or Greek in today's Eastern and Southern Europe
with some Slavic traditions in Cyrillic connected to religious traditions
in Greek).
The introduction of interlinear annotations in Greek was also margely
influenced by Hebrew and Arabic (which however did not turn these marks
into plain letters and avoided the formation of complex ligatures like in
Indian Brahmic scripts), but was the base of the interlinear notation of
actual phonetic.
Even the combining accents in French were created after an initial step
using ligatures of plain letters, before people started to replace these
ligatures by some unstable combining marks (initially not distinguished)
then turned them into plain distinctive accents which became the de facto
standard (made the offical orthography only very late: before that there
was a wide variation between those that wanted to distinguish phonetics,
using different accents, but now French tends to simplify this set: the
circumflkex in French was an abreviation mark for the unwritten letter "s"
which initially was more like the tilde, i.e. a turned small "s"). The
German umlaut written like a diaeresis is also very new (only after the
abandonment of the Fraktut alphabet where the "e" just looked like two
thick vertical strokes


Le dim. 28 oct. 2018 à 10:41, arno.schmitt via Unicode 
a écrit :

> Am 28.10.2018 um 09:13 schrieb Richard Wordingham via Unicode:
> > The notation is a quite widespread format for abbreviations.  the
> > first letter is normal sized, and the subsequent letter is written in
> > some variety of superscript with a squiggle underneath so that it
> > doesn't get overlooked.  I have deduced that this is not plain text
> > because there is no encoding mechanism for it.  For example, our
> > lecturers would frequently use this treatment to abbreviate function
> > as 'fn' with the 'n' superscript and supported by a squiggle below
> > sitting on the baseline.  The squiggle below has meaning; it marks the
> > word as an abbreviation.
> >
> > Richard.
>
> Looks to me like  U+2116 № NUMERO SIGN
> which perhaps should not have encoded,
> since we have both U+004E LATIN CAPITAL LETTER N and
> U+00BA º MASCULINE ORDINAL INDICATOR
>
> Arn0
>


Re: A sign/abbreviation for "magister"

2018-10-27 Thread Philippe Verdy via Unicode
If it was encoded in Unicode, it would use a single column and the encoding
seems evident:

x0 = MASONIC SQUARE SPACE
x1 = MASONIC SYMBOL A B OR ONE
x2 = MASONIC SYMBOL C D OR TWO
x3 = MASONIC SYMBOL E F OR THREE
x4 = MASONIC SYMBOL G H OR FOUR
x5 = MASONIC SYMBOL I L OR ZERO FIVE
x6 = MASONIC SYMBOL M N OR SIX
x7 = MASONIC SYMBOL O P OR SEVEN
x8 = MASONIC SYMBOL Q R OR EIGHT
x9 = MASONIC SYMBOL S T OR NINE
xA = MASONIC SYMBOL U J
xB = MASONIC SYMBOL X K
xC = MASONIC SYMBOL Y V
xD = MASONIC SYMBOL Z W
xE = MASONIC COMBINING DOT
xF = MASONIC COMBINING DOUBLE DOT (?)


Le dim. 28 oct. 2018 à 04:21, Garth Wallace via Unicode 
a écrit :

> I learned that one as a kid, as the "pigpen cipher". I'm not aware of any
> numerological significance (which is easy enough to "find" in anything).
>
> On Sat, Oct 27, 2018 at 7:43 PM Philippe Verdy via Unicode <
> unicode@unicode.org> wrote:
>
>> More interesting: the Masonic alphabet
>> http://tallermasonico.com/0diccio1.htm
>>
>> - 18 letters of the Latin alphabet (or Hebrew), from A to T (excluding J
>> and K), are disposed by group of 2 letters in a 3x3 square grid, whose
>> global outer sides are not marked on the outer border of the grid but on
>> lines separating columns or rows. Then letters are noted by the marked
>> sides of the square in which they are located, the second letter of the
>> group being distinguished by adding a dot in the middle of the square.
>> - The 4 other letters U to Z (excluding V and W) are noted by disposing
>> them on a 2x2 square grid (this time rotated 45 degrees), whose global
>> outer sides are also not marked on the outer border of the grid but on
>> lines separating columns or rows (only 1 letter is places by cell).
>> They are also noted by the marked sides of their square only.- Finally (if
>> needed) the missing letters J, K, V, W use the same 4 last glyphs, but are
>> distinguished by adding the central dot.
>>
>>
>>AB | CD | EF
>>  --+-+-
>>GH | I L | MN
>>  --+-+-
>>OP | QR | ST
>>
>>  \  XK  /
>>  UJ  >  < WZ
>>  /  YV  \
>>
>>
>> So:
>> - "A" becomes approximately  "_|"
>> - "B" becomes approximately  "_|" with central dot
>> - "U" becomes approximately ">"
>> - "X" becomes approximately "\/"
>> - "J" is noted like "I" as a square, or distinctly approximately as ">"
>> with a central dot
>>
>> The 3x3 grid had some esoterical meaning based on numerology (a legend
>> now propaged by scientology).
>>
>>
>> Le dim. 28 oct. 2018 à 02:59, Philippe Verdy  a
>> écrit :
>>
>>> Do you speak about this one?
>>> https://www.magisterdaire.com/magister-symbol-black-sq/
>>> It looks like a graphic personal signature for the author of this
>>> esoteric book, even if it looks like an interesting composition of several
>>> of our existing Unicode symbols, glued together in a vertical ligature,
>>> rather than a pure combining sequence.
>>> Such technics can be used extensively to create lot of other symbols, by
>>> gluing any kind of wellknown glyphs for standard characters.
>>> Mathematics and technologies (but also companies for their private
>>> corporate logos and branding marks) are constantly inventing new symbols
>>> like this.
>>>
>>>
>>> Le sam. 27 oct. 2018 à 22:01, James Kass via Unicode <
>>> unicode@unicode.org> a écrit :
>>>
>>>>
>>>> Mr͇ / M=ͬ
>>>>
>>>> An image search for "magister symbol" finds many interesting graphics,
>>>> but I couldn't find any resembling the abreviation shown on the post
>>>> card.  (Magister symbol appears to be popular for certain religious and
>>>> gaming uses.)
>>>>
>>>>


Re: A sign/abbreviation for "magister"

2018-10-27 Thread Philippe Verdy via Unicode
So in summary this Masonic "alphabet" uses 13 square "letters" and a single
combining mark (the central dot), possibly extended with the minus and plus
signs and space. It's possible that the central dot is used as a spacing
mark to note a punctuation.
The assignment of Latin (or Hebrew) letters to this alphabet varies (just
like Braille symbols depending on languages/scripts)
It may have extensions (like Braille outside its basic 2x3 patterns of
dots), such as a second dot in squares, horizontally as "··" or vertically
as ":"

Le dim. 28 oct. 2018 à 03:40, Philippe Verdy  a écrit :

> More interesting: the Masonic alphabet
> http://tallermasonico.com/0diccio1.htm
>
> - 18 letters of the Latin alphabet (or Hebrew), from A to T (excluding J
> and K), are disposed by group of 2 letters in a 3x3 square grid, whose
> global outer sides are not marked on the outer border of the grid but on
> lines separating columns or rows. Then letters are noted by the marked
> sides of the square in which they are located, the second letter of the
> group being distinguished by adding a dot in the middle of the square.
> - The 4 other letters U to Z (excluding V and W) are noted by disposing
> them on a 2x2 square grid (this time rotated 45 degrees), whose global
> outer sides are also not marked on the outer border of the grid but on
> lines separating columns or rows (only 1 letter is places by cell).
> They are also noted by the marked sides of their square only.- Finally (if
> needed) the missing letters J, K, V, W use the same 4 last glyphs, but are
> distinguished by adding the central dot.
>
>
>AB | CD | EF
>  --+-+-
>GH | I L | MN
>  --+-+-
>OP | QR | ST
>
>  \  XK  /
>  UJ  >  < WZ
>  /  YV  \
>
>
> So:
> - "A" becomes approximately  "_|"
> - "B" becomes approximately  "_|" with central dot
> - "U" becomes approximately ">"
> - "X" becomes approximately "\/"
> - "J" is noted like "I" as a square, or distinctly approximately as ">"
> with a central dot
>
> The 3x3 grid had some esoterical meaning based on numerology (a legend now
> propaged by scientology).
>
>
> Le dim. 28 oct. 2018 à 02:59, Philippe Verdy  a
> écrit :
>
>> Do you speak about this one?
>> https://www.magisterdaire.com/magister-symbol-black-sq/
>> It looks like a graphic personal signature for the author of this
>> esoteric book, even if it looks like an interesting composition of several
>> of our existing Unicode symbols, glued together in a vertical ligature,
>> rather than a pure combining sequence.
>> Such technics can be used extensively to create lot of other symbols, by
>> gluing any kind of wellknown glyphs for standard characters.
>> Mathematics and technologies (but also companies for their private
>> corporate logos and branding marks) are constantly inventing new symbols
>> like this.
>>
>>
>> Le sam. 27 oct. 2018 à 22:01, James Kass via Unicode 
>> a écrit :
>>
>>>
>>> Mr͇ / M=ͬ
>>>
>>> An image search for "magister symbol" finds many interesting graphics,
>>> but I couldn't find any resembling the abreviation shown on the post
>>> card.  (Magister symbol appears to be popular for certain religious and
>>> gaming uses.)
>>>
>>>


Re: A sign/abbreviation for "magister"

2018-10-27 Thread Philippe Verdy via Unicode
I must add that the Masonic 3x3 grid alphabet has been proposed as an
alternative to Braille, easier to learn and memoize, easier and faster to
draw with a pen on paper without any physical guide, and easier also to
recognize using only tactile contact by a finger tip, but more difficult to
form without cutting the sheet of paper while tracing the strokes. But it
was seen on some manufactured Masonic objects.

To note digits with the same shapes (like does Braille with its 2x3 dots
grid), the same 3x3 grid is used for digits 1 to 9 (digit 0 uses the same
square where it is significant as 5, but with a central dot, or use a
space), but additional symbols "+" and "-" are used (without central dot)
to switch between letters and digits. The placement of digits 1 to 9
(except 0 and 5) on the 3x3 grid varies (horizontally first, or vertically
first).

Le dim. 28 oct. 2018 à 03:40, Philippe Verdy  a écrit :

> More interesting: the Masonic alphabet
> http://tallermasonico.com/0diccio1.htm
>
> - 18 letters of the Latin alphabet (or Hebrew), from A to T (excluding J
> and K), are disposed by group of 2 letters in a 3x3 square grid, whose
> global outer sides are not marked on the outer border of the grid but on
> lines separating columns or rows. Then letters are noted by the marked
> sides of the square in which they are located, the second letter of the
> group being distinguished by adding a dot in the middle of the square.
> - The 4 other letters U to Z (excluding V and W) are noted by disposing
> them on a 2x2 square grid (this time rotated 45 degrees), whose global
> outer sides are also not marked on the outer border of the grid but on
> lines separating columns or rows (only 1 letter is places by cell).
> They are also noted by the marked sides of their square only.- Finally (if
> needed) the missing letters J, K, V, W use the same 4 last glyphs, but are
> distinguished by adding the central dot.
>
>
>AB | CD | EF
>  --+-+-
>GH | I L | MN
>  --+-+-
>OP | QR | ST
>
>  \  XK  /
>  UJ  >  < WZ
>  /  YV  \
>
>
> So:
> - "A" becomes approximately  "_|"
> - "B" becomes approximately  "_|" with central dot
> - "U" becomes approximately ">"
> - "X" becomes approximately "\/"
> - "J" is noted like "I" as a square, or distinctly approximately as ">"
> with a central dot
>
> The 3x3 grid had some esoterical meaning based on numerology (a legend now
> propaged by scientology).
>
>
> Le dim. 28 oct. 2018 à 02:59, Philippe Verdy  a
> écrit :
>
>> Do you speak about this one?
>> https://www.magisterdaire.com/magister-symbol-black-sq/
>> It looks like a graphic personal signature for the author of this
>> esoteric book, even if it looks like an interesting composition of several
>> of our existing Unicode symbols, glued together in a vertical ligature,
>> rather than a pure combining sequence.
>> Such technics can be used extensively to create lot of other symbols, by
>> gluing any kind of wellknown glyphs for standard characters.
>> Mathematics and technologies (but also companies for their private
>> corporate logos and branding marks) are constantly inventing new symbols
>> like this.
>>
>>
>> Le sam. 27 oct. 2018 à 22:01, James Kass via Unicode 
>> a écrit :
>>
>>>
>>> Mr͇ / M=ͬ
>>>
>>> An image search for "magister symbol" finds many interesting graphics,
>>> but I couldn't find any resembling the abreviation shown on the post
>>> card.  (Magister symbol appears to be popular for certain religious and
>>> gaming uses.)
>>>
>>>


Re: A sign/abbreviation for "magister"

2018-10-27 Thread Philippe Verdy via Unicode
More interesting: the Masonic alphabet
http://tallermasonico.com/0diccio1.htm

- 18 letters of the Latin alphabet (or Hebrew), from A to T (excluding J
and K), are disposed by group of 2 letters in a 3x3 square grid, whose
global outer sides are not marked on the outer border of the grid but on
lines separating columns or rows. Then letters are noted by the marked
sides of the square in which they are located, the second letter of the
group being distinguished by adding a dot in the middle of the square.
- The 4 other letters U to Z (excluding V and W) are noted by disposing
them on a 2x2 square grid (this time rotated 45 degrees), whose global
outer sides are also not marked on the outer border of the grid but on
lines separating columns or rows (only 1 letter is places by cell).
They are also noted by the marked sides of their square only.- Finally (if
needed) the missing letters J, K, V, W use the same 4 last glyphs, but are
distinguished by adding the central dot.


   AB | CD | EF
 --+-+-
   GH | I L | MN
 --+-+-
   OP | QR | ST

 \  XK  /
 UJ  >  < WZ
 /  YV  \


So:
- "A" becomes approximately  "_|"
- "B" becomes approximately  "_|" with central dot
- "U" becomes approximately ">"
- "X" becomes approximately "\/"
- "J" is noted like "I" as a square, or distinctly approximately as ">"
with a central dot

The 3x3 grid had some esoterical meaning based on numerology (a legend now
propaged by scientology).


Le dim. 28 oct. 2018 à 02:59, Philippe Verdy  a écrit :

> Do you speak about this one?
> https://www.magisterdaire.com/magister-symbol-black-sq/
> It looks like a graphic personal signature for the author of this esoteric
> book, even if it looks like an interesting composition of several of our
> existing Unicode symbols, glued together in a vertical ligature, rather
> than a pure combining sequence.
> Such technics can be used extensively to create lot of other symbols, by
> gluing any kind of wellknown glyphs for standard characters.
> Mathematics and technologies (but also companies for their private
> corporate logos and branding marks) are constantly inventing new symbols
> like this.
>
>
> Le sam. 27 oct. 2018 à 22:01, James Kass via Unicode 
> a écrit :
>
>>
>> Mr͇ / M=ͬ
>>
>> An image search for "magister symbol" finds many interesting graphics,
>> but I couldn't find any resembling the abreviation shown on the post
>> card.  (Magister symbol appears to be popular for certain religious and
>> gaming uses.)
>>
>>


Re: A sign/abbreviation for "magister"

2018-10-27 Thread Philippe Verdy via Unicode
Do you speak about this one?
https://www.magisterdaire.com/magister-symbol-black-sq/
It looks like a graphic personal signature for the author of this esoteric
book, even if it looks like an interesting composition of several of our
existing Unicode symbols, glued together in a vertical ligature, rather
than a pure combining sequence.
Such technics can be used extensively to create lot of other symbols, by
gluing any kind of wellknown glyphs for standard characters.
Mathematics and technologies (but also companies for their private
corporate logos and branding marks) are constantly inventing new symbols
like this.


Le sam. 27 oct. 2018 à 22:01, James Kass via Unicode 
a écrit :

>
> Mr͇ / M=ͬ
>
> An image search for "magister symbol" finds many interesting graphics,
> but I couldn't find any resembling the abreviation shown on the post
> card.  (Magister symbol appears to be popular for certain religious and
> gaming uses.)
>
>


Re: A sign/abbreviation for "magister"

2018-10-27 Thread Philippe Verdy via Unicode
Le sam. 27 oct. 2018 à 15:06, Asmus Freytag via Unicode 
a écrit :

> First question is: how do you interpret the symbol? For me it is
> definitely the capital M followed by the superscript "r" (written in an
> old style no longer used in Poland), but there is something below the
> superscript. It looks like a small "z", but such an interpretation
> doesn't make sense for me.
>
> My suspicion would be that the small "z" is rather a "=" that acquired a
> connecting stroke as part of quick handwriting.
>
I have the same kind of reading, the zigzagging stroek is an hnadwritten
emphasis of the uperscript r above it (explicitly noting it is terminating
the abbreviation), jut like the small underline that happens sometimes
below the superscript o in the abbreviation of "numero" (as well sometimes
there was not just one but two small underlines, including in some prints).

This sample is a perfect example of fast cursive handwritting (due to high
variability of all other letter shapes, sizes and joinings, where even the
capital M is written as two unconnected strokes), and it's not abnormal to
see in such condition this cursive joining between the two underlining
strokes so that it looks like a single zigzag.


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
it there.
>
>
>
> Are these part of any standards? Or are you claiming these are practices
> despite the standards? If so, are these just tolerated by parsers, or are
> they actually generated by encoders?
>
>
>
> What would be the rationale for supporting unnecessary whitespace? If
> linebreaks are forced at some line length they can presumably be removed at
> that length and not treated as part of the encoding.
>
> Maybe we differ on define where the encoding begins and ends, and where
> higher level protocols prescribe how they are embedded within the protocol.
>
>
>
> Tex
>
>
>
>
>
>
>
>
>
> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Philippe
> Verdy via Unicode
> *Sent:* Sunday, October 14, 2018 1:41 AM
> *To:* Adam Borowski
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Base64 encoding applied to different unicode texts always
> yields different base64 texts ... true or false?
>
>
>
> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
> enough to indicate the end of an octets-span. The extra = after it do not
> add any other octet. and as well you're allowed to insert whitespaces
> anywhere in the encoded stream (this is what ensures that the
> Base64-encoded octets-stream will not be altered if line breaks are forced
> anywhere (notably within the body of emails).
>
>
>
> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR,
> LF, NEL) in the middle is non-significant and ignorable on decoding (their
> "encoded" bit length is 0 and they don't terminate an octets-span, unlike
> "=" which discards extra bits remaining from the encoded stream before that
> are not on 8-bit boundaries).
>
>
>
> Also:
>
> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol
> before "=" can vary in its 4 lowest bits (which are then ignored/discarded
> by the "=" symbol)
>
> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
> symbol before "=" can vary in its 2 lowest bits (which are then
> ignored/discarded by the "=" symbol)
>
>
>
> So you can use Base64 by encoding each octet in separate pieces, as one
> Base64 symbol followed by an "=" symbol, and even insert any number of
> whitespaces between them: there's a infinite number of valid Base64
> encodings for representing the same octets-stream payload.
>
>
>
> Base64 allows encoding any octets streams but not directly any
> bits-streams : it assumes that the effective bits-stream has a binary
> length multiple of 8. To encode a bits-stream with an exact number of bits
> (not multiple of 8), you need to encode an extra payload to indicate the
> effective number of bits to keep at end of the encoded octets-stream (or at
> start):
>
> - Base64 does not specify how you convert a bitstream of arbitrary length
> to an octets-stream;
>
> - for that purpose, you may need to pad the bits-stream at start or at end
> with 1 to 6 bits (so that it the resulting bitstream has a length multiple
> of 8, then encodable with Base64 which takes only octets on input).
>
> - these extra padding bits are not significant for the original bitstream,
> but are significant for the Base64 encoder/decoder, they will be discarded
> by the bitstream decoder built on top of the Base64 decoder, but not by the
> Base64 decoder itself.
>
>
>
> You need to encode somewhere with the bitstream encoder how many padding
> bits (0 to 7) are present at start or end of the octets-stream; this can be
> done:
>
> - as a separate payload (not encoded by Base64), or
>
> - by prepending 3 bits at start of the bits-stream then padded at end with
> 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64
> encoding.
>
> - by appending 3 bits at end of the  bits-stream, just after 1 to 7 random
> bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.
>
> Finally your bits-stream decoder will be able to use this padding count to
> discard these random padding bits (and possibly realign the stream on
> different byte-boundaries when the effective bitlength bits-stream payload
> is not a multiple of 8 and padding bits were added)
>
>
>
> Base64 also does not specify how bits of the original bits-stream payload
> are packed into the octets-stream input suitable for Base64-encoding,
> notably it does not specify their order and endian-ness. The same remark
> applies as well for MIME, HTTP. So lot of network protocols and file
> formats need to how to properly encode which possible option is used to
> encode bi

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
Note that all these discussion about padding applies to all other base-N
encodings, including base-10.

For example to represent numbers of arbitrary precision: padding does not
require a separate symbol but can use the "0" digit which is part of the
10-symbols alphabet, or encoders can discard them on the left, or on the
right if there's a decimal dot; when the precision is less than a integral
number of decimal digits, the extra bits or fractional bits of information
in the last digit of the encoded sequence does not matter, encoders may
choose to not set them to 0 but may prefer to use rounding which may
conditionally set these bits to 1, depedning on the value of the last
significant bits or fractional bits of maximum precision.

As well the same decoders may want to use extra whitespaces (notably to
limit line lengths at arbitrary lengths, notably for embedding the encoded
sequences in printed documents or documents with a page layout and rendered
with a readable font size suitable for the page width, or for presentation
purpose by grouping symbols).

In summary, padding is not required at all by all Base-N encoders/decoders,
and non significant whitespace is frequently needed.


Le lun. 15 oct. 2018 à 13:57, Philippe Verdy  a écrit :

> If you want an example where padding with "=" is not used at all,
> - look into URL-shortening schemes
> - look into database fields or data input forms and numerous data formats
> where the "=" sign is restricted (just like in URLs and file paths, or in
> identifiers)
> Padding is not used anywhere in the middle of the binary encoding or even
> at end, only the 64 symbols of the encoding alphabet are needed and the
> extra 2 or 4 lowest bits that may be encoded in the last character of the
> encoded sequence are discarded by the decoder (these extra bits are not
> necessarily set to 0 by encoders in the last symbol, even if this is the
> canonical form recommanded in encoders, their value is simply ignored by
> decoders).
> Some Base64 encoders do not necessarily encode binary octets-streams, but
> bits-streams whose length in bits is not necessarily multiple of 8, in
> which case there may be 1 to 7 trailing bits (not just 2 or 4) in the last
> symbol of the encoded sequence.
> Other encoders use streams of binary code units that are larger than 8
> bits, and may want to encode more padding symbols to force the alignment of
> data required in their associated decoders, or will choose to not use any
> padding at all, letting the decoder discard the trailing bits themselves at
> end of the encoded stream.
>
> Le lun. 15 oct. 2018 à 13:24, Philippe Verdy  a
> écrit :
>
>> Also the rationale for supporting "unnecessary" whitespace is found in
>> MIME's version of Base64, also in RFCs describing encoding formats for
>> digital certificates, or for exchanging public keys in encryption
>> algorithms like PGP (notably, but not only, as texts in the body of emails
>> or in documentations and websites).
>>
>> Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :
>>
>>> Philippe,
>>>
>>>
>>>
>>> Where is the use of whitespace or the idea that 1-byte pieces do not
>>> need all the equal sign paddings documented?
>>>
>>> I read the rfc 3501 you pointed at, I don’t see it there.
>>>
>>>
>>>
>>> Are these part of any standards? Or are you claiming these are practices
>>> despite the standards? If so, are these just tolerated by parsers, or are
>>> they actually generated by encoders?
>>>
>>>
>>>
>>> What would be the rationale for supporting unnecessary whitespace? If
>>> linebreaks are forced at some line length they can presumably be removed at
>>> that length and not treated as part of the encoding.
>>>
>>> Maybe we differ on define where the encoding begins and ends, and where
>>> higher level protocols prescribe how they are embedded within the protocol.
>>>
>>>
>>>
>>> Tex
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Philippe
>>> Verdy via Unicode
>>> *Sent:* Sunday, October 14, 2018 1:41 AM
>>> *To:* Adam Borowski
>>> *Cc:* unicode Unicode Discussion
>>> *Subject:* Re: Base64 encoding applied to different unicode texts
>>> always yields different base64 texts ... true or false?
>>>
>>>
>>>
>>> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
>>> enough to indicate the end of an octets-span. The extra = after

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
If you want an example where padding with "=" is not used at all,
- look into URL-shortening schemes
- look into database fields or data input forms and numerous data formats
where the "=" sign is restricted (just like in URLs and file paths, or in
identifiers)
Padding is not used anywhere in the middle of the binary encoding or even
at end, only the 64 symbols of the encoding alphabet are needed and the
extra 2 or 4 lowest bits that may be encoded in the last character of the
encoded sequence are discarded by the decoder (these extra bits are not
necessarily set to 0 by encoders in the last symbol, even if this is the
canonical form recommanded in encoders, their value is simply ignored by
decoders).
Some Base64 encoders do not necessarily encode binary octets-streams, but
bits-streams whose length in bits is not necessarily multiple of 8, in
which case there may be 1 to 7 trailing bits (not just 2 or 4) in the last
symbol of the encoded sequence.
Other encoders use streams of binary code units that are larger than 8
bits, and may want to encode more padding symbols to force the alignment of
data required in their associated decoders, or will choose to not use any
padding at all, letting the decoder discard the trailing bits themselves at
end of the encoded stream.

Le lun. 15 oct. 2018 à 13:24, Philippe Verdy  a écrit :

> Also the rationale for supporting "unnecessary" whitespace is found in
> MIME's version of Base64, also in RFCs describing encoding formats for
> digital certificates, or for exchanging public keys in encryption
> algorithms like PGP (notably, but not only, as texts in the body of emails
> or in documentations and websites).
>
> Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :
>
>> Philippe,
>>
>>
>>
>> Where is the use of whitespace or the idea that 1-byte pieces do not need
>> all the equal sign paddings documented?
>>
>> I read the rfc 3501 you pointed at, I don’t see it there.
>>
>>
>>
>> Are these part of any standards? Or are you claiming these are practices
>> despite the standards? If so, are these just tolerated by parsers, or are
>> they actually generated by encoders?
>>
>>
>>
>> What would be the rationale for supporting unnecessary whitespace? If
>> linebreaks are forced at some line length they can presumably be removed at
>> that length and not treated as part of the encoding.
>>
>> Maybe we differ on define where the encoding begins and ends, and where
>> higher level protocols prescribe how they are embedded within the protocol.
>>
>>
>>
>> Tex
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Philippe
>> Verdy via Unicode
>> *Sent:* Sunday, October 14, 2018 1:41 AM
>> *To:* Adam Borowski
>> *Cc:* unicode Unicode Discussion
>> *Subject:* Re: Base64 encoding applied to different unicode texts always
>> yields different base64 texts ... true or false?
>>
>>
>>
>> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
>> enough to indicate the end of an octets-span. The extra = after it do not
>> add any other octet. and as well you're allowed to insert whitespaces
>> anywhere in the encoded stream (this is what ensures that the
>> Base64-encoded octets-stream will not be altered if line breaks are forced
>> anywhere (notably within the body of emails).
>>
>>
>>
>> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB,
>> CR, LF, NEL) in the middle is non-significant and ignorable on decoding
>> (their "encoded" bit length is 0 and they don't terminate an octets-span,
>> unlike "=" which discards extra bits remaining from the encoded stream
>> before that are not on 8-bit boundaries).
>>
>>
>>
>> Also:
>>
>> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X"
>> symbol before "=" can vary in its 4 lowest bits (which are then
>> ignored/discarded by the "=" symbol)
>>
>> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
>> symbol before "=" can vary in its 2 lowest bits (which are then
>> ignored/discarded by the "=" symbol)
>>
>>
>>
>> So you can use Base64 by encoding each octet in separate pieces, as one
>> Base64 symbol followed by an "=" symbol, and even insert any number of
>> whitespaces between them: there's a infinite number of valid Base64
>> encodings for representing the same octets-stream payload.
>>
>>
>&g

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
Also the rationale for supporting "unnecessary" whitespace is found in
MIME's version of Base64, also in RFCs describing encoding formats for
digital certificates, or for exchanging public keys in encryption
algorithms like PGP (notably, but not only, as texts in the body of emails
or in documentations and websites).

Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :

> Philippe,
>
>
>
> Where is the use of whitespace or the idea that 1-byte pieces do not need
> all the equal sign paddings documented?
>
> I read the rfc 3501 you pointed at, I don’t see it there.
>
>
>
> Are these part of any standards? Or are you claiming these are practices
> despite the standards? If so, are these just tolerated by parsers, or are
> they actually generated by encoders?
>
>
>
> What would be the rationale for supporting unnecessary whitespace? If
> linebreaks are forced at some line length they can presumably be removed at
> that length and not treated as part of the encoding.
>
> Maybe we differ on define where the encoding begins and ends, and where
> higher level protocols prescribe how they are embedded within the protocol.
>
>
>
> Tex
>
>
>
>
>
>
>
>
>
> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Philippe
> Verdy via Unicode
> *Sent:* Sunday, October 14, 2018 1:41 AM
> *To:* Adam Borowski
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Base64 encoding applied to different unicode texts always
> yields different base64 texts ... true or false?
>
>
>
> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
> enough to indicate the end of an octets-span. The extra = after it do not
> add any other octet. and as well you're allowed to insert whitespaces
> anywhere in the encoded stream (this is what ensures that the
> Base64-encoded octets-stream will not be altered if line breaks are forced
> anywhere (notably within the body of emails).
>
>
>
> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR,
> LF, NEL) in the middle is non-significant and ignorable on decoding (their
> "encoded" bit length is 0 and they don't terminate an octets-span, unlike
> "=" which discards extra bits remaining from the encoded stream before that
> are not on 8-bit boundaries).
>
>
>
> Also:
>
> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol
> before "=" can vary in its 4 lowest bits (which are then ignored/discarded
> by the "=" symbol)
>
> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
> symbol before "=" can vary in its 2 lowest bits (which are then
> ignored/discarded by the "=" symbol)
>
>
>
> So you can use Base64 by encoding each octet in separate pieces, as one
> Base64 symbol followed by an "=" symbol, and even insert any number of
> whitespaces between them: there's a infinite number of valid Base64
> encodings for representing the same octets-stream payload.
>
>
>
> Base64 allows encoding any octets streams but not directly any
> bits-streams : it assumes that the effective bits-stream has a binary
> length multiple of 8. To encode a bits-stream with an exact number of bits
> (not multiple of 8), you need to encode an extra payload to indicate the
> effective number of bits to keep at end of the encoded octets-stream (or at
> start):
>
> - Base64 does not specify how you convert a bitstream of arbitrary length
> to an octets-stream;
>
> - for that purpose, you may need to pad the bits-stream at start or at end
> with 1 to 6 bits (so that it the resulting bitstream has a length multiple
> of 8, then encodable with Base64 which takes only octets on input).
>
> - these extra padding bits are not significant for the original bitstream,
> but are significant for the Base64 encoder/decoder, they will be discarded
> by the bitstream decoder built on top of the Base64 decoder, but not by the
> Base64 decoder itself.
>
>
>
> You need to encode somewhere with the bitstream encoder how many padding
> bits (0 to 7) are present at start or end of the octets-stream; this can be
> done:
>
> - as a separate payload (not encoded by Base64), or
>
> - by prepending 3 bits at start of the bits-stream then padded at end with
> 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64
> encoding.
>
> - by appending 3 bits at end of the  bits-stream, just after 1 to 7 random
> bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.
>
> Finally your bits-stream decoder will be able to use this padding count to
> discard these random paddi

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Philippe Verdy via Unicode
Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st
sentence, it is explicitly stated :

In some circumstances, the use of padding ("=") in base-encoded data
is not required or used.


Le lun. 15 oct. 2018 à 03:56, Tex  a écrit :

> Philippe,
>
>
>
> Where is the use of whitespace or the idea that 1-byte pieces do not need
> all the equal sign paddings documented?
>
> I read the rfc 3501 you pointed at, I don’t see it there.
>
>
>
> Are these part of any standards? Or are you claiming these are practices
> despite the standards? If so, are these just tolerated by parsers, or are
> they actually generated by encoders?
>
>
>
> What would be the rationale for supporting unnecessary whitespace? If
> linebreaks are forced at some line length they can presumably be removed at
> that length and not treated as part of the encoding.
>
> Maybe we differ on define where the encoding begins and ends, and where
> higher level protocols prescribe how they are embedded within the protocol.
>
>
>
> Tex
>
>
>
>
>
>
>
>
>
> *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Philippe
> Verdy via Unicode
> *Sent:* Sunday, October 14, 2018 1:41 AM
> *To:* Adam Borowski
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Base64 encoding applied to different unicode texts always
> yields different base64 texts ... true or false?
>
>
>
> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
> enough to indicate the end of an octets-span. The extra = after it do not
> add any other octet. and as well you're allowed to insert whitespaces
> anywhere in the encoded stream (this is what ensures that the
> Base64-encoded octets-stream will not be altered if line breaks are forced
> anywhere (notably within the body of emails).
>
>
>
> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR,
> LF, NEL) in the middle is non-significant and ignorable on decoding (their
> "encoded" bit length is 0 and they don't terminate an octets-span, unlike
> "=" which discards extra bits remaining from the encoded stream before that
> are not on 8-bit boundaries).
>
>
>
> Also:
>
> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol
> before "=" can vary in its 4 lowest bits (which are then ignored/discarded
> by the "=" symbol)
>
> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
> symbol before "=" can vary in its 2 lowest bits (which are then
> ignored/discarded by the "=" symbol)
>
>
>
> So you can use Base64 by encoding each octet in separate pieces, as one
> Base64 symbol followed by an "=" symbol, and even insert any number of
> whitespaces between them: there's a infinite number of valid Base64
> encodings for representing the same octets-stream payload.
>
>
>
> Base64 allows encoding any octets streams but not directly any
> bits-streams : it assumes that the effective bits-stream has a binary
> length multiple of 8. To encode a bits-stream with an exact number of bits
> (not multiple of 8), you need to encode an extra payload to indicate the
> effective number of bits to keep at end of the encoded octets-stream (or at
> start):
>
> - Base64 does not specify how you convert a bitstream of arbitrary length
> to an octets-stream;
>
> - for that purpose, you may need to pad the bits-stream at start or at end
> with 1 to 6 bits (so that it the resulting bitstream has a length multiple
> of 8, then encodable with Base64 which takes only octets on input).
>
> - these extra padding bits are not significant for the original bitstream,
> but are significant for the Base64 encoder/decoder, they will be discarded
> by the bitstream decoder built on top of the Base64 decoder, but not by the
> Base64 decoder itself.
>
>
>
> You need to encode somewhere with the bitstream encoder how many padding
> bits (0 to 7) are present at start or end of the octets-stream; this can be
> done:
>
> - as a separate payload (not encoded by Base64), or
>
> - by prepending 3 bits at start of the bits-stream then padded at end with
> 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64
> encoding.
>
> - by appending 3 bits at end of the  bits-stream, just after 1 to 7 random
> bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.
>
> Finally your bits-stream decoder will be able to use this padding count to
> discard these random padding bits (and possibly realign the stream on
> different byte-boundaries when the effective bitlength bits-stream 

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-14 Thread Philippe Verdy via Unicode
Le dim. 14 oct. 2018 à 21:21, Doug Ewell via Unicode 
a écrit :

> Steffen Nurpmeso wrote:
>
> > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
> > (MIME) Part One: Format of Internet Message Bodies).
>
> Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
> Encodings." RFC 2045 defines a particular implementation of base64,
> specific to transporting Internet mail in a 7-bit environment.
>

Wrong, this is "specific" to transporting Internet mail in any 7 bit or 8
bit environment (today almost all mail agents are operating in 8 bit), and
then it is referenced directly by HTTP (and its HTTPS variant).

So this is no so "specific". MIME is extremely popular, RFC 4648 is
extremely exotic (and RFC 4648 is wrong when saying that IMAP is very
specific as it is now a very popular protocol, widely used as well). MIME
is so frequently used, that almost all people refer to it when they look
for Base64, or do not explicitly state that another definition (found in an
exotic RFC) is explicitly used.


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-14 Thread Philippe Verdy via Unicode
It's also interesting to look at https://tools.ietf.org/html/rfc3501
- which defines (for IMAP v4) another "BASE64" encoding,
- and also defines a "Modified UTF-7" encoding using it, deviating from
Unicode's definition of UTF-7,
- and adding other requirements (which forbids alternate encodings
permitted in UTF-7 and all other Base64 variants, including those used in
MIME/RFC 2045 or SMTP, used in strong relations with IMAP !).

And nothing in RFC 4648 is clear about the fact that it only covers the
encoding of "octets streams" and not "bits streams". It also does not
discuss the adaptation for "Base64" for transport and storage (needed for
MIME, IMAP, but also in HTTP, and in several file/data formats including
XML, or digital signatures).

That RFC 4648 is only superficial, and does not cover everything (even
Unicode has its own definition for UTF-7 and also allows variations).

As we are on this Unicode list, the definition used by Unicode (more in
line with MIME), does not follow at all those in RFC 4648.
Most uses of Base64 encodings are based on the original MIME definition,
and all of them perform new adaptations. (Even the definition of "Base16"
in RFC4648 contradicts most other definitions).


Le dim. 14 oct. 2018 à 21:21, Doug Ewell via Unicode 
a écrit :

> Steffen Nurpmeso wrote:
>
> > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
> > (MIME) Part One: Format of Internet Message Bodies).
>
> Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
> Encodings." RFC 2045 defines a particular implementation of base64,
> specific to transporting Internet mail in a 7-bit environment.
>
> RFC 4648 discusses many of the "higher-level protocol" topics that some
> people are focusing on, such as separating the base64-encoded output
> into lines of length 72 (or other), alternative target code unit sets or
> "alphabets," and padding characters. It would be helpful for everyone to
> read this particular RFC before concluding that these topics have not
> been considered, or that they compromise round-tripping or other
> characteristics of base64.
>
> I had assumed that when Roger asked about "base64 encoding," he was
> asking about the basic definition of base64.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-14 Thread Philippe Verdy via Unicode
cify an extra layer for the bits-stream encoder/decoder.

But many other encodings are still possible (and can be conforming to
Unicode, provided they preserve each Unicode scalar value, or at least the
code point identity because an encoder/decoder is not required to support
non-character code points such as surrogates or U+FFFE), where Base64 may
be used for internally generated octets-streams.


Le dim. 14 oct. 2018 à 03:47, Adam Borowski via Unicode 
a écrit :

> On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote:
> > Le sam. 13 oct. 2018 à 18:58, Steffen Nurpmeso via Unicode <
> > unicode@unicode.org> a écrit :
> > > The only variance is described as:
> > >
> > >   Care must be taken to use the proper octets for line breaks if base64
> > >   encoding is applied directly to text material that has not been
> > >   converted to canonical form.  In particular, text line breaks must be
> > >   converted into CRLF sequences prior to base64 encoding.  The
> > >   important thing to note is that this may be done directly by the
> > >   encoder rather than in a prior canonicalization step in some
> > >   implementations.
> > >
> > > This is MIME, it specifies (in the same RFC):
> >
> > I've not spoken aboutr the encoding of new lines **in the actual encoded
> > text**:
> > -  if their existing text-encoding ever gets converted to Base64 as if
> the
> > whole text was an opaque binary object, their initial text-encoding will
> be
> > preserved (so yes it will preserve the way these embedded newlines are
> > encoded as CR, LF, CR+LF, NL...)
> >
> > I spoke about newlines used in the transport syntax to split the initial
> > binary object (which may actually contain text but it does not matter).
> > MIME defines this operation and even requires splitting the binary object
> > in fragments with maximum binary size so that these binary fragments can
> be
> > converted with Base64 into lines with maximum length. In the MIME Base64
> > representation you can insert newlines anywhere between fragments encoded
> > separately.
>
> There's another kind of fragmentation that can make the encoding differ
> (but
> still decode to the same payload):
>
> The data stream gets split into 3-byte internal, 4-byte external packets.
> Any packet may contain less than those 3 bytes, in which cases it is padded
> with = characters:
> 3 bytes 
> 2 bytes XXX=
> 1 byte  XX==
>
> Usually, such smaller packets happen only at the end of a message, but to
> support encoding a stream piecewise, they are allowed at any point.
>
> For example:
> "meow" is bWVvdw==
> "me""ow"   is bWU=b3c=
> yet both carry the same payload.
>
> > Base64 is used exactly to support this flexibility in transport (or
> > storage) without altering any bit of the initial content once it is
> > decoded.
>
> Right, any such variations are in packaging only.
>
>
> ᛗᛖᛟᚹ
> --
> ⢀⣴⠾⠻⢶⣦⠀
> ⣾⠁⢰⠒⠀⣿⡁ 10 people enter a bar: 1 who understands binary,
> ⢿⡄⠘⠷⠚⠋⠀ 1 who doesn't, D who prefer to write it as hex,
> ⠈⠳⣄ and 1 who narrowly avoided an off-by-one error.
>


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Philippe Verdy via Unicode
Le sam. 13 oct. 2018 à 18:58, Steffen Nurpmeso via Unicode <
unicode@unicode.org> a écrit :

> Philippe Verdy via Unicode wrote in  w9+jearw4ghyk...@mail.gmail.com>:
>  |You forget that Base64 (as used in MIME) does not follow these rules \
>  |as it allows multiple different encodings for the same source binary. \
>  |MIME actually
>  |splits a binary object into multiple fragments at random positions, \
>  |and then encodes these fragments separately. Also MIME uses an extension
> \
>  |of Base64
>  |where it allows some variations in the encoding alphabet (so even the \
>  |same fragment of the same length may have two disting encodings).
>  |
>  |Base64 in MIME is different from standard Base64 (which never splits \
>  |the binary object before encoding it, and uses a strict alphabet of \
>  |64 ASCII
>  |characters, allowing no variation). So MIME requires special handling: \
>  |the assumpton that a binary message is encoded the same is wrong, but \
>  |MIME still
>  |requires that this non unique Base64 encoding will be decoded back \
>  |to the same initial (unsplitted) binary object (independantly of its \
>  |size and
>  |independantly of the splitting boundaries used in the transport, which \
>  |may change during the transport).
>
> Base64 is defined in RFC 2045 (Multipurpose Internet Mail
> Extensions (MIME) Part One: Format of Internet Message Bodies).
> It is a content-transfer-encoding and encodes any data
> transparently into a 7 bit clean ASCII _and_ EBCDIC compatible
> (the authors commemorate that) text.
> When decoding it reverts this representation into its original form.
> Ok, there is the CRLF newline problem, as below.
> What do you mean by "splitting"?
>
> ...
> The only variance is described as:
>
>   Care must be taken to use the proper octets for line breaks if base64
>   encoding is applied directly to text material that has not been
>   converted to canonical form.  In particular, text line breaks must be
>   converted into CRLF sequences prior to base64 encoding.  The
>   important thing to note is that this may be done directly by the
>   encoder rather than in a prior canonicalization step in some
>   implementations.
>
> This is MIME, it specifies (in the same RFC):


I've not spoken aboutr the encoding of new lines **in the actual encoded
text**:
-  if their existing text-encoding ever gets converted to Base64 as if the
whole text was an opaque binary object, their initial text-encoding will be
preserved (so yes it will preserve the way these embedded newlines are
encoded as CR, LF, CR+LF, NL...)

I spoke about newlines used in the transport syntax to split the initial
binary object (which may actually contain text but it does not matter).
MIME defines this operation and even requires splitting the binary object
in fragments with maximum binary size so that these binary fragments can be
converted with Base64 into lines with maximum length. In the MIME Base64
representation you can insert newlines anywhere between fragments encoded
separately.

The maximum size of fragment is not fixed (it is usually about 60 binary
octets, that are converted to lines of 80 ASCII characters, followed by a
newline (CR+LF is strongly suggested for MIME, but it is admitted to use
other newline sequences). Email forwarding agents frequently needed these
line lengths to process the mail properly (not just the MIME headers but as
well the content body, where they want at least some whitespace or newline
in the middle where they can freely rearrange the line lines by compressing
whitespaces or splitting lines to shorter length as necessary to their
processing; this is much less frequent today because most mail agents are
8-bit clean and allow arbitrary line lengths... except in MIME headers).

In MIME headers the situation is different, there's really a maximum
line-length there, and if a header is too long, it has to be split on
multiple lines (using continuation sequences, i.e. a newline (CR+LF is
standard here) followed by at least one space (this
insertion/change/removal of whitespaces is permitted everywhere in the MIME
header after the header type, but even before the colon that follows the
header type). So a MIME header value whose included text gets encoded with
Base64 will be split using "=?" sequences starting the indication that the
fragment is Base64 encoded (instead of being QuotedPrintable-encoded) and
then a separator and the encapsulated Base-64 encoding of a fragment, and a
single header may have multiple Base64-encoded fragments in the same header
value, and there's large freedom about where to split the value to isolate
fragments with convenient size that satisfies the MIME requirements. These
multiple fragemetns may then occur on the same line (separated by
whitespace) or on multiple line (separated by con

Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Philippe Verdy via Unicode
In summary, two disating implementations are allowed to return different
values t and t' of Base64_Encode(d) from the same message d, but both
Base64_Decode(t') and  Base64_Decode(t) will be equal and will MUST return
d exactly.

There's an allowed choice of implementation for Base64_Encode() but
Base64_Decode() must then be updated to be permissive/flexible and ensure
that in all cases,
Base64_Decode[Base64_Encode[d]] = d, for every value of d.

The reverse is not true because of this flexibility (needed for various
transport protocols that have different requirements, notably on the
allowed set of characters, and on their maximum line lengths):
Base64_Encode[Base64_Decode[t]] = t may be false.


Le sam. 13 oct. 2018 à 16:45, Philippe Verdy  a écrit :

> You forget that Base64 (as used in MIME) does not follow these rules as it
> allows multiple different encodings for the same source binary. MIME
> actually splits a binary object into multiple fragments at random
> positions, and then encodes these fragments separately. Also MIME uses an
> extension of Base64 where it allows some variations in the encoding
> alphabet (so even the same fragment of the same length may have two disting
> encodings).
>
> Base64 in MIME is different from standard Base64 (which never splits the
> binary object before encoding it, and uses a strict alphabet of 64 ASCII
> characters, allowing no variation). So MIME requires special handling: the
> assumpton that a binary message is encoded the same is wrong, but MIME
> still requires that this non unique Base64 encoding will be decoded back to
> the same initial (unsplitted) binary object (independantly of its size and
> independantly of the splitting boundaries used in the transport, which may
> change during the transport).
>
> This also applies to the Base64 encoding used in HTTP transport syntax,
> and notably in the HTTP/1.1 streaming feature where fragment sizes are also
> variable.
>
>
> Le sam. 13 oct. 2018 à 16:27, Costello, Roger L. via Unicode <
> unicode@unicode.org> a écrit :
>
>> Hi Folks,
>>
>> Thank you for your outstanding responses!
>>
>> Below is a summary of what I learned. Are there any errors in the
>> summary? Is there anything you would add? Please let me know of anything
>> that is not clear.   /Roger
>>
>> 1. While base64 encoding is usually applied to binary, it is also
>> sometimes applied to text, such as Unicode text.
>>
>> Note: Since base64 encoding may be applied to both binary and text, in
>> the following bullets I use the more generic term "data". For example,
>> "Data d is base64-encoded to yield ..."
>>
>> 2. Neither base64 encoding nor decoding should presume any special
>> knowledge of the meaning of the data or do anything extra based on that
>> presumption.
>>
>> For example, converting Unicode text to and from base64 should not
>> perform any sort of Unicode normalization, convert between UTFs, insert or
>> remove BOMs, etc. This is like saying that converting a JPEG image to and
>> from base64 should not resize or rescale the image, change its color depth,
>> convert it to another graphic format, etc.
>>
>> If you use base64 for encoding MIME content (e.g. emails), the base64
>> decoding will not transform the content. The email parser must ensure that
>> the content is valid, so the parser might have to transform the content
>> (possibly replacing some invalid sequences or truncating), and then apply
>> Unicode normalization to render the text. These transforms are part of the
>> MIME application and are independent of whether you use base64 or any
>> another encoding or transport syntax.
>>
>> 3. If data d is different than d', then the base64 text resulting from
>> encoding d is different than the base64 text resulting from encoding d'.
>>
>> 4. If base64 text t is different than t', then the data resulting from
>> decoding t is different than the data resulting from decoding t'.
>>
>> 5. For every data d there is exactly one base64 encoding t.
>>
>> 6. Every base64 text t is an encoding of exactly one data d.
>>
>> 7. For all data d, Base64_Decode[Base64_Encode[d]] = d
>>
>>


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Philippe Verdy via Unicode
You forget that Base64 (as used in MIME) does not follow these rules as it
allows multiple different encodings for the same source binary. MIME
actually splits a binary object into multiple fragments at random
positions, and then encodes these fragments separately. Also MIME uses an
extension of Base64 where it allows some variations in the encoding
alphabet (so even the same fragment of the same length may have two disting
encodings).

Base64 in MIME is different from standard Base64 (which never splits the
binary object before encoding it, and uses a strict alphabet of 64 ASCII
characters, allowing no variation). So MIME requires special handling: the
assumpton that a binary message is encoded the same is wrong, but MIME
still requires that this non unique Base64 encoding will be decoded back to
the same initial (unsplitted) binary object (independantly of its size and
independantly of the splitting boundaries used in the transport, which may
change during the transport).

This also applies to the Base64 encoding used in HTTP transport syntax, and
notably in the HTTP/1.1 streaming feature where fragment sizes are also
variable.


Le sam. 13 oct. 2018 à 16:27, Costello, Roger L. via Unicode <
unicode@unicode.org> a écrit :

> Hi Folks,
>
> Thank you for your outstanding responses!
>
> Below is a summary of what I learned. Are there any errors in the summary?
> Is there anything you would add? Please let me know of anything that is not
> clear.   /Roger
>
> 1. While base64 encoding is usually applied to binary, it is also
> sometimes applied to text, such as Unicode text.
>
> Note: Since base64 encoding may be applied to both binary and text, in the
> following bullets I use the more generic term "data". For example, "Data d
> is base64-encoded to yield ..."
>
> 2. Neither base64 encoding nor decoding should presume any special
> knowledge of the meaning of the data or do anything extra based on that
> presumption.
>
> For example, converting Unicode text to and from base64 should not perform
> any sort of Unicode normalization, convert between UTFs, insert or remove
> BOMs, etc. This is like saying that converting a JPEG image to and from
> base64 should not resize or rescale the image, change its color depth,
> convert it to another graphic format, etc.
>
> If you use base64 for encoding MIME content (e.g. emails), the base64
> decoding will not transform the content. The email parser must ensure that
> the content is valid, so the parser might have to transform the content
> (possibly replacing some invalid sequences or truncating), and then apply
> Unicode normalization to render the text. These transforms are part of the
> MIME application and are independent of whether you use base64 or any
> another encoding or transport syntax.
>
> 3. If data d is different than d', then the base64 text resulting from
> encoding d is different than the base64 text resulting from encoding d'.
>
> 4. If base64 text t is different than t', then the data resulting from
> decoding t is different than the data resulting from decoding t'.
>
> 5. For every data d there is exactly one base64 encoding t.
>
> 6. Every base64 text t is an encoding of exactly one data d.
>
> 7. For all data d, Base64_Decode[Base64_Encode[d]] = d
>
>


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-12 Thread Philippe Verdy via Unicode
I also think the reverse is also true !

Decoding a Base64 entity does not warranty it will return valid text in any
known encoding. So Unicode normalization of the output cannot apply.

Even if it represents text, nothing indicates that the result will be
encoded with some Unicode encoding form (unless this is tagged separately,
like in MIME).

If you use Base64 for decoding MIME contents (e.g. for emails), the Base-64
decoding itself will not transform the encoding, but then the email parser
will have to ensure that the text encoding is valid, at which time it will
have to transform it (possibly replace some invalid sequences or truncate
it), and then only it may apply normalization to help render that text. But
these transforms are part of the MIME application and independant of whever
you used Base-64 or any another binary encoding or transport syntax.

In other words: "If m is not equal to m', then t will not equal t'" is
reversible, but nothing indicates that m or m' Base64-decoded are texts,
they are just opaque binary objects which are still equal in value like
their t or t' Base64-encodings.

Note: some Base64 envelope formats (like MIME) allow multiple
representations t and t' from the same message m, by adding paddings or
transport syntaxes like line-splitting (with varaible length). Base64 alone
does not allow that variation (it normally uses a static alphabet), but
there are variants that accept decoding extended alphabets as binary
equivalent. So you may have two MIME-encoded texts that have different
encodings (with Base64 or Quopted-Printable, with variable line lengths)
but that represent the same source binary object, and decoding these
different encoded messages will yeld the same binary object: this does not
depend on Base64 but on the permissivity/flexibility of decoders for these
envelope formats (using **extensions** of Base64 specific to the envelope
format).


Le ven. 12 oct. 2018 à 18:27, Doug Ewell via Unicode 
a écrit :

> J Decker wrote:
>
> >> How about the opposite direction: If m is base64 encoded to yield t
> >> and then t is base64 decoded to yield n, will it always be the case
> >> that m equals n?
> >
> > False.
> > Canonical translation may occur which the different base64 may be the
> > same sort of string...
>
> Base64 is a binary-to-text encoding. Neither encoding nor decoding
> should presume any special knowledge of the meaning of the binary data,
> or do anything extra based on that presumption.
>
> Converting Unicode text to and from base64 should not perform any sort
> of Unicode normalization, convert between UTFs, insert or remove BOMs,
> etc. This is like saying that converting a JPEG image to and from base64
> should not resize or rescale the image, change its color depth, convert
> it to another graphic format, etc.
>
> So I'd say "true" to Roger's question.
>
> I touched on this a little bit in UTN #14, from the standpoint of trying
> to improve compression by normalizing the Unicode text first.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>


Re: Dealing with Georgian capitalization in programming languages

2018-10-02 Thread Philippe Verdy via Unicode
I see no easy way to convert ALL UPPERCASE text with consistant casing as
there's no rule, except by using dictionnary lookups.
In reality data should be input using default casing (as in dictionnary
entries), independantly of their position in sentences, paragraphs or
titles, and the contextual conversion of some or all characters to
uppercase being done algorithmically (this is safe for conversion to ALL
UPPERCASE, and quite reliable for conversion to Tile Case, with just a few
dictionnary lookups for a small set of knows words per language.

Note that title casing works differently in English (which is most often
abusing by putting capitales on every word), while most other languages
capitalize only selected words, or just the first selected word in French
(in addition to the possible first letter of non-selected words such as
definite and indefinite articles at start of the sentence). Capitalization
of initials on every word is wrong in German which uses capitalisation even
more strictly than French or Italian: when in doubts, do not perform any
titlecasing, and allow data to provide the actual capitalization of titles
directly (it is OK and even recommanded in German to have section headings,
or even book titles, written as if they were in the middle of sentences,
and you capitalize only titles and headings that are full sentences
grammatically, but not simple nominal groups.

So title casing should not even be promoted by the UCD standard (where it
is in fact using only very basic, simplistic rules) and applicable only in
some applications for some languages and in specific technical or rendering
contexts.



Le mar. 2 oct. 2018 à 22:21, Markus Scherer via Unicode 
a écrit :

> On Tue, Oct 2, 2018 at 12:50 AM Martin J. Dürst via Unicode <
> unicode@unicode.org> wrote:
>
>> ... The only
>> operation that can cause problems is 'capitalize'.
>>
>> When I say "cause problems", I mean producing mixed-case output. I
>> originally thought that 'capitalize' would be fine. It is fine for
>> lowercase input: I stays lowercase because Unicode Data indicates that
>> titlecase for lowercase Georgian letters is the letter itself. But it
>> will produce the apparently undesirable Mixed Case for ALL UPPERCASE
>> input.
>>
>> My questions here are:
>> - Has this been considered when Georgian Mtavruli was discussed in the
>>UTC?
>> - How have any other implementers (ICU,...) addressed this, in
>>particular the operation that's called 'capitalize' in Ruby?
>>
>
> By default, ICU toTitle() functions titlecase at word boundaries (with
> adjustment) and lowercase all else.
> That is, we implement Unicode chapter 3.13 Default Case Conversions R3
> toTitlecase(x), except that we modified the default boundary adjustment.
>
> You can customize the boundaries (e.g., only the start of the string).
> We have options for whether and how to adjust the boundaries (e.g., adjust
> to the next cased letter) and for copying, not lowercasing, the other
> characters.
> See C++ and Java class CaseMap and the relevant options.
>
> markus
>


Re: Shortcuts question

2018-09-17 Thread Philippe Verdy via Unicode
Note: CLDR concentrates on keyboard layout for text input. Layouts for
other functions (such as copy-pasting, gaming controls) are completely
different (and not necessarily bound directly to layouts for text, as they
may also have their own dedicated physical keys or users can reprogram
their keyboard for this; for gaming, softwares should all have a way to
customize the layout according to users need, and should provide
reasonnable defaults for at least the 3 base layouts: QWERTY, AZERTY and
QWERTZ, but I've never seen any game whose UI was tuned for Dvorak)

Le lun. 17 sept. 2018 à 16:42, Marcel Schneider  a
écrit :

> On 17/09/18 05:38 Martin J. Dürst wrote:
> [quote]
> >
> > From my personal experience: A few years ago, installing a Dvorak
> > keyboard (which is what I use every day for typing) didn't remap the
> > control keys, so that Ctrl-C was still on the bottom row of the left
> > hand, and so on. For me, it was really terrible.
> >
> > It may not be the same for everybody, but my experience suggests that it
> > may be similar for some others, and that therefore such a mapping should
> > only be voluntary, not default.
>
> Got it, thanks!
>
> Regards,
>
> Marcel
>


Re: Shortcuts question

2018-09-16 Thread Philippe Verdy via Unicode
For games, the mnemonic meaning of keys are unlikely to be used because
gamers prefer an ergonomic placement of their fingers according to the
physical position for essential commands.
But this won't apply to control keys, as these commands should be single
keystrokes and pressing two keys instead of one would be unpractical and
would be a disavantage when playing.

That's why the four most common 4 direction keys A/D/S/W on a QWERTY layout
will become Q/D/S/Z on a French AZERTY layout. Games that use logical key
layouts based on QWERTY are almost unplayable if there's no interface to
customize these 4 keys. So games preferably use the virtual keys instead
for these commands, or will include builtin layouts adapted for AZERTY and
QWERTZ-based layouts and still display the correct keycaps in the UI: games
normally don't force the switch to another US layout, so they still need to
use the logical layout, simply because they also need to allow users to
input real text and not jsut gaming commands (for messaging, or for
inputing custom players/objects created in the game itself, or to fill-in
user profiles, or input a registration email or to perform online logon
with the correct password), in which case they will also need to support
characters entered with control keys (AltGr, Shift, Control...), or with a
standard tactile panel on screen which will still display the common
localized layouts.

There are difficulties in games when some of their commands are mapped to
something else than just basic Latin letters (including decimal digits : on
a French AZERTY keyboard, the digits are composed by pressing Shift, or in
ShiftLock mode (there's no CapsLock mode as this ShiftLock is also released
when pressing Shift: just like on old French mechanical typewriters,
pressing ShiftLock again did not release it, and this ShiftLock applied to
all keys on the keyboard, including punctuation keys. On PC keyboards,
ShiftLock does not apply to the numeric pad which has its separate NumLock,
now largely redundant and that most users would like to disable completely
each time there's a numeric pad separated from the directional pad, on
these extended keyboards, NumLock is just a nuisance, notably on OS logon
screen when Windows turns it off by default unless the BIOS locks it at
boot time, and lot of BIOS don't do that or don't have the option to set it
permanently).



Le dim. 16 sept. 2018 à 14:18, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> On 15/09/18 15:36, Philippe Verdy wrote:
> […]
> > So yes all control keys are potentially localisable to work best with
> the base layout anre remaining mnemonic;
> > but the physical key position may be very different.
>
> An additional level of complexity is induced by ergonomics. so that most
> non-Latin layouts may wish to stick
> with QWERTY, and even ergonomic layouts in the footprints of August Dvorak
> rather than Shai Coleman are
> likely to offer variants with legacy Virtual Key mapping instead of
> staying in congruency with graphics optimized
> for text input. But again that is easier on Windows, where VKs are
> remapped separately, than on Linux that
> appears to use graphics throughout to process application shortcuts, and
> only modifiers can be "preserved" for
> further processing, no underlying letter map that AFAIU appears not to
> exist on Linux.
>
> However, about keyboarding, that may be technically too detailed for this
> List, so that I’ll step out of this thread
> here. Please follow up in parallel thread on CLDR-users instead.
>
> https://unicode.org/pipermail/cldr-users/2018-September/000837.html
>
> Thanks,
>
> Marcel
>
>
>


Re: Shortcuts question

2018-09-15 Thread Philippe Verdy via Unicode
Le ven. 7 sept. 2018 à 05:43, Marcel Schneider via Unicode <
unicode@unicode.org> a écrit :

> On 07/09/18 02:32 Shriramana Sharma via Unicode wrote:
> >
> > Hello. This may be slightly OT for this list but I'm asking it here as
> it concerns computer usage with multiple scripts and i18n:
>
> It actually belongs on CLDR-users list. But coming from you, it shall
> remain here while I’m posting a quick answer below.
>
> > 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for
> "tout" io Ctrl+A for "all"?
>
> No, Ctrl+A remains Ctrl+A on a French keyboard.
>

Yes but the location on the keyboard maps to the same as CTRL+Q on a Qwerty
layout: CTRL+ASCII letter are mapped according to the layout of the letter
(without pressing CTRL) on the localized keyboard. Some keyboard layouts
don't have all the basic Latin letters becaues their language don't need it
(e.g. it may only have one of Q or K, but no C, or it may have no W, or
some letters may be holding combined diacritics or could be ligatures, but
usuall the basic Latin letter is still accessible by pressing another
control key or by switching the layout mode.

On non Latin keyboard layouts there's much more freedom, and CTRL+A may be
localized according to the main base letter assigned to the key (the
position of Latin letter is not always visible).

On tactile layouts you cannot guess where CTRL+Latin letter is located,
actually it may be accessible very differently on a separate layout for
controls, where they will be translated: the CTRL key is not necessarily
present, replaced usually by a single key for input mode selection (which
may be switching languages, or to emojis, or to
symbols/punctuations/digits)...

The problematic control keys are those like "CTRL+[" (assuming ASCII as the
base layout) where "[" is not present or mapped very differently. As well
"CTRL+1"..."CTRL+0" may conflict with the assignment of ASCII controls like
"CTRL+[".

So yes all control keys are potentially localisable to work best with the
base layout anre remaining mnemonic; but the physical key position may be
very different.


Re: Unicode String Models

2018-09-11 Thread Philippe Verdy via Unicode
No 0xF8..0xFF are not used at all in UTF-8; but U+00F8..U+00FF really
**do** have UTF-8 encodings (using two bytes).

The only safe way to represent arbitrary bytes within strings when they are
not valid UTF-8 is to use invalid UTF-8 sequences, i.e by using a
"UTF-8-like" private extension of UTF-8 (that extension is still not UTF-8!)

This is what Java does for representing U+ by (0xC0,0x80) in the
compiled Bytecode or via the C/C++ interface for JNI when converting the
java string buffer into a C/C++ string terminated by a NULL byte (not part
of the Java string content itself). That special sequence however is really
exposed in the Java API as a true unsigned 16-bit code unit (char) with
value 0x, and a valid single code point.

The same can be done for reencoding each invalid byte in non-UTF-8
conforming texts using sequences with a "UTF-8-like" scheme (still
compatible with plain UTF-8 for every valid UTF-8 texts): you may either:
  * (a) encode each invalid byte separately (using two bytes for each), or
by encoding them by groups of 3-bits (represented using bytes 0xF8..0FF)
and then needing 3 bytes in the encoding.
  * (b) encode a private starter (e.g. 0xFF), followed by a byte for the
length of the raw bytes sequence that follows, and then the raw bytes
sequence of that length without any reencoding: this will never be confused
with other valid codepoints (however this scheme may no longer be directly
indexable from arbitrary random positions, unlike scheme a which may be
marginally longer longer)
But both schemes (a) or (b) would be useful in editors allowing to edit
arbitrary binary files as if they were plain-text, even if they contain
null bytes, or invalid UTF-8 sequences (it's up to these editors to find a
way to distinctively represent these bytes, and a way to enter/change them
reliably.

There's also a possibility of extension if the backing store uses UTF-16,
as all code units 0x.0x are used, but one scheme is possible by
using unpaired surrogates (notably a low surrogate NOT prefixed by a high
surrogate: the low surrogate already has 10 useful bits that can store any
raw byte value in its lowest bits): this scheme allows indexing from random
position and reliable sequencial traversal in both directions (backward or
forward)...

... But the presence of such extension of UTF-16 means that all the
implementation code handling standard text has to detect unpaired
surrogates, and can no longer assume that a low surrogate necessarily has a
high surrogate encoded just before it: it must be tested and that previous
position may be before the buffer start, causing a possibly buffer overrun
in backward direction (so the code will need to also know the start
position of the buffer and check it, or know the index which cannot be
negative), possibly exposing unrelated data and causing some security
risks, unless the backing store always adds a leading "guard" code unit set
arbitrarily to 0x.





Le mer. 12 sept. 2018 à 00:48, J Decker via Unicode  a
écrit :

>
>
> On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode <
> unicode@unicode.org> wrote:
>
>>
>> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode <
>> unicode@unicode.org> wrote:
>> >
>> > On Tue, 11 Sep 2018 21:10:03 +0200
>> > Hans Åberg via Unicode  wrote:
>> >
>> >> Indeed, before UTF-8, in the 1990s, I recall some Russians using
>> >> LaTeX files with sections in different Cyrillic and Latin encodings,
>> >> changing the editor encoding while typing.
>> >
>> > Rather like some of the old Unicode list archives, which are just
>> > concatenations of a month's emails, with all sorts of 8-bit encodings
>> > and stretches of base64.
>>
>> It might be useful to represent non-UTF-8 bytes as Unicode code points.
>> One way might be to use a codepoint to indicate high bit set followed by
>> the byte value with its high bit set to 0, that is, truncated into the
>> ASCII range. For example, U+0080 looks like it is not in use, though I
>> could not verify this.
>>
>>
> it's used for character 0x400.   0xD0 0x80   or 0x8000   0xE8 0x80 0x80
> (I'm probably off a bit in the leading byte)
> UTF-8 can represent from 0 to 0x20 every value; (which is all defined
> codepoints) early varients can support up to U+7FFF...
> and there's enough bits to carry the pattern forward to support 36 bits or
> 42 bits... (the last one breaking the standard a bit by allowing a byte
> wihout one bit off... 0xFF would be the leadin)
>
> 0xF8-FF are unused byte values; but those can all be encoded into utf-8.
>


  1   2   3   >