RE: PUA (BMP) planned characters HTML tables

2019-08-21 Thread Doug Ewell via Unicode
On August 11, I replied to Robert Wheelock:
 
>> I remember that a website that has tables for certain PUA precomposed
>> accented characters that aren’t yet in Unicode (thing like:
>> Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H-
>> underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...).
>
> If you are thinking of these as potential future additions to the
> standard, keep in mind that accented letters that can already be
> represented by a combination of letter + accent will not ever be
> encoded. This is one of the longest-standing principles Unicode has.
 
I missed the possible significance of the Latvian comma below vs.
Marshallese cedilla, which captured most of the ensuing discussion and
morphed into a discussion about different user communities and group
identity.
 
I'd like to restate, since I think the point may have been lost, that
for the OTHER characters Robert mentioned:
 
> H/h-acute, capital T-dieresis, capital H-underbar, acute accented
> Cyrillic vowels, Cyrillic ER/er-caron, ...
 
there does not appear to be any conflicting usage between different user
communities, and no particular difficulty in rendering or otherwise
processing these as combining sequences, using up-to-date fonts and
rendering engines. I suppose Philippe's example of Võro might factor
into whether different groups prefer different appearances for h́, but
otherwise these user-perceived characters seem to be non-controversial.
 
So to reiterate, these characters appear vanishingly unlikely to be
atomically encoded, "yet" or ever, for good reason.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org




Re: PUA (BMP) planned characters HTML tables

2019-08-15 Thread Asmus Freytag via Unicode

  
  
On 8/14/2019 7:49 PM, James Kass via
  Unicode wrote:


  
  On 2019-08-15 12:25 AM, Asmus Freytag via Unicode wrote:
  
  Empirically, it has been observed that
some distinctions that are claimed by

users, standards developers or implementers were de-facto not
honored by type

developers (and users selecting fonts) as long as the native
text doesn't

contain minimal pairs.

  
  
  Quickly checked a couple of older on-line PDFs and both used the
  comma below unabashedly.
  
  
  Quoting from this page (which appears to be more modern than the
  PDFs),
  
  http://www.trussel2.com/MOD/peloktxt.htm
  
  
  "Ij keememej ḷọk wōt ke ikar uwe ippān Jema kab ruo ṃōṃaan ilo
  juon booj jidikdik eo roñoul ruo ne aitokan im jiljino ne
  depakpakin. Ilo iien in eor jiljilimjuon ak rualitōk aō iiō—Ij jab
  kanooj ememej. Wa in ṃōṃkaj kar ..."
  
  
  It seems that users are happy to employ a dot below in lieu of
  either a comma or cedilla.  This newer web page is from a book
  published in 1978.  There's a scan of the original book cover.
  Although the book title is all caps hand printing it appears that
  commas were used.  The Marshallese orthography which uses
  commas/cedillas is fairly recent, replacing an older scheme
  devised by missionaries.  Perhaps the actual users have already
  resolved this dilemma by simply using dots below.
  
  
  

That may be the case for Marshallese. But
wouldn't surprise me.
  
My comments were based on a different case
of the same kinds of diacritics below (other languages) and at
the time we consulted typographic samples including newsprint
that were using pre-Unicode technologies. In that sense a
cleaner case, because there was no influence by what Unicode did
or didn't do.
Now, having said that, I do get it that some
materials, like text books, online class materials etc. need to
be prepared / printed using the normative style for the given
orthography.
But it's a far cry from claiming that all
text in a given language is invariably done only one way.
A./
  
  



Re: PUA (BMP) planned characters HTML tables

2019-08-15 Thread Richard Wordingham via Unicode
On Wed, 14 Aug 2019 23:32:37 +
James Kass via Unicode  wrote:

> U+0149 has a compatibility decomposition.  It has been deprecated and
> is not rendered identically on my system.
> 'n ʼn
> ( ’n )

Compatibility decompositions are quite a mix, but are generally
expected to render differently.  If they were expected to render the
same, they would normally be canonical decompositions.

U+0149 and its decomposition naturally render very differently with a
monospaced font.  The same goes for the Roman numerals that the Far
East gave us.

Richard.



Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread James Kass via Unicode



On 2019-08-15 12:25 AM, Asmus Freytag via Unicode wrote:

Empirically, it has been observed that some distinctions that are claimed by
users, standards developers or implementers were de-facto not honored by type
developers (and users selecting fonts) as long as the native text doesn't
contain minimal pairs.


Quickly checked a couple of older on-line PDFs and both used the comma 
below unabashedly.


Quoting from this page (which appears to be more modern than the PDFs),
http://www.trussel2.com/MOD/peloktxt.htm

"Ij keememej ḷọk wōt ke ikar uwe ippān Jema kab ruo ṃōṃaan ilo juon booj 
jidikdik eo roñoul ruo ne aitokan im jiljino ne depakpakin. Ilo iien in 
eor jiljilimjuon ak rualitōk aō iiō—Ij jab kanooj ememej. Wa in ṃōṃkaj 
kar ..."


It seems that users are happy to employ a dot below in lieu of either a 
comma or cedilla.  This newer web page is from a book published in 
1978.  There's a scan of the original book cover. Although the book 
title is all caps hand printing it appears that commas were used.  The 
Marshallese orthography which uses commas/cedillas is fairly recent, 
replacing an older scheme devised by missionaries.  Perhaps the actual 
users have already resolved this dilemma by simply using dots below.




Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread Asmus Freytag via Unicode

  
  
On 8/14/2019 2:05 AM, James Kass via
  Unicode wrote:

This
  presumes that the premise of user communities feeling strongly
  about the unacceptable aspect of the variants is valid.  Since it
  has been reported and nothing seems to be happening, perhaps the
  casual users aren't terribly concerned.  It's also possible that
  the various user communities have already set up their systems to
  handle things acceptably by installing appropriate fonts.

This is always a good question.
Empirically, it has been observed that some
distinctions that are claimed by users, standards developers or
implementers were de-facto not honored by type developers (and
users selecting fonts) as long as the native text doesn't
contain minimal pairs.
For example, some Latin fonts drop the dot
on the lowercase i for stylistic reasons (or designers use
dotless i in highly designed texts, like book covers, logos,
etc.). That's usually not a problem for ordinary users for
monolingual texts in, say English; even though everyone agrees
that the lowercase i is normally dotted, the absence isn't
noticed by most, and tolerated even by those who do notice it.
However, as soon as a user community sees a
particular variant as signalling their group identity, they will
be very vocal about it - even, interestingly enough, in cases
where de-facto use (e.g. via font selection, and not forced by
implementation defaults) doesn't match that preference. As I
said, we've seen this in the past for some features in some
languages.
Now, which features become strongly
identified with group identity is something that subject to
change over time; this makes it impossible to guarantee both
absolute stability and perfect compatibility; especially if a
combining mark that is used in decompositions needs to
disunified because the range of shapes changes from being
stylistic to normative.
Before Unicode, with character sets limited
to local use, you couldn't create minimal pairs (except if the
variation was part of your language, like Turkish i with/without
dot). So, if font deviated and pushed the stylistic envelope,
the non-preferred form, if used, would still necessarily refer
to the local character; there was no way it could mean anything
else. With Unicode, that's changed, and instead of user
communities treating this as a typographic issue (exclusive use
of preferred font) which is decentralized to document authors
(and perhaps font vendors) it becomes a character coding issue
that is highly visible and centralized.
That in turn can lead to the issue becoming
politicized; and not unlike some grammar issues, where the
supposedly "correct" form is far from universally agreed on in
practice.
A./
  
  



Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread Ken Whistler via Unicode



On 8/14/2019 4:32 PM, James Kass via Unicode wrote:
If a character gets deprecated, can its decomposition type be changed 
from canonical to compatibility?


Simple answer: No.

--Ken



Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread James Kass via Unicode




On 2019-08-14 7:50 PM, Richard Wordingham via Unicode wrote:

I think you'd also have to change the reference glyph of LATIN LOWER
CASE I WITH HEART to show a heart.  That's valid because the UCD trumps
the code charts, and and no Unicode-compliant process may deliberately
render  differently from LATIN LOWER CASE I WITH
HEART.


U+0149 has a compatibility decomposition.  It has been deprecated and is 
not rendered identically on my system.

'n ʼn
( ’n )
If a character gets deprecated, can its decomposition type be changed 
from canonical to compatibility?




Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread Richard Wordingham via Unicode
On Wed, 14 Aug 2019 09:05:02 +
James Kass via Unicode  wrote:

> The solution is to deprecate "LATIN LOWER CASE I WITH HEART".  It's
> only in there because of legacy.  It's presence guarantees
> round-tripping with legacy data but it isn't needed for modern data
> or display.  Urge Groups One and Two to encode their data with the
> desired combiner and educate font engine developers about the
> deprecation.  As the rendering engines get updated, the system
> substitution of the wrongly named precomposed glyph will go away.

I think you'd also have to change the reference glyph of LATIN LOWER
CASE I WITH HEART to show a heart.  That's valid because the UCD trumps
the code charts, and and no Unicode-compliant process may deliberately
render  differently from LATIN LOWER CASE I WITH
HEART. 

Richard.



Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread James Kass via Unicode



On 2019-08-12 8:30 AM, Andrew West wrote:

This issue was discussed at WG2 in 2013
(https://www.unicode.org/L2/L2013/13128-latvian-marshal-adhoc.pdf),
when there was a recommendation to encode precomposed letters L and N
with cedilla*with no decomposition*, but that solution does not seem
to have been taken up by the UTC.


Group One dots their lowercase "i" letters with little flowers and Group 
Two dots theirs with little hearts.  Group Two considers flowers 
unacceptable and Group One rejects hearts.  Because of legacy character 
sets there's a precomposed character encoded called "LATIN LOWER CASE I 
WITH HEART", but it was misnamed and is normally drawn with a flower 
instead.  Group Two tries to encode "LATIN LOWER CASE I" plus "COMBINING 
HEART" to get the thing to display properly.  But because there's a 
decomposition involved, the font engine substitutes the glyph mapped to 
"LATIN LOWER CASE I WITH HEART" in the display for the string "LATIN 
LOWER CASE I" plus "COMBINING HEART".  This thwarts Group Two because 
they still get the flower.


The solution is to deprecate "LATIN LOWER CASE I WITH HEART".  It's only 
in there because of legacy.  It's presence guarantees round-tripping 
with legacy data but it isn't needed for modern data or display.  Urge 
Groups One and Two to encode their data with the desired combiner and 
educate font engine developers about the deprecation.  As the rendering 
engines get updated, the system substitution of the wrongly named 
precomposed glyph will go away.


This presumes that the premise of user communities feeling strongly 
about the unacceptable aspect of the variants is valid.  Since it has 
been reported and nothing seems to be happening, perhaps the casual 
users aren't terribly concerned.  It's also possible that the various 
user communities have already set up their systems to handle things 
acceptably by installing appropriate fonts.




Re: PUA (BMP) planned characters HTML tables

2019-08-12 Thread Andrew West via Unicode
On Mon, 12 Aug 2019 at 02:27, James Kass via Unicode
 wrote:
>
> On 2019-08-11 5:26 PM, [ Doug Ewell ] via Unicode wrote:
> > If you are thinking of these as potential future additions to the standard, 
> > keep in mind that accented letters that can already be represented by a 
> > combination of letter + accent will not ever be encoded. This is one of the 
> > longest-standing principles Unicode has.

People seem to be ignoring the fact that Marshallese and Latvian both
use L and N with cedilla, but with completely different glyph shapes:

> In January 2013, the Unicode Technical Committee discussed issues for the 
> representation of
> Marshallese orthography. In particular, Marshallese uses the Latin script and 
> requires the letters l,
> m, n, and o with cedilla. Latvian orthography uses the Latin script and 
> requires the letters g, k, l, n,
> and r with comma below. For Marshallese, it is unacceptable to display 
> cedillas as commas below.
> Conversely, for Latvian, it is unacceptable to display commas below as 
> cedillas.

However, as fonts have been following Latvian practice for these
letters (cedilla is displayed as a comma below) since before Unicode,
Marshallese users cannot get their desired outcome using standard
Unicode combining diacritical marks unless they apply a font specially
designed for Marshallese -- which you can never guarantee if you are
writing an email or posting on twitter, etc.

This issue was discussed at WG2 in 2013
(https://www.unicode.org/L2/L2013/13128-latvian-marshal-adhoc.pdf),
when there was a recommendation to encode precomposed letters L and N
with cedilla *with no decomposition*, but that solution does not seem
to have been taken up by the UTC.

Andrew



Re: PUA (BMP) planned characters HTML tables

2019-08-12 Thread Richard Wordingham via Unicode
On Mon, 12 Aug 2019 01:21:42 +
James Kass via Unicode  wrote:

> There was a time when populating the PUA with precomposed glyphs was 
> necessary for printing or display, but that time has passed.

There is still the issue that in pure X one can't put sequences of
characters on a key; if the application doesn't invoke an input method
one is stuck.  Useful 20-year old proprietary code may be totally unable
to use modern font capabilities.  Don't forget the Cobol Y10k joke.

On Ubuntu at least, there was a period when Emacs couldn't access
X-based input methods from an English locale. The work-around: Use a
Japanese locale plus the vanilla lack of internationalisation in the
interface, or Emacs's very convenient alternative keyboard capability
for text input as opposed to commands.  The bug turned out to be in the
definition of the locales, i.e. in privileged data beyond the purview
of Emacs.

As to the need for the PUA, writing fonts to cope with Tai Tham
rendering engines is not easy, and it's no surprise that the PUA is used
on line for a newspaper that uses the Tai Tham script.  The USE is too
user-hostile for it to have helped if it had been available earlier.
(It just ignored the regular expression published in 2007.
(It's in L2/07-007R in the UTC document register, ISO/IEC
JTC1/SC2/WG2/N3207R on ISO land.) Indeed, perhaps I should be
researching the PUA encoding for Tai Tham. (My Tai Tham font Da Lekh
started as proof of principle, for there is already an unpleasant
amount of glyph sequence changing, some style-dependent. I couldn't see
how to get rendering engine support even when it might be added.  I was
pleasantly surprised at how far from impossible Tai Tham layout was
until the USE came along and made everything harder.  I now have to work
out which glyph instances have already been Indicly rearranged when I
repair the clustering.)

Oh, and i seem to need some PUA codepoints for vowels that get stranded
when line-breaks occur between the columns of an akshara.  The
proposals show this phenomenon in old(?) Pali text.  Or is there any
chance of getting them encoded?

Richard.


Re: PUA (BMP) planned characters HTML tables

2019-08-11 Thread James Kass via Unicode




On 2019-08-11 5:26 PM, [ Doug Ewell ] via Unicode wrote:

If you are thinking of these as potential future additions to the standard, 
keep in mind that accented letters that can already be represented by a 
combination of letter + accent will not ever be encoded. This is one of the 
longest-standing principles Unicode has.


Good point.

There was a time when populating the PUA with precomposed glyphs was 
necessary for printing or display, but that time has passed. Hopefully 
anyone seeking charts is transcoding older data into proper Unicode.


This can be illustrated with the Marshallese combos mentioned earlier.

PUA:  
Standard:  ĻļM̧m̧ŅņO̧o̧

Well, that didn't work out as well as expected.  But the standard 
Unicode is supported (more or less) by some of the core fonts installed 
here.  Nothing installed here displays anything useful for the PUA 
characters.  A decent OpenType font designed with Marshallese in mind 
should work just fine with the combiners.


The fact is that the standard characters will survive and can be 
universally exchanged.  And there's plenty of web page charts showing 
the standard characters.




RE: PUA (BMP) planned characters HTML tables

2019-08-11 Thread via Unicode
Robert Wheelock wrote:

> I remember that a website that has tables for certain PUA precomposed
> accented characters that aren’t yet in Unicode (thing like:
> Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H-
> underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...).

If you are thinking of these as potential future additions to the standard, 
keep in mind that accented letters that can already be represented by a 
combination of letter + accent will not ever be encoded. This is one of the 
longest-standing principles Unicode has.

--
Doug Ewell | Thornton, CO, US | ewellic.org





Re: PUA (BMP) planned characters HTML tables

2019-08-11 Thread James Kass via Unicode




On 2019-08-11 4:07 AM, Robert Wheelock via Unicode wrote:

Hello!
I remember that a website that has tables for certain PUA precomposed
accented characters that aren’t yet in Unicode (thing like:  Marshallese
M/m-cedilla, H/h-acute, capital T-dieresis, capital H-underbar, acute
accented Cyrillic vowels, Cyrillic ER/er-caron, ...).  Where was it at?!  I
still want to get the information.  Thank You!




It sounds familiar but I can't place it.  I tried the SIL pages first, 
as did Richard Wordingham apparently.


https://blogfonts.com/dehuti.font

This font has material in the PUA including:
Marshallese glyphs with cedillas: L (E382 & E394), M (E3A6 & E3BB), N 
(E3CE & E3DE), O (E429 & E465)


These appear to be PUA characters which the font developer has mapped in 
addition to the SIL PUA mappings.






Re: PUA (BMP) planned characters HTML tables

2019-08-10 Thread Richard Wordingham via Unicode
On Sun, 11 Aug 2019 00:07:05 -0400
Robert Wheelock via Unicode  wrote:

> I remember that a website that has tables for certain PUA precomposed
> accented characters that aren’t yet in Unicode (thing like:
> Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital
> H-underbar, acute accented Cyrillic vowels, Cyrillic
> ER/er-caron, ...).  Where was it at?!  I still want to get the
> information.  Thank You!

You may mean https://www.eki.ee/letter.  Once there, you'll want to make
a query by Unicode range, e.g. e000-f8ff.  It doesn't seem to refer to
the relevant agreement.  You could start hunting for agreements at
https://scripts.sil.org/cms/scripts/page.php?item_id=VendorUseOfPUA

Most of the characters you mention are scheduled to be assigned their
own codepoint on the Greek kalends.  They are precluded by policy
because they would need to be composition exclusions to avoid making
text in NFC cease to be in NFC.

I first thought of the SIL PUA at
https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi=PUA_home ,
but they knew better than to include most of them.

Richard.



RE: PUA (BMP) planned characters HTML tables

2019-08-10 Thread Robert Wheelock via Unicode
Hello!
I remember that a website that has tables for certain PUA precomposed
accented characters that aren’t yet in Unicode (thing like:  Marshallese
M/m-cedilla, H/h-acute, capital T-dieresis, capital H-underbar, acute
accented Cyrillic vowels, Cyrillic ER/er-caron, ...).  Where was it at?!  I
still want to get the information.  Thank You!

Robert Lloyd Wheelock