Re: Private Use areas

2018-08-21 Thread Andrew Cunningham via Unicode
On Wednesday, 22 August 2018, Mark E. Shoulson via Unicode <
unicode@unicode.org> wrote:

> On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote:
>
>>
>>
> Best we can do is shout loudly at OpenType tables and hope to cram in
> behavior (or at least appearance, which is more likely all we can get) that
> vaguely resembles what we're after.  And that's not SO awful, given what
> we're dealing with.
>
>>
>>
At the moment I am looking at implementing three unencoded Arabic
characters in  the PUA.

For the foreseeable future OpenType is a non-starter, so I will look at
implementing them in Graphite tables in a font.

Andrew



-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Private Use areas

2018-08-21 Thread Mark E. Shoulson via Unicode

On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote:



On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote:

On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:

On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:

Is there a block of RTL PUA also?

No.

Perhaps there should be?


This is a periodic suggestion that never goes anywhere--for good 
reason. (You can search the email archives and see that it keeps 
coming up.)


Presuming that this question was asked in good faith...


Yeah, I know there has been talk about such things, and I also knew that 
whether or not there was an RTL block (which I did not remember for 
certain), there weren't going to be any *changes* in the PUA, and we 
were going to have to make do with what there was.  There's no way to 
anticipate all the possible properties people would want in the PUA, 
though I remember thinking it was probably wrong to make the PUA 
*strongly* LTR; I know there's a not-strongly flavor too.


Best we can do is shout loudly at OpenType tables and hope to cram in 
behavior (or at least appearance, which is more likely all we can get) 
that vaguely resembles what we're after.  And that's not SO awful, given 
what we're dealing with.




As I see it, the only feasible way for people to get specialized 
behavior for PUA ranges involves first ceasing to assume that somehow 
they can jawbone the UTC into *standardizing* some ranges for some 
particular use or another. That simply isn't going to happen. People 
who assume this is somehow easy, and that the UTC are a bunch of 
boneheads who stand in the way of obvious solutions, do not -- I 
contend -- understand the complicated interplay of character 
properties, stability guarantees, and implementation behavior baked 
into system support libraries for the Unicode Standard.


The whole point of the PUA is that it *isn't* standardized (by the 
UTC).  It might have been nice to make some more varied choices of 
things that couldn't be left unspecified, but you're still going to wind 
up with "but there aren't any PUA codepoints that are JUST what I 
need!"  And, as said, it's too late now.


~mark


Re: Private Use areas

2018-08-21 Thread Rebecca Bettencourt via Unicode
On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode 
wrote:

> Ken Whistler wrote:
>
> > The way forward for folks who want to do this kind thing is:
> >
> > 1. Define a *protocol* for reliable interchange of custom character
> > property information about PUA code points.
>
> I've often thought that would be a great idea. You can't get to steps 2
> and 3 without step 1. I'd gladly participate in such a project.
>

As would I.


Re: Private Use areas

2018-08-21 Thread Doug Ewell via Unicode
Ken Whistler wrote:

> The way forward for folks who want to do this kind thing is: 
>
> 1. Define a *protocol* for reliable interchange of custom character
> property information about PUA code points. 

I've often thought that would be a great idea. You can't get to steps 2
and 3 without step 1. I'd gladly participate in such a project. 
  
--
Doug Ewell | Thornton, CO, US | ewellic.org



Re: Private Use areas

2018-08-21 Thread Adam Borowski via Unicode
On Tue, Aug 21, 2018 at 11:03:41AM -0700, Ken Whistler via Unicode wrote:
> 
> On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote:
> > On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:
> > > On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
> > > > Is there a block of RTL PUA also?
> > > No.
> > Perhaps there should be?
> 
> This is a periodic suggestion that never goes anywhere--for good reason.
> (You can search the email archives and see that it keeps coming up.)
> 
> Presuming that this question was asked in good faith...

Oif, looks like mere months of inattentive lurking are not enough (the
thread I got pointed to was from 2011).  Apologies.

> > or perhaps by allocating a new range elsewhere.
> See:
> 
> https://www.unicode.org/policies/stability_policy.html
> 
> The General_Category property value Private_Use (Co) is immutable: the set
> of code points with that value will never change.
> 
> That guarantee has been in place since 1996, and is a rule that binds the
> UTC. So nope, sorry, no more PUA ranges.

Right.

> The way forward for folks who want to do this kind thing is:
> 
> 1. Define a *protocol* for reliable interchange of custom character property
> information about PUA code points.
[...]
> And if the goal for #3 is to get some *system* implementer to support the
> protocol in widespread software, then before starting any of #1, #2, or #3,
> you had better start instead with:
> 
> 0. Create a consortium (or other ongoing organization) with a 10-year time
> horizon and participation by at least one major software implementer, to
> define, publicize, and advocate for support of the protocol.

Heh, good point.  I wonder, perhaps a long-lived consortium tasked with
assigning properties to characters already exists?

So your answer _does_ provide a way to go: any PUA use that's no longer
private, or any problem someone has with character properties, should go
through official channels here instead of inventing an own standard.

With my existing hats on (Debian fonts team member, and someone who messes
with terminals in general) I already have two such itches to scratch.
Thus, it sounds like I should do the research, prepare a write-up, and then
come back to harass you folks with inane questions.  Inventing new solutions
that work around instead of with you is a bad idea...


Meow!
-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ


Re: Private Use areas

2018-08-21 Thread Richard Wordingham via Unicode
On Tue, 21 Aug 2018 11:03:41 -0700
Ken Whistler via Unicode  wrote:

> On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote:

> Really? Suppose someone wants to implement a bicameral script in PUA. 
> They would need case mappings for that, and how would those be
> "better represented in the font itself"? Or how about digits? Would
> numeric values for digits be "better represented in the font itself"?
> How about implementation of punctuation? Would segmentation
> properties and behavior be "better represented in the font itself"?

The least intrusive way of defining the meaning of a graphic (sensu
lato) character is by a font, in a very wide sense that would interpret
a Unicode code chart as a font.  Without a font in this sense, normal
characters in the PUA have no meaning.  If one insists on a font to
have an interpretation, then:

(1) PUA characters in plain text are meaningless - I believe that's
pretty much the position now.

(2) Different schemes can co-exist, even within the same formatted
document, by having different formats.  This is the case now.  It then
makes sense to store the properties in the font, which needs to be
saved with or in the document for the document to continue to make
sense. 

Casing and digits are luxuries.  Are we not told that searching should
be done by collation?  We then do not need case-folding!  Interpreting
the preferred representation of Roman numerals does not use Unicode
properties beyond the approximate principle of one character, one
codepoint. 

As to segmentation, my understanding was that there were no characters
available to indicate word boundaries in scriptio continua; the closest
one has is line-breaking suggestions.  If my memory serves me right,
SIL Graphite fonts can hold line-breaking information.

Richard.


Re: Private Use areas

2018-08-21 Thread Rebecca Bettencourt via Unicode
On Tue, Aug 21, 2018 at 10:21 AM, Janusz S. Bień via Unicode <
unicode@unicode.org> wrote:

> I think PUA users should provide the
> properties of the characters used in a form analogical to the Unicode
> itself, and the software should be able to use this additional
> information.
>

I already provide this myself for my uses of the PUA as well as the CSUR
and any vendor-specific agreements I can find:

http://www.kreativekorp.com/charset/PUADATA/

Of course there is no way to get software to use this information. I have
entertained the idea of being able to embed this information into the font
itself as OpenType tables, e.g.:

PUAB -> Blocks.txt
PUAC -> CaseFolding.txt
PUAW -> EastAsianWidth.txt
PUAL -> LineBreak.txt
PUAD -> UnicodeData.txt

I've actually invented table names for the majority of UCD files, but those
are probably the most relevant. The table names for the more obscure files
get rather... creative, e.g.:

PUA[ -> BidiBrackets.txt
PUA] -> BidiMirroring.txt

That alone may get some people to think twice about this idea. :P


Re: Private Use areas

2018-08-21 Thread Ken Whistler via Unicode


On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote:

On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:

On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:

Is there a block of RTL PUA also?

No.

Perhaps there should be?


This is a periodic suggestion that never goes anywhere--for good reason. 
(You can search the email archives and see that it keeps coming up.)


Presuming that this question was asked in good faith...



What about designating a part of the PUA to have a specific property?


The problem with that is that assigning *any* non-default property to 
any PUA code point would break existing implementations' assumptions 
about PUA character properties and potentially create havoc with 
existing use.



Only certain properties matter enough:


That is an un-demonstrated assertion that I don't think you have thought 
through sufficiently.



* wide
* RTL


RTL is not some binary counterpart of LTR. There are 23 values of 
Bidi_Class, and anyone who wanted to implement a right-to-left script in 
PUA might well have to make use of multiple values of Bidi_Class. Also, 
there are two major types of strong right-to-leftness: Bidi_Class=R and 
Bidi_Class=AL. Should a "RTL PUA" zone favor Arabic type behavior or 
non-Arabic type behavior?



* combining


Also not a binary switch. Canonical_Combining_Class is a numeric value, 
and any value but ccc=0 for a PUA character would break normalization. 
Then for the General_Category, there are three types of "marks" that 
count as combining: gc=Mn, gc=Mc, gc=Me. Which of those would be favored 
in any PUA assignment?



as most others are better represented in the font itself.


Really? Suppose someone wants to implement a bicameral script in PUA. 
They would need case mappings for that, and how would those be "better 
represented in the font itself"? Or how about digits? Would numeric 
values for digits be "better represented in the font itself"? How about 
implementation of punctuation? Would segmentation properties and 
behavior be "better represented in the font itself"?




This could be done either by parceling one of existing PUA ranges: planes 15
and 16 are virtually unused thus any damage would be negligible;


That is simply an assertion -- and not the kind of assertion that the 
UTC tends to accept on spec. I rather suspect that there are multiple 
participants on this email list, for example, who *do* have 
implementations making extensive use of Planes 15/16 PUA code points for 
one thing or another.



  or perhaps
by allocating a new range elsewhere.

See:

https://www.unicode.org/policies/stability_policy.html

The General_Category property value Private_Use (Co) is immutable: the 
set of code points with that value will never change.


That guarantee has been in place since 1996, and is a rule that binds 
the UTC. So nope, sorry, no more PUA ranges.

Meow!


Grrr! ;-)

As I see it, the only feasible way for people to get specialized 
behavior for PUA ranges involves first ceasing to assume that somehow 
they can jawbone the UTC into *standardizing* some ranges for some 
particular use or another. That simply isn't going to happen. People who 
assume this is somehow easy, and that the UTC are a bunch of boneheads 
who stand in the way of obvious solutions, do not -- I contend -- 
understand the complicated interplay of character properties, stability 
guarantees, and implementation behavior baked into system support 
libraries for the Unicode Standard.


The way forward for folks who want to do this kind thing is:

1. Define a *protocol* for reliable interchange of custom character 
property information about PUA code points.


2. Convince more than one party to actually *use* that protocol to 
define sets of interchangeable character property definitions.


3. Convince at least one implementer to support that protocol to create 
some relevant interchangeable *behavior* for those PUA characters.


And if the goal for #3 is to get some *system* implementer to support 
the protocol in widespread software, then before starting any of #1, #2, 
or #3, you had better start instead with:


0. Create a consortium (or other ongoing organization) with a 10-year 
time horizon and participation by at least one major software 
implementer, to define, publicize, and advocate for support of the 
protocol. (And if you expect a major software implementer to 
participate, you might need to make sure you have a business case 
defined that would warrant such a 10-year effort!)


--Ken



Re: Private Use areas

2018-08-21 Thread Steven R. Loomis via Unicode
2011 Thread:
https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0124.html

Please read in particular these two:

- https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0174.html
- https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0212.html

(tl;dr: 1. the PUA set is fixed, 2. being private, the properties may be
overridable by conformant implementations.)


On Mon, Aug 20, 2018 at 5:17 PM Ken Whistler via Unicode <
unicode@unicode.org> wrote:

>
>
> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
> > Is there a block of RTL PUA also?
>
> No.
>
> --Ken
>


Re: Private Use areas

2018-08-21 Thread Janusz S. Bień via Unicode
On Tue, Aug 21 2018 at 16:56 +0200, unicode@unicode.org writes:
> On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:
>> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
>> > Is there a block of RTL PUA also?
>> 
>> No.
>
> Perhaps there should be?
>
> What about designating a part of the PUA to have a specific property?  Only
> certain properties matter enough:
> * wide
> * RTL
> * combining
> as most others are better represented in the font itself.
>
> This could be done either by parceling one of existing PUA ranges: planes 15
> and 16 are virtually unused thus any damage would be negligible; or perhaps
> by allocating a new range elsewhere.

I don't think it's a good idea. I think PUA users should provide the
properties of the characters used in a form analogical to the Unicode
itself, and the software should be able to use this additional
information.

Best regards

Janusz

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


Re: Private Use areas

2018-08-21 Thread Adam Borowski via Unicode
On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:
> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
> > Is there a block of RTL PUA also?
> 
> No.

Perhaps there should be?

What about designating a part of the PUA to have a specific property?  Only
certain properties matter enough:
* wide
* RTL
* combining
as most others are better represented in the font itself.

This could be done either by parceling one of existing PUA ranges: planes 15
and 16 are virtually unused thus any damage would be negligible; or perhaps
by allocating a new range elsewhere.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process)

2018-08-21 Thread James Kass via Unicode
Rebecca Bettencourt wrote,

> Why don't we just get Blissymbolics encoded as it is?

The Pipeline still has the Everson proposal from 1998, but Blissymbols
are still in the Pipeline.

Scripts Encoding Initiative
( http://linguistics.berkeley.edu/sei/ )
 page,
http://linguistics.berkeley.edu/sei/scripts-not-encoded.html
shows Blissymbols and links the same proposal.

Blissymbolics Communication International,
http://www.blissymbolics.org/
will likely produce the next proposal.

Both Scripts Encoding Initiative and Blissymbolics Communication
International depend upon funding.


Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process)

2018-08-21 Thread Asmus Freytag via Unicode

  
  
On 8/21/2018 1:01 AM, Julian Bradfield
  via Unicode wrote:


  On 2018-08-20, Mark E. Shoulson via Unicode  wrote:

  
Moreover, they [William's pronoun symbols] are once again an attempt to shoehorn Overington's pet 
project, "language-independent sentences/words," which are still 
generally deemed out of scope for Unicode.

  
  
I find it increasingly hard to understand why William's project is out
of scope (apart from the "demonstrate use first, then encode"
principle, which is in any case not applied to emoji), when emoji are
language-independent words - or even sentences: the GROWING HEART
emoji is (I presume) supposed to be a language-independent way of
saying "I love you more every day". Which seems rather more
fatuous as a thing to put in a writing-systems standard than the
things I think William would want.

Not that I want to hear any more about William's unmentionables; I
just wish emoji were equally unmentionable.



Unicode is descriptive, not prescriptive (or
tries to be). In other words, it
generally tries to track what people use in writing (including
"have used"
in the case of obsolete/historic characters and scripts).
Focusing on abstract commonalities misses
the point: some things are
in use by large, active user communities that have "voted with
their feet"
to treat these on the same footing as "characters". Being
descriptive
means that Unicode necessarily will (have to) follow.

It does not mean that other items that are formally of a similar
category
should necessarily be treated the same way: they are ideas and
not 
part of a system that is already in near universal use.
A./
  
  



Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))

2018-08-21 Thread Richard Wordingham via Unicode
On Tue, 21 Aug 2018 08:53:18 +0800
via Unicode  wrote:

> On 2018-08-21 08:04, Mark E. Shoulson via Unicode wrote:

> > Still, maybe it
> > doesn't really matter much: your special-purpose font can treat any
> > codepoint any way it likes, right?

> Not all properties come from the font. For example a Zhuang character 
> PUA font, which supplements CJK ideographs, does not rotate
> characters 90 degrees, when change from RTL to vertical display of
> text.

Isn't that supposed to be treated by an OpenType feature such as
'vert'?  Or does the rendering stack get in the way?

However, one might need reflowing text to be about 40% WJ.

Richard.


Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process)

2018-08-21 Thread Julian Bradfield via Unicode
On 2018-08-20, Mark E. Shoulson via Unicode  wrote:
> Moreover, they [William's pronoun symbols] are once again an attempt to 
> shoehorn Overington's pet 
> project, "language-independent sentences/words," which are still 
> generally deemed out of scope for Unicode.

I find it increasingly hard to understand why William's project is out
of scope (apart from the "demonstrate use first, then encode"
principle, which is in any case not applied to emoji), when emoji are
language-independent words - or even sentences: the GROWING HEART
emoji is (I presume) supposed to be a language-independent way of
saying "I love you more every day". Which seems rather more
fatuous as a thing to put in a writing-systems standard than the
things I think William would want.

Not that I want to hear any more about William's unmentionables; I
just wish emoji were equally unmentionable.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Unicode 11 Georgian uppercase vs. fonts

2018-08-21 Thread James Kass via Unicode
(from 2018-07-27)

> Michael Everson responded,
>
>>> If members of the Georgian user community want to consider this a stylistic 
>>> difference, they are free to do so.
>>
>> It isn’t a stylistic difference. It is a different use of capital letters 
>> than Latin, Cyrillic and other scripts use them.

suppose that english was written with a bicameral script, but english
users only used the upper case letters for emphasis.  in other words,
personal names (like bela lugosi), place names (like bechuanaland),
and book titles (like "the bridge over the river kwai") would always
be in lower case.  if someone needed to emphasize something by
SHOUTING, they would use all-caps to make this stylistic distinction.
if english users called upper case "harcourt" and lower case "fenton",
there would be no earthly reason for them to consider switching from
fenton to harcourt to be anything other than a stylistic difference.

along comes a consortium with script experts and computer encoding
experts who rightfully determine that the difference between harcourt
and fenton is actually a casing difference, even though the english
writing system does not actually use casing in a manner consistent
with other bicameral scripts.  so the consortium, tasked with breaking
down elements of text for computer entry, exchange, and storage,
encodes the english script as a casing script.

would that action by the consortium alter my perception (as a typical
member of the english user community) that the difference between
harcourt and fenton is simply stylistic?  HECK, no!

the same applies to georgian.  or any script.  whatever the consortium
does for computer text processing purposes should NEVER be interpreted
as an effort to make the users change their perceptions of their OWN
writing systems.  we've been through this kind of thing before, with
tamil as a notorious example.

best regards,

james kass