Re: New Unicode Working Group: Message Formatting

2020-01-10 Thread James Kass via Unicode
Yes, thank you, that answers the question.  Format rather than 
repertoire.  Please note, though, that the example given of a 
localizable message string is also an example of a localized sentence.


On 2020-01-10 11:17 PM, Steven R. Loomis wrote:

James,

A localizable message string is one similar to those given in the example:
English: “The package will arrive at {time} on {date}.”
German: “Das Paket wird am {date} um {time} geliefert.”

The message string may contain any number of complete sentences, including zero 
( “Arrival: {time}” ).

The Message Format Working Group is to define the *format* of the strings, not 
their *repertoire*. That is, should the string be “Arrival: %s” or “Arrival: 
${date}” or “Arrival: {0}”?


Does that answer your question?

--
Steven R. Loomis | @srl295 | git.io/srl295




El ene. 10, 2020, a las 2:48 p. m., James Kass via Unicode 
 escribió:


On 2020-01-10 9:55 PM, announceme...@unicode.org wrote:

But until now we have not had a syntax for localizable message strings 
standardized by Unicode.

What is the difference between “localizable message strings” and “localized 
sentences”?  Asking for a friend.





Re: New Unicode Working Group: Message Formatting

2020-01-10 Thread James Kass via Unicode

* sentences

On 2020-01-10 10:48 PM, James Kass wrote:


On 2020-01-10 9:55 PM, announceme...@unicode.org wrote:
But until now we have not had a syntax for localizable message 
strings standardized by Unicode.


What is the difference between “localizable message strings” and 
“localized sentances”?  Asking for a friend.






Re: New Unicode Working Group: Message Formatting

2020-01-10 Thread James Kass via Unicode



On 2020-01-10 9:55 PM, announceme...@unicode.org wrote:
But until now we have not had a syntax for localizable message strings 
standardized by Unicode.


What is the difference between “localizable message strings” and 
“localized sentances”?  Asking for a friend.




Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-04 Thread James Kass via Unicode



On 2020-01-04 12:50 PM, Richard Wordingham via Unicode wrote:

dev2: कः꣡ 

dev3: क꣡ः  
Grantha: (1) ጕ፧ጃ 
  (2) ጕጃ፧ 
The second Grantha spelling is enabled by a Harfbuzz-only change to
the USE categorisations.  It treats Grantha visarga and spacing
anusvara as though inpc=Top rather than inpc=Right.  As I am using
Ubuntu 16.04, this override isn't supported in applications that use the
system HarfBuzz library, such as my email client.

We are now establishing incompatible Devanagari font-specific
encodings fully compliant with TUS!
This seems to be a very bad approach.  And apparently it isn't limited 
to the Devanagari script.


For the Grantha examples above, Grantha (1) displays much better here.  
It seems daft to put a spacing character between a base character and 
any mark which is supposed to combine with the base character.





Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2020-01-01 Thread James Kass via Unicode



On 2020-01-02 1:04 AM, Richard Wordingham wrote in a thread deriving 
from this one,


> Have you found a definition of the ISCII handling of Vedic characters?

No.  It would be helpful.  ISCII apparently wasn't really used much.  It 
would also be helpful to know the encoding order in any legacy ISCII 
data using the Vedic characters with respect to VISARGA/ANUSVARA.  
Although such legacy data seems unlikely, I'd expect VISARGA/ANUSVARA to 
be entered/stored post-syllable.


> I've been looking at Microsoft's specification of Devanagari character
> order.  In
> 
https://docs.microsoft.com/en-us/typography/script-development/devanagari,

> the consonant syllable ends
>
> [N]+[A] + [< H+[] | {M}+[N]+[H]>]+[SM]+[(VD)]
>
> where
> N is nukta
> A is anudatta (U+0952)
> H is halant/virama
> M is matra
> SM is syllable modifier signs
> VD is vedic
>
> "Syllable modifier signs" and "vedic" are not defined.  It appears that
> SM includes U+0903 DEVANAGARI SIGN VISARGA.

What action should Microsoft take to satisfy the needs of the user 
community?

1.  No action, maintain status quo.
2.  Swap SM and VD in the specs ordering.
3.  Make new category PS (post-syllable) and move VISARGA/ANUSVARA there.
4.  ?

What kind of impact would there be on existing data if Microsoft revised 
the ordering?


Or should Unicode encode a new character like ZERO-WIDTH INVISIBLE 
DOTTED CIRCLE so that users can suppress unwanted and unexpected dotted 
circles by adding superfluous characters to the text stream?


> I note that even ग॒ः  is
> given a dotted circle by HarfBuzz.

Same on Win 7.  And  (गः॒) 
breaks the mark positioning as expected.




Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

2020-01-01 Thread James Kass via Unicode



On 2020-01-01 8:11 PM, James Kass wrote:
It’s too bad that ISCII didn’t accomodate the needs of Vedic Sanskrit, 
but here we are.


Sorry, that might be wrong to say.  It's possible that it's Unicode's 
adaptation of ISCII that hinders Vedic Sanskrit.


One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

2020-01-01 Thread James Kass via Unicode



On 2020-01-01 11:17 AM, Richard Wordingham via Unicode wrote:

> That's exactly the sort of mess that jack-booted renderers are trying
> to minimise.  Their principle is that there should be only one encoding
> per shape, though to be fair:
>
> 1) some renderers accept canonical equivalents.
> 2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ), collating
> (CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ).
> 3) Superseded chillu encodings are still supported.

There was never any need for atomic chillu form characters.  The 
principle of only one encoding per shape is best achieved when every 
shape gets an atomic encoding.  Glyph-based encoding is incompatible 
with Unicode character encoding principles.


It’s too bad that ISCII didn’t accomodate the needs of Vedic Sanskrit, 
but here we are.




Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2019-12-31 Thread James Kass via Unicode



A workaround until some kind of satisfactory adjustment is made might be 
to simply use COLON for VISARGA.  Or...


 VISARGA ⇒ U+02F8 MODIFIER LETTER RAISED COLON
ANUSVARA⇒U+02D9 DOT ABOVE

...as long as the font(s) included both those characters.

य॑ यॆ॑

य॑ं -- anusvara last
यॆ॑ं -- "

य॑: -- colon last
यॆ॑: -- "

य॑˸ -- raised colon modifier last
यॆ॑˸ -- "

य॑˙ -- spacing dot above last
यॆ॑˙ -- "



Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara

2019-12-31 Thread James Kass via Unicode



On 2019-12-21 6:27 AM, Shriramana Sharma via Unicode wrote:

However, even the simplest Vedic sequence (not involving Sama Vedic or
multiple tone marker combinations) like दे॒वेभ्य॑ः throws up a dotted
circle, and one is expected (see developer feedback in that bug
report) to input the visarga before tone markers, hoping the software
is intelligent enough to skip over the visarga (or anusvara) place the
tone marker over the preceding syllable correctly. Why it is necessary
to put the visarga first in input only to have to skip over it in
shaping is beyond me.

य॔ यॆ॔
य॔ः -- visarga last
यॆ॔ः -- "
यः॔ -- visarga before accent (U+0954)
यॆः॔ -- "

य॑ यॆ॑
य॑ः -- visarga last
यॆ॑ः -- "
यः॑  visarga before svarita (U+0951)
यॆः॑  "

U+0951 and U+0954 have canonical combining class of 230.  Putting 
VISARGA (CCC=0) after those CCC=230 marks generates the dotted circle 
for VISARGA.  Putting VISARGA before those CCC=230 marks generates the 
dotted circle for U+0954 but drops the dotted circle for U+0951.  In 
both cases where VISARGA comes before, the mark positioning is broken.  
(Mangal font, Win 7)


As far as I can tell, the simplest solution would be for the Indic 
shaping engines to suppress the dotted circle for VISARGA (or ANUSVARA) 
where appropriate.  Entering/storing VISARGA or ANUSVARA at the end of 
the syllable makes sense since that's where it goes, visually and logically.




Re: NBSP supposed to stretch, right?

2019-12-20 Thread James Kass via Unicode





On 2019-12-21 2:43 AM, Shriramana Sharma via Unicode wrote:

Ohkay and that's very nice meaningful feedback from actual
developer+user interaction. So the way I look at this going forward is
that we have four options:

1)

With the existing single NBSP character, provide a software option to
either make it flexible or inflexible, but this preference should be
stored as part of the document and not the application settings, else
shared documents would not preserve the layout intended by the
creator.




5)

Update the applications to treat NBSP correctly.  Process legacy data 
based on date/time stamp (or metadata) appropriately and offer users the 
option to update their legacy data algorithmically using proper 
non-stretching space characters such as FIGURE SPACE.


-

Options 1 and 5 have the advantage of not requiring the addition of yet 
more spacing characters to the Standard.




Re: NBSP supposed to stretch, right?

2019-12-19 Thread James Kass via Unicode



From our colleague’s web site,
http://jkorpela.fi/chars/spaces.html

“On web browsers, no-break spaces tended to be non-adjustable, but 
modern browsers generally stretch them on justification.”


Jukka Korpela then offers pointers about avoiding unwanted stretching.

and

“The change in the treatment of no-break spaces, though inconvenient, is 
consistent with changes in CSS specifications. For example, clause 7 
Spacing of CSS Text Module Level 3 (Editor’s Draft 24 Jan. 2019) defines 
the no-break space, but not the fixed-with spaces, as a word-separator 
character, stretchable on justification.”


So it appears that there’s no interoperability problem with HTML.

It seems that the widespread breakage which Asmus Freytag mentions is 
limited to legacy applications which persist in treating U+00A0 as the 
old “hard space” such as Word.  It also appears that Microsoft tried and 
failed to correct the problem in Word.  Perhaps they should try again.  
Meanwhile, in the absence of anything from Unicode more explicit than 
already recommended by the Standard, Shriramana Sharma might be well 
advised to continue to lobby the respective software people.  As more 
applications migrate towards the correct treatment of U+00A0, they are 
probably already running into interoperability problems with Microsoft 
Word and may well have already implemented solutions.




Re: NBSP supposed to stretch, right?

2019-12-18 Thread James Kass via Unicode



On 2019-12-17 12:50 AM, Shriramana Sharma via Unicode wrote:

I would have gone and filed this as a LibreOffice bug since that's the
software I use most, but when I found this is a cross-software
problem, I thought it would be best to have this discussed and
documented here (and in a future version of the standard).

There's a bug report for the LibreOffice application here...
https://bugs.documentfoundation.org/show_bug.cgi?id=41652
...which shows an interesting history of the situation.

One issue is whether to be Unicode compliant or MS-Word compliant. 
MS-Word had apparently corrected the bug with Word 2013 but had reverted 
to the incorrect behavior by the time Word 2016 rolled out.  On that 
page it's noted that applications like InDesign, Firefox, TeX, and 
QuarkXPress handle U+00A0 correctly.




Re: HEAVY EQUALS SIGN

2019-12-18 Thread James Kass via Unicode




On 2019-12-18 12:42 PM, Marius Spix via Unicode wrote:

Unicode has a HEAVY PLUS SIGN (U+2795) and a HEAVY MINUS SIGN (U+2796).
I wonder, if a HEAVY EQUALS SIGN could complete that character set.
This would allow emoji phrases like  ➕= ❤️. (man plus cat equals
love) looking typographically better, when you replace the equals sign
with a new HEAVY EQUALS SIGN character. Thoughts?

Marius


 ➕  ⚌ ❤️



Re: NBSP supposed to stretch, right?

2019-12-18 Thread James Kass via Unicode



U+0020 SPACE
U+00A0 NO-BREAK SPACE

These two characters are equal in every way except that one of them 
offers an opportunity for a line break and the other does not.


If the above statement is true, then any conformant application must 
treat/process/display both characters identically.


Responding to Asmus Freytag,
> Now, if someone can show us that there are widespread implementations 
that
> follow the above recommendation and have no interoperability issues 
with HTML

> then I may change my tune.

Can anyone show us that there are widespread implementations which would 
break if they started following the above recommendation?


Quoting from this HTML basics page,
http://www.htmlbasictutor.ca/non-breaking-space.htm

“Some browsers will ignore beyond the first instance of the non-breaking 
space.”

and
“Not all browsers acknowledge the additional instances of the 
non-breaking space.”


Fifteen or twenty years ago, we used NO-BREAK SPACE to indent paragraphs 
and to position text and graphics.  Both of those uses are presently 
considered no-nos because some browsers collapse NBSPs and because there 
are proper ways now to accomplish these kinds of effects.


The introduction of browsers which collapsed NBSP strings broke existing 
web pages.  Perhaps the developers of those browsers decided that SPACE 
and NO-BREAK SPACE are indeed identical except for line breaking.


Are there any modern mark-up language uses of SPACE vs NO-BREAK SPACE 
which would be broken if they follow the above recommendation?




Re: NBSP supposed to stretch, right?

2019-12-17 Thread James Kass via Unicode



Asmus Freytag wrote,

> And any recommendation that is not compatible with what the overwhelming
> majority of software has been doing should be ignored (or only 
enabled on

> explicit user input).
>
> Otherwise, you'll just advocating for a massively breaking change.

It seems like the recommendations are already in place and the 
“overwhelming majority of software” is already disregarding them.


I don’t see the massively breaking change here.  Are there any 
illustrations?


If legacy text containing NON-BREAK SPACE characters is popped into a 
justifier, the worst thing that can happen is that the text will be 
correctly justified under a revised application.  That’s not breaking 
anything, it’s fixing it.  Unlike changing the font-face, font size, or 
page width (which often results in reformatting the text), the line 
breaks are calculated before justification occurs.


If a string of NON-BREAK SPACE characters appears in an HTML file, the 
browser should proportionally adjust all of those space characters 
identically with the “normal” space characters.  This should preserve 
the authorial intent.


As for pre-Unicode usage of NON-BREAK SPACE, were there ever any exlicit 
guidelines suggesting that the normal SPACE character should expand or 
contract for justification but that the NON-BREAK SPACE must not expand 
or contract?




Re: NBSP supposed to stretch, right?

2019-12-17 Thread James Kass via Unicode



On 2019-12-17 10:37 AM, QSJN 4 UKR via Unicode wrote:

Agree.
By the way, it is common practice to use multiple nbsp in a row to
create a larger span. In my opinion, it is wrong to replace fixed
width spaces with non-breaking spaces.
Quote from Microsoft Typography Character design standards:
«The no-break space is not the same character as the figure space. The
figure space is not a character defined in most computer system's
current code pages. In some fonts this character's width has been
defined as equal to the figure width. This is an incorrect usage of
the character no-break space.»

The mention of code pages made me suspect that this quote was from an 
archived older web page, but it's current.  Here's the link:

https://docs.microsoft.com/en-us/typography/develop/character-design-standards/whitespace

Quoting from that same page,
"Advance width rule : The advance width of the no-break space should be 
equal to the width of the space."


So it follows that any justification operation should treat NO-BREAK 
SPACE and SPACE identically.




Re: A neat description of encoding characters

2019-12-02 Thread James Kass via Unicode




On 2019-12-03 12:59 AM, Richard Wordingham via Unicode wrote:

On Mon, 2 Dec 2019 12:01:52 +
"Costello, Roger L. via Unicode"  wrote:


 From the book titled "Computer Power and Human Reason" by Joseph
Weizenbaum, p.74-75

Suppose that the alphabet with which we wish to concern ourselves
consists of 256 distinct symbols...

Why should I wish to concern myself with only one alphabet?

You shouldn't.  But suppose you did.  That's the hypothetical set-up for 
the illustration.


When that book was published in 1976, that illustration may have helped 
some people gain a better understanding of computer encoding.


Nowadays a character string might be required to produce a glyph which 
the user community considers to be a "character" (or letter) in its 
writing system.  Adding variation selectors, invisible 'formatting' 
characters, and non-alphabetic symbols to the mix has moved computer 
encoding way beyond 1976.




Re: Is the Unicode Standard "The foundation for all modern software and communications around the world"?

2019-11-19 Thread James Kass via Unicode



On 2019-11-19 11:00 PM, Mark E. Shoulson via Unicode wrote:
Why so concerned with these minutiæ? Were you in fact misled?  
(Doesn't sound like it.)  Do you know someone who was, or whom you 
fear would be?  What incorrect conclusions might they draw from that 
misunderstanding, and how serious would they be?  Doesn't sound like 
this is really anything serious even if you were right. 


Anyone unfamiliar with our timeline, such as a millennial, might be led 
to believe that Unicode was in place before personal computers existed.  
A bit of research would have dispelled that notion.  But thereafter any 
assertion from Unicode would be suspect.


Limiting the claims to text, as Asmus Freytag suggests, might be too 
limiting.  Many people may not realize how prevalent textual data really 
is in our exchanges of information.  Imagine producing a video offering 
closed captioning/subtitling in French, Italian, and Hebrew without the 
underlying foundation of Unicode.


Rather than limiting this to text, why not substitute something for the 
word "foundation"?  For example:


The Unicode Standard is the lodestar for all modern software and 
communications around the world, ...




Re: Is the Unicode Standard "The foundation for all modern software and communications around the world"?

2019-11-19 Thread James Kass via Unicode



On 2019-11-19 6:59 PM, Costello, Roger L. via Unicode wrote:

Today I received an email from the Unicode organization. The email said this: 
(italics and yellow highlighting are mine)

The Unicode Standard is the foundation for all modern software and 
communications around the world, including all modern operating systems, 
browsers, laptops, and smart phones-plus the Internet and Web (URLs, HTML, XML, 
CSS, JSON, etc.).

That is a remarkable statement! But is it entirely true? Isn't it assuming that 
everything is text? What about binary information such as JPEG, GIF, MPEG, WAV; 
those are pretty core items to the Web, right? The Unicode Standard is silent 
about them, right? Isn't the above quote a bit misleading?


A bit, perhaps.  But think of it as a press release.

The statement smacks of hyperbole at first blush, but "foundation" can 
mean basis or starting point.  File names (and URLs) of *.WAV, *.MPG, 
etc. are stored and exchanged via Unicode.  Likewise, the tags 
(metadata) for audio/video files are stored (and displayed) via 
Unicode.  So fields such as Title, Artist, Comments/Notes, Release Date, 
Label, Composer, and so forth aren't limited to ASCII data.




Re: New Public Review on QID emoji

2019-11-12 Thread James Kass via Unicode



On 2019-11-13 3:00 AM, Asmus Freytag via Unicode wrote:

The current effort starts from an unrelated problem (Unicode not wanting to
administer emoji applications) and in my analysis, seriously puts the cart
before the horse.

But it does solve the unrelated problem.

There's nothing stopping vendors from making software which recognizes 
tag character strings to reference in-line graphics. There's nothing 
stopping users from employing those in-line graphics as emoji images.  
It would be considered a higher level protocol which uses tag character 
strings in lieu of, for example, ASCII strings like src="triceratops.png">.  Either way, it's rich-text expressed with 
plain-text strings.


But for Unicode to provide this mechanism which "should be correctly 
parsed by all conformant implementations" as well as possibly 
maintaining a registry of "tag sequences known to be in use" suggests 
that Unicode now considers that random images (with no symbolic meaning 
other than they're pictures of something) should be exchanged as plain-text.


The QID Emoji in Unicode makes as much sense as the original emoji 
inclusion.  It's a natural result of the slippery slope of emoji encoding.


Emoji are open-ended but Unicode currently has barriers erected. QID 
Emoji would eliminate limitations on what's supposed to be an open-ended 
set.  That's the problem that the current effort would resolve.  In my 
opinion it's better to open up a myriad of images and see which 
sequences actually get used than to have vendors/enthusiasts create 
images in the hope or expectation that anyone will actually use them.




Re: On the lack of a SQUARE TB glyph

2019-09-27 Thread James Kass via Unicode



On 2019-09-27 5:15 AM, Fred Brennan via Unicode wrote:

I only have two lingering questions.

* Does the existence of the legacy Adobe encoding Adobe-Japan1-6 shift the
balance? It has a SQUARE TB at CID+8306.

https://www.adobe.com/content/dam/acom/en/devnet/font/pdfs/5078.Adobe-Japan1-6.pdf


That character set also has other items not in Unicode such as numbers 
enclosed in squares from "0" and "00" through "100" and fractions like 
3/7 and 10/11.  It was published in 2008, so it might not be considered 
as "legacy".


Re: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar

2019-08-20 Thread James Kass via Unicode
Well, it was intended to be off list.  It seems that this has been 
mentioned before, for example;

http://www.unicode.org/mail-arch/unicode-ml/y2011-m07/0029.html
Maybe it's time for a new thread/subject title?


Re: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar

2019-08-20 Thread James Kass via Unicode




On 2019-08-21 2:40 AM, James Kass wrote:

Are we are allowed to write Llangollen as the definition of the
Unicode Collation Algorithm implies we should, with an invisible CGJ
between the 'n' and the 'g', so that it will collate correctly in
Welsh?  That CGJ is necessary so that it will collate*after*
Llanberis. (The problem is that the letter 'ng' comes before the letter
'n'.) 
(This is off-list).  If 'ng' comes before 'n', shouldn't Llangollen 
collate *before* Llanberis in a Welsh listing?


Re: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar

2019-08-20 Thread James Kass via Unicode



On 2019-08-21 2:08 AM, Richard Wordingham via Unicode wrote:

Are we are allowed to write Llangollen as the definition of the
Unicode Collation Algorithm implies we should, with an invisible CGJ
between the 'n' and the 'g', so that it will collate correctly in
Welsh?  That CGJ is necessary so that it will collate*after*
Llanberis. (The problem is that the letter 'ng' comes before the letter
'n'.)
So that it won't collate correctly in anything other than Welsh? Isn't 
it better to use an application which enables Welsh collation?  Here's 
how BabelPad handles Welsh:

http://www.babelstone.co.uk/Software/BabelPad_Sort_Lines.html


Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread James Kass via Unicode



On 2019-08-15 12:25 AM, Asmus Freytag via Unicode wrote:

Empirically, it has been observed that some distinctions that are claimed by
users, standards developers or implementers were de-facto not honored by type
developers (and users selecting fonts) as long as the native text doesn't
contain minimal pairs.


Quickly checked a couple of older on-line PDFs and both used the comma 
below unabashedly.


Quoting from this page (which appears to be more modern than the PDFs),
http://www.trussel2.com/MOD/peloktxt.htm

"Ij keememej ḷọk wōt ke ikar uwe ippān Jema kab ruo ṃōṃaan ilo juon booj 
jidikdik eo roñoul ruo ne aitokan im jiljino ne depakpakin. Ilo iien in 
eor jiljilimjuon ak rualitōk aō iiō—Ij jab kanooj ememej. Wa in ṃōṃkaj 
kar ..."


It seems that users are happy to employ a dot below in lieu of either a 
comma or cedilla.  This newer web page is from a book published in 
1978.  There's a scan of the original book cover. Although the book 
title is all caps hand printing it appears that commas were used.  The 
Marshallese orthography which uses commas/cedillas is fairly recent, 
replacing an older scheme devised by missionaries.  Perhaps the actual 
users have already resolved this dilemma by simply using dots below.




Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread James Kass via Unicode




On 2019-08-14 7:50 PM, Richard Wordingham via Unicode wrote:

I think you'd also have to change the reference glyph of LATIN LOWER
CASE I WITH HEART to show a heart.  That's valid because the UCD trumps
the code charts, and and no Unicode-compliant process may deliberately
render  differently from LATIN LOWER CASE I WITH
HEART.


U+0149 has a compatibility decomposition.  It has been deprecated and is 
not rendered identically on my system.

'n ʼn
( ’n )
If a character gets deprecated, can its decomposition type be changed 
from canonical to compatibility?




Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread James Kass via Unicode



On 2019-08-12 8:30 AM, Andrew West wrote:

This issue was discussed at WG2 in 2013
(https://www.unicode.org/L2/L2013/13128-latvian-marshal-adhoc.pdf),
when there was a recommendation to encode precomposed letters L and N
with cedilla*with no decomposition*, but that solution does not seem
to have been taken up by the UTC.


Group One dots their lowercase "i" letters with little flowers and Group 
Two dots theirs with little hearts.  Group Two considers flowers 
unacceptable and Group One rejects hearts.  Because of legacy character 
sets there's a precomposed character encoded called "LATIN LOWER CASE I 
WITH HEART", but it was misnamed and is normally drawn with a flower 
instead.  Group Two tries to encode "LATIN LOWER CASE I" plus "COMBINING 
HEART" to get the thing to display properly.  But because there's a 
decomposition involved, the font engine substitutes the glyph mapped to 
"LATIN LOWER CASE I WITH HEART" in the display for the string "LATIN 
LOWER CASE I" plus "COMBINING HEART".  This thwarts Group Two because 
they still get the flower.


The solution is to deprecate "LATIN LOWER CASE I WITH HEART".  It's only 
in there because of legacy.  It's presence guarantees round-tripping 
with legacy data but it isn't needed for modern data or display.  Urge 
Groups One and Two to encode their data with the desired combiner and 
educate font engine developers about the deprecation.  As the rendering 
engines get updated, the system substitution of the wrongly named 
precomposed glyph will go away.


This presumes that the premise of user communities feeling strongly 
about the unacceptable aspect of the variants is valid.  Since it has 
been reported and nothing seems to be happening, perhaps the casual 
users aren't terribly concerned.  It's also possible that the various 
user communities have already set up their systems to handle things 
acceptably by installing appropriate fonts.




Re: PUA (BMP) planned characters HTML tables

2019-08-11 Thread James Kass via Unicode




On 2019-08-11 5:26 PM, [ Doug Ewell ] via Unicode wrote:

If you are thinking of these as potential future additions to the standard, 
keep in mind that accented letters that can already be represented by a 
combination of letter + accent will not ever be encoded. This is one of the 
longest-standing principles Unicode has.


Good point.

There was a time when populating the PUA with precomposed glyphs was 
necessary for printing or display, but that time has passed. Hopefully 
anyone seeking charts is transcoding older data into proper Unicode.


This can be illustrated with the Marshallese combos mentioned earlier.

PUA:  
Standard:  ĻļM̧m̧ŅņO̧o̧

Well, that didn't work out as well as expected.  But the standard 
Unicode is supported (more or less) by some of the core fonts installed 
here.  Nothing installed here displays anything useful for the PUA 
characters.  A decent OpenType font designed with Marshallese in mind 
should work just fine with the combiners.


The fact is that the standard characters will survive and can be 
universally exchanged.  And there's plenty of web page charts showing 
the standard characters.




Re: PUA (BMP) planned characters HTML tables

2019-08-11 Thread James Kass via Unicode




On 2019-08-11 4:07 AM, Robert Wheelock via Unicode wrote:

Hello!
I remember that a website that has tables for certain PUA precomposed
accented characters that aren’t yet in Unicode (thing like:  Marshallese
M/m-cedilla, H/h-acute, capital T-dieresis, capital H-underbar, acute
accented Cyrillic vowels, Cyrillic ER/er-caron, ...).  Where was it at?!  I
still want to get the information.  Thank You!




It sounds familiar but I can't place it.  I tried the SIL pages first, 
as did Richard Wordingham apparently.


https://blogfonts.com/dehuti.font

This font has material in the PUA including:
Marshallese glyphs with cedillas: L (E382 & E394), M (E3A6 & E3BB), N 
(E3CE & E3DE), O (E429 & E465)


These appear to be PUA characters which the font developer has mapped in 
addition to the SIL PUA mappings.






Re: SHEQEL and L2/19-291

2019-07-24 Thread James Kass via Unicode



https://en.wikipedia.org/wiki/Israeli_new_shekel

"With the issuing of the third series, the Bank of Israel has adopted 
the standard English spelling of shekel and plural shekels for its 
currency.[30] Previously, the Bank had formally used the Hebrew 
transcriptions of sheqel and sheqalim (from שְׁקָלִים‎).[31]"


BTW, Google flags "sheqel" in its search box as an incorrect spelling.

On 2019-07-25 2:23 AM, Mark E. Shoulson via Unicode wrote:

Just looking at document L2/19-291,
https://www.unicode.org/L2/L2019/19291-missing-currency.pdf "Currency signs
missing in Unicode" by Eduardo Marín Silva.  And I'm wondering why he feels it
necessary for the Unicode standard to say that a more correct spelling for the
Israeli currency would be "shekel" (and not "sheqel").  What criterion is being
used that makes this "more correct"?  I think it's more popular and common, so
maybe that's it.  But historically and linguistically, "sheqel" is more
accurate.  The middle letter is ק, U+05E7 HEBREW LETTER QOF (which is not "more
correctly" KOF), from the root ש־ק־ל Sh.Q.L meaning "weight".  It's true that
Modern Hebrew does not distinguish K and Q phonetically in speech; maybe that is
what is meant?  Still, the "historical" transliteration of QOF with Q is very
widespread, and I believe occurs even on some coins/bills (could be wrong here;
is this what is meant by "more correct"? That "shekel" is what is used
officially on the currency and I am misremembering?)


Just wondering about this, since it seems to be stressed in the document.


~mark





Re: Is ARMENIAN ABBREVIATION MARK (՟, U+055F) misclassified?

2019-04-26 Thread James Kass via Unicode



On 2019-04-26 11:08 PM, Doug Ewell via Unicode wrote:

This is a small percentage of the number of fonts that have all four of these 
Armenian glyphs, but show the abbreviation mark as a spacing glyph. It looks 
like Unicode is right, Wikipedia is right, and the fonts are wrong.
If the Wikipedia page(s) are correct, then Unicode isn't.  Unicode 
charts don't show the glyph on the dotted circle and the canonical 
combining class is shown as "spacing".  The fact that Doug Ewell found 
some installed fonts displaying the character as a combining mark 
suggests that the Wikipedia pages are correct.  This character is listed 
as being unused in modern Armenian, but you'd think that it would have 
been exposed before now since the charcter has been in Unicode since 
version 1.0.


Re: Script_extension Property of U+0310 Combining Candrabindu

2019-04-18 Thread James Kass via Unicode



The Guara Times font maps Cyrillic letters (Л,л,М,м) with chandrabindus 
in the P.U.A. of the font.  This can be done without the P.U.A. using 
U+0310:  Л̐,л̐,М̐,м̐


http://www.chakra.lv/blog/2016/10/19/transliterating-sanskrit-into-russian/

On 2019-04-18 7:59 PM, Richard Wordingham via Unicode wrote:

Is there any reason why U+0310 COMBINING CANDRABINDU has scx=Inherited
rather than scx=Latn?  The only language I've seen the character used
in is Sanskrit, and the only script I've seen it in is the Latin
script.

Richard.




Re: MODIFIER LETTER SMALL GREEK PHI in Calibri is wrong.

2019-04-17 Thread James Kass via Unicode



Confirming that the installed version here shows psi.  (Version 5.74)

Luc(as) de Groot is the type designer, I've copied him on this message.


On 2019-04-17 10:06 PM, Hans Åberg via Unicode wrote:

You are possibly both right, because it is OK in the web font but wrong in the 
desktop font.



On 17 Apr 2019, at 23:53, Oren Watson via Unicode  wrote:

You can easily reproduce this by going here:
https://www.fonts.com/font/microsoft-corporation/calibri/regular
and putting in the following string: ψϕφᵠ

On Wed, Apr 17, 2019 at 5:23 PM James Tauber  wrote:
It looks correct in Google Docs so it appears to have been fixed in whatever 
version of the font is used there.

James

On Wed, Apr 17, 2019 at 5:10 PM Oren Watson via Unicode  
wrote:
Would anyone know where to report this?
In the widely used Calibri typeface included with MS Office, the glyph shown 
for U+1D60 MODIFIER LETTER SMALL GREEK PHI, actually depicts a letter psi, not 
a phi.






Re: Emoji Haggadah

2019-04-16 Thread James Kass via Unicode

> Perhaps that debunking was in the very book
> cited by Martin J. Dürst earlier in this thread.

Yes, starting on page 24.
https://books.google.com/books?id=hypplIDMd0IC=PA24=isbn:0824812077+Yukaghir=en=X=0ahUKEwj1n4r719zgAhWJn4MKHcdyCHIQ6AEIKjAA#v=onepage=isbn%3A0824812077%20Yukaghir=false


Re: Emoji Haggadah

2019-04-16 Thread James Kass via Unicode



> 
http://historyview.blogspot.com/2011/10/yukaghir-girl-writes-love-letter.html


According to a comment, the Yukaghir love letter as semasiographic 
communication was debunked by John DeFrancis in 1989 who asserted that 
it was merely a prop in a Yukaghir parlor game.  Perhaps that debunking 
was in the very book cited by Martin J. Dürst earlier in this thread.


Martin J. Dürst via Unicode wrote,
>> There is a well-known thesis in linguistics that every script has to be
>> at least in part phonetic, and the above are examples that add support
>> to this. For deeper explanations (unfortunately not yet including
>> emoji), see e.g. "Visible Speech - The Diverse Oneness of Writing
>> Systems", by John DeFrancis, University of Hawaii Press, 1989.

The blog page comment went on to say that Geoffrey Sampson, who wrote 
the book from which the blogger learned of the Yukaghir love letter, 
published a retraction in 1994.




Re: Emoji Haggadah

2019-04-16 Thread James Kass via Unicode



On 2019-04-16 7:09 AM, Martin J. Dürst via Unicode wrote:

All the examples you cite, where images stand for sounds, are typically
used in some of the oldest "ideographic" scripts. Egyptian definitely
has such concepts, and Han (CJK) does so, too, with most ideographs
consisting of a semantic and a phonetic component.


Using emoji as rebus puzzles seems harmless enough but it defeats the 
goals of those emoji proponents who want to see emoji evolve into a 
universal form of communication because phonetic recognition of symbols 
would be language specific.  Users of ancient ideographic systems 
typically shared a common language where rebus or phonetic usage made 
sense to the users.  (Of course, diverse CJK user communities were able 
to adapt over time.)


All of the reviews of this publication on the page originally linked 
seemed positive, so it appears that people are having fun with emoji.  
But I suspect that this work would be jibber-jabber to any non-English 
speaker unfamiliar with the original Haggadah.  No matter how otherwise 
fluent they might be in emoji communication.




Re: Emoji Haggadah

2019-04-15 Thread James Kass via Unicode



On 2019-04-16 3:18 AM, Mark E. Shoulson via Unicode wrote:

> For whatever reason, the author decided to go with ️ for "God" and 
such, ...


"OM"igod.

Just a thought.

If the emoji OM SYMBOL is to be used for "god", shouldn't it be casing 
to enable distinction between the common noun and the deity?


Vendor-assigned emoji (was: Encoding italic)

2019-02-11 Thread James Kass via Unicode



On 2019-01-24 Andrew West wrote,

> The ESC and UTC do an appallingly bad job at regulating emoji, and I
> would like to see the Emoji Subcommittee disbanded, and decisions on
> new emoji taken away from the UTC, and handed over to a consortium or
> committee of vendors who would be given a dedicated vendor-use emoji
> plane to play with (kinda like a PUA plane with pre-assigned
> characters with algorithmic names [VENDOR-ASSIGNED EMOJI X] which
> the vendors can then associate with glyphs as they see fit; and as
> emoji seem to evolve over time they would be free to modify and
> reassign glyphs as they like because the Unicode Standard would not
> define the meaning or glyph for any characters in this plane).

Nobody disagreed and I think it’s a splendid suggestion.  If anyone is 
discussing drafting a proposal to accomplish this, please include me in 
the “cc”.




Re: Encoding italic

2019-02-11 Thread James Kass via Unicode



Philippe Verdy wrote,

>>> case mappings,
>>
>> Adjust them as needed.
>
> Not so easy: case mappings cannot be fixed. They are stabilized in 
Unicode.

> You would need special casing rules under a specific "locale" for maths.

In BabelPad, I can select a string of text and convert it to math 
italics.  If upper case italics is desired, it would be necessary to 
select the text, convert it back to ASCII, convert it to upper case, and 
convert that upper case to math italics.  Casing the math alphanumerics 
doesn’t seem to present any problem.  Any program could make those 
interim steps invisible to the end user.


(With VS14, BabelTags mark-up, or new control character(s)—casing isn’t 
even an issue.)




Re: Encoding italic

2019-02-11 Thread James Kass via Unicode



On 2019-02-11 6:42 PM, Kent Karlsson wrote:

> Using a VS to get italics, or anything like that approach, will
> NEVER be a part of Unicode!

Maybe the crystal ball is jammed.  This can happen, especially on the 
older models which use vacuum tubes.


Wanting a second opinion, I asked the magic 8 ball:
“Will VS14 italic be part of Unicode?”
The answer was:
“It is decidedly so.”



Re: Encoding italic

2019-02-10 Thread James Kass via Unicode



Philippe Verdy wrote,

>> ...[one font file having both italic and roman]...
> The only case where it happens in real fonts is for the mapping of
> Mathematical Symbols which have a distinct encoding for some
> variants ...

William Overington made a proof-of-concept font using the VS14 character 
to access the italic glyphs which were, of course, in the same real 
font.  Which means that the developer of a font such as Deja Vu Math TeX 
Gyre could set up an OpenType table mapping the Basic Latin in the font 
to the italic math letter glyphs in the same font using the VS14 
characters.  Such a font would work interoperably on modern systems.  
Such a font would display italic letters both if encoded as math 
alphanumerics or if encoded as ASCII plus VS14.  Significantly, the 
display would be identical.


> ...[math alphanumerics]...
> These were allowed in Unicode because of their specific contextual
> use as distinctive symbols from known standards, and not for general
> use in human languages

They were encoded for interoperability and round-tripping because they 
existed in character sets such as STIX.  They remain Latin letter form 
variants.  If they had been encoded as the variant forms which 
constitute their essential identity it would have broken the character 
vs. glyph encoding model of that era.  Arguing that they must not be 
used other than for scientific purposes is just so much semantic 
quibbling in order to justify their encoding.


Suppose we started using the double struck ASCII variants on this list 
in order to note Unicode character numbers such as 핌+픽피픽픽 or 
핌+ퟚퟘퟞퟘ?  Hexadecimal notation is certainly math and Unicode can be 
considered a science.  Would that be “math abuse” if we did it?  (Is 
linguistics not a science?)


> (because these encodings are defective and don't have the necessary
> coverage, notably for the many diacritics,

The combining diacritics would be used.

> case mappings,

Adjust them as needed.

> and other linguisitic, segmentation and layout properties).
>
> The same can be said about superscript/subscript variants,
> ... : they have specific use and not made for general purpose texts ...

So people who used ISO-8859-1 were not allowed to use the superscript 
digits therein for marking footnotes?  Those superscript digits were 
reserved by ISO-8859-1 only for use by math and science?


MATHEMATICAL ITALIC CAPITAL A
Decomposition mapping:  U+0041
Binary properties:  Math, Alphabetic, Uppercase, Grapheme Base, ...

SUPERSCRIPT TWO
Decomposition mapping:  U+0032
Binary properties:  Grapheme Base

MODIFIER LETTER SMALL C
Decomposition mapping:  U+0063
Binary properties:  Alphabetic, Lowercase, Grapheme Base, ...



Re: Encoding italic

2019-02-09 Thread James Kass via Unicode



Martin J. Dürst wrote,

>> Isn't that already the case if one uses variation sequences to choose
>> between Chinese and Japanese glyphs?
>
> Well, not necessarily. There's nothing prohibiting a font that includes
> both Chinese and Japanese glyph variants.

Just as there’s nothing prohibiting a single font file from including 
both roman and italic variants of Latin characters.




Re: Encoding italic

2019-02-08 Thread James Kass via Unicode



Asmus Freytag wrote,

> You are still making the assumption that selecting a different glyph for
> the base character would automatically lead to the selection of a 
different

> glyph for the combining mark that follows. That's an iffy assumption
> because "italics" can be realized by choosing a separate font 
(typographically,

> italics is realized as a separate typeface).
>
> There's no such assumption built into the definition of a VS. At 
best, inside
> the same font, there may be an implied ligature, but that does not 
work if

> there's an underlying font switch.

Midstream font switching isn’t a user option in most plain-text 
applications, although there can be some font substitution happening at 
the OS level.  Any combining mark must apply to its base letter glyph, 
even after a base letter glyph has been modified.


More sophisticated editors, like BabelPad, allow users to select 
different fonts for different ranges of Unicode.  If a user selects font 
X for ASCII and font Y for combining marks, then mark positioning is 
already broken.


If the user selects Times New Roman for both ASCII and combining marks, 
then no font switching is involved.  The Times New Roman type face 
includes italic letter form variants.  Any application sharp enough to 
know that the italic letter form variants are stored in a different 
computer *file* should be clever enough to apply mark positioning 
accordingly.  And any single font file which includes italic letters and 
maps them with VS14 would avoid any such issues altogether.




Re: Encoding italic

2019-02-08 Thread James Kass via Unicode



William,

Rather than having the user insert the VS14 after every character, the 
editor might allow the user to select a span of text for italicization.  
Then it would be up to the editor/app to insert the VS14s where appropriate.


For Andrew’s example of “fête”, the user would either type the string:
“f” + “ê” + “t” + “e”
or the string:
“f” + “e” +  + “t” + “e”.

If the latter, the application would insert VS14 characters after the 
“f”, “e”, “t”, and “e”.  The application would not insert a VS14 after 
the combining circumflex — because the specification does not allow VS 
characters after combining marks, they may only be used on base characters.


In the first ‘spelling’, since the specifications forbid VS characters 
after any character which is not a base character (in other words, not 
after any character which has a decomposition, such as “ê”) — the 
application would first need to convert the string to the second 
‘spelling’, and proceed as above.  This is known as converting to NFD.


So in order for VS14 to be a viable approach, any application would ① 
need to convert any selected span to NFD, and ② only insert VS14 after 
each base character.  And those are two operations which are quite 
possible, although they do add slightly to the programmer’s burden.  I 
don’t think it’s a “deal-killer”.


Of course, the user might insert VS14s without application assistance.  
In which case hopefully the user knows the rules.  The worst case 
scenario is where the user might insert a VS14 after a non-base 
character, in which case it should simply be ignored by any 
application.  It should never “break” the display or the processing; it 
simply makes the text for that document non-conformant.  (Of course 
putting a VS14 after “ê” should not result in an italicized “ê”.)


Cheers,

James



Re: Encoding italic

2019-02-05 Thread James Kass via Unicode



William Overington wrote,

> Well, a proposal just about using VS14 to indicate a request for an
> italic version of a glyph in plain text, including a suggestion of to
> which characters it could apply, would test whether such a proposal
> would be accepted to go into the Document Register for the Unicode
> Technical Committee to consider or just be deemed out of scope and
> rejected and not considered by the Unicode Technical Committee.

As long as “italics in plain-text” is considered out-of-scope by 
Unicode, any proposal for handling italics in plain-text would probably 
be considered out-of-scope, as well.  But I could be wrong and wouldn’t 
mind seeing a proposal.




Re: Encoding italic

2019-02-04 Thread James Kass via Unicode



Philippe Verdy responded to William Overington,

> the proposal would contradict the goals of variation selectors and would
> pollute ther variation sequences registry (possibly even creating 
conflicts).
> And if we admit it for italics, than another VSn will be dedicated to 
bold,

> and another for monospace, and finally many would follow for various
> style modifiers.
> Finally we would no longer have enough variation selectors for all 
requests).


There are 256 variation selector characters.  Any use of variation 
sequences not registered by Unicode would be non-conformant.


William’s suggestion of floating a proposal for handling italics with 
VS14 might be an example of the old saying about “putting the cart 
before the horse”.  Any preliminary proposal would first have to clear 
the hurdle of the propriety of handling italic information at the 
plain-text level.  Such a proposal might list various approaches for 
accomplishing that, if that hurdle can be surmounted.




Re: Ancient Greek apostrophe marking elision

2019-02-04 Thread James Kass via Unicode



On 2019-01-28 8:58 PM, Richard Wordingham wrote:
> On Mon, 28 Jan 2019 03:48:52 +
> James Kass via Unicode  wrote:
>
>> It’s been said that the text segmentation rules seem over-complicated
>> and are probably non-trivial to implement properly.  I tried your
>> suggestion of WORD JOINER U+2060 after tau ( γένοιτ⁠’ ἄν ), but it
>> only added yet another word break in LibreOffice.
>
> I said we *don't* have a control that joins words.  The text of TUS
> used to say we had one in U+2060, but that was removed in 2015.  I
> pleaded for the retention of this functionality in document
> L2/2015/15-192, but my request was refused.  I pointed out in ICU
> ticket #11766 that ICU's Thai word breaker retained this facility. ...

Sorry for sounding obtuse there.  It was your *post* which suggested the 
use of WORD JOINER.  You did clearly assert that it would not work.  So, 
human nature, I /had/ to try it and see.


It. did. not. work.  (No surprise.)  But it /should/ have worked. It’s a 
JOINER, for goodness sake!


When the author/editor puts any kind of JOINER into a text string, 
what’s the intent?  What’s the poînt of having a JOINER that doesn’t?


Recently I put a ZWJ between the “c” and the “t” in the word 
“Respec‍tfully” as an experiment.  Spellchecker flagged both “respec” 
and “tfully” as being misspelt, which they probably are.  A ZWNJ would 
have been used if there had been any desire for the string to be *split* 
there, e.g., to forbid formation of a discretionary ligature.  Instead 
the ZWJ was inserted, signalling authorial intent that a ‘more joined’ 
form of the “c-t” substring was requested.


Text a man has JOINED together, let not algorithm put asunder.



Re: Encoding italic

2019-02-01 Thread James Kass via Unicode



On 2019-01-31 3:18 PM, Adam Borowski via Unicode wrote:

> They're only from a spammer's point of view.

Spammers need love, too.  They’re just not entitled to any.



Re: Encoding italic

2019-01-31 Thread James Kass via Unicode



David Starner wrote,

> The choice of using single-byte character sets isn't always voluntary.
> That's why we should use ISO-2022, not Unicode. Or we can expect
> people to fix their systems. What systems are we talking about, that
> support Unicode but compel you to use plain text? The use of Twitter
> is surely voluntary.

This marketing-related web page,

https://litmus.com/blog/best-practices-for-plain-text-emails-a-look-at-why-theyre-important

...lists various reasons for using plain-text e-mail.  Here’s an excerpt.

“Some people simply prefer it. Plain and simple—some people prefer text 
emails. ... Some users may also see HTML emails as a security and 
privacy risk, and choose not to load any images and have visibility over 
all links that are included in an email. In addition, the increased 
bandwidth that image-heavy emails tend to consume is another driver of 
why users simply prefer plain-text emails.”


Besides marketing, there’s also newsletters and e-mail discussion 
groups.  Some of those discussion groups are probably scholarly. Anyone 
involved in that would likely embrace ‘super cool Unicode text magic’ 
and it’s surprising if none of them have stumbled across the math 
alphanumerics yet.


A web search for the string “plain text only” leads to all manner of 
applications for which searchers are trying to control their 
environments.  There’s all kinds of reasons why some people prefer to 
use plain-text, it’s often an informed choice and it isn’t limited to 
e-mail.


It’s true that people don’t have to use Twitter.  People don’t have to 
turn on their computers, either.




Re: Encoding italic

2019-01-31 Thread James Kass via Unicode



David Starner wrote,

> Emoji, as have been pointed out several times, were in the original
> Unicode standard and date back to the 1980s; the first DOS character
> page has similes at 0x01 and 0x02.

That's disingenuous.



Re: Encoding italic

2019-01-30 Thread James Kass via Unicode



David Starner wrote,

>> ... italics, bold, strikethrough, and underline in plain-text
>
> Okay? Ed can do that too, along with nano and notepad. It's called
> HTML (TeX, Troff). If by plain-text, you mean self-interpeting,
> without external standards, then it's simply impossible.

HTML source files are in plain-text.  Hopefully everyone on this list 
understands that and has already explored the marvelous benefits offered 
by granting users the ability to make exciting and effective page 
layouts via any plain-text editor.  HTML is standard and interchangeable.


As Tex Texin observed, differences of opinion as to where we draw the 
line between text and mark-up are somewhat ideological.  If a compelling 
case for handling italics at the plain-text level can be made, then the 
fact that italics can already be handled elsewhere doesn’t matter.  If a 
compelling case cannot be made, there are always alternatives.


As for use of other variant letter forms enabled by the math 
alphanumerics, the situation exists.  It’s an interesting phenomenon 
which is sometimes worthy of comment and relates to this thread because 
the math alphanumerics include italics.  One of the web pages referring 
to third-party input tools calls the practice “super cool Unicode text 
magic”.




Re: Encoding italic

2019-01-29 Thread James Kass via Unicode



Doug Ewell wrote,

> I can't speak for Andrew, but I strongly suspect he implemented this as
> a proof of concept, not to declare himself the Maker of Standards.

BabelPad also offers plain-text styling via math-alpha conversion, 
although this feature isn’t newly added.  Users interested in seeing how 
plain-text italics might work can try out the stateful approach using 
tags contrasted with the character-by-character approach using 
math-range italic letters.  (Of course, the math-range stuff is already 
being interchanged on the WWW, whilst the tagging method does not yet 
appear to be widely supported.)


A few miles upthread, ‘where are the third-party developers’ was asked.  
‘Everywhere’ is the answer.  Since third-party developers have to 
subsist on the crumbs dropped by the large corps, they tend to be 
responsive to user needs and requests.




Re: Encoding italic

2019-01-29 Thread James Kass via Unicode



On 2019-01-29 5:10 PM, Doug Ewell via Unicode wrote:

I thought we had established that someone had mentioned it on this list,
at some time during the past three weeks. Can someone look up what post
that was? I don't have time to go through scores of messages, and there
is no search facility.

http://www.unicode.org/mail-arch/unicode-ml/y2019-m01/0209.html


Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread James Kass via Unicode



On 2019-01-29 1:55 AM, Mark E. Shoulson via Unicode wrote:

I guess "Suck it up and deal with it."  And that may indeed be the answer.


It would certainly make for shorter and simpler FAQ pages, anyway.



Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread James Kass via Unicode



On 2019-01-28 7:31 AM, Mark Davis ☕️ via Unicode wrote:
Expecting people to type in hard-to-find invisible characters just to 
correct double-click is not a realistic expectation.


True, which is why such entries, when consistent, are properly handled 
at the keyboard driver level.  It's a presumption that Greek classicists 
are already specifying fonts and using dedicated keyboard drivers.  
Based on the description provided by James Tauber, it should be 
relatively simple to make the keyboard insert some kind of joiner before 
U+2019 if it follows a Greek letter. This would not be visible to the 
end-user.


This approach would also mean that plain-text, which has no language 
tagging mechanism, would "get it right" cross-platform, cross-applications.




Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread James Kass via Unicode



On 2019-01-27 11:38 PM, Richard Wordingham via Unicode wrote:

On Sun, 27 Jan 2019 19:57:37 +
James Kass via Unicode  wrote:


On 2019-01-27 7:09 PM, James Tauber via Unicode wrote:

In my original post, I asked if a language-specific tailoring of
the text segmentation algorithm was the solution but no one here
has agreed so far.

If there are likely to be many languages requiring exceptions to the
segmentation algorithm wrt U+2019, then perhaps it would be better to
establish conventions using ZWJ/ZWNJ and adjust the algorithm
accordingly so that it would be cross-languages.  (Rather than
requiring additional and open ended language-specific tailorings.) (I
inserted several combinations of ZWJ/ZWNJ into James Tauber's
example, but couldn't improve the segmentation in LibreOffice,
although it was possible to make it worse.)

If you look at TR29, you will see that ZWJ should only affect word
boundaries for emoji.  ZWNJ shall have no effect.  What you want is a
control that joins words, but we don't have that.

Richard.



(https://unicode.org/reports/tr29/)

It’s been said that the text segmentation rules seem over-complicated 
and are probably non-trivial to implement properly.  I tried your 
suggestion of WORD JOINER U+2060 after tau ( γένοιτ⁠’ ἄν ), but it only 
added yet another word break in LibreOffice.


The problem may stem from the fact that WORD JOINER is supposed to be 
treated as though it were a zero-width no-break space.  IOW it is a 
*space*, and as a space it indicates a word break.  That doesn’t seem right.


Instead of treating WORD JOINER as a SPACE, why not treat it as a WORD 
JOINER?  It could save a lot of problems wrt undesirable string 
segmentation in addition to possibly minimizing future language-specific 
tailoring and easing the burden on implementers.




Re: Encoding italic

2019-01-27 Thread James Kass via Unicode



On 2019-01-27 11:44 PM, Philippe Verdy wrote:

> You're not very explicit about the Tag encoding you use for these styles.

This bold new concept was not mine.  When I tested it 
here, I was using the tag encoding recommended by the developer.


> Of course it must not be a language tag so the introducer is not 
U+E0001, or a cancel-all tag so it
> is not prefixed by U+E007F   It cannot also use letter-like, 
digit-like and hyphen-like tag characters
> for its introduction.  So probably you use some prefix in 
U+E0002..U+E001F and some additional tag
> (tag "I" for italic, tag "B" for bold, tag "U" for underline, tag "S" 
for strikethough?) and the cancel

> tag to return to normal text (terminate the tagged sequence).

Yes, U+E0001 remains deprecated and its use is strongly discouraged.

> Or may be you just use standard HTML encoding by adding U+E to 
each character of the HTML
> tag syntax (including attributes and close tags, allowing embedding?) 
So you use the "<" and ">" tag
> characters (possibly also the space tag U+E0020, or TAB tag U+E0009 
for separating attributes and the
> quotation tags for attribute values)?  Is your proposal also allowing 
the embedding of other HTML

> objects (such as SVG)?

AFAICT, this beta release supports the tag sequences , , 
, &  expressed here in ASCII.  I don’t know if the 
software developer has plans to expand the enhancements in the future.


> And what is then the interest compared to standard HTML (it is not 
more compact, ...


This was one of the ideas which surfaced earlier in this thread. Some 
users have expressed an interest in preserving, for example, italics in 
plain-text and are uncomfortable using the math alphanumerics for this, 
although the math alphanumerics seem well qualified for the purpose.  
One of the advantages given for this approach earlier is that it can be 
made to work without any official sanction and with no action necessary 
by the Consortium.


> I bet in fact that all tag characters are most often restricted in 
text input forms, and will be

> silently discarded or the whole text will be rejected.

In this e-mail, I used the tags  &  around the word “bold” in the 
first sentence of my reply in order to test your bet.


> We were told that these tag characters were deprecated, and in fact 
even their use for language
> tags has not found any significant use except some trials (but there 
are now better technologies
> available in lot of softwares, APIs and services, and application 
design/development tools, or

> document editing/publishing tools).

Indeed, these tags were deprecated.  At the time the tags were 
deprecated, there was such sorrow on this list that some list members 
were even inspired to compose haiku lamenting their passing and did post 
those haiku to this list.  Now, thanks to emoji requirements, many of 
those tags are experiencing a resurrection/renaissance.  I wonder if 
anyone is composing limericks in joyful celebration…




Re: Encoding italic

2019-01-27 Thread James Kass via Unicode



A new beta of BabelPad has been released which enables input, storing, 
and display of italics, bold, strikethrough, and underline in plain-text 
using the tag characters method described earlier in this thread.  This 
enhancement is described in the release notes linked on this download page:


http://www.babelstone.co.uk/Software/index.html



Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread James Kass via Unicode



On 2019-01-27 7:09 PM, James Tauber via Unicode wrote:
In my original post, I asked if a language-specific tailoring of the 
text segmentation algorithm was the solution but no one here has 
agreed so far.
If there are likely to be many languages requiring exceptions to the 
segmentation algorithm wrt U+2019, then perhaps it would be better to 
establish conventions using ZWJ/ZWNJ and adjust the algorithm 
accordingly so that it would be cross-languages.  (Rather than requiring 
additional and open ended language-specific tailorings.) (I inserted 
several combinations of ZWJ/ZWNJ into James Tauber's example, but 
couldn't improve the segmentation in LibreOffice, although it was 
possible to make it worse.)


Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread James Kass via Unicode



On 2019-01-27 3:08 PM, Tom Gewecke via Unicode wrote:
I think the Unicode Hawaiian ʻokina is supposed to be U+02BB (instead 
of U+02BC).

notes for U+02BB
* typographical alternate for 02BD or 02BF
* used in Hawai'ian orthorgraphy as 'okina (glottal stop)


Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread James Kass via Unicode



Richard Wordingham responded to Michael Everson,

>> I’ll be publishing a translation of Alice into Ancient Greek in due
>> course. I will absolutely only use U+2019 for the apostrophe. It
>> would be wrong for lots of reasons to use U+02BC for this.
>
> Please list them.

Let's see the list of reasons why U+02BC should be used first.



Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread James Kass via Unicode



Richard Wordingham replied to Asmus Freytag,

>> To make matters worse, users for languages that "should" use U+02BC
>> aren't actually consistent; much data uses U+2019 or U+0027. Ordinary
>> users can't tell the difference (and spell checkers seem not
>> successful in enforcing the practice).
>
> That appears to contradict Michael Everson's remark about a Polynesian
> need to distinguish the two visually.

Does it?

U+02BC /should/ be used but ordinary users can't tell the difference 
because the glyphs in their displays are identical, resulting in much 
data which uses U+2019 or U+0027.  I don't see any contradiction.




Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread James Kass via Unicode



Perhaps I'm not understanding, but if the desired behavior is to 
prohibit both line and word breaks in the example string, then...


In Notepad, replacing U+0020 with U+00A0 removes the line-break.
U+0020 ( δ’ αρχαια )
U+00A0 ( δ’ αρχαια )
U+202F ( δ’ αρχαια )
It also changes the advancement of the text cursor (Ctrl + arrows), 
suggesting that word/string selection would be as desired.  (U+202F also 
does this and may offer a more pleasing appearance to classisists by 
default.)


Wouldn't it be best to handle substitution of U+00A0 for U+0020 at the 
input method / keyboard driver level where appropriate, so that 
preferred apostrophe U+2019 can be used?




Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread James Kass via Unicode



Mark Davis responded to Asmus Freytag,

>> breaking selection for "d'Artagnan" or "can't" into two is overly fussy.
>
> True, and that is not what U+2019 does; it does not break medially.

Mark Davis earlier posted this example,
> So something like "δ’ αρχαια" (picking a phrase at random) would have
> a word break after the delta.
If the user wanted to use the preferred character, U+2019, would using 
the no break space (U+00A0) after it resolve the word or line break 
issues?  Or possibly NNBSP (U+202F)?


It's a shame if users choose suboptimal characters over preferred 
characters because of what are essentially rendering/text selection 
issues.  IMO, it's better to use preferred characters in the long run.


(Users should file bug reports on applications which improperly medially 
break strings which include U+2019.)




Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread James Kass via Unicode



On 2019-01-25 10:06 PM, Asmus Freytag via Unicode wrote:

James, by now it's unclear whether your ' is 2019 or 02BC.
The example word "aren't" in previous message used U+2019.  Sorry if I 
was unclear.


Re: Encoding italic

2019-01-25 Thread James Kass via Unicode



On 2019-01-26 12:18 AM, Asmus Freytag (c) responded:

On 1/25/2019 3:49 PM, Andrew Cunningham wrote:
Assuming some mechanism for italics is added to Unicode,  when 
converting between the new plain text and HTML there is insufficient 
information to correctly convert to HTML. many elements may have 
italic stying and there would be no meta information in Unicode to 
indicate the appropriate HTML element.




So, we would be creating an interoperability issue.



What happens now when we convert plain-text to HTML?


Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread James Kass via Unicode



For U+2019, there's a note saying 'this is the preferred character to 
use for apostrophe'.


Mark Davis wrote,

> When it is between letters it doesn't cause a word break, ...

Some applications don't seem to get that.  For instance, the 
spellchecker for Mozilla Thunderbird flags the string "aren" for 
correction in the word "aren’t", which suggests that users trying to use 
preferred characters may face uphill battles.




Re: Encoding italic

2019-01-24 Thread James Kass via Unicode



> Maybe I should have said emoji are fan-driven.

That works.  Here's the previous assertion rephrased:

  We should no more expect the conventional Unicode character encoding
  model to apply to emoji than we should expect the old-fashioned text
  ranges to become fan-driven.

And if we don't want the text ranges to become fan driven, as pointed 
out by Martin Dürst and others, we take a cautious and conservative 
approach to moving forward with the standard.


Veering back on-topic, the anti fan driven aversion doesn't apply to 
encoding italics, although /fans/ would benefit.  There's pre-existing 
conventions for italics, and a scholar with the credentials of Victor 
Gaultney should be able to make a credible proposal for encoding them.  
I hope we haven't overwhelmed him with a surplus of rhetoric.




Re: Encoding italic (was: A last missing link)

2019-01-24 Thread James Kass via Unicode



Andrew West wrote,

> Why should we not expect the conventional Unicode character encoding
> mode to apply to emoji?

Remember when William Overington used to post about encoding colours, 
sometimes accompanied by novel suggestions about how they could be 
encoded or referenced in plain-text?


Here's a very polite reply from John Hudson from 2000,
http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/1042.html
...and, over time, many of the replies to William Overington's colorful 
suggestions were less than polite.  But it was clear that colors were 
out-of-scope for a computer plain-text encoding standard.


So I don't expect the conventional model to apply to emoji because it 
didn't; if it had, they'd not have been encoded.  Since they're in 
there, the conventional model does not apply.  Of course, the 
conventions have changed along with the concept of what's acceptable in 
plain-text.


Since emoji are an open-ended evolving phenomenon, there probably has to 
be a provision for expansion.  Any idea about them having been a finite 
set overlooked the probability of open-endedness and the impracticality 
of having only the original subset covered in plain-text while additions 
would be banished to higher level protocols.


Thank you for the information about current emoji additions being 
unrelated to vendors.  I have to confess that I haven't kept up-to-date 
on the emoji.


Maybe I should have said that emoji are fan-driven.



Re: Encoding italic (was: A last missing link)

2019-01-24 Thread James Kass via Unicode



Andrew West wrote,

> ...
> http://www.unicode.org/L2/L2018/18208-white-wine-rgi.pdf), just an
> assertion that it would be a good idea if emoji users could add a
> colored swatch to an existing emoji to indicate what color they want
> it to represent (note that the colored characters do not change the
> color of the emoji they are attached to [before or after, depending
> upon whether you are speaking French or English dialect of emoji],
> they are just intended as a visual indication of what colour you wish
> the emoji was).

In order to simplify emoji processing, these should be stored in the 
data stream in logical order.  Whether these cool new characters become 
reordrant color blobs or not would depend upon language.  So, what we'd 
need is some way of indicating language in plain-text. Some kind of 
tagging mechanism.


FAICT, the emoji repertoire is vendor-driven, just as the pre-Unicode 
emoji sets were vendor driven.  Pre-Unicode, if a vendor came up with 
cool ideas for new emoji they added new characters to the PUA.  Now that 
emoji are standardized, when vendors come up with new ideas they put 
them in the emoji ranges in order to preserve the standardization factor 
and ensure interoperability.  (That's probably over-simplified and there 
are bound to be other factors involved.)


We should no more expect the conventional Unicode character encoding 
model to apply to emoji than we should expect the old-fashioned text 
ranges to become vendor-driven.




Re: Encoding italic

2019-01-23 Thread James Kass via Unicode



Nobody has really addressed Andrew West's suggestion about using the tag 
characters.


It seems conformant, unobtrusive, requiring no official sanction, and 
could be supported by third-partiers in the absence of corporate 
interest if deemed desirable.


One argument against it might be:  Whoa, that's just HTML.  Why not just 
use HTML?  SMH


One argument for it might be:  Whoa, that's just HTML!  Most everybody 
already knows about HTML, so a simple subset of HTML would be recognizable.


After revisiting the concept, it does seem elegant and workable. It 
would provide support for elements of writing in plain-text for anyone 
desiring it, enabling essential (or frivolous) preservation of 
editorial/authorial intentions in plain-text.


Am I missing something?  (Please be kind if replying.)

On 2019-01-20 10:35 AM, Andrew West wrote:


A possibility that I don't think has been mentioned so far would be to
use the existing tag characters (E0020..E007F). These are no longer
deprecated, and as they are used in emoji flag tag sequences, software
already needs to support them, and they should just be ignored by
software that does not support them. The advantages are that no new
characters need to be encoded, and they are flexible so that tag
sequences for start/end of italic, bold, fraktur, double-struck,
script, sans-serif styles could be defined. For example start and end
of italic styling could be defined as the tag sequences  and 
(E003C E0069 E003E and E003C E002F E0069 E003E).

Andrew


Re: Encoding italic

2019-01-21 Thread James Kass via Unicode



David Starner wrote,

> You're emailing from Gmail, which has support for italics in email.

But I compose e-mails in BabelPad, which has support for far more than 
italics in HTML mail.  And I'm using Mozilla Thunderbird to send and 
receive text e-mail via the Gmail account.


And if I wanted to /display/ italics in a web page, I would create the 
source file in a plain-text editor.  (HTML mark-up is fairly easy to 
type with the ASCII keyboard.)


If I compose a text file in BabelPad, it can be opened in many rich-text 
applications and the information survives intact.  Unless I am foolish 
enough to edit the file in the rich-text application and file-save it.  
Because that mungs the plain-text file, and it can no longer be 
retrieved by the plain-text editor which created it.


>> ...third-party...
>
> Where are these tools?

BabelPad is an outstanding example.  Earlier in this discussion a web 
search found at least a handful of third-party tools devoted to 
liberating the math-alphas for Twitter users.


> The superscripts show a problem with multiple encoding; even if you
> think they should be Unicode superscripts, and they look like Unicode
> superscripts, they might be HTML superscripts. Same thing would happen
> with italics if they were encoded in Unicode.

Hmmm.  Rich-text styled italics might be copied into other rich-text 
applications, but they cannot be copied into plain-text apps.  If 
Unicode-enabled italics existed, plain-text italics could be copy/pasted 
into either rich-text or plain-text applications and survive intact.  So 
Unicode-enabled italics would be interoperable. Anyone concerned about 
interoperability would be well advised to go with plain-text.  I am, so 
I do.  When I can.


Kie eksistas fumo, tie eksistas fajro.



Re: Encoding italic

2019-01-20 Thread James Kass via Unicode



Responding to David Starner,

It’s true that most users can’t be troubled to take the extra time 
needed to insert any kind of special characters which aren’t covered by 
the keyboard.  Even the enthusiasts among us seldom take the trouble to 
include ‘proper’ quotes and apostrophes in e-mails — even for posting to 
specialized lists such as this one where other members might notice and 
appreciate the extra effort involved.  Even though /we/ know how to do 
it and have software installed to help us do it.


It’s also true that standard U.S. keyboards and drivers aren’t very 
helpful with diacritics.  Yet when we reply to list colleagues with 
surnames such as “Dürst” or “Bień”, we usually manage to get it right.  
Sure, the “reply” feature puts the surname into the response for us and 
the e-mail software adds the properly spelled names into our address 
books automatically.  But when we cite those colleagues in a post 
replying to some other list member, we typically take the time and 
trouble to write their names correctly.  Not only because we /can/, but 
because we /should/.


> How do you envision this working?

Splendidly!  (smile)  Social platforms, plain-text editors, and other 
applications do enhance their interfaces based on user demand from time 
to time.  User demand, at least on Twitter, seems established.  As 
pointed out previously in this discussion, that demand doesn’t seem to 
result in much “Chicago style” text (although I have personally observed 
some) and may only be a passing fad /for Twitter users/.  When corporate 
interests aren't interested, third-party developers develop tools.


> You've yet to demonstrate that interoperability is an actual problem.

Copy/pasting from a web page into a plain-text editor removes any 
italics content, which is currently expected behavior.  Opinions differ 
as to whether that represents mere format removal or a loss of meaning.  
Those who consider it as a loss of meaning would perceive a problem with 
interoperability.


Consider superscript/subscript digits as a similar styling issue. The 
Wikipedia page for Romanization of Chinese includes information about 
the Wade-Giles system’s tone marks, which are superscripted digits.


https://en.wikipedia.org/wiki/Romanization_of_Chinese

Copy/pasting an example from the page into plain-text results in “ma1, 
ma2, ma3, ma4”, although the web page displays the letters as italic and 
the digits as (italic) superscripts.  IMO, that’s simply wrong with 
respect to the superscript digits and suboptimal with respect to the 
italic letters.


> To expand on what Mark E. Shoulson said, to add new italics characters,
> you're going to need to not only copy all of Latin, but also Cyrillic ...

I quite agree that expanding atomic italic encoding is off the table at 
this point.  (And that italicized CJK ideographs are daft.)




Re: Encoding italic (was: A last missing link)

2019-01-20 Thread James Kass via Unicode



On 2019-01-20 10:49 PM, Garth Wallace wrote:
I think the real solution is for Twitter to just implement basic 
styling and make this a moot point.


At which time it would only become a moot point for Twitter users.  
There's also Facebook and other on-line groups.  Plus scholars and 
linguists.  And interoperability.




Re: Encoding italic (was: A last missing link)

2019-01-19 Thread James Kass via Unicode



(In the event that a persuasive proposal presentation prompts the 
possibility of italics encoding...)

Possible approaches include:

1 - Liberating the italics from the Members Only Math Club
...which has been an ongoing practice since they were encoded.  It 
already works, but the set is incomplete and the (mal)practice is 
frowned upon.  Many of the older "shortcomings" of the set can now be 
overcome with combining diacritics.  These italics decompose to ASCII.


2 - Character level
Variation selectors work with today's tech.  Default ignorable property 
suggests that apps that don't want to deal with them won't.  Many see VS 
as pseudo-encoding.  Stripping VS leaves ASCII behind.


3 - Open/Close punctuation treatment
Stateful.  Works on ranges.  Not currently supported in plain-text. 
Could be supported in applications which can take a text string URL and 
make it a clickable link.  Default appearance in nonsupporting apps may 
resemble existing plain-text italic kludges such as slashes.  The ASCII 
is already in the character string.


4 - Leave it alone
This approach requires no new characters and represents the default 
condition.  ASCII.


-

Number 1 would require that anything not already covered would have to 
be eventually proposed and accepted, 2 would require no new characters 
at all, and 3 would require two control characters for starters.


As "food for thought" questions, if a persuasive case is presented for 
encoding italics, and excluding 4, which approach would have the least 
impact on the rich-text world?  Which would have the least impact on 
existing plain-text technology?  Which would be least likely to conflict 
with Unicode principles/encoding model?




Re: Encoding italic (was: A last missing link)

2019-01-19 Thread James Kass via Unicode



Victor Gaultney wrote,

> If however, we say that this "does not adequately consider the harm done
> to the text-processing model that underlies Unicode", then that exposes a
> weakness in that model. That may be a weakness that we have to accept for
> a variety of reasons (technical difficulty, burden on developers, UI 
impact,

> cost, maturity).

Unicode's character encoding principles and underlying text-processing 
model remain robust.  They are the foundation of modern computer text 
processing.  The goal of 푛푒 푝푙푢푠 푢푙푡푟푎¹ needs to accommodate 
the best expectations of the end users and the fact that the consistent 
approach of the model eases the software people's burdens by ensuring 
that effective programming solutions to support one subset or range of 
characters can be applied to the other subsets of the Unicode 
repertoire.  And that those solutions can be shared with other 
developers in a standard fashion.


Assigning properties to characters gives any conformant application 
clear instructions as to what exactly is expected as the app encounters 
each character in a string.  In simpler times, the only expectation was 
that the application would splat a glyph onto a screen (and/or sheet of 
paper) and store a binary string for later retrieval.  We've moved forward.


'Unicode encodes characters, not glyphs' is a core principle. There's a 
legitimate concern whenever anyone is perceived as heading into the 
general direction of turning the character encoding into a glyph 
registry, as it suggests a possible step backwards and might lead to a 
slippery slope.  For example, if italics are encoded, why not fraktur 
and Gaelic?²


The notion that any given system can't be improved is static.³ ("System" 
refers to Unicode's repertoire and coverage rather than its core 
principles.  Core principles are rock solid by nature.)


¹ /ne plus ultra/
² "Conversely, significant differences in writing style for the same 
script may be reflected in the bibliographical classification—for 
example, Fraktur or Gaelic styles for the Latin script. Such stylistic 
distinctions are ignored in the Unicode Standard, which treats them as 
presentation styles of the Latin script."  Ken Whistler, 
http://unicode.org/reports/tr24/
³ "Static" can be interpreted as either virtually catatonic or radio 
noise.  Either is applicable here.




Re: Encoding italic

2019-01-19 Thread James Kass via Unicode



On 2019-01-19 6:19 PM, wjgo_10...@btinternet.com wrote:

> It seems to me that it would be useful to have some codes that are
> ordinary characters in some contexts yet are control codes in others, ...

Italics aren't a novel concept.  The approach for encoding new 
characters is that  conventions for them exist and that people *are* 
exchanging them, people have exchanged them in the past, or that people 
demonstrably *need* to exchange them.


Excluding emoji, any suggestion or proposal whose premise is "It seems 
to me that  it would be useful if characters supporting that>..." is doomed to be deemed out of scope for the standard.




Re: NNBSP

2019-01-19 Thread James Kass via Unicode



Marcel Schneider wrote,

> When you ask for knowing the foundations and that knowledge is 
persistently refused,

> you end up believing that those foundations just can’t be told.
>
> Note, too, that I readily ceased blaming UTC, and shifted the blame 
elsewhere, where it

> actually belongs to.

Why not think of it as a learning curve?  Early concepts and priorities 
were made from a lower position on that curve.  We can learn from the 
past and apply those lessons to the future, but a post-mortem seldom 
benefits the cadaver.


Minutiae about decisions made long ago probably exist, but may be 
presently poorly indexed/organized and difficult to search/access. As 
the collection of encoding history becomes more sophisticated and the 
searching technology becomes more civilized, it may become easier to 
glean information from the archives.


(OT - A little humor, perhaps...
On the topic of Francophobia, it is true that some of us do not like 
dead generalissimos.  But most of us adore the French for reasons beyond 
Brigitte Bardot and bon-bons.  Cuisine, fries, dip, toast, curls, 
culture, kissing, and tarts, for instance.  Not to mention cognac and 
champagne!)




Re: Encoding italic

2019-01-17 Thread James Kass via Unicode



For web searching, using the math-string 푀푎푦푛푎푟푑 퐾푒푦푛푒푠 as 
the keywords finds John Maynard Keynes in web pages.  Tested this in 
both Google and DuckDuckGo.  Seems like search engines are accomodating 
actual user practices.  This suggests that social media data is possibly 
already being processed for the benefit of the users (and future 
historians) by software people who care about such things.




Re: Encoding italic

2019-01-17 Thread James Kass via Unicode



On 2019-01-17 11:50 AM, Martin J. Dürst wrote:

> Most probably not. I think Asmus has already alluded to it, but in good
> typography, roman and italic fonts are considered separate.

So are Latin and Cyrillic fonts.  So are American English and Polish 
fonts, for that matter, even though they're both Latin based.  Times New 
Roman and Times New Roman Italic might be two separate font /files/ on 
computers, but they are the same type face.


The point I was trying to make WRT 256-glyph fonts is that they pre-date 
Unicode and I believe much of the "layering" is based on artifacts from 
that era.


Lead fonts were glyph based.  The technical concept of character came later.



Re: Encoding italic

2019-01-17 Thread James Kass via Unicode



On 2019-01-17 6:27 AM, Martin J. Dürst replied:

> ...
> So even if you can find examples where the presence or absence of
> styling clearly makes a semantic difference, this may or will not be
> enough. It's only when it's often or overwhelmingly (as opposed to
> occasionally) the case that a styling difference makes a semantic
> difference that this would start to become a real argument for plain
> text encoding of italics (or other styling information).

(also from PDF chapter 2,)
"Plain text is public, standardized, and universally readable."
The UCS is universal, which implies that even edge cases, such as failed 
or experimental historical orthographies, are preserved in plain text.


> ...
> I think most Unicode specialists have chosen to ignore this thread by
> this point.

Those not switched off by the thread title may well be exhausted and 
pressed for time because of the UTC meeting.


> ...
> Based by these data points, and knowing many of the people involved, my
> description would be that decisions about what to encode as characters
> (plain text) and what to deal with on a higher layer (rich text) were
> taken with a wide and deep background, in a gradually forming industry
> consensus.

(IMO) All of which had to deal with the existing font size limitations 
of 256 characters and the need to reserve many of those for other 
textual symbols as well as box drawing characters.  Cause and effect.  
The computer fonts weren't designed that way *because* there was a 
technical notion to create "layers".  It's the other way around.  (If 
I'm not mistaken.)


>> ..."Jackie Brown"...
> ...
> Also, for probably at least 90% of
> the readership, the style distinction alone wouldn't induce a semantic
> distinction, because most of the readers are not familiar with these
> conventions.

Proper spelling and punctuation seem to be dwindling in popularity, as 
well.  There's a percentage unable to make a semantic distinction 
between 'your' and 'you’re'.


> (If you doubt that, please go out on the street and ask people what
> italics are used for, and count how many of them mention film titles or
> ship names.)

Or the em-dash, en-dash, Mandaic letter ash, or Gurmukhi sign yakash.  
Sure, most street people have other interests.


> (And just while we are at it, it would still not be clear which of
> several potential people named "Jackie Brown" or "Thorstein Veblen"
> would be meant.)

Isn't that outside the scope of italics?  (winks)



Re: Encoding italic (was: A last missing link)

2019-01-16 Thread James Kass via Unicode



Victor Gaultney wrote,

> Treating italic like punctuation is a win for a lot of people:

Italic Unicode encoding is a win for a lot of people regardless of 
approach.  Each of the listed wins remains essentially true whether 
treated as punctuation, encoded atomically, or selected with VS.


> My main point in suggesting that Unicode needs these characters is that
> italic has been used to indicate specific meaning - this text is somehow
> special - for over 400 years, and that content should be preserved in 
plain

> text.

( http://www.unicode.org/versions/Unicode11.0.0/ch02.pdf )

"Plain text must contain enough information to permit the text to be 
rendered legibly, and nothing more."


The argument is that italic information can be stripped yet still be 
read.  A persuasive argument towards encoding would need to negate that; 
it would have to be shown that removing italic information results in a 
loss of meaning.


The decision makers at Unicode are familiar with italic use conventions 
such as those shown in "The Chicago Manual of Style" (first published in 
1906).  The question of plain-text italics has arisen before on this 
list and has been quickly dismissed.


Unicode began with the idea of standardizing existing code pages for the 
exchange of computer text using a unique double-byte encoding rather 
than relying on code page switching.  Latin was "grandfathered" into the 
standard.  Nobody ever submitted a formal proposal for Basic Latin.  
There was no outreach to establish contact with the user community -- 
the actual people who used the script as opposed to the "computer nerds" 
who grew up with ANSI limitations and subsequent ISO code pages.  
Because that's how Unicode rolled back then.  Unicode did what it was 
supposed to do WRT Basic Latin.


When someone points out that italics are used for disambiguation as well 
as stress, the replies are consistent.


"That's not what plain-text is for."  "That's not how plain-text 
works."  "That's just styling and so should be done in rich-text." 
"Since we do that in rich-text already, there's no reason to provide for 
it in plain-text."  "You can already hack it in plain-text by enclosing 
the string with slashes."  And so it goes.


But if variant letter form information is stripped from a string like 
"Jackie Brown", the primary indication that the string represents either 
a person's name or a Tarantino flick title is also stripped.  "Thorstein 
Veblen" is either a dead economist or the name of a fictional yacht in 
the Travis McGee series.  And so forth.


Computer text tradition aside, nobody seems to offer any legitimate 
reason why such information isn't worthy of being preservable in 
plain-text.  Perhaps there isn't one.


I'm not qualified to assess the impact of italic Unicode inclusion on 
the rich-text world as mentioned by David Starner.  Maybe another list 
member will offer additional insight or a second opinion.




Re: A last missing link for interoperable representation

2019-01-16 Thread James Kass via Unicode



Julian Bradfield wrote,

> Oh, and what about dropped initials? They have been used in both
> manuscripts and typography for many centuries - surely we must encode
> them?

Naa-aah, we just hack the full width presentation forms for that.

Drop Caps in Plain Text

(Whether they actually drop depends on the font, though.)



Re: Encoding italic

2019-01-15 Thread James Kass via Unicode



Responding to David Starner,

> I might complain about the people who claim to like plain text yet would
> only be happy with massive changes to it, though.

Most movie lovers welcomed talkies.

People are free to cling to their rotary phones as long as they like.  
They just can't press the pound sign.


> However, plain text can be used standalone, and it can be used inside
> programs and other formats.

That remains true even post-emoji.  How would italics change that?

> Dismissing the people who use Unicode in ways that aren't plain text
> is unfair and hurts your case.

It wasn't my intention to be dismissive, much, so point taken. 
Discussions like this one exist so that people can express concerns and 
share ideas towards resolutions.


> Adding italics to Unicode will complicate the implementation of all rich
> text applications that currently support italics.

Would there be any advantages to rich-text apps if italics were added to 
Unicode?  Is there any cost/benefit data?  You've made an assertion 
about complication to rich-text apps which I can neither confirm nor refute.


One possible advantage would be interoperability.  People snagging 
snippets of text from web pages or word processors and dropping data 
into their plain-text windows wouldn't be bamboozled by the unexpected.  
If computer text is getting exchanged, isn't it better when it can be 
done in a standard fashion?




Re: Encoding italic (was: A last missing link)

2019-01-15 Thread James Kass via Unicode



Victor Gaultney wrote,

> Use of variation selectors, a single character modifier, or combining
> characters also seem to be less useful options, as they act at the 
individual
> character level and are highly impractical. They also violate the key 
concept
> that italics are a way of marking a span of text as 'special' - not 
individual
> letters. Matched punctuation works the same way and is a good fit for 
italic.


The VS possibility would double the character count of any strings 
including them.  That may make it undesirable for groups like Twitter 
who have limits.  But math (mis)use doesn't affect the character count.  
If the VS method were to be used, the math alphanumerics might continue 
to be used where possible, at least by Twitter users who already employ 
the math-alphas to make their corpus of legacy data.


Using VS arose in the parent thread as a way of avoiding the necessity 
of adding additional characters to the standard.  (But we don't seem to 
be running out of available code space.)  The purpose of VS is to 
preserve variant letter form distinctions in plain-text, which seems to 
apply to italics.  Further, VS is an existing mechanism which wouldn't 
be expected to impact searching and so forth on savvy systems.  (An 
opening/closing pair of control characters also shouldn't impact 
searching.)  Finally, VS already works in existing technology and there 
wouldn't be a long down-time waiting for updates to the standard and 
implementation of same. (Not that we should rush to judgment or 
"solutions" here, just that an ad-hoc "solution" is possible and could 
be implemented by third-parties.)


Concerns about statefulness in plain-text exist.  Treating "italic" as 
an opening/closing "punctuation" may help get around such concerns.  
IIRC, it was proposed that the Egyptian cartouche be handled that way.


Like emoji, people who don't like italics in plain text don't have to 
use them.




Re: Encoding italic

2019-01-15 Thread James Kass via Unicode



Enabling plain-text doesn't make rich-text poor.

People who regard plain-text with derision, disdain, or contempt have 
every right to hold and share opinions about what plain-text is *for* 
and in which direction it should be heading.  Such opinions should 
receive all the consideration they deserve.




Re: Encoding italic (was: A last missing link)

2019-01-15 Thread James Kass via Unicode



Although there probably isn't really any concerted effort to "keep 
plain-text mediocre", it can sometimes seem that way.


As we've been told repeatedly, just because something has been done over 
and over again doesn't mean that there's a precedent for it.


Using spans of text as a general indicator of rich-text seems reasonable 
at first blush.  But selected spans can also be copy/pasted (relocated), 
which is not stylistic at all.  Spans of text can be selected to apply 
casing, which is often seen as non-stylistic.  In applications such as 
BabelPad, spans of text can be converted to-and-from various forms of 
Unicode references and encodings.  Spans of text can be transliterated, 
moved, or deleted. In short, selecting a span of text only means that 
the user is going to apply some kind of process to that span.


Avant-garde enthusiasts are on the leading edge by definition. That's 
why they're known as trend setters.  Unicode exists because 
forward-looking people envisioned it and worked to make it happen. 
Regardless of one's perception of exuberance, Unicode turned out to be 
so much more than a fringe benefit.




Re: A last missing link for interoperable representation

2019-01-14 Thread James Kass via Unicode



Hans Åberg wrote,

> How about using U+0301 COMBINING ACUTE ACCENT: 푝푎푠푠푒́

Thought about using a combining accent.  Figured it would just display 
with a dotted circle but neglected to try it out first.  It actually 
renders perfectly here.  /That's/ good to know.  (smile)




Re: A last missing link for interoperable representation

2019-01-14 Thread James Kass via Unicode



Hello Martin, others...

> Blaming the problem on Unicode doesn't seem to be appropriate.

I don't consider that there's any problem with plain text users 
exchanging plain text.  I give Unicode /credit/ for being the foundation 
of that ability.  Anyone imagining that I'm casting blame is under a 
misconception.


There's plain text data out there stringing math alphanumerics into 
recognizable words.  It's being stored and shared and indexed.  I have 
no problem with that; I'm in favor of it.


(Everyone, please let's focus on Tex Texin's latest post.  Wish I'd sent 
this post before his...)


Best regards,

James Kass



Re: A last missing link for interoperable representation

2019-01-14 Thread James Kass via Unicode



Not a twitter user, don't know how popular the practice is, but here's a 
couple of links concerned with how to use bold or italics in Twitter 
plain text messages.


https://www.simplehelp.net/2018/03/13/how-to-use-bold-and-italicized-text-on-twitter/
https://mothereff.in/twitalics

Both pages include a form of caveat.  But the caveat isn't about the 
intended use of the math alphanumerics.


The first page includes the following text as part of a "tweet":
Just because you 헰헮헻 doesn’t mean you 혴혩혰혶혭혥 :)

And, as before, I have no idea how /popular/ the practice is.  But 
here's some more links:


(web page from 2013)
How To Write In Italics, Tweet Backwards And Use Lots Of Different ...
https://www.adweek.com/digital/twitter-font-italics-backwards/

(This is copy/pasted *as-is* from the web page to plain-text)
Bold and Italic Unicode Text Tool - 퐁퐨퐥퐝 풂풏풅 푖푡푎푙푖푐푠 - YayText
https://yaytext.com/bold-italic/
Super cool unicode text magic. Write 퐛퐨퐥퐝 and/or 푖푡푎푙푖푐 
updates on Facebook, Twitter, and elsewhere. Bold (serif) preview copy 
tweet.


Michael Maurino [emoji redacted-JK] on Twitter: "Can I make italics on 
twitter? 'cause ...

https://twitter.com/iron_stylus/status/281991180064022528?lang=en

Charlie Brooker on Twitter: "How do you do italics on this thing again?"
https://twitter.com/charltonbrooker/status/484623185862983680?lang=en

How to make your Facebook and Twitter text bold or italic, and other ...
https://boingboing.net/2016/04/10/yaytext-unicode-text-styling.html
Apr 10, 2016 - For years I've been using the Panix Unicode Text 
Converter to create ironic, weird or simply annoying text effects for 
use on Twitter, Facebook ...


How to change your Twitter font | Digital Trends
https://www.digitaltrends.com/.../now-you-can-use-bold-italics-and-other-fancy-fonts-...
Aug 14, 2013 - now you can use bold italics and other fancy fonts on 
twitter isaac ... or phrase into your Twitter text box, and there you 
have it: fancy tweets.


Twitter Fonts Generator (퓬퓸퓹픂 퓪퓷퓭 퓹퓪퓼퓽퓮) ― LingoJam
https://lingojam.com/TwitterFonts
You might have noticed that some users on Twitter are able to change the 
font ... them to seemingly make their tweet font bold, italic, or just 
completely different.






Re: A last missing link for interoperable representation

2019-01-13 Thread James Kass via Unicode



Julian Bradfield wrote,

> I have never seen a Unicode math alphabet character in email
> outside this list.

It's being done though.  Check this message from 2013 which includes the 
following, copy/pasted from the web page into Notepad:


혗혈혙혛 혖혍 헔햳햮헭.향햱햠햬햤햶햮햱햪  © ퟮퟬퟭퟯ 햠햫햤햷 햦햱햠햸  
헀헂헍헁헎햻.햼허헆/헺헿헮헹헲혅헴헿헮혆


https://apple.stackexchange.com/questions/104159/what-are-these-characters-and-how-can-i-use-them



Re: A last missing link for interoperable representation

2019-01-13 Thread James Kass via Unicode



Martin J. Dürst wrote,

> I'd say it should be conservative. As the meaning of that word
> (similar to others such as progressive and regressive) may be
> interpreted in various way, here's what I mean by that.
>
> It should not take up and extend every little fad at the blink of an
> eye. It should wait to see what the real needs are, and what may be
> just a temporary fad. As the Mathematical style variants show, once
> characters are encoded, it's difficult to get people off using them,
> even in ways not intended.

A conservative approach to progress is a sensible position for computer 
character encoders.  Taking a conservative approach doesn't necessarily 
mean being anti-progress.


Trying to "get people off" using already encoded characters, whether or 
not the encoded characters are used as intended, might give an 
impression of being anti-progress.


Unicode doesn't enforce any spelling or punctuation rules.  Unicode 
doesn't tell human beings how to pronounce strings of text or how to 
interpret them.  Unicode doesn't push any rules about splitting 
infinitives or conjugating verbs.


Unicode should not tell people how any written symbol must be 
interpreted.  Unicode should not tell people how or where to deploy 
their own written symbols.


Perhaps fraktur is frivolous in English text.  Perhaps its use would 
result in a new convention for written English which would enhance the 
literary experience.  Italics conventions which have only been around a 
hundred years or so may well turn out to be just a passing fad, so we 
should probably give it a bit more time.


Telling people they mustn't use Latin italics letter forms in computer 
text while we wait to see if the practice catches on seems flawed in 
concept.




Re: A last missing link for interoperable representation

2019-01-13 Thread James Kass via Unicode



Marcel Schneider wrote,

> There is a crazy typeface out there, misleadingly called 'Courier New',
> as if the foundry didn’t anticipate that at some point it would be better
> called "Courier Obsolete". ...

퐴푟푡 푛표푢푣푒푎푢 seems a bit 푝푎푠푠é nowadays, as well.

(Had to use mark-up for that “span” of a single letter in order to 
indicate the proper letter form.  But the plain-text display looks crazy 
with that HTML jive in it.)




Re: A last missing link for interoperable representation

2019-01-13 Thread James Kass via Unicode



Julian Bradfield replied,

>> Sounds like you didn't try it.  VS characters are default ignorable.
>
> By software that has a full understanding of Unicode. There is a very
> large world out there of software that was written before Unicode was
> dreamed of, let alone popular.

यदि आप किसी रोटरी फोन से कॉल कर रहे हैं, तो कृपया स्टार (*) दबाएं।

What happens with Devanagari text?  Should the user community refrain 
from interchanging data because 1980s era software isn't Unicode aware?




Re: A last missing link for interoperable representation

2019-01-12 Thread James Kass via Unicode



Mark E. Shoulson wrote,

> This discussion has been very interesting, really.  I've heard what I
> thought were very good points and relevant arguments from both/all
> sides, and I confess to not being sure which I actually prefer.

It's subjective, really.  It depends on how one views plain-text and 
one's expectations for its future.  Should plain-text be progressive, 
regressive, or stagnant?  Because those are really the only choices.  
And opinions differ.


Most of us involved with Unicode probably expect plain-text to be around 
for quite a while.  The figure bandied about in the past on this list is 
"a thousand years".  Only a society of mindless drones would cling to 
the past for a millennium.  So, many of us probably figure that 
strictures laid down now will be overridden as a matter of course, over 
time.


Unicode will probably be around for awhile, but the barrier between 
plain- and rich-text has already morphed significantly in the relatively 
short period of time it's been around.


I became attracted to Unicode about twenty years ago.  Because Unicode 
opened up entire /realms/ of new vistas relating to what could be done 
with computer plain text.  I hope this trend continues.




Re: A last missing link for interoperable representation

2019-01-12 Thread James Kass via Unicode



On 2019-01-12 4:26 PM, wjgo_10...@btinternet.com wrote:
I have now made, tested and published a font, VS14 Maquette, that uses 
VS14 to indicate italic.


https://forum.high-logic.com/viewtopic.php?f=10=7831=37561#p37561



The italics don't happen in Notepad, but VS14 Maquette works spendidly 
in LibreOffice!  (Windows 7)  (In a *.TXT file)


Since the VS characters are supposed to be used with officially 
registered/recognized sequences, it's possible that Notepad isn't trying 
to implement the feature.


The official reception of the notion of using variant letter forms, such 
as italics, in plain-text is typically frosty.  So advancement of 
plain-text might be left up to third-party developers, enthusiasts, and 
the actual text users.  And there's nothing wrong with that.  (It's 
non-conformant, though, unless the VS material is officially 
recognized/registered.)


Non-Latin scripts, such as Khmer, may have their own traditions and 
conventions WRT special letter forms.  Which is why starting at VS14 and 
working backwards might be inadequate in the long run.


Khmer has letter forms called muul/moul/muol (not sure how to spell that 
one, but neither is anybody else).  It superficially resembles fraktur 
for Khmer.  Other non-Latin scripts may have a plethora of such 
forms/fonts/styles.




Re: A last missing link for interoperable representation

2019-01-12 Thread James Kass via Unicode



Asmus Freytag wrote,

> ...What this teaches you is that italicizing (or boldfacing)
> text is fundamentally related to picking out parts of your
> text in a different font.

Typically from the same typeface, though.

> So those screen readers got it right, except that they could
> have used one of the more typical notational conventions that
> the mathalphabetics are used to express (e.g. "vector" etc.),
> rather than rattling off the Unicode name.

WRT text-to-voice applications, such as "VoiceOver", I wonder how well 
they would do when encountering /any/ exotic text runs or characters.  
Like Yi, or Vai, or even an isolated CJK ideograph in otherwise Latin 
text.  For example:  "The Han radical # 72, which looks like '日', means 
'sun'."  Would the application "say" the character as a Japanese reader 
would expect to hear it?  Or in one of the Chinese dialects?  Or would 
the application just give the hex code point?


In an era where most of the states in my country no longer teach cursive 
writing in public schools, it seems unlikely that Twitter users (and so 
forth) will be clamoring for the ability to implement Chicago Style text 
properly on their cell phone screens.  (Many users would probably prefer 
to use the cell phone to order a Chicago style pizza.)  But, stranger 
things have happened.




Re: A last missing link for interoperable representation

2019-01-12 Thread James Kass via Unicode



Reading & writing & 'rithmatick...

This is a math formula:
a + b = b + a
... where the estimable "mathematician" used Latin letters from ASCII as 
though they were math alphanumerics variables.


This is an italicized word:
푘푎푘푖푠푡표푐푟푎푐푦
... where the "geek" hacker used Latin italics letters from the math 
alphanumeric range as though they were Latin italics letters.


Where's the harm?

FWIW, the math formula:
a + b # 푏 + 푎
... becomes invalid if normalized NFKD/NFKC.  (Or if copy/pasted from an 
HTML page using marked-up ASCII into a plain-text editor.)


Yet the italicized word "kakistocracy" is still legible if normalized.  
If copy/pasted from an HTML page using the math alphanumeric characters, 
it survives intact.  If copy/pasted from markupped ASCII, it's still 
legible.




Re: A last missing link for interoperable representation

2019-01-12 Thread James Kass via Unicode




Julian Bradford wrote,


* Bradfield, sorry.



Re: A last missing link for interoperable representation

2019-01-12 Thread James Kass via Unicode



Julian Bradford wrote,

"It does not work with much existing technology. Interspersing extra
codepoints into what is otherwise plain text breaks all the existing
software that has not been, and never will be updated to deal with
arbitrarily complex algorithms required to do Unicode searching.
Somebody who need to search exotic East Asian text will know that they
need software that understands VS, but a plain ordinary language user
is unlikely to have any idea that VS exist, or that their searches
will mysteriously fail if they use this snazzy new pseudo-plain-text
italicization technique"

Sounds like you didn't try it.  VS characters are default ignorable.

First one is straight, the second one has VS2 characters interspersed 
and after the "t":

apricot
a︁p︁r︁i︁c︁o︁t︁
Notepad finds them both if you type the word "apricot" into the search box.

"..."

Regardless of how you input italics in rich-text, you are putting italic 
forms into the display.


"I think the VS or combining format character approach *would* have
been a better way to deal with the mess of mathematical alphabets, ..."

I think so, too, but since I'm not a member of *that* user community, my 
opinion hasn't much value.  Plus VS characters were encoded after the 
math stuff.


"But for plain text, it's crazy."

Are you a member of the plain-text user community?



  1   2   3   >