Re: New Unicode Working Group: Message Formatting
Yes, thank you, that answers the question. Format rather than repertoire. Please note, though, that the example given of a localizable message string is also an example of a localized sentence. On 2020-01-10 11:17 PM, Steven R. Loomis wrote: James, A localizable message string is one similar to those given in the example: English: “The package will arrive at {time} on {date}.” German: “Das Paket wird am {date} um {time} geliefert.” The message string may contain any number of complete sentences, including zero ( “Arrival: {time}” ). The Message Format Working Group is to define the *format* of the strings, not their *repertoire*. That is, should the string be “Arrival: %s” or “Arrival: ${date}” or “Arrival: {0}”? Does that answer your question? -- Steven R. Loomis | @srl295 | git.io/srl295 El ene. 10, 2020, a las 2:48 p. m., James Kass via Unicode escribió: On 2020-01-10 9:55 PM, announceme...@unicode.org wrote: But until now we have not had a syntax for localizable message strings standardized by Unicode. What is the difference between “localizable message strings” and “localized sentences”? Asking for a friend.
Re: New Unicode Working Group: Message Formatting
* sentences On 2020-01-10 10:48 PM, James Kass wrote: On 2020-01-10 9:55 PM, announceme...@unicode.org wrote: But until now we have not had a syntax for localizable message strings standardized by Unicode. What is the difference between “localizable message strings” and “localized sentances”? Asking for a friend.
Re: New Unicode Working Group: Message Formatting
On 2020-01-10 9:55 PM, announceme...@unicode.org wrote: But until now we have not had a syntax for localizable message strings standardized by Unicode. What is the difference between “localizable message strings” and “localized sentances”? Asking for a friend.
Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara
On 2020-01-04 12:50 PM, Richard Wordingham via Unicode wrote: dev2: कः꣡ dev3: क꣡ः Grantha: (1) ጕ፧ጃ (2) ጕጃ፧ The second Grantha spelling is enabled by a Harfbuzz-only change to the USE categorisations. It treats Grantha visarga and spacing anusvara as though inpc=Top rather than inpc=Right. As I am using Ubuntu 16.04, this override isn't supported in applications that use the system HarfBuzz library, such as my email client. We are now establishing incompatible Devanagari font-specific encodings fully compliant with TUS! This seems to be a very bad approach. And apparently it isn't limited to the Devanagari script. For the Grantha examples above, Grantha (1) displays much better here. It seems daft to put a spacing character between a base character and any mark which is supposed to combine with the base character.
Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara
On 2020-01-02 1:04 AM, Richard Wordingham wrote in a thread deriving from this one, > Have you found a definition of the ISCII handling of Vedic characters? No. It would be helpful. ISCII apparently wasn't really used much. It would also be helpful to know the encoding order in any legacy ISCII data using the Vedic characters with respect to VISARGA/ANUSVARA. Although such legacy data seems unlikely, I'd expect VISARGA/ANUSVARA to be entered/stored post-syllable. > I've been looking at Microsoft's specification of Devanagari character > order. In > https://docs.microsoft.com/en-us/typography/script-development/devanagari, > the consonant syllable ends > > [N]+[A] + [< H+[] | {M}+[N]+[H]>]+[SM]+[(VD)] > > where > N is nukta > A is anudatta (U+0952) > H is halant/virama > M is matra > SM is syllable modifier signs > VD is vedic > > "Syllable modifier signs" and "vedic" are not defined. It appears that > SM includes U+0903 DEVANAGARI SIGN VISARGA. What action should Microsoft take to satisfy the needs of the user community? 1. No action, maintain status quo. 2. Swap SM and VD in the specs ordering. 3. Make new category PS (post-syllable) and move VISARGA/ANUSVARA there. 4. ? What kind of impact would there be on existing data if Microsoft revised the ordering? Or should Unicode encode a new character like ZERO-WIDTH INVISIBLE DOTTED CIRCLE so that users can suppress unwanted and unexpected dotted circles by adding superfluous characters to the text stream? > I note that even ग॒ः is > given a dotted circle by HarfBuzz. Same on Win 7. And (गः॒) breaks the mark positioning as expected.
Re: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)
On 2020-01-01 8:11 PM, James Kass wrote: It’s too bad that ISCII didn’t accomodate the needs of Vedic Sanskrit, but here we are. Sorry, that might be wrong to say. It's possible that it's Unicode's adaptation of ISCII that hinders Vedic Sanskrit.
One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)
On 2020-01-01 11:17 AM, Richard Wordingham via Unicode wrote: > That's exactly the sort of mess that jack-booted renderers are trying > to minimise. Their principle is that there should be only one encoding > per shape, though to be fair: > > 1) some renderers accept canonical equivalents. > 2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ), collating > (CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ). > 3) Superseded chillu encodings are still supported. There was never any need for atomic chillu form characters. The principle of only one encoding per shape is best achieved when every shape gets an atomic encoding. Glyph-based encoding is incompatible with Unicode character encoding principles. It’s too bad that ISCII didn’t accomodate the needs of Vedic Sanskrit, but here we are.
Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara
A workaround until some kind of satisfactory adjustment is made might be to simply use COLON for VISARGA. Or... VISARGA ⇒ U+02F8 MODIFIER LETTER RAISED COLON ANUSVARA⇒U+02D9 DOT ABOVE ...as long as the font(s) included both those characters. य॑ यॆ॑ य॑ं -- anusvara last यॆ॑ं -- " य॑: -- colon last यॆ॑: -- " य॑˸ -- raised colon modifier last यॆ॑˸ -- " य॑˙ -- spacing dot above last यॆ॑˙ -- "
Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara
On 2019-12-21 6:27 AM, Shriramana Sharma via Unicode wrote: However, even the simplest Vedic sequence (not involving Sama Vedic or multiple tone marker combinations) like दे॒वेभ्य॑ः throws up a dotted circle, and one is expected (see developer feedback in that bug report) to input the visarga before tone markers, hoping the software is intelligent enough to skip over the visarga (or anusvara) place the tone marker over the preceding syllable correctly. Why it is necessary to put the visarga first in input only to have to skip over it in shaping is beyond me. य॔ यॆ॔ य॔ः -- visarga last यॆ॔ः -- " यः॔ -- visarga before accent (U+0954) यॆः॔ -- " य॑ यॆ॑ य॑ः -- visarga last यॆ॑ः -- " यः॑ visarga before svarita (U+0951) यॆः॑ " U+0951 and U+0954 have canonical combining class of 230. Putting VISARGA (CCC=0) after those CCC=230 marks generates the dotted circle for VISARGA. Putting VISARGA before those CCC=230 marks generates the dotted circle for U+0954 but drops the dotted circle for U+0951. In both cases where VISARGA comes before, the mark positioning is broken. (Mangal font, Win 7) As far as I can tell, the simplest solution would be for the Indic shaping engines to suppress the dotted circle for VISARGA (or ANUSVARA) where appropriate. Entering/storing VISARGA or ANUSVARA at the end of the syllable makes sense since that's where it goes, visually and logically.
Re: NBSP supposed to stretch, right?
On 2019-12-21 2:43 AM, Shriramana Sharma via Unicode wrote: Ohkay and that's very nice meaningful feedback from actual developer+user interaction. So the way I look at this going forward is that we have four options: 1) With the existing single NBSP character, provide a software option to either make it flexible or inflexible, but this preference should be stored as part of the document and not the application settings, else shared documents would not preserve the layout intended by the creator. 5) Update the applications to treat NBSP correctly. Process legacy data based on date/time stamp (or metadata) appropriately and offer users the option to update their legacy data algorithmically using proper non-stretching space characters such as FIGURE SPACE. - Options 1 and 5 have the advantage of not requiring the addition of yet more spacing characters to the Standard.
Re: NBSP supposed to stretch, right?
From our colleague’s web site, http://jkorpela.fi/chars/spaces.html “On web browsers, no-break spaces tended to be non-adjustable, but modern browsers generally stretch them on justification.” Jukka Korpela then offers pointers about avoiding unwanted stretching. and “The change in the treatment of no-break spaces, though inconvenient, is consistent with changes in CSS specifications. For example, clause 7 Spacing of CSS Text Module Level 3 (Editor’s Draft 24 Jan. 2019) defines the no-break space, but not the fixed-with spaces, as a word-separator character, stretchable on justification.” So it appears that there’s no interoperability problem with HTML. It seems that the widespread breakage which Asmus Freytag mentions is limited to legacy applications which persist in treating U+00A0 as the old “hard space” such as Word. It also appears that Microsoft tried and failed to correct the problem in Word. Perhaps they should try again. Meanwhile, in the absence of anything from Unicode more explicit than already recommended by the Standard, Shriramana Sharma might be well advised to continue to lobby the respective software people. As more applications migrate towards the correct treatment of U+00A0, they are probably already running into interoperability problems with Microsoft Word and may well have already implemented solutions.
Re: NBSP supposed to stretch, right?
On 2019-12-17 12:50 AM, Shriramana Sharma via Unicode wrote: I would have gone and filed this as a LibreOffice bug since that's the software I use most, but when I found this is a cross-software problem, I thought it would be best to have this discussed and documented here (and in a future version of the standard). There's a bug report for the LibreOffice application here... https://bugs.documentfoundation.org/show_bug.cgi?id=41652 ...which shows an interesting history of the situation. One issue is whether to be Unicode compliant or MS-Word compliant. MS-Word had apparently corrected the bug with Word 2013 but had reverted to the incorrect behavior by the time Word 2016 rolled out. On that page it's noted that applications like InDesign, Firefox, TeX, and QuarkXPress handle U+00A0 correctly.
Re: HEAVY EQUALS SIGN
On 2019-12-18 12:42 PM, Marius Spix via Unicode wrote: Unicode has a HEAVY PLUS SIGN (U+2795) and a HEAVY MINUS SIGN (U+2796). I wonder, if a HEAVY EQUALS SIGN could complete that character set. This would allow emoji phrases like ➕= ❤️. (man plus cat equals love) looking typographically better, when you replace the equals sign with a new HEAVY EQUALS SIGN character. Thoughts? Marius ➕ ⚌ ❤️
Re: NBSP supposed to stretch, right?
U+0020 SPACE U+00A0 NO-BREAK SPACE These two characters are equal in every way except that one of them offers an opportunity for a line break and the other does not. If the above statement is true, then any conformant application must treat/process/display both characters identically. Responding to Asmus Freytag, > Now, if someone can show us that there are widespread implementations that > follow the above recommendation and have no interoperability issues with HTML > then I may change my tune. Can anyone show us that there are widespread implementations which would break if they started following the above recommendation? Quoting from this HTML basics page, http://www.htmlbasictutor.ca/non-breaking-space.htm “Some browsers will ignore beyond the first instance of the non-breaking space.” and “Not all browsers acknowledge the additional instances of the non-breaking space.” Fifteen or twenty years ago, we used NO-BREAK SPACE to indent paragraphs and to position text and graphics. Both of those uses are presently considered no-nos because some browsers collapse NBSPs and because there are proper ways now to accomplish these kinds of effects. The introduction of browsers which collapsed NBSP strings broke existing web pages. Perhaps the developers of those browsers decided that SPACE and NO-BREAK SPACE are indeed identical except for line breaking. Are there any modern mark-up language uses of SPACE vs NO-BREAK SPACE which would be broken if they follow the above recommendation?
Re: NBSP supposed to stretch, right?
Asmus Freytag wrote, > And any recommendation that is not compatible with what the overwhelming > majority of software has been doing should be ignored (or only enabled on > explicit user input). > > Otherwise, you'll just advocating for a massively breaking change. It seems like the recommendations are already in place and the “overwhelming majority of software” is already disregarding them. I don’t see the massively breaking change here. Are there any illustrations? If legacy text containing NON-BREAK SPACE characters is popped into a justifier, the worst thing that can happen is that the text will be correctly justified under a revised application. That’s not breaking anything, it’s fixing it. Unlike changing the font-face, font size, or page width (which often results in reformatting the text), the line breaks are calculated before justification occurs. If a string of NON-BREAK SPACE characters appears in an HTML file, the browser should proportionally adjust all of those space characters identically with the “normal” space characters. This should preserve the authorial intent. As for pre-Unicode usage of NON-BREAK SPACE, were there ever any exlicit guidelines suggesting that the normal SPACE character should expand or contract for justification but that the NON-BREAK SPACE must not expand or contract?
Re: NBSP supposed to stretch, right?
On 2019-12-17 10:37 AM, QSJN 4 UKR via Unicode wrote: Agree. By the way, it is common practice to use multiple nbsp in a row to create a larger span. In my opinion, it is wrong to replace fixed width spaces with non-breaking spaces. Quote from Microsoft Typography Character design standards: «The no-break space is not the same character as the figure space. The figure space is not a character defined in most computer system's current code pages. In some fonts this character's width has been defined as equal to the figure width. This is an incorrect usage of the character no-break space.» The mention of code pages made me suspect that this quote was from an archived older web page, but it's current. Here's the link: https://docs.microsoft.com/en-us/typography/develop/character-design-standards/whitespace Quoting from that same page, "Advance width rule : The advance width of the no-break space should be equal to the width of the space." So it follows that any justification operation should treat NO-BREAK SPACE and SPACE identically.
Re: A neat description of encoding characters
On 2019-12-03 12:59 AM, Richard Wordingham via Unicode wrote: On Mon, 2 Dec 2019 12:01:52 + "Costello, Roger L. via Unicode" wrote: From the book titled "Computer Power and Human Reason" by Joseph Weizenbaum, p.74-75 Suppose that the alphabet with which we wish to concern ourselves consists of 256 distinct symbols... Why should I wish to concern myself with only one alphabet? You shouldn't. But suppose you did. That's the hypothetical set-up for the illustration. When that book was published in 1976, that illustration may have helped some people gain a better understanding of computer encoding. Nowadays a character string might be required to produce a glyph which the user community considers to be a "character" (or letter) in its writing system. Adding variation selectors, invisible 'formatting' characters, and non-alphabetic symbols to the mix has moved computer encoding way beyond 1976.
Re: Is the Unicode Standard "The foundation for all modern software and communications around the world"?
On 2019-11-19 11:00 PM, Mark E. Shoulson via Unicode wrote: Why so concerned with these minutiæ? Were you in fact misled? (Doesn't sound like it.) Do you know someone who was, or whom you fear would be? What incorrect conclusions might they draw from that misunderstanding, and how serious would they be? Doesn't sound like this is really anything serious even if you were right. Anyone unfamiliar with our timeline, such as a millennial, might be led to believe that Unicode was in place before personal computers existed. A bit of research would have dispelled that notion. But thereafter any assertion from Unicode would be suspect. Limiting the claims to text, as Asmus Freytag suggests, might be too limiting. Many people may not realize how prevalent textual data really is in our exchanges of information. Imagine producing a video offering closed captioning/subtitling in French, Italian, and Hebrew without the underlying foundation of Unicode. Rather than limiting this to text, why not substitute something for the word "foundation"? For example: The Unicode Standard is the lodestar for all modern software and communications around the world, ...
Re: Is the Unicode Standard "The foundation for all modern software and communications around the world"?
On 2019-11-19 6:59 PM, Costello, Roger L. via Unicode wrote: Today I received an email from the Unicode organization. The email said this: (italics and yellow highlighting are mine) The Unicode Standard is the foundation for all modern software and communications around the world, including all modern operating systems, browsers, laptops, and smart phones-plus the Internet and Web (URLs, HTML, XML, CSS, JSON, etc.). That is a remarkable statement! But is it entirely true? Isn't it assuming that everything is text? What about binary information such as JPEG, GIF, MPEG, WAV; those are pretty core items to the Web, right? The Unicode Standard is silent about them, right? Isn't the above quote a bit misleading? A bit, perhaps. But think of it as a press release. The statement smacks of hyperbole at first blush, but "foundation" can mean basis or starting point. File names (and URLs) of *.WAV, *.MPG, etc. are stored and exchanged via Unicode. Likewise, the tags (metadata) for audio/video files are stored (and displayed) via Unicode. So fields such as Title, Artist, Comments/Notes, Release Date, Label, Composer, and so forth aren't limited to ASCII data.
Re: New Public Review on QID emoji
On 2019-11-13 3:00 AM, Asmus Freytag via Unicode wrote: The current effort starts from an unrelated problem (Unicode not wanting to administer emoji applications) and in my analysis, seriously puts the cart before the horse. But it does solve the unrelated problem. There's nothing stopping vendors from making software which recognizes tag character strings to reference in-line graphics. There's nothing stopping users from employing those in-line graphics as emoji images. It would be considered a higher level protocol which uses tag character strings in lieu of, for example, ASCII strings like src="triceratops.png">. Either way, it's rich-text expressed with plain-text strings. But for Unicode to provide this mechanism which "should be correctly parsed by all conformant implementations" as well as possibly maintaining a registry of "tag sequences known to be in use" suggests that Unicode now considers that random images (with no symbolic meaning other than they're pictures of something) should be exchanged as plain-text. The QID Emoji in Unicode makes as much sense as the original emoji inclusion. It's a natural result of the slippery slope of emoji encoding. Emoji are open-ended but Unicode currently has barriers erected. QID Emoji would eliminate limitations on what's supposed to be an open-ended set. That's the problem that the current effort would resolve. In my opinion it's better to open up a myriad of images and see which sequences actually get used than to have vendors/enthusiasts create images in the hope or expectation that anyone will actually use them.
Re: On the lack of a SQUARE TB glyph
On 2019-09-27 5:15 AM, Fred Brennan via Unicode wrote: I only have two lingering questions. * Does the existence of the legacy Adobe encoding Adobe-Japan1-6 shift the balance? It has a SQUARE TB at CID+8306. https://www.adobe.com/content/dam/acom/en/devnet/font/pdfs/5078.Adobe-Japan1-6.pdf That character set also has other items not in Unicode such as numbers enclosed in squares from "0" and "00" through "100" and fractions like 3/7 and 10/11. It was published in 2008, so it might not be considered as "legacy".
Re: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar
Well, it was intended to be off list. It seems that this has been mentioned before, for example; http://www.unicode.org/mail-arch/unicode-ml/y2011-m07/0029.html Maybe it's time for a new thread/subject title?
Re: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar
On 2019-08-21 2:40 AM, James Kass wrote: Are we are allowed to write Llangollen as the definition of the Unicode Collation Algorithm implies we should, with an invisible CGJ between the 'n' and the 'g', so that it will collate correctly in Welsh? That CGJ is necessary so that it will collate*after* Llanberis. (The problem is that the letter 'ng' comes before the letter 'n'.) (This is off-list). If 'ng' comes before 'n', shouldn't Llangollen collate *before* Llanberis in a Welsh listing?
Re: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar
On 2019-08-21 2:08 AM, Richard Wordingham via Unicode wrote: Are we are allowed to write Llangollen as the definition of the Unicode Collation Algorithm implies we should, with an invisible CGJ between the 'n' and the 'g', so that it will collate correctly in Welsh? That CGJ is necessary so that it will collate*after* Llanberis. (The problem is that the letter 'ng' comes before the letter 'n'.) So that it won't collate correctly in anything other than Welsh? Isn't it better to use an application which enables Welsh collation? Here's how BabelPad handles Welsh: http://www.babelstone.co.uk/Software/BabelPad_Sort_Lines.html
Re: PUA (BMP) planned characters HTML tables
On 2019-08-15 12:25 AM, Asmus Freytag via Unicode wrote: Empirically, it has been observed that some distinctions that are claimed by users, standards developers or implementers were de-facto not honored by type developers (and users selecting fonts) as long as the native text doesn't contain minimal pairs. Quickly checked a couple of older on-line PDFs and both used the comma below unabashedly. Quoting from this page (which appears to be more modern than the PDFs), http://www.trussel2.com/MOD/peloktxt.htm "Ij keememej ḷọk wōt ke ikar uwe ippān Jema kab ruo ṃōṃaan ilo juon booj jidikdik eo roñoul ruo ne aitokan im jiljino ne depakpakin. Ilo iien in eor jiljilimjuon ak rualitōk aō iiō—Ij jab kanooj ememej. Wa in ṃōṃkaj kar ..." It seems that users are happy to employ a dot below in lieu of either a comma or cedilla. This newer web page is from a book published in 1978. There's a scan of the original book cover. Although the book title is all caps hand printing it appears that commas were used. The Marshallese orthography which uses commas/cedillas is fairly recent, replacing an older scheme devised by missionaries. Perhaps the actual users have already resolved this dilemma by simply using dots below.
Re: PUA (BMP) planned characters HTML tables
On 2019-08-14 7:50 PM, Richard Wordingham via Unicode wrote: I think you'd also have to change the reference glyph of LATIN LOWER CASE I WITH HEART to show a heart. That's valid because the UCD trumps the code charts, and and no Unicode-compliant process may deliberately render differently from LATIN LOWER CASE I WITH HEART. U+0149 has a compatibility decomposition. It has been deprecated and is not rendered identically on my system. 'n ʼn ( ’n ) If a character gets deprecated, can its decomposition type be changed from canonical to compatibility?
Re: PUA (BMP) planned characters HTML tables
On 2019-08-12 8:30 AM, Andrew West wrote: This issue was discussed at WG2 in 2013 (https://www.unicode.org/L2/L2013/13128-latvian-marshal-adhoc.pdf), when there was a recommendation to encode precomposed letters L and N with cedilla*with no decomposition*, but that solution does not seem to have been taken up by the UTC. Group One dots their lowercase "i" letters with little flowers and Group Two dots theirs with little hearts. Group Two considers flowers unacceptable and Group One rejects hearts. Because of legacy character sets there's a precomposed character encoded called "LATIN LOWER CASE I WITH HEART", but it was misnamed and is normally drawn with a flower instead. Group Two tries to encode "LATIN LOWER CASE I" plus "COMBINING HEART" to get the thing to display properly. But because there's a decomposition involved, the font engine substitutes the glyph mapped to "LATIN LOWER CASE I WITH HEART" in the display for the string "LATIN LOWER CASE I" plus "COMBINING HEART". This thwarts Group Two because they still get the flower. The solution is to deprecate "LATIN LOWER CASE I WITH HEART". It's only in there because of legacy. It's presence guarantees round-tripping with legacy data but it isn't needed for modern data or display. Urge Groups One and Two to encode their data with the desired combiner and educate font engine developers about the deprecation. As the rendering engines get updated, the system substitution of the wrongly named precomposed glyph will go away. This presumes that the premise of user communities feeling strongly about the unacceptable aspect of the variants is valid. Since it has been reported and nothing seems to be happening, perhaps the casual users aren't terribly concerned. It's also possible that the various user communities have already set up their systems to handle things acceptably by installing appropriate fonts.
Re: PUA (BMP) planned characters HTML tables
On 2019-08-11 5:26 PM, [ Doug Ewell ] via Unicode wrote: If you are thinking of these as potential future additions to the standard, keep in mind that accented letters that can already be represented by a combination of letter + accent will not ever be encoded. This is one of the longest-standing principles Unicode has. Good point. There was a time when populating the PUA with precomposed glyphs was necessary for printing or display, but that time has passed. Hopefully anyone seeking charts is transcoding older data into proper Unicode. This can be illustrated with the Marshallese combos mentioned earlier. PUA: Standard: ĻļM̧m̧ŅņO̧o̧ Well, that didn't work out as well as expected. But the standard Unicode is supported (more or less) by some of the core fonts installed here. Nothing installed here displays anything useful for the PUA characters. A decent OpenType font designed with Marshallese in mind should work just fine with the combiners. The fact is that the standard characters will survive and can be universally exchanged. And there's plenty of web page charts showing the standard characters.
Re: PUA (BMP) planned characters HTML tables
On 2019-08-11 4:07 AM, Robert Wheelock via Unicode wrote: Hello! I remember that a website that has tables for certain PUA precomposed accented characters that aren’t yet in Unicode (thing like: Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H-underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...). Where was it at?! I still want to get the information. Thank You! It sounds familiar but I can't place it. I tried the SIL pages first, as did Richard Wordingham apparently. https://blogfonts.com/dehuti.font This font has material in the PUA including: Marshallese glyphs with cedillas: L (E382 & E394), M (E3A6 & E3BB), N (E3CE & E3DE), O (E429 & E465) These appear to be PUA characters which the font developer has mapped in addition to the SIL PUA mappings.
Re: SHEQEL and L2/19-291
https://en.wikipedia.org/wiki/Israeli_new_shekel "With the issuing of the third series, the Bank of Israel has adopted the standard English spelling of shekel and plural shekels for its currency.[30] Previously, the Bank had formally used the Hebrew transcriptions of sheqel and sheqalim (from שְׁקָלִים).[31]" BTW, Google flags "sheqel" in its search box as an incorrect spelling. On 2019-07-25 2:23 AM, Mark E. Shoulson via Unicode wrote: Just looking at document L2/19-291, https://www.unicode.org/L2/L2019/19291-missing-currency.pdf "Currency signs missing in Unicode" by Eduardo Marín Silva. And I'm wondering why he feels it necessary for the Unicode standard to say that a more correct spelling for the Israeli currency would be "shekel" (and not "sheqel"). What criterion is being used that makes this "more correct"? I think it's more popular and common, so maybe that's it. But historically and linguistically, "sheqel" is more accurate. The middle letter is ק, U+05E7 HEBREW LETTER QOF (which is not "more correctly" KOF), from the root ש־ק־ל Sh.Q.L meaning "weight". It's true that Modern Hebrew does not distinguish K and Q phonetically in speech; maybe that is what is meant? Still, the "historical" transliteration of QOF with Q is very widespread, and I believe occurs even on some coins/bills (could be wrong here; is this what is meant by "more correct"? That "shekel" is what is used officially on the currency and I am misremembering?) Just wondering about this, since it seems to be stressed in the document. ~mark
Re: Is ARMENIAN ABBREVIATION MARK (՟, U+055F) misclassified?
On 2019-04-26 11:08 PM, Doug Ewell via Unicode wrote: This is a small percentage of the number of fonts that have all four of these Armenian glyphs, but show the abbreviation mark as a spacing glyph. It looks like Unicode is right, Wikipedia is right, and the fonts are wrong. If the Wikipedia page(s) are correct, then Unicode isn't. Unicode charts don't show the glyph on the dotted circle and the canonical combining class is shown as "spacing". The fact that Doug Ewell found some installed fonts displaying the character as a combining mark suggests that the Wikipedia pages are correct. This character is listed as being unused in modern Armenian, but you'd think that it would have been exposed before now since the charcter has been in Unicode since version 1.0.
Re: Script_extension Property of U+0310 Combining Candrabindu
The Guara Times font maps Cyrillic letters (Л,л,М,м) with chandrabindus in the P.U.A. of the font. This can be done without the P.U.A. using U+0310: Л̐,л̐,М̐,м̐ http://www.chakra.lv/blog/2016/10/19/transliterating-sanskrit-into-russian/ On 2019-04-18 7:59 PM, Richard Wordingham via Unicode wrote: Is there any reason why U+0310 COMBINING CANDRABINDU has scx=Inherited rather than scx=Latn? The only language I've seen the character used in is Sanskrit, and the only script I've seen it in is the Latin script. Richard.
Re: MODIFIER LETTER SMALL GREEK PHI in Calibri is wrong.
Confirming that the installed version here shows psi. (Version 5.74) Luc(as) de Groot is the type designer, I've copied him on this message. On 2019-04-17 10:06 PM, Hans Åberg via Unicode wrote: You are possibly both right, because it is OK in the web font but wrong in the desktop font. On 17 Apr 2019, at 23:53, Oren Watson via Unicode wrote: You can easily reproduce this by going here: https://www.fonts.com/font/microsoft-corporation/calibri/regular and putting in the following string: ψϕφᵠ On Wed, Apr 17, 2019 at 5:23 PM James Tauber wrote: It looks correct in Google Docs so it appears to have been fixed in whatever version of the font is used there. James On Wed, Apr 17, 2019 at 5:10 PM Oren Watson via Unicode wrote: Would anyone know where to report this? In the widely used Calibri typeface included with MS Office, the glyph shown for U+1D60 MODIFIER LETTER SMALL GREEK PHI, actually depicts a letter psi, not a phi.
Re: Emoji Haggadah
> Perhaps that debunking was in the very book > cited by Martin J. Dürst earlier in this thread. Yes, starting on page 24. https://books.google.com/books?id=hypplIDMd0IC=PA24=isbn:0824812077+Yukaghir=en=X=0ahUKEwj1n4r719zgAhWJn4MKHcdyCHIQ6AEIKjAA#v=onepage=isbn%3A0824812077%20Yukaghir=false
Re: Emoji Haggadah
> http://historyview.blogspot.com/2011/10/yukaghir-girl-writes-love-letter.html According to a comment, the Yukaghir love letter as semasiographic communication was debunked by John DeFrancis in 1989 who asserted that it was merely a prop in a Yukaghir parlor game. Perhaps that debunking was in the very book cited by Martin J. Dürst earlier in this thread. Martin J. Dürst via Unicode wrote, >> There is a well-known thesis in linguistics that every script has to be >> at least in part phonetic, and the above are examples that add support >> to this. For deeper explanations (unfortunately not yet including >> emoji), see e.g. "Visible Speech - The Diverse Oneness of Writing >> Systems", by John DeFrancis, University of Hawaii Press, 1989. The blog page comment went on to say that Geoffrey Sampson, who wrote the book from which the blogger learned of the Yukaghir love letter, published a retraction in 1994.
Re: Emoji Haggadah
On 2019-04-16 7:09 AM, Martin J. Dürst via Unicode wrote: All the examples you cite, where images stand for sounds, are typically used in some of the oldest "ideographic" scripts. Egyptian definitely has such concepts, and Han (CJK) does so, too, with most ideographs consisting of a semantic and a phonetic component. Using emoji as rebus puzzles seems harmless enough but it defeats the goals of those emoji proponents who want to see emoji evolve into a universal form of communication because phonetic recognition of symbols would be language specific. Users of ancient ideographic systems typically shared a common language where rebus or phonetic usage made sense to the users. (Of course, diverse CJK user communities were able to adapt over time.) All of the reviews of this publication on the page originally linked seemed positive, so it appears that people are having fun with emoji. But I suspect that this work would be jibber-jabber to any non-English speaker unfamiliar with the original Haggadah. No matter how otherwise fluent they might be in emoji communication.
Re: Emoji Haggadah
On 2019-04-16 3:18 AM, Mark E. Shoulson via Unicode wrote: > For whatever reason, the author decided to go with ️ for "God" and such, ... "OM"igod. Just a thought. If the emoji OM SYMBOL is to be used for "god", shouldn't it be casing to enable distinction between the common noun and the deity?
Vendor-assigned emoji (was: Encoding italic)
On 2019-01-24 Andrew West wrote, > The ESC and UTC do an appallingly bad job at regulating emoji, and I > would like to see the Emoji Subcommittee disbanded, and decisions on > new emoji taken away from the UTC, and handed over to a consortium or > committee of vendors who would be given a dedicated vendor-use emoji > plane to play with (kinda like a PUA plane with pre-assigned > characters with algorithmic names [VENDOR-ASSIGNED EMOJI X] which > the vendors can then associate with glyphs as they see fit; and as > emoji seem to evolve over time they would be free to modify and > reassign glyphs as they like because the Unicode Standard would not > define the meaning or glyph for any characters in this plane). Nobody disagreed and I think it’s a splendid suggestion. If anyone is discussing drafting a proposal to accomplish this, please include me in the “cc”.
Re: Encoding italic
Philippe Verdy wrote, >>> case mappings, >> >> Adjust them as needed. > > Not so easy: case mappings cannot be fixed. They are stabilized in Unicode. > You would need special casing rules under a specific "locale" for maths. In BabelPad, I can select a string of text and convert it to math italics. If upper case italics is desired, it would be necessary to select the text, convert it back to ASCII, convert it to upper case, and convert that upper case to math italics. Casing the math alphanumerics doesn’t seem to present any problem. Any program could make those interim steps invisible to the end user. (With VS14, BabelTags mark-up, or new control character(s)—casing isn’t even an issue.)
Re: Encoding italic
On 2019-02-11 6:42 PM, Kent Karlsson wrote: > Using a VS to get italics, or anything like that approach, will > NEVER be a part of Unicode! Maybe the crystal ball is jammed. This can happen, especially on the older models which use vacuum tubes. Wanting a second opinion, I asked the magic 8 ball: “Will VS14 italic be part of Unicode?” The answer was: “It is decidedly so.”
Re: Encoding italic
Philippe Verdy wrote, >> ...[one font file having both italic and roman]... > The only case where it happens in real fonts is for the mapping of > Mathematical Symbols which have a distinct encoding for some > variants ... William Overington made a proof-of-concept font using the VS14 character to access the italic glyphs which were, of course, in the same real font. Which means that the developer of a font such as Deja Vu Math TeX Gyre could set up an OpenType table mapping the Basic Latin in the font to the italic math letter glyphs in the same font using the VS14 characters. Such a font would work interoperably on modern systems. Such a font would display italic letters both if encoded as math alphanumerics or if encoded as ASCII plus VS14. Significantly, the display would be identical. > ...[math alphanumerics]... > These were allowed in Unicode because of their specific contextual > use as distinctive symbols from known standards, and not for general > use in human languages They were encoded for interoperability and round-tripping because they existed in character sets such as STIX. They remain Latin letter form variants. If they had been encoded as the variant forms which constitute their essential identity it would have broken the character vs. glyph encoding model of that era. Arguing that they must not be used other than for scientific purposes is just so much semantic quibbling in order to justify their encoding. Suppose we started using the double struck ASCII variants on this list in order to note Unicode character numbers such as 핌+픽피픽픽 or 핌+ퟚퟘퟞퟘ? Hexadecimal notation is certainly math and Unicode can be considered a science. Would that be “math abuse” if we did it? (Is linguistics not a science?) > (because these encodings are defective and don't have the necessary > coverage, notably for the many diacritics, The combining diacritics would be used. > case mappings, Adjust them as needed. > and other linguisitic, segmentation and layout properties). > > The same can be said about superscript/subscript variants, > ... : they have specific use and not made for general purpose texts ... So people who used ISO-8859-1 were not allowed to use the superscript digits therein for marking footnotes? Those superscript digits were reserved by ISO-8859-1 only for use by math and science? MATHEMATICAL ITALIC CAPITAL A Decomposition mapping: U+0041 Binary properties: Math, Alphabetic, Uppercase, Grapheme Base, ... SUPERSCRIPT TWO Decomposition mapping: U+0032 Binary properties: Grapheme Base MODIFIER LETTER SMALL C Decomposition mapping: U+0063 Binary properties: Alphabetic, Lowercase, Grapheme Base, ...
Re: Encoding italic
Martin J. Dürst wrote, >> Isn't that already the case if one uses variation sequences to choose >> between Chinese and Japanese glyphs? > > Well, not necessarily. There's nothing prohibiting a font that includes > both Chinese and Japanese glyph variants. Just as there’s nothing prohibiting a single font file from including both roman and italic variants of Latin characters.
Re: Encoding italic
Asmus Freytag wrote, > You are still making the assumption that selecting a different glyph for > the base character would automatically lead to the selection of a different > glyph for the combining mark that follows. That's an iffy assumption > because "italics" can be realized by choosing a separate font (typographically, > italics is realized as a separate typeface). > > There's no such assumption built into the definition of a VS. At best, inside > the same font, there may be an implied ligature, but that does not work if > there's an underlying font switch. Midstream font switching isn’t a user option in most plain-text applications, although there can be some font substitution happening at the OS level. Any combining mark must apply to its base letter glyph, even after a base letter glyph has been modified. More sophisticated editors, like BabelPad, allow users to select different fonts for different ranges of Unicode. If a user selects font X for ASCII and font Y for combining marks, then mark positioning is already broken. If the user selects Times New Roman for both ASCII and combining marks, then no font switching is involved. The Times New Roman type face includes italic letter form variants. Any application sharp enough to know that the italic letter form variants are stored in a different computer *file* should be clever enough to apply mark positioning accordingly. And any single font file which includes italic letters and maps them with VS14 would avoid any such issues altogether.
Re: Encoding italic
William, Rather than having the user insert the VS14 after every character, the editor might allow the user to select a span of text for italicization. Then it would be up to the editor/app to insert the VS14s where appropriate. For Andrew’s example of “fête”, the user would either type the string: “f” + “ê” + “t” + “e” or the string: “f” + “e” + + “t” + “e”. If the latter, the application would insert VS14 characters after the “f”, “e”, “t”, and “e”. The application would not insert a VS14 after the combining circumflex — because the specification does not allow VS characters after combining marks, they may only be used on base characters. In the first ‘spelling’, since the specifications forbid VS characters after any character which is not a base character (in other words, not after any character which has a decomposition, such as “ê”) — the application would first need to convert the string to the second ‘spelling’, and proceed as above. This is known as converting to NFD. So in order for VS14 to be a viable approach, any application would ① need to convert any selected span to NFD, and ② only insert VS14 after each base character. And those are two operations which are quite possible, although they do add slightly to the programmer’s burden. I don’t think it’s a “deal-killer”. Of course, the user might insert VS14s without application assistance. In which case hopefully the user knows the rules. The worst case scenario is where the user might insert a VS14 after a non-base character, in which case it should simply be ignored by any application. It should never “break” the display or the processing; it simply makes the text for that document non-conformant. (Of course putting a VS14 after “ê” should not result in an italicized “ê”.) Cheers, James
Re: Encoding italic
William Overington wrote, > Well, a proposal just about using VS14 to indicate a request for an > italic version of a glyph in plain text, including a suggestion of to > which characters it could apply, would test whether such a proposal > would be accepted to go into the Document Register for the Unicode > Technical Committee to consider or just be deemed out of scope and > rejected and not considered by the Unicode Technical Committee. As long as “italics in plain-text” is considered out-of-scope by Unicode, any proposal for handling italics in plain-text would probably be considered out-of-scope, as well. But I could be wrong and wouldn’t mind seeing a proposal.
Re: Encoding italic
Philippe Verdy responded to William Overington, > the proposal would contradict the goals of variation selectors and would > pollute ther variation sequences registry (possibly even creating conflicts). > And if we admit it for italics, than another VSn will be dedicated to bold, > and another for monospace, and finally many would follow for various > style modifiers. > Finally we would no longer have enough variation selectors for all requests). There are 256 variation selector characters. Any use of variation sequences not registered by Unicode would be non-conformant. William’s suggestion of floating a proposal for handling italics with VS14 might be an example of the old saying about “putting the cart before the horse”. Any preliminary proposal would first have to clear the hurdle of the propriety of handling italic information at the plain-text level. Such a proposal might list various approaches for accomplishing that, if that hurdle can be surmounted.
Re: Ancient Greek apostrophe marking elision
On 2019-01-28 8:58 PM, Richard Wordingham wrote: > On Mon, 28 Jan 2019 03:48:52 + > James Kass via Unicode wrote: > >> It’s been said that the text segmentation rules seem over-complicated >> and are probably non-trivial to implement properly. I tried your >> suggestion of WORD JOINER U+2060 after tau ( γένοιτ’ ἄν ), but it >> only added yet another word break in LibreOffice. > > I said we *don't* have a control that joins words. The text of TUS > used to say we had one in U+2060, but that was removed in 2015. I > pleaded for the retention of this functionality in document > L2/2015/15-192, but my request was refused. I pointed out in ICU > ticket #11766 that ICU's Thai word breaker retained this facility. ... Sorry for sounding obtuse there. It was your *post* which suggested the use of WORD JOINER. You did clearly assert that it would not work. So, human nature, I /had/ to try it and see. It. did. not. work. (No surprise.) But it /should/ have worked. It’s a JOINER, for goodness sake! When the author/editor puts any kind of JOINER into a text string, what’s the intent? What’s the poînt of having a JOINER that doesn’t? Recently I put a ZWJ between the “c” and the “t” in the word “Respectfully” as an experiment. Spellchecker flagged both “respec” and “tfully” as being misspelt, which they probably are. A ZWNJ would have been used if there had been any desire for the string to be *split* there, e.g., to forbid formation of a discretionary ligature. Instead the ZWJ was inserted, signalling authorial intent that a ‘more joined’ form of the “c-t” substring was requested. Text a man has JOINED together, let not algorithm put asunder.
Re: Encoding italic
On 2019-01-31 3:18 PM, Adam Borowski via Unicode wrote: > They're only from a spammer's point of view. Spammers need love, too. They’re just not entitled to any.
Re: Encoding italic
David Starner wrote, > The choice of using single-byte character sets isn't always voluntary. > That's why we should use ISO-2022, not Unicode. Or we can expect > people to fix their systems. What systems are we talking about, that > support Unicode but compel you to use plain text? The use of Twitter > is surely voluntary. This marketing-related web page, https://litmus.com/blog/best-practices-for-plain-text-emails-a-look-at-why-theyre-important ...lists various reasons for using plain-text e-mail. Here’s an excerpt. “Some people simply prefer it. Plain and simple—some people prefer text emails. ... Some users may also see HTML emails as a security and privacy risk, and choose not to load any images and have visibility over all links that are included in an email. In addition, the increased bandwidth that image-heavy emails tend to consume is another driver of why users simply prefer plain-text emails.” Besides marketing, there’s also newsletters and e-mail discussion groups. Some of those discussion groups are probably scholarly. Anyone involved in that would likely embrace ‘super cool Unicode text magic’ and it’s surprising if none of them have stumbled across the math alphanumerics yet. A web search for the string “plain text only” leads to all manner of applications for which searchers are trying to control their environments. There’s all kinds of reasons why some people prefer to use plain-text, it’s often an informed choice and it isn’t limited to e-mail. It’s true that people don’t have to use Twitter. People don’t have to turn on their computers, either.
Re: Encoding italic
David Starner wrote, > Emoji, as have been pointed out several times, were in the original > Unicode standard and date back to the 1980s; the first DOS character > page has similes at 0x01 and 0x02. That's disingenuous.
Re: Encoding italic
David Starner wrote, >> ... italics, bold, strikethrough, and underline in plain-text > > Okay? Ed can do that too, along with nano and notepad. It's called > HTML (TeX, Troff). If by plain-text, you mean self-interpeting, > without external standards, then it's simply impossible. HTML source files are in plain-text. Hopefully everyone on this list understands that and has already explored the marvelous benefits offered by granting users the ability to make exciting and effective page layouts via any plain-text editor. HTML is standard and interchangeable. As Tex Texin observed, differences of opinion as to where we draw the line between text and mark-up are somewhat ideological. If a compelling case for handling italics at the plain-text level can be made, then the fact that italics can already be handled elsewhere doesn’t matter. If a compelling case cannot be made, there are always alternatives. As for use of other variant letter forms enabled by the math alphanumerics, the situation exists. It’s an interesting phenomenon which is sometimes worthy of comment and relates to this thread because the math alphanumerics include italics. One of the web pages referring to third-party input tools calls the practice “super cool Unicode text magic”.
Re: Encoding italic
Doug Ewell wrote, > I can't speak for Andrew, but I strongly suspect he implemented this as > a proof of concept, not to declare himself the Maker of Standards. BabelPad also offers plain-text styling via math-alpha conversion, although this feature isn’t newly added. Users interested in seeing how plain-text italics might work can try out the stateful approach using tags contrasted with the character-by-character approach using math-range italic letters. (Of course, the math-range stuff is already being interchanged on the WWW, whilst the tagging method does not yet appear to be widely supported.) A few miles upthread, ‘where are the third-party developers’ was asked. ‘Everywhere’ is the answer. Since third-party developers have to subsist on the crumbs dropped by the large corps, they tend to be responsive to user needs and requests.
Re: Encoding italic
On 2019-01-29 5:10 PM, Doug Ewell via Unicode wrote: I thought we had established that someone had mentioned it on this list, at some time during the past three weeks. Can someone look up what post that was? I don't have time to go through scores of messages, and there is no search facility. http://www.unicode.org/mail-arch/unicode-ml/y2019-m01/0209.html
Re: Ancient Greek apostrophe marking elision
On 2019-01-29 1:55 AM, Mark E. Shoulson via Unicode wrote: I guess "Suck it up and deal with it." And that may indeed be the answer. It would certainly make for shorter and simpler FAQ pages, anyway.
Re: Ancient Greek apostrophe marking elision
On 2019-01-28 7:31 AM, Mark Davis ☕️ via Unicode wrote: Expecting people to type in hard-to-find invisible characters just to correct double-click is not a realistic expectation. True, which is why such entries, when consistent, are properly handled at the keyboard driver level. It's a presumption that Greek classicists are already specifying fonts and using dedicated keyboard drivers. Based on the description provided by James Tauber, it should be relatively simple to make the keyboard insert some kind of joiner before U+2019 if it follows a Greek letter. This would not be visible to the end-user. This approach would also mean that plain-text, which has no language tagging mechanism, would "get it right" cross-platform, cross-applications.
Re: Ancient Greek apostrophe marking elision
On 2019-01-27 11:38 PM, Richard Wordingham via Unicode wrote: On Sun, 27 Jan 2019 19:57:37 + James Kass via Unicode wrote: On 2019-01-27 7:09 PM, James Tauber via Unicode wrote: In my original post, I asked if a language-specific tailoring of the text segmentation algorithm was the solution but no one here has agreed so far. If there are likely to be many languages requiring exceptions to the segmentation algorithm wrt U+2019, then perhaps it would be better to establish conventions using ZWJ/ZWNJ and adjust the algorithm accordingly so that it would be cross-languages. (Rather than requiring additional and open ended language-specific tailorings.) (I inserted several combinations of ZWJ/ZWNJ into James Tauber's example, but couldn't improve the segmentation in LibreOffice, although it was possible to make it worse.) If you look at TR29, you will see that ZWJ should only affect word boundaries for emoji. ZWNJ shall have no effect. What you want is a control that joins words, but we don't have that. Richard. (https://unicode.org/reports/tr29/) It’s been said that the text segmentation rules seem over-complicated and are probably non-trivial to implement properly. I tried your suggestion of WORD JOINER U+2060 after tau ( γένοιτ’ ἄν ), but it only added yet another word break in LibreOffice. The problem may stem from the fact that WORD JOINER is supposed to be treated as though it were a zero-width no-break space. IOW it is a *space*, and as a space it indicates a word break. That doesn’t seem right. Instead of treating WORD JOINER as a SPACE, why not treat it as a WORD JOINER? It could save a lot of problems wrt undesirable string segmentation in addition to possibly minimizing future language-specific tailoring and easing the burden on implementers.
Re: Encoding italic
On 2019-01-27 11:44 PM, Philippe Verdy wrote: > You're not very explicit about the Tag encoding you use for these styles. This bold new concept was not mine. When I tested it here, I was using the tag encoding recommended by the developer. > Of course it must not be a language tag so the introducer is not U+E0001, or a cancel-all tag so it > is not prefixed by U+E007F It cannot also use letter-like, digit-like and hyphen-like tag characters > for its introduction. So probably you use some prefix in U+E0002..U+E001F and some additional tag > (tag "I" for italic, tag "B" for bold, tag "U" for underline, tag "S" for strikethough?) and the cancel > tag to return to normal text (terminate the tagged sequence). Yes, U+E0001 remains deprecated and its use is strongly discouraged. > Or may be you just use standard HTML encoding by adding U+E to each character of the HTML > tag syntax (including attributes and close tags, allowing embedding?) So you use the "<" and ">" tag > characters (possibly also the space tag U+E0020, or TAB tag U+E0009 for separating attributes and the > quotation tags for attribute values)? Is your proposal also allowing the embedding of other HTML > objects (such as SVG)? AFAICT, this beta release supports the tag sequences , , , & expressed here in ASCII. I don’t know if the software developer has plans to expand the enhancements in the future. > And what is then the interest compared to standard HTML (it is not more compact, ... This was one of the ideas which surfaced earlier in this thread. Some users have expressed an interest in preserving, for example, italics in plain-text and are uncomfortable using the math alphanumerics for this, although the math alphanumerics seem well qualified for the purpose. One of the advantages given for this approach earlier is that it can be made to work without any official sanction and with no action necessary by the Consortium. > I bet in fact that all tag characters are most often restricted in text input forms, and will be > silently discarded or the whole text will be rejected. In this e-mail, I used the tags & around the word “bold” in the first sentence of my reply in order to test your bet. > We were told that these tag characters were deprecated, and in fact even their use for language > tags has not found any significant use except some trials (but there are now better technologies > available in lot of softwares, APIs and services, and application design/development tools, or > document editing/publishing tools). Indeed, these tags were deprecated. At the time the tags were deprecated, there was such sorrow on this list that some list members were even inspired to compose haiku lamenting their passing and did post those haiku to this list. Now, thanks to emoji requirements, many of those tags are experiencing a resurrection/renaissance. I wonder if anyone is composing limericks in joyful celebration…
Re: Encoding italic
A new beta of BabelPad has been released which enables input, storing, and display of italics, bold, strikethrough, and underline in plain-text using the tag characters method described earlier in this thread. This enhancement is described in the release notes linked on this download page: http://www.babelstone.co.uk/Software/index.html
Re: Ancient Greek apostrophe marking elision
On 2019-01-27 7:09 PM, James Tauber via Unicode wrote: In my original post, I asked if a language-specific tailoring of the text segmentation algorithm was the solution but no one here has agreed so far. If there are likely to be many languages requiring exceptions to the segmentation algorithm wrt U+2019, then perhaps it would be better to establish conventions using ZWJ/ZWNJ and adjust the algorithm accordingly so that it would be cross-languages. (Rather than requiring additional and open ended language-specific tailorings.) (I inserted several combinations of ZWJ/ZWNJ into James Tauber's example, but couldn't improve the segmentation in LibreOffice, although it was possible to make it worse.)
Re: Ancient Greek apostrophe marking elision
On 2019-01-27 3:08 PM, Tom Gewecke via Unicode wrote: I think the Unicode Hawaiian ʻokina is supposed to be U+02BB (instead of U+02BC). notes for U+02BB * typographical alternate for 02BD or 02BF * used in Hawai'ian orthorgraphy as 'okina (glottal stop)
Re: Ancient Greek apostrophe marking elision
Richard Wordingham responded to Michael Everson, >> I’ll be publishing a translation of Alice into Ancient Greek in due >> course. I will absolutely only use U+2019 for the apostrophe. It >> would be wrong for lots of reasons to use U+02BC for this. > > Please list them. Let's see the list of reasons why U+02BC should be used first.
Re: Ancient Greek apostrophe marking elision
Richard Wordingham replied to Asmus Freytag, >> To make matters worse, users for languages that "should" use U+02BC >> aren't actually consistent; much data uses U+2019 or U+0027. Ordinary >> users can't tell the difference (and spell checkers seem not >> successful in enforcing the practice). > > That appears to contradict Michael Everson's remark about a Polynesian > need to distinguish the two visually. Does it? U+02BC /should/ be used but ordinary users can't tell the difference because the glyphs in their displays are identical, resulting in much data which uses U+2019 or U+0027. I don't see any contradiction.
Re: Ancient Greek apostrophe marking elision
Perhaps I'm not understanding, but if the desired behavior is to prohibit both line and word breaks in the example string, then... In Notepad, replacing U+0020 with U+00A0 removes the line-break. U+0020 ( δ’ αρχαια ) U+00A0 ( δ’ αρχαια ) U+202F ( δ’ αρχαια ) It also changes the advancement of the text cursor (Ctrl + arrows), suggesting that word/string selection would be as desired. (U+202F also does this and may offer a more pleasing appearance to classisists by default.) Wouldn't it be best to handle substitution of U+00A0 for U+0020 at the input method / keyboard driver level where appropriate, so that preferred apostrophe U+2019 can be used?
Re: Ancient Greek apostrophe marking elision
Mark Davis responded to Asmus Freytag, >> breaking selection for "d'Artagnan" or "can't" into two is overly fussy. > > True, and that is not what U+2019 does; it does not break medially. Mark Davis earlier posted this example, > So something like "δ’ αρχαια" (picking a phrase at random) would have > a word break after the delta. If the user wanted to use the preferred character, U+2019, would using the no break space (U+00A0) after it resolve the word or line break issues? Or possibly NNBSP (U+202F)? It's a shame if users choose suboptimal characters over preferred characters because of what are essentially rendering/text selection issues. IMO, it's better to use preferred characters in the long run. (Users should file bug reports on applications which improperly medially break strings which include U+2019.)
Re: Ancient Greek apostrophe marking elision
On 2019-01-25 10:06 PM, Asmus Freytag via Unicode wrote: James, by now it's unclear whether your ' is 2019 or 02BC. The example word "aren't" in previous message used U+2019. Sorry if I was unclear.
Re: Encoding italic
On 2019-01-26 12:18 AM, Asmus Freytag (c) responded: On 1/25/2019 3:49 PM, Andrew Cunningham wrote: Assuming some mechanism for italics is added to Unicode, when converting between the new plain text and HTML there is insufficient information to correctly convert to HTML. many elements may have italic stying and there would be no meta information in Unicode to indicate the appropriate HTML element. So, we would be creating an interoperability issue. What happens now when we convert plain-text to HTML?
Re: Ancient Greek apostrophe marking elision
For U+2019, there's a note saying 'this is the preferred character to use for apostrophe'. Mark Davis wrote, > When it is between letters it doesn't cause a word break, ... Some applications don't seem to get that. For instance, the spellchecker for Mozilla Thunderbird flags the string "aren" for correction in the word "aren’t", which suggests that users trying to use preferred characters may face uphill battles.
Re: Encoding italic
> Maybe I should have said emoji are fan-driven. That works. Here's the previous assertion rephrased: We should no more expect the conventional Unicode character encoding model to apply to emoji than we should expect the old-fashioned text ranges to become fan-driven. And if we don't want the text ranges to become fan driven, as pointed out by Martin Dürst and others, we take a cautious and conservative approach to moving forward with the standard. Veering back on-topic, the anti fan driven aversion doesn't apply to encoding italics, although /fans/ would benefit. There's pre-existing conventions for italics, and a scholar with the credentials of Victor Gaultney should be able to make a credible proposal for encoding them. I hope we haven't overwhelmed him with a surplus of rhetoric.
Re: Encoding italic (was: A last missing link)
Andrew West wrote, > Why should we not expect the conventional Unicode character encoding > mode to apply to emoji? Remember when William Overington used to post about encoding colours, sometimes accompanied by novel suggestions about how they could be encoded or referenced in plain-text? Here's a very polite reply from John Hudson from 2000, http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/1042.html ...and, over time, many of the replies to William Overington's colorful suggestions were less than polite. But it was clear that colors were out-of-scope for a computer plain-text encoding standard. So I don't expect the conventional model to apply to emoji because it didn't; if it had, they'd not have been encoded. Since they're in there, the conventional model does not apply. Of course, the conventions have changed along with the concept of what's acceptable in plain-text. Since emoji are an open-ended evolving phenomenon, there probably has to be a provision for expansion. Any idea about them having been a finite set overlooked the probability of open-endedness and the impracticality of having only the original subset covered in plain-text while additions would be banished to higher level protocols. Thank you for the information about current emoji additions being unrelated to vendors. I have to confess that I haven't kept up-to-date on the emoji. Maybe I should have said that emoji are fan-driven.
Re: Encoding italic (was: A last missing link)
Andrew West wrote, > ... > http://www.unicode.org/L2/L2018/18208-white-wine-rgi.pdf), just an > assertion that it would be a good idea if emoji users could add a > colored swatch to an existing emoji to indicate what color they want > it to represent (note that the colored characters do not change the > color of the emoji they are attached to [before or after, depending > upon whether you are speaking French or English dialect of emoji], > they are just intended as a visual indication of what colour you wish > the emoji was). In order to simplify emoji processing, these should be stored in the data stream in logical order. Whether these cool new characters become reordrant color blobs or not would depend upon language. So, what we'd need is some way of indicating language in plain-text. Some kind of tagging mechanism. FAICT, the emoji repertoire is vendor-driven, just as the pre-Unicode emoji sets were vendor driven. Pre-Unicode, if a vendor came up with cool ideas for new emoji they added new characters to the PUA. Now that emoji are standardized, when vendors come up with new ideas they put them in the emoji ranges in order to preserve the standardization factor and ensure interoperability. (That's probably over-simplified and there are bound to be other factors involved.) We should no more expect the conventional Unicode character encoding model to apply to emoji than we should expect the old-fashioned text ranges to become vendor-driven.
Re: Encoding italic
Nobody has really addressed Andrew West's suggestion about using the tag characters. It seems conformant, unobtrusive, requiring no official sanction, and could be supported by third-partiers in the absence of corporate interest if deemed desirable. One argument against it might be: Whoa, that's just HTML. Why not just use HTML? SMH One argument for it might be: Whoa, that's just HTML! Most everybody already knows about HTML, so a simple subset of HTML would be recognizable. After revisiting the concept, it does seem elegant and workable. It would provide support for elements of writing in plain-text for anyone desiring it, enabling essential (or frivolous) preservation of editorial/authorial intentions in plain-text. Am I missing something? (Please be kind if replying.) On 2019-01-20 10:35 AM, Andrew West wrote: A possibility that I don't think has been mentioned so far would be to use the existing tag characters (E0020..E007F). These are no longer deprecated, and as they are used in emoji flag tag sequences, software already needs to support them, and they should just be ignored by software that does not support them. The advantages are that no new characters need to be encoded, and they are flexible so that tag sequences for start/end of italic, bold, fraktur, double-struck, script, sans-serif styles could be defined. For example start and end of italic styling could be defined as the tag sequences and (E003C E0069 E003E and E003C E002F E0069 E003E). Andrew
Re: Encoding italic
David Starner wrote, > You're emailing from Gmail, which has support for italics in email. But I compose e-mails in BabelPad, which has support for far more than italics in HTML mail. And I'm using Mozilla Thunderbird to send and receive text e-mail via the Gmail account. And if I wanted to /display/ italics in a web page, I would create the source file in a plain-text editor. (HTML mark-up is fairly easy to type with the ASCII keyboard.) If I compose a text file in BabelPad, it can be opened in many rich-text applications and the information survives intact. Unless I am foolish enough to edit the file in the rich-text application and file-save it. Because that mungs the plain-text file, and it can no longer be retrieved by the plain-text editor which created it. >> ...third-party... > > Where are these tools? BabelPad is an outstanding example. Earlier in this discussion a web search found at least a handful of third-party tools devoted to liberating the math-alphas for Twitter users. > The superscripts show a problem with multiple encoding; even if you > think they should be Unicode superscripts, and they look like Unicode > superscripts, they might be HTML superscripts. Same thing would happen > with italics if they were encoded in Unicode. Hmmm. Rich-text styled italics might be copied into other rich-text applications, but they cannot be copied into plain-text apps. If Unicode-enabled italics existed, plain-text italics could be copy/pasted into either rich-text or plain-text applications and survive intact. So Unicode-enabled italics would be interoperable. Anyone concerned about interoperability would be well advised to go with plain-text. I am, so I do. When I can. Kie eksistas fumo, tie eksistas fajro.
Re: Encoding italic
Responding to David Starner, It’s true that most users can’t be troubled to take the extra time needed to insert any kind of special characters which aren’t covered by the keyboard. Even the enthusiasts among us seldom take the trouble to include ‘proper’ quotes and apostrophes in e-mails — even for posting to specialized lists such as this one where other members might notice and appreciate the extra effort involved. Even though /we/ know how to do it and have software installed to help us do it. It’s also true that standard U.S. keyboards and drivers aren’t very helpful with diacritics. Yet when we reply to list colleagues with surnames such as “Dürst” or “Bień”, we usually manage to get it right. Sure, the “reply” feature puts the surname into the response for us and the e-mail software adds the properly spelled names into our address books automatically. But when we cite those colleagues in a post replying to some other list member, we typically take the time and trouble to write their names correctly. Not only because we /can/, but because we /should/. > How do you envision this working? Splendidly! (smile) Social platforms, plain-text editors, and other applications do enhance their interfaces based on user demand from time to time. User demand, at least on Twitter, seems established. As pointed out previously in this discussion, that demand doesn’t seem to result in much “Chicago style” text (although I have personally observed some) and may only be a passing fad /for Twitter users/. When corporate interests aren't interested, third-party developers develop tools. > You've yet to demonstrate that interoperability is an actual problem. Copy/pasting from a web page into a plain-text editor removes any italics content, which is currently expected behavior. Opinions differ as to whether that represents mere format removal or a loss of meaning. Those who consider it as a loss of meaning would perceive a problem with interoperability. Consider superscript/subscript digits as a similar styling issue. The Wikipedia page for Romanization of Chinese includes information about the Wade-Giles system’s tone marks, which are superscripted digits. https://en.wikipedia.org/wiki/Romanization_of_Chinese Copy/pasting an example from the page into plain-text results in “ma1, ma2, ma3, ma4”, although the web page displays the letters as italic and the digits as (italic) superscripts. IMO, that’s simply wrong with respect to the superscript digits and suboptimal with respect to the italic letters. > To expand on what Mark E. Shoulson said, to add new italics characters, > you're going to need to not only copy all of Latin, but also Cyrillic ... I quite agree that expanding atomic italic encoding is off the table at this point. (And that italicized CJK ideographs are daft.)
Re: Encoding italic (was: A last missing link)
On 2019-01-20 10:49 PM, Garth Wallace wrote: I think the real solution is for Twitter to just implement basic styling and make this a moot point. At which time it would only become a moot point for Twitter users. There's also Facebook and other on-line groups. Plus scholars and linguists. And interoperability.
Re: Encoding italic (was: A last missing link)
(In the event that a persuasive proposal presentation prompts the possibility of italics encoding...) Possible approaches include: 1 - Liberating the italics from the Members Only Math Club ...which has been an ongoing practice since they were encoded. It already works, but the set is incomplete and the (mal)practice is frowned upon. Many of the older "shortcomings" of the set can now be overcome with combining diacritics. These italics decompose to ASCII. 2 - Character level Variation selectors work with today's tech. Default ignorable property suggests that apps that don't want to deal with them won't. Many see VS as pseudo-encoding. Stripping VS leaves ASCII behind. 3 - Open/Close punctuation treatment Stateful. Works on ranges. Not currently supported in plain-text. Could be supported in applications which can take a text string URL and make it a clickable link. Default appearance in nonsupporting apps may resemble existing plain-text italic kludges such as slashes. The ASCII is already in the character string. 4 - Leave it alone This approach requires no new characters and represents the default condition. ASCII. - Number 1 would require that anything not already covered would have to be eventually proposed and accepted, 2 would require no new characters at all, and 3 would require two control characters for starters. As "food for thought" questions, if a persuasive case is presented for encoding italics, and excluding 4, which approach would have the least impact on the rich-text world? Which would have the least impact on existing plain-text technology? Which would be least likely to conflict with Unicode principles/encoding model?
Re: Encoding italic (was: A last missing link)
Victor Gaultney wrote, > If however, we say that this "does not adequately consider the harm done > to the text-processing model that underlies Unicode", then that exposes a > weakness in that model. That may be a weakness that we have to accept for > a variety of reasons (technical difficulty, burden on developers, UI impact, > cost, maturity). Unicode's character encoding principles and underlying text-processing model remain robust. They are the foundation of modern computer text processing. The goal of 푛푒 푝푙푢푠 푢푙푡푟푎¹ needs to accommodate the best expectations of the end users and the fact that the consistent approach of the model eases the software people's burdens by ensuring that effective programming solutions to support one subset or range of characters can be applied to the other subsets of the Unicode repertoire. And that those solutions can be shared with other developers in a standard fashion. Assigning properties to characters gives any conformant application clear instructions as to what exactly is expected as the app encounters each character in a string. In simpler times, the only expectation was that the application would splat a glyph onto a screen (and/or sheet of paper) and store a binary string for later retrieval. We've moved forward. 'Unicode encodes characters, not glyphs' is a core principle. There's a legitimate concern whenever anyone is perceived as heading into the general direction of turning the character encoding into a glyph registry, as it suggests a possible step backwards and might lead to a slippery slope. For example, if italics are encoded, why not fraktur and Gaelic?² The notion that any given system can't be improved is static.³ ("System" refers to Unicode's repertoire and coverage rather than its core principles. Core principles are rock solid by nature.) ¹ /ne plus ultra/ ² "Conversely, significant differences in writing style for the same script may be reflected in the bibliographical classification—for example, Fraktur or Gaelic styles for the Latin script. Such stylistic distinctions are ignored in the Unicode Standard, which treats them as presentation styles of the Latin script." Ken Whistler, http://unicode.org/reports/tr24/ ³ "Static" can be interpreted as either virtually catatonic or radio noise. Either is applicable here.
Re: Encoding italic
On 2019-01-19 6:19 PM, wjgo_10...@btinternet.com wrote: > It seems to me that it would be useful to have some codes that are > ordinary characters in some contexts yet are control codes in others, ... Italics aren't a novel concept. The approach for encoding new characters is that conventions for them exist and that people *are* exchanging them, people have exchanged them in the past, or that people demonstrably *need* to exchange them. Excluding emoji, any suggestion or proposal whose premise is "It seems to me that it would be useful if characters supporting that>..." is doomed to be deemed out of scope for the standard.
Re: NNBSP
Marcel Schneider wrote, > When you ask for knowing the foundations and that knowledge is persistently refused, > you end up believing that those foundations just can’t be told. > > Note, too, that I readily ceased blaming UTC, and shifted the blame elsewhere, where it > actually belongs to. Why not think of it as a learning curve? Early concepts and priorities were made from a lower position on that curve. We can learn from the past and apply those lessons to the future, but a post-mortem seldom benefits the cadaver. Minutiae about decisions made long ago probably exist, but may be presently poorly indexed/organized and difficult to search/access. As the collection of encoding history becomes more sophisticated and the searching technology becomes more civilized, it may become easier to glean information from the archives. (OT - A little humor, perhaps... On the topic of Francophobia, it is true that some of us do not like dead generalissimos. But most of us adore the French for reasons beyond Brigitte Bardot and bon-bons. Cuisine, fries, dip, toast, curls, culture, kissing, and tarts, for instance. Not to mention cognac and champagne!)
Re: Encoding italic
For web searching, using the math-string 푀푎푦푛푎푟푑 퐾푒푦푛푒푠 as the keywords finds John Maynard Keynes in web pages. Tested this in both Google and DuckDuckGo. Seems like search engines are accomodating actual user practices. This suggests that social media data is possibly already being processed for the benefit of the users (and future historians) by software people who care about such things.
Re: Encoding italic
On 2019-01-17 11:50 AM, Martin J. Dürst wrote: > Most probably not. I think Asmus has already alluded to it, but in good > typography, roman and italic fonts are considered separate. So are Latin and Cyrillic fonts. So are American English and Polish fonts, for that matter, even though they're both Latin based. Times New Roman and Times New Roman Italic might be two separate font /files/ on computers, but they are the same type face. The point I was trying to make WRT 256-glyph fonts is that they pre-date Unicode and I believe much of the "layering" is based on artifacts from that era. Lead fonts were glyph based. The technical concept of character came later.
Re: Encoding italic
On 2019-01-17 6:27 AM, Martin J. Dürst replied: > ... > So even if you can find examples where the presence or absence of > styling clearly makes a semantic difference, this may or will not be > enough. It's only when it's often or overwhelmingly (as opposed to > occasionally) the case that a styling difference makes a semantic > difference that this would start to become a real argument for plain > text encoding of italics (or other styling information). (also from PDF chapter 2,) "Plain text is public, standardized, and universally readable." The UCS is universal, which implies that even edge cases, such as failed or experimental historical orthographies, are preserved in plain text. > ... > I think most Unicode specialists have chosen to ignore this thread by > this point. Those not switched off by the thread title may well be exhausted and pressed for time because of the UTC meeting. > ... > Based by these data points, and knowing many of the people involved, my > description would be that decisions about what to encode as characters > (plain text) and what to deal with on a higher layer (rich text) were > taken with a wide and deep background, in a gradually forming industry > consensus. (IMO) All of which had to deal with the existing font size limitations of 256 characters and the need to reserve many of those for other textual symbols as well as box drawing characters. Cause and effect. The computer fonts weren't designed that way *because* there was a technical notion to create "layers". It's the other way around. (If I'm not mistaken.) >> ..."Jackie Brown"... > ... > Also, for probably at least 90% of > the readership, the style distinction alone wouldn't induce a semantic > distinction, because most of the readers are not familiar with these > conventions. Proper spelling and punctuation seem to be dwindling in popularity, as well. There's a percentage unable to make a semantic distinction between 'your' and 'you’re'. > (If you doubt that, please go out on the street and ask people what > italics are used for, and count how many of them mention film titles or > ship names.) Or the em-dash, en-dash, Mandaic letter ash, or Gurmukhi sign yakash. Sure, most street people have other interests. > (And just while we are at it, it would still not be clear which of > several potential people named "Jackie Brown" or "Thorstein Veblen" > would be meant.) Isn't that outside the scope of italics? (winks)
Re: Encoding italic (was: A last missing link)
Victor Gaultney wrote, > Treating italic like punctuation is a win for a lot of people: Italic Unicode encoding is a win for a lot of people regardless of approach. Each of the listed wins remains essentially true whether treated as punctuation, encoded atomically, or selected with VS. > My main point in suggesting that Unicode needs these characters is that > italic has been used to indicate specific meaning - this text is somehow > special - for over 400 years, and that content should be preserved in plain > text. ( http://www.unicode.org/versions/Unicode11.0.0/ch02.pdf ) "Plain text must contain enough information to permit the text to be rendered legibly, and nothing more." The argument is that italic information can be stripped yet still be read. A persuasive argument towards encoding would need to negate that; it would have to be shown that removing italic information results in a loss of meaning. The decision makers at Unicode are familiar with italic use conventions such as those shown in "The Chicago Manual of Style" (first published in 1906). The question of plain-text italics has arisen before on this list and has been quickly dismissed. Unicode began with the idea of standardizing existing code pages for the exchange of computer text using a unique double-byte encoding rather than relying on code page switching. Latin was "grandfathered" into the standard. Nobody ever submitted a formal proposal for Basic Latin. There was no outreach to establish contact with the user community -- the actual people who used the script as opposed to the "computer nerds" who grew up with ANSI limitations and subsequent ISO code pages. Because that's how Unicode rolled back then. Unicode did what it was supposed to do WRT Basic Latin. When someone points out that italics are used for disambiguation as well as stress, the replies are consistent. "That's not what plain-text is for." "That's not how plain-text works." "That's just styling and so should be done in rich-text." "Since we do that in rich-text already, there's no reason to provide for it in plain-text." "You can already hack it in plain-text by enclosing the string with slashes." And so it goes. But if variant letter form information is stripped from a string like "Jackie Brown", the primary indication that the string represents either a person's name or a Tarantino flick title is also stripped. "Thorstein Veblen" is either a dead economist or the name of a fictional yacht in the Travis McGee series. And so forth. Computer text tradition aside, nobody seems to offer any legitimate reason why such information isn't worthy of being preservable in plain-text. Perhaps there isn't one. I'm not qualified to assess the impact of italic Unicode inclusion on the rich-text world as mentioned by David Starner. Maybe another list member will offer additional insight or a second opinion.
Re: A last missing link for interoperable representation
Julian Bradfield wrote, > Oh, and what about dropped initials? They have been used in both > manuscripts and typography for many centuries - surely we must encode > them? Naa-aah, we just hack the full width presentation forms for that. Drop Caps in Plain Text (Whether they actually drop depends on the font, though.)
Re: Encoding italic
Responding to David Starner, > I might complain about the people who claim to like plain text yet would > only be happy with massive changes to it, though. Most movie lovers welcomed talkies. People are free to cling to their rotary phones as long as they like. They just can't press the pound sign. > However, plain text can be used standalone, and it can be used inside > programs and other formats. That remains true even post-emoji. How would italics change that? > Dismissing the people who use Unicode in ways that aren't plain text > is unfair and hurts your case. It wasn't my intention to be dismissive, much, so point taken. Discussions like this one exist so that people can express concerns and share ideas towards resolutions. > Adding italics to Unicode will complicate the implementation of all rich > text applications that currently support italics. Would there be any advantages to rich-text apps if italics were added to Unicode? Is there any cost/benefit data? You've made an assertion about complication to rich-text apps which I can neither confirm nor refute. One possible advantage would be interoperability. People snagging snippets of text from web pages or word processors and dropping data into their plain-text windows wouldn't be bamboozled by the unexpected. If computer text is getting exchanged, isn't it better when it can be done in a standard fashion?
Re: Encoding italic (was: A last missing link)
Victor Gaultney wrote, > Use of variation selectors, a single character modifier, or combining > characters also seem to be less useful options, as they act at the individual > character level and are highly impractical. They also violate the key concept > that italics are a way of marking a span of text as 'special' - not individual > letters. Matched punctuation works the same way and is a good fit for italic. The VS possibility would double the character count of any strings including them. That may make it undesirable for groups like Twitter who have limits. But math (mis)use doesn't affect the character count. If the VS method were to be used, the math alphanumerics might continue to be used where possible, at least by Twitter users who already employ the math-alphas to make their corpus of legacy data. Using VS arose in the parent thread as a way of avoiding the necessity of adding additional characters to the standard. (But we don't seem to be running out of available code space.) The purpose of VS is to preserve variant letter form distinctions in plain-text, which seems to apply to italics. Further, VS is an existing mechanism which wouldn't be expected to impact searching and so forth on savvy systems. (An opening/closing pair of control characters also shouldn't impact searching.) Finally, VS already works in existing technology and there wouldn't be a long down-time waiting for updates to the standard and implementation of same. (Not that we should rush to judgment or "solutions" here, just that an ad-hoc "solution" is possible and could be implemented by third-parties.) Concerns about statefulness in plain-text exist. Treating "italic" as an opening/closing "punctuation" may help get around such concerns. IIRC, it was proposed that the Egyptian cartouche be handled that way. Like emoji, people who don't like italics in plain text don't have to use them.
Re: Encoding italic
Enabling plain-text doesn't make rich-text poor. People who regard plain-text with derision, disdain, or contempt have every right to hold and share opinions about what plain-text is *for* and in which direction it should be heading. Such opinions should receive all the consideration they deserve.
Re: Encoding italic (was: A last missing link)
Although there probably isn't really any concerted effort to "keep plain-text mediocre", it can sometimes seem that way. As we've been told repeatedly, just because something has been done over and over again doesn't mean that there's a precedent for it. Using spans of text as a general indicator of rich-text seems reasonable at first blush. But selected spans can also be copy/pasted (relocated), which is not stylistic at all. Spans of text can be selected to apply casing, which is often seen as non-stylistic. In applications such as BabelPad, spans of text can be converted to-and-from various forms of Unicode references and encodings. Spans of text can be transliterated, moved, or deleted. In short, selecting a span of text only means that the user is going to apply some kind of process to that span. Avant-garde enthusiasts are on the leading edge by definition. That's why they're known as trend setters. Unicode exists because forward-looking people envisioned it and worked to make it happen. Regardless of one's perception of exuberance, Unicode turned out to be so much more than a fringe benefit.
Re: A last missing link for interoperable representation
Hans Åberg wrote, > How about using U+0301 COMBINING ACUTE ACCENT: 푝푎푠푠푒́ Thought about using a combining accent. Figured it would just display with a dotted circle but neglected to try it out first. It actually renders perfectly here. /That's/ good to know. (smile)
Re: A last missing link for interoperable representation
Hello Martin, others... > Blaming the problem on Unicode doesn't seem to be appropriate. I don't consider that there's any problem with plain text users exchanging plain text. I give Unicode /credit/ for being the foundation of that ability. Anyone imagining that I'm casting blame is under a misconception. There's plain text data out there stringing math alphanumerics into recognizable words. It's being stored and shared and indexed. I have no problem with that; I'm in favor of it. (Everyone, please let's focus on Tex Texin's latest post. Wish I'd sent this post before his...) Best regards, James Kass
Re: A last missing link for interoperable representation
Not a twitter user, don't know how popular the practice is, but here's a couple of links concerned with how to use bold or italics in Twitter plain text messages. https://www.simplehelp.net/2018/03/13/how-to-use-bold-and-italicized-text-on-twitter/ https://mothereff.in/twitalics Both pages include a form of caveat. But the caveat isn't about the intended use of the math alphanumerics. The first page includes the following text as part of a "tweet": Just because you 헰헮헻 doesn’t mean you 혴혩혰혶혭혥 :) And, as before, I have no idea how /popular/ the practice is. But here's some more links: (web page from 2013) How To Write In Italics, Tweet Backwards And Use Lots Of Different ... https://www.adweek.com/digital/twitter-font-italics-backwards/ (This is copy/pasted *as-is* from the web page to plain-text) Bold and Italic Unicode Text Tool - 퐁퐨퐥퐝 풂풏풅 푖푡푎푙푖푐푠 - YayText https://yaytext.com/bold-italic/ Super cool unicode text magic. Write 퐛퐨퐥퐝 and/or 푖푡푎푙푖푐 updates on Facebook, Twitter, and elsewhere. Bold (serif) preview copy tweet. Michael Maurino [emoji redacted-JK] on Twitter: "Can I make italics on twitter? 'cause ... https://twitter.com/iron_stylus/status/281991180064022528?lang=en Charlie Brooker on Twitter: "How do you do italics on this thing again?" https://twitter.com/charltonbrooker/status/484623185862983680?lang=en How to make your Facebook and Twitter text bold or italic, and other ... https://boingboing.net/2016/04/10/yaytext-unicode-text-styling.html Apr 10, 2016 - For years I've been using the Panix Unicode Text Converter to create ironic, weird or simply annoying text effects for use on Twitter, Facebook ... How to change your Twitter font | Digital Trends https://www.digitaltrends.com/.../now-you-can-use-bold-italics-and-other-fancy-fonts-... Aug 14, 2013 - now you can use bold italics and other fancy fonts on twitter isaac ... or phrase into your Twitter text box, and there you have it: fancy tweets. Twitter Fonts Generator (퓬퓸퓹픂 퓪퓷퓭 퓹퓪퓼퓽퓮) ― LingoJam https://lingojam.com/TwitterFonts You might have noticed that some users on Twitter are able to change the font ... them to seemingly make their tweet font bold, italic, or just completely different.
Re: A last missing link for interoperable representation
Julian Bradfield wrote, > I have never seen a Unicode math alphabet character in email > outside this list. It's being done though. Check this message from 2013 which includes the following, copy/pasted from the web page into Notepad: 혗혈혙혛 혖혍 헔햳햮헭.향햱햠햬햤햶햮햱햪 © ퟮퟬퟭퟯ 햠햫햤햷 햦햱햠햸 헀헂헍헁헎햻.햼허헆/헺헿헮헹헲혅헴헿헮혆 https://apple.stackexchange.com/questions/104159/what-are-these-characters-and-how-can-i-use-them
Re: A last missing link for interoperable representation
Martin J. Dürst wrote, > I'd say it should be conservative. As the meaning of that word > (similar to others such as progressive and regressive) may be > interpreted in various way, here's what I mean by that. > > It should not take up and extend every little fad at the blink of an > eye. It should wait to see what the real needs are, and what may be > just a temporary fad. As the Mathematical style variants show, once > characters are encoded, it's difficult to get people off using them, > even in ways not intended. A conservative approach to progress is a sensible position for computer character encoders. Taking a conservative approach doesn't necessarily mean being anti-progress. Trying to "get people off" using already encoded characters, whether or not the encoded characters are used as intended, might give an impression of being anti-progress. Unicode doesn't enforce any spelling or punctuation rules. Unicode doesn't tell human beings how to pronounce strings of text or how to interpret them. Unicode doesn't push any rules about splitting infinitives or conjugating verbs. Unicode should not tell people how any written symbol must be interpreted. Unicode should not tell people how or where to deploy their own written symbols. Perhaps fraktur is frivolous in English text. Perhaps its use would result in a new convention for written English which would enhance the literary experience. Italics conventions which have only been around a hundred years or so may well turn out to be just a passing fad, so we should probably give it a bit more time. Telling people they mustn't use Latin italics letter forms in computer text while we wait to see if the practice catches on seems flawed in concept.
Re: A last missing link for interoperable representation
Marcel Schneider wrote, > There is a crazy typeface out there, misleadingly called 'Courier New', > as if the foundry didn’t anticipate that at some point it would be better > called "Courier Obsolete". ... 퐴푟푡 푛표푢푣푒푎푢 seems a bit 푝푎푠푠é nowadays, as well. (Had to use mark-up for that “span” of a single letter in order to indicate the proper letter form. But the plain-text display looks crazy with that HTML jive in it.)
Re: A last missing link for interoperable representation
Julian Bradfield replied, >> Sounds like you didn't try it. VS characters are default ignorable. > > By software that has a full understanding of Unicode. There is a very > large world out there of software that was written before Unicode was > dreamed of, let alone popular. यदि आप किसी रोटरी फोन से कॉल कर रहे हैं, तो कृपया स्टार (*) दबाएं। What happens with Devanagari text? Should the user community refrain from interchanging data because 1980s era software isn't Unicode aware?
Re: A last missing link for interoperable representation
Mark E. Shoulson wrote, > This discussion has been very interesting, really. I've heard what I > thought were very good points and relevant arguments from both/all > sides, and I confess to not being sure which I actually prefer. It's subjective, really. It depends on how one views plain-text and one's expectations for its future. Should plain-text be progressive, regressive, or stagnant? Because those are really the only choices. And opinions differ. Most of us involved with Unicode probably expect plain-text to be around for quite a while. The figure bandied about in the past on this list is "a thousand years". Only a society of mindless drones would cling to the past for a millennium. So, many of us probably figure that strictures laid down now will be overridden as a matter of course, over time. Unicode will probably be around for awhile, but the barrier between plain- and rich-text has already morphed significantly in the relatively short period of time it's been around. I became attracted to Unicode about twenty years ago. Because Unicode opened up entire /realms/ of new vistas relating to what could be done with computer plain text. I hope this trend continues.
Re: A last missing link for interoperable representation
On 2019-01-12 4:26 PM, wjgo_10...@btinternet.com wrote: I have now made, tested and published a font, VS14 Maquette, that uses VS14 to indicate italic. https://forum.high-logic.com/viewtopic.php?f=10=7831=37561#p37561 The italics don't happen in Notepad, but VS14 Maquette works spendidly in LibreOffice! (Windows 7) (In a *.TXT file) Since the VS characters are supposed to be used with officially registered/recognized sequences, it's possible that Notepad isn't trying to implement the feature. The official reception of the notion of using variant letter forms, such as italics, in plain-text is typically frosty. So advancement of plain-text might be left up to third-party developers, enthusiasts, and the actual text users. And there's nothing wrong with that. (It's non-conformant, though, unless the VS material is officially recognized/registered.) Non-Latin scripts, such as Khmer, may have their own traditions and conventions WRT special letter forms. Which is why starting at VS14 and working backwards might be inadequate in the long run. Khmer has letter forms called muul/moul/muol (not sure how to spell that one, but neither is anybody else). It superficially resembles fraktur for Khmer. Other non-Latin scripts may have a plethora of such forms/fonts/styles.
Re: A last missing link for interoperable representation
Asmus Freytag wrote, > ...What this teaches you is that italicizing (or boldfacing) > text is fundamentally related to picking out parts of your > text in a different font. Typically from the same typeface, though. > So those screen readers got it right, except that they could > have used one of the more typical notational conventions that > the mathalphabetics are used to express (e.g. "vector" etc.), > rather than rattling off the Unicode name. WRT text-to-voice applications, such as "VoiceOver", I wonder how well they would do when encountering /any/ exotic text runs or characters. Like Yi, or Vai, or even an isolated CJK ideograph in otherwise Latin text. For example: "The Han radical # 72, which looks like '日', means 'sun'." Would the application "say" the character as a Japanese reader would expect to hear it? Or in one of the Chinese dialects? Or would the application just give the hex code point? In an era where most of the states in my country no longer teach cursive writing in public schools, it seems unlikely that Twitter users (and so forth) will be clamoring for the ability to implement Chicago Style text properly on their cell phone screens. (Many users would probably prefer to use the cell phone to order a Chicago style pizza.) But, stranger things have happened.
Re: A last missing link for interoperable representation
Reading & writing & 'rithmatick... This is a math formula: a + b = b + a ... where the estimable "mathematician" used Latin letters from ASCII as though they were math alphanumerics variables. This is an italicized word: 푘푎푘푖푠푡표푐푟푎푐푦 ... where the "geek" hacker used Latin italics letters from the math alphanumeric range as though they were Latin italics letters. Where's the harm? FWIW, the math formula: a + b # 푏 + 푎 ... becomes invalid if normalized NFKD/NFKC. (Or if copy/pasted from an HTML page using marked-up ASCII into a plain-text editor.) Yet the italicized word "kakistocracy" is still legible if normalized. If copy/pasted from an HTML page using the math alphanumeric characters, it survives intact. If copy/pasted from markupped ASCII, it's still legible.
Re: A last missing link for interoperable representation
Julian Bradford wrote, * Bradfield, sorry.
Re: A last missing link for interoperable representation
Julian Bradford wrote, "It does not work with much existing technology. Interspersing extra codepoints into what is otherwise plain text breaks all the existing software that has not been, and never will be updated to deal with arbitrarily complex algorithms required to do Unicode searching. Somebody who need to search exotic East Asian text will know that they need software that understands VS, but a plain ordinary language user is unlikely to have any idea that VS exist, or that their searches will mysteriously fail if they use this snazzy new pseudo-plain-text italicization technique" Sounds like you didn't try it. VS characters are default ignorable. First one is straight, the second one has VS2 characters interspersed and after the "t": apricot a︁p︁r︁i︁c︁o︁t︁ Notepad finds them both if you type the word "apricot" into the search box. "..." Regardless of how you input italics in rich-text, you are putting italic forms into the display. "I think the VS or combining format character approach *would* have been a better way to deal with the mess of mathematical alphabets, ..." I think so, too, but since I'm not a member of *that* user community, my opinion hasn't much value. Plus VS characters were encoded after the math stuff. "But for plain text, it's crazy." Are you a member of the plain-text user community?