Re: Proposal for BiDi in terminal emulators
On Fri, Feb 01, 2019 at 06:57:43PM +, Richard Wordingham via Unicode wrote: > On Fri, 1 Feb 2019 13:02:45 +0200 > Khaled Hosny via Unicode wrote: > > > On Thu, Jan 31, 2019 at 11:17:19PM +, Richard Wordingham via > > Unicode wrote: > > > On Thu, 31 Jan 2019 12:46:48 +0100 > > > Egmont Koblinger wrote: > > > > > > No. How many cells do CJK ideographs occupy? We've had a strong > > > hint that a medial BEH should occupy one cell, while an isolated > > > BEH should occupy two. > > > > Monospaced Arabic fonts (there are not that many of them) are designed > > so that all forms occupy just one cell (most even including the > > mandatory lam-alef ligatures), unlike CJK fonts. > > > > I can imagine the terminal restricting itself to monspaced fonts, > > disable “liga” feature just in case, and expect the font to well > > behave. Any other magic is likely to fail. > > Of course, strictly speaking, a monospaced font cannot support harakat > as Egmont has proposed. There are two approaches for handling them in monospaced fonts; combining them with base characters as usual, or as spacing characters placed next to their bases. The later approach is a bit unusual, but makes editing heavily voweled text a bit more pleasant. It requires good OpenType support, though, so virtually no terminal supports it. Regards, Khaled
Re: Proposal for BiDi in terminal emulators
On Thu, Jan 31, 2019 at 11:17:19PM +, Richard Wordingham via Unicode wrote: > On Thu, 31 Jan 2019 12:46:48 +0100 > Egmont Koblinger wrote: > > No. How many cells do CJK ideographs occupy? We've had a strong hint > that a medial BEH should occupy one cell, while an isolated BEH should > occupy two. Monospaced Arabic fonts (there are not that many of them) are designed so that all forms occupy just one cell (most even including the mandatory lam-alef ligatures), unlike CJK fonts. I can imagine the terminal restricting itself to monspaced fonts, disable “liga” feature just in case, and expect the font to well behave. Any other magic is likely to fail. Regards, Khaled
Re: Encoding italic
On Thu, Jan 24, 2019 at 10:42:59PM +, Richard Wordingham via Unicode wrote: > On Thu, 24 Jan 2019 18:24:07 +0200 > Khaled Hosny via Unicode wrote: > > > On Thu, Jan 24, 2019 at 03:54:29PM +, Andrew West via Unicode > > wrote: > >> On Thu, 24 Jan 2019 at 15:42, James Kass > >> wrote: > > >>> Going off topic a little, I saw this tweet from Marijn van Putten > >>> today which shows examples of Arabic script from early Quranic > >>> manuscripts with phonetic information indicated by the use of red > >>> and green dots: > >>> > >>> https://twitter.com/PhDniX/status/1088171783461703682 > > >> I would be interested to know how those should be represented in > >> Unicode. > > > It is possible to represent this by use of color fonts. > > The limitations of rendering technology should not be an argument > against an encoding. We have characters that differ only in their > properties, such as word-breaking and line-breaking. They are already encoded, in their modern uncolored form. Some of the modern forms like U+06E5 ARABIC SMALL WAW, U+06E5 ARABIC SMALL WAW, etc. were even specifically “invented” in the previous century to overcome the impracticality of printing in multiple colors, so the colored and uncolored forms are different representations of the same underlying characters. > In this case, it may be argued that their colours apply only to their > 'plain' colouring. Who determines what their colour should be in blue > text? (Font technology seems to dictate that their colour is > unaffected by the choice of foreground colour.) The colors don’t change, the vowel marks are always red, the hamza is always green/yellow.
Re: Encoding italic (was: A last missing link)
On Thu, Jan 24, 2019 at 03:54:29PM +, Andrew West via Unicode wrote: > On Thu, 24 Jan 2019 at 15:42, James Kass wrote: > > > > Here's a very polite reply from John Hudson from 2000, > > http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/1042.html > > ...and, over time, many of the replies to William Overington's colorful > > suggestions were less than polite. But it was clear that colors were > > out-of-scope for a computer plain-text encoding standard. > > Going off topic a little, I saw this tweet from Marijn van Putten > today which shows examples of Arabic script from early Quranic > manuscripts with phonetic information indicated by the use of red and > green dots: > > https://twitter.com/PhDniX/status/1088171783461703682 > > I would be interested to know how those should be represented in Unicode. It is possible to represent this by use of color fonts. The green (sometimes golden) dots are the hamza, the red ones are various vowel marks. A color font would use colored glyphs for these instead of the modern shapes. I did a color fonts that does a similar thing (but still use the modern forms) and it is on my to do list to do a font using archaic Kufi forms. Regards, Khaled
Re: A last missing link for interoperable representation
On Sun, Jan 13, 2019 at 04:52:25PM +, Julian Bradfield via Unicode wrote: > On 2019-01-12, James Kass via Unicode wrote: > > This is an italicized word: > > 푘푎푘푖푠푡표푐푟푎푐푦 > > ... where the "geek" hacker used Latin italics letters from the math > > alphanumeric range as though they were Latin italics letters. > > It's a sequence of question marks unless you have an up to date > Unicode font set up (which, as it happens, I don't for the terminal in > which I read this mailing list). Since actual mathematicians don't use > the Unicode math alphabets, there's no strong incentive to get updated > fonts. They do, but not necessarily by directly inputting them. LaTeX with the “unicode-math” package will translate ASCII + font switches to the respective Unicode math alphanumeric characters. Word will do the same. Even browsers rendering MathML will do the same (though most likely the MathML source will have the math alphanumeric characters already). Regards, Khaled
Re: Excessive emoji usage and TTS (was Re: A last missing link)
On Thu, Jan 10, 2019 at 09:54:59PM +0530, Shriramana Sharma via Unicode wrote: > On Thu 10 Jan, 2019, 20:49 Arthur Reutenauer via Unicode < > unicode@unicode.org wrote: > > > > > On this topic, I was just pointed to > > > > https://twitter.com/kentcdodds/status/1083073242330361856 > > > > “You 혵혩혪혯혬 it's 풸퓊퓉ℯ to 현헿헶혁헲 your tweets and usernames > > 햙햍햎햘 햜햆햞. But > > have you 홡홞홨황홚홣홚홙 to what it 혴혰혶혯혥혴 혭혪혬혦 with assistive > > technologies > > like 퓥퓸퓲퓬퓮퓞퓿퓮퓻?” > > > Something similar: > > https://twitter.com/aaronreynolds/status/1083098920132071424?s=20 > > "This is what it’s like to get texts from my fourteen year old while > driving." > > https://t.co/s8949bmgZI That is pretty good actually and even a positive point for emoji (if these were mere images you would get nothing out of it without extra tagging, and it would still lack the standardization). Nothing like what one gets from the math symbols abuse. Regards, Khaled
Re: A sign/abbreviation for "magister"
On Wed, Oct 31, 2018 at 03:32:09PM -0700, Asmus Freytag via Unicode wrote: > On 10/31/2018 9:03 AM, Khaled Hosny via Unicode wrote: > > A while I was localizing some application to Arabic and the developer > “helpfully” used m² for square meter, but that does not work for Arabic > because there is no superscript ٢ in Unicode, so I had to contact the > developer and ask for markup to be used for the superscript so that O > can use it as well. > > This just pushes the issue down one level. > > Because it assumes that the presence/absence of markup is locale-independent. > > For translation of general text I know this is not true. There are instances > where some words in certain languages are customarily italicized in a way that > is not lexical, therefore not something where the source language would ever > supply markup. That was a while ago, but IIRC, the markup was enabled for that particular widget unconditionally. The localizer is now free to use the markup or not use it, the string was translatable as whole with the embedded markup. It should be possible to enable markup for any widget, it is just an option to tick off in the UI designer, but may experience is that markup is seldom needed in computer UIs, but I may be biased with the kind of UIs and locales I’m most familiar with. Regards, Khaled
Re: A sign/abbreviation for "magister"
On Tue, Oct 30, 2018 at 10:02:43PM +0100, Marcel Schneider wrote: > On 30/10/2018 at 21:34, Khaled Hosny via Unicode wrote: > > > > On Tue, Oct 30, 2018 at 04:52:47PM +0100, Marcel Schneider via Unicode > > wrote: > > > E.g. in Arabic script, superscript is considered worth > > > encoding and using without any caveat, whereas when Latin script is on, > > > superscripts are thrown into the same cauldron as underscoring. > > > > Curious, what Arabic superscripts are encoded in Unicode? > > First, ARABIC LETTER SUPERSCRIPT ALEPH U+0671. > But it is a vowel sign. Many letters put above are called superscript > when explaining in English. As you say, this is a vowel sign not a superscript letter, so the name is a misnomer at best. It should have been called COMBINING ARABIC LETTER ALEF ABOVE, similar to COMBINING LATIN SMALL LETTER A. In Arabic it is called small or dagger alef. > There is the range U+FC5E..U+FC63 (presentation forms). That is a backward compatiplity block no one is supposed to use, there are many such backward comatipility presentation forms even of Latin script (U+FB00..U+FB4F). So I don’t see what makes you think, based on this, that Unicode is favouring Arabic or other scripts over Latin. Regards, Khaled
Re: A sign/abbreviation for "magister"
On Tue, Oct 30, 2018 at 04:52:47PM +0100, Marcel Schneider via Unicode wrote: > E.g. in Arabic script, superscript is considered worth > encoding and using without any caveat, whereas when Latin script is on, > superscripts are thrown into the same cauldron as underscoring. Curious, what Arabic superscripts are encoded in Unicode? Regards, Khaled
Re: Unicode 11 Georgian uppercase vs. fonts
On Fri, Jul 27, 2018 at 02:02:07PM +0100, Michael Everson via Unicode wrote: > 1) Show evidence of titlecasing in Hebrew or Arabic. FWIW, there was a case system for Arabic used at some point in Egypt, called “crown letters”, and introduced under the direction of king Fuad and was used in some capacity in official documents till the end of the monarch: https://en.wikipedia.org/wiki/Crown_Letters_and_Punctuation_and_Their_Placements http://hibastudio.com/wp-content/uploads/2014/01/ar458.jpg Regards, Khaled
Re: metric for block coverage
On Sun, Feb 18, 2018 at 02:14:46AM -0800, James Kass via Unicode wrote: > Adam Borowski wrote, > > > I'm looking for a way to determine a font's coverage of available scripts. > > It's probably reasonable to do this per Unicode block. Also, it's a safe > > assumption that a font which doesn't know a codepoint can do no complex > > shaping of such a glyph, thus looking at just codepoints should be adequate > > for our purposes. > > You probably already know that basic script coverage information is > stored internally in OpenType fonts in the OS/2 table. > > https://docs.microsoft.com/en-us/typography/opentype/spec/os2 > > Parsing the bits in the "ulUnicodeRange..." entries may be the > simplest way to get basic script coverage info. Though this might not be very reliable since OpenType does not have a definition of what it means for a Unicode block to be supported; some font authoring tools use a percentage, others use the presence of any characters in the range, and fonts might even provide incorrect data for any reason. However, I don’t think script or block coverage is that useful, what users are usually interested in is the language coverage. Regards, Khaled
Re: superscripts & subscripts for science/mathematics?
On Mon, Jan 22, 2018 at 07:43:34PM -0800, David Melik via Unicode wrote: > ‘The intended use was to allow chemical and algebra formulas to be written > without > markup’--https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts. > Unless wrong, apart from disagreement, it's clear mathematics word > processing software is useful, but not a reason to not finish > almost-complete set of basic superscripts & subscripts ((super|sub)scripts) > for relevant alphabets used (English, Greek, perhaps Hebrew, latter two > which were in my original post subject line, but I likely accidentally used > link I received to delete pre-moderated post.) Mathematics written in Arabic notation use Arabic-Indic numbers and Arabic letters and they can occur in superscripts and subscripts as well. Regards, Khaled
Re: Algorithms for Unicode script detection
On Thu, Jul 06, 2017 at 09:43:29AM +1000, Simon Cozens via Unicode wrote: > I want to segment a Unicode text into runs according to their script. > I've had a look through UAX#24 in the hope of finding a standard > algorithm for doing this, but there isn't one specified. The > implementation section gives some good pointers for what to be careful > with (paired punctuation, etc.) but I can't find a step-by-step > algorithm similar to the bidi algorithm or collation algorithm. > > Equally, I don't see anything in ICU that segments into script-based > runs. You can get script properties, but that doesn't help you resolve > common characters in the context of a run. > > Does anyone know of an open-source algorithm for doing this? There is source/extra/scrptrun/ in ICU source tree (but not part of the API), apparently it is used by its ParagraphLayout library. (A copy if this code is used by Pango, and another copy is used by LibreOffice). Regards, Khaled
Re: Coloured Punctuation and Annotation
On Thu, Apr 06, 2017 at 12:50:02PM +0200, Werner LEMBERG wrote: > > > This page should show colored Hamza, diacritical dots and vowel > > marks on web browsers that support MS color font format (currently > > Firefox, Edge, and Internet Expoler on latest Windows 10): > > http://www.amirifont.org/fatiha-colored.html > > > > No special markup have been used, the color information is embedded > > in a regular OpenType font. > > Very nice! It als works with Firefox on my GNU/Linux box. I think I worded this vaguely, it works with Firefox on all platforms (even on Android), the Windows 10 restriction is for Internet Expoler only. Regards, Khaled
Re: Coloured Punctuation and Annotation
On Wed, Apr 05, 2017 at 05:29:57PM -0700, Asmus Freytag wrote: > On 4/5/2017 5:14 PM, Michael Everson wrote: > > > On 5 Apr 2017, at 23:16, Asmus Freytagwrote: > > > > > > Do you have any examples of plain text that is rendered with parts of > > > characters having white (opaque) background? > > > > > > I'm not aware of any > > There are certainly MSS (in many languages) where some punctuation made of > > dots have some of the dots red and some black. > > Agreed, those would be a challenge to reproduce with standard font > technology and in plain text. Not any more, thanks to Emoji! This page should show colored Hamza, diacritical dots and vowel marks on web browsers that support MS color font format (currently Firefox, Edge, and Internet Expoler on latest Windows 10): http://www.amirifont.org/fatiha-colored.html No special markup have been used, the color information is embedded in a regular OpenType font. Regards, Khaled
Re: "A Programmer's Introduction to Unicode"
On Mon, Mar 13, 2017 at 07:18:00PM +, Alastair Houghton wrote: > On 13 Mar 2017, at 17:55, J Deckerwrote: > > > > I liked the Go implementation of character type - a rune type - which is a > > codepoint. and strings that return runes from by index. > > https://blog.golang.org/strings > > IMO, returning code points by index is a mistake. It over-emphasises > the importance of the code point, which helps to continue the notion > in some developers’ minds that code points are somehow “characters”. > It also leads to people unnecessarily using UCS-4 as an internal > representation, which seems to have very few advantages in practice > over UTF-16. But there are many text operations that require access to Unicode code points. Take for example text layout, as mapping characters to glyphs and back has to operate on code points. The idea that you never need to work with code points is too simplistic. Regards, Khaled
Re: "A Programmer's Introduction to Unicode"
On Fri, Mar 10, 2017 at 05:00:55PM +, Peter Constable wrote: > FYI: > > http://reedbeta.com/blog/programmers-intro-to-unicode/ > > The visuals may be the most interesting part. E.g., in the usage heat > map, Arabic Presentation Forms-B lights up much more than I would have > expected I often see U+FEFB and other lam-alef ligatures used on social media (I easily spot it because my default font does not have them so they end up using fallback font). My guess is that might be because some keyboard layouts (Xorg, Android?) use them for the lam-alef keys on the keyboard (I’m guilty of doing this for Xorg keyboard layout because it didn’t handle more than one character per key, this was then decomposed back inside XIM input method, but many people don’t use XIM and the decomposition does not happen, it was messy overall). Regards, Khaled
Re: Superscript and Subscript Characters in General Use
On Sat, Jan 14, 2017 at 02:18:01AM +0100, Marcel Schneider wrote: > On Thu, 12 Jan 2017 17:01:41 +0200, Khaled Hosny wrote: > > > > LibreOffice indeed did not use HarfBuzz on Windows before 5.3, which is > > not released yet. > > Is the integration of HarfBuzz limited to free software? HarfBuzz has a fairly liberal license, so in theory it can be used in anywhere. > And what might be the reason of the delayed integration of HarfBuzz in the > Windows version of LibreOffice? Nothing specific, LibreOffice and OpenOffice.org before it and most like StarOffice before them just used what API the platform provides to do text layout, which is not an uncommon practice, it even seemed to be the best practice back in time. The reasons it finally switched to HarfBuzz are outlined in: https://bugs.documentfoundation.org/show_bug.cgi?id=89870 Regards, Khaled
Re: Superscript and Subscript Characters in General Use
On Thu, Jan 12, 2017 at 03:22:18PM +0100, Marcel Schneider wrote: > > This is done by HarfBuzz which automatically activates OpenType > > frac/dnom/numr features for sequences, > > so if the font has the features one gets vulgar fractions out of box. > > According to Wikipedia ( > https://en.wikipedia.org/wiki/HarfBuzz#Major_users > ), HarfBuzz is included in LibreOffice too, but being on Windows, despite of > having just installed the brandnew version 5.2.4.2, I still donʼt get it, > since > it comes with 5.3: > https://wiki.documentfoundation.org/ReleaseNotes/5.3#Text_Layout LibreOffice indeed did not use HarfBuzz on Windows before 5.3, which is not released yet. Regards, Khaled
Re: Superscript and Subscript Characters in General Use
On Thu, Jan 12, 2017 at 12:24:29PM +0900, Martin J. Dürst wrote: > On 2017/01/11 17:32, Richard Wordingham wrote: > > > The truly straight Unicode approach in HTML is to use 1945. > > Just entering those 5 characters into a text entry box in Firefox gave > > me a properly formatted vulgar fraction. That is how vulgar fractions > > are supposed to work. Unfortunately, one may need to avoid 'exciting > > new fonts' in favour of those with a large, working repertoire. > > Just for the record: The vulgar fraction display also happened in > Thunderbird (on Windows). Firefox and Thunderbird use the same display > engine. I have switched HTML display off, because I prefer to read all my > mail in plain text, but it still worked. This is done by HarfBuzz which automatically activates OpenType frac/dnom/numr features for sequences, so if the font has the features one gets vulgar fractions out of box. This works in Chrome as well since it uses HarfBuzz (older version of Chrome didn’t enable HarfBuzz by default for Latin so the fractions might not show there). Regards, Khaled
Re: Why incomplete subscript/superscript alphabet ?
On Sat, Oct 01, 2016 at 03:00:50PM +0300, Jukka K. Korpela wrote: > 1.10.2016, 11:29, Khaled Hosny wrote: > > > On Fri, Sep 30, 2016 at 07:31:58PM +0300, Jukka K. Korpela wrote: > [...] > >> What I was pointing at was that when using > > > rich text or markup, it is complicated or impossible to have > > > typographically > > > correct glyphs used (even when they exist), whereas the use of Unicode > > > codepoints for subscript or superscript characters may do that in a much > > > simpler way. > > > > That is not generally true. > > It is generally true, but not without exceptions. > > > In TeX you get true superscript glyphs by default. > > I suppose you’re right, though I don’t know exactly how TeX implements > superscripts. I suspect the fonts that TeX normally uses do not contain > (many) superscript or subscript glyph variants, but TeX might actually map > e.g. ^2 in math mode to a superscript glyph for 2 (identical with to the > glyph for ²). TeX has fonts designed for use at 8pt (size of 1st level scripts) and 5pt (the size of 2nd level scripts) with all the optical correction for them to look right when scaled down. They provide all the glyphs provided by the fonts for larger font sizes, so any character can be used in super or subscripts, no special mapping is needed. Regards, Khaled
Re: Why incomplete subscript/superscript alphabet ?
On Fri, Sep 30, 2016 at 07:31:58PM +0300, Jukka K. Korpela wrote: > 30.9.2016, 19:11, Leonardo Boiko wrote: > > > The Unicode codepoints are not intended as a place to store > > typographically variant glyphs (much like the Unicode "italic" > > characters aren't designed as a way of encoding italic faces). > > There is no disagreement on this. What I was pointing at was that when using > rich text or markup, it is complicated or impossible to have typographically > correct glyphs used (even when they exist), whereas the use of Unicode > codepoints for subscript or superscript characters may do that in a much > simpler way. That is not generally true. In TeX you get true superscript glyphs by default. On the web you can use font features in CSS to get them as well, provided that you are using a font that supports them. Regards, Khaled
Re: Unicode Bidi Algorithm – Java reference implementation
On Sat, Sep 17, 2016 at 05:01:10PM +0530, Deepak Jois wrote: > Hi > > It seems that the Java reference implementation for the Unicode Bidi > algorithm that I downloaded from the unicode.org site fails against > some test cases in the BidiCharacterTest.txt file – the ones that are > specifically meant to test for changes in Unicode 8.0. > > Has the reference implementation been updated, and does anyone have a > copy they can share? Is there a reference implementation in some other > language that I could look at, which has been updated? I think there is a C implementation that is kept up to date, and there is also a Python implementation that should pass the tests: https://github.com/behdad/pybyedie Regards, Khaled
Re: Numerical fractions written in Arabic script
On Tue, Jul 26, 2016 at 09:12:38PM -0400, Robert Wheelock wrote: > Hello again, y’all! > > How do Arabs, Iranians, Afghans, Pakistanis, Urdu ... all write their > equivalents of common numerical fractions (consisting of a numerator, a > separator character, and a denominator)?!?! > Considering that Arabic written script reads from right to left (like in > Hebrew, Syro-Aramaic, and the fantasy language of Tsolyáni), would they use > a normal right-facing foreslash (1/2), a left-facing backslash (1\2), or do > they align numerator above|demoniator below a horizontal fraction bar?!?! In Arabic, beveled fractions are written from left to right with a right facing slash. Also the integer is written to the left of the fraction (whether it is a nut or beveled fraction). Regards, Khaled
Re: non-breaking snakes
That sounds more like traditional Tibetan justification than kashida: http://rishida.net/scripts/tibetan/#justification On Wed, May 04, 2016 at 09:23:04AM +0200, Mark Davis ☕️ wrote: > Arabic has tatweel/kashida for justification; rather similar in principle. > > https://en.wikipedia.org/wiki/Kashida > > Mark > > On Wed, May 4, 2016 at 9:14 AM, Shriramana Sharmawrote: > > > Isn't there some Japanese orthography feature that already does > > something like this? > > > > -- > > Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा > >
Re: Devanagari and Subscript and Superscript
On Tue, Dec 15, 2015 at 11:55:02AM +, Plug Gulp wrote: > Please note that the teacher had to use a Circumflex Accent (Caret) to > indicate superscript, which is an unwritten convention, in the absence > of proper superscript support within Unicode. If the teacher is explaining actual math to his students, then the superscript is the least of his worries. Math typesetting is two dimensional, and is much more complex than regular formated text (not even regular plan text)that it needs its own typesetting engines. There are various plain text markup languages to markup math, if one really wants to represent complex mathematical notation in plain text. Regards, Khaled
Re: AW: Proposal for German capital letter "ß"
On Wed, Dec 09, 2015 at 06:16:35PM +0100, Frédéric Grosshans wrote: > * use your own casing rule and add a ZWNJ (zero width non joiner character) > such that ss↔SS and ß↔S+ZWNJ + S. Wouldn’t ZWJ be a more logical choice given that he wants to “join” both S’s into a single character. Regards, Khaled
Re: APL Under-bar Characters
On Sun, Aug 16, 2015 at 09:31:25AM -0700, alexwei...@alexweiner.com wrote: Khaled, Thank you for the link. The normalization methods were already discussed, specifically here: http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00047.html Grapheme cluster boundaries detection is different from normalisation, please read the link I provided. Where the problem of how big is ä is discussed. The answer being that this is one symbol, because the Unicode Consortium decided that it is also its own standalone character. From the thread: I'll give you an example. What would you want ⍴,'ä' to be? Right now, that could return either 1 or 2 depending on whether the ä was using the precomposed character (U+00E4) or the combining mark (U+0061, U+0308). Visually, these are identical, and generally you'd expect them to compare equal. If you are counting grapheme clusters, then the answer is one in both cases. In Unicode, the comparison of equivalent (but with different characters) strings are done by performing a normalisation step prior to comparison. There are 4 different types of normalisation, with different behaviour. Quoting from the link I provided: A key feature of default Unicode grapheme clusters (both legacy and extended) is that they remain unchanged across all canonically equivalent forms of the underlying text. Thus the boundaries remain unchanged whether the text is in NFC or NFD. Using a grapheme cluster as the fundamental unit of matching thus provides a very clear and easily explained basis for canonically equivalent matching. This is important for applications from searching to regular expressions. See also: http://unicode.org/faq/char_combmark.html#7 Now, the ä character has a precomposed form in Unicode, and if you couple that with the NFC normalisation form, you'd get the above _expression_ to return 1. So I'm not sure why the allowance was made for ä as well as other certain characters, but not for other things (under-bar characters) that face similar representation issues. It was encoded for compatibility of pre-existing character sets AFAIK. Regards, Khaled Original Message Subject: Re: APL Under-bar Characters From: Khaled Hosny khaledho...@eglug.org Date: Sun, August 16, 2015 8:17 am To: alexwei...@alexweiner.com Cc: unicode@unicode.org On Sun, Aug 16, 2015 at 07:35:17AM -0700, alexwei...@alexweiner.com wrote: Hello Unicode Mailing List, There is significant discussion about the problems of adding capital letters with individual under-bars in this mailing list for GNU APL. http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00050.html Pretty much it adds up to the following problem: The string length functionality would view an 'A' code point combined with an '_' code point as an item that has two elements, while something that looks like 'A' Should be atomic, and return a length of one. I think what you need is better “character” counting [1], rather than new precomposed characters. Regards, Khaled 1. http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
Re: APL Under-bar Characters
On Sun, Aug 16, 2015 at 07:35:17AM -0700, alexwei...@alexweiner.com wrote: Hello Unicode Mailing List, There is significant discussion about the problems of adding capital letters with individual under-bars in this mailing list for GNU APL. http://lists.gnu.org/archive/html/bug-apl/2015-08/msg00050.html Pretty much it adds up to the following problem: The string length functionality would view an 'A' code point combined with an '_' code point as an item that has two elements, while something that looks like 'A' Should be atomic, and return a length of one. I think what you need is better “character” counting [1], rather than new precomposed characters. Regards, Khaled 1. http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
Re: ZWJ as a Ligature Suppressor
This is not always true, some rendering engines (like HarfBuzz) try to follow the Unicode rules so ZWJ does not break ligatures except in Arabic where the standard says it should be interpreted as ZWJ, ZWNJ, ZWJ. Regards, Khaled On Mon, Aug 10, 2015 at 05:58:24PM +, Andrew Glass (WINDOWS) wrote: Hi Richard, To ligate or not to ligate is up to the font designer. Normally, GSUB lookups that perform ligation will be broken by the presence of ZWJ or ZWNJ. If a font designer wishes to ligate in the presence of a ZWJ or ZWNJ then they could choose to include appropriate glyph sequences in their ligation lookups. For example: glyphA glyphB - glyphC glyphA ZWJ glyphB - glyphC Cheers, Andrew Andrew Glass Ph.D. Program Manager Shell Text Input Group | Windows | Microsoft -Original Message- From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard Wordingham Sent: Sunday, August 9, 2015 3:58 AM To: unicode@unicode.org Subject: ZWJ as a Ligature Suppressor According to the text just after TUS 7.0.0 Figure 23-3 (http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G25237), ZWJ suppresses ligatures in Arabic script. Does this rule apply to other normally cursive joined scripts, e.g. Syriac and Mongolian? Am I right in thinking that for an OpenType font for other scripts, the font writer must take precautions to prevent ZWJ accidentally suppressing ligatures that would be better suppressed by ZWNJ or ZWJ ZWNJ ZWJ?
Re: Plain text custom fraction input
On Thu, Jul 23, 2015 at 10:25:22AM +0200, Marcel Schneider wrote: The remaining question would then be: What was the idea when at font design, the fraction slash was given left and right kerning, so that a preceding superscript digit will take exactly the place it has as a part of a precomposed fraction, and a following subscript takes place like if it were a denominator in one of the precomposed fractions? What says that this kerning is there for super/subscript glyphs, it can be equally (and more likely) be there for the numerator and denominator glyphs. Regards, Khaled
Re: Plain text custom fraction input
On Wed, Jul 22, 2015 at 11:54:02PM +0100, Richard Wordingham wrote: On Wed, 22 Jul 2015 12:21:32 +0200 (CEST) Marcel Schneider charupd...@orange.fr wrote: On 22 Jul 2015, at 09:52, Richard Wordingham wrote: We never thought of common hieroglyphs otherwise as running LTR, while on monuments the great liberty of the script allows to run in amost all directions. IMO monumental transcription is always difficult to deal with, whenever exact rendering is expected. However, since Unicode's purpose is plain text encoding, we must stick with what I consider as a convention in egyptology... Which means that Ancient Egyptian hieroglyphs are unencoded! Their default direction is right-to-left, but that's only the start of the trouble. The encoded hieroglyphs aren't Bidi-mirrored, so if I embed then in a right-to-left override, I should get retrograde characters. At least in OpenType, you can have mirrored glyphs in the font (which you will need in any case) and use a “rtlm” feature which should be applied when the text is being typeset right-to-left (naturally or forced). Regards, Khaled
Re: Plain text custom fraction input
On Wed, Jul 22, 2015 at 09:00:38AM +0200, Marcel Schneider wrote: On 21 Jul 2015, at 18;42, Doug Ewell wrote: As explained in TUS 7.0, §6.2 (General Punctuation), p. 273, U+2044 FRACTION SLASH is intended for use with Basic Latin digits, or other digits with General Category = Nd. The superscript and subscript presentation forms have General Category = No. That is was bugs me, that this kerning fraction slash is presented to us as to be used with plain digits, that overlap the fraction slash in proportional fonts. That recommendation is inconsistent with plain text encoding. Following TUS, anybody who uses U+2044 must use a fraction formatting feature. I know this from the time I'd got the valid demo version of some Desktop Publishing software. The feature wasn't flagged by the fraction slash, and on the other hand, the feature worked with the common slash U+002F too. It's a formatting command like superscript or underline. Some layout engines, like HarfBuzz, automatically turn on the required OpenType features for proper fraction rendering when fraction flag is used. If the font has “numr” and “dnom” features, HarfBuzz will turn them on for the digitsfraction slashdigits sequence. IMHO, that is the most Unicode-compliant approach and other engines should do the same. Regards, Khaled
Re: WORD JOINER vs ZWNBSP
On Tue, Jun 30, 2015 at 11:02:18AM +0200, Marcel Schneider wrote: On Sun, Jun 28, 2015, Peter Constable wrote: Marcel: Can you please clarify in what way Windows 7 is not supporting U+2060. On my netbook, which is running Windows 7 Starter, U+2060 is not a part of any of the shipped fonts. It is a control character, it does not need to have a glyph in the font to be properly supported.
Re: Persian counter styles
On Feb 26, 2015 2:41 AM, Shervin Afshar shervinafs...@gmail.com wrote: On Tue, Feb 24, 2015 at 11:04 AM, Khaled Hosny khaledho...@eglug.org wrote: I don’t know about Persian, but in Arabic isolated Heh is not used in math or lists is it can be confused with Arabic-Indic digit five, and instead it is always used in initial form in such situations. I don't believe that the potential confusability between Arabic-Indic digit five and stand-alone Heh implies that it should not be used in writing math. I only stated that it is not used (i.e. The current practice) whether it should or shouldn't be used is up to the mathematicians who write that math (and for one, the Arabic Mathematical Alphabetic block does not have an isolated Heh, though its place is reserved). Regards, Khaled ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Arabic percent sign and percent signs in RTL scripts
On Tue, Feb 04, 2014 at 03:43:37PM +0100, Martinho Fernandes wrote: Is the arabic percent sign (U+066A) just a typographical variation of the normal percent sign (U+0025) or is it somehow more distinct than that? The former. It is mainly used when Arabic-Indic or Extended Arabic-Indic digits are used. What about its placement? Is it placed to the left or to the right of the digits it applies to? It should follow the digit in the input stream, and its proper visual placement should be handled by the Unicode bidirectional algorithm. Regards, Khaled ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: COMBINING OVER MARK?
On Mon, Sep 30, 2013 at 05:51:09PM -0700, Leo Broukhis wrote: Hi All, Attached is a part of page 36 of Henry Alford's *The Queen's English: a manual of idiom and usage (1888)* [ http://archive.org/details/queensenglishman00alfo] Is the way to indicate alternative s/z spellings used there plain text (arguably, if it can be done with a typewriter, it is plain text) I see a typeset book not an output of a typewriter. or rich text (ignoring the font size of letters s and z)? If it's the latter, what's the markup to achieve it? Using TeX: \def\s{${}^{\rm s}_{\rm z}$} 49. How are we to decide between {\it s} and {\it z} in such words as anathemati\s{}e, cauteri\s{}e, criti\-ci\s{}e, deodori\s{}e, dogmati\s{}e, fraterni\s{}e, and the rest? Many of these are derived from Greek \bye Regards, Khaled attachment: tex.png
Re: COMBINING OVER MARK?
Well that paragraph is rich text; different fonts (roman and upright) at different sizes (text and script size) pretty much makes it formated text to me. Regards, Khaled On Tue, Oct 01, 2013 at 10:19:24AM -0700, Leo Broukhis wrote: Khaled, On a typewriter, the same effect can be achieved as anathematihalf-interval upsBS1 interval downzhalf-interval upe Where would the line between markup and typesetting languages be drawn? Leo On Tue, Oct 1, 2013 at 2:09 AM, Khaled Hosny khaledho...@eglug.org wrote: On Mon, Sep 30, 2013 at 05:51:09PM -0700, Leo Broukhis wrote: Hi All, Attached is a part of page 36 of Henry Alford's *The Queen's English: a manual of idiom and usage (1888)* [ http://archive.org/details/queensenglishman00alfo] Is the way to indicate alternative s/z spellings used there plain text (arguably, if it can be done with a typewriter, it is plain text) I see a typeset book not an output of a typewriter. or rich text (ignoring the font size of letters s and z)? If it's the latter, what's the markup to achieve it? Using TeX: \def\s{${}^{\rm s}_{\rm z}$} 49. How are we to decide between {\it s} and {\it z} in such words as anathemati\s{}e, cauteri\s{}e, criti\-ci\s{}e, deodori\s{}e, dogmati\s{}e, fraterni\s{}e, and the rest? Many of these are derived from Greek \bye Regards, Khaled
Re: Why blackletter letters?
On Thu, Sep 12, 2013 at 01:21:28PM +0100, Neil Harris wrote: On 12/09/13 11:26, Johan Winge wrote: On Wed, 11 Sep 2013 20:29:51 +0200, Hans Aberg haber...@telia.com wrote: ... The symbol for the empty set ∅ is originally a Greek letter phi ϕ, ans some use the latter. According to the autobiography of André Weil, quoted at http://jeff560.tripod.com/set.html, the empty set symbol ∅ was inspired by the Scandinavian Ø, and would then have nothing to do with the Greek phi, except for a superficial resemblance. I'm aware that some mathematician indeed do use Φ/φ, supposedly due to this misconception and/or lacking coverage in fonts and/or carelessness, but I find it terribly annoying. Really, it is no more correct than using ß in lieu of β. -- Johan Winge Do some mathematicians _really_ use Φ/φ instead of ∅, or does it just look like they're doing so? Seems so: http://math.stackexchange.com/q/227548 Also, when I went to school we were taught that phi denotes a group of nothing, not sure if that was supposed to be the empty set (we were taught math in Arabic, so not sure how that translates into English). Regards, Khaled
Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)
On Fri, Jun 21, 2013 at 02:27:38PM +0100, Michael Everson wrote: On 21 Jun 2013, at 14:06, Denis Jacquerye moy...@gmail.com wrote: It is not the character model that is not reliable, it is the application. If you application doesn't support locale settings and locale specific font features, fix the application. Try this in the file system. The file system embeds visual rendering of text? You probably mean the file manager, my file manager shows me locale-dependant glyph variants without any special setup (apart from choosing a font that have the said variants, and they are available as OpenType variants, no less). Regards, Khaled
Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)
On Fri, Jun 21, 2013 at 04:00:20PM +0100, Michael Everson wrote: Yeah, I don't believe that you can language-tag individual file names for such display as that is markup. Why do you need to? You only need one language, it is not like file names are multilingual high quality text books where every fine typographic detail for each language have to be respected. Only the language that the user care about matters, and this can be easily inferred from the system locale, and passed down to the text rendering stack. Regards, Khaled
Re: interaction of Arabic ligatures with vowel marks
On Tue, Jun 11, 2013 at 08:09:31PM -0700, Stephan Stiller wrote: Hi, How is the placement of vowel marks around ligatures handled in Arabic text? OpenType has special support for placing non combining marks over ligatures (a subset of the general support for controlling the placement of non-combining marks); it is entirely handled at text rendering level, no difference in input whether the bases will be ligated or not. No idea about other font technologies. Regards, Khaled
Re: xkcd: LTR
Looks OK here, but that is probably FreeType doing its magic as usual. Regards, Khaled On Tue, Nov 27, 2012 at 02:29:45AM +0100, Philippe Verdy wrote: Also I really don't like the Deseret font: {font-family: CMU; src: url(CMUSerif-Roman.ttf) format(truetype);} that you have inserted in your stylesheet (da.css) which is used to display the whole text content of the page, including the English Latin text at the bottom part. This downloaded font is difficult to read as it is not hinted at all (so its rendering on screen is extremely poor, we probably don't want to print each page of this XKCD series, when the main interest is the image which is perfectly readable). Could you ask to someone in this list to help you hinting this font a minimum (even basic autohinting would be much better). 2012/11/27 Philippe Verdy verd...@wanadoo.fr Did you try add the xml:lang=en-Dsrt pseudo-attribute to the html element, as suggested by the W3C Unicorn validator ? http://validator.w3.org/unicorn/check?ucn_uri=www.xn--elqus623b.net%2FXKCD%2F1138.htmlucn_lang=frucn_task=conformance# May be this could help IE and Firefox that can't figure out the language used to properly detect the encoding if they still don't trust the XML declaration in this case, to avoid them to use an encoding guesser. It is anyay curious because this site is valid as XHTML 1.1 (not as HTML5 which uses a very different and simplified prolog, which is not matched here, so the legacy rules should apply to detect XHTML here, then legacy HTML4 if XHTML is no longer recognized by IE and Firefox). Because XHTML is properly tagged, the XML requirements should apply and the XML declaration in the prolog should be used without needing to guess the encoding from the rest of the content (starting by a meta element in the HTML head element). 2012/11/27 John H. Jenkins jenk...@apple.com That's because the domain does, in fact, use sinograms and not Deseret. (It's my Chinese name.) On 2012年11月26日, at 下午1:54, Philippe Verdy verd...@wanadoo.fr wrote: I wonder why this IDN link appears to me using sinograms in its domain name, instead of Deseret letters. The link works, but my browser cannot display it and its displays the Punycoded name instead without decoding it. This is strange because I do have Deseret fonts installed and I can view Unicoded HTML pages containing Deseret letters. 2012/11/26 John H. Jenkins jenk...@apple.com Or, if one prefers: http://www.井作恆.net/XKCD/1137.htmlhttp://www.xn--elqus623b.net/XKCD/1137.html On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote: http://xkcd.com/1137/ Finally, an xkcd for Unicoders. :-) Debbie
Re: Connecting overline and Connecting underline
On Fri, Nov 16, 2012 at 08:02:49PM +0100, Andreas Prilop wrote: U+0305 Combining overline U+0332 Combining low line should both connect on left and right. Which software (program and font) actually does this when you overline/underline gh? Test at http://www.user.uni-hannover.de/nhtcapri/combining-marks.html My (unreleased) Amiri Quran font handles combining overline pretty well, but it has only Arabic characters (the code that generates the overline glyphs and OpenType rules is pretty generic, but that font have only a very specific used case). Regards, Khaled attachment: overline.png
Re: Caret
On Mon, Nov 12, 2012 at 09:35:03PM +0100, Philippe Verdy wrote: I understand then. You have a single logical position (in encoded plain text), that maps to two visual positions which may be considered AFTER depending on the direction properties of the character that you *may* type. A single vertical line assumes however that you'll type a character which will use the SAME direction as the character BEFORE the insertion point. This case remains very infrequent: it is extremely rare to start typing text in the middle between RTL and LTR text. Usually typing occurs at end of a paragraph, and most paragraphs use a single direction and when you have to insert new text in the middle of a paragraph, this is extremely rarely between a visual-LTR sequence and a viual RTL sequence (I think the most frequent case will occur between digits and letters/symbols, in cases like currency amounts or measurements). I’m not sure from where you are getting your statistics, but I’ve to deal with all those “rare” and “extremely rare” situations all the day. Regards, Khaled
Re: Apostrophe, and DIN keyboard
On Tue, Aug 14, 2012 at 03:56:23PM -0400, Robert Wheelock wrote: ... 90º ... 45º BTW, the degree sign is ° not the masculine ordinal indicator that you are using. Regards, Khaled
Re: No appropriate code point for some Chinese punctuation marks
On Sun, Jul 22, 2012 at 09:43:29AM -0700, Asmus Freytag wrote: Especially in multiscript environment, and those are not that rare, really, it's almost impossible to get such unfications to behave correctly without explicit font binding. And we all know that control of that is elusive in many contexts. It is a quite possible actually, all needed is a text layout engine that does automatic script tagging e.g. Pango and, to some extent, Firefox, and font that provide localised, script-specific punctuation glyphs, and it should just work even with plain text. I've been doing that with Arabic and it works rather reliably. Regards, Khaled
Re: Too narrowly defined: DIVISION SIGN COLON
On Wed, Jul 11, 2012 at 10:47:33AM +0200, Hans Aberg wrote: On 11 Jul 2012, at 03:51, Khaled Hosny wrote: It can be handled at a different level; when one types 3:5 in a Unicode-complient TeX engine, what gets output to the output file is the ratio not the colon, and colon gets output with 3\colon{}5. Actually, TeX does it wrongly relative Unicode: a colon : in the input file should expand TeX $\colon$, whereas ∶ RATIO U+2236 should expand to TeX $:$. It is a kind of primitive input method, like using / for division slash and * for asterisk operator, and ratio is more frequent in math than the colon. (original TeX handled this by having different glyphs/glyph classes in math than TeX, Unicode-compliant TeX engines map them to the appropriate Unicode character). Regards, Khaled
Re: Too narrowly defined: DIVISION SIGN COLON
On Wed, Jul 11, 2012 at 04:20:26PM +0200, Hans Aberg wrote: On 11 Jul 2012, at 15:59, Khaled Hosny wrote: On Wed, Jul 11, 2012 at 10:47:33AM +0200, Hans Aberg wrote: On 11 Jul 2012, at 03:51, Khaled Hosny wrote: It can be handled at a different level; when one types 3:5 in a Unicode-complient TeX engine, what gets output to the output file is the ratio not the colon, and colon gets output with 3\colon{}5. Actually, TeX does it wrongly relative Unicode: a colon : in the input file should expand TeX $\colon$, whereas ∶ RATIO U+2236 should expand to TeX $:$. It is a kind of primitive input method, like using / for division slash and * for asterisk operator, and ratio is more frequent in math than the colon. (original TeX handled this by having different glyphs/glyph classes in math than TeX, Unicode-compliant TeX engines map them to the appropriate Unicode character). There are a number of other incompatibilities between original TeX and Unicode: For example, ASCII letters are in TeX math mode typeset in italics, but Unicode has a mathematical italics style, so ASCII letters should be typeset upright in a strict Unicode mode. And similar for Greek letters, I gather. If I try the code below in lualatex, then the 푩 and the 퐁 both come out typeset upright. There is a “literal” mode in unicode-math package just for that, check its manual for more details. Also, in the code there is an example where spacing produces a semantic difference: {A: B} is the set of all A satisfying the predicate B, whereas {A : B} is the set of the single element A : B. (It is more common to use | nowadays in the first case, but it is also used as an operator.) There is also an option to control colon vs. ratio behaviour, but this is getting off-topic IMO. Regards, Khaled
Re: Too narrowly defined: DIVISION SIGN COLON
They are spaced differently. Attached how they are rendered by TeX, using its default spacing rules, the first is the ratio (which is spaced as a relational symbol) and the second is the colon (which is spaced as punctuation mark), both in math mode, and the last one is the colon in text mode. On Tue, Jul 10, 2012 at 04:22:06PM -0700, Mark Davis ☕ wrote: I would disagree about the preference for ratio; I think it is a historical accident in Unicode. What people use and have used for ratio is simply a colon. One writes 3:5, and I doubt that there was a well-established visual difference that demanded a separate code for it, so someone would need to write 3∶5 instead. Mark — Il meglio è l’inimico del bene — On Tue, Jul 10, 2012 at 3:22 PM, Asmus Freytag asm...@ix.netcom.com wrote: U+2236 RATIO * Used in preference to 003A : to denote division or scale attachment: texput.png
Re: Too narrowly defined: DIVISION SIGN COLON
It can be handled at a different level; when one types 3:5 in a Unicode-complient TeX engine, what gets output to the output file is the ratio not the colon, and colon gets output with 3\colon{}5. Regards, Khaled On Tue, Jul 10, 2012 at 06:00:24PM -0700, Mark Davis ☕ wrote: That is, they may be spaced differently (depending on the font and environment). I'm not against pointing to RATIO for specific math contexts, but to tell Joe Smith that he should be using a different character to say that the ratio of gravel to sand should be 3:1 is artificial and pointless. ━━━ Mark — Il meglio è l’inimico del bene — On Tue, Jul 10, 2012 at 5:51 PM, Khaled Hosny khaledho...@eglug.org wrote: They are spaced differently. Attached how they are rendered by TeX, using its default spacing rules, the first is the ratio (which is spaced as a relational symbol) and the second is the colon (which is spaced as punctuation mark), both in math mode, and the last one is the colon in text mode. On Tue, Jul 10, 2012 at 04:22:06PM -0700, Mark Davis ☕ wrote: I would disagree about the preference for ratio; I think it is a historical accident in Unicode. What people use and have used for ratio is simply a colon. One writes 3:5, and I doubt that there was a well-established visual difference that demanded a separate code for it, so someone would need to write 3∶5 instead. Mark — Il meglio è l’inimico del bene — On Tue, Jul 10, 2012 at 3:22 PM, Asmus Freytag asm...@ix.netcom.com wrote: U+2236 RATIO * Used in preference to 003A : to denote division or scale
Re: Too narrowly defined: DIVISION SIGN COLON
On Wed, Jul 11, 2012 at 03:39:23AM +0200, Philippe Verdy wrote: (Unfortunately it's still almost impossible to determine how browsers are selecting fonts and which fonts get finally used to render text in their tricky code, Firefox has an addon for that: https://addons.mozilla.org/en-US/firefox/addon/fontinfo/ Regards, Khaled
Re: [OT] Flag coding (was: Re: Tags and future new technologies [...])
On Sat, Jun 02, 2012 at 11:22:12AM +0200, Philippe Verdy wrote: 2012/6/2 William_J_G Overington wjgo_10...@btinternet.com: An interesting spin-off could be that the introduction of such an encoding could lead to the introduction of chromatic font technology by industry. I've been waiting for long for fonts embedding colorful glyphs (that also contain enough information for rendering the embedded colors with monochromatic patterns, also encoded in the font as a property of its internal colormap). Such thing is still not in OpenType, but it DOES exist in other font technologies (e.g. in SVG fonts, even though this is still an unfinished standard that does not meet the technical quality observed in OpentType, but that DOES use a much simpler and coherent design than the many incoherent tricks and deprecated items found in the OpenType family, including for such basic things such as metrics data which are a nightmare to make compatible). https://wiki.mozilla.org/SVGOpenTypeFonts Regards, Khaled
Re: Variant glyphs for mathematical symbols
On Sun, May 06, 2012 at 06:36:36PM -0700, Asmus Freytag wrote: First question: When the integral symbols were encoded in Unicode there was discussion of the fact that these were deliberately unifying an upright and a slanted style of integral. Now, I'm pretty sure that I've seen both styles in print at some point, but I can't seem to find any TrueType or OpenType fonts that support the slanted style. Or, I may just not know where to look. Is this style still in use anywhere, and do people make or maintain fonts for it? Latin Modern Math font has slanted integrals: http://www.gust.org.pl/projects/e-foundry/lm-math XITS Math have default slanted integrals as well as optional upright ones: https://github.com/khaledhosny/xits-math Both fonts use the new OpenType MATH table and thus need an application that support it for proper math typesetting, namely MS Office 2007+, XeTeX and LuaTeX. STIX fonts also provide both: http://stixfonts.org/ Second question: When the mathematical relations were encoded there were variants that were unified where the sole difference was something subtle like a slant of one of the lines. However, these variants were also given Standardized Variation Sequences. Are there any fonts that contain glyphs for these variant forms? Either as replacement for the more typical forms, or as alternate glyphs? Again, I may simply not know where to look. XITS Math supports the mathematical variants using variation sequences that are listed here: http://unicode.org/reports/tr25/tr25-9.html#_Toc218 PS: should these symbols exist in non-Truetype fonts I'd be interested in pointers as well, but preferably from someone who would know how to convert them into TrueType format. Many TeX math fonts have slanted integrals. Regards, Khaled
Re: U+2018 is not RIGHT HIGH 6
On Wed, May 02, 2012 at 08:04:01AM -0700, Doug Ewell wrote: Certain font designers have made these directional for decades, leading to the hideous ``convention'' which some people seem to love, but which is a classic example of abusing character encoding to achieve typographical results. This stems from TeX’s and its Computer Modern fonts, AFAIK, which are older than history... Regards, Khaled
Re: Joining Arabic Letters
On Sat, Apr 07, 2012 at 08:50:18PM +0200, Escape Landsome wrote: - the browser, including version Mozilla Firefox 9.0.1 There was a bug in Firefox 9 causing the behaviour you described, it have been fixed in Firefox 10: https://bugzilla.mozilla.org/show_bug.cgi?id=714067 Regards, Khaled
Re: Joining Arabic Letters
On Sat, Mar 31, 2012 at 08:55:28AM +0200, Philippe Verdy wrote: For now I've not seen any existing Arabic font that exhibit the correct normative joining behavior for these letters such as U+063D (the Farsi Yeh with an inverted v above, which is dual-joining like the Farsi Yeh at U+06CC without the inverted v above, and in the same joining group; those fonts only map a single non-joining glyph for U+063D, but behave correctly for U+06CC). This is true even for all Arabic fonts shipped with Windows 7. Check my free Amiri font (http://amirifont.org), it has full Unicode 6.0 Arabic coverage, with 6.1 additions under the way. But if you are using a layout engine that predates the addition of that character into Unicode, even a good font will not help here since the engine will be using the older Unicode character database where the joining behaviour of this letter is undefined. Regards, Khaled
Re: Joining Arabic Letters
On Fri, Mar 30, 2012 at 07:37:53PM +0200, Andreas Prilop wrote: I come back to http://www.unicode.org/mail-arch/unicode-ml/y2012-m03/thread.html#11 A similar problem of showing non-joining, isolated Arabic glyphs can be seen in the attached file. Both Internet Explorer 8 and MS Word 2010 display isolated glyphs in some cases. I think a better idea is to have joining glyphs always even for different typefaces. At least the Unicode Standard should say what should happen when Arabic characters of different typefaces follow each other. OpenOffice/LibreOffice work around this by conditionally inserting ZWJ when there is a font switch in the middle of the word and joining is desired. Regards, Khaled
Re: Combining Triple Diacritics (N3915) not accepted by UTC #125
On Wed, Nov 10, 2010 at 06:11:08PM +0100, Karl Pentzlin wrote: From the Pre-Preliminary minutes of UTC #125 (L2/10-416): C.4 Preliminary Proposal to enable the use of Combining Triple Diacritics in Plain Text (WG2 N3915) [Pentzlin, L2/10-353] - see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3915.pdf [125-A13] ... UTC does not believe that either solution A or solution B represents an appropriate encoding solution for the text representation problem shown in this document. Appropriate technology involving markup should be applied to the problem of representation of text at this level. This will not happen. Linguists will continue to use their PUA code points (or even their 8-bit fonts), which employ these characters perfectly (albeit using precomposed glyphs for the used combinations). Advanced typesetting engines like TeX (which were invented 30 years ago, mind you) already support wide accents that span multiple characters: $\widehat{abcd}$ $\widetilde{abcd}$ \bye Even math formulas in new MS Office versions can do that (well it is math because, apparently, only mathematicians cared about that, but I don't see why it should not work for linguists too). Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
Re: Combining Triple Diacritics (N3915) not accepted by UTC #125
Or the other way around... On Thu, Nov 11, 2010 at 08:53:49AM +0200, Klaas Ruppel wrote: Typographic solutions (as established they ever may be) do not solve encoding matters. Best regards, __ Klaas Ruppel www.kotus.fi/?l=ens=1 Kotus www.kotus.fi Fociswww.focis.fi Tel. +358 207 813 278 Fax +358 207 813 219 Khaled Hosny kirjoitti 10.11.2010 kello 20.03: On Wed, Nov 10, 2010 at 06:11:08PM +0100, Karl Pentzlin wrote: From the Pre-Preliminary minutes of UTC #125 (L2/10-416): C.4 Preliminary Proposal to enable the use of Combining Triple Diacritics in Plain Text (WG2 N3915) [Pentzlin, L2/10-353] - see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3915.pdf [125-A13] ... UTC does not believe that either solution A or solution B represents an appropriate encoding solution for the text representation problem shown in this document. Appropriate technology involving markup should be applied to the problem of representation of text at this level. This will not happen. Linguists will continue to use their PUA code points (or even their 8-bit fonts), which employ these characters perfectly (albeit using precomposed glyphs for the used combinations). Advanced typesetting engines like TeX (which were invented 30 years ago, mind you) already support wide accents that span multiple characters: $\widehat{abcd}$ $\widetilde{abcd}$ \bye Even math formulas in new MS Office versions can do that (well it is math because, apparently, only mathematicians cared about that, but I don't see why it should not work for linguists too). Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
Re: A simpler definition of the Bidi Algorithm
On Fri, Sep 10, 2010 at 05:00:21PM -0700, Asmus Freytag wrote: PS: Personally, I don't find the presentation in terms of the regular expressions any more intuitive than the original. Some people, when confronted with a problem, think ‟I know, I'll use regular expressions.” Now they have two problems. —Jamie Zawinski -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
Re: High dot/dot above punctuation?
On Wed, Jul 28, 2010 at 11:37:28AM -0700, Asmus Freytag wrote: On 7/28/2010 10:09 AM, Murray Sargent wrote: Contextual rendering is getting to be more common thanks to adoption of OpenType features. For example, both MS Publisher 2010 and MS Word 2010 support various contextually dependent OpenType features at the user's discretion. The choice of glyph for U+002E could be chosen according to an OpenType style. I know that the technology exists that (in principle) can overcome an early limitation of 1:1 relation between characters and glyphs in a single font. I also know that this technology has been implemented for certain (but not all) types of mappings that are not 1:1. It's worth remembering that plain text is a format that was introduced due to the limitations of early computers. Books have always been rendered with at least some degree of rich text. And due to the complexity of Unicode, even Unicode plain text often needs to be rendered with more than one font. However, the question I raised here is whether such mechanisms have been implemented to date for FULL STOP. Which implementation makes the required context analysis to determine whether 002E is part of a number during layout? If it does make this determination, which OpenType feature does it invoke? Which font supports this particular OpenType feature? I have few fonts where I implemented a 'locl' OpenType feature that maps European to Arabic digits, and contextual substitution feature that replaces the dot with Arabic decimal separator when it comes between two Arabic numbers, so I think it is doable. Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
Re: High dot/dot above punctuation?
On Thu, Jul 29, 2010 at 10:01:37AM +0200, Kent Karlsson wrote: Den 2010-07-29 08.47, skrev Khaled Hosny khaledho...@eglug.org: I have few fonts where I implemented a 'locl' OpenType feature that maps European to Arabic digits, and contextual substitution feature that replaces the dot with Arabic decimal separator when it comes between two Arabic numbers, so I think it is doable. Doable is not the same thing as a good idea. Your example here is one of the not-at-all-good ideas. This was done of a GUI font, the main aim is to have Arabic numbers in Arabic contexts and vice versa, since the numbers here are generated on the fly like dates, percentages etc. it is not possible (or even desirable) to change the input. Also, I don't buy in Unicode idea of encoding different sets of decimal digits separately, they are all different graphical presentations of the same thing. Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
Re: Pashto yeh characters
On Wed, Jul 28, 2010 at 04:33:12PM +0200, Andreas Prilop wrote: On Tue, 27 Jul 2010, Khaled Hosny wrote: it just happen not to get in those two positions in modern orthography, but it can be seen in Quran which is still written in the old, early Islamic orthography. If you argue with archaic spelling, then ð and þ are English letters. Except we are talking about a letter that is still in contemporary use, just not occurring at certain positions of the word. | http://www.unicode.org/mail-arch/unicode-ml/y2010-m07/att-0295/01-U_0649.jpg | http://www.unicode.org/mail-arch/unicode-ml/y2010-m07/att-0295/01-U_0649.jpg According to Grammatik des klassischen Arabisch by Wolfdietrich Fischer, page 9, the ya is written two dots in such cases, too. Except that this is not a Yaa and not pronounced like a Yaa, it is an Alef (note the small dagger Alef above it). I doubt such questions can be solved with reference to the Quran, which originally had no dots at all. Those are two scans from contemporary prints of Quran, where regular Yaa have dots. Just because Uyghur is still following the old orthography of placing Alef Maqsura in the middle of the word, doesn't suddenly make it a no Arabic character. Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
Re: Pashto yeh characters
On Wed, Jul 28, 2010 at 05:32:21PM +0200, Andreas Prilop wrote: On Tue, 27 Jul 2010, Khaled Hosny wrote: According to Grammatik des klassischen Arabisch by Wolfdietrich Fischer, page 9, the ya is written two dots in such cases, too. Except that this is not a Yaa and not pronounced like a Yaa, it is an Alef (note the small dagger Alef above it). That is exactly what I meant and exactly what is written in W. Fischer. My point is that there are two dots below. No, there aren't, at least in orthographies that differentiate between Yaa and Alef Maqsura by dots. -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
Re: Pashto yeh characters
On Tue, Jul 27, 2010 at 06:43:19PM +0200, Andreas Prilop wrote: [...] U+0649 has (should have) four glyphs without any dots. This is no Arabic letter, but an Uighur letter. Therefore you should not use U+0649 for Arabic, Persian, Pashto, Urdu but only U+06CC. I'm not sure what is the bases of this conclusion, but U+0649 have no dots in its initial/medial forms in Arabic too, it just happen not to get in those two positions in modern orthography, but it can be seen in Quran which is still written in the old, early Islamic orthography. See the attached image showing the words فسوىهن and ميكىل. Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer attachment: U+0649.jpg
Re: Generic Base Letter
On Tue, Jun 29, 2010 at 09:41:58PM -0700, Michael S. Kaplan wrote: Speaking as an MS employee who has seen how easy it is to put arbitrary combining marks on scripts like Latin and Cyrillic that don't look very good if the font has neither combined form glyphs or knowledge of attachment points, it may be the case that some of these situations that don't look good have more to do with the fact that making it look good typographically when no one put in the effort for the specific case may be simply the price one pays. In case of Arabic and Hebrew, Uniscribe inserts the dotted circle between, what it considers, invalid mark combination before doing any OpenType layout, so it is impossible for a font designer to support such combinations, simply because what the layout engine sees is markdotted circlemark combination, so the markmark layout feature even if present in the font will never triggered. It is possible to hack around this by treating markdotted circlemark sequences as markmark in the layout code, assuming nobody will ever insert the dotted circle manually, but this won't work for Arabic since the dotted circle also breaks Arabic shaping and getting around this will be much harder. Given how many times I've seen font designer trying to do advanced Arabic and Hebrew OpenType fonts frustrated by this, I think it is time for MS to review its decision here. Even if dropping invalid marks special handling for various reasons, at least allowing the font designers to tell Uniscribe please let me handle the valid and invalid marks myself, I really know what I'm doing would be very helpful. Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
Re: Generic Base Letter
On Sun, Jun 27, 2010 at 10:00:18PM -0700, Asmus Freytag wrote: The one argument that I find convincing is that too many implementations seem set to disallow generic combination, relying instead on fixed tables of known/permissible combinations. Only if you consider Microsoft too many, AFAIK, only Microsoft's Uniscribe does such, stupid in my opinion, behaviour. In that situation, a formally adopted character with the clearly stated semantic of is expected to actually render with ANY combining mark from ANY script would have an advantage. List-based implementations would then know that this character is expected to be added to the rendering tables for all marks of all scripts. Until and unless that is done, it couldn't be used successfully in those environments, but if the proposers could get buy-in from a critical mass of vendors of such implementations, this problem could be overcome. Without such a buy-in, by the way, I would be extremely wary of such a proposal, because the table-based nature of these implementations would prohibit the use of this new character in the intended way. There are so many issues with MS implementation(s), for example you can not combine any arbitrary Arabic diacritical marks on any given base character. I don't think Unicode need to invent workaround broken vendor implementations, interested parties should instead pressure on that vendor to fix its implementation(s). Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
Re: Generic Base Letter
On Mon, Jun 28, 2010 at 03:47:40PM +, Murray Sargent wrote: Khaled notes: There are so many issues with MS implementation(s), for example you can not combine any arbitrary Arabic diacritical marks on any given base character. I don't think Unicode need to invent workaround broken vendor implementations, interested parties should instead pressure on that vendor to fix its implementation(s). The MS Office math facility allows combining marks in the range U+0300..U+036F and most in the range U+20D0..U+20F0 to be applied to any base character(s) including complicated mathematical expressions. Such generality is needed in mathematics, since tildes, hats, bars, etc., are displayed over multiple base characters such as the expression a+b. Hebrew and Arabic combining marks aren't currently treated as valid mathematical combining marks, so the sequence U+25CC U+05BC U+05B8 doesn't render as Vincent desires in a math zone. It seems reasonable to allow all Unicode combining marks as accents in math zones. That would be nice, but we were talking about combining marks in normal, non-math, text. For example, it is now common practice to use two consecutive Fatha/Damma/Kasra for a certain form of Arabic tanwin used in Koran, however Uniscribe won't allow this and will always insert a dotted circle between the two marks. I know this behaviour is documented, but I fail to see the rationale behind it. Generally speaking, doing script spell checking in the rendering engine is a lousy idea IMO. Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer