Re: N4106
I welcome that finally the combining parentheses are encoded as such, and not by precomposed diacritics, especially, as I had evidence for 8-10 additional parenthesed diacritics (of course from linguistic material, where else from!) to those presented in the earlier Teuthonista proposals. I was wondering whether COMBINING DOUBLE PARENTHESES *BELOW* wouldn't be added for symmetry? (Being fully aware that analogy usually is not taken into account, as e.g. combining small letters above are still an open set), but I also seem to remember Teuthonista material showing double parenthesis below in a chart. Is its missing an editorial oversight, or does my memory trick me? Concerning the explanatory paragraph dealing with the fused carons, I have some reservations. In fact, the current behaviour with carons changing to apostrophes unconditionally, i.e. without the requirement of langauge codes not unlike the case of Russian vs. Serbian Cyrillic italics poses some problems already concerning the encodability of {d, t, l, L}+caron whith an actual (detached) caron to display. Such representation is not unheard of. While this can be solved on the font level in high quality typography, the main question concerning the fused caron is indeed, whether the fused carons are distinct from the detached carons themselves. In fact, most attached combining marks are considered as _letter_modifications_ in Unicode, as far as I see, be it b with right hook or the IPA symbols. This could warrant the encoding of what is called d/k/t with attached caron as {D, K, T} WITH SPLIT TOP STEM or something similar as singular codepoints. Of course, if the mentioned additional fused carons in Landsmålsalfabetet give us letters more by an order of magnitude, a combining mark solution would become more desirable. Szabolcs -- Szelp, André Szabolcs +43 (650) 79 22 400 On Sun, Nov 6, 2011 at 22:46, Kent Karlsson kent.karlsso...@telia.com wrote: Den 2011-11-05 04:23, skrev António Martins-Tuválkin tuval...@gmail.com: I'm going through N4106 ( http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4106.pdf ), ... I see the following characters being put forward for proposing to be encoded: 1ABB COMBINING PARENTHESES ABOVE 1ABC COMBINING DOUBLE PARENTHESES ABOVE 1ABD COMBINING PARENTHESES BELOW 1ABE COMBINING PARENTHESES OVERLAY Well, COMBINING DOUBLE PARENTHESES ABOVE seems to be the same as COMBINING PARENTHESES ABOVE, COMBINING PARENTHESES ABOVE. And COMBINING PARENTHESES OVERLAY seems to be just a tiny parenthesis before and a tiny parenthesis after; no need for a combining mark, especially one with a splitting behaviour. Otherwise, I think COMBINING ((DOUBLE)) PARENTHESES ABOVE/BELOW are an entirely new brand of characters in Unicode (if accepted as proposed). They are supposed to split (ok, we have split vowels in some Indic scripts, more on that below), but these split around *another combining mark*. So despite being given (as proposed) vanilla above/below mark properties, they do not stack the way such characters normally do, but is supposed to invoke an entirely new behaviour. Split vowels are not new, but they split around base characters (or more generally, around combining sequences), not around (a) combining character(s) only. Indeed, one can split these vowels into two characters (sometimes by canonical decomposition, when done right; sometime by cheating a bit and split into another character and the supposedly split vowel character but not interpreted as the second part of the decomposition; in principle one may need to cheat even more and use PUA characters in order to do this at the character level, but then that is really bad). That supposedly stacking combining marks *sometimes* (more a font dependence than a character dependence) don't stack but instead are laid out linearly is not new. But to *require* non-stacking behaviour for certain characters is new. So we have a combination of: 1. Splitting. (Normally only used for some Indic scripts). 2. Indeed splitting with no other characters to use for the decomposition, thus requiring the use of PUA characters, to stay compliant, for representing the result of the split at the character level. (This is entirely new, as far as I can tell.) 3. The split is entirely *within* the sequence of combining characters (except for COMBINING PARENTHESES OVERLAY, which behaves as split vowels normally do, but still with issue 2), not around the combining sequence including the base. (This is entirely new.) 4. Requiring (if at all supported) to use linear layout of combining characters instead of stacking. (This is entirely new.) This makes these proposed characters entirely unique in their display behaviour, IMO. This could be alleviated by encoding COMBINING BEGIN/END PARENTHESIS ABOVE/BELOW. That way the issues with split, as listed above, can be avoided. There is still the issue of requiring (when at all
Re: N4106
From: Kent Karlsson kent.karlsson14_at_telia.com Den 2011-11-05 04:23, skrev António Martins-Tuválkin tuvalkin_at_gmail.com: I'm going through N4106 ( http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4106.pdf ), ... I see the following characters being put forward for proposing to be encoded: 1ABB COMBINING PARENTHESES ABOVE 1ABC COMBINING DOUBLE PARENTHESES ABOVE 1ABD COMBINING PARENTHESES BELOW 1ABE COMBINING PARENTHESES OVERLAY Well, COMBINING DOUBLE PARENTHESES ABOVE seems to be the same as COMBINING PARENTHESES ABOVE, COMBINING PARENTHESES ABOVE. And COMBINING PARENTHESES OVERLAY seems to be just a tiny parenthesis before and a tiny parenthesis after; no need for a combining mark, especially one with a splitting behaviour. Otherwise, I think COMBINING ((DOUBLE)) PARENTHESES ABOVE/BELOW are an entirely new brand of characters in Unicode (if accepted as proposed). They are supposed to split (ok, we have split vowels in some Indic scripts, more on that below), but these split around *another combining mark*. So despite being given (as proposed) vanilla above/below mark properties, they do not stack the way such characters normally do, but is supposed to invoke an entirely new behaviour. I agree, except that if we give them any but a ccc=220/230, then canonical reordering will separate them from the modifier letters that they are attached to. I think this is one of those cases where a definition needs to expand in order to accommodate architecture. We do already have some non-stacking behaviour defined for these characters in order to accommodate polytonic Greek, so we do have some experience with disparate appearances of consecutive marks. That supposedly stacking combining marks *sometimes* (more a font dependence than a character dependence) don't stack but instead are laid out linearly is not new. But to *require* non-stacking behaviour for certain characters is new. Then think of it as the non-spacing version of stacking behaviour. So we have a combination of: 1. Splitting. (Normally only used for some Indic scripts). 2. Indeed splitting with no other characters to use for the decomposition, thus requiring the use of PUA characters, to stay compliant, for representing the result of the split at the character level. (This is entirely new, as far as I can tell.) I cannot imagine in any way how this requires PUA characters. 3. The split is entirely *within* the sequence of combining characters (except for COMBINING PARENTHESES OVERLAY, which behaves as split vowels normally do, but still with issue 2), not around the combining sequence including the base. (This is entirely new.) 4. Requiring (if at all supported) to use linear layout of combining characters instead of stacking. (This is entirely new.) If I were designing a font, I would simply make the in/out mark attachment point near the top/middle of the parentheses, so that it drops down around the base mark, and then attaches any subsequent marks as if the parentheses weren't there. I think you're making this too complicated. This makes these proposed characters entirely unique in their display behaviour, IMO. I do, however, agree totally with this assessment, I just believe it is more manageable than you paint it. [snip] /Kent K I do, myself, have a couple of concerns in regards to several proposed characters in N4106 as well. Namely, I believe that U+1DF2, U+1DF3, and U+1DF4 should require significant justification as to why they should not be encoded as U+0363 + U+0308, U+0366 + U+0308, and U+0367 + U+0308. I have similar concerns about U+A799, U+AB30, U+AB33, U+AB38, U+AB3E, U+AB3F, etc. Van A
Re: N4106
Den 2011-11-07 10:34, skrev vanis...@boil.afraid.org vanis...@boil.afraid.org: So despite being given (as proposed) vanilla above/below mark properties, they do not stack the way such characters normally do, but is supposed to invoke an entirely new behaviour. I agree, except that if we give them any but a ccc=220/230, then canonical reordering will separate them from the modifier letters that they are attached Nit: modifier letters (as that term is used in Unicode) are not combining marks; here you mean combining marks. to. I think this is one of those cases where a definition needs to expand in order to accommodate architecture. We do already have some non-stacking behaviour defined for these characters in order to accommodate polytonic Greek, so we do have some experience with disparate appearances of consecutive marks. Yes, but that they have special behavior needs to be made explicit. That supposedly stacking combining marks *sometimes* (more a font dependence than a character dependence) don't stack but instead are laid out linearly is not new. But to *require* non-stacking behaviour for certain characters is new. Then think of it as the non-spacing version of stacking behaviour. Would not be sufficient. See below. So we have a combination of: 1. Splitting. (Normally only used for some Indic scripts). 2. Indeed splitting with no other characters to use for the decomposition, thus requiring the use of PUA characters, to stay compliant, for representing the result of the split at the character level. (This is entirely new, as far as I can tell.) I cannot imagine in any way how this requires PUA characters. Splitting is usually done at the character level... I know, some say that this should always be done at the glyph level (somehow), but IIUC that is not so in practice. And I think it is preferable to do it at the character level, so that is not just handwaved away (oh, the font should do this...) leaving it up to each and every font designer to do this odd-ball extra (and thus won't be done most of the time, even if the font framework may support it). Laying out linearly instead of stacking is quite enough odd-ball extra. 3. The split is entirely *within* the sequence of combining characters (except for COMBINING PARENTHESES OVERLAY, which behaves as split vowels normally do, but still with issue 2), not around the combining sequence including the base. (This is entirely new.) 4. Requiring (if at all supported) to use linear layout of combining characters instead of stacking. (This is entirely new.) If I were designing a font, I would simply make the in/out mark attachment point near the top/middle of the parentheses, so that it drops down around the base mark, and then attaches any subsequent marks as if the parentheses weren't there. I think you're making this too complicated. But glyphs for combining marks may be of different widths, for example a (glyph for a) dot below is much narrower than a (proposed) wiggly line below. Or, consider LENIS MARK and DOUBLE LENIS MARK (both for Teuthonista, and both apparently used together with parentheses). The usual, and general, way of handling that is to actually split the character-that-goes-on-both-sides of something that may have different widths in different instances. Of course you also need width info for combining marks. I would still consider splitting to be a needless complication here, and instead encode begin/end pairs of combining parentheses instead of what is in N4106. This makes these proposed characters entirely unique in their display behaviour, IMO. I do, however, agree totally with this assessment, I just believe it is more manageable than you paint it. [snip] /Kent K I do, myself, have a couple of concerns in regards to several proposed characters in N4106 as well. Namely, I believe that U+1DF2, U+1DF3, and U+1DF4 should require significant justification as to why they should not be encoded as U+0363 + U+0308, U+0366 + U+0308, and U+0367 + U+0308. There is the issue of whether the diaeresis applies to the base letter (plus something) or if it applies to the combining mark just under the diaeresis. /Kent K I have similar concerns about U+A799, U+AB30, U+AB33, U+AB38, U+AB3E, U+AB3F, etc. Van A
Re: [indic] Re: Lack of Complex script rendering support on Android
Hi! Unicode List Code2000 supports most BMP code points of Unicode 5.2. It is open sourced from September: http://code2000.sourceforge.net/ Regards, Anbu7 On Mon, 7 Nov 2011 13:45:47 +0600, Christopher Fynn chris.f...@gmail.com wrote: Hard to keep track of these things - but shouldn't affect the fact that one can safely implement OpenType rendering without a licence from Adobe or Microsoft. On 7 November 2011 13:21, Philippe Verdy verd...@wanadoo.fr wrote: 2011/11/7 Christopher Fynn chris.f...@gmail.com: I'm sure people like RedHat, Debian, and Sun/Oracle (who use it in OpenOffice) - have satisfied themselves that the open type rendering they use is unencumbered. Actually now, this (OpenOffice) should no longer be Sun/Oracle but Apache. Oracle has donated OpenOffice to Apache that accepted it. Since the ecqusiation of Sun by Oracle, some OpenOffice developers were unhappy with Oracle and splitted the project in LibreOffice; not sure that both projects will merge again now that OpenOffice goes to Apache, but Apache has stated that both projects could live now (there are some differences in the GUI, but many developers are already trying to make modules that works on both projects). For now OpenOffice is still branced by Oracle in the current distribution, this may change in the next major relase showing the Apache branch, once all IPR issues are solved between Oracle and Apache.
Re: [indic] Re: Lack of Complex script rendering support on Android
On Mon, Nov 7, 2011 at 12:50 AM, Christopher Fynn chris.f...@gmail.com wrote: On 6 November 2011 15:33, Mahesh T. Pai paiva...@gmail.com wrote: ... You can update the firmware yourselves - go the custom ROM way. (errr... I am afraid the carriers' representatives may do something to me for advising this). :-D You can continue to be with the existing carrier. Look around. If all else fails, ask google - the search engine; not the customer support. (what a paradox). There are ways to root and de-root the phone for warranty purposes. We can probably all do things to get our own phones to work - but what is needed is that if *any* person buys *any* Android phone in India it just works with Indic scripts (even if the UI is not localized into all regional languages) Not just if you buy a phone in India. In this modern world, there are plenty of students and immigrants all over the world who might enjoy communicating with their friends and families in their native languages. So I would say, if anyone buys *any* Android phone *anywhere* in the world, it should just automatically support complex text rendering for all Indic and Indic-derived scripts ... I don't think it is an unreasonable expectation for consumers in these countries when they shell out a lot of money for the latest phone to be able to send and receive messages in their script. . Another problem is that reviews of phones, even those published or broadcast in India, rarely mention whether the phone supports Indic scripts - so it is difficult for a consumer to take this into account when deciding which phone to buy. A lot of consumers just expect it to be there, since even much cheaper phones have offered this for years,. I wonder if perhaps India needs to legislate script support on computing devices and cell phones? I hope the latest Windows 7 phones don't have this limitation when we start to see them. - C
Code2000 on SourceForge (was Re: [indic] Re: Lack of Complex script rendering support on Android)
On 7 November 2011 08:34, a...@peoplestring.com wrote: Code2000 supports most BMP code points of Unicode 5.2. It is open sourced from September: http://code2000.sourceforge.net/ I have doubts as to whether this project was actually created by James Kass. The project comprises the last public version of code2000.ttf and a 210MB code2000.asm file which turns out to be a dump of the ttf file in human-readable form, both of which could easily have been put onto SourceForge in contravention of copyright and license by someone pretending to be James who wants the font to be open source now that the official Code2000 site has disappeared. James once told me that Code2000 was maintained as a 66 megabyte dBASE III database file which is not what is on SourceForge. Andrew
Re: [indic] Re: Lack of Complex script rendering support on Android
I'd agree with Ed, its a broader problem than just India, and a problem not just based on market segments. I use Android devices often, but can not use them as a serious tool for work because of what I would classify as serious limitations in the OS and its internationalisation model. At moment mobile devices are still toys unable to deal with most community languages I need to work with or support in Australia, including many Latin script languages. This is a limitation in all mobile OSes not just Android. I just have to look through my facebook and google+ accounts to see many messages in African and South East Asian languages that will not display. I do love the irony of Android devices not being able to display all content available through Google services. Andrew
Re: [indic] Re: Lack of Complex script rendering support on Android
Ed Trager said on Mon, Nov 07, 2011 at 11:11:19AM -0500,: Not just if you buy a phone in India. In this modern world, there are plenty of students and immigrants all over the world who might enjoy communicating with their friends and families in their native languages. So I would say, if anyone buys *any* Android phone *anywhere* in the world, it should just automatically support complex text rendering for all Indic and Indic-derived scripts ... Arrey Bhai!!! Aap samjah karo; hum hindustani log pura Indic bhasha ASCII mein likha hai. Humko Unicode ya UTF-8 mumbu jumbo nahin chahiye. Sirf ek bhasha; sirf ek keyboard lay out - ASCII. (translation - ASCII is enough for us Indians to communicate in any Indic language - only one Keyboard, only one language) /just venting my frustration at all those device makers, and I am not very fluent in HI. And the second paragraph you quoted above - those are almost exact words I used in one of my comments on one issue over at Android's issue tracker site. I wonder if perhaps India needs to legislate script support on computing devices and cell phones? May help. OTOH, when I got my last nokia phone, the sales lady simply handed it over it to me, after fiddling with it for some time. After I got out of the shop, realised that the GUI was in some other language (either MR or GJ). It was an XpressMusic series phone. She apparently could not come out of the unfamiliar script. I finally looked up the user manual (one of those rare events in my life when I read a user manual before using a product), and followed the logos to come out. The X2 too has the same problem. Otherwise, both were very good phones. But ML renering was very poor. To be fair, when I flash one of the EU firmwares on my Android device, it boots up into Russian. I tried the Asian firmwares, only once - and it booted up into Chinese/ Korean / Japanese / one of those East Asian languages!!! ;-D -- Mahesh T. Pai || End Users are just friends who haven't submitted a patch yet.
Re: N4106
If I were designing a font, I would simply make the in/out mark attachment point near the top/middle of the parentheses, so that it drops down around the base mark, and then attaches any subsequent marks as if the parentheses weren't there. I think you're making this too complicated. But glyphs for combining marks may be of different widths, for example a (glyph for a) dot below is much narrower than a (proposed) wiggly line below. Or, consider LENIS MARK and DOUBLE LENIS MARK (both for Teuthonista, and both apparently used together with parentheses). The usual, and general, way of handling that is to actually split the character-that-goes-on-both-sides of something that may have different widths in different instances. Of course you also need width info for combining marks. I would still consider splitting to be a needless complication here, and instead encode begin/end pairs of combining parentheses instead of what is in N4106. No, the usual and general way of handling this is, if the uni-width of parenthesis is not desirable for esthetic reasons, to create precomposed _glyphs_ of the parenthesised diacritic by the font designer which are mapped to a character sequence. Szabolcs
Re: [indic] Re: Lack of Complex script rendering support on Android
On Mon, Nov 7, 2011 at 12:07 PM, Mahesh T. Pai paiva...@gmail.com wrote: ... To be fair, when I flash one of the EU firmwares on my Android device, it boots up into Russian. I tried the Asian firmwares, only once - and it booted up into Chinese/ Korean / Japanese / one of those East Asian languages!!! ;-D As Andrew Cunningham noted, the internationalization model --or really what I would call LACK of an intelligent I18N model-- on a lot of so-called smart phones just goes to show how immature these industry segments and markets still remain. For a while I used a Chinese-made smart phone that was capable of displaying more languages than the phone I received from my American cell phone carrier --and thus I conclude it was designed for sale in many export markets-- BUT it had horribly annoying shortcomings. For example, one could only type an SMS message in Chinese if it was set to a Chinese locale. But even then the Chinese input methods were limited and were inseparably tied to specific Chinese locales (i.e., pinyin was available for the Mainland locale, but pinyin was *not* available for a Traditional Chinese locale). Similar problems for Greek, Arabic, and other locale choices. This model is completely wrong and ignores the realities of most major urban centers in the world today where numerous immigrant communities not only exist but are also in constant contact with friends/families/merchants/suppliers/other service providers from all over the world. I should be able to set my locale and display language to anything I want; and still be able to use and switch-on-the-fly to any input method for any language I want independent of the current locale and display language. - Ed -- Mahesh T. Pai || End Users are just friends who haven't submitted a patch yet.