Re: RTL PUA?
On 8/21/2011 7:34 PM, Doug Ewell wrote: So what you are asking about is a directional control character that would assign subsequent characters a BC of 'AL', right? You don't want to call this a LANGUAGE MARK or anything else that implies language identification, because of the existence of real language identification mechanisms and the history of Unicode and language tagging. An ARM (Arabic RTL Mark) would be a sensible addition to the standard. It would close a small gap in design that currently prevents a fully faithful plain text export of bidi text from rich text (higher level protocol) formats. In a HLP you can assign any run to behave as if it was following a character with bidi property AL. When you export this text as plain text, unless there is an actual AL character, you cannot get the same behavior (other than by the heavy-handed method of completely overriding the directionality, making your plain text less editable). So, yes, there's a bit of a use case for such a mark. (It's effect is limited to treatment of numeric expressions, so it's not an Arabic language mark, but one that triggers the same bidi context as the presence of an Arabic Script (AL) character.) A./ -- Doug Ewell • d...@ewellic.org Sent via BlackBerry by ATT -Original Message- From: Richard Wordinghamrichard.wording...@ntlworld.com Sender: unicode-bou...@unicode.org Date: Mon, 22 Aug 2011 03:19:39 To: Unicode Mailing Listunicode@unicode.org Subject: Re: RTL PUA? On Sun, 21 Aug 2011 23:55:46 + Doug Ewelld...@ewellic.org wrote: What's a LANGUAGE MARK? There are *three* strong directionalities - 'L' left-to-right, 'AL' right-to-left as in Arabic, 'R' right-to-left (as in Hebrew, I suspect). 'AL' and 'R' have different effects on certain characters next to digits - it's the mind-numbing part of the BiDi algorithm. With one a $ sign after a string of European (or is it Arabic?) digits appears on the left and in the other it appears on the right. I can't remember whether 'higher-level protocols' have an effect on this logic. LRM has a BC of L, RLM has a BC of R, but no invisible character has a BC of AL. That's why I tentatively raised the notion of ARABIC LANGUAGE MARK. Incidentally, an RLO gives characters with a temporary BC of R, not AL. Richard.
RE: RTL PUA?
I don't buy the assumption that all the world is either AAT, Graphite or Uniscribe. Anyhow, this discussion is going off topic, the issue is should Unicode specify an RTL PUA area, not whether some products, however respectable, provide a bypass. Jony -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Shriramana Sharma Sent: Monday, August 22, 2011 8:12 AM To: unicode@unicode.org Subject: Re: RTL PUA? On 08/22/2011 08:24 AM, Peter Constable wrote: I'm not saying that there shouldn't be_some_ software that can do what you expect. But there will likely be some different views on what ought to be included within that some. Peter, given that both AAT and Graphite have provisions for assigning custom properties including BC to PUA characters, it seems Uniscribe is the only one missing out. Those advocating RTL PUA areas seem to reject AAT and Graphite as hacks or wow *one* application [*]. [* = LibreOffice is the *only* multipurpose application running on /Windows/ to support Graphite and I'm not counting SIL WorldPad. On *nix platforms, *any* number of applications that use HB-NG for rendering will be able to handle Graphite in the near future because HB-Graphite integration is already done. That is to say, once GTK and Qt fully switch to HB-NG.] Anyhow, if you Microsoft guys added support in Uniscribe for ascribing custom properties including BC to PUA characters (or have you already done it) it would be what would satisfy these PUA RTL users and convince them that no RTL PUA zones are needed, it seems. The suggestion has been made that fonts should be able to carry some additional custom tables specifying custom properties for PUA characters, which seems reasonable. I'm not sure if the OT GDEF table or the AAT PROP table completely satisfies this requirement. People interesting in using custom properties for the PUA (which includes me for Indic script) should then sit up and formulate the syntax for such tables. If Uniscribe, AAT, and Harfbuzz then provided generic support for parsing such tables and rendering PUA characters accordingly, it would be an all-around solution both for RTL PUA as well as Indic PUA, I suppose. (But I'm not sure how such a custom table would interact with the innate ability of Graphite to handle custom properties. It should probably be either the new proposed custom table or Graphite.) [sigh] -- Shriramana Sharma
Re: RTL PUA?
On 22 Aug 2011, at 03:57, Peter Constable wrote: From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Asmus Freytag Treating PUA characters as ON is very problematic As would be changing the default property of PUA characters from L to ON. Which is why that will not be proposed. Michael Everson * http://www.evertype.com/
Re: RTL PUA?
On 22 Aug 2011, at 05:53, Shriramana Sharma wrote: While I don't know much about RTL scripts, if the logic order is ALEF + LAMED, but the presentation order is LAMED + ALEF *because of the RTL nature* do you write the rule as ALEF + LAMED = ALEF_LAMED_LIGATURE or LAMED + ALEF = ALEF_LAMED_LIGATURE ? The specific shape of that ligature is not a result of the directionality property. Michael Everson * http://www.evertype.com/
Re: C1 Control Pictures Proposal
On Aug 17, 2011, at 4:38 PM, Andrew West wrote: Unless you can show evidence that C1 control pictures are currently in use and that there is a clear demand from the user community to On Aug 21, 2011, at 10:13 AM, Doug Ewell wrote: Perhaps it would help for you to do a quick survey of applications that already make use of the existing C0 control pictures, and include the results in your argument. That might help convince some of us who feel the C0 pictures are only there for compatibility with previous character encodings This is a reasonable request. In a follow-up post or in any event in the formal proposal, I shall include examples of use of and/or demand for the representation of control pictures. I would like to ask you/the list for the sources for C0 control pictures. They appear to be ANSI X3.32 and ISO 2047. (Also, FIPS Pub. 1-2, which consolidates ANSI X3.32 and some others.) Does anybody have these, and can you look the pictures up? In particular, X3.32 is withdrawn... -Sean
Re: RTL PUA?
On Mon, Aug 22, 2011 at 10:42:05AM +0530, Shriramana Sharma wrote: On 08/22/2011 08:24 AM, Peter Constable wrote: I'm not saying that there shouldn't be_some_ software that can do what you expect. But there will likely be some different views on what ought to be included within that some. Peter, given that both AAT and Graphite have provisions for assigning custom properties including BC to PUA characters, it seems Uniscribe is the only one missing out. Those advocating RTL PUA areas seem to reject AAT and Graphite as hacks or wow *one* application [*]. I personally would say to make some blocks in Plane 16 default to R, some AL and some ON. For fonts based on rendering engines that don't allow fonts to change characters properties this would be crutial, for those engines that are capable of changing the properties it would present no problem (the font can change this properties arbitrary even if it defaults to RTL...). [* = LibreOffice is the *only* multipurpose application running on /Windows/ to support Graphite and I'm not counting SIL WorldPad. On *nix platforms, *any* number of applications that use HB-NG for rendering will be able to handle Graphite in the near future because HB-Graphite integration is already done. That is to say, once GTK and Qt fully switch to HB-NG.] That said, the HarfBuzz-ng itself (i.e. it's own engine) tries to imitate the Uniscribe. Most probably, Graphite fonts will still be an exception on these systems... [sigh] -- Shriramana Sharma -- Petr Tomasek http://www.etf.cuni.cz/~tomasek Jabber: but...@jabbim.cz EA 355:001 DU DU DU DU EA 355:002 TU TU TU TU EA 355:003 NU NU NU NU NU NU NU EA 355:004 NA NA NA NA NA
Re: Code pages and Unicode
On 21 August 2011 02:14, Richard Wordingham richard.wording...@ntlworld.com wrote: On Fri, 19 Aug 2011 17:03:41 -0700 Ken Whistler k...@sybase.com wrote: O.k., so apparently we have awhile to go before we have to start worrying about the Y2K or IPv4 problem for Unicode. Call me again in the year 2851, and we'll still have 5 years left to design a new scheme and plan for the transition. ;-) It'll be much easier to extend UTF-16 if there are still enough contiguous points available. Set that wake-up call for 2790, or whenever plane 13 (better, plane 12) is about to come into use. Stymied by the Unicode® stability policies again: The General_Category property values will not be further subdivided. The General_Category property value Surrogate (Cs) is immutable: the set of code points with that value will never change. http://unicode.org/policies/stability_policy.html#Property_Value Can anyone think of a way to extend UTF-16 without adding new surrogates or inventing a new general category? Andrew
Re: Code pages and Unicode
On 08/22/2011 03:05 PM, Andrew West wrote: Can anyone think of a way to extend UTF-16 without adding new surrogates or inventing a new general category? Why would anyone *need* to do so? UTF-16 can represent all codepoints upto Plane 16 right? -- Shriramana Sharma
Re: RTL PUA?
On 08/22/2011 04:34 PM, Behdad Esfahbod wrote: On 08/22/11 06:53, Shriramana Sharma wrote: While I don't know much about RTL scripts, if the logic order is ALEF + LAMED, but the presentation order is LAMED + ALEF*because of the RTL nature* do you write the rule as ALEF + LAMED = ALEF_LAMED_LIGATURE or LAMED + ALEF = ALEF_LAMED_LIGATURE ? Depends on your specific shaping engine logic. OpenType assumes native direction per script. So if you have Arabic text between LRO/PDF, you have to reverse the order then apply OpenType shaping. Other engines may decide to handle these differently. But the general statement is true: ligatures are visual artifacts and hence only form in one direction, not the other (except if it's, say, the ff ligature). Hi Behdad. I only asked whether the OT *tables* would contain the entries in the logical order or the visual order. Clearly it would still be the visual order (but Philippe Verdy seemed to imagine/suggest otherwise). It is clear that in the *script itself* the ligature would form in the direction of writing. -- Shriramana Sharma
Re: RTL PUA?
On 08/22/2011 05:26 PM, Behdad Esfahbod wrote: OpenType tables contain entries in the logical order of the script in question. Ie. Arabic tables are always RTL. Yes I understand, but still, to clarify: The font tables themselves contain only ASCII characters I presume. In it do you write: ALEF + LAMED = ALEF_LAMED_LIGATURE or LAMED + ALEF = ALEF_LAMED_LIGATURE ? IIUC, in logical order ALEF precedes LAMED, and in visual order, ALEF stands to the right of LAMED. -- Shriramana Sharma
Re: Code pages and Unicode
On 22 August 2011 12:51, Shriramana Sharma samj...@gmail.com wrote: On 08/22/2011 03:05 PM, Andrew West wrote: Can anyone think of a way to extend UTF-16 without adding new surrogates or inventing a new general category? Why would anyone *need* to do so? UTF-16 can represent all codepoints upto Plane 16 right? To clarify, I was replying to Richard Wordingham's tongue in cheek suggestion to extend UTF-16 to go beyond Plane 16 in the year 2790 or when only one free plane remains. I am not advocating extending UTF-16 or the Unicode code space, or suggesting that it will ever be necessary to do so. But hypothetically, I don't see a way to extend UTF-16 without breaking the stability policy. The same stability policies would also prohibit the assignment of any area of the Unicode code space for code page usage as Srivas Sinnathurai has proposed. (If there was an automatic filter on ideas that break one or more stability policies this mailing list would be a far quieter place.) Andrew
Re: RTL PUA?
On 08/22/2011 12:21 PM, Jonathan Rosenne wrote: I don't buy the assumption that all the world is either AAT, Graphite or Uniscribe. Nobody asserted that either. It is only pointed out that major implementations are able to provide what you seek. Anyhow, this discussion is going off topic, the issue is should Unicode specify an RTL PUA area, not whether some products, however respectable, provide a bypass. I don't see why you call it a *bypass*. Only if the road in front of you presents obstacles and does not allow you to proceed further, you need to take a bypass. If we are considering the Standard as the road which we need to take, the road doesn't present any obstacle to using PUA characters as RTL, so Graphite etc are not providing a *bypass* but in fact just being good generous implementations that allow custom properties for the PUA as the Standard allows. The request being made to allocate BC=R areas in the PUA is sure to generate an impression that conformant implementations should consider such a property normative, which then would violate the definition of the PUA that conformant implementations need not treat any property of the PUA as normative. Returning to your concerns, it is being asserted that since implementations are *already* able to provide for custom properties for the PUA, there is *no* need for Unicode to specify an RTL PUA area and furthermore as such a specification would violate the definition of the PUA, it should also *not* be done. One both *need* not do it and *should* not do it. -- Shriramana Sharma
Re: RTL PUA?
Um... Computers are hardware, and don't understand a thing. What I think you mean is computer _software_. (I know, I'm being pedantic, but with good reason.) Sorry, I just can’t resist pointing out that difference between hardware and software is only the fact that the former is material, with all the consequences that follows. In any other way they are completely interchangeable. As for the other part of your mail, Peter, sorry, but it really doesn’t make any sense to me. As John has pointed out, you can adjust the properties of private use characters on Apple computers. Perhaps there is a way to do so on Windows, Unix and other systems as well. What Philippe and Doug are proposing, and I also strongly agree with, is to have a standard way of interchange of these properties. I don’t think it is neccessary to go into the advantages of standards. Speaking of actual implementation, I’m convinced that this format should be the same as it is for encoded characters (whether it is the plain text format of the Unicode Character Database, XML or anything else). Rendering engines should – maybe they already do so – accept multiple files containing character properties, which could make upgrades to the newer versions of the standard a matter of downloading the new standard set, and provide a way of overriding private use (or even standard if one is so inclined) characters’ properties. Introduction of unencoded scripts would therefore become a matter of distributing a small properties file and the corresponding fonts. Á
Re: RTL PUA?
On 08/22/2011 08:26 AM, Shriramana Sharma wrote: On 08/22/2011 05:26 PM, Behdad Esfahbod wrote: OpenType tables contain entries in the logical order of the script in question. Ie. Arabic tables are always RTL. Yes I understand, but still, to clarify: The font tables themselves contain only ASCII characters I presume. In it do you write: ALEF + LAMED = ALEF_LAMED_LIGATURE or LAMED + ALEF = ALEF_LAMED_LIGATURE ? IIUC, in logical order ALEF precedes LAMED, and in visual order, ALEF stands to the right of LAMED. In the ligature tables, it's recorded as ALEF + LAMED = ALEF_LAMED_LIGATURE. The font tables are concerned with what happens when this character follows that one, not what happens when this character stands on the right of that one. So it's stored in logical order. ~mark
Re: RTL PUA?
2011/8/22 Peter Constable peter...@microsoft.com: From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy As I explained in an earlier message, the layout engine doesn't use the default property value but the resolved bidi level. Once again, you refuse to understand my arguments. I don't think I'm refusing to understand anything. I'm merely taking your assertions _as stated_ and evaluating whether I think they are accurate or not. Perhaps what you intend to convey assumes things not clear in what you've stated, since you think I'm not understanding you. What I'm saying is that OpenType CANNOT resolve the bidi level of PUAs (with the exception where we use additional BiDi controls, Of course _OpenType_ cannot, but any rendering engine that uses OpenType _must_ resolve the bidi level of _all_ characters in a sequence that it is given to render. Given our current situation, a default rendering implementation would resolve PUA characters to an even (LTR) level unless, of course, bidi control characters -- particularly RLO -- are used to override the directionality of the character, as you mention. which remains a hack, because it adds unnecessary unvisible markup around the encoded texts, and complexifies the use of strings and substrings). We'll, depending on how you define hack, some might reasonably suggest that any usage of PUA is a hack. (Of course, some who may not use the term in the same way might argue that it is certainly not a hack.) You can turn the problem as you want, but PUAs (as well as unknown characters) still have default properties that, in fine, will get used in absence of a more precise definition (i.e. an explicit override) of the actual BiDi property needed for the character. So now I perceive your opinion : - you don't want the solution proposed by Michael Everson (simply adding a range of RTL PUA), that I also think is not necessary, but is clearly a possible solution. - you propose to use BiDi overrrides. I also think (like Michael Everson) that this is an unpractical hack (Michael Everson that has to work and discuss with old scripts, or many new unencoded characters to add to existing scripts (notably Arabic) trying to encode them, finding various ways to represent them, and *test* his solutions, will certainly think that embedding each occurence of a PUA substring in BiDi controls, including in the middle of Arabic words, is certainly a very bad hack. - He must certainly think (I also think it too), that PUA characters are NOT hacks. They are architectural to the well-being of the UCS, essential in various situations to preserve the software conformance with the standard. In fact, for old and rare scripts, using PUAs will remain essential for long, because those scripts will need more and more time now to get encoded, requiring more extensive researches, more collaborations with less technical-aware people that cannot understand why they'll have to test the proposed solutions using test fonts and test input methods tht require them to enter BiDi controls around all those PUA characters. The only problem here is the strong LTR property of all existing PUAs, as if they were only needed for rare Han sinograms, or for symbols. Note that, for using a PUA for rare letters found in Arabic, it is impossible to embed the whole Arabic text in Bidi overrides: this would completely break the normal behavior of the non-PUA characters found in the text, notably sequences of Arabic digits, because the BiDi controls are effectively disabling the BiDi algorithm so that it will return a single RTL run for all the text in these controls. IF BiDi controls are used, they have to be inserted ONLY between subranges containing the PUAs, and only those. The solution proposed by Michael (a new block of RTL PUAs, probably in plane 14) still has an advantage: no BiDi controls are needed at all. The BiDi algorithm does not have to be disabled. All other aspects of RTL scripts (or mixed RTL/LTR scripts) are preserved (including mirroring behaviors for auto-LTR characters (at the begining of paragraphs) and characters whose directionality depends on the resolved direction of the precening text. I don't think this is necessary though: I see no reason why implementations *have to* keep the strong LTR property of existing PUAs. This strong LTR property is only the consequence of the fact that this is only the *default* value of those PUAs, and applications should not be restricted from changing this property as they want, especially for PUAs. But to change this property value, we need an explicit PUA agreement about their usage, in such a way that it can be understood by a computer. This means an external source of character properties. My opinion is that this need is most often sufficient if it solves just the problem of correct display order. Given that the encoded texts (using those existing strong LTR PUAs that we want to adopt a RTL
Re: RTL PUA?
2011/8/22 Peter Constable peter...@microsoft.com: From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Asmus Freytag Treating PUA characters as ON is very problematic As would be changing the default property of PUA characters from L to ON. I also agree with that. This is a bad option that would break compatibility (the solution advocatd by Michael Everson seems better, in that perspective, because it does not change any existing property given to existing assigned PUAs). Anyway when I spoke about a computer note that I did not use the definite article. It is evidently implied that there's also a need for software changes as well (so this does not mean *all* computers, but this could reach someday *most* computers with their installed or upgraded softwares). Your last remark in another message of this thread was really pedantic.
Re: RTL PUA?
2011/8/22 Shriramana Sharma samj...@gmail.com: On 08/22/2011 12:01 AM, Peter Constable wrote: If you mean a rule to substitute [g1 g2] with [g3] won't apply if the sequence processed by the OpenType Layout lookup processor is [g2 g1], Peter, actually I suspect Philippe is thinking that in the case of RTL, the *glyphs* are placed in reverse order and then he is asking how can the ligation take place. No, I've not said anything about ligation. But yes the problem is related to the expected reverse order of glyphs, for some PUAs, but not necessarily all of them (not the LTR runs of PUAs, after Bidi resolution). Ligation is a completely orthogonal problem (not really a problem because it is already solved).
RE: Code pages and Unicode
srivas sinnathurai sisrivas at blueyonder dot co dot uk wrote: The true lifting of UTF-16 would be to UTF-32. Leave the UTF-16 un touched and make the new half versatile as possible. I think any other solution is just a patch up for the timebeing. There is no evidence whatsoever that this is a problem that needs to be solved, not in 700 or 800 years, not ever. Ken's words are again being ignored. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
RE: RTL PUA?
It's actually quite easy to convince Uniscribe to treat specific characters as RTL, others as LTR, and, in general, with whatever classifications you desire. Pass a preprocessed string to Uniscribe's ScriptItemize(). RichEdit has used that approach to some degree starting with RichEdit 3.0 (Windows/Office 2000). It's also a handy way to force all operators to be treated as LTR in an LTR math zone and as RTL in an RTL math zone (aside from numeric contexts for '.' and ','). And you can force IRIs to display LTR or RTL that way by classifying the delimiters such as the dots in the domain name accordingly. Some of my blog posts on http://blogs.msdn.com/b/murrays/ discuss this in greater detail. So there's no need to change the properties of the PUA to establish PUA RTL conventions. They won't be generally interchangeable, but that's the nature of the PUA. You also have to implement such choices using rich/structured text. Plain text doesn't have a place to store the necessary properties. Most text is rich text anyway grin. Murray
Re: RTL PUA?
2011/8/22 Mark E. Shoulson m...@kli.org: I'm not certain I understand the question, but if I have it right... The logic order is ALEF + LAMED, and the presentation... places those in a right-to-left sequence, shall we say (since talking about the presentation *order* is confusing here). The font table contains the lookup that ALEF + LAMED = ALEF_LAMED_LIGATURE. It all goes according to the logical order, since the presentation order isn't really an order, it's just a direction. (this is different from things like devanagari short-i vowel, which moves with respect to the other letters in the script.) Lookup tables in fonts (at least OpenType) do not work at the character level, but at the glyph level: they substitute glyph ids by other glyph ids. Sequences of glyph ids are already reordered in visual order by the layout engine when they are searched in OpenType lookups, should they be RTL glyphs, or Indic glyphs with special reordering requirements (independant of the logical ordering of characters/code points). In addition, the same sequence of characters may be sometimes searched in several distinct sequences of glypg ids (this depends on the kind of OpenType table being consulted, as well as on character properties which also determine which lookup table will be searched and the relative order of successive lookups). The only lookup table in fonts that work at the character/code point level is their cmap (which maps a default glyph id from each encoded character, independantly of their logical or visual ordering, as well as independantly of the script/language in which those characters or glyphs are used, but possibly depending on the encoding used and the software platform supporting that encoding). Not all fonts need a cmap; for some of them, a default cmap may be implied or automatically constructed -- for example Symbol fonts in Windows, that are implicitly mapped in a PUA range; another example is Type1 or CFF fonts that have a default standardEncoding inherited from PostScript, based on glyph names (rather than glyph ids or code points) that may have themselves an implicit mapping to UCS codepoints (if these names are those defined in the AGL). Not all these mappings are 1-to-1, which means that they are not reversible, in the general case.
Re: RTL PUA?
2011/8/22 Shriramana Sharma samj...@gmail.com: Hi Behdad. I only asked whether the OT *tables* would contain the entries in the logical order or the visual order. Clearly it would still be the visual order (but Philippe Verdy seemed to imagine/suggest otherwise). No ! I've not imagined that. You incorrectly reinterpret imaginatively another incorrect imaginative reinterpretation, made by someone else, of what I wrote, which did not even suggest that.
Re: RTL PUA?
2011/8/22 Shriramana Sharma samj...@gmail.com: On 08/22/2011 05:26 PM, Behdad Esfahbod wrote: OpenType tables contain entries in the logical order of the script in question. Ie. Arabic tables are always RTL. Yes I understand, but still, to clarify: The font tables themselves contain only ASCII characters I presume. No. The lookup tables contain sequences of numeric glyph ids (16 bit integers in TrueType and OpenType). Which are also not the code point values, and not the character names or glyph names. you write: ALEF + LAMED = ALEF_LAMED_LIGATURE or LAMED + ALEF = ALEF_LAMED_LIGATURE ? Let's say that; - the LAMED character is cmap'ped (by its code point value in an cmap for Unicode, or by its code position in a cmap for another legacy 8-bit encoding) to the glyph id 1012, - and the ALEF character is cmapped to the glyph id 1001 (the values of glyph ids are not important, not even their relative order or differences, they don't need to obey any standard), - and the ALEF-LAMED ligature is in glyph id 1540 (the ALEF-LAMED character of the UCS may also be cmapped separately, but this is not a requirement) Then the lookup to perform the ligature will contain : (1012, 1001) - (1540). Glyph id's are presented and scanned in the lookup table, in sequences preordered in visual order by the text layout/shaping engine. However, given that the ALEF-LAMED is also a character of the UCS, the text layout/shaping engine that knows the Arabic script can also perform a character-based substitution itself, even in absence of the lookup of glyph ids in fonts; then it can render the ligature character according to the glyph id to which it is cmapped in that font.
Re: Code pages and Unicode
Christoph Päper 於 2011年8月20日 上午2:31 寫道: Mark Davis ☕: Under the original design principles of Unicode, the goal was a bit more limited; we envisioned […] a generative mechanism for infrequent CJK ideographs, I'd still like having that as an option. Et voilà! We have Ideographic Description Sequences. Or, if you're more ambitious, CDL. Generative mechanisms for Han are very attractive given the nature of the script, but once you try to support something other than display, or even try to write a rendering engine, all sorts of nasty problems crop up that have proven difficult to solve. We won't even get into the problem of wanting to discourage people from making up new ad hoc characters for Han. I won't say some sort of generative mechanism will never become the preferred way of handling unencoded ideographs, but there is a lot of work to be done before that would be practical. = John H. Jenkins jenk...@apple.com
Re: RTL PUA?
2011/8/22 Joó Ádám a...@jooadam.hu: Um... Computers are hardware, and don't understand a thing. What I think you mean is computer _software_. (I know, I'm being pedantic, but with good reason.) Sorry, I just can’t resist pointing out that difference between hardware and software is only the fact that the former is material, with all the consequences that follows. In any other way they are completely interchangeable. Same opinion for me. As for the other part of your mail, Peter, sorry, but it really doesn’t make any sense to me. As John has pointed out, you can adjust the properties of private use characters on Apple computers. Perhaps there is a way to do so on Windows, Unix and other systems as well. What Philippe and Doug are proposing, and I also strongly agree with, is to have a standard way of interchange of these properties. I don’t think it is neccessary to go into the advantages of standards. Speaking of actual implementation, I’m convinced that this format should be the same as it is for encoded characters (whether it is the plain text format of the Unicode Character Database, XML or anything else). Rendering engines should – maybe they already do so – accept multiple files containing character properties, which could make upgrades to the newer versions of the standard a matter of downloading the new standard set, and provide a way of overriding private use (or even standard if one is so inclined) characters’ properties. Introduction of unencoded scripts would therefore become a matter of distributing a small properties file and the corresponding fonts. As well, the small properties files can be embedded, in a very compact form, in the PUA font. This small table can be limited to just listing the ranges of PUA code points that are strong RTL instead of LTR. Most often, there will be only one range, and this just requires a couple of integers in that embedded table (possibly more, only if you want to represent more properties), without requiring a complex XML parser or a complex parser for the tabulated ASCII format used in the UCD, which is overkill for just the few properties that are needed for correct display. So the duplication in each font is not a real problem (note that there won't be a lot of fonts, most often there will be only one that matches the PUA agreement and that is suitable to render the UCS-encoded PUA text).
Implement BIDI algorithm by line
Hi all, I have a question about the BIDI algorithm implementation. Bidi algorithm describe that one must resolving embedding level in a paragraph before break paragraph into lines. I don't understand why. Should we firstly break paragraph into lines and remember the paragraph level, and then resolving the embedding levels for each character in lines? If we do it like this, what issues would be occurred? Thanks a lot!
RE: RTL PUA?
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: As well, the small properties files can be embedded, in a very compact form, in the PUA font. As soon as you embed all the information in the font, you require different solutions for systems that use different font technologies. I was thinking of something more portable. This small table can be limited to just listing the ranges of PUA code points that are strong RTL instead of LTR. Most often, there will be only one range, and this just requires a couple of integers in that embedded table (possibly more, only if you want to represent more properties), without requiring a complex XML parser or a complex parser for the tabulated ASCII format used in the UCD, which is overkill for just the few properties that are needed for correct display. I generally assume there is more to character handling than display. So the duplication in each font is not a real problem (note that there won't be a lot of fonts, most often there will be only one that matches the PUA agreement and that is suitable to render the UCS-encoded PUA text). Depending on how you count, there are already two to four fonts that support Ewellic in the PUA. There are probably many more that support Tengwar or Cirth or Klingon. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: RTL PUA?
On 08/20/2011 10:54 AM, Shriramana Sharma wrote: On 08/19/2011 10:05 PM, Mark Davis ☕ wrote: All of the property assignments to PUA characters (except the GC) are purely informative. I just now noticed that you had excepted the GC in the above. Why is that? How are applications supposed to handle combining marks etc if in the PUA? Mark, can you please reply to the above -- It seems that while it is true that GC=Co should be retained *in the standard* to clearly identify the character as a PUA character, the applications will still by changing that GC to Lo, Mc, Mn, No etc for their internal private-agreement processing. So what is the exact nature of your excepting the GC in your statement above? -- Shriramana Sharma
Re: RTL PUA?
On 08/22/2011 05:20 PM, Shriramana Sharma wrote: Hi Behdad. I only asked whether the OT *tables* would contain the entries in the logical order or the visual order. Clearly it would still be the visual order My mistake: I should have said *logical* order. (but Philippe Verdy seemed to imagine/suggest otherwise). This one is correct w.r.t. what I had *intended* to say above: i.e. Philippe thinks the entries contain the glyphs in *visual* order. See other mail replying to Philippe pointing this out. -- Shriramana Sharma
Re: RTL PUA?
On 08/22/2011 09:00 PM, Philippe Verdy wrote: The font tables themselves contain only ASCII characters I presume. No. The lookup tables contain sequences of numeric glyph ids (16 bit integers in TrueType and OpenType). Which are also not the code point values, and not the character names or glyph names. And numeric glyph IDs are still ASCII aren't they? I was just noting that the glyph tables themselves don't *use* the actual codepoints of the characters getting ligated (while they *refer* to them). Let's say that; - the LAMED character is cmap'ped (by its code point value in an cmap for Unicode, or by its code position in a cmap for another legacy 8-bit encoding) to the glyph id 1012, - and the ALEF character is cmapped to the glyph id 1001 (the values of glyph ids are not important, not even their relative order or differences, they don't need to obey any standard), - and the ALEF-LAMED ligature is in glyph id 1540 (the ALEF-LAMED character of the UCS may also be cmapped separately, but this is not a requirement) Then the lookup to perform the ligature will contain : (1012, 1001) - (1540). No! See Behdad's post -- it is clearly said that the lookup will still be in logical order (1001, 1012) - (1540) and not in visual order as you say. See? This is what I meant in the other mail by you suggesting that the tables containing the characters in visual order and not in logical order, to which you replied (without much real explanation I'm afraid): quoteNo ! I've not imagined that. You incorrectly reinterpret imaginatively another incorrect imaginative reinterpretation, made by someone else, of what I wrote, which did not even suggest that./quote Glyph id's are presented and scanned in the lookup table, in sequences preordered in visual order by the text layout/shaping engine. Nope -- they are placed in the lookup table in *logical* order. IIUC the entire sequence of glyphs is only reordered from RTL at the very end. Peter or Behdad, can you corroborate this? -- Shriramana Sharma
Re: RTL PUA?
On 08/22/2011 09:31 PM, Doug Ewell wrote: Philippe Verdyverdy underscore p at wanadoo dot fr wrote: As well, the small properties files can be embedded, in a very compact form, in the PUA font. As soon as you embed all the information in the font, you require different solutions for systems that use different font technologies. Why? In the end all the systems base upon the character properties specified by the standard. For the PUA characters in question, what is needed for a table of properties to override the default ones. The systems would then handle those new properties in the same way that they would handle the regular ones. Granted, if the renderers hardcode the properties (as most OT ones do) then some parsing is required to import all the override data provided by the extra font table into a struct or such -- after which (I presume) it would be possible (to a large extent?) to treat it the same as an encoded script. [Actually, this seems quite difficult to implement in OT, where the philosophy is to explicitly hardcode the properties, but Graphite and AAT should be fine I guess.] I generally assume there is more to character handling than display. True -- so if someone wanted a PUA script to be handled properly in sorting etc one would have to prepare collation tables which would obviously go *outside* the font. -- Shriramana Sharma
Re: Feedback from C1 Control Pictures Proposal
I would like to ask Frank for a bit of help here (and, to the extent that Ken thinks that the proposal is reasonable, some affirmation that the uses/demonstration of demand will be seen as acceptable to the Unicode people). Specifically, can Frank help identify, and possibly provide screenshots, of: - C0 control pictures in use - C1 control pictures in use Maybe only an older person would understand this point, but to emulate a particular terminal, you have to make the emulator show on the screen what the real terminal would show. Since I have been laid off and have to clean out my office this week, I don't have time to re-do the research, but many terminals -- my vast collection of terminal manuals has been boxed for shipment to the Computer History Museum: http://www.columbia.edu/cu/computinghistory/books/#terminalmanuals ...have glyphs for C0 controls and some have them for C1 controls. Here's the exhibit I prepared for my proposal in 1998: ftp://kermit.columbia.edu/kermit/ucsterminal/terminal-exhibits.pdf Here again, for reference, is the proposal itself (only the C1 part is relevant to this discussion): ftp://kermit.columbia.edu/kermit/ucsterminal/control.txt The exhibit shows: Terminals that have C0 control glyphs: DEC VT320, 420, etc Data General Dasher HP-2621 Wyse 60 Wyse 370 Atlantic Research Corporation Interview 30A Data Analyzer (exhibit N1) Terminals that have C1 control glyphs: DEC VT320, 420, etc (full set) Data General Dasher (partial set) Siemens-Nixdorf 97801 (as hex byte pictures 80, 81, etc) Wyse 370 (full set) This is not an exhaustive survey, more of a proof by existence. * Unfortunately, I don't actually know of any applications, other than Penango (my company's primary product), which currently use the U+2400 range. [That is what kicked off this proposal, by the way.] I don't have information about what applications use them. Our own terminal emulator, Kermit 95: http://www.columbia.edu/kermit/k95.html does not. That's because it was designed to be portable between Windows console screens and GUI screens, and no Windows console font contained control pictures. Instead, when we put the emulator into debugging mode, color is used. Obviously, that's not plain text, but this way it shows control characters in a single cell. By now, Kermit 95 would indeed use control pictures in its GUI version, except that the programmers aren't here any more, and except that C1 control pictures are not defined yet. By the way, the cancellation of the Kermit Project is is not an end but a new beginning, because now the source code for Kermit 95 has been published with an Open Source license: http://www.columbia.edu/kermit/k95sourcecode.html So here is why I believe it is important to have C1 control character glyphs available in Unicode: . Terminal emulation is still important. For example, everybody who uses the Unix shell is doing so through a terminal emulator. And here, as we all know, is where the real work gets done -- coding, website creation and maintenance, system administration, network configuration, etc etc. . Since the Unix shell and other text-only online environments exist outside the English-speaking world too, terminal emulators are being updated to support UTF-8. Kermit 95 has supported it since about 2002. The Linux console window (which is a terminal emulator) uses UTF-8 *by default*. . The terminals that are emulated were manufactured before 1995, and therefore mostly follow the ANSI X3.64 definition, which reserves both C0 and C1 for control characters, as Unicode itself has done. . But Microsoft has created code pages that are identical to ISO standard character sets such as ISO-8859-x (which are compatible with ANSI X3.64), but with graphic characters in the C1 area. These have leaked into every part of the Internet, including text that we view in a terminal screen (e.g. email). . When a real terminal, or a program that emulates one, receives text written in, say, Microsoft code page 1252, it invariably hangs. Why? Because the text contains smart quotes or somesuch, which coincide with valid C1 commands understood by the terminal. Some of which, such as ISO 6429 DCS, OSC, or APC, are a header for a packet of control information. The terminal waits for the end-of-packet for the control sequence, as it must do, but it never comes. Those who support terminal emulators need tools to diagnose problems like this. The best and most portable tool is to put the terminal into display controls mode. This is a feature that the above mentioned terminals had. A Unicode-based terminal emulator has glyphs to show the C0 controls but not the C1 controls, which can b e even more lethal than the C0 ones when used improperly, as they are in Windows code pages. Note that tech support is done not only on the scene, but remotely. Support technicians
Re: Code pages and Unicode
On Monday 22 August 2011, Andrew West andrewcw...@gmail.com wrote: Can anyone think of a way to extend UTF-16 without adding new surrogates or inventing a new general category? Andrew How about a triple sequence of two high surrogates followed by one low surrogate? I suggest this as a solution to the problem that is posed by Andrew as I feel that it would be interesting to know if that would be possible or whether it would be forbidden due to an existing policy that has already been guaranteed to be unchangeable. William Overington 22 August 2011
RE: RTL PUA?
Shriramana Sharma samjnaa at gmail dot com wrote: As soon as you embed all the information in the font, you require different solutions for systems that use different font technologies. Why? In the end all the systems base upon the character properties specified by the standard. For the PUA characters in question, what is needed for a table of properties to override the default ones. The systems would then handle those new properties in the same way that they would handle the regular ones. Right, so if you embed that table in an OT font, the information is not available to a system that uses a font technology other than OT. What is needed is a way to specify the properties in a platform-independent way, where platform means not only OS but also font technology. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: RTL PUA?
On Mon, Aug 22, 2011 at 07:51:22AM -0700, Doug Ewell wrote: Some PUA properties, like glyph shapes and maybe directionality, can be stored in a font. Others, like numeric values and casing, might not or cannot. An interchangeable format needs to be agreed upon for the Why not? P.T. -- Petr Tomasek http://www.etf.cuni.cz/~tomasek Jabber: but...@jabbim.cz EA 355:001 DU DU DU DU EA 355:002 TU TU TU TU EA 355:003 NU NU NU NU NU NU NU EA 355:004 NA NA NA NA NA
Re: RTL PUA?
On 08/22/2011 10:12 PM, Doug Ewell wrote: Right, so if you embed that table in an OT font, the information is not available to a system that uses a font technology other than OT. I don't understand why you would say so -- assuming we are all talking about TrueType fonts, AAT just uses some tables, OT others and Graphite still others. They are all just tables appended to the TrueType font data. Any software that is able to read TT font data can also read the tables. So what's the problem? -- Shriramana Sharma
Re: RTL PUA?
Shriramana Sharma wrote: The font tables themselves contain only ASCII characters I presume. OpenType Layout tables use Glyph IDs. OTL development tools typically use glyph names, which may be particular to the tool or the same names used in the post or CFF tables. OTL tables work on glyphs, not characters, and bidi will have been resolved prior to application of OTL substitution and positioning. Input glyph strings for substitution lookups are always in the resolved direction of the glyph run, so Arabic and Hebrew alphabetic runs are processed right-to-left, i.e. alef lamed - alef_lamed *not* lamed alef - alef_lamed Similarly, context stings for glyph positioning (if present) will be right-to-left, although anchor attachment positions on individual glyphs are relative to the 0,0 coordinate, i.e. the left sidebearing. JH -- Tiro Typeworkswww.tiro.com Gulf Islands, BC t...@tiro.com The criminologist's definition of 'public order crimes' comes perilously close to the historian's description of 'working-class leisure-time activity.' - Sidney Harring, _Policing a Class Society_
RE: RTL PUA?
Petr Tomasek tomasek at etf dot cuni dot cz wrote: Some PUA properties, like glyph shapes and maybe directionality, can be stored in a font. Others, like numeric values and casing, might not or cannot. An interchangeable format needs to be agreed upon for Why not? Where does one store numeric values in a font? Maybe this should be taken off-list. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: RTL PUA?
Shriramana Sharma wrote: I was just noting that the glyph tables themselves don't *use* the actual codepoints of the characters getting ligated (while they *refer* to them). Characters are mapped to glyph IDs in the font cmap tables. Glyph IDs are mapped to other glyph IDs (one-to-one, one-to-many, many-to-one, or one-to-one-of-many) in the layout GSUB table. No! See Behdad's post -- it is clearly said that the lookup will still be in logical order (1001, 1012) - (1540) and not in visual order as you say. I think there may be some confusion in this discussion over what constitutes 'visual order'. I try to avoid the term because it is difficult for right-to-left readers to accustom themselves to thinking of visual order as anything other than right-to-left. I prefer the term 'reading order' or 'resolved order', i.e. resolved bidi and script shaping order, which may have involved integrated reordering (reordering within the glyph processing) as in the case of Indic scripts. Nope -- they are placed in the lookup table in *logical* order. IIUC the entire sequence of glyphs is only reordered from RTL at the very end. Peter or Behdad, can you corroborate this? Glyph ID inputs for OTL processing are according to reading/resolved order. This is typically the same as logical order, but the term logical order really applies to character strings, not glyph strings, which are much more maleable. The order of input strings in GSUB lookups or contexts is dependent not only on the underlying character order, but also on the results of previous GSUB lookups. So while, unlike AAT and Graphite, OpenType Layout doesn't explicitly provide for glyph re-ordering, some kinds of glyph reordering are possible using sequences of contextual lookups to duplicate a glyph in a second location in the string and then remove the first instance. We use this in some Devanagari fonts to enable subsequent ligation of short ikar variants to the left of a consonant base with reph marks to the right of that base. JH -- Tiro Typeworkswww.tiro.com Gulf Islands, BC t...@tiro.com The criminologist's definition of 'public order crimes' comes perilously close to the historian's description of 'working-class leisure-time activity.' - Sidney Harring, _Policing a Class Society_
Re: Code pages and Unicode
On 22/08/11 16:55, Doug Ewell wrote: srivas sinnathuraisisrivas at blueyonder dot co dot uk wrote: The true lifting of UTF-16 would be to UTF-32. Leave the UTF-16 un touched and make the new half versatile as possible. I think any other solution is just a patch up for the timebeing. There is no evidence whatsoever that this is a problem that needs to be solved, not in 700 or 800 years, not ever. Ken's words are again being ignored. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell I see at least one reason to extend the present 17 planes Unicode space: that would provide space for a RTL PUA. ☺ Presently, UTF-16 uses surrogate pairs to address non-BMP characters: HS LS (High Surrogate followed by Low Surrogate). What would happen if we imbricate them? Would HS1 HS2 LS1 LS2 be acceptable to address more characters?
Re: RTL PUA?
On Monday 22 August 2011, Philippe Verdy verd...@wanadoo.fr wrote: So there are only two options: [snipped] ... : this requires an approval either by the UTC WG2 (solution 1) or by the OpenType working group (solution 2). Would a third option work? In the Description section of the Macintosh Roman section of a TrueType font, include a line of text in a plain text format of which the following line of text is an example. PUA.RTL=$E000-$E1FF,$E440-$E447,$E541,$E549,$E57C,$EA00-$EA0F,$EC07; One could specify precisely which Private Use Area characters were to become RTL when using that particular font. One would need rendering software that looked for such a string of text in the font file, yet, as far as I am aware, no approval from any committee in order to put this solution into practical use. William Overington 22 August 2011
Re: Code pages and Unicode
On 20/08/11 02:03, Ken Whistler wrote: O.k., so apparently we have awhile to go before we have to start worrying about the Y2K or IPv4 problem for Unicode. Call me again in the year 2851, and we'll still have 5 years left to design a new scheme and plan for the transition. ;-) --Ken I wonder whether you aren’t a little too optimistic. Have you considered the unencoded ideographic scripts? 1,071 hieroglyphs have already been encoded. I think there are approximately 4,000 more to encode. 1,165 Yi syllables and 55 Yi radicals have been encoded. But they only support one dialect of Yi and I read there are tens of thousands of Yi ideographs and that a proposal to encode 88,613 classical Yi characters was made 4 years ago. The threshold of 200,000 characters doesn’t seem very far.
Re: RTL PUA?
Doug Ewell 於 2011年8月22日 上午10:59 寫道: Petr Tomasek tomasek at etf dot cuni dot cz wrote: Some PUA properties, like glyph shapes and maybe directionality, can be stored in a font. Others, like numeric values and casing, might not or cannot. An interchangeable format needs to be agreed upon for Why not? Where does one store numeric values in a font? Maybe this should be taken off-list. This is actually a relevant point. The major TrueType variants all work primarily with glyphs, not characters. Using them as a place to store information about the *characters* in the text is therefore not a reliable way to provide an override for default system behavior. By the time the rendering engine consults the fonts for layout specifics, large chunks of the text processing will already be completed. OpenType, for example, expects that the bidi algorithm is largely run in character space, not glyph space, and therefore without regard for the specific font involved. (AAT does almost everything in glyph space, including bidi. I'm not sure about Graphite.) The net result is that a font is an unreliable way of storing character-specific information useful on multiple platforms. This is one reason why embedding the existing directionality controls within the text itself is currently the most reliable way of getting the behavior one might want in a platform-agnostic way. = Siôn ap-Rhisiart John H. Jenkins jenk...@apple.com
Re: RTL PUA?
True -- so if someone wanted a PUA script to be handled properly in sorting etc one would have to prepare collation tables which would obviously go *outside* the font. If a proper definition of an unencoded script needs additional properties which cannot be stored in the font anyway, why would you want to store part of it in OT tables? It’s just not the right place. Fonts’ sole purpose is to display already defined characters, not to define them. Tails shouldn’t be made wagging dogs. Á
Re: RTL PUA?
On 08/22/2011 10:55 PM, Joó Ádám wrote: If a proper definition of an unencoded script needs additional properties which cannot be stored in the font anyway, why would you want to store part of it in OT tables? It’s just not the right place. Fonts’ sole purpose is to display already defined characters, not to define them. Tails shouldn’t be made wagging dogs. True, but we are only trying to help those who find themselves unable to even *display* PUA characters as RTL (or as Indic with reordering, which can be handled by IndicMatraCategory). Since collation never cares about whether the script is LTR or RTL or Indic (with the except of Thai etc where the encoding is as per visual order and not logical order) the collation data can be outside the font, since it is not needed for display. -- Shriramana Sharma
Re: RTL PUA?
William_J_G Overington 於 2011年8月22日 上午10:49 寫道: In the Description section of the Macintosh Roman section of a TrueType font, include a line of text in a plain text format of which the following line of text is an example. PUA.RTL=$E000-$E1FF,$E440-$E447,$E541,$E549,$E57C,$EA00-$EA0F,$EC07; Forgive my asking, but this reference to the description section of the Macintosh Roman section of a TrueType font has me puzzled, because I don't know what you're talking about. What table contains this string? = 井作恆 John H. Jenkins jenk...@apple.com
Re: RTL PUA?
On Monday 22 August 2011, John H. Jenkins jenk...@apple.com wrote: Forgive my asking, but this reference to the description section of the Macintosh Roman section of a TrueType font has me puzzled, because I don't know what you're talking about. What table contains this string? When I use FontCreator, made by High-Logic, http://www.high-logic.com is the webspace: with a font file open, I can select Format from the menu bar and then select Naming... from the drop down menu. That leads to a dialogue panel. From that dialogue panel one may select, for an ordinary, basic Unicode font, either of two platforms, namely Macintosh Roman and Microsoft Unicode BMP only. Having selected a platform, one may view the text content of various fields for that platform, such as font family name and copyright notice, version string and postscript name. There is then a button that is labelled Advanced... that, if clicked, opens another dialogue panel with various other text fields, including Font Designer and Description, which are the two that I often use. Now, when the text values in the fields are stored in the font file, the values for the Macintosh Roman platform are stored in plain text and the values for the Microsoft Unicode BMP only platform are stored in some encoded format. So, if one opens a TrueType font file in WordPad and one searches for an item of plain text that is in one of the fields of the font, then the text that is in the Macintosh platform can be found, yet the text that is in the Microsoft Unicode BMP only platform cannot be found. So, I thought that if a manufacturer of a wordprocessing application or a desktop publishing application decided to make a special researcher's edition of the software, then that software could, when a font is selected, first scan the font for a PUA.RTL string and, if one is found, override the left-to-right nature of the identified characters to be a right-to-left nature, just while that font is selected. Whether such a software package ever becomes available is something that only time will tell, yet it seems to me that it is a method that could be used without needing any changes by any committee. William Overington 22 August 2011
Re: RTL PUA?
2011/8/22 Doug Ewell d...@ewellic.org: Depending on how you count, there are already two to four fonts that support Ewellic in the PUA. There are probably many more that support Tengwar or Cirth or Klingon. First, these fonts can work fine with the default LTR directionality. So there's no need for additional data for them. Second, even if they were RTL, the needed info for each of these fonts, embedded in them would be extremely small, reduced to just specifying the range of RTL characters they need to contain. So I don't see that as a problem. Those fonts do exist and are used exactly because there was no problem for rendering them with texts encoded in logical order (the same as the visual order). It's still strange that we can have several fonts for esoteric fonts that have been used effectively by very few people, when there are centuries of traditions, and many interested users (but spread in very small communities worldwide) that cannot use computer technologies to render their favorite scripts, or that want to teach them, or make books and other publications to expose them, as an important humane cultural heritage, even if this was only to translate them or transcribe them in a more modern script.
Re: Implement BIDI algorithm by line
Huh? What context is this in? On 8/22/2011 11:18 AM, CE Whitehead wrote: Hi. I think many line breaks within paragraphs are soft line breaks but that embedding levels have to be taken into account when deciding the width of the glyphs; that's as near as I can tell. Here is the description of the algorithm -- is this what you have read? http://unicode.org/reports/tr9/ Some rules are in fact applied after the line wrapping (after the soft breaks) -- The following rules describe the logical process of finding the correct display order. As opposed to resolution phases, these rules act on a per-line basis/and are applied *after* any line wrapping is applied to the paragraph./ Logically there are the following steps: * The levels of the text are determined according to the previous rules. * The characters are shaped into glyphs according to their context /(taking the embedding levels into account for mirroring)./ * The accumulated widths of those glyphs /(in logical order)/ are used to determine line breaks. * For each line, rules L1 http://unicode.org/reports/tr9/#L1–L4 http://unicode.org/reports/tr9/#L4 are used to reorder the characters on that line. (I'd have to reread the whole document on line breaking then on bidi to answer this truely; sorry; hope this helps anyway) --C. E. Whitehead cewcat...@hotmail.com
Re: RTL PUA?
2011/8/22 Shriramana Sharma samj...@gmail.com: On 08/22/2011 09:00 PM, Philippe Verdy wrote: The font tables themselves contain only ASCII characters I presume. No. The lookup tables contain sequences of numeric glyph ids (16 bit integers in TrueType and OpenType). Which are also not the code point values, and not the character names or glyph names. And numeric glyph IDs are still ASCII aren't they? I was just noting that the glyph tables themselves don't *use* the actual codepoints of the characters getting ligated (while they *refer* to them). Let's say that; - the LAMED character is cmap'ped (by its code point value in an cmap for Unicode, or by its code position in a cmap for another legacy 8-bit encoding) to the glyph id 1012, - and the ALEF character is cmapped to the glyph id 1001 (the values of glyph ids are not important, not even their relative order or differences, they don't need to obey any standard), - and the ALEF-LAMED ligature is in glyph id 1540 (the ALEF-LAMED character of the UCS may also be cmapped separately, but this is not a requirement) Then the lookup to perform the ligature will contain : (1012, 1001) - (1540). No! See Behdad's post -- it is clearly said that the lookup will still be in logical order (1001, 1012) - (1540) and not in visual order as you say. See? This is what I meant in the other mail by you suggesting that the tables containing the characters in visual order and not in logical order, to which you replied (without much real explanation I'm afraid): quoteNo ! I've not imagined that. You incorrectly reinterpret imaginatively another incorrect imaginative reinterpretation, made by someone else, of what I wrote, which did not even suggest that./quote Glyph id's are presented and scanned in the lookup table, in sequences preordered in visual order by the text layout/shaping engine. Nope -- they are placed in the lookup table in *logical* order. IIUC the entire sequence of glyphs is only reordered from RTL at the very end. Peter or Behdad, can you corroborate this? Hmmm... this is not very clear then in the OpenType specification. Well it does not matter the which order is physically used in the stored table as long as it is consistant. But this confirms that the OpenType rendering algorithm, the way it is presented in the OpenType specification, is completely wrong: the Bidi algorithm is definitely not the first step needed before performing glyph substitutions. However the Bidi algorithm really needs to reorder the glyphs at least relatively, for correct application of GPOS (glyph positionining). As a consequence, the font to use will be completely known (all cmap'pings will have been applied already, and no glyph substitution can accur across distinct fonts that have independant glyph ids). As such the PUA agreement implied by the PUA font would have been asserted. Nothing forbids then to use the font as THE reliable source of information about which PUAs are RTL and which ones are LTR. The computing order of features should not then be: - BiDi algorithm for reordering grapheme clusters - font search and font fallback (using cmap) - GSUB (lookups of ligatures or discretionary glyph variants) - GPOS but really: - font lookup and font fallback (using cmap) - GSUB (lookups of ligatures or discretionary glyph variants) - BiDi algorithm for reordering glyphs representing the grapheme clusters or ligatured grapheme clusters - GPOS The BiDi algorithm absolutely does not have to be changed. This time there's absolutely no PUA with unknown directionality if the font defines the RTL property for these PUA (using the normative LTR only as a default when the font does not specify it)
Re: RTL PUA?
2011/8/22 Shriramana Sharma samj...@gmail.com: True -- so if someone wanted a PUA script to be handled properly in sorting etc one would have to prepare collation tables which would obviously go *outside* the font. Collation tables can aleady be tailored very easily with existing technologies. And anyway this has nothing to do with directionality of characters, or their rendering, on which they absolutely do not depend. Tailored collations already have a working standard and syntax in the CLDR project or ICU and in a few other libraries (notably in CPAN for Perl).
Re: Code pages and Unicode
On 8/22/2011 9:58 AM, Jean-François Colson wrote: I wonder whether you aren’t a little too optimistic. No. If anything I'm assuming that the folks working on proposals will be amazingly assiduous during the next decade. Have you considered the unencoded ideographic scripts? Why, yes I have. 1,071 hieroglyphs have already been encoded. I think there are approximately 4,000 more to encode. A preliminary listing of 4548 additional hieroglyphs, based on Hieroglyphica (1993), was presented to WG2 in 1999. Twelve years have passed, and no additional document has been forthcoming to work through the issues in standardizing such a list as characters. I won't hold my breath, but somebody *might* get through that work by 2021. 1,165 Yi syllables and 55 Yi radicals have been encoded. But they only support one dialect of Yi and I read there are tens of thousands of Yi ideographs and that a proposal to encode 88,613 classical Yi characters was made 4 years ago. 88,613 classical Yi *glyphs*. This is just a collection of every glyph form noted from wherever. Even the proponents acknowledged that it was more on the order of maybe 7000 *characters* involved. They got feedback to do the homework to work through the character/glyph model for classical Yi, and come back when they have a documented, reliable listing of the Yi *characters* that need encoding, together with the list of variants for each character. Given the nature and scope of the work, and no (current) indication of the progress being made, this also *might* get done by 2021. The threshold of 200,000 characters doesn’t seem very far. Nah. It is still way over the extended horizon. The only big historic ideographic script that is close to being done is Tangut, and the wrangling even over that one has gone on for years now. --Ken
Re: RTL PUA?
William_J_G Overington 於 2011年8月22日 下午12:36 寫道: On Monday 22 August 2011, John H. Jenkins jenk...@apple.com wrote: Forgive my asking, but this reference to the description section of the Macintosh Roman section of a TrueType font has me puzzled, because I don't know what you're talking about. What table contains this string? When I use FontCreator, made by High-Logic, http://www.high-logic.com is the webspace: with a font file open, I can select Format from the menu bar and then select Naming... from the drop down menu. That leads to a dialogue panel. From that dialogue panel one may select, for an ordinary, basic Unicode font, either of two platforms, namely Macintosh Roman and Microsoft Unicode BMP only. Having selected a platform, one may view the text content of various fields for that platform, such as font family name and copyright notice, version string and postscript name. There is then a button that is labelled Advanced... that, if clicked, opens another dialogue panel with various other text fields, including Font Designer and Description, which are the two that I often use. Now, when the text values in the fields are stored in the font file, the values for the Macintosh Roman platform are stored in plain text and the values for the Microsoft Unicode BMP only platform are stored in some encoded format. So, if one opens a TrueType font file in WordPad and one searches for an item of plain text that is in one of the fields of the font, then the text that is in the Macintosh platform can be found, yet the text that is in the Microsoft Unicode BMP only platform cannot be found. So, I thought that if a manufacturer of a wordprocessing application or a desktop publishing application decided to make a special researcher's edition of the software, then that software could, when a font is selected, first scan the font for a PUA.RTL string and, if one is found, override the left-to-right nature of the identified characters to be a right-to-left nature, just while that font is selected. Whether such a software package ever becomes available is something that only time will tell, yet it seems to me that it is a method that could be used without needing any changes by any committee. Ah. You're referring to an entry in the 'name' table, then. The intention of the 'name' table is to provide localizable strings for the UI. Using it to store data of any sort for the rendering engine would be very, very inappropriate. In general, one should not be using a text editor to examine the contents of a TrueType font. It would be like using a text editor to examine the contents of an application. Even if you see some plain text, you really don't have any sense for how it's actually being used. You may want to bone up on the structure of TrueType/OpenType fonts. = John H. Jenkins 井作恆 Жбь А. ЖЩэпЮьц jenk...@apple.com
RE: RTL PUA?
There is more to displaying characters than LTR versus RTL, and there is more to handling characters than just displaying them. This point continues to be lost on several people responding to this thread. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
RE: RTL PUA?
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: Depending on how you count, there are already two to four fonts that support Ewellic in the PUA. There are probably many more that support Tengwar or Cirth or Klingon. First, these fonts can work fine with the default LTR directionality. So there's no need for additional data for them. Second, even if they were RTL, the needed info for each of these fonts, embedded in them would be extremely small, reduced to just specifying the range of RTL characters they need to contain. This isn't my point. Multiple fonts can exist for PUA scripts and the user should not have to be constrained to using just the one font which happens to contain property information, because someone decided properties should be stored in the font. So I don't see that as a problem. Those fonts do exist and are used exactly because there was no problem for rendering them with texts encoded in logical order (the same as the visual order). Not my point. It's still strange that we can have several fonts for esoteric fonts that have been used effectively by very few people, when there are centuries of traditions, and many interested users (but spread in very small communities worldwide) that cannot use computer technologies to render their favorite scripts, or that want to teach them, or make books and other publications to expose them, as an important humane cultural heritage, even if this was only to translate them or transcribe them in a more modern script. One person added Ewellic to his shareware font as an experiment, and I paid another person to do a font for me. Sorry if this was culturally insensitive. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
RE: RTL PUA?
Shriramana Sharma samjnaa at gmail dot com wrote: Right, so if you embed that table in an OT font, the information is not available to a system that uses a font technology other than OT. I don't understand why you would say so -- assuming we are all talking about TrueType fonts, AAT just uses some tables, OT others and Graphite still others. They are all just tables appended to the TrueType font data. Any software that is able to read TT font data can also read the tables. So what's the problem? OK, so it's obvious by now I'm not a font guy. But I still maintain that there's more to proper handling of Unicode characters, PUA or otherwise, than whether their directionality is LTR or Arabic-RTL or non-Arabic-RTL or what have you. That's why all those other properties exist. And I maintain that PUA users need a place to store those other properties, and that the font doesn't seem like the right place for non-display properties. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: RTL PUA?
2011/8/22 William_J_G Overington wjgo_10...@btinternet.com: Having selected a platform, one may view the text content of various fields for that platform, such as font family name and copyright notice, version string and postscript name. There is then a button that is labelled Advanced... that, if clicked, opens another dialogue panel with various other text fields, including Font Designer and Description, which are the two that I often use. Now, when the text values in the fields are stored in the font file, the values for the Macintosh Roman platform are stored in plain text and the values for the Microsoft Unicode BMP only platform are stored in some encoded format. Note some encoded format. The strings are encoded using the encoding specified in the platform selectors. The strings for the Macintish Romain platform will be encoded using MacRoman. The strings for the MS Unicode BMP platform will be encoded with the BMP part of UTF-16 (without support for surrogates). The strings for the Unicode platform will use the UTF-32 encoding. So, if one opens a TrueType font file in WordPad and one searches for an item of plain text that is in one of the fields of the font, then the text that is in the Macintosh platform can be found: It just happens that you are opening the TrueType font as if it was a plain-text encoded with Windows-1252, or some other 8-bit encoding based on ASCII. You are also searching ASCII characters that are encoded identically in Windows-1252 as well as in the MacRoman encoding, so you find a match. yet the text that is in the Microsoft Unicode BMP only platform cannot be found. Because tou would have to insert null bytes in your search strings, to find an exact match in an UTF-16 encoded string. Without these nulls, you'll get no match. What you are doing is a search in a text loaded after assuming the wrong encoding. TrueType fonts are binary containers, that can mix several encodings for its plain-text elements, but that also embed many other non-text data. This happens even if your text editor is capable of loading Unicode-encoded texts (this fails here if you try to load it as UTF-16, because the whole TTF container cannot match the conformance requirements for correctly encoded UTF-16 texts, for the whole document, but only for fragments of it. On the opposite, there's no conformance problem if you try to read the file as if it was Windows-1252 or ISO-8859-1...
ALM (was: Re: RTL PUA?)
On 8/21/2011 3:31 PM, Richard Wordingham wrote: I expect ARABIC LANGUAGE MARK would not go down well - has it already been proposed and rejected?. ARABIC *LETTER* MARK, not *LANGUAGE* mark. (And suggested to just be renamed to AL MARK.) Proposed? Yes. Discussed? Yes. Rejected? No. The last UTC meeting took a consensus to issue a public review issue on the proposed ALM and ELM (embedding level mark) characters. So there will be further discussion and chance for input. Nothing has been decided yet. --Ken
Re: RTL PUA?
On Mon, 22 Aug 2011 07:51:22 -0700 Doug Ewell d...@ewellic.org wrote: Some PUA properties, like glyph shapes and maybe directionality, can be stored in a font. Others, like numeric values and casing, might not or cannot. An interchangeable format needs to be agreed upon for the properties in the latter category. I suggest that the obvious format is that used for capturing the UCD in XML. Only the characters in which you are interested need be specified. One very important property for several scripts is the script to which a character belongs. One reason for associating properties with a font is that text that is to be displayed is at that point tentatively associated with a font. Another is that in a multi-font document, a PUA character could have multiple implicit properties dependent on the font it appears in. Richard.
RE: RTL PUA?
Richard Wordingham richard dot wordingham at ntlworld dot com wrote: One reason for associating properties with a font is that text that is to be displayed is at that point tentatively associated with a font. I thought John said fonts dealt with glyph IDs, not characters per se. Another is that in a multi-font document, a PUA character could have multiple implicit properties dependent on the font it appears in. Normal, assigned characters don't change their Unicode properties depending on font. I don't see why PUA characters would be different. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: RTL PUA?
On Sat, Aug 20, 2011 at 7:08 AM, Shriramana Sharma samj...@gmail.com wrote: On 08/20/2011 01:57 PM, Martin Hosken wrote: D49 states that all properties of PUA characters are overridable by a higher protocol. But in 'normal' implementations, there are no higher level protocols to override the properties and so they use the defaults in the Unicode Database. So while in *theory* it's possible to override these values, nobody does. (This happens to also be the case with other tailoring algorithms in Unicode). Adding the configuration that tailoring requires is usually prohibitive and so it just doesn't get done. Good point -- Michael should note this. Somebody remarked that Apple Mac OS's rendering engine already supports an extended OT table which would signal that the glyphs in a PUA font are RTL. If other rendering don't support it, again it is not the fault of the standard. Is there a specificatino for that OT table? Are you implementing this in anything? Read a previous post by John Jenkins. He's the one who said they have a prop table in Apple's implemention of OT (or is it their own AAT) that enables one to do this. Is this correct? that Apple solves the problem of RTL PUA user requirements? See John Jenkins latest mail that says: [Begin Quote] To be honest, I don't know if using the 'prop' table to override directionality for glyphs still works. A quick-and-dirty test on Lion suggests that it doesn't, so I may have spoken too quickly. This is not a part of the functionality of AAT which gets much exercise, so it's entirely possible that it was lost at some point without anyone noticing. In any event, my apologies for raising any false hopes. [End Quote] Hope a new proposal or a UTN from UC will make things clear, and RTL community benefits. N. Ganesan Jonathan Kew 於 2011年8月21日 上午10:48 寫道: On 21 Aug 2011, at 17:21, Behdad Esfahbod wrote: On 08/21/11 16:44, Shriramana Sharma wrote: BTW can John Jenkins show us a few entries from the prop table of some font supporting the custom Apple PUA characters, especially the RTL and GC=No ones? Like this? https://developer.apple.com/fonts/ttrefman/RM06/Chap6prop.html However, note that this documentation is very old, and does not make it clear whether there is any support for overriding directionality in current Mac OS X software. Yes, it's very old, largely because we haven't done anything with the structure of the 'prop' table for a long, long time. Still, anything referring to QuickDraw GX is obviously overdue for an update. To be honest, I don't know if using the 'prop' table to override directionality for glyphs still works. A quick-and-dirty test on Lion suggests that it doesn't, so I may have spoken too quickly. This is not a part of the functionality of AAT which gets much exercise, so it's entirely possible that it was lost at some point without anyone noticing. In any event, my apologies for raising any false hopes. = 井作恆 John H. Jenkins jenk...@apple.com If the application doesn't do this and allows Graphite to break the text into runs, then Graphite can treat PUA characters as having BC other than L? /myunderstanding Yes that understanding is correct. Great! Could you then place some sample characters from your Scheherezade font in the PUA and render them RTL and show to us then Michael would be convinced. -- Shriramana Sharma
Re: Code pages and Unicode
On Mon, 22 Aug 2011 14:06:00 +0100 (BST) William_J_G Overington wjgo_10...@btinternet.com wrote: On Monday 22 August 2011, Andrew West andrewcw...@gmail.com wrote: Can anyone think of a way to extend UTF-16 without adding new surrogates or inventing a new general category? Andrew How about a triple sequence of two high surrogates followed by one low surrogate? The problem is that a search for the character represented by the code unit sequence (H2,L3) would also pick up the sequence (H1,H2,L3). While there is no ambiguity, it does make searching more complicated to code. The same issue applies to the suggestion of using (H1,H2,L3,L4) sequences. Now, we could use (H1,H2,L3,L4) sequences and never assign the (H2,L3) combinations. They would therefore be category Cn, which currently consists of both the unassigned characters and the non-characters. However, I can't help feeling that they'd be almost a sort of surrogate. It's slightly more efficient to replace L3 by a single BMP character. Practically, I think that if we can change the semantics of the Myanmar script, our descendants can go back on the guarantee of no more surrogates. Richard.
Re: Code pages and Unicode
On 8/22/2011 3:15 PM, Richard Wordingham wrote: On Monday 22 August 2011, Andrew Westandrewcw...@gmail.com wrote: Can anyone think of a way to extend UTF-16 without adding new surrogates or inventing a new general category? Andrew How about a triple sequence of two high surrogates followed by one low surrogate? How about Clause 12.5 of ISO/IEC 10646: 001B, 0025, 0040 You escape out of UTF-16 to ISO 2022, and then you can do whatever the heck you want, including exchange and processing of complete 4-byte forms, with all the billions of characters folks seem to think they need. Of course you would have to convince implementers to honor the ISO 2022 escape sequence and liberate themselves into a high-level world of nosebleed character numerosity. But then I guess by the time this is needed, folks are counting on the need being self-evident. ;-) --Ken
Re: Implement BIDI algorithm by line
Yes, this is the algorithm I have read. http://unicode.org/reports/tr9/ But I don't know why user must take a paragraph as a unit to determine the embedding levels. Why can't i shape the text first and then wrapping the line, and determining the embedding levels for characters within a line. finally, reordering the characters within a line. If a paragraph is too long, i think it's a big memory occupied. This would be a limite in embedding system such as mobile phone. On Tue, Aug 23, 2011 at 2:18 AM, CE Whitehead cewcat...@hotmail.com wrote: Hi. I think many line breaks within paragraphs are soft line breaks but that embedding levels have to be taken into account when deciding the width of the glyphs; that's as near as I can tell. Here is the description of the algorithm -- is this what you have read? http://unicode.org/reports/tr9/ Some rules are in fact applied after the line wrapping (after the soft breaks) -- The following rules describe the logical process of finding the correct display order. As opposed to resolution phases, these rules act on a per-line basis* and are applied after any line wrapping is applied to the paragraph.* Logically there are the following steps: - The levels of the text are determined according to the previous rules. - The characters are shaped into glyphs according to their context *(taking the embedding levels into account for mirroring).* - The accumulated widths of those glyphs *(in logical order)* are used to determine line breaks. - For each line, rules L1 http://unicode.org/reports/tr9/#L1–L4http://unicode.org/reports/tr9/#L4are used to reorder the characters on that line. (I'd have to reread the whole document on line breaking then on bidi to answer this truely; sorry; hope this helps anyway) --C. E. Whitehead cewcat...@hotmail.com
Re: Implement BIDI algorithm by line
Sorry, Asmus, what do you mean? On Tue, Aug 23, 2011 at 2:44 AM, Asmus Freytag asm...@ix.netcom.com wrote: Huh? What context is this in? On 8/22/2011 11:18 AM, CE Whitehead wrote: Hi. I think many line breaks within paragraphs are soft line breaks but that embedding levels have to be taken into account when deciding the width of the glyphs; that's as near as I can tell. Here is the description of the algorithm -- is this what you have read? http://unicode.org/reports/tr9/ Some rules are in fact applied after the line wrapping (after the soft breaks) -- The following rules describe the logical process of finding the correct display order. As opposed to resolution phases, these rules act on a per-line basis* and are applied after any line wrapping is applied to the paragraph.* Logically there are the following steps: - The levels of the text are determined according to the previous rules. - The characters are shaped into glyphs according to their context *(taking the embedding levels into account for mirroring).* - The accumulated widths of those glyphs *(in logical order)* are used to determine line breaks. - For each line, rules L1 http://unicode.org/reports/tr9/#L1–L4http://unicode.org/reports/tr9/#L4are used to reorder the characters on that line. (I'd have to reread the whole document on line breaking then on bidi to answer this truely; sorry; hope this helps anyway) --C. E. Whitehead cewcat...@hotmail.com
Re: RTL PUA?
On 08/23/2011 03:29 AM, N. Ganesan wrote: Hope a new proposal or a UTN from UC will make things clear, and RTL community benefits. Dear Ganesan, I wonder if you have actually understood all the issues here. As usual you have done your copy-paste from somebody else's post. Please say something if you have something to actually contribute instead of just saying I support Oriya OM I support PUA RTL or such. If you support PUA RTL, and since you are so interested in Grantha, you should do a proposal for regions in the PUA to be allocated proper IndicMatraCategory properties so that today we can put Grantha in the PUA and get it rendered properly by existing rendering engines. -- Shriramana Sharma
Re: Code pages and Unicode
On 23/08/11 00:15, Richard Wordingham wrote: The problem is that a search for the character represented by the code unit sequence (H2,L3) would also pick up the sequence (H1,H2,L3). While there is no ambiguity, it does make searching more complicated to code. The same issue applies to the suggestion of using (H1,H2,L3,L4) sequences. And what dou you think about (H1,H2,VS1,L3,L4)?