Re: RTL PUA?

2011-08-22 Thread Asmus Freytag

On 8/21/2011 7:34 PM, Doug Ewell wrote:

So what you are asking about is a directional control character that would 
assign subsequent characters a BC of 'AL', right?

You don't want to call this a LANGUAGE MARK or anything else that implies language 
identification, because of the existence of real language identification 
mechanisms and the history of Unicode and language tagging.


An ARM (Arabic RTL Mark) would be a sensible addition to the standard. 
It would close a small gap in design that currently prevents a fully 
faithful plain text export of bidi text from rich text (higher level 
protocol) formats.


In a HLP you can assign any run to behave as if it was following a 
character with bidi property AL.


When you export this text as plain text, unless there is an actual AL 
character, you cannot get the same behavior (other than by the 
heavy-handed method of completely overriding the directionality, making 
your plain text less editable).


So, yes, there's a bit of a use case for such a mark.

(It's effect is limited to treatment of numeric expressions, so it's not 
an Arabic language mark, but one that triggers the same bidi context 
as the presence of an Arabic Script (AL) character.)


A./


--
Doug Ewell • d...@ewellic.org
Sent via BlackBerry by ATT

-Original Message-
From: Richard Wordinghamrichard.wording...@ntlworld.com
Sender: unicode-bou...@unicode.org
Date: Mon, 22 Aug 2011 03:19:39
To: Unicode Mailing Listunicode@unicode.org
Subject: Re: RTL PUA?

On Sun, 21 Aug 2011 23:55:46 +
Doug Ewelld...@ewellic.org  wrote:


What's a LANGUAGE MARK?

There are *three* strong directionalities - 'L' left-to-right, 'AL'
right-to-left as in Arabic, 'R' right-to-left (as in Hebrew, I
suspect).  'AL' and 'R' have different effects on certain characters
next to digits - it's the mind-numbing part of the BiDi algorithm.
With one a $ sign after a string of European (or is it Arabic?) digits
appears on the left and in the other it appears on the right.  I
can't remember whether 'higher-level protocols' have an effect on this
logic. LRM has a BC of L, RLM has a BC of R, but no invisible character
has a BC of AL. That's why I tentatively raised the notion of ARABIC
LANGUAGE MARK.  Incidentally, an RLO gives characters with a
temporary BC of R, not AL.

Richard.









RE: RTL PUA?

2011-08-22 Thread Jonathan Rosenne
I don't buy the assumption that all the world is either AAT, Graphite or 
Uniscribe.

Anyhow, this discussion is going off topic, the issue is should Unicode specify 
an RTL PUA area, not whether some products, however respectable, provide a 
bypass.

Jony

 -Original Message-
 From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On
 Behalf Of Shriramana Sharma
 Sent: Monday, August 22, 2011 8:12 AM
 To: unicode@unicode.org
 Subject: Re: RTL PUA?
 
 On 08/22/2011 08:24 AM, Peter Constable wrote:
  I'm not saying that there shouldn't be_some_  software that can do
  what you expect. But there will likely be some different views on
  what ought to be included within that some.
 
 Peter, given that both AAT and Graphite have provisions for assigning
 custom properties including BC to PUA characters, it seems Uniscribe is
 the only one missing out. Those advocating RTL PUA areas seem to reject
 AAT and Graphite as hacks or wow *one* application [*].
 
 [* = LibreOffice is the *only* multipurpose application running on
 /Windows/ to support Graphite and I'm not counting SIL WorldPad. On *nix
 platforms, *any* number of applications that use HB-NG for rendering
 will be able to handle Graphite in the near future because HB-Graphite
 integration is already done. That is to say, once GTK and Qt fully
 switch to HB-NG.]
 
 Anyhow, if you Microsoft guys added support in Uniscribe for ascribing
 custom properties including BC to PUA characters (or have you already
 done it) it would be what would satisfy these PUA RTL users and convince
 them that no RTL PUA zones are needed, it seems.
 
 The suggestion has been made that fonts should be able to carry some
 additional custom tables specifying custom properties for PUA
 characters, which seems reasonable. I'm not sure if the OT GDEF table or
 the AAT PROP table completely satisfies this requirement. People
 interesting in using custom properties for the PUA (which includes me
 for Indic script) should then sit up and formulate the syntax for such
 tables.
 
 If Uniscribe, AAT, and Harfbuzz then provided generic support for
 parsing such tables and rendering PUA characters accordingly, it would
 be an all-around solution both for RTL PUA as well as Indic PUA, I
 suppose. (But I'm not sure how such a custom table would interact with
 the innate ability of Graphite to handle custom properties. It should
 probably be either the new proposed custom table or Graphite.)
 
 [sigh]
 
 --
 Shriramana Sharma





Re: RTL PUA?

2011-08-22 Thread Michael Everson
On 22 Aug 2011, at 03:57, Peter Constable wrote:

 From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On 
 Behalf Of Asmus Freytag
 
 Treating PUA characters as ON is very problematic
 
 As would be changing the default property of PUA characters from L to ON.

Which is why that will not be proposed.

Michael Everson * http://www.evertype.com/





Re: RTL PUA?

2011-08-22 Thread Michael Everson
On 22 Aug 2011, at 05:53, Shriramana Sharma wrote:

 While I don't know much about RTL scripts, if the logic order is ALEF + 
 LAMED, but the presentation order is LAMED + ALEF *because of the RTL nature* 
 do you write the rule as ALEF + LAMED = ALEF_LAMED_LIGATURE or LAMED + ALEF = 
 ALEF_LAMED_LIGATURE ?

The specific shape of that ligature is not a result of the directionality 
property.

Michael Everson * http://www.evertype.com/





Re: C1 Control Pictures Proposal

2011-08-22 Thread Sean Leonard

On Aug 17, 2011, at 4:38 PM, Andrew West wrote:

 
 Unless you can show evidence that C1 control pictures are currently in
 use and that there is a clear demand from the user community to


On Aug 21, 2011, at 10:13 AM, Doug Ewell wrote:

 Perhaps it would help for you to do a quick survey of applications that 
 already make use of the existing C0 control pictures, and include the results 
 in your argument.  That might help convince some of us who feel the C0 
 pictures are only there for compatibility with previous character encodings

This is a reasonable request. In a follow-up post or in any event in the formal 
proposal, I shall include examples of use of and/or demand for the 
representation of control pictures.

I would like to ask you/the list for the sources for C0 control pictures. They 
appear to be ANSI X3.32 and ISO 2047. (Also, FIPS Pub. 1-2, which consolidates 
ANSI X3.32 and some others.) Does anybody have these, and can you look the 
pictures up? In particular, X3.32 is withdrawn...

-Sean



Re: RTL PUA?

2011-08-22 Thread Petr Tomasek
On Mon, Aug 22, 2011 at 10:42:05AM +0530, Shriramana Sharma wrote:
 On 08/22/2011 08:24 AM, Peter Constable wrote:
 I'm not saying that there shouldn't be_some_  software that can do
 what you expect. But there will likely be some different views on
 what ought to be included within that some.
 
 Peter, given that both AAT and Graphite have provisions for assigning 
 custom properties including BC to PUA characters, it seems Uniscribe is 
 the only one missing out. Those advocating RTL PUA areas seem to reject 
 AAT and Graphite as hacks or wow *one* application [*].

I personally would say to make some blocks in Plane 16 default to R, some
AL and some ON. For fonts based on rendering engines that
don't allow fonts to change characters properties this would be
crutial, for those engines that are capable of changing the properties
it would present no problem (the font can change this properties arbitrary
even if it defaults to RTL...).

 [* = LibreOffice is the *only* multipurpose application running on 
 /Windows/ to support Graphite and I'm not counting SIL WorldPad. On *nix 
 platforms, *any* number of applications that use HB-NG for rendering 
 will be able to handle Graphite in the near future because HB-Graphite 
 integration is already done. That is to say, once GTK and Qt fully 
 switch to HB-NG.]

That said, the HarfBuzz-ng itself (i.e. it's own engine) tries to imitate
the Uniscribe. Most probably, Graphite fonts will still be an exception
on these systems...

 [sigh]
 
 -- 
 Shriramana Sharma
 

-- 
Petr Tomasek http://www.etf.cuni.cz/~tomasek
Jabber: but...@jabbim.cz


EA 355:001  DU DU DU DU
EA 355:002  TU TU TU TU
EA 355:003  NU NU NU NU NU NU NU
EA 355:004  NA NA NA NA NA






Re: Code pages and Unicode

2011-08-22 Thread Andrew West
On 21 August 2011 02:14, Richard Wordingham
richard.wording...@ntlworld.com wrote:
 On Fri, 19 Aug 2011 17:03:41 -0700
 Ken Whistler k...@sybase.com wrote:

 O.k., so apparently we have awhile to go before we have to start
 worrying about the Y2K or IPv4 problem for Unicode. Call me again in
 the year 2851, and we'll still have 5 years left to design a new
 scheme and plan for the transition. ;-)

 It'll be much easier to extend UTF-16 if there are still enough
 contiguous points available.  Set that wake-up call for 2790, or
 whenever plane 13 (better, plane 12) is about to come into use.

Stymied by the Unicode® stability policies again:

The General_Category property values will not be further subdivided. 
The General_Category property value Surrogate (Cs) is immutable: the
set of code points with that value will never change.

http://unicode.org/policies/stability_policy.html#Property_Value

Can anyone think of a way to extend UTF-16 without adding new
surrogates or inventing a new general category?

Andrew




Re: Code pages and Unicode

2011-08-22 Thread Shriramana Sharma

On 08/22/2011 03:05 PM, Andrew West wrote:

Can anyone think of a way to extend UTF-16 without adding new
surrogates or inventing a new general category?


Why would anyone *need* to do so? UTF-16 can represent all codepoints 
upto Plane 16 right?


--
Shriramana Sharma



Re: RTL PUA?

2011-08-22 Thread Shriramana Sharma

On 08/22/2011 04:34 PM, Behdad Esfahbod wrote:

On 08/22/11 06:53, Shriramana Sharma wrote:



  While I don't know much about RTL scripts, if the logic order is ALEF + 
LAMED,
  but the presentation order is LAMED + ALEF*because of the RTL nature*  do you
  write the rule as ALEF + LAMED = ALEF_LAMED_LIGATURE or LAMED + ALEF =
  ALEF_LAMED_LIGATURE ?

Depends on your specific shaping engine logic.  OpenType assumes native
direction per script.  So if you have Arabic text between LRO/PDF, you have to
reverse the order then apply OpenType shaping.  Other engines may decide to
handle these differently.  But the general statement is true: ligatures are
visual artifacts and hence only form in one direction, not the other (except
if it's, say, the ff ligature).


Hi Behdad. I only asked whether the OT *tables* would contain the 
entries in the logical order or the visual order. Clearly it would still 
be the visual order (but Philippe Verdy seemed to imagine/suggest 
otherwise).


It is clear that in the *script itself* the ligature would form in the 
direction of writing.


--
Shriramana Sharma



Re: RTL PUA?

2011-08-22 Thread Shriramana Sharma

On 08/22/2011 05:26 PM, Behdad Esfahbod wrote:

OpenType tables contain entries in the logical order of the script in
question.  Ie. Arabic tables are always RTL.


Yes I understand, but still, to clarify:

The font tables themselves contain only ASCII characters I presume. In 
it do you write:


ALEF + LAMED = ALEF_LAMED_LIGATURE

or

LAMED + ALEF = ALEF_LAMED_LIGATURE ?

IIUC, in logical order ALEF precedes LAMED, and in visual order, ALEF 
stands to the right of LAMED.


--
Shriramana Sharma



Re: Code pages and Unicode

2011-08-22 Thread Andrew West
On 22 August 2011 12:51, Shriramana Sharma samj...@gmail.com wrote:
 On 08/22/2011 03:05 PM, Andrew West wrote:

 Can anyone think of a way to extend UTF-16 without adding new
 surrogates or inventing a new general category?

 Why would anyone *need* to do so? UTF-16 can represent all codepoints upto
 Plane 16 right?

To clarify, I was replying to Richard Wordingham's tongue in cheek
suggestion to extend UTF-16 to go beyond Plane 16 in the year 2790 or
when only one free plane remains.  I am not advocating extending
UTF-16 or the Unicode code space, or suggesting that it will ever be
necessary to do so.

But hypothetically, I don't see a way to extend UTF-16 without
breaking the stability policy.  The same stability policies would also
prohibit the assignment of any area of the Unicode code space for code
page usage as Srivas Sinnathurai has proposed.  (If there was an
automatic filter on ideas that break one or more stability policies
this mailing list would be a far quieter place.)

Andrew



Re: RTL PUA?

2011-08-22 Thread Shriramana Sharma

On 08/22/2011 12:21 PM, Jonathan Rosenne wrote:

I don't buy the assumption that all the world is either AAT, Graphite
or Uniscribe.


Nobody asserted that either. It is only pointed out that major 
implementations are able to provide what you seek.



Anyhow, this discussion is going off topic, the issue is should
Unicode specify an RTL PUA area, not whether some products, however
respectable, provide a bypass.


I don't see why you call it a *bypass*. Only if the road in front of you 
presents obstacles and does not allow you to proceed further, you need 
to take a bypass. If we are considering the Standard as the road which 
we need to take, the road doesn't present any obstacle to using PUA 
characters as RTL, so Graphite etc are not providing a *bypass* but in 
fact just being good generous implementations that allow custom 
properties for the PUA as the Standard allows.


The request being made to allocate BC=R areas in the PUA is sure to 
generate an impression that conformant implementations should consider 
such a property normative, which then would violate the definition of 
the PUA that conformant implementations need not treat any property of 
the PUA as normative.


Returning to your concerns, it is being asserted that since 
implementations are *already* able to provide for custom properties for 
the PUA, there is *no* need for Unicode to specify an RTL PUA area and 
furthermore as such a specification would violate the definition of the 
PUA, it should also *not* be done. One both *need* not do it and 
*should* not do it.


--
Shriramana Sharma



Re: RTL PUA?

2011-08-22 Thread Joó Ádám
 Um... Computers are hardware, and don't understand a thing. What I think you 
 mean is computer _software_. (I know, I'm being pedantic, but with good 
 reason.)

Sorry, I just can’t resist pointing out that difference between
hardware and software is only the fact that the former is material,
with all the consequences that follows. In any other way they are
completely interchangeable.

As for the other part of your mail, Peter, sorry, but it really
doesn’t make any sense to me. As John has pointed out, you can adjust
the properties of private use characters on Apple computers. Perhaps
there is a way to do so on Windows, Unix and other systems as well.
What Philippe and Doug are proposing, and I also strongly agree with,
is to have a standard way of interchange of these properties. I don’t
think it is neccessary to go into the advantages of standards.

Speaking of actual implementation, I’m convinced that this format
should be the same as it is for encoded characters (whether it is the
plain text format of the Unicode Character Database, XML or anything
else). Rendering engines should – maybe they already do so – accept
multiple files containing character properties, which could make
upgrades to the newer versions of the standard a matter of downloading
the new standard set, and provide a way of overriding private use (or
even standard if one is so inclined) characters’ properties.
Introduction of unencoded scripts would therefore become a matter of
distributing a small properties file and the corresponding fonts.


Á




Re: RTL PUA?

2011-08-22 Thread Mark E. Shoulson

On 08/22/2011 08:26 AM, Shriramana Sharma wrote:

On 08/22/2011 05:26 PM, Behdad Esfahbod wrote:

OpenType tables contain entries in the logical order of the script in
question.  Ie. Arabic tables are always RTL.


Yes I understand, but still, to clarify:

The font tables themselves contain only ASCII characters I presume. In 
it do you write:


ALEF + LAMED = ALEF_LAMED_LIGATURE

or

LAMED + ALEF = ALEF_LAMED_LIGATURE ?

IIUC, in logical order ALEF precedes LAMED, and in visual order, ALEF 
stands to the right of LAMED.


In the ligature tables, it's recorded as ALEF + LAMED = 
ALEF_LAMED_LIGATURE.  The font tables are concerned with what happens 
when this character follows that one, not what happens when this 
character stands on the right of that one.  So it's stored in logical 
order.


~mark



Re: RTL PUA?

2011-08-22 Thread Philippe Verdy
2011/8/22 Peter Constable peter...@microsoft.com:
 From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy

 As I explained in an earlier message, the layout engine doesn't use
 the default property value but the resolved bidi level.

 Once again, you refuse to understand my arguments.

 I don't think I'm refusing to understand anything. I'm merely taking your 
 assertions _as stated_ and evaluating whether I think they are accurate or 
 not. Perhaps what you intend to convey assumes things not clear in what 
 you've stated, since you think I'm not understanding you.


 What I'm saying is that OpenType CANNOT resolve the bidi level of
 PUAs (with the exception where we use additional BiDi controls,

 Of course _OpenType_ cannot, but any rendering engine that uses OpenType 
 _must_ resolve the bidi level of _all_ characters in a sequence that it is 
 given to render. Given our current situation, a default rendering 
 implementation would resolve PUA characters to an even (LTR) level unless, of 
 course, bidi control characters -- particularly RLO -- are used to override 
 the directionality of the character, as you mention.

 which remains a hack, because it adds unnecessary unvisible markup
 around the encoded texts, and complexifies the use of strings and
 substrings).

 We'll, depending on how you define hack, some might reasonably suggest that 
 any usage of PUA is a hack. (Of course, some who may not use the term in 
 the same way might argue that it is certainly not a hack.)

 You can turn the problem as you want, but PUAs (as well as unknown
 characters) still have default properties that, in fine, will get used in 
 absence of a more precise definition (i.e. an explicit override) of the 
 actual BiDi property needed for the character.

So now I perceive your opinion :

- you don't want the solution proposed by Michael Everson (simply
adding a range of RTL PUA), that I also think is not necessary, but is
clearly a possible solution.

- you propose to use BiDi overrrides. I also think (like Michael
Everson) that this is an unpractical hack (Michael Everson that has to
work and discuss with old scripts, or many new unencoded characters to
add to existing scripts (notably Arabic) trying to encode them,
finding various ways to represent them, and *test* his solutions, will
certainly think that embedding each occurence of a PUA substring in
BiDi controls, including in the middle of Arabic words, is certainly a
very bad hack.

- He must certainly think (I also think it too), that PUA characters
are NOT hacks. They are architectural to the well-being of the UCS,
essential in various situations to preserve the software conformance
with the standard. In fact, for old and rare scripts, using PUAs will
remain essential for long, because those scripts will need more and
more time now to get encoded, requiring more extensive researches,
more collaborations with less technical-aware people that cannot
understand why they'll have to test the proposed solutions using test
fonts and test input methods tht require them to enter BiDi controls
around all those PUA characters.

The only problem here is the strong LTR property of all existing PUAs,
as if they were only needed for rare Han sinograms, or for symbols.

Note that, for using a PUA for rare letters found in Arabic, it is
impossible to embed the whole Arabic text in Bidi overrides: this
would completely break the normal behavior of the non-PUA characters
found in the text, notably sequences of Arabic digits, because the
BiDi controls are effectively disabling the BiDi algorithm so that it
will return a single RTL run for all the text in these controls. IF
BiDi controls are used, they have to be inserted ONLY between
subranges containing the PUAs, and only those.

The solution proposed by Michael (a new block of RTL PUAs, probably in
plane 14) still has an advantage: no BiDi controls are needed at all.
The BiDi algorithm does not have to be disabled. All other aspects of
RTL scripts (or mixed RTL/LTR scripts) are preserved (including
mirroring behaviors for auto-LTR characters (at the begining of
paragraphs) and characters whose directionality depends on the
resolved direction of the precening text.

I don't think this is necessary though: I see no reason why
implementations *have to* keep the strong LTR property of existing
PUAs. This strong LTR property is only the consequence of the fact
that this is only the *default* value of those PUAs, and applications
should not be restricted from changing this property as they want,
especially for PUAs.

But to change this property value, we need an explicit PUA agreement
about their usage, in such a way that it can be understood by a
computer. This means an external source of character properties. My
opinion is that this need is most often sufficient if it solves just
the problem of correct display order. Given that the encoded texts
(using those existing strong LTR PUAs that we want to adopt a RTL

Re: RTL PUA?

2011-08-22 Thread Philippe Verdy
2011/8/22 Peter Constable peter...@microsoft.com:
 From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On 
 Behalf Of Asmus Freytag

 Treating PUA characters as ON is very problematic

 As would be changing the default property of PUA characters from L to ON.

I also agree with that. This is a bad option that would break
compatibility (the solution advocatd by Michael Everson seems better,
in that perspective, because it does not change any existing property
given to existing assigned PUAs).

Anyway when I spoke about a computer note that I did not use the
definite article. It is evidently implied that there's also a need for
software changes as well (so this does not mean *all* computers, but
this could reach someday *most* computers with their installed or
upgraded softwares). Your last remark in another message of this
thread was really pedantic.



Re: RTL PUA?

2011-08-22 Thread Philippe Verdy
2011/8/22 Shriramana Sharma samj...@gmail.com:
 On 08/22/2011 12:01 AM, Peter Constable wrote:

 If you mean a rule to substitute [g1 g2] with [g3] won't apply if the
 sequence processed by the OpenType Layout lookup processor is [g2
 g1],

 Peter, actually I suspect Philippe is thinking that in the case of RTL, the
 *glyphs* are placed in reverse order and then he is asking how can the
 ligation take place.

No, I've not said anything about ligation. But yes the problem is
related to the expected reverse order of glyphs, for some PUAs, but
not necessarily all of them (not the LTR runs of PUAs, after Bidi
resolution). Ligation is a completely orthogonal problem (not really a
problem because it is already solved).



RE: Code pages and Unicode

2011-08-22 Thread Doug Ewell
srivas sinnathurai sisrivas at blueyonder dot co dot uk wrote:

 The true lifting of UTF-16 would be to UTF-32.
 
 Leave the UTF-16 un touched and make the new half versatile as possible.
 
 I think any other solution is just a patch up for the timebeing.

There is no evidence whatsoever that this is a problem that needs to be
solved, not in 700 or 800 years, not ever.  Ken's words are again being
ignored.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­






RE: RTL PUA?

2011-08-22 Thread Murray Sargent
It's actually quite easy to convince Uniscribe to treat specific characters as 
RTL, others as LTR, and, in general, with whatever classifications you desire. 
Pass a preprocessed string to Uniscribe's ScriptItemize(). RichEdit has used 
that approach to some degree starting with RichEdit 3.0 (Windows/Office 2000). 
It's also a handy way to force all operators to be treated as LTR in an LTR 
math zone and as RTL in an RTL math zone (aside from numeric contexts for '.' 
and ','). And you can force IRIs to display LTR or RTL that way by classifying 
the delimiters such as the dots in the domain name accordingly. Some of my blog 
posts on http://blogs.msdn.com/b/murrays/ discuss this in greater detail.

So there's no need to change the properties of the PUA to establish PUA RTL 
conventions. They won't be generally interchangeable, but that's the nature of 
the PUA. You also have to implement such choices using rich/structured text. 
Plain text doesn't have a place to store the necessary properties. Most text is 
rich text anyway grin.

Murray




Re: RTL PUA?

2011-08-22 Thread Philippe Verdy
2011/8/22 Mark E. Shoulson m...@kli.org:
 I'm not certain I understand the question, but if I have it right... The
 logic order is ALEF + LAMED, and the presentation... places those in a
 right-to-left sequence, shall we say (since talking about the presentation
 *order* is confusing here).  The font table contains the lookup that ALEF +
 LAMED = ALEF_LAMED_LIGATURE.  It all goes according to the logical order,
 since the presentation order isn't really an order, it's just a direction.
  (this is different from things like devanagari short-i vowel, which moves
 with respect to the other letters in the script.)

Lookup tables in fonts (at least OpenType) do not work at the
character level, but at the glyph level: they substitute glyph ids by
other glyph ids. Sequences of glyph ids are already reordered in
visual order by the layout engine when they are searched in OpenType
lookups, should they be RTL glyphs, or Indic glyphs with special
reordering requirements (independant of the logical ordering of
characters/code points).

In addition, the same sequence of characters may be sometimes searched
in several distinct sequences of glypg ids (this depends on the kind
of OpenType table being consulted, as well as on character properties
which also determine which lookup table will be searched and the
relative order of successive lookups).

The only lookup table in fonts that work at the character/code point
level is their cmap (which maps a default glyph id from each encoded
character, independantly of their logical or visual ordering, as well
as independantly of the script/language in which those characters or
glyphs are used, but possibly depending on the encoding used and the
software platform supporting that encoding).

Not all fonts need a cmap; for some of them, a default cmap may be
implied or automatically constructed -- for example Symbol fonts in
Windows, that are implicitly mapped in a PUA range; another example is
Type1 or CFF fonts that have a default standardEncoding inherited
from PostScript, based on glyph names (rather than glyph ids or code
points) that may have themselves an implicit mapping to UCS codepoints
(if these names are those defined in the AGL). Not all these mappings
are 1-to-1, which means that they are not reversible, in the general
case.



Re: RTL PUA?

2011-08-22 Thread Philippe Verdy
2011/8/22 Shriramana Sharma samj...@gmail.com:
 Hi Behdad. I only asked whether the OT *tables* would contain the entries in
 the logical order or the visual order. Clearly it would still be the visual
 order (but Philippe Verdy seemed to imagine/suggest otherwise).

No ! I've not imagined that. You incorrectly reinterpret
imaginatively another incorrect imaginative reinterpretation, made by
someone else, of what I wrote, which did not even suggest that.



Re: RTL PUA?

2011-08-22 Thread Philippe Verdy
2011/8/22 Shriramana Sharma samj...@gmail.com:
 On 08/22/2011 05:26 PM, Behdad Esfahbod wrote:

 OpenType tables contain entries in the logical order of the script in
 question.  Ie. Arabic tables are always RTL.

 Yes I understand, but still, to clarify:

 The font tables themselves contain only ASCII characters I  presume.

No. The lookup tables contain sequences of numeric glyph ids (16 bit
integers in TrueType and OpenType). Which are also not the code point
values, and not the character names or glyph names.

 you write:

 ALEF + LAMED = ALEF_LAMED_LIGATURE

 or

 LAMED + ALEF = ALEF_LAMED_LIGATURE ?

Let's say that;
- the LAMED character is cmap'ped (by its code point value in an cmap
for Unicode, or by its code position in a cmap for another legacy
8-bit encoding) to the glyph id 1012,
- and the ALEF character is cmapped to the glyph id 1001 (the values
of glyph ids are not important, not even their relative order or
differences, they don't need to obey any standard),
- and the ALEF-LAMED ligature is in glyph id 1540 (the ALEF-LAMED
character of the UCS may also be cmapped separately, but this is not a
requirement)

Then the lookup to perform the ligature will contain : (1012, 1001) - (1540).

Glyph id's are presented and scanned in the lookup table, in sequences
preordered in visual order by the text layout/shaping engine.

However, given that the ALEF-LAMED is also a character of the UCS, the
text layout/shaping engine that knows the Arabic script can also
perform a character-based substitution itself, even in absence of the
lookup of glyph ids in fonts; then it can render the ligature
character according to the glyph id to which it is cmapped in that
font.




Re: Code pages and Unicode

2011-08-22 Thread John H. Jenkins

Christoph Päper 於 2011年8月20日 上午2:31 寫道:

 Mark Davis ☕:
 
 Under the original design principles of Unicode, the goal was a bit more 
 limited; we envisioned […] a generative mechanism for infrequent CJK 
 ideographs,
 
 I'd still like having that as an option.
 


Et voilà!  We have Ideographic Description Sequences.  Or, if you're more 
ambitious, CDL.  

Generative mechanisms for Han are very attractive given the nature of the 
script, but once you try to support something other than display, or even try 
to write a rendering engine, all sorts of nasty problems crop up that have 
proven difficult to solve.  We won't even get into the problem of wanting to 
discourage people from making up new ad hoc characters for Han. 

I won't say some sort of generative mechanism will never become the preferred 
way of handling unencoded ideographs, but there is a lot of work to be done 
before that would be practical.

=
John H. Jenkins
jenk...@apple.com






Re: RTL PUA?

2011-08-22 Thread Philippe Verdy
2011/8/22 Joó Ádám a...@jooadam.hu:
 Um... Computers are hardware, and don't understand a thing. What I think you 
 mean is computer _software_. (I know, I'm being pedantic, but with good 
 reason.)

 Sorry, I just can’t resist pointing out that difference between
 hardware and software is only the fact that the former is material,
 with all the consequences that follows. In any other way they are
 completely interchangeable.

Same opinion for me.

 As for the other part of your mail, Peter, sorry, but it really
 doesn’t make any sense to me. As John has pointed out, you can adjust
 the properties of private use characters on Apple computers. Perhaps
 there is a way to do so on Windows, Unix and other systems as well.
 What Philippe and Doug are proposing, and I also strongly agree with,
 is to have a standard way of interchange of these properties. I don’t
 think it is neccessary to go into the advantages of standards.

 Speaking of actual implementation, I’m convinced that this format
 should be the same as it is for encoded characters (whether it is the
 plain text format of the Unicode Character Database, XML or anything
 else). Rendering engines should – maybe they already do so – accept
 multiple files containing character properties, which could make
 upgrades to the newer versions of the standard a matter of downloading
 the new standard set, and provide a way of overriding private use (or
 even standard if one is so inclined) characters’ properties.
 Introduction of unencoded scripts would therefore become a matter of
 distributing a small properties file and the corresponding fonts.

As well, the small properties files can be embedded, in a very compact
form, in the PUA font.

This small table can be limited to just listing the ranges of PUA code
points that are strong RTL instead of LTR. Most often, there will be
only one range, and this just requires a couple of integers in that
embedded table (possibly more, only if you want to represent more
properties), without requiring a complex XML parser or a complex
parser for the tabulated ASCII format used in the UCD, which is
overkill for just the few properties that are needed for correct
display.

So the duplication in each font is not a real problem (note that there
won't be a lot of fonts, most often there will be only one that
matches the PUA agreement and that is suitable to render the
UCS-encoded PUA text).




Implement BIDI algorithm by line

2011-08-22 Thread li bo
Hi all,
I have a question about the BIDI algorithm implementation. Bidi algorithm
describe that one must resolving embedding level in a paragraph before break
paragraph into lines. I don't understand why. Should we firstly break
paragraph into lines and remember the paragraph level, and then resolving
the embedding levels for each character in lines? If we do it like this,
what issues would be occurred?

Thanks a lot!


RE: RTL PUA?

2011-08-22 Thread Doug Ewell
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:

 As well, the small properties files can be embedded, in a very compact
 form, in the PUA font.

As soon as you embed all the information in the font, you require
different solutions for systems that use different font technologies.
I was thinking of something more portable.

 This small table can be limited to just listing the ranges of PUA code
 points that are strong RTL instead of LTR. Most often, there will be
 only one range, and this just requires a couple of integers in that
 embedded table (possibly more, only if you want to represent more
 properties), without requiring a complex XML parser or a complex
 parser for the tabulated ASCII format used in the UCD, which is
 overkill for just the few properties that are needed for correct
 display.

I generally assume there is more to character handling than display.

 So the duplication in each font is not a real problem (note that there
 won't be a lot of fonts, most often there will be only one that
 matches the PUA agreement and that is suitable to render the
 UCS-encoded PUA text).

Depending on how you count, there are already two to four fonts that
support Ewellic in the PUA.  There are probably many more that support
Tengwar or Cirth or Klingon.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­






Re: RTL PUA?

2011-08-22 Thread Shriramana Sharma

On 08/20/2011 10:54 AM, Shriramana Sharma wrote:

On 08/19/2011 10:05 PM, Mark Davis ☕ wrote:

All of the property assignments to PUA characters (except the GC) are
purely informative.


I just now noticed that you had excepted the GC in the above. Why is
that? How are applications supposed to handle combining marks etc if in
the PUA?


Mark, can you please reply to the above --

It seems that while it is true that GC=Co should be retained *in the 
standard* to clearly identify the character as a PUA character, the 
applications will still by changing that GC to Lo, Mc, Mn, No etc for 
their internal private-agreement processing. So what is the exact nature 
of your excepting the GC in your statement above?


--
Shriramana Sharma



Re: RTL PUA?

2011-08-22 Thread Shriramana Sharma

On 08/22/2011 05:20 PM, Shriramana Sharma wrote:


Hi Behdad. I only asked whether the OT *tables* would contain the
entries in the logical order or the visual order. Clearly it would still
be the visual order


My mistake: I should have said *logical* order.


(but Philippe Verdy seemed to imagine/suggest
otherwise).


This one is correct w.r.t. what I had *intended* to say above: i.e. 
Philippe thinks the entries contain the glyphs in *visual* order.


See other mail replying to Philippe pointing this out.

--
Shriramana Sharma



Re: RTL PUA?

2011-08-22 Thread Shriramana Sharma

On 08/22/2011 09:00 PM, Philippe Verdy wrote:


The font tables themselves contain only ASCII characters I  presume.


No. The lookup tables contain sequences of numeric glyph ids (16 bit
integers in TrueType and OpenType). Which are also not the code point
values, and not the character names or glyph names.


And numeric glyph IDs are still ASCII aren't they? I was just noting 
that the glyph tables themselves don't *use* the actual codepoints of 
the characters getting ligated (while they *refer* to them).



Let's say that;
- the LAMED character is cmap'ped (by its code point value in an cmap
for Unicode, or by its code position in a cmap for another legacy
8-bit encoding) to the glyph id 1012,
- and the ALEF character is cmapped to the glyph id 1001 (the values
of glyph ids are not important, not even their relative order or
differences, they don't need to obey any standard),
- and the ALEF-LAMED ligature is in glyph id 1540 (the ALEF-LAMED
character of the UCS may also be cmapped separately, but this is not a
requirement)

Then the lookup to perform the ligature will contain : (1012, 1001) -  (1540).


No! See Behdad's post -- it is clearly said that the lookup will still 
be in logical order (1001, 1012) - (1540) and not in visual order as 
you say. See? This is what I meant in the other mail by you suggesting 
that the tables containing the characters in visual order and not in 
logical order, to which you replied (without much real explanation I'm 
afraid):


quoteNo ! I've not imagined that. You incorrectly reinterpret
imaginatively another incorrect imaginative reinterpretation, made by
someone else, of what I wrote, which did not even suggest that./quote


Glyph id's are presented and scanned in the lookup table, in sequences
preordered in visual order by the text layout/shaping engine.


Nope -- they are placed in the lookup table in *logical* order. IIUC the 
entire sequence of glyphs is only reordered from RTL at the very end. 
Peter or Behdad, can you corroborate this?


--
Shriramana Sharma



Re: RTL PUA?

2011-08-22 Thread Shriramana Sharma

On 08/22/2011 09:31 PM, Doug Ewell wrote:

Philippe Verdyverdy underscore p at wanadoo dot fr  wrote:


As well, the small properties files can be embedded, in a very compact
form, in the PUA font.


As soon as you embed all the information in the font, you require
different solutions for systems that use different font technologies.


Why? In the end all the systems base upon the character properties 
specified by the standard. For the PUA characters in question, what is 
needed for a table of properties to override the default ones. The 
systems would then handle those new properties in the same way that they 
would handle the regular ones. Granted, if the renderers hardcode the 
properties (as most OT ones do) then some parsing is required to import 
all the override data provided by the extra font table into a struct or 
such -- after which (I presume) it would be possible (to a large 
extent?) to treat it the same as an encoded script. [Actually, this 
seems quite difficult to implement in OT, where the philosophy is to 
explicitly hardcode the properties, but Graphite and AAT should be fine 
I guess.]



I generally assume there is more to character handling than display.


True -- so if someone wanted a PUA script to be handled properly in 
sorting etc one would have to prepare collation tables which would 
obviously go *outside* the font.


--
Shriramana Sharma



Re: Feedback from C1 Control Pictures Proposal

2011-08-22 Thread Frank da Cruz
 I would like to ask Frank for a bit of help here (and, to the extent that
 Ken thinks that the proposal is reasonable, some affirmation that the
 uses/demonstration of demand will be seen as acceptable to the Unicode
 people). Specifically, can Frank help identify, and possibly provide
 screenshots, of:
 
  - C0 control pictures in use
  - C1 control pictures in use

Maybe only an older person would understand this point, but to emulate a
particular terminal, you have to make the emulator show on the screen what
the real terminal would show.  Since I have been laid off and have to clean
out my office this week, I don't have time to re-do the research, but many
terminals -- my vast collection of terminal manuals has been boxed for
shipment to the Computer History Museum:

  http://www.columbia.edu/cu/computinghistory/books/#terminalmanuals

...have glyphs for C0 controls and some have them for C1 controls.  Here's
the exhibit I prepared for my proposal in 1998:

  ftp://kermit.columbia.edu/kermit/ucsterminal/terminal-exhibits.pdf

Here again, for reference, is the proposal itself (only the C1 part is
relevant to this discussion):

  ftp://kermit.columbia.edu/kermit/ucsterminal/control.txt

The exhibit shows:

Terminals that have C0 control glyphs:
  DEC VT320, 420, etc
  Data General Dasher
  HP-2621
  Wyse 60
  Wyse 370
  Atlantic Research Corporation Interview 30A Data Analyzer (exhibit N1)

Terminals that have C1 control glyphs:
  DEC VT320, 420, etc (full set)
  Data General Dasher (partial set)
  Siemens-Nixdorf 97801 (as hex byte pictures 80, 81, etc)
  Wyse 370 (full set)  

This is not an exhaustive survey, more of a proof by existence.

 * Unfortunately, I don't actually know of any applications, other than
   Penango (my company's primary product), which currently use the U+2400
   range. [That is what kicked off this proposal, by the way.]
 
I don't have information about what applications use them.  Our own terminal
emulator, Kermit 95:

  http://www.columbia.edu/kermit/k95.html

does not.  That's because it was designed to be portable between Windows
console screens and GUI screens, and no Windows console font contained
control pictures.  Instead, when we put the emulator into debugging mode,
color is used.  Obviously, that's not plain text, but this way it shows
control characters in a single cell.

By now, Kermit 95 would indeed use control pictures in its GUI version,
except that the programmers aren't here any more, and except that C1 control
pictures are not defined yet.  By the way, the cancellation of the Kermit
Project is is not an end but a new beginning, because now the source code
for Kermit 95 has been published with an Open Source license:

  http://www.columbia.edu/kermit/k95sourcecode.html

So here is why I believe it is important to have C1 control character glyphs
available in Unicode:

 . Terminal emulation is still important.  For example, everybody who uses
   the Unix shell is doing so through a terminal emulator.  And here, as we
   all know, is where the real work gets done -- coding, website creation
   and maintenance, system administration, network configuration, etc etc.

 . Since the Unix shell and other text-only online environments exist
   outside the English-speaking world too, terminal emulators are being
   updated to support UTF-8.  Kermit 95 has supported it since about 2002.
   The Linux console window (which is a terminal emulator) uses UTF-8 *by
   default*.

 . The terminals that are emulated were manufactured before 1995, and
   therefore mostly follow the ANSI X3.64 definition, which reserves both
   C0 and C1 for control characters, as Unicode itself has done.

 . But Microsoft has created code pages that are identical to ISO standard
   character sets such as ISO-8859-x (which are compatible with ANSI X3.64),
   but with graphic characters in the C1 area.  These have leaked into
   every part of the Internet, including text that we view in a terminal
   screen (e.g. email).

 . When a real terminal, or a program that emulates one, receives text
   written in, say, Microsoft code page 1252, it invariably hangs.  Why?
   Because the text contains smart quotes or somesuch, which coincide with
   valid C1 commands understood by the terminal.  Some of which, such as
   ISO 6429 DCS, OSC, or APC, are a header for a packet of control
   information.  The terminal waits for the end-of-packet for the control
   sequence, as it must do, but it never comes.

Those who support terminal emulators need tools to diagnose problems like
this.  The best and most portable tool is to put the terminal into display
controls mode.  This is a feature that the above mentioned terminals had.
A Unicode-based terminal emulator has glyphs to show the C0 controls but not
the C1 controls, which can b e even more lethal than the C0 ones when used
improperly, as they are in Windows code pages.

Note that tech support is done not only on the scene, but remotely.  Support
technicians 

Re: Code pages and Unicode

2011-08-22 Thread William_J_G Overington
On Monday 22 August 2011, Andrew West andrewcw...@gmail.com wrote:
 
 Can anyone think of a way to extend UTF-16 without adding new surrogates or 
 inventing a new general category?
 
 Andrew
 
How about a triple sequence of two high surrogates followed by one low 
surrogate?
 
I suggest this as a solution to the problem that is posed by Andrew as I feel 
that it would be interesting to know if that would be possible or whether it 
would be forbidden due to an existing policy that has already been guaranteed 
to be unchangeable.
 
William Overington
 
22 August 2011
 







RE: RTL PUA?

2011-08-22 Thread Doug Ewell
Shriramana Sharma samjnaa at gmail dot com wrote:

 As soon as you embed all the information in the font, you require
 different solutions for systems that use different font technologies.
 
 Why? In the end all the systems base upon the character properties 
 specified by the standard. For the PUA characters in question, what is 
 needed for a table of properties to override the default ones. The 
 systems would then handle those new properties in the same way that they 
 would handle the regular ones.

Right, so if you embed that table in an OT font, the information is not
available to a system that uses a font technology other than OT.

What is needed is a way to specify the properties in a
platform-independent way, where platform means not only OS but also
font technology.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­








Re: RTL PUA?

2011-08-22 Thread Petr Tomasek
On Mon, Aug 22, 2011 at 07:51:22AM -0700, Doug Ewell wrote:
 Some PUA properties, like glyph shapes and maybe directionality, can be
 stored in a font.  Others, like numeric values and casing, might not or
 cannot.  An interchangeable format needs to be agreed upon for the

Why not?

P.T.

-- 
Petr Tomasek http://www.etf.cuni.cz/~tomasek
Jabber: but...@jabbim.cz


EA 355:001  DU DU DU DU
EA 355:002  TU TU TU TU
EA 355:003  NU NU NU NU NU NU NU
EA 355:004  NA NA NA NA NA






Re: RTL PUA?

2011-08-22 Thread Shriramana Sharma

On 08/22/2011 10:12 PM, Doug Ewell wrote:

Right, so if you embed that table in an OT font, the information is not
available to a system that uses a font technology other than OT.


I don't understand why you would say so -- assuming we are all talking 
about TrueType fonts, AAT just uses some tables, OT others and Graphite 
still others. They are all just tables appended to the TrueType font 
data. Any software that is able to read TT font data can also read the 
tables. So what's the problem?


--
Shriramana Sharma



Re: RTL PUA?

2011-08-22 Thread John Hudson

Shriramana Sharma wrote:

The font tables themselves contain only ASCII characters I presume. 


OpenType Layout tables use Glyph IDs. OTL development tools typically 
use glyph names, which may be particular to the tool or the same names 
used in the post or CFF tables.


OTL tables work on glyphs, not characters, and bidi will have been 
resolved prior to application of OTL substitution and positioning. Input 
glyph strings for substitution lookups are always in the resolved 
direction of the glyph run, so Arabic and Hebrew alphabetic runs are 
processed right-to-left, i.e.


alef lamed - alef_lamed

*not*

lamed alef - alef_lamed

Similarly, context stings for glyph positioning (if present) will be 
right-to-left, although anchor attachment positions on individual glyphs 
are relative to the 0,0 coordinate, i.e. the left sidebearing.


JH



--

Tiro Typeworkswww.tiro.com
Gulf Islands, BC  t...@tiro.com

The criminologist's definition of 'public order
crimes' comes perilously close to the historian's
description of 'working-class leisure-time activity.'
 - Sidney Harring, _Policing a Class Society_



RE: RTL PUA?

2011-08-22 Thread Doug Ewell
Petr Tomasek tomasek at etf dot cuni dot cz wrote:

 Some PUA properties, like glyph shapes and maybe directionality, can
 be stored in a font.  Others, like numeric values and casing, might
 not or cannot.  An interchangeable format needs to be agreed upon for
 
 Why not?

Where does one store numeric values in a font?  Maybe this should be
taken off-list.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­





Re: RTL PUA?

2011-08-22 Thread John Hudson

Shriramana Sharma wrote:

I was just noting 
that the glyph tables themselves don't *use* the actual codepoints of 
the characters getting ligated (while they *refer* to them).


Characters are mapped to glyph IDs in the font cmap tables.

Glyph IDs are mapped to other glyph IDs (one-to-one, one-to-many, 
many-to-one, or one-to-one-of-many) in the layout GSUB table.


No! See Behdad's post -- it is clearly said that the lookup will still 
be in logical order (1001, 1012) - (1540) and not in visual order as 
you say.


I think there may be some confusion in this discussion over what 
constitutes 'visual order'. I try to avoid the term because it is 
difficult for right-to-left readers to accustom themselves to thinking 
of visual order as anything other than right-to-left. I prefer the term 
'reading order' or 'resolved order', i.e. resolved bidi and script 
shaping order, which may have involved integrated reordering (reordering 
within the glyph processing) as in the case of Indic scripts.


Nope -- they are placed in the lookup table in *logical* order. IIUC the 
entire sequence of glyphs is only reordered from RTL at the very end. 
Peter or Behdad, can you corroborate this?


Glyph ID inputs for OTL processing are according to reading/resolved 
order. This is typically the same as logical order, but the term logical 
order really applies to character strings, not glyph strings, which are 
much more maleable. The order of input strings in GSUB lookups or 
contexts is dependent not only on the underlying character order, but 
also on the results of previous GSUB lookups. So while, unlike AAT and 
Graphite, OpenType Layout doesn't explicitly provide for glyph 
re-ordering, some kinds of glyph reordering are possible using sequences 
of contextual lookups to duplicate a glyph in a second location in the 
string and then remove the first instance. We use this in some 
Devanagari fonts to enable subsequent ligation of short ikar variants to 
the left of a consonant base with reph marks to the right of that base.


JH



--

Tiro Typeworkswww.tiro.com
Gulf Islands, BC  t...@tiro.com

The criminologist's definition of 'public order
crimes' comes perilously close to the historian's
description of 'working-class leisure-time activity.'
 - Sidney Harring, _Policing a Class Society_



Re: Code pages and Unicode

2011-08-22 Thread Jean-François Colson

On 22/08/11 16:55, Doug Ewell wrote:

srivas sinnathuraisisrivas at blueyonder dot co dot uk  wrote:


The true lifting of UTF-16 would be to UTF-32.

Leave the UTF-16 un touched and make the new half versatile as possible.

I think any other solution is just a patch up for the timebeing.

There is no evidence whatsoever that this is a problem that needs to be
solved, not in 700 or 800 years, not ever.  Ken's words are again being
ignored.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­


I see at least one reason to extend the present 17 planes Unicode space: 
that would provide space for a RTL PUA. ☺


Presently, UTF-16 uses surrogate pairs to address non-BMP characters: HS 
LS (High Surrogate followed by Low Surrogate).


What would happen if we imbricate them? Would HS1 HS2 LS1 LS2 be 
acceptable to address more characters?




Re: RTL PUA?

2011-08-22 Thread William_J_G Overington
On Monday 22 August 2011, Philippe Verdy verd...@wanadoo.fr wrote:
 
 So there are only two options:
 
[snipped]
 
 ... : this requires an approval either by the UTC  WG2 (solution 1) or by 
 the OpenType working group (solution 2).
 
Would a third option work?
 
In the Description section of the Macintosh Roman section of a TrueType font, 
include a line of text in a plain text format of which the following line of 
text is an example.
 
PUA.RTL=$E000-$E1FF,$E440-$E447,$E541,$E549,$E57C,$EA00-$EA0F,$EC07;
 
One could specify precisely which Private Use Area characters were to become 
RTL when using that particular font.
 
One would need rendering software that looked for such a string of text in the 
font file, yet, as far as I am aware, no approval from any committee in order 
to put this solution into practical use.
 
William Overington
 
22 August 2011
 













Re: Code pages and Unicode

2011-08-22 Thread Jean-François Colson

On 20/08/11 02:03, Ken Whistler wrote:

O.k., so apparently we have awhile to go before we have to start worrying
about the Y2K or IPv4 problem for Unicode. Call me again in the
year 2851, and we'll still have 5 years left to design a new scheme 
and plan

for the transition. ;-)

--Ken


I wonder whether you aren’t a little too optimistic.

Have you considered the unencoded ideographic scripts?

1,071 hieroglyphs have already been encoded. I think there are 
approximately 4,000 more to encode.


1,165 Yi syllables and 55 Yi radicals have been encoded. But they only 
support one dialect of Yi and I read there are tens of thousands of Yi 
ideographs and that a proposal to encode 88,613 classical Yi characters 
was made 4 years ago.


The threshold of 200,000 characters doesn’t seem very far.



Re: RTL PUA?

2011-08-22 Thread John H. Jenkins

Doug Ewell 於 2011年8月22日 上午10:59 寫道:

 Petr Tomasek tomasek at etf dot cuni dot cz wrote:
 
 Some PUA properties, like glyph shapes and maybe directionality, can
 be stored in a font.  Others, like numeric values and casing, might
 not or cannot.  An interchangeable format needs to be agreed upon for
 
 Why not?
 
 Where does one store numeric values in a font?  Maybe this should be
 taken off-list.
 


This is actually a relevant point.  The major TrueType variants all work 
primarily with glyphs, not characters.  Using them as a place to store 
information about the *characters* in the text is therefore not a reliable way 
to provide an override for default system behavior.  By the time the rendering 
engine consults the fonts for layout specifics, large chunks of the text 
processing will already be completed.  

OpenType, for example, expects that the bidi algorithm is largely run in 
character space, not glyph space, and therefore without regard for the specific 
font involved.  (AAT does almost everything in glyph space, including bidi.  
I'm not sure about Graphite.)  

The net result is that a font is an unreliable way of storing 
character-specific information useful on multiple platforms.  This is one 
reason why embedding the existing directionality controls within the text 
itself is currently the most reliable way of getting the behavior one might 
want in a platform-agnostic way.

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com






Re: RTL PUA?

2011-08-22 Thread Joó Ádám
 True -- so if someone wanted a PUA script to be handled properly in sorting
 etc one would have to prepare collation tables which would obviously go
 *outside* the font.

If a proper definition of an unencoded script needs additional
properties which cannot be stored in the font anyway, why would you
want to store part of it in OT tables? It’s just not the right place.
Fonts’ sole purpose is to display already defined characters, not to
define them. Tails shouldn’t be made wagging dogs.

Á




Re: RTL PUA?

2011-08-22 Thread Shriramana Sharma

On 08/22/2011 10:55 PM, Joó Ádám wrote:

If a proper definition of an unencoded script needs additional
properties which cannot be stored in the font anyway, why would you
want to store part of it in OT tables? It’s just not the right place.
Fonts’ sole purpose is to display already defined characters, not to
define them. Tails shouldn’t be made wagging dogs.


True, but we are only trying to help those who find themselves unable to 
even *display* PUA characters as RTL (or as Indic with reordering, which 
can be handled by IndicMatraCategory). Since collation never cares about 
whether the script is LTR or RTL or Indic (with the except of Thai etc 
where the encoding is as per visual order and not logical order) the 
collation data can be outside the font, since it is not needed for display.


--
Shriramana Sharma



Re: RTL PUA?

2011-08-22 Thread John H. Jenkins

William_J_G Overington 於 2011年8月22日 上午10:49 寫道:

 In the Description section of the Macintosh Roman section of a TrueType font, 
 include a line of text in a plain text format of which the following line of 
 text is an example.
 
 PUA.RTL=$E000-$E1FF,$E440-$E447,$E541,$E549,$E57C,$EA00-$EA0F,$EC07;
 

Forgive my asking, but this reference to the description section of the 
Macintosh Roman section of a TrueType font has me puzzled, because I don't 
know what you're talking about.  What table contains this string?

=
井作恆
John H. Jenkins
jenk...@apple.com







Re: RTL PUA?

2011-08-22 Thread William_J_G Overington
On Monday 22 August 2011, John H. Jenkins jenk...@apple.com wrote:
 
 Forgive my asking, but this reference to the description section of the 
 Macintosh Roman section of a TrueType font has me puzzled, because I don't 
 know what you're talking about.  What table contains this string?
 
When I use FontCreator, made by High-Logic, http://www.high-logic.com is the 
webspace: with a font file open, I can select Format from the menu bar and then 
select Naming... from the drop down menu.
 
That leads to a dialogue panel.
 
From that dialogue panel one may select, for an ordinary, basic Unicode font, 
either of two platforms, namely Macintosh Roman and Microsoft Unicode BMP only.
 
Having selected a platform, one may view the text content of various fields for 
that platform, such as font family name and copyright notice, version string 
and postscript name. There is then a button that is labelled Advanced... that, 
if clicked, opens another dialogue panel with various other text fields, 
including Font Designer and Description, which are the two that I often use.
 
Now, when the text values in the fields are stored in the font file, the values 
for the Macintosh Roman platform are stored in plain text and the values for 
the Microsoft Unicode BMP only platform are stored in some encoded format.
 
So, if one opens a TrueType font file in WordPad and one searches for an item 
of plain text that is in one of the fields of the font, then the text that is 
in the Macintosh platform can be found, yet the text that is in the Microsoft 
Unicode BMP only platform cannot be found.
 
So, I thought that if a manufacturer of a wordprocessing application or a 
desktop publishing application decided to make a special researcher's edition 
of the software, then that software could, when a font is selected, first scan 
the font for a PUA.RTL string and, if one is found, override the left-to-right 
nature of the identified characters to be a right-to-left nature, just while 
that font is selected.
 
Whether such a software package ever becomes available is something that only 
time will tell, yet it seems to me that it is a method that could be used 
without needing any changes by any committee.
 
William Overington
  
22 August 2011
 
 








Re: RTL PUA?

2011-08-22 Thread Philippe Verdy
2011/8/22 Doug Ewell d...@ewellic.org:
 Depending on how you count, there are already two to four fonts that
 support Ewellic in the PUA.  There are probably many more that support
 Tengwar or Cirth or Klingon.

First, these fonts can work fine with the default LTR directionality.
So there's no need for additional data for them. Second, even if they
were RTL, the needed info for each of these fonts, embedded in them
would be extremely small, reduced to just specifying the range of RTL
characters they need to contain.

So I don't see that as a problem. Those fonts do exist and are used
exactly because there was no problem for rendering them with texts
encoded in logical order (the same as the visual order). It's still
strange that we can have several fonts for esoteric fonts that have
been used effectively by very few people, when there are centuries of
traditions, and many interested users (but spread in very small
communities worldwide) that cannot use computer technologies to render
their favorite scripts, or that want to teach them, or make books and
other publications to expose them, as an important humane cultural
heritage, even if this was only to translate them or transcribe them
in a more modern script.




Re: Implement BIDI algorithm by line

2011-08-22 Thread Asmus Freytag

Huh? What context is this in?

On 8/22/2011 11:18 AM, CE Whitehead wrote:

Hi.

I think many line breaks within paragraphs are soft line breaks but 
that embedding levels have to be taken into account when deciding the 
width of the glyphs; that's as near as I can tell.


Here is the description of the algorithm -- is this what you have read?
http://unicode.org/reports/tr9/
Some rules are in fact applied after the line wrapping (after the soft 
breaks) --
The following rules describe the logical process of finding the 
correct display order. As opposed to resolution phases, these rules 
act on a per-line basis/and are applied *after* any line wrapping is 
applied to the paragraph./

Logically there are the following steps:

  * The levels of the text are determined according to the previous rules.
  * The characters are shaped into glyphs according to their context
/(taking the embedding levels into account for mirroring)./
  * The accumulated widths of those glyphs /(in logical order)/ are
used to determine line breaks.
  * For each line, rules L1 http://unicode.org/reports/tr9/#L1–L4
http://unicode.org/reports/tr9/#L4 are used to reorder the
characters on that line.



(I'd have to reread the whole document on line breaking then on bidi 
to answer this truely; sorry; hope this helps anyway)

--C. E. Whitehead
cewcat...@hotmail.com




Re: RTL PUA?

2011-08-22 Thread Philippe Verdy
2011/8/22 Shriramana Sharma samj...@gmail.com:
 On 08/22/2011 09:00 PM, Philippe Verdy wrote:

 The font tables themselves contain only ASCII characters I  presume.

 No. The lookup tables contain sequences of numeric glyph ids (16 bit
 integers in TrueType and OpenType). Which are also not the code point
 values, and not the character names or glyph names.

 And numeric glyph IDs are still ASCII aren't they? I was just noting that
 the glyph tables themselves don't *use* the actual codepoints of the
 characters getting ligated (while they *refer* to them).

 Let's say that;
 - the LAMED character is cmap'ped (by its code point value in an cmap
 for Unicode, or by its code position in a cmap for another legacy
 8-bit encoding) to the glyph id 1012,
 - and the ALEF character is cmapped to the glyph id 1001 (the values
 of glyph ids are not important, not even their relative order or
 differences, they don't need to obey any standard),
 - and the ALEF-LAMED ligature is in glyph id 1540 (the ALEF-LAMED
 character of the UCS may also be cmapped separately, but this is not a
 requirement)

 Then the lookup to perform the ligature will contain : (1012, 1001) -
  (1540).

 No! See Behdad's post -- it is clearly said that the lookup will still be in
 logical order (1001, 1012) - (1540) and not in visual order as you say.
 See? This is what I meant in the other mail by you suggesting that the
 tables containing the characters in visual order and not in logical order,
 to which you replied (without much real explanation I'm afraid):

 quoteNo ! I've not imagined that. You incorrectly reinterpret
 imaginatively another incorrect imaginative reinterpretation, made by
 someone else, of what I wrote, which did not even suggest that./quote

 Glyph id's are presented and scanned in the lookup table, in sequences
 preordered in visual order by the text layout/shaping engine.

 Nope -- they are placed in the lookup table in *logical* order. IIUC the
 entire sequence of glyphs is only reordered from RTL at the very end. Peter
 or Behdad, can you corroborate this?

Hmmm... this is not very clear then in the OpenType specification.
Well it does not matter the which order is physically used in the
stored table as long as it is consistant.

But this confirms that the OpenType rendering algorithm, the way it is
presented in the OpenType specification, is completely wrong: the Bidi
algorithm is definitely not the first step needed before performing
glyph substitutions.

However the Bidi algorithm really needs to reorder the glyphs at least
relatively, for correct application of GPOS (glyph positionining). As
a consequence, the font to use will be completely known (all
cmap'pings will have been applied already, and no glyph substitution
can accur across distinct fonts that have independant glyph ids). As
such the PUA agreement implied by the PUA font would have been
asserted. Nothing forbids then to use the font as THE reliable source
of information about which PUAs are RTL and which ones are LTR.

The computing order of features should not then be:
 - BiDi algorithm for reordering grapheme clusters
 - font search and font fallback (using cmap)
 - GSUB (lookups of ligatures or discretionary glyph variants)
 - GPOS
but really:
 - font lookup and font fallback (using cmap)
 - GSUB (lookups of ligatures or discretionary glyph variants)
 - BiDi algorithm for reordering glyphs representing the grapheme
clusters or ligatured grapheme clusters
 - GPOS

The BiDi algorithm absolutely does not have to be changed. This time
there's absolutely no PUA with unknown directionality if the font
defines the RTL property for these PUA (using the normative LTR only
as a default when the font does not specify it)




Re: RTL PUA?

2011-08-22 Thread Philippe Verdy
2011/8/22 Shriramana Sharma samj...@gmail.com:
 True -- so if someone wanted a PUA script to be handled properly in sorting
 etc one would have to prepare collation tables which would obviously go
 *outside* the font.

Collation tables can aleady be tailored very easily with existing
technologies. And anyway this has nothing to do with directionality of
characters, or their rendering, on which they absolutely do not
depend.

Tailored collations already have a working standard and syntax in the
CLDR project or ICU and in a few other libraries (notably in CPAN for
Perl).



Re: Code pages and Unicode

2011-08-22 Thread Ken Whistler

On 8/22/2011 9:58 AM, Jean-François Colson wrote:

I wonder whether you aren’t a little too optimistic.


No. If anything I'm assuming that the folks working on proposals will
be amazingly assiduous during the next decade.



Have you considered the unencoded ideographic scripts?


Why, yes I have.



1,071 hieroglyphs have already been encoded. I think there are 
approximately 4,000 more to encode.


A preliminary listing of 4548 additional hieroglyphs, based on 
Hieroglyphica (1993), was
presented to WG2 in 1999. Twelve years have passed, and no additional 
document has
been forthcoming to work through the issues in standardizing such a list 
as characters.

I won't hold my breath, but somebody *might* get through that work by 2021.



1,165 Yi syllables and 55 Yi radicals have been encoded. But they only 
support one dialect of Yi and I read there are tens of thousands of Yi 
ideographs and that a proposal to encode 88,613 classical Yi 
characters was made 4 years ago.


88,613 classical Yi *glyphs*. This is just a collection of every glyph 
form noted

from wherever. Even the proponents acknowledged that it was more on the
order of maybe 7000 *characters* involved. They got feedback to do the 
homework
to work through the character/glyph model for classical Yi, and come 
back when
they have a documented, reliable listing of the Yi *characters* that 
need encoding,
together with the list of variants for each character. Given the nature 
and scope of
the work, and no (current) indication of the progress being made, this 
also *might*

get done by 2021.



The threshold of 200,000 characters doesn’t seem very far.


Nah. It is still way over the extended horizon. The only big historic 
ideographic
script that is close to being done is Tangut, and the wrangling even 
over that one

has gone on for years now.

--Ken





Re: RTL PUA?

2011-08-22 Thread John H. Jenkins

William_J_G Overington 於 2011年8月22日 下午12:36 寫道:

 On Monday 22 August 2011, John H. Jenkins jenk...@apple.com wrote:
 
 Forgive my asking, but this reference to the description section of the 
 Macintosh Roman section of a TrueType font has me puzzled, because I don't 
 know what you're talking about.  What table contains this string?
 
 When I use FontCreator, made by High-Logic, http://www.high-logic.com is the 
 webspace: with a font file open, I can select Format from the menu bar and 
 then select Naming... from the drop down menu.
 
 That leads to a dialogue panel.
 
 From that dialogue panel one may select, for an ordinary, basic Unicode font, 
 either of two platforms, namely Macintosh Roman and Microsoft Unicode BMP 
 only.
 
 Having selected a platform, one may view the text content of various fields 
 for that platform, such as font family name and copyright notice, version 
 string and postscript name. There is then a button that is labelled 
 Advanced... that, if clicked, opens another dialogue panel with various other 
 text fields, including Font Designer and Description, which are the two that 
 I often use.
 
 Now, when the text values in the fields are stored in the font file, the 
 values for the Macintosh Roman platform are stored in plain text and the 
 values for the Microsoft Unicode BMP only platform are stored in some encoded 
 format.
 
 So, if one opens a TrueType font file in WordPad and one searches for an item 
 of plain text that is in one of the fields of the font, then the text that is 
 in the Macintosh platform can be found, yet the text that is in the Microsoft 
 Unicode BMP only platform cannot be found.
 
 So, I thought that if a manufacturer of a wordprocessing application or a 
 desktop publishing application decided to make a special researcher's 
 edition of the software, then that software could, when a font is selected, 
 first scan the font for a PUA.RTL string and, if one is found, override the 
 left-to-right nature of the identified characters to be a right-to-left 
 nature, just while that font is selected.
 
 Whether such a software package ever becomes available is something that only 
 time will tell, yet it seems to me that it is a method that could be used 
 without needing any changes by any committee.
 

Ah.  You're referring to an entry in the 'name' table, then.  The intention of 
the 'name' table is to provide localizable strings for the UI.  Using it to 
store data of any sort for the rendering engine would be very, very 
inappropriate.  

In general, one should not be using a text editor to examine the contents of a 
TrueType font. It would be like using a text editor to examine the contents of 
an application.  Even if you see some plain text, you really don't have any 
sense for how it's actually being used.  

You may want to bone up on the structure of TrueType/OpenType fonts.

=
John H. Jenkins
井作恆
Жбь А. ЖЩэпЮьц
jenk...@apple.com







RE: RTL PUA?

2011-08-22 Thread Doug Ewell
There is more to displaying characters than LTR versus RTL, and there is
more to handling characters than just displaying them.  This point
continues to be lost on several people responding to this thread.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­






RE: RTL PUA?

2011-08-22 Thread Doug Ewell
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:

 Depending on how you count, there are already two to four fonts that
 support Ewellic in the PUA.  There are probably many more that
 support Tengwar or Cirth or Klingon.
 
 First, these fonts can work fine with the default LTR directionality.
 So there's no need for additional data for them. Second, even if they
 were RTL, the needed info for each of these fonts, embedded in them
 would be extremely small, reduced to just specifying the range of RTL
 characters they need to contain.

This isn't my point.  Multiple fonts can exist for PUA scripts and the
user should not have to be constrained to using just the one font which
happens to contain property information, because someone decided
properties should be stored in the font.

 So I don't see that as a problem. Those fonts do exist and are used
 exactly because there was no problem for rendering them with texts
 encoded in logical order (the same as the visual order).

Not my point.

 It's still
 strange that we can have several fonts for esoteric fonts that have
 been used effectively by very few people, when there are centuries of
 traditions, and many interested users (but spread in very small
 communities worldwide) that cannot use computer technologies to render
 their favorite scripts, or that want to teach them, or make books and
 other publications to expose them, as an important humane cultural
 heritage, even if this was only to translate them or transcribe them
 in a more modern script.

One person added Ewellic to his shareware font as an experiment, and I
paid another person to do a font for me.  Sorry if this was culturally
insensitive.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­






RE: RTL PUA?

2011-08-22 Thread Doug Ewell
Shriramana Sharma samjnaa at gmail dot com wrote:

 Right, so if you embed that table in an OT font, the information is not
 available to a system that uses a font technology other than OT.
 
 I don't understand why you would say so -- assuming we are all talking 
 about TrueType fonts, AAT just uses some tables, OT others and Graphite 
 still others. They are all just tables appended to the TrueType font 
 data. Any software that is able to read TT font data can also read the 
 tables. So what's the problem?

OK, so it's obvious by now I'm not a font guy.

But I still maintain that there's more to proper handling of Unicode
characters, PUA or otherwise, than whether their directionality is LTR
or Arabic-RTL or non-Arabic-RTL or what have you.  That's why all those
other properties exist.  And I maintain that PUA users need a place to
store those other properties, and that the font doesn't seem like the
right place for non-display properties.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­






Re: RTL PUA?

2011-08-22 Thread Philippe Verdy
2011/8/22 William_J_G Overington wjgo_10...@btinternet.com:
 Having selected a platform, one may view the text content of various fields 
 for that platform, such as font family name and copyright notice, version 
 string and postscript name. There is then a button that is labelled 
 Advanced... that, if clicked, opens another dialogue panel with various other 
 text fields, including Font Designer and Description, which are the two that 
 I often use.

 Now, when the text values in the fields are stored in the font file, the 
 values for the Macintosh Roman platform are stored in plain text and the 
 values for the Microsoft Unicode BMP only platform are stored in some encoded 
 format.

Note some encoded format. The strings are encoded using the encoding
specified in the platform selectors. The strings for the Macintish
Romain platform will be encoded using MacRoman. The strings for the MS
Unicode BMP platform will be encoded with the BMP part of UTF-16
(without support for surrogates). The strings for the Unicode platform
will use the UTF-32 encoding.

 So, if one opens a TrueType font file in WordPad and one searches for an item 
 of plain text that is in one of the fields of the font, then the text that is 
 in the Macintosh platform can be found:

It just happens that you are opening the TrueType font as if it was a
plain-text encoded with Windows-1252, or some other 8-bit encoding
based on ASCII.  You are also searching ASCII characters that are
encoded identically in Windows-1252 as well as in the MacRoman
encoding, so you find a match.

 yet the text that is in the Microsoft Unicode BMP only platform cannot be 
 found.

Because tou would have to insert null bytes in your search strings, to
find an exact match in an UTF-16 encoded string. Without these nulls,
you'll get no match. What you are doing is a search in a text loaded
after assuming the wrong encoding. TrueType fonts are binary
containers, that can mix several encodings for its plain-text
elements, but that also embed many other non-text data. This happens
even if your text editor is capable of loading Unicode-encoded texts
(this fails here if you try to load it as UTF-16, because the whole
TTF container cannot match the conformance requirements for correctly
encoded UTF-16 texts, for the whole document, but only for fragments
of it. On the opposite, there's no conformance problem if you try to
read the file as if it was Windows-1252 or ISO-8859-1...




ALM (was: Re: RTL PUA?)

2011-08-22 Thread Ken Whistler

On 8/21/2011 3:31 PM, Richard Wordingham wrote:

I expect ARABIC LANGUAGE MARK would not go down well
- has it already been proposed and rejected?.


ARABIC *LETTER* MARK, not *LANGUAGE* mark. (And suggested
to just be renamed to AL MARK.)

Proposed? Yes.

Discussed? Yes.

Rejected? No.

The last UTC meeting took a consensus to issue a public review
issue on the proposed ALM and ELM (embedding level mark)
characters. So there will be further discussion and chance for input.

Nothing has been decided yet.

--Ken




Re: RTL PUA?

2011-08-22 Thread Richard Wordingham
On Mon, 22 Aug 2011 07:51:22 -0700
Doug Ewell d...@ewellic.org wrote:

 Some PUA properties, like glyph shapes and maybe directionality, can
 be stored in a font.  Others, like numeric values and casing, might
 not or cannot.  An interchangeable format needs to be agreed upon for
 the properties in the latter category.

I suggest that the obvious format is that used for capturing the UCD in
XML.  Only the characters in which you are interested need be
specified.  One very important property for several scripts is the
script to which a character belongs.

One reason for associating properties with a font is that text that is
to be displayed is at that point tentatively associated with a font.
Another is that in a multi-font document, a PUA character could
have multiple implicit properties dependent on the font it appears in.

Richard. 




RE: RTL PUA?

2011-08-22 Thread Doug Ewell
Richard Wordingham richard dot wordingham at ntlworld dot com wrote:

 One reason for associating properties with a font is that text that is
 to be displayed is at that point tentatively associated with a font.

I thought John said fonts dealt with glyph IDs, not characters per se.

 Another is that in a multi-font document, a PUA character could
 have multiple implicit properties dependent on the font it appears in.

Normal, assigned characters don't change their Unicode properties
depending on font.  I don't see why PUA characters would be different.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­






Re: RTL PUA?

2011-08-22 Thread N. Ganesan
On Sat, Aug 20, 2011 at 7:08 AM, Shriramana Sharma samj...@gmail.com
wrote:
 On 08/20/2011 01:57 PM, Martin Hosken wrote:

 D49 states that all properties of PUA characters are overridable by a
 higher protocol. But in 'normal' implementations, there are no higher
 level protocols to override the properties and so they use the
 defaults in the Unicode Database. So while in *theory* it's possible
 to override these values, nobody does. (This happens to also be the
 case with other tailoring algorithms in Unicode). Adding the
 configuration that tailoring requires is usually prohibitive and so
 it just doesn't get done.

 Good point -- Michael should note this.

 Somebody remarked that Apple Mac OS's rendering engine already
 supports an extended OT table which would signal that the glyphs in
 a PUA font are RTL. If other rendering don't support it, again it
 is not the fault of the standard.

 Is there a specificatino for that OT table? Are you implementing this
 in anything?

 Read a previous post by John Jenkins. He's the one who said they have a
 prop table in Apple's implemention of OT (or is it their own AAT) that
 enables one to do this.


Is this correct? that Apple solves the problem of RTL PUA user requirements?

See John Jenkins latest mail that says:
[Begin Quote]
To be honest, I don't know if using the 'prop' table to override
directionality for glyphs still works.  A quick-and-dirty test on Lion
suggests that it doesn't, so I may have spoken too quickly.  This is not a
part of the functionality of AAT which gets much exercise, so it's entirely
possible that it was lost at some point without anyone noticing.  In any
event, my apologies for raising any false hopes.
[End Quote]

Hope a new proposal or a UTN from UC will make things clear, and RTL
community benefits.

N. Ganesan

Jonathan Kew 於 2011年8月21日 上午10:48 寫道:

 On 21 Aug 2011, at 17:21, Behdad Esfahbod wrote:

 On 08/21/11 16:44, Shriramana Sharma wrote:

 BTW can John Jenkins show us a few entries from the prop table of some
font
 supporting the custom Apple PUA characters, especially the RTL and GC=No
ones?

 Like this?

 https://developer.apple.com/fonts/ttrefman/RM06/Chap6prop.html

 However, note that this documentation is very old, and does not make it
clear whether there is any support for overriding directionality in current
Mac OS X software.


Yes, it's very old, largely because we haven't done anything with the
structure of the 'prop' table for a long, long time.  Still, anything
referring to QuickDraw GX is obviously overdue for an update.

To be honest, I don't know if using the 'prop' table to override
directionality for glyphs still works.  A quick-and-dirty test on Lion
suggests that it doesn't, so I may have spoken too quickly.  This is not a
part of the functionality of AAT which gets much exercise, so it's entirely
possible that it was lost at some point without anyone noticing.  In any
event, my apologies for raising any false hopes.

=
井作恆
John H. Jenkins
jenk...@apple.com




 If the application doesn't do this and allows Graphite to break the
 text into runs, then Graphite can treat PUA characters as having BC
 other than L? /myunderstanding

 Yes that understanding is correct.

 Great! Could you then place some sample characters from your
 Scheherezade font in the PUA and render them RTL and show to us then
 Michael would be convinced.

 --
 Shriramana Sharma




Re: Code pages and Unicode

2011-08-22 Thread Richard Wordingham
On Mon, 22 Aug 2011 14:06:00 +0100 (BST)
William_J_G Overington wjgo_10...@btinternet.com wrote:

 On Monday 22 August 2011, Andrew West andrewcw...@gmail.com wrote:
  
  Can anyone think of a way to extend UTF-16 without adding new
  surrogates or inventing a new general category?
  
  Andrew
  
 How about a triple sequence of two high surrogates followed by one
 low surrogate? 

The problem is that a search for the character represented by the code
unit sequence (H2,L3) would also pick up the sequence (H1,H2,L3).
While there is no ambiguity, it does make searching more complicated
to code.  The same issue applies to the suggestion of using
(H1,H2,L3,L4) sequences.

Now, we could use (H1,H2,L3,L4) sequences and never assign the (H2,L3)
combinations.  They would therefore be category Cn, which currently
consists of both the unassigned characters and the non-characters.
However, I can't help feeling that they'd be almost a sort of
surrogate.  It's slightly more efficient to replace L3 by a single BMP
character.

Practically, I think that if we can change the semantics of the Myanmar
script, our descendants can go back on the guarantee of no more
surrogates.

Richard.



Re: Code pages and Unicode

2011-08-22 Thread Ken Whistler

On 8/22/2011 3:15 PM, Richard Wordingham wrote:

On Monday 22 August 2011, Andrew Westandrewcw...@gmail.com  wrote:
  

Can anyone think of a way to extend UTF-16 without adding new
surrogates or inventing a new general category?

Andrew
  
  How about a triple sequence of two high surrogates followed by one

  low surrogate?


How about Clause 12.5 of ISO/IEC 10646:

001B, 0025, 0040

You escape out of UTF-16 to ISO 2022, and then you can do whatever the
heck you want, including exchange and processing of complete 4-byte forms,
with all the billions of characters folks seem to think they need.

Of course you would have to convince implementers to honor the ISO 2022
escape sequence and liberate themselves into a high-level world of 
nosebleed

character numerosity. But then I guess by the time this is needed, folks are
counting on the need being self-evident. ;-)

--Ken




Re: Implement BIDI algorithm by line

2011-08-22 Thread li bo
 Yes, this is the algorithm I have read.  http://unicode.org/reports/tr9/
But I don't know why user must take a paragraph as a unit to determine the
embedding levels. Why can't i shape the text first and then wrapping the
line, and determining the embedding levels for characters within a line.
finally, reordering the characters within a line.
If a paragraph is too long, i think it's a big memory occupied. This would
be a limite in embedding system such as mobile phone.
On Tue, Aug 23, 2011 at 2:18 AM, CE Whitehead cewcat...@hotmail.com wrote:

  Hi.

 I think many line breaks within paragraphs are soft line breaks but that
 embedding levels have to be taken into account when deciding the width of
 the glyphs; that's as near as I can tell.

 Here is the description of the algorithm -- is this what you have read?
 http://unicode.org/reports/tr9/
 Some rules are in fact applied after the line wrapping (after the soft
 breaks) --
 The following rules describe the logical process of finding the correct
 display order. As opposed to resolution phases, these rules act on a
 per-line basis* and are applied after any line wrapping is applied to the
 paragraph.*
 Logically there are the following steps:

- The levels of the text are determined according to the previous
rules.
- The characters are shaped into glyphs according to their context *(taking
the embedding levels into account for mirroring).*
- The accumulated widths of those glyphs *(in logical order)* are used
to determine line breaks.
- For each line, rules L1 
 http://unicode.org/reports/tr9/#L1–L4http://unicode.org/reports/tr9/#L4are
  used to reorder the characters on that line.



 (I'd have to reread the whole document on line breaking then on bidi to
 answer this truely; sorry; hope this helps anyway)
 --C. E. Whitehead
 cewcat...@hotmail.com



Re: Implement BIDI algorithm by line

2011-08-22 Thread li bo
Sorry, Asmus, what do you mean?

On Tue, Aug 23, 2011 at 2:44 AM, Asmus Freytag asm...@ix.netcom.com wrote:

 Huh? What context is this in?


 On 8/22/2011 11:18 AM, CE Whitehead wrote:

 Hi.

 I think many line breaks within paragraphs are soft line breaks but that
 embedding levels have to be taken into account when deciding the width of
 the glyphs; that's as near as I can tell.

 Here is the description of the algorithm -- is this what you have read?
 http://unicode.org/reports/tr9/
 Some rules are in fact applied after the line wrapping (after the soft
 breaks) --
 The following rules describe the logical process of finding the correct
 display order. As opposed to resolution phases, these rules act on a
 per-line basis* and are applied after any line wrapping is applied to the
 paragraph.*
 Logically there are the following steps:

- The levels of the text are determined according to the previous
rules.
- The characters are shaped into glyphs according to their context *(taking
the embedding levels into account for mirroring).*
- The accumulated widths of those glyphs *(in logical order)* are used
to determine line breaks.
- For each line, rules L1 
 http://unicode.org/reports/tr9/#L1–L4http://unicode.org/reports/tr9/#L4are
  used to reorder the characters on that line.



 (I'd have to reread the whole document on line breaking then on bidi to
 answer this truely; sorry; hope this helps anyway)
 --C. E. Whitehead
 cewcat...@hotmail.com





Re: RTL PUA?

2011-08-22 Thread Shriramana Sharma

On 08/23/2011 03:29 AM, N. Ganesan wrote:

Hope a new proposal or a UTN from UC will make things clear, and RTL
community benefits.


Dear Ganesan,

I wonder if you have actually understood all the issues here. As usual 
you have done your copy-paste from somebody else's post. Please say 
something if you have something to actually contribute instead of just 
saying I support Oriya OM I support PUA RTL or such.


If you support PUA RTL, and since you are so interested in Grantha, you 
should do a proposal for regions in the PUA to be allocated proper 
IndicMatraCategory properties so that today we can put Grantha in the 
PUA and get it rendered properly by existing rendering engines.


--
Shriramana Sharma



Re: Code pages and Unicode

2011-08-22 Thread Jean-François Colson

On 23/08/11 00:15, Richard Wordingham wrote:

The problem is that a search for the character represented by the code
unit sequence (H2,L3) would also pick up the sequence (H1,H2,L3).
While there is no ambiguity, it does make searching more complicated
to code.  The same issue applies to the suggestion of using
(H1,H2,L3,L4) sequences.


And what dou you think about (H1,H2,VS1,L3,L4)?