subject:"Character Sequences of Uncertain Rendering \(was\: Version linking\?\)"

Re: Character Sequences of Uncertain Rendering (was: Version linking?)

2017-08-27 Thread Philippe Verdy via Unicode

Actually the matras in questions in the first message were neither
left-to-right or right-to-left, they were two-part vowels, and repeatedly
encoded after a base letter.
Malayalam itself is left-to-right but this only makes sense for the order
of base letters. matras encoded after that are placed around it according
to the script rule, but two part vowels cause problem if multiple ones are
used. We know how to order the right parts that are postposed, but there's
no clear order for the left parts that are preposed (including when there
are also preposed one-part vowels).

This is kind of similar to the problem of defining the stacking order when
there are multiple diacritics above (or below) when they all compete for
the same position. If generally the option is to render them ordered from
the innermost to the outermost position (so successive diacritics noramlly
positioned above should stack vertically upward, but there are known
exception where they will be instead not stacking vertically but
horizontally either left-to-right or right-to-left, and some cases where
their order will also be reversed).

There are only common positions and stacking options which should be used
by default in absence of any kind of joiners between them. For all other
cases, we need additional joiner controls between them if this is not the
default. But here, what is the default for the uncomon case where there are
multliple occurences of the same two-part matras ? In my opinion, they
should still be ordering their respective left-part or right from from
innermost to outermost, so the left-parts will be rendered right to left,
and the right-parts will be rendered left-to-right.

Here the problem is that this is performed in Firefox only for a limited
number (2) of preposed one-part vowels or preposed diacritics, or preposed
left-parts (of two-part vowels). So after rendering the first two matras,
there's no space left for the third matra, which will then be rendered
entirely after the cluster, in a separate cluster (missing a base
consonnant so you see the dotted glyph in the middle). IE does seem to do
things correctly by supporting more left-side preposed matras or left-side
preposed "half-matras": it first decomposes the two-part matras into two
pseudo-matras for each part and then order the first pseudo-matra like
other preposed vowels, all by default right-to-left (i.e. from innermost to
outsermost when you place the center of view on the base letter).

But there's no special joiners encoded in Unicode to override the placement
(direction) or relative order of diacritics competing to the same position.
If one was used, it should be encoded just before that diacritic, but
twop-part diacritics are even more challenging as they could possibly need
one or two separate overrides (either for the left-part or the right-part,
or both !)

However for the case given above, it makes no sense to use what Google
Chrome currently renders for "കോോോ" (U+0D15, followed by 3 occurences of
U+0D4B).

To make it clear, I'll use ASCII-only notation :  for the base letter
(U+0D15) and  for the two-part diacritic U+0D4B, and  the dotted
circle.
- When we encode , the rendering should be "CMD". it is OK in all
browsers.
- When we encode  we also see "CCMDD" everywhere including in
Chrome or Firefox.
- Then comes the encoding  that IE correctly renders as
"CCCMDDD", but Chrome or Firefox cannot render this correctly, they first
render  as "CCMDD" then comes  left alone without base
consonnant, so a dotted circle is inserted and we see "CoD" as a glued (but
now separate) cluster, the final result is "CCMDDCoD" (which is still not
breakable whe ntrying to select it with keyboard/mouse/touch).

I think this is caused by the algorithm used in Chrome and Firefox
renderers that only offer at most two positions for preposed parts when
computing the reordered layout of glyphs. IE does this correctly by not
limiting the number of preposed glyphs or using a higher limit (I did not
test by using arbitrarily-long sequences of preposed vowels or two-part
vowels, or at least 4 of them then more).

I know that IE/Edge is capable now to stack very high stacks of diacritics
(and this was implemented probably for the Tibetan script, or for
supporting mathematical notations).

But still, overriding the default direction of stacking is unspecified in
Unicode, except for a few documented cases where some joiner controls are
used (for the "liquid" vowels that we consider as consonnants in Latin, and
that will be present in words borrowed to Indic languages in their script
using matras) to alter the restation of stacking (but without complex glyph
reordering)

consider also the case of Acute accent in Greek whose default position is
by default altered when they occur contextually with capital letters, from
above, to the left. so  is reordered as
, but most Greek fonts will render like their
precombined

Re: Character Sequences of Uncertain Rendering (was: Version linking?)

2017-08-27 Thread Richard Wordingham via Unicode

On Sun, 27 Aug 2017 19:55:31 +0200
Philippe Verdy via Unicode  wrote:

> 2017-08-27 6:06 GMT+02:00 Richard Wordingham via Unicode <
> unicode@unicode.org>:  

> Canonical reordering is unambiguously refering to the canonical
> equivalences in TUS. These are automated and can occur at any time,
> and the only way to avoid them is to insert joiners. But they should
> never be needed for normal texts, except to split clusters or
> introduce semantic differences where they are relevant (and in that
> case the renderers will also try to distinguish them, otherwise they
> can freely reorder every sequence of diacritics with distinct
> non-zero combining classes and will represent all canonically
> equivlent sequences exactly the same way without distinguishing them).

This wasn't the sort of problem I was talking about.  The Indic
example with undefined rendering has two left matras with ccc=0.  The
questions was whether they should be displayed from left to right (as in
MS Edge) or right to left (as in Firefox).

The problem of diacritics below having different combining classes has
been raised for minority languages in Thai.  There seems a definite
prospect that the rendering order has to depend on the writing system -
and the other order would simply be wrong.  Standardisation occurs
outside the purview of the UTC.  The order may be forced by CGJ,
which is a joiner in name only when it occurs before combining marks.

Richard.

Re: Character Sequences of Uncertain Rendering (was: Version linking?)

2017-08-27 Thread Philippe Verdy via Unicode

2017-08-27 6:06 GMT+02:00 Richard Wordingham via Unicode <
unicode@unicode.org>:

> On Sat, 26 Aug 2017 21:52:19 +0200
> Philippe Verdy via Unicode  wrote:
>
> > 2017-08-26 21:28 GMT+02:00 Richard Wordingham via Unicode <
> > unicode@unicode.org>:
>
> > Of course SHY in this use is not suitable, but who knows if one will
> > not need this to split in tow parts what would be otherwise a single
> > cluster (possibly reordered by canonical reordering if one needs to
> > split between two Indic matras: this would suggest there's a need for
> > a new "empty base consonnant" for that Indic script, but SHY (U+00AD)
> > should probably not have the correct effect if it also inserts an
> > undesired line break opportunity, independantly of how the glyph
> > which would be rendered and the position (first or second line) where
> > it would be rendered if the linebreak is honored).
>
> I am confused as to what conceivable case you have in mind.  An example
> would help.  I wonder if I'm misunderstanding what you mean by
> 'canonical reordering'.

Canonical reordering is unambiguously refering to the canonical
equivalences in TUS. These are automated and can occur at any time, and the
only way to avoid them is to insert joiners. But they should never be
needed for normal texts, except to split clusters or introduce semantic
differences where they are relevant (and in that case the renderers will
also try to distinguish them, otherwise they can freely reorder every
sequence of diacritics with distinct non-zero combining classes and will
represent all canonically equivlent sequences exactly the same way without
distinguishing them).

Re: Character Sequences of Uncertain Rendering (was: Version linking?)

2017-08-26 Thread Richard Wordingham via Unicode

On Sat, 26 Aug 2017 21:52:19 +0200
Philippe Verdy via Unicode  wrote:

> 2017-08-26 21:28 GMT+02:00 Richard Wordingham via Unicode <
> unicode@unicode.org>:  

> Of course SHY in this use is not suitable, but who knows if one will
> not need this to split in tow parts what would be otherwise a single
> cluster (possibly reordered by canonical reordering if one needs to
> split between two Indic matras: this would suggest there's a need for
> a new "empty base consonnant" for that Indic script, but SHY (U+00AD)
> should probably not have the correct effect if it also inserts an
> undesired line break opportunity, independantly of how the glyph
> which would be rendered and the position (first or second line) where
> it would be rendered if the linebreak is honored).

I am confused as to what conceivable case you have in mind.  An example
would help.  I wonder if I'm misunderstanding what you mean by
'canonical reordering'.  Do you mean the order of codepoints, or the
arrangement of glyphs.  CGJ is available to preserve a specific
ordering of codepoints, though it is completely redundant in most Indic
scripts.

It is a fact that aksharas do get split between lines in manuscripts,
undesirable though it may be.  In a transcription intended to preserve
a division into lines, one would probably use NBSP at such a point,
and worry less about attempting to preserve the structure of the
line-broken akshara.  It seems that Unicode only supports word
boundaries and their absence where they provide or prohibit line
breaks.

Richard.

Re: Character Sequences of Uncertain Rendering (was: Version linking?)

2017-08-26 Thread Philippe Verdy via Unicode

2017-08-26 21:28 GMT+02:00 Richard Wordingham via Unicode <
unicode@unicode.org>:

>
> I'm wondering if there are any cases where a SHY _should_ go between a
> Latin letter and diacritic.  I can't think of any.
>

In standard Latin orthography you would not expect it, normally, but there
will be cases where this will still occur at random places between long
spans of letters.

However I did NOT suggest (like you are doing here) using SHY between a
Latin letter and any diacritic.

But may be you've been confused by the fact I took the example of free
insertion of SHY controls in alphabetic scripts in comparison to the free
insertion of joiner controls (not the same thing) between Indic letters
(including vowel matras or subjoined consonants that are encoded as
combining characters but are not really "diacritics").

Of course SHY in this use is not suitable, but who knows if one will not
need this to split in tow parts what would be otherwise a single cluster
(possibly reordered by canonical reordering if one needs to split between
two Indic matras: this would suggest there's a need for a new "empty base
consonnant" for that Indic script, but SHY (U+00AD) should probably not
have the correct effect if it also inserts an undesired line break
opportunity, independantly of how the glyph which would be rendered and the
position (first or second line) where it would be rendered if the linebreak
is honored).

If one wants an, empty base letter to combine with the diacritic after it,
I think it should be NBSP (U+00A0) to avoid the interpretation as a
"defective" cluster using a implied glyph such as the dotted circle (but
NBSP also has its own problems, notably for collation where it would
collate like a space instead of being ignorable at primary level: this can
be fixed however quite easily in collation tailorings, using collation
elements made with "NBSP+combining matra")

Character Sequences of Uncertain Rendering (was: Version linking?)

2017-08-26 Thread Richard Wordingham via Unicode

On Fri, 25 Aug 2017 01:24:36 +0200
Philippe Verdy via Unicode  wrote:

> 2017-08-17 22:37 GMT+02:00 Richard Wordingham via Unicode <
> unicode@unicode.org>:  
> 
> > Fortunately, there is no good evidence that the occurrence
> > of multiple distinct left matras is anything but a typing error,
> > though I can easily see how it might be used as a lexicographical
> > convention on the fuzzy edge of plain text.
> >
> > In a similar vein, in Malayalam, we get repeats of the 2-part vowel
> > U+0D4B MALAYALAM VOWEL SIGN OO (see Cibu Johny's report at
> > https://lists.freedesktop.org/archives/harfbuzz/2013-February/002945.html
> > ),
> > but I'm not sure what the legitimate encodings of the example word
> > കോോോ (typed here as ) are.

> Even if there were typing errors, the input method should either
> signal it visually to the user (using canonical reordering), or the
> user could still cancel this reordering (e.g. CTRL+Z for undoing it)
> and the input method could still fix it and mainting the order by
> then inserting combining joiners automatically even if the user did
> not enter them directly.

I don't see how any of ZWJ, ZWNJ and CGJ would help multiple
distinct left matras or repeated 2-part vowels. You might argue for
insertion of U+25CC as a base consonant, along with the ability to
delete just it.

> The joiners should better be removed transparently by the text editor
> without requiring the user to perform complex selections or pressing
> BACKSPACE multiple times, as I don't see any use of these joiners at
> end of graphemes, or multiple joiners in a sequence.

I believe  has a rôle in some Arabic script writing systems,
and possibly in other cursive Semitic scripts, such as Mongolian.
 is required at some syllable boundaries, and it is nice
to have ZWNJ honoured in the sequence , which is composed of two
extended grapheme clusters,  and .  This latter,
of course, is no more than one would require of good Latin typography
that works well with an English spell-checker - I would expect 'caecum'
to have a ligature but not 'sundae'.

> Even for Latin, one can freely enter SHY controls at any place within
> words, even if they are not at correct syllabic separations: this will
> impact the rendering if there are linebreaks, but this is done on
> purpose, and still easy to correct if this was made by error (a spell
> checker could also help locate these uncommons errors in existing
> texts but would not automatically correct them without instruction
> given by the user and a user can also choose to ignore/discard these
> signals and store the text as is).

Now that beings to mind some interesting cases -  and .  I'm not sure where the
handling should go, but Firefox handles the former reasonably.  My one
gripe is that I don't know how to tell the system that a rendered soft
hyphen is invisible.  Some typographers claim that the glyph for the
soft hyphen (i.e. the glyph for U+00AD) should be used when it becomes
manifest.  I haven't found any cases where a line break should go
between a left matra and a base consonant, but I wouldn't be surprised
to encounter an example in a manuscript in a phonetically ordered
script.  (They are far from unknown in Thai, but that's probably due
to software deficiencies.)  TUS treats the rendering of soft hyphens as
beyond its scope except for line-breaking - the rules are
language-dependent and beyond the scope of Unicode.  I don't know if
CLDR handles rendering around line-breaking soft hyphens.

I'm wondering if there are any cases where a SHY _should_ go between
a Latin letter and diacritic.  I can't think of any.

Richard.

Re: Character Sequences of Uncertain Rendering (was: Version linking?)

Re: Character Sequences of Uncertain Rendering (was: Version linking?)

Re: Character Sequences of Uncertain Rendering (was: Version linking?)

Re: Character Sequences of Uncertain Rendering (was: Version linking?)

Re: Character Sequences of Uncertain Rendering (was: Version linking?)

Character Sequences of Uncertain Rendering (was: Version linking?)

6 matches

Site Navigation

Mail list logo

Footer information