Guys, does nobody read the bloody Standard anymore!?

You CAN currently add a diacritic on top of a double diacritic. The "other" 
base character is called the Combining Grapheme Joiner (U+304F).

>From V. 5.0, ch 7.9:

Occasionally one runs across orthographic conventions that use a dot, an acute 
accent, or other simple diacritic above a ligature tie - that is, U+0361 
Combining Double Inverted Breve. Because of the considerations of canonical 
ordering [...], one cannot represent such text simply by putting a combining 
dot above or combining acute directly after U+0361 in the text. Instead, the 
recommended way of representing such text is to place U+034F Combining Grapheme 
Joiner (CGJ) between the ligature tie and the combining mark that follows it, as

0075 + 0361 + 034F + 0301 + 0069 .

Because CGJ has a combining class of zero, it blocks reordering of the double 
diacritic to follow the second combining mark in canonical order. The sequence 
of <CGJ, acute> is then rendered with default stacking, placing it centered 
above the ligature tie. This conventiona can be used to create similar effects 
with combining marks above other double diacritics (or below double diacritics 
that render below base characters).

--------------------------------------------------------------------------------
"Philippe Verdy" wrote: 
First encode each base (unjoined) extended grapheme clusters 
separately (possibly with their own diacritics or extenders or 
prependers, including ZWJ and ZWNJ, according to their definition in 
the UAX defining text segmentations). 


Then encode the double diacritic between them. 


So for your examples you get <006F, 035D, 006F> (double breve) or 
<006F, 035D, 006F> (double macron). 


Double diacritics have a combining property equal to zero, so they 
block the reordering for canonical equivalences and the relative order 
and independance for the encoding of base grapheme clusters will be 
preserved during normalizations. 


As a consequence, if there's another diacritic added on top of the 
double diacritic, it can only be added at end of this sequence, but 
the bad thing is that it will appear just after the encoding of the 
second base grapheme cluster, and so it is subject to reordering, as 
it will be interpreted as being part itself of the second grapheme 
clusters. 


Currently you cannot add another diacritic on top of a double 
diacritic, we lack something for blocking such interpretation in the 
second cluster. 


To do that, we would need another base character with combining 
property 0 (blocking canonical reorderings), and that would have the 
same grouping semantic as other double diacritics. This character 
would be abstract (and invisible by itself) and could be something 
like: 


  U+xyzt DOUBLE DIACRITIC HOLDER 


For example to add an acute accent above the double breve joining the 
two letters 'o', we would encode: 


  <006F, 035D, 006F, xyzt, 0301> 


instead of just <006F, 035D, 006F, 0301> which is canonically 
equivalent to <006F, 035D, 00F3> and which encodes the letter 'o' and 
the letter 'o' with an acute accent (centered on this second o) joined 
with the double breve *above* the acute accent of the second 'o'. 


My opinion is that such double diacritic holder exists: it's ZWJ, 
which could be safely used as the needed invisible base for additional 
diacritics occuring on top (and centered) of a double diacritic. But 
others may have other preferences about the choice of this character. 


I don't know if ZWJ has been specified so that it could occur only 
before a "defective" combining sequence containing only combining 
diacritics. for this case, this would mean that the semantic of the 
combining diacritics encoded after it must apply to the full part of 
the extended grapheme cluster encoded before it. 


This use of ZWJ effectively allows the interpretation of the encoded 
sequence as if it was in TeX syntax: 


  \acute{ \breve{oo} } 


Without the ZWJ, it would be interpreted as: 


  \breve{ o\acute{o} } 


The double diacritics or just intended to be used between each base 
grapheme clusters to join. And it could possibly be used to groop more 
than 2 base grapheme, for example with 3 'o' as: 


  <006F, 035D, 006F, 035D, 006F> 


interpreted in TeX syntax as: \breve{ooo} 


But even with this case, you wont be able to encode with the ZWJ trick 
in plain text, such groupings that are expressed this way in TeX: 


  \breve{ \breve{oo} x \breve{ o\acute{o} } } 


Because double diacritics encoded in Unicode can't be safely stacked 
together (for such application you'll need a rich-text layer on top of 
Unicode, such as TeX here). 


Philippe. 


--------------------------------------------------------------------------------
verdy_p ([email protected]) wrote:


I just thought about a solution to allow stacking of double-diacritics: we 
could use variation selectors after them, 
to specify a higher level of grouping. 


So in the example above: 
- "\breve{oo}" remains encoded as: 
- "x" remains encoded as: 
- "o \acute{o}" remains encoded as: followed by or 
- "\breve{o \acute{o}}" remains encoded as: 


And to stack a second level of breves, we could use between those three groups: 




Even softwares ignoring how to create the layout would still consider this long 
sequence as an unbreakable extended 
grapheme cluter. and its important relative ordering will be presrved by 
normalizations. Here also you'll be able to 
add other single diacritics on top of the double breves... 


This way, you may stack up to 256 additional levels of double diacritics in a 
structured layer that will be 
preserved as a single extended grapheme cluster. 


Softwares that don't know what to with the variation selectors will ignore 
them, and will treat all double breves 
above as equal, so they will render something like this in TeX: 


\breve{ oo x o \acute{o} } 


in a single grouping (not so bad after all...) 


BUT! Such variations sequences have NOT been allocated in the Unicode registry 
for this purpose. I think that such 
application should use something else than variation selectors, that are 
intended to represent glyphic variants for 
the individual double diacritics. 


An I think that this could be done by allocating instead, in the special plane 
15, a block for STACKING selectors 
(or more generally GROUPING LEVELS), with exactly the same properties as 
variation selectors, except that they won't 
require a prior registration for their use in association with double 
diacritics. 


Such selectors could eventually be used to encode bidimensional structures like 
those used in Egyptian hieroglyphs, 
and that already use the default horizontal layout and would require a single 
additional vertical stacking. For 
example: 


- generates the TeX equivalent of: "\hiero{1} \hiero{2}" : this is the normal 
horizontal reading 



- generates the TeX-like equivalent of: "\vstack{ \hiero{1} \hiero{2} }" : this 
is the 
vertical stacking behavior, and needs a joiner-like character to preserve the 
unbreakable "extended grapheme 
cluster". 


But when both horizontal and vertical layout are used, the direction of 
stacking in complex groupings must be 
disambiguated, and would require two distinct characters. We could use ZWJ for 
grouping with horizontal layout 
(within a larger vertically stacked compound), and ZWNJ for grouping with 
vertical layout. So we would encode here 
for this second case. 


Now if the structure is more complex, we'll need several levels of grouping, 
both for the horizontal and the 
vertical joiners. Adding a GROUPING LEVEL (acting exactly like a variation 
selector), encoded just after ZWJ or ZWNJ 
(using the special codepoint in plane 15, encoded as a combining character with 
combining class 0) would solve the 
representation problem. 


For example (HIERO1-HIERO2:HIERO3)-HIERO4:HIERO5 (usiong the WikiHiero 
notation), whose layout is similar to: 


+--------+--------+--------+ 
| HIERO1 | HIERO2 | | 
+--------+--------+ HIERO4 | 
| HIERO3 | | 
+-----------------+--------+ 
| HIERO5 | 
+--------------------------+ 


could be encoded as: 




And it will still match the definition of extended grapheme clusters, while 
also fully preserving the semantic 
composition and structure of the cluster : 


* The absence of a grouping level selector means that the horizontal or 
vertical joiners are acting at level 0. 
* Sequences encoded at the same grouping level using ZWJ separators are 
assuming the horizontal layout for 
hieroglyphs 
* Those encoded at the same grouping level with ZWNJ are assuming the vertical 
layout. 
* ZWJ (horizontal layout) has as higher grouping priority than ZWNJ if they 
occur simultaneously at the same level. 


If the grouping level selectors are not supported by the layout engine, it will 
just try to honor ZWJ and ZWNJ 
(ignoring the specified grouping levels) as if it was only encoded as: 




which is the actual encoding (in WikiHiero syntax) of 
(HIERO1-HIERO2:HIERO3-HIERO4:HIERO5) 


+--------+--------+ 
| HIERO1 | HIERO2 | 
+--------+--------+ 
| HIERO3 | HIERO4 | 
+--------+--------+ 
| HIERO5 | 
+-----------------+ 


And if the vertical stacking is not supported by the layout engine, it will 
also ignore the ZWJ and ZWNJ, and so 
will render the five hieoroglyphs linearily, ignoring in fact just only the 
vertical layers by drawing them in three 
successive spans as: 


+-----------------+-----------------+--------+ 
| HIERO1 HIERO2 | HIERO3 HIERO4 | HIERO5 | 
+-----------------+-----------------+--------+ 


Which is, for now, all that Unicode officially documents. 


But the bad thing I don't like in such use of ZWNJ and ZWNJ, is that it is not 
intended for controlling the layout, 
but instead to hint the presence or absence of ligatures. Are compound layouts 
such as those used in hieroglyphs to 
be considered as special graphic ligatures ? 


I think that they represent something much stronger than what ZWJ and ZWNJ 
represent. But there are precedents of 
such strong semantic assignments to ZWJ and ZWNJ for Indic scripts. I don't 
think that what is already used to 
control the semantics (and partially the graphic appearance) in Indic scripts 
(in a way specific to those scripts), 
can't be also used here specifically for hieroglyphs that really need such 
strong semantics, even if they certainly 
don't need other kinds of ligatures. 


Adding the generic ZWJ, ZWNJ (optionnaly followed by the generic grouping level 
selectors) to the hieroglyphic 
script will not alter the way it is already encoded. But at least it will be 
possible to preserve the hieroglyph 
semantics in plain-text, without depending on an unspecified syntax. 


So my dicussion here only proposes only one addition for encoding as new 
characters in Unicode: 


- adding a new block of grouping selectors in the special plane 15. In my 
opinion, a single row of 16 grouping level 
selectors (acting in additional to the implicit level 0) will be enough for all 
situations. They MUST have combining 
class 0, and might be ignorable, just like variation selectors, except that 
they don't imply any glyph modification 
for the characters that are encoded in the composite "default grapheme 
cluster". They must have a general category 
of "zero-width" combining characters (probably Mo), and must be *optionally* 
ignorable in collations. They should 
not format controls (in general category C) because they would be ignored in 
all cases in collations. 


- the addition of 2 generic horizontal/vertical grouping may be discussed : can 
we override ZWJ and ZWNJ ? If not, 
then ZWJ/ZWNJ + a grouping level may be also encoded as a single Unicode 
character, with the same general properties 
as ZWJ and ZWNJ, all in the same allocated block in the special plane. 


Only the vertical groupings will be used to stack vertically the double 
diacritics or to stack other diacritics on 
top of a double diacritic. 


This is left to discussions as several options are possible, before one can be 
implemented somewhere, tested, and 
finally recommanded. 


I'm not asking to add grouping selectors immediately, if existing variation 
selectors can safely be used on top of 
ZWJ and ZWNJ, and if ZWJ/ZWNJ can be used in some scripts (like Egyptian 
hieroglyphs) to encode their semantic 2D 
layout. 


Philippe. 




Reply via email to