https://bugzilla.wikimedia.org/show_bug.cgi?id=3821


Philippe Verdy <verd...@wanadoo.fr> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |verd...@wanadoo.fr




--- Comment #18 from Philippe Verdy <verd...@wanadoo.fr>  2010-01-19 16:17:38 
UTC ---
Ref #1è (Tisza Gergő): you write:
''in case of multiple characters in the charinsert block, all but the first are
thrown away''

Certainly not, this would be exactly against the purpose of the initial bug, as
we really need the possibility of inserting a sequence of '''several'''
characters when they do not exist as precomposed characters.

My proposal would be similar to yours, but the text element content of the
charinsert tag would be '''fully''' inserted.

But given the fact that the text content of charinsert is a list of characters,
the best that we can do is to allow separating them by spaces (including the
possibility of using an initial space to avoid the collision and possible
precomposition with the trailing ">" character that terminates the charinsert
start tag). Then split this text by spaces, ignore leading and traling spaces.
Each remaining sequence becomes a candidate for insertion in the list.

Note that the characters in the text content may also be inserted using numeric
character references (to their unicode code point value in decimal or
hexadecimal). For full XML conformance (and easier editing in edit tools
featuring characters not found in the native script and language of the host
wiki), these numeric character references '''must''' still be interpreted
equivalently.

But I would favor another syntax where the text content of the charinsert tag
would always be the displayed string, using an attribute only to qualify some
spans and limit the characters that will be actually inserted. For example:
<charinsert>č Č š Š ž Ž <char title="hacek"
insert="ˇ">&#x25cc;ˇ</char></charinsert>

Note that for isolated diacritics (combining characters), the character to
display before it in the selector (and only needed there and that should not
present) should probably be implied. Above I use U+25CC(◌) DOTTED CIRCLE
which is the recommanded one (and supported in many fonts), to avoid the
confusion with the actual precomposed letters which should be clickable
directly and should be visibly distinct. In that case the insert attribute used
above (which specifies which string will be actually inserted in the edited
text when the displayed character is selected), could be avoided completely:
<charinsert>č Č š Š ž Ž <comb title="hacek">ˇ</comb></charinsert>

If some diacritics cannot be used with the default dotted circle, an
"ignorable" (but displayed) quotted substring could be specified:
<charinsert>č Č š Š ž Ž <comb
title="hacek"><q>&#x25cc;</q>ˇ</comb></charinsert>

Note that the separate <comb> element is not really different from the <char>
element above. If the intent is to insert a single Unicode character, the fact
that the referenced character is combining can be infered directly from the
Unicode character properties (there are not so many combining characters in
Unicode, so they could easily be detected in Javascript from a small
preinitialized array of booleans indexed by character, for fast lookups; note
also that some Unicode combining characters are also decomposable, so the
actual characters can also be any string that starts with a combining
character). In that case, this reduces the code to just:

<charinsert>č Č š Š ž Ž <span
title="hacek"><q>&#x25cc;</q>ˇ</span></charinsert>

when specifiying the base character explicitly (U+25CC here, even if this is
the default), or

<charinsert>č Č š Š ž Ž <span title="hacek">ˇ</span></charinsert>

when using the default (the string to insert will just ignore the substrings
between <q>, and if the resulting string still starts by a combining character,
it will display a leading U+25CC DOTTED CIRCLE implicitly.

However, I still think that using a separate <comb> element will be more
explicit (and will allow a different presentation that can be customized (for
example the display tool could display the diacritic after a non-breaking space
U+00A0, instead of a dotted circle, and will use a distinctive background color
or could display it within a table cell with a thin dotted border, according to
site's stylesheet or user's preference).

So I militate for:

<charinsert>č Č š Š ž Ž <comb>ˇ</comb></charinsert>

(the simplest form), with the following optional extensions:

<charinsert>č Č š <char title="S with caron">Š</char> ž Ž <comb
title="hacek"><q>&nbsp;</q>ˇ</comb></charinsert>

The content of charinsert will be a free list of text elements or <char>
elements or <comb> elements. Here is the DTD:

<!ELEMENT charinsert ((#PCDATA | char | comb)*) >
<!ELEMENT char (#PCDATA) ((#PCDATA | q)*) ><!-- force the normal presentation
-->
<!ELEMENT comb (#PCDATA) ((#PCDATA | q)*) ><!-- force the alternate
presentation for combining characters or strings starting by one -->
<!ELEMENT q (#PCDATA) >
<!ATTRIB comb
   title #PCDATA IMPLIED <!-- default is empty -->
>
<!ATTRIB char
   title #PCDATA IMPLIED <!-- default is empty -->
>

Note also that ***not all*** diacritics are combining in Unicode: this is true
for Thaï which is encoded in visual order without using combining characters
for leading diacritics, but that will often still not display correctly if they
are used before any random characer (with which they may create ligatures, or
could simply be displayed with an undesired trailing dotted circle generated by
the text renderere or by the font).

There is also the need to support the insertion of other "invisible"
characters, notably format controls, and to render, in the diaply tool, various
spaces and make them easily distinguished (for example in clickable table
cells).
In those cases, it may even be desirable to not display at all the character
that will be inserted when the table cell will be clicked (for example, "ZWJ",
"ZWNJ", "NBSP"...). Generally, in those cases, there will be a separate label
that will be used instead of the character itself and this label should
probably be displayed with a smaller font within the table cell.

The "title" attribute is not made for this, as its role is give an hint helper
which may be much longer than what is displayed in the clickable table cell
(and too large to fit there cleanly). Such hint will be displayed and made
accessible elsewhere (for example in a "bubble", or on the browser's status
bar, or on any other more convenient single element within the HTML page and
whose content will be refreshed to display this hint string, when the table
cell is active or hovered by the mouse; it may also be vocalized, or directed
(out of flow) to the contextual helper line of a Braille reader, according to
its local user preferences, who can reserve a part of the display pad for
displaying HTML "title" hints or image descriptions). The title attribute is
then descriptive and can be arbitrarily long (it could be several sentences)
and is not intended for containing visual abbreviations like "NBSP".

To support an alternate representation of the character using abbreviations
(which will replace the actual rendering of the character within the table
cell, we can use another optional attribute. It may even be preferable to make
visual distinctions for format controls:

<!ELEMENT charinsert ((#PCDATA | char | comb | ctrl)*) ><!-- new definition
here: adding ctrl -->
<!ELEMENT ctrl (#PCDATA) ((#PCDATA | q)*) ><!-- force the alternate
presentation for controls or strings starting by one -->
<!ATTRIB  ctrl
   title #PCDATA IMPLIED <!-- default is empty -->
   alt #PCDATA IMPLIED <!-- default is empty -->
>
<!ATTRIB  comb
   title #PCDATA IMPLIED <!-- default is empty -->
   alt #PCDATA IMPLIED <!-- default is empty -->
>
<!ATTRIB  char
   title #PCDATA IMPLIED <!-- default is empty -->
   alt #PCDATA IMPLIED <!-- default is empty -->
>

For example:

<charinsert><char title="non-breaking space"
alt="NBSP">&nbsp;</char></charinsert>
<charinsert><comb title="acute accent" alt="ʹ">&#x301;</comb></charinsert>
<charinsert><comb title="dotted circle">&#x25CC;</comb></charinsert>
<charinsert><ctrl title="zero-width space"
alt="ZWSP">&#x200B;</ctrl></charinsert><!-- this is not a format control -->
<charinsert><ctrl title="zero-width non-joiner"
alt="ZWNJ">&#x200C;</ctrl></charinsert><!-- this is a format control -->
<charinsert><ctrl title="left-to-right override"
alt="»">&#x202A;</ctrl></charinsert>
<charinsert><ctrl title="right-to-left override"
alt="«">&#x202B;</ctrl></charinsert>
<charinsert><char title="narrow non-breaking space"
alt="NNBSP">&#x203F;</char></charinsert><!-- this is not a control, it is
visible ! -->

Separating the <char>, <comb>, and <ctrl> elements allow distinct presentations
when needed (such as distinct table cell background colors in a characters
selector). The <q> element will only be usable when there's not "alt="
attribute to remove the display of the actual character within the rendered
table cell.

The U+25CC default base character (DOTTED CIRCLE) for sequences starting by a
combining character will be implied and generated automatically by default, if:
* the text content within the <comb> or <ctrl> or <char> (including the text
within <q> elements which are also rendered by simple concatenation, but
discarded from the actually inserted text) starts by a combining character
(even if its "combining class" is 0, which just means that it can just never be
precomposed with a prior base character, and never be reordered through
normalization).
* and there's no alt attribute in the <comb> or <ctrl> or <char> element start
tag.
It will also be used implicitly if the characters within <charinsert> are
packed in a space-separated string without surrounding <comb> or <ctrl> or
<char> to subqualify them (meaning that by default they are treated as if each
space separated sequence was within a <char> element with unspecified "alt="
and "title=" attributes.

It will be also illegal to use <q> within the content of <charinsert>, except
within the content of <char> or <ctrl> or <comb> as it would not be clear where
to associate them with surrounding space separated char sequences. They must be
delimited, at least within a <char> without attributes.

The text content of a single <char> or <comb> or <ctrl> element can contain any
text, it is not restricted to a single Unicode character. And it can also
include spaces (however the spaces are implicitly packed, with leading and
trailing spaces discarded; if one still wants to be able to include a litteral
space within the text to insert and that must not be discarded complely, I
propose adding the <space> element:

<!ELEMENT space #EMPTY>
<!ATTRIB space
   title #PCDATA IMPLIED <!-- default is empty -->
   alt #PCDATA IMPLIED <!-- default is empty -->
>

And allow it within the content of <charinsert>, <ctrl>, <comb> and <char> :

<!ELEMENT charinsert ((#PCDATA | space | char | comb | ctrl)*) ><!-- new
definition here: adding space -->
<!ELEMENT ctrl (#PCDATA) ((#PCDATA | space | q)*) ><!-- force the alternate
presentation for controls or strings starting by one -->
<!ELEMENT char (#PCDATA) ((#PCDATA | space | q)*) ><!-- force the normal
presentation -->
<!ELEMENT comb (#PCDATA) ((#PCDATA | space | q)*) ><!-- force the alternate
presentation for combining characters or strings starting by one -->

All these definitions are easy to parse through XML DOM in the Wiki parser. I
hope they are precise enough. Comments are welcome.

Philippe.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to