https://bugzilla.wikimedia.org/show_bug.cgi?id=3821
Philippe Verdy <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |[email protected] --- Comment #18 from Philippe Verdy <[email protected]> 2010-01-19 16:17:38 UTC --- Ref #1è (Tisza Gergő): you write: ''in case of multiple characters in the charinsert block, all but the first are thrown away'' Certainly not, this would be exactly against the purpose of the initial bug, as we really need the possibility of inserting a sequence of '''several''' characters when they do not exist as precomposed characters. My proposal would be similar to yours, but the text element content of the charinsert tag would be '''fully''' inserted. But given the fact that the text content of charinsert is a list of characters, the best that we can do is to allow separating them by spaces (including the possibility of using an initial space to avoid the collision and possible precomposition with the trailing ">" character that terminates the charinsert start tag). Then split this text by spaces, ignore leading and traling spaces. Each remaining sequence becomes a candidate for insertion in the list. Note that the characters in the text content may also be inserted using numeric character references (to their unicode code point value in decimal or hexadecimal). For full XML conformance (and easier editing in edit tools featuring characters not found in the native script and language of the host wiki), these numeric character references '''must''' still be interpreted equivalently. But I would favor another syntax where the text content of the charinsert tag would always be the displayed string, using an attribute only to qualify some spans and limit the characters that will be actually inserted. For example: <charinsert>č Č š Š ž Ž <char title="hacek" insert="ˇ">◌ˇ</char></charinsert> Note that for isolated diacritics (combining characters), the character to display before it in the selector (and only needed there and that should not present) should probably be implied. Above I use U+25CC(◌) DOTTED CIRCLE which is the recommanded one (and supported in many fonts), to avoid the confusion with the actual precomposed letters which should be clickable directly and should be visibly distinct. In that case the insert attribute used above (which specifies which string will be actually inserted in the edited text when the displayed character is selected), could be avoided completely: <charinsert>č Č š Š ž Ž <comb title="hacek">ˇ</comb></charinsert> If some diacritics cannot be used with the default dotted circle, an "ignorable" (but displayed) quotted substring could be specified: <charinsert>č Č š Š ž Ž <comb title="hacek"><q>◌</q>ˇ</comb></charinsert> Note that the separate <comb> element is not really different from the <char> element above. If the intent is to insert a single Unicode character, the fact that the referenced character is combining can be infered directly from the Unicode character properties (there are not so many combining characters in Unicode, so they could easily be detected in Javascript from a small preinitialized array of booleans indexed by character, for fast lookups; note also that some Unicode combining characters are also decomposable, so the actual characters can also be any string that starts with a combining character). In that case, this reduces the code to just: <charinsert>č Č š Š ž Ž <span title="hacek"><q>◌</q>ˇ</span></charinsert> when specifiying the base character explicitly (U+25CC here, even if this is the default), or <charinsert>č Č š Š ž Ž <span title="hacek">ˇ</span></charinsert> when using the default (the string to insert will just ignore the substrings between <q>, and if the resulting string still starts by a combining character, it will display a leading U+25CC DOTTED CIRCLE implicitly. However, I still think that using a separate <comb> element will be more explicit (and will allow a different presentation that can be customized (for example the display tool could display the diacritic after a non-breaking space U+00A0, instead of a dotted circle, and will use a distinctive background color or could display it within a table cell with a thin dotted border, according to site's stylesheet or user's preference). So I militate for: <charinsert>č Č š Š ž Ž <comb>ˇ</comb></charinsert> (the simplest form), with the following optional extensions: <charinsert>č Č š <char title="S with caron">Š</char> ž Ž <comb title="hacek"><q> </q>ˇ</comb></charinsert> The content of charinsert will be a free list of text elements or <char> elements or <comb> elements. Here is the DTD: <!ELEMENT charinsert ((#PCDATA | char | comb)*) > <!ELEMENT char (#PCDATA) ((#PCDATA | q)*) ><!-- force the normal presentation --> <!ELEMENT comb (#PCDATA) ((#PCDATA | q)*) ><!-- force the alternate presentation for combining characters or strings starting by one --> <!ELEMENT q (#PCDATA) > <!ATTRIB comb title #PCDATA IMPLIED <!-- default is empty --> > <!ATTRIB char title #PCDATA IMPLIED <!-- default is empty --> > Note also that ***not all*** diacritics are combining in Unicode: this is true for Thaï which is encoded in visual order without using combining characters for leading diacritics, but that will often still not display correctly if they are used before any random characer (with which they may create ligatures, or could simply be displayed with an undesired trailing dotted circle generated by the text renderere or by the font). There is also the need to support the insertion of other "invisible" characters, notably format controls, and to render, in the diaply tool, various spaces and make them easily distinguished (for example in clickable table cells). In those cases, it may even be desirable to not display at all the character that will be inserted when the table cell will be clicked (for example, "ZWJ", "ZWNJ", "NBSP"...). Generally, in those cases, there will be a separate label that will be used instead of the character itself and this label should probably be displayed with a smaller font within the table cell. The "title" attribute is not made for this, as its role is give an hint helper which may be much longer than what is displayed in the clickable table cell (and too large to fit there cleanly). Such hint will be displayed and made accessible elsewhere (for example in a "bubble", or on the browser's status bar, or on any other more convenient single element within the HTML page and whose content will be refreshed to display this hint string, when the table cell is active or hovered by the mouse; it may also be vocalized, or directed (out of flow) to the contextual helper line of a Braille reader, according to its local user preferences, who can reserve a part of the display pad for displaying HTML "title" hints or image descriptions). The title attribute is then descriptive and can be arbitrarily long (it could be several sentences) and is not intended for containing visual abbreviations like "NBSP". To support an alternate representation of the character using abbreviations (which will replace the actual rendering of the character within the table cell, we can use another optional attribute. It may even be preferable to make visual distinctions for format controls: <!ELEMENT charinsert ((#PCDATA | char | comb | ctrl)*) ><!-- new definition here: adding ctrl --> <!ELEMENT ctrl (#PCDATA) ((#PCDATA | q)*) ><!-- force the alternate presentation for controls or strings starting by one --> <!ATTRIB ctrl title #PCDATA IMPLIED <!-- default is empty --> alt #PCDATA IMPLIED <!-- default is empty --> > <!ATTRIB comb title #PCDATA IMPLIED <!-- default is empty --> alt #PCDATA IMPLIED <!-- default is empty --> > <!ATTRIB char title #PCDATA IMPLIED <!-- default is empty --> alt #PCDATA IMPLIED <!-- default is empty --> > For example: <charinsert><char title="non-breaking space" alt="NBSP"> </char></charinsert> <charinsert><comb title="acute accent" alt="ʹ">́</comb></charinsert> <charinsert><comb title="dotted circle">◌</comb></charinsert> <charinsert><ctrl title="zero-width space" alt="ZWSP">​</ctrl></charinsert><!-- this is not a format control --> <charinsert><ctrl title="zero-width non-joiner" alt="ZWNJ">‌</ctrl></charinsert><!-- this is a format control --> <charinsert><ctrl title="left-to-right override" alt="»">‪</ctrl></charinsert> <charinsert><ctrl title="right-to-left override" alt="«">‫</ctrl></charinsert> <charinsert><char title="narrow non-breaking space" alt="NNBSP">‿</char></charinsert><!-- this is not a control, it is visible ! --> Separating the <char>, <comb>, and <ctrl> elements allow distinct presentations when needed (such as distinct table cell background colors in a characters selector). The <q> element will only be usable when there's not "alt=" attribute to remove the display of the actual character within the rendered table cell. The U+25CC default base character (DOTTED CIRCLE) for sequences starting by a combining character will be implied and generated automatically by default, if: * the text content within the <comb> or <ctrl> or <char> (including the text within <q> elements which are also rendered by simple concatenation, but discarded from the actually inserted text) starts by a combining character (even if its "combining class" is 0, which just means that it can just never be precomposed with a prior base character, and never be reordered through normalization). * and there's no alt attribute in the <comb> or <ctrl> or <char> element start tag. It will also be used implicitly if the characters within <charinsert> are packed in a space-separated string without surrounding <comb> or <ctrl> or <char> to subqualify them (meaning that by default they are treated as if each space separated sequence was within a <char> element with unspecified "alt=" and "title=" attributes. It will be also illegal to use <q> within the content of <charinsert>, except within the content of <char> or <ctrl> or <comb> as it would not be clear where to associate them with surrounding space separated char sequences. They must be delimited, at least within a <char> without attributes. The text content of a single <char> or <comb> or <ctrl> element can contain any text, it is not restricted to a single Unicode character. And it can also include spaces (however the spaces are implicitly packed, with leading and trailing spaces discarded; if one still wants to be able to include a litteral space within the text to insert and that must not be discarded complely, I propose adding the <space> element: <!ELEMENT space #EMPTY> <!ATTRIB space title #PCDATA IMPLIED <!-- default is empty --> alt #PCDATA IMPLIED <!-- default is empty --> > And allow it within the content of <charinsert>, <ctrl>, <comb> and <char> : <!ELEMENT charinsert ((#PCDATA | space | char | comb | ctrl)*) ><!-- new definition here: adding space --> <!ELEMENT ctrl (#PCDATA) ((#PCDATA | space | q)*) ><!-- force the alternate presentation for controls or strings starting by one --> <!ELEMENT char (#PCDATA) ((#PCDATA | space | q)*) ><!-- force the normal presentation --> <!ELEMENT comb (#PCDATA) ((#PCDATA | space | q)*) ><!-- force the alternate presentation for combining characters or strings starting by one --> All these definitions are easy to parse through XML DOM in the Wiki parser. I hope they are precise enough. Comments are welcome. Philippe. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
