Complex Combining

Peter Kirk Thu, 27 Nov 2003 09:59:46 -0800

On 27/11/2003 05:00, Philippe Verdy wrote:

...

        encircle(<DIGIT 9, DIGIT 2, DIGIT 3, DOT, DIGIT 0>)
        == <DIGIT 9, DOUBLE ENCLOSING CIRCLE,
            DIGIT 2, DOUBLE ENCLOSING CIRCLE,
            DIGIT 3, DOUBLE ENCLOSING CIRCLE,
            DOT, DOUBLE ENCLOSING CIRCLE,
            DIGIT 0>

Here you don't have any ZWJ character, that's the double diacritic which
creates explicitly the ligature between the previous and next base
character.

All these solutions are not specified in the standard. This is a pure
convention of use of Unicode, and until there's some enhancement published
in the Unicode character model, to clearly create ranges of characters on
which diacritics can be applied, without the too simple ZWJ control, this
interpretation of such encoded text will remain application-dependant.

This is all rather interesting speculation. There are surely a lot of potential cases in scripts where some kind of combining mark can be considered as applying to a sequence of an arbitrary number of characters. For example:

Enclosing circles, squares and ellipses. Continuous underlines and overlines. Continuous tildes, slurs, contour tone marks etc which may apply to several characters or whole words. The cartouche in Egyptian hieroglyphs, which surrounds a group of several characters. A number of mathematical functions e.g. fraction dividers, extensions to root signs. Combining marks which are supposed to be centred over or under two or more characters or even a whole word, like the Hebrew masora circle.

Now I am sure it could be argued that some of these are not plain text and so should be dealt with by higher level markup. But maybe some of these need to be considered as part of plain text; for example, it is at least conceivable, and arguably true of the Egyptian cartouche, that these marks are required for proper understanding of the plain text, just as much so as regular letters and combining marks.

So how should they be represented? Philippe's suggestion of <c1, mark, c2, mark, c3, mark... mark, cn> would seem to work, but could be very inefficient. Jill's alternative <bracket1, c1, c2, c3... cn, bracket2, mark> is more efficient for long sequences. But perhaps better would be to have paired opening and closing marks: <mark1, c1, c2, c3... cn, mark2> - although this requires a new pair of characters for each such case.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Complex Combining

Reply via email to