On Sat, 29 Dec 2001 [EMAIL PROTECTED] wrote: >Tex's example may or may not be realistic -- I have no way of knowing -- >but in suggesting a top-to-bottom directional override, I had hoped it >would be possible to represent a run of text such as Tex describes >without resorting to the infamous "higher protocol."
But it is. Unicode just does not take a stand on how it should be formatted. See below. >This may seem arbitrary to some; why should overrides of default >horizontal directionality be a plain-text issue but overrides of default >vertical directionality be a higher-level "formatting style" issue? I >hope this discussion can shed some light on this question, and possibly >help me see what I may be missing. I think this has to do with the way people conceive the term "plaintext" -- anything beyond a simple line (or column) based flow layout will likely be thought of as "rich" instead. The reason is both historical and practical. Text is laid out like this in most cultures, and early printing/computer/typewriter technology followed suit. The matter of mixed writing directions is a relatively new one, and so isn't really covered by the concept of "plaintext". The practical reason is that comprehensive layout of fully free direction text is really difficult, if not impossible, whereas writing systems with identical line progression directions are more or less compatible, using a simplish algorithm (Unicode BiDi). If you look at the way text is normally displayed on 2D media, it's printed in a unidirectional stream and then chopped into lines at sheet edge. As long as the lines progress in the same direction, you can always manipulate the order of the symbols within the stream to get more or less correct display of mixed script directionalities. (Yes, line breaking and deeply nested BiDi levels are still troublesome.) This way, lr-tb is sorta compatible with rl-tb. There are of course three more pairs, not counting boustrophedon and the likes, but AFAIK this is the most common combination. It's also where the ease stops. If you try to mix opposite line progression directions, you will end up with something like the Unicode BiDi algo, only applied at the paragraph level. That soon becomes unreadable, and makes for really lousy APIs. (Even BiDi is difficult, as one usually needs to render entire paragraphs at a time.) Mixing vertical and horizontal writing modes is even more complicated since you cannot think of the text as a directional, chopped-into-lines stream, anymore. You *can* use all sorts of funky heuristics, but keeping the text both readable and "plain" is pretty much impossible. (If you don't believe that, think about how you would format a string of 1000 lr-tb, 100 tb-lr, 100 rl-bt and 1000 bt-lr characters. This is not a realistic example, of course, but illustrates the general point.) Now, there are many ways to cope with simplified variations of the theme. One is to rotate nested characters of foreign directionality so that the character progression direction for all the scripts present remains the same, no matter what the script. E.g. XSL-FO documentation gives a number of examples of this approach. Another is to force the character progression direction to agree between scripts, without rotation. This only works when characters are graphically separate, like they are in the Latin script or scripts based on Han ideographs. Top-to-bottom Latin within Japanese is a good example. (It also illustrates the effects on readability of messing with the natural directionality of text.) You can also print short spans of foreign text in its natural direction, within a line of text of differing native directionality. Metric units, printed in Latin within tb-rl traditional Japanese, are probably the most common case. I'm sure that people on this list could cite countless weirder examples. The point is, all such solutions are for special cases. They do not solve the problem of how to fit longer, nested spans with arbitrary directionality on a page without in some cases making the text as a whole illegible and/or unaesthetic. Hence, it's better to handle the special cases as what they are, instead of bringing them all into Unicode and forcing every Unicode compatible application to incorporate a full page layout engine. I think this is the ultimate reason why TUS 3.0 leaves this stuff to those "higher level protocols". We might in fact say that the Unicode Standard has two completely separate parts. The first is the logical encoding of any character based script as a stream of character codes, the second is an actual 2D, line based rendering of the encoding for the very special case where two scripts of identical line progression direction are mixed. Anything beyond this could well be said to be beyond the scope of TUS. We might indeed go as far as to say that certain combinations of scripts which *can* be encoded in Unicode, *cannot* actually be consistently rendered on 2D graphical media. (After all, one shouldn't neglect the possibility of there being a script which cannot be line-broken at all. This would mean that it cannot be printed on 2D media of fixed size, but can still be encoded in Unicode. A cursive script written on an uninterrupted tape, like I hear some Tibetan text is, could conceivably serve as an example. I think this sort of thing does not point to problem with Unicode, but rather further illustrates the fact that a stream of Unicode characters and its rendering are separate beings.) BTW, something akin to the above should really go in a FAQ. Is there anything resembling a Unicode FAQ in existence, anywhere? >Actually, there is a more serious problem involved with vertical >directional overrides: They would force the Unicode plain-text mechanism >to become aware of both vertical directionality and directional >priority. This sounds obvious, but in fact there are not two, but THREE >issues involved with text directionality: Beyond the fact that some characters are used in more than one writing mode, I don't see a problem with incorporating such properties in the character database. However, I don't think one should involve these properties in any default plaintext rendering of Unicode. If anything, I'd push all of the rendering details into a separate part of the Unicode standard, and make it clear that not all streams of code points have a consistent, readable rendering like one would expect. Sampo Syreeni, aka decoy - mailto:[EMAIL PROTECTED], tel:+358-50-5756111 student/math+cs/helsinki university, http://www.iki.fi/~decoy/front openpgp: 050985C2/025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2

