In my particular case, I have citations in (for example) the arabic wikipedia, which cite references on English or Turkish webpages (to cite the example of the arwiki article on 'Istanbul'). The original author of the article did not explicitly mark the language of the reference, because the unicode bidirectional algorithm did a perfect job of rendering the cited page title LTR in an otherwise RTL context. When I translate this to XeLaTeX, the entire citation is garbled because, although XeLaTeX/polyglossia does render the individual words LTR (using directionality implied from the unicode code block), the individual words are laid out RTL and the punctuation is a mess, because XeLaTeX does not implement the bidir algorithm's mechanism for inferring the directionality of 'weak' and 'soft' characters. (The original citations also don't necessarily add <bdi> tags where necessary, but that appears to be an easily fixed fault of the citation template.)
My understanding from this discussion is that I should implement the unicode bidi algorithm myself in my article preprocessor, to explicitly annotate the directionality of soft characters before feeding the output to xelatex. That work won't help others who find themselves in a similar situation (or document authors who would prefer not to have to explicitly annotate every LTR embedding), but it should be a reasonable solution to my particular problem. --scott On Mon, Dec 9, 2013 at 10:32 AM, <msk...@ansuz.sooke.bc.ca> wrote: > On Mon, 9 Dec 2013, Khaled Hosny wrote: >> > U+E0001 U+E0065 U+E006E U+0073 U+0061 U+006E U+0067 >> >> And it is a kind of tagging, so beyond the scope of identifying the >> language of *untagged* text (which is the claim that spurred all this >> discussion). > > The claim was "A properly encoded utf-8 string should contain everything > you need!". If you forbid using Unicode tag characters, then you're > saying "It is impossible to encode language in Unicode when you're not > allowed to use the features designed for that purpose," which is not > an interesting statement. > > Yes, of course some kind of tagging is needed. Keith seems to think that > the tagging will magically come from "proper" UTF-8, and of course he's > wrong. I think language tagging would be possible in pure Unicode, as the > string above demonstrates, but that's not a good way to do it. The really > original question had to do with RTL versus LTR detection, not language > detection, and that's a different issue. > > Unicode specifies a way to detect RTL versus LTR, such that in many cases > it doesn't require tagging. Unicode's way of doing it may or may not be a > good one, but we cannot reasonably pretend that it doesn't exist. The > Unicode bidi algorithm does exist. XeTeX does not implement the Unicode > bidi algorithm. The interesting remaining question is whether XeTeX > should implement it. I tend to think not - because if we implement it, > people will blame us for its failings. It'd also be a lot of work, break > compatibility with the rest of the TeX world, STILL require tagging in > many cases, and so on. > > -- > Matthew Skala > msk...@ansuz.sooke.bc.ca People before principles. > http://ansuz.sooke.bc.ca/ > > > -------------------------------------------------- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex -- ( http://cscott.net/ ) -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex