[XXE] Language-specific processing

maxwell Fri, 02 Oct 2009 12:31:24 -0400

On Wed, 30 Sep 2009 15:05:50 +0900, "Susan F." <soofalk at gmail.com> wrote:
> ...is language-specific processing possible? not
> language-specific text generation, rather language-specific text
alignment.
> such as, if lang=en then alignment=left, if lang=jp then
alignment=justify.
> the logic is very simple and it seems possible but all my custamization
> efforrts were unsuccessful.


Apropos of this (and another msg today on multi-lingual work in XXE): for a
couple years now, we have been using XXE to process grammars of languages
with complex scripts (specifically Bangla = Bengali, which is a more or
less typical Indic script, and Urdu, which uses a Perso-Arabic script).  We
have nothing in these languages longer than a couple lines of text
(interlinear text, if you're familiar with that, as well as example words
and phrases in-lined).

For output, we decided the normal FO -> PDF route would not work, because
it didn't (as far as we could tell) handle non-Roman scripts well.  We
briefly experimented with converting the DocBook XML to Microsoft Word (XXE
has scripts to do that).  It worked passably for Bengali, but did not look
like a feasible route for Arabic script.  

Instead, we've been converting the DocBook XML to XeTeX (a Unicode-aware
version of LaTeX) using the dblatex program
(http://dblatex.sourceforge.net/).  We had to do a few enhancements to
handle interlinear text (thanks to work by Andy Black and his
collaborators--Andy appears on this list from time to time), and do a few
things with a specialized LaTeX style sheet.  The other work we had to do
to accommodate right-to-left text is to bracket that text with a LaTeX
command that tells XeTeX to process that stretch of text in a right-to-left
fashion (and to use a specific font).  We do the bracketing automatically
during the conversion process (so the bracketing is not visible in XXE),
using a Perl script to find stretches of Unicode characters in the Arabic
block (plus spaces etc.).  The only tricky part about that was when we had
non-Arabic punctuation intermixed with Arabic script.

The use of dblatex to produce a XeTeX output, and then running XeTeX to
produce the PDF, has given us excellent results for Urdu, which is a very
difficult language to typeset (it's not your typical Arabic script).  We
went back and re-ran the Bangla grammar through this process, and it comes
out well too.  (Disclaimer: we're running this process under Linux.  I
believe it would work in Windows as well, but we haven't tried that.)

There is one shortcoming to using XMLMind to edit Arabic script: the cursor
movement does not work correctly in right-to-left script.  (More
specifically, I believe the insertion point has the correct behavior, but
the visible cursor does not.)  The folks at XXE have declined to fix
this--understandably, as I'm sure the Arabic script market for XXE is
rather small.  Also, persuading XXE to use a specific font for other
languages is a bit tricky, involving editing a file in the Java lib
directory.  I was successful at this for the Bangla script, but I could not
override the default Arabic script choice.

We have not tried any of this with CJK languages.

   Mike Maxwell
   CASL/ U MD

[XXE] Language-specific processing

Reply via email to