Peter Jacobi wrote:
Markus Scherer <[EMAIL PROTECTED]> wrote:

ICU 2.8 has the ability to handle m:n character conversion mappings driven
by simple lines in Unicode conversion tables (text files).

That's a nice coincidence, to have this feature. I was wondering
if this would enable transcoding from legacy Tamil charsets (in visual
glyph order, like Thai) to Unicode.

Possible, but this is "just" m:n character conversion. This feature does not add arbitrary text reordering. If you can achieve what you need with a set of m:n mappings, then you can use it by itself.


Otherwise you would have to do line/paragraph chunking and use, for example, the ICU Transliterator classes for arbitrary Unicode-to-Unicode transforms after converting to or before converting out of Unicode.

I've looked at the example data files for the m:n mappings but
it's still opaque to me, what hat to go in the headers. Is there a
point to start reading from to gain further insights?

There will be by the time ICU 2.8 is released, and it will be in the User Guide. Sorry for not having written that yet.


However, there is actually nothing you need to do in the header. The makeconv tool will detect that you have multiple code points and/or multiple complete codepage character byte sequences and automatically put such mappings into an appropriate data structure. This is possible because it knows the structure of the codepage from the already necessary header information. (The structure of Unicode is known anyway, and trivial in .ucm files where code points are listed.)

I'm especially wondering, whether the converter by default will
take the longest matching entry in an m:n table or whether
the sequence of entries is significant. (Something must be done
to e.g. disambiguate keLa from kau).

The sequence of entries is not significant. makeconv will sort the mappings internally for processing before the binary table is written.


The converter must and will use the longest match - otherwise it would not be able to handle Ka vs. Ka+semi-voiced-mark in the Japanese table.

For more contrived examples, see the test files test3.ucm and test4.ucm in icu/source/test/testdata/

Best regards,
markus




Reply via email to