Markus Scherer <[EMAIL PROTECTED]> wrote:
ICU 2.8 has the ability to handle m:n character conversion mappings driven
by simple lines in Unicode conversion tables (text files).
That's a nice coincidence, to have this feature. I was wondering
if this would enable transcoding from legacy Tamil charsets (in visual
glyph order, like Thai) to Unicode.
Possible, but this is "just" m:n character conversion. This feature does not add arbitrary text reordering. If you can achieve what you need with a set of m:n mappings, then you can use it by itself.
Otherwise you would have to do line/paragraph chunking and use, for example, the ICU Transliterator classes for arbitrary Unicode-to-Unicode transforms after converting to or before converting out of Unicode.
I've looked at the example data files for the m:n mappings but it's still opaque to me, what hat to go in the headers. Is there a point to start reading from to gain further insights?
There will be by the time ICU 2.8 is released, and it will be in the User Guide. Sorry for not having written that yet.
However, there is actually nothing you need to do in the header. The makeconv tool will detect that you have multiple code points and/or multiple complete codepage character byte sequences and automatically put such mappings into an appropriate data structure. This is possible because it knows the structure of the codepage from the already necessary header information. (The structure of Unicode is known anyway, and trivial in .ucm files where code points are listed.)
I'm especially wondering, whether the converter by default will take the longest matching entry in an m:n table or whether the sequence of entries is significant. (Something must be done to e.g. disambiguate keLa from kau).
The sequence of entries is not significant. makeconv will sort the mappings internally for processing before the binary table is written.
The converter must and will use the longest match - otherwise it would not be able to handle Ka vs. Ka+semi-voiced-mark in the Japanese table.
For more contrived examples, see the test files test3.ucm and test4.ucm in icu/source/test/testdata/
Best regards, markus