Nathan Wells asked :
>> I am not sure if this is the right place to ask, but I am trying >> to create hyphenation rules for a UTF-8 language (Khmer). I've >> tried patgen, but I can't get it to work (some have said it >> doesn't support UTF-8?). to which Werner Lemberg replied via Stack Exchange. Since not all subscribed to these lists will necessarily also read Stack Exchange (I don't, for example), I have repeated some parts of his answer here, to which I have added related questions of my own. I have also opened up the distribution to the XeTeX list, since it seems extremely relevant thereto. > First of all, whatever you are going to achieve, it won't work with > ‘classical’ TeX. This is due to a design decision of Knuth – today we > know that this was unfortunate, but at the time of writing TeX this > was far less obvious: Hyphenation patterns are applied to glyph > indices and not to input character codes. Since there are more than > 256 Khmer ligature glyphs, the standard hyphenation algorithm can't > be applied. > > Today, this design problem can be circumvented natively by luatex > only, Does XeTeX also address this issue (open question, not one to which I claim to already know the answer) ? > Now back to your problem. The patgen program is completely agnostic > of what it processes; the only limitation is that it cannot handle > more than 243 entities: The 8bit range of 256 characters minus the > digits 0-9 and characters ‘.’, ‘-’, and ‘*’ (which can be mapped to > different characters if necessary). Since the number of Khmer > characters is less than 128, patgen can be used to create patterns. OK, so let's open up the question from just Khmer : if I were to want to build patterns for a language that had more than 243 characters, is there a variant of Patgen that can correctly handle such a task ? Philip Taylor
