> > Message: 4 > Date: Fri, 10 Oct 2014 08:54:04 +0700 > From: Nathan Wells <[email protected]> > To: "About TeX hyphenation patterns." <[email protected]> > Cc: Unicode-based TeX for Mac OS X and other platforms <[email protected]> > Subject: Re: [tex-hyphen] Help with UTF-8 Language > Message-ID: > <cafse7htpaagzyr4ocp5bn3davp+mw30_qi-tmk70xkfopn0...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Thank you all for your replies! > My programming abilities are quite limited and I realize there aren't many > people who need to make hyphenation dictionaries, hence the lack of good > Unicode support. But would someone be willing to help with a little more > step-by-step help? I am a little confused as how best to map the Khmer > Unicode characters to 8-bit values. > I think it would be quite useful to post a tutorial of the process once I > am done so others can more easily create hyphenation dictionaries for > languages that don't have them yet (I have yet to find a good tutorial > anywhere). > Thanks again for your help, > Nathan Hi Nathan step 1. First you need word database in your language. I wrote small program for my case which accepts text file(it can be text with mixed scripts) and gets from there words in some "Lang" sorts them and outputs in file. another code merges this wordlists. finally you need something like this: Aggressive Animal Alphabet Dosimeter Guard if you can make such list in other way thats fine. Step 2 after this you need to know hyphenation rules. This can be different from language to language In example for my case i can hyphenate word on after vowel, if there are two or more consonants after vowel one stays on same line others go on next line, but there are some consonant pairs which can not be splitted. After doing this with your wordlist you get something like this: splitted_word_list.txt Ag-gres-si-ve A-ni-mal Al-pha-bet Do-si-me-ter Gu-ard Step 3. after this comes patgen and you pass splitted_word_list.txt as dictionary file for 'hyph_start' and 'hyph_finish' left hypmin righthypmin you can use 3*N. "3" because Khmer is 3 bytes long. Using this trick i made patgen to work with utf-8. I used wordlist from step 1 and generated patterns from step 3 to test hyphenation using hyph-utf8 and luatex and compared it to splitted wordlist from step 2. For step 1-2 i have wrote program which does all work. Unfortunately script codes(Language script detection, hyphenation rules, vowels) are "hardwired" in code. I can send you codes and you can modify them or send me textfiles with your language /text can me mixed with some other languages or with html murkup, but not word files please :) / , vowels list /unicode codes/ and hyphenation rule set. I'll try modify my code in way program can accept: script_code_ranges, vowels_set, consonant_pairs_which_cannotbesplited
