Re: [tex-hyphen] tex-hyphen Digest, Vol 58, Issue 3

levan shoshiashvili Fri, 10 Oct 2014 03:59:08 -0700



> 
> Message: 4
> Date: Fri, 10 Oct 2014 08:54:04 +0700
> From: Nathan Wells <[email protected]>
> To: "About TeX hyphenation patterns." <[email protected]>
> Cc: Unicode-based TeX for Mac OS X and other platforms <[email protected]>
> Subject: Re: [tex-hyphen] Help with UTF-8 Language
> Message-ID:
>       <cafse7htpaagzyr4ocp5bn3davp+mw30_qi-tmk70xkfopn0...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Thank you all for your replies!
> My programming abilities are quite limited and I realize there aren't many
> people who need to make hyphenation dictionaries, hence the lack of good
> Unicode support. But would someone be willing to help with a little more
> step-by-step help? I am a little confused as how best to map the Khmer
> Unicode characters to 8-bit values.
> I think it would be quite useful to post a tutorial of the process once I
> am done so others can more easily create hyphenation dictionaries for
> languages that don't have them yet (I have yet to find a good tutorial
> anywhere).
> Thanks again for your help,
> Nathan
Hi Nathan

step 1.
First you need word database in your language.
I wrote small program for my case which accepts text file(it can be text with 
mixed scripts) and
gets from there  words in some "Lang" sorts them and outputs in file.
another code merges this wordlists.
finally you need something like this:

Aggressive
Animal
Alphabet
Dosimeter
Guard

if you can make such list in other way thats fine.

Step 2
after this you need to know hyphenation rules.
This can be different from language to language
In example for my case i can hyphenate word on after vowel,
if there are two or more consonants after vowel one stays on same line others
go on next line, but there are some consonant pairs which can not be splitted.

After doing this with your wordlist you get something like this:
splitted_word_list.txt
Ag-gres-si-ve
A-ni-mal
Al-pha-bet
Do-si-me-ter
Gu-ard

Step 3.
after this comes patgen and you pass splitted_word_list.txt as dictionary file
for  'hyph_start' and 'hyph_finish' left hypmin righthypmin you can use 3*N. 
"3" because
Khmer is 3 bytes long. Using this trick i made patgen to work with utf-8.  

I used wordlist from step 1 and generated patterns from step 3 to test 
hyphenation using hyph-utf8 and luatex and
compared it to splitted wordlist from step 2.

For step 1-2 i have wrote program which does all work. Unfortunately script 
codes(Language script detection, hyphenation rules, vowels) are "hardwired" in 
code.

I can send you codes and you can modify them or send me textfiles with your 
language /text can me mixed with some other languages or with html murkup, but 
not word files please :) / , vowels list /unicode codes/ and hyphenation rule 
set.
I'll try modify my code in way program can accept: script_code_ranges, 
vowels_set, consonant_pairs_which_cannotbesplited
Re: [tex-hyphen] tex-hyphen Digest, Vol 58, Issue 3

Reply via email to