Re: [tex-hyphen] Unicode Turkish Hyphenation Pattern

S. Ekin Kocabas Wed, 25 Jun 2008 11:13:36 -0700

Hi Mojca,

I'm not a member of the email list [email protected], and therefore have not received any of the emails below until your last email to Turgut Uyar which you've also cc'ed me. I'll now do my best to reply to your queries.

On your first comment, the Turkish alphabet has more letters, and that's probably why there are more patterns.

On your second comment, the missing 2bi. , 2bö. , 2bü. , 2ci. ... These patterns don't really make a lot of sense in Turkish, since there are words which should be hyphenated at bi or ci such as
ga-ra-bi
ge-mi-ci
bö-rül-ce
etc.

Could it be that Yannis' patterns mapped ö, ç, ü etc. to some other old Ottoman letters?

On your last comment on missing ".i2 .ö2 .ü2" , again there are many counter examples

i-yi-lik
ö-ner-ge
ü-züm
etc..

It's a pity that there's only a single mechanism for hyphenation in TeX which is specialized to English. Hyphenation in Turkish is so easy and mechanical that none of the dictionaries in Turkish even bother to show it. And one can write a very short and simple computer program that does hyphenation. Below is an example one written in Mathematica syntax. It would have been great to extend the hyphenation mechanism in TeX, but that may be too much work...

I hope I was able to reply to your queries. If you send me Yannis' pattern, I can be of more help probably. Also, my knowledge on TeX hyphenation is limited with what I read at Knuth's TeX book. I did not delve into Liang's thesis. Hence, it is very probable that some parts of my reply are nonsensical.

Best,

Ekin

PS: After I finished typing this email, I saw the thread at [email protected] and specifically the comments

7.) Can someone explain why "s with line below" has no unicode point?
(Yes, I know that it won't be added, ...)

In modern day Turkish there are "s" and "ş" . "ş" corresponds to "S WITH CEDILLA". I don't know how "s with line below" character is different than "ş".

Ekin - do you by any chance know any programming language to update
the source for generating patterns?

I know C. I'll look into

http://www.ctan.org/tex-archive/language/turkish/hyphen/turk_hyf.c

to make it compatible with current day Turkish.

Unless someone objects, I would rename the old patterns and clean the
not-needed characters in Turkish ones. But we need to remove all the
\lccode and \catcode commands.

I agree. One other problem that I saw was that, small caps don't work with Turkish. In English "i-I" form a small-CAPITAL letter pair whereas in Turkish there are two pairs "ı-I" and "i-İ" . Dotted ones are paired together, so are the dotless ones---which is different than in English. Some fonts don't even have the small caps version for "ı" (i.e. Adobe Caslon Pro) which is quite annoying. I have not been able to solve this problem...

PS2: I subscribed to tex-hyphen.

This is the hyphenating code in Mathematica. 'kelime' means word. These words are not real ones, but still, due to the rules in Turkish, you can pretty much hyphenate anything you can type. I've tested the function with those hypothetical words. Oh and by the way 'hecele' means hyphenate, 'sesli harfler' = consonants.

Kelime Heceleyici ver 0.1

kelime = "Şakşukacılaştıramayabileceklerimizdenkilerdensiniz";
kelime = "Tastamamdedirticiolmayacalisanlarinhuzunluuykusuzlugu"
Hecele[kelime]

"Tastamamdedirticiolmayacalisanlarinhuzunluuykusuzlugu"

"Tas-ta-mam-de-dir-ti-ci-ol-ma-ya-ca-li-san-la-rin-hu-zun-lu-uy-ku-suz-lu-gu"

x = x; Remove["Global`*"];
Off[General::spell];
Off[General::spell1];
Hecele[x_] :=
Block[{SesliHarfler = {"a", "e", "ı", "i", "o", "ö", "u", "ü", "A", "E",
      "I", "İ", "O", "Ö", "U", "Ü"}, out = ""},
   For[i = 1, i <= StringLength[x], i++,
    this = StringTake[x, {i}];
    previous = If[i != 1, StringTake[x, {i - 1}], " "];
    next = If[i < StringLength[x], StringTake[x, {i + 1}], " "];
    If[! MemberQ[SesliHarfler, this],
     If[MemberQ[SesliHarfler, next],
       out = out <> "-";
       ];
     ];
    If[MemberQ[SesliHarfler, this],
     If[MemberQ[SesliHarfler, previous],
       out = out <> "-";
       ];
     ];
    out = out <> this;
    ];
   If[StringTake[out, {1}] === "-",
    out = StringDrop[out, 1];
    ];
   out
   ];

Mojca Miklavec said the following on 6/25/2008 9:50 AM:

Hello Turgut,

may I please ask you to comment on this issue concerning the Turkish
hyphenation patterns?

See http://tug.org/pipermail/tex-hyphen/2008-June/000243.html

Thanks a lot,
   Mojca


On Wed, Jun 25, 2008 at 6:44 PM, Mojca Miklavec
<[EMAIL PROTECTED]> wrote:

Hello Ekin,

When comparing your and Yannis' file (derived from Otoman), I see the
following differences:

your:
2a1
2e1
2ı1
2i1
2o1
2ö1
2u1
2ü1
1b1
1c1
1ç1
1d1
1f1
1g1
1ğ1
1h1
1j1
1k1
1l1
1m1
1n1
1p1
1r1
1s1
1ş1
1t1
1v1
1y1
1z1

Yannis'/Otoman:
2a1
2e1
2ı1
2o1
2u1
1b1
1c1
1d1
1f1
1g1
1h1
1j1
1k1
1l1
1m1
1n1
1p1
1r1
1s1
1t1
1v1
1y1
1z1

Which means that your patterns have more letters (makes sense to me -
I have no idea why the other 6 letters should be treated any
different).

Missing in your patterns (seems like a leftover from old patterns to
me - I guess your file is OK):

2bi.
2bö.
2bü.
2ci.
2cö.
2cü.
2ça.
2çe.
2çı.
2çi.
2ço.
2çö.
2çu.
2çü.
...

Missing from your file:
.i2
.ö2
.ü2

Also looks like a leftover to me.

Can you please comment on that?

Thanks,
   Mojca


On Wed, Jun 25, 2008 at 6:08 PM, Mojca Miklavec wrote:

On Wed, Jun 25, 2008 at 7:37 AM, S. Ekin Kocabas wrote:

I hope this file can be turned into one which may be included in the
hyph-utf8 package. Let me know if I can help in any way.

The file will be included in any case, but I have some question. If
you take a look into the source of pattern generating script (or even
if you don't):

Vowels are divided into two groups. Some patterns only appear for one
group and some only for the other.

2a1
2e1
2ı1
2o1
2u1

.i2
.ö2
.ü2

Should they really be consideder different (I'm esp. interested in
knowing why i and ı are so much different, but there probably is a
good reason for that) or was this partially a leftover from the Otoman
rules?

Same question for consonants. One has

1b1
1c1
1d1
1f1
1g1
1h1
1j1
1k1
1l1
1m1
1n1
1p1
1r1
1s1
1t1
1v1
1y1
1z1

but I'm not sure why the other three letters are missing. I suspect
that this might be a leftover from the old encoding.

Thanks a lot,
  Mojca

Re: [tex-hyphen] Unicode Turkish Hyphenation Pattern

Reply via email to