RE: FW: New version of TR29:

Marco Cimarosti Tue, 20 Aug 2002 08:20:50 -0700

Philipp Reichmuth wrote:
> MC> "O'zbek" would not split, because the apostrophe is not 
> followed by "a",
> MC> "e", "i", "o", "u" or "y".
> 
> "G'iyosaddin" would (sorry for the silly word, it's the middle name of
> a medieval poet, but it's the first thing that came into my mind, and
> "g'" is not such a rare combination in Uzbek that this is the only
> case).
>
> You can't sensibly base a general-purpose word splitting
> algorithm on the French and Italian definition of "vowel".


It depends on what you mean by "sensibly".

To me, privileging French and Italian over Uzbek is a sensible choice. Text
written in English, or Swedish, or Chinese is more likely to contain French
quotations than Uzbek quotations.

> It is probably impossible to do that without looking at the language
> of your encoded string.

Definitely so. I thought that this was out of discussion, as DUTR#29 itself
clearly states that the general algorithm needs to be *tailored* for each
language.

An Uzbek tailoring will contain special rules for "G'" and "O'", which will
override the general mechanism.

But, out of the cases covered by the Uzbek-specific tailored rules, also
Uzbek text needs to follow the default rules, in order for terms or
quotations from other languages to be handled correctly in the *majority* of
cases...

_ Marco

RE: FW: New version of TR29:

Reply via email to