Philipp Reichmuth wrote: > MC> "O'zbek" would not split, because the apostrophe is not > followed by "a", > MC> "e", "i", "o", "u" or "y". > > "G'iyosaddin" would (sorry for the silly word, it's the middle name of > a medieval poet, but it's the first thing that came into my mind, and > "g'" is not such a rare combination in Uzbek that this is the only > case). > > You can't sensibly base a general-purpose word splitting > algorithm on the French and Italian definition of "vowel".
It depends on what you mean by "sensibly". To me, privileging French and Italian over Uzbek is a sensible choice. Text written in English, or Swedish, or Chinese is more likely to contain French quotations than Uzbek quotations. > It is probably impossible to do that without looking at the language > of your encoded string. Definitely so. I thought that this was out of discussion, as DUTR#29 itself clearly states that the general algorithm needs to be *tailored* for each language. An Uzbek tailoring will contain special rules for "G'" and "O'", which will override the general mechanism. But, out of the cases covered by the Uzbek-specific tailored rules, also Uzbek text needs to follow the default rules, in order for terms or quotations from other languages to be handled correctly in the *majority* of cases... _ Marco

