Dear Eric, On 20 May 2016 at 18:21, Muller, Eric wrote: > A few questions: > > 1. the hyphenation patterns are meant to work on text that has been > "normalized" in some way;
In the early days of TeX it was sufficient if it worked with 8-bit fonts and whatever special treatment the macro package (like Babel) provided to set the the catcodes of characters. > I know that at least all uppercase letters should > be converted to lowercase. True. > Looking at the French patterns, I see that they > account for apostrophe by U+0027 but not for U+2019, so I suppose that > U+2019 should be folded to U+0027. This needs a bit of explanation and perhaps a bit of further discussion. If you take the patterns from https://github.com/hyphenation/tex-hyphen/blob/master/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/hyph-fr.pat.txt you'll see that both are present. If you were looking at hyph-fr.tex then in fact we only have U+0027 there, but that's because 8-bit TeX automatically "converts" U+0027 into U+2019 (or rather has the glyph U+2019 on slot 0x27). In any case I believe that you should take the patterns from the plain text file, not from "*.tex". However a bit of further discussion might be in place. I believe that we should start supporting equivalence classes at some point. At least for my mother tongue many characters are absolutely equivalent (for example o = ó = ò = ô; they don't even change the meaning of the word). And hyphenation patterns for quite some languages like Turkish just define equivalence classes and then write the same pattern repeated for all pairs of characters. It would be a lot "saner" if patterns would define equivalence classes (including lowercase and uppercase letters being in the same class; or apostrophes) and then the engine should support proper interpretation of that. With Ethiopic the only rule is "feel free to hyphenate anywhere" (except just before commas etc.). So we made hyphenation patterns saying just: for each letter <l> from the alphabet, add: 1<l>1 Which could in fact be just a single patterns if we had support for equivalence classes. > It also seems that something should be > done to fold combining sequences to precomposed characters. I could not find > any documentation of what the normalization should be? The old TeX did not support combining characters in any way. XeTeX does some "black magic" in the background (I believe it does some Unicode normalization, but I don't know the details). I'm not sure what (if anything) LuaTeX does. We currently have "œ + combining acute" in one of the patterns and that one is also "a bit problematic" because it should be treated as a single glyph (and probably isn't). So we also added "do not hyphenate before combining acute" which is a bit of a strange rule. Thai and Lao are also a bit "weird" in a way, with hyphenation patterns actually trying to prevent "combining characters" to be split from the rest. And some hyphenation patterns (mostly for Indic languages) include rules for non-breaking space etc. > 2. In a layout engine, the most likely organization is to use Unicode UAX#14 > (may be with tailorings for the locales) to determine linebreak > opportunities, and then may be to try to hyphenate the pieces between two > linebreak opportunities. Those fragments can contain pretty much arbitrary > characters. I suspect that the text between linebreak opportunities should > be broken into subruns, corresponding to some notion of word. For example, > with the string "foo<NBSP>…<NBSP>bar" (… is U+2026), it seems that > hyphenating that whole string returns an hyphenation opportunity after the > second <NBSP>. I suspect that "foo" and "bar" should be isolated and > presented independently to the hyphenation engine. But what are the rules > for that tokenization? I hope that someone else will answer that question. (I just wanted to say that TeX has issues with compound words and situations like that. You probably shouldn't take TeX as your role model.) > 3. I suspect that different languages may want different > normalization/tokenization? > > 4. all that suggests that there normalization/tokenization rules should be > captured with the hyphenation patterns, preferably in a way that can be > exploited by code. > > Are my assumptions correct? has all this already been discussed? resolved? > > Incidentally, I found > <https://wiki.openoffice.org/wiki/Documentation/SL/Using_TeX_hyphenation_patterns_in_OpenOffice.org#4._Add_hyphenation_rules_for_special_characters>, > which seems to deal with the same problem. I want to leave answering those questions to someone else. Mojca
