> I've been thinking: Perhaps the final solution is to
> do away with \lccode and \uccode completely and instead base the
> system on unicode properties?
You don't say :-)
I see at least three things here:
The case at hand in this thread: some legacy 8-bit patterns would
result, ideally, in two different Unicode patterns. This is not
possible for the moment because those pattern pairs would be converted
back to the same pattern in 8-bit engines, and iniTeX doesn't want
duplicate patterns. It's one of the biggest problems we had to deal
with since the beginning, but what is the rationale behind this
no-duplicate policy? I couldn't find any justification for it; the
error is thrown in “TeX: The Program” part 43, section 963, with no
comment at all; and in spite of what the help message says, Appendix H
of the TeXbook doesn't mention anything about duplicate patterns (at
least not the definitive millenium edition). A simple explanation could
be that the patterns are suspected to be buggy if they contain
duplicates; but it seems a rather weak check (and not iniTeX's job, in
my opinion), and I don't really see the harm in duplicates (just
discarding them doesn't sound that horrible). Then again, I might be
missing something, of course.
The irony here is that LuaTeX doesn't complain about duplicate
patterns anymore since the hyphenation-handling code moved over to
libHnj last October, and part 43 of the original TeX code disappeared
entirely; Taco, can you comment about that?
Second, I'm also tempted to say that we don't need \lccode's and
\uccode's for patterns, and that we should rely on Unicode properties only.
Finally, I had another thought that was raised by the Sanskrit
patterns: in Indic scripts, single glyphs can correspond to a great deal
of characters -- up to 5 or 6, apparently, in the current patterns
contributed by Yves Codet. This blends very badly with \lefthyphenmin
and \righthyphenmin, because if we want to, say, prevent such a
5-character glyph from being hyphenated at the end of a word, we would
have to set \righthyphenmin to 5; but this would of course prevent all
the other 5-character clusters from being hyphenated, some of them
possibly corresponding to 5 actual glyphs in the font. That is,
counting characters doesn't seem as relevant as it is in the Latin or
Cyrillic scripts. I believe we should consider what Unicode calls a
grapheme cluster (“what the user thinks of as a character”) instead of
characters. Needless to say, for the existing patterns the two concepts
overlap to a great extent; the vast majority of grapheme clusters can be
represented with a single Unicode character, if not all of them. This
does not at all hold, however, for Indic scripts (neither for Arabic,
but that's hardly relevant for hyphenation).
Arthur