On 16 Jun 2008, at 4:16 pm, Mojca Miklavec wrote:

IMO, where some patterns have traditionally included the apostrophe (x27),
we should probably provide duplicate patterns with U+2019 as well.

Any little/tiny chance to use some other way to achieve the same? It's
seem like yet-another-hack to me, that will prevent us from direct
conversion to 8-bit patterns.

1.) create a list of equivalent characters

2.)
a) parse contents of \patterns and if some character from the list
belongs to that list, duplicate the pattern before it's passed to TeX

It ought to be possible to do this, I guess, but it's fairly painful as TeX macro programming. (For LuaTeX it could no doubt be done much more easily in Lua, but that doesn't help XeTeX.)

b) extend the engine (only XeTeX/LuaTeX in that case) in some way to
accept hints that some characters are equivalent during hyphenation. I
guess that \lccode does exactly that, but I'm not sure what will
happen if I set lccode of "adiaeresis" to lccode of "a" for example,
when I want to use some macro to do uppercasing/lowercasing of words
for me.

Or to take the specific example of the apostrophe, we could set \lccode"2019="27 (or vice versa, depending which way we want to write the patterns). But then if someone applies \lowercase to a run of text that includes the ’ character, they'll be surprised to see it changed to '.

The trouble is that \lccode is overloaded, being used for multiple purposes that may not always want the same set of mappings. I suppose if we had a separate \hyphequiv table, that would help -- but you're not getting a new feature like that in time for the TL2008 release!

I would really prefer not to introduce new hacks in patterns.
Apostrophe represents a single character, so it should be left as a
single character in patterns (assuming that we leave it there), only
TeX might see it in a different way.

The correct Unicode character to use would be U+2019, I think, so we could simply use that in the patterns and ignore U+0027. The trouble is that there are sure to be users who have U+0027 in their text, and expect this to behave the same way; in order to support both the "best practice" and the "ASCII-like" encoding of the data, we need two versions of the patterns. That's not really a "hack in patterns", IMO, it's a concession to the fact that real-life data will not always be encoded in the purest and best Unicode Way, and it may be helpful to try and support these "variant spellings" where possible.

JK

Reply via email to