On 27/1/14 09:47, Ulrike Fischer wrote:
Am Mon, 13 Jan 2014 08:24:30 +0000 schrieb Jonathan Kew:


So is it relly true, that XeTeX is not able to apply the TeX hyphenation
mechnanism correctly to some unicode characters like „ß“?
I can't believe it.

That seems unlikely. It's almost certainly being affected by something
that latex or babel or whatever is setting up.

The problem seems to be the \lccode and \uccode of ß:

During format generation of xelatex (just before the pattern are
read) they are are set to 255 and 223 by "\reserved@a{"C0}{"DF}".

Aha. That looks like it relates to a legacy 8-bit codepage (Cork?), and is incorrect for a Unicode world.

\lccode of ß should certainly be 223 (0xDF), corresponding to its Unicode value U+00DF LATIN SMALL LETTER SHARP S.

Its \uccode is debatable; it should probably also be 223, as ß is normally treated as non-uppercaseable (or as uppercasing to "SS", which can't be done via \uccode), but another option would be 0x1E9E, for the (relatively recently-encoded) Unicode letter U+1E9E LATIN CAPITAL LETTER SHARP S.

But latter on xelatex.ini resets them both to 223 and this disturbs
the hyphenation:


\documentclass{article}

\textwidth=1in
\usepackage{fontspec}
\usepackage[german]{babel}

\begin{document}

\showthe\lccode`\ß
\showthe\uccode`\ß

\noindent wußte geißeln wußte geißeln wußte geißeln
   wußte geißeln wußte geißeln wußte geißeln
   wußte geißeln wußte geißeln wußte geißeln
   wußte geißeln wußte geißeln wußte geißeln
\par

%Setting values active at format generation works:
\lccode`\ß=255
\uccode`\ß=223

\noindent wußte geißeln wußte geißeln wußte geißeln
   wußte geißeln wußte geißeln wußte geißeln
   wußte geißeln wußte geißeln wußte geißeln
   wußte geißeln wußte geißeln wußte geißeln
\par

\end{document}

I don't know if it is an expected behaviour or a bug of xetex that
the lccode/uccode matters.

It's expected that it matters, because text is mapped via \lccode for matching against hyphenation patterns.

(AFAIR, \uccode should be irrelevant here.)

But you get the same behaviour with
lualatex only the "other way round": As the pattern are read only at
the begin of document the first paragraph in my example works fine,
but the second with the changed lccode/uccode fails.


So this probably means that the code at the end of xelatex.ini which
resets catcodes and lccode/uccodes etc should move to the begin of
hyphen.cfg so that the correct codes are active when the patterns
are read.

That sounds right, I think. Or maybe this can be fixed within the hyph-utf8 code somewhere.

Reading the patterns with incorrect lccodes (in particular, \lccode `ß = 255) may appear to work if you set that same lccode in the document, but it's still wrong - and seems likely to lead to confusion with ÿ, which is U+00FF.

Thanks for the diagnosis!

JK



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex

Reply via email to