From: "Anto'nio Martins-Tuva'lkin" <[EMAIL PROTECTED]>
> On 2003.05.25, 00:00, Philippe Verdy <[EMAIL PROTECTED]> wrote:
> > even if the Dutch language considers it as a single letter, in a
> > way similar to the Spanish "ch"
>
> I see one major difference: When you apply extra wide inter-char
> distance, you (should) get, f.i.:
> K o r t r ij k and not K o r t r i j k
> but E l c h e and not E l ch e
> This is common practice in both spanish and dutch typography, ISTK.
> I was told in this forum that the surest way to keep this working in
> Unicode texts is to use "i<WJ>j" for Dutch and plain "ij" for other
> languages.
My opinion about this is not related to the use or non-use of joiner and disjoiner
controls.
I think it goes to the locale definition of breakers (I mean the set of breakers for
sentences, lines, words, hyphenation):
Shouldn't that go to the definition of locale-specific ***character (or
character-clusters) breakers***, going beyond what Unicode can provide in a single and
unified character model that just tries to represent international text independantly
of the language ?
After all Unicode mostly defines only the required abstract characters needed to
encode a given strict, outside of any typographical considerations with fonts and
style effects, but does not really work on the representation of locale-specific needs
for specific typographical uses such as line justification...
Once again, Unicode should not attempt to be a markup language. It only represents
text as a linear stream of abstract characters encoded in strings that can be
transmitted. Unicode is not specifying the typographic needs. This goes to other
systems such as HTML, SGML, or XSLT and CSS, plus other internationalization standards
such as transliteration rules, and domain specific conventions, or even the art of
text translation...
Regarding your request to handle ij specially in Dutch, nothing forbids a locale-aware
rendering application to remap the i+j pair as a single ij character before rendering
it, if the text is labelled as Dutch...
So you could get with a few locale-specific chararacter-cluster breaking rules:
K o r t r ij k and not K o r t r i j k
B i j e c t i e and not B ij e c t i e
(simply because i+j is a single combined Dutch ij character only if its not followed
by a vowel)
For the same reason, a French text would render with strict typography:
B oe u f and not B o e u f
(in this case it would render the oe ligature)
Such approach is still much less complicated than what is actually needed for Brahmic
scripts, and even worse for Thai! And it could handle the defficiencies of some
conversions to legacy character sets, for example restoring the final form of a greek
sigma when appropriate.
So the only good question to ask is whever we can label the text with its language,
using some markup system, or at least using the Unicode language tags needed as a
possible interface for font renderers that cannot interpret a markup system...
I would not be shocked to see the ligated or combined forms not rendered in a text
simply because the text is incorrectly marked with the wrong language, or ecause such
markup is simply not available. This exception is similar to the common approach
consisting in rendering the text the best as we can with the tools we have, by using
canonical or compatibility equivalences.
But I see nothing in Unicode that would require the text to be encoded only with the
Unicode prefered character, only because Unicode recommands it, but where in practice,
other standards exist that mandate input methods or keyboards where such composition
is widely impractical. The strict typographic rules cannot be applied without some
smart algorithm, but the reader will always make the correct interpretation of text
(this is the interpretation of text that Unicode standardizes, not its rendering).
-- Philippe.