fantasai <[email protected]>:

>> The problem is that the hyphenation system in itself can't decide how
>> to change the spelling, without any "dictionary"   functionality. It
>> can't know if I meant "mat-tjuv" ("food thief" in Swedish) or "matt-tjuv"
>> ("carpet thief") when I wrote "mat&shy;tjuv". So there has to be a way
>> to tell the hyphenation system that.

Imagine if there was also ‘matt·juv’ next to ‘mat·tjuv’ and ‘matt·tjuv’, or 
even ‘mat·ttjuv’.

> Hm. I don't think I have a solution for that problem. :/
> Currently you'd just have to not hyphenate that word.

Smart-font solution (OpenType, AFDKO syntax):

  “mattjuv, matttjuv”

  lookup tripleletters {
    sub t' t' t by t;
  }
  feature rlig {
    script latn;
    language SWE exclude_dflt;
    lookup tripleletters;
  } rlig;

Combining Grapheme Joiner (U+034F, ‘CGJ’) could possibly be given an 
interpretation like this (XML syntax), but Zero-Width Non-Joiner  (U+200C, 
‘ZWNJ’) should probably not be repurposed:

  “mattjuv, mat&#x34F;tjuv”

Possible Unicode solution with a new combining character that makes the 
preceding character or grapheme – I’m not sure which – invisible except at the 
end of a line:

  “mattjuv, matt&#x2065;tjuv”

  U+2065 – Combining Collapse or Reduplicating Soft Hyphen or so

All solutions require author education. The latter two require changes to 
existing software and specifications (including CSS), the former “just” updated 
fonts. The second solution would fall back gracefully to ‘mattjuv’, the others 
to ‘matttjuv’, maybe even with a .notdef glyph in there.

All of these approaches are too complicated for Joe Sixpack (or Jo Sexpack), so 
I don’t think that will work in practice, except in environments that already 
make sure to treat border cases like disambiguation of umlaut and diaeresis use 
of trema dots.

JFTR, Swedish is not the only language with this orthographic feature. The 
German orthography reform of 1996 did away with letter collapsing completely, 
probably for this very problem. Now there are instances of three times the same 
letter on the same line, which some consider ugly, but smart fonts can overcome 
most of the perceived problems by ligating the first two letters of such a 
sequence or by selecting an alternate glyph for the final one. The special 
treatment of the double-‘k’ grapheme was also abolished: It used to look like 
‘ck’ – often a ligature – except at the end of the line where it showed its 
real face, ‘k-k’; now it’s always typed, encoded and displayed as ‘ck’ and 
cannot be separated. Theoretical graphemes ‘zz' and ‘hh’ still look like ‘tz’ 
and ‘ch’ respectively, whereof only the former may be split ‘t-z’.
_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Reply via email to