Am 19.05.2010 21:55, schrieb Stephan Hennig:

    * a regular pattern set containing all valid hyphenations,
    * a compound-word pattern set, that matches only word compounds,
    * an undesirable pattern set, that recognizes valid, but undesirable
      hyphenations

Realistically, I can think of five different pattern sets for hyphenation of the German language (in the order of decreasing weight):

  A) word compounds
       - hyphenation is much preferred
       - weight 20
       - examples: Text-illustration, Tal-entwässerung

  B) affix hyphenation
       - still appreciated, Knuth had experimented with this before
         Liang developed the dictionary algorithm
       - weight 15
       - examples: Textillustra-tion, Talent-wässe-rung

  C) all valid hyphenations
       - these correspond to the current patterns
       - weight 10
       - -examples: Text-il-lus-tra-ti-on, Tal-ent-wäs-se-rung

  D) undesirable hyphenations
       - hyphenations near a word compound or word boundary
       - weight 5
       - examples: Textil-lustrati-on

  E) sense distorting
       - to be suppressed by all means
       - weight: 1 or zero
       - examples: Talent-wässerung, Textil-lustration

I hope the examples in case E are understandable even for non-Germans.
Talentwässerung (valley drainage) has nothing to do with "talent" and Texiillustration (text illustration) has nothing to do with "textiles".

Note, how Talent-wässerung is matched by pattern sets B, C and E. Similar, Textil-lustration is matched by pattern sets C, D and E. Both hyphenations are sense distorting and have to be suppressed by all means.

A sane ranking of the pattern sets would be (in the order of decreasing priority):

  1. sense distorting        (E)
       - suppress

  2. word compounds          (A)
       - prefer

  3. undesirable             (D)
       - avoid

  4. affix                   (B)
       - prefer

  5. regular                 (C)
       - if nothing else fits

That results in the following hyphenation weights:

  Text -20- il -0- lus -10- tra -15- ti -5- on
  Tal -20- ent -0- wäs -10- se -15- rung

For the German language, that level of granularity of hyphenation control would be great. Even though, finding a good set of weights (demerits) for the paragraph breaking algorithm won't be easy. The current demerits are already awkward enough. And I won't give much grey value for more legible hyphenations. But if one C-type hyphenation turns into an A-type hyphenation or a D-type hyphenation turns into a B-type hyphenation say, per page, it pays-off, IMO.

Best regards,
Stephan Hennig

Reply via email to