Re: Unicode soft hyphen and hyphenation

2007-01-13 Thread Vincent Hennebert
Jeremias Maerki a écrit :
 On 12.01.2007 09:25:59 Vincent Hennebert wrote:
 Jeremias Maerki a écrit :
 Good to see that happen! Here's my take:

 On 11.01.2007 13:24:16 Manuel Mall wrote:
 Hi,

 when I implemented the UAX#14 line breaking I noticed that fop doesn't 
 currently support the Unicode soft hyphen (SHY).

 I am thinking of adding support for this character to the line breaking 
 but am unsure of its correct behaviour in an XSL:FO environment. So I 
 have few questions related to treatment of the SHY:

 1) If hyphenation is not enabled should a SHY still produce a valid 
 break opportunity or should it be ignored?
 I think it should represent a valid break opportunity.
 Well, I don't agree. See the description of SHY in section 15.2 of the
 Unicode standard: SHY is used as a hint for automatic hyphenators and
 overrides there behaviors. I would typically use it for nicely rendering
 veryLongProgramVariablesLikeWeCanFindInJava in e.g. a portion of text
 describing them in some documentation. Here I obviously want to force
 hyphenation to occur between the words that make the variable name
 (Long-Program-Variables instead of LongPro-gramVar-iables or whatever).

 So, as a hint for hyphenators, SHY should be ignored when hyphenation is
 disabled, and when enabled have the priority over automatic hyphenation.
 
 Hmm, I'm used to different behaviour in word processors and I don't read

Except that I wouldn't trust any word processor when it comes to
high-quality typography :-P
Does anyone know what InDesign is supposed to do?


 the UCD spec like you do. Also 5.3 in UAX#14 also doesn't give me the
 impression that a SHY is only active when hyphenation is enabled. It
 says: The action of a hyphenation algorithm is equivalent to the
 insertion of a SHY. However, when a word contains an explicit SHY, it is
 customarily treated as overriding the action of the hyphenator for that
 word. I read this as: SHY is the basic operator to add additional
 break points and a hyphenator can be added to do that task automatically.

Still don't agree. Overriding is not adding hyphenation points. The
following sentence in the description of SHY is pretty clear to me:
The use of SHY is generally limited to situations where users need to
override the behavior of [an automatic] hyphenator.

[Manuel]
 Interesting but moot point I think. FOP is the automatic hyphenator in
 this case and the hyphenate property could be argued to control which
 hyphenation algorithm FOP is using. If hyphenate=true FOP is allowed
 to add its own hyphenation breaks. If hyphenate=false it uses only
 user specified hyphenation breaks (= soft hyphens).

Well, again, the description of the hyphenate property (§7.9.4) sounds
clear to me: when false, Hyphenation may not be used in the
line-breaking algorithm.

snip/

To summarize, my opinion is that:
- if hyphenate = false, no automatic hyphenation is performed, and
  soft hyphens are discarded
- if hyphenate = true, automatic hyphenation is performed, except for
  any word that contains soft hyphens, in which case the soft hyphens
  are used to create legal breakpoints.

Now if the majority is against me, I'll shut up right now to not prevent
things moving on.

Vincent


Re: Unicode soft hyphen and hyphenation

2007-01-13 Thread Manuel Mall
On Saturday 13 January 2007 19:57, Vincent Hennebert wrote:
 Jeremias Maerki a écrit :
  On 12.01.2007 09:25:59 Vincent Hennebert wrote:
  Jeremias Maerki a écrit :
  Good to see that happen! Here's my take:
 
  On 11.01.2007 13:24:16 Manuel Mall wrote:
  Hi,
 
snip/
 Still don't agree. Overriding is not adding hyphenation points. The
 following sentence in the description of SHY is pretty clear to me:
 The use of SHY is generally limited to situations where users need
 to override the behavior of [an automatic] hyphenator.

 [Manuel]

  Interesting but moot point I think. FOP is the automatic hyphenator
  in this case and the hyphenate property could be argued to control
  which hyphenation algorithm FOP is using. If hyphenate=true FOP
  is allowed to add its own hyphenation breaks. If hyphenate=false
  it uses only user specified hyphenation breaks (= soft hyphens).

 Well, again, the description of the hyphenate property (§7.9.4)
 sounds clear to me: when false, Hyphenation may not be used in the
 line-breaking algorithm.

I still think this can be interpreted both ways. It clearly forbids 
formatter generated hyphenation but does it also suppress user 
specified hyphenation?

In HTML there is no hyphenation but browsers are expected to honor the 
SHY, that is treat it as a possible line break and if chosen put a 
hyphen there otherwise discard the SHY. Given that XSL:FO is derived 
from the HTML/CSS rendering model one could argue that this is the 
default behaviour the XSL:FO authors most likely intended. If not it 
would be difficult to construct a FO document that behaves with respect 
to hyphenation and the SHY similar to HTML.

 snip/

 To summarize, my opinion is that:
 - if hyphenate = false, no automatic hyphenation is performed, and
   soft hyphens are discarded
 - if hyphenate = true, automatic hyphenation is performed, except
 for any word that contains soft hyphens, in which case the soft
 hyphens are used to create legal breakpoints.

 Now if the majority is against me, I'll shut up right now to not
 prevent things moving on.


Fully agree - happy to go with the majority either way.

 Vincent

Manuel


Re: Unicode soft hyphen and hyphenation

2007-01-13 Thread Simon Pepping
On Sat, Jan 13, 2007 at 08:27:20PM +0900, Manuel Mall wrote:
 On Saturday 13 January 2007 19:57, Vincent Hennebert wrote:

  Well, again, the description of the hyphenate property (§7.9.4)
  sounds clear to me: when false, Hyphenation may not be used in the
  line-breaking algorithm.
 
 I still think this can be interpreted both ways. It clearly forbids 
 formatter generated hyphenation but does it also suppress user 
 specified hyphenation?
 
 In HTML there is no hyphenation but browsers are expected to honor the 
 SHY, that is treat it as a possible line break and if chosen put a 
 hyphen there otherwise discard the SHY. Given that XSL:FO is derived 
 from the HTML/CSS rendering model one could argue that this is the 
 default behaviour the XSL:FO authors most likely intended. If not it 
 would be difficult to construct a FO document that behaves with respect 
 to hyphenation and the SHY similar to HTML.

I agree with Manuel here: SHY should always be taken into account, and
always represents a linebreak opportunity.

  snip/
 
  To summarize, my opinion is that:
  - if hyphenate = false, no automatic hyphenation is performed, and
soft hyphens are discarded
  - if hyphenate = true, automatic hyphenation is performed, except
  for any word that contains soft hyphens, in which case the soft
  hyphens are used to create legal breakpoints.

I am not sure about this one.

Note that there is another way to let users override the automatic
hyphenation results. It is the equivalent of TeX's \hyphenation
command, which contains a list of fully hyphenated words which are
effectively added to the list of exceptions in the hyphenation
patterns. Every renderer has the freedom to provide a way for users to
specify such a list. This has nothing to do with the spec. It is part
of the hyphenation services of the renderer.

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.eu


Re: Unicode soft hyphen and hyphenation

2007-01-12 Thread Vincent Hennebert
Jeremias Maerki a écrit :
 Good to see that happen! Here's my take:
 
 On 11.01.2007 13:24:16 Manuel Mall wrote:
 Hi,

 when I implemented the UAX#14 line breaking I noticed that fop doesn't 
 currently support the Unicode soft hyphen (SHY).

 I am thinking of adding support for this character to the line breaking 
 but am unsure of its correct behaviour in an XSL:FO environment. So I 
 have few questions related to treatment of the SHY:

 1) If hyphenation is not enabled should a SHY still produce a valid 
 break opportunity or should it be ignored?
 
 I think it should represent a valid break opportunity.

Well, I don't agree. See the description of SHY in section 15.2 of the
Unicode standard: SHY is used as a hint for automatic hyphenators and
overrides there behaviors. I would typically use it for nicely rendering
veryLongProgramVariablesLikeWeCanFindInJava in e.g. a portion of text
describing them in some documentation. Here I obviously want to force
hyphenation to occur between the words that make the variable name
(Long-Program-Variables instead of LongPro-gramVar-iables or whatever).

So, as a hint for hyphenators, SHY should be ignored when hyphenation is
disabled, and when enabled have the priority over automatic hyphenation.


 2) If hyphenation is enabled shall a word containing a SHY still undergo 
 hyphenation?
 Yes, IMO. A SHY may sometimes be used to handle a special case and if
 that is done in a longer word, I still expect the hyphenation to do its
 work on the rest of the word, but then taking the shy into account when
 doing word-splitting. Nothing fancy, though.

[Jörg]
 That's an interesting question. The problem are languages which use
 compound words and agglutination. Last time I looked, for the English
 language words containing shy were not automatically hyphenated, because
 this wouldn't make sense. German, Hungarian, Turkish etc. are somewhat
 more delicate.
 I think it's best to do automatic hyphenation, but remove shy (as well
 as other Unicode chars like joiners) before passing the word to the
 hyphenator. The shy position should however dominate the other
 hyphenation positions, perhaps by giving it a lower penalty.

We would just have to set the right penalty for SHY and automatic
hyphens, such that SHY are preferred yet don't completely prevent
breaking to occur at other hyphens in the word. Will probably need some
trial-and-error steps.


 
 3) Shall a break opportunity created by a SHY be given the same penalty 
 (in the Knuth sense) as a normal hyphenation break?
 
 Yes, IMO.

Well, I was also thinking yes on the first time, but given point 2 above...


Vincent



Re: Unicode soft hyphen and hyphenation

2007-01-12 Thread Jeremias Maerki

On 12.01.2007 09:25:59 Vincent Hennebert wrote:
 Jeremias Maerki a écrit :
  Good to see that happen! Here's my take:
  
  On 11.01.2007 13:24:16 Manuel Mall wrote:
  Hi,
 
  when I implemented the UAX#14 line breaking I noticed that fop doesn't 
  currently support the Unicode soft hyphen (SHY).
 
  I am thinking of adding support for this character to the line breaking 
  but am unsure of its correct behaviour in an XSL:FO environment. So I 
  have few questions related to treatment of the SHY:
 
  1) If hyphenation is not enabled should a SHY still produce a valid 
  break opportunity or should it be ignored?
  
  I think it should represent a valid break opportunity.
 
 Well, I don't agree. See the description of SHY in section 15.2 of the
 Unicode standard: SHY is used as a hint for automatic hyphenators and
 overrides there behaviors. I would typically use it for nicely rendering
 veryLongProgramVariablesLikeWeCanFindInJava in e.g. a portion of text
 describing them in some documentation. Here I obviously want to force
 hyphenation to occur between the words that make the variable name
 (Long-Program-Variables instead of LongPro-gramVar-iables or whatever).
 
 So, as a hint for hyphenators, SHY should be ignored when hyphenation is
 disabled, and when enabled have the priority over automatic hyphenation.

Hmm, I'm used to different behaviour in word processors and I don't read
the UCD spec like you do. Also 5.3 in UAX#14 also doesn't give me the
impression that a SHY is only active when hyphenation is enabled. It
says: The action of a hyphenation algorithm is equivalent to the
insertion of a SHY. However, when a word contains an explicit SHY, it is
customarily treated as overriding the action of the hyphenator for that
word. I read this as: SHY is the basic operator to add additional
break points and a hyphenator can be added to do that task automatically.

An example from the OpenOffice Help:
Definite separator
To support automatic hyphenation by entering a separator inside a word
yourself, use the keys Ctrl+minus sign. The word is separated at this
position when it is at the end of the line, even if automatic
hyphenation for this paragraph is switched off.

snip/
 
  2) If hyphenation is enabled shall a word containing a SHY still undergo 
  hyphenation?
  Yes, IMO. A SHY may sometimes be used to handle a special case and if
  that is done in a longer word, I still expect the hyphenation to do its
  work on the rest of the word, but then taking the shy into account when
  doing word-splitting. Nothing fancy, though.
 
 [Jörg]
  That's an interesting question. The problem are languages which use
  compound words and agglutination. Last time I looked, for the English
  language words containing shy were not automatically hyphenated, because
  this wouldn't make sense. German, Hungarian, Turkish etc. are somewhat
  more delicate.
  I think it's best to do automatic hyphenation, but remove shy (as well
  as other Unicode chars like joiners) before passing the word to the
  hyphenator. The shy position should however dominate the other
  hyphenation positions, perhaps by giving it a lower penalty.
 
 We would just have to set the right penalty for SHY and automatic
 hyphens, such that SHY are preferred yet don't completely prevent
 breaking to occur at other hyphens in the word. Will probably need some
 trial-and-error steps.
 
 
  
  3) Shall a break opportunity created by a SHY be given the same penalty 
  (in the Knuth sense) as a normal hyphenation break?
  
  Yes, IMO.
 
 Well, I was also thinking yes on the first time, but given point 2 above...

Given the wording of UAX#14 5.3 I remain with my opinion.


Jeremias Maerki



Re: Unicode soft hyphen and hyphenation

2007-01-12 Thread Manuel Mall
On Friday 12 January 2007 17:25, Vincent Hennebert wrote:
 Jeremias Maerki a écrit :
  Good to see that happen! Here's my take:
 
  On 11.01.2007 13:24:16 Manuel Mall wrote:
  Hi,
 
  when I implemented the UAX#14 line breaking I noticed that fop
  doesn't currently support the Unicode soft hyphen (SHY).
 
  I am thinking of adding support for this character to the line
  breaking but am unsure of its correct behaviour in an XSL:FO
  environment. So I have few questions related to treatment of the
  SHY:
 
  1) If hyphenation is not enabled should a SHY still produce a
  valid break opportunity or should it be ignored?
 
  I think it should represent a valid break opportunity.

 Well, I don't agree. See the description of SHY in section 15.2 of
 the Unicode standard: SHY is used as a hint for automatic hyphenators
 and overrides there behaviors. I would typically use it for nicely
 rendering veryLongProgramVariablesLikeWeCanFindInJava in e.g. a
 portion of text describing them in some documentation. Here I
 obviously want to force hyphenation to occur between the words that
 make the variable name (Long-Program-Variables instead of
 LongPro-gramVar-iables or whatever).

 So, as a hint for hyphenators, SHY should be ignored when hyphenation
 is disabled, and when enabled have the priority over automatic
 hyphenation.

Interesting but moot point I think. FOP is the automatic hyphenator in 
this case and the hyphenate property could be argued to control which 
hyphenation algorithm FOP is using. If hyphenate=true FOP is allowed 
to add its own hyphenation breaks. If hyphenate=false it uses only 
user specified hyphenation breaks (= soft hyphens).

I am not saying you are wrong, just arguing that JM's initial response 
could also be construed as being compliant to both XSL:FO and Unicode.

Personally I am favouring the view that a soft hyphen always presents a 
break opportunity. If a user goes to the length of adding these special 
characters I think they would like them honoured. It especially allows 
them to bypass odd behaviours in incomplete or incorrect hyphenation 
tables. 

  2) If hyphenation is enabled shall a word containing a SHY still
  undergo hyphenation?
 
  Yes, IMO. A SHY may sometimes be used to handle a special case and
  if that is done in a longer word, I still expect the hyphenation to
  do its work on the rest of the word, but then taking the shy into
  account when doing word-splitting. Nothing fancy, though.

 [Jörg]

  That's an interesting question. The problem are languages which use
  compound words and agglutination. Last time I looked, for the
  English language words containing shy were not automatically
  hyphenated, because this wouldn't make sense. German, Hungarian,
  Turkish etc. are somewhat more delicate.
  I think it's best to do automatic hyphenation, but remove shy (as
  well as other Unicode chars like joiners) before passing the word
  to the hyphenator. The shy position should however dominate the
  other hyphenation positions, perhaps by giving it a lower penalty.


Well, if a user specifies explicit hyphenation points isn't he telling 
the system use mine and don't use yours? Although it could be argued 
the user could disable hyphenation altogether (assuming SHY is honoured 
in that case) if he doesn't like the automatic hyphenation. 
Unfortunately XSL:FO doesn't allows to control this only on a block 
basis. So the user is constrained in his options as he cannot disable 
hyphenation on a particular word.

 We would just have to set the right penalty for SHY and automatic
 hyphens, such that SHY are preferred yet don't completely prevent
 breaking to occur at other hyphens in the word. Will probably need
 some trial-and-error steps.

  3) Shall a break opportunity created by a SHY be given the same
  penalty (in the Knuth sense) as a normal hyphenation break?
 
  Yes, IMO.

 Well, I was also thinking yes on the first time, but given point 2
 above...


 Vincent

Manuel


Re: Unicode soft hyphen and hyphenation

2007-01-11 Thread Jeremias Maerki
Good to see that happen! Here's my take:

On 11.01.2007 13:24:16 Manuel Mall wrote:
 Hi,
 
 when I implemented the UAX#14 line breaking I noticed that fop doesn't 
 currently support the Unicode soft hyphen (SHY).
 
 I am thinking of adding support for this character to the line breaking 
 but am unsure of its correct behaviour in an XSL:FO environment. So I 
 have few questions related to treatment of the SHY:
 
 1) If hyphenation is not enabled should a SHY still produce a valid 
 break opportunity or should it be ignored?

I think it should represent a valid break opportunity.

 2) If hyphenation is enabled shall a word containing a SHY still undergo 
 hyphenation?

Yes, IMO. A SHY may sometimes be used to handle a special case and if
that is done in a longer word, I still expect the hyphenation to do its
work on the rest of the word, but then taking the shy into account when
doing word-splitting. Nothing fancy, though.

 3) Shall a break opportunity created by a SHY be given the same penalty 
 (in the Knuth sense) as a normal hyphenation break?

Yes, IMO.


Jeremias Maerki