Re: Why is / is valid line breaking char in FOP?

2005-10-26 Thread Jeremias Maerki
IMO, start it in XML Graphics Commons. We can always move it somewhere
else if the need arises.

On 26.10.2005 22:30:48 J.Pietschmann wrote:

> I'd rather have the code in a reusable library outside of the FOP
> project (in particular the infrastructure dealing with Unicode files
> and the table generator). Unfortunately, none of the jakarta commons
> modules showed much enthusiasm for integrating it, and I don't think
> I have enough time to maintain a new module for this.


Jeremias Maerki



Re: Why is / is valid line breaking char in FOP?

2005-10-26 Thread J.Pietschmann

Manuel Mall wrote:
I like the idea of having a UNICODE conformant/compliant/based line 
breaking algorithm in FOP. Note this has nothing to do with the Knuth 
algorithm used in FOP. I am talking about using the UNICODE algorithm 
to determine line break opportunities. 


That's exactly the purpose of both BreakIterator and my implementation.

Shall 
we use your work in FOP


That was the basic idea.


and if so how can we best integrate it?


Well
I'd rather have the code in a reusable library outside of the FOP
project (in particular the infrastructure dealing with Unicode files
and the table generator). Unfortunately, none of the jakarta commons
modules showed much enthusiasm for integrating it, and I don't think
I have enough time to maintain a new module for this.

BTW, looking at http://www.unicode.org/reports/tr14/ with respect to the 
SOLIDUS, that is line breaking property SY, it is actually quite 
complex as it does not allow a break within a sequence of digits, e.g. 
26/10/2005 and discourages breaking things like "w/o" or "A/S".


Oops! I should have read further.

J.Pietschmann


Re: Why is / is valid line breaking char in FOP?

2005-10-25 Thread Manuel Mall
On Wed, 26 Oct 2005 03:15 am, J.Pietschmann wrote:
> Manuel Mall wrote:
> > While investigating if we could use the standard
> > java.text.BreakIterator to determine line break points I noticed
> > that FOP uses in addition to space, zero width space, hyphen also
> > the forward slash as a valid line breaking character. The Java
> > BreakIterator does not recognize slash as a line breaking char (nor
> > FWIW does MS Word).
> >
> > What is the background to FOP allowing this? Is this consistent
> > with normal user expectations or is this specific to type setting
> > environments / Tex / Knuth?
>
> The BreakIterator class is supposed to implement the Unicode TR14
> standard annex
>   http://www.unicode.org/reports/tr14/
> The slash U+002F aka SOLIDUS is assigned a line breaking property
> value SY (Symbols Allowing Breaks)
>   http://www.unicode.org/Public/UNIDATA/LineBreak.txt
> which means "prevent a break before, and allow a break after". I
> suspect this is a recent change in Unicode, not implemented yet by
> your JDK release.
> BTW first breaking the text using whitespace, then applying the
> BreakIterator is unwise, because white space is significant for TR14
> line breaking. Unfortunately, combining whitespace normalization,
> line break detection and word parsing (for hyphenation) in a single
> pass is unwieldy if BreakIterator is used, that's why I tried to
> implement it differently some time ago
>   http://people.apache.org/~pietsch/linebreak.tar.gz
>
Joerg,

great stuff.

I like the idea of having a UNICODE conformant/compliant/based line 
breaking algorithm in FOP. Note this has nothing to do with the Knuth 
algorithm used in FOP. I am talking about using the UNICODE algorithm 
to determine line break opportunities. It is then up to the Knuth 
algorithm to convert the Knuth element lists generated from the line 
break opportunities into an optimal set of line breaks.

But how can we move forward? The current FOP code to determine line 
break opportunities looks a bit like a quick solution that works well 
for simple texts using only space, nbsp, zero width space, but not 
anything that uses more sophisticated UNICODE break characters. You 
have some code which does a better job at it but its not in FOP. Shall 
we use your work in FOP and if so how can we best integrate it?

BTW, looking at http://www.unicode.org/reports/tr14/ with respect to the 
SOLIDUS, that is line breaking property SY, it is actually quite 
complex as it does not allow a break within a sequence of digits, e.g. 
26/10/2005 and discourages breaking things like "w/o" or "A/S".

> J.Pietschmann

Manuel


Re: Why is / is valid line breaking char in FOP?

2005-10-25 Thread J.Pietschmann

Manuel Mall wrote:
While investigating if we could use the standard java.text.BreakIterator 
to determine line break points I noticed that FOP uses in addition to 
space, zero width space, hyphen also the forward slash as a valid line 
breaking character. The Java BreakIterator does not recognize slash as 
a line breaking char (nor FWIW does MS Word).


What is the background to FOP allowing this? Is this consistent with 
normal user expectations or is this specific to type setting 
environments / Tex / Knuth?



The BreakIterator class is supposed to implement the Unicode TR14
standard annex
 http://www.unicode.org/reports/tr14/
The slash U+002F aka SOLIDUS is assigned a line breaking property
value SY (Symbols Allowing Breaks)
 http://www.unicode.org/Public/UNIDATA/LineBreak.txt
which means "prevent a break before, and allow a break after". I suspect
this is a recent change in Unicode, not implemented yet by your JDK
release.
BTW first breaking the text using whitespace, then applying the
BreakIterator is unwise, because white space is significant for TR14
line breaking. Unfortunately, combining whitespace normalization, line
break detection and word parsing (for hyphenation) in a single pass is
unwieldy if BreakIterator is used, that's why I tried to implement it
differently some time ago
 http://people.apache.org/~pietsch/linebreak.tar.gz

J.Pietschmann


Re: Why is / is valid line breaking char in FOP?

2005-10-25 Thread Luca Furini

Manuel Mall wrote:

While investigating if we could use the standard java.text.BreakIterator 
to determine line break points I noticed that FOP uses in addition to 
space, zero width space, hyphen also the forward slash as a valid line 
breaking character. The Java BreakIterator does not recognize slash as a 
line breaking char (nor FWIW does MS Word).


What is the background to FOP allowing this? Is this consistent with 
normal user expectations or is this specific to type setting 
environments / Tex / Knuth?


I don't remember whether it was already there or I added the slash to the 
other allowed characters, but in my idea this is useful when the text of a 
block contains some long url, which would not have other feasible breaks 
(apart maybe from some "-", which could be misleading).


Regards
Luca


Why is / is valid line breaking char in FOP?

2005-10-25 Thread Manuel Mall
My apologies if that has been discussed before. 

While investigating if we could use the standard java.text.BreakIterator 
to determine line break points I noticed that FOP uses in addition to 
space, zero width space, hyphen also the forward slash as a valid line 
breaking character. The Java BreakIterator does not recognize slash as 
a line breaking char (nor FWIW does MS Word).

What is the background to FOP allowing this? Is this consistent with 
normal user expectations or is this specific to type setting 
environments / Tex / Knuth?

Regards

Manuel