Re: Unicode compliant Line Breaking

2005-11-07 Thread Jeremias Maerki
1. +1
2. +1
3.b) +1 for the separatable parts although c) is also ok for now.

+1 to try to find synergies with the code in Batik.

If I were you I'd create a branch and put your stuff in there. It's
easier for everyone to follow and to help (wishful thinking).

On 31.10.2005 08:25:12 Manuel Mall wrote:
 In a previous post Joerg pointed to the Unicode Standard Annex #14 on 
 Line Breaking (http://www.unicode.org/reports/tr14/) and his initial 
 implementation: http://people.apache.org/~pietsch/linebreak.tar.gz.
 
 I had since a closer look at both UAX#14 and Joerg's code. Because I 
 liked what I saw I went about adapting Joerg's code it to Unicode 4.1 
 and added fairly extensive JUnit test cases to it mainly because it 
 really helps to go through the various different cases mentioned in the 
 spec in some structured fashion.
 
 The results are now available for public inspection: 
 http://people.apache.org/~manuel/fop/linebreak.tar.gz
 
 1. I would like to propose that Unicode conformant line breaking be 
 integrated into FOP trunk because it:
 a) Moves FOP more towards being a universal formatter and not just a 
 formatter for western languages
 b) Moves FOP more towards becoming a high quality typesetting system 
 (something that was really started by integrating Knuth style breaking)
 The reason I think this needs to be voted on is because Unicode line 
 breaking will in subtle ways change the current line breaking behaviour 
 and therefore constitutes a (significant) change in FOPs overall 
 rendering.
 
 2. I would also like to propose that the Unicode conformant line 
 breaking be implemented using our own pair-table based implementation 
 and not using Java's line breaker, because:
 a) It gives us full control and allows FOP to follow the Unicode 
 standard (and its updates and erratas) closely and therefore keep FOPs 
 Unicode compliance level independent of the Java version.
 b) It allows us to tailor the algorithm to match the needs of XSL-FO and 
 FOP.
 c) It allows us to provide user customisation features (down the track) 
 not available through using the Java APIs.
 
 Of course there are downsides, like:
 a) Are we falling for the 'not invented here' syndrome?
 b) Duplicating code which is already in the Java base system
 c) Increasing the memory footprint of FOP
 
 3. Assuming we get enough +1 for the above proposals the first item to 
 decide after that would be: Where should the code live?
 a) Joerg would like to see it in Jakarta Commons but hasn't got the time 
 to start the project. 
 b) Jeremias suggested XMLGraphics Commons. 
 c) Personally I think it is too early to factor it out. More experience 
 with its design and use cases should be gathered before making it 
 standalone and at this point in time it really only are 2 core Java 
 classes. I would like to suggest that it initially lives under FOP in 
 something like org.apache.fop.text. Should the need and energy levels 
 (= developer enthusiasm) become available later to make this into an 
 Jakarta Commons or XMLGraphics Commons project so be it.
 
 Assuming now that this will be agreed as well the next step would be the 
 more detailed design of the integration. But this is well beyond the 
 scope of this e-mail as there are some tricky issues involved and they 
 probably need to be tackled in conjunction with the white space 
 handling issues. Many of the problems are related to our LayoutManager 
 structures which create barriers when it comes to the need to process 
 character sequences across those boundaries as is the case for both 
 line breaking and white space handling. Add to that the design of the 
 different Knuth sequences required to model the different break cases 
 in conjunction with conditional border/padding and white space removal 
 around line breaking and different types of line justifications and 
 there is some real work ahead.
 
 Cheers
 
 Manuel
 
 Should add my votes:
 
 1.) +1
 2.) +1
 3.c) +1



Jeremias Maerki



Re: Unicode compliant Line Breaking

2005-11-02 Thread Simon Pepping
On Tue, Nov 01, 2005 at 11:17:08PM +0100, J.Pietschmann wrote:
 Simon Pepping wrote:
 Is our current hyphenation method a subset of Unicode's method?
 
 Umm. What's the relation between hyphenation and TR14 (except for
 handling soft hyphens)? I guess you confuse finding line breaks
 in general and line breaking due to hyphenation.

I mean, will our current method of finding possible line breaking
points using the hyphenation tables be part of a TR14 compliant system
to find line break opportunities?

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.nl



Re: Unicode compliant Line Breaking

2005-11-02 Thread J.Pietschmann

Simon Pepping wrote:

I mean, will our current method of finding possible line breaking
points using the hyphenation tables be part of a TR14 compliant system
to find line break opportunities?


In some sense yes, but I'm not sure what you really mean.

Currently, spaces and slashes (/) as well as hyphenation points
are considered break opportunities. TR14 doesn't care about hyphenation
but expands significantly on the other points. For example, in the
string foo-bar the position after the dash is a break opportunity,
as people usually expect, but in -1234 the position after the dash
isn't a break opportunity, also as people usually expect. The TR
encodes as much of such expectations as is possible with a limited
context.

A few places in TextLayoutManager which use BREAK_CHARS will have to
be changed, either keeping info from a previous scanning using a
BreakIterator or something, or looking up the line break Unicode
properties and looking up whether a break may occur in the
line-break matrix. Hyphenation points are generated elsewhere and
remain unaffected.

J.Pietschmann


Re: Unicode compliant Line Breaking

2005-11-01 Thread Simon Pepping
On Mon, Oct 31, 2005 at 03:25:12PM +0800, Manuel Mall wrote:
 In a previous post Joerg pointed to the Unicode Standard Annex #14 on 
 Line Breaking (http://www.unicode.org/reports/tr14/) and his initial 
 implementation: http://people.apache.org/~pietsch/linebreak.tar.gz.
 
 I had since a closer look at both UAX#14 and Joerg's code. Because I 
 liked what I saw I went about adapting Joerg's code it to Unicode 4.1 
 and added fairly extensive JUnit test cases to it mainly because it 
 really helps to go through the various different cases mentioned in the 
 spec in some structured fashion.

Is our current hyphenation method a subset of Unicode's method?

 Assuming now that this will be agreed as well the next step would be the 
 more detailed design of the integration. But this is well beyond the 
 scope of this e-mail as there are some tricky issues involved and they 
 probably need to be tackled in conjunction with the white space 
 handling issues. Many of the problems are related to our LayoutManager 
 structures which create barriers when it comes to the need to process 
 character sequences across those boundaries as is the case for both 
 line breaking and white space handling. Add to that the design of the 

I seem to recall that the hyphenation code collects words across LM
boundaries.

It seems a useful goal to implement Unicode hyphenation. But since it
is a major effort, it does not fit in working towards a release. In
any case it would have to be in a separate branch until it proves to
work and to implement a substantial part of hyphenation. Then it does
not immediately matter if it is a separate project or a part of FOP.

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.nl



Re: Unicode compliant Line Breaking

2005-11-01 Thread J.Pietschmann

Simon Pepping wrote:

Is our current hyphenation method a subset of Unicode's method?


Umm. What's the relation between hyphenation and TR14 (except for
handling soft hyphens)? I guess you confuse finding line breaks
in general and line breaking due to hyphenation.


I seem to recall that the hyphenation code collects words across LM
boundaries.


As it should. Word boundaries and FO boundaries are different things:
 blockA wwrapper text-decoration=underlineo/wrapperrd/block

J.Pietschmann



Re: Unicode compliant Line Breaking

2005-10-31 Thread Jingjing Lee
my votes:

1.) +1
2.) +1
3.c) +1

BTW,
more than 2.a, even the most up-to-date jdk (1.5.0_05)
is not full UAX#14 compliant. It treat QU as (A)
instead of (XB/XA). So we REALLY need a independent
impl to follow the Unicode standard.



__ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com


Re: Unicode compliant Line Breaking

2005-10-31 Thread The Web Maestro
IMO, Unicode conformant line-breaking is an important goal for FOP to 
achieve. But before I vote, I have a question:


On Oct 30, 2005, at 11:25 PM, Manuel Mall wrote:

snip


2. I would also like to propose that the Unicode conformant line
breaking be implemented using our own pair-table based implementation
and not using Java's line breaker, because:


Does it make sense to have using our own implementation over Java's 
'configurable'? That way, our users could choose whether or not to use 
it. In my case, we had no need for Unicode, and IIC the extra code 
would merely serve to hinder FOP's performance  increase FOP's memory 
footprint (unless it's only called when using Unicode). In addition, a 
future Java implementation could bring a robust (and maintained) 
Unicode solution.


snip


3. Assuming we get enough +1 for the above proposals the first item to
decide after that would be: Where should the code live?
a) Joerg would like to see it in Jakarta Commons but hasn't got the 
time

to start the project.
b) Jeremias suggested XMLGraphics Commons.
c) Personally I think it is too early to factor it out. More experience
with its design and use cases should be gathered before making it
standalone and at this point in time it really only are 2 core Java
classes. I would like to suggest that it initially lives under FOP in
something like org.apache.fop.text. Should the need and energy levels
(= developer enthusiasm) become available later to make this into an
Jakarta Commons or XMLGraphics Commons project so be it.


I would think it would be best to start it under XML Graphics Commons 
(as that's where I suspect it will likely end up), and move it if 
necessary from there.


Regards,

Web Maestro Clay
--
[EMAIL PROTECTED] - http://homepage.mac.com/webmaestro/
My religion is simple. My religion is kindness.
- HH The 14th Dalai Lama of Tibet



Re: Unicode compliant Line Breaking

2005-10-31 Thread thomas . deweese
Hi all,

Just an FYI, Batik also currently has an implementation of the
Unicode TR14 word breaking alg. (org.apache.batik.gvt.flow.TextLineBreak).

As far as performance is concerned it should be fairly fast as it 
is
mostly just table based.

The Web Maestro [EMAIL PROTECTED] wrote on 10/31/2005 11:04:54 
AM:

 IMO, Unicode conformant line-breaking is an important goal for FOP to 
 achieve. But before I vote, I have a question:
 
 On Oct 30, 2005, at 11:25 PM, Manuel Mall wrote:
 
 snip
 
  2. I would also like to propose that the Unicode conformant line
  breaking be implemented using our own pair-table based implementation
  and not using Java's line breaker, because:
 
 Does it make sense to have using our own implementation over Java's 
 'configurable'? That way, our users could choose whether or not to use 
 it. In my case, we had no need for Unicode, and IIC the extra code 
 would merely serve to hinder FOP's performance  increase FOP's memory 
 footprint (unless it's only called when using Unicode). In addition, a 
 future Java implementation could bring a robust (and maintained) 
 Unicode solution.
 
 snip
 
  3. Assuming we get enough +1 for the above proposals the first item to
  decide after that would be: Where should the code live?
  a) Joerg would like to see it in Jakarta Commons but hasn't got the 
  time
  to start the project.
  b) Jeremias suggested XMLGraphics Commons.
  c) Personally I think it is too early to factor it out. More 
experience
  with its design and use cases should be gathered before making it
  standalone and at this point in time it really only are 2 core Java
  classes. I would like to suggest that it initially lives under FOP in
  something like org.apache.fop.text. Should the need and energy levels
  (= developer enthusiasm) become available later to make this into an
  Jakarta Commons or XMLGraphics Commons project so be it.
 
 I would think it would be best to start it under XML Graphics Commons 
 (as that's where I suspect it will likely end up), and move it if 
 necessary from there.
 
 Regards,
 
 Web Maestro Clay
 -- 
 [EMAIL PROTECTED] - http://homepage.mac.com/webmaestro/
 My religion is simple. My religion is kindness.
 - HH The 14th Dalai Lama of Tibet
 



Unicode compliant Line Breaking

2005-10-30 Thread Manuel Mall
In a previous post Joerg pointed to the Unicode Standard Annex #14 on 
Line Breaking (http://www.unicode.org/reports/tr14/) and his initial 
implementation: http://people.apache.org/~pietsch/linebreak.tar.gz.

I had since a closer look at both UAX#14 and Joerg's code. Because I 
liked what I saw I went about adapting Joerg's code it to Unicode 4.1 
and added fairly extensive JUnit test cases to it mainly because it 
really helps to go through the various different cases mentioned in the 
spec in some structured fashion.

The results are now available for public inspection: 
http://people.apache.org/~manuel/fop/linebreak.tar.gz

1. I would like to propose that Unicode conformant line breaking be 
integrated into FOP trunk because it:
a) Moves FOP more towards being a universal formatter and not just a 
formatter for western languages
b) Moves FOP more towards becoming a high quality typesetting system 
(something that was really started by integrating Knuth style breaking)
The reason I think this needs to be voted on is because Unicode line 
breaking will in subtle ways change the current line breaking behaviour 
and therefore constitutes a (significant) change in FOPs overall 
rendering.

2. I would also like to propose that the Unicode conformant line 
breaking be implemented using our own pair-table based implementation 
and not using Java's line breaker, because:
a) It gives us full control and allows FOP to follow the Unicode 
standard (and its updates and erratas) closely and therefore keep FOPs 
Unicode compliance level independent of the Java version.
b) It allows us to tailor the algorithm to match the needs of XSL-FO and 
FOP.
c) It allows us to provide user customisation features (down the track) 
not available through using the Java APIs.

Of course there are downsides, like:
a) Are we falling for the 'not invented here' syndrome?
b) Duplicating code which is already in the Java base system
c) Increasing the memory footprint of FOP

3. Assuming we get enough +1 for the above proposals the first item to 
decide after that would be: Where should the code live?
a) Joerg would like to see it in Jakarta Commons but hasn't got the time 
to start the project. 
b) Jeremias suggested XMLGraphics Commons. 
c) Personally I think it is too early to factor it out. More experience 
with its design and use cases should be gathered before making it 
standalone and at this point in time it really only are 2 core Java 
classes. I would like to suggest that it initially lives under FOP in 
something like org.apache.fop.text. Should the need and energy levels 
(= developer enthusiasm) become available later to make this into an 
Jakarta Commons or XMLGraphics Commons project so be it.

Assuming now that this will be agreed as well the next step would be the 
more detailed design of the integration. But this is well beyond the 
scope of this e-mail as there are some tricky issues involved and they 
probably need to be tackled in conjunction with the white space 
handling issues. Many of the problems are related to our LayoutManager 
structures which create barriers when it comes to the need to process 
character sequences across those boundaries as is the case for both 
line breaking and white space handling. Add to that the design of the 
different Knuth sequences required to model the different break cases 
in conjunction with conditional border/padding and white space removal 
around line breaking and different types of line justifications and 
there is some real work ahead.

Cheers

Manuel

Should add my votes:

1.) +1
2.) +1
3.c) +1