Re: Unicode compliant Line Breaking
1. +1 2. +1 3.b) +1 for the separatable parts although c) is also ok for now. +1 to try to find synergies with the code in Batik. If I were you I'd create a branch and put your stuff in there. It's easier for everyone to follow and to help (wishful thinking). On 31.10.2005 08:25:12 Manuel Mall wrote: In a previous post Joerg pointed to the Unicode Standard Annex #14 on Line Breaking (http://www.unicode.org/reports/tr14/) and his initial implementation: http://people.apache.org/~pietsch/linebreak.tar.gz. I had since a closer look at both UAX#14 and Joerg's code. Because I liked what I saw I went about adapting Joerg's code it to Unicode 4.1 and added fairly extensive JUnit test cases to it mainly because it really helps to go through the various different cases mentioned in the spec in some structured fashion. The results are now available for public inspection: http://people.apache.org/~manuel/fop/linebreak.tar.gz 1. I would like to propose that Unicode conformant line breaking be integrated into FOP trunk because it: a) Moves FOP more towards being a universal formatter and not just a formatter for western languages b) Moves FOP more towards becoming a high quality typesetting system (something that was really started by integrating Knuth style breaking) The reason I think this needs to be voted on is because Unicode line breaking will in subtle ways change the current line breaking behaviour and therefore constitutes a (significant) change in FOPs overall rendering. 2. I would also like to propose that the Unicode conformant line breaking be implemented using our own pair-table based implementation and not using Java's line breaker, because: a) It gives us full control and allows FOP to follow the Unicode standard (and its updates and erratas) closely and therefore keep FOPs Unicode compliance level independent of the Java version. b) It allows us to tailor the algorithm to match the needs of XSL-FO and FOP. c) It allows us to provide user customisation features (down the track) not available through using the Java APIs. Of course there are downsides, like: a) Are we falling for the 'not invented here' syndrome? b) Duplicating code which is already in the Java base system c) Increasing the memory footprint of FOP 3. Assuming we get enough +1 for the above proposals the first item to decide after that would be: Where should the code live? a) Joerg would like to see it in Jakarta Commons but hasn't got the time to start the project. b) Jeremias suggested XMLGraphics Commons. c) Personally I think it is too early to factor it out. More experience with its design and use cases should be gathered before making it standalone and at this point in time it really only are 2 core Java classes. I would like to suggest that it initially lives under FOP in something like org.apache.fop.text. Should the need and energy levels (= developer enthusiasm) become available later to make this into an Jakarta Commons or XMLGraphics Commons project so be it. Assuming now that this will be agreed as well the next step would be the more detailed design of the integration. But this is well beyond the scope of this e-mail as there are some tricky issues involved and they probably need to be tackled in conjunction with the white space handling issues. Many of the problems are related to our LayoutManager structures which create barriers when it comes to the need to process character sequences across those boundaries as is the case for both line breaking and white space handling. Add to that the design of the different Knuth sequences required to model the different break cases in conjunction with conditional border/padding and white space removal around line breaking and different types of line justifications and there is some real work ahead. Cheers Manuel Should add my votes: 1.) +1 2.) +1 3.c) +1 Jeremias Maerki
Re: Unicode compliant Line Breaking
On Tue, Nov 01, 2005 at 11:17:08PM +0100, J.Pietschmann wrote: Simon Pepping wrote: Is our current hyphenation method a subset of Unicode's method? Umm. What's the relation between hyphenation and TR14 (except for handling soft hyphens)? I guess you confuse finding line breaks in general and line breaking due to hyphenation. I mean, will our current method of finding possible line breaking points using the hyphenation tables be part of a TR14 compliant system to find line break opportunities? Simon -- Simon Pepping home page: http://www.leverkruid.nl
Re: Unicode compliant Line Breaking
Simon Pepping wrote: I mean, will our current method of finding possible line breaking points using the hyphenation tables be part of a TR14 compliant system to find line break opportunities? In some sense yes, but I'm not sure what you really mean. Currently, spaces and slashes (/) as well as hyphenation points are considered break opportunities. TR14 doesn't care about hyphenation but expands significantly on the other points. For example, in the string foo-bar the position after the dash is a break opportunity, as people usually expect, but in -1234 the position after the dash isn't a break opportunity, also as people usually expect. The TR encodes as much of such expectations as is possible with a limited context. A few places in TextLayoutManager which use BREAK_CHARS will have to be changed, either keeping info from a previous scanning using a BreakIterator or something, or looking up the line break Unicode properties and looking up whether a break may occur in the line-break matrix. Hyphenation points are generated elsewhere and remain unaffected. J.Pietschmann
Re: Unicode compliant Line Breaking
On Mon, Oct 31, 2005 at 03:25:12PM +0800, Manuel Mall wrote: In a previous post Joerg pointed to the Unicode Standard Annex #14 on Line Breaking (http://www.unicode.org/reports/tr14/) and his initial implementation: http://people.apache.org/~pietsch/linebreak.tar.gz. I had since a closer look at both UAX#14 and Joerg's code. Because I liked what I saw I went about adapting Joerg's code it to Unicode 4.1 and added fairly extensive JUnit test cases to it mainly because it really helps to go through the various different cases mentioned in the spec in some structured fashion. Is our current hyphenation method a subset of Unicode's method? Assuming now that this will be agreed as well the next step would be the more detailed design of the integration. But this is well beyond the scope of this e-mail as there are some tricky issues involved and they probably need to be tackled in conjunction with the white space handling issues. Many of the problems are related to our LayoutManager structures which create barriers when it comes to the need to process character sequences across those boundaries as is the case for both line breaking and white space handling. Add to that the design of the I seem to recall that the hyphenation code collects words across LM boundaries. It seems a useful goal to implement Unicode hyphenation. But since it is a major effort, it does not fit in working towards a release. In any case it would have to be in a separate branch until it proves to work and to implement a substantial part of hyphenation. Then it does not immediately matter if it is a separate project or a part of FOP. Simon -- Simon Pepping home page: http://www.leverkruid.nl
Re: Unicode compliant Line Breaking
Simon Pepping wrote: Is our current hyphenation method a subset of Unicode's method? Umm. What's the relation between hyphenation and TR14 (except for handling soft hyphens)? I guess you confuse finding line breaks in general and line breaking due to hyphenation. I seem to recall that the hyphenation code collects words across LM boundaries. As it should. Word boundaries and FO boundaries are different things: blockA wwrapper text-decoration=underlineo/wrapperrd/block J.Pietschmann
Re: Unicode compliant Line Breaking
my votes: 1.) +1 2.) +1 3.c) +1 BTW, more than 2.a, even the most up-to-date jdk (1.5.0_05) is not full UAX#14 compliant. It treat QU as (A) instead of (XB/XA). So we REALLY need a independent impl to follow the Unicode standard. __ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com
Re: Unicode compliant Line Breaking
IMO, Unicode conformant line-breaking is an important goal for FOP to achieve. But before I vote, I have a question: On Oct 30, 2005, at 11:25 PM, Manuel Mall wrote: snip 2. I would also like to propose that the Unicode conformant line breaking be implemented using our own pair-table based implementation and not using Java's line breaker, because: Does it make sense to have using our own implementation over Java's 'configurable'? That way, our users could choose whether or not to use it. In my case, we had no need for Unicode, and IIC the extra code would merely serve to hinder FOP's performance increase FOP's memory footprint (unless it's only called when using Unicode). In addition, a future Java implementation could bring a robust (and maintained) Unicode solution. snip 3. Assuming we get enough +1 for the above proposals the first item to decide after that would be: Where should the code live? a) Joerg would like to see it in Jakarta Commons but hasn't got the time to start the project. b) Jeremias suggested XMLGraphics Commons. c) Personally I think it is too early to factor it out. More experience with its design and use cases should be gathered before making it standalone and at this point in time it really only are 2 core Java classes. I would like to suggest that it initially lives under FOP in something like org.apache.fop.text. Should the need and energy levels (= developer enthusiasm) become available later to make this into an Jakarta Commons or XMLGraphics Commons project so be it. I would think it would be best to start it under XML Graphics Commons (as that's where I suspect it will likely end up), and move it if necessary from there. Regards, Web Maestro Clay -- [EMAIL PROTECTED] - http://homepage.mac.com/webmaestro/ My religion is simple. My religion is kindness. - HH The 14th Dalai Lama of Tibet
Re: Unicode compliant Line Breaking
Hi all, Just an FYI, Batik also currently has an implementation of the Unicode TR14 word breaking alg. (org.apache.batik.gvt.flow.TextLineBreak). As far as performance is concerned it should be fairly fast as it is mostly just table based. The Web Maestro [EMAIL PROTECTED] wrote on 10/31/2005 11:04:54 AM: IMO, Unicode conformant line-breaking is an important goal for FOP to achieve. But before I vote, I have a question: On Oct 30, 2005, at 11:25 PM, Manuel Mall wrote: snip 2. I would also like to propose that the Unicode conformant line breaking be implemented using our own pair-table based implementation and not using Java's line breaker, because: Does it make sense to have using our own implementation over Java's 'configurable'? That way, our users could choose whether or not to use it. In my case, we had no need for Unicode, and IIC the extra code would merely serve to hinder FOP's performance increase FOP's memory footprint (unless it's only called when using Unicode). In addition, a future Java implementation could bring a robust (and maintained) Unicode solution. snip 3. Assuming we get enough +1 for the above proposals the first item to decide after that would be: Where should the code live? a) Joerg would like to see it in Jakarta Commons but hasn't got the time to start the project. b) Jeremias suggested XMLGraphics Commons. c) Personally I think it is too early to factor it out. More experience with its design and use cases should be gathered before making it standalone and at this point in time it really only are 2 core Java classes. I would like to suggest that it initially lives under FOP in something like org.apache.fop.text. Should the need and energy levels (= developer enthusiasm) become available later to make this into an Jakarta Commons or XMLGraphics Commons project so be it. I would think it would be best to start it under XML Graphics Commons (as that's where I suspect it will likely end up), and move it if necessary from there. Regards, Web Maestro Clay -- [EMAIL PROTECTED] - http://homepage.mac.com/webmaestro/ My religion is simple. My religion is kindness. - HH The 14th Dalai Lama of Tibet
Unicode compliant Line Breaking
In a previous post Joerg pointed to the Unicode Standard Annex #14 on Line Breaking (http://www.unicode.org/reports/tr14/) and his initial implementation: http://people.apache.org/~pietsch/linebreak.tar.gz. I had since a closer look at both UAX#14 and Joerg's code. Because I liked what I saw I went about adapting Joerg's code it to Unicode 4.1 and added fairly extensive JUnit test cases to it mainly because it really helps to go through the various different cases mentioned in the spec in some structured fashion. The results are now available for public inspection: http://people.apache.org/~manuel/fop/linebreak.tar.gz 1. I would like to propose that Unicode conformant line breaking be integrated into FOP trunk because it: a) Moves FOP more towards being a universal formatter and not just a formatter for western languages b) Moves FOP more towards becoming a high quality typesetting system (something that was really started by integrating Knuth style breaking) The reason I think this needs to be voted on is because Unicode line breaking will in subtle ways change the current line breaking behaviour and therefore constitutes a (significant) change in FOPs overall rendering. 2. I would also like to propose that the Unicode conformant line breaking be implemented using our own pair-table based implementation and not using Java's line breaker, because: a) It gives us full control and allows FOP to follow the Unicode standard (and its updates and erratas) closely and therefore keep FOPs Unicode compliance level independent of the Java version. b) It allows us to tailor the algorithm to match the needs of XSL-FO and FOP. c) It allows us to provide user customisation features (down the track) not available through using the Java APIs. Of course there are downsides, like: a) Are we falling for the 'not invented here' syndrome? b) Duplicating code which is already in the Java base system c) Increasing the memory footprint of FOP 3. Assuming we get enough +1 for the above proposals the first item to decide after that would be: Where should the code live? a) Joerg would like to see it in Jakarta Commons but hasn't got the time to start the project. b) Jeremias suggested XMLGraphics Commons. c) Personally I think it is too early to factor it out. More experience with its design and use cases should be gathered before making it standalone and at this point in time it really only are 2 core Java classes. I would like to suggest that it initially lives under FOP in something like org.apache.fop.text. Should the need and energy levels (= developer enthusiasm) become available later to make this into an Jakarta Commons or XMLGraphics Commons project so be it. Assuming now that this will be agreed as well the next step would be the more detailed design of the integration. But this is well beyond the scope of this e-mail as there are some tricky issues involved and they probably need to be tackled in conjunction with the white space handling issues. Many of the problems are related to our LayoutManager structures which create barriers when it comes to the need to process character sequences across those boundaries as is the case for both line breaking and white space handling. Add to that the design of the different Knuth sequences required to model the different break cases in conjunction with conditional border/padding and white space removal around line breaking and different types of line justifications and there is some real work ahead. Cheers Manuel Should add my votes: 1.) +1 2.) +1 3.c) +1