Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following
Vincent Hennebert wrote: I'd have to think more about it, but: - perhaps the compareNodes method should compare the line/page numbers for each node rather than the index in the Knuth sequence. Or some mixing of the two. The index can tell us which node allows to lay out more content, the line number ... I am not able to see it as a very informative measure ... - if you restart using the last deactivated node you are sure that immediately after that you'll have to restart using the last too-short/too-long node, because no feasible break will be found (otherwise the list of active nodes wouldn't have been emptied). Yes, but I think we have a significant difference: in the first case we will have N good lines, a bad line and maybe some other good lines; in the second we have N-1 good lines, a quite-bad one (either too long or too short), then a bad one and finally some good ones. I've preparared a very small patch fixing a couple of things: - the TLM add a zero-width infinite-value penalty to forbid breaks at the glue elements used for left/right aligned text (I'm going to check if a similar fix is needed elsewhere in the code) - the BreakingAlgorithm uses (if possible) lastDeactivated instead of either lastTooShort or lastTooLong. The patch is just a dozen of lines long, and it was easy to apply it to the float branch. How should I proceed? Apply it to both trunk and branch? Only to the branch? I'm also going to mark bug 41121 as a duplicate of 41109, as the problem is exactly the same: the algorithm restarts from a very bad break instead of a good one (in that case, after the first word). Regards Luca
Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following
Hi Luca, Luca Furini a écrit : snip/ 1) TextLM breaks the text even when a / or a - is found, handling them as hyphenation points with the usual sequence of glue + penalty + glue elements. The LineLM tries, in the first instance, to avoid using hyphenation points, so the penalty is not taken into account. But this has the side effect of using the first glue element as a feasible break (if the penalty were a feasible break too, it would surely be a better one, such avoiding the glue to be effectively chosen). I don't follow you: IIUC the glue-penalty-glue triplet is generated only the second time, when the first breaking doesn't give acceptable results? What do you mean by the penalty is not taken into account? Also, I don't see why the penalty would be preferred over the glue, as it has a positive penalty value. This is probably the smaller of the problems, and can be solved just adding an infinite penalty before the first glue element. But maybe we This seems to be a good idea, anyway. want to prevent this breaking to happen, as we can now use zero-width-spaces to explicitly insert breaking positions? Good point. I'd say yes for '/'. This would add a burden to the user who would have to modify the FO generation step to add ZWSP for URLs or filenames; but we must also take into account cases where the user does /not/ want the word to be split at '/' characters. For hyphens, I would keep the current behavior, as this is the most expected one IMO. And it can also be prevented by adding non-breaking zero-width space. 2) The presence of an inline object larger that the available width makes the algorithm to deactivate all the active nodes and then restart with a second-hand node, as no line can be built that does not overflow. The restarting node was chosen, in BreakingAlgorithm.findBreakingPoints(), between lastTooShort and lastTooLong, neither of them being a good breaking point. There is a lastDeactivated node chosen among the deactivated nodes but it was not used. A deactivated node previously was an active one, so it is surely better than a node who failed to qualify; replacing either lastTooShort or lastTooLong (according to the adjustment) with lastDeactivated leads to a better set of breaks. However, this in not enough. The attached file small.20.pdf shows the result after fixing these first two problems. 3) At the moment, the LineLM can call findBreakingPoints() up to three times, the last one with a maximum adjusting ratio equal to 20. I came to the conclusion that this is really TOO much. I tried stopping after the second call (with max ratio = 5) and the result is much better (see attached file small.5.pdf). Yes 20 is probably too much. We need perhaps to also differentiate the case where no acceptable line-breaking can be found because of a box too long to even fit alone on one line. In such a case even a very high max ratio won't help. A high maximum adjustment ratio means that the algorithm is allowed to stretch spaces a lot in order to find a set of breaks which is *globally* better; this means that it can choose some not-so-beautiful breaks in order to build a set spanning over a larger portion of the paragraph. In our example: there can be a break just before the long url (a line ending after Consider:) only if we use an enormous adjustment ratio. With a smaller, more appropriate threshold, Consider: can no more end a line, so the algorithm will restart from a previous point. In conclusion: the first two items are easily fixed, and I'm going to commit the changes in the afternoon (in there are no objections); concerning the question of the automatic break at /- characters, I'll probably leave the code unchaged for the moment, until we decide what is best. Concerning point #3, I'm going to have a closer look at the restarting mechanism ... Yes, the current mechanism doesn't seem to be good enough, but I'm wondering if we can find a better one. Currently a too-short/too-long node replaces another one if it has fewer demerits. The number of lines/pages handled so far isn't taken into account. So this is likely that a too-short/too-long node ending an earlier line/page will be preferred over a node going further in the Knuth sequence. Why should that be the case? In fact the main problem I think is to find the right heuristic to select too-short/too-long nodes, in order to end up with the most acceptable result. Easy to say... Also, may I suggest you to look at the Temp_Floats branch, and perhaps even working on it instead of trunk? I've made quite heavy changes to the breaking code that might be difficult to merge back into the trunk if there are also changes there. Cheers, Vincent
Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following
Hi guys, J.Pietschmann a écrit : Simon Pepping wrote: Would this be a good moment to make these features of the breaking algorithm user configurable, like they are in TeX? This allows people to play with the various possibilities without having to modify the code. This can be combined with parameters for configuring the handling of before-floats. We might want to have a coherent set of parameters here. I was thinking about creating extension parameters in the fox: namespace. As those are things that have to be independently set for each FO file IMO, rather than having them in the Fop config file. I'll try to work on that soon. Probably, if this can be combined with implementing UAX14. This may be time to look at Simon's generalized Knuth elements for linebreaking. I wanted to but haven't had the time yet, and I'm still missing some knowledge regarding UAX14. Damn, so many things to do in so little time. Not speaking of releasing 0.93... Vincent
Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following
On Wed, Dec 06, 2006 at 12:19:01PM +0100, Luca Furini wrote: Simon Pepping wrote: Would this be a good moment to make these features of the breaking algorithm user configurable, like they are in TeX? This allows people to play with the various possibilities without having to modify the code. I agree with you that it would be good to have them configurable. Your idea is to put them in the configuration file or in the fo file itself? At the moment, my preference would be for the first option. I also prefer the configuration file. I would modify the configuration file when I would be looking for the best result in a number of test runs. Simon -- Simon Pepping home page: http://www.leverkruid.eu
Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following
On Wed, Dec 06, 2006 at 03:46:54PM +0100, Vincent Hennebert wrote: Luca Furini a écrit : Vincent wrote: I don't follow you: IIUC the glue-penalty-glue triplet is generated only the second time, when the first breaking doesn't give acceptable results? What do you mean by the penalty is not taken into account? No, the sequence is always the same: since the beginning it represents the hyphenation points too, but at the first call to findBreakingPoints() there is a parameter saying that only non-hyphenated breaks should be looked at. Doesn't that have an impact on performance? I know we are no longer at the time when TeX was created, but, still, performing hyphenation only when the first pass has failed should be more effective? As far as I understand, it is necessary to do hyphenation first, because it is not possible to modify the set of Knuth elements later because the LMs (?) already contain references to them by index. Another question: should the hyphen characters in the text be feasible breaks even if hyphenation is disabled? I'd say yes. In French, at least, there are lots of compound words that it is totally acceptable to break even if hyphenation is disabled. They should simply be discouraged by setting a positive penalty value (the same as for other hyphens). When we hyphenate words it's as if we were adding soft hyphens at some places in the input text --these wouldn't be the same hyphens. I agree with both points. And a soft hyphen inserted by the user is equivalent to a legal linebreak created by the hyphenation process. Simon -- Simon Pepping home page: http://www.leverkruid.eu
Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following
On Dec 6, 2006, at 20:39, Simon Pepping wrote: Hi folks, On Wed, Dec 06, 2006 at 12:19:01PM +0100, Luca Furini wrote: snip / I agree with you that it would be good to have them configurable. Your idea is to put them in the configuration file or in the fo file itself? At the moment, my preference would be for the first option. I also prefer the configuration file. I would modify the configuration file when I would be looking for the best result in a number of test runs. FWIW, I'm more leaning towards Vincent's proposal to implement this as extension attributes. The sketched 'best result' could then be found in a single run, copying the same fo-subtree multiple times -- different page-sequences?-- and specifying different values in the attributes. In the worst case, to make the best result definitive, the user would have to re-run the document, maintaining only the subtree that leads to the most desired layout. In the best case, maybe FOP itself could be taught to re-layout one and the same subtree with a subtle property change at the root (which would save re-parsing the XML, and --if applicable-- also the XSL transform). Cheers, Andreas
Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following
Chuck Bearden wrote: If in a left-aligned block some typical text words are followed by a string longer than the line-length and containing no spaces (e.g. a long URL), then the foregoing text will have premature line breaks, i.e. halfway to two-thirds the way into the line. I had a look at this, and what I found out is that the strange-looking lines are the combined effect of three different problems. So, sorry in advance for the long post, but breaking is never an easy matter! :-) 1) TextLM breaks the text even when a / or a - is found, handling them as hyphenation points with the usual sequence of glue + penalty + glue elements. The LineLM tries, in the first instance, to avoid using hyphenation points, so the penalty is not taken into account. But this has the side effect of using the first glue element as a feasible break (if the penalty were a feasible break too, it would surely be a better one, such avoiding the glue to be effectively chosen). This is probably the smaller of the problems, and can be solved just adding an infinite penalty before the first glue element. But maybe we want to prevent this breaking to happen, as we can now use zero-width-spaces to explicitly insert breaking positions? 2) The presence of an inline object larger that the available width makes the algorithm to deactivate all the active nodes and then restart with a second-hand node, as no line can be built that does not overflow. The restarting node was chosen, in BreakingAlgorithm.findBreakingPoints(), between lastTooShort and lastTooLong, neither of them being a good breaking point. There is a lastDeactivated node chosen among the deactivated nodes but it was not used. A deactivated node previously was an active one, so it is surely better than a node who failed to qualify; replacing either lastTooShort or lastTooLong (according to the adjustment) with lastDeactivated leads to a better set of breaks. However, this in not enough. The attached file small.20.pdf shows the result after fixing these first two problems. 3) At the moment, the LineLM can call findBreakingPoints() up to three times, the last one with a maximum adjusting ratio equal to 20. I came to the conclusion that this is really TOO much. I tried stopping after the second call (with max ratio = 5) and the result is much better (see attached file small.5.pdf). A high maximum adjustment ratio means that the algorithm is allowed to stretch spaces a lot in order to find a set of breaks which is *globally* better; this means that it can choose some not-so-beautiful breaks in order to build a set spanning over a larger portion of the paragraph. In our example: there can be a break just before the long url (a line ending after Consider:) only if we use an enormous adjustment ratio. With a smaller, more appropriate threshold, Consider: can no more end a line, so the algorithm will restart from a previous point. In conclusion: the first two items are easily fixed, and I'm going to commit the changes in the afternoon (in there are no objections); concerning the question of the automatic break at /- characters, I'll probably leave the code unchaged for the moment, until we decide what is best. Concerning point #3, I'm going to have a closer look at the restarting mechanism ... Regards Luca small.20.pdf Description: Adobe PDF document small.5.pdf Description: Adobe PDF document
Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following
Simon Pepping wrote: Would this be a good moment to make these features of the breaking algorithm user configurable, like they are in TeX? This allows people to play with the various possibilities without having to modify the code. Probably, if this can be combined with implementing UAX14. J.Pietschmann