Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following

2006-12-12 Thread Luca Furini

Vincent Hennebert wrote:


I'd have to think more about it, but:
- perhaps the compareNodes method should compare the line/page numbers
  for each node rather than the index in the Knuth sequence. Or some
  mixing of the two.


The index can tell us which node allows to lay out more content, the line 
number ... I am not able to see it as a very informative measure ...



- if you restart using the last deactivated node you are sure that
  immediately after that you'll have to restart using the last
  too-short/too-long node, because no feasible break will be found
  (otherwise the list of active nodes wouldn't have been emptied).


Yes, but I think we have a significant difference: in the first case we 
will have N good lines, a bad line and maybe some other good lines; in the 
second we have N-1 good lines, a quite-bad one (either too long or too 
short), then a bad one and finally some good ones.


I've preparared a very small patch fixing a couple of things:
- the TLM add a zero-width infinite-value penalty to forbid breaks at the
  glue elements used for left/right aligned text (I'm going to check if a
  similar fix is needed elsewhere in the code)
- the BreakingAlgorithm uses (if possible) lastDeactivated instead of
  either lastTooShort or lastTooLong.

The patch is just a dozen of lines long, and it was easy to apply it to 
the float branch.


How should I proceed? Apply it to both trunk and branch? Only to the 
branch?


I'm also going to mark bug 41121 as a duplicate of 41109, as the problem 
is exactly the same: the algorithm restarts from a very bad break instead 
of a good one (in that case, after the first word).


Regards
Luca


Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following

2006-12-06 Thread Vincent Hennebert
Hi Luca,

Luca Furini a écrit :
snip/
 1) TextLM breaks the text even when a / or a - is found, handling
 them as hyphenation points with the usual sequence of glue + penalty +
 glue elements.
 
 The LineLM tries, in the first instance, to avoid using hyphenation
 points, so the penalty is not taken into account. But this has the side
 effect of using the first glue element as a feasible break (if the
 penalty were a feasible break too, it would surely be a better one, such
 avoiding the glue to be effectively chosen).

I don't follow you: IIUC the glue-penalty-glue triplet is generated only
the second time, when the first breaking doesn't give acceptable
results? What do you mean by the penalty is not taken into account?

Also, I don't see why the penalty would be preferred over the glue, as
it has a positive penalty value.


 This is probably the smaller of the problems, and can be solved just
 adding an infinite penalty before the first glue element. But maybe we

This seems to be a good idea, anyway.


 want to prevent this breaking to happen, as we can now use
 zero-width-spaces to explicitly insert breaking positions?

Good point. I'd say yes for '/'. This would add a burden to the user who
would have to modify the FO generation step to add ZWSP for URLs or
filenames; but we must also take into account cases where the user does
/not/ want the word to be split at '/' characters.
For hyphens, I would keep the current behavior, as this is the most
expected one IMO. And it can also be prevented by adding non-breaking
zero-width space.


 2) The presence of an inline object larger that the available width
 makes the algorithm to deactivate all the active nodes and then restart
 with a second-hand node, as no line can be built that does not
 overflow. The restarting node was chosen, in
 BreakingAlgorithm.findBreakingPoints(), between lastTooShort and
 lastTooLong, neither of them being a good breaking point. There is a
 lastDeactivated node chosen among the deactivated nodes but it was not
 used.
 
 A deactivated node previously was an active one, so it is surely better
 than a node who failed to qualify; replacing either lastTooShort or
 lastTooLong (according to the adjustment) with lastDeactivated leads to
 a better set of breaks. However, this in not enough. The attached file
 small.20.pdf shows the result after fixing these first two problems.
 
 
 3) At the moment, the LineLM can call findBreakingPoints() up to three
 times, the last one with a maximum adjusting ratio equal to 20. I came
 to the conclusion that this is really TOO much. I tried stopping after
 the second call (with max ratio = 5) and the result is much better (see
 attached file small.5.pdf).

Yes 20 is probably too much. We need perhaps to also differentiate the
case where no acceptable line-breaking can be found because of a box too
long to even fit alone on one line. In such a case even a very high max
ratio won't help.


 A high maximum adjustment ratio means that the algorithm is allowed to
 stretch spaces a lot in order to find a set of breaks which is
 *globally* better; this means that it can choose some not-so-beautiful
 breaks in order to build a set spanning over a larger portion of the
 paragraph.
 
 In our example: there can be a break just before the long url (a line
 ending after Consider:) only if we use an enormous adjustment ratio.
 With a smaller, more appropriate threshold, Consider: can no more end
 a line, so the algorithm will restart from a previous point.
 
 
 In conclusion: the first two items are easily fixed, and I'm going to
 commit the changes in the afternoon (in there are no objections);
 concerning the question of the automatic break at /- characters, I'll
 probably leave the code unchaged for the moment, until we decide what is
 best.
 
 Concerning point #3, I'm going to have a closer look at the restarting
 mechanism ...

Yes, the current mechanism doesn't seem to be good enough, but I'm
wondering if we can find a better one. Currently a too-short/too-long
node replaces another one if it has fewer demerits. The number of
lines/pages handled so far isn't taken into account. So this is likely
that a too-short/too-long node ending an earlier line/page will be
preferred over a node going further in the Knuth sequence. Why should
that be the case?
In fact the main problem I think is to find the right heuristic to
select too-short/too-long nodes, in order to end up with the most
acceptable result. Easy to say...

Also, may I suggest you to look at the Temp_Floats branch, and perhaps
even working on it instead of trunk? I've made quite heavy changes to
the breaking code that might be difficult to merge back into the trunk
if there are also changes there.


Cheers,
Vincent


Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following

2006-12-06 Thread Vincent Hennebert
Hi guys,

J.Pietschmann a écrit :
 Simon Pepping wrote:
 Would this be a good moment to make these features of the breaking
 algorithm user configurable, like they are in TeX? This allows people
 to play with the various possibilities without having to modify the
 code.

This can be combined with parameters for configuring the handling of
before-floats. We might want to have a coherent set of parameters here.
I was thinking about creating extension parameters in the fox:
namespace. As those are things that have to be independently set for
each FO file IMO, rather than having them in the Fop config file. I'll
try to work on that soon.


 Probably, if this can be combined with implementing UAX14.

This may be time to look at Simon's generalized Knuth elements for
linebreaking. I wanted to but haven't had the time yet, and I'm still
missing some knowledge regarding UAX14. Damn, so many things to do in so
little time. Not speaking of releasing 0.93...


Vincent


Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following

2006-12-06 Thread Simon Pepping
On Wed, Dec 06, 2006 at 12:19:01PM +0100, Luca Furini wrote:
 Simon Pepping wrote:
 
 Would this be a good moment to make these features of the breaking
 algorithm user configurable, like they are in TeX? This allows people
 to play with the various possibilities without having to modify the
 code.
 
 I agree with you that it would be good to have them configurable.
 
 Your idea is to put them in the configuration file or in the fo file 
 itself? At the moment, my preference would be for the first option.

I also prefer the configuration file. I would modify the configuration
file when I would be looking for the best result in a number of test
runs.

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.eu


Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following

2006-12-06 Thread Simon Pepping
On Wed, Dec 06, 2006 at 03:46:54PM +0100, Vincent Hennebert wrote:
 Luca Furini a écrit :
  Vincent wrote:
  
  I don't follow you: IIUC the glue-penalty-glue triplet is generated only
  the second time, when the first breaking doesn't give acceptable
  results? What do you mean by the penalty is not taken into account?
  
  No, the sequence is always the same: since the beginning it represents
  the hyphenation points too, but at the first call to
  findBreakingPoints() there is a parameter saying that only
  non-hyphenated breaks should be looked at.
 
 Doesn't that have an impact on performance? I know we are no longer at
 the time when TeX was created, but, still, performing hyphenation only
 when the first pass has failed should be more effective?

As far as I understand, it is necessary to do hyphenation first,
because it is not possible to modify the set of Knuth elements later
because the LMs (?) already contain references to them by index.

  Another question: should the hyphen characters in the text be feasible
  breaks even if hyphenation is disabled?
 
 I'd say yes. In French, at least, there are lots of compound words that
 it is totally acceptable to break even if hyphenation is disabled. They
 should simply be discouraged by setting a positive penalty value (the
 same as for other hyphens).
 When we hyphenate words it's as if we were adding soft hyphens at some
 places in the input text --these wouldn't be the same hyphens.

I agree with both points. And a soft hyphen inserted by the user is
equivalent to a legal linebreak created by the hyphenation process.

Simon

-- 
Simon Pepping
home page: http://www.leverkruid.eu


Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following

2006-12-06 Thread Andreas L Delmelle

On Dec 6, 2006, at 20:39, Simon Pepping wrote:

Hi folks,


On Wed, Dec 06, 2006 at 12:19:01PM +0100, Luca Furini wrote:

snip /
I agree with you that it would be good to have them configurable.

Your idea is to put them in the configuration file or in the fo file
itself? At the moment, my preference would be for the first option.


I also prefer the configuration file. I would modify the configuration
file when I would be looking for the best result in a number of test
runs.


FWIW, I'm more leaning towards Vincent's proposal to implement this  
as extension attributes. The sketched 'best result' could then be  
found in a single run, copying the same fo-subtree multiple times -- 
different page-sequences?-- and specifying different values in the  
attributes.
In the worst case, to make the best result definitive, the user would  
have to re-run the document, maintaining only the subtree that leads  
to the most desired layout.
In the best case, maybe FOP itself could be taught to re-layout one  
and the same subtree with a subtle property change at the root (which  
would save re-parsing the XML, and --if applicable-- also the XSL  
transform).



Cheers,

Andreas



Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following

2006-12-05 Thread Luca Furini

Chuck Bearden wrote:


If in a left-aligned block some typical text words are followed by a string
longer than the line-length and containing no spaces (e.g. a long URL), then the
foregoing text will have premature line breaks, i.e. halfway to two-thirds the
way into the line.


I had a look at this, and what I found out is that the strange-looking 
lines are the combined effect of three different problems. So, sorry in 
advance for the long post, but breaking is never an easy matter! :-)



1) TextLM breaks the text even when a / or a - is found, handling them 
as hyphenation points with the usual sequence of glue + penalty + glue 
elements.


The LineLM tries, in the first instance, to avoid using hyphenation 
points, so the penalty is not taken into account. But this has the side 
effect of using the first glue element as a feasible break (if the penalty 
were a feasible break too, it would surely be a better one, such avoiding 
the glue to be effectively chosen).


This is probably the smaller of the problems, and can be solved just 
adding an infinite penalty before the first glue element. But maybe we 
want to prevent this breaking to happen, as we can now use 
zero-width-spaces to explicitly insert breaking positions?



2) The presence of an inline object larger that the available width makes 
the algorithm to deactivate all the active nodes and then restart with a 
second-hand node, as no line can be built that does not overflow. The 
restarting node was chosen, in BreakingAlgorithm.findBreakingPoints(), 
between lastTooShort and lastTooLong, neither of them being a good 
breaking point. There is a lastDeactivated node chosen among the 
deactivated nodes but it was not used.


A deactivated node previously was an active one, so it is surely better 
than a node who failed to qualify; replacing either lastTooShort or 
lastTooLong (according to the adjustment) with lastDeactivated leads to a 
better set of breaks. However, this in not enough. The attached file 
small.20.pdf shows the result after fixing these first two problems.



3) At the moment, the LineLM can call findBreakingPoints() up to three 
times, the last one with a maximum adjusting ratio equal to 20. I came to 
the conclusion that this is really TOO much. I tried stopping after the 
second call (with max ratio = 5) and the result is much better (see 
attached file small.5.pdf).


A high maximum adjustment ratio means that the algorithm is allowed to 
stretch spaces a lot in order to find a set of breaks which is *globally* 
better; this means that it can choose some not-so-beautiful breaks in 
order to build a set spanning over a larger portion of the paragraph.


In our example: there can be a break just before the long url (a line 
ending after Consider:) only if we use an enormous adjustment ratio. 
With a smaller, more appropriate threshold, Consider: can no more end a 
line, so the algorithm will restart from a previous point.



In conclusion: the first two items are easily fixed, and I'm going to 
commit the changes in the afternoon (in there are no objections); 
concerning the question of the automatic break at /- characters, I'll 
probably leave the code unchaged for the moment, until we decide what is 
best.


Concerning point #3, I'm going to have a closer look at the restarting 
mechanism ...


Regards
Luca


small.20.pdf
Description: Adobe PDF document


small.5.pdf
Description: Adobe PDF document


Re: DO NOT REPLY [Bug 41019] - Left-align oddness with long, unbreakable strings following

2006-12-05 Thread J.Pietschmann

Simon Pepping wrote:

Would this be a good moment to make these features of the breaking
algorithm user configurable, like they are in TeX? This allows people
to play with the various possibilities without having to modify the
code.


Probably, if this can be combined with implementing UAX14.

J.Pietschmann