Hello all,
I tracked down the bugs 10374, 2106 and 6042. The last
bug was caused by a simple, easy to fix mistake in the
hyphenation framework. The bug 10374 is unfortunately
a duplicate of 2106, not 6042, and a bit more interesting.
It is caused by the parser delivering character references
as a separate character chunk, thereby creating multiple
FOText children of the block (FObjMixed) for consecutive
text. This interferes badly with line breaking and
hyphenation. Take
e#x78;tensible
with room up to the l on the line.
This is split into three FOText objects
e #x78; tensible
The text is delivered separately to the line layout
algorithm. The e and X do not fill the line but
also are not words and are appended to the pendingAreas
vector. The tensible then overflows the line and is
passed to the hyphenation, lets say it is hyphenated
as tensi-ble. The tensi- is appended without
flushing the pending areas, which are put first into the
next line.
I put a StringBuffer into FObjMixed to accumulate
consecutive addCharacters() events. This fixes the problem
with character references, but not
efo:inlineX/fo:inlinetensible
(also noted somewhere in bugzilla as problem)
The second is to flush pendig areas in addWord(). This
fixes the lost characters problem but *still* does not
correctly hyphenate words split into inline FOs, only
the chunk actually overflowing the line is considered
for hyphenation.
More problems I noted:
- white space is handled inconsistently
- line break detection relies on white space only
- word detection for hyphenation relies on white space
and wrongly assumes there is a white space before the
word passed to doHyphenation()
- the LinkSet is not considered for hyphenated word parts
in addWord, and neither for page-number-citation nor
fo:character
- same for most of overlining, line through and vertical
alignment
- characters are copied to FOText, and then copied *twice*
in LineArea.layout(), one purely for hyphenation. During
Layout, character data is at least three times, possibly
four times (parser buffer) in memory
Questions:
- Is it still worth to do major hacks in LineArea.java?
- Should we consider using Unicode break properties for
line break opportunity detection?
- How should words for hyphenation be detected?
- What happens to line breaks and word detection in case of
* inline graphics and other definitely non-text inlines
* inline foreign elements, like formulas
* inline-containers containing blocks, especially blocks
with text only
- Are there script or language dependencies to consider for
line break and word detection?
- At which point should collapse-whitespace, linefeed-treatment
etc. considered? Possibilities:
* while creating FOText
* while feeding it into the line area
* during line area layout
Considering white-space-collapse during FOText creation has some
problems in case of successive spaces in different inline FO.
There are additional issues with consecutive spaces which had
been discussed here already, in particular how
foo fo:inline text-decoration=underline bar/fo:inline
should be handled. Will this result in two consecutive spaces,
one of them underlined? Has this issue been resolved meanwhile?
J.Pietschmann
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]