Re: Current FOText implementation + Refinement whitespace handling

2005-11-03 Thread Manuel Mall
On Thu, 3 Nov 2005 04:08 am, Andreas L Delmelle wrote:
 Hi all,
 (Manuel, I guess this is mostly directed to you, as you may already
 have been browsing the same classes...)

 Just wandering a bit through the FOText source code (follow-up on
 Manuel's recent thread on whitespace handling), and I stumbled upon
 the following suspicious little detail:
 FOText has a static member 'lastFOTextProcessed', which doesn't seem
 to get cleared/flushed anywhere.

Actually I think I have the issue of white space handling during 
refinement which is implemented in the handleWhiteSpace method in Block 
under control.

The two issues I identified before:
a) the character iterator needs to indicate inline boundaries as they 
act as a fence with respect to white space collapse
b) we need to be able to remove leading white space before a hard break
have been solved.

a) The char iterator returned by inline's now does return NUL characters 
for start, end which the white space handling function automatically 
interprets as interupting consecutive white space
b) The LF look ahead function already present has been enhanced to 
indicate LF for the end of the block thereby allowing 
white-space-treatment to be applied to the leading white space before 
the block end

I have also rewritten the handleWhiteSpace method to behave in sync with 
the current understanding. My testcases so far indicate that this is 
working consistent with my expectations, other XSL-FO implementations, 
and comparable HTML use cases.

The only limitation I am aware off at the moment is that the 
suppress-at-line-break property is not supported. This is because the 
char iterator used does only return characters (in the pure Java sense) 
and not their related XSL-FO properties. At this point in time I 
consider support for the suppress-at-line-break property not as high 
priority.

The only bug (at this level) I am aware off is the already documented 
failure of a delete on a fo:character. But you already suggested some 
possible solutions I will have a look at.

In terms of completing white space handling it really now boils down to 
the handling of white-space-treatment around formatter generated 
breaks. This is currently being discussed in other e-mail threads and 
is quite complex (not logically but technically) because it interacts 
with the Knuth sequences / Knuth algorithm, the internal LM structures, 
as well as the proposed UNICODE compliant line breaking.


 The intention is quite clear, but the possible effects of the current
 implementation may turn out rather nasty. IIC, this is what the
 warning is about in the FOText javadoc as well as the TODO for that
 member variable.
 Rough guess: since the variable doesn't get cleared, it always
 contains a reference to a char array containing the last portion of
 accumulated text (or, more precisely, a FOText instance carrying that
 reference, as well as one to the previous FOText etc.) --even after
 the document has finished, into the next run if within the same JVM
 (+ possible multi-thread mayhem?)
 The TODO hints at a solution involving the page-sequence. I somehow
 feel that moving it to the block level would be enough... Logically,
 whitespace handling --which is one of the prime reasons of existence
 of this static variable-- deals with line-breaks, and
 start-block/end- block are implicit after- or before-eol.

 To follow up on that last sentence, the current refinement whitespace
 handling works roughly as follows:

 1. Add all text and inline children to the block, until the first
 non- inline child is encountered (or the block ends)
 2. Recursively iterate over *all* text nodes anywhere in the block up
 to here, converting/removing any superfluous whitespace in the
 process

 and (+/-) repeat the above for each uninterrupted sequence of text/
 inline children in the block.

 Seems to work nicely, for the most part.

 Manuel already raised the issue of inappropriate inter-FO whitespace-
 collapsing, but I have another question. Given this algorithm, and
 knowing that the inlines do not do any whitespace-handling
 themselves, what happens in the following case:

 fo:block
fo:inline
  fo:block
fo:inline
  fo:block
 ...
 ?

 My current best guess is that the inner block's underlying character
 sequence will be 'recursively' iterated over three times (?) That
 would be two too many, since all whitespace will have been collapsed
 the first time around.

The Block class has a flag which prevents multiple iterations I believe.

 I'm still chewing on some ideas to move part of this to InlineLevel,
 so that ultimately, we can do away with the recursion and let each
 level handle its own small part. The higher level then chains these
 small parts together with its own character content.

 One way to make this happen would be to overload
 Block.handleWhiteSpace() to deal with an InlineLevel parameter. This
 has the advantage of the whitespace-related properties being easily
 

Re: Current FOText implementation + Refinement whitespace handling

2005-11-03 Thread Andreas L Delmelle

On Nov 3, 2005, at 10:41, Manuel Mall wrote:


snip /
The only limitation I am aware off at the moment is that the
suppress-at-line-break property is not supported. This is because  
the
char iterator used does only return characters (in the pure Java  
sense)

and not their related XSL-FO properties. At this point in time I
consider support for the suppress-at-line-break property not as high
priority.


Well, IMO it shouldn't be handled during refinement, exactly because  
the iterators don't have access to the FO's properties. I know it can  
be argued that in cases like:

...
fo:character character=... suppress-at-line-break=true /
/fo:block

During refinement, it is already known that the character will be  
suppressed. If the character is whitespace and the iterator is  
modified to deal with that, it will be dropped, regardless of the  
value of suppress-at-line-break (supposing default values for all  
other props).


But in the most general cases, I think suppress-at-line-break is best  
dealt with during layout, not refinement.


Then again, modifying the OneCharIterator (or subclassing to  
FOCharIterator?) to correctly deal with fo:characters (removal/ 
replacing) is easy enough. So, it's _possible_ to let it access the  
FOs properties. It's only that I'm not sure whether it would be  
_necessary_.


snip /


My current best guess is that the inner block's underlying character
sequence will be 'recursively' iterated over three times (?) That
would be two too many, since all whitespace will have been collapsed
the first time around.



The Block class has a flag which prevents multiple iterations I  
believe.


See my immediate correction :-)
This became apparent just after I had clicked 'Send'...

snip /

As I said above, I believe I got this sorted out. May be I should  
do an

early commit?


Would be nice. So at least the codebases we're talking about are  
completely in sync.


Cheers,

Andreas



Current FOText implementation + Refinement whitespace handling

2005-11-02 Thread Andreas L Delmelle

Hi all,
(Manuel, I guess this is mostly directed to you, as you may already  
have been browsing the same classes...)


Just wandering a bit through the FOText source code (follow-up on  
Manuel's recent thread on whitespace handling), and I stumbled upon  
the following suspicious little detail:
FOText has a static member 'lastFOTextProcessed', which doesn't seem  
to get cleared/flushed anywhere.


The intention is quite clear, but the possible effects of the current  
implementation may turn out rather nasty. IIC, this is what the  
warning is about in the FOText javadoc as well as the TODO for that  
member variable.
Rough guess: since the variable doesn't get cleared, it always  
contains a reference to a char array containing the last portion of  
accumulated text (or, more precisely, a FOText instance carrying that  
reference, as well as one to the previous FOText etc.) --even after  
the document has finished, into the next run if within the same JVM  
(+ possible multi-thread mayhem?)
The TODO hints at a solution involving the page-sequence. I somehow  
feel that moving it to the block level would be enough... Logically,  
whitespace handling --which is one of the prime reasons of existence  
of this static variable-- deals with line-breaks, and start-block/end- 
block are implicit after- or before-eol.


To follow up on that last sentence, the current refinement whitespace  
handling works roughly as follows:


1. Add all text and inline children to the block, until the first non- 
inline child is encountered (or the block ends)
2. Recursively iterate over *all* text nodes anywhere in the block up  
to here, converting/removing any superfluous whitespace in the process


and (+/-) repeat the above for each uninterrupted sequence of text/ 
inline children in the block.


Seems to work nicely, for the most part.

Manuel already raised the issue of inappropriate inter-FO whitespace- 
collapsing, but I have another question. Given this algorithm, and  
knowing that the inlines do not do any whitespace-handling  
themselves, what happens in the following case:


fo:block
  fo:inline
fo:block
  fo:inline
fo:block
...
?

My current best guess is that the inner block's underlying character  
sequence will be 'recursively' iterated over three times (?) That  
would be two too many, since all whitespace will have been collapsed  
the first time around.


I'm still chewing on some ideas to move part of this to InlineLevel,  
so that ultimately, we can do away with the recursion and let each  
level handle its own small part. The higher level then chains these  
small parts together with its own character content.


One way to make this happen would be to overload  
Block.handleWhiteSpace() to deal with an InlineLevel parameter. This  
has the advantage of the whitespace-related properties being easily  
available. The call to this overloaded method would be made from  
InlineLevel.endOfNode().


If you're still following, I'd use a CharIterator that iterates over  
regular characters, fo:characters (and possibly the first and last  
characters of any nested FO). This iterator can operate very easily  
on both inlines and blocks. I don't immediately see any need to  
iterate backwards, at least not during refinement. Big advantage here  
would precisely be that we can wait until Block.endOfNode() to deal  
with any white-space for the entire block (leading and trailing), the  
nested bits will already have performed their parts at that point, so  
it is done sooner and far more efficiently IIC (guaranteed only one  
pass per level, no matter how deep the nesting goes).


Food for thought :-)

Cheers,

Andreas