Re: FOP and large documents: out of memory

2010-01-14 Thread Vincent Hennebert
Hi Stephan,

I’m not sure I would invest any energy into improving the
CachedRenderPagesModel (-conserve option). It doesn’t look like the
right approach to me, and like you noticed it doesn’t even work out of
the box currently.

Why store the Area Tree on disk? Why not directly render it into the
final output format? If that latter supports out-of-order pages, then
that’s great; Otherwise we may as well store the final pages and order
them later on when the document is complete, instead of storing them in
a half-finished area tree format.

As to pages that hold unresolved references, so can’t obviously be
rendered yet: there usually aren’t that many of them that would make the
area tree solution vastly superior to a final format one in term of
memory consumption. Those ones could be kept in memory until all the
references they hold are resolved.

Also, the handling of forward references is currently less than optimal.
The resolution is made in the area tree instead of looping back to the
layout engine. ATM, a page-reference is rendered using a placeholder
string (‘MMM’), and that placeholder is later replaced with the actual
value (e.g., ‘5’). This is fine for constructs like tables of content,
but may produce ugly results if the page-number-citation is inside
a paragraph, ruining the even spacing. What’s the point of implementing
a high-quality line-breaking algorithm if its output is spoiled by
a poor handling of page citations?

I think the two-pass approach is the best long-term solution, although
obviously less trivial. One challenge is to detect a possible infinite
loop. For example: referenced item is at the beginning of page IX,
reference is updated to IX, which takes less room than MMM, so the
document is re-laid out and referenced item is moved to page VIII;
Reference must be updated again, document is laid out again and
referenced item end up on page IX again. And again, and again...


One possible workaround for your use case is to generate your document
once with a dummy TOC and just “Page X” into the intermediate format;
Parse it to get the total number of pages and the page numbers for each
element of the TOC; Re-generate it with hardcoded values for page
references.

HTH,
Vincent


Stephan Thesing wrote:
 Hello,
 
 as is well-known, FOP can run out of heap memory, when large documents
 are processed (http://xmlgraphics.apache.org/fop/0.95/running.html#memory).
 
 I have the situation that the documents I have to process mandate a footer on 
 each page that contains a page X of Y element and a TOC at the
 beginning of the document, i.e. FOP cannot layout the pages until all
 referenced page-citations are known, which is after the last page of the 
 document.
 
 When page content is quite complicated (e.g. 2000 pages mostly full with 
 tables), the heap space does not suffice to hold all pages until all 
 references can be resolved, thus FOP aborts with out-of-memory.
 
 Since increasing the heap space does not always work (3 GB heap space was 
 required in one example), I need a better solution for this.
 
 1. -conserve option
 One alternative would be the -conserve option, which serializes the pages 
 to disk and reloads them as needed.
 Although slow, this definitely would be a solution, if it worked, which it 
 doesn't:
  Our documents include graphics (SVG, PNG), and the serialization with 
 -conserve throws an exception, because some class in Batik is not 
 serializable (e.g. SVGOMAnimatedString IIRR), thus the page is missing, 
 causing FOP to abort later.
 Thus, Batik would have to be fixed for this.
 
 2. Two passes
 Since the pages are kept because of unresolved references, one could do the
 same as e.g. LaTeX always did: process the document twice.
 In a first run, pages are discarded after layout, only the references for 
 page-citations are kept and at the end reused for the second pass
 (when all pages for the citations are finally known).
 For the second run, these id-refs are initially loaded and no pages have
 to be kept.
 This would require more changes in FOP (and should definitely be made 
 optional obviously).
 
 
 
 I would appreciate any comments or other suggestions !
 
 
 Best regards
   Stephan


Re: FOP and large documents: out of memory

2010-01-14 Thread Andreas Delmelle
On 13 Jan 2010, at 22:37, Stephan Thesing wrote:

 
 On 13 Jan 2010, at 21:27, Stephan Thesing wrote:
 ...
 Our documents include graphics (SVG, PNG), and the serialization with
 -conserve throws an exception, because some class in Batik is not
 serializable (e.g. SVGOMAnimatedString IIRR), thus the page is missing, 
 causing
 FOP to abort later.
 Thus, Batik would have to be fixed for this.
 
 I think FOP can be 'fixed' for this too. If that is really the only class
 that is causing trouble, then FOP could make a serializable subclass for
 it, and use that in the area tree, instead of Batik's default
 non-serializable implementation. Unless Batik really needs it, why fix it 
 there?
 
 I don't think that can work, as that class is used in elements nested in 
 classes of Batik that represent the SVG.
 
 I.e., FOP never instantiates it, but the Batik code does somewhere along

OK, I see...

Just noticed that my idea for 'subclassing' is probably not entirely what I 
meant...
Suppose, for the sake of the argument, that String is not serializable, but 
we'd need it for some reason and the Java vendor does not want to alter their 
implementation. What could be done, is store only the info needed to create a 
new String upon deserialization. Serialize the char-array, and re-instantiate 
the String instead.

I was thinking something similar should be possible here, but if it is really 
that far out of FOP's control, then never mind.


Regards

Andreas

Andreas Delmelle
mailto:andreas.delmelle.AT.telenet.be
---



FOP and large documents: out of memory

2010-01-13 Thread Stephan Thesing
Hello,

as is well-known, FOP can run out of heap memory, when large documents
are processed (http://xmlgraphics.apache.org/fop/0.95/running.html#memory).

I have the situation that the documents I have to process mandate a footer on 
each page that contains a page X of Y element and a TOC at the
beginning of the document, i.e. FOP cannot layout the pages until all
referenced page-citations are known, which is after the last page of the 
document.

When page content is quite complicated (e.g. 2000 pages mostly full with 
tables), the heap space does not suffice to hold all pages until all references 
can be resolved, thus FOP aborts with out-of-memory.

Since increasing the heap space does not always work (3 GB heap space was 
required in one example), I need a better solution for this.

1. -conserve option
One alternative would be the -conserve option, which serializes the pages to 
disk and reloads them as needed.
Although slow, this definitely would be a solution, if it worked, which it 
doesn't:
 Our documents include graphics (SVG, PNG), and the serialization with 
-conserve throws an exception, because some class in Batik is not 
serializable (e.g. SVGOMAnimatedString IIRR), thus the page is missing, 
causing FOP to abort later.
Thus, Batik would have to be fixed for this.

2. Two passes
Since the pages are kept because of unresolved references, one could do the
same as e.g. LaTeX always did: process the document twice.
In a first run, pages are discarded after layout, only the references for 
page-citations are kept and at the end reused for the second pass
(when all pages for the citations are finally known).
For the second run, these id-refs are initially loaded and no pages have
to be kept.
This would require more changes in FOP (and should definitely be made optional 
obviously).



I would appreciate any comments or other suggestions !


Best regards
  Stephan
-- 
Dr.-Ing. Stephan Thesing
Elektrastr. 50
81925 München
GERMANY

Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser


Re: FOP and large documents: out of memory

2010-01-13 Thread Andreas Delmelle

On 13 Jan 2010, at 21:27, Stephan Thesing wrote:

Hi Stephan,

snip /
 Since increasing the heap space does not always work (3 GB heap space was 
 required in one example), I need a better solution for this.
 
 1. -conserve option
 One alternative would be the -conserve option, which serializes the pages 
 to disk and reloads them as needed.
 Although slow, this definitely would be a solution, if it worked, which it 
 doesn't:
 Our documents include graphics (SVG, PNG), and the serialization with 
 -conserve throws an exception, because some class in Batik is not 
 serializable (e.g. SVGOMAnimatedString IIRR), thus the page is missing, 
 causing FOP to abort later.
 Thus, Batik would have to be fixed for this.

I think FOP can be 'fixed' for this too. If that is really the only class that 
is causing trouble, then FOP could make a serializable subclass for it, and use 
that in the area tree, instead of Batik's default non-serializable 
implementation. Unless Batik really needs it, why fix it there?

It would require some thought on a (de)serialization routine, though... But 
seems much easier/faster to implement than the two-pass approach, if 
time/effort is of the essence.



Regards,

Andreas
mailto:andreas.delmelle.AT.telenet.be

---