Re: FOP and large documents: out of memory
Hi Stephan, I’m not sure I would invest any energy into improving the CachedRenderPagesModel (-conserve option). It doesn’t look like the right approach to me, and like you noticed it doesn’t even work out of the box currently. Why store the Area Tree on disk? Why not directly render it into the final output format? If that latter supports out-of-order pages, then that’s great; Otherwise we may as well store the final pages and order them later on when the document is complete, instead of storing them in a half-finished area tree format. As to pages that hold unresolved references, so can’t obviously be rendered yet: there usually aren’t that many of them that would make the area tree solution vastly superior to a final format one in term of memory consumption. Those ones could be kept in memory until all the references they hold are resolved. Also, the handling of forward references is currently less than optimal. The resolution is made in the area tree instead of looping back to the layout engine. ATM, a page-reference is rendered using a placeholder string (‘MMM’), and that placeholder is later replaced with the actual value (e.g., ‘5’). This is fine for constructs like tables of content, but may produce ugly results if the page-number-citation is inside a paragraph, ruining the even spacing. What’s the point of implementing a high-quality line-breaking algorithm if its output is spoiled by a poor handling of page citations? I think the two-pass approach is the best long-term solution, although obviously less trivial. One challenge is to detect a possible infinite loop. For example: referenced item is at the beginning of page IX, reference is updated to IX, which takes less room than MMM, so the document is re-laid out and referenced item is moved to page VIII; Reference must be updated again, document is laid out again and referenced item end up on page IX again. And again, and again... One possible workaround for your use case is to generate your document once with a dummy TOC and just “Page X” into the intermediate format; Parse it to get the total number of pages and the page numbers for each element of the TOC; Re-generate it with hardcoded values for page references. HTH, Vincent Stephan Thesing wrote: Hello, as is well-known, FOP can run out of heap memory, when large documents are processed (http://xmlgraphics.apache.org/fop/0.95/running.html#memory). I have the situation that the documents I have to process mandate a footer on each page that contains a page X of Y element and a TOC at the beginning of the document, i.e. FOP cannot layout the pages until all referenced page-citations are known, which is after the last page of the document. When page content is quite complicated (e.g. 2000 pages mostly full with tables), the heap space does not suffice to hold all pages until all references can be resolved, thus FOP aborts with out-of-memory. Since increasing the heap space does not always work (3 GB heap space was required in one example), I need a better solution for this. 1. -conserve option One alternative would be the -conserve option, which serializes the pages to disk and reloads them as needed. Although slow, this definitely would be a solution, if it worked, which it doesn't: Our documents include graphics (SVG, PNG), and the serialization with -conserve throws an exception, because some class in Batik is not serializable (e.g. SVGOMAnimatedString IIRR), thus the page is missing, causing FOP to abort later. Thus, Batik would have to be fixed for this. 2. Two passes Since the pages are kept because of unresolved references, one could do the same as e.g. LaTeX always did: process the document twice. In a first run, pages are discarded after layout, only the references for page-citations are kept and at the end reused for the second pass (when all pages for the citations are finally known). For the second run, these id-refs are initially loaded and no pages have to be kept. This would require more changes in FOP (and should definitely be made optional obviously). I would appreciate any comments or other suggestions ! Best regards Stephan
Re: FOP and large documents: out of memory
On 13 Jan 2010, at 22:37, Stephan Thesing wrote: On 13 Jan 2010, at 21:27, Stephan Thesing wrote: ... Our documents include graphics (SVG, PNG), and the serialization with -conserve throws an exception, because some class in Batik is not serializable (e.g. SVGOMAnimatedString IIRR), thus the page is missing, causing FOP to abort later. Thus, Batik would have to be fixed for this. I think FOP can be 'fixed' for this too. If that is really the only class that is causing trouble, then FOP could make a serializable subclass for it, and use that in the area tree, instead of Batik's default non-serializable implementation. Unless Batik really needs it, why fix it there? I don't think that can work, as that class is used in elements nested in classes of Batik that represent the SVG. I.e., FOP never instantiates it, but the Batik code does somewhere along OK, I see... Just noticed that my idea for 'subclassing' is probably not entirely what I meant... Suppose, for the sake of the argument, that String is not serializable, but we'd need it for some reason and the Java vendor does not want to alter their implementation. What could be done, is store only the info needed to create a new String upon deserialization. Serialize the char-array, and re-instantiate the String instead. I was thinking something similar should be possible here, but if it is really that far out of FOP's control, then never mind. Regards Andreas Andreas Delmelle mailto:andreas.delmelle.AT.telenet.be ---
FOP and large documents: out of memory
Hello, as is well-known, FOP can run out of heap memory, when large documents are processed (http://xmlgraphics.apache.org/fop/0.95/running.html#memory). I have the situation that the documents I have to process mandate a footer on each page that contains a page X of Y element and a TOC at the beginning of the document, i.e. FOP cannot layout the pages until all referenced page-citations are known, which is after the last page of the document. When page content is quite complicated (e.g. 2000 pages mostly full with tables), the heap space does not suffice to hold all pages until all references can be resolved, thus FOP aborts with out-of-memory. Since increasing the heap space does not always work (3 GB heap space was required in one example), I need a better solution for this. 1. -conserve option One alternative would be the -conserve option, which serializes the pages to disk and reloads them as needed. Although slow, this definitely would be a solution, if it worked, which it doesn't: Our documents include graphics (SVG, PNG), and the serialization with -conserve throws an exception, because some class in Batik is not serializable (e.g. SVGOMAnimatedString IIRR), thus the page is missing, causing FOP to abort later. Thus, Batik would have to be fixed for this. 2. Two passes Since the pages are kept because of unresolved references, one could do the same as e.g. LaTeX always did: process the document twice. In a first run, pages are discarded after layout, only the references for page-citations are kept and at the end reused for the second pass (when all pages for the citations are finally known). For the second run, these id-refs are initially loaded and no pages have to be kept. This would require more changes in FOP (and should definitely be made optional obviously). I would appreciate any comments or other suggestions ! Best regards Stephan -- Dr.-Ing. Stephan Thesing Elektrastr. 50 81925 München GERMANY Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser
Re: FOP and large documents: out of memory
On 13 Jan 2010, at 21:27, Stephan Thesing wrote: Hi Stephan, snip / Since increasing the heap space does not always work (3 GB heap space was required in one example), I need a better solution for this. 1. -conserve option One alternative would be the -conserve option, which serializes the pages to disk and reloads them as needed. Although slow, this definitely would be a solution, if it worked, which it doesn't: Our documents include graphics (SVG, PNG), and the serialization with -conserve throws an exception, because some class in Batik is not serializable (e.g. SVGOMAnimatedString IIRR), thus the page is missing, causing FOP to abort later. Thus, Batik would have to be fixed for this. I think FOP can be 'fixed' for this too. If that is really the only class that is causing trouble, then FOP could make a serializable subclass for it, and use that in the area tree, instead of Batik's default non-serializable implementation. Unless Batik really needs it, why fix it there? It would require some thought on a (de)serialization routine, though... But seems much easier/faster to implement than the two-pass approach, if time/effort is of the essence. Regards, Andreas mailto:andreas.delmelle.AT.telenet.be ---