There's a bug in merging:
https://stackoverflow.com/questions/47140209/files-flattened-and-merged-with-pdfbox-are-sharing-common-cosstream
https://issues.apache.org/jira/browse/PDFBOX-3999

If you don't have a structure tree, then you can close it early.

Tilman

Am 07.12.2017 um 19:24 schrieb David Fertig:
I'm looking into merging multiple PDF files using more realistic memory/disk 
limits.  For example, when merging 400 1-page files, PdfBox thinks it needs 30G 
of space.  This is due to the way it segments the cache limits across all the 
input sources plus the output file, with the output cache limited to the same 
size as each input file.  I've experimented with 2 easy modifications and one 
more involved modifications.

   1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
other ½ across the input files. (Still keeping them open until then end)
   2.  Better: Split the cache in ½, give ½ to the output file, and ½ to the 
input file, close each input file after merging.
   3.  Best: Dynamically allocate in 16 page (64K) chucks from memory or disk 
on demand, release cache as documents are closed after merge.

All these approaches have reduced the memory limit requirements by 1-2 orders 
of  magnitude.  While I realize this doesn't change the actual memory and disk 
space used, it allows the limits to be a reasonable expectation of space used 
during the merge processes.

I have one question.  Both #2 and #3 approaches close the input files right 
after being merged and have no issues (in limited testing).  Is there a reason 
the current merge utility keeps all the input files open during the merge and 
only closes them all at the end?  Closing them after they are merged would save 
considerable cache space and reduce the need for so many file handles as well.

Thank you,
David
This email, including attachments, may contain information that is privileged, 
confidential or is exempt from disclosure under applicable law (including, but 
not limited to, protected health information). It is not intended for 
transmission to, or receipt by, any unauthorized persons. If the reader of this 
message is not the intended recipient, or the employee or agent responsible for 
delivering the message to the intended recipient, you are hereby notified that 
any dissemination, distribution or copying of this communication is strictly 
prohibited. If you believe this email was sent to you in error, do not read it. 
Please notify the sender immediately informing them of the error and delete all 
copies and attachments of the message from your system. Thank you.



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to