pdfbox-1.8.2
tika-app-1.4 ( I'm including Apache Tika as I just found out that
Apache Tika comes with pdfbox )
I have various existing PDFs that I need to merge into one PDF. The
number of PDFs to be merged into one can be varied .. anywhere from 2
PDFs to 10 or 20 or more PDFs. The number of pages in each PDF to be
merged can also be varied. These PDFs are mostly scanned via an EDRMS
like HP TRIM7 ... so documents say like ... medical reports, etc ..
and up as PDFs. Thus, each page of the PDF is an image instead of
text.
Merging them into a single PDF is no problem using the PDFMergerUtility.
After I have merged them into a single PDF, I then need to add
bookmarks so that the person reading the PDF ( e.g. insurer, trustee )
can quickly jump to a section of the merged PDF to see one of the
merged PDFs.
The issue is the memory consumption .. the merged PDF tend to be quite
large ( anywhere from 200MB to 1GB ... again because each individual
PDF were scanned via an EDRMS like HP TRIM7 so each page tend to be an
image ). Now having multiple of these merges run in parallel, and I
can easily consume the entire heap allocated to the JVM.
To create the bookmarks, I have to open the large / merged PDF.
So the question is, is there a better way of creating bookmarks so as
that the amount of memory consumed is minimal ?
Note that I am making sure I am calling PDDocument.close() in a
finally clause. See snippets below.
1) To create the bookmarks, I have to find out the number of pages in
each PDF before they are merged. Something like in a loop:
PDDocument document = null;
try {
document = PDDocument.load(aDownload.getLocalFile());
aDownload.setNumberOfPages( document.getNumberOfPages() );
} finally {
if( document != null ) {
document.close();
}
}
2) Then I have to open the large / merged PDF file, then create the
bookmarks using the number of pages as the guide from above ( And I
also have to set the meta-data ... the author, date/time, subject on
the PDF ):
private void finaliseDocument(
final File pdfFile,
final List<DocumentDownloadEntry> downloadEntries )
throws Exception
{
logger.log(Level.INFO, String.format("Finalising PDF document %s",
pdfFile.toString()));
PDDocument document = null;
try {
document = PDDocument.load(pdfFile);
document.getDocumentCatalog().setPageMode(PDDocumentCatalog.PAGE_MODE_USE_OUTLINES);
document.getDocumentInformation().setCreationDate(Calendar.getInstance());
document.getDocumentInformation().setAuthor(getUserName());
document.getDocumentInformation().setTitle(getClaimDocuments().getEquipId()
+ " - " + getSubmissionType());
makeBookmarks( document, downloadEntries );
document.save(pdfFile);
} finally {
if( document != null ) {
document.close();
}
}
}
private void makeBookmarks(
final PDDocument document,
final List<DocumentDownloadEntry> downloadEntries)
throws Exception
{
PDDocumentOutline outline = new PDDocumentOutline();
document.getDocumentCatalog().setDocumentOutline( outline );
PDOutlineItem pagesOutline = new PDOutlineItem();
pagesOutline.setTitle( document.getDocumentInformation().getTitle() );
outline.appendChild( pagesOutline );
@SuppressWarnings("rawtypes")
List pages = document.getDocumentCatalog().getAllPages();
int pageIndex = 0;
for( DocumentDownloadEntry aDownload : downloadEntries ) {
if( aDownload.isDownload() && aDownload.isDownloaded() ) {
PDPage page = (PDPage)pages.get( pageIndex );
pageIndex += aDownload.getNumberOfPages();
PDPageFitWidthDestination dest = new
PDPageFitWidthDestination();
dest.setPage( page );
PDOutlineItem bookmark = new PDOutlineItem();
bookmark.setDestination( dest );
bookmark.setTitle( aDownload.getDocumentName() );
pagesOutline.appendChild( bookmark );
}
}
pagesOutline.openNode();
outline.openNode();
}