On 1/16/2012 4:24 AM, Nick Burch wrote:
On Fri, 13 Jan 2012, P. Hill wrote:
Anyone know about the (future?) ability of Tika to parse PDF
Portfolio Files?
http://help.adobe.com/en_US/Acrobat/9.0/Standard/WSA2872EA8-9756-4a8c-9F20-8E93D59D91CE.html
My hunch is that this'll need some PDFBox support too, to let us at
the original files, and to let us know what parts are a portfolio.
As a first step, I'd suggest you ask on the PDFBox list about their
support for Portfolio files
Nick
Nick,
I finally got a moment to ask about PDF Portfolio files and the folks
over at PDFBox directed me to:
http://pdfbox.apache.org/userguide/file_references.html
I pass that along for Tika developers, but it seems there might be some
issues about combining all the content in a portfolio not unlike e-mails
with attachments or other compound documents
(http://wiki.apache.org/tika/MetadataDiscussion).
I can report my company has seen a least one end user using Portfolio
files, but they don't seem very common.
-Paul