+1 to Nick's links and advice.
To use the RecursiveParserWrapper with tika-app, use the -J option; or if
you're using tika-server, use the /rmeta endpoint.
The ecology of embedded docs is rich and understudied (IMHO), let us know what
you find!
Cheers,
Tim
-----Original Message-----
From: McGreevy, Anthony [mailto:[email protected]]
Sent: Tuesday, March 27, 2018 11:47 AM
To: [email protected]
Subject: RE: Subfile Extraction
Thanks for the information!
Much appreciated!
Anthony
-----Original Message-----
From: Nick Burch [mailto:[email protected]]
Sent: 27 March 2018 15:50
To: [email protected]
Subject: Re: Subfile Extraction
On Sun, 25 Mar 2018, McGreevy, Anthony wrote:
> I am currently playing with Tika to see how it works with regards to
> extraction of subfiles.
Do you mean files or resources embedded within another file?
If so... With the Tika App, you want -z to have these extracted. With the Tika
java classes, you want to pop something like a
https://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.htmlhttps://tika.apache.org/1.17/api/org/apache/tika/parser/RecursiveParserWrapper.html
or a
https://tika.apache.org/1.17/api/org/apache/tika/extractor/ContainerExtractor.html
on your ParseContext to get called for embedded resources. See
https://wiki.apache.org/tika/RecursiveMetadata for more on how it works and how
to have Tika parse + return all the embedded files and resources
Nick