Hi,

Following the example here http://wiki.apache.org/tika/RecursiveMetadata I'm trying to parse nested documents and collect separately the content and metadata of each document. All is going well, in fact maybe a little too well ;) - parsers descend into internal components of compound documents, so e.g. I'm getting all images from Word docs as separate nested documents. This is very cool - it's good to know that Tika supports this when you need it.

However, I'd like to have an option to avoid recursing into compound documents, while still being able to process nested archives (like zip, tgz, etc). Is there any easy way to express this preference? I thought about using the type of handler passed to the RecursiveParser.parse(..) to decide when to stop recursing, but I noticed that in both cases (embedded components and entries in archives) an EmbeddedContentHandler is passed to the parse(...) method.

Oh, and I really would appreciate some further feedback on TIKA-675 - if this idea is ok I'd start working towards a patch.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to