Hi,
Following the example here http://wiki.apache.org/tika/RecursiveMetadata
I'm trying to parse nested documents and collect separately the content
and metadata of each document. All is going well, in fact maybe a little
too well ;) - parsers descend into internal components of compound
documents, so e.g. I'm getting all images from Word docs as separate
nested documents. This is very cool - it's good to know that Tika
supports this when you need it.
However, I'd like to have an option to avoid recursing into compound
documents, while still being able to process nested archives (like zip,
tgz, etc). Is there any easy way to express this preference? I thought
about using the type of handler passed to the RecursiveParser.parse(..)
to decide when to stop recursing, but I noticed that in both cases
(embedded components and entries in archives) an EmbeddedContentHandler
is passed to the parse(...) method.
Oh, and I really would appreciate some further feedback on TIKA-675 - if
this idea is ok I'd start working towards a patch.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com