Recursive parsing

Andrzej Bialecki Wed, 07 Dec 2011 01:08:59 -0800

Hi,

Following the example here http://wiki.apache.org/tika/RecursiveMetadataI'm trying to parse nested documents and collect separately the contentand metadata of each document. All is going well, in fact maybe a littletoo well ;) - parsers descend into internal components of compounddocuments, so e.g. I'm getting all images from Word docs as separatenested documents. This is very cool - it's good to know that Tikasupports this when you need it.

However, I'd like to have an option to avoid recursing into compounddocuments, while still being able to process nested archives (like zip,tgz, etc). Is there any easy way to express this preference? I thoughtabout using the type of handler passed to the RecursiveParser.parse(..)to decide when to stop recursing, but I noticed that in both cases(embedded components and entries in archives) an EmbeddedContentHandleris passed to the parse(...) method.

Oh, and I really would appreciate some further feedback on TIKA-675 - ifthis idea is ok I'd start working towards a patch.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Recursive parsing

Reply via email to