Hi, On Mon, Sep 14, 2009 at 3:53 PM, Ken Krugler <kkrugler_li...@transpac.com> wrote: > Has this been discussed previously? Just curious, as I'd thought about > changing my mbox parser to handle incremental calls to parse(), and save > state in the context object being passed in. This would require a small > change to how I call the parser, as it would then be a loop (while > (is.available() > 0) { parser.parse(is, xxx); })
See TIKA-252 [1] for a related feature request. Tika has been designed to deal with documents as single entities, since there is no comprehensive composite document abstraction that we could easily use. Trying to solve that problem you quickly end up with questions about whether an inline image should be treated the same as a file attachment, or whether things like <img> tags in HTML documents should be resolved and the images included in the parse output. It's not an unsolvable problem, but it's complex enough that so far we've scoped the issue outside Tika. However, within the current Tika design there are a couple of options that you could do: * As suggested in TIKA-252, you could extend the PackageParser to embed per-component metadata into the produced XHTML output. Your application would then need to detect the component boundaries and the included metadata from the XHTML output. * Alternatively you could inject a custom delegate parser that intercepts each component stream and handles it separately without producing output to be included in the top-level parse result. [1] https://issues.apache.org/jira/browse/TIKA-252 BR, Jukka Zitting