Hi,

On Mon, Sep 14, 2009 at 3:53 PM, Ken Krugler
<kkrugler_li...@transpac.com> wrote:
> Has this been discussed previously? Just curious, as I'd thought about
> changing my mbox parser to handle incremental calls to parse(), and save
> state in the context object being passed in. This would require a small
> change to how I call the parser, as it would then be a loop (while
> (is.available() > 0) { parser.parse(is, xxx); })

See TIKA-252 [1] for a related feature request.

Tika has been designed to deal with documents as single entities,
since there is no comprehensive composite document abstraction that we
could easily use. Trying to solve that problem you quickly end up with
questions about whether an inline image should be treated the same as
a file attachment, or whether things like <img> tags in HTML documents
should be resolved and the images included in the parse output. It's
not an unsolvable problem, but it's complex enough that so far we've
scoped the issue outside Tika.

However, within the current Tika design there are a couple of options
that you could do:

* As suggested in TIKA-252, you could extend the PackageParser to
embed per-component metadata into the produced XHTML output. Your
application would then need to detect the component boundaries and the
included metadata from the XHTML output.

* Alternatively you could inject a custom delegate parser that
intercepts each component stream and handles it separately without
producing output to be included in the top-level parse result.

[1] https://issues.apache.org/jira/browse/TIKA-252

BR,

Jukka Zitting

Reply via email to