Hi all,

I recently wrote an mbox parser for Tika, since I need that for my Bixo web crawler.

One issue I ran into - a single mbox file logically decomposes into multiple documents. I can and do currently treat it as a single document, where I use XHTML <ul> lists for each message's headers. But it would work better from the client perspective if the metadata being returned by the parse() call could be used as expected - e.g. DublinCore's SUBJECT, DATE, and CREATOR match up with each email's subject, date and author header fields.

Currently though the metadata can't be used this way, at least from what I see, due to the parse() call being a single call that returns a single instance.

Has this been discussed previously? Just curious, as I'd thought about changing my mbox parser to handle incremental calls to parse(), and save state in the context object being passed in. This would require a small change to how I call the parser, as it would then be a loop (while (is.available() > 0) { parser.parse(is, xxx); })

Thanks,

-- Ken

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Reply via email to