Hi all,
I recently wrote an mbox parser for Tika, since I need that for my
Bixo web crawler.
One issue I ran into - a single mbox file logically decomposes into
multiple documents. I can and do currently treat it as a single
document, where I use XHTML <ul> lists for each message's headers. But
it would work better from the client perspective if the metadata being
returned by the parse() call could be used as expected - e.g.
DublinCore's SUBJECT, DATE, and CREATOR match up with each email's
subject, date and author header fields.
Currently though the metadata can't be used this way, at least from
what I see, due to the parse() call being a single call that returns a
single instance.
Has this been discussed previously? Just curious, as I'd thought about
changing my mbox parser to handle incremental calls to parse(), and
save state in the context object being passed in. This would require a
small change to how I call the parser, as it would then be a loop
(while (is.available() > 0) { parser.parse(is, xxx); })
Thanks,
-- Ken
--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378