Multiple documents per input stream

Ken Krugler Mon, 14 Sep 2009 06:55:37 -0700

Hi all,

I recently wrote an mbox parser for Tika, since I need that for myBixo web crawler.

One issue I ran into - a single mbox file logically decomposes intomultiple documents. I can and do currently treat it as a singledocument, where I use XHTML <ul> lists for each message's headers. Butit would work better from the client perspective if the metadata beingreturned by the parse() call could be used as expected - e.g.DublinCore's SUBJECT, DATE, and CREATOR match up with each email'ssubject, date and author header fields.

Currently though the metadata can't be used this way, at least fromwhat I see, due to the parse() call being a single call that returns asingle instance.

Has this been discussed previously? Just curious, as I'd thought aboutchanging my mbox parser to handle incremental calls to parse(), andsave state in the context object being passed in. This would require asmall change to how I call the parser, as it would then be a loop(while (is.available() > 0) { parser.parse(is, xxx); })


Thanks,

-- Ken

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Multiple documents per input stream

Reply via email to