Rough cut of mbox parser
------------------------

                 Key: TIKA-295
                 URL: https://issues.apache.org/jira/browse/TIKA-295
             Project: Tika
          Issue Type: New Feature
    Affects Versions: 0.4
            Reporter: Ken Krugler


Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, 
application/mbox) files.

* The first email headers are used to fill in metadata. Subsequent email 
headers are tossed.
* Charset handling needs to be fixed up. It's unclear (not spec'd) whether 
emails individually use the charset as specified in their individual header, or 
the entire file should be re-encoded (and the encoding is sent in the response 
header, or auto-detected).
* Multi-part emails won't be handled properly, though it's unclear what should 
be done in that case (if anything).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to