[ 
https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-295.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

Nice work, thanks! I committed the patch (with tabs->spaces changes and an 
added license header for the test case) in revision 820967.

For further work on this I would suggest using the Mime4J library [1] from 
Apache James, as they've already dealt with many of the questions you raise 
above.

I'm resolving this as Fixed as the basic feature is now there thanks to the 
patch. Please file additional issues on any future improvements.

[1] http://james.apache.org/mime4j/

> Rough cut of mbox parser
> ------------------------
>
>                 Key: TIKA-295
>                 URL: https://issues.apache.org/jira/browse/TIKA-295
>             Project: Tika
>          Issue Type: New Feature
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>         Attachments: tika-295.patch
>
>
> Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, 
> application/mbox) files.
> * The first email headers are used to fill in metadata. Subsequent email 
> headers are tossed.
> * Charset handling needs to be fixed up. It's unclear (not spec'd) whether 
> emails individually use the charset as specified in their individual header, 
> or the entire file should be re-encoded (and the encoding is sent in the 
> response header, or auto-detected).
> * Multi-part emails won't be handled properly, though it's unclear what 
> should be done in that case (if anything).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to