[ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-295. -------------------------------- Resolution: Fixed Fix Version/s: 0.5 Assignee: Jukka Zitting Nice work, thanks! I committed the patch (with tabs->spaces changes and an added license header for the test case) in revision 820967. For further work on this I would suggest using the Mime4J library [1] from Apache James, as they've already dealt with many of the questions you raise above. I'm resolving this as Fixed as the basic feature is now there thanks to the patch. Please file additional issues on any future improvements. [1] http://james.apache.org/mime4j/ > Rough cut of mbox parser > ------------------------ > > Key: TIKA-295 > URL: https://issues.apache.org/jira/browse/TIKA-295 > Project: Tika > Issue Type: New Feature > Affects Versions: 0.4 > Reporter: Ken Krugler > Assignee: Jukka Zitting > Fix For: 0.5 > > Attachments: tika-295.patch > > > Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, > application/mbox) files. > * The first email headers are used to fill in metadata. Subsequent email > headers are tossed. > * Charset handling needs to be fixed up. It's unclear (not spec'd) whether > emails individually use the charset as specified in their individual header, > or the entire file should be re-encoded (and the encoding is sent in the > response header, or auto-detected). > * Multi-part emails won't be handled properly, though it's unclear what > should be done in that case (if anything). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.