[ https://issues.apache.org/jira/browse/TIKA-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757777#action_12757777 ]
Ken Krugler commented on TIKA-252: ---------------------------------- I'd run into something similar. I recently wrote an mbox parser for Tika, since I need that for my Bixo web crawler. A single mbox file logically decomposes into multiple documents (one per email). I can and do currently treat it as a single document, where I use XHTML <ul> lists for each message's headers. But it would work better from the client perspective if the metadata being returned by the parse() call could be used as expected - e.g. DublinCore's SUBJECT, DATE, and CREATOR match up with each email's subject, date and author header fields. An alternative idea is that you could make the parse() API callable multiple times, where it incrementally processes the input stream, and returns a boolean for whether or not additional data remains. The parser becomes more complex, in that it would need to maintain some state (probably in the context param) but it would be a pretty minor change for the caller. > PackageParser's XHTML should contain metadata of subfiles > --------------------------------------------------------- > > Key: TIKA-252 > URL: https://issues.apache.org/jira/browse/TIKA-252 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 0.4 > Reporter: Jonathan Koren > Priority: Minor > > Currently PackageParser only sets the Metadata based on the outermost file > type. For instance, an gzipped tar containing pdfs will have > Metadata.Content-Type set to application/gzip, and the mimetypes of the > internal files (the pdfs) will be lost. > It would be nice if the metadata found when parsing the contained pdfs would > be recoverable. Perhaps in a sequence like: > <div class="metadata><span > class="METADATA-KEY">METADATA-VALUE</span>...</div> within the <div > class="package-file"> -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.