[ 
https://issues.apache.org/jira/browse/TIKA-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757777#action_12757777
 ] 

Ken Krugler commented on TIKA-252:
----------------------------------

I'd run into something similar. I recently wrote an mbox parser for Tika, since 
I need that for my Bixo web crawler.

A single mbox file logically decomposes into multiple documents (one per 
email). I can and do currently treat it as a single document, where I use XHTML 
<ul> lists for each message's headers. But it would work better from the client 
perspective if the metadata being returned by the parse() call could be used as 
expected - e.g. DublinCore's SUBJECT, DATE, and CREATOR match up with each 
email's subject, date and author header fields.

An alternative idea is that you could make the parse() API callable multiple 
times, where it incrementally processes the input stream, and returns a boolean 
for whether or not additional data remains. The parser becomes more complex, in 
that it would need to maintain some state (probably in the context param) but 
it would be a pretty minor change for the caller.


> PackageParser's XHTML should contain metadata of subfiles
> ---------------------------------------------------------
>
>                 Key: TIKA-252
>                 URL: https://issues.apache.org/jira/browse/TIKA-252
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Jonathan Koren
>            Priority: Minor
>
> Currently PackageParser only sets the Metadata based on the outermost file 
> type.  For instance, an gzipped tar containing pdfs will have 
> Metadata.Content-Type set to application/gzip, and the mimetypes of the 
> internal files (the pdfs) will be lost.  
> It would be nice if the metadata found when parsing the contained pdfs would 
> be recoverable.  Perhaps in a sequence like:
> <div class="metadata><span 
> class="METADATA-KEY">METADATA-VALUE</span>...</div> within the <div 
> class="package-file">

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to