[jira] [Commented] (TIKA-2478) RFC822 includes redundant copies of the text
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235659#comment-16235659 ] Robert Letzler commented on TIKA-2478: -- I am at a conference. I will respond to your message when I return. Thanks! -Rob > RFC822 includes redundant copies of the text > > > Key: TIKA-2478 > URL: https://issues.apache.org/jira/browse/TIKA-2478 > Project: Tika > Issue Type: Bug >Affects Versions: 1.16 >Reporter: Robert Letzler >Assignee: Tim Allison >Priority: Minor > Fix For: 1.17 > > Attachments: TIKA-2478.patch, UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, > mixed-simple, mixed-with-pdf-inline > > > MBOX messages often get parsed into four documents: > a.The mbox file - outer container "/" > b.The actual email-- "/embedded-1" > c.The utf-8 text content of the email "/embedded-1/embedded-2" > d.The utf-8 html content of the email "/embedded-1/embedded-3" > entries C and D are redundant and distracting. The MSG parser parses the > first non-null: email body and then it skips the rest. Please modify MBOX to > not have separate "attached" documents for the html body and the text body. > The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an > example of input sufficient to generate this behavior. > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208439#comment-16208439 ] Robert Letzler commented on TIKA-2478: -- Also, the current MBOX parser often puts the subject line in the "title" field of an embedded document that does not contain the text. It would be great to include the subject line, to, from, CC:, and date fields with the body in a single document. > MBOX import includes redundant copies of the text > - > > Key: TIKA-2478 > URL: https://issues.apache.org/jira/browse/TIKA-2478 > Project: Tika > Issue Type: Bug >Affects Versions: 1.16 >Reporter: Robert Letzler >Priority: Minor > > MBOX messages often get parsed into four documents: > a.The mbox file - outer container "/" > b.The actual email-- "/embedded-1" > c.The utf-8 text content of the email "/embedded-1/embedded-2" > d.The utf-8 html content of the email "/embedded-1/embedded-3" > entries C and D are redundant and distracting. The MSG parser parses the > first non-null: email body and then it skips the rest. Please modify MBOX to > not have separate "attached" documents for the html body and the text body. > The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an > example of input sufficient to generate this behavior. > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TIKA-2478) MBOX import includes redundant copies of the text
Robert Letzler created TIKA-2478: Summary: MBOX import includes redundant copies of the text Key: TIKA-2478 URL: https://issues.apache.org/jira/browse/TIKA-2478 Project: Tika Issue Type: Bug Affects Versions: 1.16 Reporter: Robert Letzler Priority: Minor MBOX messages often get parsed into four documents: a. The mbox file - outer container "/" b. The actual email-- "/embedded-1" c. The utf-8 text content of the email "/embedded-1/embedded-2" d. The utf-8 html content of the email "/embedded-1/embedded-3" entries C and D are redundant and distracting. The MSG parser parses the first non-null: email body and then it skips the rest. Please modify MBOX to not have separate "attached" documents for the html body and the text body. The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example of input sufficient to generate this behavior. Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)