[jira] [Commented] (TIKA-2478) RFC822 includes redundant copies of the text

2017-11-02 Thread Robert Letzler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235659#comment-16235659
 ] 

Robert Letzler commented on TIKA-2478:
--

I am at a conference.  I will respond to your message when I return.

Thanks!

-Rob



> RFC822 includes redundant copies of the text
> 
>
> Key: TIKA-2478
> URL: https://issues.apache.org/jira/browse/TIKA-2478
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.16
>Reporter: Robert Letzler
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.17
>
> Attachments: TIKA-2478.patch, UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, 
> mixed-simple, mixed-with-pdf-inline
>
>
> MBOX messages often get parsed into four documents:
> a.The mbox file - outer container "/"
> b.The actual email--  "/embedded-1"
> c.The utf-8 text content of the email "/embedded-1/embedded-2"
> d.The utf-8 html content of the email  "/embedded-1/embedded-3"
> entries C and D are redundant and distracting.  The MSG parser parses the 
> first non-null: email body and then it skips the rest.  Please modify MBOX to 
> not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an 
> example of input sufficient to generate this behavior.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-17 Thread Robert Letzler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208439#comment-16208439
 ] 

Robert Letzler commented on TIKA-2478:
--

Also, the current MBOX parser often puts the subject line in the "title" field 
of an embedded document that does not contain the text.  It would be great to 
include the subject line, to, from, CC:, and date fields with the body in a 
single document.

> MBOX import includes redundant copies of the text
> -
>
> Key: TIKA-2478
> URL: https://issues.apache.org/jira/browse/TIKA-2478
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.16
>Reporter: Robert Letzler
>Priority: Minor
>
> MBOX messages often get parsed into four documents:
> a.The mbox file - outer container "/"
> b.The actual email--  "/embedded-1"
> c.The utf-8 text content of the email "/embedded-1/embedded-2"
> d.The utf-8 html content of the email  "/embedded-1/embedded-3"
> entries C and D are redundant and distracting.  The MSG parser parses the 
> first non-null: email body and then it skips the rest.  Please modify MBOX to 
> not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an 
> example of input sufficient to generate this behavior.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2478) MBOX import includes redundant copies of the text

2017-10-16 Thread Robert Letzler (JIRA)
Robert Letzler created TIKA-2478:


 Summary: MBOX import includes redundant copies of the text
 Key: TIKA-2478
 URL: https://issues.apache.org/jira/browse/TIKA-2478
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.16
Reporter: Robert Letzler
Priority: Minor


MBOX messages often get parsed into four documents:
a.  The mbox file - outer container "/"
b.  The actual email--  "/embedded-1"
c.  The utf-8 text content of the email "/embedded-1/embedded-2"
d.  The utf-8 html content of the email  "/embedded-1/embedded-3"

entries C and D are redundant and distracting.  The MSG parser parses the first 
non-null: email body and then it skips the rest.  Please modify MBOX to not 
have separate "attached" documents for the html body and the text body.

The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example 
of input sufficient to generate this behavior.

Thanks!





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)