[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata

2016-10-19 Thread Chris Knott (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15590005#comment-15590005
 ] 

Chris Knott commented on TIKA-2122:
---

Wow, thanks! Very fast turnaround.

> Extract all email headers from Outlook .msg files into Metadata
> ---
>
> Key: TIKA-2122
> URL: https://issues.apache.org/jira/browse/TIKA-2122
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.13
>Reporter: Chris Knott
>Priority: Minor
> Fix For: 2.0, 1.14
>
> Attachments: msg_raw_headers.xlsx
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently most email headers are not added to the Metadata when extracting 
> Outlook .msg files.
> http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
> The headers - {{msg.getHeaders()}} - are already being looped through as a 
> way to estimate the date.
> All headers should be added to Metadata, using the name of the header with a 
> prefix such as {{"raw-header:"}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata

2016-10-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15583223#comment-15583223
 ] 

Hudson commented on TIKA-2122:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1117 (See 
[https://builds.apache.org/job/Tika-trunk/1117/])
TIKA-2122 : add all headers from MSG and RFC822 files (tallison: rev 
8e819c3caf3ff3b0492f600b4193d1b3ee74f51b)
* (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
* (edit) tika-core/src/main/java/org/apache/tika/metadata/Message.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java


> Extract all email headers from Outlook .msg files into Metadata
> ---
>
> Key: TIKA-2122
> URL: https://issues.apache.org/jira/browse/TIKA-2122
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.13
>Reporter: Chris Knott
>Priority: Minor
> Fix For: 2.0, 1.14
>
> Attachments: msg_raw_headers.xlsx
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently most email headers are not added to the Metadata when extracting 
> Outlook .msg files.
> http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
> The headers - {{msg.getHeaders()}} - are already being looped through as a 
> way to estimate the date.
> All headers should be added to Metadata, using the name of the header with a 
> prefix such as {{"raw-header:"}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata

2016-10-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15583175#comment-15583175
 ] 

Hudson commented on TIKA-2122:
--

FAILURE: Integrated in Jenkins build tika-2.x-windows #63 (See 
[https://builds.apache.org/job/tika-2.x-windows/63/])
TIKA-2122: Extract all headers from MSG/RFC822 (tallison: rev 
30e03de89fd4b21cb91917c72aec12eede761be3)
* (edit) 
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
* (edit) tika-parser-modules/tika-parser-web-module/pom.xml
* (edit) 
tika-parser-modules/tika-parser-web-module/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
* (edit) tika-parser-modules/tika-parser-office-module/pom.xml
* (edit) tika-parser-bundles/tika-parser-office-bundle/pom.xml
* (edit) CHANGES.txt
* (edit) tika-core/src/main/java/org/apache/tika/metadata/Message.java
* (edit) tika-parser-modules/pom.xml
* (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
* (edit) 
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java
* (edit) 
tika-parser-modules/tika-parser-web-module/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java


> Extract all email headers from Outlook .msg files into Metadata
> ---
>
> Key: TIKA-2122
> URL: https://issues.apache.org/jira/browse/TIKA-2122
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.13
>Reporter: Chris Knott
>Priority: Minor
> Fix For: 2.0, 1.14
>
> Attachments: msg_raw_headers.xlsx
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently most email headers are not added to the Metadata when extracting 
> Outlook .msg files.
> http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
> The headers - {{msg.getHeaders()}} - are already being looped through as a 
> way to estimate the date.
> All headers should be added to Metadata, using the name of the header with a 
> prefix such as {{"raw-header:"}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata

2016-10-17 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15582236#comment-15582236
 ] 

Tim Allison commented on TIKA-2122:
---

Er, how about {{mail:raw-header:}}?

> Extract all email headers from Outlook .msg files into Metadata
> ---
>
> Key: TIKA-2122
> URL: https://issues.apache.org/jira/browse/TIKA-2122
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.13
>Reporter: Chris Knott
>Priority: Minor
> Fix For: 2.0, 1.14
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently most email headers are not added to the Metadata when extracting 
> Outlook .msg files.
> http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
> The headers - {{msg.getHeaders()}} - are already being looped through as a 
> way to estimate the date.
> All headers should be added to Metadata, using the name of the header with a 
> prefix such as {{"raw-header:"}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata

2016-10-17 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15581955#comment-15581955
 ] 

Tim Allison commented on TIKA-2122:
---

We'll also have to start adding handling for encoding in headers:

{noformat}
H: From: =?iso-8859-1?Q?L'=C9quipe_Microsoft_Outlook_Express?= 

H: To: "Nouvel utilisateur de Outlook Express"
H: Subject: Microsoft Outlook Express 6
H: Date: Thu, 5 Apr 2007 09:26:06 -0700
H: MIME-Version: 1.0
H: Content-Type: text/html;
H:  charset="iso-8859-1"
H: Content-Transfer-Encoding: quoted-printable
H: X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028
{noformat}

> Extract all email headers from Outlook .msg files into Metadata
> ---
>
> Key: TIKA-2122
> URL: https://issues.apache.org/jira/browse/TIKA-2122
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.13
>Reporter: Chris Knott
>Priority: Minor
> Fix For: 2.0, 1.14
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently most email headers are not added to the Metadata when extracting 
> Outlook .msg files.
> http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
> The headers - {{msg.getHeaders()}} - are already being looped through as a 
> way to estimate the date.
> All headers should be added to Metadata, using the name of the header with a 
> prefix such as {{"raw-header:"}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata

2016-10-17 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15581946#comment-15581946
 ] 

Tim Allison commented on TIKA-2122:
---

Y, I think this is a really good idea with a prefix -- partly because it will 
expose areas for further work in .msg, and as [~gagravarr] pointed out, we 
still need some volunteer energy on other properties within .msg.

I suspect that folks interested in forensics would want both the raw headers 
and the other properties we might eventually pull out.

For now, how about {{raw-email-header:}}?

As an example of "areas for further work", it looks like POI is breaking 
headers on new lines or semi-colons?  On one of our current test files, I've 
prepended each header with "H:":

{noformat}
H: Microsoft Mail Internet Headers Version 2.0
H: Received: from hq-ex3fe3.ptcnet.ptc.com ([132.253.201.67]) by 
HQ-MAIL3.ptcnet.ptc.com with Microsoft SMTPSVC(6.0.3790.3959);
H:   Thu, 29 Jan 2009 14:17:10 -0500
H: Received: from irp1.ptc.com ([12.11.148.83]) by hq-ex3fe3.ptcnet.ptc.com 
with Microsoft SMTPSVC(6.0.3790.3959);
H:   Thu, 29 Jan 2009 14:17:10 -0500
H: X-IronPort-Anti-Spam-Filtered: true
H: X-IronPort-Anti-Spam-Result: 
AskBALePgUmM0wsCk2dsb2JhbACMeYZdPwEBAQEJCQoJEQWpcoEDjWwBAwEDhA0G
H: X-IronPort-AV: E=Sophos;i="4.37,346,1231131600"; 
H:d="scan'208";a="51369639"
{noformat}

> Extract all email headers from Outlook .msg files into Metadata
> ---
>
> Key: TIKA-2122
> URL: https://issues.apache.org/jira/browse/TIKA-2122
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.13
>Reporter: Chris Knott
>Priority: Minor
> Fix For: 2.0, 1.14
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently most email headers are not added to the Metadata when extracting 
> Outlook .msg files.
> http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
> The headers - {{msg.getHeaders()}} - are already being looped through as a 
> way to estimate the date.
> All headers should be added to Metadata, using the name of the header with a 
> prefix such as {{"raw-header:"}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata

2016-10-16 Thread Chris Knott (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15580045#comment-15580045
 ] 

Chris Knott commented on TIKA-2122:
---

Sorry I am not particularly familiar with Tika or POI, just needed this feature 
for a current project - what do you mean by HMEF?

My use case is needing to extract custom headers which start with "x-" - 
there's never going to be a way to do this properly I presume, because the 
headers could be anything.

How about extracting just headers that start "x-" and prepending them with 
"custom-email-header:" or something?

---

On another note, what's the easiest workaround for this at the moment?

> Extract all email headers from Outlook .msg files into Metadata
> ---
>
> Key: TIKA-2122
> URL: https://issues.apache.org/jira/browse/TIKA-2122
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.13
>Reporter: Chris Knott
>Priority: Minor
> Fix For: 2.0, 1.14
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently most email headers are not added to the Metadata when extracting 
> Outlook .msg files.
> http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
> The headers - {{msg.getHeaders()}} - are already being looped through as a 
> way to estimate the date.
> All headers should be added to Metadata, using the name of the header with a 
> prefix such as {{"raw-header:"}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata

2016-10-16 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15579692#comment-15579692
 ] 

Nick Burch commented on TIKA-2122:
--

I'm not sure if we want to be dumping these raw into the Tika metadata - maybe 
we could do with a prefix though? (Would probably want syncing up with RFC822 
and MBox parsers though for consistency)

Also note that HMEF doesn't currently pull out all the possible properties from 
the MSG level (support for fixed-length properties is incomplete and in need of 
volunteer energy), so there may be more bits of metadata we could get from the 
MSG file "properly", which may negate some of the need for this. (Pending 
suitable POI work!)

> Extract all email headers from Outlook .msg files into Metadata
> ---
>
> Key: TIKA-2122
> URL: https://issues.apache.org/jira/browse/TIKA-2122
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.13
>Reporter: Chris Knott
>Priority: Minor
> Fix For: 2.0, 1.14
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently most email headers are not added to the Metadata when extracting 
> Outlook .msg files.
> http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
> The headers - {{msg.getHeaders()}} - are already being looped through as a 
> way to estimate the date.
> All headers should be added to Metadata, using the name of the header with a 
> prefix such as {{"raw-header:"}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)