[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata
[ https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15590005#comment-15590005 ] Chris Knott commented on TIKA-2122: --- Wow, thanks! Very fast turnaround. > Extract all email headers from Outlook .msg files into Metadata > --- > > Key: TIKA-2122 > URL: https://issues.apache.org/jira/browse/TIKA-2122 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.13 >Reporter: Chris Knott >Priority: Minor > Fix For: 2.0, 1.14 > > Attachments: msg_raw_headers.xlsx > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently most email headers are not added to the Metadata when extracting > Outlook .msg files. > http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java > The headers - {{msg.getHeaders()}} - are already being looped through as a > way to estimate the date. > All headers should be added to Metadata, using the name of the header with a > prefix such as {{"raw-header:"}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata
[ https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15583223#comment-15583223 ] Hudson commented on TIKA-2122: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1117 (See [https://builds.apache.org/job/Tika-trunk/1117/]) TIKA-2122 : add all headers from MSG and RFC822 files (tallison: rev 8e819c3caf3ff3b0492f600b4193d1b3ee74f51b) * (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java * (edit) tika-core/src/main/java/org/apache/tika/metadata/Message.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java > Extract all email headers from Outlook .msg files into Metadata > --- > > Key: TIKA-2122 > URL: https://issues.apache.org/jira/browse/TIKA-2122 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.13 >Reporter: Chris Knott >Priority: Minor > Fix For: 2.0, 1.14 > > Attachments: msg_raw_headers.xlsx > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently most email headers are not added to the Metadata when extracting > Outlook .msg files. > http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java > The headers - {{msg.getHeaders()}} - are already being looped through as a > way to estimate the date. > All headers should be added to Metadata, using the name of the header with a > prefix such as {{"raw-header:"}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata
[ https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15583175#comment-15583175 ] Hudson commented on TIKA-2122: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #63 (See [https://builds.apache.org/job/tika-2.x-windows/63/]) TIKA-2122: Extract all headers from MSG/RFC822 (tallison: rev 30e03de89fd4b21cb91917c72aec12eede761be3) * (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java * (edit) tika-parser-modules/tika-parser-web-module/pom.xml * (edit) tika-parser-modules/tika-parser-web-module/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java * (edit) tika-parser-modules/tika-parser-office-module/pom.xml * (edit) tika-parser-bundles/tika-parser-office-bundle/pom.xml * (edit) CHANGES.txt * (edit) tika-core/src/main/java/org/apache/tika/metadata/Message.java * (edit) tika-parser-modules/pom.xml * (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java * (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java * (edit) tika-parser-modules/tika-parser-web-module/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java > Extract all email headers from Outlook .msg files into Metadata > --- > > Key: TIKA-2122 > URL: https://issues.apache.org/jira/browse/TIKA-2122 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.13 >Reporter: Chris Knott >Priority: Minor > Fix For: 2.0, 1.14 > > Attachments: msg_raw_headers.xlsx > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently most email headers are not added to the Metadata when extracting > Outlook .msg files. > http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java > The headers - {{msg.getHeaders()}} - are already being looped through as a > way to estimate the date. > All headers should be added to Metadata, using the name of the header with a > prefix such as {{"raw-header:"}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata
[ https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15582236#comment-15582236 ] Tim Allison commented on TIKA-2122: --- Er, how about {{mail:raw-header:}}? > Extract all email headers from Outlook .msg files into Metadata > --- > > Key: TIKA-2122 > URL: https://issues.apache.org/jira/browse/TIKA-2122 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.13 >Reporter: Chris Knott >Priority: Minor > Fix For: 2.0, 1.14 > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently most email headers are not added to the Metadata when extracting > Outlook .msg files. > http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java > The headers - {{msg.getHeaders()}} - are already being looped through as a > way to estimate the date. > All headers should be added to Metadata, using the name of the header with a > prefix such as {{"raw-header:"}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata
[ https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15581955#comment-15581955 ] Tim Allison commented on TIKA-2122: --- We'll also have to start adding handling for encoding in headers: {noformat} H: From: =?iso-8859-1?Q?L'=C9quipe_Microsoft_Outlook_Express?= H: To: "Nouvel utilisateur de Outlook Express" H: Subject: Microsoft Outlook Express 6 H: Date: Thu, 5 Apr 2007 09:26:06 -0700 H: MIME-Version: 1.0 H: Content-Type: text/html; H: charset="iso-8859-1" H: Content-Transfer-Encoding: quoted-printable H: X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 {noformat} > Extract all email headers from Outlook .msg files into Metadata > --- > > Key: TIKA-2122 > URL: https://issues.apache.org/jira/browse/TIKA-2122 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.13 >Reporter: Chris Knott >Priority: Minor > Fix For: 2.0, 1.14 > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently most email headers are not added to the Metadata when extracting > Outlook .msg files. > http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java > The headers - {{msg.getHeaders()}} - are already being looped through as a > way to estimate the date. > All headers should be added to Metadata, using the name of the header with a > prefix such as {{"raw-header:"}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata
[ https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15581946#comment-15581946 ] Tim Allison commented on TIKA-2122: --- Y, I think this is a really good idea with a prefix -- partly because it will expose areas for further work in .msg, and as [~gagravarr] pointed out, we still need some volunteer energy on other properties within .msg. I suspect that folks interested in forensics would want both the raw headers and the other properties we might eventually pull out. For now, how about {{raw-email-header:}}? As an example of "areas for further work", it looks like POI is breaking headers on new lines or semi-colons? On one of our current test files, I've prepended each header with "H:": {noformat} H: Microsoft Mail Internet Headers Version 2.0 H: Received: from hq-ex3fe3.ptcnet.ptc.com ([132.253.201.67]) by HQ-MAIL3.ptcnet.ptc.com with Microsoft SMTPSVC(6.0.3790.3959); H: Thu, 29 Jan 2009 14:17:10 -0500 H: Received: from irp1.ptc.com ([12.11.148.83]) by hq-ex3fe3.ptcnet.ptc.com with Microsoft SMTPSVC(6.0.3790.3959); H: Thu, 29 Jan 2009 14:17:10 -0500 H: X-IronPort-Anti-Spam-Filtered: true H: X-IronPort-Anti-Spam-Result: AskBALePgUmM0wsCk2dsb2JhbACMeYZdPwEBAQEJCQoJEQWpcoEDjWwBAwEDhA0G H: X-IronPort-AV: E=Sophos;i="4.37,346,1231131600"; H:d="scan'208";a="51369639" {noformat} > Extract all email headers from Outlook .msg files into Metadata > --- > > Key: TIKA-2122 > URL: https://issues.apache.org/jira/browse/TIKA-2122 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.13 >Reporter: Chris Knott >Priority: Minor > Fix For: 2.0, 1.14 > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently most email headers are not added to the Metadata when extracting > Outlook .msg files. > http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java > The headers - {{msg.getHeaders()}} - are already being looped through as a > way to estimate the date. > All headers should be added to Metadata, using the name of the header with a > prefix such as {{"raw-header:"}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata
[ https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15580045#comment-15580045 ] Chris Knott commented on TIKA-2122: --- Sorry I am not particularly familiar with Tika or POI, just needed this feature for a current project - what do you mean by HMEF? My use case is needing to extract custom headers which start with "x-" - there's never going to be a way to do this properly I presume, because the headers could be anything. How about extracting just headers that start "x-" and prepending them with "custom-email-header:" or something? --- On another note, what's the easiest workaround for this at the moment? > Extract all email headers from Outlook .msg files into Metadata > --- > > Key: TIKA-2122 > URL: https://issues.apache.org/jira/browse/TIKA-2122 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.13 >Reporter: Chris Knott >Priority: Minor > Fix For: 2.0, 1.14 > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently most email headers are not added to the Metadata when extracting > Outlook .msg files. > http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java > The headers - {{msg.getHeaders()}} - are already being looped through as a > way to estimate the date. > All headers should be added to Metadata, using the name of the header with a > prefix such as {{"raw-header:"}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2122) Extract all email headers from Outlook .msg files into Metadata
[ https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15579692#comment-15579692 ] Nick Burch commented on TIKA-2122: -- I'm not sure if we want to be dumping these raw into the Tika metadata - maybe we could do with a prefix though? (Would probably want syncing up with RFC822 and MBox parsers though for consistency) Also note that HMEF doesn't currently pull out all the possible properties from the MSG level (support for fixed-length properties is incomplete and in need of volunteer energy), so there may be more bits of metadata we could get from the MSG file "properly", which may negate some of the need for this. (Pending suitable POI work!) > Extract all email headers from Outlook .msg files into Metadata > --- > > Key: TIKA-2122 > URL: https://issues.apache.org/jira/browse/TIKA-2122 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.13 >Reporter: Chris Knott >Priority: Minor > Fix For: 2.0, 1.14 > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently most email headers are not added to the Metadata when extracting > Outlook .msg files. > http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java > The headers - {{msg.getHeaders()}} - are already being looped through as a > way to estimate the date. > All headers should be added to Metadata, using the name of the header with a > prefix such as {{"raw-header:"}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)