Hi Folks,

Can people check out:

https://issues.apache.org/jira/browse/TIKA-879


And Jeremy’s patch - how does that jive with what is going on
RE: TIKA-1602? I’d like to help get Jeremy’s fix in somehow
so he can move on.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Merrill>, Jeremy <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, April 10, 2015 at 4:05 PM
To: "[email protected]" <[email protected]>
Subject: Re: Detecting standards-non-compliant emails as message/rfc822

>Hi Chris,
>
>
>Thanks for your quick, detailed and helpful response; I appreciate it!
>
>Unit tests in tika-core pass. Unit tests in tika-parsers fail on a GDAL
>test case. These failures also exist in trunk off of
>https://github.com/apache/tika/, so I think we're good on that front.
>
>
>I didn't quite follow how to run the Tika Batch set of tools. Do I need
>an additional document set to run it against?
>
>
>I've opened a 
>JIRA issue <https://issues.apache.org/jira/browse/TIKA-1602>. Sorry if I
>messed it up -- I'm bad at even our in-house Jira, and y'all's is
>different...
>
>
>Thanks,
>
>---
>Jeremy B. Merrill
>The New York Times
>
>
>
>
>
>On Fri, Apr 10, 2015 at 2:56 PM, Mattmann, Chris A (3980)
><[email protected]> wrote:
>
>Dear Jeremy thank you for using Tika and that’s awesome that it’s
>being used in that particular use case at NYT! Rad.
>
>My suggestion: yes, this would be a useful way to fix it. Adding
>STATUS as a match magic priority is fine! it just gets considered
>along with the rest of them. The only thing I wonder would be potential
>false positives on it. Best way to check? Run the unit tests with your
>patch. Do they all pass?
>
>If so, then I would check out Tika Batch here:
>
>http://wiki.apache.org/tika/TikaBatchUsage
>
>https://wiki.apache.org/tika/TikaBatchOverview
>
>
>Tim Allison, myself, Tyler Palsulich and others have been setting up
>a govdocs regression test suite (and I have one for Polar data for
>TREC) that runs Tika over many many files and checks whether or not
>it still parses them the same way.
>
>First start with the unit tests, and then let’s try Tika batch. If
>it passes both of those I’d say this is something we definitely want
>to commit!
>
>In the meanwhile please feel free to open up a JIRA:
>
>https://issues.apache.org/jira/browse/TIKA and then reference
>a Pull Request and your code in it, just to get that part going.
>Thanks for considering contributing!
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: [email protected]
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: <Merrill>, Jeremy <[email protected]>
>Reply-To: "[email protected]" <[email protected]>
>Date: Friday, April 10, 2015 at 11:46 AM
>To: "[email protected]" <[email protected]>
>Subject: Detecting standards-non-compliant emails as message/rfc822
>
>>Hi friends,
>>
>>
>>*tl;dr*: I've added an extra line to tika-mimetypes.xml for detecting
>>certain rfc822 non-compliant emails that are exported by a certain U.S.
>>politician's email server. Would this be useful to add to the official
>>Tika repo?
>>
>>https://github.com/jeremybmerrill/tika/commit/32931d3438b868c2d2bcea75423
>>6
>>756944ab5eb7
>><https://github.com/jeremybmerrill/tika/commit/32931d3438b868c2d2bcea7542
>>3
>>6756944ab5eb7>
>>
>>
>>
>>longer version:
>>
>>
>>
>>Big fan of Tika -- we're using it a fair amount here to do document
>>search for emails/files we receive in big dumps from various public
>>officials.
>>
>>
>>These dumps frequently come directly from these officials' mailservers.
>>The dumps, I believe since they're not intended to be transmitted over
>>the wire, sometimes are slightly non-compliant. Many begin with the
>>non-standard header RFC822 `Status: `.
>>
>>
>>It's important to note that Tika (and the underlying library, James
>>Mime4J) do properly parse these emails, despite the non-compliant header.
>>The problem is getting Tika to *detect* the file as an email so that
>>Mime4J gets chosen to parse it.
>>
>>
>>
>>Tika does not properly detect these emails as `message/rfc822`. I've
>>added `Status: ` as a magic detection line in tika-mimetypes.xml. This
>>solves my problem and does not appear to cause test failures. Perhaps
>>there's another, easier solution? Also, I don't
>> know if it'll cause problems for other people or whether it would be
>>useful to them -- that's why I'm asking you. If it is, I'd be happy to
>>contribute it as a patch. Please let me know.
>>
>>---
>>Jeremy B. Merrill
>>The New York Times
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
>
>

Reply via email to