Hi Chris,

Thanks for your quick, detailed and helpful response; I appreciate it!

Unit tests in tika-core pass. Unit tests in tika-parsers fail on a GDAL
test case. These failures also exist in trunk off of
https://github.com/apache/tika/, so I think we're good on that front.

I didn't quite follow how to run the Tika Batch set of tools. Do I need an
additional document set to run it against?

I've opened a JIRA issue <https://issues.apache.org/jira/browse/TIKA-1602>.
Sorry if I messed it up -- I'm bad at even our in-house Jira, and y'all's
is different...

Thanks,

---
Jeremy B. Merrill
The New York Times


On Fri, Apr 10, 2015 at 2:56 PM, Mattmann, Chris A (3980) <
[email protected]> wrote:

> Dear Jeremy thank you for using Tika and that’s awesome that it’s
> being used in that particular use case at NYT! Rad.
>
> My suggestion: yes, this would be a useful way to fix it. Adding
> STATUS as a match magic priority is fine! it just gets considered
> along with the rest of them. The only thing I wonder would be potential
> false positives on it. Best way to check? Run the unit tests with your
> patch. Do they all pass?
>
> If so, then I would check out Tika Batch here:
>
> http://wiki.apache.org/tika/TikaBatchUsage
>
> https://wiki.apache.org/tika/TikaBatchOverview
>
>
> Tim Allison, myself, Tyler Palsulich and others have been setting up
> a govdocs regression test suite (and I have one for Polar data for
> TREC) that runs Tika over many many files and checks whether or not
> it still parses them the same way.
>
> First start with the unit tests, and then let’s try Tika batch. If
> it passes both of those I’d say this is something we definitely want
> to commit!
>
> In the meanwhile please feel free to open up a JIRA:
>
> https://issues.apache.org/jira/browse/TIKA and then reference
> a Pull Request and your code in it, just to get that part going.
> Thanks for considering contributing!
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: <Merrill>, Jeremy <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Friday, April 10, 2015 at 11:46 AM
> To: "[email protected]" <[email protected]>
> Subject: Detecting standards-non-compliant emails as message/rfc822
>
> >Hi friends,
> >
> >
> >*tl;dr*: I've added an extra line to tika-mimetypes.xml for detecting
> >certain rfc822 non-compliant emails that are exported by a certain U.S.
> >politician's email server. Would this be useful to add to the official
> >Tika repo?
> >
> >
> https://github.com/jeremybmerrill/tika/commit/32931d3438b868c2d2bcea754236
> >756944ab5eb7
> ><
> https://github.com/jeremybmerrill/tika/commit/32931d3438b868c2d2bcea75423
> >6756944ab5eb7>
> >
> >
> >
> >longer version:
> >
> >
> >
> >Big fan of Tika -- we're using it a fair amount here to do document
> >search for emails/files we receive in big dumps from various public
> >officials.
> >
> >
> >These dumps frequently come directly from these officials' mailservers.
> >The dumps, I believe since they're not intended to be transmitted over
> >the wire, sometimes are slightly non-compliant. Many begin with the
> >non-standard header RFC822 `Status: `.
> >
> >
> >It's important to note that Tika (and the underlying library, James
> >Mime4J) do properly parse these emails, despite the non-compliant header.
> >The problem is getting Tika to *detect* the file as an email so that
> >Mime4J gets chosen to parse it.
> >
> >
> >
> >Tika does not properly detect these emails as `message/rfc822`. I've
> >added `Status: ` as a magic detection line in tika-mimetypes.xml. This
> >solves my problem and does not appear to cause test failures. Perhaps
> >there's another, easier solution? Also, I don't
> > know if it'll cause problems for other people or whether it would be
> >useful to them -- that's why I'm asking you. If it is, I'd be happy to
> >contribute it as a patch. Please let me know.
> >
> >---
> >Jeremy B. Merrill
> >The New York Times
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

Reply via email to