Hi Chris, Thanks for your quick, detailed and helpful response; I appreciate it!
Unit tests in tika-core pass. Unit tests in tika-parsers fail on a GDAL test case. These failures also exist in trunk off of https://github.com/apache/tika/, so I think we're good on that front. I didn't quite follow how to run the Tika Batch set of tools. Do I need an additional document set to run it against? I've opened a JIRA issue <https://issues.apache.org/jira/browse/TIKA-1602>. Sorry if I messed it up -- I'm bad at even our in-house Jira, and y'all's is different... Thanks, --- Jeremy B. Merrill The New York Times On Fri, Apr 10, 2015 at 2:56 PM, Mattmann, Chris A (3980) < [email protected]> wrote: > Dear Jeremy thank you for using Tika and that’s awesome that it’s > being used in that particular use case at NYT! Rad. > > My suggestion: yes, this would be a useful way to fix it. Adding > STATUS as a match magic priority is fine! it just gets considered > along with the rest of them. The only thing I wonder would be potential > false positives on it. Best way to check? Run the unit tests with your > patch. Do they all pass? > > If so, then I would check out Tika Batch here: > > http://wiki.apache.org/tika/TikaBatchUsage > > https://wiki.apache.org/tika/TikaBatchOverview > > > Tim Allison, myself, Tyler Palsulich and others have been setting up > a govdocs regression test suite (and I have one for Polar data for > TREC) that runs Tika over many many files and checks whether or not > it still parses them the same way. > > First start with the unit tests, and then let’s try Tika batch. If > it passes both of those I’d say this is something we definitely want > to commit! > > In the meanwhile please feel free to open up a JIRA: > > https://issues.apache.org/jira/browse/TIKA and then reference > a Pull Request and your code in it, just to get that part going. > Thanks for considering contributing! > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: <Merrill>, Jeremy <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Friday, April 10, 2015 at 11:46 AM > To: "[email protected]" <[email protected]> > Subject: Detecting standards-non-compliant emails as message/rfc822 > > >Hi friends, > > > > > >*tl;dr*: I've added an extra line to tika-mimetypes.xml for detecting > >certain rfc822 non-compliant emails that are exported by a certain U.S. > >politician's email server. Would this be useful to add to the official > >Tika repo? > > > > > https://github.com/jeremybmerrill/tika/commit/32931d3438b868c2d2bcea754236 > >756944ab5eb7 > >< > https://github.com/jeremybmerrill/tika/commit/32931d3438b868c2d2bcea75423 > >6756944ab5eb7> > > > > > > > >longer version: > > > > > > > >Big fan of Tika -- we're using it a fair amount here to do document > >search for emails/files we receive in big dumps from various public > >officials. > > > > > >These dumps frequently come directly from these officials' mailservers. > >The dumps, I believe since they're not intended to be transmitted over > >the wire, sometimes are slightly non-compliant. Many begin with the > >non-standard header RFC822 `Status: `. > > > > > >It's important to note that Tika (and the underlying library, James > >Mime4J) do properly parse these emails, despite the non-compliant header. > >The problem is getting Tika to *detect* the file as an email so that > >Mime4J gets chosen to parse it. > > > > > > > >Tika does not properly detect these emails as `message/rfc822`. I've > >added `Status: ` as a magic detection line in tika-mimetypes.xml. This > >solves my problem and does not appear to cause test failures. Perhaps > >there's another, easier solution? Also, I don't > > know if it'll cause problems for other people or whether it would be > >useful to them -- that's why I'm asking you. If it is, I'd be happy to > >contribute it as a patch. Please let me know. > > > >--- > >Jeremy B. Merrill > >The New York Times > > > > > > > > > > > > > > > > > > > >
