Hi Folks, Can people check out:
https://issues.apache.org/jira/browse/TIKA-879 And Jeremy’s patch - how does that jive with what is going on RE: TIKA-1602? I’d like to help get Jeremy’s fix in somehow so he can move on. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: <Merrill>, Jeremy <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Friday, April 10, 2015 at 4:05 PM To: "[email protected]" <[email protected]> Subject: Re: Detecting standards-non-compliant emails as message/rfc822 >Hi Chris, > > >Thanks for your quick, detailed and helpful response; I appreciate it! > >Unit tests in tika-core pass. Unit tests in tika-parsers fail on a GDAL >test case. These failures also exist in trunk off of >https://github.com/apache/tika/, so I think we're good on that front. > > >I didn't quite follow how to run the Tika Batch set of tools. Do I need >an additional document set to run it against? > > >I've opened a >JIRA issue <https://issues.apache.org/jira/browse/TIKA-1602>. Sorry if I >messed it up -- I'm bad at even our in-house Jira, and y'all's is >different... > > >Thanks, > >--- >Jeremy B. Merrill >The New York Times > > > > > >On Fri, Apr 10, 2015 at 2:56 PM, Mattmann, Chris A (3980) ><[email protected]> wrote: > >Dear Jeremy thank you for using Tika and that’s awesome that it’s >being used in that particular use case at NYT! Rad. > >My suggestion: yes, this would be a useful way to fix it. Adding >STATUS as a match magic priority is fine! it just gets considered >along with the rest of them. The only thing I wonder would be potential >false positives on it. Best way to check? Run the unit tests with your >patch. Do they all pass? > >If so, then I would check out Tika Batch here: > >http://wiki.apache.org/tika/TikaBatchUsage > >https://wiki.apache.org/tika/TikaBatchOverview > > >Tim Allison, myself, Tyler Palsulich and others have been setting up >a govdocs regression test suite (and I have one for Polar data for >TREC) that runs Tika over many many files and checks whether or not >it still parses them the same way. > >First start with the unit tests, and then let’s try Tika batch. If >it passes both of those I’d say this is something we definitely want >to commit! > >In the meanwhile please feel free to open up a JIRA: > >https://issues.apache.org/jira/browse/TIKA and then reference >a Pull Request and your code in it, just to get that part going. >Thanks for considering contributing! > >Cheers, >Chris > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: [email protected] >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > >-----Original Message----- >From: <Merrill>, Jeremy <[email protected]> >Reply-To: "[email protected]" <[email protected]> >Date: Friday, April 10, 2015 at 11:46 AM >To: "[email protected]" <[email protected]> >Subject: Detecting standards-non-compliant emails as message/rfc822 > >>Hi friends, >> >> >>*tl;dr*: I've added an extra line to tika-mimetypes.xml for detecting >>certain rfc822 non-compliant emails that are exported by a certain U.S. >>politician's email server. Would this be useful to add to the official >>Tika repo? >> >>https://github.com/jeremybmerrill/tika/commit/32931d3438b868c2d2bcea75423 >>6 >>756944ab5eb7 >><https://github.com/jeremybmerrill/tika/commit/32931d3438b868c2d2bcea7542 >>3 >>6756944ab5eb7> >> >> >> >>longer version: >> >> >> >>Big fan of Tika -- we're using it a fair amount here to do document >>search for emails/files we receive in big dumps from various public >>officials. >> >> >>These dumps frequently come directly from these officials' mailservers. >>The dumps, I believe since they're not intended to be transmitted over >>the wire, sometimes are slightly non-compliant. Many begin with the >>non-standard header RFC822 `Status: `. >> >> >>It's important to note that Tika (and the underlying library, James >>Mime4J) do properly parse these emails, despite the non-compliant header. >>The problem is getting Tika to *detect* the file as an email so that >>Mime4J gets chosen to parse it. >> >> >> >>Tika does not properly detect these emails as `message/rfc822`. I've >>added `Status: ` as a magic detection line in tika-mimetypes.xml. This >>solves my problem and does not appear to cause test failures. Perhaps >>there's another, easier solution? Also, I don't >> know if it'll cause problems for other people or whether it would be >>useful to them -- that's why I'm asking you. If it is, I'd be happy to >>contribute it as a patch. Please let me know. >> >>--- >>Jeremy B. Merrill >>The New York Times >> >> >> >> >> >> >> >> >> > > > > > > > >
