Grant, how do I test and find problematic emails? I am testing my FreeEed ( http://freeeed.org/) package with the same Enron. Also, please keep in mind that the Enron found on EDRM is a conversion done by a third party, not the original emails as they were harvested.
Thank you, Mark On Thu, Mar 13, 2014 at 1:28 PM, Grant Ingersoll <[email protected]>wrote: > Myself and a colleague were parsing the Enron dataset the other day and = > noticed that a number of emails that had message bodies in them were not = > getting extracted. > > In particular, when running our Tika parsing code in Hadoop distributed = > mode, the body was going missing. If I ran the exact same code in my = > IDE in Hadoop local mode (i.e. no cluster), the message body gets = > extracted fine. > > To isolate things down, we tried with the testLotusEml.eml file in = > Tika's test document suite (many of the Enron emails are Lotus) and = > noticed the same thing. Digging in further, I thought the issue might = > be something in the RFC822Parser, since this is the MIME type of the = > document. (In particular, I thought it would be a threading issue) =20 > > Turns out, however, the problem seems to be in my understanding of how = > TikaConfig.getDefaultConfig().getParser works (or doesn't work). = > Namely, if you run the Test below (I added it to RFC822ParserTest = > locally), the first two checkParser methods pass just fine, the third = > one fails. =20 > > So, I guess my questions are:=20 > - what's different between how I use getDefaultConfig in local mode vs. = > Hadoop mode? I haven't customized the config at all in either case and = > I am not aware of any SPIs registered. (i've also reproduced the = > problem in non-dev environments -- i.e. machines only doing this = > workload w/ a clean OS) > - what's different in this test which is being run in the Tika = > development environment and presumably has the same core configuration? > > (note to Julien Nioche, if you are reading this: this problem exists in = > Behemoth TikaProcessor or at least it did in the snapshot of the version = > I have) > > @Test > public void testLotus() throws Exception { > checkParser(new RFC822Parser()); > checkParser(new AutoDetectParser()); > checkParser(TikaConfig.getDefaultConfig().getParser()); > } > > private void checkParser(Parser parser) { > Metadata metadata =3D new Metadata(); > InputStream stream =3D getStream("test-documents/testLotusEml.eml"); > ContentHandler handler =3D new BodyContentHandler(); > > try { > parser.parse(stream, handler, metadata, new ParseContext()); > String bodyText =3D handler.toString(); > assertTrue(bodyText.contains("Message body")); > } catch (Exception e) { > fail("Exception thrown: " + e.getMessage()); > } > } > > Thanks, > Grant > > -------------------------------------------- > Grant Ingersoll | @gsingers > http://www.lucidworks.com > >
