Myself and a colleague were parsing the Enron dataset the other day and =
noticed that a number of emails that had message bodies in them were not =
getting extracted.

In particular, when running our Tika parsing code in Hadoop distributed =
mode, the body was going missing.  If I ran the exact same code in my =
IDE in Hadoop local mode (i.e. no cluster), the message body gets =
extracted fine.

To isolate things down, we tried with the testLotusEml.eml file in =
Tika's test document suite (many of the Enron emails are Lotus) and =
noticed the same thing.  Digging in further, I thought the issue might =
be something in the RFC822Parser, since this is the MIME type of the =
document.  (In particular, I thought it would be a threading issue) =20

Turns out, however, the problem seems to be in my understanding of how =
TikaConfig.getDefaultConfig().getParser works (or doesn't work).  =
Namely, if you run the Test below (I added it to RFC822ParserTest =
locally), the first two checkParser methods pass just fine, the third =
one fails.  =20

So, I guess my questions are:=20
- what's different between how I use getDefaultConfig in local mode vs. =
Hadoop mode?  I haven't customized the config at all in either case and =
I am not aware of any SPIs registered.  (i've also reproduced the =
problem in non-dev environments -- i.e. machines only doing this =
workload w/ a clean OS)
- what's different in this test which is being run in the Tika =
development environment and presumably has the same core configuration?

(note to Julien Nioche, if you are reading this: this problem exists in =
Behemoth TikaProcessor or at least it did in the snapshot of the version =
I have)

 @Test
 public void testLotus() throws Exception {
   checkParser(new RFC822Parser());
   checkParser(new AutoDetectParser());
   checkParser(TikaConfig.getDefaultConfig().getParser());
 }

 private void checkParser(Parser parser) {
   Metadata metadata =3D new Metadata();
   InputStream stream =3D getStream("test-documents/testLotusEml.eml");
   ContentHandler handler =3D new BodyContentHandler();

   try {
     parser.parse(stream, handler, metadata, new ParseContext());
     String bodyText =3D handler.toString();
     assertTrue(bodyText.contains("Message body"));
   } catch (Exception e) {
     fail("Exception thrown: " + e.getMessage());
   }
 }

Thanks,
Grant

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com

Reply via email to