Myself and a colleague were parsing the Enron dataset the other day and =
noticed that a number of emails that had message bodies in them were not =
getting extracted.
In particular, when running our Tika parsing code in Hadoop distributed =
mode, the body was going missing. If I ran the exact same code in my =
IDE in Hadoop local mode (i.e. no cluster), the message body gets =
extracted fine.
To isolate things down, we tried with the testLotusEml.eml file in =
Tika's test document suite (many of the Enron emails are Lotus) and =
noticed the same thing. Digging in further, I thought the issue might =
be something in the RFC822Parser, since this is the MIME type of the =
document. (In particular, I thought it would be a threading issue) =20
Turns out, however, the problem seems to be in my understanding of how =
TikaConfig.getDefaultConfig().getParser works (or doesn't work). =
Namely, if you run the Test below (I added it to RFC822ParserTest =
locally), the first two checkParser methods pass just fine, the third =
one fails. =20
So, I guess my questions are:=20
- what's different between how I use getDefaultConfig in local mode vs. =
Hadoop mode? I haven't customized the config at all in either case and =
I am not aware of any SPIs registered. (i've also reproduced the =
problem in non-dev environments -- i.e. machines only doing this =
workload w/ a clean OS)
- what's different in this test which is being run in the Tika =
development environment and presumably has the same core configuration?
(note to Julien Nioche, if you are reading this: this problem exists in =
Behemoth TikaProcessor or at least it did in the snapshot of the version =
I have)
@Test
public void testLotus() throws Exception {
checkParser(new RFC822Parser());
checkParser(new AutoDetectParser());
checkParser(TikaConfig.getDefaultConfig().getParser());
}
private void checkParser(Parser parser) {
Metadata metadata =3D new Metadata();
InputStream stream =3D getStream("test-documents/testLotusEml.eml");
ContentHandler handler =3D new BodyContentHandler();
try {
parser.parse(stream, handler, metadata, new ParseContext());
String bodyText =3D handler.toString();
assertTrue(bodyText.contains("Message body"));
} catch (Exception e) {
fail("Exception thrown: " + e.getMessage());
}
}
Thanks,
Grant
--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com