Grant,

how do I test and find problematic emails? I am testing my FreeEed (
http://freeeed.org/) package with the same Enron. Also, please keep in mind
that the Enron found on EDRM is a conversion done by a third party, not the
original emails as they were harvested.

Thank you,
Mark


On Thu, Mar 13, 2014 at 1:28 PM, Grant Ingersoll <[email protected]>wrote:

> Myself and a colleague were parsing the Enron dataset the other day and =
> noticed that a number of emails that had message bodies in them were not =
> getting extracted.
>
> In particular, when running our Tika parsing code in Hadoop distributed =
> mode, the body was going missing.  If I ran the exact same code in my =
> IDE in Hadoop local mode (i.e. no cluster), the message body gets =
> extracted fine.
>
> To isolate things down, we tried with the testLotusEml.eml file in =
> Tika's test document suite (many of the Enron emails are Lotus) and =
> noticed the same thing.  Digging in further, I thought the issue might =
> be something in the RFC822Parser, since this is the MIME type of the =
> document.  (In particular, I thought it would be a threading issue) =20
>
> Turns out, however, the problem seems to be in my understanding of how =
> TikaConfig.getDefaultConfig().getParser works (or doesn't work).  =
> Namely, if you run the Test below (I added it to RFC822ParserTest =
> locally), the first two checkParser methods pass just fine, the third =
> one fails.  =20
>
> So, I guess my questions are:=20
> - what's different between how I use getDefaultConfig in local mode vs. =
> Hadoop mode?  I haven't customized the config at all in either case and =
> I am not aware of any SPIs registered.  (i've also reproduced the =
> problem in non-dev environments -- i.e. machines only doing this =
> workload w/ a clean OS)
> - what's different in this test which is being run in the Tika =
> development environment and presumably has the same core configuration?
>
> (note to Julien Nioche, if you are reading this: this problem exists in =
> Behemoth TikaProcessor or at least it did in the snapshot of the version =
> I have)
>
>  @Test
>  public void testLotus() throws Exception {
>    checkParser(new RFC822Parser());
>    checkParser(new AutoDetectParser());
>    checkParser(TikaConfig.getDefaultConfig().getParser());
>  }
>
>  private void checkParser(Parser parser) {
>    Metadata metadata =3D new Metadata();
>    InputStream stream =3D getStream("test-documents/testLotusEml.eml");
>    ContentHandler handler =3D new BodyContentHandler();
>
>    try {
>      parser.parse(stream, handler, metadata, new ParseContext());
>      String bodyText =3D handler.toString();
>      assertTrue(bodyText.contains("Message body"));
>    } catch (Exception e) {
>      fail("Exception thrown: " + e.getMessage());
>    }
>  }
>
> Thanks,
> Grant
>
> --------------------------------------------
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com
>
>

Reply via email to