Digging further, it seems message/rfc822 is being loaded as a MIME type when 
running locally, but not when running on Hadoop.


On Mar 13, 2014, at 2:28 PM, Grant Ingersoll <[email protected]> wrote:

> Myself and a colleague were parsing the Enron dataset the other day and =
> noticed that a number of emails that had message bodies in them were not =
> getting extracted.
> 
> In particular, when running our Tika parsing code in Hadoop distributed =
> mode, the body was going missing.  If I ran the exact same code in my =
> IDE in Hadoop local mode (i.e. no cluster), the message body gets =
> extracted fine.
> 
> To isolate things down, we tried with the testLotusEml.eml file in =
> Tika's test document suite (many of the Enron emails are Lotus) and =
> noticed the same thing.  Digging in further, I thought the issue might =
> be something in the RFC822Parser, since this is the MIME type of the =
> document.  (In particular, I thought it would be a threading issue) =20
> 
> Turns out, however, the problem seems to be in my understanding of how =
> TikaConfig.getDefaultConfig().getParser works (or doesn't work).  =
> Namely, if you run the Test below (I added it to RFC822ParserTest =
> locally), the first two checkParser methods pass just fine, the third =
> one fails.  =20
> 
> So, I guess my questions are:=20
> - what's different between how I use getDefaultConfig in local mode vs. =
> Hadoop mode?  I haven't customized the config at all in either case and =
> I am not aware of any SPIs registered.  (i've also reproduced the =
> problem in non-dev environments -- i.e. machines only doing this =
> workload w/ a clean OS)
> - what's different in this test which is being run in the Tika =
> development environment and presumably has the same core configuration?
> 
> (note to Julien Nioche, if you are reading this: this problem exists in =
> Behemoth TikaProcessor or at least it did in the snapshot of the version =
> I have)
> 
>  @Test
>  public void testLotus() throws Exception {
>    checkParser(new RFC822Parser());
>    checkParser(new AutoDetectParser());
>    checkParser(TikaConfig.getDefaultConfig().getParser());
>  }
> 
>  private void checkParser(Parser parser) {
>    Metadata metadata =3D new Metadata();
>    InputStream stream =3D getStream("test-documents/testLotusEml.eml");
>    ContentHandler handler =3D new BodyContentHandler();
> 
>    try {
>      parser.parse(stream, handler, metadata, new ParseContext());
>      String bodyText =3D handler.toString();
>      assertTrue(bodyText.contains("Message body"));
>    } catch (Exception e) {
>      fail("Exception thrown: " + e.getMessage());
>    }
>  }
> 
> Thanks,
> Grant
> 
> --------------------------------------------
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com
> 

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com





Reply via email to