On Mar 13, 2014, at 3:53 PM, Jukka Zitting <[email protected]> wrote:

> Hi,
> 
> On Thu, Mar 13, 2014 at 3:41 PM, Grant Ingersoll <[email protected]> wrote:
>> But why would that test fail in the Tika dev environment?
> 
> The DefaultParser instance returned by
> TikaConfig.getDefaultConfig().getParser() doesn't auto-detect the
> content type of the input document, so unless you explicitly specify
> the content type in the input metadata, it won't know how to parse the
> document.

So how come the same code works in non-Hadoop env where it is auto-detecting?  
I don't set the content type in either case.  It's the same code, the only 
difference being, I believe as Nick pointed out, I missed packaging something 
in such that message/rfc822 is not a registered mime type in the Hadoop job 
jar.  But, according to this statement, then no documents should be auto 
detected, right?  


FWIW, this strikes me as an unexpected default.  I would expect the default to 
"just work" and that if you don't want that behavior, you configure to not 
auto-detect.  


-Grant

Reply via email to