On Mar 13, 2014, at 3:53 PM, Jukka Zitting <[email protected]> wrote:
> Hi, > > On Thu, Mar 13, 2014 at 3:41 PM, Grant Ingersoll <[email protected]> wrote: >> But why would that test fail in the Tika dev environment? > > The DefaultParser instance returned by > TikaConfig.getDefaultConfig().getParser() doesn't auto-detect the > content type of the input document, so unless you explicitly specify > the content type in the input metadata, it won't know how to parse the > document. So how come the same code works in non-Hadoop env where it is auto-detecting? I don't set the content type in either case. It's the same code, the only difference being, I believe as Nick pointed out, I missed packaging something in such that message/rfc822 is not a registered mime type in the Hadoop job jar. But, according to this statement, then no documents should be auto detected, right? FWIW, this strikes me as an unexpected default. I would expect the default to "just work" and that if you don't want that behavior, you configure to not auto-detect. -Grant
