Hi, On Fri, Mar 14, 2014 at 5:13 PM, Grant Ingersoll <[email protected]> wrote: > On Mar 13, 2014, at 3:53 PM, Jukka Zitting <[email protected]> wrote: >> On Thu, Mar 13, 2014 at 3:41 PM, Grant Ingersoll <[email protected]> wrote: >>> But why would that test fail in the Tika dev environment? >> >> The DefaultParser instance returned by >> TikaConfig.getDefaultConfig().getParser() doesn't auto-detect the >> content type of the input document, so unless you explicitly specify >> the content type in the input metadata, it won't know how to parse the >> document. > > So how come the same code works in non-Hadoop env where it is auto-detecting?
What is the code you're using in your Hadoop deployment, TikaConfig.getDefaultConfig().getParser()? > FWIW, this strikes me as an unexpected default. I would expect the default > to "just work" > and that if you don't want that behavior, you configure to not auto-detect. The Parser and Detector interfaces are distinct by design; you need AutoDetectParser to make them work together. The TikaConfig.getParser() method just returns a composite of all the configured parsers, with none of the extra functionality like auto-detection or zip-bomb prevention that higher level code provides. For something that "just works", I would suggest using the Tika facade that hides the details of how these different components are wired together. BR, Jukka Zitting
