On Mar 14, 2014, at 9:31 PM, Jukka Zitting <[email protected]> wrote:
> Hi, > > On Fri, Mar 14, 2014 at 5:13 PM, Grant Ingersoll <[email protected]> wrote: >> On Mar 13, 2014, at 3:53 PM, Jukka Zitting <[email protected]> wrote: >>> On Thu, Mar 13, 2014 at 3:41 PM, Grant Ingersoll <[email protected]> >>> wrote: >>>> But why would that test fail in the Tika dev environment? >>> >>> The DefaultParser instance returned by >>> TikaConfig.getDefaultConfig().getParser() doesn't auto-detect the >>> content type of the input document, so unless you explicitly specify >>> the content type in the input metadata, it won't know how to parse the >>> document. >> >> So how come the same code works in non-Hadoop env where it is auto-detecting? > > What is the code you're using in your Hadoop deployment, > TikaConfig.getDefaultConfig().getParser()? Ah, I see what is going on. The code is from Behemoth (so not my code), but the error is mine in that I must be missing a library in my packaging so it's not getting the supported type. Behemoth does set the content type independent of the parsers. > >> FWIW, this strikes me as an unexpected default. I would expect the default >> to "just work" >> and that if you don't want that behavior, you configure to not auto-detect. > > The Parser and Detector interfaces are distinct by design; you need > AutoDetectParser to make them work together. The > TikaConfig.getParser() method just returns a composite of all the > configured parsers, with none of the extra functionality like > auto-detection or zip-bomb prevention that higher level code provides. > > For something that "just works", I would suggest using the Tika facade > that hides the details of how these different components are wired > together. Gotcha! Thanks everyone.
