+1, was going to suggest the same thing - that is using the Tika facade. Cheers, Chris
------------------------ Chris Mattmann [email protected] -----Original Message----- From: Jukka Zitting <[email protected]> Reply-To: <[email protected]> Date: Friday, March 14, 2014 6:31 PM To: Tika Users <[email protected]> Subject: Re: Parsers, DefaultConfig and such >Hi, > >On Fri, Mar 14, 2014 at 5:13 PM, Grant Ingersoll <[email protected]> >wrote: >> On Mar 13, 2014, at 3:53 PM, Jukka Zitting <[email protected]> >>wrote: >>> On Thu, Mar 13, 2014 at 3:41 PM, Grant Ingersoll <[email protected]> >>>wrote: >>>> But why would that test fail in the Tika dev environment? >>> >>> The DefaultParser instance returned by >>> TikaConfig.getDefaultConfig().getParser() doesn't auto-detect the >>> content type of the input document, so unless you explicitly specify >>> the content type in the input metadata, it won't know how to parse the >>> document. >> >> So how come the same code works in non-Hadoop env where it is >>auto-detecting? > >What is the code you're using in your Hadoop deployment, >TikaConfig.getDefaultConfig().getParser()? > >> FWIW, this strikes me as an unexpected default. I would expect the >>default to "just work" >> and that if you don't want that behavior, you configure to not >>auto-detect. > >The Parser and Detector interfaces are distinct by design; you need >AutoDetectParser to make them work together. The >TikaConfig.getParser() method just returns a composite of all the >configured parsers, with none of the extra functionality like >auto-detection or zip-bomb prevention that higher level code provides. > >For something that "just works", I would suggest using the Tika facade >that hides the details of how these different components are wired >together. > >BR, > >Jukka Zitting
