Hi,

On Fri, Mar 14, 2014 at 5:13 PM, Grant Ingersoll <[email protected]> wrote:
> On Mar 13, 2014, at 3:53 PM, Jukka Zitting <[email protected]> wrote:
>> On Thu, Mar 13, 2014 at 3:41 PM, Grant Ingersoll <[email protected]> wrote:
>>> But why would that test fail in the Tika dev environment?
>>
>> The DefaultParser instance returned by
>> TikaConfig.getDefaultConfig().getParser() doesn't auto-detect the
>> content type of the input document, so unless you explicitly specify
>> the content type in the input metadata, it won't know how to parse the
>> document.
>
> So how come the same code works in non-Hadoop env where it is auto-detecting?

What is the code you're using in your Hadoop deployment,
TikaConfig.getDefaultConfig().getParser()?

> FWIW, this strikes me as an unexpected default.  I would expect the default 
> to "just work"
> and that if you don't want that behavior, you configure to not auto-detect.

The Parser and Detector interfaces are distinct by design; you need
AutoDetectParser to make them work together. The
TikaConfig.getParser() method just returns a composite of all the
configured parsers, with none of the  extra functionality like
auto-detection or zip-bomb prevention that higher level code provides.

For something that "just works", I would suggest using the Tika facade
that hides the details of how these different components are wired
together.

BR,

Jukka Zitting

Reply via email to