On Mar 14, 2014, at 9:31 PM, Jukka Zitting <[email protected]> wrote:

> Hi,
> 
> On Fri, Mar 14, 2014 at 5:13 PM, Grant Ingersoll <[email protected]> wrote:
>> On Mar 13, 2014, at 3:53 PM, Jukka Zitting <[email protected]> wrote:
>>> On Thu, Mar 13, 2014 at 3:41 PM, Grant Ingersoll <[email protected]> 
>>> wrote:
>>>> But why would that test fail in the Tika dev environment?
>>> 
>>> The DefaultParser instance returned by
>>> TikaConfig.getDefaultConfig().getParser() doesn't auto-detect the
>>> content type of the input document, so unless you explicitly specify
>>> the content type in the input metadata, it won't know how to parse the
>>> document.
>> 
>> So how come the same code works in non-Hadoop env where it is auto-detecting?
> 
> What is the code you're using in your Hadoop deployment,
> TikaConfig.getDefaultConfig().getParser()?

Ah, I see what is going on.  The code is from Behemoth (so not my code), but 
the error is mine in that I must be missing a library in my packaging so it's 
not getting the supported type.  Behemoth does set the content type independent 
of the parsers.


> 
>> FWIW, this strikes me as an unexpected default.  I would expect the default 
>> to "just work"
>> and that if you don't want that behavior, you configure to not auto-detect.
> 
> The Parser and Detector interfaces are distinct by design; you need
> AutoDetectParser to make them work together. The
> TikaConfig.getParser() method just returns a composite of all the
> configured parsers, with none of the  extra functionality like
> auto-detection or zip-bomb prevention that higher level code provides.
> 
> For something that "just works", I would suggest using the Tika facade
> that hides the details of how these different components are wired
> together.

Gotcha!

Thanks everyone.

Reply via email to