+1, was going to suggest the same thing - that is using
the Tika facade.

Cheers,
Chris

------------------------
Chris Mattmann
[email protected]




-----Original Message-----
From: Jukka Zitting <[email protected]>
Reply-To: <[email protected]>
Date: Friday, March 14, 2014 6:31 PM
To: Tika Users <[email protected]>
Subject: Re: Parsers, DefaultConfig and such

>Hi,
>
>On Fri, Mar 14, 2014 at 5:13 PM, Grant Ingersoll <[email protected]>
>wrote:
>> On Mar 13, 2014, at 3:53 PM, Jukka Zitting <[email protected]>
>>wrote:
>>> On Thu, Mar 13, 2014 at 3:41 PM, Grant Ingersoll <[email protected]>
>>>wrote:
>>>> But why would that test fail in the Tika dev environment?
>>>
>>> The DefaultParser instance returned by
>>> TikaConfig.getDefaultConfig().getParser() doesn't auto-detect the
>>> content type of the input document, so unless you explicitly specify
>>> the content type in the input metadata, it won't know how to parse the
>>> document.
>>
>> So how come the same code works in non-Hadoop env where it is
>>auto-detecting?
>
>What is the code you're using in your Hadoop deployment,
>TikaConfig.getDefaultConfig().getParser()?
>
>> FWIW, this strikes me as an unexpected default.  I would expect the
>>default to "just work"
>> and that if you don't want that behavior, you configure to not
>>auto-detect.
>
>The Parser and Detector interfaces are distinct by design; you need
>AutoDetectParser to make them work together. The
>TikaConfig.getParser() method just returns a composite of all the
>configured parsers, with none of the  extra functionality like
>auto-detection or zip-bomb prevention that higher level code provides.
>
>For something that "just works", I would suggest using the Tika facade
>that hides the details of how these different components are wired
>together.
>
>BR,
>
>Jukka Zitting


Reply via email to