On 15 March 2014 10:11, Grant Ingersoll <[email protected]> wrote:

>
> On Mar 14, 2014, at 9:31 PM, Jukka Zitting <[email protected]>
> wrote:
>
> > Hi,
> >
> > On Fri, Mar 14, 2014 at 5:13 PM, Grant Ingersoll <[email protected]>
> wrote:
> >> On Mar 13, 2014, at 3:53 PM, Jukka Zitting <[email protected]>
> wrote:
> >>> On Thu, Mar 13, 2014 at 3:41 PM, Grant Ingersoll <[email protected]>
> wrote:
> >>>> But why would that test fail in the Tika dev environment?
> >>>
> >>> The DefaultParser instance returned by
> >>> TikaConfig.getDefaultConfig().getParser() doesn't auto-detect the
> >>> content type of the input document, so unless you explicitly specify
> >>> the content type in the input metadata, it won't know how to parse the
> >>> document.
> >>
> >> So how come the same code works in non-Hadoop env where it is
> auto-detecting?
> >
> > What is the code you're using in your Hadoop deployment,
> > TikaConfig.getDefaultConfig().getParser()?
>
> Ah, I see what is going on.  The code is from Behemoth (so not my code),
> but the error is mine in that I must be missing a library in my packaging
> so it's not getting the supported type.  Behemoth does set the content type
> independent of the parsers.
>

Yep, we separate the detection from the parsing indeed and give the parser
an indication of what the mime type is. This gives us more flexibility as
we can have the MT specified by the user or guessed by Tika.

J.


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to