The parameter

<property>
  <name>mime.type.magic</name>
  <value>true</value>
  <description>Defines if the mime content type detector uses magic
resolution.
  </description>
</property>

should trigger the mime type detection based on the content and not on what
the server returns. It is not a Tika issue as such as the selection of what
parser to use is based on the mimetype that Nutch uses.

The param above should be set to true by default. I thought we had more
options but am probably confusing with the language identification

Julien


On 25 November 2012 14:16, Sourajit Basak <[email protected]> wrote:

> DEBUG tika.TikaParser - Using Tika parser
> org.apache.tika.parser.txt.TXTParser for mime-type text/plain
>
> The above indicates Tika is fired. But somehow I need to tell Tika to use
> HtmlParser for mime-type text/plain. Have to dig into Tika docs.
>
> Is it possible to do anything in Nutch ?
>
> On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak <[email protected]
> >wrote:
>
> > Some of my target webpages return a mime type of text/plain though they
> > are htmls. I changed "http.accept" to include text/plain and configured
> > both tika & parse-html to see if those can be parsed. However, both seem
> to
> > produce no content.
> >
> > I changed parse-plugins.xml & the corresponding plugin.xml's to match
> this
> > mime type.
> >
> > Has anyone encountered this problem ?
> >
> >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to