Thanks Julien.

I can get the outlinks now, let me check if I can get the raw content. I
will update this thread.

On Mon, Nov 26, 2012 at 2:37 PM, Julien Nioche <
[email protected]> wrote:

> The parameter
>
> <property>
>   <name>mime.type.magic</name>
>   <value>true</value>
>   <description>Defines if the mime content type detector uses magic
> resolution.
>   </description>
> </property>
>
> should trigger the mime type detection based on the content and not on what
> the server returns. It is not a Tika issue as such as the selection of what
> parser to use is based on the mimetype that Nutch uses.
>
> The param above should be set to true by default. I thought we had more
> options but am probably confusing with the language identification
>
> Julien
>
>
> On 25 November 2012 14:16, Sourajit Basak <[email protected]>
> wrote:
>
> > DEBUG tika.TikaParser - Using Tika parser
> > org.apache.tika.parser.txt.TXTParser for mime-type text/plain
> >
> > The above indicates Tika is fired. But somehow I need to tell Tika to use
> > HtmlParser for mime-type text/plain. Have to dig into Tika docs.
> >
> > Is it possible to do anything in Nutch ?
> >
> > On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak <
> [email protected]
> > >wrote:
> >
> > > Some of my target webpages return a mime type of text/plain though they
> > > are htmls. I changed "http.accept" to include text/plain and configured
> > > both tika & parse-html to see if those can be parsed. However, both
> seem
> > to
> > > produce no content.
> > >
> > > I changed parse-plugins.xml & the corresponding plugin.xml's to match
> > this
> > > mime type.
> > >
> > > Has anyone encountered this problem ?
> > >
> > >
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Reply via email to