Thanks Julien. I can get the outlinks now, let me check if I can get the raw content. I will update this thread.
On Mon, Nov 26, 2012 at 2:37 PM, Julien Nioche < [email protected]> wrote: > The parameter > > <property> > <name>mime.type.magic</name> > <value>true</value> > <description>Defines if the mime content type detector uses magic > resolution. > </description> > </property> > > should trigger the mime type detection based on the content and not on what > the server returns. It is not a Tika issue as such as the selection of what > parser to use is based on the mimetype that Nutch uses. > > The param above should be set to true by default. I thought we had more > options but am probably confusing with the language identification > > Julien > > > On 25 November 2012 14:16, Sourajit Basak <[email protected]> > wrote: > > > DEBUG tika.TikaParser - Using Tika parser > > org.apache.tika.parser.txt.TXTParser for mime-type text/plain > > > > The above indicates Tika is fired. But somehow I need to tell Tika to use > > HtmlParser for mime-type text/plain. Have to dig into Tika docs. > > > > Is it possible to do anything in Nutch ? > > > > On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak < > [email protected] > > >wrote: > > > > > Some of my target webpages return a mime type of text/plain though they > > > are htmls. I changed "http.accept" to include text/plain and configured > > > both tika & parse-html to see if those can be parsed. However, both > seem > > to > > > produce no content. > > > > > > I changed parse-plugins.xml & the corresponding plugin.xml's to match > > this > > > mime type. > > > > > > Has anyone encountered this problem ? > > > > > > > > > > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >

