Enabling this property slows down the parse phase drastically when encountered with mime-type image/jpeg.
On Mon, Nov 26, 2012 at 8:07 PM, Sourajit Basak <[email protected]>wrote: > Thanks Julien. > > I can get the outlinks now, let me check if I can get the raw content. I > will update this thread. > > > On Mon, Nov 26, 2012 at 2:37 PM, Julien Nioche < > [email protected]> wrote: > >> The parameter >> >> <property> >> <name>mime.type.magic</name> >> <value>true</value> >> <description>Defines if the mime content type detector uses magic >> resolution. >> </description> >> </property> >> >> should trigger the mime type detection based on the content and not on >> what >> the server returns. It is not a Tika issue as such as the selection of >> what >> parser to use is based on the mimetype that Nutch uses. >> >> The param above should be set to true by default. I thought we had more >> options but am probably confusing with the language identification >> >> Julien >> >> >> On 25 November 2012 14:16, Sourajit Basak <[email protected]> >> wrote: >> >> > DEBUG tika.TikaParser - Using Tika parser >> > org.apache.tika.parser.txt.TXTParser for mime-type text/plain >> > >> > The above indicates Tika is fired. But somehow I need to tell Tika to >> use >> > HtmlParser for mime-type text/plain. Have to dig into Tika docs. >> > >> > Is it possible to do anything in Nutch ? >> > >> > On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak < >> [email protected] >> > >wrote: >> > >> > > Some of my target webpages return a mime type of text/plain though >> they >> > > are htmls. I changed "http.accept" to include text/plain and >> configured >> > > both tika & parse-html to see if those can be parsed. However, both >> seem >> > to >> > > produce no content. >> > > >> > > I changed parse-plugins.xml & the corresponding plugin.xml's to match >> > this >> > > mime type. >> > > >> > > Has anyone encountered this problem ? >> > > >> > > >> > > >> > >> >> >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >> > >

