Ignore my last post. Tika isn't slowing down, neither is this property. On Mon, Nov 26, 2012 at 10:50 PM, Sourajit Basak <[email protected]>wrote:
> Enabling this property slows down the parse phase drastically when > encountered with mime-type image/jpeg. > > > On Mon, Nov 26, 2012 at 8:07 PM, Sourajit Basak > <[email protected]>wrote: > >> Thanks Julien. >> >> I can get the outlinks now, let me check if I can get the raw content. I >> will update this thread. >> >> >> On Mon, Nov 26, 2012 at 2:37 PM, Julien Nioche < >> [email protected]> wrote: >> >>> The parameter >>> >>> <property> >>> <name>mime.type.magic</name> >>> <value>true</value> >>> <description>Defines if the mime content type detector uses magic >>> resolution. >>> </description> >>> </property> >>> >>> should trigger the mime type detection based on the content and not on >>> what >>> the server returns. It is not a Tika issue as such as the selection of >>> what >>> parser to use is based on the mimetype that Nutch uses. >>> >>> The param above should be set to true by default. I thought we had more >>> options but am probably confusing with the language identification >>> >>> Julien >>> >>> >>> On 25 November 2012 14:16, Sourajit Basak <[email protected]> >>> wrote: >>> >>> > DEBUG tika.TikaParser - Using Tika parser >>> > org.apache.tika.parser.txt.TXTParser for mime-type text/plain >>> > >>> > The above indicates Tika is fired. But somehow I need to tell Tika to >>> use >>> > HtmlParser for mime-type text/plain. Have to dig into Tika docs. >>> > >>> > Is it possible to do anything in Nutch ? >>> > >>> > On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak < >>> [email protected] >>> > >wrote: >>> > >>> > > Some of my target webpages return a mime type of text/plain though >>> they >>> > > are htmls. I changed "http.accept" to include text/plain and >>> configured >>> > > both tika & parse-html to see if those can be parsed. However, both >>> seem >>> > to >>> > > produce no content. >>> > > >>> > > I changed parse-plugins.xml & the corresponding plugin.xml's to match >>> > this >>> > > mime type. >>> > > >>> > > Has anyone encountered this problem ? >>> > > >>> > > >>> > > >>> > >>> >>> >>> >>> -- >>> * >>> *Open Source Solutions for Text Engineering >>> >>> http://digitalpebble.blogspot.com/ >>> http://www.digitalpebble.com >>> http://twitter.com/digitalpebble >>> >> >> >

