Can you briefly describe the problem here Sourajit? On Thu, Jan 31, 2013 at 9:01 AM, Sourajit Basak <[email protected]> wrote: > Seems to be related to NUTCH-374 but that shows as fixed. > > I have set Nutch to accept unlimited content size & this page is gzip > encoded. > > > > On Thu, Jan 31, 2013 at 9:38 PM, Sourajit Basak > <[email protected]>wrote: > >> Re-opening this thread. >> >> Using Nutch v1.5 try to get the parseText from this NYTimes url (Use >> parse-html) >> >> http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home >> >> I do not get any content from the fetcher. This is my fetcher accept >> params. >> <property> >> <name>http.accept</name> >> >> <value>text/plain,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value> >> </property> >> >> >> >> >> On Mon, Nov 26, 2012 at 11:03 PM, Sourajit Basak <[email protected] >> > wrote: >> >>> Ignore my last post. Tika isn't slowing down, neither is this property. >>> >>> >>> On Mon, Nov 26, 2012 at 10:50 PM, Sourajit Basak < >>> [email protected]> wrote: >>> >>>> Enabling this property slows down the parse phase drastically when >>>> encountered with mime-type image/jpeg. >>>> >>>> >>>> On Mon, Nov 26, 2012 at 8:07 PM, Sourajit Basak < >>>> [email protected]> wrote: >>>> >>>>> Thanks Julien. >>>>> >>>>> I can get the outlinks now, let me check if I can get the raw content. >>>>> I will update this thread. >>>>> >>>>> >>>>> On Mon, Nov 26, 2012 at 2:37 PM, Julien Nioche < >>>>> [email protected]> wrote: >>>>> >>>>>> The parameter >>>>>> >>>>>> <property> >>>>>> <name>mime.type.magic</name> >>>>>> <value>true</value> >>>>>> <description>Defines if the mime content type detector uses magic >>>>>> resolution. >>>>>> </description> >>>>>> </property> >>>>>> >>>>>> should trigger the mime type detection based on the content and not on >>>>>> what >>>>>> the server returns. It is not a Tika issue as such as the selection of >>>>>> what >>>>>> parser to use is based on the mimetype that Nutch uses. >>>>>> >>>>>> The param above should be set to true by default. I thought we had more >>>>>> options but am probably confusing with the language identification >>>>>> >>>>>> Julien >>>>>> >>>>>> >>>>>> On 25 November 2012 14:16, Sourajit Basak <[email protected]> >>>>>> wrote: >>>>>> >>>>>> > DEBUG tika.TikaParser - Using Tika parser >>>>>> > org.apache.tika.parser.txt.TXTParser for mime-type text/plain >>>>>> > >>>>>> > The above indicates Tika is fired. But somehow I need to tell Tika >>>>>> to use >>>>>> > HtmlParser for mime-type text/plain. Have to dig into Tika docs. >>>>>> > >>>>>> > Is it possible to do anything in Nutch ? >>>>>> > >>>>>> > On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak < >>>>>> [email protected] >>>>>> > >wrote: >>>>>> > >>>>>> > > Some of my target webpages return a mime type of text/plain though >>>>>> they >>>>>> > > are htmls. I changed "http.accept" to include text/plain and >>>>>> configured >>>>>> > > both tika & parse-html to see if those can be parsed. However, >>>>>> both seem >>>>>> > to >>>>>> > > produce no content. >>>>>> > > >>>>>> > > I changed parse-plugins.xml & the corresponding plugin.xml's to >>>>>> match >>>>>> > this >>>>>> > > mime type. >>>>>> > > >>>>>> > > Has anyone encountered this problem ? >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> * >>>>>> *Open Source Solutions for Text Engineering >>>>>> >>>>>> http://digitalpebble.blogspot.com/ >>>>>> http://www.digitalpebble.com >>>>>> http://twitter.com/digitalpebble >>>>>> >>>>> >>>>> >>>> >>> >>
-- Lewis

