Seems to be related to NUTCH-374 but that shows as fixed. I have set Nutch to accept unlimited content size & this page is gzip encoded.
On Thu, Jan 31, 2013 at 9:38 PM, Sourajit Basak <[email protected]>wrote: > Re-opening this thread. > > Using Nutch v1.5 try to get the parseText from this NYTimes url (Use > parse-html) > > http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home > > I do not get any content from the fetcher. This is my fetcher accept > params. > <property> > <name>http.accept</name> > > <value>text/plain,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value> > </property> > > > > > On Mon, Nov 26, 2012 at 11:03 PM, Sourajit Basak <[email protected] > > wrote: > >> Ignore my last post. Tika isn't slowing down, neither is this property. >> >> >> On Mon, Nov 26, 2012 at 10:50 PM, Sourajit Basak < >> [email protected]> wrote: >> >>> Enabling this property slows down the parse phase drastically when >>> encountered with mime-type image/jpeg. >>> >>> >>> On Mon, Nov 26, 2012 at 8:07 PM, Sourajit Basak < >>> [email protected]> wrote: >>> >>>> Thanks Julien. >>>> >>>> I can get the outlinks now, let me check if I can get the raw content. >>>> I will update this thread. >>>> >>>> >>>> On Mon, Nov 26, 2012 at 2:37 PM, Julien Nioche < >>>> [email protected]> wrote: >>>> >>>>> The parameter >>>>> >>>>> <property> >>>>> <name>mime.type.magic</name> >>>>> <value>true</value> >>>>> <description>Defines if the mime content type detector uses magic >>>>> resolution. >>>>> </description> >>>>> </property> >>>>> >>>>> should trigger the mime type detection based on the content and not on >>>>> what >>>>> the server returns. It is not a Tika issue as such as the selection of >>>>> what >>>>> parser to use is based on the mimetype that Nutch uses. >>>>> >>>>> The param above should be set to true by default. I thought we had more >>>>> options but am probably confusing with the language identification >>>>> >>>>> Julien >>>>> >>>>> >>>>> On 25 November 2012 14:16, Sourajit Basak <[email protected]> >>>>> wrote: >>>>> >>>>> > DEBUG tika.TikaParser - Using Tika parser >>>>> > org.apache.tika.parser.txt.TXTParser for mime-type text/plain >>>>> > >>>>> > The above indicates Tika is fired. But somehow I need to tell Tika >>>>> to use >>>>> > HtmlParser for mime-type text/plain. Have to dig into Tika docs. >>>>> > >>>>> > Is it possible to do anything in Nutch ? >>>>> > >>>>> > On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak < >>>>> [email protected] >>>>> > >wrote: >>>>> > >>>>> > > Some of my target webpages return a mime type of text/plain though >>>>> they >>>>> > > are htmls. I changed "http.accept" to include text/plain and >>>>> configured >>>>> > > both tika & parse-html to see if those can be parsed. However, >>>>> both seem >>>>> > to >>>>> > > produce no content. >>>>> > > >>>>> > > I changed parse-plugins.xml & the corresponding plugin.xml's to >>>>> match >>>>> > this >>>>> > > mime type. >>>>> > > >>>>> > > Has anyone encountered this problem ? >>>>> > > >>>>> > > >>>>> > > >>>>> > >>>>> >>>>> >>>>> >>>>> -- >>>>> * >>>>> *Open Source Solutions for Text Engineering >>>>> >>>>> http://digitalpebble.blogspot.com/ >>>>> http://www.digitalpebble.com >>>>> http://twitter.com/digitalpebble >>>>> >>>> >>>> >>> >> >

