Hi, the given URL is a redirect (HTTP 303, at least, when I try) with no content (only the HTTP header). Tried with curl and Nutch's parsechecker tool:
% bin/nutch parsechecker "http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home" fetching: http://www.nytimes.com/... ... Content Metadata: Vary=Host Date=Sat, 02 Feb 2013 15:01:18 GMT Content-Length=0 Location=http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html&OQ=pagewantedQ3D2Q26_rQ3D1Q26refQ3Dglobal-homeQ26&OP=548bb88dQ2FRtezRXVzRQ3DQ3DQ3DRfzDQ2AR(tgrHttzEREQ26Q271RQ26Q27R1Q27Rz5gfXtQ2At.VRgf_X5r5hfQ51gG5Hrh_X!_Q2AzHQ51z5hX5Q3DhVtHGhz_D5rhgtDeQ2Bz5HrQ5EfzDQ2A Set-Cookie=RMID=007f0100777d510d2a3e0045; Expires=Sun, 02 Feb 2014 15:01:18 GMT; Path=/; Domain=.nytimes.com; Content-Type=text/plain Connection=close Server=Apache Parse Metadata: Content-Encoding=UTF-8 Content-Type=text/plain; charset=UTF-8 % curl -v "http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home" >/dev/null ... > GET /2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home HTTP/1.1 ... > < HTTP/1.1 303 See Other < Date: Sat, 02 Feb 2013 14:59:03 GMT < Server: Apache < Set-Cookie: RMID=007f01000e9f510d29b70033; Expires=Sun, 02 Feb 2014 14:59:03 GMT; Path=/; Domain=.nytimes.com; < Vary: Host < Location: http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html&OQ=pagewantedQ3D2Q26_rQ3D1Q26refQ3Dglobal-homeQ26&OP=f39d9b3aQ2FQ2AmQ51dQ2AKSdQ2A(((Q2AQ7Ddg_Q2ANm46JmmdUQ2AUCVMQ2ACVQ2AMVQ2AdQ274Q7DKm_mrSQ2A4Q7DtKQ276Q27!Q7DQ7E42Q27J6!tKyt_dJQ7EdQ27!KQ27(!SmJ2!dtgQ276!4mgQ51ndQ27J6GQ7Ddg_ < Content-Length: 0 < Connection: close < Content-Type: text/plain Sebastian On 02/01/2013 05:47 AM, Sourajit Basak wrote: > Here it goes. > > Try to dump the content from this url with the following settings. > http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home > > <property> > <name>http.content.limit</name> > <value>-1</value> > </property> > > This page is gzip encoded. You will see that the fetcher is unable to > download any content. Check by inspecting the content-length. > Initially I was thinking it to be a problem with the parse-html plugin but > now it seems that the fetcher returns null content. > > This seemed related to NUTCH-374 > > Let me know if you need further info. > > On Fri, Feb 1, 2013 at 1:54 AM, Lewis John Mcgibbney < > [email protected]> wrote: > >> Can you briefly describe the problem here Sourajit? >> >> On Thu, Jan 31, 2013 at 9:01 AM, Sourajit Basak >> <[email protected]> wrote: >>> Seems to be related to NUTCH-374 but that shows as fixed. >>> >>> I have set Nutch to accept unlimited content size & this page is gzip >>> encoded. >>> >>> >>> >>> On Thu, Jan 31, 2013 at 9:38 PM, Sourajit Basak < >> [email protected]>wrote: >>> >>>> Re-opening this thread. >>>> >>>> Using Nutch v1.5 try to get the parseText from this NYTimes url (Use >>>> parse-html) >>>> >>>> >> http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home >>>> >>>> I do not get any content from the fetcher. This is my fetcher accept >>>> params. >>>> <property> >>>> <name>http.accept</name> >>>> >>>> >> <value>text/plain,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value> >>>> </property> >>>> >>>> >>>> >>>> >>>> On Mon, Nov 26, 2012 at 11:03 PM, Sourajit Basak < >> [email protected] >>>>> wrote: >>>> >>>>> Ignore my last post. Tika isn't slowing down, neither is this property. >>>>> >>>>> >>>>> On Mon, Nov 26, 2012 at 10:50 PM, Sourajit Basak < >>>>> [email protected]> wrote: >>>>> >>>>>> Enabling this property slows down the parse phase drastically when >>>>>> encountered with mime-type image/jpeg. >>>>>> >>>>>> >>>>>> On Mon, Nov 26, 2012 at 8:07 PM, Sourajit Basak < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Thanks Julien. >>>>>>> >>>>>>> I can get the outlinks now, let me check if I can get the raw >> content. >>>>>>> I will update this thread. >>>>>>> >>>>>>> >>>>>>> On Mon, Nov 26, 2012 at 2:37 PM, Julien Nioche < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> The parameter >>>>>>>> >>>>>>>> <property> >>>>>>>> <name>mime.type.magic</name> >>>>>>>> <value>true</value> >>>>>>>> <description>Defines if the mime content type detector uses magic >>>>>>>> resolution. >>>>>>>> </description> >>>>>>>> </property> >>>>>>>> >>>>>>>> should trigger the mime type detection based on the content and not >> on >>>>>>>> what >>>>>>>> the server returns. It is not a Tika issue as such as the selection >> of >>>>>>>> what >>>>>>>> parser to use is based on the mimetype that Nutch uses. >>>>>>>> >>>>>>>> The param above should be set to true by default. I thought we had >> more >>>>>>>> options but am probably confusing with the language identification >>>>>>>> >>>>>>>> Julien >>>>>>>> >>>>>>>> >>>>>>>> On 25 November 2012 14:16, Sourajit Basak <[email protected] >>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> DEBUG tika.TikaParser - Using Tika parser >>>>>>>>> org.apache.tika.parser.txt.TXTParser for mime-type text/plain >>>>>>>>> >>>>>>>>> The above indicates Tika is fired. But somehow I need to tell Tika >>>>>>>> to use >>>>>>>>> HtmlParser for mime-type text/plain. Have to dig into Tika docs. >>>>>>>>> >>>>>>>>> Is it possible to do anything in Nutch ? >>>>>>>>> >>>>>>>>> On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak < >>>>>>>> [email protected] >>>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Some of my target webpages return a mime type of text/plain >> though >>>>>>>> they >>>>>>>>>> are htmls. I changed "http.accept" to include text/plain and >>>>>>>> configured >>>>>>>>>> both tika & parse-html to see if those can be parsed. However, >>>>>>>> both seem >>>>>>>>> to >>>>>>>>>> produce no content. >>>>>>>>>> >>>>>>>>>> I changed parse-plugins.xml & the corresponding plugin.xml's to >>>>>>>> match >>>>>>>>> this >>>>>>>>>> mime type. >>>>>>>>>> >>>>>>>>>> Has anyone encountered this problem ? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> * >>>>>>>> *Open Source Solutions for Text Engineering >>>>>>>> >>>>>>>> http://digitalpebble.blogspot.com/ >>>>>>>> http://www.digitalpebble.com >>>>>>>> http://twitter.com/digitalpebble >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> >> >> >> -- >> Lewis >> >

