>> ./nutch plugin urlfilter-regex org.apache.nutch.urlfilter.regex.RegexURLFilter http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home + http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home
Here is the relevant part of the regex file. #-[?*!@=] +^http://www\.nytimes\.com/([^\*@!])*(\s|$) On Fri, Feb 1, 2013 at 10:48 AM, Lewis John Mcgibbney < [email protected]> wrote: > And your regex rules? > So is the URL fetched? > > On Thu, Jan 31, 2013 at 8:47 PM, Sourajit Basak > <[email protected]> wrote: > > Here it goes. > > > > Try to dump the content from this url with the following settings. > > > http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home > > > > <property> > > <name>http.content.limit</name> > > <value>-1</value> > > </property> > > > > This page is gzip encoded. You will see that the fetcher is unable to > > download any content. Check by inspecting the content-length. > > Initially I was thinking it to be a problem with the parse-html plugin > but > > now it seems that the fetcher returns null content. > > > > This seemed related to NUTCH-374 > > > > Let me know if you need further info. >

