And your regex rules? So is the URL fetched?
On Thu, Jan 31, 2013 at 8:47 PM, Sourajit Basak <[email protected]> wrote: > Here it goes. > > Try to dump the content from this url with the following settings. > http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home > > <property> > <name>http.content.limit</name> > <value>-1</value> > </property> > > This page is gzip encoded. You will see that the fetcher is unable to > download any content. Check by inspecting the content-length. > Initially I was thinking it to be a problem with the parse-html plugin but > now it seems that the fetcher returns null content. > > This seemed related to NUTCH-374 > > Let me know if you need further info.

