>> ./nutch plugin urlfilter-regex
org.apache.nutch.urlfilter.regex.RegexURLFilter
http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home
+
http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home

Here is the relevant part of the regex file.

#-[?*!@=]
+^http://www\.nytimes\.com/([^\*@!])*(\s|$)

On Fri, Feb 1, 2013 at 10:48 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> And your regex rules?
> So is the URL fetched?
>
> On Thu, Jan 31, 2013 at 8:47 PM, Sourajit Basak
> <[email protected]> wrote:
> > Here it goes.
> >
> > Try to dump the content from this url with the following settings.
> >
> http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home
> >
> >   <property>
> >     <name>http.content.limit</name>
> >     <value>-1</value>
> >   </property>
> >
> > This page is gzip encoded. You will see that the fetcher is unable to
> > download any content. Check by inspecting the content-length.
> > Initially I was thinking it to be a problem with the parse-html plugin
> but
> > now it seems that the fetcher returns null content.
> >
> > This seemed related to NUTCH-374
> >
> > Let me know if you need further info.
>

Reply via email to