Hi,
the regular expression looks good.
Which conf/regex-urlfilter.txt has been changed?
runtime/local/conf/regex-urlfilter.txt ?
If
conf/regex-urlfilter.txt is changed
you need to run "ant runtime" again
to install the configuration changes
into runtime/local/conf.
For distributed mode you need to rebuild
and deploy after any configuration change
because configuration files are included
in the job file.
Sebastian
On 01/13/2015 06:16 AM, fxmy wang wrote:
> Hi Nutch users,
>
>
> We are trying to crawl a forum site with the help of Nutch-2.2.1.
>
> The URLs are like far.boo.com/f?kw=SomeTopic&pn=150
> where pn means PageNumber.
>
> The goal, is to filter out those old posts, say I want all those pn>1000
> posts filtered.
>
> So in conf/regex-urlfilter.txt I added this above the '# accept anything
> else' line.
>
> -[*!@] # skip certain queries
> -pn=[0-9]{4,}$ # filter out pn>1000
> +. # accept anything else
>
> And... no effect :(
> After some generate-fetch-parse-updatedb circle the URL
> far.boo.com/f?kw=SomeTopic&pn=649800 still got fetched.
>
> To verify furthermore I run the command below
> bin/nutch plugin urlfilter-regex
> org.apache.nutch.urlfilter.regex.RegexURLFilter [0]
> and pasted 'far.boo.com/f?kw=SomeTopic&pn=649800' in, the output is
> +far.boo.com/f?kw=SomeTopic&pn=649800
> Seems nutch didn't filter it out.
>
> What is the proper way to deal with numbers in URLs?
> Did I do something wrong?
> Any advice will be very appreciated.
>
> ----------------------------------------------------------------------
> [0]http://www.mail-archive.com/user%40nutch.apache.org/msg09536.html
> ----------------------------------------------------------------------
>
> BR, fxmy
>