Hi

The + character needs escaping, use - \+ in the filter (see
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html)

There is a tool for testing the URLFilters in Nutch already, just do

./nutch org.apache.nutch.net.URLFilterChecker -allCombined

from runtime/local/bin

HTH

Julien


On 19 July 2014 16:28, Bin Wang <[email protected]> wrote:

> Hi there,
>
> I am using Nutch to crawl a site that has dynamic pages.
>
> http://www.example.com/browse/category1/category2/category3?navid=1234567
>
>
> I commented out the line in regex-urlfilter.txt to allow dynamic pages.
> i.e. to allow the URLs that has the question mark character in it.
>
> # skip URLs containing certain characters as probable queries, etc.
> # -[?*!@=]
>
> However, in those dynamic pages, there is a panel - "NARROW DOWN YOUR
> RESULTS BY:", and there are many filters which lead to hundreds of outlinks
> that won't bring any extra data, but will result in 100x+ page requests.
>
> http://www.example.com/browse/category1/category2/category3?navid=1234567+2
> http://www.example.com/browse/category1/category2/category3?navid=1234567+3
>
> http://www.example.com/browse/category1/category2/category3?navid=1234567+2+3
>
>
> To avoid causing unnecessary burden for the target website, I want to
> filter out the URLs that contains "+" sign. And the regular expression
> looks like this now:
>
> -^(file|ftp|mailto):
> -\.(gif|GIF|jpg|JPG|... omit ...|js|JS)$
> +^http://www.example.com/ProductsCategory
> +^http://www.example.com/browse/
> -[+]
>
>
> Where, /ProductsCategory is the seed URLs that I need to start and
> /browse.. are the pages that I want to collect.
> Also, I am assuming "-[+]" will remove the URLs that contains "+" sign.
> However, it is not doing what I expect now.
> And I can still see the robot is grabbing the pages that contains "+" from
> the nohup file.
>
> Question1: how can I modify my the regular expression in
> regex-urlfilter.txt to fit my need?
>
> I have also followed the NutchInEclipse
> <http://wiki.apache.org/nutch/RunNutchInEclipse> tutorial by tejas in
> Nutch
> Wiki. And now I have a working environment to test Nutch source code.
>
> Question2: Is there an easy way in Eclipse to test the output of a list of
> URLs after being filtered by a certain regular expression?
>
> I know Nutch is using java.util.regex but I want to know how Nutch read
> from a configuration file and which character should I escape ..etc.
>
> Thanks!
>
> Bin
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to