Hi Julien,

Thanks a lot for your reply. Escaping + sign totally makes sense to me.
 However, I don't think it is necessary to escape + sign if you put them in
[], like how those dynamic pages got escaped?
2# skip URLs containing certain characters as probable queries, etc. 33
-[?*!@=]

Also, in the end, I read the source code of Nutch and came up with a way to
test it in Eclipse using the regexURLFilter class.
More information of my testing, you can see it here
<http://datafireball.com/2014/07/20/nutch-how-regex-urlfilter-txt-really-works/>

Am I using the right class?  I will definitely check out "org.apache.nutch
.net.URLFilterChecker". Thanks!

Bin




On Mon, Jul 21, 2014 at 3:39 AM, Julien Nioche <
[email protected]> wrote:

> Hi
>
> The + character needs escaping, use - \+ in the filter (see
> http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html)
>
> There is a tool for testing the URLFilters in Nutch already, just do
>
> ./nutch org.apache.nutch.net.URLFilterChecker -allCombined
>
> from runtime/local/bin
>
> HTH
>
> Julien
>
>
> On 19 July 2014 16:28, Bin Wang <[email protected]> wrote:
>
> > Hi there,
> >
> > I am using Nutch to crawl a site that has dynamic pages.
> >
> >
> http://www.example.com/browse/category1/category2/category3?navid=1234567
> >
> >
> > I commented out the line in regex-urlfilter.txt to allow dynamic pages.
> > i.e. to allow the URLs that has the question mark character in it.
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > # -[?*!@=]
> >
> > However, in those dynamic pages, there is a panel - "NARROW DOWN YOUR
> > RESULTS BY:", and there are many filters which lead to hundreds of
> outlinks
> > that won't bring any extra data, but will result in 100x+ page requests.
> >
> >
> http://www.example.com/browse/category1/category2/category3?navid=1234567+2
> >
> http://www.example.com/browse/category1/category2/category3?navid=1234567+3
> >
> >
> http://www.example.com/browse/category1/category2/category3?navid=1234567+2+3
> >
> >
> > To avoid causing unnecessary burden for the target website, I want to
> > filter out the URLs that contains "+" sign. And the regular expression
> > looks like this now:
> >
> > -^(file|ftp|mailto):
> > -\.(gif|GIF|jpg|JPG|... omit ...|js|JS)$
> > +^http://www.example.com/ProductsCategory
> > +^http://www.example.com/browse/
> > -[+]
> >
> >
> > Where, /ProductsCategory is the seed URLs that I need to start and
> > /browse.. are the pages that I want to collect.
> > Also, I am assuming "-[+]" will remove the URLs that contains "+" sign.
> > However, it is not doing what I expect now.
> > And I can still see the robot is grabbing the pages that contains "+"
> from
> > the nohup file.
> >
> > Question1: how can I modify my the regular expression in
> > regex-urlfilter.txt to fit my need?
> >
> > I have also followed the NutchInEclipse
> > <http://wiki.apache.org/nutch/RunNutchInEclipse> tutorial by tejas in
> > Nutch
> > Wiki. And now I have a working environment to test Nutch source code.
> >
> > Question2: Is there an easy way in Eclipse to test the output of a list
> of
> > URLs after being filtered by a certain regular expression?
> >
> > I know Nutch is using java.util.regex but I want to know how Nutch read
> > from a configuration file and which character should I escape ..etc.
> >
> > Thanks!
> >
> > Bin
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Reply via email to