Hi Julien, Thanks a lot for your reply. Escaping + sign totally makes sense to me. However, I don't think it is necessary to escape + sign if you put them in [], like how those dynamic pages got escaped? 2# skip URLs containing certain characters as probable queries, etc. 33 -[?*!@=]
Also, in the end, I read the source code of Nutch and came up with a way to test it in Eclipse using the regexURLFilter class. More information of my testing, you can see it here <http://datafireball.com/2014/07/20/nutch-how-regex-urlfilter-txt-really-works/> Am I using the right class? I will definitely check out "org.apache.nutch .net.URLFilterChecker". Thanks! Bin On Mon, Jul 21, 2014 at 3:39 AM, Julien Nioche < [email protected]> wrote: > Hi > > The + character needs escaping, use - \+ in the filter (see > http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html) > > There is a tool for testing the URLFilters in Nutch already, just do > > ./nutch org.apache.nutch.net.URLFilterChecker -allCombined > > from runtime/local/bin > > HTH > > Julien > > > On 19 July 2014 16:28, Bin Wang <[email protected]> wrote: > > > Hi there, > > > > I am using Nutch to crawl a site that has dynamic pages. > > > > > http://www.example.com/browse/category1/category2/category3?navid=1234567 > > > > > > I commented out the line in regex-urlfilter.txt to allow dynamic pages. > > i.e. to allow the URLs that has the question mark character in it. > > > > # skip URLs containing certain characters as probable queries, etc. > > # -[?*!@=] > > > > However, in those dynamic pages, there is a panel - "NARROW DOWN YOUR > > RESULTS BY:", and there are many filters which lead to hundreds of > outlinks > > that won't bring any extra data, but will result in 100x+ page requests. > > > > > http://www.example.com/browse/category1/category2/category3?navid=1234567+2 > > > http://www.example.com/browse/category1/category2/category3?navid=1234567+3 > > > > > http://www.example.com/browse/category1/category2/category3?navid=1234567+2+3 > > > > > > To avoid causing unnecessary burden for the target website, I want to > > filter out the URLs that contains "+" sign. And the regular expression > > looks like this now: > > > > -^(file|ftp|mailto): > > -\.(gif|GIF|jpg|JPG|... omit ...|js|JS)$ > > +^http://www.example.com/ProductsCategory > > +^http://www.example.com/browse/ > > -[+] > > > > > > Where, /ProductsCategory is the seed URLs that I need to start and > > /browse.. are the pages that I want to collect. > > Also, I am assuming "-[+]" will remove the URLs that contains "+" sign. > > However, it is not doing what I expect now. > > And I can still see the robot is grabbing the pages that contains "+" > from > > the nohup file. > > > > Question1: how can I modify my the regular expression in > > regex-urlfilter.txt to fit my need? > > > > I have also followed the NutchInEclipse > > <http://wiki.apache.org/nutch/RunNutchInEclipse> tutorial by tejas in > > Nutch > > Wiki. And now I have a working environment to test Nutch source code. > > > > Question2: Is there an easy way in Eclipse to test the output of a list > of > > URLs after being filtered by a certain regular expression? > > > > I know Nutch is using java.util.regex but I want to know how Nutch read > > from a configuration file and which character should I escape ..etc. > > > > Thanks! > > > > Bin > > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >

