It depends on the structure of your site and you can modify "regex-urlfilter.txt" to reach your goal.
>From the examples you gave, you can do this: *"- ^http://ww.mywebsite.com/[^/]*$"* it will exclude http://ww.mywebsite.com/alpha, http://ww.mywebsite.com/beta , http://ww.mywebsite.com/gamma *"- ^http://ww.mywebsite.com/.*/$"* This will exclude any URL that ends with "/" I would suggest you get familiar with regular expressions (in case you don't yet) Remi On Sun, Apr 1, 2012 at 6:27 PM, alessio crisantemi < alessio.crisant...@gmail.com> wrote: > Dear All, > I would change my crawling operation but I don't know how can I do. > > crawling my website I used the follow command: > > $ bin/nutch crawl urls -solr http://localhost:8983/solr -threads 3 -depth > 35 -topN 10 > > for crawl with nutch and index results on solr index. > > > > But I would not crawl the single section of my website but only the single > pages. > > for example: > > You considere a site: www.mywebsite.com composed with 3 section: > > http://ww.mywebsite.com/alpha > > http://ww.mywebsite.com/beta > > http://ww.mywebsite.com/gamma > > > > so, I want between my results, only the single pages of my articles, and > not the list of articles on this directories also. > > So, I would for example, the parsong of the file: > > http://ww.mywebsite.com/alpha/artcle1.html > > http://ww.mywebsite.com/alpha/artcle3.html > > ... > > > > and i don't want the parsing of the parent section: > > http://ww.mywebsite.com/alpha/ > > > > How can I do? > > suggestion? > > sorry if not all clear > > thank you > > alessio >