If you also provide the settings from nutch-site.xml which restrict's your Nutchbot from crawling outside some specified domain that would be helpful.
At this stage I think that if your restrictions completely deny Nutch from following outlinks to other domains, then the use of reg-ex filters is pointless. This is not what you wish to be configuring. Instead you want to be allowing Nutch to crawl outlinks to other domains but limit which domains you wish to crawl. I think it should be possible to add the filters in your reg-ex file like # accept the following but block everything else +^http://([a-z0-9]*\.)*somesite.it/ +^http://([a-z0-9]*\.)*aaa.it/ +^http://([a-z0-9]*\.)*bbb.it/ etc I don't think you will need to explicitly deny everything else. However you'll only find out by doing a number of small test crawls to check out whether your reg-ex filters are working HTH On Thu, Dec 1, 2011 at 8:57 AM, Adriana Farina <[email protected]>wrote: > Hi! > > Thank you for your answer. You're right, maybe an example would explain > better what I need to do. > > I have to perform the following task. I have to explore a specific domain > (. > gov.it) and I have an initial set of seeds, for example www.aaa.it, > www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch > pages outside that domain. However some resources I need to download > (documents) are stored on web sites that are not inside the domain I'm > interested in. > For example: www.aaa.it/subfolder/albi redirects to www.somesite.it (where > www.somesite.it is not inside "my" domain). Nutch will not fetch that page > since I told it to behave that way, but I need to download documents stored > on www.somesite.it. So I need nutch to go outside the domain I specified > only when it sees the words "albi" or "albo" inside the url, since that > words identify the documents I need. How can I do this? > > I hope I've been clear. :) > > > > 2011/11/30 Lewis John Mcgibbney <[email protected]> > > > Hi Adriana, > > > > This should be achievable through fine grained URL filters. It is kindof > > hard to substantiate on this without you providing some examples of the > > type of stuff you're trying to do! > > > > Lewis > > > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina < > > [email protected] > > > wrote: > > > > > Hello, > > > > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I > configured > > > it so that it doesn't fetch pages outside a specific domain. However > now > > I > > > need to let it fetch pages outside the domain I choosed but only for > some > > > urls (not for all the urls I have to crawl). How can I do this? I have > to > > > write a new plugin? > > > > > > Thanks. > > > > > > > > > > > -- > > *Lewis* > > > -- *Lewis*

