Hi! Thank you for your answer. You're right, maybe an example would explain better what I need to do.
I have to perform the following task. I have to explore a specific domain (. gov.it) and I have an initial set of seeds, for example www.aaa.it, www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch pages outside that domain. However some resources I need to download (documents) are stored on web sites that are not inside the domain I'm interested in. For example: www.aaa.it/subfolder/albi redirects to www.somesite.it (where www.somesite.it is not inside "my" domain). Nutch will not fetch that page since I told it to behave that way, but I need to download documents stored on www.somesite.it. So I need nutch to go outside the domain I specified only when it sees the words "albi" or "albo" inside the url, since that words identify the documents I need. How can I do this? I hope I've been clear. :) 2011/11/30 Lewis John Mcgibbney <[email protected]> > Hi Adriana, > > This should be achievable through fine grained URL filters. It is kindof > hard to substantiate on this without you providing some examples of the > type of stuff you're trying to do! > > Lewis > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina < > [email protected] > > wrote: > > > Hello, > > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I configured > > it so that it doesn't fetch pages outside a specific domain. However now > I > > need to let it fetch pages outside the domain I choosed but only for some > > urls (not for all the urls I have to crawl). How can I do this? I have to > > write a new plugin? > > > > Thanks. > > > > > > -- > *Lewis* >

