I'll download nutch 1.2 and try Arch, it seems interesting. Thank you. I think I need to do some tests to try all the solutions you all suggested.
2011/12/1 <[email protected]> > Hi Adriana, > > You can try Arch for this: > > http://www.atnf.csiro.au/computing/software/arch > > You can configure it to crawl your web sites plus sets of miscellaneous > URLs called "bookmarks" in Arch. Arch is a free extension of Nutch. Right > now, only Arch based on Nutch 1.2 is available for downloading. We are > about to release Arch based on Nutch 1.4. > > Regards, > > Arkadi > > > > > -----Original Message----- > > From: Adriana Farina [mailto:[email protected]] > > Sent: Thursday, 1 December 2011 7:58 PM > > To: [email protected] > > Subject: Re: Fetching just some urls outside domain > > > > Hi! > > > > Thank you for your answer. You're right, maybe an example would explain > > better what I need to do. > > > > I have to perform the following task. I have to explore a specific > > domain (. > > gov.it) and I have an initial set of seeds, for example www.aaa.it, > > www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch > > pages outside that domain. However some resources I need to download > > (documents) are stored on web sites that are not inside the domain I'm > > interested in. > > For example: www.aaa.it/subfolder/albi redirects to www.somesite.it > > (where > > www.somesite.it is not inside "my" domain). Nutch will not fetch that > > page > > since I told it to behave that way, but I need to download documents > > stored > > on www.somesite.it. So I need nutch to go outside the domain I > > specified > > only when it sees the words "albi" or "albo" inside the url, since that > > words identify the documents I need. How can I do this? > > > > I hope I've been clear. :) > > > > > > > > 2011/11/30 Lewis John Mcgibbney <[email protected]> > > > > > Hi Adriana, > > > > > > This should be achievable through fine grained URL filters. It is > > kindof > > > hard to substantiate on this without you providing some examples of > > the > > > type of stuff you're trying to do! > > > > > > Lewis > > > > > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina < > > > [email protected] > > > > wrote: > > > > > > > Hello, > > > > > > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I > > configured > > > > it so that it doesn't fetch pages outside a specific domain. > > However now > > > I > > > > need to let it fetch pages outside the domain I choosed but only > > for some > > > > urls (not for all the urls I have to crawl). How can I do this? I > > have to > > > > write a new plugin? > > > > > > > > Thanks. > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > >

