Hello, It is interesting to know how can one put a filter on outlinks? I mean if I have a regex, in which file should I put it? For example, I want nutch to ignore outlinks ending with .info.
Thanks. Alex. -----Original Message----- From: Arkadi.Kosmynin <[email protected]> To: user <[email protected]> Sent: Thu, Dec 1, 2011 1:44 pm Subject: RE: Fetching just some urls outside domain Hi Adriana, You can try Arch for this: http://www.atnf.csiro.au/computing/software/arch You can configure it to crawl your web sites plus sets of miscellaneous URLs called "bookmarks" in Arch. Arch is a free extension of Nutch. Right now, only Arch based on Nutch 1.2 is available for downloading. We are about to release Arch based on Nutch 1.4. Regards, Arkadi > -----Original Message----- > From: Adriana Farina [mailto:[email protected]] > Sent: Thursday, 1 December 2011 7:58 PM > To: [email protected] > Subject: Re: Fetching just some urls outside domain > > Hi! > > Thank you for your answer. You're right, maybe an example would explain > better what I need to do. > > I have to perform the following task. I have to explore a specific > domain (. > gov.it) and I have an initial set of seeds, for example www.aaa.it, > www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch > pages outside that domain. However some resources I need to download > (documents) are stored on web sites that are not inside the domain I'm > interested in. > For example: www.aaa.it/subfolder/albi redirects to www.somesite.it > (where > www.somesite.it is not inside "my" domain). Nutch will not fetch that > page > since I told it to behave that way, but I need to download documents > stored > on www.somesite.it. So I need nutch to go outside the domain I > specified > only when it sees the words "albi" or "albo" inside the url, since that > words identify the documents I need. How can I do this? > > I hope I've been clear. :) > > > > 2011/11/30 Lewis John Mcgibbney <[email protected]> > > > Hi Adriana, > > > > This should be achievable through fine grained URL filters. It is > kindof > > hard to substantiate on this without you providing some examples of > the > > type of stuff you're trying to do! > > > > Lewis > > > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina < > > [email protected] > > > wrote: > > > > > Hello, > > > > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I > configured > > > it so that it doesn't fetch pages outside a specific domain. > However now > > I > > > need to let it fetch pages outside the domain I choosed but only > for some > > > urls (not for all the urls I have to crawl). How can I do this? I > have to > > > write a new plugin? > > > > > > Thanks. > > > > > > > > > > > -- > > *Lewis* > >

