Hello,

It is interesting to know how can one put a filter on outlinks? I mean if I 
have a regex, in which file should I put it?
For example, I want nutch to ignore outlinks ending with .info.

Thanks.
Alex.

 

 

 

-----Original Message-----
From: Arkadi.Kosmynin <[email protected]>
To: user <[email protected]>
Sent: Thu, Dec 1, 2011 1:44 pm
Subject: RE: Fetching just some urls outside domain


Hi Adriana,

You can try Arch for this:

http://www.atnf.csiro.au/computing/software/arch

You can configure it to crawl your web sites plus sets of miscellaneous URLs 
called "bookmarks" in Arch. Arch is a free extension of Nutch. Right now, only 
Arch based on Nutch 1.2 is available for downloading. We are about to release 
Arch based on Nutch 1.4.

Regards,

Arkadi



> -----Original Message-----
> From: Adriana Farina [mailto:[email protected]]
> Sent: Thursday, 1 December 2011 7:58 PM
> To: [email protected]
> Subject: Re: Fetching just some urls outside domain
> 
> Hi!
> 
> Thank you for your answer. You're right, maybe an example would explain
> better what I need to do.
> 
> I have to perform the following task. I have to explore a specific
> domain (.
> gov.it) and I have an initial set of seeds, for example www.aaa.it,
> www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch
> pages outside that domain. However some resources I need to download
> (documents) are stored on web sites that are not inside the domain I'm
> interested in.
> For example: www.aaa.it/subfolder/albi redirects to www.somesite.it
> (where
> www.somesite.it is not inside "my" domain). Nutch will not fetch that
> page
> since I told it to behave that way, but I need to download documents
> stored
> on www.somesite.it. So I need nutch to go outside the domain I
> specified
> only when it sees the words "albi" or "albo" inside the url, since that
> words identify the documents I need. How can I do this?
> 
> I hope I've been clear. :)
> 
> 
> 
> 2011/11/30 Lewis John Mcgibbney <[email protected]>
> 
> > Hi Adriana,
> >
> > This should be achievable through fine grained URL filters. It is
> kindof
> > hard to substantiate on this without you providing some examples of
> the
> > type of stuff you're trying to do!
> >
> > Lewis
> >
> > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina <
> > [email protected]
> > > wrote:
> >
> > > Hello,
> > >
> > > I'm using nutch 1.3 from just a month, so I'm not an expert. I
> configured
> > > it so that it doesn't fetch pages outside a specific domain.
> However now
> > I
> > > need to let it fetch pages outside the domain I choosed but only
> for some
> > > urls (not for all the urls I have to crawl). How can I do this? I
> have to
> > > write a new plugin?
> > >
> > > Thanks.
> > >
> >
> >
> >
> > --
> > *Lewis*
> >

 

Reply via email to