I'll download nutch 1.2 and try Arch, it seems interesting. Thank you.

I think I need to do some tests to try all the solutions you all suggested.



2011/12/1 <[email protected]>

> Hi Adriana,
>
> You can try Arch for this:
>
> http://www.atnf.csiro.au/computing/software/arch
>
> You can configure it to crawl your web sites plus sets of miscellaneous
> URLs called "bookmarks" in Arch. Arch is a free extension of Nutch. Right
> now, only Arch based on Nutch 1.2 is available for downloading. We are
> about to release Arch based on Nutch 1.4.
>
> Regards,
>
> Arkadi
>
>
>
> > -----Original Message-----
> > From: Adriana Farina [mailto:[email protected]]
> > Sent: Thursday, 1 December 2011 7:58 PM
> > To: [email protected]
> > Subject: Re: Fetching just some urls outside domain
> >
> > Hi!
> >
> > Thank you for your answer. You're right, maybe an example would explain
> > better what I need to do.
> >
> > I have to perform the following task. I have to explore a specific
> > domain (.
> > gov.it) and I have an initial set of seeds, for example www.aaa.it,
> > www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch
> > pages outside that domain. However some resources I need to download
> > (documents) are stored on web sites that are not inside the domain I'm
> > interested in.
> > For example: www.aaa.it/subfolder/albi redirects to www.somesite.it
> > (where
> > www.somesite.it is not inside "my" domain). Nutch will not fetch that
> > page
> > since I told it to behave that way, but I need to download documents
> > stored
> > on www.somesite.it. So I need nutch to go outside the domain I
> > specified
> > only when it sees the words "albi" or "albo" inside the url, since that
> > words identify the documents I need. How can I do this?
> >
> > I hope I've been clear. :)
> >
> >
> >
> > 2011/11/30 Lewis John Mcgibbney <[email protected]>
> >
> > > Hi Adriana,
> > >
> > > This should be achievable through fine grained URL filters. It is
> > kindof
> > > hard to substantiate on this without you providing some examples of
> > the
> > > type of stuff you're trying to do!
> > >
> > > Lewis
> > >
> > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina <
> > > [email protected]
> > > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I
> > configured
> > > > it so that it doesn't fetch pages outside a specific domain.
> > However now
> > > I
> > > > need to let it fetch pages outside the domain I choosed but only
> > for some
> > > > urls (not for all the urls I have to crawl). How can I do this? I
> > have to
> > > > write a new plugin?
> > > >
> > > > Thanks.
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
>

Reply via email to