Hi!

Thank you for your answer. You're right, maybe an example would explain
better what I need to do.

I have to perform the following task. I have to explore a specific domain (.
gov.it) and I have an initial set of seeds, for example www.aaa.it,
www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch
pages outside that domain. However some resources I need to download
(documents) are stored on web sites that are not inside the domain I'm
interested in.
For example: www.aaa.it/subfolder/albi redirects to www.somesite.it (where
www.somesite.it is not inside "my" domain). Nutch will not fetch that page
since I told it to behave that way, but I need to download documents stored
on www.somesite.it. So I need nutch to go outside the domain I specified
only when it sees the words "albi" or "albo" inside the url, since that
words identify the documents I need. How can I do this?

I hope I've been clear. :)



2011/11/30 Lewis John Mcgibbney <[email protected]>

> Hi Adriana,
>
> This should be achievable through fine grained URL filters. It is kindof
> hard to substantiate on this without you providing some examples of the
> type of stuff you're trying to do!
>
> Lewis
>
> On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina <
> [email protected]
> > wrote:
>
> > Hello,
> >
> > I'm using nutch 1.3 from just a month, so I'm not an expert. I configured
> > it so that it doesn't fetch pages outside a specific domain. However now
> I
> > need to let it fetch pages outside the domain I choosed but only for some
> > urls (not for all the urls I have to crawl). How can I do this? I have to
> > write a new plugin?
> >
> > Thanks.
> >
>
>
>
> --
> *Lewis*
>

Reply via email to