Re: Effective way to crawling seed and discover new urls.

Lewis John Mcgibbney Fri, 13 Dec 2013 02:11:00 -0800

Hi Nguyen,

On Fri, Dec 13, 2013 at 4:28 AM, <user-digest-h...@nutch.apache.org> wrote:


>
> I am crawling a list of home pages to discover new articles, crawler will
> stop at depth 1.But at depth 1, crawler still add many new urls with depth
> 2, so event i only crawl up to depth 1 but crawldb still have many, many
> urls at depth 2. Is there any way to prevent that or we need to implement a
> custom plugin?
>
> And i only want to index discovered article at depth 1, not seed. do we
> have a feature to do that?
>
>
Assuming you are still deploying 2.X please see the generate.max.distance
property in nutch-site.xml. This will do what you want. It was a very neat
patch committed by Ferdy. Some history can be found here
https://issues.apache.org/jira/browse/NUTCH-1431
Thanks
Lewis

Re: Effective way to crawling seed and discover new urls.

Reply via email to