RE: Effective way to crawling seed and discover new urls.

Markus Jelsma Fri, 13 Dec 2013 01:34:03 -0800

Although there is no real notion of depth, as you already figured out, you can 
keep track of it via a scoring filter.
http://grokbase.com/t/nutch/user/1092p10q5g/depth-information-not-being-available-in-crawl-datum


 
 
-----Original message-----
> From:Nguyen Manh Tien <[email protected]>
> Sent: Friday 13th December 2013 5:30
> To: [email protected]
> Subject: Effective way to crawling seed and discover new urls.
> 
> Hi,
> 
> I am crawling a list of home pages to discover new articles, crawler will
> stop at depth 1.But at depth 1, crawler still add many new urls with depth
> 2, so event i only crawl up to depth 1 but crawldb still have many, many
> urls at depth 2. Is there any way to prevent that or we need to implement a
> custom plugin?
> 
> And i only want to index discovered article at depth 1, not seed. do we
> have a feature to do that?
> 
> Thanks.
> Tien
>

RE: Effective way to crawling seed and discover new urls.

Reply via email to