Thanks all, I found scoring-depth in 1.x have the feature i want. In nutch-2.x, this https://issues.apache.org/jira/browse/NUTCH-1431 already implemented the depth concept provide by scoring-depth in 1.x.
But scoring-depth can also truncate outlinks that have depth > maxDepth and it can prioritize by smaller value of depth when generate, should we port those features in this plugin to nutch-2.x? I am willing to implement that and provide path. For my second question "And i only want to index discovered article at depth 1, not seed. do we have a feature to do that?" I assume it can do that with current feature in nutch-1.x and 2.x, right? Tien On Fri, Dec 13, 2013 at 5:28 PM, Markus Jelsma <[email protected]>wrote: > Seems i missed that one :) > > > -----Original message----- > > From:Julien Nioche <[email protected]> > > Sent: Friday 13th December 2013 11:22 > > To: [email protected] > > Subject: Re: Effective way to crawling seed and discover new urls. > > > > What's wrong with using the scoring-depth plugin? > > > > > > On 13 December 2013 09:33, Markus Jelsma <[email protected]> > wrote: > > > > > Although there is no real notion of depth, as you already figured out, > you > > > can keep track of it via a scoring filter. > > > > > > > http://grokbase.com/t/nutch/user/1092p10q5g/depth-information-not-being-available-in-crawl-datum > > > > > > > > > > > > -----Original message----- > > > > From:Nguyen Manh Tien <[email protected]> > > > > Sent: Friday 13th December 2013 5:30 > > > > To: [email protected] > > > > Subject: Effective way to crawling seed and discover new urls. > > > > > > > > Hi, > > > > > > > > I am crawling a list of home pages to discover new articles, crawler > will > > > > stop at depth 1.But at depth 1, crawler still add many new urls with > > > depth > > > > 2, so event i only crawl up to depth 1 but crawldb still have many, > many > > > > urls at depth 2. Is there any way to prevent that or we need to > > > implement a > > > > custom plugin? > > > > > > > > And i only want to index discovered article at depth 1, not seed. do > we > > > > have a feature to do that? > > > > > > > > Thanks. > > > > Tien > > > > > > > > > > > > > > > -- > > > > Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > >

