Thanks all,

I found scoring-depth in 1.x have the feature i want. In nutch-2.x, this
https://issues.apache.org/jira/browse/NUTCH-1431 already implemented the
depth concept provide by scoring-depth in 1.x.

But scoring-depth can also truncate outlinks that have depth > maxDepth and
it can prioritize by smaller value of depth when generate, should we port
those features in this plugin to nutch-2.x?
I am willing to implement that and provide path.

For my second question "And i only want to index discovered article at
depth 1, not seed. do we have a feature to do that?"
I assume it can do that with current feature in nutch-1.x and 2.x, right?

Tien


On Fri, Dec 13, 2013 at 5:28 PM, Markus Jelsma
<[email protected]>wrote:

> Seems i missed that one :)
>
>
> -----Original message-----
> > From:Julien Nioche <[email protected]>
> > Sent: Friday 13th December 2013 11:22
> > To: [email protected]
> > Subject: Re: Effective way to crawling seed and discover new urls.
> >
> > What's wrong with using the scoring-depth plugin?
> >
> >
> > On 13 December 2013 09:33, Markus Jelsma <[email protected]>
> wrote:
> >
> > > Although there is no real notion of depth, as you already figured out,
> you
> > > can keep track of it via a scoring filter.
> > >
> > >
> http://grokbase.com/t/nutch/user/1092p10q5g/depth-information-not-being-available-in-crawl-datum
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Nguyen Manh Tien <[email protected]>
> > > > Sent: Friday 13th December 2013 5:30
> > > > To: [email protected]
> > > > Subject: Effective way to crawling seed and discover new urls.
> > > >
> > > > Hi,
> > > >
> > > > I am crawling a list of home pages to discover new articles, crawler
> will
> > > > stop at depth 1.But at depth 1, crawler still add many new urls with
> > > depth
> > > > 2, so event i only crawl up to depth 1 but crawldb still have many,
> many
> > > > urls at depth 2. Is there any way to prevent that or we need to
> > > implement a
> > > > custom plugin?
> > > >
> > > > And i only want to index discovered article at depth 1, not seed. do
> we
> > > > have a feature to do that?
> > > >
> > > > Thanks.
> > > > Tien
> > > >
> > >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>

Reply via email to