RE: Effective way to crawling seed and discover new urls.

Markus Jelsma Fri, 13 Dec 2013 02:30:09 -0800

Seems i missed that one :)
 
 
-----Original message-----
> From:Julien Nioche <[email protected]>
> Sent: Friday 13th December 2013 11:22
> To: [email protected]
> Subject: Re: Effective way to crawling seed and discover new urls.
> 
> What's wrong with using the scoring-depth plugin?
> 
> 
> On 13 December 2013 09:33, Markus Jelsma <[email protected]> wrote:
> 
> > Although there is no real notion of depth, as you already figured out, you
> > can keep track of it via a scoring filter.
> >
> > http://grokbase.com/t/nutch/user/1092p10q5g/depth-information-not-being-available-in-crawl-datum
> >
> >
> >
> > -----Original message-----
> > > From:Nguyen Manh Tien <[email protected]>
> > > Sent: Friday 13th December 2013 5:30
> > > To: [email protected]
> > > Subject: Effective way to crawling seed and discover new urls.
> > >
> > > Hi,
> > >
> > > I am crawling a list of home pages to discover new articles, crawler will
> > > stop at depth 1.But at depth 1, crawler still add many new urls with
> > depth
> > > 2, so event i only crawl up to depth 1 but crawldb still have many, many
> > > urls at depth 2. Is there any way to prevent that or we need to
> > implement a
> > > custom plugin?
> > >
> > > And i only want to index discovered article at depth 1, not seed. do we
> > > have a feature to do that?
> > >
> > > Thanks.
> > > Tien
> > >
> >
> 
> 
> 
> -- 
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

RE: Effective way to crawling seed and discover new urls.

Reply via email to