Re: Difficult crawling

Julien Nioche Tue, 14 Dec 2010 03:06:19 -0800

Hi German,

The best way of doing this is be to track the depth of the URLs since the
injection and give a priority to the lowest depths during the generation
step. This is done by implementing a custom scoring filter implementing the
interface 
ScoringFilter<http://nutch.apache.org/apidocs-1.2/org/apache/nutch/scoring/ScoringFilter.html>.
Not trivial but that would definitely work.


HTH

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


On 11 December 2010 19:55, Germán Biozzoli <[email protected]> wrote:

> Hi people
>
> I'm semi newbie to nutch and starting to use in a production site. My
> problem comes when I define the URL's seed with almost 20 urls. Also I
> need that the crawler explores all of them and discover new ones. All
> of this is working OK, I mean the crawler obtains news URLs and
> explore it. I defined that the depth is 20 and uses the topN with
> 1000. My problem is that every crawl process reach a time or a number
> of URLs and never complete the seed list. Because of this, the actual
> index has a lot of new discovered URLs but the core URL sites,
> uncompleted. I repeat the crawl and of course takes the uncompleted
> links from the previous level...But never could go top and complete my
> seed list.
>
> I need if someone could suggest me a way to make the type of crawl
> that I need. Perhaps the best could be to complete the seed list
> before to go further with discovered URLs, is there a way to do so?
> Perhaps another approach could be to use two nutch´es, one for the
> seed list, to complete this sites and another to go for external
> URL's. After that merging segments and invert.
>
> Someone could give me a tip about this.
> Thanks a lot
> Germán
>

Re: Difficult crawling

Reply via email to