Difficult crawling

Germán Biozzoli Sat, 11 Dec 2010 11:55:45 -0800

Hi people

I'm semi newbie to nutch and starting to use in a production site. My
problem comes when I define the URL's seed with almost 20 urls. Also I
need that the crawler explores all of them and discover new ones. All
of this is working OK, I mean the crawler obtains news URLs and
explore it. I defined that the depth is 20 and uses the topN with
1000. My problem is that every crawl process reach a time or a number
of URLs and never complete the seed list. Because of this, the actual
index has a lot of new discovered URLs but the core URL sites,
uncompleted. I repeat the crawl and of course takes the uncompleted
links from the previous level...But never could go top and complete my
seed list.


I need if someone could suggest me a way to make the type of crawl
that I need. Perhaps the best could be to complete the seed list
before to go further with discovered URLs, is there a way to do so?
Perhaps another approach could be to use two nutch´es, one for the
seed list, to complete this sites and another to go for external
URL's. After that merging segments and invert.

Someone could give me a tip about this.
Thanks a lot
Germán

Difficult crawling

Reply via email to