RE: Difficult crawling

Rob Hunter Mon, 13 Dec 2010 17:37:12 -0800

Hi Germán,

I think you've already picked up on the best ways to do what you want to do.  
I'd probably limit your first crawl to just your 20 sites using 
crawl-urlfilter.txt, and then (if you want external sites) move outside of that 
with a larger, unbound crawl.  The other thing you could do is increase your 
topN - in my experience, 1000 is pretty tiny.  A depth of 20 suggests to me 
that your goal is to move outside of your 20 initial sites; if you assume that 
you double your number of domain names at every level after one, by level 4 you 
have 8000 (20^3) top level domains - you've outstripped your topN by leaps and 
bounds before you're a quarter of the way through.

Hope this helps,
Rob 

-----Original Message-----
From: Germán Biozzoli [mailto:[email protected]] 
Sent: Saturday, December 11, 2010 11:55 AM
To: [email protected]
Subject: Difficult crawling

Hi people

I'm semi newbie to nutch and starting to use in a production site. My
problem comes when I define the URL's seed with almost 20 urls. Also I
need that the crawler explores all of them and discover new ones. All
of this is working OK, I mean the crawler obtains news URLs and
explore it. I defined that the depth is 20 and uses the topN with
1000. My problem is that every crawl process reach a time or a number
of URLs and never complete the seed list. Because of this, the actual
index has a lot of new discovered URLs but the core URL sites,
uncompleted. I repeat the crawl and of course takes the uncompleted
links from the previous level...But never could go top and complete my
seed list.

I need if someone could suggest me a way to make the type of crawl
that I need. Perhaps the best could be to complete the seed list
before to go further with discovered URLs, is there a way to do so?
Perhaps another approach could be to use two nutch´es, one for the
seed list, to complete this sites and another to go for external
URL's. After that merging segments and invert.

Someone could give me a tip about this.
Thanks a lot
Germán

RE: Difficult crawling

Reply via email to