Hello, I need some help to accomplish my current project. In my project, urls of sites can be added any time, and what I'm trying to achieve is to make a full crawl of the one site right after it's added. So imagine I've already crawled two websites and then I add another one. In this moment I want to start crawling this new site, but not recrawl the other two.
Also I want to have just one index to be able to search in all the websites that have been crawled. After the brief explanation, this is what I've done so far: - I've set a crawl cycle that is run every time a new url is added. - I've configured db.fetch.interval.default (2592000), db.fetch.schedule.class (Adaptive), db.signature.class (org.apache.nutch.crawl.TextProfileSignature). The problem is that, as far as I know, when I start a new crawl cycle, all pages are checked for fetch and the ones with db_gone or that have reached it's next fetch time are re-fetched. But what I want is that just the pages from the new url are fetched, because I have an scheduled job that will handle the recrawl proccess of all the urls. Is there a way to achieve this? I was trying to create a new crawldb each crawl cycle and then merge it with the crawldb with all the urls, but I haven't been able to make it work... Also I don't know if this is good idea or not, because the user can insert more than once an url, so if I create new crawldb each time, it'll make again a crawl of the url although it has already been crawled.. Any help would be appreciated. Thanks P.S. I'm using nutch from Java (well, Grails really, but Java is fine for now). -- View this message in context: http://lucene.472066.n3.nabble.com/Crawling-different-websites-one-each-full-crawl-cycle-tp2278607p2278607.html Sent from the Nutch - User mailing list archive at Nabble.com.

