Crawling different websites, one each full crawl cycle

Saphira Tue, 18 Jan 2011 01:08:22 -0800

Hello,

I need some help to accomplish my current project.
In my project, urls of sites can be added any time, and what I'm trying to
achieve is to make a full crawl of the one site right after it's added. So
imagine I've already crawled two websites and then I add another one. In
this moment I want to start crawling this new site, but not recrawl the
other two.


Also I want to have just one index to be able to search in all the websites
that have been crawled.

After the brief explanation, this is what I've done so far:
   - I've set a crawl cycle that is run every time a new url is added.
   - I've configured db.fetch.interval.default (2592000),
db.fetch.schedule.class (Adaptive), db.signature.class
(org.apache.nutch.crawl.TextProfileSignature).

The problem is that, as far as I know, when I start a new crawl cycle, all
pages are checked for fetch and the ones with db_gone or that have reached
it's next fetch time are re-fetched. But what I want is that just the pages
from the new url are fetched, because I have an scheduled job that will
handle the recrawl proccess of all the urls.

Is there a way to achieve this?
I was trying to create a new crawldb each crawl cycle and then merge it with
the crawldb with all the urls, but I haven't been able to make it work...
Also I don't know if this is good idea or not, because the user can insert
more than once an url, so if I create new crawldb each time, it'll make
again a crawl of the url although it has already been crawled..

Any help would be appreciated.
Thanks

P.S. I'm using nutch from Java (well, Grails really, but Java is fine for
now).
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Crawling-different-websites-one-each-full-crawl-cycle-tp2278607p2278607.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Crawling different websites, one each full crawl cycle

Reply via email to