web crawler job settings

Ahmet Arslan Mon, 01 Jul 2013 06:57:14 -0700

Hi,

I am crawling main pages of some online newspaper web sites. 
I don't need deletes at all. I am using crawl once model.


Here is the settings I use : 

Schedule type:Scan every document once
Start Method : Start at beginning of schedule window

Scheduled time: Any day of week at 1 am 3 am 5 am 7 am 9 am 11 am 1 pm 3 pm 5 
pm 7 pm 9 pm 11 pm plus 0 minutes
Maximum run time: No limit

Maximum hop count for link type 'link': 1
Maximum hop count for link type 'redirect': Unlimited
Hop count mode: No deletes, forever

Include only hosts matching seeds? yes
Seeds: A few URLs in the form of http://main.page.com/{category} where category 
is Sports, Politics etc.

By setting hop count to 1 ( or 2) and 'no deletes, forever', I am expecting 
this crawl to be super fast and most efficient. Minimal DB queries etc. Am I 
correct?

Thanks,
Ahmet

web crawler job settings

Reply via email to