Re: web crawler job settings

Karl Wright Mon, 01 Jul 2013 07:12:42 -0700

Hi Ahmet,

I would say that would be pretty efficient.  ManifoldCF will need to keep
records in its jobqueue table which correspond to hopcount=2.  It will
never fetch these, however.


Karl




On Mon, Jul 1, 2013 at 9:56 AM, Ahmet Arslan <[email protected]> wrote:

> Hi,
>
> I am crawling main pages of some online newspaper web sites.
> I don't need deletes at all. I am using crawl once model.
>
> Here is the settings I use :
>
> Schedule type: Scan every document once
> Start Method : Start at beginning of schedule window
>
> Scheduled time: Any day of week at 1 am 3 am 5 am 7 am 9 am 11 am 1 pm 3
> pm 5 pm 7 pm 9 pm 11 pm plus 0 minutes
> Maximum run time: No limit
>
> Maximum hop count for link type 'link': 1
> Maximum hop count for link type 'redirect': Unlimited
> Hop count mode: No deletes, forever
>
> Include only hosts matching seeds? yes
> Seeds: A few URLs in the form of http://main.page.com/{category} where
> category is Sports, Politics etc.
>
> By setting hop count to 1 ( or 2) and 'no deletes, forever', I am
> expecting this crawl to be super fast and most efficient. Minimal DB
> queries etc. Am I correct?
>
> Thanks,
> Ahmet
>
>

Re: web crawler job settings

Reply via email to