Re: Crawl command help

Ferdy Galema Thu, 16 Aug 2012 04:09:27 -0700

Hi,

Could you be a bit more specific what type of urls are refetched? In
general it is advised to run the different jobs explicitly, to have more
control over the crawling. (Inject,generate,fetch,parse,update etc.)


Ferdy.

On Thu, Aug 16, 2012 at 12:55 PM, Hugo Alves <[email protected]>wrote:

> Hi.
>
> I am using nutch 2.0 with hsql.
>
> I've created some plugins for parsing special content inside company
> website, the plugins parse the content and next send some data to a
> sql server database,this is working fine. But the problem is the crawl
> command. I am starting nutch with:
> ./nutch crawl -depth 300 -topN 30000.
>
> In nutch-site.xml i configured the refetch interval to 30 days(the
> default value) but after each cycle nutch fetches the new pages found
> and the old pages.
>
> What i am doing wrong?
>

Re: Crawl command help

Reply via email to