Re: recrawl a URL?

Rémy Amouroux Thu, 30 Aug 2012 03:36:17 -0700

In order to store the crawldb and the segments in differents directory, you 
will have to use the inject,generate,fetch,parse and updatedb command. Those 
commands allows to define both crawldb and segments paths.


The only way I see in nutch 1.4 to do this using the crawl command is to move 
stuff around before running the following crawls.

Exemple : 

bin/nutch crawl urlDirectoryToInject -dir mainDirectory
mv mainDirectory/segments segmentsDirectoryOne
mkdir mainDirectory/segments
bin/nutch crawl urlDirectoryToInject -dir mainDirectory
mv mainDirectory/segments segmentsDirectoryTwo
mkdir mainDirectory/segments
...

not super elegant, but working :-)

Regards
RemyA

Le 30 août 2012 à 12:06, Max Dzyuba a écrit :

> Does anybody know the answer to my question below? Let me know if the
> question is not clear. 
> 
> Is it possible to use the same crawldb but store segment data in a different
> directory for consecutive crawls using the "bin/nutch crawl" command? I
> thought that there is no option to specify the path to crawldb or linkdb,
> but only the path to a directory where to save all crawl data (including
> crawldb, linkdb and segments) into. I'm using Nutch 1.5. If it's possible,
> how would the crawl command look like?
> 
> 
> Thanks in advance!
> Max

Re: recrawl a URL?

Reply via email to