Re: NutchTutorial Followed Crawldb Not Created

Sebastian Nagel Sat, 05 Jul 2014 13:07:28 -0700

The second call of bin/crawl seems to have swapped seed and crawl directory:


> root@Walleye:~/nutch# bin/crawl crawl urls -dir crawl -depth 3 -topN 5

The first call
root@myserver:~/nutch# bin/crawl urls/seed.txt testcrawl -dir crawl -depth 3 
-topN 50
seems to have the order correct. But does also not follow entirely the scheme
% bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>

The status message of Injector also indicates that the seed directory does not
contain URLs:

(first call)
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
=> one URL injected

(second call)
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering: 0
=> zero URLs

Sebastian


On 07/05/2014 02:57 AM, CdnGuy wrote:
> Wiped and rebuilt the server from scratch.
> Followed tutorial again.
> Here's the results:
> root@Walleye:~/nutch# bin/crawl crawl urls -dir crawl -depth 3 -topN 5
> Injector: starting at 2014-07-04 20:50:27
> Injector: crawlDb: urls/crawldb
> Injector: urlDir: crawl
> Injector: Converting injected urls to crawl db entries.
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering: 0
> Injector: Merging injected urls into crawl db.
> Injector: overwrite: false
> Injector: update: false
> Injector: finished at 2014-07-04 20:50:31, elapsed: 00:00:03
> 
> Still no files in the crawl directory.
> What am I missing?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/NutchTutorial-Followed-Crawldb-Not-Created-tp4145668p4145686.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: NutchTutorial Followed Crawldb Not Created

Reply via email to