I follow the tutorial at http://wiki.apache.org/nutch/NutchTutorial to
start crawling web pages. The usage with crawl command works
bin/nutch crawl ../test-domain/urls -dir ../test-domain -depth 3
>& crawl.log
But when switching to use lower level commands described in Whole-web
Crawling section. The step at Fetching does not take effect. The
result of command
bin/nutch generate ../test-domain/crawldb/ ../test-domain/segments
shows
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
But in target folder, there is not segments dir generated as by using
crawl command; creating segments dir manually beforehand has the same
result.
I searched the internet, other people who has similar issues
(http://osdir.com/ml/nutch-user.lucene.apache.org/2009-09/msg00062.html)
solve by pasting conf files to slave nodes. However, I do not use
hadoop and simply download nutch 1.1, and execute commands as
instructed in tutorial.
Is setting up hadoop cluster necessary in order to crawl web pages? Or
what might cause this issue and how to fix it?
Thanks
--
ChiaHung Lin @ nuk, tw.