I follow the tutorial at http://wiki.apache.org/nutch/NutchTutorial to
start crawling web pages. The usage with crawl command works

    bin/nutch crawl ../test-domain/urls -dir ../test-domain -depth 3
>& crawl.log

But when switching to use lower level commands described in Whole-web
Crawling section. The step at Fetching does not take effect. The
result of command

    bin/nutch generate ../test-domain/crawldb/ ../test-domain/segments

shows

    Generator: Selecting best-scoring urls due for fetch.
    Generator: starting
    Generator: filtering: true
    Generator: normalizing: true
    Generator: jobtracker is 'local', generating exactly one partition.
    Generator: 0 records selected for fetching, exiting ...

 But in target folder, there is not segments dir generated as by using
crawl command; creating segments dir manually beforehand has the same
result.

I searched the internet, other people who has similar issues
(http://osdir.com/ml/nutch-user.lucene.apache.org/2009-09/msg00062.html)
solve by pasting conf files to slave nodes. However, I do not use
hadoop and simply download nutch 1.1, and execute commands as
instructed in tutorial.

Is setting up hadoop cluster necessary in order to crawl web pages? Or
what might cause this issue and how to fix it?

Thanks

-- 
ChiaHung Lin @ nuk, tw.

Reply via email to