On 2010-09-29 21:08, brad wrote:
I have tried to move from a local instance of Nutch to a Pseudo-Distributed
Mode Hadoop Nutch on a single machine. I set everything up using the How to
Setup Nutch (V1.1) and Hadoop instructions located here:
http://wiki.apache.org/nutch/NutchHadoopTutorial
Then I moved all my relevant files to the HDFS using:
bin/hadoop dfs -put crawl_www/crawldb /crawl_www/crawldb
.
I then double checked the files moved ok using
bin/hadoop dfs -ls /crawl_www/crawldb
And that worked fine
Found 1 items
drwxr-xr-x - root supergroup 0 2010-09-28 13:14
/crawl_www/crawldb/current
I went all the way down to the file level and it appears the files exist
bin/hadoop dfs -ls /crawl_www/crawldb/current/part-00000
Found 2 items
-rw-r--r-- 1 root supergroup 2375690617 2010-09-28 13:13
/crawl_www/crawldb/current/part-00000/data
-rw-r--r-- 1 root supergroup 23784625 2010-09-28 13:14
/crawl_www/crawldb/current/part-00000/index
Also, when I use firefox to browse the hdfs filesystem using
localhost:50070, everything appears to work perfectly and I can see
everything.
But, when I try a basic test run of Nutch, I get the following:
bin/nutch generate crawl_www/crawldb crawl_www/segments -topN 1000
INFO crawl.Generator - Generator: starting at 2010-09-29 11:54:15
INFO crawl.Generator - Generator: Selecting best-scoring urls due for
fetch.
INFO crawl.Generator - Generator: filtering: true
INFO crawl.Generator - Generator: normalizing: true
INFO crawl.Generator - Generator: topN: 1000
ERROR crawl.Generator - Generator:
org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
hdfs://localhost:9000/user/root/crawl_www/crawldb/current
Did I miss on configuration step? I believe I have checked and double
checked everything and it appears to look correct.
Any ideas?
Yes - you missed the leading slash in your path. The cmd-lines that you
quote above use relative path (no leading slash) and Hadoop assumes it's
in your Hadoop home dir, which is /user/${whoami}
By the way, I would strongly advise against running Hadoop as root.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com