I have tried to move from a local instance of Nutch to a Pseudo-Distributed Mode Hadoop Nutch on a single machine. I set everything up using the How to Setup Nutch (V1.1) and Hadoop instructions located here: http://wiki.apache.org/nutch/NutchHadoopTutorial
Then I moved all my relevant files to the HDFS using: bin/hadoop dfs -put crawl_www/crawldb /crawl_www/crawldb . I then double checked the files moved ok using bin/hadoop dfs -ls /crawl_www/crawldb And that worked fine Found 1 items drwxr-xr-x - root supergroup 0 2010-09-28 13:14 /crawl_www/crawldb/current I went all the way down to the file level and it appears the files exist bin/hadoop dfs -ls /crawl_www/crawldb/current/part-00000 Found 2 items -rw-r--r-- 1 root supergroup 2375690617 2010-09-28 13:13 /crawl_www/crawldb/current/part-00000/data -rw-r--r-- 1 root supergroup 23784625 2010-09-28 13:14 /crawl_www/crawldb/current/part-00000/index Also, when I use firefox to browse the hdfs filesystem using localhost:50070, everything appears to work perfectly and I can see everything. But, when I try a basic test run of Nutch, I get the following: bin/nutch generate crawl_www/crawldb crawl_www/segments -topN 1000 INFO crawl.Generator - Generator: starting at 2010-09-29 11:54:15 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. INFO crawl.Generator - Generator: filtering: true INFO crawl.Generator - Generator: normalizing: true INFO crawl.Generator - Generator: topN: 1000 ERROR crawl.Generator - Generator: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/root/crawl_www/crawldb/current Did I miss on configuration step? I believe I have checked and double checked everything and it appears to look correct. Any ideas? Note: this is Nutch 1.2 on Centos Linux 5.5. Thanks Brad

