Thanks Andrzej. It did not occur to me that the path would need to change in my scripts.
As for root, is it a risk, if I just using the box for testing? -----Original Message----- From: Andrzej Bialecki [mailto:[email protected]] Sent: Wednesday, September 29, 2010 12:31 PM To: [email protected] Subject: Re: Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode... On 2010-09-29 21:08, brad wrote: > I have tried to move from a local instance of Nutch to a > Pseudo-Distributed Mode Hadoop Nutch on a single machine. I set > everything up using the How to Setup Nutch (V1.1) and Hadoop instructions located here: > http://wiki.apache.org/nutch/NutchHadoopTutorial > > Then I moved all my relevant files to the HDFS using: > > bin/hadoop dfs -put crawl_www/crawldb /crawl_www/crawldb . > > I then double checked the files moved ok using > > bin/hadoop dfs -ls /crawl_www/crawldb > > And that worked fine > Found 1 items > drwxr-xr-x - root supergroup 0 2010-09-28 13:14 > /crawl_www/crawldb/current > > I went all the way down to the file level and it appears the files > exist bin/hadoop dfs -ls /crawl_www/crawldb/current/part-00000 > > Found 2 items > -rw-r--r-- 1 root supergroup 2375690617 2010-09-28 13:13 > /crawl_www/crawldb/current/part-00000/data > -rw-r--r-- 1 root supergroup 23784625 2010-09-28 13:14 > /crawl_www/crawldb/current/part-00000/index > > Also, when I use firefox to browse the hdfs filesystem using > localhost:50070, everything appears to work perfectly and I can see > everything. > > But, when I try a basic test run of Nutch, I get the following: > bin/nutch generate crawl_www/crawldb crawl_www/segments -topN 1000 > > > INFO crawl.Generator - Generator: starting at 2010-09-29 11:54:15 > INFO crawl.Generator - Generator: Selecting best-scoring urls due for > fetch. > INFO crawl.Generator - Generator: filtering: true INFO > crawl.Generator - Generator: normalizing: true INFO crawl.Generator - > Generator: topN: 1000 ERROR crawl.Generator - Generator: > org.apache.hadoop.mapred.InvalidInputException: > Input path does not exist: > hdfs://localhost:9000/user/root/crawl_www/crawldb/current > > > Did I miss on configuration step? I believe I have checked and double > checked everything and it appears to look correct. > > Any ideas? Yes - you missed the leading slash in your path. The cmd-lines that you quote above use relative path (no leading slash) and Hadoop assumes it's in your Hadoop home dir, which is /user/${whoami} By the way, I would strongly advise against running Hadoop as root. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

