Thanks Andrzej.  It did not occur to me that the path would need to change
in my scripts.

As for root, is it a risk, if I just using the box for testing?

-----Original Message-----
From: Andrzej Bialecki [mailto:[email protected]] 
Sent: Wednesday, September 29, 2010 12:31 PM
To: [email protected]
Subject: Re: Error with Hadoop when moving from Local to HDFS
Pseudo-Distributed Mode...

On 2010-09-29 21:08, brad wrote:
> I have tried to move from a local instance of Nutch to a 
> Pseudo-Distributed Mode Hadoop Nutch on a single machine.  I set 
> everything up using the How to Setup Nutch (V1.1) and Hadoop instructions
located here:
> http://wiki.apache.org/nutch/NutchHadoopTutorial
>
> Then I moved all my relevant files to the HDFS using:
>
> bin/hadoop dfs -put crawl_www/crawldb /crawl_www/crawldb .
>
> I then double checked the files moved ok using
>
> bin/hadoop dfs -ls /crawl_www/crawldb
>
> And that worked fine
> Found 1 items
> drwxr-xr-x   - root supergroup          0 2010-09-28 13:14
> /crawl_www/crawldb/current
>
> I went all the way down to the file level and it appears the files 
> exist bin/hadoop dfs -ls /crawl_www/crawldb/current/part-00000
>
> Found 2 items
> -rw-r--r--   1 root supergroup 2375690617 2010-09-28 13:13
> /crawl_www/crawldb/current/part-00000/data
> -rw-r--r--   1 root supergroup   23784625 2010-09-28 13:14
> /crawl_www/crawldb/current/part-00000/index
>
> Also, when I use firefox to browse the hdfs filesystem using 
> localhost:50070, everything appears to work perfectly and I can see 
> everything.
>
> But, when I try a basic test run of Nutch, I get the following:
> bin/nutch generate crawl_www/crawldb crawl_www/segments -topN 1000
>
>
> INFO  crawl.Generator - Generator: starting at 2010-09-29 11:54:15 
> INFO  crawl.Generator - Generator: Selecting best-scoring urls due for 
> fetch.
> INFO  crawl.Generator - Generator: filtering: true INFO  
> crawl.Generator - Generator: normalizing: true INFO  crawl.Generator - 
> Generator: topN: 1000 ERROR crawl.Generator - Generator:
> org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> hdfs://localhost:9000/user/root/crawl_www/crawldb/current
>
>
> Did I miss on configuration step?  I believe I have checked and double 
> checked everything and it appears to look correct.
>
> Any ideas?

Yes - you missed the leading slash in your path. The cmd-lines that you
quote above use relative path (no leading slash) and Hadoop assumes it's in
your Hadoop home dir, which is /user/${whoami}

By the way, I would strongly advise against running Hadoop as root.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|
||  |  Embedded Unix, System Integration http://www.sigram.com  Contact:
info at sigram dot com


Reply via email to