Re: Nutch on hadoop dfs needs a local copy of data?

Stefano Cherchi Tue, 18 May 2010 07:26:39 -0700

Thank you Andrzej,

>This should not happen - are you sure 
> that your hadoop config files are
>consistent across the cluster, especially 
> the FS related properties?


Definitely. The script I wrote to manage the crawling/indexing process performs 
a scp copy of the whole conf directory to all active slaves before start 
crawling.

>When you start the job make sure that the 
> classpath that your command
>uses pulls in the right hadoop config, one that 
> correctly defines the
>filesystem.

I suppose it does actually... I configured the filesystem-related parameters 
into conf/hadoop-site.xml. Most of the nutch subprocesses run fine on dfs. Only 
"invertlinks" and perhaps another (could be "parse" but now I'm not able to 
verify) run into troubles if they cannot find a local copy of the dfs data, so 
I can't figure out where to look for a missing configuration.

>Also, check for the presence of 
> multiple copies of hadoop config files
>on your classpath - be aware that some 
> of them may be inside jars, e.g.
>in your job jar.


I'm using a default configuration for hadoop and never messed up with any jar 
file...  

Just added the file hadoop-site.xml into nutch-1.0.job archive, still no luck. 
Same error: LinkDb: java.io.IOException: No input paths specified in job

Not sure where to look for further configurations... Hints?


S

-- 
Best 
> regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ 
> _   __________________________________
[__ || __|__/|__||\/|  
> Information Retrieval, Semantic Web
___|||__||  \|  ||  
> |  Embedded Unix, System Integration

> target=_blank >http://www.sigram.com  Contact: info at sigram dot 
> com

Re: Nutch on hadoop dfs needs a local copy of data?

Reply via email to