Nutch on hadoop dfs needs a local copy of data?

Stefano Cherchi Tue, 18 May 2010 03:19:50 -0700

I'm running Nutch 1.0 on a 4-nodes Hadoop cluster, so all nutch data must 
reside on the hadoop distributed filesystem rather than on the local fs.


The odd thing is that some of the nutch crawling "steps" seem to need a local 
copy of the data in the nutch main directory in order to work correctly.

As an example, if I run the invertlinks command without a local copy of data 
then nutch throws an exception:

LinkDb: java.io.IOException: No input paths specified in job
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:141)
at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)

If an exact copy of the dfs directories is present on the local fs instead, all 
runs fine even if nutch actually works on dfs stored data.
 
This is forcing me to perform an annoying, heavy time-spending copy of data 
(several GB) on all cluster nodes at every crawl cycle.

Someone else has run into this and found a workaround?

S
---------------------------------- 
"Anyone proposing to run Windows on servers should be prepared to explain 
what they know about servers that Google, Yahoo, and Amazon don't."
Paul Graham


"A mathematician is a device for turning coffee into theorems."
Paul Erdos (who obviously never met a sysadmin)

Nutch on hadoop dfs needs a local copy of data?

Reply via email to