So maybe try hacking CrawlDb#createJob() So that when you create the new NutchJob object you pass in the uri parameter
124 JobConf job = new NutchJob(uri, config); as suggested in the thrown stack trace. Please get back to us with results. I've not been using anything like Amazon EMR and would be really interested to find out if this is solved. fingers crossed. On Wed, Feb 22, 2012 at 12:42 PM, Ali S Kureishy <[email protected]>wrote: > Hi, > > [This might be more relevant to the Amazon's EMR support, however, I have > posted it here as I'm not sure if the issue is on the EMR side or the Nutch > side]. > > I'm trying to run a Nutch crawl (v 1.4) on Amazon's EMR (Elastic Map > Reduce). I've setup the configuration parameters for the task as follows: > > *Job jar:* s3n://mybucket/engine/job/nutch-1.4.job > *Arguments: *org.apache.nutch.crawl.Crawl s3n://mybucket/engine/seedurls/ > -dir s3n://mybucket/engine/crawls > > The job eventually fails with the below exception in the stderr log. > > Exception in thread "main" java.lang.IllegalArgumentException: This > file system object (hdfs://10.2.21.205:9000) does not support access > to the request path 's3n://mybucket/engine/crawls/crawldb/current' You > possibly called FileSystem.get(conf) when you should have called > FileSystem.get(uri, conf) to obtain a file system supporting your > path. > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:372) > at > org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:709) > at org.apache.nutch.crawl.CrawlDb.createJob(CrawlDb.java:129) > at org.apache.nutch.crawl.Injector.inject(Injector.java:223) > at org.apache.nutch.crawl.Crawl.run(Crawl.java:127) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > I've read in a few places that previous versions of Nutch had a bug with > handling different filesystems (such as s3n:// etc). Is that still an issue > with Nutch 1.4? If so, what is the workaround if I want to run a Nutch job > on Amazon's EMR? (not specifying the S3 filesystem would mean that the HDFS > output would vanish once the EMR task completes). And if it has been fixed, > what do you think might be causing the issue below? > > Thanks, > Safdar > -- *Lewis*

