Hi all, Trying to run nutch on Elastic Mapreduce, I ran into an issue which I think is the same as the following,
https://forums.aws.amazon.com/thread.jspa?threadID=54492 Exception in thread "main" java.lang.IllegalArgumentException: This file system object (hdfs://ip-10-122-99-48.ec2.internal:9000) does not support access to the request path 's3n://mybucketname/crawl/crawldb/current' You possibly called FileSystem.get(conf) when you should of called FileSystem.get(uri, conf) to obtain a file system supporting your path. at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:351) at org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:99) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:155) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:688) at org.apache.nutch.crawl.CrawlDb.createJob(CrawlDb.java:122) at org.apache.nutch.crawl.Injector.inject(Injector.java:226) at org.apache.nutch.crawl.Crawl.main(Crawl.java:119) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) It appears that CrawlDb.java uses code that assumes all inputs are on HDFS. Is this a known bug - and if so, could someone point me to the number, and whether there exists a patch for it? If not, I'd be happy to contribute one. I'm using Nutch 1.2 that I've patched for NUTCH 937 and NUTCH 993. Cheers Viksit

