Re: Nutch on EMR

Sergey A Volkov Wed, 24 Aug 2011 01:50:24 -0700

Hi

I'm running nutch 1.2 with some changes on emr.
Yes, there are some problems with paths.


This problem could be solved if you set "fs.default.name" to your s3 bucket.

And you should probably use "s3n" uri insted of "s3"http://wiki.apache.org/hadoop/AmazonS3


Sergey Volkov

On 08/24/2011 05:45 AM, Doug Chang wrote:

not sure, havent tried it yet. Trying to get my hadoop tutorials out the
door first before Fri. Can chat then.

I think some have tried it. The bixo dude has experience doing this.

humm. not sure what is the cause of this is. If I had to guess there are 2
ways to run nutch, using EC2+EBS or EMR. problem with EMR is when the job
stops the data goes poof into thin air. So they transfer data to S3. And bug
is in there. Not sure which layer the bug is in, the hadoop file system
interface into s3 or from nutch. we can debug this on fri if you like.

dc


On Tue, Aug 23, 2011 at 6:03 PM, Peter Harrington<
[email protected]>  wrote:

Does anyone use Nutch on EMR?
I am using Nutch 1.3 and I get an error saying:

FATAL org.apache.nutch.crawl.Generator (main): Generator:
java.lang.IllegalArgumentException: This file system object
(hdfs://ip-44-169-41-187.ec2.internal:9000) does not support access to the
request path 's3://Datasets/crawlResults/crawldb/.locked' You possibly
called FileSystem.get(conf) when you should have called FileSystem.get(uri,
conf) to obtain a file system supporting your path.

I have seen other posts with this same problem but no resolution.   Does
anyone use Nutch-1.3 on EMR?

Thanks for the help,
Peter

Re: Nutch on EMR

Reply via email to