Re: Problem with K-Means clustering on Amazon EMR

Andrew Musselman Sun, 16 Mar 2014 09:17:23 -0700

Another wild guess, I've had issues trying to use the 's3' protocol from Hadoop 
and got things working by using the 's3n' protocol instead.


> On Mar 16, 2014, at 8:41 AM, Jay Vyas <[email protected]> wrote:
> 
> I specifically have fixed mapreduce jobs by doing what the error message 
> suggests.
> 
> But maybe (hopefully) there is another workaround that is configuration 
> driven.
> 
> Just a hunch but, Maybe mahout needs to be refactored to create fs objects 
> using the get(uri,conf) calls?
> 
> As hadoop evolves to support different flavored of hcfs probably using API 
> calls that are more flexible (i.e. Like the fs.get(uri,conf) one), will 
> probably be a good thing to keep in mind.
> 
>> On Mar 16, 2014, at 9:22 AM, Frank Scholten <[email protected]> wrote:
>> 
>> Hi Konstantin,
>> 
>> Good to hear from you.
>> 
>> The link you mentioned points to EigenSeedGenerator not
>> RandomSeedGenerator. The problem seems to be with the call to
>> 
>> fs.getFileStatus(input).isDir()
>> 
>> 
>> It's been a while and I don't remember but perhaps you have to set
>> additional Hadoop fs properties to use S3. See
>> https://wiki.apache.org/hadoop/AmazonS3. Perhaps you isolate the cause of
>> this by creating a small Java main app with that line of code and run it in
>> the debugger.
>> 
>> Cheers,
>> 
>> Frank
>> 
>> 
>> 
>> On Sun, Mar 16, 2014 at 12:07 PM, Konstantin Slisenko
>> <[email protected]>wrote:
>> 
>>> Hello!
>>> 
>>> I run a text-documents clustering on Hadoop cluster in Amazon Elastic Map
>>> Reduce. As input and output I use S3 Amazon file system. I specify all
>>> paths as "s3://bucket-name/folder-name".
>>> 
>>> SparceVectorsFromSequenceFile works correctly with S3
>>> but when I start K-Means clustering job, I get this error:
>>> 
>>> Exception in thread "main" java.lang.IllegalArgumentException: This
>>> file system object (hdfs://172.31.41.65:9000) does not support access
>>> to the request path
>>> 
>>> 's3://by.kslisenko.bigdata/stackovweflow-small/out_new/sparse/tfidf-vectors'
>>> You possibly called FileSystem.get(conf) when you should have called
>>> FileSystem.get(uri, conf) to obtain a file system supporting your
>>> path.
>>> 
>>>       at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:375)
>>>       at
>>> org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106)
>>>       at
>>> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162)
>>>       at
>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530)
>>>       at
>>> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:76)
>>>       at
>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:93)
>>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>       at
>>> bbuzz2011.stackoverflow.runner.RunnerWithInParams.cluster(RunnerWithInParams.java:121)
>>>       at
>>> bbuzz2011.stackoverflow.runner.RunnerWithInParams.run(RunnerWithInParams.java:52)cause
>>> of this a
>>>       at
>>> bbuzz2011.stackoverflow.runner.RunnerWithInParams.main(RunnerWithInParams.java:41)
>>>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>       at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>       at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>>       at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>> 
>>> 
>>> I checked RandomSeedGenerator.buildRandom
>>> (
>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/kmeans/EigenSeedGenerator.java?av=f
>>> )
>>> and I assume it has correct code:
>>> 
>>> FileSystem fs = FileSystem.get(output.toUri(), conf);
>>> 
>>> 
>>> I can not run clustering because of this error. May be you have any
>>> ideas how to fix this?
>>>

Re: Problem with K-Means clustering on Amazon EMR

Reply via email to