Re: Problem with K-Means clustering on Amazon EMR

Jay Vyas Sun, 16 Mar 2014 12:20:12 -0700

I agree best to be explicit when creating filesystem instances by using the two 
argument get(...). it's time to update it filesystem 2.0 Apis.  Can you file a 
Jira for this ?  If not I will :)


> On Mar 16, 2014, at 12:37 PM, Sebastian Schelter <[email protected]> wrote:
> 
> I've also encountered a similar error once. It's really just the 
> FileSystem.get call that needs to be modified. I think its a good idea to 
> walk through the codebase and refactor this where necessary.
> 
> --sebastian
> 
> 
>> On 03/16/2014 05:16 PM, Andrew Musselman wrote:
>> Another wild guess, I've had issues trying to use the 's3' protocol from 
>> Hadoop and got things working by using the 's3n' protocol instead.
>> 
>>> On Mar 16, 2014, at 8:41 AM, Jay Vyas <[email protected]> wrote:
>>> 
>>> I specifically have fixed mapreduce jobs by doing what the error message 
>>> suggests.
>>> 
>>> But maybe (hopefully) there is another workaround that is configuration 
>>> driven.
>>> 
>>> Just a hunch but, Maybe mahout needs to be refactored to create fs objects 
>>> using the get(uri,conf) calls?
>>> 
>>> As hadoop evolves to support different flavored of hcfs probably using API 
>>> calls that are more flexible (i.e. Like the fs.get(uri,conf) one), will 
>>> probably be a good thing to keep in mind.
>>> 
>>>> On Mar 16, 2014, at 9:22 AM, Frank Scholten <[email protected]> wrote:
>>>> 
>>>> Hi Konstantin,
>>>> 
>>>> Good to hear from you.
>>>> 
>>>> The link you mentioned points to EigenSeedGenerator not
>>>> RandomSeedGenerator. The problem seems to be with the call to
>>>> 
>>>> fs.getFileStatus(input).isDir()
>>>> 
>>>> 
>>>> It's been a while and I don't remember but perhaps you have to set
>>>> additional Hadoop fs properties to use S3. See
>>>> https://wiki.apache.org/hadoop/AmazonS3. Perhaps you isolate the cause of
>>>> this by creating a small Java main app with that line of code and run it in
>>>> the debugger.
>>>> 
>>>> Cheers,
>>>> 
>>>> Frank
>>>> 
>>>> 
>>>> 
>>>> On Sun, Mar 16, 2014 at 12:07 PM, Konstantin Slisenko
>>>> <[email protected]>wrote:
>>>> 
>>>>> Hello!
>>>>> 
>>>>> I run a text-documents clustering on Hadoop cluster in Amazon Elastic Map
>>>>> Reduce. As input and output I use S3 Amazon file system. I specify all
>>>>> paths as "s3://bucket-name/folder-name".
>>>>> 
>>>>> SparceVectorsFromSequenceFile works correctly with S3
>>>>> but when I start K-Means clustering job, I get this error:
>>>>> 
>>>>> Exception in thread "main" java.lang.IllegalArgumentException: This
>>>>> file system object (hdfs://172.31.41.65:9000) does not support access
>>>>> to the request path
>>>>> 
>>>>> 's3://by.kslisenko.bigdata/stackovweflow-small/out_new/sparse/tfidf-vectors'
>>>>> You possibly called FileSystem.get(conf) when you should have called
>>>>> FileSystem.get(uri, conf) to obtain a file system supporting your
>>>>> path.
>>>>> 
>>>>>       at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:375)
>>>>>       at
>>>>> org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106)
>>>>>       at
>>>>> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162)
>>>>>       at
>>>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530)
>>>>>       at
>>>>> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:76)
>>>>>       at
>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:93)
>>>>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>       at
>>>>> bbuzz2011.stackoverflow.runner.RunnerWithInParams.cluster(RunnerWithInParams.java:121)
>>>>>       at
>>>>> bbuzz2011.stackoverflow.runner.RunnerWithInParams.run(RunnerWithInParams.java:52)cause
>>>>> of this a
>>>>>       at
>>>>> bbuzz2011.stackoverflow.runner.RunnerWithInParams.main(RunnerWithInParams.java:41)
>>>>>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>       at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>       at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>       at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>>> 
>>>>> 
>>>>> I checked RandomSeedGenerator.buildRandom
>>>>> (
>>>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/kmeans/EigenSeedGenerator.java?av=f
>>>>> )
>>>>> and I assume it has correct code:
>>>>> 
>>>>> FileSystem fs = FileSystem.get(output.toUri(), conf);
>>>>> 
>>>>> 
>>>>> I can not run clustering because of this error. May be you have any
>>>>> ideas how to fix this?
>

Re: Problem with K-Means clustering on Amazon EMR

Reply via email to