Re: Problem with K-Means clustering on Amazon EMR

Konstantin Slisenko Mon, 24 Mar 2014 12:04:30 -0700

Hi,

I created MAHOUT-1487. I also want to submit this path. I can do it on next
weekend or later.



2014-03-23 17:14 GMT+03:00 Sebastian Schelter <[email protected]>:

> Hi Konstantin,
>
> Great to see that you located the error. Could you open a jira issue and
> submit a patch that contains an updated error message?
>
> Thank you,
> Sebastian
>
>
> On 03/23/2014 02:57 PM, Konstantin Slisenko wrote:
>
>> Hi!
>>
>> I investigated the situation. RandomSeedGenerator (
>> http://grepcode.com/file/repo1.maven.org/maven2/org.
>> apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/
>> kmeans/RandomSeedGenerator.java?av=f)
>> has following code:
>>
>> FileSystem fs = FileSystem.get(output.toUri(), conf);
>>
>> ...
>>
>> fs.getFileStatus(input).isDir()
>>
>> FileSystem object was created from output path, which was not specified
>> correctly by me. (I didn't use prefix "s3://" for this path). Afterwards
>> getFileStatus has parameter for input path, which was correct. This caused
>> misunderstanding.
>>
>> To prevent this misunderstanding, I propose to improve error message
>> adding
>> following details:
>> 1. Specify which filesystem type used (DistributedFileSystem,
>> NativeS3FileSystem, etc. using fs.getClass().getName())
>> 2. Then specify which path can not be processed correctly.
>>
>> This can be done by validation utility which can be applied to many places
>> in Mahout. When we use Mahout we need to specify many paths and we also
>> can
>> use many types of file systems: local for debugging, distributed on
>> Hadoop,
>> and s3 on Amazon. In this case better error messages can save much time. I
>> think that refactoring is not needed for this case.
>>
>> 2014-03-16 22:19 GMT+03:00 Jay Vyas <[email protected]>:
>>
>>  I agree best to be explicit when creating filesystem instances by using
>>> the two argument get(...). it's time to update it filesystem 2.0 Apis.
>>>  Can
>>> you file a Jira for this ?  If not I will :)
>>>
>>>  On Mar 16, 2014, at 12:37 PM, Sebastian Schelter <[email protected]>
>>>> wrote:
>>>>
>>>> I've also encountered a similar error once. It's really just the
>>>>
>>> FileSystem.get call that needs to be modified. I think its a good idea to
>>> walk through the codebase and refactor this where necessary.
>>>
>>>>
>>>> --sebastian
>>>>
>>>>
>>>>  On 03/16/2014 05:16 PM, Andrew Musselman wrote:
>>>>> Another wild guess, I've had issues trying to use the 's3' protocol
>>>>>
>>>> from Hadoop and got things working by using the 's3n' protocol instead.
>>>
>>>>
>>>>>  On Mar 16, 2014, at 8:41 AM, Jay Vyas <[email protected]> wrote:
>>>>>>
>>>>>> I specifically have fixed mapreduce jobs by doing what the error
>>>>>>
>>>>> message suggests.
>>>
>>>>
>>>>>> But maybe (hopefully) there is another workaround that is
>>>>>>
>>>>> configuration driven.
>>>
>>>>
>>>>>> Just a hunch but, Maybe mahout needs to be refactored to create fs
>>>>>>
>>>>> objects using the get(uri,conf) calls?
>>>
>>>>
>>>>>> As hadoop evolves to support different flavored of hcfs probably using
>>>>>>
>>>>> API calls that are more flexible (i.e. Like the fs.get(uri,conf) one),
>>> will
>>> probably be a good thing to keep in mind.
>>>
>>>>
>>>>>>  On Mar 16, 2014, at 9:22 AM, Frank Scholten <[email protected]>
>>>>>>>
>>>>>> wrote:
>>>
>>>>
>>>>>>> Hi Konstantin,
>>>>>>>
>>>>>>> Good to hear from you.
>>>>>>>
>>>>>>> The link you mentioned points to EigenSeedGenerator not
>>>>>>> RandomSeedGenerator. The problem seems to be with the call to
>>>>>>>
>>>>>>> fs.getFileStatus(input).isDir()
>>>>>>>
>>>>>>>
>>>>>>> It's been a while and I don't remember but perhaps you have to set
>>>>>>> additional Hadoop fs properties to use S3. See
>>>>>>> https://wiki.apache.org/hadoop/AmazonS3. Perhaps you isolate the
>>>>>>>
>>>>>> cause of
>>>
>>>> this by creating a small Java main app with that line of code and run
>>>>>>>
>>>>>> it in
>>>
>>>> the debugger.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Frank
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 16, 2014 at 12:07 PM, Konstantin Slisenko
>>>>>>> <[email protected]>wrote:
>>>>>>>
>>>>>>>  Hello!
>>>>>>>>
>>>>>>>> I run a text-documents clustering on Hadoop cluster in Amazon
>>>>>>>>
>>>>>>> Elastic Map
>>>
>>>> Reduce. As input and output I use S3 Amazon file system. I specify
>>>>>>>>
>>>>>>> all
>>>
>>>> paths as "s3://bucket-name/folder-name".
>>>>>>>>
>>>>>>>> SparceVectorsFromSequenceFile works correctly with S3
>>>>>>>> but when I start K-Means clustering job, I get this error:
>>>>>>>>
>>>>>>>> Exception in thread "main" java.lang.IllegalArgumentException: This
>>>>>>>> file system object (hdfs://172.31.41.65:9000) does not support
>>>>>>>>
>>>>>>> access
>>>
>>>> to the request path
>>>>>>>>
>>>>>>>>
>>>>>>>>  's3://by.kslisenko.bigdata/stackovweflow-small/out_new/
>>> sparse/tfidf-vectors'
>>>
>>>> You possibly called FileSystem.get(conf) when you should have called
>>>>>>>> FileSystem.get(uri, conf) to obtain a file system supporting your
>>>>>>>> path.
>>>>>>>>
>>>>>>>>        at
>>>>>>>>
>>>>>>> org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:375)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>  org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(
>>> DistributedFileSystem.java:106)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>  org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(
>>> DistributedFileSystem.java:162)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>  org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(
>>> DistributedFileSystem.java:530)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>  org.apache.mahout.clustering.kmeans.RandomSeedGenerator.
>>> buildRandom(RandomSeedGenerator.java:76)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>  org.apache.mahout.clustering.kmeans.KMeansDriver.run(
>>> KMeansDriver.java:93)
>>>
>>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>>>        at
>>>>>>>>
>>>>>>>>  bbuzz2011.stackoverflow.runner.RunnerWithInParams.
>>> cluster(RunnerWithInParams.java:121)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>  bbuzz2011.stackoverflow.runner.RunnerWithInParams.run(
>>> RunnerWithInParams.java:52)cause
>>>
>>>> of this a
>>>>>>>>        at
>>>>>>>>
>>>>>>>>  bbuzz2011.stackoverflow.runner.RunnerWithInParams.
>>> main(RunnerWithInParams.java:41)
>>>
>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>>        at
>>>>>>>>
>>>>>>>>  sun.reflect.NativeMethodAccessorImpl.invoke(
>>> NativeMethodAccessorImpl.java:39)
>>>
>>>>        at
>>>>>>>>
>>>>>>>>  sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>> DelegatingMethodAccessorImpl.java:25)
>>>
>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>>>>>>
>>>>>>>>
>>>>>>>> I checked RandomSeedGenerator.buildRandom
>>>>>>>> (
>>>>>>>>
>>>>>>>>  http://grepcode.com/file/repo1.maven.org/maven2/org.
>>> apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/
>>> kmeans/EigenSeedGenerator.java?av=f
>>>
>>>> )
>>>>>>>> and I assume it has correct code:
>>>>>>>>
>>>>>>>> FileSystem fs = FileSystem.get(output.toUri(), conf);
>>>>>>>>
>>>>>>>>
>>>>>>>> I can not run clustering because of this error. May be you have any
>>>>>>>> ideas how to fix this?
>>>>>>>>
>>>>>>>
>>>>
>>>
>>
>

Re: Problem with K-Means clustering on Amazon EMR

Reply via email to