Hi, I created MAHOUT-1487. I also want to submit this path. I can do it on next weekend or later.
2014-03-23 17:14 GMT+03:00 Sebastian Schelter <[email protected]>: > Hi Konstantin, > > Great to see that you located the error. Could you open a jira issue and > submit a patch that contains an updated error message? > > Thank you, > Sebastian > > > On 03/23/2014 02:57 PM, Konstantin Slisenko wrote: > >> Hi! >> >> I investigated the situation. RandomSeedGenerator ( >> http://grepcode.com/file/repo1.maven.org/maven2/org. >> apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/ >> kmeans/RandomSeedGenerator.java?av=f) >> has following code: >> >> FileSystem fs = FileSystem.get(output.toUri(), conf); >> >> ... >> >> fs.getFileStatus(input).isDir() >> >> FileSystem object was created from output path, which was not specified >> correctly by me. (I didn't use prefix "s3://" for this path). Afterwards >> getFileStatus has parameter for input path, which was correct. This caused >> misunderstanding. >> >> To prevent this misunderstanding, I propose to improve error message >> adding >> following details: >> 1. Specify which filesystem type used (DistributedFileSystem, >> NativeS3FileSystem, etc. using fs.getClass().getName()) >> 2. Then specify which path can not be processed correctly. >> >> This can be done by validation utility which can be applied to many places >> in Mahout. When we use Mahout we need to specify many paths and we also >> can >> use many types of file systems: local for debugging, distributed on >> Hadoop, >> and s3 on Amazon. In this case better error messages can save much time. I >> think that refactoring is not needed for this case. >> >> 2014-03-16 22:19 GMT+03:00 Jay Vyas <[email protected]>: >> >> I agree best to be explicit when creating filesystem instances by using >>> the two argument get(...). it's time to update it filesystem 2.0 Apis. >>> Can >>> you file a Jira for this ? If not I will :) >>> >>> On Mar 16, 2014, at 12:37 PM, Sebastian Schelter <[email protected]> >>>> wrote: >>>> >>>> I've also encountered a similar error once. It's really just the >>>> >>> FileSystem.get call that needs to be modified. I think its a good idea to >>> walk through the codebase and refactor this where necessary. >>> >>>> >>>> --sebastian >>>> >>>> >>>> On 03/16/2014 05:16 PM, Andrew Musselman wrote: >>>>> Another wild guess, I've had issues trying to use the 's3' protocol >>>>> >>>> from Hadoop and got things working by using the 's3n' protocol instead. >>> >>>> >>>>> On Mar 16, 2014, at 8:41 AM, Jay Vyas <[email protected]> wrote: >>>>>> >>>>>> I specifically have fixed mapreduce jobs by doing what the error >>>>>> >>>>> message suggests. >>> >>>> >>>>>> But maybe (hopefully) there is another workaround that is >>>>>> >>>>> configuration driven. >>> >>>> >>>>>> Just a hunch but, Maybe mahout needs to be refactored to create fs >>>>>> >>>>> objects using the get(uri,conf) calls? >>> >>>> >>>>>> As hadoop evolves to support different flavored of hcfs probably using >>>>>> >>>>> API calls that are more flexible (i.e. Like the fs.get(uri,conf) one), >>> will >>> probably be a good thing to keep in mind. >>> >>>> >>>>>> On Mar 16, 2014, at 9:22 AM, Frank Scholten <[email protected]> >>>>>>> >>>>>> wrote: >>> >>>> >>>>>>> Hi Konstantin, >>>>>>> >>>>>>> Good to hear from you. >>>>>>> >>>>>>> The link you mentioned points to EigenSeedGenerator not >>>>>>> RandomSeedGenerator. The problem seems to be with the call to >>>>>>> >>>>>>> fs.getFileStatus(input).isDir() >>>>>>> >>>>>>> >>>>>>> It's been a while and I don't remember but perhaps you have to set >>>>>>> additional Hadoop fs properties to use S3. See >>>>>>> https://wiki.apache.org/hadoop/AmazonS3. Perhaps you isolate the >>>>>>> >>>>>> cause of >>> >>>> this by creating a small Java main app with that line of code and run >>>>>>> >>>>>> it in >>> >>>> the debugger. >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Frank >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sun, Mar 16, 2014 at 12:07 PM, Konstantin Slisenko >>>>>>> <[email protected]>wrote: >>>>>>> >>>>>>> Hello! >>>>>>>> >>>>>>>> I run a text-documents clustering on Hadoop cluster in Amazon >>>>>>>> >>>>>>> Elastic Map >>> >>>> Reduce. As input and output I use S3 Amazon file system. I specify >>>>>>>> >>>>>>> all >>> >>>> paths as "s3://bucket-name/folder-name". >>>>>>>> >>>>>>>> SparceVectorsFromSequenceFile works correctly with S3 >>>>>>>> but when I start K-Means clustering job, I get this error: >>>>>>>> >>>>>>>> Exception in thread "main" java.lang.IllegalArgumentException: This >>>>>>>> file system object (hdfs://172.31.41.65:9000) does not support >>>>>>>> >>>>>>> access >>> >>>> to the request path >>>>>>>> >>>>>>>> >>>>>>>> 's3://by.kslisenko.bigdata/stackovweflow-small/out_new/ >>> sparse/tfidf-vectors' >>> >>>> You possibly called FileSystem.get(conf) when you should have called >>>>>>>> FileSystem.get(uri, conf) to obtain a file system supporting your >>>>>>>> path. >>>>>>>> >>>>>>>> at >>>>>>>> >>>>>>> org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:375) >>> >>>> at >>>>>>>> >>>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem.checkPath( >>> DistributedFileSystem.java:106) >>> >>>> at >>>>>>>> >>>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName( >>> DistributedFileSystem.java:162) >>> >>>> at >>>>>>>> >>>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus( >>> DistributedFileSystem.java:530) >>> >>>> at >>>>>>>> >>>>>>>> org.apache.mahout.clustering.kmeans.RandomSeedGenerator. >>> buildRandom(RandomSeedGenerator.java:76) >>> >>>> at >>>>>>>> >>>>>>>> org.apache.mahout.clustering.kmeans.KMeansDriver.run( >>> KMeansDriver.java:93) >>> >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>>>>> at >>>>>>>> >>>>>>>> bbuzz2011.stackoverflow.runner.RunnerWithInParams. >>> cluster(RunnerWithInParams.java:121) >>> >>>> at >>>>>>>> >>>>>>>> bbuzz2011.stackoverflow.runner.RunnerWithInParams.run( >>> RunnerWithInParams.java:52)cause >>> >>>> of this a >>>>>>>> at >>>>>>>> >>>>>>>> bbuzz2011.stackoverflow.runner.RunnerWithInParams. >>> main(RunnerWithInParams.java:41) >>> >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>>>> at >>>>>>>> >>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke( >>> NativeMethodAccessorImpl.java:39) >>> >>>> at >>>>>>>> >>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke( >>> DelegatingMethodAccessorImpl.java:25) >>> >>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >>>>>>>> >>>>>>>> >>>>>>>> I checked RandomSeedGenerator.buildRandom >>>>>>>> ( >>>>>>>> >>>>>>>> http://grepcode.com/file/repo1.maven.org/maven2/org. >>> apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/ >>> kmeans/EigenSeedGenerator.java?av=f >>> >>>> ) >>>>>>>> and I assume it has correct code: >>>>>>>> >>>>>>>> FileSystem fs = FileSystem.get(output.toUri(), conf); >>>>>>>> >>>>>>>> >>>>>>>> I can not run clustering because of this error. May be you have any >>>>>>>> ideas how to fix this? >>>>>>>> >>>>>>> >>>> >>> >> >
