Re: Need help in executing SSVD for dimensionality reduction on Mahout
If the rows in the input for SSVD are data points you are trying to create reduced space for, then rows of USigma represent the same points in the PCA (reduced) space. The mapping between the input rows and output rows is by same keys in the sequence files. However, it doesn't look like your input is using distinct such values (1), this is not recommended. SSVD will also propagate names if NamedVector is used for rows of the input. That's possibly another way to map input rows to PCA space rows in USigma. However, it doesn't look like the input is using Named vectors in this case. On Mon, Mar 17, 2014 at 10:22 PM, Vijaya Pratap wrote: > Hi, > > I am trying to use SSVD for dimensionality reduction on Mahout, the input > is a sample data in CSV format. Below is a snippet of the input > > 22,2,44,36,5,9,2824,2,4,733,285,169 > 25,1,150,175,3,9,4037,2,18,1822,254,171 > > I have executed the below steps. > > 1. Loaded the csv file and Vectorized the data by following the steps > mentioned at https://github.com/tdunning/pig-vector with key as > TextConverter and value as VectorWritable. Listed below is the output of > this step. I believe the values 420468, 279945 are indices, please correct > me if I am wrong. > Key: 1: Value: > > {420468:733.0,279945:2.0,607618:285.0,107323:4.0,88330:2.0,263605:9.0,975378:169.0,796003:2824.0,899937:44.0,422862:5.0,723271:22.0,508675:36.0} > Key: 1: Value: > > {420468:1822.0,279945:2.0,607618:254.0,107323:18.0,88330:1.0,263605:9.0,975378:171.0,796003:4037.0,899937:150.0,422862:3.0,723271:25.0,508675:175.0} > > 2. Passed the output of the above command to SSVD as follows > bin/mahout ssvd -i /user/cloudera/vectorized_data/ -o > /user/cloudera/reduced_dimensions --rank 7 -us true -V false -U false -pca > true -ow -t 1 > > Below is a snippet of the output in USigma folder > Key: 1: Value: > > {0:190.78376981262613,1:350.30406212052424,2:78.24932121461198,3:98.67283686605012,4:-122.95056058078157,5:-4.201436498582381,6:-1.4370820809434337} > Key: 1: Value: > > {0:1295.933786837574,1:-698.5629072274602,2:-24.15996813349674,3:60.936737740013946,4:11.859426028893711,5:-6.379057682687426,6:0.9356299409590896} > > Please let me know if my approach is correct and help me in interpreting > the output in USigma folder > > > Thanks in advance > Pratap >
Fwd: Need help in executing SSVD for dimensionality reduction on Mahout
Hi, I am trying to use SSVD for dimensionality reduction on Mahout, the input is a sample data in CSV format. Below is a snippet of the input 22,2,44,36,5,9,2824,2,4,733,285,169 25,1,150,175,3,9,4037,2,18,1822,254,171 I have executed the below steps. 1. Loaded the csv file and Vectorized the data by following the steps mentioned at https://github.com/tdunning/pig-vector with key as TextConverter and value as VectorWritable. Listed below is the output of this step. I believe the values 420468, 279945 are indices, please correct me if I am wrong. Key: 1: Value: {420468:733.0,279945:2.0,607618:285.0,107323:4.0,88330:2.0,263605:9.0,975378:169.0,796003:2824.0,899937:44.0,422862:5.0,723271:22.0,508675:36.0} Key: 1: Value: {420468:1822.0,279945:2.0,607618:254.0,107323:18.0,88330:1.0,263605:9.0,975378:171.0,796003:4037.0,899937:150.0,422862:3.0,723271:25.0,508675:175.0} 2. Passed the output of the above command to SSVD as follows bin/mahout ssvd -i /user/cloudera/vectorized_data/ -o /user/cloudera/reduced_dimensions --rank 7 -us true -V false -U false -pca true -ow -t 1 Below is a snippet of the output in USigma folder Key: 1: Value: {0:190.78376981262613,1:350.30406212052424,2:78.24932121461198,3:98.67283686605012,4:-122.95056058078157,5:-4.201436498582381,6:-1.4370820809434337} Key: 1: Value: {0:1295.933786837574,1:-698.5629072274602,2:-24.15996813349674,3:60.936737740013946,4:11.859426028893711,5:-6.379057682687426,6:0.9356299409590896} Please let me know if my approach is correct and help me in interpreting the output in USigma folder Thanks in advance Pratap
RE: reduce is too slow in StreamingKmeans
As mahout streamingkmeans has no problems in sequential mode, I would like to try sequential mode. However, "java.lang.OutofMemoryError" occurs. I wonder where to set JVM heap size for sequential mode? Is it the same with mapreduce mode? -Original Message- From: fx MA XIAOJUN [mailto:xiaojun...@fujixerox.co.jp] Sent: Tuesday, March 18, 2014 10:50 AM To: Suneel Marthi; user@mahout.apache.org Subject: RE: reduce is too slow in StreamingKmeans Thank you for your extremely quick reply. >> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u >> mean Streaming KMeans here? I want to try using -rskm in streaming kmeans. But in mahout 0.8, if setting -rskm as true, errors occur. I heard that the bug has been fixed in 0.9. So I upgraded 0.8->0.9 The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 2.x(YARN) cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is compiled by cloudera. So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 distribution. It turned out that "Mahout kmeans" runs very well on mapreduce. However, "Mahout streamingkmeans" runs properly in sequential mode, but fails in mapreduce mode. If it is the problem of incompatibility between hadoop and mahout, I don’t think "mahout kmeans" can run properly. Is mahout 0.9 compatible with Hadoop 0.20? -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Monday, March 17, 2014 6:21 PM To: fx MA XIAOJUN; user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN wrote: Thank you for your quick reply. As to -km, I thought it was log10, instead of ln. I was wrong... This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever. >> This has been my experience too both with 0.8 and 0.9. So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option. Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful. >> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u >> mean Streaming KMeans here? However, when executing mahout streamingkmeans, I got errors as following. Hadoop I installed is cdh5-beta1-mapreduce version 1. Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x profile, hence the error u r seeing. If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop 2 profile like below: mvn clean install -Dhadoop2.profile= Please give that a try. -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Wednesday, February 19, 2014 1:08 AM To: user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans Streaming KMeans runs with a single reducer that runs Ball KMe
Re: reduce is too slow in StreamingKmeans
-rskm option works only in sequential mode and fails in MR. That's still an issue in present trunk that needs to be fixed. That should explain why Streaming KMeans with -rskm works only in sequential mode for you. Mahout 0.9 has been built with Hadoop 1.2.1 profile, not sure if that's gonna work with 0.20. On Monday, March 17, 2014 9:50 PM, fx MA XIAOJUN wrote: Thank you for your extremely quick reply. >> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u >> mean Streaming KMeans here? I want to try using -rskm in streaming kmeans. But in mahout 0.8, if setting -rskm as true, errors occur. I heard that the bug has been fixed in 0.9. So I upgraded 0.8->0.9 The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 2.x(YARN) cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is compiled by cloudera. So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 distribution. It turned out that "Mahout kmeans" runs very well on mapreduce. However, "Mahout streamingkmeans" runs properly in sequential mode, but fails in mapreduce mode. If it is the problem of incompatibility between hadoop and mahout, I don’t think "mahout kmeans" can run properly. Is mahout 0.9 compatible with Hadoop 0.20? -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Monday, March 17, 2014 6:21 PM To: fx MA XIAOJUN; user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN wrote: Thank you for your quick reply. As to -km, I thought it was log10, instead of ln. I was wrong... This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever. >> This has been my experience too both with 0.8 and 0.9. So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option. Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful. >> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u >> mean Streaming KMeans here? However, when executing mahout streamingkmeans, I got errors as following. Hadoop I installed is cdh5-beta1-mapreduce version 1. Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x profile, hence the error u r seeing. If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop 2 profile like below: mvn clean install -Dhadoop2.profile= Please give that a try. -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Wednesday, February 19, 2014 1:08 AM To: user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. How did u c
RE: reduce is too slow in StreamingKmeans
Thank you for your extremely quick reply. >> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u >> mean Streaming KMeans here? I want to try using -rskm in streaming kmeans. But in mahout 0.8, if setting -rskm as true, errors occur. I heard that the bug has been fixed in 0.9. So I upgraded 0.8->0.9 The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 2.x(YARN) cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is compiled by cloudera. So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 distribution. It turned out that "Mahout kmeans" runs very well on mapreduce. However, "Mahout streamingkmeans" runs properly in sequential mode, but fails in mapreduce mode. If it is the problem of incompatibility between hadoop and mahout, I don’t think "mahout kmeans" can run properly. Is mahout 0.9 compatible with Hadoop 0.20? -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Monday, March 17, 2014 6:21 PM To: fx MA XIAOJUN; user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN wrote: Thank you for your quick reply. As to -km, I thought it was log10, instead of ln. I was wrong... This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever. >> This has been my experience too both with 0.8 and 0.9. So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option. Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful. >> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u >> mean Streaming KMeans here? However, when executing mahout streamingkmeans, I got errors as following. Hadoop I installed is cdh5-beta1-mapreduce version 1. Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x profile, hence the error u r seeing. If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop 2 profile like below: mvn clean install -Dhadoop2.profile= Please give that a try. -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Wednesday, February 19, 2014 1:08 AM To: user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. How did u come up with -km 63000? Given that u would like 1 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n) = 1 * ln(2 * 10^6) = 145087 (rounded to nearest integer) and that should be the value of -km in ur case. (km = k * log (n) ) Not sure if that's gonna fix ur reduce being stuck at 76% forever but its definitely worth a try. If you would like go to wit
Re: Mahout parallel K-Means - algorithms analysis
You could take a look at org.apache.mahout.clustering.classify/ClusterClassificationMapper Enjoy, Wei Shung On Sat, Mar 15, 2014 at 2:51 PM, Suneel Marthi wrote: > The clustering code is cimapper and cireducer. Following the clustering, > there is cluster classification which is mapper only. > > Not sure about the reference paper, this stuffs been around for long but > the documentation for kmeans on mahout.apache.org should explain the > approach. > > Sent from my iPhone > > > On Mar 15, 2014, at 5:36 PM, hiroshi leon > wrote: > > > > Hello Ted, > > > > Thank you so much for your reply, the program that I was checking is the > KMeansDriver class with the run function, > > the buildCluster function in the same class and following the > ClusterIterator class with > > the iterateMR function. > > > > I would like to know how where can I check the code that is implemented > for the mapper and the > > reducer? is it in the CIMappper.class and CIReducer.class? > > > > Is there a research paper or pseudo-code in which Mahout parallel > K-means was based on? > > > > Thank you so much and have a nice day. > > > > Best regards > > > > > >> From: ted.dunn...@gmail.com > >> Date: Sat, 15 Mar 2014 13:56:56 -0700 > >> Subject: Re: Mahout parallel K-Means - algorithms analysis > >> To: user@mahout.apache.org > >> > >> We would love to help. > >> > >> Can you say which program and which classes you are looking at? > >> > >> > >> On Sat, Mar 15, 2014 at 12:58 PM, hiroshi leon < > hiroshi_8...@hotmail.com>wrote: > >> > >>> To whom it may correspond, > >>> > >>> Hello, I have been checking the algorithm of Mahout 0.9 version k-means > >>> using MapReduce and I would like to know where can I check the code of > >>> what is happening inside the map function and in the reducer? > >>> > >>> > >>> I was debugging using NetBeans and I was not able to find what is > exactly > >>> implemented in the Map and Reduce functions... > >>> > >>> > >>> > >>> The reason what I am doing this is because I would like to know what > >>> is exactly implemented in the version of Mahout 0.9 in order to see > >>> which parts where optimized on the K-Means mapReduce algorithm. > >>> > >>> > >>> > >>> Do you know which research paper the Mahout K-means was based on or > where > >>> can I read the pseudo code? > >>> > >>> > >>> > >>> Thank you so much! > >>> > >>> > >>> > >>> Best regards! > >>> > >>> Hiroshi > > >
Re: Normalization in Mahout
On Monday, March 17, 2014 8:10 AM, Bikash Gupta wrote: Want to achieve few things 1. Normalize input data of clustering and classification algorithm Not sure what you consider as normalization, but: If u r trying to normalize text, Lucene's analyzers do it while generating term vectors. If u r trying to normalize the term vectors for clustering, the distance measure specified while clustering normalizes the values appropriately based on the chosen distance measure. 2. Normalize output data to plot in graph The output from clustering is already normalized based on the specified distanceMeasure (all of the clustered points r). On Mon, Mar 17, 2014 at 5:32 PM, Suneel Marthi wrote: > What r u trying to do? > > > > > > On Monday, March 17, 2014 7:45 AM, Bikash Gupta > wrote: > > Hi, > > Do we have any utility for Column and Row normalization in Mahout? > > -- > Thanks & Regards > Bikash Gupta > -- Thanks & Regards Bikash Kumar Gupta
Re: Normalization in Mahout
Want to achieve few things 1. Normalize input data of clustering and classification algorithm 2. Normalize output data to plot in graph On Mon, Mar 17, 2014 at 5:32 PM, Suneel Marthi wrote: > What r u trying to do? > > > > > > On Monday, March 17, 2014 7:45 AM, Bikash Gupta > wrote: > > Hi, > > Do we have any utility for Column and Row normalization in Mahout? > > -- > Thanks & Regards > Bikash Gupta > -- Thanks & Regards Bikash Kumar Gupta
Re: Normalization in Mahout
What r u trying to do? On Monday, March 17, 2014 7:45 AM, Bikash Gupta wrote: Hi, Do we have any utility for Column and Row normalization in Mahout? -- Thanks & Regards Bikash Gupta
Normalization in Mahout
Hi, Do we have any utility for Column and Row normalization in Mahout? -- Thanks & Regards Bikash Gupta
Re: Problem with FileSystem in Kmeans
I have 3 node cluster of CDH4.6, however I have build Mahout 0.9 with Hadoop 2.x profile. I have also created a mount point for these node and the path uri is same as HDFS. I have manually configured filesystem parameter conf.set("fs.hdfs.impl",org. apache.hadoop.hdfs.DistributedFileSystem.class.getName()); conf.set("fs.file.impl",org.apache.hadoop.fs.LocalFileSystem.class.getName()); Input data(sequence file) and Cluster center(output of Canopy) are present in HDFS. After this I am executing KmeansDriver using ToolRunner but got the error as shown above. After debugging I have found that cluster-0 is getting created in Mount Point and cluster-1 in HDFS if I dont provide file system scheme. Once i provide the file system scheme i.e. "hdfs://<<>>/", everything works like charm. On Mon, Mar 17, 2014 at 4:24 PM, Suneel Marthi wrote: > Have not seen that behavior with KMeans, what were ur settings again? > Sorry joining late onto this thread, hence have not looked at the entire > history. > > > > > On Monday, March 17, 2014 6:52 AM, Bikash Gupta < > bikash.gupt...@gmail.com> wrote: > Suneel, > > Just for information, I havent found this issue in Canopy. Canopy > cluster-0 was created in HDFS only. > > However Kmeans cluster-0 was created in local file system and cluster-1 in > HDFS and after that it spit an error as it was unable to locate cluster-0 > > > On Mon, Mar 17, 2014 at 3:10 PM, Suneel Marthi wrote: > > This problem's specifically to do with Canopy clustering and is not an > issue with KMeans. I had seen this behavior with Canopy and looking at the > code its indeed an issue wherein cluster-0 is created on the local file > system and the remaining clusters land on HDFS. > > Please file a JIRA for this if not already done so. > > > > > > On Wednesday, March 12, 2014 3:02 AM, Bikash Gupta < > bikash.gupt...@gmail.com> wrote: > > Hi, > > Problem is not with input path, its the way Kmeans is getting executed. Let > me explain. > > I have created CSV->Sequence using map-reduce hence my data is in HDFS > After this I have run Canopy MR hence data is also in HDFS > > Now these two things are getting pushed in Kmeans MR. > > If you check KmeansDriver class, at first it tries to create cluster-0 > folder with data, here if you dont specify the scheme then it will write in > local file system. After that MR job is getting started which is expecting > cluster-0 in HDFS. > > Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); > ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta); > ClusterClassifier prior = new ClusterClassifier(clusters, policy); > prior.writeToSeqFiles(priorClustersPath); > > if (runSequential) { > ClusterIterator.iterateSeq(conf, input, priorClustersPath, output, > maxIterations); > } else { > ClusterIterator.iterateMR(conf, input, priorClustersPath, output, > maxIterations); > } > > Let me know if I am not able to explain clearly. > > > > On Wed, Mar 12, 2014 at 11:53 AM, Sebastian Schelter > wrote: > > > Hi Bikash, > > > > Have you tried adding hdfs:// to your input path? Maybe that helps. > > > > --sebastian > > > > > > On 03/11/2014 11:22 AM, Bikash Gupta wrote: > > > >> Hi, > >> > >> I am running Kmeans in cluster where I am setting the configuration of > >> fs.hdfs.impl and fs.file.impl before hand as mentioned below > >> > >> conf.set("fs.hdfs.impl",org.apache.hadoop.hdfs. > >> DistributedFileSystem.class.getName()); > >> conf.set("fs.file.impl",org.apache.hadoop.fs. > >> LocalFileSystem.class.getName()); > >> > >> Problem is that cluster-0 directory is getting created in local file > >> system > >> and cluster-1 is getting created in HDFS, and Kmeans map reduce job is > >> unable to find cluster-0 . Please see below the stacktrace > >> > >> 2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments: > >> {--clustering=null, --clusters=[/3/clusters-0-final], > >> --convergenceDelta=[0.1], > >> --distanceMeasure=[org.apache.mahout.common.distance. > >> EuclideanDistanceMeasure], > >> --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100], > >> --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0], > >> --tempDir=[temp]} > >> 2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load > >> native-hadoop library for your platform... using builtin-java classes > >> where > >> applicable > >> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence > >> Clusters In: /3/clusters-0-final Out: /5 > >> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max > >> Iterations: 100 > >> 2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser > for > >> parsing the arguments. Applications should implement Tool for the same. > >> 2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths > >> to > >> process : 3 > >> 2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job: > >> job_201403111332_0011 > >> 2014-03-11
Re: Problem with FileSystem in Kmeans
Have not seen that behavior with KMeans, what were ur settings again? Sorry joining late onto this thread, hence have not looked at the entire history. On Monday, March 17, 2014 6:52 AM, Bikash Gupta wrote: Suneel, Just for information, I havent found this issue in Canopy. Canopy cluster-0 was created in HDFS only. However Kmeans cluster-0 was created in local file system and cluster-1 in HDFS and after that it spit an error as it was unable to locate cluster-0 On Mon, Mar 17, 2014 at 3:10 PM, Suneel Marthi wrote: This problem's specifically to do with Canopy clustering and is not an issue with KMeans. I had seen this behavior with Canopy and looking at the code its indeed an issue wherein cluster-0 is created on the local file system and the remaining clusters land on HDFS. > >Please file a JIRA for this if not already done so. > > > > > > >On Wednesday, March 12, 2014 3:02 AM, Bikash Gupta >wrote: > >Hi, > >Problem is not with input path, its the way Kmeans is getting executed. Let >me explain. > >I have created CSV->Sequence using map-reduce hence my data is in HDFS >After this I have run Canopy MR hence data is also in HDFS > >Now these two things are getting pushed in Kmeans MR. > >If you check KmeansDriver class, at first it tries to create cluster-0 >folder with data, here if you dont specify the scheme then it will write in >local file system. After that MR job is getting started which is expecting >cluster-0 in HDFS. > >Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); > ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta); > ClusterClassifier prior = new ClusterClassifier(clusters, policy); > prior.writeToSeqFiles(priorClustersPath); > > if (runSequential) { > ClusterIterator.iterateSeq(conf, input, priorClustersPath, output, >maxIterations); > } else { > ClusterIterator.iterateMR(conf, input, priorClustersPath, output, >maxIterations); > } > >Let me know if I am not able to explain clearly. > > > >On Wed, Mar 12, 2014 at 11:53 AM, Sebastian Schelter wrote: > >> Hi Bikash, >> >> Have you tried adding hdfs:// to your input path? Maybe that helps. >> >> --sebastian >> >> >> On 03/11/2014 11:22 AM, Bikash Gupta wrote: >> >>> Hi, >>> >>> I am running Kmeans in cluster where I am setting the configuration of >>> fs.hdfs.impl and fs.file.impl before hand as mentioned below >>> > >>> conf.set("fs.hdfs.impl",org.apache.hadoop.hdfs. >>> DistributedFileSystem.class.getName()); >>> conf.set("fs.file.impl",org.apache.hadoop.fs. >>> LocalFileSystem.class.getName()); >>> > >>> Problem is that cluster-0 directory is getting created in local file >>> system >>> and cluster-1 is getting created in HDFS, and Kmeans map reduce job is >>> unable to find cluster-0 . Please see below the stacktrace >>> >>> 2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments: >>> {--clustering=null, --clusters=[/3/clusters-0-final], >>> --convergenceDelta=[0.1], >>> --distanceMeasure=[org.apache.mahout.common.distance. >>> EuclideanDistanceMeasure], >>> --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100], >>> --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0], >>> --tempDir=[temp]} >>> 2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load >>> native-hadoop library for your platform... using builtin-java classes >>> where >>> applicable >>> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence >>> Clusters In: /3/clusters-0-final Out: /5 >>> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max >>> Iterations: 100 >>> 2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser for >>> parsing the arguments. Applications should implement Tool for the same. >>> 2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths >>> to >>> process : 3 >>> 2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job: >>> job_201403111332_0011 >>> 2014-03-11 14:52:20 o.a.h.m.JobClient [INFO] map 0% reduce 0% >>> 2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id : >>> attempt_201403111332_0011_m_00_0, Status : FAILED >>> 2014-03-11 14:52:28 STDIO [ERROR] java.lang.IllegalStateException: >>> /5/clusters-0 >>> at > >>> org.apache.mahout.common.iterator.sequencefile. >>> SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable. >>> java:78) >>> at >>> org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles( > >>> ClusterClassifier.java:208) >>> at >>> org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44) >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138) >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask. >>> java:672) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) >>> at org.apache.hadoop.mapred.Child$4.run(Child.java:268) >>> at java.security.AccessController.doPrivileged(Native Method)
Re: Problem with FileSystem in Kmeans
Suneel, Just for information, I havent found this issue in Canopy. Canopy cluster-0 was created in HDFS only. However Kmeans cluster-0 was created in local file system and cluster-1 in HDFS and after that it spit an error as it was unable to locate cluster-0 On Mon, Mar 17, 2014 at 3:10 PM, Suneel Marthi wrote: > This problem's specifically to do with Canopy clustering and is not an > issue with KMeans. I had seen this behavior with Canopy and looking at the > code its indeed an issue wherein cluster-0 is created on the local file > system and the remaining clusters land on HDFS. > > Please file a JIRA for this if not already done so. > > > > > > On Wednesday, March 12, 2014 3:02 AM, Bikash Gupta < > bikash.gupt...@gmail.com> wrote: > > Hi, > > Problem is not with input path, its the way Kmeans is getting executed. Let > me explain. > > I have created CSV->Sequence using map-reduce hence my data is in HDFS > After this I have run Canopy MR hence data is also in HDFS > > Now these two things are getting pushed in Kmeans MR. > > If you check KmeansDriver class, at first it tries to create cluster-0 > folder with data, here if you dont specify the scheme then it will write in > local file system. After that MR job is getting started which is expecting > cluster-0 in HDFS. > > Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); > ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta); > ClusterClassifier prior = new ClusterClassifier(clusters, policy); > prior.writeToSeqFiles(priorClustersPath); > > if (runSequential) { > ClusterIterator.iterateSeq(conf, input, priorClustersPath, output, > maxIterations); > } else { > ClusterIterator.iterateMR(conf, input, priorClustersPath, output, > maxIterations); > } > > Let me know if I am not able to explain clearly. > > > > On Wed, Mar 12, 2014 at 11:53 AM, Sebastian Schelter > wrote: > > > Hi Bikash, > > > > Have you tried adding hdfs:// to your input path? Maybe that helps. > > > > --sebastian > > > > > > On 03/11/2014 11:22 AM, Bikash Gupta wrote: > > > >> Hi, > >> > >> I am running Kmeans in cluster where I am setting the configuration of > >> fs.hdfs.impl and fs.file.impl before hand as mentioned below > >> > >> conf.set("fs.hdfs.impl",org.apache.hadoop.hdfs. > >> DistributedFileSystem.class.getName()); > >> conf.set("fs.file.impl",org.apache.hadoop.fs. > >> LocalFileSystem.class.getName()); > >> > >> Problem is that cluster-0 directory is getting created in local file > >> system > >> and cluster-1 is getting created in HDFS, and Kmeans map reduce job is > >> unable to find cluster-0 . Please see below the stacktrace > >> > >> 2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments: > >> {--clustering=null, --clusters=[/3/clusters-0-final], > >> --convergenceDelta=[0.1], > >> --distanceMeasure=[org.apache.mahout.common.distance. > >> EuclideanDistanceMeasure], > >> --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100], > >> --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0], > >> --tempDir=[temp]} > >> 2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load > >> native-hadoop library for your platform... using builtin-java classes > >> where > >> applicable > >> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence > >> Clusters In: /3/clusters-0-final Out: /5 > >> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max > >> Iterations: 100 > >> 2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser > for > >> parsing the arguments. Applications should implement Tool for the same. > >> 2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths > >> to > >> process : 3 > >> 2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job: > >> job_201403111332_0011 > >> 2014-03-11 14:52:20 o.a.h.m.JobClient [INFO] map 0% reduce 0% > >> 2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id : > >> attempt_201403111332_0011_m_00_0, Status : FAILED > >> 2014-03-11 14:52:28 STDIO [ERROR] java.lang.IllegalStateException: > >> /5/clusters-0 > >> at > >> org.apache.mahout.common.iterator.sequencefile. > >> SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable. > >> java:78) > >> at > >> > org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles( > >> ClusterClassifier.java:208) > >> at > >> org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44) > >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138) > >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask. > >> java:672) > >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) > >> at org.apache.hadoop.mapred.Child$4.run(Child.java:268) > >> at java.security.AccessController.doPrivileged(Native Method) > >> at javax.security.auth.Subject.doAs(Subject.java:415) > >> at > >> org.apache.h
Re: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
R u running on Hadoop 2.x which seems to be the case here. Compile with hadoop 2 profile: mvn -DskipTests clean install -Dhadoop2.profile= On Monday, March 17, 2014 5:57 AM, Margusja wrote: Hi Here is my output: [speech@h14 ~]$ mahout/bin/mahout seqdirectory -c UTF-8 -i /user/speech/demo -o demo-seqfiles MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf MAHOUT-JOB: /home/speech/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar 14/03/17 11:47:30 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/user/speech/demo], --keyPrefix=[], --method=[mapreduce], --output=[demo-seqfiles], --startPhase=[0], --tempDir=[temp]} 14/03/17 11:47:31 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 14/03/17 11:47:31 INFO Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress 14/03/17 11:47:31 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 14/03/17 11:47:31 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 14/03/17 11:47:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 14/03/17 11:47:32 INFO input.FileInputFormat: Total input paths to process : 10 14/03/17 11:47:32 INFO input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 4, size left: 29775 14/03/17 11:47:32 INFO mapreduce.JobSubmitter: number of splits:1 14/03/17 11:47:32 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 14/03/17 11:47:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local42076163_0001 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/03/17 11:47:32 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 14/03/17 11:47:32 INFO mapreduce.Job: Running job: job_local42076163_0001 14/03/17 11:47:32 INFO ma
java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
Hi Here is my output: [speech@h14 ~]$ mahout/bin/mahout seqdirectory -c UTF-8 -i /user/speech/demo -o demo-seqfiles MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf MAHOUT-JOB: /home/speech/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar 14/03/17 11:47:30 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/user/speech/demo], --keyPrefix=[], --method=[mapreduce], --output=[demo-seqfiles], --startPhase=[0], --tempDir=[temp]} 14/03/17 11:47:31 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 14/03/17 11:47:31 INFO Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress 14/03/17 11:47:31 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 14/03/17 11:47:31 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 14/03/17 11:47:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 14/03/17 11:47:32 INFO input.FileInputFormat: Total input paths to process : 10 14/03/17 11:47:32 INFO input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 4, size left: 29775 14/03/17 11:47:32 INFO mapreduce.JobSubmitter: number of splits:1 14/03/17 11:47:32 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 14/03/17 11:47:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local42076163_0001 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/03/17 11:47:32 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 14/03/17 11:47:32 INFO mapreduce.Job: Running job: job_local42076163_0001 14/03/17 11:47:32 INFO mapred.LocalJobRunner: OutputCommitter set in config null 14/03/17 11:47:33 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 14/03/17 11:47:33
Re: Problem with FileSystem in Kmeans
This problem's specifically to do with Canopy clustering and is not an issue with KMeans. I had seen this behavior with Canopy and looking at the code its indeed an issue wherein cluster-0 is created on the local file system and the remaining clusters land on HDFS. Please file a JIRA for this if not already done so. On Wednesday, March 12, 2014 3:02 AM, Bikash Gupta wrote: Hi, Problem is not with input path, its the way Kmeans is getting executed. Let me explain. I have created CSV->Sequence using map-reduce hence my data is in HDFS After this I have run Canopy MR hence data is also in HDFS Now these two things are getting pushed in Kmeans MR. If you check KmeansDriver class, at first it tries to create cluster-0 folder with data, here if you dont specify the scheme then it will write in local file system. After that MR job is getting started which is expecting cluster-0 in HDFS. Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta); ClusterClassifier prior = new ClusterClassifier(clusters, policy); prior.writeToSeqFiles(priorClustersPath); if (runSequential) { ClusterIterator.iterateSeq(conf, input, priorClustersPath, output, maxIterations); } else { ClusterIterator.iterateMR(conf, input, priorClustersPath, output, maxIterations); } Let me know if I am not able to explain clearly. On Wed, Mar 12, 2014 at 11:53 AM, Sebastian Schelter wrote: > Hi Bikash, > > Have you tried adding hdfs:// to your input path? Maybe that helps. > > --sebastian > > > On 03/11/2014 11:22 AM, Bikash Gupta wrote: > >> Hi, >> >> I am running Kmeans in cluster where I am setting the configuration of >> fs.hdfs.impl and fs.file.impl before hand as mentioned below >> >> conf.set("fs.hdfs.impl",org.apache.hadoop.hdfs. >> DistributedFileSystem.class.getName()); >> conf.set("fs.file.impl",org.apache.hadoop.fs. >> LocalFileSystem.class.getName()); >> >> Problem is that cluster-0 directory is getting created in local file >> system >> and cluster-1 is getting created in HDFS, and Kmeans map reduce job is >> unable to find cluster-0 . Please see below the stacktrace >> >> 2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments: >> {--clustering=null, --clusters=[/3/clusters-0-final], >> --convergenceDelta=[0.1], >> --distanceMeasure=[org.apache.mahout.common.distance. >> EuclideanDistanceMeasure], >> --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100], >> --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0], >> --tempDir=[temp]} >> 2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load >> native-hadoop library for your platform... using builtin-java classes >> where >> applicable >> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence >> Clusters In: /3/clusters-0-final Out: /5 >> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max >> Iterations: 100 >> 2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser for >> parsing the arguments. Applications should implement Tool for the same. >> 2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths >> to >> process : 3 >> 2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job: >> job_201403111332_0011 >> 2014-03-11 14:52:20 o.a.h.m.JobClient [INFO] map 0% reduce 0% >> 2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id : >> attempt_201403111332_0011_m_00_0, Status : FAILED >> 2014-03-11 14:52:28 STDIO [ERROR] java.lang.IllegalStateException: >> /5/clusters-0 >> at >> org.apache.mahout.common.iterator.sequencefile. >> SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable. >> java:78) >> at >> org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles( >> ClusterClassifier.java:208) >> at >> org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44) >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask. >> java:672) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) >> at org.apache.hadoop.mapred.Child$4.run(Child.java:268) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:415) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs( >> UserGroupInformation.java:1438) >> at org.apache.hadoop.mapred.Child.main(Child.java:262) >> Caused by: java.io.FileNotFoundException: File /5/clusters-0 >> >> Please suggest!!! >> >> >> > -- Thanks & Regards Bikash Kumar Gupta
Re: reduce is too slow in StreamingKmeans
On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN wrote: Thank you for your quick reply. As to -km, I thought it was log10, instead of ln. I was wrong... This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever. >> This has been my experience too both with 0.8 and 0.9. So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option. Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful. >> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u >> mean Streaming KMeans here? However, when executing mahout streamingkmeans, I got errors as following. Hadoop I installed is cdh5-beta1-mapreduce version 1. Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x profile, hence the error u r seeing. If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop 2 profile like below: mvn clean install -Dhadoop2.profile= Please give that a try. -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Wednesday, February 19, 2014 1:08 AM To: user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. How did u come up with -km 63000? Given that u would like 1 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n) = 1 * ln(2 * 10^6) = 145087 (rounded to nearest integer) and that should be the value of -km in ur case. (km = k * log (n) ) Not sure if that's gonna fix ur reduce being stuck at 76% forever but its definitely worth a try. If you would like go to with -rskm option, please upgrade to Mahout 0.9. I still think there's an issue with -rskm option with Mahout 0.9 and trunk today while executing in MR mode, but it definitely works in the nonMR (-xm sequential) mode in 0.9. On Monday, February 17, 2014 9:05 PM, Sylvia Ma wrote: I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that reduce of mahout streamingkmeans is extremely slow. For example: With a dataset of 200 objects, 128 variables, I would like to get 1 clusters. The command executed is as the following. mahout streamingkmeans -i input -o output -ow -k 1 -km 63000 I have 15 maps which were all completed in 4 hours. However, reduce took over 100 hours and it was still stuck at 76%. I have tuned performance of hadoop as the following. map task jvm = 3g reduce task jvm = 10g io.sort.mb = 512 io.sort.factor = 50 mapred.reduce.parallel.copies = 10 mapred.inmem.merge.threshold = 0 I tried to assign enough memory but the reduce is still very very very slow. Why does it take so much time in reduce? And What can I do to speed up the job? I wonder if it will be helpf
RE: reduce is too slow in StreamingKmeans
Thank you for your quick reply. As to -km, I thought it was log10, instead of ln. I was wrong... This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever. So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option. Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful. However, when executing mahout streamingkmeans, I got errors as following. Hadoop I installed is cdh5-beta1-mapreduce version 1. Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Wednesday, February 19, 2014 1:08 AM To: user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. How did u come up with -km 63000? Given that u would like 1 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n) = 1 * ln(2 * 10^6) = 145087 (rounded to nearest integer) and that should be the value of -km in ur case. (km = k * log (n) ) Not sure if that's gonna fix ur reduce being stuck at 76% forever but its definitely worth a try. If you would like go to with -rskm option, please upgrade to Mahout 0.9. I still think there's an issue with -rskm option with Mahout 0.9 and trunk today while executing in MR mode, but it definitely works in the nonMR (-xm sequential) mode in 0.9. On Monday, February 17, 2014 9:05 PM, Sylvia Ma wrote: I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that reduce of mahout streamingkmeans is extremely slow. For example: With a dataset of 200 objects, 128 variables, I would like to get 1 clusters. The command executed is as the following. mahout streamingkmeans -i input -o output -ow -k 1 -km 63000 I have 15 maps which were all completed in 4 hours. However, reduce took over 100 hours and it was still stuck at 76%. I have tuned performance of hadoop as the following. map task jvm = 3g reduce task jvm = 10g io.sort.mb = 512 io.sort.factor = 50 mapred.reduce.parallel.copies = 10 mapred.inmem.merge.threshold = 0 I tried to assign enough memory but the reduce is still very very very slow. Why does it take so much time in reduce? And What can I do to speed up the job? I wonder if it will be helpful to set -rskm to be true. -rskm option has bug in Mahout 0.8, so I cannot get a try... Yours Sincerely, Sylvia Ma