RE: Mahout parallel K-Means - algorithms analysis
Thank you Wei and Suneel, By the way, does somebody know if the Parallel K-means of Mahout is using Cannopy clustering at the beginning to generate the initial K in the K-Means driver class? Best regards, Hiroshi Date: Mon, 17 Mar 2014 13:05:01 -0700 Subject: Re: Mahout parallel K-Means - algorithms analysis From: weish...@gmail.com To: user@mahout.apache.org CC: ted.dunn...@gmail.com You could take a look at org.apache.mahout.clustering.classify/ClusterClassificationMapper Enjoy, Wei Shung On Sat, Mar 15, 2014 at 2:51 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: The clustering code is cimapper and cireducer. Following the clustering, there is cluster classification which is mapper only. Not sure about the reference paper, this stuffs been around for long but the documentation for kmeans on mahout.apache.org should explain the approach. Sent from my iPhone On Mar 15, 2014, at 5:36 PM, hiroshi leon hiroshi_8...@hotmail.com wrote: Hello Ted, Thank you so much for your reply, the program that I was checking is the KMeansDriver class with the run function, the buildCluster function in the same class and following the ClusterIterator class with the iterateMR function. I would like to know how where can I check the code that is implemented for the mapper and the reducer? is it in the CIMappper.class and CIReducer.class? Is there a research paper or pseudo-code in which Mahout parallel K-means was based on? Thank you so much and have a nice day. Best regards From: ted.dunn...@gmail.com Date: Sat, 15 Mar 2014 13:56:56 -0700 Subject: Re: Mahout parallel K-Means - algorithms analysis To: user@mahout.apache.org We would love to help. Can you say which program and which classes you are looking at? On Sat, Mar 15, 2014 at 12:58 PM, hiroshi leon hiroshi_8...@hotmail.comwrote: To whom it may correspond, Hello, I have been checking the algorithm of Mahout 0.9 version k-means using MapReduce and I would like to know where can I check the code of what is happening inside the map function and in the reducer? I was debugging using NetBeans and I was not able to find what is exactly implemented in the Map and Reduce functions... The reason what I am doing this is because I would like to know what is exactly implemented in the version of Mahout 0.9 in order to see which parts where optimized on the K-Means mapReduce algorithm. Do you know which research paper the Mahout K-means was based on or where can I read the pseudo code? Thank you so much! Best regards! Hiroshi
Command line vector to sequence file
Hi I am looking a simple way in a command line how to convert vector to sequence file. in example I have data.txt file contains vectors. 1,1 2,1 1,2 2,2 3,3 8,8 8,9 9,8 9,9 So is there command line possibility to convert that into sequence file? I tried mahout seqdirectory but after it hdfs dfs -text output2/part-m-0 gives me something like: /data.txt1,1 2,1 1,2 2,2 3,3 8,8 8,9 9,8 9,9 and that is not sequence file format as I understand. I know there are java API but I am looking command line. -- Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) -BEGIN PUBLIC KEY- MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCvbeg7LwEC2SCpAEewwpC3ajxE 5ZsRMCB77L8bae9G7TslgLkoIzo9yOjPdx2NN6DllKbV65UjTay43uUDyql9g3tl RhiJIcoAExkSTykWqAIPR88LfilLy1JlQ+0RD8OXiWOVVQfhOHpQ0R/jcAkM2lZa BjM8j36yJvoBVsfOHQIDAQAB -END PUBLIC KEY-
Re: Command line vector to sequence file
Thank you, I am going to try it. Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) -BEGIN PUBLIC KEY- MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCvbeg7LwEC2SCpAEewwpC3ajxE 5ZsRMCB77L8bae9G7TslgLkoIzo9yOjPdx2NN6DllKbV65UjTay43uUDyql9g3tl RhiJIcoAExkSTykWqAIPR88LfilLy1JlQ+0RD8OXiWOVVQfhOHpQ0R/jcAkM2lZa BjM8j36yJvoBVsfOHQIDAQAB -END PUBLIC KEY- On 18/03/14 10:58, Kevin Moulart wrote: Hi, I did the same search a few weeks back and found that there is nothing in the current API to do that from command line. However I did write a java program that transforms a csv into a SequenceFile which can be used to train a naive bayes (amongst other things). Here are the sources : https://gist.github.com/kmoulart/9616125 You'll find all you need to make a jar with dependecies running and with a proper command line (using JCommander). Both the sequential version and the MapReduce one are in the given files. If you're lazy, I'll put the whole maven project on my github later today. Hope it helps you Kévin Moulart 2014-03-18 9:41 GMT+01:00 Margusja mar...@roo.ee: Hi I am looking a simple way in a command line how to convert vector to sequence file. in example I have data.txt file contains vectors. 1,1 2,1 1,2 2,2 3,3 8,8 8,9 9,8 9,9 So is there command line possibility to convert that into sequence file? I tried mahout seqdirectory but after it hdfs dfs -text output2/part-m-0 gives me something like: /data.txt1,1 2,1 1,2 2,2 3,3 8,8 8,9 9,8 9,9 and that is not sequence file format as I understand. I know there are java API but I am looking command line. -- Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) -BEGIN PUBLIC KEY- MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCvbeg7LwEC2SCpAEewwpC3ajxE 5ZsRMCB77L8bae9G7TslgLkoIzo9yOjPdx2NN6DllKbV65UjTay43uUDyql9g3tl RhiJIcoAExkSTykWqAIPR88LfilLy1JlQ+0RD8OXiWOVVQfhOHpQ0R/jcAkM2lZa BjM8j36yJvoBVsfOHQIDAQAB -END PUBLIC KEY-
Re: Command line vector to sequence file
You're welcome ! Here's the repository if need be : https://github.com/kmoulart/hadoop_mahout_utils Kévin Moulart 2014-03-18 10:00 GMT+01:00 Margusja mar...@roo.ee: Thank you, I am going to try it. Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) -BEGIN PUBLIC KEY- MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCvbeg7LwEC2SCpAEewwpC3ajxE 5ZsRMCB77L8bae9G7TslgLkoIzo9yOjPdx2NN6DllKbV65UjTay43uUDyql9g3tl RhiJIcoAExkSTykWqAIPR88LfilLy1JlQ+0RD8OXiWOVVQfhOHpQ0R/jcAkM2lZa BjM8j36yJvoBVsfOHQIDAQAB -END PUBLIC KEY- On 18/03/14 10:58, Kevin Moulart wrote: Hi, I did the same search a few weeks back and found that there is nothing in the current API to do that from command line. However I did write a java program that transforms a csv into a SequenceFile which can be used to train a naive bayes (amongst other things). Here are the sources : https://gist.github.com/kmoulart/9616125 You'll find all you need to make a jar with dependecies running and with a proper command line (using JCommander). Both the sequential version and the MapReduce one are in the given files. If you're lazy, I'll put the whole maven project on my github later today. Hope it helps you Kévin Moulart 2014-03-18 9:41 GMT+01:00 Margusja mar...@roo.ee: Hi I am looking a simple way in a command line how to convert vector to sequence file. in example I have data.txt file contains vectors. 1,1 2,1 1,2 2,2 3,3 8,8 8,9 9,8 9,9 So is there command line possibility to convert that into sequence file? I tried mahout seqdirectory but after it hdfs dfs -text output2/part-m-0 gives me something like: /data.txt1,1 2,1 1,2 2,2 3,3 8,8 8,9 9,8 9,9 and that is not sequence file format as I understand. I know there are java API but I am looking command line. -- Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) -BEGIN PUBLIC KEY- MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCvbeg7LwEC2SCpAEewwpC3ajxE 5ZsRMCB77L8bae9G7TslgLkoIzo9yOjPdx2NN6DllKbV65UjTay43uUDyql9g3tl RhiJIcoAExkSTykWqAIPR88LfilLy1JlQ+0RD8OXiWOVVQfhOHpQ0R/jcAkM2lZa BjM8j36yJvoBVsfOHQIDAQAB -END PUBLIC KEY-
Re: Mahout parallel K-Means - algorithms analysis
Canopy and KMeans run independently and do not call eachother. For KMEans, the K value has to be specified when invoking KMeans. Typically u run Canopy first and then invoke KMeans with the appropriate K-value as inferred from Canopy. On Tuesday, March 18, 2014 4:33 AM, hiroshi leon hiroshi_8...@hotmail.com wrote: Thank you Wei and Suneel, By the way, does somebody know if the Parallel K-means of Mahout is using Cannopy clustering at the beginning to generate the initial K in the K-Means driver class? Best regards, Hiroshi Date: Mon, 17 Mar 2014 13:05:01 -0700 Subject: Re: Mahout parallel K-Means - algorithms analysis From: weish...@gmail.com To: user@mahout.apache.org CC: ted.dunn...@gmail.com You could take a look at org.apache.mahout.clustering.classify/ClusterClassificationMapper Enjoy, Wei Shung On Sat, Mar 15, 2014 at 2:51 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: The clustering code is cimapper and cireducer. Following the clustering, there is cluster classification which is mapper only. Not sure about the reference paper, this stuffs been around for long but the documentation for kmeans on mahout.apache.org should explain the approach. Sent from my iPhone On Mar 15, 2014, at 5:36 PM, hiroshi leon hiroshi_8...@hotmail.com wrote: Hello Ted, Thank you so much for your reply, the program that I was checking is the KMeansDriver class with the run function, the buildCluster function in the same class and following the ClusterIterator class with the iterateMR function. I would like to know how where can I check the code that is implemented for the mapper and the reducer? is it in the CIMappper.class and CIReducer.class? Is there a research paper or pseudo-code in which Mahout parallel K-means was based on? Thank you so much and have a nice day. Best regards From: ted.dunn...@gmail.com Date: Sat, 15 Mar 2014 13:56:56 -0700 Subject: Re: Mahout parallel K-Means - algorithms analysis To: user@mahout.apache.org We would love to help. Can you say which program and which classes you are looking at? On Sat, Mar 15, 2014 at 12:58 PM, hiroshi leon hiroshi_8...@hotmail.comwrote: To whom it may correspond, Hello, I have been checking the algorithm of Mahout 0.9 version k-means using MapReduce and I would like to know where can I check the code of what is happening inside the map function and in the reducer? I was debugging using NetBeans and I was not able to find what is exactly implemented in the Map and Reduce functions... The reason what I am doing this is because I would like to know what is exactly implemented in the version of Mahout 0.9 in order to see which parts where optimized on the K-Means mapReduce algorithm. Do you know which research paper the Mahout K-means was based on or where can I read the pseudo code? Thank you so much! Best regards! Hiroshi
RE: Mahout parallel K-Means - algorithms analysis
Thanks Suneel, Can someone please explain me a litlte bit about the ClusteringPolicy and the clusterClassifier? and what are the benefits when using it with parallel K-Means? Thank you so much, Best regards. Date: Tue, 18 Mar 2014 04:35:14 -0700 From: suneel_mar...@yahoo.com Subject: Re: Mahout parallel K-Means - algorithms analysis To: user@mahout.apache.org Canopy and KMeans run independently and do not call eachother. For KMEans, the K value has to be specified when invoking KMeans. Typically u run Canopy first and then invoke KMeans with the appropriate K-value as inferred from Canopy. On Tuesday, March 18, 2014 4:33 AM, hiroshi leon hiroshi_8...@hotmail.com wrote: Thank you Wei and Suneel, By the way, does somebody know if the Parallel K-means of Mahout is using Cannopy clustering at the beginning to generate the initial K in the K-Means driver class? Best regards, Hiroshi Date: Mon, 17 Mar 2014 13:05:01 -0700 Subject: Re: Mahout parallel K-Means - algorithms analysis From: weish...@gmail.com To: user@mahout.apache.org CC: ted.dunn...@gmail.com You could take a look at org.apache.mahout.clustering.classify/ClusterClassificationMapper Enjoy, Wei Shung On Sat, Mar 15, 2014 at 2:51 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: The clustering code is cimapper and cireducer. Following the clustering, there is cluster classification which is mapper only. Not sure about the reference paper, this stuffs been around for long but the documentation for kmeans on mahout.apache.org should explain the approach. Sent from my iPhone On Mar 15, 2014, at 5:36 PM, hiroshi leon hiroshi_8...@hotmail.com wrote: Hello Ted, Thank you so much for your reply, the program that I was checking is the KMeansDriver class with the run function, the buildCluster function in the same class and following the ClusterIterator class with the iterateMR function. I would like to know how where can I check the code that is implemented for the mapper and the reducer? is it in the CIMappper.class and CIReducer.class? Is there a research paper or pseudo-code in which Mahout parallel K-means was based on? Thank you so much and have a nice day. Best regards From: ted.dunn...@gmail.com Date: Sat, 15 Mar 2014 13:56:56 -0700 Subject: Re: Mahout parallel K-Means - algorithms analysis To: user@mahout.apache.org We would love to help. Can you say which program and which classes you are looking at? On Sat, Mar 15, 2014 at 12:58 PM, hiroshi leon hiroshi_8...@hotmail.comwrote: To whom it may correspond, Hello, I have been checking the algorithm of Mahout 0.9 version k-means using MapReduce and I would like to know where can I check the code of what is happening inside the map function and in the reducer? I was debugging using NetBeans and I was not able to find what is exactly implemented in the Map and Reduce functions... The reason what I am doing this is because I would like to know what is exactly implemented in the version of Mahout 0.9 in order to see which parts where optimized on the K-Means mapReduce algorithm. Do you know which research paper the Mahout K-means was based on or where can I read the pseudo code? Thank you so much! Best regards! Hiroshi
Re: Naive Bayes classification
Hi Tharindu, If I understand correctly seqdirectory creates labels based on the file name but this is not what you want. What do you want the labels to be? Cheers, Frank On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira tharindurus...@gmail.comwrote: Hi everyone, I'm developing an application where I need to train a Naive Bayes classification model and use this model to classify new entities(In this case text files based on their content) I observed that seqdirectory command always adds the file/directory name as the key field for each document which will be used as the label in classification jobs. This makes sense when I need to train a model and create the labelindex since I have organized my training data according to their labels in separate directories. Now I'm trying to use this model and infer the best label for an unknown document. My requirement is to ask Mahout to read my new file and output the predicted category by looking at the labelindex and the tfidf vector of the new content. I tried creating vectors from the new content (seqdirectory and seq2sparse), and then using this vector to run testnb command. But unfortunately seqdirectory commands adds file names as labels which does not make sense in classification. The following error message will further demonstrate this behavior. imput0.txt is the file name of my new document. [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while classifying documents java.lang.IllegalArgumentException: Label not found: input0.txt at com.google.common.base.Preconditions.checkArgument(Preconditions.java:125) at org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182) at org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205) at org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209) at org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173) at org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70) at org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160) at org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:66) So how can I achieve what I'm trying to do here? Thanks, -- M.P. Tharindu Rusira Kumara Department of Computer Science and Engineering, University of Moratuwa, Sri Lanka. +94757033733 www.tharindu-rusira.blogspot.com
Introducing PredictionIO: A developer-friendly Mahout stack for production
Hi, After a year of work, I would like to present PredictionIO project ( https://github.com/PredictionIO) to this community. When a few of us were doing PhD study, Mahout was the de facto Java package that we used in many research work. This is a very powerful algorithm library, yet we see that something needs to be done to make it more accessible to developers in production environment. Therefore, we started the idea of PredictionIO, which adds a developer-friendly REST API, a web admin UI and an integrated infrastructure on top of Mahout. The project is still at its early stage. CF algorithm libraries of Mahout is supported currently. *REST API and SDK* in Python, Ruby, Java, PHP, Node.js etc Through the API layer, which supports both sync and asycn call, users can: - Record data A sample SDK call: * cli.identify(John)* * cli.record_action_on_item(view, Mahout Page 1)* - Query recommendation in real-time A sample GEO-based recommendation query: * r = cli.get_itemrec_topn(myEngine, 5, {pio_latlng:[37.9, 91.2]})* *Web Admin UI* Through the UI, users can: - conduct algorithm evaluation with metrics such as MAP@k - deploy / switch algorithm on production - adjust recommendation preferences, such as Freshness, Serendipity, Unseen-only filter etc *Integrated Infrastructure* PredictionIO helps users link Mahout, Hadoop, data store and job scheduler etc together. The whole stack can be installed and configured in minutes. It takes care of a lot of production issues, such as model re-training with new data and prediction result indexing. We are working hard to make it extremely easy for developers to build Machine Learning into web and apps. Hopefully, PredictionIO can get Mahout into the hands of a wider audience. Love to hear your feedback. If you are interested in the project, just remember that contributors are always welcome! Regards, Simon
Text clustering with hashing vector encoders
Hi all, Would it be possible to use hashing vector encoders for text clustering just like when classifying? Currently we vectorize using a dictionary where we map each token to a fixed position in the dictionary. After the clustering we use have to retrieve the dictionary to determine the cluster labels. This is quite a complex process where multiple outputs are read and written in the entire clustering process. I think it would be great if both algorithms could use the same encoding process but I don't know if this is possible. The problem is that we lose the mapping between token and position when hashing. We need this mapping to determine cluster labels. However, maybe we could make it so hashed encoders can be used and that determining top labels is left to the user. This might be a possibility because I noticed a problem with the current cluster labeling code. This is what happens: first vectors are vectorized with TF-IDF and clustered. Then the labels are ranked, but again according to TF-IDF, instead of TF. So it is possible that a token becomes the top ranked label, even though it is rare within the cluster. The document with that token is in the cluster because of other tokens. If the labels are determined based on a TF score within the cluster I think you would have better labels. But this requires a post-processing step on your original data and doing a TF count. Thoughts? Cheers, Frank
Re: Text clustering with hashing vector encoders
Yes. Hashing vector encoders will preserve distances when used with multiple probes. Interpretation becomes somewhat difficult, but there is code available to reverse engineer labels on hashed vectors. IDF weighting is slightly tricky, but quite doable if you keep a dictionary of, say, the most common 50-200 thousand words and assume all others have constant and equal frequency. On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten fr...@frankscholten.nlwrote: Hi all, Would it be possible to use hashing vector encoders for text clustering just like when classifying? Currently we vectorize using a dictionary where we map each token to a fixed position in the dictionary. After the clustering we use have to retrieve the dictionary to determine the cluster labels. This is quite a complex process where multiple outputs are read and written in the entire clustering process. I think it would be great if both algorithms could use the same encoding process but I don't know if this is possible. The problem is that we lose the mapping between token and position when hashing. We need this mapping to determine cluster labels. However, maybe we could make it so hashed encoders can be used and that determining top labels is left to the user. This might be a possibility because I noticed a problem with the current cluster labeling code. This is what happens: first vectors are vectorized with TF-IDF and clustered. Then the labels are ranked, but again according to TF-IDF, instead of TF. So it is possible that a token becomes the top ranked label, even though it is rare within the cluster. The document with that token is in the cluster because of other tokens. If the labels are determined based on a TF score within the cluster I think you would have better labels. But this requires a post-processing step on your original data and doing a TF count. Thoughts? Cheers, Frank
Re: reduce is too slow in StreamingKmeans
When dealing with Streaming KMeans, it would be helpful for troubleshooting purposes if u could provide the values for k (# of clusters), km ( = k log n) and n (# of datapoints). Try setting -Xmx to a higher heap size and run the sequential version again. I had seen OOM errors happen during the Reduce phase while running the MR version, my reduce heap size was set to 2GB and I was trying to cluster about 2M datapoints each of cardinality 100 (that's after running thru SSVD-PCA). Speaking from my experience, its either been that the Reducer fails with OOM errors or is stuck forever at 76% (and raise alarms with the Operations stuck because its not making any progress). How big is ur dataset and how long did it take for the map phase to complete? On Tuesday, March 18, 2014 12:54 AM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp wrote: As mahout streamingkmeans has no problems in sequential mode, I would like to try sequential mode. However, java.lang.OutofMemoryError occurs. I wonder where to set JVM heap size for sequential mode? Is it the same with mapreduce mode? -Original Message- From: fx MA XIAOJUN [mailto:xiaojun...@fujixerox.co.jp] Sent: Tuesday, March 18, 2014 10:50 AM To: Suneel Marthi; user@mahout.apache.org Subject: RE: reduce is too slow in StreamingKmeans Thank you for your extremely quick reply. What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here? I want to try using -rskm in streaming kmeans. But in mahout 0.8, if setting -rskm as true, errors occur. I heard that the bug has been fixed in 0.9. So I upgraded 0.8-0.9 The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 2.x(YARN) cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is compiled by cloudera. So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 distribution. It turned out that Mahout kmeans runs very well on mapreduce. However, Mahout streamingkmeans runs properly in sequential mode, but fails in mapreduce mode. If it is the problem of incompatibility between hadoop and mahout, I don’t think mahout kmeans can run properly. Is mahout 0.9 compatible with Hadoop 0.20? -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Monday, March 17, 2014 6:21 PM To: fx MA XIAOJUN; user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp wrote: Thank you for your quick reply. As to -km, I thought it was log10, instead of ln. I was wrong... This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever. This has been my experience too both with 0.8 and 0.9. So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option. Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful. What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here? However, when executing mahout streamingkmeans, I got errors as following. Hadoop I installed is cdh5-beta1-mapreduce version 1. Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at
Re: Naive Bayes classification
Tharindu, If I understand what u r trying to do:- a) You have a trained Bayes model. b) You would like to classify new documents using this trained model. c) You were trying to use TestNaiveBayesDriver to classify the documents in (b). Option 1: --- You could write a custom MapReduce job that creates sequence files from the documents (without the label key). You could feed these sequencefiles to seq2sparse to generate ur vectors - call TestNAiveBayes with this input. Let me know if u need code for the earlier part. Option 2: --- Work with your existing tf-idf vectors generated from seqdirectory - seq2sparse. But instead of invoking Mahout TestNaiveBayes, create a custom MapReduce job (or a plain java program if that's fine with u) that basically does the following: a) Instantiate a classifier with trained model: (Pseudo code below) NaiveBayesModel naiveBayesModel = NaiveBayesModel.materialize(new Path(outputDir.getAbsolutePath()), conf); AbstractVectorClassifier classifier = new StandardNaiveBayesClassifier(naiveBayesModel); // Parse through the input tf-idf vectors Text, VectorWritable and feed them to the classifier for (PairText,VectorWritable vector : new SequenceFileDirIterableText,VectorWritable(getInputPath(), PathType.LIST, PathFilters.logsCRCFilter(), null, true, conf)) { // invoke the classifier on the incoming vector Vector result = classifier.classifyFull(vector.getSecond().get()); context.write(record.getFirst(), new VectorWritable(result)); } You can have the above code as part of a mapper in an MR job. On Tuesday, March 18, 2014 5:49 PM, Kevin Moulart kevinmoul...@gmail.com wrote: To use Naive Bayes you need a Sequence File Text, VectorWritable with the key formatted like this label/label for some reason I checked with the sources to be sure and it parses it looking for a '/'. When y used seqdirectory, it told Naive Bayes to classify the content of each file (ex : file1.txt) with the label corresponding to its name (here, file1.txt). So when you tried testing with input0.txt it failed because input0.txt was not a valid label. I designed a MapReduce java job that transforms a csv with numeric values into a proper SequenceFile, if you want you can take the source and update if to suit your need : https://github.com/kmoulart/hadoop_mahout_utils Good luck. Kévin Moulart 2014-03-18 20:13 GMT+01:00 Frank Scholten fr...@frankscholten.nl: Hi Tharindu, If I understand correctly seqdirectory creates labels based on the file name but this is not what you want. What do you want the labels to be? Cheers, Frank On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira tharindurus...@gmail.comwrote: Hi everyone, I'm developing an application where I need to train a Naive Bayes classification model and use this model to classify new entities(In this case text files based on their content) I observed that seqdirectory command always adds the file/directory name as the key field for each document which will be used as the label in classification jobs. This makes sense when I need to train a model and create the labelindex since I have organized my training data according to their labels in separate directories. Now I'm trying to use this model and infer the best label for an unknown document. My requirement is to ask Mahout to read my new file and output the predicted category by looking at the labelindex and the tfidf vector of the new content. I tried creating vectors from the new content (seqdirectory and seq2sparse), and then using this vector to run testnb command. But unfortunately seqdirectory commands adds file names as labels which does not make sense in classification. The following error message will further demonstrate this behavior. imput0.txt is the file name of my new document. [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while classifying documents java.lang.IllegalArgumentException: Label not found: input0.txt at com.google.common.base.Preconditions.checkArgument(Preconditions.java:125) at org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182) at org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205) at org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209) at org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173) at org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70) at org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160) at org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
clusterdump samplePoints parameter
Hi all, Can someone please answer a quick question about the --samplePoints parameter in the clusterdump utility? I understand it specifies the number of points returned per cluster. But are the points per cluster ordered or ranked in any way before this truncation occurs? Thanks, Terry
Re: clusterdump samplePoints parameter
Its the max. no. of points to include from each cluster in the clusterdump. If not specified all points would be included. On Tuesday, March 18, 2014 11:25 PM, Terry Blankers te...@amritanet.com wrote: Hi all, Can someone please answer a quick question about the --samplePoints parameter in the clusterdump utility? I understand it specifies the number of points returned per cluster. But are the points per cluster ordered or ranked in any way before this truncation occurs? Thanks, Terry
Re: Text clustering with hashing vector encoders
+1 to this. We could then use Hamming Distance to compute the distances between Hashed Vectors. We have the code for HashedVector.java based on Moses Charikar's SimHash paper. On Tuesday, March 18, 2014 7:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. Hashing vector encoders will preserve distances when used with multiple probes. Interpretation becomes somewhat difficult, but there is code available to reverse engineer labels on hashed vectors. IDF weighting is slightly tricky, but quite doable if you keep a dictionary of, say, the most common 50-200 thousand words and assume all others have constant and equal frequency. On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten fr...@frankscholten.nlwrote: Hi all, Would it be possible to use hashing vector encoders for text clustering just like when classifying? Currently we vectorize using a dictionary where we map each token to a fixed position in the dictionary. After the clustering we use have to retrieve the dictionary to determine the cluster labels. This is quite a complex process where multiple outputs are read and written in the entire clustering process. I think it would be great if both algorithms could use the same encoding process but I don't know if this is possible. The problem is that we lose the mapping between token and position when hashing. We need this mapping to determine cluster labels. However, maybe we could make it so hashed encoders can be used and that determining top labels is left to the user. This might be a possibility because I noticed a problem with the current cluster labeling code. This is what happens: first vectors are vectorized with TF-IDF and clustered. Then the labels are ranked, but again according to TF-IDF, instead of TF. So it is possible that a token becomes the top ranked label, even though it is rare within the cluster. The document with that token is in the cluster because of other tokens. If the labels are determined based on a TF score within the cluster I think you would have better labels. But this requires a post-processing step on your original data and doing a TF count. Thoughts? Cheers, Frank
Re: Text clustering with hashing vector encoders
How does with multiple probes affect distance preservation, and how would idf weighting get tricky just by hashing strings? Would we be computing distance between hashed strings, or distance between vectors based on counts of hashed strings? On Tue, Mar 18, 2014 at 8:50 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: +1 to this. We could then use Hamming Distance to compute the distances between Hashed Vectors. We have the code for HashedVector.java based on Moses Charikar's SimHash paper. On Tuesday, March 18, 2014 7:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. Hashing vector encoders will preserve distances when used with multiple probes. Interpretation becomes somewhat difficult, but there is code available to reverse engineer labels on hashed vectors. IDF weighting is slightly tricky, but quite doable if you keep a dictionary of, say, the most common 50-200 thousand words and assume all others have constant and equal frequency. On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten fr...@frankscholten.nl wrote: Hi all, Would it be possible to use hashing vector encoders for text clustering just like when classifying? Currently we vectorize using a dictionary where we map each token to a fixed position in the dictionary. After the clustering we use have to retrieve the dictionary to determine the cluster labels. This is quite a complex process where multiple outputs are read and written in the entire clustering process. I think it would be great if both algorithms could use the same encoding process but I don't know if this is possible. The problem is that we lose the mapping between token and position when hashing. We need this mapping to determine cluster labels. However, maybe we could make it so hashed encoders can be used and that determining top labels is left to the user. This might be a possibility because I noticed a problem with the current cluster labeling code. This is what happens: first vectors are vectorized with TF-IDF and clustered. Then the labels are ranked, but again according to TF-IDF, instead of TF. So it is possible that a token becomes the top ranked label, even though it is rare within the cluster. The document with that token is in the cluster because of other tokens. If the labels are determined based on a TF score within the cluster I think you would have better labels. But this requires a post-processing step on your original data and doing a TF count. Thoughts? Cheers, Frank
Re: Naive Bayes classification
Hi, first of all I'm sorry that my previous mail was vague and poorly formulated. Yes, Suneel got exactly what I was asking.Both options will address my requirement. Thanks a lot. -Tharindu On Mar 19, 2014 8:51 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Tharindu, If I understand what u r trying to do:- a) You have a trained Bayes model. b) You would like to classify new documents using this trained model. c) You were trying to use TestNaiveBayesDriver to classify the documents in (b). Option 1: --- You could write a custom MapReduce job that creates sequence files from the documents (without the label key). You could feed these sequencefiles to seq2sparse to generate ur vectors - call TestNAiveBayes with this input. Let me know if u need code for the earlier part. Option 2: --- Work with your existing tf-idf vectors generated from seqdirectory - seq2sparse. But instead of invoking Mahout TestNaiveBayes, create a custom MapReduce job (or a plain java program if that's fine with u) that basically does the following: a) Instantiate a classifier with trained model: (Pseudo code below) NaiveBayesModel naiveBayesModel = NaiveBayesModel.materialize(new Path(outputDir.getAbsolutePath()), conf); AbstractVectorClassifier classifier = new StandardNaiveBayesClassifier(naiveBayesModel); // Parse through the input tf-idf vectors Text, VectorWritable and feed them to the classifier for (PairText,VectorWritable vector : new SequenceFileDirIterableText,VectorWritable(getInputPath(), PathType.LIST, PathFilters.logsCRCFilter(), null, true, conf)) { // invoke the classifier on the incoming vector Vector result = classifier.classifyFull(vector.getSecond().get()); context.write(record.getFirst(), new VectorWritable(result)); } You can have the above code as part of a mapper in an MR job. On Tuesday, March 18, 2014 5:49 PM, Kevin Moulart kevinmoul...@gmail.com wrote: To use Naive Bayes you need a Sequence File Text, VectorWritable with the key formatted like this label/label for some reason I checked with the sources to be sure and it parses it looking for a '/'. When y used seqdirectory, it told Naive Bayes to classify the content of each file (ex : file1.txt) with the label corresponding to its name (here, file1.txt). So when you tried testing with input0.txt it failed because input0.txt was not a valid label. I designed a MapReduce java job that transforms a csv with numeric values into a proper SequenceFile, if you want you can take the source and update if to suit your need : https://github.com/kmoulart/hadoop_mahout_utils Good luck. Kévin Moulart 2014-03-18 20:13 GMT+01:00 Frank Scholten fr...@frankscholten.nl: Hi Tharindu, If I understand correctly seqdirectory creates labels based on the file name but this is not what you want. What do you want the labels to be? Cheers, Frank On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira tharindurus...@gmail.comwrote: Hi everyone, I'm developing an application where I need to train a Naive Bayes classification model and use this model to classify new entities(In this case text files based on their content) I observed that seqdirectory command always adds the file/directory name as the key field for each document which will be used as the label in classification jobs. This makes sense when I need to train a model and create the labelindex since I have organized my training data according to their labels in separate directories. Now I'm trying to use this model and infer the best label for an unknown document. My requirement is to ask Mahout to read my new file and output the predicted category by looking at the labelindex and the tfidf vector of the new content. I tried creating vectors from the new content (seqdirectory and seq2sparse), and then using this vector to run testnb command. But unfortunately seqdirectory commands adds file names as labels which does not make sense in classification. The following error message will further demonstrate this behavior. imput0.txt is the file name of my new document. [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while classifying documents java.lang.IllegalArgumentException: Label not found: input0.txt at com.google.common.base.Preconditions.checkArgument(Preconditions.java:125) at org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182) at org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205) at org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209) at org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173) at
Multiple errors and messages
Hello When run the following command on Mahout-0.9 and Hadoop-1.2.1, I get multiple errors and I can not figure out what is the problem? Sorry for the long post. [hadoop@solaris ~]$ mahout wikipediaDataSetCreator -i wikipedia/chunks -o wikipediainput -c ~/categories.txt Running on hadoop, using /export/home/hadoop/hadoop-1.2.1/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /export/home/hadoop/mahout-distribution-0.9/mahout-examples-0.9-job.jar 14/03/18 20:28:28 WARN driver.MahoutDriver: No wikipediaDataSetCreator.props found on classpath, will use command-line arguments only 14/03/18 20:28:29 INFO wikipedia.WikipediaDatasetCreatorDriver: Input: wikipedia/chunks Out: wikipediainput Categories: /export/home/hadoop/categories.txt 14/03/18 20:28:30 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/03/18 20:28:32 INFO input.FileInputFormat: Total input paths to process : 699 14/03/18 20:28:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/03/18 20:28:32 WARN snappy.LoadSnappy: Snappy native library not loaded 14/03/18 20:28:37 INFO mapred.JobClient: Running job: job_201403181916_0001 14/03/18 20:28:38 INFO mapred.JobClient: map 0% reduce 0% 14/03/18 20:41:44 INFO mapred.JobClient: map 1% reduce 0% 14/03/18 20:52:57 INFO mapred.JobClient: map 2% reduce 0% 14/03/18 21:04:02 INFO mapred.JobClient: map 3% reduce 0% 14/03/18 21:15:13 INFO mapred.JobClient: map 4% reduce 0% 14/03/18 21:26:30 INFO mapred.JobClient: map 5% reduce 0% 14/03/18 21:29:07 INFO mapred.JobClient: map 5% reduce 1% 14/03/18 21:34:45 INFO mapred.JobClient: Task Id : attempt_201403181916_0001_m_40_0, Status : FAILED 14/03/18 21:34:46 WARN mapred.JobClient: Error reading task outputhttp://solaris:50060/tasklog?plaintext=trueattemptid=attempt_201403181916_0001_m_40_0filter=stdout 14/03/18 21:34:46 WARN mapred.JobClient: Error reading task outputhttp://solaris:50060/tasklog?plaintext=trueattemptid=attempt_201403181916_0001_m_40_0filter=stderr 14/03/18 21:38:29 INFO mapred.JobClient: map 6% reduce 1% 14/03/18 21:41:48 INFO mapred.JobClient: map 6% reduce 2% 14/03/18 21:50:05 INFO mapred.JobClient: map 7% reduce 2% 14/03/18 22:00:59 INFO mapred.JobClient: map 8% reduce 2% 14/03/18 22:12:38 INFO mapred.JobClient: map 9% reduce 2% 14/03/18 22:14:53 INFO mapred.JobClient: map 9% reduce 3% 14/03/18 22:24:30 INFO mapred.JobClient: map 10% reduce 3% 14/03/18 22:35:49 INFO mapred.JobClient: map 11% reduce 3% 14/03/18 22:47:41 INFO mapred.JobClient: map 12% reduce 3% 14/03/18 22:48:18 INFO mapred.JobClient: map 12% reduce 4% 14/03/18 22:59:26 INFO mapred.JobClient: map 13% reduce 4% 14/03/18 23:10:39 INFO mapred.JobClient: map 14% reduce 4% 14/03/18 23:21:32 INFO mapred.JobClient: map 15% reduce 4% 14/03/18 23:24:54 INFO mapred.JobClient: map 15% reduce 5% 14/03/18 23:32:48 INFO mapred.JobClient: map 16% reduce 5% 14/03/18 23:43:53 INFO mapred.JobClient: map 17% reduce 5% 14/03/18 23:54:57 INFO mapred.JobClient: map 18% reduce 5% 14/03/18 23:58:59 INFO mapred.JobClient: map 18% reduce 6% 14/03/19 00:05:59 INFO mapred.JobClient: map 19% reduce 6% 14/03/19 00:16:43 INFO mapred.JobClient: map 20% reduce 6% 14/03/19 00:17:30 INFO mapred.JobClient: Task Id : attempt_201403181916_0001_m_000137_0, Status : FAILED Map output lost, rescheduling: getMapOutput(attempt_201403181916_0001_m_000137_0,0) failed : java.io.IOException: Error Reading IndexFile at org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:113) at org.apache.hadoop.mapred.IndexCache.getIndexInformation(IndexCache.java:66) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:4070) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:914) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at