Hi, I'm doing a MSc at Northeastern and I'm working on analyzing some US election polls with kmeans. I'm a beginner with both Mahout and Hadoop. I've been reading the docs but I'd still appreciate some orientation on these questions:
* I can transform my input data into vectors and run k-means using the command line [1]. I downloaded hadoop (1.0.4, working in a real cluster) and I wrote a program for it. Then I downloaded Mahout and I saw that there is a jar file included (0.20, single node: M2_REPO/org/apache/hadoop/hadoop-core/0.20.204.0/hadoop-core-0.20.204.0.jar ). If I point HADOOP_HOME to my hadoop installation, will mahout use it? I set HADOOP_HOME in hadoop/conf/hadoop-env.sh, though. * I might need to remove some columns of my data set. With Hadoop I could write a program to tokenize the input and create the data structures I needed, and then call kmeansdriver. I can use bash to remove the columns and mahout from command line. should I write a program instead? * How do I write a program for Mahout 0.7 (and Hadoop 1.x), from scratch? I need to transform the dataset: Vectors should be created only with the features I want k-means to consider to cluster my data. Then I can call kmeansdriver. I think I can do both using the explanation of http://www.odbms.org/download/TamingTextCH06.pdf Should the main class extend any other? How do I deploy it in a cluster with hadoop? * it is my understanding that mahout is a framework. I read the code example in org.apache.mahout.clustering.syntheticcontrol.kmeans. It extends AbstractJob. I made a new project in Eclipse and copied the example. My goal was to run it. I tried "java -jar myjar.jar" and passing my new jar as a parameter to hadoop. What's the correct way of running a program for mahout? Thanks [1] https://cwiki.apache.org/MAHOUT/k-means-commandline.html
