Hey All, It is again me with probably another stupid query but I am having hard time getting going. I have installed/configured both mahout and hadoop now. Ran all the examples in quickstart. Now I wanted to start writing my own code to cluster data and wondering where to start.
I should accept that I am new to hadoop too, went through their wordcount quickstart app. I am taken aback with the vastness of mahout and need some assistance to start with. My data stream format <data can from network/disk>: String1 label1:rating/relevance label2:r label3:r String2 label1:rating/relevance label2:r label3:r label4:r String3 label1:rating/relevance label2:r >From my reading I have learnt I need to convert this text file to tf-idf vector and need to use one of the Vectorizer Class. I thought starting with the cluster example is a good place. imported entire mahout distribution as a maven project in eclipse and executed job.java under cluster.syntheticcontrol.kmeans. but I got this exception. I am not sure why I encountered it. I have set JAVA_HOME, HADOOP_HOME, HADOOP_CONF_DIR but the app is still searching the data in the current folder. Feb 2, 2011 1:40:27 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Running with default arguments Feb 2, 2011 1:41:12 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Preparing Input Feb 2, 2011 1:42:27 PM org.apache.hadoop.metrics.jvm.JvmMetrics init INFO: Initializing JVM Metrics with processName=JobTracker, sessionId= Feb 2, 2011 1:45:35 PM org.apache.hadoop.mapred.JobClient configureCommandLineOptions WARNING: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. Feb 2, 2011 1:45:35 PM org.apache.hadoop.mapred.JobClient configureCommandLineOptions WARNING: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/sjagannath/mahout-distribution-0.4/examples/testdata at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus( FileInputFormat.java:224) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits( FileInputFormat.java:241) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at org.apache.mahout.clustering.conversion.InputDriver.runJob( InputDriver.java:108) at org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.run(Job.java:133 ) at org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:58 ) Given this: I need to know what is happening here, Where I should start to vectorize my data. -- Thanks, Sharath Jagannath
