Need help: beginner

sharath jagannath Wed, 02 Feb 2011 14:08:18 -0800

Hey All,

It is again me with probably another stupid query but I am having hard time
getting going.
I have installed/configured both mahout and hadoop now. Ran all the examples
in quickstart.
Now I wanted to start writing my own code to cluster data and wondering
where to start.


I should accept that I am new to hadoop too, went through their wordcount
quickstart app.

I am taken aback with the vastness of mahout and need some assistance to
start with.


My data stream format <data can from network/disk>:


String1 label1:rating/relevance label2:r label3:r

String2 label1:rating/relevance label2:r label3:r label4:r

String3 label1:rating/relevance label2:r


>From my reading I have learnt I need to convert this text file to tf-idf
vector and need to use one of the Vectorizer Class.

I thought starting with the cluster example is a good place. imported entire
mahout distribution as a maven project in eclipse and executed job.java
under cluster.syntheticcontrol.kmeans.

but I got this exception. I am not sure why I encountered it. I have set
JAVA_HOME, HADOOP_HOME, HADOOP_CONF_DIR but the app is still searching the
data in the current folder.


Feb 2, 2011 1:40:27 PM org.slf4j.impl.JCLLoggerAdapter info

INFO: Running with default arguments

Feb 2, 2011 1:41:12 PM org.slf4j.impl.JCLLoggerAdapter info

INFO: Preparing Input

Feb 2, 2011 1:42:27 PM org.apache.hadoop.metrics.jvm.JvmMetrics init

INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=

Feb 2, 2011 1:45:35 PM org.apache.hadoop.mapred.JobClient
configureCommandLineOptions

WARNING: Use GenericOptionsParser for parsing the arguments. Applications
should implement Tool for the same.

Feb 2, 2011 1:45:35 PM org.apache.hadoop.mapred.JobClient
configureCommandLineOptions

WARNING: No job jar file set.  User classes may not be found. See
JobConf(Class) or JobConf#setJar(String).

Exception in thread "main"
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
not exist: file:/Users/sjagannath/mahout-distribution-0.4/examples/testdata

at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(
FileInputFormat.java:224)

at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(
FileInputFormat.java:241)

at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)

at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)

at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)

at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)

at org.apache.mahout.clustering.conversion.InputDriver.runJob(
InputDriver.java:108)

at org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.run(Job.java:133
)

at org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:58
)



Given this:
I need to know what is happening here, Where I should start to vectorize my
data.

-- 
Thanks,
Sharath Jagannath

Need help: beginner

Reply via email to