Hi there, 

I am doing a fairly silly experiment to measure hadoop performance. As part of 
this I have extracted emails from the Enron database and I am clustering them 
using a proprietary method for clustering short messages (ie. tweets, emails, 
sms's) and benchmarking clusters in various configurations. 

As part of this I have been benchmarking a single processing machine (my new 
laptop) this is a hp elite book with 32mb ram,sdds and nice processors ect ect, 
the point is that when explaining to people that we need hadoop I can show them 
that a laptop is really really useless and likely to remain so (I know this is 
obvious, come and work in a corporate and find out what else you have to do to 
earn a living! Then tell me that I am silly! ) 

Anyhooo...  I have seen reasonable behaviours from the algorithms I have built 
(ie. for very small data map reduce puts an overhead on the processing, but 
once you get reasonably large the parallelism wins) but when I try with 
mahout's kmeans I get an odd behaviour. 

When I get to ~175k individual files /175mb input data I get an exception 

Exception in thread "main" java.lang.IllegalStateException: Job failed!
        at 
org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
        at 
org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
        at 
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)

Is this because I am entirely inept and have missed something, or is this 
because of a limitation on mahout sequence files due to them not being aimed at 
loads of short messages that really can't be clustered anyway due to them 
having no information in them, hell? 

Simon



----
Dr. Simon Thompson

Reply via email to