Hi there,
I am doing a fairly silly experiment to measure hadoop performance. As part of
this I have extracted emails from the Enron database and I am clustering them
using a proprietary method for clustering short messages (ie. tweets, emails,
sms's) and benchmarking clusters in various configurations.
As part of this I have been benchmarking a single processing machine (my new
laptop) this is a hp elite book with 32mb ram,sdds and nice processors ect ect,
the point is that when explaining to people that we need hadoop I can show them
that a laptop is really really useless and likely to remain so (I know this is
obvious, come and work in a corporate and find out what else you have to do to
earn a living! Then tell me that I am silly! )
Anyhooo... I have seen reasonable behaviours from the algorithms I have built
(ie. for very small data map reduce puts an overhead on the processing, but
once you get reasonably large the parallelism wins) but when I try with
mahout's kmeans I get an odd behaviour.
When I get to ~175k individual files /175mb input data I get an exception
Exception in thread "main" java.lang.IllegalStateException: Job failed!
at
org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329)
at
org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199)
at
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
Is this because I am entirely inept and have missed something, or is this
because of a limitation on mahout sequence files due to them not being aimed at
loads of short messages that really can't be clustered anyway due to them
having no information in them, hell?
Simon
----
Dr. Simon Thompson