Hi Simon, That looks like an error from the seq2sparse job you're using to vectorize the code. I think it's very surprising to get an error when vectorizing, but more others more experienced than me should probably comment. :)
The line numbers don't match what I have in my version of Mahout (a forked version of trunk). If I'm not mistaken there should be an "inner" exception thrown by a mapper or reducer that tells us more. Can you please look through the error log and see if there's anything else? As a side note, I'm clustering the 20 newsgroups data set (~20K documents at ~20MB in total) and it's working fine. Thanks! Dan On Sat, Mar 9, 2013 at 5:44 PM, <[email protected]> wrote: > Hi there, > > I am doing a fairly silly experiment to measure hadoop performance. As part > of this I have extracted emails from the Enron database and I am clustering > them using a proprietary method for clustering short messages (ie. tweets, > emails, sms's) and benchmarking clusters in various configurations. > > As part of this I have been benchmarking a single processing machine (my new > laptop) this is a hp elite book with 32mb ram,sdds and nice processors ect > ect, the point is that when explaining to people that we need hadoop I can > show them that a laptop is really really useless and likely to remain so (I > know this is obvious, come and work in a corporate and find out what else you > have to do to earn a living! Then tell me that I am silly! ) > > Anyhooo... I have seen reasonable behaviours from the algorithms I have > built (ie. for very small data map reduce puts an overhead on the processing, > but once you get reasonably large the parallelism wins) but when I try with > mahout's kmeans I get an odd behaviour. > > When I get to ~175k individual files /175mb input data I get an exception > > Exception in thread "main" java.lang.IllegalStateException: Job failed! > at > org.apache.mahout.vectorizer.DictionaryVectorizer.makePartialVectors(DictionaryVectorizer.java:329) > at > org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:199) > at > org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:271) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > > Is this because I am entirely inept and have missed something, or is this > because of a limitation on mahout sequence files due to them not being aimed > at loads of short messages that really can't be clustered anyway due to them > having no information in them, hell? > > Simon > > > > ---- > Dr. Simon Thompson
