Seen this issue happen a few times before, there are few edge conditions that need to be fixed in the Streaming KMeans code and you are right that the generated clusters are different on successive runs given the same input.
IIRC this stacktrace is due to BallKMeans failing to read any input centroids - can't recall the sequence that leads to this off the top of my head, will have to look. What's the size of ur input - the no. of points u r trying to cluster, how r u setting the value for ----estimatedNumMapClusters ? Streaming KMeans is still experimental and has scalability issues that need to be worked out. There are few other scenarios wherein Streaming KMeans fails that u should be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469. Lemme take a look at this. On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić <[email protected]> wrote: > Hello everyone, > > I'm using Mahout Streaming K Means multiple times in a loop, every time > for same input data, and output path is always different. Concretely, I'm > increasing number of clusters in each iteration. Currently it is run on a > single machine. > > A couple of times (maybe 3 of 20 runs) I get this exception > > Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge > INFO: Merging 1 sorted segments > Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge > INFO: Down to the last merge-pass, with 1 segments left of total size: > 1623 bytes > Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job > statusUpdate > INFO: > Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run > WARNING: job_local1196467414_0036 > java.lang.NullPointerException > at com.google.common.base.Preconditions.checkNotNull( > Preconditions.java:213) > at org.apache.mahout.math.random.WeightedThing.<init>( > WeightedThing.java:31) > at org.apache.mahout.math.neighborhood.ProjectionSearch. > searchFirst(ProjectionSearch.java:191) > at org.apache.mahout.clustering.streaming.cluster.BallKMeans. > iterativeAssignment(BallKMeans.java:395) > at org.apache.mahout.clustering.streaming.cluster.BallKMeans. > cluster(BallKMeans.java:208) > at org.apache.mahout.clustering.streaming.mapreduce. > StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107) > at org.apache.mahout.clustering.streaming.mapreduce. > StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73) > at org.apache.mahout.clustering.streaming.mapreduce. > StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) > at org.apache.hadoop.mapred.ReduceTask.runNewReducer( > ReduceTask.java:649) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run( > LocalJobRunner.java:398) > > I'm running it like this: > > String[] args1 = new String[] {"-i",dataPath,"-o", > plusOneCentroids,"-k",String.valueOf(i+1), > "--estimatedNumMapClusters",String.valueOf((i+1)*3), > "-ow"}; > StreamingKMeansDriver.main(args1); > > I'm using the same configuration, and the same dataset, but I see no > reason why I get this exception, and it's even stranger that it doesn't > always occur. > > Any ideas? > > Thanks >
