Heh.... u r data size is tiny indeed. One of the edge conditions I was alluding to was the failures of this implementation on tiny datasets.
Do u see any output clusters? If so how many points? possible to share ur dataset to troubleshoot ? On Thu, Oct 9, 2014 at 9:18 AM, Marko Dinić <[email protected]> wrote: > Suneel, > > Thank you for your answer, this was rather strange to me. > > The number of points is 942. I have multiple runs, in each run I have a > loop in which number of clusters is increased in each iteration and I > multiple that number by 3, since I'm expecting log(n) initial centroids, > before Ball K Means step. It's actually an attempt of elbow method > implementation. It's very strange that this crashing happens occasionally. > > Can I expect that problems like this be fixed in future? I'm using it > since it gives better results, both in speed and clustering quality, but it > would be a problem if it crashes like this. > > > On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote: > >> Seen this issue happen a few times before, there are few edge conditions >> that need to be fixed in the Streaming KMeans code and you are right that >> the generated clusters are different on successive runs given the same >> input. >> >> IIRC this stacktrace is due to BallKMeans failing to read any input >> centroids - can't recall the sequence that leads to this off the top of my >> head, will have to look. >> >> What's the size of ur input - the no. of points u r trying to cluster, how >> r u setting the value for ----estimatedNumMapClusters ? >> Streaming KMeans is still experimental and has scalability issues that >> need >> to be worked out. >> >> There are few other scenarios wherein Streaming KMeans fails that u should >> be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469. >> >> Lemme take a look at this. >> >> >> >> On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić <[email protected]> >> wrote: >> >> Hello everyone, >>> >>> I'm using Mahout Streaming K Means multiple times in a loop, every time >>> for same input data, and output path is always different. Concretely, I'm >>> increasing number of clusters in each iteration. Currently it is run on a >>> single machine. >>> >>> A couple of times (maybe 3 of 20 runs) I get this exception >>> >>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue >>> merge >>> INFO: Merging 1 sorted segments >>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue >>> merge >>> INFO: Down to the last merge-pass, with 1 segments left of total size: >>> 1623 bytes >>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job >>> statusUpdate >>> INFO: >>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run >>> WARNING: job_local1196467414_0036 >>> java.lang.NullPointerException >>> at com.google.common.base.Preconditions.checkNotNull( >>> Preconditions.java:213) >>> at org.apache.mahout.math.random.WeightedThing.<init>( >>> WeightedThing.java:31) >>> at org.apache.mahout.math.neighborhood.ProjectionSearch. >>> searchFirst(ProjectionSearch.java:191) >>> at org.apache.mahout.clustering.streaming.cluster.BallKMeans. >>> iterativeAssignment(BallKMeans.java:395) >>> at org.apache.mahout.clustering.streaming.cluster.BallKMeans. >>> cluster(BallKMeans.java:208) >>> at org.apache.mahout.clustering.streaming.mapreduce. >>> StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107) >>> at org.apache.mahout.clustering.streaming.mapreduce. >>> StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73) >>> at org.apache.mahout.clustering.streaming.mapreduce. >>> StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37) >>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) >>> at org.apache.hadoop.mapred.ReduceTask.runNewReducer( >>> ReduceTask.java:649) >>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) >>> at org.apache.hadoop.mapred.LocalJobRunner$Job.run( >>> LocalJobRunner.java:398) >>> >>> I'm running it like this: >>> >>> String[] args1 = new String[] {"-i",dataPath,"-o", >>> plusOneCentroids,"-k",String.valueOf(i+1), "--estimatedNumMapClusters", >>> String.valueOf((i+1)*3), >>> "-ow"}; >>> StreamingKMeansDriver.main(args1); >>> >>> I'm using the same configuration, and the same dataset, but I see no >>> reason why I get this exception, and it's even stranger that it doesn't >>> always occur. >>> >>> Any ideas? >>> >>> Thanks >>> >>> >> > -- > Pozdrav, > Marko Dinić >
