Suneel,

Thank you for your answer, this was rather strange to me.

The number of points is 942. I have multiple runs, in each run I have a loop in which number of clusters is increased in each iteration and I multiple that number by 3, since I'm expecting log(n) initial centroids, before Ball K Means step. It's actually an attempt of elbow method implementation. It's very strange that this crashing happens occasionally.

Can I expect that problems like this be fixed in future? I'm using it since it gives better results, both in speed and clustering quality, but it would be a problem if it crashes like this.

On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote:
Seen this issue happen a few times before, there are few edge conditions
that need to be fixed in the Streaming KMeans code and you are right that
the generated clusters are different on successive runs given the same
input.

IIRC this stacktrace is due to BallKMeans failing to read any input
centroids - can't recall the sequence that leads to this off the top of my
head, will have to look.

What's the size of ur input - the no. of points u r trying to cluster, how
r u setting the value for ----estimatedNumMapClusters ?
Streaming KMeans is still experimental and has scalability issues that need
to be worked out.

There are few other scenarios wherein Streaming KMeans fails that u should
be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469.

Lemme take a look at this.



On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić <[email protected]>
wrote:

Hello everyone,

I'm using Mahout Streaming K Means multiple times in a loop, every time
for same input data, and output path is always different. Concretely, I'm
increasing number of clusters in each iteration. Currently it is run on a
single machine.

A couple of times (maybe 3 of 20 runs) I get this exception

Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge
INFO: Merging 1 sorted segments
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue merge
INFO: Down to the last merge-pass, with 1 segments left of total size:
1623 bytes
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job
statusUpdate
INFO:
Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local1196467414_0036
java.lang.NullPointerException
     at com.google.common.base.Preconditions.checkNotNull(
Preconditions.java:213)
     at org.apache.mahout.math.random.WeightedThing.<init>(
WeightedThing.java:31)
     at org.apache.mahout.math.neighborhood.ProjectionSearch.
searchFirst(ProjectionSearch.java:191)
     at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
iterativeAssignment(BallKMeans.java:395)
     at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
cluster(BallKMeans.java:208)
     at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
     at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
     at org.apache.mahout.clustering.streaming.mapreduce.
StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
     at org.apache.hadoop.mapred.ReduceTask.runNewReducer(
ReduceTask.java:649)
     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
     at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
LocalJobRunner.java:398)

I'm running it like this:

String[] args1 = new String[] {"-i",dataPath,"-o",
plusOneCentroids,"-k",String.valueOf(i+1), 
"--estimatedNumMapClusters",String.valueOf((i+1)*3),
"-ow"};
                         StreamingKMeansDriver.main(args1);

I'm using the same configuration, and the same dataset, but I see no
reason why I get this exception, and it's even stranger that it doesn't
always occur.

Any ideas?

Thanks



--
Pozdrav,
Marko Dinić

Reply via email to