Re: Streaming K Means exception without any reason

Suneel Marthi Thu, 09 Oct 2014 06:35:34 -0700

Heh.... u r data size is tiny indeed. One of the edge conditions I was
alluding to was the failures of this implementation on tiny datasets.


Do u see any output clusters? If so how many points?
possible to share ur dataset to troubleshoot ?


On Thu, Oct 9, 2014 at 9:18 AM, Marko Dinić <[email protected]>
wrote:

> Suneel,
>
> Thank you for your answer, this was rather strange to me.
>
> The number of points is 942. I have multiple runs, in each run I have a
> loop in which number of clusters is increased in each iteration and I
> multiple that number by 3, since I'm expecting log(n) initial centroids,
> before Ball K Means step. It's actually an attempt of elbow method
> implementation. It's very strange that this crashing happens occasionally.
>
> Can I expect that problems like this be fixed in future? I'm using it
> since it gives better results, both in speed and clustering quality, but it
> would be a problem if it crashes like this.
>
>
> On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote:
>
>> Seen this issue happen a few times before, there are few edge conditions
>> that need to be fixed in the Streaming KMeans code and you are right that
>> the generated clusters are different on successive runs given the same
>> input.
>>
>> IIRC this stacktrace is due to BallKMeans failing to read any input
>> centroids - can't recall the sequence that leads to this off the top of my
>> head, will have to look.
>>
>> What's the size of ur input - the no. of points u r trying to cluster, how
>> r u setting the value for ----estimatedNumMapClusters ?
>> Streaming KMeans is still experimental and has scalability issues that
>> need
>> to be worked out.
>>
>> There are few other scenarios wherein Streaming KMeans fails that u should
>> be aware of, see https://issues.apache.org/jira/browse/MAHOUT-1469.
>>
>> Lemme take a look at this.
>>
>>
>>
>> On Thu, Oct 9, 2014 at 5:39 AM, Marko Dinić <[email protected]>
>> wrote:
>>
>>  Hello everyone,
>>>
>>> I'm using Mahout Streaming K Means multiple times in a loop, every time
>>> for same input data, and output path is always different. Concretely, I'm
>>> increasing number of clusters in each iteration. Currently it is run on a
>>> single machine.
>>>
>>> A couple of times (maybe 3 of 20 runs) I get this exception
>>>
>>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue
>>> merge
>>> INFO: Merging 1 sorted segments
>>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.Merger$MergeQueue
>>> merge
>>> INFO: Down to the last merge-pass, with 1 segments left of total size:
>>> 1623 bytes
>>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job
>>> statusUpdate
>>> INFO:
>>> Oct 09, 2014 11:30:40 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
>>> WARNING: job_local1196467414_0036
>>> java.lang.NullPointerException
>>>      at com.google.common.base.Preconditions.checkNotNull(
>>> Preconditions.java:213)
>>>      at org.apache.mahout.math.random.WeightedThing.<init>(
>>> WeightedThing.java:31)
>>>      at org.apache.mahout.math.neighborhood.ProjectionSearch.
>>> searchFirst(ProjectionSearch.java:191)
>>>      at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
>>> iterativeAssignment(BallKMeans.java:395)
>>>      at org.apache.mahout.clustering.streaming.cluster.BallKMeans.
>>> cluster(BallKMeans.java:208)
>>>      at org.apache.mahout.clustering.streaming.mapreduce.
>>> StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
>>>      at org.apache.mahout.clustering.streaming.mapreduce.
>>> StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
>>>      at org.apache.mahout.clustering.streaming.mapreduce.
>>> StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
>>>      at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
>>>      at org.apache.hadoop.mapred.ReduceTask.runNewReducer(
>>> ReduceTask.java:649)
>>>      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>>>      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>> LocalJobRunner.java:398)
>>>
>>> I'm running it like this:
>>>
>>> String[] args1 = new String[] {"-i",dataPath,"-o",
>>> plusOneCentroids,"-k",String.valueOf(i+1), "--estimatedNumMapClusters",
>>> String.valueOf((i+1)*3),
>>> "-ow"};
>>>                          StreamingKMeansDriver.main(args1);
>>>
>>> I'm using the same configuration, and the same dataset, but I see no
>>> reason why I get this exception, and it's even stranger that it doesn't
>>> always occur.
>>>
>>> Any ideas?
>>>
>>> Thanks
>>>
>>>
>>
> --
> Pozdrav,
> Marko Dinić
>

Re: Streaming K Means exception without any reason

Reply via email to