RE: Mahout-279/kmeans++

Whitmore, Mattie Thu, 23 Aug 2012 07:25:45 -0700

Yes, unique names will be my next plan -- I just can't kick off that job until 
after the weekend.  If this makes no difference I will also try the noise idea, 
and I'll follow up about both.


My next question is regarding clusterDump.  Is there a way to run this in 
parallel? I have found some code to execute in java (the preferable method for 
me) but I would like the method to be faster and not in memory.  Is this a 
possibility? Or in the works?

Thanks!

-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]] 
Sent: Wednesday, August 22, 2012 9:09 PM
To: [email protected]
Subject: Re: Mahout-279/kmeans++

Can you also try to provide distinct names to vectors and then cluster?
It should not have any affect, but would be good to know the behavior.

On 22-08-2012 23:10, Whitmore, Mattie wrote:
> Yes, I have data which is exactly the same.  If I give every vector a name 
> which is distinct (albeit the data point is the same as other points in the 
> set) will this keep the algorithm from dropping non-distinct vectors/data 
> points (which is what I THINK but have yet to verify is what is going on)?
>
> Thanks,
>
> Mattie
>
> -----Original Message-----
> From: Ted Dunning [mailto:[email protected]]
> Sent: Wednesday, August 22, 2012 1:18 PM
> To: [email protected]
> Subject: Re: Mahout-279/kmeans++
>
> Just an off thought, do you have duplicate input points?
>
> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[email protected]>wrote:
>
>> ... I have also verified by running canopy multiple times with 0.5 and 0.7
>> that there is a continual discrepancy between the two clustering versions.
>>   The max/min vectors in a cluster using 0.5 is: 19192158/215  and 0.7 is:
>> 921998/5.  They should not necessarily be the same, since I am using canopy
>> clustering to find initial centroids, however I would think they would have
>> the same sum, which they do not (45901885 vs 1599154).
>>
>> Here is the method I am running:
>>
>> public static void KmeansClusteringCanopy(String outputDir, String T,
>> String itMax)
>>                          throws IOException, InterruptedException,
>> ClassNotFoundException,
>>                          InstantiationException, IllegalAccessException {
>>
>>                  Configuration conf = new Configuration();
>>
>>                  DistanceMeasure measure = new EuclideanDistanceMeasure();
>>
>>                  Path vectorsFolder = new Path(outputDir, "vectors");
>>                  Path clusterCenters = new Path(outputDir +
>> "-canopy/centriods");
>>                  Path clusterOutput = new Path(outputDir +
>> "-canopy/clusters");
>>
>>                  // create canopies instead of initial vectors
>>                  CanopyDriver.run(conf, vectorsFolder, clusterCenters,
>> measure,
>>                                  Double.parseDouble(T),
>> Double.parseDouble(T), false, 0, false);
>>
>>
>>                  // kmeans cluster operation
>>                  KMeansDriver.run(conf, vectorsFolder, new
>> Path(clusterCenters,
>>                                  "clusters-0-final/part-r-00000"),
>> clusterOutput, measure, 0.01,
>>                                  Integer.parseInt(itMax), true, 0.0, false);
>>
>>
>>                  //post process by putting completed clusters into their
>> own files.
>>                  ClusterOutputPostProcessorDriver.run(clusterOutput,
>>                                  new
>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
>>
>>          }
>>
>> What do you think?
>>
>> On another but related note: Is there a plan to have a method -- say
>> ClusterOutputPostProcessorDriver -- which when run outputs the vectors
>> within clusters as well as a separate folder containing pruned outliers?
>>
>> Thanks!
>>
>> Mattie
>>
>> -----Original Message-----
>> From: Paritosh Ranjan [mailto:[email protected]]
>> Sent: Friday, August 17, 2012 12:16 PM
>> To: [email protected]
>> Subject: Re: Mahout-279/kmeans++
>>
>> The clustering algorithm has also changed internally. So, expect the
>> results to be different ( and better ).
>>
>> I can think of one reason for this behavior. Maybe lots of clusters are
>> having only one vector inside it, and, AFAIK, clusterdumper will not
>> output any cluster with single vector.
>> So, I think, its clusterdumper which is doing the invisible "pruning" (
>> by not ouputting clusters with single vectors ).
>>
>> Can you cross check the output once with ClusterOutputPostProcessorDriver?
>>
>> No, no tool can output the pruned vectors. The only way to see all
>> vectors assigned to any cluster is to set clusterClassificationThreshold
>> to 0.
>>
>> If you still face the problem, then please provide the parameters with
>> which you are calling kmeans.
>>
>> Regarding "I should also mention I have vectors which are exactly the
>> same (even their names), perhaps they are the ones being pruned, is that
>> possible? "
>>
>> The name of the vector has nothing to do with clustering, I am not sure
>> whether it will have any effect when clusterdumper is in action. So,
>> crosschecking with ClusterOutputPostProcessorDriver will answer this.
>>
>> Good luck.
>> Paritosh
>>
>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
>>> Sure, I have a dataset which I wish to cluster using Kmeans.  Previously
>> (v0.5) when I did a clusterdump the total amount of vectors within the
>> resultant clusters was the same as the total amount fed to the algorithm.
>>   I wish this to be the case when clustering with v0.7.  The only change in
>> the algorithm is clusterClassificationThreshold,  I set this value to be 0
>> so that it will in fact cluster all vectors in the dataset.
>>> My logic here was no vector should have a probability of being in some
>> cluster less than 0 and therefore all vectors should cluster.
>>> However after running a clusterdump I find that vectors (1/3 roughly)
>> have been pruned.
>>> Is this a bug, or me just not understanding the new capabilities?
>>>
>>> I should also mention I have vectors which are exactly the same (even
>> their names), perhaps they are the ones being pruned, is that possible?
>>> Another question if I may: I will eventually want to use the pruning
>> capabilities, does the ClusterOutputPostProcessorDriver method (or a
>> similar method) have the capability of outputting the pruned vectors into a
>> folder?
>>> Thanks! Please let me know if I'm still not being clear enough.
>>>
>>> Mattie
>>>
>>> -----Original Message-----
>>> From: Paritosh Ranjan [mailto:[email protected]]
>>> Sent: Friday, August 17, 2012 11:20 AM
>>> To: [email protected]
>>> Subject: Re: Mahout-279/kmeans++
>>>
>>> clusterClassificationThreshold is for outlier removal, and this is the
>> way it should be used.
>>> Can you provide some more information about your job and the way you are
>> calling it?
>>> And if I look at the code, the vector should be clustered even if the
>> pdf is 0. The method which decides whether the vector should be assigned to
>> a particular cluster or not -
>>> /**
>>>       * Decides whether the vector should be classified or not based on
>> the max pdf
>>>       * value of the clusters and threshold value.
>>>       *
>>>       * @return whether the vector should be classified or not.
>>>       */
>>>      private static boolean shouldClassify(Vector pdfPerCluster, Double
>> clusterClassificationThreshold) {
>>>        return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
>>>      }
>>>
>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
>>>
>>>> Hi Ted,
>>>>
>>>> Yes this is great!  I hope to start working with this algorithm in the
>> next couple weeks.
>>>> I have a question about the 0.7 implementation of kmeans and the
>> clusterClassificationThreshold,  I have this value set at zero, but the
>> output is still showing that about 1/3 of my data is not assigned to a
>> cluster in my output.  Am I using this value incorrectly?  I did a
>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite
>> the clusterClassificationThreshold = 0.
>>>>
>>>> Thanks,
>>>>
>>>> Mattie
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Ted Dunning [mailto:[email protected]]
>>>> Sent: Wednesday, August 15, 2012 5:20 PM
>>>> To: [email protected]
>>>> Subject: Re: Mahout-279/kmeans++
>>>>
>>>> Mattie,
>>>>
>>>> Would this help?
>>>>
>>>>
>> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
>>>> and
>>>>
>>>>
>> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[email protected]
>>> wrote:
>>>>> Hi!
>>>>>
>>>>> I have been using RandomSeedGenerator, and was hoping it had a patch
>> like
>>>>> that described in Mahout-279 since I want only 10 vectors out of a set
>> of
>>>>> more than 100,000,000.  I have been using canopy clustering for better
>>>>> results, but still need to do a few passes of kmeans to determine my
>> T, and
>>>>> the random seed does take a long time.
>>>>>
>>>>> The comments say that you are working on a kmeans++, I searched around
>> but
>>>>> couldn't confirm any more information about it.  Is a scalable
>> kmeans++ in
>>>>> the works? (I know research on the subject is quite new)
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>> Mattie Whitmore
>>>>> Mathematician/IR&D Software Engineer
>>>>> HARRIS  Corporation - Advanced Information Solutions
>>>>> 301.837.5278
>>>>> [email protected]<mailto:[email protected]>
>>>>>
>>>>>
>>>>>
>>>>>

RE: Mahout-279/kmeans++

Reply via email to