Ah, I forgot to mention, I had also called computeParameters() to calculate the numPoints. Sorry for the inconvenience.

On 26-09-2011 23:49, Jeff Eastman wrote:
I tried this patch and found it not to work correctly, since computeCentroid() 
does not compute numPoints and since it is called after getNumPoints(). Here is 
mine which addresses this. BTW, I'm working on a patch which generalizes this 
limit to arbitrary limit values and which affects sequential operation and 
reducer outputs too:

CanopyMapper:
@Override
   protected void cleanup(Context context) throws IOException,
       InterruptedException {
     for (Canopy canopy : canopies) {
       canopy.computeParameters();
       if (canopy.getNumPoints()>  1) {
         context.write(new Text("centroid"), new VectorWritable(canopy
             .getCenter()));
       }
     }
     super.cleanup(context);
   }

-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]]
Sent: Saturday, September 24, 2011 8:53 AM
To: [email protected]
Subject: Re: Clustering : Number of Reducers

Just a correction, The code change is in CanopyMapper.

On 24-09-2011 21:20, Paritosh Ranjan wrote:
I have changed the code in CanopyDriver and the performance/memory
consumption of of reducer has improved a lot.
Thanks for this fix.

On 21-09-2011 00:51, Konstantin Shmakov wrote:
This became technical but I believe a single product requirement
should not
drive generic implementation. Canopy suppose to produce a fast "hint"
for
other clustering techniques; one can experiment with custom
variations to do
just that. For instance for 1) I'd suggest to try adding one line in
CanopyMapper to output only canopies with>1 points:

     protected void cleanup(Context context) throws IOException,
InterruptedException {
       for (Canopy canopy : canopies) {
-      context.write(new Text("centroid"), new
VectorWritable(canopy.computeCentroid()));
+      if(canopy.getNumPoints()>   1) {
+       context.write(new Text("centroid"), new
VectorWritable(canopy.computeCentroid()));
+      }
       }

Even though it will filter canopies at the earlier stage and can
potentially
filter canopies with up to #mappers points it can be an effective data
reduction technique. One can even write these canopies with the
different
key and cluster them separately but that would be more custom
variations.

--Konstantin

On Tue, Sep 20, 2011 at 11:20 AM, Paritosh Ranjan<[email protected]>
wrote:

The bigger problem, in my opinion is, the existence of canopies
containing
single vectors. Since, these canopies with only vector inside it are
not
clusters, so, there would be almost a billion canopies formed, if the
vectors are far from each other.

I think, two improvements, can be applied to the current algorithm.

1) To ask for minimum number of vectors to be inside a
canopy/cluster, or
the cluster is discarded.
2) To change this "in memory" version of clustering to a "persisted"
one.
The current implementation is not scalable. I have a valid business
scenario
with 5 million clusters, and I think there would be more users with
bigger
datasets/cluster numbers.


Thanks and Regards,
Paritosh Ranjan

On 20-09-2011 23:35, Jeff Eastman wrote:

As all the Mahout clustering implementations keep their clusters in
memory, I don't believe any of them will handle that many clusters.
I'm a
bit skeptical; however, that 5 million clusters over a billion, 300-d
vectors will produce anything useful by way of analytics. You've
got the
curse of dimensionality working against you and your vectors will
be nearly
equidistant from each other. This means that very small (=noise)
differences
in distance will be driving the clustering.


-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]]
Sent: Tuesday, September 20, 2011 10:41 AM
To: [email protected]
Subject: Re: Clustering : Number of Reducers


The max load I expect is 1 billion vectors. Around 300 dimensions per
vector. The number of clusters with more than one vector inside it can
be around 5 million, with an average of 10-20 vector per cluster.

But, When most of the vectors are really far away in the worst case
(apart from the similar ones, which will be inside the canopy) ,
most of
the canopies might contain only one vector. So, the number of canopies
will be really high ( As lots of canopies will result into clusters
having single vector ).

On 20-09-2011 22:56, Jeff Eastman wrote:

I guess it depends upon what you expect from your HUGE data set:
How many
clusters do you believe it contains? A hundred? A thousand? A
million? A
billion? With the right T-values I believe Canopy can handle the
first three
but not the last. It will also depend upon the size of your
vectors. This is
because, as canopy centroids are calculated, the centroid vectors
become
more dense and these take up more space in memory. So a million,
really wide
clusters might have trouble fitting into a 4GB reducer memory. But
what are
you really going to do with a million clusters? This number seems
vastly
larger than one might find useful in summarizing a data set. I
would think a
couple hundred clusters would be the limit of human-understandable
clustering. Canopy can do that with no problem.

MeanShiftCanopy, as its name implies, is really just an iterative
canopy
implementation. It allows the specification of an arbitrary number of
initial reducers, but it counts them down to 1 in each iteration
in order to
properly process all the input. It is an agglomerative clustering
algorithm,
and the clusters it builds contain the indices of each of the
input points
that have been agglomerated. This makes the mean shift canopy
larger in
memory than vanilla canopies since the list of points is
maintained too. It
is possible to avoid the points accumulation and it won't happen
unless the
-cl option is provided. In this case the memory consumption will
be about
the same as vanilla canopy.

Bottom line: How many clusters do you expect to find?




-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]]
Sent: Tuesday, September 20, 2011 9:46 AM
To: [email protected]
Subject: Re: Clustering : Number of Reducers

"but all the canopies gotta fit in memory."

If this is true, then CanopyDriver would not be able to cluster HUGE
data ( as the memory might blow up ).

I am using MeanShiftCanopyDriver of 0.6-snapshot which can use any
number of reducers. Will it also need all the canopies in memory?

Or, which Clustering technique would you suggest to cluster really
big
data ( considering performance and big size as parameters )?

Thanks and Regards,
Paritosh Ranjan

On 20-09-2011 21:35, Jeff Eastman wrote:

Well, while it is true that the CanopyDriver writes all its
canopies to
the file system, they are written at the end of the reduce
method. The
mappers all output the same key, so the one reducer gets all the
mapper
pairs and these must fit into memory before they can be output.
With T1/T2
values that are too small given the data, there will be a very
large number
of clusters output by each mapper and a corresponding deluge of
clusters at
the reducer. T3/T4 may be used to supply different thresholds in
the reduce
step, but all the canopies gotta fit in memory.

-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]]
Sent: Tuesday, September 20, 2011 12:31 AM
To: [email protected]
Subject: Re: Clustering : Number of Reducers

"The limit is that all the canopies need to fit into memory."
I don't think so. I think you can use CanopyDriver to write
canopies in
a filesystem. This is done as a mapreduce job. Then the KMeansDriver
needs these canopy points as input to run KMeans.

On 20-09-2011 01:39, Jeff Eastman wrote:

Actually, most of the clustering jobs (including DirichletDriver)
accept the -Dmapred.reduce.tasks=n argument as noted below.
Canopy is the
only job which forces n=1 and this is so the reducer will see
all of the
mapper outputs. Generally, by adjusting T2&       T1 to
suitably-large values
you can get canopy to handle pretty large datasets. The limit is
that all
the canopies need to fit into memory.

-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]]
Sent: Sunday, September 18, 2011 10:03 PM
To: [email protected]
Subject: Re: Clustering : Number of Reducers

So, does this mean that Mahout can not support clustering for large
data?

Even in DirichletDriver the number of reducers is hardcoded to
1. And
we
need canopies to run KMeansDriver.

Paritosh

On 19-09-2011 01:47, Konstantin Shmakov wrote:

For most of the tasks one can force the number of reducers with
mapred.reduce.tasks=<N>
where<N>        the desired number of reducers.

It will not necessary increase the performance though - with
kmeans
and
fuzzykmeans combiners do reducers job and increasing the number of
reducers
won't usually affect performance.

With the canopy the distributed
algorithm<http://svn.apache.**org/viewvc/mahout/trunk/core/**
src/main/java/org/apache/**mahout/clustering/canopy/**
CanopyDriver.java?revision=**1134456&view=markup<http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/canopy/CanopyDriver.java?revision=1134456&view=markup>

has
no combiners and has 1 reducer hardcoded
- trying to increase #reducers won't have any effect as the
algorithm
doesn't work with>1 reducer. My experience that the canopy
won't scale
to
large data and need improvement.

-- Konstantin



On Sun, Sep 18, 2011 at 10:50 AM, Paritosh
Ranjan<[email protected]>
       wrote:

   Hi,
I have been trying to cluster some hundreds of millions of
records
using
Mahout Clustering techniques.

The number of reducers is always one which I am not able to
change.
This is
effecting the performance. I am using Mahout 0.5

In 0.6-SNAPSHOT, I see that the MeanShiftCanopyDriver has been
changed to
use any number of reducers. Will other ClusterDrivers also get
changed to
use any number of reducers in 0.6?

Thanks and Regards,
Paritosh Ranjan



   -----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1410 / Virus Database: 1520/3906 - Release Date:
09/19/11

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date:
09/20/11

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date:
09/20/11


-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date:
09/20/11




-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1410 / Virus Database: 1520/3916 - Release Date: 09/24/11


-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1410 / Virus Database: 1520/3920 - Release Date: 09/26/11


Reply via email to