This has been asked before several times,  if you search the mailing lists
you may hit similar posts.

There is no clear formula for picking the ideal T1 and T2 values,  the
problem with using Canopy is that because it runs with a single reducer u r
most likely to hit OOME depending on how big the data is that u r trying to
cluster.

If the intent here is to come up with a useful 'k' value based on the
output of Canopy, I would suggest to look at Streaming KMeans instead.
Canopy has been marked for deprecation and will be removed in a future
release not to mention its scalability issues.





On Tue, Jun 3, 2014 at 2:31 AM, David Noel <[email protected]> wrote:

> Is there some specific methodology for determining the most useful t1
> and t2 threshold values or is it largely a matter of trial and error?
>

Reply via email to