BTW it seems odd that I get large numbers for distance from centroid
using clustering. Shouldn't I expect small numbers for the closest docs?
I have assumed the real distance is 1-reported distance but the
distances reported by rowsimilarity are very small as I'd expect. I was
using tanimoto in both cases as the distance measure but also tried
cosine with similar results.
On 5/8/12 9:12 AM, Pat Ferrel wrote:
Here is a sample data set. In this case I asked for 30 and got 28 but
in other cases the discrepancy has been greater like ask for 200 and
get 38 but that was for a much larger data set.
Running on my mac laptop in a single node pseudo cluster hadoop
0.20.205, mahout 0.6
command line:
mahout kmeans \
-i b2/bixo-vectors/tfidf-vectors/ \
-c b2/bixo-kmeans-centroids \
-cl \
-o b2/bixo-kmeans-clusters \
-k 30 \
-ow \
-cd 0.01 \
-x 20 \
-dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
Find the data here:
http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740
BTW when I run rowsimilarity asking for 20 similar docs I get a max of
20 but sometimes many less. Shouldn't this always return the requested
number? I'll post this question again to the the attention of the
right person.
On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
I looked at the 0.6 version's code but was not able to find any reason.
If possible, can you share the data you are trying to cluster along
with the execution parameters?
You can also open a Jira for this and provide the info there.
On 07-05-2012 19:45, Pat Ferrel wrote:
0.6
I take it this is not expected behavior? I could be doing something
stupid. I only look in the "final" directory. Looking in the others
with clusterdump shows the same number of clusters and I assumed
they were iterations.
On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
Which version are you using ? 0.6 or the current 0.7-snapshot?
On 07-05-2012 02:19, Pat Ferrel wrote:
What would cause kmeans to not return k clusters? As I tweak
parameters I get different numbers of clusters but it's usually
less than the k I pass in. Since I am not using canopies at
present I would expect k to always be honored but the quality of
the clusters would depend on the convergence amount and number of
iterations allowed. No?