Speed of clustering documents

Samir Raiyani Wed, 29 Dec 2010 14:40:09 -0800

Hello,

We have been testing Mahout in a few different configurations and it seems
to take a significant amount of time (several minutes to over an hour) for
small document sets (3,000 documents and 7,000 documents). Is this type of
performance normal?


Thanks,
Samir

*Results*

Phases in clustering
------------------------------
   1) Preprocessing the data
   2) Creating TF-IDF vectors
   3) Getting centroids(Canopy)
   4) Build Clusters (K-means/LDA)
   5) Results extraction

Running on 3,000 documents
-------------------------------------------
                                    1)     2)
3)                                              4)
         5)
a)Hadoop oriented          L      12m
N                                               N
          L
b)Sequential oriented      L      12m
2m(optimized)                             12-15m(optimized)            L
c)Pseudo distributed       L      12m         Taking More
time(>30min)           12-15m(optimized)             L

Running on 7,000 documents
------------------------------------------

b)Sequential oriented      90m

UNITS
------------
m-minutes
L-Less time(1 to 3 minutes)
N-Not running(Because of various reasons like heap memory etc).
m-minutes

Speed of clustering documents

Reply via email to