Hello,
We have been testing Mahout in a few different configurations and it seems
to take a significant amount of time (several minutes to over an hour) for
small document sets (3,000 documents and 7,000 documents). Is this type of
performance normal?
Thanks,
Samir
*Results*
Phases in clustering
------------------------------
1) Preprocessing the data
2) Creating TF-IDF vectors
3) Getting centroids(Canopy)
4) Build Clusters (K-means/LDA)
5) Results extraction
Running on 3,000 documents
-------------------------------------------
1) 2)
3) 4)
5)
a)Hadoop oriented L 12m
N N
L
b)Sequential oriented L 12m
2m(optimized) 12-15m(optimized) L
c)Pseudo distributed L 12m Taking More
time(>30min) 12-15m(optimized) L
Running on 7,000 documents
------------------------------------------
b)Sequential oriented 90m
UNITS
------------
m-minutes
L-Less time(1 to 3 minutes)
N-Not running(Because of various reasons like heap memory etc).
m-minutes