1) An outlierThreshold can be provided, then only those vectors which
have pdf greater than this threshold will be included in the cluster.
Try to play around with this value, this might help.
2) This questions is not that clear. However, I think you are talking
about the clustering phase. If that is the case, then it won't be
possible to use the closest point instead of centroid without change in
the code base.
On 09-08-2012 14:18, Abramov Pavel wrote:
Hello,
If think Zipf's law is relevant for my data. Thanks for idea about
memory-mapping.
1) How can I "drop" extremely small/large clusters? There are 50% small
clusters with only 1 member while 1 large cluster has 50% members. Is
there a way to "split" large clusters with Kmeans?
2) Can I force Mahout not to use exact centroid but the closest point from
my set? Any point has ~10 non-zero components while exact centroid is very
dense (~200k).
Thanks!
Pavel
09.08.12 5:43 пользователь "Lance Norskog" <[email protected]> написал:
If Zipf's Law is relevant, locality will be much better than random.
Maybe you need a Vector implementation that is backed by memory-mapped
files?
On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <[email protected]>
wrote:
Thank you Jeff, Paritosh,
Reducing cluster count from 1000 to 100 made my day. I will also try
Canopy for initial cluster count.
Unfortunately I don't know how to reduce my 200k dictionary. There are
no
unfrequent terms.
I will provide you with Hadoop config shortly. But I am pretty sure I
can't decrease number of Mappers/Reducers per node or increase memory
limits. It will affect the whole cluster.
Thank you!
Pavel
08.08.12 16:15 пользователь "Jeff Eastman" <[email protected]>
написал:
Consider that each cluster retains 4 vectors in memory in each mapper
and reducer, and that these vectors tend to become more dense (through
addition of multiple sparse components) as iterations proceed. With 1000
clusters and 200k terms in your dictionary this will cause the heap
space to be consumed rapidly as you have noted. Some times you can work
around this problem by increasing your heap size on a per-job basis or
reducing the number of mappers and reducers allowed on each node. Also
be sure you are not launching reducers until all of your mapper tasks
have completed.
In order to provide more help to you, it would be useful to understand
more about how your cluster is "well tuned". How many mappers & reducers
are you launching in parallel, the heapspace limits set for tasks on
each node, etc.
For a quick test, try reducing the k to 500 or 100 to see how many
clusters you can reasonably compute with your dataset on your cluster.
Canopy is also a good way to get a feel for a good initial k, though it
can be hard to arrive at good T-values in some text clustering cases.
You, can also try hierarchical clustering with reduced k to stay under
your memory limits.
On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
A stacktrace of error would have helped in finding the exact error.
However, number of clusters can create Heap Space problems ( If the
vector dimension is also high ).
Either try to reduce the number of initial clusters ( In my opinion,
the best way to know about initial clusters is Canopy Clustering
https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
or, try to reduce the dimension of the vectors.
PS : you are also providing numClusters twice
--numClusters 1000 \ --numClusters 5 \
On 08-08-2012 10:42, Abramov Pavel wrote:
Hello,
I am trying to run KMeans example on 15 000 000 documents (seq2sparse
output).
There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
document size (titles). seq2sparse produces 200 files 80 MB each.
My job failed with Java heap space Error. 1st iteration passes while
2nd iteration fails. On a Map phase of buildClusters I see a lot of
warnings, but it passes. Reduce phase of buildClusters fails with
"Java Heap space".
I can not increase reducer/mapper memory in hadoop. My cluster is
tunned well.
How can I avoid this situation? My cluster has 300 Mappers and 220
Reducers running 40 servers 8-core 12 GB RAM.
Thanks in advance!
Here is KMeans parameters:
------------------------------------------------
mahout kmeans -Dmapred.reduce.tasks=200 \
-i ...tfidf-vectors/ \
-o /tmp/clustering_results_kmeans/ \
--clusters /tmp/clusters/ \
--numClusters 1000 \
--numClusters 5 \
--overwrite \
--clustering
------------------------------------------------
Pavel
--
Lance Norskog
[email protected]