The upcoming knn package has a file based matrix implementation that uses 
memory mapping to allow sharing a copy of a large matrix between processes and 
threads.  

Sent from my iPhone

On Aug 9, 2012, at 1:48 AM, Abramov Pavel <[email protected]> wrote:

> Hello, 
> 
> If think Zipf's law is relevant for my data. Thanks for idea about
> memory-mapping.
> 
> 1) How can I "drop" extremely small/large clusters? There are 50% small
> clusters with only 1 member while 1 large cluster has 50% members. Is
> there a way to "split" large clusters with Kmeans?
> 
> 2) Can I force Mahout not to use exact centroid but the closest point from
> my set? Any point has ~10 non-zero components while exact centroid is very
> dense (~200k).
> 
> 
> Thanks!
> 
> Pavel
> 
> 
> 09.08.12 5:43 пользователь "Lance Norskog" <[email protected]> написал:
> 
>> If Zipf's Law is relevant, locality will be much better than random.
>> Maybe you need a Vector implementation that is backed by memory-mapped
>> files?
>> 
>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <[email protected]>
>> wrote:
>>> Thank you Jeff, Paritosh,
>>> 
>>> Reducing cluster count from 1000 to 100 made my day. I will also try
>>> Canopy for initial cluster count.
>>> Unfortunately I don't know how to reduce my 200k dictionary. There are
>>> no
>>> unfrequent terms.
>>> 
>>> I will provide you with Hadoop config shortly. But I am pretty sure I
>>> can't decrease number of Mappers/Reducers per node or increase memory
>>> limits. It will affect the whole cluster.
>>> 
>>> 
>>> Thank you!
>>> 
>>> Pavel
>>> 
>>> 
>>> 08.08.12 16:15 пользователь "Jeff Eastman" <[email protected]>
>>> написал:
>>> 
>>>> Consider that each cluster retains 4 vectors in memory in each mapper
>>>> and reducer, and that these vectors tend to become more dense (through
>>>> addition of multiple sparse components) as iterations proceed. With 1000
>>>> clusters and 200k terms in your dictionary this will cause the heap
>>>> space to be consumed rapidly as you have noted. Some times you can work
>>>> around this problem by increasing your heap size on a per-job basis or
>>>> reducing the number of mappers and reducers allowed on each node. Also
>>>> be sure you are not launching reducers until all of your mapper tasks
>>>> have completed.
>>>> 
>>>> In order to provide more help to you, it would be useful to understand
>>>> more about how your cluster is "well tuned". How many mappers & reducers
>>>> are you launching in parallel, the heapspace limits set for tasks on
>>>> each node, etc.
>>>> 
>>>> For a quick test, try reducing the k to 500 or 100 to see how many
>>>> clusters you can reasonably compute with your dataset on your cluster.
>>>> Canopy is also a good way to get a feel for a good initial k, though it
>>>> can be hard to arrive at good T-values in some text clustering cases.
>>>> You, can also try hierarchical clustering with reduced k to stay under
>>>> your memory limits.
>>>> 
>>>> 
>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>> A stacktrace of error would have helped in finding the exact error.
>>>>> 
>>>>> However, number of clusters can create Heap Space problems ( If the
>>>>> vector dimension is also high ).
>>>>> Either try to reduce the number of initial clusters ( In my opinion,
>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
>>>>> 
>>>>> or, try to reduce the dimension of the vectors.
>>>>> 
>>>>> PS : you are also providing numClusters twice
>>>>> 
>>>>> --numClusters 1000 \ --numClusters 5 \
>>>>> 
>>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse
>>>>>> output).
>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>>> 
>>>>>> My job failed with Java heap space Error. 1st iteration passes while
>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>>>> "Java Heap space".
>>>>>> 
>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>>>> tunned well.
>>>>>> 
>>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>> 
>>>>>> Thanks in advance!
>>>>>> 
>>>>>> Here is KMeans parameters:
>>>>>> 
>>>>>> ------------------------------------------------
>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>>> -i ...tfidf-vectors/  \
>>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>>> --clusters /tmp/clusters/ \
>>>>>> --numClusters 1000 \
>>>>>> --numClusters 5 \
>>>>>> --overwrite \
>>>>>> --clustering
>>>>>> ------------------------------------------------
>>>>>> 
>>>>>> Pavel
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Lance Norskog
>> [email protected]
> 

Reply via email to