Re: java.lang.OutOfMemoryError with Mahout 0.10 and Spark 1.1.1

2015-07-20 Thread Dmitriy Lyubimov
assuming task memory x number of cores does not exceed ~5g, and block cache manager ratio does not have some really weird setting, the next best thing to look at is initial task split size. I don' think in the release you are looking at the driver manages initial off-dfs splits satisfactorily

java.lang.OutOfMemoryError with Mahout 0.10 and Spark 1.1.1

2015-07-20 Thread Rodolfo Viana
I’m trying to run Mahout 0.10 with Spark 1.1.1. I have input files with 8k, 10M, 20M, 25M. So far I run with the following configuration: 8k with 1,2,3 slaves 10M with 1, 2, 3 slaves 20M with 1,2,3 slaves But when I try to run bin/mahout spark-itemsimilarity --master spark://node1:7077 --input

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Ankit Goel
Oh, I thought kmeans gave me a point vector as a centroid, not a calculated point central to a cluster. I guess in this case I would be looking for the most central point vector (from the index ) that I can use as a representative of the cluster. On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Ted Dunning
The most central point in a cluster is often referred to as a medoid (similar to median, but multi-dimensional). The Mahout code does not compute medoids. In general, they are difficult to compute and implementing a full k-medoid clustering algorithm even more so. On Mon, Jul 20, 2015 at 6:25

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Ankit Goel
That kind of puts me in a tough position. I was planning to use kmeans as a method for aggregating similar articles from multiple news sources, and then getting a representative article from those. Here I mean similar as in the articles are from different news sources but are about the exact same

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Andrew Musselman
It's possible you could write a post-processing step to find the closest point to the centroid based on the distance property if I'm recalling it correctly. On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel ankitgoel2...@gmail.com wrote: That kind of puts me in a tough position. I was planning to use

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Andrew Musselman
I'm not sure centroid id is even a defined thing, especially since the centroid, in my understanding, is just a point in space, not necessarily a point in your data. Are you trying to find the most-central point in a given cluster? On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel

Partial Solr Index Clustering

2015-07-20 Thread Ankit Goel
Hi, I was wondering if its possible to use only partial solr index for clustering. For example, my crawler updates my solr index every hour with new documents, and I just want to cluster those new documents, not the old ones. If I was programming normally, I could query solr for the latest

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Ankit Goel
Hmm, kmeans algorithmically is supposed to only annoint existing vectors(documents) as the centroid for a cluster every step (or so I believe). If mahout is generating non document vector as a centroid, it changes a lot of things. That would also explain the -distanceMeasure option in

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Ted Dunning
You can always just pick the article closest to the centroid. But I think that you may find that with simple k-means that clusters are going to be about more than one thing. On Mon, Jul 20, 2015 at 8:21 PM, Ankit Goel ankitgoel2...@gmail.com wrote: Hmm, kmeans algorithmically is supposed to

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Ankit Goel
True that. Kmeans is just a first step anyways. Definetely needs tuning. Thanks guys On Tue, Jul 21, 2015 at 9:46 AM, Ted Dunning ted.dunn...@gmail.com wrote: You can always just pick the article closest to the centroid. But I think that you may find that with simple k-means that clusters are