Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Sabarish Sasidharan
Thanks Debasish, Reza and Pat. In my case, I am doing an SVD and then doing
the similarities computation. So a rowSimiliarities() would be a good fit,
looking forward to it.

In the meanwhile I will try to see if I can further limit the number of
similarities computed through some other fashion or use kmeans instead or a
combination of both. I have also been looking at Mahout's similarity
recommenders based on spark, but not sure if the row similarity would apply
in my case as my matrix is pretty dense.

Regards
Sab



On Tue, Mar 3, 2015 at 7:11 AM, Pat Ferrel  wrote:

> Sab, not sure what you require for the similarity metric or your use case
> but you can also look at spark-rowsimilarity or spark-itemsimilarity
> (column-wise) here
> http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html.
> These are optimized for LLR based “similarity” which is very simple to
> calculate since you don’t use either the item weight or the entire row or
> column vector values. Downsampling is done by number of values per column
> (or row) and by LLR strength. This keeps it to O(n)
>
> They run pretty fast and only use memory if you use the version that
> attaches application IDs to the rows and columns. Using
> SimilarityAnalysis.cooccurrence may help. It’s in the Spark/Scala part of
> Mahout.
>
> On Mar 2, 2015, at 12:56 PM, Reza Zadeh  wrote:
>
> Hi Sab,
> The current method is optimized for having many rows and few columns. In
> your case it is exactly the opposite. We are working on your case, tracked
> by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823
> Your case is very common, so I will put some time into building it.
>
> In the meantime, if you're looking for groups of similar points, consider
> using K-means - it will get you clusters of similar rows with euclidean
> distance.
>
> Best,
> Reza
>
>
> On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>>
>> ​Hi Reza
>> ​​
>> I see that ((int, int), double) pairs are generated for any combination
>> that meets the criteria controlled by the threshold. But assuming a simple
>> 1x10K matrix that means I would need atleast 12GB memory per executor for
>> the flat map just for these pairs excluding any other overhead. Is that
>> correct? How can we make this scale for even larger n (when m stays small)
>> like 100 x 5 million.​ One is by using higher thresholds. The other is that
>> I use a SparseVector to begin with. Are there any other optimizations I can
>> take advantage of?
>>
>>
>>
>>
>> ​Thanks
>> Sab
>>
>>
>
>


-- 

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*
+++


Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Pat Ferrel
Sab, not sure what you require for the similarity metric or your use case but 
you can also look at spark-rowsimilarity or spark-itemsimilarity (column-wise) 
here http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html 
.  
These are optimized for LLR based “similarity” which is very simple to 
calculate since you don’t use either the item weight or the entire row or 
column vector values. Downsampling is done by number of values per column (or 
row) and by LLR strength. This keeps it to O(n)

They run pretty fast and only use memory if you use the version that attaches 
application IDs to the rows and columns. Using SimilarityAnalysis.cooccurrence 
may help. It’s in the Spark/Scala part of Mahout.

On Mar 2, 2015, at 12:56 PM, Reza Zadeh  wrote:

Hi Sab,
The current method is optimized for having many rows and few columns. In your 
case it is exactly the opposite. We are working on your case, tracked by this 
JIRA: https://issues.apache.org/jira/browse/SPARK-4823 

Your case is very common, so I will put some time into building it.

In the meantime, if you're looking for groups of similar points, consider using 
K-means - it will get you clusters of similar rows with euclidean distance.

Best,
Reza


On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan 
mailto:sabarish.sasidha...@manthan.com>> 
wrote:

​Hi Reza
​​
I see that ((int, int), double) pairs are generated for any combination that 
meets the criteria controlled by the threshold. But assuming a simple 1x10K 
matrix that means I would need atleast 12GB memory per executor for the flat 
map just for these pairs excluding any other overhead. Is that correct? How can 
we make this scale for even larger n (when m stays small) like 100 x 5 
million.​ One is by using higher thresholds. The other is that I use a 
SparseVector to begin with. Are there any other optimizations I can take 
advantage of?




​Thanks
Sab





Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Reza Zadeh
Hi Sab,
The current method is optimized for having many rows and few columns. In
your case it is exactly the opposite. We are working on your case, tracked
by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823
Your case is very common, so I will put some time into building it.

In the meantime, if you're looking for groups of similar points, consider
using K-means - it will get you clusters of similar rows with euclidean
distance.

Best,
Reza


On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> ​Hi Reza
> ​​
> I see that ((int, int), double) pairs are generated for any combination
> that meets the criteria controlled by the threshold. But assuming a simple
> 1x10K matrix that means I would need atleast 12GB memory per executor for
> the flat map just for these pairs excluding any other overhead. Is that
> correct? How can we make this scale for even larger n (when m stays small)
> like 100 x 5 million.​ One is by using higher thresholds. The other is that
> I use a SparseVector to begin with. Are there any other optimizations I can
> take advantage of?
>
> ​Thanks
> Sab
>
>


Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Debasish Das
Column based similarities work well if the columns are mild (10K, 100K, we
actually scaled it to 1.5M columns but it really stress tests the shuffle
and it needs to tune the shuffle parameters)...You can either use dimsum
sampling or come up with your own threshold based on your application that
you can apply in reduceByKey (you have to change the code to use
combineByKey and add your filters before shuffling the keys to reducer)...

The other variant that you are mentioning is row based similarity flow
which is tracked in the following JIRA where I am interesting in doing no
shuffle but use broadcast and mapPartitions. I will open up the PR soon but
it is compute intensive and I am experimenting with BLAS optimizations...

https://issues.apache.org/jira/browse/SPARK-4823

Your case of 100 x 5 million (tranpose of it) for example is very common in
matrix factorization where you have user factors and product factors which
will typically be 5 million x 100 dense matrix and you want to compute
user->user and item->item similarities...

You are right that sparsity helps but you can't apply sparsity (for example
pick topK) before doing the dot products...so it is still a compute
intensive operation...

On Sun, Mar 1, 2015 at 9:36 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> ​Hi Reza
> ​​
> I see that ((int, int), double) pairs are generated for any combination
> that meets the criteria controlled by the threshold. But assuming a simple
> 1x10K matrix that means I would need atleast 12GB memory per executor for
> the flat map just for these pairs excluding any other overhead. Is that
> correct? How can we make this scale for even larger n (when m stays small)
> like 100 x 5 million.​ One is by using higher thresholds. The other is that
> I use a SparseVector to begin with. Are there any other optimizations I can
> take advantage of?
>
> ​Thanks
> Sab
>
>


Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan
​Hi Reza
​​
I see that ((int, int), double) pairs are generated for any combination
that meets the criteria controlled by the threshold. But assuming a simple
1x10K matrix that means I would need atleast 12GB memory per executor for
the flat map just for these pairs excluding any other overhead. Is that
correct? How can we make this scale for even larger n (when m stays small)
like 100 x 5 million.​ One is by using higher thresholds. The other is that
I use a SparseVector to begin with. Are there any other optimizations I can
take advantage of?

​Thanks
Sab


Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Reza Zadeh
Hi Sab,
In this dense case, the output will contain 1 x 1 entries, i.e. 100
million doubles, which doesn't fit in 1GB with overheads.
For a dense matrix, similarColumns() scales quadratically in the number of
columns, so you need more memory across the cluster.
Reza


On Sun, Mar 1, 2015 at 7:06 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> Sorry, I actually meant 30 x 1 matrix (missed a 0)
>
>
> Regards
> Sab
>
>


Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan
Sorry, I actually meant 30 x 1 matrix (missed a 0)


Regards
Sab


Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Reza Zadeh
Hi Sabarish,

Works fine for me with less than those settings (30x1000 dense matrix, 1GB
driver, 1GB executor):

bin/spark-shell --driver-memory 1G --executor-memory 1G

Then running the following finished without trouble and in a few seconds.
Are you sure your driver is actually getting the RAM you think you gave it?

// Create 30x1000 matrix
val rows = sc.parallelize(1 to 30, 4).map { line =>
  val values = Array.tabulate(1000)(x=>scala.math.random)
  Vectors.dense(values)
}.cache()
val mat = new RowMatrix(rows)

// Compute similar columns perfectly, with brute force.
val exact = mat.columnSimilarities().entries.map(x => x.value).sum()



On Sun, Mar 1, 2015 at 3:31 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> I am trying to compute column similarities on a 30x1000 RowMatrix of
> DenseVectors. The size of the input RDD is 3.1MB and its all in one
> partition. I am running on a single node of 15G and giving the driver 1G
> and the executor 9G. This is on a single node hadoop. In the first attempt
> the BlockManager doesn't respond within the heart beat interval. In the
> second attempt I am seeing a GC overhead limit exceeded error. And it is
> almost always in the RowMatrix.columSimilaritiesDIMSUM ->
> mapPartitionsWithIndex (line 570)
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at
> org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$19$$anonfun$apply$2.apply(RowMatrix.scala:570)
> at
> org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$19$$anonfun$apply$2.apply(RowMatrix.scala:528)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>
>
> It also really seems to be running out of memory. I am seeing the
> following in the attempt log
> Heap
>  PSYoungGen  total 2752512K, used 2359296K
>   eden space 2359296K, 100% used
>   from space 393216K, 0% used
>   to   space 393216K, 0% used
>  ParOldGen   total 6291456K, used 6291376K [0x00058000,
> 0x0007, 0x0007)
>   object space 6291456K, 99% used
>  Metaspace   used 39225K, capacity 39558K, committed 39904K, reserved
> 1083392K
>   class spaceused 5736K, capacity 5794K, committed 5888K, reserved
> 1048576K​
>
> ​What could be going wrong?
>
> Regards
> Sab
>


Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-01 Thread Sabarish Sasidharan
I am trying to compute column similarities on a 30x1000 RowMatrix of
DenseVectors. The size of the input RDD is 3.1MB and its all in one
partition. I am running on a single node of 15G and giving the driver 1G
and the executor 9G. This is on a single node hadoop. In the first attempt
the BlockManager doesn't respond within the heart beat interval. In the
second attempt I am seeing a GC overhead limit exceeded error. And it is
almost always in the RowMatrix.columSimilaritiesDIMSUM ->
mapPartitionsWithIndex (line 570)

java.lang.OutOfMemoryError: GC overhead limit exceeded
at
org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$19$$anonfun$apply$2.apply(RowMatrix.scala:570)
at
org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$19$$anonfun$apply$2.apply(RowMatrix.scala:528)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)


It also really seems to be running out of memory. I am seeing the following
in the attempt log
Heap
 PSYoungGen  total 2752512K, used 2359296K
  eden space 2359296K, 100% used
  from space 393216K, 0% used
  to   space 393216K, 0% used
 ParOldGen   total 6291456K, used 6291376K [0x00058000,
0x0007, 0x0007)
  object space 6291456K, 99% used
 Metaspace   used 39225K, capacity 39558K, committed 39904K, reserved
1083392K
  class spaceused 5736K, capacity 5794K, committed 5888K, reserved
1048576K​

​What could be going wrong?

Regards
Sab