[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2015-03-05 Thread Andrew Palumbo (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349780#comment-14349780
 ] 

Andrew Palumbo commented on MAHOUT-1464:


We can close this, right?

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>  Labels: DSL, scala, spark
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-07-13 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060237#comment-14060237
 ] 

Ted Dunning commented on MAHOUT-1464:
-

{quote}
If A is the primary self-similarity matrix we want to do A'A with, and B is the 
secondary matrix, we want to do A'B with it since for use in a recommender we 
want rows of the cooccurrence to be IDed by the columns of A (items in A), 
right?
{quote}

Yes.  I think I understand the question.

A'B should have row id's which are column id's of A.  It should have column 
id's which are the column id's of B.



> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-27 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046423#comment-14046423
 ] 

Pat Ferrel commented on MAHOUT-1464:


finally got a Spark cluster working and this does run on it. Also using HDFS 
for I/O. 

I think there is more to do here but will treat as enhancements and close now.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031751#comment-14031751
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user tdunning commented on a diff in the pull request:

https://github.com/apache/mahout/pull/18#discussion_r13783816
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala 
---
@@ -188,8 +188,8 @@ object MatrixOps {
 def apply(f: Vector): Double = f.sum
   }
 
-  private def vectorCountFunc = new VectorFunction {
-def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.greater(0))
+  private def vectorCountNonZeroElementsFunc = new VectorFunction {
+def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.notEqual(0))
--- End diff --

The issue I have is with the rowAggregation and columnAggregation API.  It 
enforces row by row evaluation.  A map-reduce API could evaluate in many 
different orders and could iterate by rows or by columns for either aggregation 
and wouldn't require the a custom VectorFunction for simple aggregations.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031707#comment-14031707
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on a diff in the pull request:

https://github.com/apache/mahout/pull/18#discussion_r13782871
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala 
---
@@ -188,8 +188,8 @@ object MatrixOps {
 def apply(f: Vector): Double = f.sum
   }
 
-  private def vectorCountFunc = new VectorFunction {
-def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.greater(0))
+  private def vectorCountNonZeroElementsFunc = new VectorFunction {
+def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.notEqual(0))
--- End diff --

used Sebastian's getNumNonZeroElements here. Is your issue with Dmitriy's 
suggestion? This is only for in core matrices, the code used for drms is the 
stuff in SparkEngine, which accumulates using the nonZero iterator on row 
Vectors.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031632#comment-14031632
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user tdunning commented on a diff in the pull request:

https://github.com/apache/mahout/pull/18#discussion_r13782027
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala 
---
@@ -188,8 +188,8 @@ object MatrixOps {
 def apply(f: Vector): Double = f.sum
   }
 
-  private def vectorCountFunc = new VectorFunction {
-def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.greater(0))
+  private def vectorCountNonZeroElementsFunc = new VectorFunction {
+def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.notEqual(0))
--- End diff --

Using the vector aggregation framework will be very inefficient here.  We 
should either use Seb's suggestion or add properly scalable aggregation that 
doesn't depend on getting a vector view of a column.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031621#comment-14031621
 ] 

Hudson commented on MAHOUT-1464:


FAILURE: Integrated in Mahout-Quality #2655 (See 
[https://builds.apache.org/job/Mahout-Quality/2655/])
MAHOUT-1464 fixed bug counting only positive column elements, now counts all 
non-zero (pat) closes apache/mahout#18 (pat: rev 
c20eee89c6cc669494cf7edbb80255a83e194a15)
* math/src/main/java/org/apache/mahout/math/function/Functions.java
* 
spark/src/test/scala/org/apache/mahout/sparkbindings/drm/RLikeDrmOpsSuite.scala
* math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala
* spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala
* 
math-scala/src/test/scala/org/apache/mahout/math/scalabindings/MatrixOpsSuite.scala


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031619#comment-14031619
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/18


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031613#comment-14031613
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on a diff in the pull request:

https://github.com/apache/mahout/pull/18#discussion_r13781773
  
--- Diff: math/src/main/java/org/apache/mahout/math/function/Functions.java 
---
@@ -1393,6 +1393,17 @@ public double apply(double a) {
 };
   }
 
+  /** Constructs a function that returns a != b ? 1 : 0. 
a is a variable, b is fixed. */
+  public static DoubleFunction notEqual(final double b) {
+return new DoubleFunction() {
+
+  @Override
+  public double apply(double a) {
+return a != b ? 1 : 0;
--- End diff --

Whenever I modify a mature file that someone else has created, my general 
rule is to stay with the style of the collective authors. Here I agree that the 
1.0, 0.0 is better I'm hesitant to change it here when 1, and 0 are used 
throughout the file and I don't want to change it everywhere. There is probably 
more chance of me messing something up accidentally than actually fixing 
something if I change the whole file. If this seem wrong let me know but in 
past jobs we did this to avoid constant thrash over minor style disagreements.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031608#comment-14031608
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on a diff in the pull request:

https://github.com/apache/mahout/pull/18#discussion_r13781567
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala 
---
@@ -188,8 +188,8 @@ object MatrixOps {
 def apply(f: Vector): Double = f.sum
   }
 
-  private def vectorCountFunc = new VectorFunction {
-def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.greater(0))
+  private def vectorCountNonZeroElementsFunc = new VectorFunction {
+def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.notEqual(0))
--- End diff --

Nice. I didn't look deep enough to see that f is the column vector. I'll 
change that.

While we are at it, I now know about A'A (which is the slim calc?) that 
doesn't really compute A'. If you do similar for two different matrices:``` B.t 
%*% A``` does B.t ever get checkpointed?


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031292#comment-14031292
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/18#discussion_r13775208
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala 
---
@@ -188,8 +188,8 @@ object MatrixOps {
 def apply(f: Vector): Double = f.sum
   }
 
-  private def vectorCountFunc = new VectorFunction {
-def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.greater(0))
+  private def vectorCountNonZeroElementsFunc = new VectorFunction {
+def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.notEqual(0))
--- End diff --

i mean, shouldn't this specialized version be more effective than an 
aggregate: 

def apply(f: Vector): Double = f.getnumNonZeroElements().toDouble


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031187#comment-14031187
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/18#discussion_r13772128
  
--- Diff: math/src/main/java/org/apache/mahout/math/function/Functions.java 
---
@@ -1393,6 +1393,17 @@ public double apply(double a) {
 };
   }
 
+  /** Constructs a function that returns a != b ? 1 : 0. 
a is a variable, b is fixed. */
+  public static DoubleFunction notEqual(final double b) {
+return new DoubleFunction() {
+
+  @Override
+  public double apply(double a) {
+return a != b ? 1 : 0;
--- End diff --

just relaying some historical discussion in Mahout. It was known to create 
a bug in my own Mahout commit once. Which had sparked the discussion about 
constant formatting.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031177#comment-14031177
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on a diff in the pull request:

https://github.com/apache/mahout/pull/18#discussion_r13771890
  
--- Diff: 
spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala ---
@@ -118,8 +118,8 @@ class CooccurrenceAnalysisSuite extends FunSuite with 
MahoutSuite with MahoutLoc
   }
 
   test("cooccurrence [A'A], [B'A] integer data using LLR") {
-val a = dense((1000, 10, 0, 0, 0), (0, 0, 1, 10, 0), (0, 0, 0, 0, 
100), (1, 0, 0, 1000, 0))
-val b = dense((100, 1000, 1, 1, 0), (1, 1000, 100, 10, 0), 
(0, 0, 10, 0, 100), (10, 100, 0, 1000, 0))
+val a = dense((1000, 10, 0, 0, 0), (0, 0, -1, 10, 0), (0, 0, 0, 0, 
100), (1, 0, 0, 1000, 0))
--- End diff --

We may want to check for illegal values at some place in the pipeline. This 
is here so I don't forget. At present a negative value is legal. If we make it 
illegal I want this to fail.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031165#comment-14031165
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on a diff in the pull request:

https://github.com/apache/mahout/pull/18#discussion_r13771632
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala 
---
@@ -188,8 +188,8 @@ object MatrixOps {
 def apply(f: Vector): Double = f.sum
   }
 
-  private def vectorCountFunc = new VectorFunction {
-def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.greater(0))
+  private def vectorCountNonZeroElementsFunc = new VectorFunction {
+def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.notEqual(0))
--- End diff --

This is creating a Vector of non-zero counts per column, just like colSums 
is summing the column's values. The function is simply needed in the 
aggregateColumns. If you are suggesting another way to do this you'll have to 
be more explicit.

#17 has nothing to do with that afaict. It is about finding non-zero 
elements in a Vector.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031166#comment-14031166
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on a diff in the pull request:

https://github.com/apache/mahout/pull/18#discussion_r13771652
  
--- Diff: math/src/main/java/org/apache/mahout/math/function/Functions.java 
---
@@ -1393,6 +1393,17 @@ public double apply(double a) {
 };
   }
 
+  /** Constructs a function that returns a != b ? 1 : 0. 
a is a variable, b is fixed. */
+  public static DoubleFunction notEqual(final double b) {
+return new DoubleFunction() {
+
+  @Override
+  public double apply(double a) {
+return a != b ? 1 : 0;
--- End diff --

Taken from "equal" in the same file. Changed one character. But I'll note 
the point.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031157#comment-14031157
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/18#discussion_r13771395
  
--- Diff: 
spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala ---
@@ -118,8 +118,8 @@ class CooccurrenceAnalysisSuite extends FunSuite with 
MahoutSuite with MahoutLoc
   }
 
   test("cooccurrence [A'A], [B'A] integer data using LLR") {
-val a = dense((1000, 10, 0, 0, 0), (0, 0, 1, 10, 0), (0, 0, 0, 0, 
100), (1, 0, 0, 1000, 0))
-val b = dense((100, 1000, 1, 1, 0), (1, 1000, 100, 10, 0), 
(0, 0, 10, 0, 100), (10, 100, 0, 1000, 0))
+val a = dense((1000, 10, 0, 0, 0), (0, 0, -1, 10, 0), (0, 0, 0, 0, 
100), (1, 0, 0, 1000, 0))
--- End diff --

not sure if coocurrence test changes like that are necessary


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031145#comment-14031145
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/18#discussion_r13771023
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala 
---
@@ -188,8 +188,8 @@ object MatrixOps {
 def apply(f: Vector): Double = f.sum
   }
 
-  private def vectorCountFunc = new VectorFunction {
-def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.greater(0))
+  private def vectorCountNonZeroElementsFunc = new VectorFunction {
+def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.notEqual(0))
--- End diff --

Hm. isn't that is made obsolete with Sebastian's PR #17 ?


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031138#comment-14031138
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/18#discussion_r13770924
  
--- Diff: 
math-scala/src/test/scala/org/apache/mahout/math/scalabindings/MatrixOpsSuite.scala
 ---
@@ -123,4 +123,16 @@ class MatrixOpsSuite extends FunSuite with MahoutSuite 
{
 
   }
 
+  test("numNonZeroElementsPerColumn") {
+val a = dense(
+  (2, 3, 4),
+  (3, 4, 5),
+  (-5, 0, -1),
+  (0, 0, 1)
+)
+
+a.numNonZeroElementsPerColumn() should equal(dvec(3,2,4))
--- End diff --

style: spacing?


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031135#comment-14031135
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/18#discussion_r13770844
  
--- Diff: math/src/main/java/org/apache/mahout/math/function/Functions.java 
---
@@ -1393,6 +1393,17 @@ public double apply(double a) {
 };
   }
 
+  /** Constructs a function that returns a != b ? 1 : 0. 
a is a variable, b is fixed. */
+  public static DoubleFunction notEqual(final double b) {
+return new DoubleFunction() {
+
+  @Override
+  public double apply(double a) {
+return a != b ? 1 : 0;
--- End diff --

Since you are returning doubles, correct style is to say 1.0 or 0.0 not 1 
or 0


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031133#comment-14031133
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/18#issuecomment-46057614
  
Bringing up a second PR from same branch. You really need just to rebase 
the changes over current master. This may not merge well.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030848#comment-14030848
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/18#issuecomment-46035704
  
Accounting for possible negative values in matrix of drm columns.

drm case was a simple fix but in core Functions.java was modified to 
include a "notEqual(value)" function. There may be some other way to do this 
but it is a trivial function and now rather obvious. 


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030843#comment-14030843
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


GitHub user pferrel opened a pull request:

https://github.com/apache/mahout/pull/18

MAHOUT-1464

The numNonZeroElementsPerColumn additions did not account for negative 
values, only counted the positive non-zero values. Fixed this in the in core 
and distributed case.

I added to Functions.java to create a Functions.notEqual. It may be 
possible to do this with the other functions but it wasn't obvious so I wrote 
one. The test is in MatrixOpsSuite, where is it used.

The distributed case was much simpler.

Changed tests to include negative values.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pferrel/mahout mahout-1464

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mahout/pull/18.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18


commit 107a0ba9605241653a85b113661a8fa5c055529f
Author: pferrel 
Date:   2014-06-04T19:54:22Z

added Sebastian's CooccurrenceAnalysis patch updated it to use current 
Mahout-DSL

commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0
Author: pferrel 
Date:   2014-06-04T21:32:18Z

added Sebastian's MurmurHash changes

Signed-off-by: pferrel 

commit c6adaa44c80bba99d41600e260bbb1ad5c972e69
Author: pferrel 
Date:   2014-06-05T16:52:23Z

MAHOUT-1464 import cleanup, minor changes to examples for running on Spark 
Cluster

commit 1d66e5726e71e297ef4a7a27331463ba363098c0
Author: pferrel 
Date:   2014-06-06T20:19:32Z

scalatest for cooccurrence cross and self along with other 
CooccurrenceAnalyisi methods

commit 766db0f9e7feb70520fbd444afcb910788f01e76
Author: pferrel 
Date:   2014-06-06T20:20:46Z

Merge branch 'master' into mahout-1464

commit e492976688cb8860354bb20a362d370405f560e1
Author: pferrel 
Date:   2014-06-06T20:50:07Z

cleaned up test comments

commit a49692eb1664de4b15de1864b95701a6410c80c8
Author: pferrel 
Date:   2014-06-06T21:09:55Z

got those cursed .DS_Stores out of the branch and put an exclude in 
.gitignore

commit 268290d28d4f83cc47a7e6baebc5eb4c53d7c8da
Author: pferrel 
Date:   2014-06-07T21:50:04Z

Merge branch 'master' into mahout-1464

commit 63b10704390e18f513cca30596b1d25e146a6edd
Author: pferrel 
Date:   2014-06-08T15:26:36Z

Merge branch 'master' into mahout-1464

commit ac00d7655c4cba5f6c6dcb4882be95656b17a834
Author: pferrel 
Date:   2014-06-09T14:11:43Z

Merge branch 'master' into mahout-1464

commit fb008efeae3d5f6f6ba350fbc2ef3944da1dcaef
Author: pferrel 
Date:   2014-06-12T02:17:27Z

added 'colCounts' to a drm using the SparkEngine and MatrixOps, which, when 
used in cooccurrence, fixes the problem with non-boolean preference values

commit 5b04cb31403e2521d9874ad5e14f28cd0af26c26
Author: pferrel 
Date:   2014-06-12T02:18:29Z

Merge branch 'master' into mahout-1464

commit e451a2a596f5ceda8d1b4990e97ad3d5673fdb5f
Author: pferrel 
Date:   2014-06-12T16:02:26Z

fixed some things from Dmitiy's comments, primary being the SparkEngine 
accumulator was doing >= 0 instead of > 0

commit 411e0e92b4721626b736d66c292926fa4fdbb530
Author: pferrel 
Date:   2014-06-12T17:43:21Z

changing the name of drm.colCounts to drm.getNumNonZeroElements

commit 9655fd70f69ed97eb2d6765928a0a1f7dd760281
Author: pferrel 
Date:   2014-06-12T18:32:03Z

meant to say changing drm.colCounts to drm.numNonZeroElementsPerColumn

commit a2001375d46c5946b671f89f5a7cff2e6a094ea8
Author: pferrel 
Date:   2014-06-12T18:34:32Z

Merge branch 'master' into mahout-1464

commit 2db06b5566c8dcccb382733613b2fab6c223b5de
Author: pferrel 
Date:   2014-06-12T18:51:54Z

typo

commit 0b689b8b879c4ac03b71cf504a9d0d78ffa6bfa5
Author: pferrel 
Date:   2014-06-12T20:03:45Z

clean up test

commit 32afbe5e552ab94979dd545d14cda17ebc9c018e
Author: pferrel 
Date:   2014-06-12T23:42:08Z

one more fat finger error

commit b91e5e98c47829a5cc099289f83e99e6bf317dd6
Author: pferrel 
Date:   2014-06-13T16:18:33Z

did not account for negative values in the purely mathematical MatrixOps 
and SparkEngine version of numNonZeroElementsPerColumn so fixed this and added 
to tests

commit 9f6fd902f95c7daf687ecb59698f78217dbf6b6b
Author: pferrel 
Date:   2014-06-13T16:43:46Z

merging master to run new tests




> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>

[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030125#comment-14030125
 ] 

Hudson commented on MAHOUT-1464:


SUCCESS: Integrated in Mahout-Quality #2653 (See 
[https://builds.apache.org/job/Mahout-Quality/2653/])
MAHOUT-1464 Cooccurrence Analysis on Spark (pat) closes apache/mahout#12 (pat: 
rev c1ca30872c622e513e49fc1bb111bc4b8a527d3b)
* spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/DistributedEngine.scala
* spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala
* math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala
* math/src/main/java/org/apache/mahout/math/MurmurHash.java
* math-scala/src/main/scala/org/apache/mahout/math/drm/CheckpointedOps.scala
* 
spark/src/test/scala/org/apache/mahout/sparkbindings/drm/RLikeDrmOpsSuite.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala
* 
math-scala/src/test/scala/org/apache/mahout/math/scalabindings/MatrixOpsSuite.scala
* CHANGELOG


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030109#comment-14030109
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/12


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029631#comment-14029631
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user tdunning commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45934491
  
I hate abbreviations.  If you are asking about naming, use the long name.

If you can assure binary, then going with what we already have would be
nice.



On Thu, Jun 12, 2014 at 10:34 AM, Pat Ferrel 
wrote:

> numNonZeroElementsPerColumn? vs colSums?
>
> OK
>
> —
> Reply to this email directly or view it on GitHub
> .
>


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029474#comment-14029474
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45923020
  
numNonZeroElementsPerColumn? vs colSums?

OK


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029469#comment-14029469
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on a diff in the pull request:

https://github.com/apache/mahout/pull/12#discussion_r13714354
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala ---
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.cf
+
+import org.apache.mahout.math._
+import scalabindings._
+import RLikeOps._
+import drm._
+import RLikeDrmOps._
+import org.apache.mahout.sparkbindings._
+import scala.collection.JavaConversions._
+import org.apache.mahout.math.stats.LogLikelihood
+import collection._
+import org.apache.mahout.common.RandomUtils
+import org.apache.mahout.math.function.{VectorFunction, Functions}
+
+
+/**
+ * based on "Ted Dunnning & Ellen Friedman: Practical Machine Learning, 
Innovations in Recommendation",
+ * available at http://www.mapr.com/practical-machine-learning
+ *
+ * see also "Sebastian Schelter, Christoph Boden, Volker Markl:
+ * Scalable Similarity-Based Neighborhood Methods with MapReduce
+ * ACM Conference on Recommender Systems 2012"
+ */
+object CooccurrenceAnalysis extends Serializable {
+
+  /** Compares (Int,Double) pairs by the second value */
+  private val orderByScore = Ordering.fromLessThan[(Int, Double)] { case 
((_, score1), (_, score2)) => score1 > score2}
+
+  def cooccurrences(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, 
maxInterestingItemsPerThing: Int = 50,
+maxNumInteractions: Int = 500, drmBs: 
Array[DrmLike[Int]] = Array()): List[DrmLike[Int]] = {
+
+implicit val distributedContext = drmARaw.context
+
+// Apply selective downsampling, pin resulting matrix
+val drmA = sampleDownAndBinarize(drmARaw, randomSeed, 
maxNumInteractions)
+
+// num users, which equals the maximum number of interactions per item
+val numUsers = drmA.nrow.toInt
+
+// Compute & broadcast the number of interactions per thing in A
+val bcastInteractionsPerItemA = drmBroadcast(drmA.colCounts)
+
+// Compute co-occurrence matrix A'A
+val drmAtA = drmA.t %*% drmA
+
+// Compute loglikelihood scores and sparsify the resulting matrix to 
get the indicator matrix
+val drmIndicatorsAtA = computeIndicators(drmAtA, numUsers, 
maxInterestingItemsPerThing, bcastInteractionsPerItemA,
+  bcastInteractionsPerItemA, crossCooccurrence = false)
+
+var indicatorMatrices = List(drmIndicatorsAtA)
+
+// Now look at cross-co-occurrences
+for (drmBRaw <- drmBs) {
+  // Down-sample and pin other interaction matrix
+  val drmB = sampleDownAndBinarize(drmBRaw, randomSeed, 
maxNumInteractions).checkpoint()
+
+  // Compute & broadcast the number of interactions per thing in B
+  val bcastInteractionsPerThingB = drmBroadcast(drmB.colCounts)
+
+  // Compute cross-co-occurrence matrix B'A
+  val drmBtA = drmB.t %*% drmA
+
+  val drmIndicatorsBtA = computeIndicators(drmBtA, numUsers, 
maxInterestingItemsPerThing,
+bcastInteractionsPerThingB, bcastInteractionsPerItemA)
+
+  indicatorMatrices = indicatorMatrices :+ drmIndicatorsBtA
+
+  drmB.uncache()
+}
+
+// Unpin downsampled interaction matrix
+drmA.uncache()
+
+// Return list of indicator matrices
+indicatorMatrices
+  }
+
+  /**
+   * Compute loglikelihood ratio
+   * see http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html 
for details
+   **/
+  def loglikelihoodRatio(numInteractionsWithA: Long, numInteractionsWithB: 
Long,
+ numInteractionsWithAandB: Long, numInteractions: 
Long) = {
+
 

[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029464#comment-14029464
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on a diff in the pull request:

https://github.com/apache/mahout/pull/12#discussion_r13714291
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala ---
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.cf
+
+import org.apache.mahout.math._
+import scalabindings._
+import RLikeOps._
+import drm._
+import RLikeDrmOps._
+import org.apache.mahout.sparkbindings._
+import scala.collection.JavaConversions._
+import org.apache.mahout.math.stats.LogLikelihood
+import collection._
+import org.apache.mahout.common.RandomUtils
+import org.apache.mahout.math.function.{VectorFunction, Functions}
+
+
+/**
+ * based on "Ted Dunnning & Ellen Friedman: Practical Machine Learning, 
Innovations in Recommendation",
+ * available at http://www.mapr.com/practical-machine-learning
+ *
+ * see also "Sebastian Schelter, Christoph Boden, Volker Markl:
+ * Scalable Similarity-Based Neighborhood Methods with MapReduce
+ * ACM Conference on Recommender Systems 2012"
+ */
+object CooccurrenceAnalysis extends Serializable {
+
+  /** Compares (Int,Double) pairs by the second value */
+  private val orderByScore = Ordering.fromLessThan[(Int, Double)] { case 
((_, score1), (_, score2)) => score1 > score2}
+
+  def cooccurrences(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, 
maxInterestingItemsPerThing: Int = 50,
+maxNumInteractions: Int = 500, drmBs: 
Array[DrmLike[Int]] = Array()): List[DrmLike[Int]] = {
+
+implicit val distributedContext = drmARaw.context
+
+// Apply selective downsampling, pin resulting matrix
+val drmA = sampleDownAndBinarize(drmARaw, randomSeed, 
maxNumInteractions)
+
+// num users, which equals the maximum number of interactions per item
+val numUsers = drmA.nrow.toInt
+
+// Compute & broadcast the number of interactions per thing in A
+val bcastInteractionsPerItemA = drmBroadcast(drmA.colCounts)
--- End diff --

colCounts or whatever we call it is just as efficient, is distributed and 
tells the reader what is the important value. 


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029446#comment-14029446
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user sscdotopen commented on a diff in the pull request:

https://github.com/apache/mahout/pull/12#discussion_r13713968
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala ---
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.cf
+
+import org.apache.mahout.math._
+import scalabindings._
+import RLikeOps._
+import drm._
+import RLikeDrmOps._
+import org.apache.mahout.sparkbindings._
+import scala.collection.JavaConversions._
+import org.apache.mahout.math.stats.LogLikelihood
+import collection._
+import org.apache.mahout.common.RandomUtils
+import org.apache.mahout.math.function.{VectorFunction, Functions}
+
+
+/**
+ * based on "Ted Dunnning & Ellen Friedman: Practical Machine Learning, 
Innovations in Recommendation",
+ * available at http://www.mapr.com/practical-machine-learning
+ *
+ * see also "Sebastian Schelter, Christoph Boden, Volker Markl:
+ * Scalable Similarity-Based Neighborhood Methods with MapReduce
+ * ACM Conference on Recommender Systems 2012"
+ */
+object CooccurrenceAnalysis extends Serializable {
+
+  /** Compares (Int,Double) pairs by the second value */
+  private val orderByScore = Ordering.fromLessThan[(Int, Double)] { case 
((_, score1), (_, score2)) => score1 > score2}
+
+  def cooccurrences(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, 
maxInterestingItemsPerThing: Int = 50,
+maxNumInteractions: Int = 500, drmBs: 
Array[DrmLike[Int]] = Array()): List[DrmLike[Int]] = {
+
+implicit val distributedContext = drmARaw.context
+
+// Apply selective downsampling, pin resulting matrix
+val drmA = sampleDownAndBinarize(drmARaw, randomSeed, 
maxNumInteractions)
+
+// num users, which equals the maximum number of interactions per item
+val numUsers = drmA.nrow.toInt
+
+// Compute & broadcast the number of interactions per thing in A
+val bcastInteractionsPerItemA = drmBroadcast(drmA.colCounts)
+
+// Compute co-occurrence matrix A'A
+val drmAtA = drmA.t %*% drmA
+
+// Compute loglikelihood scores and sparsify the resulting matrix to 
get the indicator matrix
+val drmIndicatorsAtA = computeIndicators(drmAtA, numUsers, 
maxInterestingItemsPerThing, bcastInteractionsPerItemA,
+  bcastInteractionsPerItemA, crossCooccurrence = false)
+
+var indicatorMatrices = List(drmIndicatorsAtA)
+
+// Now look at cross-co-occurrences
+for (drmBRaw <- drmBs) {
+  // Down-sample and pin other interaction matrix
+  val drmB = sampleDownAndBinarize(drmBRaw, randomSeed, 
maxNumInteractions).checkpoint()
+
+  // Compute & broadcast the number of interactions per thing in B
+  val bcastInteractionsPerThingB = drmBroadcast(drmB.colCounts)
--- End diff --

drmB is already binary here, so we could use colSums


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422.

[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029445#comment-14029445
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user sscdotopen commented on a diff in the pull request:

https://github.com/apache/mahout/pull/12#discussion_r13713951
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala ---
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.cf
+
+import org.apache.mahout.math._
+import scalabindings._
+import RLikeOps._
+import drm._
+import RLikeDrmOps._
+import org.apache.mahout.sparkbindings._
+import scala.collection.JavaConversions._
+import org.apache.mahout.math.stats.LogLikelihood
+import collection._
+import org.apache.mahout.common.RandomUtils
+import org.apache.mahout.math.function.{VectorFunction, Functions}
+
+
+/**
+ * based on "Ted Dunnning & Ellen Friedman: Practical Machine Learning, 
Innovations in Recommendation",
+ * available at http://www.mapr.com/practical-machine-learning
+ *
+ * see also "Sebastian Schelter, Christoph Boden, Volker Markl:
+ * Scalable Similarity-Based Neighborhood Methods with MapReduce
+ * ACM Conference on Recommender Systems 2012"
+ */
+object CooccurrenceAnalysis extends Serializable {
+
+  /** Compares (Int,Double) pairs by the second value */
+  private val orderByScore = Ordering.fromLessThan[(Int, Double)] { case 
((_, score1), (_, score2)) => score1 > score2}
+
+  def cooccurrences(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, 
maxInterestingItemsPerThing: Int = 50,
+maxNumInteractions: Int = 500, drmBs: 
Array[DrmLike[Int]] = Array()): List[DrmLike[Int]] = {
+
+implicit val distributedContext = drmARaw.context
+
+// Apply selective downsampling, pin resulting matrix
+val drmA = sampleDownAndBinarize(drmARaw, randomSeed, 
maxNumInteractions)
+
+// num users, which equals the maximum number of interactions per item
+val numUsers = drmA.nrow.toInt
+
+// Compute & broadcast the number of interactions per thing in A
+val bcastInteractionsPerItemA = drmBroadcast(drmA.colCounts)
--- End diff --

drmA is already binary here, so we could use colSums


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029449#comment-14029449
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user sscdotopen commented on a diff in the pull request:

https://github.com/apache/mahout/pull/12#discussion_r13714024
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala ---
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.cf
+
+import org.apache.mahout.math._
+import scalabindings._
+import RLikeOps._
+import drm._
+import RLikeDrmOps._
+import org.apache.mahout.sparkbindings._
+import scala.collection.JavaConversions._
+import org.apache.mahout.math.stats.LogLikelihood
+import collection._
+import org.apache.mahout.common.RandomUtils
+import org.apache.mahout.math.function.{VectorFunction, Functions}
+
+
+/**
+ * based on "Ted Dunnning & Ellen Friedman: Practical Machine Learning, 
Innovations in Recommendation",
+ * available at http://www.mapr.com/practical-machine-learning
+ *
+ * see also "Sebastian Schelter, Christoph Boden, Volker Markl:
+ * Scalable Similarity-Based Neighborhood Methods with MapReduce
+ * ACM Conference on Recommender Systems 2012"
+ */
+object CooccurrenceAnalysis extends Serializable {
+
+  /** Compares (Int,Double) pairs by the second value */
+  private val orderByScore = Ordering.fromLessThan[(Int, Double)] { case 
((_, score1), (_, score2)) => score1 > score2}
+
+  def cooccurrences(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, 
maxInterestingItemsPerThing: Int = 50,
+maxNumInteractions: Int = 500, drmBs: 
Array[DrmLike[Int]] = Array()): List[DrmLike[Int]] = {
+
+implicit val distributedContext = drmARaw.context
+
+// Apply selective downsampling, pin resulting matrix
+val drmA = sampleDownAndBinarize(drmARaw, randomSeed, 
maxNumInteractions)
+
+// num users, which equals the maximum number of interactions per item
+val numUsers = drmA.nrow.toInt
+
+// Compute & broadcast the number of interactions per thing in A
+val bcastInteractionsPerItemA = drmBroadcast(drmA.colCounts)
+
+// Compute co-occurrence matrix A'A
+val drmAtA = drmA.t %*% drmA
+
+// Compute loglikelihood scores and sparsify the resulting matrix to 
get the indicator matrix
+val drmIndicatorsAtA = computeIndicators(drmAtA, numUsers, 
maxInterestingItemsPerThing, bcastInteractionsPerItemA,
+  bcastInteractionsPerItemA, crossCooccurrence = false)
+
+var indicatorMatrices = List(drmIndicatorsAtA)
+
+// Now look at cross-co-occurrences
+for (drmBRaw <- drmBs) {
+  // Down-sample and pin other interaction matrix
+  val drmB = sampleDownAndBinarize(drmBRaw, randomSeed, 
maxNumInteractions).checkpoint()
+
+  // Compute & broadcast the number of interactions per thing in B
+  val bcastInteractionsPerThingB = drmBroadcast(drmB.colCounts)
+
+  // Compute cross-co-occurrence matrix B'A
+  val drmBtA = drmB.t %*% drmA
+
+  val drmIndicatorsBtA = computeIndicators(drmBtA, numUsers, 
maxInterestingItemsPerThing,
+bcastInteractionsPerThingB, bcastInteractionsPerItemA)
+
+  indicatorMatrices = indicatorMatrices :+ drmIndicatorsBtA
+
+  drmB.uncache()
+}
+
+// Unpin downsampled interaction matrix
+drmA.uncache()
+
+// Return list of indicator matrices
+indicatorMatrices
+  }
+
+  /**
+   * Compute loglikelihood ratio
+   * see http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html 
for details
+   **/
+  def loglikelihoodRatio(numInteractionsWithA: Long, numInteractionsWithB: 
Long,
+ numInteractionsWithAandB: Long, numInteractions: 
Long) = {

[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029441#comment-14029441
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user sscdotopen commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45921381
  
I think the name _colCounts_ is misleading, we should stick to something 
like numNonZeroElementsPerColumn or so, not sure here.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Pat Ferrel
Not sure how this relates to the PR.

If you look here you can see all the PR files and diffs from master. Comments 
can be attached to the files in question.
https://github.com/apache/mahout/pull/12/files

iterateNonZero is not in question afaik, and is used in a couple places. If 
someone wants to write an alternative I’ll be happy to change things.

On Jun 12, 2014, at 10:06 AM, Sebastian Schelter  wrote:

Ok, but the current implementation still gives the correct number, as it checks 
for accidental zeros.

I think we should add some custom implementations here to not have to go 
through the non-zeroes iterator.

--sebastian

On 06/12/2014 07:00 PM, Ted Dunning wrote:
> The reason is that sparse implementations may have recorded a non-zero that
> later got assigned a zero, but they didn't bother to remove the memory cell.
> 
> 
> 
> 
> On Thu, Jun 12, 2014 at 9:50 AM, Sebastian Schelter  wrote:
> 
>> I'm a bit lost in this discussion. Why do we assume that
>> getNumNonZeroElements() on a Vector only returns an upper bound? The code
>> in AbstractVector clearly returns the non-zeros only:
>> 
>> int count = 0;
>> Iterator it = iterateNonZero();
>> while (it.hasNext()) {
>>   if (it.next().get() != 0.0) {
>> count++;
>>   }
>> }
>> return count;
>> 
>> On the other hand, the internal code seems broken here, why does
>> iterateNonZero potentially return 0's?
>> 
>> --sebastian
>> 
>> 
>> 
>> 
>> 
>> 
>> On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:
>> 
>>> 
>>>  [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=
>>> com.atlassian.jira.plugin.system.issuetabpanels:comment-
>>> tabpanel&focusedCommentId=14029345#comment-14029345 ]
>>> 
>>> ASF GitHub Bot commented on MAHOUT-1464:
>>> 
>>> 
>>> Github user dlyubimov commented on the pull request:
>>> 
>>>  https://github.com/apache/mahout/pull/12#issuecomment-45915940
>>> 
>>>  fix header to say MAHOUT-1464, then hit close and reopen, it will
>>> restart the echo.
>>> 
>>> 
>>>  Cooccurrence Analysis on Spark
 --
 
  Key: MAHOUT-1464
  URL: https://issues.apache.org/jira/browse/MAHOUT-1464
  Project: Mahout
   Issue Type: Improvement
   Components: Collaborative Filtering
  Environment: hadoop, spark
 Reporter: Pat Ferrel
 Assignee: Pat Ferrel
  Fix For: 1.0
 
  Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
 MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
 run-spark-xrsj.sh
 
 
 Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
 that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
 a DRM can be used as input.
 Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
 has several applications including cross-action recommendations.
 
>>> 
>>> 
>>> 
>>> --
>>> This message was sent by Atlassian JIRA
>>> (v6.2#6252)
>>> 
>>> 
>> 
> 




Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Ted Dunning
The reason is that sparse implementations may have recorded a non-zero that
later got assigned a zero, but they didn't bother to remove the memory cell.




On Thu, Jun 12, 2014 at 9:50 AM, Sebastian Schelter  wrote:

> I'm a bit lost in this discussion. Why do we assume that
> getNumNonZeroElements() on a Vector only returns an upper bound? The code
> in AbstractVector clearly returns the non-zeros only:
>
> int count = 0;
> Iterator it = iterateNonZero();
> while (it.hasNext()) {
>   if (it.next().get() != 0.0) {
> count++;
>   }
> }
> return count;
>
> On the other hand, the internal code seems broken here, why does
> iterateNonZero potentially return 0's?
>
> --sebastian
>
>
>
>
>
>
> On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:
>
>>
>>  [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=
>> com.atlassian.jira.plugin.system.issuetabpanels:comment-
>> tabpanel&focusedCommentId=14029345#comment-14029345 ]
>>
>> ASF GitHub Bot commented on MAHOUT-1464:
>> 
>>
>> Github user dlyubimov commented on the pull request:
>>
>>  https://github.com/apache/mahout/pull/12#issuecomment-45915940
>>
>>  fix header to say MAHOUT-1464, then hit close and reopen, it will
>> restart the echo.
>>
>>
>>  Cooccurrence Analysis on Spark
>>> --
>>>
>>>  Key: MAHOUT-1464
>>>  URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>>>  Project: Mahout
>>>   Issue Type: Improvement
>>>   Components: Collaborative Filtering
>>>  Environment: hadoop, spark
>>> Reporter: Pat Ferrel
>>> Assignee: Pat Ferrel
>>>  Fix For: 1.0
>>>
>>>  Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
>>> run-spark-xrsj.sh
>>>
>>>
>>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
>>> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
>>> a DRM can be used as input.
>>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
>>> has several applications including cross-action recommendations.
>>>
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)
>>
>>
>


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Sebastian Schelter
Ok, but the current implementation still gives the correct number, as it 
checks for accidental zeros.


I think we should add some custom implementations here to not have to go 
through the non-zeroes iterator.


--sebastian

On 06/12/2014 07:00 PM, Ted Dunning wrote:

The reason is that sparse implementations may have recorded a non-zero that
later got assigned a zero, but they didn't bother to remove the memory cell.




On Thu, Jun 12, 2014 at 9:50 AM, Sebastian Schelter  wrote:


I'm a bit lost in this discussion. Why do we assume that
getNumNonZeroElements() on a Vector only returns an upper bound? The code
in AbstractVector clearly returns the non-zeros only:

 int count = 0;
 Iterator it = iterateNonZero();
 while (it.hasNext()) {
   if (it.next().get() != 0.0) {
 count++;
   }
 }
 return count;

On the other hand, the internal code seems broken here, why does
iterateNonZero potentially return 0's?

--sebastian






On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:



  [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=
com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel&focusedCommentId=14029345#comment-14029345 ]

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

  https://github.com/apache/mahout/pull/12#issuecomment-45915940

  fix header to say MAHOUT-1464, then hit close and reopen, it will
restart the echo.


  Cooccurrence Analysis on Spark

--

  Key: MAHOUT-1464
  URL: https://issues.apache.org/jira/browse/MAHOUT-1464
  Project: Mahout
   Issue Type: Improvement
   Components: Collaborative Filtering
  Environment: hadoop, spark
 Reporter: Pat Ferrel
 Assignee: Pat Ferrel
  Fix For: 1.0

  Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh


Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
has several applications including cross-action recommendations.





--
This message was sent by Atlassian JIRA
(v6.2#6252)










Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Pat Ferrel
The SparkEngine colCounts functions was checking for >= 0. But because it was 
iterating nonZero it never got an == 0, so a bug that didn’t surface. It’s 
already been fixed.

The primary question at present is: what should we call colCounts? Currently it 
is used in cooccurrence:

val bcastInteractionsPerItemA = drmBroadcast(drmA.colCounts)

Dmitriy wanted you to see if this fits R-Like semantics and suggest an 
alternative, if possible. I was commenting on the possible Java related naming 
so ignore any misstatements.

On Jun 12, 2014, at 9:50 AM, Sebastian Schelter  wrote:

I'm a bit lost in this discussion. Why do we assume that 
getNumNonZeroElements() on a Vector only returns an upper bound? The code in 
AbstractVector clearly returns the non-zeros only:

   int count = 0;
   Iterator it = iterateNonZero();
   while (it.hasNext()) {
 if (it.next().get() != 0.0) {
   count++;
 }
   }
   return count;

On the other hand, the internal code seems broken here, why does iterateNonZero 
potentially return 0's?

--sebastian





On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:
> 
> [ 
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029345#comment-14029345
>  ]
> 
> ASF GitHub Bot commented on MAHOUT-1464:
> 
> 
> Github user dlyubimov commented on the pull request:
> 
> https://github.com/apache/mahout/pull/12#issuecomment-45915940
> 
> fix header to say MAHOUT-1464, then hit close and reopen, it will restart 
> the echo.
> 
> 
>> Cooccurrence Analysis on Spark
>> --
>> 
>> Key: MAHOUT-1464
>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>> Project: Mahout
>>  Issue Type: Improvement
>>  Components: Collaborative Filtering
>> Environment: hadoop, spark
>>Reporter: Pat Ferrel
>>Assignee: Pat Ferrel
>> Fix For: 1.0
>> 
>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, 
>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
>> run-spark-xrsj.sh
>> 
>> 
>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
>> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
>> can be used as input.
>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
>> several applications including cross-action recommendations.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
> 




Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Sebastian Schelter
I'm a bit lost in this discussion. Why do we assume that 
getNumNonZeroElements() on a Vector only returns an upper bound? The 
code in AbstractVector clearly returns the non-zeros only:


int count = 0;
Iterator it = iterateNonZero();
while (it.hasNext()) {
  if (it.next().get() != 0.0) {
count++;
  }
}
return count;

On the other hand, the internal code seems broken here, why does 
iterateNonZero potentially return 0's?


--sebastian





On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029345#comment-14029345
 ]

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

 https://github.com/apache/mahout/pull/12#issuecomment-45915940

 fix header to say MAHOUT-1464, then hit close and reopen, it will restart 
the echo.



Cooccurrence Analysis on Spark
--

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh


Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that runs 
on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM can be 
used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
several applications including cross-action recommendations.




--
This message was sent by Atlassian JIRA
(v6.2#6252)





[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029374#comment-14029374
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45917234
  
I already fixed the header.

I agree with Ted, kinda what functional programming is for. The reason I 
didn't use the Java aggregate is because it wasn't distributed. Still probably 
beyond this ticket. I'll refactor if a Scala journeyman wants to provide a 
general mechanism. I'm still on training wheels.

This still needs to be tested in a distributed Spark+HDFS environment and 
MAHOUT-1561 will make testing easy. I'd be happy to merge this and move on, 
which will have the side effect of testing larger datasets and clusters.

If Someone wants to test this now on a Spark+HDFS cluster, please do!


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029345#comment-14029345
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45915940
  
fix header to say MAHOUT-1464, then hit close and reopen, it will restart 
the echo.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029339#comment-14029339
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/12#discussion_r13711381
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala 
---
@@ -188,4 +188,8 @@ object MatrixOps {
 def apply(f: Vector): Double = f.sum
   }
 
+  private def vectorCountFunc = new VectorFunction {
+def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.greater(0))
+  }
+
 }
--- End diff --

it looks like, to me. don't have time to look in depth. but distributed 
code definitely counts non-negatives with explicit inline conditional >0


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029344#comment-14029344
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/12#discussion_r13711414
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala 
---
@@ -188,4 +188,8 @@ object MatrixOps {
 def apply(f: Vector): Double = f.sum
   }
 
+  private def vectorCountFunc = new VectorFunction {
+def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.greater(0))
+  }
+
 }
--- End diff --

it is very easy to tweak tests though to check if in doubt


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029315#comment-14029315
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user tdunning commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45913357
  
This discussion isn't getting echoed to the mailing list.  I didn't even 
know it was happening.

I think that a non-zero counter is nice, but it would be better to have a 
more general general aggregator of somethings.  We have two instances already 
of this pattern and there will be more (sum of the abs values is common).

Why not implement a general aggregator?  THis is different from our current 
aggregateColumns because that function is not parallelizable.

Something like def columnAggregator(combiner, mapper) is what I am aiming 
for.  Positive counter would be m.columnAggregator(_ + _, _ > 0)

 


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029310#comment-14029310
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45912084
  
Awaiting Sebastian's take on the naming of 'colCounts' to better fit R-Like 
Semantics


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-11 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028835#comment-14028835
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

put comments on PR. 

BTW In order for PR comments to mirror in JIRA, you need to use MAHOUT-1464 
phrase in PR name, not Mahout 146e or Mahout-1646 (that's the way ASF bot 
apparently works)

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-11 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028805#comment-14028805
 ] 

Pat Ferrel commented on MAHOUT-1464:


Check the code and let me know if there are problems. It uses a spark 
accumulator Vector keeping track of the non-zero column counts. Accumulators 
seem like a nice simplification.

point #2: Still I can only read about 50% of D's code and can only keep in my 
brain about 10% of the overlapping traits, and classes. Not concerned with 
authorship, just correctness.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-11 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028773#comment-14028773
 ] 

Ted Dunning commented on MAHOUT-1464:
-

Should there be a dedicated colCounts function, or a more general accumulator?

Basically, a row-by-row or column-by-column map-reduce aggregator is a common 
thing to need.  This is different from the aggregateColumns we now have since 
what we have now doesn't requires access to the entire row.

What I would be more interested in would be something like

{code}
  Vector r = v.aggregateByRows(DoubleDoubleFunction combine, DoubleFunction 
map)
{code}

The virtue here is that iteration by rows is an efficient way to handle 
row-major arrangements, but iteration by column works as well:

{code}
  for (MatrixSlice row : m) {
   for (int i = 0; i < columns; i++) {
r.setQuick(combine.apply(r.getQuick(i), 
map.apply(row.getQuick(i;
   }
  }
{code}

or

{code}
 for (MatrixSlice col: m.columnIterator()) {
 r.setQuick(col.index(), col.aggregate(combine, map));
}
{code}

These are approximate and we don't really have a columnIterator, but you can 
imagine how some kinds of matrix would have such a thing internally.  You can 
also see how trivially these would be to parallelize.  Arrangements which have 
row-wise patches of column-major data would also be easy to handle by combining 
these patterns.





> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-11 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028775#comment-14028775
 ] 

Ted Dunning commented on MAHOUT-1464:
-

{quote}
Scary adding to Dmitriy's code though so I'll invite him to look at it. Added a 
couple tests but I don't see many for SparkEngine.
{quote}

We don't have author tags.  This is *our* code now and we all have to feel a 
bit of ownership and entitlement.  Go for it.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-11 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028760#comment-14028760
 ] 

Pat Ferrel commented on MAHOUT-1464:


Ok, learned something today.

As to using the Java's x.aggregateColumns it looks like there are distributed 
Spark versions of colSums and the rest. They use Spark accumulators to avoid 
pulling the entire matrix into memory. I followed those models and created 
"colCounts" in MatrixOps and SparkEngine. Then used it instead of colSums.

Cooccurrence now passes tests with non-boolean data.

Scary adding to Dmitriy's code though so I'll invite him to look at it. Added a 
couple tests but I don't see many for SparkEngine.

https://github.com/pferrel/mahout/compare/mahout-1464

Still having problems getting mr-legacy to pass tests spark and match-scala 
pass tests.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-11 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028360#comment-14028360
 ] 

Sebastian Schelter commented on MAHOUT-1464:


Hi,

The computation of A'A is usually done without explicitly forming A'. 
Instead A'A is computed as the sum of outer products of rows of A.

--sebastian




> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-11 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028342#comment-14028342
 ] 

Ted Dunning commented on MAHOUT-1464:
-

I don't understand the question.

In fact, the transpose is never computed explicitly.  There is a special 
operation that does A' A in a single pass and step.  It is possible to fuse the 
down-sampling into this multiplication, but not possible to fuse the column 
counts.  For large sparse A, the value of A'A is computed using a map-reduce 
style data flow where each row is examined and all cooccurrence counts are 
emitted to be grouped by row number later.

In order to save memory, it is probably a good idea to discard the original 
counts as soon as they are reduced to binary form and down-sampled.

For computing counts, it is possible to accumulate column sums in a row-wise 
accumulator at the same time that row sums are accumulated one element at a 
time.  This avoids a pass over A and probably helps significantly on speed.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-11 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027986#comment-14027986
 ] 

Pat Ferrel commented on MAHOUT-1464:


The algo transposes A (the primary) before self-coocurrence. That gives us a 
point to look at columns when they are rows, which in turn makes distributed 
ops on the drm simple. So rather than looking at the counts for columns, my 
earlier proposal was to look at the same data when it is a row. Might this be 
better since it can easily be a distributed calculation?

In other words since A.t * A is calculated, we can split this into transpose 
and multiply taking column counts from the rows of A.t then doing the multiply 
after. In the list of calculations: A.t * A, B.t * A, ... each include a state 
where the columns turn into rows and so the same approach can be used.

This introduces what was a bug as a significant optimization. If the data is 
already boolean, use the colSums then no distributed counting is needed.

Not sure if the above is all true, so read it as a question



> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Pat Ferrel
facepalm, missed that. Thanks.

On Jun 10, 2014, at 4:29 PM, Ted Dunning (JIRA)  wrote:


   [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027208#comment-14027208
 ] 

Ted Dunning commented on MAHOUT-1464:
-

Matrix and Vector already have something that can be used:

{code}
   Vector counts = x.aggregateColumns(new VectorFunction() {
 @Override
 public double apply(Vector f) {
   return f.aggregate(Functions.PLUS, Functions.greater(0));
 }
   });
{code}

> Cooccurrence Analysis on Spark
> --
> 
>Key: MAHOUT-1464
>URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
>Environment: hadoop, spark
>   Reporter: Pat Ferrel
>   Assignee: Pat Ferrel
>Fix For: 1.0
> 
>Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
> 
> 
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)



[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027208#comment-14027208
 ] 

Ted Dunning commented on MAHOUT-1464:
-

Matrix and Vector already have something that can be used:

{code}
Vector counts = x.aggregateColumns(new VectorFunction() {
  @Override
  public double apply(Vector f) {
return f.aggregate(Functions.PLUS, Functions.greater(0));
  }
});
{code}

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027202#comment-14027202
 ] 

Pat Ferrel commented on MAHOUT-1464:


OK, good to know. So the fix above for rows is not good either, oh bother.

If I have to write specific code might it be better put in the Drm and/or 
Vector?

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027182#comment-14027182
 ] 

Ted Dunning commented on MAHOUT-1464:
-

I don't think that numNonZero can be trusted here.  The contract it provides is 
to return an upper bound on the number of non-zeros, not a precise value.

Better to write specific code.



> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027159#comment-14027159
 ] 

Pat Ferrel commented on MAHOUT-1464:


I think the same thing is happening with number of item interactions:

// Broadcast vector containing the number of interactions with each thing
val bcastNumInteractions = drmBroadcast(drmI.colSums)// sums?

This broadcasts a vector of sums. We need a getNumNonZeroElements() for column 
vectors actually a way to get a Vector of nonZero counts per column? We could 
get them from rows of the transposed matrix before doing the multiply of At %*% 
A or B.t %*% A in which case we’d get nin-zero counts from the rows. Either way 
I don’t see a way to get a vector of these values without doing a mapBlock on 
the transposed matrix. Am I missing something?

Currently the IndexedDataset is a very thin wrapper but I could add two 
vectors, which contain number of non-zero elements for rows and columns. In 
this case I would have it extend CheckpointedDrm perhaps. Since CheckpointedDrm 
extends DrmLike it could be used in the DSL algebra directly, in which case it 
would be simple to do the right thing with these vectors as well as the two id 
dictionaries for transpose and multiply but it’s a slippery slope.

Before I go off in the wrong direction is there an existing way to get a vector 
of non-zero counts for rows or columns?


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026742#comment-14026742
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45646307
  
i assume this is current PR for MAHOUT-1464?


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026738#comment-14026738
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user sscdotopen commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45646072
  
the cooccurrence analysis code should go to the math-scala not spark 
module, as it is independent of the underlying engine.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026719#comment-14026719
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov closed the pull request at:

https://github.com/apache/mahout/pull/8


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Sebastian Schelter

Hi Pat,

We truncate the indicators to the top-k and you don't want the 
self-comparison in there. So I don't see a reason to not exclude it as 
early as possible.


--sebatian

On 06/10/2014 05:28 PM, Pat Ferrel wrote:

Still getting the wrong values with non-boolean input so I’ll continue to look 
at.

Another question is: computeIndicators seems to exclude self-comparison during 
A’A and, of course, not for B’A. Since this returns the indicator matrix for 
the general case shouldn’t it include those values? Seems like they should be 
filtered out in the output phase if anywhere and that by option. If we were 
actually returning a multiply we’d include those.

 // exclude co-occurrences of the item with itself
 if (crossCooccurrence || thingB != thingA) {

On Jun 10, 2014, at 1:49 AM, Sebastian Schelter  wrote:

Oh good catch! I had an extra binarize method before, so that the data was 
already binary. I merged that into the downsample code and must have overlooked 
that thing. You are right, numNonZeros is the way to go!


On 06/10/2014 01:11 AM, Ted Dunning wrote:

Sounds like a very plausible root cause.





On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA)  wrote:



 [
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
]

Pat Ferrel commented on MAHOUT-1464:


seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?

 // Downsample the interaction vector of each user
 for (userIndex <- 0 until keys.size) {

   val interactionsOfUser = block(userIndex, ::) // this is a Vector
   // if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
   // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
   val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements()  // should do this I think

   val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser

   interactionsOfUser.nonZeroes().foreach { elem =>
 val numInteractionsWithThing = numInteractions(elem.index)
 val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing

 if (random.nextDouble() <= math.min(perUserSampleRate,
perThingSampleRate)) {
   // We ignore the original interaction value and create a
binary 0-1 matrix
   // as we only consider whether interactions happened or did
not happen
   downsampledBlock(userIndex, elem.index) = 1
 }
   }



Cooccurrence Analysis on Spark
--

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,

MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh



Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)

that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.

Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence

has several applications including cross-action recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)










Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Pat Ferrel
Still getting the wrong values with non-boolean input so I’ll continue to look 
at.

Another question is: computeIndicators seems to exclude self-comparison during 
A’A and, of course, not for B’A. Since this returns the indicator matrix for 
the general case shouldn’t it include those values? Seems like they should be 
filtered out in the output phase if anywhere and that by option. If we were 
actually returning a multiply we’d include those.

// exclude co-occurrences of the item with itself
if (crossCooccurrence || thingB != thingA) {

On Jun 10, 2014, at 1:49 AM, Sebastian Schelter  wrote:

Oh good catch! I had an extra binarize method before, so that the data was 
already binary. I merged that into the downsample code and must have overlooked 
that thing. You are right, numNonZeros is the way to go!


On 06/10/2014 01:11 AM, Ted Dunning wrote:
> Sounds like a very plausible root cause.
> 
> 
> 
> 
> 
> On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA)  wrote:
> 
>> 
>> [
>> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
>> ]
>> 
>> Pat Ferrel commented on MAHOUT-1464:
>> 
>> 
>> seems like the downsampleAndBinarize method is returning the wrong values.
>> It is actually summing the values where it should be counting the non-zero
>> elements?
>> 
>> // Downsample the interaction vector of each user
>> for (userIndex <- 0 until keys.size) {
>> 
>>   val interactionsOfUser = block(userIndex, ::) // this is a Vector
>>   // if the values are non-boolean the sum will not be the number
>> of interactions it will be a sum of strength-of-interaction, right?
>>   // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
>> this sum strength of interactions?
>>   val numInteractionsOfUser =
>> interactionsOfUser.getNumNonZeroElements()  // should do this I think
>> 
>>   val perUserSampleRate = math.min(maxNumInteractions,
>> numInteractionsOfUser) / numInteractionsOfUser
>> 
>>   interactionsOfUser.nonZeroes().foreach { elem =>
>> val numInteractionsWithThing = numInteractions(elem.index)
>> val perThingSampleRate = math.min(maxNumInteractions,
>> numInteractionsWithThing) / numInteractionsWithThing
>> 
>> if (random.nextDouble() <= math.min(perUserSampleRate,
>> perThingSampleRate)) {
>>   // We ignore the original interaction value and create a
>> binary 0-1 matrix
>>   // as we only consider whether interactions happened or did
>> not happen
>>   downsampledBlock(userIndex, elem.index) = 1
>> }
>>   }
>> 
>> 
>>> Cooccurrence Analysis on Spark
>>> --
>>> 
>>> Key: MAHOUT-1464
>>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>>> Project: Mahout
>>>  Issue Type: Improvement
>>>  Components: Collaborative Filtering
>>> Environment: hadoop, spark
>>>Reporter: Pat Ferrel
>>>Assignee: Pat Ferrel
>>> Fix For: 1.0
>>> 
>>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
>> run-spark-xrsj.sh
>>> 
>>> 
>>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
>> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
>> a DRM can be used as input.
>>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
>> has several applications including cross-action recommendations.
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)
>> 
> 




[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026549#comment-14026549
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45626614
  
Go ahead and hit the butto. Still have a bit more to do here.


On Jun 9, 2014, at 6:47 PM, Dmitriy Lyubimov  
wrote:

you can close -- but since i originated the PR, it is easier for me (I have 
access to the "close" button on it while everyone else would have to use 
"close apache/mahout#8" commit to do the same.) 


On Mon, Jun 9, 2014 at 5:20 PM, Pat Ferrel  
wrote: 

> According to the instructions I merge from my branch anyway. I can close 
> it right? The instruction for closing without merging? 
> 
> I assume you got my mail about finding the blocker now there are some 
> questions about the cooccurrence algo itself. 
> 
> — 
> Reply to this email directly or view it on GitHub 
> . 
>
—
Reply to this email directly or view it on GitHub.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Sebastian Schelter
Oh good catch! I had an extra binarize method before, so that the data 
was already binary. I merged that into the downsample code and must have 
overlooked that thing. You are right, numNonZeros is the way to go!



On 06/10/2014 01:11 AM, Ted Dunning wrote:

Sounds like a very plausible root cause.





On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA)  wrote:



 [
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
]

Pat Ferrel commented on MAHOUT-1464:


seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?

 // Downsample the interaction vector of each user
 for (userIndex <- 0 until keys.size) {

   val interactionsOfUser = block(userIndex, ::) // this is a Vector
   // if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
   // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
   val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements()  // should do this I think

   val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser

   interactionsOfUser.nonZeroes().foreach { elem =>
 val numInteractionsWithThing = numInteractions(elem.index)
 val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing

 if (random.nextDouble() <= math.min(perUserSampleRate,
perThingSampleRate)) {
   // We ignore the original interaction value and create a
binary 0-1 matrix
   // as we only consider whether interactions happened or did
not happen
   downsampledBlock(userIndex, elem.index) = 1
 }
   }



Cooccurrence Analysis on Spark
--

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,

MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh



Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)

that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.

Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence

has several applications including cross-action recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)







[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026021#comment-14026021
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45565428
  
you can close -- but since i originated the PR, it is easier for me (I have
access to the "close" button on it while everyone else would have to use
"close apache/mahout#8" commit to do the same.)


On Mon, Jun 9, 2014 at 5:20 PM, Pat Ferrel  wrote:

> According to the instructions I merge from my branch anyway. I can close
> it right? The instruction for closing without merging?
>
> I assume you got my mail about finding the blocker now there are some
> questions about the cooccurrence algo itself.
>
> —
> Reply to this email directly or view it on GitHub
> .
>


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025966#comment-14025966
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45560733
  
According to the instructions I merge from my branch anyway. I can close it 
right? The instruction for closing without merging?

I assume you got my mail about finding the blocker now there are some 
questions about the cooccurrence algo itself.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025933#comment-14025933
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45557670
  
Pat, so, we are not going to use this for merging into merging, i take it? 
I will close it, you can keep working on your other requests.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-09 Thread Ted Dunning
Sounds like a very plausible root cause.





On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
> ]
>
> Pat Ferrel commented on MAHOUT-1464:
> 
>
> seems like the downsampleAndBinarize method is returning the wrong values.
> It is actually summing the values where it should be counting the non-zero
> elements?
>
> // Downsample the interaction vector of each user
> for (userIndex <- 0 until keys.size) {
>
>   val interactionsOfUser = block(userIndex, ::) // this is a Vector
>   // if the values are non-boolean the sum will not be the number
> of interactions it will be a sum of strength-of-interaction, right?
>   // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
> this sum strength of interactions?
>   val numInteractionsOfUser =
> interactionsOfUser.getNumNonZeroElements()  // should do this I think
>
>   val perUserSampleRate = math.min(maxNumInteractions,
> numInteractionsOfUser) / numInteractionsOfUser
>
>   interactionsOfUser.nonZeroes().foreach { elem =>
> val numInteractionsWithThing = numInteractions(elem.index)
> val perThingSampleRate = math.min(maxNumInteractions,
> numInteractionsWithThing) / numInteractionsWithThing
>
> if (random.nextDouble() <= math.min(perUserSampleRate,
> perThingSampleRate)) {
>   // We ignore the original interaction value and create a
> binary 0-1 matrix
>   // as we only consider whether interactions happened or did
> not happen
>   downsampledBlock(userIndex, elem.index) = 1
> }
>   }
>
>
> > Cooccurrence Analysis on Spark
> > --
> >
> > Key: MAHOUT-1464
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: Collaborative Filtering
> > Environment: hadoop, spark
> >Reporter: Pat Ferrel
> >Assignee: Pat Ferrel
> > Fix For: 1.0
> >
> > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
> run-spark-xrsj.sh
> >
> >
> > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
> a DRM can be used as input.
> > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
> has several applications including cross-action recommendations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-09 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
 ] 

Pat Ferrel commented on MAHOUT-1464:


seems like the downsampleAndBinarize method is returning the wrong values. It 
is actually summing the values where it should be counting the non-zero 
elements?

// Downsample the interaction vector of each user
for (userIndex <- 0 until keys.size) {

  val interactionsOfUser = block(userIndex, ::) // this is a Vector
  // if the values are non-boolean the sum will not be the number of 
interactions it will be a sum of strength-of-interaction, right?
  // val numInteractionsOfUser = interactionsOfUser.sum // doesn't this 
sum strength of interactions?
  val numInteractionsOfUser = 
interactionsOfUser.getNumNonZeroElements()  // should do this I think

  val perUserSampleRate = math.min(maxNumInteractions, 
numInteractionsOfUser) / numInteractionsOfUser

  interactionsOfUser.nonZeroes().foreach { elem =>
val numInteractionsWithThing = numInteractions(elem.index)
val perThingSampleRate = math.min(maxNumInteractions, 
numInteractionsWithThing) / numInteractionsWithThing

if (random.nextDouble() <= math.min(perUserSampleRate, 
perThingSampleRate)) {
  // We ignore the original interaction value and create a binary 
0-1 matrix
  // as we only consider whether interactions happened or did not 
happen
  downsampledBlock(userIndex, elem.index) = 1
}
  }


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-09 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025655#comment-14025655
 ] 

Pat Ferrel commented on MAHOUT-1464:


The indicator matrix contains self similarity. The code seems to imply that 
self similarity values should be excluded. Certainly the mahout itemsimilarity 
doesn't return them.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-09 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025549#comment-14025549
 ] 

Pat Ferrel commented on MAHOUT-1464:


While I was waiting for the build to settle down I wrote some more tests for 
different value types. The same row/column is used for each input so all the 
LLR indicator matrices should be the same and are using the Hadoop code. But 
using integers of larger than 1 magnitude returns an empty indicator matrix.

input:

val a = dense((1000, 10, 0, 0, 0), (0, 0, 1, 10, 0), (0, 0, 0, 0, 100), 
(1, 0, 0, 1000, 0))

should produce

val matrixLLRCoocAtAControl = dense(
  (0.0, 1.7260924347106847, 0, 0, 0),
  (1.7260924347106847, 0, 0, 0, 0),
  (0, 0, 0, 1.7260924347106847, 0),
  (0, 0, 1.7260924347106847, 0, 0),
  (0, 0, 0, 0, 0)
)

however the following gets an empty matrix returned.

val drmCooc = CooccurrenceAnalysis.cooccurrences(drmARaw = drmA, drmBs = 
Array(drmB))
//var cp = drmSelfCooc(0).checkpoint()
//cp.writeDRM("/tmp/cooc-spark/")//to get values written
val matrixSelfCooc = drmCooc(0).checkpoint().collect

matrixSelfCooc is always empty. Took the same input to the Mahout version using 
LLR and it produces the correct result == matrixLLRCoocAtAControl.

Still investigating why this happens.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020385#comment-14020385
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


GitHub user pferrel opened a pull request:

https://github.com/apache/mahout/pull/12

Mahout 1464

MAHOUT-1464 looks ready to me but can't push it yet.

My build is broken from an unrelated mr-legacy test so I'll wait to push it 
until my local build passes but wanted to get this out for review if anyone 
cares.

I took out the epinions and movielens examples, will add them back in with 
the CLI driver maybe. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pferrel/mahout mahout-1464

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mahout/pull/12.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12


commit 107a0ba9605241653a85b113661a8fa5c055529f
Author: pferrel 
Date:   2014-06-04T19:54:22Z

added Sebastian's CooccurrenceAnalysis patch updated it to use current 
Mahout-DSL

commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0
Author: pferrel 
Date:   2014-06-04T21:32:18Z

added Sebastian's MurmurHash changes

Signed-off-by: pferrel 

commit c6adaa44c80bba99d41600e260bbb1ad5c972e69
Author: pferrel 
Date:   2014-06-05T16:52:23Z

MAHOUT-1464 import cleanup, minor changes to examples for running on Spark 
Cluster

commit 1d66e5726e71e297ef4a7a27331463ba363098c0
Author: pferrel 
Date:   2014-06-06T20:19:32Z

scalatest for cooccurrence cross and self along with other 
CooccurrenceAnalyisi methods

commit 766db0f9e7feb70520fbd444afcb910788f01e76
Author: pferrel 
Date:   2014-06-06T20:20:46Z

Merge branch 'master' into mahout-1464

commit e492976688cb8860354bb20a362d370405f560e1
Author: pferrel 
Date:   2014-06-06T20:50:07Z

cleaned up test comments

commit a49692eb1664de4b15de1864b95701a6410c80c8
Author: pferrel 
Date:   2014-06-06T21:09:55Z

got those cursed .DS_Stores out of the branch and put an exclude in 
.gitignore




> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018916#comment-14018916
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45241940
  
Hah, that's me looking at my own code


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018902#comment-14018902
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45239528
  
i don't even know that cartoon, just thought is was funny. Yeah, thinking 
of it, it is how i used to feel most of the time looking at my colleagues' code 
at work 


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018897#comment-14018897
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45238498
  
You a Ren and Stimpy fan or is it just the way you feel sometimes?


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018887#comment-14018887
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45237262
  
I Like Pat's github avatar :+1: 


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018878#comment-14018878
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45235053
  
I guess I'm suggesting that examples like these might be good in the right 
place. Not as build tests but as usage examples. As long as they use only 
supported code (read/write for instance)


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018872#comment-14018872
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45234551
  
yes, but downloading is described in the comments


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018871#comment-14018871
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user sscdotopen commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45234269
  
Its not allowed to redistribute the movielens dataset.

On 06/05/2014 05:28 PM, Pat Ferrel wrote:
> I could use a little advice here. The epinions and movielens tests in the 
examples folder. Should they be put into the build?
>
> Pros: good example data.
> Cons: the reading and writing are not parallel and so only work locally. 
It is easy to change the Spark context to use a cluster but the data still has 
to be local. These tests would be easier to maintain if they were attached to 
the ItemSimilarityDriver, which will handle cluster storage and execution and 
will be maintained better.
>
> I'd rather move them out into an ItemSimilarityDriver examples folder and 
will do this if no one objects. They will not be build tests, obviously, since 
they take too long.
>
> ---
> Reply to this email directly or view it on GitHub:
> https://github.com/apache/mahout/pull/8#issuecomment-45234064
>


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018870#comment-14018870
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45234064
  
I could use a little advice here. The epinions and movielens tests in the 
examples folder. Should they be put into the build?

Pros: good example data.
Cons: the reading and writing are not parallel and so only work locally. It 
is easy to change the Spark context to use a cluster but the data still has to 
be local. These tests would be easier to maintain if they were attached to the 
ItemSimilarityDriver, which will handle cluster storage and execution and will 
be maintained better.

I'd rather move them out into an ItemSimilarityDriver examples folder and 
will do this if no one objects. They will not be build tests, obviously, since 
they take too long.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018818#comment-14018818
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45224589
  
OK, I tend to use IDEA import optimization, which works about 90% of the 
time. Notice that the mutable import messes things up so D removed that.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018494#comment-14018494
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user sscdotopen commented on a diff in the pull request:

https://github.com/apache/mahout/pull/8#discussion_r13425122
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala ---
@@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.cf
+
+import org.apache.mahout.math._
+import scalabindings._
+import RLikeOps._
+import drm._
+import RLikeDrmOps._
+import org.apache.mahout.sparkbindings._
+
+import scala.collection.JavaConversions._
+import org.apache.mahout.math.stats.LogLikelihood
+import collection._
+// import scala.collection.parallel.mutable
--- End diff --

we can remove the .parallel. import


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-02 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016102#comment-14016102
 ] 

Pat Ferrel commented on MAHOUT-1464:


My problem is that my cluster is 1.2.1 and to upgrade everything I run on it 
has to go to H2. Oh bother.

I think the best thing is commit this and see it someone will run one of the 
several included tests on a cluster. It works local and seems to work clustered 
but the write fails. The write is not part of the core code.

Anyway unless someone vetos I'll commit it once I get at least one build 
integrated test included.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016013#comment-14016013
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


GitHub user dlyubimov reopened a pull request:

https://github.com/apache/mahout/pull/8

MAHOUT-1464 Cooccurrence Analysis on Spark

Grabbed Pat's branch. submitting as PR (WIP at this point). 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dlyubimov/mahout MAHOUT-1464

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mahout/pull/8.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8


commit 70654fa58dd4b801c551429945fa2f1377a60b2e
Author: pferrel 
Date:   2014-06-02T21:11:55Z

starting to merge the cooccurrence stuff, import errors

commit fc5fb6ac37e4c12d25c35ddb7912a32aac06e449
Author: pferrel 
Date:   2014-06-02T21:33:45Z

tried changing the imports in CooccurrenceAnalysis.scala to no avail

commit 242aed0e0921afe9a87ee8973ba8077cbe65fffa
Author: Dmitriy Lyubimov 
Date:   2014-06-02T22:42:57Z

Compilation fixes, updates for MAHOUT-1529 changes




> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016012#comment-14016012
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov closed the pull request at:

https://github.com/apache/mahout/pull/8


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-02 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14015824#comment-14015824
 ] 

Pat Ferrel commented on MAHOUT-1464:


import scala.collection.JavaConversions._

is included. I'll pare back to just this ticket and send a PR

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-02 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14015810#comment-14015810
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

if you want me to verify this, please convert to pull request so i can 
painlessly sync to exactly what you are testing.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-02 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14015806#comment-14015806
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

I think this has nothing to do with anything in Spark or scala bindings I think 
. 

the .nonZeroes() is mahout-math method (java) which produces a java iterator, 
which then implicitly cast to scala iterator (since .foreach is scala 
operator). 

is JavaConversions._ still imported?

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-02 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14015788#comment-14015788
 ] 

Pat Ferrel commented on MAHOUT-1464:


Looks like DrmLike may have been refactored since this patch was written.

[~dlyubimov] The following patch code has an error at "elem" saying "Missing 
parameter type 'elem'" Looking at the scaladocs I tracked back to the DrmLike 
trait and see no way to .mapBlock on it. Has something been refactored here? 
The .nonZeroes() is a java sparse vector iterator I think. This worked about a 
month ago so thought you might have an idea how things have changed?

{code:scala}
  def computeIndicators(drmBtA: DrmLike[Int], numUsers: Int, 
maxInterestingItemsPerThing: Int,
bcastNumInteractionsB: Broadcast[Vector], 
bcastNumInteractionsA: Broadcast[Vector],
crossCooccurrence: Boolean = true) = {
drmBtA.mapBlock() {
  case (keys, block) =>

val llrBlock = block.like()
val numInteractionsB: Vector = bcastNumInteractionsB
val numInteractionsA: Vector = bcastNumInteractionsA

for (index <- 0 until keys.size) {

  val thingB = keys(index)

  // PriorityQueue to select the top-k items
  val topItemsPerThing = new 
mutable.PriorityQueue[(Int,Double)]()(orderByScore)

  block(index, ::).nonZeroes().foreach { elem => //! Error: 
"Missing parameter type 'elem'"
val thingA = elem.index
val cooccurrences = elem.get
{code}

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-02 Thread Sebastian Schelter
The important thing here is that we test the code on a sufficiently large
dataset on a real cluster. Take that on, if you want!
Am 02.06.2014 20:08 schrieb "Pat Ferrel (JIRA)" :

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14015667#comment-14015667
> ]
>
> Pat Ferrel commented on MAHOUT-1464:
> 
>
> [~ssc] Should I reassign to me for now so we can get this committed?
>
> > Cooccurrence Analysis on Spark
> > --
> >
> > Key: MAHOUT-1464
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: Collaborative Filtering
> > Environment: hadoop, spark
> >Reporter: Pat Ferrel
> >Assignee: Sebastian Schelter
> > Fix For: 1.0
> >
> > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
> run-spark-xrsj.sh
> >
> >
> > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
> a DRM can be used as input.
> > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
> has several applications including cross-action recommendations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-02 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14015667#comment-14015667
 ] 

Pat Ferrel commented on MAHOUT-1464:


[~ssc] Should I reassign to me for now so we can get this committed?

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-05-27 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010653#comment-14010653
 ] 

Pat Ferrel commented on MAHOUT-1464:


There have been no commits afaik. The status is for Sebastian to say but I've 
used the cooccurrence analysis and it works correctly. I can't verify Spark 
cluster execution with HDFS due to what I think is my own bad setup.

If someone else could test it on a cluster I'd say it should be committed. If 
we can wait, I'm trying to get my cluster upgraded to hadoop 2 and reconfigure 
Spark for that. Then try testing this on the new setup.

There are no scala tests for this though there are some in the patches. I'm 
adding some scala tests that will cover this code in doing a CLI in 
MAHOUT-1541, which is a few weeks from being able to commit.

Not sure if it's packaged correctly, the tests supplied here are really 
examples since they are on large datasets and take a long time to execute.

Bottom line is it needs to be verified on a Cluster and checked for package 
structure. I'm happy to do this if we don't need it committed right away. Both 
of these things need to be done as part of MAHOUT-1541, which I'm actively 
working on but is not really ready to review yet.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-05-27 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010278#comment-14010278
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

Is there anything else to commit here?

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-05-22 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006038#comment-14006038
 ] 

Pat Ferrel commented on MAHOUT-1464:


Agreed about LLR. Never saw it underperform the other measures for CF.

I'm using a very small dataset with minSplits = 2 so I can check the actual 
values. Then running the epinions, which I can't really check for value. 
Testing is going through the ItemSimilarityDriver from MAHOUT-1541 so the write 
is not using your patch. 

Unfortunately I lied. Works with HDFS in and out but on a multi-threaded 
local[4] standalone Spark. Setting to my cluster master still fails. The error 
message is about connection refused so there is something is still not 
configured correctly on my cluster. I had to use fully qualified URIs to the 
data because Spark was defaulting to the wrong locations. All pointing to a bad 
Spark build or conf. Spark-shell seems to work fine on the cluster. Anyway, 
I'll reinstall Spark and try again. Sorry for the false alarm.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-05-21 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005631#comment-14005631
 ] 

Sebastian Schelter commented on MAHOUT-1464:


[~pferrel] Great, how large was your testdataset?

I'd vote against other similarity types for sake of similarity, LLR also works 
best in my experience

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-05-21 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005484#comment-14005484
 ] 

Pat Ferrel commented on MAHOUT-1464:


Runs correctly on clustered Spark and HDFS.

Is there more to do here? Are the other similarity types needed?

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-18 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974188#comment-13974188
 ] 

Pat Ferrel commented on MAHOUT-1464:


It looks like our setups are pretty much identical as far as I can tell. The 
primary difference is in using the IDE to launch and that may be causing the 
problem. 

Therefore I'll put the testing aside for awhile and work on getting Dmitriy's 
Spark Scala shell working since we know that write from there is working--at 
least writeDRM is.

As I've said, it looks like the cluster cooccurrence computation (both cross 
and self similarity) is being executed properly on the epinions data but I'm 
unable to get a file output. 

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-17 Thread Pat Ferrel
I have no trouble reading from HDFS using the spark-shell. I assume I would 
also have no trouble writing but that is using the basic shell that comes with 
Spark.

scala> val textFile = sc.textFile("xrsj/ratings_data.txt")
scala> textFile.count()

This works with local, pseudo-cluster, or even full cluster. I just can’t write 
using the RSJ code. 

Are you using your custom mahout+spark Scala shell on github, doing a writeDRM? 
At home you are using cdh 4.3.2 on a single machine pseudo-cluster? Which 
versions of hadoop and spark are you running? Did you install spark outside of 
cdh? What os?

If nothing else I can try to duplicate the environment. We know your writeDRM 
works so if I can duplicate that I can start debugging the RSJ stuff.

BTW data for the RSJ code is here: 
https://cloud.occamsmachete.com/public.php?service=files&t=0011a9651691ee38e905a36e99a0f125

On Apr 17, 2014, at 1:23 PM, Dmitriy Lyubimov (JIRA)  wrote:


   [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973347#comment-13973347
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

Hm. At home i don't have any trouble reading/writing from/to hdfs. 

There are some minor differences in configuration plus i am running hdfs cdh 
4.3.2 at home vs. 4.3.0 at work computer. That's the only difference. 

(some patchlevel specific?)



> Cooccurrence Analysis on Spark
> --
> 
>Key: MAHOUT-1464
>URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
>Environment: hadoop, spark
>   Reporter: Pat Ferrel
>   Assignee: Sebastian Schelter
>Fix For: 1.0
> 
>Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
> 
> 
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)



[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-17 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973347#comment-13973347
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

Hm. At home i don't have any trouble reading/writing from/to hdfs. 

There are some minor differences in configuration plus i am running hdfs cdh 
4.3.2 at home vs. 4.3.0 at work computer. That's the only difference. 

(some patchlevel specific?)



> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-17 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973290#comment-13973290
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

also just discovered that sbt build in 0.9.1 screws hbase dependency. Not 
likely to be much of a reason, but who knows. 


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-17 Thread Dmitriy Lyubimov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973288#comment-13973288
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

the idea most likely willl have screwed hadoop dependency because by default it 
inherits it from thath of Mahout's default dependency. I used to run this stuff 
(or rather our internal variant of this stuff) from my own project which has a 
very strict control over dependendencies (esp. hadoop dependencies).  I also 
inserted a CDH4 profile to spark module which overrides Mahout's default hadoop 
dependency, and that should help -- but it is still a pain, i gave up on 
running it from idea with Mahout maven dependencies. Something is screwed there 
in the end.

i don't experiment with RSJ yet -- i guess i will leave it to Sebastian at this 
point .

what i do is running the following script on my "shell" branch in github via 
mahout shell

{code:title="simple.mscala"} 
val a = dense((1,2,3),(3,4,5))
val drmA = drmParallelize(a,numPartitions = 2)
val drmAtA = drmA.t %*% drmA

val r = drmAtA.mapBlock() {
  case (keys, block) =>
block += 1.0
keys -> block
}.checkpoint(/*StorageLevel.NONE*/)

r.collect

// local write
r.writeDRM("file:///home/dmitriy/A")

// hdfs write
r.writeDRM("hdfs://localhost:11010/A")
{code}


which actually runs totally fine in local mode, and _sometimes_  also runs ok 
in "standalone"/hdfs mode but sometimes there are strange after-effects of 
hangs and bailing out with OOM when run on remote cluster with "standalone". 

I am pretty sure it is either dependency issues again in Mahout maven build, or 
something that has happened to Spark 0.9.x release.  Spark 0.6.x -- 0.8.x 
releases and earlier had absolutely no trouble working with hdfs sequence files.

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   >