Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Pat Ferrel
Not sure how this relates to the PR.

If you look here you can see all the PR files and diffs from master. Comments 
can be attached to the files in question.
https://github.com/apache/mahout/pull/12/files

iterateNonZero is not in question afaik, and is used in a couple places. If 
someone wants to write an alternative I’ll be happy to change things.

On Jun 12, 2014, at 10:06 AM, Sebastian Schelter  wrote:

Ok, but the current implementation still gives the correct number, as it checks 
for accidental zeros.

I think we should add some custom implementations here to not have to go 
through the non-zeroes iterator.

--sebastian

On 06/12/2014 07:00 PM, Ted Dunning wrote:
> The reason is that sparse implementations may have recorded a non-zero that
> later got assigned a zero, but they didn't bother to remove the memory cell.
> 
> 
> 
> 
> On Thu, Jun 12, 2014 at 9:50 AM, Sebastian Schelter  wrote:
> 
>> I'm a bit lost in this discussion. Why do we assume that
>> getNumNonZeroElements() on a Vector only returns an upper bound? The code
>> in AbstractVector clearly returns the non-zeros only:
>> 
>> int count = 0;
>> Iterator it = iterateNonZero();
>> while (it.hasNext()) {
>>   if (it.next().get() != 0.0) {
>> count++;
>>   }
>> }
>> return count;
>> 
>> On the other hand, the internal code seems broken here, why does
>> iterateNonZero potentially return 0's?
>> 
>> --sebastian
>> 
>> 
>> 
>> 
>> 
>> 
>> On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:
>> 
>>> 
>>>  [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=
>>> com.atlassian.jira.plugin.system.issuetabpanels:comment-
>>> tabpanel&focusedCommentId=14029345#comment-14029345 ]
>>> 
>>> ASF GitHub Bot commented on MAHOUT-1464:
>>> 
>>> 
>>> Github user dlyubimov commented on the pull request:
>>> 
>>>  https://github.com/apache/mahout/pull/12#issuecomment-45915940
>>> 
>>>  fix header to say MAHOUT-1464, then hit close and reopen, it will
>>> restart the echo.
>>> 
>>> 
>>>  Cooccurrence Analysis on Spark
 --
 
  Key: MAHOUT-1464
  URL: https://issues.apache.org/jira/browse/MAHOUT-1464
  Project: Mahout
   Issue Type: Improvement
   Components: Collaborative Filtering
  Environment: hadoop, spark
 Reporter: Pat Ferrel
 Assignee: Pat Ferrel
  Fix For: 1.0
 
  Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
 MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
 run-spark-xrsj.sh
 
 
 Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
 that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
 a DRM can be used as input.
 Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
 has several applications including cross-action recommendations.
 
>>> 
>>> 
>>> 
>>> --
>>> This message was sent by Atlassian JIRA
>>> (v6.2#6252)
>>> 
>>> 
>> 
> 




Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Ted Dunning
The reason is that sparse implementations may have recorded a non-zero that
later got assigned a zero, but they didn't bother to remove the memory cell.




On Thu, Jun 12, 2014 at 9:50 AM, Sebastian Schelter  wrote:

> I'm a bit lost in this discussion. Why do we assume that
> getNumNonZeroElements() on a Vector only returns an upper bound? The code
> in AbstractVector clearly returns the non-zeros only:
>
> int count = 0;
> Iterator it = iterateNonZero();
> while (it.hasNext()) {
>   if (it.next().get() != 0.0) {
> count++;
>   }
> }
> return count;
>
> On the other hand, the internal code seems broken here, why does
> iterateNonZero potentially return 0's?
>
> --sebastian
>
>
>
>
>
>
> On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:
>
>>
>>  [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=
>> com.atlassian.jira.plugin.system.issuetabpanels:comment-
>> tabpanel&focusedCommentId=14029345#comment-14029345 ]
>>
>> ASF GitHub Bot commented on MAHOUT-1464:
>> 
>>
>> Github user dlyubimov commented on the pull request:
>>
>>  https://github.com/apache/mahout/pull/12#issuecomment-45915940
>>
>>  fix header to say MAHOUT-1464, then hit close and reopen, it will
>> restart the echo.
>>
>>
>>  Cooccurrence Analysis on Spark
>>> --
>>>
>>>  Key: MAHOUT-1464
>>>  URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>>>  Project: Mahout
>>>   Issue Type: Improvement
>>>   Components: Collaborative Filtering
>>>  Environment: hadoop, spark
>>> Reporter: Pat Ferrel
>>> Assignee: Pat Ferrel
>>>  Fix For: 1.0
>>>
>>>  Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
>>> run-spark-xrsj.sh
>>>
>>>
>>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
>>> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
>>> a DRM can be used as input.
>>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
>>> has several applications including cross-action recommendations.
>>>
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)
>>
>>
>


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Sebastian Schelter
Ok, but the current implementation still gives the correct number, as it 
checks for accidental zeros.


I think we should add some custom implementations here to not have to go 
through the non-zeroes iterator.


--sebastian

On 06/12/2014 07:00 PM, Ted Dunning wrote:

The reason is that sparse implementations may have recorded a non-zero that
later got assigned a zero, but they didn't bother to remove the memory cell.




On Thu, Jun 12, 2014 at 9:50 AM, Sebastian Schelter  wrote:


I'm a bit lost in this discussion. Why do we assume that
getNumNonZeroElements() on a Vector only returns an upper bound? The code
in AbstractVector clearly returns the non-zeros only:

 int count = 0;
 Iterator it = iterateNonZero();
 while (it.hasNext()) {
   if (it.next().get() != 0.0) {
 count++;
   }
 }
 return count;

On the other hand, the internal code seems broken here, why does
iterateNonZero potentially return 0's?

--sebastian






On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:



  [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=
com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel&focusedCommentId=14029345#comment-14029345 ]

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

  https://github.com/apache/mahout/pull/12#issuecomment-45915940

  fix header to say MAHOUT-1464, then hit close and reopen, it will
restart the echo.


  Cooccurrence Analysis on Spark

--

  Key: MAHOUT-1464
  URL: https://issues.apache.org/jira/browse/MAHOUT-1464
  Project: Mahout
   Issue Type: Improvement
   Components: Collaborative Filtering
  Environment: hadoop, spark
 Reporter: Pat Ferrel
 Assignee: Pat Ferrel
  Fix For: 1.0

  Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh


Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
has several applications including cross-action recommendations.





--
This message was sent by Atlassian JIRA
(v6.2#6252)










Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Pat Ferrel
The SparkEngine colCounts functions was checking for >= 0. But because it was 
iterating nonZero it never got an == 0, so a bug that didn’t surface. It’s 
already been fixed.

The primary question at present is: what should we call colCounts? Currently it 
is used in cooccurrence:

val bcastInteractionsPerItemA = drmBroadcast(drmA.colCounts)

Dmitriy wanted you to see if this fits R-Like semantics and suggest an 
alternative, if possible. I was commenting on the possible Java related naming 
so ignore any misstatements.

On Jun 12, 2014, at 9:50 AM, Sebastian Schelter  wrote:

I'm a bit lost in this discussion. Why do we assume that 
getNumNonZeroElements() on a Vector only returns an upper bound? The code in 
AbstractVector clearly returns the non-zeros only:

   int count = 0;
   Iterator it = iterateNonZero();
   while (it.hasNext()) {
 if (it.next().get() != 0.0) {
   count++;
 }
   }
   return count;

On the other hand, the internal code seems broken here, why does iterateNonZero 
potentially return 0's?

--sebastian





On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:
> 
> [ 
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029345#comment-14029345
>  ]
> 
> ASF GitHub Bot commented on MAHOUT-1464:
> 
> 
> Github user dlyubimov commented on the pull request:
> 
> https://github.com/apache/mahout/pull/12#issuecomment-45915940
> 
> fix header to say MAHOUT-1464, then hit close and reopen, it will restart 
> the echo.
> 
> 
>> Cooccurrence Analysis on Spark
>> --
>> 
>> Key: MAHOUT-1464
>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>> Project: Mahout
>>  Issue Type: Improvement
>>  Components: Collaborative Filtering
>> Environment: hadoop, spark
>>Reporter: Pat Ferrel
>>Assignee: Pat Ferrel
>> Fix For: 1.0
>> 
>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, 
>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
>> run-spark-xrsj.sh
>> 
>> 
>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
>> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
>> can be used as input.
>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
>> several applications including cross-action recommendations.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
> 




Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Sebastian Schelter
I'm a bit lost in this discussion. Why do we assume that 
getNumNonZeroElements() on a Vector only returns an upper bound? The 
code in AbstractVector clearly returns the non-zeros only:


int count = 0;
Iterator it = iterateNonZero();
while (it.hasNext()) {
  if (it.next().get() != 0.0) {
count++;
  }
}
return count;

On the other hand, the internal code seems broken here, why does 
iterateNonZero potentially return 0's?


--sebastian





On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029345#comment-14029345
 ]

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

 https://github.com/apache/mahout/pull/12#issuecomment-45915940

 fix header to say MAHOUT-1464, then hit close and reopen, it will restart 
the echo.



Cooccurrence Analysis on Spark
--

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh


Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that runs 
on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM can be 
used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
several applications including cross-action recommendations.




--
This message was sent by Atlassian JIRA
(v6.2#6252)





Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Pat Ferrel
facepalm, missed that. Thanks.

On Jun 10, 2014, at 4:29 PM, Ted Dunning (JIRA)  wrote:


   [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027208#comment-14027208
 ] 

Ted Dunning commented on MAHOUT-1464:
-

Matrix and Vector already have something that can be used:

{code}
   Vector counts = x.aggregateColumns(new VectorFunction() {
 @Override
 public double apply(Vector f) {
   return f.aggregate(Functions.PLUS, Functions.greater(0));
 }
   });
{code}

> Cooccurrence Analysis on Spark
> --
> 
>Key: MAHOUT-1464
>URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
>Environment: hadoop, spark
>   Reporter: Pat Ferrel
>   Assignee: Pat Ferrel
>Fix For: 1.0
> 
>Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
> 
> 
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)



Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Sebastian Schelter

Hi Pat,

We truncate the indicators to the top-k and you don't want the 
self-comparison in there. So I don't see a reason to not exclude it as 
early as possible.


--sebatian

On 06/10/2014 05:28 PM, Pat Ferrel wrote:

Still getting the wrong values with non-boolean input so I’ll continue to look 
at.

Another question is: computeIndicators seems to exclude self-comparison during 
A’A and, of course, not for B’A. Since this returns the indicator matrix for 
the general case shouldn’t it include those values? Seems like they should be 
filtered out in the output phase if anywhere and that by option. If we were 
actually returning a multiply we’d include those.

 // exclude co-occurrences of the item with itself
 if (crossCooccurrence || thingB != thingA) {

On Jun 10, 2014, at 1:49 AM, Sebastian Schelter  wrote:

Oh good catch! I had an extra binarize method before, so that the data was 
already binary. I merged that into the downsample code and must have overlooked 
that thing. You are right, numNonZeros is the way to go!


On 06/10/2014 01:11 AM, Ted Dunning wrote:

Sounds like a very plausible root cause.





On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA)  wrote:



 [
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
]

Pat Ferrel commented on MAHOUT-1464:


seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?

 // Downsample the interaction vector of each user
 for (userIndex <- 0 until keys.size) {

   val interactionsOfUser = block(userIndex, ::) // this is a Vector
   // if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
   // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
   val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements()  // should do this I think

   val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser

   interactionsOfUser.nonZeroes().foreach { elem =>
 val numInteractionsWithThing = numInteractions(elem.index)
 val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing

 if (random.nextDouble() <= math.min(perUserSampleRate,
perThingSampleRate)) {
   // We ignore the original interaction value and create a
binary 0-1 matrix
   // as we only consider whether interactions happened or did
not happen
   downsampledBlock(userIndex, elem.index) = 1
 }
   }



Cooccurrence Analysis on Spark
--

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,

MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh



Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)

that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.

Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence

has several applications including cross-action recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)










Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Pat Ferrel
Still getting the wrong values with non-boolean input so I’ll continue to look 
at.

Another question is: computeIndicators seems to exclude self-comparison during 
A’A and, of course, not for B’A. Since this returns the indicator matrix for 
the general case shouldn’t it include those values? Seems like they should be 
filtered out in the output phase if anywhere and that by option. If we were 
actually returning a multiply we’d include those.

// exclude co-occurrences of the item with itself
if (crossCooccurrence || thingB != thingA) {

On Jun 10, 2014, at 1:49 AM, Sebastian Schelter  wrote:

Oh good catch! I had an extra binarize method before, so that the data was 
already binary. I merged that into the downsample code and must have overlooked 
that thing. You are right, numNonZeros is the way to go!


On 06/10/2014 01:11 AM, Ted Dunning wrote:
> Sounds like a very plausible root cause.
> 
> 
> 
> 
> 
> On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA)  wrote:
> 
>> 
>> [
>> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
>> ]
>> 
>> Pat Ferrel commented on MAHOUT-1464:
>> 
>> 
>> seems like the downsampleAndBinarize method is returning the wrong values.
>> It is actually summing the values where it should be counting the non-zero
>> elements?
>> 
>> // Downsample the interaction vector of each user
>> for (userIndex <- 0 until keys.size) {
>> 
>>   val interactionsOfUser = block(userIndex, ::) // this is a Vector
>>   // if the values are non-boolean the sum will not be the number
>> of interactions it will be a sum of strength-of-interaction, right?
>>   // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
>> this sum strength of interactions?
>>   val numInteractionsOfUser =
>> interactionsOfUser.getNumNonZeroElements()  // should do this I think
>> 
>>   val perUserSampleRate = math.min(maxNumInteractions,
>> numInteractionsOfUser) / numInteractionsOfUser
>> 
>>   interactionsOfUser.nonZeroes().foreach { elem =>
>> val numInteractionsWithThing = numInteractions(elem.index)
>> val perThingSampleRate = math.min(maxNumInteractions,
>> numInteractionsWithThing) / numInteractionsWithThing
>> 
>> if (random.nextDouble() <= math.min(perUserSampleRate,
>> perThingSampleRate)) {
>>   // We ignore the original interaction value and create a
>> binary 0-1 matrix
>>   // as we only consider whether interactions happened or did
>> not happen
>>   downsampledBlock(userIndex, elem.index) = 1
>> }
>>   }
>> 
>> 
>>> Cooccurrence Analysis on Spark
>>> --
>>> 
>>> Key: MAHOUT-1464
>>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>>> Project: Mahout
>>>  Issue Type: Improvement
>>>  Components: Collaborative Filtering
>>> Environment: hadoop, spark
>>>Reporter: Pat Ferrel
>>>Assignee: Pat Ferrel
>>> Fix For: 1.0
>>> 
>>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
>> run-spark-xrsj.sh
>>> 
>>> 
>>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
>> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
>> a DRM can be used as input.
>>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
>> has several applications including cross-action recommendations.
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)
>> 
> 




Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Sebastian Schelter
Oh good catch! I had an extra binarize method before, so that the data 
was already binary. I merged that into the downsample code and must have 
overlooked that thing. You are right, numNonZeros is the way to go!



On 06/10/2014 01:11 AM, Ted Dunning wrote:

Sounds like a very plausible root cause.





On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA)  wrote:



 [
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
]

Pat Ferrel commented on MAHOUT-1464:


seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?

 // Downsample the interaction vector of each user
 for (userIndex <- 0 until keys.size) {

   val interactionsOfUser = block(userIndex, ::) // this is a Vector
   // if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
   // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
   val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements()  // should do this I think

   val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser

   interactionsOfUser.nonZeroes().foreach { elem =>
 val numInteractionsWithThing = numInteractions(elem.index)
 val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing

 if (random.nextDouble() <= math.min(perUserSampleRate,
perThingSampleRate)) {
   // We ignore the original interaction value and create a
binary 0-1 matrix
   // as we only consider whether interactions happened or did
not happen
   downsampledBlock(userIndex, elem.index) = 1
 }
   }



Cooccurrence Analysis on Spark
--

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,

MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh



Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)

that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.

Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence

has several applications including cross-action recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)







Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-09 Thread Ted Dunning
Sounds like a very plausible root cause.





On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
> ]
>
> Pat Ferrel commented on MAHOUT-1464:
> 
>
> seems like the downsampleAndBinarize method is returning the wrong values.
> It is actually summing the values where it should be counting the non-zero
> elements?
>
> // Downsample the interaction vector of each user
> for (userIndex <- 0 until keys.size) {
>
>   val interactionsOfUser = block(userIndex, ::) // this is a Vector
>   // if the values are non-boolean the sum will not be the number
> of interactions it will be a sum of strength-of-interaction, right?
>   // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
> this sum strength of interactions?
>   val numInteractionsOfUser =
> interactionsOfUser.getNumNonZeroElements()  // should do this I think
>
>   val perUserSampleRate = math.min(maxNumInteractions,
> numInteractionsOfUser) / numInteractionsOfUser
>
>   interactionsOfUser.nonZeroes().foreach { elem =>
> val numInteractionsWithThing = numInteractions(elem.index)
> val perThingSampleRate = math.min(maxNumInteractions,
> numInteractionsWithThing) / numInteractionsWithThing
>
> if (random.nextDouble() <= math.min(perUserSampleRate,
> perThingSampleRate)) {
>   // We ignore the original interaction value and create a
> binary 0-1 matrix
>   // as we only consider whether interactions happened or did
> not happen
>   downsampledBlock(userIndex, elem.index) = 1
> }
>   }
>
>
> > Cooccurrence Analysis on Spark
> > --
> >
> > Key: MAHOUT-1464
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: Collaborative Filtering
> > Environment: hadoop, spark
> >Reporter: Pat Ferrel
> >Assignee: Pat Ferrel
> > Fix For: 1.0
> >
> > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
> run-spark-xrsj.sh
> >
> >
> > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
> a DRM can be used as input.
> > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
> has several applications including cross-action recommendations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-02 Thread Sebastian Schelter
The important thing here is that we test the code on a sufficiently large
dataset on a real cluster. Take that on, if you want!
Am 02.06.2014 20:08 schrieb "Pat Ferrel (JIRA)" :

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14015667#comment-14015667
> ]
>
> Pat Ferrel commented on MAHOUT-1464:
> 
>
> [~ssc] Should I reassign to me for now so we can get this committed?
>
> > Cooccurrence Analysis on Spark
> > --
> >
> > Key: MAHOUT-1464
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: Collaborative Filtering
> > Environment: hadoop, spark
> >Reporter: Pat Ferrel
> >Assignee: Sebastian Schelter
> > Fix For: 1.0
> >
> > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
> run-spark-xrsj.sh
> >
> >
> > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
> a DRM can be used as input.
> > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
> has several applications including cross-action recommendations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-17 Thread Pat Ferrel
I have no trouble reading from HDFS using the spark-shell. I assume I would 
also have no trouble writing but that is using the basic shell that comes with 
Spark.

scala> val textFile = sc.textFile("xrsj/ratings_data.txt")
scala> textFile.count()

This works with local, pseudo-cluster, or even full cluster. I just can’t write 
using the RSJ code. 

Are you using your custom mahout+spark Scala shell on github, doing a writeDRM? 
At home you are using cdh 4.3.2 on a single machine pseudo-cluster? Which 
versions of hadoop and spark are you running? Did you install spark outside of 
cdh? What os?

If nothing else I can try to duplicate the environment. We know your writeDRM 
works so if I can duplicate that I can start debugging the RSJ stuff.

BTW data for the RSJ code is here: 
https://cloud.occamsmachete.com/public.php?service=files&t=0011a9651691ee38e905a36e99a0f125

On Apr 17, 2014, at 1:23 PM, Dmitriy Lyubimov (JIRA)  wrote:


   [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973347#comment-13973347
 ] 

Dmitriy Lyubimov commented on MAHOUT-1464:
--

Hm. At home i don't have any trouble reading/writing from/to hdfs. 

There are some minor differences in configuration plus i am running hdfs cdh 
4.3.2 at home vs. 4.3.0 at work computer. That's the only difference. 

(some patchlevel specific?)



> Cooccurrence Analysis on Spark
> --
> 
>Key: MAHOUT-1464
>URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
>Environment: hadoop, spark
>   Reporter: Pat Ferrel
>   Assignee: Sebastian Schelter
>Fix For: 1.0
> 
>Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
> 
> 
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)



Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-16 Thread Dmitriy Lyubimov
​I actually see this behavior too on occasion -- hanging on write to HDFS
in front-end.

So i am looking into it.

Working hypothesis is that It is front-end hadoop dependencies of course
during hdfs moves and renames that front end is doing once all partitions
are generated. Backend seems to be able to write files just fine.


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Dmitriy Lyubimov
PS like i said, the "Client" feature only appeared in 0.9. Nobody missed it
before that and it never was a prerequisite to run anything.


On Mon, Apr 14, 2014 at 2:14 PM, Dmitriy Lyubimov (JIRA) wrote:

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968849#comment-13968849]
>
> Dmitriy Lyubimov commented on MAHOUT-1464:
> --
>
>
>
> IDEA is driver. but output is written by spark workers. Not the same
> environment, and in most cases, not the same machine. Just like it happens
> for MR reducers. Unless it is "local" master url. Which i assume it was
> not.
>
>
> This is strange. I can, was able to and will able to. why wouldn't it able
> to? unless there are network or security issues. There's nothing
> fundamentally different between reading/writing hdfs from a worker process
> or any other process.
>
>
>
> No. Spark client is about shipping driver and have it running somewhere
> else. it is as if somebody was running mahout cli command on one of the
> worker nodes. this is it. it knows nothing about hdfs -- and even what the
> driver program is going to do. One might use the Client code to print out
> "Hello, World" and exit on some of the worker nodes, the Client wouldn't
> know or care. Using a worker to run driver programs, that's all it does.
>
>
>
>
> > Cooccurrence Analysis on Spark
> > --
> >
> > Key: MAHOUT-1464
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: Collaborative Filtering
> > Environment: hadoop, spark
> >Reporter: Pat Ferrel
> >Assignee: Sebastian Schelter
> > Fix For: 1.0
> >
> > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
> run-spark-xrsj.sh
> >
> >
> > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
> a DRM can be used as input.
> > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
> has several applications including cross-action recommendations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-04-14 Thread Dmitriy Lyubimov
inline


On Mon, Apr 14, 2014 at 11:21 AM, Pat Ferrel (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968613#comment-13968613]
>
> Pat Ferrel commented on MAHOUT-1464:
> 
>
> @Dmitriy, no clue what email you are talking about, you have written a lot
> lately. Where is it, on a Jira?
>
no, on @dev... basically you want to run it as a standalone application
(just like SparkPI example). The easiest way to do it is just import all
mahout tree into idea and launch Sebastian's driver program directly, that
much should work -- especially since you only care about local mode in fact
(just to be clear, "local" master means same jvm, single thread, really
useful for debugging only).

>
> I did my setup and tried launching with Hadoop and Mahout running locally
> (MAHOUT_LOCAL=true),
>
this environment variable would have no bearing on spark program. The only
thing that is important is master url per above.


> One localhost instance of Spark, passing in the 'mvn package' mahout spark
> jar from the localfs and pointing at data on the localfs.  This is per
> instructions of the Spark site. There is no firewall issue since it is
> always localhost talking to localhost.
>

You need to be a bit more specific here.

Yes you can run spark as a single node cluster (just like hadoop single
node cluster), but that would be still "standalone" master, not "local".
"local" is as i indicated, is totally same jvm, single thread, it does not
require starting any additional spark processes.

As long as you want "standalone" (i.e. real thing, albeit single-node) you
need not use Client. It won't work. Launch program directly, just like they
do with examples such as SparkPi. this Client thing will not work for our
Mahout programs without additional considerations.


>
> Anyway if I could find your "running mahout on spark" email it would
> probably explain what I'm doing wrong.
>
> You did see I was using Spark 0.9.1?
>
In all likelihood this should be fine if you also change dependency and
recompile with it in root pom.xml. Otherwise there's no way of reliably
telling if different versions on client on backend may trigger
incompatibilities other than trying. (e.g. if they changed akka or netty
version between 0.9.0 and 0.9.1).



>
> > Cooccurrence Analysis on Spark
> > --
> >
> > Key: MAHOUT-1464
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > Project: Mahout
> >  Issue Type: Improvement
> >  Components: Collaborative Filtering
> > Environment: hadoop, spark
> >Reporter: Pat Ferrel
> >Assignee: Sebastian Schelter
> > Fix For: 1.0
> >
> > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
> run-spark-xrsj.sh
> >
> >
> > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
> a DRM can be used as input.
> > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
> has several applications including cross-action recommendations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>