Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
Not sure how this relates to the PR.
If you look here you can see all the PR files and diffs from master. Comments
can be attached to the files in question.
https://github.com/apache/mahout/pull/12/files
iterateNonZero is not in question afaik, and is used in a couple places. If
someone wants to write an alternative I’ll be happy to change things.
On Jun 12, 2014, at 10:06 AM, Sebastian Schelter wrote:
Ok, but the current implementation still gives the correct number, as it checks
for accidental zeros.
I think we should add some custom implementations here to not have to go
through the non-zeroes iterator.
--sebastian
On 06/12/2014 07:00 PM, Ted Dunning wrote:
> The reason is that sparse implementations may have recorded a non-zero that
> later got assigned a zero, but they didn't bother to remove the memory cell.
>
>
>
>
> On Thu, Jun 12, 2014 at 9:50 AM, Sebastian Schelter wrote:
>
>> I'm a bit lost in this discussion. Why do we assume that
>> getNumNonZeroElements() on a Vector only returns an upper bound? The code
>> in AbstractVector clearly returns the non-zeros only:
>>
>> int count = 0;
>> Iterator it = iterateNonZero();
>> while (it.hasNext()) {
>> if (it.next().get() != 0.0) {
>> count++;
>> }
>> }
>> return count;
>>
>> On the other hand, the internal code seems broken here, why does
>> iterateNonZero potentially return 0's?
>>
>> --sebastian
>>
>>
>>
>>
>>
>>
>> On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:
>>
>>>
>>> [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=
>>> com.atlassian.jira.plugin.system.issuetabpanels:comment-
>>> tabpanel&focusedCommentId=14029345#comment-14029345 ]
>>>
>>> ASF GitHub Bot commented on MAHOUT-1464:
>>>
>>>
>>> Github user dlyubimov commented on the pull request:
>>>
>>> https://github.com/apache/mahout/pull/12#issuecomment-45915940
>>>
>>> fix header to say MAHOUT-1464, then hit close and reopen, it will
>>> restart the echo.
>>>
>>>
>>> Cooccurrence Analysis on Spark
--
Key: MAHOUT-1464
URL: https://issues.apache.org/jira/browse/MAHOUT-1464
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Fix For: 1.0
Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh
Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
has several applications including cross-action recommendations.
>>>
>>>
>>>
>>> --
>>> This message was sent by Atlassian JIRA
>>> (v6.2#6252)
>>>
>>>
>>
>
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
The reason is that sparse implementations may have recorded a non-zero that
later got assigned a zero, but they didn't bother to remove the memory cell.
On Thu, Jun 12, 2014 at 9:50 AM, Sebastian Schelter wrote:
> I'm a bit lost in this discussion. Why do we assume that
> getNumNonZeroElements() on a Vector only returns an upper bound? The code
> in AbstractVector clearly returns the non-zeros only:
>
> int count = 0;
> Iterator it = iterateNonZero();
> while (it.hasNext()) {
> if (it.next().get() != 0.0) {
> count++;
> }
> }
> return count;
>
> On the other hand, the internal code seems broken here, why does
> iterateNonZero potentially return 0's?
>
> --sebastian
>
>
>
>
>
>
> On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:
>
>>
>> [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=
>> com.atlassian.jira.plugin.system.issuetabpanels:comment-
>> tabpanel&focusedCommentId=14029345#comment-14029345 ]
>>
>> ASF GitHub Bot commented on MAHOUT-1464:
>>
>>
>> Github user dlyubimov commented on the pull request:
>>
>> https://github.com/apache/mahout/pull/12#issuecomment-45915940
>>
>> fix header to say MAHOUT-1464, then hit close and reopen, it will
>> restart the echo.
>>
>>
>> Cooccurrence Analysis on Spark
>>> --
>>>
>>> Key: MAHOUT-1464
>>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>>> Project: Mahout
>>> Issue Type: Improvement
>>> Components: Collaborative Filtering
>>> Environment: hadoop, spark
>>> Reporter: Pat Ferrel
>>> Assignee: Pat Ferrel
>>> Fix For: 1.0
>>>
>>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
>>> run-spark-xrsj.sh
>>>
>>>
>>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
>>> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
>>> a DRM can be used as input.
>>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
>>> has several applications including cross-action recommendations.
>>>
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)
>>
>>
>
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
Ok, but the current implementation still gives the correct number, as it
checks for accidental zeros.
I think we should add some custom implementations here to not have to go
through the non-zeroes iterator.
--sebastian
On 06/12/2014 07:00 PM, Ted Dunning wrote:
The reason is that sparse implementations may have recorded a non-zero that
later got assigned a zero, but they didn't bother to remove the memory cell.
On Thu, Jun 12, 2014 at 9:50 AM, Sebastian Schelter wrote:
I'm a bit lost in this discussion. Why do we assume that
getNumNonZeroElements() on a Vector only returns an upper bound? The code
in AbstractVector clearly returns the non-zeros only:
int count = 0;
Iterator it = iterateNonZero();
while (it.hasNext()) {
if (it.next().get() != 0.0) {
count++;
}
}
return count;
On the other hand, the internal code seems broken here, why does
iterateNonZero potentially return 0's?
--sebastian
On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:
[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=
com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel&focusedCommentId=14029345#comment-14029345 ]
ASF GitHub Bot commented on MAHOUT-1464:
Github user dlyubimov commented on the pull request:
https://github.com/apache/mahout/pull/12#issuecomment-45915940
fix header to say MAHOUT-1464, then hit close and reopen, it will
restart the echo.
Cooccurrence Analysis on Spark
--
Key: MAHOUT-1464
URL: https://issues.apache.org/jira/browse/MAHOUT-1464
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Fix For: 1.0
Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh
Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
has several applications including cross-action recommendations.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
The SparkEngine colCounts functions was checking for >= 0. But because it was
iterating nonZero it never got an == 0, so a bug that didn’t surface. It’s
already been fixed.
The primary question at present is: what should we call colCounts? Currently it
is used in cooccurrence:
val bcastInteractionsPerItemA = drmBroadcast(drmA.colCounts)
Dmitriy wanted you to see if this fits R-Like semantics and suggest an
alternative, if possible. I was commenting on the possible Java related naming
so ignore any misstatements.
On Jun 12, 2014, at 9:50 AM, Sebastian Schelter wrote:
I'm a bit lost in this discussion. Why do we assume that
getNumNonZeroElements() on a Vector only returns an upper bound? The code in
AbstractVector clearly returns the non-zeros only:
int count = 0;
Iterator it = iterateNonZero();
while (it.hasNext()) {
if (it.next().get() != 0.0) {
count++;
}
}
return count;
On the other hand, the internal code seems broken here, why does iterateNonZero
potentially return 0's?
--sebastian
On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:
>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029345#comment-14029345
> ]
>
> ASF GitHub Bot commented on MAHOUT-1464:
>
>
> Github user dlyubimov commented on the pull request:
>
> https://github.com/apache/mahout/pull/12#issuecomment-45915940
>
> fix header to say MAHOUT-1464, then hit close and reopen, it will restart
> the echo.
>
>
>> Cooccurrence Analysis on Spark
>> --
>>
>> Key: MAHOUT-1464
>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>> Project: Mahout
>> Issue Type: Improvement
>> Components: Collaborative Filtering
>> Environment: hadoop, spark
>>Reporter: Pat Ferrel
>>Assignee: Pat Ferrel
>> Fix For: 1.0
>>
>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
>> run-spark-xrsj.sh
>>
>>
>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that
>> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM
>> can be used as input.
>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has
>> several applications including cross-action recommendations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
I'm a bit lost in this discussion. Why do we assume that
getNumNonZeroElements() on a Vector only returns an upper bound? The
code in AbstractVector clearly returns the non-zeros only:
int count = 0;
Iterator it = iterateNonZero();
while (it.hasNext()) {
if (it.next().get() != 0.0) {
count++;
}
}
return count;
On the other hand, the internal code seems broken here, why does
iterateNonZero potentially return 0's?
--sebastian
On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:
[
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029345#comment-14029345
]
ASF GitHub Bot commented on MAHOUT-1464:
Github user dlyubimov commented on the pull request:
https://github.com/apache/mahout/pull/12#issuecomment-45915940
fix header to say MAHOUT-1464, then hit close and reopen, it will restart
the echo.
Cooccurrence Analysis on Spark
--
Key: MAHOUT-1464
URL: https://issues.apache.org/jira/browse/MAHOUT-1464
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Fix For: 1.0
Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that runs
on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM can be
used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has
several applications including cross-action recommendations.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
facepalm, missed that. Thanks.
On Jun 10, 2014, at 4:29 PM, Ted Dunning (JIRA) wrote:
[
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027208#comment-14027208
]
Ted Dunning commented on MAHOUT-1464:
-
Matrix and Vector already have something that can be used:
{code}
Vector counts = x.aggregateColumns(new VectorFunction() {
@Override
public double apply(Vector f) {
return f.aggregate(Functions.PLUS, Functions.greater(0));
}
});
{code}
> Cooccurrence Analysis on Spark
> --
>
>Key: MAHOUT-1464
>URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
>Environment: hadoop, spark
> Reporter: Pat Ferrel
> Assignee: Pat Ferrel
>Fix For: 1.0
>
>Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM
> can be used as input.
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has
> several applications including cross-action recommendations.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
Hi Pat,
We truncate the indicators to the top-k and you don't want the
self-comparison in there. So I don't see a reason to not exclude it as
early as possible.
--sebatian
On 06/10/2014 05:28 PM, Pat Ferrel wrote:
Still getting the wrong values with non-boolean input so I’ll continue to look
at.
Another question is: computeIndicators seems to exclude self-comparison during
A’A and, of course, not for B’A. Since this returns the indicator matrix for
the general case shouldn’t it include those values? Seems like they should be
filtered out in the output phase if anywhere and that by option. If we were
actually returning a multiply we’d include those.
// exclude co-occurrences of the item with itself
if (crossCooccurrence || thingB != thingA) {
On Jun 10, 2014, at 1:49 AM, Sebastian Schelter wrote:
Oh good catch! I had an extra binarize method before, so that the data was
already binary. I merged that into the downsample code and must have overlooked
that thing. You are right, numNonZeros is the way to go!
On 06/10/2014 01:11 AM, Ted Dunning wrote:
Sounds like a very plausible root cause.
On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA) wrote:
[
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
]
Pat Ferrel commented on MAHOUT-1464:
seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?
// Downsample the interaction vector of each user
for (userIndex <- 0 until keys.size) {
val interactionsOfUser = block(userIndex, ::) // this is a Vector
// if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
// val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements() // should do this I think
val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser
interactionsOfUser.nonZeroes().foreach { elem =>
val numInteractionsWithThing = numInteractions(elem.index)
val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing
if (random.nextDouble() <= math.min(perUserSampleRate,
perThingSampleRate)) {
// We ignore the original interaction value and create a
binary 0-1 matrix
// as we only consider whether interactions happened or did
not happen
downsampledBlock(userIndex, elem.index) = 1
}
}
Cooccurrence Analysis on Spark
--
Key: MAHOUT-1464
URL: https://issues.apache.org/jira/browse/MAHOUT-1464
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Fix For: 1.0
Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh
Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
has several applications including cross-action recommendations.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
Still getting the wrong values with non-boolean input so I’ll continue to look
at.
Another question is: computeIndicators seems to exclude self-comparison during
A’A and, of course, not for B’A. Since this returns the indicator matrix for
the general case shouldn’t it include those values? Seems like they should be
filtered out in the output phase if anywhere and that by option. If we were
actually returning a multiply we’d include those.
// exclude co-occurrences of the item with itself
if (crossCooccurrence || thingB != thingA) {
On Jun 10, 2014, at 1:49 AM, Sebastian Schelter wrote:
Oh good catch! I had an extra binarize method before, so that the data was
already binary. I merged that into the downsample code and must have overlooked
that thing. You are right, numNonZeros is the way to go!
On 06/10/2014 01:11 AM, Ted Dunning wrote:
> Sounds like a very plausible root cause.
>
>
>
>
>
> On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA) wrote:
>
>>
>> [
>> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
>> ]
>>
>> Pat Ferrel commented on MAHOUT-1464:
>>
>>
>> seems like the downsampleAndBinarize method is returning the wrong values.
>> It is actually summing the values where it should be counting the non-zero
>> elements?
>>
>> // Downsample the interaction vector of each user
>> for (userIndex <- 0 until keys.size) {
>>
>> val interactionsOfUser = block(userIndex, ::) // this is a Vector
>> // if the values are non-boolean the sum will not be the number
>> of interactions it will be a sum of strength-of-interaction, right?
>> // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
>> this sum strength of interactions?
>> val numInteractionsOfUser =
>> interactionsOfUser.getNumNonZeroElements() // should do this I think
>>
>> val perUserSampleRate = math.min(maxNumInteractions,
>> numInteractionsOfUser) / numInteractionsOfUser
>>
>> interactionsOfUser.nonZeroes().foreach { elem =>
>> val numInteractionsWithThing = numInteractions(elem.index)
>> val perThingSampleRate = math.min(maxNumInteractions,
>> numInteractionsWithThing) / numInteractionsWithThing
>>
>> if (random.nextDouble() <= math.min(perUserSampleRate,
>> perThingSampleRate)) {
>> // We ignore the original interaction value and create a
>> binary 0-1 matrix
>> // as we only consider whether interactions happened or did
>> not happen
>> downsampledBlock(userIndex, elem.index) = 1
>> }
>> }
>>
>>
>>> Cooccurrence Analysis on Spark
>>> --
>>>
>>> Key: MAHOUT-1464
>>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>>> Project: Mahout
>>> Issue Type: Improvement
>>> Components: Collaborative Filtering
>>> Environment: hadoop, spark
>>>Reporter: Pat Ferrel
>>>Assignee: Pat Ferrel
>>> Fix For: 1.0
>>>
>>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
>> run-spark-xrsj.sh
>>>
>>>
>>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
>> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
>> a DRM can be used as input.
>>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
>> has several applications including cross-action recommendations.
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)
>>
>
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
Oh good catch! I had an extra binarize method before, so that the data
was already binary. I merged that into the downsample code and must have
overlooked that thing. You are right, numNonZeros is the way to go!
On 06/10/2014 01:11 AM, Ted Dunning wrote:
Sounds like a very plausible root cause.
On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA) wrote:
[
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
]
Pat Ferrel commented on MAHOUT-1464:
seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?
// Downsample the interaction vector of each user
for (userIndex <- 0 until keys.size) {
val interactionsOfUser = block(userIndex, ::) // this is a Vector
// if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
// val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements() // should do this I think
val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser
interactionsOfUser.nonZeroes().foreach { elem =>
val numInteractionsWithThing = numInteractions(elem.index)
val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing
if (random.nextDouble() <= math.min(perUserSampleRate,
perThingSampleRate)) {
// We ignore the original interaction value and create a
binary 0-1 matrix
// as we only consider whether interactions happened or did
not happen
downsampledBlock(userIndex, elem.index) = 1
}
}
Cooccurrence Analysis on Spark
--
Key: MAHOUT-1464
URL: https://issues.apache.org/jira/browse/MAHOUT-1464
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Fix For: 1.0
Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh
Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
has several applications including cross-action recommendations.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
Sounds like a very plausible root cause.
On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA) wrote:
>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
> ]
>
> Pat Ferrel commented on MAHOUT-1464:
>
>
> seems like the downsampleAndBinarize method is returning the wrong values.
> It is actually summing the values where it should be counting the non-zero
> elements?
>
> // Downsample the interaction vector of each user
> for (userIndex <- 0 until keys.size) {
>
> val interactionsOfUser = block(userIndex, ::) // this is a Vector
> // if the values are non-boolean the sum will not be the number
> of interactions it will be a sum of strength-of-interaction, right?
> // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
> this sum strength of interactions?
> val numInteractionsOfUser =
> interactionsOfUser.getNumNonZeroElements() // should do this I think
>
> val perUserSampleRate = math.min(maxNumInteractions,
> numInteractionsOfUser) / numInteractionsOfUser
>
> interactionsOfUser.nonZeroes().foreach { elem =>
> val numInteractionsWithThing = numInteractions(elem.index)
> val perThingSampleRate = math.min(maxNumInteractions,
> numInteractionsWithThing) / numInteractionsWithThing
>
> if (random.nextDouble() <= math.min(perUserSampleRate,
> perThingSampleRate)) {
> // We ignore the original interaction value and create a
> binary 0-1 matrix
> // as we only consider whether interactions happened or did
> not happen
> downsampledBlock(userIndex, elem.index) = 1
> }
> }
>
>
> > Cooccurrence Analysis on Spark
> > --
> >
> > Key: MAHOUT-1464
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> > Project: Mahout
> > Issue Type: Improvement
> > Components: Collaborative Filtering
> > Environment: hadoop, spark
> >Reporter: Pat Ferrel
> >Assignee: Pat Ferrel
> > Fix For: 1.0
> >
> > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
> run-spark-xrsj.sh
> >
> >
> > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
> a DRM can be used as input.
> > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
> has several applications including cross-action recommendations.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
The important thing here is that we test the code on a sufficiently large dataset on a real cluster. Take that on, if you want! Am 02.06.2014 20:08 schrieb "Pat Ferrel (JIRA)" : > > [ > https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14015667#comment-14015667 > ] > > Pat Ferrel commented on MAHOUT-1464: > > > [~ssc] Should I reassign to me for now so we can get this committed? > > > Cooccurrence Analysis on Spark > > -- > > > > Key: MAHOUT-1464 > > URL: https://issues.apache.org/jira/browse/MAHOUT-1464 > > Project: Mahout > > Issue Type: Improvement > > Components: Collaborative Filtering > > Environment: hadoop, spark > >Reporter: Pat Ferrel > >Assignee: Sebastian Schelter > > Fix For: 1.0 > > > > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, > MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, > run-spark-xrsj.sh > > > > > > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) > that runs on Spark. This should be compatible with Mahout Spark DRM DSL so > a DRM can be used as input. > > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence > has several applications including cross-action recommendations. > > > > -- > This message was sent by Atlassian JIRA > (v6.2#6252) >
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
I have no trouble reading from HDFS using the spark-shell. I assume I would
also have no trouble writing but that is using the basic shell that comes with
Spark.
scala> val textFile = sc.textFile("xrsj/ratings_data.txt")
scala> textFile.count()
This works with local, pseudo-cluster, or even full cluster. I just can’t write
using the RSJ code.
Are you using your custom mahout+spark Scala shell on github, doing a writeDRM?
At home you are using cdh 4.3.2 on a single machine pseudo-cluster? Which
versions of hadoop and spark are you running? Did you install spark outside of
cdh? What os?
If nothing else I can try to duplicate the environment. We know your writeDRM
works so if I can duplicate that I can start debugging the RSJ stuff.
BTW data for the RSJ code is here:
https://cloud.occamsmachete.com/public.php?service=files&t=0011a9651691ee38e905a36e99a0f125
On Apr 17, 2014, at 1:23 PM, Dmitriy Lyubimov (JIRA) wrote:
[
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973347#comment-13973347
]
Dmitriy Lyubimov commented on MAHOUT-1464:
--
Hm. At home i don't have any trouble reading/writing from/to hdfs.
There are some minor differences in configuration plus i am running hdfs cdh
4.3.2 at home vs. 4.3.0 at work computer. That's the only difference.
(some patchlevel specific?)
> Cooccurrence Analysis on Spark
> --
>
>Key: MAHOUT-1464
>URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
>Environment: hadoop, spark
> Reporter: Pat Ferrel
> Assignee: Sebastian Schelter
>Fix For: 1.0
>
>Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM
> can be used as input.
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has
> several applications including cross-action recommendations.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
I actually see this behavior too on occasion -- hanging on write to HDFS in front-end. So i am looking into it. Working hypothesis is that It is front-end hadoop dependencies of course during hdfs moves and renames that front end is doing once all partitions are generated. Backend seems to be able to write files just fine.
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
PS like i said, the "Client" feature only appeared in 0.9. Nobody missed it before that and it never was a prerequisite to run anything. On Mon, Apr 14, 2014 at 2:14 PM, Dmitriy Lyubimov (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968849#comment-13968849] > > Dmitriy Lyubimov commented on MAHOUT-1464: > -- > > > > IDEA is driver. but output is written by spark workers. Not the same > environment, and in most cases, not the same machine. Just like it happens > for MR reducers. Unless it is "local" master url. Which i assume it was > not. > > > This is strange. I can, was able to and will able to. why wouldn't it able > to? unless there are network or security issues. There's nothing > fundamentally different between reading/writing hdfs from a worker process > or any other process. > > > > No. Spark client is about shipping driver and have it running somewhere > else. it is as if somebody was running mahout cli command on one of the > worker nodes. this is it. it knows nothing about hdfs -- and even what the > driver program is going to do. One might use the Client code to print out > "Hello, World" and exit on some of the worker nodes, the Client wouldn't > know or care. Using a worker to run driver programs, that's all it does. > > > > > > Cooccurrence Analysis on Spark > > -- > > > > Key: MAHOUT-1464 > > URL: https://issues.apache.org/jira/browse/MAHOUT-1464 > > Project: Mahout > > Issue Type: Improvement > > Components: Collaborative Filtering > > Environment: hadoop, spark > >Reporter: Pat Ferrel > >Assignee: Sebastian Schelter > > Fix For: 1.0 > > > > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, > MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, > run-spark-xrsj.sh > > > > > > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) > that runs on Spark. This should be compatible with Mahout Spark DRM DSL so > a DRM can be used as input. > > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence > has several applications including cross-action recommendations. > > > > -- > This message was sent by Atlassian JIRA > (v6.2#6252) >
Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark
inline On Mon, Apr 14, 2014 at 11:21 AM, Pat Ferrel (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968613#comment-13968613] > > Pat Ferrel commented on MAHOUT-1464: > > > @Dmitriy, no clue what email you are talking about, you have written a lot > lately. Where is it, on a Jira? > no, on @dev... basically you want to run it as a standalone application (just like SparkPI example). The easiest way to do it is just import all mahout tree into idea and launch Sebastian's driver program directly, that much should work -- especially since you only care about local mode in fact (just to be clear, "local" master means same jvm, single thread, really useful for debugging only). > > I did my setup and tried launching with Hadoop and Mahout running locally > (MAHOUT_LOCAL=true), > this environment variable would have no bearing on spark program. The only thing that is important is master url per above. > One localhost instance of Spark, passing in the 'mvn package' mahout spark > jar from the localfs and pointing at data on the localfs. This is per > instructions of the Spark site. There is no firewall issue since it is > always localhost talking to localhost. > You need to be a bit more specific here. Yes you can run spark as a single node cluster (just like hadoop single node cluster), but that would be still "standalone" master, not "local". "local" is as i indicated, is totally same jvm, single thread, it does not require starting any additional spark processes. As long as you want "standalone" (i.e. real thing, albeit single-node) you need not use Client. It won't work. Launch program directly, just like they do with examples such as SparkPi. this Client thing will not work for our Mahout programs without additional considerations. > > Anyway if I could find your "running mahout on spark" email it would > probably explain what I'm doing wrong. > > You did see I was using Spark 0.9.1? > In all likelihood this should be fine if you also change dependency and recompile with it in root pom.xml. Otherwise there's no way of reliably telling if different versions on client on backend may trigger incompatibilities other than trying. (e.g. if they changed akka or netty version between 0.9.0 and 0.9.1). > > > Cooccurrence Analysis on Spark > > -- > > > > Key: MAHOUT-1464 > > URL: https://issues.apache.org/jira/browse/MAHOUT-1464 > > Project: Mahout > > Issue Type: Improvement > > Components: Collaborative Filtering > > Environment: hadoop, spark > >Reporter: Pat Ferrel > >Assignee: Sebastian Schelter > > Fix For: 1.0 > > > > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, > MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, > run-spark-xrsj.sh > > > > > > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) > that runs on Spark. This should be compatible with Mahout Spark DRM DSL so > a DRM can be used as input. > > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence > has several applications including cross-action recommendations. > > > > -- > This message was sent by Atlassian JIRA > (v6.2#6252) >
