Re: Duplicate entries in output of mllib column similarities

2015-05-12 Thread Reza Zadeh
Great! Reza

On Tue, May 12, 2015 at 7:42 AM, Richard Bolkey  wrote:

> Hi Reza,
>
> That was the fix we needed. After sorting, the transposed entries are gone!
>
> Thanks a bunch,
> rick
>
> On Sat, May 9, 2015 at 5:17 PM, Reza Zadeh  wrote:
>
>> Hi Richard,
>> One reason that could be happening is that the rows of your matrix are
>> using SparseVectors, but the entries in your vectors aren't sorted by
>> index. Is that the case? Sparse Vectors
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala>
>> need sorted indices.
>> Reza
>>
>> On Sat, May 9, 2015 at 8:51 AM, Richard Bolkey  wrote:
>>
>>> Hi Reza,
>>>
>>> After a bit of digging, I had my previous issue a little bit wrong.
>>> We're not getting duplicate (i,j) entries, but we are getting transposed
>>> entries (i,j) and (j,i) with potentially different scores. We assumed the
>>> output would be a triangular matrix. Still, let me know if that's expected.
>>> A transposed entry occurs for about 5% of our output entries.
>>>
>>> scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect()
>>> res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
>>> Array(MatrixEntry(22769,539029,0.00453050595770095))
>>>
>>> scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect()
>>> res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
>>> Array(MatrixEntry(539029,22769,0.002265252978850475))
>>>
>>> I saved a subset of vectors to object files that replicates the issue .
>>> It's about 300mb. Should I try to whittle that down some more? What would
>>> be the best way to get that to you.
>>>
>>> Many thanks,
>>> Rick
>>>
>>> On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh  wrote:
>>>
>>>> This shouldn't be happening, do you have an example to reproduce it?
>>>>
>>>> On Thu, May 7, 2015 at 4:17 PM, rbolkey  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a question regarding one of the oddities we encountered while
>>>>> running
>>>>> mllib's column similarities operation. When we examine the output, we
>>>>> find
>>>>> duplicate matrix entries (the same i,j). Sometimes the entries have
>>>>> the same
>>>>> value/similarity score, but they're frequently different too.
>>>>>
>>>>> Is this a known issue? An artifact of the probabilistic nature of the
>>>>> output? Which output score should we trust (lower vs higher one when
>>>>> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on
>>>>> a 10
>>>>> node cluster.
>>>>>
>>>>> Thanks
>>>>> Rick
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Duplicate entries in output of mllib column similarities

2015-05-12 Thread Richard Bolkey
Hi Reza,

That was the fix we needed. After sorting, the transposed entries are gone!

Thanks a bunch,
rick

On Sat, May 9, 2015 at 5:17 PM, Reza Zadeh  wrote:

> Hi Richard,
> One reason that could be happening is that the rows of your matrix are
> using SparseVectors, but the entries in your vectors aren't sorted by
> index. Is that the case? Sparse Vectors
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala>
> need sorted indices.
> Reza
>
> On Sat, May 9, 2015 at 8:51 AM, Richard Bolkey  wrote:
>
>> Hi Reza,
>>
>> After a bit of digging, I had my previous issue a little bit wrong. We're
>> not getting duplicate (i,j) entries, but we are getting transposed entries
>> (i,j) and (j,i) with potentially different scores. We assumed the output
>> would be a triangular matrix. Still, let me know if that's expected. A
>> transposed entry occurs for about 5% of our output entries.
>>
>> scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect()
>> res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
>> Array(MatrixEntry(22769,539029,0.00453050595770095))
>>
>> scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect()
>> res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
>> Array(MatrixEntry(539029,22769,0.002265252978850475))
>>
>> I saved a subset of vectors to object files that replicates the issue .
>> It's about 300mb. Should I try to whittle that down some more? What would
>> be the best way to get that to you.
>>
>> Many thanks,
>> Rick
>>
>> On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh  wrote:
>>
>>> This shouldn't be happening, do you have an example to reproduce it?
>>>
>>> On Thu, May 7, 2015 at 4:17 PM, rbolkey  wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question regarding one of the oddities we encountered while
>>>> running
>>>> mllib's column similarities operation. When we examine the output, we
>>>> find
>>>> duplicate matrix entries (the same i,j). Sometimes the entries have the
>>>> same
>>>> value/similarity score, but they're frequently different too.
>>>>
>>>> Is this a known issue? An artifact of the probabilistic nature of the
>>>> output? Which output score should we trust (lower vs higher one when
>>>> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on
>>>> a 10
>>>> node cluster.
>>>>
>>>> Thanks
>>>> Rick
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>


Re: Duplicate entries in output of mllib column similarities

2015-05-09 Thread Reza Zadeh
Hi Richard,
One reason that could be happening is that the rows of your matrix are
using SparseVectors, but the entries in your vectors aren't sorted by
index. Is that the case? Sparse Vectors
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala>
need sorted indices.
Reza

On Sat, May 9, 2015 at 8:51 AM, Richard Bolkey  wrote:

> Hi Reza,
>
> After a bit of digging, I had my previous issue a little bit wrong. We're
> not getting duplicate (i,j) entries, but we are getting transposed entries
> (i,j) and (j,i) with potentially different scores. We assumed the output
> would be a triangular matrix. Still, let me know if that's expected. A
> transposed entry occurs for about 5% of our output entries.
>
> scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect()
> res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
> Array(MatrixEntry(22769,539029,0.00453050595770095))
>
> scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect()
> res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
> Array(MatrixEntry(539029,22769,0.002265252978850475))
>
> I saved a subset of vectors to object files that replicates the issue .
> It's about 300mb. Should I try to whittle that down some more? What would
> be the best way to get that to you.
>
> Many thanks,
> Rick
>
> On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh  wrote:
>
>> This shouldn't be happening, do you have an example to reproduce it?
>>
>> On Thu, May 7, 2015 at 4:17 PM, rbolkey  wrote:
>>
>>> Hi,
>>>
>>> I have a question regarding one of the oddities we encountered while
>>> running
>>> mllib's column similarities operation. When we examine the output, we
>>> find
>>> duplicate matrix entries (the same i,j). Sometimes the entries have the
>>> same
>>> value/similarity score, but they're frequently different too.
>>>
>>> Is this a known issue? An artifact of the probabilistic nature of the
>>> output? Which output score should we trust (lower vs higher one when
>>> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on a
>>> 10
>>> node cluster.
>>>
>>> Thanks
>>> Rick
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Duplicate entries in output of mllib column similarities

2015-05-09 Thread Richard Bolkey
Hi Reza,

After a bit of digging, I had my previous issue a little bit wrong. We're
not getting duplicate (i,j) entries, but we are getting transposed entries
(i,j) and (j,i) with potentially different scores. We assumed the output
would be a triangular matrix. Still, let me know if that's expected. A
transposed entry occurs for about 5% of our output entries.

scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect()
res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
Array(MatrixEntry(22769,539029,0.00453050595770095))

scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect()
res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
Array(MatrixEntry(539029,22769,0.002265252978850475))

I saved a subset of vectors to object files that replicates the issue .
It's about 300mb. Should I try to whittle that down some more? What would
be the best way to get that to you.

Many thanks,
Rick

On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh  wrote:

> This shouldn't be happening, do you have an example to reproduce it?
>
> On Thu, May 7, 2015 at 4:17 PM, rbolkey  wrote:
>
>> Hi,
>>
>> I have a question regarding one of the oddities we encountered while
>> running
>> mllib's column similarities operation. When we examine the output, we find
>> duplicate matrix entries (the same i,j). Sometimes the entries have the
>> same
>> value/similarity score, but they're frequently different too.
>>
>> Is this a known issue? An artifact of the probabilistic nature of the
>> output? Which output score should we trust (lower vs higher one when
>> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on a
>> 10
>> node cluster.
>>
>> Thanks
>> Rick
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Duplicate entries in output of mllib column similarities

2015-05-07 Thread Reza Zadeh
This shouldn't be happening, do you have an example to reproduce it?

On Thu, May 7, 2015 at 4:17 PM, rbolkey  wrote:

> Hi,
>
> I have a question regarding one of the oddities we encountered while
> running
> mllib's column similarities operation. When we examine the output, we find
> duplicate matrix entries (the same i,j). Sometimes the entries have the
> same
> value/similarity score, but they're frequently different too.
>
> Is this a known issue? An artifact of the probabilistic nature of the
> output? Which output score should we trust (lower vs higher one when
> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on a 10
> node cluster.
>
> Thanks
> Rick
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Duplicate entries in output of mllib column similarities

2015-05-07 Thread rbolkey
Hi,

I have a question regarding one of the oddities we encountered while running
mllib's column similarities operation. When we examine the output, we find
duplicate matrix entries (the same i,j). Sometimes the entries have the same
value/similarity score, but they're frequently different too.

Is this a known issue? An artifact of the probabilistic nature of the
output? Which output score should we trust (lower vs higher one when
different)? We're using a threshold of 0.3, and running Spark 1.3.1 on a 10
node cluster.

Thanks
Rick



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org