Re: Duplicate entries in output of mllib column similarities
Great! Reza On Tue, May 12, 2015 at 7:42 AM, Richard Bolkey wrote: > Hi Reza, > > That was the fix we needed. After sorting, the transposed entries are gone! > > Thanks a bunch, > rick > > On Sat, May 9, 2015 at 5:17 PM, Reza Zadeh wrote: > >> Hi Richard, >> One reason that could be happening is that the rows of your matrix are >> using SparseVectors, but the entries in your vectors aren't sorted by >> index. Is that the case? Sparse Vectors >> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala> >> need sorted indices. >> Reza >> >> On Sat, May 9, 2015 at 8:51 AM, Richard Bolkey wrote: >> >>> Hi Reza, >>> >>> After a bit of digging, I had my previous issue a little bit wrong. >>> We're not getting duplicate (i,j) entries, but we are getting transposed >>> entries (i,j) and (j,i) with potentially different scores. We assumed the >>> output would be a triangular matrix. Still, let me know if that's expected. >>> A transposed entry occurs for about 5% of our output entries. >>> >>> scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect() >>> res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = >>> Array(MatrixEntry(22769,539029,0.00453050595770095)) >>> >>> scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect() >>> res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = >>> Array(MatrixEntry(539029,22769,0.002265252978850475)) >>> >>> I saved a subset of vectors to object files that replicates the issue . >>> It's about 300mb. Should I try to whittle that down some more? What would >>> be the best way to get that to you. >>> >>> Many thanks, >>> Rick >>> >>> On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh wrote: >>> >>>> This shouldn't be happening, do you have an example to reproduce it? >>>> >>>> On Thu, May 7, 2015 at 4:17 PM, rbolkey wrote: >>>> >>>>> Hi, >>>>> >>>>> I have a question regarding one of the oddities we encountered while >>>>> running >>>>> mllib's column similarities operation. When we examine the output, we >>>>> find >>>>> duplicate matrix entries (the same i,j). Sometimes the entries have >>>>> the same >>>>> value/similarity score, but they're frequently different too. >>>>> >>>>> Is this a known issue? An artifact of the probabilistic nature of the >>>>> output? Which output score should we trust (lower vs higher one when >>>>> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on >>>>> a 10 >>>>> node cluster. >>>>> >>>>> Thanks >>>>> Rick >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> - >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>>> >>> >> >
Re: Duplicate entries in output of mllib column similarities
Hi Reza, That was the fix we needed. After sorting, the transposed entries are gone! Thanks a bunch, rick On Sat, May 9, 2015 at 5:17 PM, Reza Zadeh wrote: > Hi Richard, > One reason that could be happening is that the rows of your matrix are > using SparseVectors, but the entries in your vectors aren't sorted by > index. Is that the case? Sparse Vectors > <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala> > need sorted indices. > Reza > > On Sat, May 9, 2015 at 8:51 AM, Richard Bolkey wrote: > >> Hi Reza, >> >> After a bit of digging, I had my previous issue a little bit wrong. We're >> not getting duplicate (i,j) entries, but we are getting transposed entries >> (i,j) and (j,i) with potentially different scores. We assumed the output >> would be a triangular matrix. Still, let me know if that's expected. A >> transposed entry occurs for about 5% of our output entries. >> >> scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect() >> res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = >> Array(MatrixEntry(22769,539029,0.00453050595770095)) >> >> scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect() >> res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = >> Array(MatrixEntry(539029,22769,0.002265252978850475)) >> >> I saved a subset of vectors to object files that replicates the issue . >> It's about 300mb. Should I try to whittle that down some more? What would >> be the best way to get that to you. >> >> Many thanks, >> Rick >> >> On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh wrote: >> >>> This shouldn't be happening, do you have an example to reproduce it? >>> >>> On Thu, May 7, 2015 at 4:17 PM, rbolkey wrote: >>> >>>> Hi, >>>> >>>> I have a question regarding one of the oddities we encountered while >>>> running >>>> mllib's column similarities operation. When we examine the output, we >>>> find >>>> duplicate matrix entries (the same i,j). Sometimes the entries have the >>>> same >>>> value/similarity score, but they're frequently different too. >>>> >>>> Is this a known issue? An artifact of the probabilistic nature of the >>>> output? Which output score should we trust (lower vs higher one when >>>> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on >>>> a 10 >>>> node cluster. >>>> >>>> Thanks >>>> Rick >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> - >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >> >
Re: Duplicate entries in output of mllib column similarities
Hi Richard, One reason that could be happening is that the rows of your matrix are using SparseVectors, but the entries in your vectors aren't sorted by index. Is that the case? Sparse Vectors <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala> need sorted indices. Reza On Sat, May 9, 2015 at 8:51 AM, Richard Bolkey wrote: > Hi Reza, > > After a bit of digging, I had my previous issue a little bit wrong. We're > not getting duplicate (i,j) entries, but we are getting transposed entries > (i,j) and (j,i) with potentially different scores. We assumed the output > would be a triangular matrix. Still, let me know if that's expected. A > transposed entry occurs for about 5% of our output entries. > > scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect() > res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = > Array(MatrixEntry(22769,539029,0.00453050595770095)) > > scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect() > res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = > Array(MatrixEntry(539029,22769,0.002265252978850475)) > > I saved a subset of vectors to object files that replicates the issue . > It's about 300mb. Should I try to whittle that down some more? What would > be the best way to get that to you. > > Many thanks, > Rick > > On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh wrote: > >> This shouldn't be happening, do you have an example to reproduce it? >> >> On Thu, May 7, 2015 at 4:17 PM, rbolkey wrote: >> >>> Hi, >>> >>> I have a question regarding one of the oddities we encountered while >>> running >>> mllib's column similarities operation. When we examine the output, we >>> find >>> duplicate matrix entries (the same i,j). Sometimes the entries have the >>> same >>> value/similarity score, but they're frequently different too. >>> >>> Is this a known issue? An artifact of the probabilistic nature of the >>> output? Which output score should we trust (lower vs higher one when >>> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on a >>> 10 >>> node cluster. >>> >>> Thanks >>> Rick >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >
Re: Duplicate entries in output of mllib column similarities
Hi Reza, After a bit of digging, I had my previous issue a little bit wrong. We're not getting duplicate (i,j) entries, but we are getting transposed entries (i,j) and (j,i) with potentially different scores. We assumed the output would be a triangular matrix. Still, let me know if that's expected. A transposed entry occurs for about 5% of our output entries. scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect() res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(MatrixEntry(22769,539029,0.00453050595770095)) scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect() res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(MatrixEntry(539029,22769,0.002265252978850475)) I saved a subset of vectors to object files that replicates the issue . It's about 300mb. Should I try to whittle that down some more? What would be the best way to get that to you. Many thanks, Rick On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh wrote: > This shouldn't be happening, do you have an example to reproduce it? > > On Thu, May 7, 2015 at 4:17 PM, rbolkey wrote: > >> Hi, >> >> I have a question regarding one of the oddities we encountered while >> running >> mllib's column similarities operation. When we examine the output, we find >> duplicate matrix entries (the same i,j). Sometimes the entries have the >> same >> value/similarity score, but they're frequently different too. >> >> Is this a known issue? An artifact of the probabilistic nature of the >> output? Which output score should we trust (lower vs higher one when >> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on a >> 10 >> node cluster. >> >> Thanks >> Rick >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Duplicate entries in output of mllib column similarities
This shouldn't be happening, do you have an example to reproduce it? On Thu, May 7, 2015 at 4:17 PM, rbolkey wrote: > Hi, > > I have a question regarding one of the oddities we encountered while > running > mllib's column similarities operation. When we examine the output, we find > duplicate matrix entries (the same i,j). Sometimes the entries have the > same > value/similarity score, but they're frequently different too. > > Is this a known issue? An artifact of the probabilistic nature of the > output? Which output score should we trust (lower vs higher one when > different)? We're using a threshold of 0.3, and running Spark 1.3.1 on a 10 > node cluster. > > Thanks > Rick > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Duplicate entries in output of mllib column similarities
Hi, I have a question regarding one of the oddities we encountered while running mllib's column similarities operation. When we examine the output, we find duplicate matrix entries (the same i,j). Sometimes the entries have the same value/similarity score, but they're frequently different too. Is this a known issue? An artifact of the probabilistic nature of the output? Which output score should we trust (lower vs higher one when different)? We're using a threshold of 0.3, and running Spark 1.3.1 on a 10 node cluster. Thanks Rick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org