Re: groupBy gives non deterministic results

2014-09-10 Thread redocpot
Hi, I am using spark 1.0.0. The bug is fixed by 1.0.1. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: groupBy gives non deterministic results

2014-09-10 Thread redocpot
Ah, thank you. I did not notice that. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13871.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: groupBy gives non deterministic results

2014-09-09 Thread redocpot
Thank you for your replies. More details here: The prog is executed on local mode (single node). Default env params are used. The test code and the result are in this gist: https://gist.github.com/coderh/0147467f0b185462048c Here is 10 first lines of the data: 3 fields each row, the delimiter

groupBy gives non deterministic results

2014-09-08 Thread redocpot
Hi, I have a key-value RDD called rdd below. After a groupBy, I tried to count rows. But the result is not unique, somehow non deterministic. Here is the test code: val step1 = ligneReceipt_cleTable.persist val step2 = step1.groupByKey val s1size = step1.count val s2size =

Re: groupBy gives non deterministic results

2014-09-08 Thread redocpot
Update: Just test with HashPartitioner(8) and count on each partition: List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657591*), (*6,658327*), (*7,658434*)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657594)*, (6,658326), (*7,658434*)),

sbt directory missed

2014-07-28 Thread redocpot
Hi, I have started a EC2 cluster using Spark by running spark-ec2 script. Just a little confused, I can not find sbt/ directory under /spark. I have checked spark-version, it's 1.0.0 (default). When I was working 0.9.x, sbt/ has been there. Is the script changed in 1.0.X ? I can not find any

Re: sbt directory missed

2014-07-28 Thread redocpot
update: Just checked the python launch script, when retrieving spark, it will refer to this script: https://github.com/mesos/spark-ec2/blob/v3/spark/init.sh where each version number is mapped to a tar file, 0.9.2) if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then wget

Re: sbt directory missed

2014-07-28 Thread redocpot
Thank you for your reply. I need sbt for packaging my project and then submit it. Could you tell me how to run a spark project on 1.0 AMI without sbt? I don't understand why 1.0 only contains the prebuilt packages. I dont think it makes sense, since sbt is essential. User has to download sbt

Re: implicit ALS dataSet

2014-06-23 Thread redocpot
Hi, The real-world dataset is a bit more large, so I tested on the MovieLens data set, and find the same results: alpha lambda rank top1 top5 EPR_in EPR_out 40 0.001 50 297 559 0.05855

Re: implicit ALS dataSet

2014-06-05 Thread redocpot
Thank you for your quick reply. As far as I know, the update does not require negative observations, because the update rule Xu = (YtCuY + λI)^-1 Yt Cu P(u) can be simplified by taking advantage of its algebraic structure, so negative observations are not needed. This is what I think at the