Re: groupBy gives non deterministic results
Hi, I am using spark 1.0.0. The bug is fixed by 1.0.1. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: groupBy gives non deterministic results
Ah, thank you. I did not notice that. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13871.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: groupBy gives non deterministic results
Thank you for your replies. More details here: The prog is executed on local mode (single node). Default env params are used. The test code and the result are in this gist: https://gist.github.com/coderh/0147467f0b185462048c Here is 10 first lines of the data: 3 fields each row, the delimiter is ; 3801959;11775022;118 3801960;14543202;118 3801984;11781380;20 3801984;13255417;20 3802003;11777557;91 3802055;11781159;26 3802076;11782793;102 3802086;17881551;102 3802087;19064728;99 3802105;12760994;99 ... There are 27 partitions(small files). Total size is about 100 Mb. We find that this problem is highly probably caused by the bug SPARK-2043: https://issues.apache.org/jira/browse/SPARK-2043 Could someone give more details on this bug ? The pull request say: The current implementation reads one key with the next hash code as it finishes reading the keys with the current hash code, which may cause it to miss some matches of the next key. This can cause operations like join to give the wrong result when reduce tasks spill to disk and there are hash collisions, as values won't be matched together. This PR fixes it by not reading in that next key, using a peeking iterator instead. I don't understand why reading a key with the next hash code will cause it to miss some matches of the next key. If someone could show me some code to dig in, it's highly appreciated. =) Thank you. Hao. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13797.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
groupBy gives non deterministic results
Hi, I have a key-value RDD called rdd below. After a groupBy, I tried to count rows. But the result is not unique, somehow non deterministic. Here is the test code: val step1 = ligneReceipt_cleTable.persist val step2 = step1.groupByKey val s1size = step1.count val s2size = step2.count val t = step2 // rdd after groupBy val t1 = t.count val t2 = t.count val t3 = t.count val t4 = t.count val t5 = t.count val t6 = t.count val t7 = t.count val t8 = t.count println(s1size = + s1size) println(s2size = + s2size) println(1 = + t1) println(2 = + t2) println(3 = + t3) println(4 = + t4) println(5 = + t5) println(6 = + t6) println(7 = + t7) println(8 = + t8) Here are the results: s1size = 5338864 s2size = 5268001 1 = 5268002 2 = 5268001 3 = 5268001 4 = 5268002 5 = 5268001 6 = 5268002 7 = 5268002 8 = 5268001 Even if the difference is just one row, that's annoying. Any idea ? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: groupBy gives non deterministic results
Update: Just test with HashPartitioner(8) and count on each partition: List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657591*), (*6,658327*), (*7,658434*)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657594)*, (6,658326), (*7,658434*)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657592)*, (6,658326), (*7,658435*)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657591)*, (6,658326), (7,658434)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657592)*, (6,658326), (7,658435)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657592)*, (6,658326), (7,658435)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657592)*, (6,658326), (7,658435)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657591)*, (6,658326), (7,658435)) The result is not identical for each execution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13702.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
sbt directory missed
Hi, I have started a EC2 cluster using Spark by running spark-ec2 script. Just a little confused, I can not find sbt/ directory under /spark. I have checked spark-version, it's 1.0.0 (default). When I was working 0.9.x, sbt/ has been there. Is the script changed in 1.0.X ? I can not find any change log on this. Or maybe I am missing something. Certainly, I can download sbt and make things work. Just want to make things clear. Thank you. Here is the file list of spark/ root@ip-10-81-154-223:~# ls -l spark total 384 drwxrwxr-x 10 1000 1000 4096 Jul 28 14:58 . drwxr-xr-x 20 root root 4096 Jul 28 14:58 .. drwxrwxr-x 2 1000 1000 4096 Jul 28 13:34 bin -rw-rw-r-- 1 1000 1000 281471 May 26 07:02 CHANGES.txt drwxrwxr-x 2 1000 1000 4096 Jul 28 08:22 conf drwxrwxr-x 4 1000 1000 4096 May 26 07:02 ec2 drwxrwxr-x 3 1000 1000 4096 May 26 07:02 examples drwxrwxr-x 2 1000 1000 4096 May 26 07:02 lib -rw-rw-r-- 1 1000 1000 29983 May 26 07:02 LICENSE drwxr-xr-x 2 root root 4096 Jul 28 14:42 logs -rw-rw-r-- 1 1000 1000 22559 May 26 07:02 NOTICE drwxrwxr-x 6 1000 1000 4096 May 26 07:02 python -rw-rw-r-- 1 1000 1000 4221 May 26 07:02 README.md -rw-rw-r-- 1 1000 1000 35 May 26 07:02 RELEASE drwxrwxr-x 2 1000 1000 4096 May 26 07:02 sbin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-directory-missed-tp10783.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: sbt directory missed
update: Just checked the python launch script, when retrieving spark, it will refer to this script: https://github.com/mesos/spark-ec2/blob/v3/spark/init.sh where each version number is mapped to a tar file, 0.9.2) if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then wget http://s3.amazonaws.com/spark-related-packages/spark-0.9.2-bin-hadoop1.tgz else wget http://s3.amazonaws.com/spark-related-packages/spark-0.9.2-bin-cdh4.tgz fi ;; 1.0.0) if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then wget http://s3.amazonaws.com/spark-related-packages/spark-1.0.0-bin-hadoop1.tgz else wget http://s3.amazonaws.com/spark-related-packages/spark-1.0.0-bin-cdh4.tgz fi ;; 1.0.1) if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then wget http://s3.amazonaws.com/spark-related-packages/spark-1.0.1-bin-hadoop1.tgz else wget http://s3.amazonaws.com/spark-related-packages/spark-1.0.1-bin-cdh4.tgz fi ;; I just checked the three last tar file. I find the /sbt directory and many other directory like bagel, mllib, etc in 0.9.2 tar file. However, they are not in 1.0.0 and 1.0.1 tar files. I am not sure that 1.0.X versions are mapped to the correct tar files. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-directory-missed-tp10783p10784.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: sbt directory missed
Thank you for your reply. I need sbt for packaging my project and then submit it. Could you tell me how to run a spark project on 1.0 AMI without sbt? I don't understand why 1.0 only contains the prebuilt packages. I dont think it makes sense, since sbt is essential. User has to download sbt or clone github repo, whereas in 0.9 ami, sbt is pre-installed. A command like: $ sbt/sbt package run could do the job. Thanks. =) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-directory-missed-tp10783p10812.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: implicit ALS dataSet
Hi, The real-world dataset is a bit more large, so I tested on the MovieLens data set, and find the same results: alpha lambda rank top1 top5 EPR_in EPR_out 40 0.001 50 297 559 0.05855 0.17299 40 0.01 50 295 559 0.05854 0.17298 40 0.1 50 296 560 0.05846 0.17287 40 1 50 309 564 0.05819 0.17227 40 25 50 287 537 0.05699 0.14855 40 50 50 267 496 0.05795 0.13389 40 100 50 247 444 0.06504 0.11920 40 200 50 145 306 0.09558 0.11388 40 300 50 77 178 0.11340 0.12264 To be clear, there are 1650 items in this movielens data set. Top 1 and Top 5 in the table means the nb of diff items on top1 and top5 according to the preference list for each user after ALS do the work. Top1, top5, EPR_in are based on training set. Only EPR_out is on test set. In the top1 and top5, all items are taken into account, no matter whether it is purchased or not. The table shows that small lambda( 1) always leads to over fitting, while big lambda like 300 removes over fitting but the nb of diff items on the top 1 and top 5 of the preference list is very small (not personalized). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/implicit-ALS-dataSet-tp7067p8115.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: implicit ALS dataSet
Thank you for your quick reply. As far as I know, the update does not require negative observations, because the update rule Xu = (YtCuY + λI)^-1 Yt Cu P(u) can be simplified by taking advantage of its algebraic structure, so negative observations are not needed. This is what I think at the first time I read the paper. What makes me confused is, after that, the paper (in Discussion section) says Unlike explicit datasets, here *the model should take all user-item preferences as an input, including those which are not related to any input observation (thus hinting to a zero preference).* This is crucial, as the given observations are inherently biased towards a positive preference, and thus do not reflect well the user profile. However, taking all user-item values as an input to the model raises serious scalability issues – the number of all those pairs tends to significantly exceed the input size since a typical user would provide feedback only on a small fraction of the available items. We address this by exploiting the algebraic structure of the model, leading to an algorithm that scales linearly with the input size *while addressing the full scope of user-item pairs* without resorting to any sub-sampling. If my understanding is right, it seems that we need negative obs as input, but we dont use them during the updating. It is strange for me, because that will generate too many use-time pair, which is not possible. Thx for the confirmation. I will read the ALS implementation for more details. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/implicit-ALS-dataSet-tp7067p7086.html Sent from the Apache Spark User List mailing list archive at Nabble.com.