Re: groupBy gives non deterministic results

2014-09-10 Thread redocpot
Hi, 

I am using spark 1.0.0. The bug is fixed by 1.0.1.

Hao



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: groupBy gives non deterministic results

2014-09-10 Thread redocpot
Ah, thank you. I did not notice that.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13871.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: groupBy gives non deterministic results

2014-09-09 Thread redocpot
Thank you for your replies.

More details here:

The prog is executed on local mode (single node). Default env params are
used.

The test code and the result are in this gist:
https://gist.github.com/coderh/0147467f0b185462048c

Here is 10 first lines of the data: 3 fields each row, the delimiter is ;

3801959;11775022;118
3801960;14543202;118
3801984;11781380;20
3801984;13255417;20
3802003;11777557;91
3802055;11781159;26
3802076;11782793;102
3802086;17881551;102
3802087;19064728;99
3802105;12760994;99
...

There are 27 partitions(small files). Total size is about 100 Mb.

We find that this problem is highly probably caused by the bug SPARK-2043:
https://issues.apache.org/jira/browse/SPARK-2043

Could someone give more details on this bug ?

The pull request say: 

The current implementation reads one key with the next hash code as it
finishes reading the keys with the current hash code, which may cause it to
miss some matches of the next key. This can cause operations like join to
give the wrong result when reduce tasks spill to disk and there are hash
collisions, as values won't be matched together. This PR fixes it by not
reading in that next key, using a peeking iterator instead.

I don't understand why reading a key with the next hash code will cause it
to miss some matches of the next key. If someone could show me some code to
dig in, it's highly appreciated. =)

Thank you.

Hao.











--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13797.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



groupBy gives non deterministic results

2014-09-08 Thread redocpot
Hi,

I have a key-value RDD called rdd below. After a groupBy, I tried to count
rows.
But the result is not unique, somehow non deterministic.

Here is the test code:

  val step1 = ligneReceipt_cleTable.persist
  val step2 = step1.groupByKey
  
  val s1size = step1.count
  val s2size = step2.count

  val t = step2 // rdd after groupBy

  val t1 = t.count
  val t2 = t.count
  val t3 = t.count
  val t4 = t.count
  val t5 = t.count
  val t6 = t.count
  val t7 = t.count
  val t8 = t.count

  println(s1size =  + s1size)
  println(s2size =  + s2size)
  println(1 =  + t1)
  println(2 =  + t2)
  println(3 =  + t3)
  println(4 =  + t4)
  println(5 =  + t5)
  println(6 =  + t6)
  println(7 =  + t7)
  println(8 =  + t8)

Here are the results:

s1size = 5338864
s2size = 5268001
1 = 5268002
2 = 5268001
3 = 5268001
4 = 5268002
5 = 5268001
6 = 5268002
7 = 5268002
8 = 5268001

Even if the difference is just one row, that's annoying.  

Any idea ?

Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: groupBy gives non deterministic results

2014-09-08 Thread redocpot
Update:

Just test with HashPartitioner(8) and count on each partition:

List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657591*), (*6,658327*), (*7,658434*)), 
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657594)*, (6,658326), (*7,658434*)), 
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657592)*, (6,658326), (*7,658435*)), 
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657591)*, (6,658326), (7,658434)), 
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657592)*, (6,658326), (7,658435)), 
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657592)*, (6,658326), (7,658435)), 
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657592)*, (6,658326), (7,658435)), 
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657591)*, (6,658326), (7,658435))

The result is not identical for each execution.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13702.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



sbt directory missed

2014-07-28 Thread redocpot
Hi, 

I have started a EC2 cluster using Spark by running spark-ec2 script.

Just a little confused, I can not find sbt/ directory under /spark.

I have checked spark-version, it's 1.0.0 (default). When I was working
0.9.x, sbt/ has been there.

Is the script changed in 1.0.X ? I can not find any change log on this. Or
maybe I am missing something.

Certainly, I can download sbt and make things work. Just want to make things
clear.

Thank you.

Here is the file list of spark/

root@ip-10-81-154-223:~# ls -l spark
total 384
drwxrwxr-x 10 1000 1000   4096 Jul 28 14:58 .
drwxr-xr-x 20 root root   4096 Jul 28 14:58 ..
drwxrwxr-x  2 1000 1000   4096 Jul 28 13:34 bin
-rw-rw-r--  1 1000 1000 281471 May 26 07:02 CHANGES.txt
drwxrwxr-x  2 1000 1000   4096 Jul 28 08:22 conf
drwxrwxr-x  4 1000 1000   4096 May 26 07:02 ec2
drwxrwxr-x  3 1000 1000   4096 May 26 07:02 examples
drwxrwxr-x  2 1000 1000   4096 May 26 07:02 lib
-rw-rw-r--  1 1000 1000  29983 May 26 07:02 LICENSE
drwxr-xr-x  2 root root   4096 Jul 28 14:42 logs
-rw-rw-r--  1 1000 1000  22559 May 26 07:02 NOTICE
drwxrwxr-x  6 1000 1000   4096 May 26 07:02 python
-rw-rw-r--  1 1000 1000   4221 May 26 07:02 README.md
-rw-rw-r--  1 1000 1000 35 May 26 07:02 RELEASE
drwxrwxr-x  2 1000 1000   4096 May 26 07:02 sbin









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-directory-missed-tp10783.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: sbt directory missed

2014-07-28 Thread redocpot
update:

Just checked the python launch script, when retrieving spark, it will refer
to this script:
https://github.com/mesos/spark-ec2/blob/v3/spark/init.sh

where each version number is mapped to a tar file,

0.9.2)
  if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then
wget
http://s3.amazonaws.com/spark-related-packages/spark-0.9.2-bin-hadoop1.tgz
  else
wget
http://s3.amazonaws.com/spark-related-packages/spark-0.9.2-bin-cdh4.tgz
  fi
  ;;
1.0.0)
  if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then
wget
http://s3.amazonaws.com/spark-related-packages/spark-1.0.0-bin-hadoop1.tgz
  else
wget
http://s3.amazonaws.com/spark-related-packages/spark-1.0.0-bin-cdh4.tgz
  fi
  ;;
1.0.1)
  if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then
wget
http://s3.amazonaws.com/spark-related-packages/spark-1.0.1-bin-hadoop1.tgz
  else
wget
http://s3.amazonaws.com/spark-related-packages/spark-1.0.1-bin-cdh4.tgz
  fi
  ;;

I just checked the three last tar file. I find the /sbt directory and many
other directory like bagel, mllib, etc in 0.9.2 tar file. However, they are
not in 1.0.0 and 1.0.1 tar files.

I am not sure that 1.0.X versions are mapped to the correct tar files.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-directory-missed-tp10783p10784.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: sbt directory missed

2014-07-28 Thread redocpot
Thank you for your reply.

I need sbt for packaging my project and then submit it.

Could you tell me how to run a spark project on 1.0 AMI without sbt?

I don't understand why 1.0 only contains the prebuilt packages. I dont think
it makes sense, since sbt is essential.

User has to download sbt or clone github repo, whereas in 0.9 ami, sbt is
pre-installed.

A command like: 
$ sbt/sbt package run
could do the job.

Thanks. =)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-directory-missed-tp10783p10812.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: implicit ALS dataSet

2014-06-23 Thread redocpot
Hi, 

The real-world dataset is a bit more large, so I tested on the MovieLens
data set, and find the same results:


alpha
lambda 
rank
top1
top5
EPR_in
EPR_out


40
0.001 
50
297
559
0.05855
0.17299



40
0.01 
50
295
559
0.05854
0.17298


40
0.1 
50
296
560
0.05846
0.17287


40
1 
50
309
564
0.05819
0.17227


40
25 
50
287
537
0.05699
0.14855


40
50 
50
267
496
0.05795
0.13389


40
100 
50
247
444
0.06504
0.11920


40
200 
50
145
306
0.09558
0.11388


40
300 
50
77
178
0.11340
0.12264



To be clear, there are 1650 items in this movielens data set. Top 1 and Top
5 in the table means the nb of diff items on top1 and top5 according to the
preference list for each user after ALS do the work. Top1, top5, EPR_in are
based on training set. Only EPR_out is on test set. In the top1 and top5,
all items are taken into account, no matter whether it is purchased or not.

The table shows that small lambda(  1) always leads to over fitting, while
big lambda like 300 removes over fitting but the nb of diff items on the top
1 and top 5 of the preference list is very small (not personalized).





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/implicit-ALS-dataSet-tp7067p8115.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: implicit ALS dataSet

2014-06-05 Thread redocpot
Thank you for your quick reply.

As far as I know, the update does not require negative observations, because
the update rule

Xu = (YtCuY + λI)^-1 Yt Cu P(u)

can be simplified by taking advantage of its algebraic structure, so
negative observations are not needed. This is what I think at the first time
I read the paper.

What makes me confused is, after that, the paper (in Discussion section)
says 

Unlike explicit datasets, here *the model should take all user-item
preferences as an input, including those which are not related to any input
observation (thus hinting to a zero preference).* This is crucial, as the
given observations are inherently biased towards a positive preference, and
thus do not reflect well the user profile. 
However, taking all user-item values as an input to the model raises serious
scalability issues – the number of all those pairs tends to significantly
exceed the input size since a typical user would provide feedback only on a
small fraction of the available items. We address this by exploiting the
algebraic structure of the model, leading to an algorithm that scales
linearly with the input size *while addressing the full scope of user-item
pairs* without resorting to any sub-sampling.

If my understanding is right, it seems that we need negative obs as input,
but we dont use them during the updating. It is strange for me, because that
will generate too many use-time pair, which is not possible.

Thx for the confirmation. I will read the ALS implementation for more
details.

Hao



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/implicit-ALS-dataSet-tp7067p7086.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.