Can you provide small sample or test data that reproduce this problem? and what's your env setup? single node or cluster?
Sent from my iPhone > On 2014年9月8日, at 22:29, redocpot <julien19890...@gmail.com> wrote: > > Hi, > > I have a key-value RDD called rdd below. After a groupBy, I tried to count > rows. > But the result is not unique, somehow non deterministic. > > Here is the test code: > > val step1 = ligneReceipt_cleTable.persist > val step2 = step1.groupByKey > > val s1size = step1.count > val s2size = step2.count > > val t = step2 // rdd after groupBy > > val t1 = t.count > val t2 = t.count > val t3 = t.count > val t4 = t.count > val t5 = t.count > val t6 = t.count > val t7 = t.count > val t8 = t.count > > println("s1size = " + s1size) > println("s2size = " + s2size) > println("1 => " + t1) > println("2 => " + t2) > println("3 => " + t3) > println("4 => " + t4) > println("5 => " + t5) > println("6 => " + t6) > println("7 => " + t7) > println("8 => " + t8) > > Here are the results: > > s1size = 5338864 > s2size = 5268001 > 1 => 5268002 > 2 => 5268001 > 3 => 5268001 > 4 => 5268002 > 5 => 5268001 > 6 => 5268002 > 7 => 5268002 > 8 => 5268001 > > Even if the difference is just one row, that's annoying. > > Any idea ? > > Thank you. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org