Can you provide small sample or test data that reproduce this problem? and 
what's your env setup? single node or cluster?

Sent from my iPhone

> On 2014年9月8日, at 22:29, redocpot <julien19890...@gmail.com> wrote:
> 
> Hi,
> 
> I have a key-value RDD called rdd below. After a groupBy, I tried to count
> rows.
> But the result is not unique, somehow non deterministic.
> 
> Here is the test code:
> 
>  val step1 = ligneReceipt_cleTable.persist
>  val step2 = step1.groupByKey
> 
>  val s1size = step1.count
>  val s2size = step2.count
> 
>  val t = step2 // rdd after groupBy
> 
>  val t1 = t.count
>  val t2 = t.count
>  val t3 = t.count
>  val t4 = t.count
>  val t5 = t.count
>  val t6 = t.count
>  val t7 = t.count
>  val t8 = t.count
> 
>  println("s1size = " + s1size)
>  println("s2size = " + s2size)
>  println("1 => " + t1)
>  println("2 => " + t2)
>  println("3 => " + t3)
>  println("4 => " + t4)
>  println("5 => " + t5)
>  println("6 => " + t6)
>  println("7 => " + t7)
>  println("8 => " + t8)
> 
> Here are the results:
> 
> s1size = 5338864
> s2size = 5268001
> 1 => 5268002
> 2 => 5268001
> 3 => 5268001
> 4 => 5268002
> 5 => 5268001
> 6 => 5268002
> 7 => 5268002
> 8 => 5268001
> 
> Even if the difference is just one row, that's annoying.  
> 
> Any idea ?
> 
> Thank you.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to