That’s a good catch, but I think it’s suggested to use HiveContext currently.  
( https://github.com/apache/spark/tree/master/sql)

Catalyst$> sbt/sbt hive/console
case class Foo(k: String, v: Int)
val rows = List.fill(100)(Foo("a", 1)) ++ List.fill(200)(Foo("b", 2)) ++ 
List.fill(300)(Foo("c", 3))
sparkContext.makeRDD(rows).registerAsTable("foo")
sql("select k,count(*) from foo group by k").collect
res1: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300])

Cheng Hao
From: Pei-Lun Lee [mailto:pl...@appier.com]
Sent: Wednesday, June 11, 2014 6:01 PM
To: user@spark.apache.org
Subject: Spark SQL incorrect result on GROUP BY query

Hi,

I am using spark 1.0.0 and found in spark sql some queries use GROUP BY give 
weird results.
To reproduce, type the following commands in spark-shell connecting to a 
standalone server:

case class Foo(k: String, v: Int)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val rows = List.fill(100)(Foo("a", 1)) ++ List.fill(200)(Foo("b", 2)) ++ 
List.fill(300)(Foo("c", 3))
sc.makeRDD(rows).registerAsTable("foo")
sql("select k,count(*) from foo group by k").collect

the result will be something random like:
res1: Array[org.apache.spark.sql.Row] = Array([b,180], [3,18], [a,75], [c,270], 
[4,56], [1,1])

and if I run the same query again, the new result will be correct:
sql("select k,count(*) from foo group by k").collect
res2: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300])

Should I file a bug?

--
Pei-Lun Lee

Reply via email to