Hello Everyone, I would need urgent help with a data consistency issue I am having. Stand alone Cluster of five servers. sqlContext instance of HiveContext (default in spark-shell) No special options other than driver memory and executor memory. Parquet partitions are 512 where there are 160 cores. Data is nearly 2 billion rows. The issue happens
val data = sqlContext.read.parquet("/var/Saif/data_pqt") val res = data.groupBy("product", "band", "age", "vint", "mb", "yyyymm").agg(count($"account_id").as("N"), sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), sum($"spend").as("spend"), sum($"payment").as("payment"), sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" === 1).as("newacct")).persist() val z = res.select("vint", "yyyymm").filter("vint = '2007-01-01'").select("yyyymm").distinct.collect z.length >>> res0: Int = 102 res.unpersist() val z = res.select("vint", "yyyymm").filter("vint = '2007-01-01'").select("yyyymm").distinct.collect z.length >>> res1: Int = 103 Please help, Saif