Hello, I have a requirement where I need to get total count of rows and total count of failedRows based on a grouping.
The code looks like below: myDataset.createOrReplaceTempView("temp_view"); Dataset <Row> countDataset = sparkSession.sql("Select column1,column2,column3,column4,column5,column6,column7,column8, count(*) as totalRows, sum(CASE WHEN (column8 is NULL) THEN 1 ELSE 0 END) as failedRows from temp_view group by column1,column2,column3,column4,column5,column6,column7,column8"); Up till around 50 Million records, the query performance was ok. After that it gave it up. Mostly resulting in out of Memory exception. I read documentation and blogs, most of them gives me examples of RDD.reduceByKey. But here I got dataset and spark Sql. What am I missing here ? . Any help will be appreciated. Thanks! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org