Hi. I'm using Spark SQL 1.2. I have this query:
CREATE TABLE test_MA STORED AS PARQUET AS SELECT field1 ,field2 ,field3 ,field4 ,field5 ,COUNT(1) AS field6 ,MAX(field7) ,MIN(field8) ,SUM(field9 / 100) ,COUNT(field10) ,SUM(IF(field11 < -500, 1, 0)) ,MAX(field12) ,SUM(IF(field13 = 1, 1, 0)) ,SUM(IF(field13 in (3,4,5,6,10,104,105,107), 1, 0)) ,SUM(IF(field13 = 2012 , 1, 0)) ,SUM(IF(field13 in (0,100,101,102,103,106), 1, 0)) FROM table1 CL JOIN table2 netw ON CL.field15 = netw.id WHERE AND field3 IS NOT NULL AND field4 IS NOT NULL AND field5 IS NOT NULL GROUP BY field1,field2,field3,field4, netw.field5 spark-submit --master spark://master:7077 *--driver-memory 20g --executor-memory 60g* --class "GMain" project_2.10-1.0.jar --driver-class-path '/opt/cloudera/parcels/CDH/lib/hive/lib/*' --driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*' 2> ./error Input data is 8GB in parquet format. Many times crash by *GC overhead*. I've fixed spark.shuffle.partitions to 1024 but my worker nodes (with 128GB RAM/node) is collapsed. *Is it a query too difficult to Spark SQL? * *Would It be better to do it in Spark?* *Am I doing something wrong?* Thanks -- Regards. Miguel Ángel