>> Using large memory for executors (*--executor-memory 120g*). Not really a good advice.
On Thu, Apr 2, 2015 at 9:17 AM, Cheng, Hao <[email protected]> wrote: > Spark SQL tries to load the entire partition data and organized as > In-Memory HashMaps, it does eat large memory if there are not many > duplicated group by keys with large amount of records; > > > > Couple of things you can try case by case: > > · Increasing the partition numbers (the records count in each > partition will reduce) > > · Using large memory for executors (*--executor-memory 120g*). > > · Reduce the SPARK COREs (to reduce the parallel running threads) > > > > We are trying to approve that by using the sort-merge aggregation, which > should reduce the memory utilization significantly, but that’s still on > going. > > > > Cheng Hao > > > > *From:* Masf [mailto:[email protected]] > *Sent:* Thursday, April 2, 2015 11:47 PM > *To:* [email protected] > *Subject:* Spark SQL. Memory consumption > > > > > Hi. > > > > I'm using Spark SQL 1.2. I have this query: > > > > CREATE TABLE test_MA STORED AS PARQUET AS > > SELECT > > field1 > > ,field2 > > ,field3 > > ,field4 > > ,field5 > > ,COUNT(1) AS field6 > > ,MAX(field7) > > ,MIN(field8) > > ,SUM(field9 / 100) > > ,COUNT(field10) > > ,SUM(IF(field11 < -500, 1, 0)) > > ,MAX(field12) > > ,SUM(IF(field13 = 1, 1, 0)) > > ,SUM(IF(field13 in (3,4,5,6,10,104,105,107), 1, 0)) > > ,SUM(IF(field13 = 2012 , 1, 0)) > > ,SUM(IF(field13 in (0,100,101,102,103,106), 1, 0)) > > > > FROM table1 CL > > JOIN table2 netw > > ON CL.field15 = netw.id > > WHERE > > AND field3 IS NOT NULL > > AND field4 IS NOT NULL > > AND field5 IS NOT NULL > > GROUP BY field1,field2,field3,field4, netw.field5 > > > > > > spark-submit --master spark://master:7077 *--driver-memory 20g > --executor-memory 60g* --class "GMain" project_2.10-1.0.jar > --driver-class-path '/opt/cloudera/parcels/CDH/lib/hive/lib/*' > --driver-java-options > '-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*' > 2> ./error > > > > > > Input data is 8GB in parquet format. Many times crash by * GC overhead*. > I've fixed spark.shuffle.partitions to 1024 but my worker nodes (with 128GB > RAM/node) is collapsed. > > > > *Is it a query too difficult to Spark SQL? * > > *Would It be better to do it in Spark?* > > *Am I doing something wrong?* > > > > > > Thanks > > -- > > > > > Regards. > Miguel Ángel >
