>> Using large memory for executors (*--executor-memory 120g*).

Not really a good advice.

On Thu, Apr 2, 2015 at 9:17 AM, Cheng, Hao <[email protected]> wrote:

>  Spark SQL tries to load the entire partition data and organized as
> In-Memory HashMaps, it does eat large memory if there are not many
> duplicated group by keys with large amount of records;
>
>
>
> Couple of things you can try case by case:
>
> ·        Increasing the partition numbers (the records count in each
> partition will reduce)
>
> ·        Using large memory for executors (*--executor-memory 120g*).
>
> ·        Reduce the SPARK COREs (to reduce the parallel running threads)
>
>
>
> We are trying to approve that by using the sort-merge aggregation, which
> should reduce the memory utilization significantly, but that’s still on
> going.
>
>
>
> Cheng Hao
>
>
>
> *From:* Masf [mailto:[email protected]]
> *Sent:* Thursday, April 2, 2015 11:47 PM
> *To:* [email protected]
> *Subject:* Spark SQL. Memory consumption
>
>
>
>
>  Hi.
>
>
>
> I'm using Spark SQL 1.2. I have this query:
>
>
>
> CREATE TABLE test_MA STORED AS PARQUET AS
>
>  SELECT
>
>             field1
>
>             ,field2
>
>             ,field3
>
>             ,field4
>
>             ,field5
>
>             ,COUNT(1) AS field6
>
>             ,MAX(field7)
>
>             ,MIN(field8)
>
>             ,SUM(field9 / 100)
>
>             ,COUNT(field10)
>
>             ,SUM(IF(field11 < -500, 1, 0))
>
>             ,MAX(field12)
>
>             ,SUM(IF(field13 = 1, 1, 0))
>
>             ,SUM(IF(field13 in (3,4,5,6,10,104,105,107), 1, 0))
>
>             ,SUM(IF(field13 = 2012 , 1, 0))
>
>             ,SUM(IF(field13 in (0,100,101,102,103,106), 1, 0))
>
>
>
>             FROM table1 CL
>
>                            JOIN table2 netw
>
>                                     ON CL.field15 = netw.id
>
>             WHERE
>
>                         AND field3 IS NOT NULL
>
>                         AND field4 IS NOT NULL
>
>                         AND field5 IS NOT NULL
>
>             GROUP BY field1,field2,field3,field4, netw.field5
>
>
>
>
>
> spark-submit --master spark://master:7077 *--driver-memory 20g
> --executor-memory 60g* --class "GMain" project_2.10-1.0.jar
> --driver-class-path '/opt/cloudera/parcels/CDH/lib/hive/lib/*'
> --driver-java-options
> '-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*'
> 2> ./error
>
>
>
>
>
> Input data is 8GB in parquet format. Many times crash by * GC overhead*.
> I've fixed spark.shuffle.partitions to 1024 but my worker nodes (with 128GB
> RAM/node) is collapsed.
>
>
>
> *Is it a query too difficult to Spark SQL? *
>
> *Would It be better to do it in Spark?*
>
> *Am I doing something wrong?*
>
>
>
>
>
> Thanks
>
> --
>
>
>
>
> Regards.
> Miguel Ángel
>

Reply via email to