GitHub user witgo opened a pull request:

    https://github.com/apache/spark/pull/15512

    The SerializerInstance instance used when deserializing a TaskResult is not 
reused 

    ## What changes were proposed in this pull request?
    The following code is called when the DirectTaskResult instance is 
deserialized
    
    ```scala
    
      def value(): T = {
        if (valueObjectDeserialized) {
          valueObject
        } else {
          // Each deserialization creates a new instance of SerializerInstance, 
which is very time-consuming
          val resultSer = SparkEnv.get.serializer.newInstance()
          valueObject = resultSer.deserialize(valueBytes)
          valueObjectDeserialized = true
          valueObject
        }
      }
    
    ```
    
    In the case of stage has a lot of tasks, reuse SerializerInstance instance 
can improve the scheduling performance of three times
     
    The test data is TPC-DS 2T (Parquet) and  SQL statement as follows (query 
2)):
    
    
    ```sql
    
    select  i_item_id, 
            avg(ss_quantity) agg1,
            avg(ss_list_price) agg2,
            avg(ss_coupon_amt) agg3,
            avg(ss_sales_price) agg4 
     from store_sales, customer_demographics, date_dim, item, promotion
     where ss_sold_date_sk = d_date_sk and
           ss_item_sk = i_item_sk and
           ss_cdemo_sk = cd_demo_sk and
           ss_promo_sk = p_promo_sk and
           cd_gender = 'M' and 
           cd_marital_status = 'M' and
           cd_education_status = '4 yr Degree' and
           (p_channel_email = 'N' or p_channel_event = 'N') and
           d_year = 2001 
     group by i_item_id
     order by i_item_id
     limit 100;
    
    ```
    
    `spark-defaults.conf` file:
    
    ```
    spark.master                           yarn-client
    spark.executor.instances               20
    spark.driver.memory                    16g
    spark.executor.memory                  30g
    spark.executor.cores                   5
    spark.default.parallelism              100 
    spark.sql.shuffle.partitions           100000 
    spark.serializer                       
org.apache.spark.serializer.KryoSerializer
    spark.driver.maxResultSize              0
    spark.rpc.netty.dispatcher.numThreads   8
    spark.executor.extraJavaOptions          -XX:+UseG1GC 
-XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M 
    spark.cleaner.referenceTracking.blocking true
    spark.cleaner.referenceTracking.blocking.shuffle true
    
    ```
    
    
    Performance test results are as follows 
    
    [SPARK-17930](https://github.com/witgo/spark/tree/SPARK-17930)| 
[ed14633](https://github.com/witgo/spark/commit/ed1463341455830b8867b721a1b34f291139baf3])
    ------------ | -------------
    54.5 s|231.7 s
    
    
    ## How was this patch tested?
    
    Existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/witgo/spark SPARK-17930

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15512.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15512
    
----
commit 037871d8843760fbbdeab344d8228bfaeba6f6ae
Author: Guoqiang Li <wi...@qq.com>
Date:   2016-10-16T03:18:00Z

    The SerializerInstance instance used when deserializing a TaskResult is not 
reused

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to