Performance problems of using spark SQL to read kudu data！

冯宝利 Sat, 28 Nov 2020 00:32:59 -0800

Hi：

   I use spark SQL to read kudu data, and the parallel tasks of spark cannot be 
increased。


   My Kudu table uses range partition table（range ）, and the number of hash 
partitions in each range partition is 6. For example，my kudu table is A，
   The specific partition range information is as follows:
HASH (scan_record_id, stowage_no) PARTITIONS 6,
RANGE (creater_time) (
    PARTITION "2019-12-01" <= VALUES < "2020-01-01",
    PARTITION "2020-01-01" <= VALUES < "2020-02-01",
    PARTITION "2020-02-01" <= VALUES < "2020-03-01",
    PARTITION "2020-03-01" <= VALUES < "2020-04-01",
    PARTITION "2020-04-01" <= VALUES < "2020-05-01",
    PARTITION "2020-05-01" <= VALUES < "2020-06-01",
    PARTITION "2020-06-01" <= VALUES < "2020-07-01",
    PARTITION "2020-07-01" <= VALUES < "2020-08-01",
    PARTITION "2020-08-01" <= VALUES < "2020-09-01",
    PARTITION "2020-09-01" <= VALUES < "2020-10-01",
    PARTITION "2020-10-01" <= VALUES < "2020-11-01",
    PARTITION "2020-11-01" <= VALUES < "2020-12-01",
    PARTITION "2020-12-01" <= VALUES < "2021-01-01"
)
mysql sql is：select *  from  A where creater_time>'2020-11-05' and 
create_time<'2020-11-27'
     When I run spark SQL, the specified number of executors is 20, but the 
number of saprk executors is still 6,the spark_submit commands is:
  spark-submit --master yarn --deploy-mode cluster --name  test  --queue 
bigdata_pro  --conf spark.dynamicAllocation.maxExecutors=20   --executor-cores 
1 --executor-memory 8g --driver-memory 8g   --class   uc.com.Test      
hdfs://ns1/user/hue/Test.jar 

   Saprk and kudu version: the  Spark version is 2.4.0 and kudu version is 
1.10.0.
   In addition to increasing the number of hash partitions under each range 
partition, is there any way to increase the number of tasks for spark to read 
kudu data through parameters?

   Thanks!

Performance problems of using spark SQL to read kudu data！

Reply via email to