Re: Performance problems of using spark SQL to read kudu data！

Andrew Wong Sun, 29 Nov 2020 18:28:02 -0800

Hello!

Starting in Kudu 1.10, you should be able to supply 'splitSizeBytes' as a
KuduReadOption in Spark, allowing you to generate Kudu scan tokens that
operate on smaller chunks of data. Here's
<https://github.com/apache/kudu/blob/4cbee427475396c9cf1e05a402c3f952d3fafde7/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/SparkSQLTest.scala#L465>
an example test. It isn't straightforward to generate 20 scan tokens
exactly, but this does offer finer granularity scans than full tablet scans.


Hope this helps!

On Sat, Nov 28, 2020 at 12:32 AM 冯宝利 <fengba...@uce.cn> wrote:

> Hi：
>
>
>  I use spark SQL to read kudu data, and the parallel tasks of spark cannot be 
> increased。
>
>    My Kudu table uses range partition table（range
> ）, and the number of hash partitions in each range partition is 6. For 
> example，my
> kudu table is A，
>    The specific partition range information is as follows:
>
> HASH (scan_record_id, stowage_no) PARTITIONS 6,
> RANGE (creater_time) (
>     PARTITION "2019-12-01" <= VALUES < "2020-01-01",
>     PARTITION "2020-01-01" <= VALUES < "2020-02-01",
>     PARTITION "2020-02-01" <= VALUES < "2020-03-01",
>     PARTITION "2020-03-01" <= VALUES < "2020-04-01",
>     PARTITION "2020-04-01" <= VALUES < "2020-05-01",
>     PARTITION "2020-05-01" <= VALUES < "2020-06-01",
>     PARTITION "2020-06-01" <= VALUES < "2020-07-01",
>     PARTITION "2020-07-01" <= VALUES < "2020-08-01",
>     PARTITION "2020-08-01" <= VALUES < "2020-09-01",
>     PARTITION "2020-09-01" <= VALUES < "2020-10-01",
>     PARTITION "2020-10-01" <= VALUES < "2020-11-01",
>     PARTITION "2020-11-01" <= VALUES < "2020-12-01",
>     PARTITION "2020-12-01" <= VALUES < "2021-01-01"
> )
>
> mysql sql is：select *  from  A where creater_time>'2020-11-05' and 
> create_time<'2020-11-27'
>
>
>  When I run spark SQL, the specified number of executors is 20, but the 
> number of saprk executors is still 6,the
> spark_submit commands is:
>       spark-submit --master yarn --deploy-mode cluster --name  test
> --queue bigdata_pro  --conf spark.dynamicAllocation.maxExecutors=20
>  --executor-cores 1 --executor-memory 8g --driver-memory 8g
>  --class   uc.com.Test      hdfs://ns1/user/hue/Test.jar
>
>    Saprk and kudu version: the  Spark version is 2.4.0 and kudu version is
> 1.10.0.
>
>  In addition to increasing the number of hash partitions under each range 
> partition, is there any way to increase the number of tasks for spark to read 
> kudu data through parameters?
>
>    Thanks!
>
>
>
>
>
>
>

-- 
Andrew Wong

Re: Performance problems of using spark SQL to read kudu data！

Reply via email to