[ 
https://issues.apache.org/jira/browse/IMPALA-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812543#comment-17812543
 ] 

ASF subversion and git services commented on IMPALA-12765:
----------------------------------------------------------

Commit 62a3168eca955119ad7c01b1f4d91a9702efd397 in impala's branch 
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=62a3168ec ]

IMPALA-12765: Balance consecutive partitions better for Iceberg tables

During remote read scheduling Impala does the following:

Non-Iceberg tables
 * The scheduler processes the scan ranges in partition key order
 * The scheduler selects N executors as candidates
 * The scheduler chooses the executor from the candidates based on
   minimum number of assigned bytes
 * So consecutive partitions are more likely to be assigned to
   different executors

Iceberg tables
 * The scheduler processes the scan ranges in random order
 * The scheduler selects N executors as candidates
 * The scheduler chooses the executor from the candidates based on
   minimum number of assigned bytes
 * So consecutive partitions (by partition key order) are assigned
   randomly, i.e. there's a higher chance of clustering

With this patch, IcebergScanNode orders its file descriptors based on
their paths, so we will have a more balanced scheduling for consecutive
partitions. It is especially important for queries that prune partitions
via runtime filters (e.g. due to a JOIN), because it doesn't matter that
we schedule the scan ranges evenly, the scan ranges that survive the
runtime filters can still be clustered on certain executors.

E.g. TPC-DS Q22 has the following JOIN and WHERE predicates:

 inv_date_sk=d_date_sk and
 d_month_seq between 1199 and 1199 + 11

The Inventory table is partitioned by column inv_date_sk, and we filter
the rows in the joined table by 'd_month_seq between 1199 and
1199 + 11'. This means that we will only need a range of partitions from
the Inventory table, but that range will only be revealed during
runtime. Scheduling neighbouring partitions to different executors means
that the surviving partitions are spread across executors more evenly.

Testing:
 * e2e test

Change-Id: I60773965ecbb4d8e659db158f1f0ac76086d5578
Reviewed-on: http://gerrit.cloudera.org:8080/20973
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> Balance consecutive partitions better for Iceberg tables
> --------------------------------------------------------
>
>                 Key: IMPALA-12765
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12765
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> During scheduling Impala does the following:
> * Non-Iceberg tables
> ** The scheduler processes the scan ranges in partition key order
> ** The scheduler selects N replicas as candidates
> ** The scheduler chooses the executor from the candidates based on minimum 
> number of assigned bytes
> ** So consecutive partitions are more likely to be assigned to different 
> executors
> * Iceberg tables
> ** The scheduler processes the scan ranges in random order
> ** The scheduler selects N replicas as candidates
> ** The scheduler chooses the executor from the candidates based on minimum 
> number of assigned bytes
> ** So consecutive partitions (by partition key order) are assigned randomly, 
> i.e. there's a higher chances of clustering
> If the IcebergScanNode ordered its file descriptors based on their paths we 
> would have a more balanced scheduling for consecutive partitions. Queries 
> that operate on a range of partitions are quite common, so it makes sense to 
> optimize that case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to