[jira] [Commented] (SPARK-25411) Implement range partition in Spark
[ https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936176#comment-16936176 ] Christopher Hoshino-Fish commented on SPARK-25411: -- I've also done this in the past to balance skewed partitions and get uniform query performance. Really effective, would be great to see this combined with computed/derived partitions > Implement range partition in Spark > -- > > Key: SPARK-25411 > URL: https://issues.apache.org/jira/browse/SPARK-25411 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang, Gang >Priority: Major > Attachments: range partition design doc.pdf > > > In our product environment, there are some partitioned fact tables, which are > all quite huge. To accelerate join execution, we need make them also > bucketed. Than comes the problem, if the bucket number is large enough, there > may be too many files(files count = bucket number * partition count), which > may bring pressure to the HDFS. And if the bucket number is small, Spark will > launch equal number of tasks to read/write it. > > So, can we implement a new partition support range values, just like range > partition in Oracle/MySQL > ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). > Say, we can partition by a date column, and make every two months as a > partition, or partitioned by a integer column, make interval of 1 as a > partition. > > Ideally, feature like range partition should be implemented in Hive. While, > it's been always hard to update Hive version in a prod environment, and much > lightweight and flexible if we implement it in Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25411) Implement range partition in Spark
[ https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929690#comment-16929690 ] Xiao Li commented on SPARK-25411: - I think it is very specific to the impl of data sources. We can discuss this in DSV2 design instead of the feature parity with PostgreSQL > Implement range partition in Spark > -- > > Key: SPARK-25411 > URL: https://issues.apache.org/jira/browse/SPARK-25411 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang, Gang >Priority: Major > Attachments: range partition design doc.pdf > > > In our product environment, there are some partitioned fact tables, which are > all quite huge. To accelerate join execution, we need make them also > bucketed. Than comes the problem, if the bucket number is large enough, there > may be too many files(files count = bucket number * partition count), which > may bring pressure to the HDFS. And if the bucket number is small, Spark will > launch equal number of tasks to read/write it. > > So, can we implement a new partition support range values, just like range > partition in Oracle/MySQL > ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). > Say, we can partition by a date column, and make every two months as a > partition, or partitioned by a integer column, make interval of 1 as a > partition. > > Ideally, feature like range partition should be implemented in Hive. While, > it's been always hard to update Hive version in a prod environment, and much > lightweight and flexible if we implement it in Spark. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25411) Implement range partition in Spark
[ https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885028#comment-16885028 ] Wang, Gang commented on SPARK-25411: I referred to ORACLE DDLs, quite close to PG. > Implement range partition in Spark > -- > > Key: SPARK-25411 > URL: https://issues.apache.org/jira/browse/SPARK-25411 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang, Gang >Priority: Major > Attachments: range partition design doc.pdf > > > In our product environment, there are some partitioned fact tables, which are > all quite huge. To accelerate join execution, we need make them also > bucketed. Than comes the problem, if the bucket number is large enough, there > may be too many files(files count = bucket number * partition count), which > may bring pressure to the HDFS. And if the bucket number is small, Spark will > launch equal number of tasks to read/write it. > > So, can we implement a new partition support range values, just like range > partition in Oracle/MySQL > ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). > Say, we can partition by a date column, and make every two months as a > partition, or partitioned by a integer column, make interval of 1 as a > partition. > > Ideally, feature like range partition should be implemented in Hive. While, > it's been always hard to update Hive version in a prod environment, and much > lightweight and flexible if we implement it in Spark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25411) Implement range partition in Spark
[ https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884290#comment-16884290 ] Yuming Wang commented on SPARK-25411: - PostgreSQL support range Partition. [https://www.postgresql.org/docs/11/ddl-partitioning.html#DDL-PARTITIONING-OVERVIEW] [https://www.postgresql.org/docs/11/sql-createtable.html] > Implement range partition in Spark > -- > > Key: SPARK-25411 > URL: https://issues.apache.org/jira/browse/SPARK-25411 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang, Gang >Priority: Major > Attachments: range partition design doc.pdf > > > In our product environment, there are some partitioned fact tables, which are > all quite huge. To accelerate join execution, we need make them also > bucketed. Than comes the problem, if the bucket number is large enough, there > may be too many files(files count = bucket number * partition count), which > may bring pressure to the HDFS. And if the bucket number is small, Spark will > launch equal number of tasks to read/write it. > > So, can we implement a new partition support range values, just like range > partition in Oracle/MySQL > ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). > Say, we can partition by a date column, and make every two months as a > partition, or partitioned by a integer column, make interval of 1 as a > partition. > > Ideally, feature like range partition should be implemented in Hive. While, > it's been always hard to update Hive version in a prod environment, and much > lightweight and flexible if we implement it in Spark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25411) Implement range partition in Spark
[ https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661610#comment-16661610 ] Wang, Gang commented on SPARK-25411: [~cloud_fan] How do you think of this feature? In our inner benchmark, it do improve a lot in performance for huge tables join with predicates. > Implement range partition in Spark > -- > > Key: SPARK-25411 > URL: https://issues.apache.org/jira/browse/SPARK-25411 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang, Gang >Priority: Major > Attachments: range partition design doc.pdf > > > In our product environment, there are some partitioned fact tables, which are > all quite huge. To accelerate join execution, we need make them also > bucketed. Than comes the problem, if the bucket number is large enough, there > may be too many files(files count = bucket number * partition count), which > may bring pressure to the HDFS. And if the bucket number is small, Spark will > launch equal number of tasks to read/write it. > > So, can we implement a new partition support range values, just like range > partition in Oracle/MySQL > ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). > Say, we can partition by a date column, and make every two months as a > partition, or partitioned by a integer column, make interval of 1 as a > partition. > > Ideally, feature like range partition should be implemented in Hive. While, > it's been always hard to update Hive version in a prod environment, and much > lightweight and flexible if we implement it in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25411) Implement range partition in Spark
[ https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16627145#comment-16627145 ] Wang, Gang commented on SPARK-25411: Add a design doc, please help to review. > Implement range partition in Spark > -- > > Key: SPARK-25411 > URL: https://issues.apache.org/jira/browse/SPARK-25411 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang, Gang >Priority: Major > Attachments: range partition design doc.pdf > > > In our product environment, there are some partitioned fact tables, which are > all quite huge. To accelerate join execution, we need make them also > bucketed. Than comes the problem, if the bucket number is large enough, there > may be too many files(files count = bucket number * partition count), which > may bring pressure to the HDFS. And if the bucket number is small, Spark will > launch equal number of tasks to read/write it. > > So, can we implement a new partition support range values, just like range > partition in Oracle/MySQL > ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). > Say, we can partition by a date column, and make every two months as a > partition, or partitioned by a integer column, make interval of 1 as a > partition. > > Ideally, feature like range partition should be implemented in Hive. While, > it's been always hard to update Hive version in a prod environment, and much > lightweight and flexible if we implement it in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org