[jira] [Updated] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table
[ https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-20240: - Labels: bulk-closed (was: ) > SparkSQL support limitations of max dynamic partitions when inserting hive > table > > > Key: SPARK-20240 > URL: https://issues.apache.org/jira/browse/SPARK-20240 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 1.6.3 >Reporter: zenglinxi >Priority: Major > Labels: bulk-closed > > We found that HDFS problem occurs sometimes when user have a typo in their > code while using SparkSQL inserting data into a partition table. > For Example: > create table: > {quote} > create table test_tb ( >price double, > ) PARTITIONED BY (day_partition string ,hour_partition string) > {quote} > normal sql for inserting table: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select price, day_partition, hour_partition from other_table; > {quote} > sql with typo: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select hour_partition, day_partition, price from other_table; > {quote} > This typo makes SparkSQL take column "price" as "hour_partition", which may > create million HDFS files in short time if the "other_table" has large data > with a wide range of "price" and give rise to awful performance of NameNode > RPC. > We think it's a good idea to limit the maximum number of files allowed to be > create by each task for protecting HDFS NameNode from unconscious error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table
[ https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-20240: -- Affects Version/s: (was: 2.1.0) > SparkSQL support limitations of max dynamic partitions when inserting hive > table > > > Key: SPARK-20240 > URL: https://issues.apache.org/jira/browse/SPARK-20240 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 1.6.3 >Reporter: zenglinxi > > We found that HDFS problem occurs sometimes when user have a typo in their > code while using SparkSQL inserting data into a partition table. > For Example: > create table: > {quote} > create table test_tb ( >price double, > ) PARTITIONED BY (day_partition string ,hour_partition string) > {quote} > normal sql for inserting table: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select price, day_partition, hour_partition from other_table; > {quote} > sql with typo: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select hour_partition, day_partition, price from other_table; > {quote} > This typo makes SparkSQL take column "price" as "hour_partition", which may > create million HDFS files in short time if the "other_table" has large data > with a wide range of "price" and give rise to awful performance of NameNode > RPC. > We think it's a good idea to limit the maximum number of files allowed to be > create by each task for protecting HDFS NameNode from unconscious error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table
[ https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-20240: -- Affects Version/s: 1.6.3 > SparkSQL support limitations of max dynamic partitions when inserting hive > table > > > Key: SPARK-20240 > URL: https://issues.apache.org/jira/browse/SPARK-20240 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 1.6.3, 2.1.0 >Reporter: zenglinxi > > We found that HDFS problem occurs sometimes when user have a typo in their > code while using SparkSQL inserting data into a partition table. > For Example: > create table: > {quote} > create table test_tb ( >price double, > ) PARTITIONED BY (day_partition string ,hour_partition string) > {quote} > normal sql for inserting table: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select price, day_partition, hour_partition from other_table; > {quote} > sql with typo: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select hour_partition, day_partition, price from other_table; > {quote} > This typo makes SparkSQL take column "price" as "hour_partition", which may > create million HDFS files in short time if the "other_table" has large data > with a wide range of "price" and give rise to awful performance of NameNode > RPC. > We think it's a good idea to limit the maximum number of files allowed to be > create by each task for protecting HDFS NameNode from unconscious error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org