[jira] [Updated] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table

2019-05-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20240:
-
Labels: bulk-closed  (was: )

> SparkSQL support limitations of max dynamic partitions when inserting hive 
> table
> 
>
> Key: SPARK-20240
> URL: https://issues.apache.org/jira/browse/SPARK-20240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: zenglinxi
>Priority: Major
>  Labels: bulk-closed
>
> We found that HDFS problem occurs sometimes when user have a typo in their 
> code  while using SparkSQL inserting data into a partition table.
> For Example:
> create table:
>  {quote}
>  create table test_tb (
>price double,
> ) PARTITIONED BY (day_partition string ,hour_partition string)
> {quote}
> normal sql for inserting table:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select price, day_partition, hour_partition from other_table;
>  {quote}
> sql with typo:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select hour_partition, day_partition, price from other_table;
>  {quote}
> This typo makes SparkSQL take column "price" as "hour_partition",  which may 
> create million HDFS files in short time if the "other_table" has large data 
> with a wide range of "price" and give rise to awful performance of NameNode 
> RPC.
> We think it's a good idea to limit the maximum number of files allowed to be 
> create by each task for protecting HDFS NameNode from unconscious error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table

2017-04-06 Thread zenglinxi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-20240:
--
Affects Version/s: (was: 2.1.0)

> SparkSQL support limitations of max dynamic partitions when inserting hive 
> table
> 
>
> Key: SPARK-20240
> URL: https://issues.apache.org/jira/browse/SPARK-20240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: zenglinxi
>
> We found that HDFS problem occurs sometimes when user have a typo in their 
> code  while using SparkSQL inserting data into a partition table.
> For Example:
> create table:
>  {quote}
>  create table test_tb (
>price double,
> ) PARTITIONED BY (day_partition string ,hour_partition string)
> {quote}
> normal sql for inserting table:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select price, day_partition, hour_partition from other_table;
>  {quote}
> sql with typo:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select hour_partition, day_partition, price from other_table;
>  {quote}
> This typo makes SparkSQL take column "price" as "hour_partition",  which may 
> create million HDFS files in short time if the "other_table" has large data 
> with a wide range of "price" and give rise to awful performance of NameNode 
> RPC.
> We think it's a good idea to limit the maximum number of files allowed to be 
> create by each task for protecting HDFS NameNode from unconscious error.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table

2017-04-06 Thread zenglinxi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-20240:
--
Affects Version/s: 1.6.3

> SparkSQL support limitations of max dynamic partitions when inserting hive 
> table
> 
>
> Key: SPARK-20240
> URL: https://issues.apache.org/jira/browse/SPARK-20240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3, 2.1.0
>Reporter: zenglinxi
>
> We found that HDFS problem occurs sometimes when user have a typo in their 
> code  while using SparkSQL inserting data into a partition table.
> For Example:
> create table:
>  {quote}
>  create table test_tb (
>price double,
> ) PARTITIONED BY (day_partition string ,hour_partition string)
> {quote}
> normal sql for inserting table:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select price, day_partition, hour_partition from other_table;
>  {quote}
> sql with typo:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select hour_partition, day_partition, price from other_table;
>  {quote}
> This typo makes SparkSQL take column "price" as "hour_partition",  which may 
> create million HDFS files in short time if the "other_table" has large data 
> with a wide range of "price" and give rise to awful performance of NameNode 
> RPC.
> We think it's a good idea to limit the maximum number of files allowed to be 
> create by each task for protecting HDFS NameNode from unconscious error.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org