[jira] [Updated] (SPARK-31751) spark serde property path overwrites table property location

2020-05-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31751:
-
Description: 
This is an issue that have caused us so many data errors. 

1) using spark ( with hive context enabled )

{code}
df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
df.write.format("orc").option("compression", 
"ZLIB").mode("overwrite").saveAsTable('test_spark');
{code}

 

2) from hive 

{code}
alter table test_spark rename to test_spark2
{code}

 

3)from spark-sql from command line ( note : not pyspark or spark-shell )  

{code}
select * from test_spark2
{code}

 

will give output 

{code}
NULL NULL NULL
Time taken: 0.334 seconds, Fetched 1 row(s)
{code}

 

This will throw NULL because , pyspark write API will add a serde property 
called path into the hive metastore. when hive renames the table , it do not 
understand this serde and hence keep it as it is. Now when spark-sql tries to 
read it , it will honor the serde property first and then tries to read from 
the non-existent hdfs location. If it had given an error , then also it would 
have been fine , but throwing out NULL will cause applications to fail pretty 
bad. Spark claims to support hive tables , hence it should respect hive 
metastore location property rather than spark serde property when trying to 
read a table. This cannot be classified as a expected behaviour.

  was:
This is an issue that have caused us so many data errors. 

1) using spark ( with hive context enabled )

df = spark.createDataFrame([\{"a": "x", "b": "y", "c": "3"}])
df.write.format("orc").option("compression", 
"ZLIB").mode("overwrite").saveAsTable('test_spark');

 

2) from hive 

alter table test_spark rename to test_spark2

 

3)from spark-sql from command line ( note : not pyspark or spark-shell )  

select * from test_spark2

 

will give output 

NULL NULL NULL
Time taken: 0.334 seconds, Fetched 1 row(s)

 

This will throw NULL because , pyspark write API will add a serde property 
called path into the hive metastore. when hive renames the table , it do not 
understand this serde and hence keep it as it is. Now when spark-sql tries to 
read it , it will honor the serde property first and then tries to read from 
the non-existent hdfs location. If it had given an error , then also it would 
have been fine , but throwing out NULL will cause applications to fail pretty 
bad. Spark claims to support hive tables , hence it should respect hive 
metastore location property rather than spark serde property when trying to 
read a table. This cannot be classified as a expected behaviour.


> spark serde property path overwrites table property location
> 
>
> Key: SPARK-31751
> URL: https://issues.apache.org/jira/browse/SPARK-31751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.5
>Reporter: Nithin
>Priority: Major
>
> This is an issue that have caused us so many data errors. 
> 1) using spark ( with hive context enabled )
> {code}
> df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
> df.write.format("orc").option("compression", 
> "ZLIB").mode("overwrite").saveAsTable('test_spark');
> {code}
>  
> 2) from hive 
> {code}
> alter table test_spark rename to test_spark2
> {code}
>  
> 3)from spark-sql from command line ( note : not pyspark or spark-shell )  
> {code}
> select * from test_spark2
> {code}
>  
> will give output 
> {code}
> NULL NULL NULL
> Time taken: 0.334 seconds, Fetched 1 row(s)
> {code}
>  
> This will throw NULL because , pyspark write API will add a serde property 
> called path into the hive metastore. when hive renames the table , it do not 
> understand this serde and hence keep it as it is. Now when spark-sql tries to 
> read it , it will honor the serde property first and then tries to read from 
> the non-existent hdfs location. If it had given an error , then also it would 
> have been fine , but throwing out NULL will cause applications to fail pretty 
> bad. Spark claims to support hive tables , hence it should respect hive 
> metastore location property rather than spark serde property when trying to 
> read a table. This cannot be classified as a expected behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31751) spark serde property path overwrites table property location

2020-05-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31751:
-
Flags:   (was: Important)

> spark serde property path overwrites table property location
> 
>
> Key: SPARK-31751
> URL: https://issues.apache.org/jira/browse/SPARK-31751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.5
>Reporter: Nithin
>Priority: Major
>
> This is an issue that have caused us so many data errors. 
> 1) using spark ( with hive context enabled )
> df = spark.createDataFrame([\{"a": "x", "b": "y", "c": "3"}])
> df.write.format("orc").option("compression", 
> "ZLIB").mode("overwrite").saveAsTable('test_spark');
>  
> 2) from hive 
> alter table test_spark rename to test_spark2
>  
> 3)from spark-sql from command line ( note : not pyspark or spark-shell )  
> select * from test_spark2
>  
> will give output 
> NULL NULL NULL
> Time taken: 0.334 seconds, Fetched 1 row(s)
>  
> This will throw NULL because , pyspark write API will add a serde property 
> called path into the hive metastore. when hive renames the table , it do not 
> understand this serde and hence keep it as it is. Now when spark-sql tries to 
> read it , it will honor the serde property first and then tries to read from 
> the non-existent hdfs location. If it had given an error , then also it would 
> have been fine , but throwing out NULL will cause applications to fail pretty 
> bad. Spark claims to support hive tables , hence it should respect hive 
> metastore location property rather than spark serde property when trying to 
> read a table. This cannot be classified as a expected behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31751) spark serde property path overwrites table property location

2020-05-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31751:
-
Priority: Major  (was: Critical)

> spark serde property path overwrites table property location
> 
>
> Key: SPARK-31751
> URL: https://issues.apache.org/jira/browse/SPARK-31751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.5
>Reporter: Nithin
>Priority: Major
>
> This is an issue that have caused us so many data errors. 
> 1) using spark ( with hive context enabled )
> df = spark.createDataFrame([\{"a": "x", "b": "y", "c": "3"}])
> df.write.format("orc").option("compression", 
> "ZLIB").mode("overwrite").saveAsTable('test_spark');
>  
> 2) from hive 
> alter table test_spark rename to test_spark2
>  
> 3)from spark-sql from command line ( note : not pyspark or spark-shell )  
> select * from test_spark2
>  
> will give output 
> NULL NULL NULL
> Time taken: 0.334 seconds, Fetched 1 row(s)
>  
> This will throw NULL because , pyspark write API will add a serde property 
> called path into the hive metastore. when hive renames the table , it do not 
> understand this serde and hence keep it as it is. Now when spark-sql tries to 
> read it , it will honor the serde property first and then tries to read from 
> the non-existent hdfs location. If it had given an error , then also it would 
> have been fine , but throwing out NULL will cause applications to fail pretty 
> bad. Spark claims to support hive tables , hence it should respect hive 
> metastore location property rather than spark serde property when trying to 
> read a table. This cannot be classified as a expected behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31751) spark serde property path overwrites table property location

2020-05-22 Thread Nithin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nithin updated SPARK-31751:
---
Affects Version/s: 2.4.5

> spark serde property path overwrites table property location
> 
>
> Key: SPARK-31751
> URL: https://issues.apache.org/jira/browse/SPARK-31751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.5
>Reporter: Nithin
>Priority: Critical
>
> This is an issue that have caused us so many data errors. 
> 1) using spark ( with hive context enabled )
> df = spark.createDataFrame([\{"a": "x", "b": "y", "c": "3"}])
> df.write.format("orc").option("compression", 
> "ZLIB").mode("overwrite").saveAsTable('test_spark');
>  
> 2) from hive 
> alter table test_spark rename to test_spark2
>  
> 3)from spark-sql from command line ( note : not pyspark or spark-shell )  
> select * from test_spark2
>  
> will give output 
> NULL NULL NULL
> Time taken: 0.334 seconds, Fetched 1 row(s)
>  
> This will throw NULL because , pyspark write API will add a serde property 
> called path into the hive metastore. when hive renames the table , it do not 
> understand this serde and hence keep it as it is. Now when spark-sql tries to 
> read it , it will honor the serde property first and then tries to read from 
> the non-existent hdfs location. If it had given an error , then also it would 
> have been fine , but throwing out NULL will cause applications to fail pretty 
> bad. Spark claims to support hive tables , hence it should respect hive 
> metastore location property rather than spark serde property when trying to 
> read a table. This cannot be classified as a expected behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org