[jira] [Updated] (SPARK-31751) spark serde property path overwrites table property location
[ https://issues.apache.org/jira/browse/SPARK-31751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31751: - Description: This is an issue that have caused us so many data errors. 1) using spark ( with hive context enabled ) {code} df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}]) df.write.format("orc").option("compression", "ZLIB").mode("overwrite").saveAsTable('test_spark'); {code} 2) from hive {code} alter table test_spark rename to test_spark2 {code} 3)from spark-sql from command line ( note : not pyspark or spark-shell ) {code} select * from test_spark2 {code} will give output {code} NULL NULL NULL Time taken: 0.334 seconds, Fetched 1 row(s) {code} This will throw NULL because , pyspark write API will add a serde property called path into the hive metastore. when hive renames the table , it do not understand this serde and hence keep it as it is. Now when spark-sql tries to read it , it will honor the serde property first and then tries to read from the non-existent hdfs location. If it had given an error , then also it would have been fine , but throwing out NULL will cause applications to fail pretty bad. Spark claims to support hive tables , hence it should respect hive metastore location property rather than spark serde property when trying to read a table. This cannot be classified as a expected behaviour. was: This is an issue that have caused us so many data errors. 1) using spark ( with hive context enabled ) df = spark.createDataFrame([\{"a": "x", "b": "y", "c": "3"}]) df.write.format("orc").option("compression", "ZLIB").mode("overwrite").saveAsTable('test_spark'); 2) from hive alter table test_spark rename to test_spark2 3)from spark-sql from command line ( note : not pyspark or spark-shell ) select * from test_spark2 will give output NULL NULL NULL Time taken: 0.334 seconds, Fetched 1 row(s) This will throw NULL because , pyspark write API will add a serde property called path into the hive metastore. when hive renames the table , it do not understand this serde and hence keep it as it is. Now when spark-sql tries to read it , it will honor the serde property first and then tries to read from the non-existent hdfs location. If it had given an error , then also it would have been fine , but throwing out NULL will cause applications to fail pretty bad. Spark claims to support hive tables , hence it should respect hive metastore location property rather than spark serde property when trying to read a table. This cannot be classified as a expected behaviour. > spark serde property path overwrites table property location > > > Key: SPARK-31751 > URL: https://issues.apache.org/jira/browse/SPARK-31751 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.5 >Reporter: Nithin >Priority: Major > > This is an issue that have caused us so many data errors. > 1) using spark ( with hive context enabled ) > {code} > df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}]) > df.write.format("orc").option("compression", > "ZLIB").mode("overwrite").saveAsTable('test_spark'); > {code} > > 2) from hive > {code} > alter table test_spark rename to test_spark2 > {code} > > 3)from spark-sql from command line ( note : not pyspark or spark-shell ) > {code} > select * from test_spark2 > {code} > > will give output > {code} > NULL NULL NULL > Time taken: 0.334 seconds, Fetched 1 row(s) > {code} > > This will throw NULL because , pyspark write API will add a serde property > called path into the hive metastore. when hive renames the table , it do not > understand this serde and hence keep it as it is. Now when spark-sql tries to > read it , it will honor the serde property first and then tries to read from > the non-existent hdfs location. If it had given an error , then also it would > have been fine , but throwing out NULL will cause applications to fail pretty > bad. Spark claims to support hive tables , hence it should respect hive > metastore location property rather than spark serde property when trying to > read a table. This cannot be classified as a expected behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31751) spark serde property path overwrites table property location
[ https://issues.apache.org/jira/browse/SPARK-31751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31751: - Flags: (was: Important) > spark serde property path overwrites table property location > > > Key: SPARK-31751 > URL: https://issues.apache.org/jira/browse/SPARK-31751 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.5 >Reporter: Nithin >Priority: Major > > This is an issue that have caused us so many data errors. > 1) using spark ( with hive context enabled ) > df = spark.createDataFrame([\{"a": "x", "b": "y", "c": "3"}]) > df.write.format("orc").option("compression", > "ZLIB").mode("overwrite").saveAsTable('test_spark'); > > 2) from hive > alter table test_spark rename to test_spark2 > > 3)from spark-sql from command line ( note : not pyspark or spark-shell ) > select * from test_spark2 > > will give output > NULL NULL NULL > Time taken: 0.334 seconds, Fetched 1 row(s) > > This will throw NULL because , pyspark write API will add a serde property > called path into the hive metastore. when hive renames the table , it do not > understand this serde and hence keep it as it is. Now when spark-sql tries to > read it , it will honor the serde property first and then tries to read from > the non-existent hdfs location. If it had given an error , then also it would > have been fine , but throwing out NULL will cause applications to fail pretty > bad. Spark claims to support hive tables , hence it should respect hive > metastore location property rather than spark serde property when trying to > read a table. This cannot be classified as a expected behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31751) spark serde property path overwrites table property location
[ https://issues.apache.org/jira/browse/SPARK-31751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31751: - Priority: Major (was: Critical) > spark serde property path overwrites table property location > > > Key: SPARK-31751 > URL: https://issues.apache.org/jira/browse/SPARK-31751 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.5 >Reporter: Nithin >Priority: Major > > This is an issue that have caused us so many data errors. > 1) using spark ( with hive context enabled ) > df = spark.createDataFrame([\{"a": "x", "b": "y", "c": "3"}]) > df.write.format("orc").option("compression", > "ZLIB").mode("overwrite").saveAsTable('test_spark'); > > 2) from hive > alter table test_spark rename to test_spark2 > > 3)from spark-sql from command line ( note : not pyspark or spark-shell ) > select * from test_spark2 > > will give output > NULL NULL NULL > Time taken: 0.334 seconds, Fetched 1 row(s) > > This will throw NULL because , pyspark write API will add a serde property > called path into the hive metastore. when hive renames the table , it do not > understand this serde and hence keep it as it is. Now when spark-sql tries to > read it , it will honor the serde property first and then tries to read from > the non-existent hdfs location. If it had given an error , then also it would > have been fine , but throwing out NULL will cause applications to fail pretty > bad. Spark claims to support hive tables , hence it should respect hive > metastore location property rather than spark serde property when trying to > read a table. This cannot be classified as a expected behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31751) spark serde property path overwrites table property location
[ https://issues.apache.org/jira/browse/SPARK-31751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nithin updated SPARK-31751: --- Affects Version/s: 2.4.5 > spark serde property path overwrites table property location > > > Key: SPARK-31751 > URL: https://issues.apache.org/jira/browse/SPARK-31751 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.5 >Reporter: Nithin >Priority: Critical > > This is an issue that have caused us so many data errors. > 1) using spark ( with hive context enabled ) > df = spark.createDataFrame([\{"a": "x", "b": "y", "c": "3"}]) > df.write.format("orc").option("compression", > "ZLIB").mode("overwrite").saveAsTable('test_spark'); > > 2) from hive > alter table test_spark rename to test_spark2 > > 3)from spark-sql from command line ( note : not pyspark or spark-shell ) > select * from test_spark2 > > will give output > NULL NULL NULL > Time taken: 0.334 seconds, Fetched 1 row(s) > > This will throw NULL because , pyspark write API will add a serde property > called path into the hive metastore. when hive renames the table , it do not > understand this serde and hence keep it as it is. Now when spark-sql tries to > read it , it will honor the serde property first and then tries to read from > the non-existent hdfs location. If it had given an error , then also it would > have been fine , but throwing out NULL will cause applications to fail pretty > bad. Spark claims to support hive tables , hence it should respect hive > metastore location property rather than spark serde property when trying to > read a table. This cannot be classified as a expected behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org