[
https://issues.apache.org/jira/browse/SPARK-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cheng Lian updated SPARK-10562:
---
Description:
When using DataFrame.write.partitionBy().saveAsTable() it creates the partiton
by columns in all lowercase in the meta-store. However, it writes the data to
the filesystem using mixed-case.
This causes an error when running a select against the table.
{noformat}
from pyspark.sql import Row
# Create a data frame with mixed case column names
myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015),
Row(Name="Frank Lampard", Goals=15, Year=2012)])
myDF = sqlContext.createDataFrame(myRDD)
# Write this data out to a parquet file and partition by the Year (which is a
mixedCase name)
myDF.write.partitionBy("Year").saveAsTable("chelsea_goals")
%sql show create table chelsea_goals;
--The metastore is showwing a partition column name of all lowercase "year"
# Verify that the data is written with appropriate partitions
display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals"))
{noformat}
{code:sql}
%sql -- Now try to run a query against this table
select * from chelsea_goals
{code}
{noformat}
Error in SQL statement: UncheckedExecutionException:
java.lang.RuntimeException: Partition column year not found in schema
StructType(StructField(Goals,LongType,true), StructField(Name,StringType,true),
StructField(Year,LongType,true))
{noformat}
{noformat}
# Now lets try this again using a lowercase column name
myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015),
Row(Name="Frank Lampard", Goals=15, year=2012)])
myDF2 = sqlContext.createDataFrame(myRDD2)
myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2")
{noformat}
{code:sql}
%sql select * from chelsea_goals2;
--Now everything works
{code}
was:
When using DataFrame.write.partitionBy().saveAsTable() it creates the partiton
by columns in all lowercase in the meta-store. However, it writes the data to
the filesystem using mixed-case.
This causes an error when running a select against the table.
--
from pyspark.sql import Row
# Create a data frame with mixed case column names
myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015),
Row(Name="Frank Lampard", Goals=15, Year=2012)])
myDF = sqlContext.createDataFrame(myRDD)
# Write this data out to a parquet file and partition by the Year (which is a
mixedCase name)
myDF.write.partitionBy("Year").saveAsTable("chelsea_goals")
%sql show create table chelsea_goals;
--The metastore is showwing a partition column name of all lowercase "year"
# Verify that the data is written with appropriate partitions
display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals"))
%sql
--Now try to run a query against this table
select * from chelsea_goals
Error in SQL statement: UncheckedExecutionException:
java.lang.RuntimeException: Partition column year not found in schema
StructType(StructField(Goals,LongType,true), StructField(Name,StringType,true),
StructField(Year,LongType,true))
# Now lets try this again using a lowercase column name
myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015),
Row(Name="Frank Lampard", Goals=15, year=2012)])
myDF2 = sqlContext.createDataFrame(myRDD2)
myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2")
%sql select * from chelsea_goals2;
--Now everything works
> .partitionBy() creates the metastore partition columns in all lowercase, but
> persists the data path as MixedCase resulting in an error when the data is
> later attempted to query.
> -
>
> Key: SPARK-10562
> URL: https://issues.apache.org/jira/browse/SPARK-10562
> Project: Spark
> Issue Type: Bug
> Components: SQL
>Affects Versions: 1.4.1
>Reporter: Jason Pohl
>Assignee: Wenchen Fan
> Attachments: MixedCasePartitionBy.dbc
>
>
> When using DataFrame.write.partitionBy().saveAsTable() it creates the
> partiton by columns in all lowercase in the meta-store. However, it writes
> the data to the filesystem using mixed-case.
> This causes an error when running a select against the table.
> {noformat}
> from pyspark.sql import Row
> # Create a data frame with mixed case column names
> myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015),
>Row(Name="Frank Lampard", Goals=15, Year=2012)])
> myDF = sqlContext.createDataFrame(myRDD)
> # Write this data out to a parquet file and partition by the Year (which is a
> mixedCase name)
> myDF.write.partitionBy("Year").saveAsTable("chelsea_goals")
> %sql show create table chelsea_goals;
>