[ 
https://issues.apache.org/jira/browse/SPARK-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-27498:
----------------------------------
    Summary: Built-in parquet code path (convertMetastoreParquet=true) does not 
respect hive.enforce.bucketing  (was: Built-in parquet code path does not 
respect hive.enforce.bucketing)

> Built-in parquet code path (convertMetastoreParquet=true) does not respect 
> hive.enforce.bucketing
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-27498
>                 URL: https://issues.apache.org/jira/browse/SPARK-27498
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0, 3.0.0
>            Reporter: Bruce Robbins
>            Priority: Major
>
> _Caveat: I can see how this could be intentional if Spark believes that the 
> built-in Parquet code path is creating Hive-compatible bucketed files. 
> However, I assume that is not the case and that this is an actual bug._
>   
>  Spark makes an effort to avoid corrupting bucketed Hive tables unless the 
> user overrides this behavior by setting hive.enforce.bucketing and 
> hive.enforce.sorting to false.
> However, this behavior falls down when Spark uses the built-in Parquet code 
> path to write to the Hive table.
> Here's an example.
> In Hive, do this (I create a table where things work as expected, and one 
> where things don't work as expected):
> {noformat}
> hive> create table sourcetable as select 1 a, 3 b, 7 c;
> hive> drop table hivebuckettext1;
> hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) 
> sorted by (a, b asc) into 10 buckets stored as textfile;
> hive> insert into hivebuckettext1 select * from sourcetable;
> hive> drop table hivebucketparq1;
> hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) 
> sorted by (a, b asc) into 10 buckets stored as parquet;
> hive> insert into hivebucketparq1 select * from sourcetable;
> {noformat}
> For the text table, things seem to work as expected:
> {noformat}
> scala> sql("insert into hivebuckettext1 select 1, 2, 3")
> 19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> org.apache.spark.sql.AnalysisException: Output Hive table 
> `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate 
> bucketed output which is compatible with Hive.;
> {noformat}
> For the parquet table, the insert just happens:
> {noformat}
> scala> sql("insert into hivebucketparq1 select 1, 2, 3")
> res1: org.apache.spark.sql.DataFrame = []
> scala> 
> {noformat}
> Note also that Spark has changed the table definition of hivebucketparq1 (in 
> the HMS!) so that it is no longer a bucketed table. I will file a separate 
> Jira on this (SPARK-27497).
> If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as 
> expected.
> Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but 
> InsertIntoHadoopFsRelationCommand does not. Probably the check should be made 
> in an analyzer rule while the InsertIntoTable node still holds a 
> HiveTableRelation.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to