Bruce Robbins created SPARK-27498: ------------------------------------- Summary: Built-in parquet code path does not respect hive.enforce.bucketing Key: SPARK-27498 URL: https://issues.apache.org/jira/browse/SPARK-27498 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: Bruce Robbins
_Caveat: I can see how this could be intentional if Spark believes that the built-in Parquet code path is creating Hive-compatible bucketed files. However, I assume that is not the case and that this is an actual bug._ Spark makes an effort to avoid corrupting Hive-bucketed tables unless the user overrides this behavior by setting hive.enforce.bucketing and hive.enforce.sorting to false. However, this behavior falls down when Spark uses the built-in Parquet code path to write to the table. Here's an example. In Hive, do this (I create a table where things work as expected, and one where things don't work as expected): {noformat} hive> create table sourcetable as select 1 a, 3 b, 7 c; hive> drop table hivebuckettext1; hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as textfile; hive> insert into hivebuckettext1 select * from sourcetable; hive> drop table hivebucketparq1; hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) sorted by (a, b asc) into 10 buckets stored as parquet; hive> insert into hivebucketparq1 select * from sourcetable; {noformat} For the text table, things seem to work as expected: {noformat} scala> sql("insert into hivebuckettext1 select 1, 2, 3") 19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException org.apache.spark.sql.AnalysisException: Output Hive table `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.; {noformat} For the parquet table, the insert just happens: {noformat} scala> sql("insert into hivebucketparq1 select 1, 2, 3") res1: org.apache.spark.sql.DataFrame = [] scala> {noformat} Note also that Spark has changed the table definition of hivebucketparq1 (in the HMS!) so that it is no longer a bucketed table. I will file a separate Jira on this (SPARK-27497). If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as expected. Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but InsertIntoHadoopFsRelationCommand does not. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org