[ https://issues.apache.org/jira/browse/SPARK-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruce Robbins updated SPARK-27498: ---------------------------------- Summary: Built-in parquet code path (convertMetastoreParquet=true) does not respect hive.enforce.bucketing (was: Built-in parquet code path does not respect hive.enforce.bucketing) > Built-in parquet code path (convertMetastoreParquet=true) does not respect > hive.enforce.bucketing > ------------------------------------------------------------------------------------------------- > > Key: SPARK-27498 > URL: https://issues.apache.org/jira/browse/SPARK-27498 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.0, 3.0.0 > Reporter: Bruce Robbins > Priority: Major > > _Caveat: I can see how this could be intentional if Spark believes that the > built-in Parquet code path is creating Hive-compatible bucketed files. > However, I assume that is not the case and that this is an actual bug._ > > Spark makes an effort to avoid corrupting bucketed Hive tables unless the > user overrides this behavior by setting hive.enforce.bucketing and > hive.enforce.sorting to false. > However, this behavior falls down when Spark uses the built-in Parquet code > path to write to the Hive table. > Here's an example. > In Hive, do this (I create a table where things work as expected, and one > where things don't work as expected): > {noformat} > hive> create table sourcetable as select 1 a, 3 b, 7 c; > hive> drop table hivebuckettext1; > hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) > sorted by (a, b asc) into 10 buckets stored as textfile; > hive> insert into hivebuckettext1 select * from sourcetable; > hive> drop table hivebucketparq1; > hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) > sorted by (a, b asc) into 10 buckets stored as parquet; > hive> insert into hivebucketparq1 select * from sourcetable; > {noformat} > For the text table, things seem to work as expected: > {noformat} > scala> sql("insert into hivebuckettext1 select 1, 2, 3") > 19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > org.apache.spark.sql.AnalysisException: Output Hive table > `default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate > bucketed output which is compatible with Hive.; > {noformat} > For the parquet table, the insert just happens: > {noformat} > scala> sql("insert into hivebucketparq1 select 1, 2, 3") > res1: org.apache.spark.sql.DataFrame = [] > scala> > {noformat} > Note also that Spark has changed the table definition of hivebucketparq1 (in > the HMS!) so that it is no longer a bucketed table. I will file a separate > Jira on this (SPARK-27497). > If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as > expected. > Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but > InsertIntoHadoopFsRelationCommand does not. Probably the check should be made > in an analyzer rule while the InsertIntoTable node still holds a > HiveTableRelation. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org