Bruce Robbins created SPARK-27498:
-------------------------------------

             Summary: Built-in parquet code path does not respect 
hive.enforce.bucketing
                 Key: SPARK-27498
                 URL: https://issues.apache.org/jira/browse/SPARK-27498
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.0, 3.0.0
            Reporter: Bruce Robbins


_Caveat: I can see how this could be intentional if Spark believes that the 
built-in Parquet code path is creating Hive-compatible bucketed files. However, 
I assume that is not the case and that this is an actual bug._
  
 Spark makes an effort to avoid corrupting Hive-bucketed tables unless the user 
overrides this behavior by setting hive.enforce.bucketing and 
hive.enforce.sorting to false.

However, this behavior falls down when Spark uses the built-in Parquet code 
path to write to the table.

Here's an example.

In Hive, do this (I create a table where things work as expected, and one where 
things don't work as expected):
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebuckettext1;
hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as textfile;
hive> insert into hivebuckettext1 select * from sourcetable;
hive> drop table hivebucketparq1;
hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucketparq1 select * from sourcetable;
{noformat}
For the text table, things seem to work as expected:
{noformat}
scala> sql("insert into hivebuckettext1 select 1, 2, 3")
19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Output Hive table 
`default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate 
bucketed output which is compatible with Hive.;
{noformat}
For the parquet table, the insert just happens:
{noformat}
scala> sql("insert into hivebucketparq1 select 1, 2, 3")
res1: org.apache.spark.sql.DataFrame = []
scala> 
{noformat}
Note also that Spark has changed the table definition of hivebucketparq1 (in 
the HMS!) so that it is no longer a bucketed table. I will file a separate Jira 
on this (SPARK-27497).

If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as 
expected.

Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but 
InsertIntoHadoopFsRelationCommand does not.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to