[jira] [Commented] (SPARK-27592) Set the bucketed data source table SerDe correctly

Dongjoon Hyun (Jira) Mon, 28 Oct 2019 12:44:31 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961404#comment-16961404
 ]


Dongjoon Hyun commented on SPARK-27592:
---------------------------------------

Ping [~Patnaik] since you asked about this on SPARK-29234. Originally, this is 
registered as an improvement PR, so we don't backport this to the older 
branches. However, given the situation, I'm also fine for [~yumwang] to 
backporting these since he is the author of this.

BTW, please don't expect a backporting to EOL branches like `branch-2.3`.
 - [https://spark.apache.org/versioning-policy.html]
{quote}Feature release branches will, generally, be maintained with bug fix 
releases for a period of 18 months. For example, branch 2.3.x is no longer 
considered maintained as of September 2019, 18 months after the release of 
2.3.0 in February 2018. No more 2.3.x releases should be expected after that 
point, even for bug fixes.
{quote}

> Set the bucketed data source table SerDe correctly
> --------------------------------------------------
>
>                 Key: SPARK-27592
>                 URL: https://issues.apache.org/jira/browse/SPARK-27592
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Yuming Wang
>            Assignee: Yuming Wang
>            Priority: Major
>             Fix For: 3.0.0
>
>
> We hint Hive using incorrect 
> InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's 
> Parquet datasource bucket table:
> {noformat}
> spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) 
> SORTED BY (c1) INTO 2 BUCKETS;
>  2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data 
> source table `default`.`t` into Hive metastore in Spark SQL specific format, 
> which is NOT compatible with Hive.
>  spark-sql> DESC EXTENDED t;
>  c1 int NULL
>  c2 int NULL
>  # Detailed Table Information
>  Database default
>  Table t
>  Owner yumwang
>  Created Time Mon Apr 29 17:52:05 CST 2019
>  Last Access Thu Jan 01 08:00:00 CST 1970
>  Created By Spark 2.4.0
>  Type MANAGED
>  Provider parquet
>  Num Buckets 2
>  Bucket Columns [`c1`]
>  Sort Columns [`c1`]
>  Table Properties [transient_lastDdlTime=1556531525]
>  Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t]
>  Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
>  OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  Storage Properties [serialization.format=1]
> {noformat}
>  We can see incompatible information when creating the table:
> {noformat}
> WARN HiveExternalCatalog:66 - Persisting bucketed data source table 
> `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT 
> compatible with Hive.
> {noformat}
>  But downstream don’t know the compatibility. I'd like to write the write 
> information of this table to metadata so that each engine decides 
> compatibility itself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27592) Set the bucketed data source table SerDe correctly

Reply via email to