Re: Different query result between spark thrift server and spark-shell

Jun Zhu Thu, 25 Apr 2019 02:21:49 -0700

Never mind, I got the point, spark replace hive parquet with it's own,
Should set spark.sql.hive.convertMetastoreParquet=false to use hive's.
Thanks


On Thu, Apr 25, 2019 at 5:00 PM Jun Zhu <jun....@vungle.com> wrote:

> Hi,
> We are using plugins from apache hudi which self defined a hive external
> table inputformat with:
>
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
>
> WITH SERDEPROPERTIES (
>
>   'serialization.format' = '1'
>
> )
>
> STORED AS
>
>   INPUTFORMAT 'com.uber.hoodie.hadoop.HoodieInputFormat'
>
>   OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
>
> LOCATION 's3a://vungle2-dataeng/jun-test/stage20190424new'
>
> It works when query in spark-shell, however not in spark thrift server
> with same config,
> After debug found:
> spark-shell execution plan differ from spark thrift server
> 1. in spark-shell
> |== Physical Plan ==
> TakeOrderedAndProject(limit=10, orderBy=[datestr#130 ASC NULLS
> FIRST,event_id#81 DESC NULLS LAST], output=[event_id#81,datestr#130,c#74L])
> +- *(2) Filter (c#74L > 1)
>    +- *(2) HashAggregate(keys=[event_id#81, datestr#130],
> functions=[count(1)])
>       +- Exchange hashpartitioning(event_id#81, datestr#130, 200)
>          +- *(1) HashAggregate(keys=[event_id#81, datestr#130],
> functions=[partial_count(1)])
>             +- *HiveTableScan* [event_id#81, datestr#130],
> *HiveTableRelation* `default`.`hoodie_test_as_reportads_new`,
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe,
> [_hoodie_record_key#78, _hoodie_commit_time#79, _hoodie_commit_seqno#8...
>
> 2. in spark thrift server
>
> | == Physical Plan ==
> TakeOrderedAndProject(limit=10, orderBy=[datestr#63 ASC NULLS
> FIRST,event_id#14 DESC NULLS LAST], output=[event_id#14,datestr#63,c#7L])
> +- *(2) Filter (c#7L > 1)
>    +- *(2) HashAggregate(keys=[event_id#14, datestr#63],
> functions=[count(1)])
>       +- Exchange hashpartitioning(event_id#14, datestr#63, 200)
>          +- *(1) HashAggregate(keys=[event_id#14, datestr#63],
> functions=[partial_count(1)])
>             +- *(1) *FileScan* *parquet*
> default.hoodie_test_as_reportads_new[event_id#14,datestr#63] Batched: true,
> Format: *Parquet*, Location:
> PrunedInMemoryFileIndex[s3a://vungle2-dataeng/jun-test/stage20190424new/2019-04-24_08,
> s3
>
> Looks like thrift server failed to recognize self-define inputformat.
> Any thoughts? Or can I config the FileScan to HiveTableScan? thanks~
> Best,
>
> --
> [image: vshapesaqua11553186012.gif] <https://vungle.com/>   *Jun Zhu*
> Sr. Engineer I, Data
> ＋86 18565739171
>
> [image: in1552694272.png] <https://www.linkedin.com/company/vungle>    [image:
> fb1552694203.png] <https://facebook.com/vungle>      [image:
> tw1552694330.png] <https://twitter.com/vungle>      [image:
> ig1552694392.png] <https://www.instagram.com/vungle>
> Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China
>
>

-- 
[image: vshapesaqua11553186012.gif] <https://vungle.com/>   *Jun Zhu*
Sr. Engineer I, Data
＋86 18565739171

[image: in1552694272.png] <https://www.linkedin.com/company/vungle>    [image:
fb1552694203.png] <https://facebook.com/vungle>      [image:
tw1552694330.png] <https://twitter.com/vungle>      [image:
ig1552694392.png] <https://www.instagram.com/vungle>
Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China

Re: Different query result between spark thrift server and spark-shell

Reply via email to