Different query result between spark thrift server and spark-shell

Jun Zhu Thu, 25 Apr 2019 02:00:31 -0700

Hi,
We are using plugins from apache hudi which self defined a hive external
table inputformat with:


ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

WITH SERDEPROPERTIES (

  'serialization.format' = '1'

)

STORED AS

  INPUTFORMAT 'com.uber.hoodie.hadoop.HoodieInputFormat'

  OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

LOCATION 's3a://vungle2-dataeng/jun-test/stage20190424new'

It works when query in spark-shell, however not in spark thrift server with
same config,
After debug found:
spark-shell execution plan differ from spark thrift server
1. in spark-shell
|== Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[datestr#130 ASC NULLS
FIRST,event_id#81 DESC NULLS LAST], output=[event_id#81,datestr#130,c#74L])
+- *(2) Filter (c#74L > 1)
   +- *(2) HashAggregate(keys=[event_id#81, datestr#130],
functions=[count(1)])
      +- Exchange hashpartitioning(event_id#81, datestr#130, 200)
         +- *(1) HashAggregate(keys=[event_id#81, datestr#130],
functions=[partial_count(1)])
            +- *HiveTableScan* [event_id#81, datestr#130],
*HiveTableRelation* `default`.`hoodie_test_as_reportads_new`,
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe,
[_hoodie_record_key#78, _hoodie_commit_time#79, _hoodie_commit_seqno#8...

2. in spark thrift server

| == Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[datestr#63 ASC NULLS
FIRST,event_id#14 DESC NULLS LAST], output=[event_id#14,datestr#63,c#7L])
+- *(2) Filter (c#7L > 1)
   +- *(2) HashAggregate(keys=[event_id#14, datestr#63],
functions=[count(1)])
      +- Exchange hashpartitioning(event_id#14, datestr#63, 200)
         +- *(1) HashAggregate(keys=[event_id#14, datestr#63],
functions=[partial_count(1)])
            +- *(1) *FileScan* *parquet*
default.hoodie_test_as_reportads_new[event_id#14,datestr#63] Batched: true,
Format: *Parquet*, Location:
PrunedInMemoryFileIndex[s3a://vungle2-dataeng/jun-test/stage20190424new/2019-04-24_08,
s3

Looks like thrift server failed to recognize self-define inputformat.
Any thoughts? Or can I config the FileScan to HiveTableScan? thanks~
Best,

-- 
[image: vshapesaqua11553186012.gif] <https://vungle.com/>   *Jun Zhu*
Sr. Engineer I, Data
＋86 18565739171

[image: in1552694272.png] <https://www.linkedin.com/company/vungle>    [image:
fb1552694203.png] <https://facebook.com/vungle>      [image:
tw1552694330.png] <https://twitter.com/vungle>      [image:
ig1552694392.png] <https://www.instagram.com/vungle>
Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China

Different query result between spark thrift server and spark-shell

Reply via email to