Never mind, I got the point, spark replace hive parquet with it's own, Should set spark.sql.hive.convertMetastoreParquet=false to use hive's. Thanks
On Thu, Apr 25, 2019 at 5:00 PM Jun Zhu <jun....@vungle.com> wrote: > Hi, > We are using plugins from apache hudi which self defined a hive external > table inputformat with: > > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > > WITH SERDEPROPERTIES ( > > 'serialization.format' = '1' > > ) > > STORED AS > > INPUTFORMAT 'com.uber.hoodie.hadoop.HoodieInputFormat' > > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > > LOCATION 's3a://vungle2-dataeng/jun-test/stage20190424new' > > It works when query in spark-shell, however not in spark thrift server > with same config, > After debug found: > spark-shell execution plan differ from spark thrift server > 1. in spark-shell > |== Physical Plan == > TakeOrderedAndProject(limit=10, orderBy=[datestr#130 ASC NULLS > FIRST,event_id#81 DESC NULLS LAST], output=[event_id#81,datestr#130,c#74L]) > +- *(2) Filter (c#74L > 1) > +- *(2) HashAggregate(keys=[event_id#81, datestr#130], > functions=[count(1)]) > +- Exchange hashpartitioning(event_id#81, datestr#130, 200) > +- *(1) HashAggregate(keys=[event_id#81, datestr#130], > functions=[partial_count(1)]) > +- *HiveTableScan* [event_id#81, datestr#130], > *HiveTableRelation* `default`.`hoodie_test_as_reportads_new`, > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, > [_hoodie_record_key#78, _hoodie_commit_time#79, _hoodie_commit_seqno#8... > > 2. in spark thrift server > > | == Physical Plan == > TakeOrderedAndProject(limit=10, orderBy=[datestr#63 ASC NULLS > FIRST,event_id#14 DESC NULLS LAST], output=[event_id#14,datestr#63,c#7L]) > +- *(2) Filter (c#7L > 1) > +- *(2) HashAggregate(keys=[event_id#14, datestr#63], > functions=[count(1)]) > +- Exchange hashpartitioning(event_id#14, datestr#63, 200) > +- *(1) HashAggregate(keys=[event_id#14, datestr#63], > functions=[partial_count(1)]) > +- *(1) *FileScan* *parquet* > default.hoodie_test_as_reportads_new[event_id#14,datestr#63] Batched: true, > Format: *Parquet*, Location: > PrunedInMemoryFileIndex[s3a://vungle2-dataeng/jun-test/stage20190424new/2019-04-24_08, > s3 > > Looks like thrift server failed to recognize self-define inputformat. > Any thoughts? Or can I config the FileScan to HiveTableScan? thanks~ > Best, > > -- > [image: vshapesaqua11553186012.gif] <https://vungle.com/> *Jun Zhu* > Sr. Engineer I, Data > +86 18565739171 > > [image: in1552694272.png] <https://www.linkedin.com/company/vungle> [image: > fb1552694203.png] <https://facebook.com/vungle> [image: > tw1552694330.png] <https://twitter.com/vungle> [image: > ig1552694392.png] <https://www.instagram.com/vungle> > Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China > > -- [image: vshapesaqua11553186012.gif] <https://vungle.com/> *Jun Zhu* Sr. Engineer I, Data +86 18565739171 [image: in1552694272.png] <https://www.linkedin.com/company/vungle> [image: fb1552694203.png] <https://facebook.com/vungle> [image: tw1552694330.png] <https://twitter.com/vungle> [image: ig1552694392.png] <https://www.instagram.com/vungle> Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China