Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?
|spark.sql.parquet.filterPushdown| defaults to |false| because there’s a bug in Parquet which may cause NPE, please refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration This bug hasn’t been fixed in Parquet master. We’ll turn this on once the bug is fixed. Cheng On 1/19/15 5:02 PM, Xiaoyu Wang wrote: The *spark.sql.parquet.**filterPushdown=true *has been turned on. But set *spark.sql.hive.**convertMetastoreParquet *to *false*. the first parameter is lose efficacy!!! 2015-01-20 6:52 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com mailto:yana.kadiy...@gmail.com: If you're talking about filter pushdowns for parquet files this also has to be turned on explicitly. Try *spark.sql.parquet.**filterPushdown=true . *It's off by default On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wangxy...@gmail.com mailto:wangxy...@gmail.com wrote: Yes it works! But the filter can't pushdown!!! If custom parquetinputformat only implement the datasource API? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 2015-01-16 21:51 GMT+08:00 Xiaoyu Wang wangxy...@gmail.com mailto:wangxy...@gmail.com: Thanks yana! I will try it! 在 2015年1月16日,20:51,yana yana.kadiy...@gmail.com mailto:yana.kadiy...@gmail.com 写道: I think you might need to set spark.sql.hive.convertMetastoreParquet to false if I understand that flag correctly Sent on the new Sprint Network from my Samsung Galaxy S®4. Original message From: Xiaoyu Wang Date:01/16/2015 5:09 AM (GMT-05:00) To: user@spark.apache.org mailto:user@spark.apache.org Subject: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan? Hi all! In the Spark SQL1.2.0. I create a hive table with custom parquet inputformat and outputformat. like this : CREATE TABLE test( id string, msg string) CLUSTERED BY ( id) SORTED BY ( id ASC) INTO 10 BUCKETS ROW FORMAT SERDE '*com.a.MyParquetHiveSerDe*' STORED AS INPUTFORMAT '*com.a.MyParquetInputFormat*' OUTPUTFORMAT '*com.a.MyParquetOutputFormat*'; And the spark shell see the plan of select * from test is : [== Physical Plan ==] [!OutputFaker [id#5,msg#6]] [ *ParquetTableScan* [id#12,msg#13], (ParquetRelation hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6d15a113, []), []] *Not HiveTableScan*!!! *So it dosn't execute my custom inputformat!* Why? How can it execute my custom inputformat? Thanks!
Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?
In Spark SQL, Parquet filter pushdown doesn’t cover |HiveTableScan| for now. May I ask why do you prefer |HiveTableScan| rather than |ParquetTableScan|? Cheng On 1/19/15 5:02 PM, Xiaoyu Wang wrote: The *spark.sql.parquet.**filterPushdown=true *has been turned on. But set *spark.sql.hive.**convertMetastoreParquet *to *false*. the first parameter is lose efficacy!!! 2015-01-20 6:52 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com mailto:yana.kadiy...@gmail.com: If you're talking about filter pushdowns for parquet files this also has to be turned on explicitly. Try *spark.sql.parquet.**filterPushdown=true . *It's off by default On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wangxy...@gmail.com mailto:wangxy...@gmail.com wrote: Yes it works! But the filter can't pushdown!!! If custom parquetinputformat only implement the datasource API? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 2015-01-16 21:51 GMT+08:00 Xiaoyu Wang wangxy...@gmail.com mailto:wangxy...@gmail.com: Thanks yana! I will try it! 在 2015年1月16日,20:51,yana yana.kadiy...@gmail.com mailto:yana.kadiy...@gmail.com 写道: I think you might need to set spark.sql.hive.convertMetastoreParquet to false if I understand that flag correctly Sent on the new Sprint Network from my Samsung Galaxy S®4. Original message From: Xiaoyu Wang Date:01/16/2015 5:09 AM (GMT-05:00) To: user@spark.apache.org mailto:user@spark.apache.org Subject: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan? Hi all! In the Spark SQL1.2.0. I create a hive table with custom parquet inputformat and outputformat. like this : CREATE TABLE test( id string, msg string) CLUSTERED BY ( id) SORTED BY ( id ASC) INTO 10 BUCKETS ROW FORMAT SERDE '*com.a.MyParquetHiveSerDe*' STORED AS INPUTFORMAT '*com.a.MyParquetInputFormat*' OUTPUTFORMAT '*com.a.MyParquetOutputFormat*'; And the spark shell see the plan of select * from test is : [== Physical Plan ==] [!OutputFaker [id#5,msg#6]] [ *ParquetTableScan* [id#12,msg#13], (ParquetRelation hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6d15a113, []), []] *Not HiveTableScan*!!! *So it dosn't execute my custom inputformat!* Why? How can it execute my custom inputformat? Thanks!
Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?
The *spark.sql.parquet.**filterPushdown=true *has been turned on. But set *spark.sql.hive.**convertMetastoreParquet *to *false*. the first parameter is lose efficacy!!! 2015-01-20 6:52 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com: If you're talking about filter pushdowns for parquet files this also has to be turned on explicitly. Try *spark.sql.parquet.**filterPushdown=true . *It's off by default On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wangxy...@gmail.com wrote: Yes it works! But the filter can't pushdown!!! If custom parquetinputformat only implement the datasource API? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 2015-01-16 21:51 GMT+08:00 Xiaoyu Wang wangxy...@gmail.com: Thanks yana! I will try it! 在 2015年1月16日,20:51,yana yana.kadiy...@gmail.com 写道: I think you might need to set spark.sql.hive.convertMetastoreParquet to false if I understand that flag correctly Sent on the new Sprint Network from my Samsung Galaxy S®4. Original message From: Xiaoyu Wang Date:01/16/2015 5:09 AM (GMT-05:00) To: user@spark.apache.org Subject: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan? Hi all! In the Spark SQL1.2.0. I create a hive table with custom parquet inputformat and outputformat. like this : CREATE TABLE test( id string, msg string) CLUSTERED BY ( id) SORTED BY ( id ASC) INTO 10 BUCKETS ROW FORMAT SERDE '*com.a.MyParquetHiveSerDe*' STORED AS INPUTFORMAT '*com.a.MyParquetInputFormat*' OUTPUTFORMAT '*com.a.MyParquetOutputFormat*'; And the spark shell see the plan of select * from test is : [== Physical Plan ==] [!OutputFaker [id#5,msg#6]] [ *ParquetTableScan* [id#12,msg#13], (ParquetRelation hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6d15a113, []), []] *Not HiveTableScan*!!! *So it dosn't execute my custom inputformat!* Why? How can it execute my custom inputformat? Thanks!
Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?
If you're talking about filter pushdowns for parquet files this also has to be turned on explicitly. Try *spark.sql.parquet.**filterPushdown=true . *It's off by default On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wangxy...@gmail.com wrote: Yes it works! But the filter can't pushdown!!! If custom parquetinputformat only implement the datasource API? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 2015-01-16 21:51 GMT+08:00 Xiaoyu Wang wangxy...@gmail.com: Thanks yana! I will try it! 在 2015年1月16日,20:51,yana yana.kadiy...@gmail.com 写道: I think you might need to set spark.sql.hive.convertMetastoreParquet to false if I understand that flag correctly Sent on the new Sprint Network from my Samsung Galaxy S®4. Original message From: Xiaoyu Wang Date:01/16/2015 5:09 AM (GMT-05:00) To: user@spark.apache.org Subject: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan? Hi all! In the Spark SQL1.2.0. I create a hive table with custom parquet inputformat and outputformat. like this : CREATE TABLE test( id string, msg string) CLUSTERED BY ( id) SORTED BY ( id ASC) INTO 10 BUCKETS ROW FORMAT SERDE '*com.a.MyParquetHiveSerDe*' STORED AS INPUTFORMAT '*com.a.MyParquetInputFormat*' OUTPUTFORMAT '*com.a.MyParquetOutputFormat*'; And the spark shell see the plan of select * from test is : [== Physical Plan ==] [!OutputFaker [id#5,msg#6]] [ *ParquetTableScan* [id#12,msg#13], (ParquetRelation hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6d15a113, []), []] *Not HiveTableScan*!!! *So it dosn't execute my custom inputformat!* Why? How can it execute my custom inputformat? Thanks!
Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?
Yes it works! But the filter can't pushdown!!! If custom parquetinputformat only implement the datasource API? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 2015-01-16 21:51 GMT+08:00 Xiaoyu Wang wangxy...@gmail.com: Thanks yana! I will try it! 在 2015年1月16日,20:51,yana yana.kadiy...@gmail.com 写道: I think you might need to set spark.sql.hive.convertMetastoreParquet to false if I understand that flag correctly Sent on the new Sprint Network from my Samsung Galaxy S®4. Original message From: Xiaoyu Wang Date:01/16/2015 5:09 AM (GMT-05:00) To: user@spark.apache.org Subject: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan? Hi all! In the Spark SQL1.2.0. I create a hive table with custom parquet inputformat and outputformat. like this : CREATE TABLE test( id string, msg string) CLUSTERED BY ( id) SORTED BY ( id ASC) INTO 10 BUCKETS ROW FORMAT SERDE '*com.a.MyParquetHiveSerDe*' STORED AS INPUTFORMAT '*com.a.MyParquetInputFormat*' OUTPUTFORMAT '*com.a.MyParquetOutputFormat*'; And the spark shell see the plan of select * from test is : [== Physical Plan ==] [!OutputFaker [id#5,msg#6]] [ *ParquetTableScan* [id#12,msg#13], (ParquetRelation hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6d15a113, []), []] *Not HiveTableScan*!!! *So it dosn't execute my custom inputformat!* Why? How can it execute my custom inputformat? Thanks!
Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?
Hi all! In the Spark SQL1.2.0. I create a hive table with custom parquet inputformat and outputformat. like this : CREATE TABLE test( id string, msg string) CLUSTERED BY ( id) SORTED BY ( id ASC) INTO 10 BUCKETS ROW FORMAT SERDE '*com.a.MyParquetHiveSerDe*' STORED AS INPUTFORMAT '*com.a.MyParquetInputFormat*' OUTPUTFORMAT '*com.a.MyParquetOutputFormat*'; And the spark shell see the plan of select * from test is : [== Physical Plan ==] [!OutputFaker [id#5,msg#6]] [ *ParquetTableScan* [id#12,msg#13], (ParquetRelation hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6d15a113, []), []] *Not HiveTableScan*!!! *So it dosn't execute my custom inputformat!* Why? How can it execute my custom inputformat? Thanks!
RE: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?
I think you might need to set spark.sql.hive.convertMetastoreParquet to false if I understand that flag correctly Sent on the new Sprint Network from my Samsung Galaxy S®4. div Original message /divdivFrom: Xiaoyu Wang wangxy...@gmail.com /divdivDate:01/16/2015 5:09 AM (GMT-05:00) /divdivTo: user@spark.apache.org /divdivSubject: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan? /divdiv /divHi all! In the Spark SQL1.2.0. I create a hive table with custom parquet inputformat and outputformat. like this : CREATE TABLE test( id string, msg string) CLUSTERED BY ( id) SORTED BY ( id ASC) INTO 10 BUCKETS ROW FORMAT SERDE 'com.a.MyParquetHiveSerDe' STORED AS INPUTFORMAT 'com.a.MyParquetInputFormat' OUTPUTFORMAT 'com.a.MyParquetOutputFormat'; And the spark shell see the plan of select * from test is : [== Physical Plan ==] [!OutputFaker [id#5,msg#6]] [ ParquetTableScan [id#12,msg#13], (ParquetRelation hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6d15a113, []), []] Not HiveTableScan!!! So it dosn't execute my custom inputformat! Why? How can it execute my custom inputformat? Thanks!
Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?
Thanks yana! I will try it! 在 2015年1月16日,20:51,yana yana.kadiy...@gmail.com mailto:yana.kadiy...@gmail.com 写道: I think you might need to set spark.sql.hive.convertMetastoreParquet to false if I understand that flag correctly Sent on the new Sprint Network from my Samsung Galaxy S®4. Original message From: Xiaoyu Wang Date:01/16/2015 5:09 AM (GMT-05:00) To: user@spark.apache.org mailto:user@spark.apache.org Subject: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan? Hi all! In the Spark SQL1.2.0. I create a hive table with custom parquet inputformat and outputformat. like this : CREATE TABLE test( id string, msg string) CLUSTERED BY ( id) SORTED BY ( id ASC) INTO 10 BUCKETS ROW FORMAT SERDE 'com.a.MyParquetHiveSerDe' STORED AS INPUTFORMAT 'com.a.MyParquetInputFormat' OUTPUTFORMAT 'com.a.MyParquetOutputFormat'; And the spark shell see the plan of select * from test is : [== Physical Plan ==] [!OutputFaker [id#5,msg#6]] [ ParquetTableScan [id#12,msg#13], (ParquetRelation hdfs://hadoop/user/hive/warehouse/test.db/test hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6d15a113, []), []] Not HiveTableScan!!! So it dosn't execute my custom inputformat! Why? How can it execute my custom inputformat? Thanks!