Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-20 Thread Cheng Lian
|spark.sql.parquet.filterPushdown| defaults to |false| because there’s a 
bug in Parquet which may cause NPE, please refer to 
http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration


This bug hasn’t been fixed in Parquet master. We’ll turn this on once 
the bug is fixed.


Cheng

On 1/19/15 5:02 PM, Xiaoyu Wang wrote:

The *spark.sql.parquet.**filterPushdown=true *has been turned on. But 
set *spark.sql.hive.**convertMetastoreParquet *to *false*. the first 
parameter is lose efficacy!!!


2015-01-20 6:52 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com 
mailto:yana.kadiy...@gmail.com:


If you're talking about filter pushdowns for parquet files this
also has to be turned on explicitly. Try
*spark.sql.parquet.**filterPushdown=true . *It's off by default

On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wangxy...@gmail.com
mailto:wangxy...@gmail.com wrote:

Yes it works!
But the filter can't pushdown!!!

If custom parquetinputformat only implement the datasource API?


https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

2015-01-16 21:51 GMT+08:00 Xiaoyu Wang wangxy...@gmail.com
mailto:wangxy...@gmail.com:

Thanks yana!
I will try it!


在 2015年1月16日,20:51,yana yana.kadiy...@gmail.com
mailto:yana.kadiy...@gmail.com 写道:

I think you might need to set
spark.sql.hive.convertMetastoreParquet to false if I
understand that flag correctly

Sent on the new Sprint Network from my Samsung Galaxy S®4.


 Original message 
From: Xiaoyu Wang
Date:01/16/2015 5:09 AM (GMT-05:00)
To: user@spark.apache.org mailto:user@spark.apache.org
Subject: Why custom parquet format hive table execute
ParquetTableScan physical plan, not HiveTableScan?

Hi all!

In the Spark SQL1.2.0.
I create a hive table with custom parquet inputformat and
outputformat.
like this :
CREATE TABLE test(
  id string,
  msg string)
CLUSTERED BY (
  id)
SORTED BY (
  id ASC)
INTO 10 BUCKETS
ROW FORMAT SERDE
  '*com.a.MyParquetHiveSerDe*'
STORED AS INPUTFORMAT
  '*com.a.MyParquetInputFormat*'
OUTPUTFORMAT
  '*com.a.MyParquetOutputFormat*';

And the spark shell see the plan of select * from test is :

[== Physical Plan ==]
[!OutputFaker [id#5,msg#6]]
[ *ParquetTableScan* [id#12,msg#13], (ParquetRelation
hdfs://hadoop/user/hive/warehouse/test.db/test,
Some(Configuration: core-default.xml, core-site.xml,
mapred-default.xml, mapred-site.xml, yarn-default.xml,
yarn-site.xml, hdfs-default.xml, hdfs-site.xml),
org.apache.spark.sql.hive.HiveContext@6d15a113, []), []]

*Not HiveTableScan*!!!
*So it dosn't execute my custom inputformat!*
Why? How can it execute my custom inputformat?

Thanks!






​


Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-20 Thread Cheng Lian
In Spark SQL, Parquet filter pushdown doesn’t cover |HiveTableScan| for 
now. May I ask why do you prefer |HiveTableScan| rather than 
|ParquetTableScan|?


Cheng

On 1/19/15 5:02 PM, Xiaoyu Wang wrote:

The *spark.sql.parquet.**filterPushdown=true *has been turned on. But 
set *spark.sql.hive.**convertMetastoreParquet *to *false*. the first 
parameter is lose efficacy!!!


2015-01-20 6:52 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com 
mailto:yana.kadiy...@gmail.com:


If you're talking about filter pushdowns for parquet files this
also has to be turned on explicitly. Try
*spark.sql.parquet.**filterPushdown=true . *It's off by default

On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wangxy...@gmail.com
mailto:wangxy...@gmail.com wrote:

Yes it works!
But the filter can't pushdown!!!

If custom parquetinputformat only implement the datasource API?


https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

2015-01-16 21:51 GMT+08:00 Xiaoyu Wang wangxy...@gmail.com
mailto:wangxy...@gmail.com:

Thanks yana!
I will try it!


在 2015年1月16日,20:51,yana yana.kadiy...@gmail.com
mailto:yana.kadiy...@gmail.com 写道:

I think you might need to set
spark.sql.hive.convertMetastoreParquet to false if I
understand that flag correctly

Sent on the new Sprint Network from my Samsung Galaxy S®4.


 Original message 
From: Xiaoyu Wang
Date:01/16/2015 5:09 AM (GMT-05:00)
To: user@spark.apache.org mailto:user@spark.apache.org
Subject: Why custom parquet format hive table execute
ParquetTableScan physical plan, not HiveTableScan?

Hi all!

In the Spark SQL1.2.0.
I create a hive table with custom parquet inputformat and
outputformat.
like this :
CREATE TABLE test(
  id string,
  msg string)
CLUSTERED BY (
  id)
SORTED BY (
  id ASC)
INTO 10 BUCKETS
ROW FORMAT SERDE
  '*com.a.MyParquetHiveSerDe*'
STORED AS INPUTFORMAT
  '*com.a.MyParquetInputFormat*'
OUTPUTFORMAT
  '*com.a.MyParquetOutputFormat*';

And the spark shell see the plan of select * from test is :

[== Physical Plan ==]
[!OutputFaker [id#5,msg#6]]
[ *ParquetTableScan* [id#12,msg#13], (ParquetRelation
hdfs://hadoop/user/hive/warehouse/test.db/test,
Some(Configuration: core-default.xml, core-site.xml,
mapred-default.xml, mapred-site.xml, yarn-default.xml,
yarn-site.xml, hdfs-default.xml, hdfs-site.xml),
org.apache.spark.sql.hive.HiveContext@6d15a113, []), []]

*Not HiveTableScan*!!!
*So it dosn't execute my custom inputformat!*
Why? How can it execute my custom inputformat?

Thanks!






​


Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-19 Thread Xiaoyu Wang
The *spark.sql.parquet.**filterPushdown=true *has been turned on. But set
*spark.sql.hive.**convertMetastoreParquet *to *false*. the first parameter
is lose efficacy!!!

2015-01-20 6:52 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com:

 If you're talking about filter pushdowns for parquet files this also has
 to be turned on explicitly. Try  *spark.sql.parquet.**filterPushdown=true
 . *It's off by default

 On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wangxy...@gmail.com wrote:

 Yes it works!
 But the filter can't pushdown!!!

 If custom parquetinputformat only implement the datasource API?


 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

 2015-01-16 21:51 GMT+08:00 Xiaoyu Wang wangxy...@gmail.com:

 Thanks yana!
 I will try it!

 在 2015年1月16日,20:51,yana yana.kadiy...@gmail.com 写道:

 I think you might need to set
 spark.sql.hive.convertMetastoreParquet to false if I understand that
 flag correctly

 Sent on the new Sprint Network from my Samsung Galaxy S®4.


  Original message 
 From: Xiaoyu Wang
 Date:01/16/2015 5:09 AM (GMT-05:00)
 To: user@spark.apache.org
 Subject: Why custom parquet format hive table execute ParquetTableScan
 physical plan, not HiveTableScan?

 Hi all!

 In the Spark SQL1.2.0.
 I create a hive table with custom parquet inputformat and outputformat.
 like this :
 CREATE TABLE test(
   id string,
   msg string)
 CLUSTERED BY (
   id)
 SORTED BY (
   id ASC)
 INTO 10 BUCKETS
 ROW FORMAT SERDE
   '*com.a.MyParquetHiveSerDe*'
 STORED AS INPUTFORMAT
   '*com.a.MyParquetInputFormat*'
 OUTPUTFORMAT
   '*com.a.MyParquetOutputFormat*';

 And the spark shell see the plan of select * from test is :

 [== Physical Plan ==]
 [!OutputFaker [id#5,msg#6]]
 [ *ParquetTableScan* [id#12,msg#13], (ParquetRelation
 hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration:
 core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
 yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml),
 org.apache.spark.sql.hive.HiveContext@6d15a113, []), []]

 *Not HiveTableScan*!!!
 *So it dosn't execute my custom inputformat!*
 Why? How can it execute my custom inputformat?

 Thanks!







Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-19 Thread Yana Kadiyska
If you're talking about filter pushdowns for parquet files this also has to
be turned on explicitly. Try  *spark.sql.parquet.**filterPushdown=true . *It's
off by default

On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wangxy...@gmail.com wrote:

 Yes it works!
 But the filter can't pushdown!!!

 If custom parquetinputformat only implement the datasource API?


 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

 2015-01-16 21:51 GMT+08:00 Xiaoyu Wang wangxy...@gmail.com:

 Thanks yana!
 I will try it!

 在 2015年1月16日,20:51,yana yana.kadiy...@gmail.com 写道:

 I think you might need to set
 spark.sql.hive.convertMetastoreParquet to false if I understand that flag
 correctly

 Sent on the new Sprint Network from my Samsung Galaxy S®4.


  Original message 
 From: Xiaoyu Wang
 Date:01/16/2015 5:09 AM (GMT-05:00)
 To: user@spark.apache.org
 Subject: Why custom parquet format hive table execute ParquetTableScan
 physical plan, not HiveTableScan?

 Hi all!

 In the Spark SQL1.2.0.
 I create a hive table with custom parquet inputformat and outputformat.
 like this :
 CREATE TABLE test(
   id string,
   msg string)
 CLUSTERED BY (
   id)
 SORTED BY (
   id ASC)
 INTO 10 BUCKETS
 ROW FORMAT SERDE
   '*com.a.MyParquetHiveSerDe*'
 STORED AS INPUTFORMAT
   '*com.a.MyParquetInputFormat*'
 OUTPUTFORMAT
   '*com.a.MyParquetOutputFormat*';

 And the spark shell see the plan of select * from test is :

 [== Physical Plan ==]
 [!OutputFaker [id#5,msg#6]]
 [ *ParquetTableScan* [id#12,msg#13], (ParquetRelation
 hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration:
 core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
 yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml),
 org.apache.spark.sql.hive.HiveContext@6d15a113, []), []]

 *Not HiveTableScan*!!!
 *So it dosn't execute my custom inputformat!*
 Why? How can it execute my custom inputformat?

 Thanks!






Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-19 Thread Xiaoyu Wang
Yes it works!
But the filter can't pushdown!!!

If custom parquetinputformat only implement the datasource API?

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

2015-01-16 21:51 GMT+08:00 Xiaoyu Wang wangxy...@gmail.com:

 Thanks yana!
 I will try it!

 在 2015年1月16日,20:51,yana yana.kadiy...@gmail.com 写道:

 I think you might need to set
 spark.sql.hive.convertMetastoreParquet to false if I understand that flag
 correctly

 Sent on the new Sprint Network from my Samsung Galaxy S®4.


  Original message 
 From: Xiaoyu Wang
 Date:01/16/2015 5:09 AM (GMT-05:00)
 To: user@spark.apache.org
 Subject: Why custom parquet format hive table execute ParquetTableScan
 physical plan, not HiveTableScan?

 Hi all!

 In the Spark SQL1.2.0.
 I create a hive table with custom parquet inputformat and outputformat.
 like this :
 CREATE TABLE test(
   id string,
   msg string)
 CLUSTERED BY (
   id)
 SORTED BY (
   id ASC)
 INTO 10 BUCKETS
 ROW FORMAT SERDE
   '*com.a.MyParquetHiveSerDe*'
 STORED AS INPUTFORMAT
   '*com.a.MyParquetInputFormat*'
 OUTPUTFORMAT
   '*com.a.MyParquetOutputFormat*';

 And the spark shell see the plan of select * from test is :

 [== Physical Plan ==]
 [!OutputFaker [id#5,msg#6]]
 [ *ParquetTableScan* [id#12,msg#13], (ParquetRelation
 hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration:
 core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
 yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml),
 org.apache.spark.sql.hive.HiveContext@6d15a113, []), []]

 *Not HiveTableScan*!!!
 *So it dosn't execute my custom inputformat!*
 Why? How can it execute my custom inputformat?

 Thanks!





Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-16 Thread Xiaoyu Wang
Hi all!

In the Spark SQL1.2.0.
I create a hive table with custom parquet inputformat and outputformat.
like this :
CREATE TABLE test(
  id string,
  msg string)
CLUSTERED BY (
  id)
SORTED BY (
  id ASC)
INTO 10 BUCKETS
ROW FORMAT SERDE
  '*com.a.MyParquetHiveSerDe*'
STORED AS INPUTFORMAT
  '*com.a.MyParquetInputFormat*'
OUTPUTFORMAT
  '*com.a.MyParquetOutputFormat*';

And the spark shell see the plan of select * from test is :

[== Physical Plan ==]
[!OutputFaker [id#5,msg#6]]
[ *ParquetTableScan* [id#12,msg#13], (ParquetRelation
hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration:
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml),
org.apache.spark.sql.hive.HiveContext@6d15a113, []), []]

*Not HiveTableScan*!!!
*So it dosn't execute my custom inputformat!*
Why? How can it execute my custom inputformat?

Thanks!


RE: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-16 Thread yana
I think you might need to set 
spark.sql.hive.convertMetastoreParquet to false if I understand that flag 
correctly

Sent on the new Sprint Network from my Samsung Galaxy S®4.

div Original message /divdivFrom: Xiaoyu Wang 
wangxy...@gmail.com /divdivDate:01/16/2015  5:09 AM  (GMT-05:00) 
/divdivTo: user@spark.apache.org /divdivSubject: Why custom parquet 
format hive table execute ParquetTableScan physical plan, not 
HiveTableScan? /divdiv
/divHi all!

In the Spark SQL1.2.0.
I create a hive table with custom parquet inputformat and outputformat.
like this :
CREATE TABLE test(
  id string, 
  msg string)
CLUSTERED BY ( 
  id) 
SORTED BY ( 
  id ASC) 
INTO 10 BUCKETS
ROW FORMAT SERDE
  'com.a.MyParquetHiveSerDe'
STORED AS INPUTFORMAT 
  'com.a.MyParquetInputFormat' 
OUTPUTFORMAT 
  'com.a.MyParquetOutputFormat';

And the spark shell see the plan of select * from test is :

[== Physical Plan ==]
[!OutputFaker [id#5,msg#6]]
[ ParquetTableScan [id#12,msg#13], (ParquetRelation 
hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration: 
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, 
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), 
org.apache.spark.sql.hive.HiveContext@6d15a113, []), []]

Not HiveTableScan!!!
So it dosn't execute my custom inputformat!
Why? How can it execute my custom inputformat?

Thanks!

Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-16 Thread Xiaoyu Wang
Thanks yana!
I will try it!

 在 2015年1月16日,20:51,yana yana.kadiy...@gmail.com 
 mailto:yana.kadiy...@gmail.com 写道:
 
 I think you might need to set 
 spark.sql.hive.convertMetastoreParquet to false if I understand that flag 
 correctly
 
 Sent on the new Sprint Network from my Samsung Galaxy S®4.
 
 
  Original message 
 From: Xiaoyu Wang
 Date:01/16/2015 5:09 AM (GMT-05:00)
 To: user@spark.apache.org mailto:user@spark.apache.org
 Subject: Why custom parquet format hive table execute ParquetTableScan 
 physical plan, not HiveTableScan?
 
 Hi all!
 
 In the Spark SQL1.2.0.
 I create a hive table with custom parquet inputformat and outputformat.
 like this :
 CREATE TABLE test(
   id string, 
   msg string)
 CLUSTERED BY ( 
   id) 
 SORTED BY ( 
   id ASC) 
 INTO 10 BUCKETS
 ROW FORMAT SERDE
   'com.a.MyParquetHiveSerDe'
 STORED AS INPUTFORMAT 
   'com.a.MyParquetInputFormat' 
 OUTPUTFORMAT 
   'com.a.MyParquetOutputFormat';
 
 And the spark shell see the plan of select * from test is :
 
 [== Physical Plan ==]
 [!OutputFaker [id#5,msg#6]]
 [ ParquetTableScan [id#12,msg#13], (ParquetRelation 
 hdfs://hadoop/user/hive/warehouse/test.db/test 
 hdfs://hadoop/user/hive/warehouse/test.db/test, Some(Configuration: 
 core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, 
 yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), 
 org.apache.spark.sql.hive.HiveContext@6d15a113, []), []]
 
 Not HiveTableScan!!!
 So it dosn't execute my custom inputformat!
 Why? How can it execute my custom inputformat?
 
 Thanks!