[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2017-10-03 Thread Saktheesh Balaraj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189481#comment-16189481
 ] 

Saktheesh Balaraj commented on SPARK-14172:
---

Similar problem is observed while joining 2 hive tables based on partition 
columns.

*Example*
Table A having 1000 partitions (date partition and hour sub-partition)  and 
Table B having 2 partition. when joining 2 tables based on partition it's going 
for full table scan in table A i.e using all 1000 partitions instead of taking 
2 partitions from Table A and join with Table B.

*select * from tableA a, tableB b where a.trans_date=b.trans_date and 
a.trans_hour=b.trans_hour;*

(Here trans_date is the partition and trans_hour is the sub-partition on both 
the tables)

*Workaround: *
selecting 2 partitions from table B and then do lookup on Table A 
step1: select trans_date and a.trans_hour from table B 
step2: select * from tableA where trans_date= and 
a.trans_hour =



> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2017-01-18 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828179#comment-15828179
 ] 

Hyukjin Kwon commented on SPARK-14172:
--

Oh wait, sorry the linked PR above has a different JIRA.

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2017-01-18 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828177#comment-15828177
 ] 

Hyukjin Kwon commented on SPARK-14172:
--

Hi [~hvanhovell], this JIRA seems mistakenly not resolved.

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-08-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418532#comment-15418532
 ] 

Apache Spark commented on SPARK-14172:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/14620

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-27 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350704#comment-15350704
 ] 

Jiang Xingbo commented on SPARK-14172:
--

Yes you are right. I thought the deterministic part can always be PPDed safely 
but it was not, in fact, the order of each part should also be considered. For 
example:
{code:sql}rand() < 0.01 AND partition_col = 'some_value'{code} should not be 
PPDed, but
{code:sql}partition_col = 'some_value' AND rand() < 0.01{code} still could be.

Thank you for your kindly reply!

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-27 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350624#comment-15350624
 ] 

Wenchen Fan commented on SPARK-14172:
-

I'm not sure if it's safe to push it down. For non-deterministic expressions, 
the order(or number) of input rows matters. If we push down the deterministic 
part of filter condition, then the input rows to the remaining filter condition 
will change and may result to wrong answer.

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348184#comment-15348184
 ] 

Apache Spark commented on SPARK-14172:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/13893

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-23 Thread Pedro Osorio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347556#comment-15347556
 ] 

Pedro Osorio commented on SPARK-14172:
--

[~lucasmf] I have been able to reproduce the issue by running the following 
code in a spark shell:

{code}
sqlContext.sql("drop table test_partition_predicate")
sqlContext.sql("create table test_partition_predicate (col1 string) partitioned 
by (partition_col string)")
sqlContext.sql("explain extended select * from test_partition_predicate where 
partition_col = '1' and rand() < 0.9").collect().foreach(println)
{code}

I have verified using community.cloud.databricks.com that in spark1.4 the 
physical plan uses the partition_col predicate in the HiveTableScan, but in 
versions 1.6 and 2.0, this shows up in the filters section, which means the 
whole table is scanned.

As [~jiangxb1987] pointed out, the issue is in collectProjectsAndFilters and 
was introduced in this patch, I believe: 
https://github.com/apache/spark/pull/8486

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-23 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15346209#comment-15346209
 ] 

Jiang Xingbo commented on SPARK-14172:
--

In collectProjectsAndFilters function, currently we only collect filters when 
condition.deterministic is true. Consider one condition like this:
"SELECT value, hr FROM srcpart1 WHERE ds = '2008-04-08' AND rand() < 0.9"
where srcpart1 has ds as its partition key. In this case, condition is not 
determinstic, but we should collect "ds = '2008-04-08'" because it's a 
determinstic part and is connected with the other part using 'AND'.
To fix this problem, I suggest we recursively split the condition which is an 
'AND', and then we put the derminstic parts into filters, and the other parts 
into child.
Were these thoughts plausible, I'm willing to create a pr to improve this.

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-21 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343121#comment-15343121
 ] 

MIN-FU YANG commented on SPARK-14172:
-

I cannot reproduce it on 1.6.1 either. Could you give more detailed description?

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-21 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342917#comment-15342917
 ] 

MIN-FU YANG commented on SPARK-14172:
-

Hi, I cannot reproduce the problem in master branch. Maybe it's resolved in 
newer version. I'll try reproduce it in 1.6.1

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org