[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-05 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353325#comment-16353325
 ] 

Wenchen Fan commented on SPARK-23304:
-

I didn't follow the discussion here closely, but FYI, the document of 
`Dataset.coalesce` says:

 
{code:java}
* Returns a new Dataset that has exactly `numPartitions` partitions, when the 
fewer partitions
* are requested. If a larger number of partitions is requested, it will stay at 
the current
* number of partitions{code}

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
> Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, 
> spark23_oldorc_explain_convermetastoreorcfalse.txt
>
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-02 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350440#comment-16350440
 ] 

Thomas Graves commented on SPARK-23304:
---

ok so I guess by that logic then the coalesce won't every work with the 
COUNT(DISTINCT()) since its the intermediate query I want it to apply to, it 
will work on the select bcookie. 

I tested that and verified. 

spark.sql("SELECT something FROM sometable WHERE dt >= '20170301' AND dt <= 
'20170331' AND something IS NOT NULL").coalesce(8).show()

Actually works then.

So I guess we can close this it was my misunderstanding.

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
> Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, 
> spark23_oldorc_explain_convermetastoreorcfalse.txt
>
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-02 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350423#comment-16350423
 ] 

Thomas Graves commented on SPARK-23304:
---

it doesn't look like sql("xyz").rdd.partitions.length comes back correct in 
either spark 2.2 or 2.3.  

But if I change the query from SELECT COUNT(DISTINCT(bcookie)) . to just SELECT 
bookie, then the partitions.length works.  So perhaps is something with the 
count

 

spark 2.3 SELECT COUNT(DISTINCT(bcookie))

scala> query.rdd.partitions.length
res4: Int = 1

scala> query.count()
[Stage 5:===> (15420 + 619) / 16039]

 

spark 2.2 SELECT COUNT(DISTINCT(bcookie)):

scala> query.rdd.partitions.length
res0: Int = 1

scala> query.count()
[Stage 0:==> (1136 + 1600) / 5346]

 

spark 2.2 Query with just select bcookie:

scala> query.rdd.partitions.length
res1: Int = 5346

spark 2.3 Query with just select bcookie:

scala> query.rdd.partitions.length
res9: Int = 16039

 

Note if I change to just be SELECT DISTINCT(bcookie) then I get 200:

scala> query.rdd.partitions.length
res10: Int = 200

 

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
> Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, 
> spark23_oldorc_explain_convermetastoreorcfalse.txt
>
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-02 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350428#comment-16350428
 ] 

Thomas Graves commented on SPARK-23304:
---

well I guess that give you end # of partitions and not the # it will be 
initially reading

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
> Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, 
> spark23_oldorc_explain_convermetastoreorcfalse.txt
>
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349555#comment-16349555
 ] 

Thomas Graves commented on SPARK-23304:
---

I don't have any hive tables backed by parquet to compare to.

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
> Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, 
> spark23_oldorc_explain_convermetastoreorcfalse.txt
>
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349511#comment-16349511
 ] 

Xiao Li commented on SPARK-23304:
-

Based on the new plan, it sounds like the plan is not changed. Could you try to 
use \{{sql("xyz").rdd.partitions.length}} to get the number of partitions? Are 
they the same?

 

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
> Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, 
> spark23_oldorc_explain_convermetastoreorcfalse.txt
>
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349498#comment-16349498
 ] 

Dongjoon Hyun commented on SPARK-23304:
---

I updated the affected version according to the latest updates.

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
> Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, 
> spark23_oldorc_explain_convermetastoreorcfalse.txt
>
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349480#comment-16349480
 ] 

Thomas Graves commented on SPARK-23304:
---

I just ran the query (show()) and saw the # of partitions. 

spark23_oldorc_explain_convermetastoreorcfalse.txt is the explain with  --conf 
spark.sql.orc.impl=hive --conf spark.sql.orc.filterPushdown=false --conf 
spark.sql.hive.convertMetastoreOrc=false

 

 

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
> Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, 
> spark23_oldorc_explain_convermetastoreorcfalse.txt
>
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349461#comment-16349461
 ] 

Xiao Li commented on SPARK-23304:
-

That is fine.

Obviously, at least, we need to submit a PR to document the behavior changes 
introduced by the native ORC reader. This is missing. 

I am just trying to confirm no behavior change we made in this release. 

 

How did you get the number partitions? Is it through something like?

{noformat}

sql("xyz").rdd.partitions.length

{noformat}

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
> Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt
>
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349446#comment-16349446
 ] 

Thomas Graves commented on SPARK-23304:
---

It still seems like a bug to me since the coalesce isn't happening but wanted 
to make sure you saw that.  I apologize for the original mistake on my side.

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
> Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt
>
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349444#comment-16349444
 ] 

Thomas Graves commented on SPARK-23304:
---

[~smilegator] just to make sure you saw my comment above, this isn't a 
regression from spark 2.2, I made a mistake since spark 2.2 had small number of 
partitions to start with, I missed that.

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
> Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt
>
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349436#comment-16349436
 ] 

Xiao Li commented on SPARK-23304:
-

In this release, we also made a change in the default of another SQLConf 
`spark.sql.hive.convertMetastoreOrc`. Could you also rerun the query after 
setting this conf to `false` and rerun the query in 2.3 release?

 

I am wondering if this is in your original query? I am unable to reproduce this 
one.

```NOT (something#226 = )))```

 

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
> Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt
>
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349430#comment-16349430
 ] 

Thomas Graves commented on SPARK-23304:
---

 

I filed Jira https://issues.apache.org/jira/browse/SPARK-23309 for the 
performance issue.

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
> Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt
>
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349369#comment-16349369
 ] 

Thomas Graves commented on SPARK-23304:
---

Note I've removed some of the columns from the output, if you need them I can 
anonymize them instead.

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
> Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt
>
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349367#comment-16349367
 ] 

Thomas Graves commented on SPARK-23304:
---

ok I've attached 2 files one with spark 2.3 and one with spark 2.2 ran with 
options: --conf spark.sql.orc.impl=hive --conf 
spark.sql.orc.filterPushdown=false.

that was the query.coalesce(8)  then query.explain(true).

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349346#comment-16349346
 ] 

Sameer Agarwal commented on SPARK-23304:


Also, is there a JIRA/repro for the caching issue you mentioned? We can 
continue to investigate that in parallel (cc [~kiszk])

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349340#comment-16349340
 ] 

Xiao Li commented on SPARK-23304:
-

I do not think our native ORC reader respects `hive.exec.orc.split.strategy`. 
cc [~dongjoon] [~cloud_fan] for their confirmation.

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349339#comment-16349339
 ] 

Xiao Li commented on SPARK-23304:
-

Hi, [~tgraves], could you change the two SQLConf

`spark.sql.orc.impl` -> `hive`. This is to use the original Hive ORC reader. 

`spark.sql.orc.filterPushdown` -> `false`

Then, could you provide the plans for both 2.2 and 2.3? For example, 
`query.explain(true)`

 

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349270#comment-16349270
 ] 

Thomas Graves commented on SPARK-23304:
---

so with the new ORC code is there anyway to control the # of partitions being 
read initially?  In spark 2.2 you could set the hive.exec.orc.split.strategy, 
but that doesn't appear to work in 2.3.

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Major
>
> The query below seems to ignore the coalesce. This is running spark 2.2 or 
> spark 2.3 against hive, which is reading orc:
>  
>  Query:
>  spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349227#comment-16349227
 ] 

Thomas Graves commented on SPARK-23304:
---

Ok, I just realized what you are getting at, I tried on 2.2 to coalesce to a 
small number 8 and its not doing it.

Sorry, I guess this isn't a regression then.

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Blocker
>
> Testing with spark 2.3 and I see a difference in the sql coalesce talking to 
> hive vs spark 2.2. It seems spark 2.3 ignores the coalesce.
>  
> Query:
> spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>  
> in spark 2.2 the coalesce works here, but in spark 2.3, it doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349220#comment-16349220
 ] 

Thomas Graves commented on SPARK-23304:
---

If it helps , spark 2.3 # partitions is 317531 and spark 2.2 is 166290

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Blocker
>
> Testing with spark 2.3 and I see a difference in the sql coalesce talking to 
> hive vs spark 2.2. It seems spark 2.3 ignores the coalesce.
>  
> Query:
> spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>  
> in spark 2.2 the coalesce works here, but in spark 2.3, it doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349212#comment-16349212
 ] 

Thomas Graves commented on SPARK-23304:
---

yes there are difference in the # of partitions between 2.2 and 2.3.  I was 
assuming that was the new orc functionality.   results of the query are the 
same.

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Blocker
>
> Testing with spark 2.3 and I see a difference in the sql coalesce talking to 
> hive vs spark 2.2. It seems spark 2.3 ignores the coalesce.
>  
> Query:
> spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>  
> in spark 2.2 the coalesce works here, but in spark 2.3, it doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working

2018-02-01 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349147#comment-16349147
 ] 

Sameer Agarwal commented on SPARK-23304:


[~tgraves] just to rule out the obvious, was there a difference in the number 
of partitions in {{spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable 
WHERE dt >= '20170301' AND dt <= '20170331' AND something IS NOT NULL")}} in 
Spark 2.2 and 2.3?

> Spark SQL coalesce() against hive not working
> -
>
> Key: SPARK-23304
> URL: https://issues.apache.org/jira/browse/SPARK-23304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: Xiao Li
>Priority: Blocker
>
> Testing with spark 2.3 and I see a difference in the sql coalesce talking to 
> hive vs spark 2.2. It seems spark 2.3 ignores the coalesce.
>  
> Query:
> spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
> '20170301' AND dt <= '20170331' AND something IS NOT 
> NULL").coalesce(16).show()
>  
> in spark 2.2 the coalesce works here, but in spark 2.3, it doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org