[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353325#comment-16353325 ] Wenchen Fan commented on SPARK-23304: - I didn't follow the discussion here closely, but FYI, the document of `Dataset.coalesce` says: {code:java} * Returns a new Dataset that has exactly `numPartitions` partitions, when the fewer partitions * are requested. If a larger number of partitions is requested, it will stay at the current * number of partitions{code} > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, > spark23_oldorc_explain_convermetastoreorcfalse.txt > > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350440#comment-16350440 ] Thomas Graves commented on SPARK-23304: --- ok so I guess by that logic then the coalesce won't every work with the COUNT(DISTINCT()) since its the intermediate query I want it to apply to, it will work on the select bcookie. I tested that and verified. spark.sql("SELECT something FROM sometable WHERE dt >= '20170301' AND dt <= '20170331' AND something IS NOT NULL").coalesce(8).show() Actually works then. So I guess we can close this it was my misunderstanding. > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, > spark23_oldorc_explain_convermetastoreorcfalse.txt > > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350423#comment-16350423 ] Thomas Graves commented on SPARK-23304: --- it doesn't look like sql("xyz").rdd.partitions.length comes back correct in either spark 2.2 or 2.3. But if I change the query from SELECT COUNT(DISTINCT(bcookie)) . to just SELECT bookie, then the partitions.length works. So perhaps is something with the count spark 2.3 SELECT COUNT(DISTINCT(bcookie)) scala> query.rdd.partitions.length res4: Int = 1 scala> query.count() [Stage 5:===> (15420 + 619) / 16039] spark 2.2 SELECT COUNT(DISTINCT(bcookie)): scala> query.rdd.partitions.length res0: Int = 1 scala> query.count() [Stage 0:==> (1136 + 1600) / 5346] spark 2.2 Query with just select bcookie: scala> query.rdd.partitions.length res1: Int = 5346 spark 2.3 Query with just select bcookie: scala> query.rdd.partitions.length res9: Int = 16039 Note if I change to just be SELECT DISTINCT(bcookie) then I get 200: scala> query.rdd.partitions.length res10: Int = 200 > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, > spark23_oldorc_explain_convermetastoreorcfalse.txt > > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350428#comment-16350428 ] Thomas Graves commented on SPARK-23304: --- well I guess that give you end # of partitions and not the # it will be initially reading > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, > spark23_oldorc_explain_convermetastoreorcfalse.txt > > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349555#comment-16349555 ] Thomas Graves commented on SPARK-23304: --- I don't have any hive tables backed by parquet to compare to. > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, > spark23_oldorc_explain_convermetastoreorcfalse.txt > > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349511#comment-16349511 ] Xiao Li commented on SPARK-23304: - Based on the new plan, it sounds like the plan is not changed. Could you try to use \{{sql("xyz").rdd.partitions.length}} to get the number of partitions? Are they the same? > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, > spark23_oldorc_explain_convermetastoreorcfalse.txt > > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349498#comment-16349498 ] Dongjoon Hyun commented on SPARK-23304: --- I updated the affected version according to the latest updates. > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, > spark23_oldorc_explain_convermetastoreorcfalse.txt > > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349480#comment-16349480 ] Thomas Graves commented on SPARK-23304: --- I just ran the query (show()) and saw the # of partitions. spark23_oldorc_explain_convermetastoreorcfalse.txt is the explain with --conf spark.sql.orc.impl=hive --conf spark.sql.orc.filterPushdown=false --conf spark.sql.hive.convertMetastoreOrc=false > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt, > spark23_oldorc_explain_convermetastoreorcfalse.txt > > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349461#comment-16349461 ] Xiao Li commented on SPARK-23304: - That is fine. Obviously, at least, we need to submit a PR to document the behavior changes introduced by the native ORC reader. This is missing. I am just trying to confirm no behavior change we made in this release. How did you get the number partitions? Is it through something like? {noformat} sql("xyz").rdd.partitions.length {noformat} > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt > > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349446#comment-16349446 ] Thomas Graves commented on SPARK-23304: --- It still seems like a bug to me since the coalesce isn't happening but wanted to make sure you saw that. I apologize for the original mistake on my side. > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt > > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349444#comment-16349444 ] Thomas Graves commented on SPARK-23304: --- [~smilegator] just to make sure you saw my comment above, this isn't a regression from spark 2.2, I made a mistake since spark 2.2 had small number of partitions to start with, I missed that. > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt > > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349436#comment-16349436 ] Xiao Li commented on SPARK-23304: - In this release, we also made a change in the default of another SQLConf `spark.sql.hive.convertMetastoreOrc`. Could you also rerun the query after setting this conf to `false` and rerun the query in 2.3 release? I am wondering if this is in your original query? I am unable to reproduce this one. ```NOT (something#226 = )))``` > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt > > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349430#comment-16349430 ] Thomas Graves commented on SPARK-23304: --- I filed Jira https://issues.apache.org/jira/browse/SPARK-23309 for the performance issue. > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt > > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349369#comment-16349369 ] Thomas Graves commented on SPARK-23304: --- Note I've removed some of the columns from the output, if you need them I can anonymize them instead. > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > Attachments: spark22_oldorc_explain.txt, spark23_oldorc_explain.txt > > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349367#comment-16349367 ] Thomas Graves commented on SPARK-23304: --- ok I've attached 2 files one with spark 2.3 and one with spark 2.2 ran with options: --conf spark.sql.orc.impl=hive --conf spark.sql.orc.filterPushdown=false. that was the query.coalesce(8) then query.explain(true). > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349346#comment-16349346 ] Sameer Agarwal commented on SPARK-23304: Also, is there a JIRA/repro for the caching issue you mentioned? We can continue to investigate that in parallel (cc [~kiszk]) > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349340#comment-16349340 ] Xiao Li commented on SPARK-23304: - I do not think our native ORC reader respects `hive.exec.orc.split.strategy`. cc [~dongjoon] [~cloud_fan] for their confirmation. > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349339#comment-16349339 ] Xiao Li commented on SPARK-23304: - Hi, [~tgraves], could you change the two SQLConf `spark.sql.orc.impl` -> `hive`. This is to use the original Hive ORC reader. `spark.sql.orc.filterPushdown` -> `false` Then, could you provide the plans for both 2.2 and 2.3? For example, `query.explain(true)` > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349270#comment-16349270 ] Thomas Graves commented on SPARK-23304: --- so with the new ORC code is there anyway to control the # of partitions being read initially? In spark 2.2 you could set the hive.exec.orc.split.strategy, but that doesn't appear to work in 2.3. > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Major > > The query below seems to ignore the coalesce. This is running spark 2.2 or > spark 2.3 against hive, which is reading orc: > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349227#comment-16349227 ] Thomas Graves commented on SPARK-23304: --- Ok, I just realized what you are getting at, I tried on 2.2 to coalesce to a small number 8 and its not doing it. Sorry, I guess this isn't a regression then. > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Blocker > > Testing with spark 2.3 and I see a difference in the sql coalesce talking to > hive vs spark 2.2. It seems spark 2.3 ignores the coalesce. > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > > in spark 2.2 the coalesce works here, but in spark 2.3, it doesn't. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349220#comment-16349220 ] Thomas Graves commented on SPARK-23304: --- If it helps , spark 2.3 # partitions is 317531 and spark 2.2 is 166290 > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Blocker > > Testing with spark 2.3 and I see a difference in the sql coalesce talking to > hive vs spark 2.2. It seems spark 2.3 ignores the coalesce. > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > > in spark 2.2 the coalesce works here, but in spark 2.3, it doesn't. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349212#comment-16349212 ] Thomas Graves commented on SPARK-23304: --- yes there are difference in the # of partitions between 2.2 and 2.3. I was assuming that was the new orc functionality. results of the query are the same. > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Blocker > > Testing with spark 2.3 and I see a difference in the sql coalesce talking to > hive vs spark 2.2. It seems spark 2.3 ignores the coalesce. > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > > in spark 2.2 the coalesce works here, but in spark 2.3, it doesn't. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23304) Spark SQL coalesce() against hive not working
[ https://issues.apache.org/jira/browse/SPARK-23304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349147#comment-16349147 ] Sameer Agarwal commented on SPARK-23304: [~tgraves] just to rule out the obvious, was there a difference in the number of partitions in {{spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= '20170301' AND dt <= '20170331' AND something IS NOT NULL")}} in Spark 2.2 and 2.3? > Spark SQL coalesce() against hive not working > - > > Key: SPARK-23304 > URL: https://issues.apache.org/jira/browse/SPARK-23304 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Xiao Li >Priority: Blocker > > Testing with spark 2.3 and I see a difference in the sql coalesce talking to > hive vs spark 2.2. It seems spark 2.3 ignores the coalesce. > > Query: > spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= > '20170301' AND dt <= '20170331' AND something IS NOT > NULL").coalesce(16).show() > > in spark 2.2 the coalesce works here, but in spark 2.3, it doesn't. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org